Stratified K-Fold Cross-Validation Practice Problem

This data science coding problem helps you practice Model Validation, stratified k-fold cross-validation, and implementation skills. Read the problem statement, write your solution, and strengthen your understanding of Model Validation.

Problem ID: 154
Problem key: 154-stratified-k-fold-cross-validation
URL: https://datacrack.app/solve/154-stratified-k-fold-cross-validation
Difficulty: hard
Topic: Model Validation
Module: Introduction to Machine Learning

Problem Statement

# 🧩 Stratified K-Fold Cross-Validation

---

### 🎯 Goal

Create K-Fold splits that preserve the **class distribution** of the target labels as much as possible.

---

### 📖 Introduction

Plain K-Fold Cross-Validation creates folds without checking the class labels. In classification problems, this can create validation folds that do not represent the full dataset well. For example, one fold may contain mostly class `0`, while another fold may contain very few examples from class `1`.

This becomes more risky when the dataset is imbalanced.

Stratified K-Fold solves this by:

- grouping examples by class label
- splitting each class across the folds
- keeping each fold’s class distribution closer to the full dataset

> **Note:** Stratified K-Fold is mainly used for classification problems because it depends on class labels.

---

### 💻 Task

Implement `stratified_k_fold_indices` from scratch.

Steps:

1. Create a dictionary to store the positions of each class in `y`.

   For example, if:

   ```python
   y = [0, 0, 0, 1, 1, 1]
   ```

   then:

   ```python
   class 0 positions = [0, 1, 2]
   class 1 positions = [3, 4, 5]
   ```

2. Create `k` empty validation folds.

3. If `shuffle=True`, shuffle the positions inside each class.

4. For each class, spread its positions across the folds one by one.

   Example with `k = 3`:

    ```text
    first position  → fold 0
    second position → fold 1
    third position  → fold 2
    fourth position → fold 0 again
    ```

5. After the validation folds are built, create the training indices for each round.

   The training indices are all dataset indices except the validation indices for that fold.

6. Return all `[train_indices, val_indices]` pairs.

---


### 📥 Input / 📤 Output

**Input**
- `y` (`list`): class labels
- `k` (`int`): number of folds
- `shuffle` (`bool`): whether to shuffle examples inside each class before distributing them
- `random_state` (`int` or `None`): seed used when `shuffle=True`

**Output**
- `list`: a list of validation rounds
- Each round should be `[train_indices, val_indices]`
- `train_indices`: row indices used for training in that round
- `val_indices`: row indices used for validation in that round

---



### 🧩 Starter Code

```python
import random

def stratified_k_fold_indices(y, k, shuffle=False, random_state=None):
    """
    Return train/validation index splits for Stratified K-Fold Cross-Validation.
    """
    # TODO 1: Find the positions of each class in y
    # TODO 2: Create k empty validation folds
    # TODO 3: Shuffle positions inside each class if requested
    # TODO 4: Distribute each class across folds one by one
    # TODO 5: Build train indices for each fold
    # TODO 6: Return all [train_indices, val_indices] pairs
    pass
```

---

### 💡 Example

```python
y = [0, 0, 0, 1, 1, 1]

stratified_k_fold_indices(y, k=3, shuffle=False)
```

**Expected Output**

```python
[
    [[1, 2, 4, 5], [0, 3]],
    [[0, 2, 3, 5], [1, 4]],
    [[0, 1, 3, 4], [2, 5]]
]
```

---

### 🧭 Hint

For each class, send its first index to fold 0, second index to fold 1, and so on. When you reach the last fold, wrap around to fold 0 again.

Stratified K-Fold Cross-Validation Practice Problem

Problem ID: 154
Problem key: 154-stratified-k-fold-cross-validation
URL: https://datacrack.app/solve/154-stratified-k-fold-cross-validation
Difficulty: hard
Topic: Model Validation
Module: Introduction to Machine Learning

Problem Statement

# 🧩 Stratified K-Fold Cross-Validation

---

### 🎯 Goal

Create K-Fold splits that preserve the **class distribution** of the target labels as much as possible.

---

### 📖 Introduction

Plain K-Fold Cross-Validation creates folds without checking the class labels. In classification problems, this can create validation folds that do not represent the full dataset well. For example, one fold may contain mostly class `0`, while another fold may contain very few examples from class `1`.

This becomes more risky when the dataset is imbalanced.

Stratified K-Fold solves this by:

- grouping examples by class label
- splitting each class across the folds
- keeping each fold’s class distribution closer to the full dataset

> **Note:** Stratified K-Fold is mainly used for classification problems because it depends on class labels.

---

### 💻 Task

Implement `stratified_k_fold_indices` from scratch.

Steps:

1. Create a dictionary to store the positions of each class in `y`.

   For example, if:

   ```python
   y = [0, 0, 0, 1, 1, 1]
   ```

   then:

   ```python
   class 0 positions = [0, 1, 2]
   class 1 positions = [3, 4, 5]
   ```

2. Create `k` empty validation folds.

3. If `shuffle=True`, shuffle the positions inside each class.

4. For each class, spread its positions across the folds one by one.

   Example with `k = 3`:

    ```text
    first position  → fold 0
    second position → fold 1
    third position  → fold 2
    fourth position → fold 0 again
    ```

5. After the validation folds are built, create the training indices for each round.

   The training indices are all dataset indices except the validation indices for that fold.

6. Return all `[train_indices, val_indices]` pairs.

---


### 📥 Input / 📤 Output

**Input**
- `y` (`list`): class labels
- `k` (`int`): number of folds
- `shuffle` (`bool`): whether to shuffle examples inside each class before distributing them
- `random_state` (`int` or `None`): seed used when `shuffle=True`

**Output**
- `list`: a list of validation rounds
- Each round should be `[train_indices, val_indices]`
- `train_indices`: row indices used for training in that round
- `val_indices`: row indices used for validation in that round

---



### 🧩 Starter Code

```python
import random

def stratified_k_fold_indices(y, k, shuffle=False, random_state=None):
    """
    Return train/validation index splits for Stratified K-Fold Cross-Validation.
    """
    # TODO 1: Find the positions of each class in y
    # TODO 2: Create k empty validation folds
    # TODO 3: Shuffle positions inside each class if requested
    # TODO 4: Distribute each class across folds one by one
    # TODO 5: Build train indices for each fold
    # TODO 6: Return all [train_indices, val_indices] pairs
    pass
```

---

### 💡 Example

```python
y = [0, 0, 0, 1, 1, 1]

stratified_k_fold_indices(y, k=3, shuffle=False)
```

**Expected Output**

```python
[
    [[1, 2, 4, 5], [0, 3]],
    [[0, 2, 3, 5], [1, 4]],
    [[0, 1, 3, 4], [2, 5]]
]
```

---

### 🧭 Hint

For each class, send its first index to fold 0, second index to fold 1, and so on. When you reach the last fold, wrap around to fold 0 again.

Stratified K-Fold Cross-Validation Practice Problem

Problem Statement

Stratified K-Fold Cross-Validation Practice Problem

Problem Statement

Starter Code

Internal Links