Stratified K-Fold Cross-Validation Practice Problem
This data science coding problem helps you practice Model Validation, stratified k-fold cross-validation, and implementation skills. Read the problem statement, write your solution, and strengthen your understanding of Model Validation.
- Problem ID: 154
- Problem key: 154-stratified-k-fold-cross-validation
- URL: https://datacrack.app/solve/154-stratified-k-fold-cross-validation
- Difficulty: hard
- Topic: Model Validation
- Module: Introduction to Machine Learning
Problem Statement
# 🧩 Stratified K-Fold Cross-Validation
---
### 🎯 Goal
Create K-Fold splits that preserve the **class distribution** of the target labels as much as possible.
---
### 📖 Introduction
Plain K-Fold Cross-Validation creates folds without checking the class labels. In classification problems, this can create validation folds that do not represent the full dataset well. For example, one fold may contain mostly class `0`, while another fold may contain very few examples from class `1`.
This becomes more risky when the dataset is imbalanced.
Stratified K-Fold solves this by:
- grouping examples by class label
- splitting each class across the folds
- keeping each fold’s class distribution closer to the full dataset
> **Note:** Stratified K-Fold is mainly used for classification problems because it depends on class labels.
---
### 💻 Task
Implement `stratified_k_fold_indices` from scratch.
Steps:
1. Create a dictionary to store the positions of each class in `y`.
For example, if:
```python
y = [0, 0, 0, 1, 1, 1]
```
then:
```python
class 0 positions = [0, 1, 2]
class 1 positions = [3, 4, 5]
```
2. Create `k` empty validation folds.
3. If `shuffle=True`, shuffle the positions inside each class.
4. For each class, spread its positions across the folds one by one.
Example with `k = 3`:
```text
first position → fold 0
second position → fold 1
third position → fold 2
fourth position → fold 0 again
```
5. After the validation folds are built, create the training indices for each round.
The training indices are all dataset indices except the validation indices for that fold.
6. Return all `[train_indices, val_indices]` pairs.
---
### 📥 Input / 📤 Output
**Input**
- `y` (`list`): class labels
- `k` (`int`): number of folds
- `shuffle` (`bool`): whether to shuffle examples inside each class before distributing them
- `random_state` (`int` or `None`): seed used when `shuffle=True`
**Output**
- `list`: a list of validation rounds
- Each round should be `[train_indices, val_indices]`
- `train_indices`: row indices used for training in that round
- `val_indices`: row indices used for validation in that round
---
### 🧩 Starter Code
```python
import random
def stratified_k_fold_indices(y, k, shuffle=False, random_state=None):
"""
Return train/validation index splits for Stratified K-Fold Cross-Validation.
"""
# TODO 1: Find the positions of each class in y
# TODO 2: Create k empty validation folds
# TODO 3: Shuffle positions inside each class if requested
# TODO 4: Distribute each class across folds one by one
# TODO 5: Build train indices for each fold
# TODO 6: Return all [train_indices, val_indices] pairs
pass
```
---
### 💡 Example
```python
y = [0, 0, 0, 1, 1, 1]
stratified_k_fold_indices(y, k=3, shuffle=False)
```
**Expected Output**
```python
[
[[1, 2, 4, 5], [0, 3]],
[[0, 2, 3, 5], [1, 4]],
[[0, 1, 3, 4], [2, 5]]
]
```
---
### 🧠Hint
For each class, send its first index to fold 0, second index to fold 1, and so on. When you reach the last fold, wrap around to fold 0 again.