Validation Set Practice Problem
This data science coding problem helps you practice Model Validation, validation set, and implementation skills. Read the problem statement, write your solution, and strengthen your understanding of Model Validation.
- Problem ID: 156
- Problem key: 156-validation-set
- URL: https://datacrack.app/solve/156-validation-set
- Difficulty: easy
- Topic: Model Validation
- Module: Introduction to Machine Learning
Problem Statement
# 🧩 Validation Set
---
### 🎯 Goal
Split data into **training**, **validation**, and **test** sets so that model tuning and final evaluation stay separate.
---
### 📖 Introduction
A train/test split gives us a clean final evaluation set. But during model development, we often need to make choices:
- Which polynomial degree should we use?
- Which regularization strength is best?
- Which model performs better?
If we use the test set to make these choices, the test score becomes biased. The training set is seen by the model during training. The validation set is seen by us during model building because we use it to choose the best model or settings. Because of that, the final test set should stay unseen by both the model training step and our model-selection decisions. This gives us a more honest estimate of how the final model may perform on completely new examples.
That is why we add a **validation set**.
| Split | Purpose |
|:------|:--------|
| **Training set** | Fit model parameters |
| **Validation set** | Tune model choices and hyperparameters |
| **Test set** | Final evaluation after model choices are finished |
---
### 💻 Task
Implement `train_val_test_split` from scratch.
Steps:
1. Pair each feature example with its target.
2. If `shuffle=True`, shuffle the pairs using `random.Random(random_state)`.
3. Convert `val_size` and `test_size` into integer counts.
4. Put the last `test_count` examples into the test set.
5. Put the examples before the test set into the validation set.
6. Put the remaining examples into the training set.
7. Return `[X_train, X_val, X_test, y_train, y_val, y_test]`.
---
### 📥 Input / 📤 Output
**Input**
- `X` (`list`): feature values or feature rows
- `y` (`list`): target values with the same length as `X`
- `val_size` (`float` or `int`): fraction or exact number of validation examples
- `test_size` (`float` or `int`): fraction or exact number of test examples
- `shuffle` (`bool`): whether to shuffle before splitting
- `random_state` (`int` or `None`): seed used when `shuffle=True`
**Output**
- `list`: `[X_train, X_val, X_test, y_train, y_val, y_test]`
---
### 🧩 Starter Code
```python
import random
def train_val_test_split(X, y, val_size=0.2, test_size=0.2, shuffle=True, random_state=None):
"""
Split features and targets into train, validation, and test sets.
"""
# TODO 1: Pair X and y together
# TODO 2: Shuffle the pairs when requested
# TODO 3: Compute validation and test counts
# TODO 4: Slice train, validation, and test pairs
# TODO 5: Separate X and y for each split
pass
```
---
### 💡 Example
```python
X = [1, 2, 3, 4, 5, 6, 7, 8]
y = [10, 20, 30, 40, 50, 60, 70, 80]
train_val_test_split(X, y, val_size=0.25, test_size=0.25, shuffle=False)
```
**Expected Output**
```python
[[1, 2, 3, 4], [5, 6], [7, 8], [10, 20, 30, 40], [50, 60], [70, 80]]
```
---
### 🧭 Hint
Think of the split as three consecutive blocks after shuffling:
`train | validation | test`