Train Test Split Practice Problem
This data science coding problem helps you practice Model Validation, train test split, and implementation skills. Read the problem statement, write your solution, and strengthen your understanding of Model Validation.
- Problem ID: 155
- Problem key: 155-train-test-split
- URL: https://datacrack.app/solve/155-train-test-split
- Difficulty: easy
- Topic: Model Validation
- Module: Introduction to Machine Learning
Problem Statement
# 🧩 Train/Test Split
---
### 🎯 Goal
Learn how to split a dataset into a **training set** and a **test set** while keeping each feature row matched with its correct target value.
---
### 📖 Introduction
A machine learning model should be evaluated on data it has not already seen.
If we train and test on the same examples, the score can look better than it really is because the model may simply memorize patterns from the training data.
To avoid this, we divide the dataset into two parts:
| Split | Purpose |
|:------|:--------|
| **Training set** | Used to fit the model parameters |
| **Test set** | Used to estimate performance on unseen examples |
---
### 💻 Task
Implement `train_test_split_basic` from scratch.
Steps:
1. Pair each feature example with its target using `zip(X, y)`.
2. If `shuffle=True`, shuffle the paired examples using `random.Random(random_state)`.
3. Convert `test_size` into an integer number of test examples:
- if `test_size` is a float, use `int(len(X) * test_size)`
- if `test_size` is an integer, use it directly
- for example, if `len(X) = 5` and `test_size = 0.4`, then `test_count = int(5 * 0.4) = 2`
4. Use the last `test_count` examples as the test set.
5. Use the remaining examples as the training set.
6. Return `[X_train, X_test, y_train, y_test]`.
---
### 📥 Input / 📤 Output
**Input**
- `X` (`list`): feature values or feature rows
- `y` (`list`): target values with the same length as `X`
- `test_size` (`float` or `int`): if float, it is a fraction of the dataset; if int, it is the exact number of test examples
- `shuffle` (`bool`): whether to shuffle the examples before splitting
- `random_state` (`int` or `None`): seed used when `shuffle=True`
**Output**
- `list`: `[X_train, X_test, y_train, y_test]`
---
### 🧩 Starter Code
```python
import random
def train_test_split_basic(X, y, test_size=0.2, shuffle=True, random_state=None):
"""
Split features and targets into train and test sets.
"""
# TODO 1: Pair each feature value with its matching target
# TODO 2: Shuffle the pairs if shuffle=True
# TODO 3: Convert test_size into a number of test examples
# TODO 4: Split pairs into train and test groups
# TODO 5: Separate features and targets again
pass
```
---
### 💡 Example
```python
X = [10, 20, 30, 40, 50]
y = [1, 2, 3, 4, 5]
train_test_split_basic(X, y, test_size=0.4, shuffle=False)
```
**Expected Output**
```python
[[10, 20, 30], [40, 50], [1, 2, 3], [4, 5]]
```
---
### 🧭 Hint
Shuffle the `(X, y)` pairs together, not `X` and `y` separately.
That is what keeps each row connected to the correct label.