Train Test Split Practice Problem

This data science coding problem helps you practice Model Validation, train test split, and implementation skills. Read the problem statement, write your solution, and strengthen your understanding of Model Validation.

Problem ID: 155
Problem key: 155-train-test-split
URL: https://datacrack.app/solve/155-train-test-split
Difficulty: easy
Topic: Model Validation
Module: Introduction to Machine Learning

Problem Statement

# 🧩 Train/Test Split

---

### 🎯 Goal

Learn how to split a dataset into a **training set** and a **test set** while keeping each feature row matched with its correct target value.

---

### 📖 Introduction

A machine learning model should be evaluated on data it has not already seen.

If we train and test on the same examples, the score can look better than it really is because the model may simply memorize patterns from the training data.

To avoid this, we divide the dataset into two parts:

| Split | Purpose |
|:------|:--------|
| **Training set** | Used to fit the model parameters |
| **Test set** | Used to estimate performance on unseen examples |

---

### 💻 Task

Implement `train_test_split_basic` from scratch.

Steps:

1. Pair each feature example with its target using `zip(X, y)`.
2. If `shuffle=True`, shuffle the paired examples using `random.Random(random_state)`.
3. Convert `test_size` into an integer number of test examples:
   - if `test_size` is a float, use `int(len(X) * test_size)`
   - if `test_size` is an integer, use it directly
   - for example, if `len(X) = 5` and `test_size = 0.4`, then `test_count = int(5 * 0.4) = 2`
4. Use the last `test_count` examples as the test set.
5. Use the remaining examples as the training set.
6. Return `[X_train, X_test, y_train, y_test]`.

---

### 📥 Input / 📤 Output

**Input**
- `X` (`list`): feature values or feature rows
- `y` (`list`): target values with the same length as `X`
- `test_size` (`float` or `int`): if float, it is a fraction of the dataset; if int, it is the exact number of test examples
- `shuffle` (`bool`): whether to shuffle the examples before splitting
- `random_state` (`int` or `None`): seed used when `shuffle=True`

**Output**
- `list`: `[X_train, X_test, y_train, y_test]`

---

### 🧩 Starter Code

```python
import random

def train_test_split_basic(X, y, test_size=0.2, shuffle=True, random_state=None):
    """
    Split features and targets into train and test sets.
    """
    # TODO 1: Pair each feature value with its matching target

    # TODO 2: Shuffle the pairs if shuffle=True

    # TODO 3: Convert test_size into a number of test examples

    # TODO 4: Split pairs into train and test groups

    # TODO 5: Separate features and targets again

    pass
```

---

### 💡 Example

```python
X = [10, 20, 30, 40, 50]
y = [1, 2, 3, 4, 5]

train_test_split_basic(X, y, test_size=0.4, shuffle=False)
```

**Expected Output**

```python
[[10, 20, 30], [40, 50], [1, 2, 3], [4, 5]]
```

---

### 🧭 Hint

Shuffle the `(X, y)` pairs together, not `X` and `y` separately.  
That is what keeps each row connected to the correct label.

Train Test Split Practice Problem

Problem ID: 155
Problem key: 155-train-test-split
URL: https://datacrack.app/solve/155-train-test-split
Difficulty: easy
Topic: Model Validation
Module: Introduction to Machine Learning

Problem Statement

# 🧩 Train/Test Split

---

### 🎯 Goal

Learn how to split a dataset into a **training set** and a **test set** while keeping each feature row matched with its correct target value.

---

### 📖 Introduction

A machine learning model should be evaluated on data it has not already seen.

If we train and test on the same examples, the score can look better than it really is because the model may simply memorize patterns from the training data.

To avoid this, we divide the dataset into two parts:

| Split | Purpose |
|:------|:--------|
| **Training set** | Used to fit the model parameters |
| **Test set** | Used to estimate performance on unseen examples |

---

### 💻 Task

Implement `train_test_split_basic` from scratch.

Steps:

1. Pair each feature example with its target using `zip(X, y)`.
2. If `shuffle=True`, shuffle the paired examples using `random.Random(random_state)`.
3. Convert `test_size` into an integer number of test examples:
   - if `test_size` is a float, use `int(len(X) * test_size)`
   - if `test_size` is an integer, use it directly
   - for example, if `len(X) = 5` and `test_size = 0.4`, then `test_count = int(5 * 0.4) = 2`
4. Use the last `test_count` examples as the test set.
5. Use the remaining examples as the training set.
6. Return `[X_train, X_test, y_train, y_test]`.

---

### 📥 Input / 📤 Output

**Input**
- `X` (`list`): feature values or feature rows
- `y` (`list`): target values with the same length as `X`
- `test_size` (`float` or `int`): if float, it is a fraction of the dataset; if int, it is the exact number of test examples
- `shuffle` (`bool`): whether to shuffle the examples before splitting
- `random_state` (`int` or `None`): seed used when `shuffle=True`

**Output**
- `list`: `[X_train, X_test, y_train, y_test]`

---

### 🧩 Starter Code

```python
import random

def train_test_split_basic(X, y, test_size=0.2, shuffle=True, random_state=None):
    """
    Split features and targets into train and test sets.
    """
    # TODO 1: Pair each feature value with its matching target

    # TODO 2: Shuffle the pairs if shuffle=True

    # TODO 3: Convert test_size into a number of test examples

    # TODO 4: Split pairs into train and test groups

    # TODO 5: Separate features and targets again

    pass
```

---

### 💡 Example

```python
X = [10, 20, 30, 40, 50]
y = [1, 2, 3, 4, 5]

train_test_split_basic(X, y, test_size=0.4, shuffle=False)
```

**Expected Output**

```python
[[10, 20, 30], [40, 50], [1, 2, 3], [4, 5]]
```

---

### 🧭 Hint

Shuffle the `(X, y)` pairs together, not `X` and `y` separately.  
That is what keeps each row connected to the correct label.

Train Test Split Practice Problem

Problem Statement

Train Test Split Practice Problem

Problem Statement

Starter Code

Internal Links