Detect Data Leakage Practice Problem
This data science coding problem helps you practice Model Generalization, detect data leakage, and implementation skills. Read the problem statement, write your solution, and strengthen your understanding of Model Generalization.
- Problem ID: 158
- Problem key: 158-detect-data-leakage
- URL: https://datacrack.app/solve/158-detect-data-leakage
- Difficulty: medium
- Topic: Model Generalization
- Module: Introduction to Machine Learning
Problem Statement
# 🧩 Detect Data Leakage
---
### 🎯 Goal
Detect common signs of **data leakage** before trusting a model's validation score.
> **Note:** This function catches simple leakage signals. Real leakage can be more subtle and may require understanding how each feature was created.
---
### 📖 Introduction
Data leakage happens when information from the training process leaks into the validation or testing sets, making the evaluation score look better than it should. Leakage can make validation scores look unrealistically good.
Two common leakage signals are:
1. The same sample appears in both training and validation data.
2. A feature reveals the target or uses future information.
For example, if we are predicting whether a customer will churn, a feature called `future_churn_status` would be suspicious because it may contain information from after the prediction time.
---
### 💻 Task
Implement `detect_data_leakage`.
Steps:
1. If `leakage_keywords` is `None`, use this default list:
```python
["target", "label", "outcome", "future", "after", "post", "leak"]
```
These words are suspicious because they may suggest that a feature contains the target, the final result, or information from after the prediction period.
2. Find sample IDs that appear in both `train_ids` and `validation_ids`.
3. Keep overlapping IDs in the order they appear in `train_ids`, without duplicates.
4. Check each feature name for possible leakage.
A feature is suspicious if its lowercase name:
- is exactly the same as the lowercase target name
- contains the lowercase target name
- contains any leakage keyword
5. Set `has_leakage` to `True` if there is at least one overlapping ID or one suspicious feature.
6. Return a dictionary with:
- `overlap_ids`
- `suspicious_features`
- `has_leakage`
---
### 📥 Input / 📤 Output
**Input**
- `train_ids` (`list`): sample IDs used for training
- `validation_ids` (`list`): sample IDs used for validation
- `feature_names` (`list[str]`): names of model input features
- `target_name` (`str`): name of the target being predicted
- `leakage_keywords` (`list[str]` or `None`): suspicious keywords; if `None`, use the default list
**Output**
- `dict`: containing `overlap_ids`, `suspicious_features`, and `has_leakage`
---
### 🧩 Starter Code
```python
def detect_data_leakage(train_ids, validation_ids, feature_names, target_name, leakage_keywords=None):
"""
Detect simple data leakage signals from split IDs and feature names.
Args:
train_ids (list): Sample IDs used for training.
validation_ids (list): Sample IDs used for validation.
feature_names (list): Names of model input features.
target_name (str): Name of the target being predicted.
leakage_keywords (list or None): Suspicious words to check in feature names.
Returns:
dict: Leakage summary with overlap_ids, suspicious_features, and has_leakage.
"""
# TODO 1: Set default leakage keywords if needed
# TODO 2: Find overlapping sample IDs
# TODO 3: Keep overlapping IDs in train_ids order without duplicates
# TODO 4: Find suspicious feature names
# TODO 5: Set has_leakage based on overlaps or suspicious features
# TODO 6: Return leakage summary
pass
```
---
### 💡 Example
```python
detect_data_leakage(
train_ids=[1, 2, 3, 4],
validation_ids=[4, 5, 6],
feature_names=["age", "income", "default_label"],
target_name="default"
)
```
**Expected Output**
```python
{
"overlap_ids": [4],
"suspicious_features": ["default_label"],
"has_leakage": True
}
```
---
### 🧭 Hint
A great validation score is not useful if validation data or future information leaked into training.
# 🧩 Detect Data Leakage
---
### 🎯 Goal
Detect common signs of **data leakage** before trusting a model's validation score.
> **Note:** This function catches simple leakage signals. Real leakage can be more subtle and may require understanding how each feature was created.
---
### 📖 Introduction
Data leakage happens when information from the training process leaks into the validation or testing sets, making the evaluation score look better than it should. Leakage can make validation scores look unrealistically good.
Two common leakage signals are:
1. The same sample appears in both training and validation data.
2. A feature reveals the target or uses future information.
For example, if we are predicting whether a customer will churn, a feature called `future_churn_status` would be suspicious because it may contain information from after the prediction time.
---
### 💻 Task
Implement `detect_data_leakage`.
Steps:
1. If `leakage_keywords` is `None`, use this default list:
```python
["target", "label", "outcome", "future", "after", "post", "leak"]
```
These words are suspicious because they may suggest that a feature contains the target, the final result, or information from after the prediction period.
2. Find sample IDs that appear in both `train_ids` and `validation_ids`.
3. Keep overlapping IDs in the order they appear in `train_ids`, without duplicates.
4. Check each feature name for possible leakage.
A feature is suspicious if its lowercase name:
- is exactly the same as the lowercase target name
- contains the lowercase target name
- contains any leakage keyword
5. Set `has_leakage` to `True` if there is at least one overlapping ID or one suspicious feature.
6. Return a dictionary with:
- `overlap_ids`
- `suspicious_features`
- `has_leakage`
---
### 📥 Input / 📤 Output
**Input**
- `train_ids` (`list`): sample IDs used for training
- `validation_ids` (`list`): sample IDs used for validation
- `feature_names` (`list[str]`): names of model input features
- `target_name` (`str`): name of the target being predicted
- `leakage_keywords` (`list[str]` or `None`): suspicious keywords; if `None`, use the default list
**Output**
- `dict`: containing `overlap_ids`, `suspicious_features`, and `has_leakage`
---
### 🧩 Starter Code
```python
def detect_data_leakage(train_ids, validation_ids, feature_names, target_name, leakage_keywords=None):
"""
Detect simple data leakage signals from split IDs and feature names.
"""
# TODO 1: Set default leakage keywords
# TODO 2: Find overlapping sample IDs
# TODO 3: Find suspicious feature names
# TODO 4: Return leakage summary
pass
```
---
### 💡 Example
```python
detect_data_leakage(
train_ids=[1, 2, 3, 4],
validation_ids=[4, 5, 6],
feature_names=["age", "income", "default_label"],
target_name="default"
)
```
**Expected Output**
```python
{
"overlap_ids": [4],
"suspicious_features": ["default_label"],
"has_leakage": True
}
```
---
### 🧭 Hint
A great validation score is not useful if validation data or future information leaked into training.