Cross-Column Validation Practice Problem
This data science coding problem helps you practice Data Consistency & Validation, cross-column validation, and implementation skills. Read the problem statement, write your solution, and strengthen your understanding of Data Consistency & Validation.
- Problem ID: 177
- Problem key: 177-cross-column-validation
- URL: https://datacrack.app/solve/177-cross-column-validation
- Difficulty: easy
- Topic: Data Consistency & Validation
- Module: Data Cleaning
Problem Statement
# Cross-Column Validation
### 🎯 Goal
Some data-quality rules cannot be checked by looking at one column in isolation — they depend on the *relationship* between columns. A classic example: a **start date must never come after an end date**. A row where `start = 2023-05-01` and `end = 2023-04-01` is internally inconsistent, even though each date is individually valid.
This function compares two date columns row-by-row and flags whether their ordering makes sense.
### 💻 Task
Implement `validate_date_order(data, start_column, end_column)` that:
1. Converts the input dictionary to a DataFrame
2. Parses both columns as datetimes (using `pd.to_datetime`)
3. Marks a row **valid** when `start <= end` (equal dates are allowed)
4. Adds a new boolean column named `"valid"` holding the result
5. Returns the DataFrame as a dictionary (original date strings unchanged)
**Important:** Equal dates are valid. Any row whose date cannot be parsed should be treated as **not valid**.
---
### 📥 Input
- `data`: A dictionary where keys are column names and values are lists
- `start_column`: Name of the column holding start dates
- `end_column`: Name of the column holding end dates
### 📤 Output
- A dictionary representing the DataFrame with an added `"valid"` boolean column
---
### 🧩 Starter Code
```python
import pandas as pd
def validate_date_order(data, start_column, end_column):
"""
Cross-column check: the start date must not come after the end date.
Args:
data (dict): Input data as dictionary
start_column (str): Column holding the start dates
end_column (str): Column holding the end dates
Returns:
dict: DataFrame as dictionary with an added "valid" boolean column
"""
# TODO: Convert input dictionary to DataFrame
# TODO: Parse both columns with pd.to_datetime
# TODO: Compute valid = (start <= end)
# TODO: Store the result in a "valid" column
# TODO: Return DataFrame as dictionary
pass
```
---
### 💡 Examples
**Example 1:** Mixed valid and invalid ordering
```python
data = {"start": ["2023-01-01", "2023-05-01", "2023-03-15"],
"end": ["2023-06-01", "2023-04-01", "2023-03-20"]}
validate_date_order(data, "start", "end")
```
```
{"start": ["2023-01-01", "2023-05-01", "2023-03-15"],
"end": ["2023-06-01", "2023-04-01", "2023-03-20"],
"valid": [True, False, True]}
```
**Example 2:** Custom column names
```python
data = {"project": ["A", "B"],
"begin": ["2022-01-01", "2022-12-31"],
"finish": ["2022-12-31", "2022-01-01"]}
validate_date_order(data, "begin", "finish")
```
```
{"project": ["A", "B"],
"begin": ["2022-01-01", "2022-12-31"],
"finish": ["2022-12-31", "2022-01-01"],
"valid": [True, False]}
```
**Example 3:** Equal dates are allowed
```python
data = {"s": ["2024-02-01", "2024-02-01"],
"e": ["2024-02-01", "2024-01-31"]}
validate_date_order(data, "s", "e")
```
```
{"s": ["2024-02-01", "2024-02-01"],
"e": ["2024-02-01", "2024-01-31"],
"valid": [True, False]}
```