Unit Standardization Practice Problem
This data science coding problem helps you practice String Standardization, unit standardization, and implementation skills. Read the problem statement, write your solution, and strengthen your understanding of String Standardization.
- Problem ID: 182
- Problem key: 182-unit-standardization
- URL: https://datacrack.app/solve/182-unit-standardization
- Difficulty: medium
- Topic: String Standardization
- Module: Data Cleaning
Problem Statement
# Unit Standardization
### 🎯 Goal
Measurements get written every which way: `"5kg"`, `"5 kilograms"`, and `"5 KG"` are all the same weight, but as strings they're three different values. Using a **unit-variant mapping you're given**, this function rewrites each measurement into a consistent `"<number> <unit>"` format with canonical unit abbreviations.
### 📦 The mapping you're given
The `unit_map` (variant → canonical abbreviation) is passed to the function. For these examples and tests it is:
```python
unit_map = {
"kg": "kg", "kgs": "kg", "kilogram": "kg", "kilograms": "kg",
"g": "g", "gram": "g", "grams": "g",
"lb": "lb", "lbs": "lb", "pound": "lb", "pounds": "lb",
"l": "l", "liter": "l", "liters": "l", "litre": "l", "litres": "l",
"ml": "ml", "milliliter": "ml", "milliliters": "ml",
}
```
### 💻 Task
Implement `standardize_units(data, column, unit_map)` that, for each measurement:
1. Converts the input dictionary to a DataFrame
2. Extracts the leading number and the unit word (e.g. `"2.5 Pounds"` → `2.5` + `Pounds`)
3. Lowercases the unit and looks it up in `unit_map` to get its canonical abbreviation
4. Rebuilds each value as `"<number> <canonical_unit>"` (single space)
5. Returns the cleaned DataFrame as a dictionary
**Important:** The `unit_map` is provided as an argument — nothing is hardcoded. Matching is case-insensitive (lowercase the unit before lookup). If a unit isn't in the map, keep it lowercased. A value that doesn't match the `number + unit` pattern is returned unchanged.
---
### 📥 Input
- `data`: A dictionary where keys are column names and values are lists
- `column`: The column holding the measurement strings
- `unit_map`: A dictionary mapping unit variants to their canonical abbreviation (the `unit_map` above)
### 📤 Output
- A dictionary representing the cleaned DataFrame
---
### 🧩 Starter Code
```python
import re
import pandas as pd
def standardize_units(data, column, unit_map):
"""
Standardize measurement strings into a consistent "<number> <unit>" format
using a provided unit-variant -> canonical-abbreviation mapping.
Args:
data (dict): Input data as dictionary
column (str): Column holding the measurement strings
unit_map (dict): Maps unit variants to canonical abbreviations, e.g. {"kilograms": "kg"}
Returns:
dict: DataFrame as dictionary with standardized "<number> <unit>" strings
"""
# TODO: For each value, regex-extract the number and the unit word
# TODO: Lowercase the unit and look up its canonical form in unit_map
# TODO: Rebuild as "<number> <canonical_unit>" (unmatched unit -> keep lowercased)
# TODO: Return the DataFrame as a dictionary
pass
```
---
### 💡 Examples
*(all use the `unit_map` shown above)*
**Example 1:** Weights written four ways
```python
data = {"weight": ["5kg", "5 kilograms", "5 KG", "2.5 Pounds"]}
standardize_units(data, "weight", unit_map)
```
```
{"weight": ["5 kg", "5 kg", "5 kg", "2.5 lb"]}
```
**Example 2:** Volume units
```python
data = {"volume": ["10 Liters", "250ml", "3 L"]}
standardize_units(data, "volume", unit_map)
```
```
{"volume": ["10 l", "250 ml", "3 l"]}
```
**Example 3:** Grams, kilograms, pounds
```python
data = {"weight": ["500 g", "1.5kg", "2 lbs"]}
standardize_units(data, "weight", unit_map)
```
```
{"weight": ["500 g", "1.5 kg", "2 lb"]}
```