Unit Standardization Practice Problem

This data science coding problem helps you practice String Standardization, unit standardization, and implementation skills. Read the problem statement, write your solution, and strengthen your understanding of String Standardization.

Problem ID: 182
Problem key: 182-unit-standardization
URL: https://datacrack.app/solve/182-unit-standardization
Difficulty: medium
Topic: String Standardization
Module: Data Cleaning

Problem Statement

# Unit Standardization

### 🎯 Goal
Measurements get written every which way: `"5kg"`, `"5 kilograms"`, and `"5 KG"` are all the same weight, but as strings they're three different values. Using a **unit-variant mapping you're given**, this function rewrites each measurement into a consistent `"<number> <unit>"` format with canonical unit abbreviations.

### 📦 The mapping you're given
The `unit_map` (variant → canonical abbreviation) is passed to the function. For these examples and tests it is:
```python
unit_map = {
    "kg": "kg", "kgs": "kg", "kilogram": "kg", "kilograms": "kg",
    "g": "g", "gram": "g", "grams": "g",
    "lb": "lb", "lbs": "lb", "pound": "lb", "pounds": "lb",
    "l": "l", "liter": "l", "liters": "l", "litre": "l", "litres": "l",
    "ml": "ml", "milliliter": "ml", "milliliters": "ml",
}
```

### 💻 Task
Implement `standardize_units(data, column, unit_map)` that, for each measurement:
1. Converts the input dictionary to a DataFrame
2. Extracts the leading number and the unit word (e.g. `"2.5 Pounds"` → `2.5` + `Pounds`)
3. Lowercases the unit and looks it up in `unit_map` to get its canonical abbreviation
4. Rebuilds each value as `"<number> <canonical_unit>"` (single space)
5. Returns the cleaned DataFrame as a dictionary

**Important:** The `unit_map` is provided as an argument — nothing is hardcoded. Matching is case-insensitive (lowercase the unit before lookup). If a unit isn't in the map, keep it lowercased. A value that doesn't match the `number + unit` pattern is returned unchanged.

---

### 📥 Input
- `data`: A dictionary where keys are column names and values are lists
- `column`: The column holding the measurement strings
- `unit_map`: A dictionary mapping unit variants to their canonical abbreviation (the `unit_map` above)

### 📤 Output
- A dictionary representing the cleaned DataFrame

---

### 🧩 Starter Code

```python
import re
import pandas as pd

def standardize_units(data, column, unit_map):
    """
    Standardize measurement strings into a consistent "<number> <unit>" format
    using a provided unit-variant -> canonical-abbreviation mapping.

    Args:
        data (dict): Input data as dictionary
        column (str): Column holding the measurement strings
        unit_map (dict): Maps unit variants to canonical abbreviations, e.g. {"kilograms": "kg"}

    Returns:
        dict: DataFrame as dictionary with standardized "<number> <unit>" strings
    """
    # TODO: For each value, regex-extract the number and the unit word
    # TODO: Lowercase the unit and look up its canonical form in unit_map
    # TODO: Rebuild as "<number> <canonical_unit>" (unmatched unit -> keep lowercased)
    # TODO: Return the DataFrame as a dictionary
    pass
```

---

### 💡 Examples
*(all use the `unit_map` shown above)*

**Example 1:** Weights written four ways
```python
data = {"weight": ["5kg", "5 kilograms", "5 KG", "2.5 Pounds"]}
standardize_units(data, "weight", unit_map)
```
```
{"weight": ["5 kg", "5 kg", "5 kg", "2.5 lb"]}
```

**Example 2:** Volume units
```python
data = {"volume": ["10 Liters", "250ml", "3 L"]}
standardize_units(data, "volume", unit_map)
```
```
{"volume": ["10 l", "250 ml", "3 l"]}
```

**Example 3:** Grams, kilograms, pounds
```python
data = {"weight": ["500 g", "1.5kg", "2 lbs"]}
standardize_units(data, "weight", unit_map)
```
```
{"weight": ["500 g", "1.5 kg", "2 lb"]}
```

Unit Standardization Practice Problem

Problem ID: 182
Problem key: 182-unit-standardization
URL: https://datacrack.app/solve/182-unit-standardization
Difficulty: medium
Topic: String Standardization
Module: Data Cleaning

Problem Statement

# Unit Standardization

### 🎯 Goal
Measurements get written every which way: `"5kg"`, `"5 kilograms"`, and `"5 KG"` are all the same weight, but as strings they're three different values. Using a **unit-variant mapping you're given**, this function rewrites each measurement into a consistent `"<number> <unit>"` format with canonical unit abbreviations.

### 📦 The mapping you're given
The `unit_map` (variant → canonical abbreviation) is passed to the function. For these examples and tests it is:
```python
unit_map = {
    "kg": "kg", "kgs": "kg", "kilogram": "kg", "kilograms": "kg",
    "g": "g", "gram": "g", "grams": "g",
    "lb": "lb", "lbs": "lb", "pound": "lb", "pounds": "lb",
    "l": "l", "liter": "l", "liters": "l", "litre": "l", "litres": "l",
    "ml": "ml", "milliliter": "ml", "milliliters": "ml",
}
```

### 💻 Task
Implement `standardize_units(data, column, unit_map)` that, for each measurement:
1. Converts the input dictionary to a DataFrame
2. Extracts the leading number and the unit word (e.g. `"2.5 Pounds"` → `2.5` + `Pounds`)
3. Lowercases the unit and looks it up in `unit_map` to get its canonical abbreviation
4. Rebuilds each value as `"<number> <canonical_unit>"` (single space)
5. Returns the cleaned DataFrame as a dictionary

**Important:** The `unit_map` is provided as an argument — nothing is hardcoded. Matching is case-insensitive (lowercase the unit before lookup). If a unit isn't in the map, keep it lowercased. A value that doesn't match the `number + unit` pattern is returned unchanged.

---

### 📥 Input
- `data`: A dictionary where keys are column names and values are lists
- `column`: The column holding the measurement strings
- `unit_map`: A dictionary mapping unit variants to their canonical abbreviation (the `unit_map` above)

### 📤 Output
- A dictionary representing the cleaned DataFrame

---

### 🧩 Starter Code

```python
import re
import pandas as pd

def standardize_units(data, column, unit_map):
    """
    Standardize measurement strings into a consistent "<number> <unit>" format
    using a provided unit-variant -> canonical-abbreviation mapping.

    Args:
        data (dict): Input data as dictionary
        column (str): Column holding the measurement strings
        unit_map (dict): Maps unit variants to canonical abbreviations, e.g. {"kilograms": "kg"}

    Returns:
        dict: DataFrame as dictionary with standardized "<number> <unit>" strings
    """
    # TODO: For each value, regex-extract the number and the unit word
    # TODO: Lowercase the unit and look up its canonical form in unit_map
    # TODO: Rebuild as "<number> <canonical_unit>" (unmatched unit -> keep lowercased)
    # TODO: Return the DataFrame as a dictionary
    pass
```

---

### 💡 Examples
*(all use the `unit_map` shown above)*

**Example 1:** Weights written four ways
```python
data = {"weight": ["5kg", "5 kilograms", "5 KG", "2.5 Pounds"]}
standardize_units(data, "weight", unit_map)
```
```
{"weight": ["5 kg", "5 kg", "5 kg", "2.5 lb"]}
```

**Example 2:** Volume units
```python
data = {"volume": ["10 Liters", "250ml", "3 L"]}
standardize_units(data, "volume", unit_map)
```
```
{"volume": ["10 l", "250 ml", "3 l"]}
```

**Example 3:** Grams, kilograms, pounds
```python
data = {"weight": ["500 g", "1.5kg", "2 lbs"]}
standardize_units(data, "weight", unit_map)
```
```
{"weight": ["500 g", "1.5 kg", "2 lb"]}
```

Starter Code

import re
import pandas as pd

def standardize_units(data, column, unit_map):
    """
    Standardize measurement strings into a consistent "<number> <unit>" format
    using a provided unit-variant -> canonical-abbreviation mapping.

    Args:
        data (dict): Input data as dictionary
        column (str): Column holding the measurement strings
        unit_map (dict): Maps unit variants to canonical abbreviations, e.g. {"kilograms": "kg"}

    Returns:
        dict: DataFrame as dictionary with standardized "<number> <unit>" strings
    """
    # TODO: For each value, regex-extract the number and the unit word
    # TODO: Lowercase the unit and look up its canonical form in unit_map
    # TODO: Rebuild as "<number> <canonical_unit>" (unmatched unit -> keep lowercased)
    # TODO: Return the DataFrame as a dictionary
    pass

Unit Standardization Practice Problem

Problem Statement

Unit Standardization Practice Problem

Problem Statement

Starter Code

Internal Links