Parse and Clean Names Practice Problem
This data science coding problem helps you practice String Standardization, parse and clean names, and implementation skills. Read the problem statement, write your solution, and strengthen your understanding of String Standardization.
- Problem ID: 180
- Problem key: 180-parse-and-clean-names
- URL: https://datacrack.app/solve/180-parse-and-clean-names
- Difficulty: medium
- Topic: String Standardization
- Module: Data Cleaning
Problem Statement
# Parse and Clean Names
### 🎯 Goal
Person names arrive cluttered with honorific titles (`Dr.`, `Mrs.`) and generational suffixes (`Jr.`, `III`), in inconsistent casing. To compare or match people, we need to separate the *core name* from these decorations and normalize the capitalization.
### 💻 Task
Implement `parse_name(data, column)` that, for each full name:
1. Converts the input dictionary to a DataFrame
2. Splits the name into tokens, stripping stray periods/commas
3. Detects a leading **title** (`Mr`, `Mrs`, `Ms`, `Dr`, `Prof`) and a trailing **suffix** (`Jr`, `Sr`, `II`, `III`, `IV`) — case-insensitively
4. Capitalizes the remaining tokens to form the clean name
5. Replaces `column` with the clean name and adds two new columns: `"title"` and `"suffix"` (empty string `""` when absent)
6. Returns the DataFrame as a dictionary
**Important:** Titles and suffixes are matched case-insensitively but output in canonical form (`"dr"`→`"Dr"`, `"iii"`→`"III"`). When no title/suffix is present, use an empty string.
---
### 📥 Input
- `data`: A dictionary where keys are column names and values are lists
- `column`: The column holding the full-name strings
### 📤 Output
- A dictionary representing the DataFrame: `column` cleaned, plus `"title"` and `"suffix"` columns
---
### 🧩 Starter Code
```python
import pandas as pd
def parse_name(data, column):
"""
Parse a full name into components, stripping titles (Mr, Dr, ...) and
suffixes (Jr, III, ...).
Args:
data (dict): Input data as dictionary
column (str): Column holding the full names
Returns:
dict: DataFrame as dictionary with cleaned name plus "title" and "suffix" columns
"""
# TODO: Define title and suffix lookup tables (lowercase key -> display form)
# TODO: For each name: tokenize and strip punctuation
# TODO: Pull off a leading title and a trailing suffix if present
# TODO: Capitalize the remaining tokens for the clean name
# TODO: Write back the name and add "title" / "suffix" columns
pass
```
---
### 💡 Examples
**Example 1:** Titles, suffix, and plain lowercase
```python
data = {"name": ["Dr. John Smith", "Mrs. Jane Doe Jr.", "bob jones"]}
parse_name(data, "name")
```
```
{"name": ["John Smith", "Jane Doe", "Bob Jones"],
"title": ["Dr", "Mrs", ""],
"suffix": ["", "Jr", ""]}
```
**Example 2:** Roman-numeral suffix and a dotted title
```python
data = {"name": ["prof albert king III", "ms. sara lee"]}
parse_name(data, "name")
```
```
{"name": ["Albert King", "Sara Lee"],
"title": ["Prof", "Ms"],
"suffix": ["III", ""]}
```
**Example 3:** Uppercase name with title and suffix
```python
data = {"name": ["mr. TOM HANKS sr"]}
parse_name(data, "name")
```
```
{"name": ["Tom Hanks"],
"title": ["Mr"],
"suffix": ["Sr"]}
```