Fuzzy Duplicate Detection Practice Problem

This data science coding problem helps you practice Duplicate Detection & Removal, fuzzy duplicate detection, and implementation skills. Read the problem statement, write your solution, and strengthen your understanding of Duplicate Detection & Removal.

Problem ID: 29
Problem key: 29-fuzzy-duplicate-detection
URL: https://datacrack.app/solve/29-fuzzy-duplicate-detection
Difficulty: hard
Topic: Duplicate Detection & Removal
Module: Data Cleaning

Problem Statement

# Fuzzy Duplicate Detection using Edit Distance

### 🎯 Goal
Find **near-duplicate** entries in a column by measuring how similar two strings are using Levenshtein distance.

### 💻 Task
Implement `fuzzy_duplicate_detection(data, column, threshold)` that:
1. Converts the input dictionary to a DataFrame
2. Computes the Levenshtein distance between every pair of values in the specified column
3. Returns all pairs where the distance is **≤ threshold**

You must implement the Levenshtein distance yourself (no external libraries).

### 🔣 Levenshtein Distance
The minimum number of single-character edits (insert, delete, substitute) to transform one string into another.

`d("kitten", "sitting") = 3`

---

### 📥 Input
- `data`: A dictionary where keys are column names and values are lists of data
- `column`: The column name to compare values in
- `threshold`: Maximum edit distance to consider a \"fuzzy match\"

### 📤 Output
- A list of dictionaries, each with keys: `row_i`, `row_j`, `value_i`, `value_j`, `distance`
- Only include pairs where `distance ≤ threshold`
- Pairs are ordered by `(row_i, row_j)` with `row_i < row_j`

---

### 🧩 Starter Code

```python
import pandas as pd
import numpy as np

def fuzzy_duplicate_detection(data, column, threshold):
    """
    Find near-duplicate entries in a column using
    Levenshtein (edit) distance.

    Args:
        data (dict): Input data as dictionary (from JSON)
        column (str): Column name to compare values in
        threshold (int): Maximum edit distance for a fuzzy match

    Returns:
        list[dict]: List of dicts with row_i, row_j, value_i, value_j, distance
    """
    # TODO: Implement a helper function for Levenshtein distance
    # TODO: Convert the input dictionary to a DataFrame
    # TODO: Compare every pair of values in the specified column
    # TODO: Return pairs where distance <= threshold
    pass
```

---

### 💡 Examples

**Example 1:**
```python
data = {"name": ["John", "Jon", "Jane", "Janet"]}
fuzzy_duplicate_detection(data, column="name", threshold=1)
```
```python
[
    {"row_i": 0, "row_j": 1, "value_i": "John", "value_j": "Jon", "distance": 1},
    {"row_i": 2, "row_j": 3, "value_i": "Jane", "value_j": "Janet", "distance": 1}
]
```

**Example 2:**
```python
data = {"city": ["New York", "new york", "Los Angeles", "San Francisco"]}
fuzzy_duplicate_detection(data, column="city", threshold=2)
```
```python
[
    {"row_i": 0, "row_j": 1, "value_i": "New York", "value_j": "new york", "distance": 2}
]
```

Fuzzy Duplicate Detection Practice Problem

Problem ID: 29
Problem key: 29-fuzzy-duplicate-detection
URL: https://datacrack.app/solve/29-fuzzy-duplicate-detection
Difficulty: hard
Topic: Duplicate Detection & Removal
Module: Data Cleaning

Problem Statement

# Fuzzy Duplicate Detection using Edit Distance

### 🎯 Goal
Find **near-duplicate** entries in a column by measuring how similar two strings are using Levenshtein distance.

### 💻 Task
Implement `fuzzy_duplicate_detection(data, column, threshold)` that:
1. Converts the input dictionary to a DataFrame
2. Computes the Levenshtein distance between every pair of values in the specified column
3. Returns all pairs where the distance is **≤ threshold**

You must implement the Levenshtein distance yourself (no external libraries).

### 🔣 Levenshtein Distance
The minimum number of single-character edits (insert, delete, substitute) to transform one string into another.

`d("kitten", "sitting") = 3`

---

### 📥 Input
- `data`: A dictionary where keys are column names and values are lists of data
- `column`: The column name to compare values in
- `threshold`: Maximum edit distance to consider a \"fuzzy match\"

### 📤 Output
- A list of dictionaries, each with keys: `row_i`, `row_j`, `value_i`, `value_j`, `distance`
- Only include pairs where `distance ≤ threshold`
- Pairs are ordered by `(row_i, row_j)` with `row_i < row_j`

---

### 🧩 Starter Code

```python
import pandas as pd
import numpy as np

def fuzzy_duplicate_detection(data, column, threshold):
    """
    Find near-duplicate entries in a column using
    Levenshtein (edit) distance.

    Args:
        data (dict): Input data as dictionary (from JSON)
        column (str): Column name to compare values in
        threshold (int): Maximum edit distance for a fuzzy match

    Returns:
        list[dict]: List of dicts with row_i, row_j, value_i, value_j, distance
    """
    # TODO: Implement a helper function for Levenshtein distance
    # TODO: Convert the input dictionary to a DataFrame
    # TODO: Compare every pair of values in the specified column
    # TODO: Return pairs where distance <= threshold
    pass
```

---

### 💡 Examples

**Example 1:**
```python
data = {"name": ["John", "Jon", "Jane", "Janet"]}
fuzzy_duplicate_detection(data, column="name", threshold=1)
```
```python
[
    {"row_i": 0, "row_j": 1, "value_i": "John", "value_j": "Jon", "distance": 1},
    {"row_i": 2, "row_j": 3, "value_i": "Jane", "value_j": "Janet", "distance": 1}
]
```

**Example 2:**
```python
data = {"city": ["New York", "new york", "Los Angeles", "San Francisco"]}
fuzzy_duplicate_detection(data, column="city", threshold=2)
```
```python
[
    {"row_i": 0, "row_j": 1, "value_i": "New York", "value_j": "new york", "distance": 2}
]
```

Starter Code

import pandas as pd
import numpy as np

def fuzzy_duplicate_detection(data, column, threshold):
    """
    Find near-duplicate entries in a column using
    Levenshtein (edit) distance.

    Args:
        data (dict): Input data as dictionary (from JSON)
        column (str): Column name to compare values in
        threshold (int): Maximum edit distance for a fuzzy match

    Returns:
        list[dict]: List of dicts with row_i, row_j, value_i, value_j, distance
    """
    # TODO: Implement a helper function for Levenshtein distance
    # TODO: Convert the input dictionary to a DataFrame
    # TODO: Compare every pair of values in the specified column
    # TODO: Return pairs where distance <= threshold
    pass

Fuzzy Duplicate Detection Practice Problem

Problem Statement

Fuzzy Duplicate Detection Practice Problem

Problem Statement

Starter Code

Internal Links