Clean URLs Emails Practice Problem
This data science coding problem helps you practice Text Data Cleaning, clean urls emails, and implementation skills. Read the problem statement, write your solution, and strengthen your understanding of Text Data Cleaning.
- Problem ID: 165
- Problem key: 165-clean-urls-emails
- URL: https://datacrack.app/solve/165-clean-urls-emails
- Difficulty: medium
- Topic: Text Data Cleaning
- Module: Data Cleaning
Problem Statement
# Clean URLs and Emails
### 🎯 Goal
User-generated text often contains URLs and email addresses that are irrelevant to text analysis — they add noise to NLP models and can leak private information. Selectively removing URLs and/or emails produces cleaner, more focused text for downstream processing.
### 💻 Task
Implement `clean_urls_emails(data, column, remove_urls=True, remove_emails=True)` that:
1. Converts the input dictionary to a DataFrame
2. If `remove_urls=True`, removes all URLs matching the pattern `https?://\S+`
3. If `remove_emails=True`, removes all email addresses matching `\S+@\S+\.\S+`
4. Collapses any resulting extra whitespace
5. Returns the cleaned DataFrame as a dictionary
---
### 📥 Input
- `data`: A dictionary where keys are column names and values are lists of strings
- `column`: The name of the column to clean
- `remove_urls`: Boolean — whether to remove URLs (default: `True`)
- `remove_emails`: Boolean — whether to remove email addresses (default: `True`)
### 📤 Output
- A dictionary representing the cleaned DataFrame
---
### 🧩 Starter Code
```python
import pandas as pd
import re
def clean_urls_emails(data, column, remove_urls=True, remove_emails=True):
"""
Remove URLs and/or email addresses from a text column.
Args:
data (dict): Input data as dictionary
column (str): Column name to clean
remove_urls (bool): Whether to remove URLs
remove_emails (bool): Whether to remove email addresses
Returns:
dict: Cleaned DataFrame as dictionary
"""
# TODO: Convert input dictionary to DataFrame
# TODO: Remove URLs using regex if remove_urls is True
# TODO: Remove emails using regex if remove_emails is True
# TODO: Collapse extra whitespace and strip
# TODO: Return cleaned DataFrame as dictionary
pass
```
---
### 💡 Examples
**Example 1:** Remove both URLs and emails
```python
data = {"text": ["Visit https://example.com for info", "Contact user@mail.com today", "No links here"]}
clean_urls_emails(data, "text", remove_urls=True, remove_emails=True)
```
```
{'text': ['Visit for info', 'Contact today', 'No links here']}
```
**Example 2:** Multiple URLs and emails
```python
data = {"text": ["Check http://test.org and https://abc.com", "Email a@b.com or c@d.org"]}
clean_urls_emails(data, "text", remove_urls=True, remove_emails=True)
```
```
{'text': ['Check and', 'Email or']}
```
**Example 3:** URLs only
```python
data = {"text": ["No special content", "Go to http://site.com now"]}
clean_urls_emails(data, "text", remove_urls=True, remove_emails=False)
```
```
{'text': ['No special content', 'Go to now']}
```