Remove Stop Words Practice Problem
This data science coding problem helps you practice Text Data Cleaning, remove stop words, and implementation skills. Read the problem statement, write your solution, and strengthen your understanding of Text Data Cleaning.
- Problem ID: 169
- Problem key: 169-remove-stop-words
- URL: https://datacrack.app/solve/169-remove-stop-words
- Difficulty: medium
- Topic: Text Data Cleaning
- Module: Data Cleaning
Problem Statement
# Remove Stop Words
### 🎯 Goal
Stop words like "the", "is", "and", and "a" appear frequently in text but carry little meaning for analysis. Removing them reduces noise and helps NLP models focus on the words that actually matter — improving both performance and interpretability.
### 💻 Task
Implement `remove_stop_words(data, column, stop_words=None)` that:
1. Converts the input dictionary to a DataFrame
2. If `stop_words` is `None`, uses this default list: `["the", "a", "an", "is", "are", "was", "were", "in", "on", "at", "to", "for", "of", "and", "or", "but", "not", "with", "by", "from", "as", "it", "this", "that"]`
3. Splits each text into words, filters out stop words (case-insensitive comparison), and joins back
4. Preserves original case of non-stop words
5. Returns the cleaned DataFrame as a dictionary
---
### 📥 Input
- `data`: A dictionary where keys are column names and values are lists of strings
- `column`: The name of the column to clean
- `stop_words`: Optional list of stop words (if `None`, use the default list)
### 📤 Output
- A dictionary representing the cleaned DataFrame
---
### 🧩 Starter Code
```python
import pandas as pd
def remove_stop_words(data, column, stop_words=None):
"""
Remove common stop words from a text column.
Args:
data (dict): Input data as dictionary
column (str): Column name to clean
stop_words (list): Optional custom stop word list
Returns:
dict: Cleaned DataFrame as dictionary
"""
# TODO: Define default stop words if none provided
# TODO: Convert input dictionary to DataFrame
# TODO: Split text into words, filter stop words, rejoin
# TODO: Return cleaned DataFrame as dictionary
pass
```
---
### 💡 Examples
**Example 1:** Default stop words
```python
data = {"text": ["the cat is on the mat", "a dog and a cat", "this is a test"]}
remove_stop_words(data, "text")
```
```
{'text': ['cat mat', 'dog cat', 'test']}
```
**Example 2:** Case-preserving removal
```python
data = {"text": ["I love the weather", "She is at the park"]}
remove_stop_words(data, "text")
```
```
{'text': ['I love weather', 'She park']}
```
**Example 3:** Custom stop words
```python
data = {"text": ["remove these words please"]}
remove_stop_words(data, "text", stop_words=["these", "please"])
```
```
{'text': ['remove words']}
```Starter Code
import pandas as pd
def remove_stop_words(data, column, stop_words=None):
"""
Remove common stop words from a text column.
Args:
data (dict): Input data as dictionary
column (str): Column name to clean
stop_words (list): Optional custom stop word list
Returns:
dict: Cleaned DataFrame as dictionary
"""
# TODO: Define default stop words if none provided
# TODO: Convert input dictionary to DataFrame
# TODO: Split text into words, filter stop words, rejoin
# TODO: Return cleaned DataFrame as dictionary
pass