Multiclass Cross-Entropy Loss Practice Problem
This data science coding problem helps you practice Logistic Regression, multiclass cross-entropy loss, and implementation skills. Read the problem statement, write your solution, and strengthen your understanding of Logistic Regression.
- Problem ID: 14
- Problem key: 14-multiclass-cross-entropy-loss
- URL: https://datacrack.app/solve/14-multiclass-cross-entropy-loss
- Difficulty: hard
- Topic: Logistic Regression
- Module: Introduction to Machine Learning
Problem Statement
# 🧩 Deriving the Multiclass Cross-Entropy Loss
---
### 🎯 Goal
- Understand how the **Multiclass Cross-Entropy Loss** emerges from **probability theory**.
- Learn how **Softmax** models probabilities over multiple classes.
- Derive the **loss expression** step by step from the **categorical likelihood function**.
---
### 🔍 Explanation of Symbols
| Symbol | Meaning | Shape / Type |
|:------:|:--------|:-------------|
| **$y_i$** | True class label for sample *i* | integer (0 to K–1) |
| **$\mathbf{y}_i$** | One-hot encoded true label vector | $(K,)$ |
| **$\hat{\mathbf{y}}_i$** | Softmax-predicted probabilities for all classes | $(K,)$ |
| **$L$** | Cross-entropy loss (how wrong the model is) | float |
| **$K$** | Number of classes | integer |
| **$N$** | Number of samples | integer |
---
### 🧮 Background & Intuition
Let’s recall how we modeled probabilities in **binary classification**.
For a single sample $i$, the probability of observing the label $y_i \in {0,1}$,
given the model’s predicted probability $\hat{y}_i$ for class 1, is:
$$
P(y_i \mid \hat{y}_i) = \hat{y}_i^{,y_i} (1 - \hat{y}_i)^{(1 - y_i)}
$$
This works because:
* If $y_i = 1$, only $\hat{y}_i$ remains.
* If $y_i = 0$, only $(1 - \hat{y}_i)$ remains.
---
Now, we extend this same idea to **multiclass classification**,
where each sample can belong to one of $K$ possible classes.
Instead of predicting a single probability, our model now produces a **probability distribution** over all classes — typically using the **Softmax function**.
The probability of observing a single label $y_i$,
given the predicted distribution $\hat{\mathbf{y}}_i$, is written as:
$$ P(y_i \mid \hat{\mathbf{y}}_i) = \prod_{k=1}^{K} \hat{y}_{ik}^{\,y_{ik}} $$
Here:
* $\hat{y}_{ik}$ is the predicted probability for class $k$ (from Softmax).
* $y_{ik}$ is 1 for the correct class and 0 for all others.
Only the correct class term contributes to the product because all other $y_{ik}=0$.
---
🧠 **In short:**
This expression is the **multiclass extension** of the binary likelihood —
a single, unified formula that works for any number of classes.
---
### 📥 Input / 📤 Output
**Input:**
* `y_true` (`list[list[int]]`):
A list of **one-hot encoded** true labels, where each inner list represents the class indicator for one sample.
Example: `[1, 0, 0]` means the sample belongs to class 0.
* `y_pred` (`list[list[float]]`):
A list of predicted **Softmax probabilities** for each class.
Each inner list corresponds to the model’s predicted probability distribution for one sample.
---
**Output:**
* `float`:
The **average Multiclass Cross-Entropy Loss** — a single numeric value representing how far the predicted distributions are from the true one-hot labels.
Lower values indicate better predictions (i.e., probabilities closer to 1 for the correct class).
---
### 🧭 Derivation Task
#### 🧩 1️⃣ Step 1 — Write the Likelihood for All Samples
Extend the single-sample likelihood to the entire dataset.
Assume each sample is **independent**, and express the overall likelihood as a **product** of all sample probabilities.
💡 *Hint:*
You should have one product over all samples ($i = 1 \ldots N$)
and another product over all classes ($k = 1 \ldots K$) inside it.
---
#### 🧩 2️⃣ Step 2 — Take the Logarithm to Simplify
Multiplying many probabilities leads to extremely small numbers.
Taking the logarithm will simplify the product into a **sum**,
which is easier to work with and numerically stable.
💡 *Hint:*
Use the logarithmic property:
$\log(ab) = \log a + \log b$
You’ll end up with a **double summation** over samples and classes.
---
#### 🧩 3️⃣ Step 3 — Turn Maximization into Minimization
We usually want to **maximize** the log-likelihood,
but in optimization frameworks, we minimize a **loss** instead.
So take the **negative** of your expression and **average** it across all samples.
💡 *Hint:*
Your final result should represent the **average negative log-probability**
of the true class under the model’s predicted probabilities.
---
### 💡 What to Do
Follow the mathematical derivation above to implement the function below.
Your function should:
1. Accept one-hot encoded `y_true` and Softmax-predicted `y_pred`.
2. Compute the negative average log-probability for the correct classes.
3. Use `np.clip()` to avoid taking `log(0)`.
---
### 🧩 Starter Code
```python
import numpy as np
def multiclass_cross_entropy(y_true, y_pred):
"""
Derive and implement the Multiclass Cross-Entropy Loss function
starting from the categorical likelihood.
Args:
y_true (list[list[int]]): One-hot encoded true labels.
y_pred (list[list[float]]): Predicted Softmax probabilities.
Returns:
float: Cross-entropy loss value.
"""
y_true = np.array(y_true, dtype=float)
y_pred = np.array(y_pred, dtype=float)
# TODO: implement your derived expression here
pass
````
---
### 💡 Example
```python
y_true = [
[1, 0, 0],
[0, 1, 0],
[0, 0, 1]
]
y_pred = [
[0.9, 0.05, 0.05],
[0.1, 0.8, 0.1],
[0.2, 0.2, 0.6]
]
print(multiclass_cross_entropy(y_true, y_pred))
```
**Expected Output:**
```
0.2797765635793423
```