Multiclass Cross-Entropy Loss Practice Problem

This data science coding problem helps you practice Logistic Regression, multiclass cross-entropy loss, and implementation skills. Read the problem statement, write your solution, and strengthen your understanding of Logistic Regression.

Problem ID: 14
Problem key: 14-multiclass-cross-entropy-loss
URL: https://datacrack.app/solve/14-multiclass-cross-entropy-loss
Difficulty: hard
Topic: Logistic Regression
Module: Introduction to Machine Learning

Problem Statement

# 🧩 Deriving the Multiclass Cross-Entropy Loss
---

### 🎯 Goal

- Understand how the **Multiclass Cross-Entropy Loss** emerges from **probability theory**.  
- Learn how **Softmax** models probabilities over multiple classes.  
- Derive the **loss expression** step by step from the **categorical likelihood function**.  

---

### 🔍 Explanation of Symbols

| Symbol | Meaning | Shape / Type |
|:------:|:--------|:-------------|
| **$y_i$** | True class label for sample *i* | integer (0 to K–1) |
| **$\mathbf{y}_i$** | One-hot encoded true label vector | $(K,)$ |
| **$\hat{\mathbf{y}}_i$** | Softmax-predicted probabilities for all classes | $(K,)$ |
| **$L$** | Cross-entropy loss (how wrong the model is) | float |
| **$K$** | Number of classes | integer |
| **$N$** | Number of samples | integer |

---


### 🧮 Background & Intuition

Let’s recall how we modeled probabilities in **binary classification**.

For a single sample $i$, the probability of observing the label $y_i \in {0,1}$,
given the model’s predicted probability $\hat{y}_i$ for class 1, is:

$$
P(y_i \mid \hat{y}_i) = \hat{y}_i^{,y_i} (1 - \hat{y}_i)^{(1 - y_i)}
$$


This works because:

* If $y_i = 1$, only $\hat{y}_i$ remains.
* If $y_i = 0$, only $(1 - \hat{y}_i)$ remains.

---

Now, we extend this same idea to **multiclass classification**,
where each sample can belong to one of $K$ possible classes.

Instead of predicting a single probability, our model now produces a **probability distribution** over all classes — typically using the **Softmax function**.

The probability of observing a single label $y_i$,
given the predicted distribution $\hat{\mathbf{y}}_i$, is written as:

$$ P(y_i \mid \hat{\mathbf{y}}_i) = \prod_{k=1}^{K} \hat{y}_{ik}^{\,y_{ik}} $$
Here:

* $\hat{y}_{ik}$ is the predicted probability for class $k$ (from Softmax).
* $y_{ik}$ is 1 for the correct class and 0 for all others.

Only the correct class term contributes to the product because all other $y_{ik}=0$.

---

🧠 **In short:**
This expression is the **multiclass extension** of the binary likelihood —
a single, unified formula that works for any number of classes.

---

### 📥 Input / 📤 Output

**Input:**

* `y_true` (`list[list[int]]`):
  A list of **one-hot encoded** true labels, where each inner list represents the class indicator for one sample.
  Example: `[1, 0, 0]` means the sample belongs to class 0.

* `y_pred` (`list[list[float]]`):
  A list of predicted **Softmax probabilities** for each class.
  Each inner list corresponds to the model’s predicted probability distribution for one sample.

---

**Output:**

* `float`:
  The **average Multiclass Cross-Entropy Loss** — a single numeric value representing how far the predicted distributions are from the true one-hot labels.
  Lower values indicate better predictions (i.e., probabilities closer to 1 for the correct class).

---

### 🧭 Derivation Task

#### 🧩 1️⃣ Step 1 — Write the Likelihood for All Samples

Extend the single-sample likelihood to the entire dataset.
Assume each sample is **independent**, and express the overall likelihood as a **product** of all sample probabilities.

💡 *Hint:*
You should have one product over all samples ($i = 1 \ldots N$)
and another product over all classes ($k = 1 \ldots K$) inside it.

---

#### 🧩 2️⃣ Step 2 — Take the Logarithm to Simplify

Multiplying many probabilities leads to extremely small numbers.
Taking the logarithm will simplify the product into a **sum**,
which is easier to work with and numerically stable.

💡 *Hint:*
Use the logarithmic property:
$\log(ab) = \log a + \log b$
You’ll end up with a **double summation** over samples and classes.

---

#### 🧩 3️⃣ Step 3 — Turn Maximization into Minimization

We usually want to **maximize** the log-likelihood,
but in optimization frameworks, we minimize a **loss** instead.
So take the **negative** of your expression and **average** it across all samples.

💡 *Hint:*
Your final result should represent the **average negative log-probability**
of the true class under the model’s predicted probabilities.

---

### 💡 What to Do

Follow the mathematical derivation above to implement the function below.  
Your function should:
1. Accept one-hot encoded `y_true` and Softmax-predicted `y_pred`.  
2. Compute the negative average log-probability for the correct classes.  
3. Use `np.clip()` to avoid taking `log(0)`.

---

### 🧩 Starter Code

```python
import numpy as np

def multiclass_cross_entropy(y_true, y_pred):
    """
    Derive and implement the Multiclass Cross-Entropy Loss function
    starting from the categorical likelihood.

    Args:
        y_true (list[list[int]]): One-hot encoded true labels.
        y_pred (list[list[float]]): Predicted Softmax probabilities.
    Returns:
        float: Cross-entropy loss value.
    """
    y_true = np.array(y_true, dtype=float)
    y_pred = np.array(y_pred, dtype=float)

    # TODO: implement your derived expression here
    pass
````

---

### 💡 Example

```python
y_true = [
    [1, 0, 0],
    [0, 1, 0],
    [0, 0, 1]
]
y_pred = [
    [0.9, 0.05, 0.05],
    [0.1, 0.8, 0.1],
    [0.2, 0.2, 0.6]
]

print(multiclass_cross_entropy(y_true, y_pred))
```

**Expected Output:**

```
0.2797765635793423
```

Multiclass Cross-Entropy Loss Practice Problem

Problem ID: 14
Problem key: 14-multiclass-cross-entropy-loss
URL: https://datacrack.app/solve/14-multiclass-cross-entropy-loss
Difficulty: hard
Topic: Logistic Regression
Module: Introduction to Machine Learning

Problem Statement

# 🧩 Deriving the Multiclass Cross-Entropy Loss
---

### 🎯 Goal

- Understand how the **Multiclass Cross-Entropy Loss** emerges from **probability theory**.  
- Learn how **Softmax** models probabilities over multiple classes.  
- Derive the **loss expression** step by step from the **categorical likelihood function**.  

---

### 🔍 Explanation of Symbols

| Symbol | Meaning | Shape / Type |
|:------:|:--------|:-------------|
| **$y_i$** | True class label for sample *i* | integer (0 to K–1) |
| **$\mathbf{y}_i$** | One-hot encoded true label vector | $(K,)$ |
| **$\hat{\mathbf{y}}_i$** | Softmax-predicted probabilities for all classes | $(K,)$ |
| **$L$** | Cross-entropy loss (how wrong the model is) | float |
| **$K$** | Number of classes | integer |
| **$N$** | Number of samples | integer |

---


### 🧮 Background & Intuition

Let’s recall how we modeled probabilities in **binary classification**.

For a single sample $i$, the probability of observing the label $y_i \in {0,1}$,
given the model’s predicted probability $\hat{y}_i$ for class 1, is:

$$
P(y_i \mid \hat{y}_i) = \hat{y}_i^{,y_i} (1 - \hat{y}_i)^{(1 - y_i)}
$$


This works because:

* If $y_i = 1$, only $\hat{y}_i$ remains.
* If $y_i = 0$, only $(1 - \hat{y}_i)$ remains.

---

Now, we extend this same idea to **multiclass classification**,
where each sample can belong to one of $K$ possible classes.

Instead of predicting a single probability, our model now produces a **probability distribution** over all classes — typically using the **Softmax function**.

The probability of observing a single label $y_i$,
given the predicted distribution $\hat{\mathbf{y}}_i$, is written as:

$$ P(y_i \mid \hat{\mathbf{y}}_i) = \prod_{k=1}^{K} \hat{y}_{ik}^{\,y_{ik}} $$
Here:

* $\hat{y}_{ik}$ is the predicted probability for class $k$ (from Softmax).
* $y_{ik}$ is 1 for the correct class and 0 for all others.

Only the correct class term contributes to the product because all other $y_{ik}=0$.

---

🧠 **In short:**
This expression is the **multiclass extension** of the binary likelihood —
a single, unified formula that works for any number of classes.

---

### 📥 Input / 📤 Output

**Input:**

* `y_true` (`list[list[int]]`):
  A list of **one-hot encoded** true labels, where each inner list represents the class indicator for one sample.
  Example: `[1, 0, 0]` means the sample belongs to class 0.

* `y_pred` (`list[list[float]]`):
  A list of predicted **Softmax probabilities** for each class.
  Each inner list corresponds to the model’s predicted probability distribution for one sample.

---

**Output:**

* `float`:
  The **average Multiclass Cross-Entropy Loss** — a single numeric value representing how far the predicted distributions are from the true one-hot labels.
  Lower values indicate better predictions (i.e., probabilities closer to 1 for the correct class).

---

### 🧭 Derivation Task

#### 🧩 1️⃣ Step 1 — Write the Likelihood for All Samples

Extend the single-sample likelihood to the entire dataset.
Assume each sample is **independent**, and express the overall likelihood as a **product** of all sample probabilities.

💡 *Hint:*
You should have one product over all samples ($i = 1 \ldots N$)
and another product over all classes ($k = 1 \ldots K$) inside it.

---

#### 🧩 2️⃣ Step 2 — Take the Logarithm to Simplify

Multiplying many probabilities leads to extremely small numbers.
Taking the logarithm will simplify the product into a **sum**,
which is easier to work with and numerically stable.

💡 *Hint:*
Use the logarithmic property:
$\log(ab) = \log a + \log b$
You’ll end up with a **double summation** over samples and classes.

---

#### 🧩 3️⃣ Step 3 — Turn Maximization into Minimization

We usually want to **maximize** the log-likelihood,
but in optimization frameworks, we minimize a **loss** instead.
So take the **negative** of your expression and **average** it across all samples.

💡 *Hint:*
Your final result should represent the **average negative log-probability**
of the true class under the model’s predicted probabilities.

---

### 💡 What to Do

Follow the mathematical derivation above to implement the function below.  
Your function should:
1. Accept one-hot encoded `y_true` and Softmax-predicted `y_pred`.  
2. Compute the negative average log-probability for the correct classes.  
3. Use `np.clip()` to avoid taking `log(0)`.

---

### 🧩 Starter Code

```python
import numpy as np

def multiclass_cross_entropy(y_true, y_pred):
    """
    Derive and implement the Multiclass Cross-Entropy Loss function
    starting from the categorical likelihood.

    Args:
        y_true (list[list[int]]): One-hot encoded true labels.
        y_pred (list[list[float]]): Predicted Softmax probabilities.
    Returns:
        float: Cross-entropy loss value.
    """
    y_true = np.array(y_true, dtype=float)
    y_pred = np.array(y_pred, dtype=float)

    # TODO: implement your derived expression here
    pass
````

---

### 💡 Example

```python
y_true = [
    [1, 0, 0],
    [0, 1, 0],
    [0, 0, 1]
]
y_pred = [
    [0.9, 0.05, 0.05],
    [0.1, 0.8, 0.1],
    [0.2, 0.2, 0.6]
]

print(multiclass_cross_entropy(y_true, y_pred))
```

**Expected Output:**

```
0.2797765635793423
```

Multiclass Cross-Entropy Loss Practice Problem

Problem Statement

Multiclass Cross-Entropy Loss Practice Problem

Problem Statement

Starter Code

Internal Links