Gradients in Multiclass Logistic Regression Practice Problem

This data science coding problem helps you practice Logistic Regression, gradients in multiclass logistic regression, and implementation skills. Read the problem statement, write your solution, and strengthen your understanding of Logistic Regression.

Problem ID: 124
Problem key: 124-gradients-in-multiclass-logistic-regression
URL: https://datacrack.app/solve/124-gradients-in-multiclass-logistic-regression
Difficulty: hard
Topic: Logistic Regression
Module: Introduction to Machine Learning

Problem Statement


# 🧩 Gradients in Multiclass Logistic Regression

---

### 🎯 Goal

* Extend the gradient derivation from binary logistic regression to the **multiclass** setting.
* Compute the gradients of the **Multiclass Cross-Entropy Loss** with respect to the weight matrix $W$ and bias vector $b$.
* These gradients will be used in the next exercise, **Gradient Descent for Multiclass Logistic Regression**, to update the model parameters.

---

### 💻 Task  

You are given input features $X$, one-hot encoded true labels $Y$, and predicted Softmax probabilities $\hat{Y}$.

Steps:

1. Compute the **error matrix** $E = \hat{Y} - Y$.
2. Derive and implement the gradient of the loss with respect to $W$: combine $X$ (transposed) with $E$.
3. Derive and implement the gradient of the loss with respect to $b$: average the errors across all samples for each class.
4. Return both gradients as `(dW, db)`.

---

### 🔍 Explanation of Symbols

|     Symbol     | Meaning                                        | Shape / Type    |
| :------------: | :--------------------------------------------- | :-------------- |
|    **$X$**     | Input feature matrix                            | $(N, d)$        |
| **$Y$**        | One-hot encoded true labels                     | $(N, K)$        |
| **$\hat{Y}$**  | Predicted Softmax probabilities (model output)  | $(N, K)$        |
|    **$W$**     | Weight matrix                                   | $(d, K)$        |
|    **$b$**     | Bias vector                                     | $(K,)$          |
|    **$N$**     | Number of samples                               | integer         |
|    **$K$**     | Number of classes                               | integer         |
|    **$d$**     | Number of features                              | integer         |
|    **$L$**     | Multiclass Cross-Entropy Loss                   | float           |

---

### 📥 Input / 📤 Output

* **Input:**

  * `X`: list or 2D array — input features with shape $(N, d)$.
  * `y_true`: list or 2D array — one-hot encoded true labels with shape $(N, K)$.
  * `y_pred`: list or 2D array — predicted Softmax probabilities with shape $(N, K)$.

* **Output:**

  * Tuple: `(dW, db)`

    * `dW`: list (2D) — gradient of the loss w.r.t. the weight matrix, shape $(d, K)$.
    * `db`: list (1D) — gradient of the loss w.r.t. the bias vector, shape $(K,)$.

---


### 🧩 Starter Code

```python
import numpy as np

def compute_multiclass_gradients(X, y_true, y_pred):
    """
    Compute the gradients of the Multiclass Cross-Entropy Loss
    with respect to weight matrix W and bias vector b.

    Args:
        X (list or np.ndarray): input features, shape (N, d)
        y_true (list or np.ndarray): one-hot true labels, shape (N, K)
        y_pred (list or np.ndarray): predicted Softmax probabilities, shape (N, K)
    Returns:
        tuple: (dW, db)
            dW (list): gradient w.r.t. weight matrix, shape (d, K)
            db (list): gradient w.r.t. bias vector, shape (K,)
    """
    X = np.array(X, dtype=np.float64)
    y_true = np.array(y_true, dtype=np.float64)
    y_pred = np.array(y_pred, dtype=np.float64)
    n = X.shape[0]

    # TODO: Implement your derived gradient equations here
    pass
```

---

### 💡 Example

```python
X = [[1, 2], [2, 3], [3, 4]]
y_true = [[1, 0, 0], [0, 1, 0], [0, 0, 1]]
y_pred = [[0.7, 0.2, 0.1], [0.1, 0.8, 0.1], [0.2, 0.1, 0.7]]

dW, db = compute_multiclass_gradients(X, y_true, y_pred)

print("dW:", dW)
print("db:", db)
```

**Expected Output:**

```
dW: [[0.16666666666666666, 0.03333333333333335, -0.19999999999999998],
     [0.16666666666666666, 0.0666666666666667, -0.2333333333333333]]
db: [0.0, 0.03333333333333335, -0.033333333333333326]
```

---

### 🧮 Background & Intuition

In **multiclass logistic regression** (also called *softmax regression*), the model predicts class probabilities using:

$$
\hat{Y} = \text{Softmax}(XW + b)
$$

where $\text{Softmax}$ converts raw logits into a probability distribution over $K$ classes for each sample.

During training, we minimize the **Multiclass Cross-Entropy Loss** you derived previously:

$$
L = -\frac{1}{N}\sum_{i=1}^{N}\sum_{k=1}^{K} y_{ik}\log(\hat{y}_{ik})
$$

To minimize this loss using Gradient Descent, we must compute $\frac{\partial L}{\partial W}$ and $\frac{\partial L}{\partial b}$.

---

### 🧭 Derivation Hints

🧩 **1️⃣ The Error Matrix**

Just as in the binary case, the key quantity is the **error** between predictions and true labels:

$$
E = \hat{Y} - Y
$$

where $E$ has shape $(N, K)$ — each row represents the error across all classes for one sample.

---

🧩 **2️⃣ Gradient with respect to $W$**

> 💡 **Hint:** Think about how each feature in $X$ contributes to the error across all classes.
> The gradient $\frac{\partial L}{\partial W}$ has the same shape as $W$, i.e., $(d, K)$.
> You need to combine $X$ (transposed) with the error matrix $E$.

---

🧩 **3️⃣ Gradient with respect to $b$**

> 💡 **Hint:** The bias gradient doesn't involve features — just average the errors across all samples for each class.
> The result should be a vector of shape $(K,)$.

---

Gradients in Multiclass Logistic Regression Practice Problem

Problem ID: 124
Problem key: 124-gradients-in-multiclass-logistic-regression
URL: https://datacrack.app/solve/124-gradients-in-multiclass-logistic-regression
Difficulty: hard
Topic: Logistic Regression
Module: Introduction to Machine Learning

Problem Statement


# 🧩 Gradients in Multiclass Logistic Regression

---

### 🎯 Goal

* Extend the gradient derivation from binary logistic regression to the **multiclass** setting.
* Compute the gradients of the **Multiclass Cross-Entropy Loss** with respect to the weight matrix $W$ and bias vector $b$.
* These gradients will be used in the next exercise, **Gradient Descent for Multiclass Logistic Regression**, to update the model parameters.

---

### 💻 Task  

You are given input features $X$, one-hot encoded true labels $Y$, and predicted Softmax probabilities $\hat{Y}$.

Steps:

1. Compute the **error matrix** $E = \hat{Y} - Y$.
2. Derive and implement the gradient of the loss with respect to $W$: combine $X$ (transposed) with $E$.
3. Derive and implement the gradient of the loss with respect to $b$: average the errors across all samples for each class.
4. Return both gradients as `(dW, db)`.

---

### 🔍 Explanation of Symbols

|     Symbol     | Meaning                                        | Shape / Type    |
| :------------: | :--------------------------------------------- | :-------------- |
|    **$X$**     | Input feature matrix                            | $(N, d)$        |
| **$Y$**        | One-hot encoded true labels                     | $(N, K)$        |
| **$\hat{Y}$**  | Predicted Softmax probabilities (model output)  | $(N, K)$        |
|    **$W$**     | Weight matrix                                   | $(d, K)$        |
|    **$b$**     | Bias vector                                     | $(K,)$          |
|    **$N$**     | Number of samples                               | integer         |
|    **$K$**     | Number of classes                               | integer         |
|    **$d$**     | Number of features                              | integer         |
|    **$L$**     | Multiclass Cross-Entropy Loss                   | float           |

---

### 📥 Input / 📤 Output

* **Input:**

  * `X`: list or 2D array — input features with shape $(N, d)$.
  * `y_true`: list or 2D array — one-hot encoded true labels with shape $(N, K)$.
  * `y_pred`: list or 2D array — predicted Softmax probabilities with shape $(N, K)$.

* **Output:**

  * Tuple: `(dW, db)`

    * `dW`: list (2D) — gradient of the loss w.r.t. the weight matrix, shape $(d, K)$.
    * `db`: list (1D) — gradient of the loss w.r.t. the bias vector, shape $(K,)$.

---


### 🧩 Starter Code

```python
import numpy as np

def compute_multiclass_gradients(X, y_true, y_pred):
    """
    Compute the gradients of the Multiclass Cross-Entropy Loss
    with respect to weight matrix W and bias vector b.

    Args:
        X (list or np.ndarray): input features, shape (N, d)
        y_true (list or np.ndarray): one-hot true labels, shape (N, K)
        y_pred (list or np.ndarray): predicted Softmax probabilities, shape (N, K)
    Returns:
        tuple: (dW, db)
            dW (list): gradient w.r.t. weight matrix, shape (d, K)
            db (list): gradient w.r.t. bias vector, shape (K,)
    """
    X = np.array(X, dtype=np.float64)
    y_true = np.array(y_true, dtype=np.float64)
    y_pred = np.array(y_pred, dtype=np.float64)
    n = X.shape[0]

    # TODO: Implement your derived gradient equations here
    pass
```

---

### 💡 Example

```python
X = [[1, 2], [2, 3], [3, 4]]
y_true = [[1, 0, 0], [0, 1, 0], [0, 0, 1]]
y_pred = [[0.7, 0.2, 0.1], [0.1, 0.8, 0.1], [0.2, 0.1, 0.7]]

dW, db = compute_multiclass_gradients(X, y_true, y_pred)

print("dW:", dW)
print("db:", db)
```

**Expected Output:**

```
dW: [[0.16666666666666666, 0.03333333333333335, -0.19999999999999998],
     [0.16666666666666666, 0.0666666666666667, -0.2333333333333333]]
db: [0.0, 0.03333333333333335, -0.033333333333333326]
```

---

### 🧮 Background & Intuition

In **multiclass logistic regression** (also called *softmax regression*), the model predicts class probabilities using:

$$
\hat{Y} = \text{Softmax}(XW + b)
$$

where $\text{Softmax}$ converts raw logits into a probability distribution over $K$ classes for each sample.

During training, we minimize the **Multiclass Cross-Entropy Loss** you derived previously:

$$
L = -\frac{1}{N}\sum_{i=1}^{N}\sum_{k=1}^{K} y_{ik}\log(\hat{y}_{ik})
$$

To minimize this loss using Gradient Descent, we must compute $\frac{\partial L}{\partial W}$ and $\frac{\partial L}{\partial b}$.

---

### 🧭 Derivation Hints

🧩 **1️⃣ The Error Matrix**

Just as in the binary case, the key quantity is the **error** between predictions and true labels:

$$
E = \hat{Y} - Y
$$

where $E$ has shape $(N, K)$ — each row represents the error across all classes for one sample.

---

🧩 **2️⃣ Gradient with respect to $W$**

> 💡 **Hint:** Think about how each feature in $X$ contributes to the error across all classes.
> The gradient $\frac{\partial L}{\partial W}$ has the same shape as $W$, i.e., $(d, K)$.
> You need to combine $X$ (transposed) with the error matrix $E$.

---

🧩 **3️⃣ Gradient with respect to $b$**

> 💡 **Hint:** The bias gradient doesn't involve features — just average the errors across all samples for each class.
> The result should be a vector of shape $(K,)$.

---

Gradients in Multiclass Logistic Regression Practice Problem

Problem Statement

Gradients in Multiclass Logistic Regression Practice Problem

Problem Statement

Starter Code

Internal Links