Gradients in Multiclass Logistic Regression Practice Problem
This data science coding problem helps you practice Logistic Regression, gradients in multiclass logistic regression, and implementation skills. Read the problem statement, write your solution, and strengthen your understanding of Logistic Regression.
- Problem ID: 124
- Problem key: 124-gradients-in-multiclass-logistic-regression
- URL: https://datacrack.app/solve/124-gradients-in-multiclass-logistic-regression
- Difficulty: hard
- Topic: Logistic Regression
- Module: Introduction to Machine Learning
Problem Statement
# š§© Gradients in Multiclass Logistic Regression
---
### šÆ Goal
* Extend the gradient derivation from binary logistic regression to the **multiclass** setting.
* Compute the gradients of the **Multiclass Cross-Entropy Loss** with respect to the weight matrix $W$ and bias vector $b$.
* These gradients will be used in the next exercise, **Gradient Descent for Multiclass Logistic Regression**, to update the model parameters.
---
### š» Task
You are given input features $X$, one-hot encoded true labels $Y$, and predicted Softmax probabilities $\hat{Y}$.
Steps:
1. Compute the **error matrix** $E = \hat{Y} - Y$.
2. Derive and implement the gradient of the loss with respect to $W$: combine $X$ (transposed) with $E$.
3. Derive and implement the gradient of the loss with respect to $b$: average the errors across all samples for each class.
4. Return both gradients as `(dW, db)`.
---
### š Explanation of Symbols
| Symbol | Meaning | Shape / Type |
| :------------: | :--------------------------------------------- | :-------------- |
| **$X$** | Input feature matrix | $(N, d)$ |
| **$Y$** | One-hot encoded true labels | $(N, K)$ |
| **$\hat{Y}$** | Predicted Softmax probabilities (model output) | $(N, K)$ |
| **$W$** | Weight matrix | $(d, K)$ |
| **$b$** | Bias vector | $(K,)$ |
| **$N$** | Number of samples | integer |
| **$K$** | Number of classes | integer |
| **$d$** | Number of features | integer |
| **$L$** | Multiclass Cross-Entropy Loss | float |
---
### š„ Input / š¤ Output
* **Input:**
* `X`: list or 2D array ā input features with shape $(N, d)$.
* `y_true`: list or 2D array ā one-hot encoded true labels with shape $(N, K)$.
* `y_pred`: list or 2D array ā predicted Softmax probabilities with shape $(N, K)$.
* **Output:**
* Tuple: `(dW, db)`
* `dW`: list (2D) ā gradient of the loss w.r.t. the weight matrix, shape $(d, K)$.
* `db`: list (1D) ā gradient of the loss w.r.t. the bias vector, shape $(K,)$.
---
### š§© Starter Code
```python
import numpy as np
def compute_multiclass_gradients(X, y_true, y_pred):
"""
Compute the gradients of the Multiclass Cross-Entropy Loss
with respect to weight matrix W and bias vector b.
Args:
X (list or np.ndarray): input features, shape (N, d)
y_true (list or np.ndarray): one-hot true labels, shape (N, K)
y_pred (list or np.ndarray): predicted Softmax probabilities, shape (N, K)
Returns:
tuple: (dW, db)
dW (list): gradient w.r.t. weight matrix, shape (d, K)
db (list): gradient w.r.t. bias vector, shape (K,)
"""
X = np.array(X, dtype=np.float64)
y_true = np.array(y_true, dtype=np.float64)
y_pred = np.array(y_pred, dtype=np.float64)
n = X.shape[0]
# TODO: Implement your derived gradient equations here
pass
```
---
### š” Example
```python
X = [[1, 2], [2, 3], [3, 4]]
y_true = [[1, 0, 0], [0, 1, 0], [0, 0, 1]]
y_pred = [[0.7, 0.2, 0.1], [0.1, 0.8, 0.1], [0.2, 0.1, 0.7]]
dW, db = compute_multiclass_gradients(X, y_true, y_pred)
print("dW:", dW)
print("db:", db)
```
**Expected Output:**
```
dW: [[0.16666666666666666, 0.03333333333333335, -0.19999999999999998],
[0.16666666666666666, 0.0666666666666667, -0.2333333333333333]]
db: [0.0, 0.03333333333333335, -0.033333333333333326]
```
---
### š§® Background & Intuition
In **multiclass logistic regression** (also called *softmax regression*), the model predicts class probabilities using:
$$
\hat{Y} = \text{Softmax}(XW + b)
$$
where $\text{Softmax}$ converts raw logits into a probability distribution over $K$ classes for each sample.
During training, we minimize the **Multiclass Cross-Entropy Loss** you derived previously:
$$
L = -\frac{1}{N}\sum_{i=1}^{N}\sum_{k=1}^{K} y_{ik}\log(\hat{y}_{ik})
$$
To minimize this loss using Gradient Descent, we must compute $\frac{\partial L}{\partial W}$ and $\frac{\partial L}{\partial b}$.
---
### š§ Derivation Hints
š§© **1ļøā£ The Error Matrix**
Just as in the binary case, the key quantity is the **error** between predictions and true labels:
$$
E = \hat{Y} - Y
$$
where $E$ has shape $(N, K)$ ā each row represents the error across all classes for one sample.
---
š§© **2ļøā£ Gradient with respect to $W$**
> š” **Hint:** Think about how each feature in $X$ contributes to the error across all classes.
> The gradient $\frac{\partial L}{\partial W}$ has the same shape as $W$, i.e., $(d, K)$.
> You need to combine $X$ (transposed) with the error matrix $E$.
---
š§© **3ļøā£ Gradient with respect to $b$**
> š” **Hint:** The bias gradient doesn't involve features ā just average the errors across all samples for each class.
> The result should be a vector of shape $(K,)$.
---