By understanding these three concept,we get to better understand how Transformer adjusts the weights.

Loss Function

it’s the first step: we need to measure the mistake before we can fix it.

What it does:
The loss function measures how wrong a prediction is compared to the actual value. It gives us a number that shows the size of the mistake.

How it works:

You have an actual value (the real answer).
You have a predicted value (the model’s guess).
The loss function calculates the difference between these two values.
A smaller difference means a smaller loss number; a bigger difference means a bigger loss number.

Example:
Suppose the actual price of a house is $300,000,and the model predicts $280,000.

The difference is $300,000 - $280,000 = $20,000.
The loss could simply be this difference: $20,000.

Why we need it:
We use the loss to see how good or bad the prediction is. Without it, we wouldn’t know if the model is right or wrong, or by how much.

Gradient Descent

What it does:
Gradient descent figures out how to adjust the model’s weights (numbers that control the prediction) to make the loss smaller.
It decides the direction where the model is improving towards.

How it works:

The model uses weights to make predictions.
The loss tells us how wrong the prediction is.
Gradient descent looks at how each weight affects the loss.
It adjusts each weight a little bit to reduce the loss.
We repeat this process many times until the loss gets as small as possible.

Example:
Let’s say our model predicts the house price using a weight. The formula might be:

Gradient descent checks: “If I change the weight from 280 to 281, does the loss get smaller?”
New prediction = 281 × 1000 = $281,000; loss = $19,000.
The loss went down, so we keep adjusting the weight upward a little at a time (e.g., to 282, 283) until the prediction is closer to $300,000.

Why we need it:
It’s the step that improves the model by reducing the loss, using the mistake we measured earlier.

Backpropagation (反向传播)

What it does:
Backpropagation figures out how much each weight in the model is responsible for the loss, so we can adjust them using gradient descent.

How it works:

A model can have many weights, not just one, spread across layers (like steps in a calculation).
First, we send the input through all layers to get a prediction (forward).
Then, we calculate the loss.
Next, we go backwards: starting from the loss, we check how each weight in each layer contributed to it.
We calculate how to adjust each weight to lower the loss.

Example:
Let’s make our house price model a bit more complex with two weights:

Weight 1 (W1) = 200, Weight 2 (W2) = 1.4.
Step 1: Input (1000 square feet) × W1 = 200 × 1000 = 200,000.
Step 2: 200,000 × W2 = 200,000 × 1.4 = $280,000 (prediction).
Actual price = $300,000, so loss = $20,000.
Backpropagation asks:
- How much did W2 (1.4) affect the loss? If W2 were 1.5, prediction = 200,000 × 1.5 = $300,000, loss = $0. So, W2 needs to increase.
- How much did W1 (200) affect the loss? It made the 200,000 that W2 used. If W1 were 210, then 210 × 1000 = 210,000, and 210,000 × 1.4 = $294,000, loss = $6,000. So, W1 also needs to adjust.
Backpropagation calculates these effects backwards, from the loss to W2, then to W1.

Why we need it:
With many weights, we need to know which ones to change and by how much. Backpropagation tells us this so gradient descent can work.

Chenkaixuan's Blog

Loss Function,Gradient Descent and Backpropagation （basic）

Loss Function

Gradient Descent

Backpropagation (反向传播)