Neural Network Fundamentals
Part 5: Training - Learning from Mistakes
The Brain's Decision Committee - Chapter 5
The Story So Far...
In Part 4, our committee member attempted their first classification task. They looked at images of vertical and horizontal lines and tried to identify them. The results were... not great. With random weights, they achieved about 50% accuracy - no better than flipping a coin.
But here's the beautiful thing about neural networks: they can learn from their mistakes.
In this notebook, we'll teach our Perceptron how to improve. We'll show it examples, tell it when it's wrong, and let it gradually adjust its weights until it becomes an expert line detector.
This is training - the heart of machine learning.
What You'll Learn in Part 5
This is one of the most important notebooks in the series. By the end, you will understand:
- Loss Functions - How to measure "how wrong" a prediction is
- Why We Square Errors - The mathematical reason behind MSE
- Binary Cross-Entropy - The preferred loss for classification (and why!)
- Gradient Descent - The algorithm that finds better weights
- Learning Rate - How fast to adjust (and what happens if it's wrong)
- The Gradient - The direction of steepest improvement
- Backpropagation - How errors flow backward through the network
- The Training Loop - Putting it all together
- Watch It Learn - See the Perceptron go from 50% to 95%+ accuracy!
Prerequisites
Make sure you've completed:
- Parts 0-1: Matrices (
neural_network_fundamentals.ipynb)
- Part 2: Single Neuron (
part_2_single_neuron.ipynb)
- Part 3: Activation Functions (
part_3_activation_functions.ipynb)
- Part 4: The Perceptron (
part_4_perceptron.ipynb)
Setup: Import Dependencies and Recreate Our Tools
Let's bring in everything we built in previous notebooks.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# =============================================================================# PART 5: TRAINING - SETUP AND IMPORTS# ============================================================================= import numpy as npimport matplotlib.pyplot as pltfrom IPython.display import display, clear_output # Try to import ipywidgets for interactive featurestry: import ipywidgets as widgets WIDGETS_AVAILABLE = Trueexcept ImportError: WIDGETS_AVAILABLE = False print("Note: ipywidgets not installed. Interactive features will be limited.") # Set up matplotlib stylestyle_options = ['seaborn-v0_8-whitegrid', 'seaborn-whitegrid', 'ggplot', 'default']for style in style_options: try: plt.style.use(style) break except OSError: continue plt.rcParams['figure.figsize'] = [10, 6]plt.rcParams['font.size'] = 12np.random.seed(42) print("Setup complete!")print("="*60)1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# =============================================================================# RECREATE OUR TOOLS FROM PREVIOUS NOTEBOOKS# ============================================================================= # -----------------------------------------------------------------------------# Our canonical line images (from Part 1)# -----------------------------------------------------------------------------vertical_line = np.array([[0, 1, 0], [0, 1, 0], [0, 1, 0]])horizontal_line = np.array([[0, 0, 0], [1, 1, 1], [0, 0, 0]])vertical_flat = vertical_line.flatten()horizontal_flat = horizontal_line.flatten() # -----------------------------------------------------------------------------# Dataset generator (from Part 4)# -----------------------------------------------------------------------------def generate_line_dataset(n_samples=100, noise_level=0.0, seed=None): """Generate vertical (label=1) and horizontal (label=0) line images.""" if seed is not None: np.random.seed(seed) X, y = [], [] for i in range(n_samples): image = np.zeros((3, 3)) if i < n_samples // 2: # Vertical lines col = np.random.randint(0, 3) image[:, col] = 1 if noise_level > 0: image = np.clip(image + np.random.randn(3, 3) * noise_level, 0, 1) X.append(image.flatten()) y.append(1) else: # Horizontal lines row = np.random.randint(0, 3) image[row, :] = 1 if noise_level > 0: image = np.clip(image + np.random.randn(3, 3) * noise_level, 0, 1) X.append(image.flatten()) y.append(0) X, y = np.array(X), np.array(y) shuffle_idx = np.random.permutation(n_samples) return X[shuffle_idx], y[shuffle_idx] # -----------------------------------------------------------------------------# Sigmoid activation function (from Part 3)# -----------------------------------------------------------------------------def sigmoid(z): """Sigmoid activation: maps any value to range (0, 1).""" return 1 / (1 + np.exp(-np.clip(z, -500, 500))) # -----------------------------------------------------------------------------# Basic Perceptron class (from Part 4) - We'll add training later!# -----------------------------------------------------------------------------class Perceptron: """A single-layer Perceptron for binary classification.""" def __init__(self, n_inputs): self.weights = np.random.randn(n_inputs) * 0.1 self.bias = 0.0 self.n_inputs = n_inputs def forward(self, x): """Compute the forward pass.""" x = np.array(x).flatten() z = np.dot(self.weights, x) + self.bias return sigmoid(z) def predict(self, x): """Make a binary prediction (0 or 1).""" return 1 if self.forward(x) >= 0.5 else 0 # Generate our training datasetX_train, y_train = generate_line_dataset(n_samples=100, noise_level=0.0, seed=42) print("Tools recreated from previous notebooks!")print(f" - Vertical/Horizontal line templates")print(f" - Dataset generator")print(f" - Sigmoid activation")print(f" - Basic Perceptron class")print(f"\nTraining dataset: {len(X_train)} samples")print(f" - {sum(y_train)} vertical lines (label=1)")print(f" - {len(y_train) - sum(y_train)} horizontal lines (label=0)")
5.1 The Error: How Wrong Are We?
Before we can improve, we need to measure how wrong our predictions are. This is the foundation of learning.
The Basic Idea
When our Perceptron makes a prediction, we compare it to the actual answer:
Error = Actual Value - Predicted Value
= y - ŷ
A Concrete Example
Let's say we show the Perceptron a vertical line (actual label y = 1):
| Scenario | Prediction (ŷ) | Error (y - ŷ) | Interpretation |
|---|
| Perfect | 1.0 | 1.0 - 1.0 = 0.0 | No error! |
| Good | 0.9 | 1.0 - 0.9 = 0.1 | Small error |
| Bad | 0.3 | 1.0 - 0.3 = 0.7 | Big error! |
| Terrible | 0.0 | 1.0 - 0.0 = 1.0 | Maximum error |
Committee Analogy
"The committee member votes on a case. After the vote, the supervisor reveals the correct answer. The difference between their vote and the correct answer is their ERROR - and they need to learn from it."
Why Error Matters
The error tells us two things:
- How much to adjust (larger error = bigger adjustment needed)
- Which direction to adjust (positive error = increase output, negative = decrease)
Let's see this with real numbers:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
# =============================================================================# CALCULATING ERROR: Step by Step# ============================================================================= # Create an untrained Perceptronperceptron = Perceptron(n_inputs=9) print("="*70)print("CALCULATING ERROR: Step by Step")print("="*70) # Test on a vertical line (actual label = 1)print("\n" + "-"*70)print("Example 1: Testing on a VERTICAL line")print("-"*70) y_actual = 1 # The true label (it IS a vertical line)y_predicted = perceptron.forward(vertical_flat) print(f"\n Step 1: Get the actual label")print(f" y (actual) = {y_actual}")print(f" This means: 'This IS a vertical line'") print(f"\n Step 2: Get the prediction from our Perceptron")print(f" ŷ (predicted) = {y_predicted:.4f}")print(f" This means: '{y_predicted*100:.1f}% confident it's vertical'") print(f"\n Step 3: Calculate the error")print(f" error = y - ŷ")print(f" error = {y_actual} - {y_predicted:.4f}")error_vertical = y_actual - y_predictedprint(f" error = {error_vertical:.4f}") print(f"\n Interpretation:")if error_vertical > 0: print(f" The error is POSITIVE ({error_vertical:.4f})") print(f" This means: The Perceptron underestimated! It should output HIGHER.")else: print(f" The error is NEGATIVE ({error_vertical:.4f})") print(f" This means: The Perceptron overestimated! It should output LOWER.") # Test on a horizontal line (actual label = 0)print("\n" + "-"*70)print("Example 2: Testing on a HORIZONTAL line")print("-"*70) y_actual_h = 0 # The true label (it is NOT a vertical line)y_predicted_h = perceptron.forward(horizontal_flat) print(f"\n Step 1: Get the actual label")print(f" y (actual) = {y_actual_h}")print(f" This means: 'This is NOT a vertical line'") print(f"\n Step 2: Get the prediction from our Perceptron")print(f" ŷ (predicted) = {y_predicted_h:.4f}")print(f" This means: '{y_predicted_h*100:.1f}% confident it's vertical'") print(f"\n Step 3: Calculate the error")print(f" error = y - ŷ")print(f" error = {y_actual_h} - {y_predicted_h:.4f}")error_horizontal = y_actual_h - y_predicted_hprint(f" error = {error_horizontal:.4f}") print(f"\n Interpretation:")if error_horizontal > 0: print(f" The error is POSITIVE ({error_horizontal:.4f})") print(f" This means: The Perceptron underestimated!")elif error_horizontal < 0: print(f" The error is NEGATIVE ({error_horizontal:.4f})") print(f" This means: The Perceptron overestimated! It should output LOWER.")else: print(f" The error is ZERO - perfect prediction!")
5.2 Loss Functions: The Teacher's Grading System
Before we look at specific formulas, let's understand what a loss function is and why we need one.
What is a Loss Function?
A loss function (also called a "cost function" or "objective function") is a mathematical formula that:
- Takes in predictions and actual labels
- Outputs a single number representing "how wrong" the predictions are
- Lower is better - a loss of 0 means perfect predictions
Why Do We Need Loss Functions?
Think about learning anything - you need feedback to improve. The loss function provides that feedback:
| Without Loss Function | With Loss Function |
|---|
| "Your predictions are wrong" | "Your predictions are 0.25 wrong" |
| Vague, not actionable | Precise, quantifiable |
| Can't compare methods | Can compare: 0.25 vs 0.15 |
| Can't track progress | Can see improvement over time |
The Role of Loss in Training
Loss functions are the heart of machine learning. The entire training process is:
- Make predictions
- Calculate loss (how wrong?)
- Adjust weights to reduce loss
- Repeat
The weights that minimize loss are the "best" weights - that's the entire goal of training!
Committee Analogy
"The loss function is like a performance review score. Every time the committee member makes a decision, they get a score. A perfect decision scores 0. A terrible decision scores high. The member's goal is to adjust their behavior to minimize this score over time."
5.2.1 Mean Squared Error (MSE): Our First Loss Function
Now let's look at a specific loss function: Mean Squared Error (MSE).
Simple error (y - ŷ) has a problem: positive and negative errors can cancel out!
Example: If we have two predictions:
- Prediction 1: error = +0.5 (underestimated)
- Prediction 2: error = -0.5 (overestimated)
- Average error = (+0.5 + -0.5) / 2 = 0 ← Looks perfect, but it's NOT!
The Solution: Square the Errors
By squaring each error before averaging, we solve this problem:
MSE=n1∑i=1n(yi−y^i)2
Let's break this formula down piece by piece:
| Symbol | Meaning | Example |
|---|
| n | Number of samples | 100 images |
| yi | Actual label for sample i | 1 (vertical) |
| y^i | Predicted value for sample i | 0.7 |
| (yi−y^i) | Error for sample i | 1 - 0.7 = 0.3 |
| (yi−y^i)2 | Squared error | 0.3² = 0.09 |
| n1∑ | Average of all squared errors | Mean |
Why Square?
Squaring the errors has three important benefits:
- No Cancellation: Positive and negative errors both become positive
- Penalize Big Errors: A small error (0.1) becomes tiny (0.01), but a big error (0.9) becomes large (0.81)
- Smooth Landscape: Creates a smooth "bowl" shape that's easy to optimize (more on this later)
Let's Calculate MSE Step by Step:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
# =============================================================================# MEAN SQUARED ERROR: Step by Step Calculation# ============================================================================= print("="*70)print("MEAN SQUARED ERROR (MSE): Step by Step")print("="*70) # Let's use 5 samples to make this clearsample_actuals = np.array([1, 1, 0, 0, 1]) # True labelssample_predictions = np.array([0.9, 0.6, 0.3, 0.1, 0.5]) # Our predictions print("\nOur data:")print(f" Actual labels (y): {sample_actuals}")print(f" Predictions (ŷ): {sample_predictions}") # Step 1: Calculate each errorprint("\n" + "-"*70)print("STEP 1: Calculate each error (y - ŷ)")print("-"*70)errors = sample_actuals - sample_predictionsprint(f"\n Sample 1: {sample_actuals[0]} - {sample_predictions[0]} = {errors[0]:.2f}")print(f" Sample 2: {sample_actuals[1]} - {sample_predictions[1]} = {errors[1]:.2f}")print(f" Sample 3: {sample_actuals[2]} - {sample_predictions[2]} = {errors[2]:.2f}")print(f" Sample 4: {sample_actuals[3]} - {sample_predictions[3]} = {errors[3]:.2f}")print(f" Sample 5: {sample_actuals[4]} - {sample_predictions[4]} = {errors[4]:.2f}")print(f"\n All errors: {errors}") # Step 2: Square each errorprint("\n" + "-"*70)print("STEP 2: Square each error (to make all positive)")print("-"*70)squared_errors = errors ** 2print(f"\n Sample 1: ({errors[0]:.2f})² = {squared_errors[0]:.4f}")print(f" Sample 2: ({errors[1]:.2f})² = {squared_errors[1]:.4f}")print(f" Sample 3: ({errors[2]:.2f})² = {squared_errors[2]:.4f}")print(f" Sample 4: ({errors[3]:.2f})² = {squared_errors[3]:.4f}")print(f" Sample 5: ({errors[4]:.2f})² = {squared_errors[4]:.4f}")print(f"\n Squared errors: {squared_errors}") # Step 3: Take the meanprint("\n" + "-"*70)print("STEP 3: Take the mean (average)")print("-"*70)mse = np.mean(squared_errors)print(f"\n Sum of squared errors: {np.sum(squared_errors):.4f}")print(f" Number of samples: {len(squared_errors)}")print(f" MSE = Sum / n = {np.sum(squared_errors):.4f} / {len(squared_errors)}")print(f" MSE = {mse:.4f}") # The MSE functionprint("\n" + "-"*70)print("THE MSE FUNCTION (for reuse)")print("-"*70) def mse_loss(y_true, y_pred): """ Mean Squared Error loss function. Formula: MSE = (1/n) * Σ(y - ŷ)² Parameters: y_true: Array of actual labels (0 or 1) y_pred: Array of predicted probabilities (0 to 1) Returns: Single value representing average squared error """ return np.mean((y_true - y_pred) ** 2) # Verify our calculationprint(f"\n Using our function: mse_loss(y, ŷ) = {mse_loss(sample_actuals, sample_predictions):.4f}")print(f" Our manual calculation: {mse:.4f}")print(f" Match: {'Yes!' if abs(mse_loss(sample_actuals, sample_predictions) - mse) < 0.0001 else 'No'}")
5.3 Binary Cross-Entropy: The Better Loss for Classification
MSE works, but for classification problems (like our V/H detection), there's a better loss function: Binary Cross-Entropy (BCE).
First, Let's Understand the Name
The name "Binary Cross-Entropy" has three parts:
| Term | Meaning | Our Context |
|---|
| Binary | Two classes only | Vertical (1) or Horizontal (0) |
| Cross | Comparing two distributions | Comparing predictions vs reality |
| Entropy | Measure of uncertainty/surprise | How "surprised" we are by the outcome |
Entropy comes from information theory. It measures uncertainty:
- If something is certain (100% probability), entropy is 0 - no surprise!
- If something is uncertain (50/50), entropy is high - maximum surprise!
Cross-entropy compares what we PREDICTED against what ACTUALLY happened.
Why Not Just Use MSE?
MSE works for regression (predicting continuous values like house prices), but classification has a special property: we're predicting probabilities.
The Problem with MSE: When the prediction is very wrong (e.g., predicting 0.01 for a true label of 1), MSE gives an error of 0.99² = 0.98. That's bad, but is it bad enough?
Consider: predicting 0.01 when you should predict 1.0 means you were 99% confident and COMPLETELY wrong. That deserves a HUGE penalty!
BCE's Solution: BCE uses logarithms, which give much harsher penalties for confident wrong answers.
The Logarithm: Why It's Perfect for This
The logarithm is a special mathematical function. Here's why it works for measuring surprise:
log(1) = 0 → If probability was 100%, no surprise at all
log(0.5) ≈ -0.69 → Uncertain, some surprise
log(0.1) ≈ -2.30 → Low probability, big surprise!
log(0.01) ≈ -4.61 → Very low probability, huge surprise!
log(0) = -∞ → Zero probability, infinite surprise (impossible event!)
The negative sign flips these to positive loss values: -log(0.01) = 4.61
The Intuition: Measuring "Surprise"
Think of cross-entropy as measuring how surprised you are by the actual answer:
| Prediction | Actual | BCE Value | Interpretation |
|---|
| 0.99 | 1 | 0.01 | "Not surprised at all - I expected this!" |
| 0.5 | 1 | 0.69 | "Somewhat surprised - I was uncertain" |
| 0.01 | 1 | 4.61 | "VERY surprised! I was confident it was NOT 1!" |
The Mathematics
BCE=−n1∑i=1n[yi⋅log(y^i)+(1−yi)⋅log(1−y^i)]
This looks scary! Let's break it down:
When the actual label y = 1 (it IS vertical):
- The formula simplifies to: −log(y^)
- If we predicted high (ŷ = 0.9): −log(0.9)=0.105 (low loss - good!)
- If we predicted low (ŷ = 0.1): −log(0.1)=2.303 (high loss - bad!)
When the actual label y = 0 (it is NOT vertical):
- The formula simplifies to: −log(1−y^)
- If we predicted low (ŷ = 0.1): −log(0.9)=0.105 (low loss - good!)
- If we predicted high (ŷ = 0.9): −log(0.1)=2.303 (high loss - bad!)
Committee Analogy
"BCE measures how embarrassed the committee member should be. If they confidently voted 'definitely vertical!' (0.99) and it turned out to be horizontal, they should be VERY embarrassed. The logarithm captures this severe penalty for confident wrong answers."
Let's Implement and Compare:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
# =============================================================================# BINARY CROSS-ENTROPY: Step by Step# ============================================================================= print("="*70)print("BINARY CROSS-ENTROPY (BCE): Step by Step")print("="*70) # First, let's understand the log functionprint("\n" + "-"*70)print("UNDERSTANDING THE LOGARITHM")print("-"*70)print("""The natural log (ln or log) has a special property: - log(1) = 0 (no surprise when probability matches reality) - log(0.5) = -0.69 (some surprise) - log(0.1) = -2.30 (very surprised!) - log(0.01) = -4.61 (extremely surprised!) As the probability gets closer to 0, log goes to -infinity.That's why BCE severely punishes confident wrong predictions!""") # Show the log curveprint(" Let's calculate -log(ŷ) for different predictions:")predictions = [0.99, 0.9, 0.7, 0.5, 0.3, 0.1, 0.01]print(f"\n {'Prediction (ŷ)':<18} {'-log(ŷ)':<15} {'Interpretation'}")print(" " + "-"*60)for p in predictions: neg_log = -np.log(p) if neg_log < 0.5: interp = "Low loss (good prediction)" elif neg_log < 1.5: interp = "Medium loss" else: interp = "High loss (bad prediction!)" print(f" {p:<18} {neg_log:<15.4f} {interp}") print("\n" + "-"*70)print("BCE CALCULATION FOR A SINGLE SAMPLE")print("-"*70) # Example 1: Actual is 1, prediction is 0.9 (good prediction)y_true_1 = 1y_pred_1 = 0.9 print(f"\n Example 1: Actual y = {y_true_1}, Predicted ŷ = {y_pred_1}")print(f" (This is a GOOD prediction for a vertical line)")print(f"\n BCE formula: -[y * log(ŷ) + (1-y) * log(1-ŷ)]")print(f"\n Since y = 1, the (1-y) term becomes 0, so:")print(f" BCE = -[{y_true_1} * log({y_pred_1}) + 0]")print(f" BCE = -log({y_pred_1})")print(f" BCE = -{np.log(y_pred_1):.4f}")bce_1 = -np.log(y_pred_1)print(f" BCE = {bce_1:.4f}") # Example 2: Actual is 1, prediction is 0.1 (bad prediction)y_true_2 = 1y_pred_2 = 0.1 print(f"\n Example 2: Actual y = {y_true_2}, Predicted ŷ = {y_pred_2}")print(f" (This is a BAD prediction for a vertical line)")print(f"\n Since y = 1:")print(f" BCE = -log({y_pred_2})")print(f" BCE = -{np.log(y_pred_2):.4f}")bce_2 = -np.log(y_pred_2)print(f" BCE = {bce_2:.4f}") print(f"\n Notice: The bad prediction has {bce_2/bce_1:.1f}x higher loss!") # The BCE functionprint("\n" + "-"*70)print("THE BCE FUNCTION (for reuse)")print("-"*70) def binary_cross_entropy(y_true, y_pred): """ Binary Cross-Entropy loss function. Formula: BCE = -(1/n) * Σ[y*log(ŷ) + (1-y)*log(1-ŷ)] Parameters: y_true: Array of actual labels (0 or 1) y_pred: Array of predicted probabilities (0 to 1) Returns: Single value representing average cross-entropy loss """ # Clip predictions to avoid log(0) which is undefined epsilon = 1e-15 # A tiny number y_pred = np.clip(y_pred, epsilon, 1 - epsilon) # Calculate BCE bce = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)) return bce print("""def binary_cross_entropy(y_true, y_pred): # Clip to avoid log(0) - would be undefined! epsilon = 1e-15 y_pred = np.clip(y_pred, epsilon, 1 - epsilon) # BCE formula return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))""")1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
# =============================================================================# VISUALIZE: MSE vs BCE# ============================================================================= fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Generate prediction values from 0.01 to 0.99y_pred_range = np.linspace(0.01, 0.99, 100) # When actual label is 1 (vertical line)mse_when_y_is_1 = (1 - y_pred_range) ** 2bce_when_y_is_1 = -np.log(y_pred_range) # Plot for y = 1ax1 = axes[0]ax1.plot(y_pred_range, mse_when_y_is_1, 'b-', linewidth=2, label='MSE')ax1.plot(y_pred_range, bce_when_y_is_1, 'r-', linewidth=2, label='BCE')ax1.set_xlabel('Prediction (ŷ)', fontsize=12)ax1.set_ylabel('Loss', fontsize=12)ax1.set_title('When Actual y = 1 (Vertical Line)\nLower prediction = Higher loss', fontsize=12, fontweight='bold')ax1.legend()ax1.grid(True, alpha=0.3)ax1.set_xlim(0, 1)ax1.set_ylim(0, 5)ax1.axvline(x=0.5, color='gray', linestyle=':', alpha=0.5)ax1.annotate('If we predict 0.1\nBCE = 2.3 (harsh!)\nMSE = 0.81', xy=(0.1, 2.3), xytext=(0.3, 3.5), fontsize=9, arrowprops=dict(arrowstyle='->', color='red')) # When actual label is 0 (horizontal line)mse_when_y_is_0 = y_pred_range ** 2bce_when_y_is_0 = -np.log(1 - y_pred_range) # Plot for y = 0ax2 = axes[1]ax2.plot(y_pred_range, mse_when_y_is_0, 'b-', linewidth=2, label='MSE')ax2.plot(y_pred_range, bce_when_y_is_0, 'r-', linewidth=2, label='BCE')ax2.set_xlabel('Prediction (ŷ)', fontsize=12)ax2.set_ylabel('Loss', fontsize=12)ax2.set_title('When Actual y = 0 (Horizontal Line)\nHigher prediction = Higher loss', fontsize=12, fontweight='bold')ax2.legend()ax2.grid(True, alpha=0.3)ax2.set_xlim(0, 1)ax2.set_ylim(0, 5)ax2.axvline(x=0.5, color='gray', linestyle=':', alpha=0.5)ax2.annotate('If we predict 0.9\nBCE = 2.3 (harsh!)\nMSE = 0.81', xy=(0.9, 2.3), xytext=(0.5, 3.5), fontsize=9, arrowprops=dict(arrowstyle='->', color='red')) plt.tight_layout()plt.show() print("\nKEY INSIGHT: BCE vs MSE")print("="*60)print("""Notice how BCE (red line) rises much more steeply than MSE (blue line)as predictions get worse? This is why BCE is preferred for classification: - It SEVERELY punishes confident wrong predictions - A prediction of 0.1 when the answer is 1 has BCE loss of 2.3 - The same prediction has MSE loss of only 0.81 BCE creates stronger learning signals when the model is very wrong,which helps it learn faster and more reliably!""")
5.4 Gradient Descent: Finding Better Weights
Now we know HOW WRONG we are (the loss). But how do we make our predictions BETTER?
This is where optimization comes in - the process of finding the best values for our weights.
The Optimization Problem
Our Perceptron has 9 weights + 1 bias = 10 numbers to choose. Each combination of these 10 numbers gives different predictions and a different loss.
The Question: Out of the infinite possible combinations, which gives the LOWEST loss?
The Naive Approach: Try all combinations!
- But with continuous numbers, there are infinitely many combinations
- Even with just 100 values per parameter: 100^10 = 10^20 combinations
- That's more than the number of grains of sand on Earth!
The Smart Approach: Use mathematics to guide our search toward better values.
What is a Derivative? (A Quick Refresher)
The derivative tells you how much one quantity changes when you change another.
Simple Example: You're driving a car.
- Position = where you are
- Derivative of position = speed (how fast position changes)
- Derivative of speed = acceleration (how fast speed changes)
For Our Loss Function:
- Loss = how wrong we are
- Derivative of loss w.r.t. weight = how much loss changes when we change the weight
If the derivative is:
- Positive: Increasing the weight increases loss → we should DECREASE the weight
- Negative: Increasing the weight decreases loss → we should INCREASE the weight
- Zero: We're at a minimum (or maximum)!
The Key Idea: The Loss Landscape
Imagine the loss as a landscape where:
- The height at any point = how wrong we are (higher = worse)
- The position = our current weights
- Our goal = find the lowest point (minimum loss)
We want to "roll downhill" until we find the bottom!
The Algorithm: Gradient Descent
Gradient means "slope" - it tells us which way is uphill.
Gradient Descent means:
- Look at the slope where we are
- Take a step in the opposite direction (downhill)
- Repeat until we reach the bottom
The Mathematics
wnew=wold−α⋅∂w∂L
Let's break this down:
| Symbol | Meaning | Intuition |
|---|
| wnew | Updated weight | Where we're going |
| wold | Current weight | Where we are |
| α | Learning rate | How big a step to take |
| ∂w∂L | Gradient (slope) | Which way is uphill |
| − | Subtraction | We go OPPOSITE to uphill (= downhill) |
Committee Analogy
"The gradient is like a compass that always points uphill. We want to go DOWNHILL (less error), so we walk in the opposite direction. The learning rate decides whether we take small careful steps or big bold leaps."
Let's Visualize This:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
# =============================================================================# VISUALIZE: The Loss Landscape and Gradient Descent# ============================================================================= # Create a simple 1D loss landscape (parabola)# This represents how loss changes as we change ONE weightweight_values = np.linspace(-3, 3, 100)loss_landscape = weight_values ** 2 + 0.5 # Simple parabola with minimum at w=0 fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Plot 1: The loss landscapeax1 = axes[0]ax1.plot(weight_values, loss_landscape, 'b-', linewidth=3)ax1.fill_between(weight_values, loss_landscape, alpha=0.2)ax1.set_xlabel('Weight Value (w)', fontsize=12)ax1.set_ylabel('Loss (L)', fontsize=12)ax1.set_title('The Loss Landscape\n(For a Single Weight)', fontsize=12, fontweight='bold')ax1.grid(True, alpha=0.3) # Mark the minimumax1.scatter([0], [0.5], color='green', s=200, zorder=5, marker='*', label='Minimum (goal)')ax1.annotate('Our goal: Find this minimum!', xy=(0, 0.5), xytext=(0.5, 2), fontsize=10, arrowprops=dict(arrowstyle='->', color='green')) # Show current positioncurrent_w = 2.0current_loss = current_w ** 2 + 0.5ax1.scatter([current_w], [current_loss], color='red', s=150, zorder=5, label='Current position')ax1.axvline(x=current_w, color='red', linestyle='--', alpha=0.3)ax1.legend() # Plot 2: Gradient descent animation (multiple steps)ax2 = axes[1]ax2.plot(weight_values, loss_landscape, 'b-', linewidth=2, alpha=0.5)ax2.fill_between(weight_values, loss_landscape, alpha=0.1)ax2.set_xlabel('Weight Value (w)', fontsize=12)ax2.set_ylabel('Loss (L)', fontsize=12)ax2.set_title('Gradient Descent: Rolling Downhill\n(Learning Rate α = 0.3)', fontsize=12, fontweight='bold')ax2.grid(True, alpha=0.3) # Simulate gradient descentlearning_rate = 0.3w = 2.5 # Starting positionpath = [(w, w**2 + 0.5)] for step in range(8): gradient = 2 * w # Derivative of w² is 2w w = w - learning_rate * gradient # Gradient descent update loss = w ** 2 + 0.5 path.append((w, loss)) # Plot the pathpath = np.array(path)ax2.plot(path[:, 0], path[:, 1], 'ro-', markersize=8, linewidth=2, label='Gradient descent path')ax2.scatter([path[0, 0]], [path[0, 1]], color='red', s=200, zorder=5, marker='o', label='Start')ax2.scatter([path[-1, 0]], [path[-1, 1]], color='green', s=200, zorder=5, marker='*', label='End (near minimum)') # Add step numbersfor i, (w_val, l_val) in enumerate(path): ax2.annotate(f'{i}', xy=(w_val, l_val), xytext=(w_val+0.1, l_val+0.3), fontsize=9, fontweight='bold') ax2.legend(loc='upper right') plt.tight_layout()plt.show() print("\nGRADIENT DESCENT STEPS:")print("="*60)print(f"{'Step':<6} {'Weight (w)':<15} {'Gradient (2w)':<15} {'Update':<20} {'Loss'}")print("-"*60)w = 2.5for step in range(6): gradient = 2 * w update = -learning_rate * gradient loss = w ** 2 + 0.5 print(f"{step:<6} {w:<15.4f} {gradient:<15.4f} {update:<20.4f} {loss:.4f}") w = w + update # Same as w = w - learning_rate * gradient print("-"*60)print(f"\nStarted at w = 2.5 (loss = 6.75)")print(f"After 5 steps: w = {w:.4f} (loss = {w**2 + 0.5:.4f})")print(f"Getting closer to the minimum at w = 0 (loss = 0.5)!")
5.5 Learning Rate: How Fast to Adjust
The learning rate (α, alpha) controls how big each step is. It's one of the most important choices in training!
Parameters vs Hyperparameters
Before we dive in, let's clarify an important distinction:
| Term | What It Is | Examples | Who Sets It? |
|---|
| Parameters | Values the model LEARNS | Weights, Bias | The training algorithm |
| Hyperparameters | Settings WE choose before training | Learning rate, number of epochs | The human (you!) |
The learning rate is a hyperparameter - we choose it before training, and it affects HOW the model learns (but is not learned itself).
Why Learning Rate Matters So Much
The learning rate multiplies the gradient to determine the step size:
step = learning_rate × gradient
new_weight = old_weight - step
The Problem: Gradients can vary wildly:
- Sometimes the gradient is 10.0 (steep slope)
- Sometimes it's 0.001 (nearly flat)
The Learning Rate's Job: Scale these gradients to reasonable step sizes.
The Goldilocks Problem
| Learning Rate | Effect | Problem |
|---|
| Too Large (α = 1.0) | Big steps | Overshoot! Miss the minimum, bounce around |
| Too Small (α = 0.001) | Tiny steps | Takes forever, might get stuck |
| Just Right (α = 0.1) | Medium steps | Steady progress toward minimum |
The Mathematics
Remember our update formula:
wnew=wold−α⋅gradient
- If gradient = 10 and α = 0.1: step size = 1.0 (reasonable)
- If gradient = 10 and α = 1.0: step size = 10.0 (too big!)
- If gradient = 10 and α = 0.001: step size = 0.01 (too small!)
Committee Analogy
"The learning rate is how much the committee member adjusts after each mistake. Too much adjustment, and they overcorrect wildly. Too little, and they never improve. The right amount leads to steady learning."
Let's See All Three Scenarios:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# =============================================================================# VISUALIZE: Learning Rate Effects# ============================================================================= def run_gradient_descent(start_w, learning_rate, steps=15): """Run gradient descent and return the path.""" w = start_w path = [(w, w**2 + 0.5)] for _ in range(steps): gradient = 2 * w # Derivative of w² w = w - learning_rate * gradient w = np.clip(w, -5, 5) # Prevent explosion loss = w ** 2 + 0.5 path.append((w, loss)) return np.array(path) fig, axes = plt.subplots(1, 3, figsize=(15, 5))weight_values = np.linspace(-3, 3, 100)loss_landscape = weight_values ** 2 + 0.5 scenarios = [ (0.9, 'TOO LARGE (α=0.9)', 'red', 'Overshoots and bounces!'), (0.3, 'JUST RIGHT (α=0.3)', 'green', 'Steady progress!'), (0.05, 'TOO SMALL (α=0.05)', 'blue', 'Very slow progress...')] for ax, (lr, title, color, desc) in zip(axes, scenarios): ax.plot(weight_values, loss_landscape, 'k-', linewidth=1, alpha=0.3) ax.fill_between(weight_values, loss_landscape, alpha=0.1, color='gray') path = run_gradient_descent(start_w=2.5, learning_rate=lr) ax.plot(path[:, 0], path[:, 1], 'o-', color=color, markersize=6, linewidth=1.5) ax.scatter([path[0, 0]], [path[0, 1]], color=color, s=150, zorder=5, marker='s', label='Start') ax.scatter([path[-1, 0]], [path[-1, 1]], color='black', s=150, zorder=5, marker='*', label='End') ax.set_xlabel('Weight (w)', fontsize=11) ax.set_ylabel('Loss', fontsize=11) ax.set_title(f'{title}\n{desc}', fontsize=11, fontweight='bold') ax.set_xlim(-3, 3) ax.set_ylim(0, 10) ax.grid(True, alpha=0.3) ax.legend(loc='upper right', fontsize=9) # Show final loss ax.annotate(f'Final loss: {path[-1, 1]:.2f}', xy=(0, 8), fontsize=10, ha='center') plt.tight_layout()plt.show() print("\nLEARNING RATE COMPARISON:")print("="*60)for lr, title, _, _ in scenarios: path = run_gradient_descent(start_w=2.5, learning_rate=lr) print(f"\n{title}") print(f" Final weight: {path[-1, 0]:.4f}") print(f" Final loss: {path[-1, 1]:.4f}") print(f" Optimal loss: 0.5000 (at w=0)") print(f" Distance from optimal: {abs(path[-1, 1] - 0.5):.4f}")
5.6 The Gradient: Which Way is Down?
We've been using the word "gradient" - but what IS it exactly, and how do we calculate it?
What is a Gradient?
The gradient is the derivative (slope) of the loss with respect to each weight. It tells us:
- How much the loss changes when we change a weight
- Which direction increases the loss (so we go the opposite way!)
Regular Derivatives vs Partial Derivatives
Regular derivative: When you have ONE variable.
- Example: If
f(x) = x², then df/dx = 2x
Partial derivative (∂): When you have MULTIPLE variables and you want to see the effect of changing just ONE while keeping others fixed.
- Example: If
f(x, y) = x² + y², then:
∂f/∂x = 2x (how f changes when x changes, y held constant)
∂f/∂y = 2y (how f changes when y changes, x held constant)
In our Perceptron:
- Loss depends on 9 weights + 1 bias = 10 variables
- We need 10 partial derivatives (one for each parameter)
- The gradient is the collection of ALL these partial derivatives
The Notation "w.r.t." (With Respect To)
You'll often see "gradient of L w.r.t. w" - this means "how does L change when we change w?"
∂L/∂w is read as "partial derivative of L with respect to w"
The Chain Rule: Breaking Down Complex Functions
Our Perceptron has multiple operations chained together:
x → [weighted sum] → z → [sigmoid] → ŷ → [BCE loss] → L
w · x + b
To find how changing w affects the final loss L, we use the chain rule:
∂w∂L=∂y^∂L⋅∂z∂y^⋅∂w∂z
This looks complicated, but each piece is simple!
The Beautiful Simplification
For sigmoid activation with BCE loss, all the calculus simplifies to:
∂w∂L=(y^−y)⋅x
And for the bias:
∂b∂L=(y^−y)
That's it! The gradient is just:
- (prediction - actual) × input
Why This Formula Makes Intuitive Sense
| Part | Meaning | Intuition |
|---|
| (y^−y) | Error | How wrong we are (and in which direction) |
| x | Input | Which inputs contributed to the output |
If we predicted too high (ŷ > y), the error is positive, so we'll decrease the weights.
If the input was large, we'll decrease more (because it had more influence).
Let's Calculate Gradients Step by Step:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# =============================================================================# CALCULATING GRADIENTS: Step by Step# ============================================================================= print("="*70)print("CALCULATING GRADIENTS: Step by Step")print("="*70) # Use our Perceptron on a vertical linex = vertical_flat.copy()y_true = 1 # It IS vertical # Get the predictiony_pred = perceptron.forward(x) print(f"\nInput (x): {x}")print(f"Actual label (y): {y_true}")print(f"Prediction (ŷ): {y_pred:.4f}") # Calculate the error termprint("\n" + "-"*70)print("STEP 1: Calculate the error term (ŷ - y)")print("-"*70)error = y_pred - y_trueprint(f"\n error = ŷ - y")print(f" error = {y_pred:.4f} - {y_true}")print(f" error = {error:.4f}") if error > 0: print(f"\n Interpretation: Error is POSITIVE ({error:.4f})") print(f" This means we predicted TOO HIGH - need to decrease output")else: print(f"\n Interpretation: Error is NEGATIVE ({error:.4f})") print(f" This means we predicted TOO LOW - need to increase output") # Calculate gradient for weightsprint("\n" + "-"*70)print("STEP 2: Calculate gradient for each weight")print("-"*70)print(f"\n Formula: ∂L/∂w = (ŷ - y) × x = error × x")print(f"\n For each weight w_i, the gradient is: error × x_i") gradient_weights = error * xprint(f"\n Gradients for all 9 weights:")print(f" error × x = {error:.4f} × {x}")print(f" = [{', '.join([f'{g:.4f}' for g in gradient_weights])}]") # Show which weights should changeprint(f"\n Let's interpret this (as a 3x3 grid):")grad_grid = gradient_weights.reshape(3, 3)print(f" {grad_grid[0]}")print(f" {grad_grid[1]}")print(f" {grad_grid[2]}") print(f"\n Notice: Only the middle column has non-zero gradients!")print(f" That's because only those pixels had value 1 in the input.")print(f" Weights for other pixels don't need to change (input was 0).") # Calculate gradient for biasprint("\n" + "-"*70)print("STEP 3: Calculate gradient for bias")print("-"*70)gradient_bias = errorprint(f"\n Formula: ∂L/∂b = (ŷ - y) = error")print(f" Bias gradient = {gradient_bias:.4f}") # Show the updateprint("\n" + "-"*70)print("STEP 4: Apply the update (with learning rate α = 0.5)")print("-"*70)learning_rate = 0.5print(f"\n Update formula: w_new = w_old - α × gradient")print(f"\n For weight w₁ (position 1, middle column):")old_w1 = perceptron.weights[1]new_w1 = old_w1 - learning_rate * gradient_weights[1]print(f" w₁_new = {old_w1:.4f} - {learning_rate} × {gradient_weights[1]:.4f}")print(f" w₁_new = {old_w1:.4f} - {learning_rate * gradient_weights[1]:.4f}")print(f" w₁_new = {new_w1:.4f}")print(f"\n Since error was negative, w₁ INCREASED to make output higher next time!")
5.7 Backpropagation: Tracing the Blame
Backpropagation ("backprop") is the algorithm that calculates gradients by flowing errors BACKWARD through the network.
Why Backpropagation is Revolutionary
Before backpropagation was popularized in 1986 (by Rumelhart, Hinton, and Williams), training neural networks was incredibly difficult. People didn't know how to efficiently calculate gradients for networks with many layers.
The Problem: In a network with multiple layers, changing one weight affects EVERYTHING that comes after it. How do you figure out exactly how much each weight contributed to the final error?
The Solution: Backpropagation! It uses the chain rule to efficiently calculate ALL gradients in ONE backward pass through the network.
The Name Explained
- Back: We start from the OUTPUT (the error) and work BACKWARD
- Propagation: The error "propagates" (spreads) to earlier layers
Think of it like blame assignment:
- The final output was wrong
- What caused it to be wrong? Trace backward...
- These specific weights were most responsible
- Adjust them accordingly
For Our Single Neuron
In our simple Perceptron, backpropagation is straightforward:
FORWARD PASS (left to right):
x → [w·x + b] → z → [sigmoid] → ŷ → [compare to y] → Loss
BACKWARD PASS (right to left):
Loss → ∂L/∂ŷ → ∂L/∂z → ∂L/∂w, ∂L/∂b
↑ ↑ ↑
"How does "How does "How does
loss loss loss
change change change
with ŷ?" with z?" with w,b?"
Committee Analogy
"Backpropagation is like a post-mortem after a mistake. The committee asks: 'What went wrong?' They trace the decision back: 'The final vote was wrong. Why? The weighted sum was off. Why? These specific weights gave too much importance to the wrong evidence.' Then they adjust those specific weights."
The Backprop Flow for Our Perceptron
| Step | Calculation | Formula |
|---|
| 1 | Loss gradient w.r.t. output | ∂y^∂L (from BCE) |
| 2 | Output gradient w.r.t. pre-activation | ∂z∂y^=y^(1−y^) (sigmoid derivative) |
| 3 | Pre-activation gradient w.r.t. weights | ∂w∂z=x |
| 4 | Chain them together | ∂w∂L=(y^−y)⋅x |
The beautiful thing: steps 1 and 2 combine to give us just (y^−y)!
5.8 The Training Loop: Putting It All Together
Now we have all the pieces! Let's build the complete training algorithm.
Why Do We Need a Loop?
A single gradient descent step makes only a TINY improvement. To go from random weights to good weights, we need MANY small steps.
Real Example:
- Start: Loss = 0.7, Accuracy = 50%
- After 1 step: Loss = 0.69, Accuracy = 51% (tiny improvement)
- After 10 steps: Loss = 0.5, Accuracy = 65%
- After 100 steps: Loss = 0.1, Accuracy = 95%
Each step nudges the weights slightly. Over many steps, these tiny nudges accumulate into major improvements!
Why Multiple Epochs?
Problem: One pass through the data isn't enough.
- With 100 samples, we only make 100 weight updates
- The model might not have "seen" enough patterns
- Early samples were processed with very different weights than late samples
Solution: Go through the data MULTIPLE times (epochs).
- Epoch 1: First exposure to all samples
- Epoch 2: Second look, with better weights now
- Epoch 3: Refinement continues
- ...
Each epoch, the model gets better at the task!
The Training Loop Algorithm
FOR each epoch (pass through the data):
FOR each sample (x, y) in the training data:
1. FORWARD PASS: Get prediction
ŷ = sigmoid(w · x + b)
2. COMPUTE LOSS: How wrong?
L = BCE(y, ŷ)
3. COMPUTE GRADIENTS: Which way to go?
∂L/∂w = (ŷ - y) × x
∂L/∂b = (ŷ - y)
4. UPDATE WEIGHTS: Take a step downhill
w = w - α × ∂L/∂w
b = b - α × ∂L/∂b
Record average loss for this epoch
Key Terms
| Term | Meaning |
|---|
| Epoch | One complete pass through all training data |
| Sample | One training example (input + label) |
| Batch | Group of samples processed together (we use batch size = 1 here) |
| Iteration | One weight update |
Let's Build Our Trainable Perceptron!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
# =============================================================================# THE TRAINABLE PERCEPTRON: Complete Implementation# ============================================================================= class TrainablePerceptron: """ A Perceptron that can learn from examples! This class includes: - Forward pass (prediction) - Loss calculation (BCE) - Gradient calculation (backpropagation) - Weight update (gradient descent) - Full training loop """ def __init__(self, n_inputs): """Initialize with random weights.""" self.weights = np.random.randn(n_inputs) * 0.1 self.bias = 0.0 self.n_inputs = n_inputs # For tracking training progress self.loss_history = [] self.accuracy_history = [] def forward(self, x): """Forward pass: compute prediction.""" x = np.array(x).flatten() z = np.dot(self.weights, x) + self.bias return sigmoid(z) def predict(self, x): """Binary prediction (0 or 1).""" return 1 if self.forward(x) >= 0.5 else 0 def compute_loss(self, y_true, y_pred): """Compute BCE loss for one sample.""" epsilon = 1e-15 y_pred = np.clip(y_pred, epsilon, 1 - epsilon) return -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)) def train(self, X, y, learning_rate=0.1, epochs=100, verbose=True): """ Train the Perceptron using gradient descent. Parameters: X: Training inputs, shape (n_samples, n_features) y: Training labels, shape (n_samples,) learning_rate: Step size for gradient descent epochs: Number of passes through the data verbose: Whether to print progress Returns: List of losses for each epoch """ self.loss_history = [] self.accuracy_history = [] if verbose: print("="*70) print("TRAINING STARTED") print("="*70) print(f" Samples: {len(X)}") print(f" Epochs: {epochs}") print(f" Learning rate: {learning_rate}") print() for epoch in range(epochs): total_loss = 0 correct = 0 # Go through each training sample for i in range(len(X)): xi = X[i] # Input yi = y[i] # True label # ===== STEP 1: FORWARD PASS ===== y_pred = self.forward(xi) # ===== STEP 2: COMPUTE LOSS ===== loss = self.compute_loss(yi, y_pred) total_loss += loss # Count correct predictions if (y_pred >= 0.5 and yi == 1) or (y_pred < 0.5 and yi == 0): correct += 1 # ===== STEP 3: COMPUTE GRADIENTS ===== # The beautiful simplification: gradient = (prediction - actual) * input error = y_pred - yi gradient_weights = error * xi gradient_bias = error # ===== STEP 4: UPDATE WEIGHTS ===== self.weights = self.weights - learning_rate * gradient_weights self.bias = self.bias - learning_rate * gradient_bias # Record progress avg_loss = total_loss / len(X) accuracy = correct / len(X) self.loss_history.append(avg_loss) self.accuracy_history.append(accuracy) # Print progress every 10 epochs if verbose and (epoch + 1) % 10 == 0: print(f" Epoch {epoch+1:3d}/{epochs}: Loss = {avg_loss:.4f}, Accuracy = {accuracy*100:.1f}%") if verbose: print() print("="*70) print("TRAINING COMPLETE!") print("="*70) print(f" Final Loss: {self.loss_history[-1]:.4f}") print(f" Final Accuracy: {self.accuracy_history[-1]*100:.1f}%") return self.loss_history print("TrainablePerceptron class created!")print("Now let's train it and watch it learn...")
5.9 Watching It Learn!
This is the moment we've been building toward. Let's train our Perceptron and watch it transform from a confused guesser into an expert line detector!
What to Watch For
During training, you'll see:
- Loss decreasing - The model is making fewer/smaller mistakes
- Accuracy increasing - More predictions are correct
- Eventually plateauing - When the model has learned all it can
Convergence: When Has the Model Learned Enough?
Convergence means the model has stopped improving significantly. Signs of convergence:
| Sign | What It Looks Like | What It Means |
|---|
| Loss plateaus | Loss curve flattens out | No more improvement possible |
| Loss oscillates | Jumps up and down slightly | Near the minimum |
| Accuracy stable | Stays at same level | Model has learned the pattern |
When to Stop Training:
- When loss stops decreasing for several epochs
- When accuracy reaches acceptable level (e.g., 95%+)
- When you've run out of patience!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# =============================================================================# TRAINING THE PERCEPTRON: Watch It Learn!# ============================================================================= # Create a fresh Perceptronnp.random.seed(42) # For reproducibilitymodel = TrainablePerceptron(n_inputs=9) # Check initial performance (before training)print("BEFORE TRAINING:")print("-"*40)correct_before = sum(model.predict(X_train[i]) == y_train[i] for i in range(len(X_train)))print(f"Accuracy: {correct_before}/{len(y_train)} = {correct_before/len(y_train)*100:.1f}%")print(f"(This is basically random guessing)")print() # Train the model!losses = model.train(X_train, y_train, learning_rate=0.5, epochs=50) # Check final performanceprint("\nAFTER TRAINING:")print("-"*40)correct_after = sum(model.predict(X_train[i]) == y_train[i] for i in range(len(X_train)))print(f"Accuracy: {correct_after}/{len(y_train)} = {correct_after/len(y_train)*100:.1f}%")1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# =============================================================================# VISUALIZE THE LEARNING PROGRESS# ============================================================================= fig, axes = plt.subplots(1, 3, figsize=(15, 4)) # Plot 1: Loss over timeax1 = axes[0]ax1.plot(model.loss_history, 'b-', linewidth=2)ax1.set_xlabel('Epoch', fontsize=12)ax1.set_ylabel('Loss (BCE)', fontsize=12)ax1.set_title('Loss Decreasing Over Time', fontsize=12, fontweight='bold')ax1.grid(True, alpha=0.3)ax1.annotate(f'Start: {model.loss_history[0]:.2f}', xy=(0, model.loss_history[0]), xytext=(5, model.loss_history[0]+0.1), fontsize=10)ax1.annotate(f'End: {model.loss_history[-1]:.2f}', xy=(len(model.loss_history)-1, model.loss_history[-1]), xytext=(len(model.loss_history)-15, model.loss_history[-1]+0.1), fontsize=10) # Plot 2: Accuracy over timeax2 = axes[1]ax2.plot([a*100 for a in model.accuracy_history], 'g-', linewidth=2)ax2.set_xlabel('Epoch', fontsize=12)ax2.set_ylabel('Accuracy (%)', fontsize=12)ax2.set_title('Accuracy Increasing Over Time', fontsize=12, fontweight='bold')ax2.grid(True, alpha=0.3)ax2.set_ylim(0, 105)ax2.axhline(y=50, color='red', linestyle='--', alpha=0.5, label='Random guessing')ax2.legend() # Plot 3: Learned weights (as 3x3 grid)ax3 = axes[2]weights_grid = model.weights.reshape(3, 3)im = ax3.imshow(weights_grid, cmap='RdBu', vmin=-2, vmax=2)ax3.set_title('Learned Weights\n(What the model looks for)', fontsize=12, fontweight='bold')for i in range(3): for j in range(3): ax3.text(j, i, f'{weights_grid[i,j]:.2f}', ha='center', va='center', fontsize=11, fontweight='bold')plt.colorbar(im, ax=ax3)ax3.set_xticks([])ax3.set_yticks([]) plt.tight_layout()plt.show() print("\nKEY OBSERVATIONS:")print("="*60)print(f"1. Loss decreased from {model.loss_history[0]:.4f} to {model.loss_history[-1]:.4f}")print(f"2. Accuracy improved from ~50% to {model.accuracy_history[-1]*100:.1f}%")print(f"3. The learned weights show HIGH values in the middle column!")print(f" This is exactly what we'd expect for a vertical line detector!")1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# =============================================================================# TEST ON CANONICAL EXAMPLES: Before vs After# ============================================================================= print("="*70)print("TESTING ON CANONICAL EXAMPLES")print("="*70) # Test on vertical linev_pred = model.forward(vertical_flat)print(f"\nVertical Line:")print(f" Prediction: {v_pred:.4f} ({v_pred*100:.1f}% confident it's vertical)")print(f" Actual: 1 (IS vertical)")print(f" Result: {'CORRECT!' if v_pred >= 0.5 else 'Wrong'}") # Test on horizontal lineh_pred = model.forward(horizontal_flat)print(f"\nHorizontal Line:")print(f" Prediction: {h_pred:.4f} ({h_pred*100:.1f}% confident it's vertical)")print(f" Actual: 0 (NOT vertical)")print(f" Result: {'CORRECT!' if h_pred < 0.5 else 'Wrong'}") print("\n" + "="*70)print("THE PERCEPTRON HAS LEARNED!")print("="*70)print("""From random weights giving ~50% accuracy,our Perceptron now confidently classifies lines! It learned that: - The MIDDLE COLUMN matters most for vertical lines - Other pixels should have low/negative weights This happened automatically through gradient descent -we never told it what a vertical line looks like!""")
Part 5 Summary: What We've Learned
This was the most important notebook in the series! You've learned the core of how neural networks learn.
Key Concepts Mastered
| Concept | Formula | Why It Matters |
|---|
| Error | y - ŷ | Measures how wrong we are |
| MSE Loss | (1/n)Σ(y-ŷ)² | Penalizes errors, larger errors more |
| BCE Loss | -[y·log(ŷ) + (1-y)·log(1-ŷ)] | Better for classification, harsh on confident mistakes |
| Gradient | (ŷ - y) · x | Direction and magnitude of improvement |
| Gradient Descent | w = w - α·∇L | Algorithm to find better weights |
| Learning Rate | α | Controls step size (too big = overshoot, too small = slow) |
| Backpropagation | Chain rule backward | Calculates gradients for all weights |
The Training Loop (Memorize This!)
for epoch in range(epochs):
for x, y in training_data:
y_pred = forward(x)
loss = bce(y, y_pred)
gradient = (y_pred - y) * x
weights -= lr * gradient
Committee Analogy Progress
| Part | Committee Story |
|---|
| Part 1-3 | Member learned procedures (math, weights, voting) |
| Part 4 | First case - confused, random guessing |
| Part 5 | Member receives feedback and LEARNS! |
| Part 6 | (Next) Evaluating the trained expert |
The Big Picture
Before Training: Random weights → Random predictions → ~50% accuracy
After Training: Learned weights → Meaningful predictions → ~95%+ accuracy
The Perceptron discovered ON ITS OWN that vertical lines have pixels in the middle column!
Knowledge Check
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
# =============================================================================# KNOWLEDGE CHECK - Part 5# ============================================================================= print("KNOWLEDGE CHECK - Part 5: Training")print("="*60)print("\nAnswer these questions to test your understanding:\n") questions = [ { "q": "1. Why do we square errors in MSE?", "options": [ "A) To make the math easier", "B) To prevent positive and negative errors from canceling out", "C) To make errors smaller", "D) Because computers prefer square numbers" ], "answer": "B", "explanation": "Squaring makes all errors positive, so they add up rather than cancel. It also penalizes larger errors more." }, { "q": "2. Why is BCE preferred over MSE for classification?", "options": [ "A) BCE is faster to compute", "B) BCE uses less memory", "C) BCE severely punishes confident wrong predictions", "D) BCE always gives lower values" ], "answer": "C", "explanation": "BCE uses logarithms which give very large penalties when the model is confident but wrong (e.g., predicting 0.01 when answer is 1)." }, { "q": "3. What happens if the learning rate is too high?", "options": [ "A) Training is faster and better", "B) The model overshoots the minimum and may never converge", "C) The model learns more features", "D) Nothing bad, higher is always better" ], "answer": "B", "explanation": "A high learning rate causes big jumps that overshoot the minimum, causing the loss to bounce around or even increase." }, { "q": "4. The gradient formula for our Perceptron is (ŷ - y) × x. What does the 'x' part mean?", "options": [ "A) Larger inputs get larger weight updates", "B) The input is added to the gradient", "C) X marks the spot", "D) Nothing, it's just mathematical convention" ], "answer": "A", "explanation": "The input 'x' determines which weights contributed to the output. Weights connected to larger inputs get larger updates because they had more influence." }, { "q": "5. What is an 'epoch' in training?", "options": [ "A) One weight update", "B) One forward pass", "C) One complete pass through all training data", "D) When the model reaches 100% accuracy" ], "answer": "C", "explanation": "An epoch is one complete pass through the entire training dataset. We typically train for many epochs until the model converges." }] for q in questions: print(q["q"]) for opt in q["options"]: print(f" {opt}") print() print("\n" + "="*60)print("Scroll down for answers...")print("="*60)1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# =============================================================================# ANSWERS - Knowledge Check Part 5# ============================================================================= print("ANSWERS - Part 5 Knowledge Check")print("="*60) for i, q in enumerate(questions, 1): print(f"\n{i}. Answer: {q['answer']}") print(f" {q['explanation']}") print("\n" + "="*60)print("How did you do?")print(" 5/5: Training Master!")print(" 4/5: Solid understanding!")print(" 3/5: Review the sections you missed")print(" <3: Re-read Part 5 - these concepts are crucial!")print("="*60)
What's Next?
Congratulations! You've completed the most important notebook in this series!
You now understand how neural networks learn - loss functions, gradient descent, and backpropagation are the foundation of ALL deep learning.
Coming Up in Part 6: Evaluation - The Trained Expert
- Training vs Inference - Learning mode vs using mode
- Accuracy Metrics - Precision, recall, F1 score
- Confusion Matrix - Detailed prediction breakdown
- Interpretability - What did the model actually learn?
Continue to Part 6: part_6_evaluation.ipynb
"The Perceptron has learned. Now it's time to see what it REALLY knows."
The Brain's Decision Committee - From Confusion to Competence