AgenticWorks

A community for developers awakening to agentic AI. Hands-on lessons, enterprise-grade context engineering, and a forum that earns its quiet.

Platform

  • Learn
  • Forum
  • Showcase

Project

  • About

Community

  • Network
  • Code of conduct

Field reports

Monthly notes on what shipped, what broke, and what we learned.

© 2026 AgenticWorks. Built in public.

AgenticWorks
LearnShowcaseForumCommunity
Sign in

Track 1 · ML foundations

Brain's Decision Committee
  1. 01The first neuron
  2. 02A single neuron
  3. 03Activation functions
  4. 04The perceptron
  5. 05Training
  6. 06Evaluation
  7. 07Hidden layers
  8. 08Deep learning challenges
  9. 09Full implementation
  10. 10What's next
BackpropagationPart 5 · 60 min · intermediate

Learning from mistakes

Measure error, apply gradient descent, tune learning rate, and implement the training loop.

Open in ColabDownload notebookFull lab fallback
Kernel: ColdSections: 0/15

Neural Network Fundamentals

Part 5: Training - Learning from Mistakes

The Brain's Decision Committee - Chapter 5


The Story So Far...

In Part 4, our committee member attempted their first classification task. They looked at images of vertical and horizontal lines and tried to identify them. The results were... not great. With random weights, they achieved about 50% accuracy - no better than flipping a coin.

But here's the beautiful thing about neural networks: they can learn from their mistakes.

In this notebook, we'll teach our Perceptron how to improve. We'll show it examples, tell it when it's wrong, and let it gradually adjust its weights until it becomes an expert line detector.

This is training - the heart of machine learning.


What You'll Learn in Part 5

This is one of the most important notebooks in the series. By the end, you will understand:

  1. Loss Functions - How to measure "how wrong" a prediction is
  2. Why We Square Errors - The mathematical reason behind MSE
  3. Binary Cross-Entropy - The preferred loss for classification (and why!)
  4. Gradient Descent - The algorithm that finds better weights
  5. Learning Rate - How fast to adjust (and what happens if it's wrong)
  6. The Gradient - The direction of steepest improvement
  7. Backpropagation - How errors flow backward through the network
  8. The Training Loop - Putting it all together
  9. Watch It Learn - See the Perceptron go from 50% to 95%+ accuracy!

Prerequisites

Make sure you've completed:

  • Parts 0-1: Matrices (neural_network_fundamentals.ipynb)
  • Part 2: Single Neuron (part_2_single_neuron.ipynb)
  • Part 3: Activation Functions (part_3_activation_functions.ipynb)
  • Part 4: The Perceptron (part_4_perceptron.ipynb)

Setup: Import Dependencies and Recreate Our Tools

Let's bring in everything we built in previous notebooks.

cell 003
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# =============================================================================# PART 5: TRAINING - SETUP AND IMPORTS# ============================================================================= import numpy as npimport matplotlib.pyplot as pltfrom IPython.display import display, clear_output # Try to import ipywidgets for interactive featurestry:    import ipywidgets as widgets    WIDGETS_AVAILABLE = Trueexcept ImportError:    WIDGETS_AVAILABLE = False    print("Note: ipywidgets not installed. Interactive features will be limited.") # Set up matplotlib stylestyle_options = ['seaborn-v0_8-whitegrid', 'seaborn-whitegrid', 'ggplot', 'default']for style in style_options:    try:        plt.style.use(style)        break    except OSError:        continue plt.rcParams['figure.figsize'] = [10, 6]plt.rcParams['font.size'] = 12np.random.seed(42) print("Setup complete!")print("="*60)
cell 004
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# =============================================================================# RECREATE OUR TOOLS FROM PREVIOUS NOTEBOOKS# ============================================================================= # -----------------------------------------------------------------------------# Our canonical line images (from Part 1)# -----------------------------------------------------------------------------vertical_line = np.array([[0, 1, 0], [0, 1, 0], [0, 1, 0]])horizontal_line = np.array([[0, 0, 0], [1, 1, 1], [0, 0, 0]])vertical_flat = vertical_line.flatten()horizontal_flat = horizontal_line.flatten() # -----------------------------------------------------------------------------# Dataset generator (from Part 4)# -----------------------------------------------------------------------------def generate_line_dataset(n_samples=100, noise_level=0.0, seed=None):    """Generate vertical (label=1) and horizontal (label=0) line images."""    if seed is not None:        np.random.seed(seed)        X, y = [], []        for i in range(n_samples):        image = np.zeros((3, 3))                if i < n_samples // 2:  # Vertical lines            col = np.random.randint(0, 3)            image[:, col] = 1            if noise_level > 0:                image = np.clip(image + np.random.randn(3, 3) * noise_level, 0, 1)            X.append(image.flatten())            y.append(1)        else:  # Horizontal lines            row = np.random.randint(0, 3)            image[row, :] = 1            if noise_level > 0:                image = np.clip(image + np.random.randn(3, 3) * noise_level, 0, 1)            X.append(image.flatten())            y.append(0)        X, y = np.array(X), np.array(y)    shuffle_idx = np.random.permutation(n_samples)    return X[shuffle_idx], y[shuffle_idx] # -----------------------------------------------------------------------------# Sigmoid activation function (from Part 3)# -----------------------------------------------------------------------------def sigmoid(z):    """Sigmoid activation: maps any value to range (0, 1)."""    return 1 / (1 + np.exp(-np.clip(z, -500, 500))) # -----------------------------------------------------------------------------# Basic Perceptron class (from Part 4) - We'll add training later!# -----------------------------------------------------------------------------class Perceptron:    """A single-layer Perceptron for binary classification."""        def __init__(self, n_inputs):        self.weights = np.random.randn(n_inputs) * 0.1        self.bias = 0.0        self.n_inputs = n_inputs        def forward(self, x):        """Compute the forward pass."""        x = np.array(x).flatten()        z = np.dot(self.weights, x) + self.bias        return sigmoid(z)        def predict(self, x):        """Make a binary prediction (0 or 1)."""        return 1 if self.forward(x) >= 0.5 else 0 # Generate our training datasetX_train, y_train = generate_line_dataset(n_samples=100, noise_level=0.0, seed=42) print("Tools recreated from previous notebooks!")print(f"  - Vertical/Horizontal line templates")print(f"  - Dataset generator")print(f"  - Sigmoid activation")print(f"  - Basic Perceptron class")print(f"\nTraining dataset: {len(X_train)} samples")print(f"  - {sum(y_train)} vertical lines (label=1)")print(f"  - {len(y_train) - sum(y_train)} horizontal lines (label=0)")

5.1 The Error: How Wrong Are We?

Before we can improve, we need to measure how wrong our predictions are. This is the foundation of learning.

The Basic Idea

When our Perceptron makes a prediction, we compare it to the actual answer:

Error = Actual Value - Predicted Value
      = y - ŷ

A Concrete Example

Let's say we show the Perceptron a vertical line (actual label y = 1):

ScenarioPrediction (ŷ)Error (y - ŷ)Interpretation
Perfect1.01.0 - 1.0 = 0.0No error!
Good0.91.0 - 0.9 = 0.1Small error
Bad0.31.0 - 0.3 = 0.7Big error!
Terrible0.01.0 - 0.0 = 1.0Maximum error

Committee Analogy

"The committee member votes on a case. After the vote, the supervisor reveals the correct answer. The difference between their vote and the correct answer is their ERROR - and they need to learn from it."

Why Error Matters

The error tells us two things:

  1. How much to adjust (larger error = bigger adjustment needed)
  2. Which direction to adjust (positive error = increase output, negative = decrease)

Let's see this with real numbers:

cell 006
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
# =============================================================================# CALCULATING ERROR: Step by Step# ============================================================================= # Create an untrained Perceptronperceptron = Perceptron(n_inputs=9) print("="*70)print("CALCULATING ERROR: Step by Step")print("="*70) # Test on a vertical line (actual label = 1)print("\n" + "-"*70)print("Example 1: Testing on a VERTICAL line")print("-"*70) y_actual = 1  # The true label (it IS a vertical line)y_predicted = perceptron.forward(vertical_flat) print(f"\n  Step 1: Get the actual label")print(f"          y (actual) = {y_actual}")print(f"          This means: 'This IS a vertical line'") print(f"\n  Step 2: Get the prediction from our Perceptron")print(f"          ŷ (predicted) = {y_predicted:.4f}")print(f"          This means: '{y_predicted*100:.1f}% confident it's vertical'") print(f"\n  Step 3: Calculate the error")print(f"          error = y - ŷ")print(f"          error = {y_actual} - {y_predicted:.4f}")error_vertical = y_actual - y_predictedprint(f"          error = {error_vertical:.4f}") print(f"\n  Interpretation:")if error_vertical > 0:    print(f"          The error is POSITIVE ({error_vertical:.4f})")    print(f"          This means: The Perceptron underestimated! It should output HIGHER.")else:    print(f"          The error is NEGATIVE ({error_vertical:.4f})")    print(f"          This means: The Perceptron overestimated! It should output LOWER.") # Test on a horizontal line (actual label = 0)print("\n" + "-"*70)print("Example 2: Testing on a HORIZONTAL line")print("-"*70) y_actual_h = 0  # The true label (it is NOT a vertical line)y_predicted_h = perceptron.forward(horizontal_flat) print(f"\n  Step 1: Get the actual label")print(f"          y (actual) = {y_actual_h}")print(f"          This means: 'This is NOT a vertical line'") print(f"\n  Step 2: Get the prediction from our Perceptron")print(f"          ŷ (predicted) = {y_predicted_h:.4f}")print(f"          This means: '{y_predicted_h*100:.1f}% confident it's vertical'") print(f"\n  Step 3: Calculate the error")print(f"          error = y - ŷ")print(f"          error = {y_actual_h} - {y_predicted_h:.4f}")error_horizontal = y_actual_h - y_predicted_hprint(f"          error = {error_horizontal:.4f}") print(f"\n  Interpretation:")if error_horizontal > 0:    print(f"          The error is POSITIVE ({error_horizontal:.4f})")    print(f"          This means: The Perceptron underestimated!")elif error_horizontal < 0:    print(f"          The error is NEGATIVE ({error_horizontal:.4f})")    print(f"          This means: The Perceptron overestimated! It should output LOWER.")else:    print(f"          The error is ZERO - perfect prediction!")

5.2 Loss Functions: The Teacher's Grading System

Before we look at specific formulas, let's understand what a loss function is and why we need one.

What is a Loss Function?

A loss function (also called a "cost function" or "objective function") is a mathematical formula that:

  • Takes in predictions and actual labels
  • Outputs a single number representing "how wrong" the predictions are
  • Lower is better - a loss of 0 means perfect predictions

Why Do We Need Loss Functions?

Think about learning anything - you need feedback to improve. The loss function provides that feedback:

Without Loss FunctionWith Loss Function
"Your predictions are wrong""Your predictions are 0.25 wrong"
Vague, not actionablePrecise, quantifiable
Can't compare methodsCan compare: 0.25 vs 0.15
Can't track progressCan see improvement over time

The Role of Loss in Training

Loss functions are the heart of machine learning. The entire training process is:

  1. Make predictions
  2. Calculate loss (how wrong?)
  3. Adjust weights to reduce loss
  4. Repeat

The weights that minimize loss are the "best" weights - that's the entire goal of training!

Committee Analogy

"The loss function is like a performance review score. Every time the committee member makes a decision, they get a score. A perfect decision scores 0. A terrible decision scores high. The member's goal is to adjust their behavior to minimize this score over time."


5.2.1 Mean Squared Error (MSE): Our First Loss Function

Now let's look at a specific loss function: Mean Squared Error (MSE).

Simple error (y - ŷ) has a problem: positive and negative errors can cancel out!

Example: If we have two predictions:

  • Prediction 1: error = +0.5 (underestimated)
  • Prediction 2: error = -0.5 (overestimated)
  • Average error = (+0.5 + -0.5) / 2 = 0 ← Looks perfect, but it's NOT!

The Solution: Square the Errors

By squaring each error before averaging, we solve this problem:

MSE=1n∑i=1n(yi−y^i)2\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2MSE=n1​∑i=1n​(yi​−y^​i​)2

Let's break this formula down piece by piece:

SymbolMeaningExample
nnnNumber of samples100 images
yiy_iyi​Actual label for sample iii1 (vertical)
y^i\hat{y}_iy^​i​Predicted value for sample iii0.7
(yi−y^i)(y_i - \hat{y}_i)(yi​−y^​i​)Error for sample iii1 - 0.7 = 0.3
(yi−y^i)2(y_i - \hat{y}_i)^2(yi​−y^​i​)2Squared error0.3² = 0.09
1n∑\frac{1}{n}\sumn1​∑Average of all squared errorsMean

Why Square?

Squaring the errors has three important benefits:

  1. No Cancellation: Positive and negative errors both become positive
  2. Penalize Big Errors: A small error (0.1) becomes tiny (0.01), but a big error (0.9) becomes large (0.81)
  3. Smooth Landscape: Creates a smooth "bowl" shape that's easy to optimize (more on this later)

Let's Calculate MSE Step by Step:

cell 008
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
# =============================================================================# MEAN SQUARED ERROR: Step by Step Calculation# ============================================================================= print("="*70)print("MEAN SQUARED ERROR (MSE): Step by Step")print("="*70) # Let's use 5 samples to make this clearsample_actuals = np.array([1, 1, 0, 0, 1])       # True labelssample_predictions = np.array([0.9, 0.6, 0.3, 0.1, 0.5])  # Our predictions print("\nOur data:")print(f"  Actual labels (y):      {sample_actuals}")print(f"  Predictions (ŷ):        {sample_predictions}") # Step 1: Calculate each errorprint("\n" + "-"*70)print("STEP 1: Calculate each error (y - ŷ)")print("-"*70)errors = sample_actuals - sample_predictionsprint(f"\n  Sample 1: {sample_actuals[0]} - {sample_predictions[0]} = {errors[0]:.2f}")print(f"  Sample 2: {sample_actuals[1]} - {sample_predictions[1]} = {errors[1]:.2f}")print(f"  Sample 3: {sample_actuals[2]} - {sample_predictions[2]} = {errors[2]:.2f}")print(f"  Sample 4: {sample_actuals[3]} - {sample_predictions[3]} = {errors[3]:.2f}")print(f"  Sample 5: {sample_actuals[4]} - {sample_predictions[4]} = {errors[4]:.2f}")print(f"\n  All errors: {errors}") # Step 2: Square each errorprint("\n" + "-"*70)print("STEP 2: Square each error (to make all positive)")print("-"*70)squared_errors = errors ** 2print(f"\n  Sample 1: ({errors[0]:.2f})² = {squared_errors[0]:.4f}")print(f"  Sample 2: ({errors[1]:.2f})² = {squared_errors[1]:.4f}")print(f"  Sample 3: ({errors[2]:.2f})² = {squared_errors[2]:.4f}")print(f"  Sample 4: ({errors[3]:.2f})² = {squared_errors[3]:.4f}")print(f"  Sample 5: ({errors[4]:.2f})² = {squared_errors[4]:.4f}")print(f"\n  Squared errors: {squared_errors}") # Step 3: Take the meanprint("\n" + "-"*70)print("STEP 3: Take the mean (average)")print("-"*70)mse = np.mean(squared_errors)print(f"\n  Sum of squared errors: {np.sum(squared_errors):.4f}")print(f"  Number of samples: {len(squared_errors)}")print(f"  MSE = Sum / n = {np.sum(squared_errors):.4f} / {len(squared_errors)}")print(f"  MSE = {mse:.4f}") # The MSE functionprint("\n" + "-"*70)print("THE MSE FUNCTION (for reuse)")print("-"*70) def mse_loss(y_true, y_pred):    """    Mean Squared Error loss function.        Formula: MSE = (1/n) * Σ(y - ŷ)²        Parameters:        y_true: Array of actual labels (0 or 1)        y_pred: Array of predicted probabilities (0 to 1)        Returns:        Single value representing average squared error    """    return np.mean((y_true - y_pred) ** 2) # Verify our calculationprint(f"\n  Using our function: mse_loss(y, ŷ) = {mse_loss(sample_actuals, sample_predictions):.4f}")print(f"  Our manual calculation: {mse:.4f}")print(f"  Match: {'Yes!' if abs(mse_loss(sample_actuals, sample_predictions) - mse) < 0.0001 else 'No'}")

5.3 Binary Cross-Entropy: The Better Loss for Classification

MSE works, but for classification problems (like our V/H detection), there's a better loss function: Binary Cross-Entropy (BCE).

First, Let's Understand the Name

The name "Binary Cross-Entropy" has three parts:

TermMeaningOur Context
BinaryTwo classes onlyVertical (1) or Horizontal (0)
CrossComparing two distributionsComparing predictions vs reality
EntropyMeasure of uncertainty/surpriseHow "surprised" we are by the outcome

Entropy comes from information theory. It measures uncertainty:

  • If something is certain (100% probability), entropy is 0 - no surprise!
  • If something is uncertain (50/50), entropy is high - maximum surprise!

Cross-entropy compares what we PREDICTED against what ACTUALLY happened.

Why Not Just Use MSE?

MSE works for regression (predicting continuous values like house prices), but classification has a special property: we're predicting probabilities.

The Problem with MSE: When the prediction is very wrong (e.g., predicting 0.01 for a true label of 1), MSE gives an error of 0.99² = 0.98. That's bad, but is it bad enough?

Consider: predicting 0.01 when you should predict 1.0 means you were 99% confident and COMPLETELY wrong. That deserves a HUGE penalty!

BCE's Solution: BCE uses logarithms, which give much harsher penalties for confident wrong answers.

The Logarithm: Why It's Perfect for This

The logarithm is a special mathematical function. Here's why it works for measuring surprise:

  • log(1) = 0 → If probability was 100%, no surprise at all
  • log(0.5) ≈ -0.69 → Uncertain, some surprise
  • log(0.1) ≈ -2.30 → Low probability, big surprise!
  • log(0.01) ≈ -4.61 → Very low probability, huge surprise!
  • log(0) = -∞ → Zero probability, infinite surprise (impossible event!)

The negative sign flips these to positive loss values: -log(0.01) = 4.61

The Intuition: Measuring "Surprise"

Think of cross-entropy as measuring how surprised you are by the actual answer:

PredictionActualBCE ValueInterpretation
0.9910.01"Not surprised at all - I expected this!"
0.510.69"Somewhat surprised - I was uncertain"
0.0114.61"VERY surprised! I was confident it was NOT 1!"

The Mathematics

BCE=−1n∑i=1n[yi⋅log⁡(y^i)+(1−yi)⋅log⁡(1−y^i)]\text{BCE} = -\frac{1}{n}\sum_{i=1}^{n} \left[ y_i \cdot \log(\hat{y}_i) + (1-y_i) \cdot \log(1-\hat{y}_i) \right]BCE=−n1​∑i=1n​[yi​⋅log(y^​i​)+(1−yi​)⋅log(1−y^​i​)]

This looks scary! Let's break it down:

When the actual label y = 1 (it IS vertical):

  • The formula simplifies to: −log⁡(y^)-\log(\hat{y})−log(y^​)
  • If we predicted high (ŷ = 0.9): −log⁡(0.9)=0.105-\log(0.9) = 0.105−log(0.9)=0.105 (low loss - good!)
  • If we predicted low (ŷ = 0.1): −log⁡(0.1)=2.303-\log(0.1) = 2.303−log(0.1)=2.303 (high loss - bad!)

When the actual label y = 0 (it is NOT vertical):

  • The formula simplifies to: −log⁡(1−y^)-\log(1 - \hat{y})−log(1−y^​)
  • If we predicted low (ŷ = 0.1): −log⁡(0.9)=0.105-\log(0.9) = 0.105−log(0.9)=0.105 (low loss - good!)
  • If we predicted high (ŷ = 0.9): −log⁡(0.1)=2.303-\log(0.1) = 2.303−log(0.1)=2.303 (high loss - bad!)

Committee Analogy

"BCE measures how embarrassed the committee member should be. If they confidently voted 'definitely vertical!' (0.99) and it turned out to be horizontal, they should be VERY embarrassed. The logarithm captures this severe penalty for confident wrong answers."

Let's Implement and Compare:

cell 010
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
# =============================================================================# BINARY CROSS-ENTROPY: Step by Step# ============================================================================= print("="*70)print("BINARY CROSS-ENTROPY (BCE): Step by Step")print("="*70) # First, let's understand the log functionprint("\n" + "-"*70)print("UNDERSTANDING THE LOGARITHM")print("-"*70)print("""The natural log (ln or log) has a special property:  - log(1) = 0        (no surprise when probability matches reality)  - log(0.5) = -0.69  (some surprise)  - log(0.1) = -2.30  (very surprised!)  - log(0.01) = -4.61 (extremely surprised!) As the probability gets closer to 0, log goes to -infinity.That's why BCE severely punishes confident wrong predictions!""") # Show the log curveprint("  Let's calculate -log(ŷ) for different predictions:")predictions = [0.99, 0.9, 0.7, 0.5, 0.3, 0.1, 0.01]print(f"\n  {'Prediction (ŷ)':<18} {'-log(ŷ)':<15} {'Interpretation'}")print("  " + "-"*60)for p in predictions:    neg_log = -np.log(p)    if neg_log < 0.5:        interp = "Low loss (good prediction)"    elif neg_log < 1.5:        interp = "Medium loss"    else:        interp = "High loss (bad prediction!)"    print(f"  {p:<18} {neg_log:<15.4f} {interp}") print("\n" + "-"*70)print("BCE CALCULATION FOR A SINGLE SAMPLE")print("-"*70) # Example 1: Actual is 1, prediction is 0.9 (good prediction)y_true_1 = 1y_pred_1 = 0.9 print(f"\n  Example 1: Actual y = {y_true_1}, Predicted ŷ = {y_pred_1}")print(f"  (This is a GOOD prediction for a vertical line)")print(f"\n  BCE formula: -[y * log(ŷ) + (1-y) * log(1-ŷ)]")print(f"\n  Since y = 1, the (1-y) term becomes 0, so:")print(f"  BCE = -[{y_true_1} * log({y_pred_1}) + 0]")print(f"  BCE = -log({y_pred_1})")print(f"  BCE = -{np.log(y_pred_1):.4f}")bce_1 = -np.log(y_pred_1)print(f"  BCE = {bce_1:.4f}") # Example 2: Actual is 1, prediction is 0.1 (bad prediction)y_true_2 = 1y_pred_2 = 0.1 print(f"\n  Example 2: Actual y = {y_true_2}, Predicted ŷ = {y_pred_2}")print(f"  (This is a BAD prediction for a vertical line)")print(f"\n  Since y = 1:")print(f"  BCE = -log({y_pred_2})")print(f"  BCE = -{np.log(y_pred_2):.4f}")bce_2 = -np.log(y_pred_2)print(f"  BCE = {bce_2:.4f}") print(f"\n  Notice: The bad prediction has {bce_2/bce_1:.1f}x higher loss!") # The BCE functionprint("\n" + "-"*70)print("THE BCE FUNCTION (for reuse)")print("-"*70) def binary_cross_entropy(y_true, y_pred):    """    Binary Cross-Entropy loss function.        Formula: BCE = -(1/n) * Σ[y*log(ŷ) + (1-y)*log(1-ŷ)]        Parameters:        y_true: Array of actual labels (0 or 1)        y_pred: Array of predicted probabilities (0 to 1)        Returns:        Single value representing average cross-entropy loss    """    # Clip predictions to avoid log(0) which is undefined    epsilon = 1e-15  # A tiny number    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)        # Calculate BCE    bce = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))    return bce print("""def binary_cross_entropy(y_true, y_pred):    # Clip to avoid log(0) - would be undefined!    epsilon = 1e-15    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)        # BCE formula    return -np.mean(y_true * np.log(y_pred) +                     (1 - y_true) * np.log(1 - y_pred))""")
cell 011
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
# =============================================================================# VISUALIZE: MSE vs BCE# ============================================================================= fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Generate prediction values from 0.01 to 0.99y_pred_range = np.linspace(0.01, 0.99, 100) # When actual label is 1 (vertical line)mse_when_y_is_1 = (1 - y_pred_range) ** 2bce_when_y_is_1 = -np.log(y_pred_range) # Plot for y = 1ax1 = axes[0]ax1.plot(y_pred_range, mse_when_y_is_1, 'b-', linewidth=2, label='MSE')ax1.plot(y_pred_range, bce_when_y_is_1, 'r-', linewidth=2, label='BCE')ax1.set_xlabel('Prediction (ŷ)', fontsize=12)ax1.set_ylabel('Loss', fontsize=12)ax1.set_title('When Actual y = 1 (Vertical Line)\nLower prediction = Higher loss', fontsize=12, fontweight='bold')ax1.legend()ax1.grid(True, alpha=0.3)ax1.set_xlim(0, 1)ax1.set_ylim(0, 5)ax1.axvline(x=0.5, color='gray', linestyle=':', alpha=0.5)ax1.annotate('If we predict 0.1\nBCE = 2.3 (harsh!)\nMSE = 0.81',              xy=(0.1, 2.3), xytext=(0.3, 3.5),             fontsize=9, arrowprops=dict(arrowstyle='->', color='red')) # When actual label is 0 (horizontal line)mse_when_y_is_0 = y_pred_range ** 2bce_when_y_is_0 = -np.log(1 - y_pred_range) # Plot for y = 0ax2 = axes[1]ax2.plot(y_pred_range, mse_when_y_is_0, 'b-', linewidth=2, label='MSE')ax2.plot(y_pred_range, bce_when_y_is_0, 'r-', linewidth=2, label='BCE')ax2.set_xlabel('Prediction (ŷ)', fontsize=12)ax2.set_ylabel('Loss', fontsize=12)ax2.set_title('When Actual y = 0 (Horizontal Line)\nHigher prediction = Higher loss', fontsize=12, fontweight='bold')ax2.legend()ax2.grid(True, alpha=0.3)ax2.set_xlim(0, 1)ax2.set_ylim(0, 5)ax2.axvline(x=0.5, color='gray', linestyle=':', alpha=0.5)ax2.annotate('If we predict 0.9\nBCE = 2.3 (harsh!)\nMSE = 0.81',              xy=(0.9, 2.3), xytext=(0.5, 3.5),             fontsize=9, arrowprops=dict(arrowstyle='->', color='red')) plt.tight_layout()plt.show() print("\nKEY INSIGHT: BCE vs MSE")print("="*60)print("""Notice how BCE (red line) rises much more steeply than MSE (blue line)as predictions get worse? This is why BCE is preferred for classification:  - It SEVERELY punishes confident wrong predictions  - A prediction of 0.1 when the answer is 1 has BCE loss of 2.3  - The same prediction has MSE loss of only 0.81 BCE creates stronger learning signals when the model is very wrong,which helps it learn faster and more reliably!""")

5.4 Gradient Descent: Finding Better Weights

Now we know HOW WRONG we are (the loss). But how do we make our predictions BETTER?

This is where optimization comes in - the process of finding the best values for our weights.

The Optimization Problem

Our Perceptron has 9 weights + 1 bias = 10 numbers to choose. Each combination of these 10 numbers gives different predictions and a different loss.

The Question: Out of the infinite possible combinations, which gives the LOWEST loss?

The Naive Approach: Try all combinations!

  • But with continuous numbers, there are infinitely many combinations
  • Even with just 100 values per parameter: 100^10 = 10^20 combinations
  • That's more than the number of grains of sand on Earth!

The Smart Approach: Use mathematics to guide our search toward better values.

What is a Derivative? (A Quick Refresher)

The derivative tells you how much one quantity changes when you change another.

Simple Example: You're driving a car.

  • Position = where you are
  • Derivative of position = speed (how fast position changes)
  • Derivative of speed = acceleration (how fast speed changes)

For Our Loss Function:

  • Loss = how wrong we are
  • Derivative of loss w.r.t. weight = how much loss changes when we change the weight

If the derivative is:

  • Positive: Increasing the weight increases loss → we should DECREASE the weight
  • Negative: Increasing the weight decreases loss → we should INCREASE the weight
  • Zero: We're at a minimum (or maximum)!

The Key Idea: The Loss Landscape

Imagine the loss as a landscape where:

  • The height at any point = how wrong we are (higher = worse)
  • The position = our current weights
  • Our goal = find the lowest point (minimum loss)

We want to "roll downhill" until we find the bottom!

The Algorithm: Gradient Descent

Gradient means "slope" - it tells us which way is uphill.

Gradient Descent means:

  1. Look at the slope where we are
  2. Take a step in the opposite direction (downhill)
  3. Repeat until we reach the bottom

The Mathematics

wnew=wold−α⋅∂L∂ww_{new} = w_{old} - \alpha \cdot \frac{\partial L}{\partial w}wnew​=wold​−α⋅∂w∂L​

Let's break this down:

SymbolMeaningIntuition
wneww_{new}wnew​Updated weightWhere we're going
woldw_{old}wold​Current weightWhere we are
α\alphaαLearning rateHow big a step to take
∂L∂w\frac{\partial L}{\partial w}∂w∂L​Gradient (slope)Which way is uphill
−-−SubtractionWe go OPPOSITE to uphill (= downhill)

Committee Analogy

"The gradient is like a compass that always points uphill. We want to go DOWNHILL (less error), so we walk in the opposite direction. The learning rate decides whether we take small careful steps or big bold leaps."

Let's Visualize This:

cell 013
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
# =============================================================================# VISUALIZE: The Loss Landscape and Gradient Descent# ============================================================================= # Create a simple 1D loss landscape (parabola)# This represents how loss changes as we change ONE weightweight_values = np.linspace(-3, 3, 100)loss_landscape = weight_values ** 2 + 0.5  # Simple parabola with minimum at w=0 fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Plot 1: The loss landscapeax1 = axes[0]ax1.plot(weight_values, loss_landscape, 'b-', linewidth=3)ax1.fill_between(weight_values, loss_landscape, alpha=0.2)ax1.set_xlabel('Weight Value (w)', fontsize=12)ax1.set_ylabel('Loss (L)', fontsize=12)ax1.set_title('The Loss Landscape\n(For a Single Weight)', fontsize=12, fontweight='bold')ax1.grid(True, alpha=0.3) # Mark the minimumax1.scatter([0], [0.5], color='green', s=200, zorder=5, marker='*', label='Minimum (goal)')ax1.annotate('Our goal: Find this minimum!', xy=(0, 0.5), xytext=(0.5, 2),            fontsize=10, arrowprops=dict(arrowstyle='->', color='green')) # Show current positioncurrent_w = 2.0current_loss = current_w ** 2 + 0.5ax1.scatter([current_w], [current_loss], color='red', s=150, zorder=5, label='Current position')ax1.axvline(x=current_w, color='red', linestyle='--', alpha=0.3)ax1.legend() # Plot 2: Gradient descent animation (multiple steps)ax2 = axes[1]ax2.plot(weight_values, loss_landscape, 'b-', linewidth=2, alpha=0.5)ax2.fill_between(weight_values, loss_landscape, alpha=0.1)ax2.set_xlabel('Weight Value (w)', fontsize=12)ax2.set_ylabel('Loss (L)', fontsize=12)ax2.set_title('Gradient Descent: Rolling Downhill\n(Learning Rate α = 0.3)', fontsize=12, fontweight='bold')ax2.grid(True, alpha=0.3) # Simulate gradient descentlearning_rate = 0.3w = 2.5  # Starting positionpath = [(w, w**2 + 0.5)] for step in range(8):    gradient = 2 * w  # Derivative of w² is 2w    w = w - learning_rate * gradient  # Gradient descent update    loss = w ** 2 + 0.5    path.append((w, loss)) # Plot the pathpath = np.array(path)ax2.plot(path[:, 0], path[:, 1], 'ro-', markersize=8, linewidth=2, label='Gradient descent path')ax2.scatter([path[0, 0]], [path[0, 1]], color='red', s=200, zorder=5, marker='o', label='Start')ax2.scatter([path[-1, 0]], [path[-1, 1]], color='green', s=200, zorder=5, marker='*', label='End (near minimum)') # Add step numbersfor i, (w_val, l_val) in enumerate(path):    ax2.annotate(f'{i}', xy=(w_val, l_val), xytext=(w_val+0.1, l_val+0.3),                fontsize=9, fontweight='bold') ax2.legend(loc='upper right') plt.tight_layout()plt.show() print("\nGRADIENT DESCENT STEPS:")print("="*60)print(f"{'Step':<6} {'Weight (w)':<15} {'Gradient (2w)':<15} {'Update':<20} {'Loss'}")print("-"*60)w = 2.5for step in range(6):    gradient = 2 * w    update = -learning_rate * gradient    loss = w ** 2 + 0.5    print(f"{step:<6} {w:<15.4f} {gradient:<15.4f} {update:<20.4f} {loss:.4f}")    w = w + update  # Same as w = w - learning_rate * gradient print("-"*60)print(f"\nStarted at w = 2.5 (loss = 6.75)")print(f"After 5 steps: w = {w:.4f} (loss = {w**2 + 0.5:.4f})")print(f"Getting closer to the minimum at w = 0 (loss = 0.5)!")

5.5 Learning Rate: How Fast to Adjust

The learning rate (α, alpha) controls how big each step is. It's one of the most important choices in training!

Parameters vs Hyperparameters

Before we dive in, let's clarify an important distinction:

TermWhat It IsExamplesWho Sets It?
ParametersValues the model LEARNSWeights, BiasThe training algorithm
HyperparametersSettings WE choose before trainingLearning rate, number of epochsThe human (you!)

The learning rate is a hyperparameter - we choose it before training, and it affects HOW the model learns (but is not learned itself).

Why Learning Rate Matters So Much

The learning rate multiplies the gradient to determine the step size:

step = learning_rate × gradient
new_weight = old_weight - step

The Problem: Gradients can vary wildly:

  • Sometimes the gradient is 10.0 (steep slope)
  • Sometimes it's 0.001 (nearly flat)

The Learning Rate's Job: Scale these gradients to reasonable step sizes.

The Goldilocks Problem

Learning RateEffectProblem
Too Large (α = 1.0)Big stepsOvershoot! Miss the minimum, bounce around
Too Small (α = 0.001)Tiny stepsTakes forever, might get stuck
Just Right (α = 0.1)Medium stepsSteady progress toward minimum

The Mathematics

Remember our update formula:

wnew=wold−α⋅gradientw_{new} = w_{old} - \alpha \cdot \text{gradient}wnew​=wold​−α⋅gradient

  • If gradient = 10 and α = 0.1: step size = 1.0 (reasonable)
  • If gradient = 10 and α = 1.0: step size = 10.0 (too big!)
  • If gradient = 10 and α = 0.001: step size = 0.01 (too small!)

Committee Analogy

"The learning rate is how much the committee member adjusts after each mistake. Too much adjustment, and they overcorrect wildly. Too little, and they never improve. The right amount leads to steady learning."

Let's See All Three Scenarios:

cell 015
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# =============================================================================# VISUALIZE: Learning Rate Effects# ============================================================================= def run_gradient_descent(start_w, learning_rate, steps=15):    """Run gradient descent and return the path."""    w = start_w    path = [(w, w**2 + 0.5)]    for _ in range(steps):        gradient = 2 * w  # Derivative of w²        w = w - learning_rate * gradient        w = np.clip(w, -5, 5)  # Prevent explosion        loss = w ** 2 + 0.5        path.append((w, loss))    return np.array(path) fig, axes = plt.subplots(1, 3, figsize=(15, 5))weight_values = np.linspace(-3, 3, 100)loss_landscape = weight_values ** 2 + 0.5 scenarios = [    (0.9, 'TOO LARGE (α=0.9)', 'red', 'Overshoots and bounces!'),    (0.3, 'JUST RIGHT (α=0.3)', 'green', 'Steady progress!'),    (0.05, 'TOO SMALL (α=0.05)', 'blue', 'Very slow progress...')] for ax, (lr, title, color, desc) in zip(axes, scenarios):    ax.plot(weight_values, loss_landscape, 'k-', linewidth=1, alpha=0.3)    ax.fill_between(weight_values, loss_landscape, alpha=0.1, color='gray')        path = run_gradient_descent(start_w=2.5, learning_rate=lr)    ax.plot(path[:, 0], path[:, 1], 'o-', color=color, markersize=6, linewidth=1.5)    ax.scatter([path[0, 0]], [path[0, 1]], color=color, s=150, zorder=5, marker='s', label='Start')    ax.scatter([path[-1, 0]], [path[-1, 1]], color='black', s=150, zorder=5, marker='*', label='End')        ax.set_xlabel('Weight (w)', fontsize=11)    ax.set_ylabel('Loss', fontsize=11)    ax.set_title(f'{title}\n{desc}', fontsize=11, fontweight='bold')    ax.set_xlim(-3, 3)    ax.set_ylim(0, 10)    ax.grid(True, alpha=0.3)    ax.legend(loc='upper right', fontsize=9)        # Show final loss    ax.annotate(f'Final loss: {path[-1, 1]:.2f}', xy=(0, 8), fontsize=10, ha='center') plt.tight_layout()plt.show() print("\nLEARNING RATE COMPARISON:")print("="*60)for lr, title, _, _ in scenarios:    path = run_gradient_descent(start_w=2.5, learning_rate=lr)    print(f"\n{title}")    print(f"  Final weight: {path[-1, 0]:.4f}")    print(f"  Final loss:   {path[-1, 1]:.4f}")    print(f"  Optimal loss: 0.5000 (at w=0)")    print(f"  Distance from optimal: {abs(path[-1, 1] - 0.5):.4f}")

5.6 The Gradient: Which Way is Down?

We've been using the word "gradient" - but what IS it exactly, and how do we calculate it?

What is a Gradient?

The gradient is the derivative (slope) of the loss with respect to each weight. It tells us:

  • How much the loss changes when we change a weight
  • Which direction increases the loss (so we go the opposite way!)

Regular Derivatives vs Partial Derivatives

Regular derivative: When you have ONE variable.

  • Example: If f(x) = x², then df/dx = 2x

Partial derivative (∂): When you have MULTIPLE variables and you want to see the effect of changing just ONE while keeping others fixed.

  • Example: If f(x, y) = x² + y², then:
    • ∂f/∂x = 2x (how f changes when x changes, y held constant)
    • ∂f/∂y = 2y (how f changes when y changes, x held constant)

In our Perceptron:

  • Loss depends on 9 weights + 1 bias = 10 variables
  • We need 10 partial derivatives (one for each parameter)
  • The gradient is the collection of ALL these partial derivatives

The Notation "w.r.t." (With Respect To)

You'll often see "gradient of L w.r.t. w" - this means "how does L change when we change w?"

∂L/∂w is read as "partial derivative of L with respect to w"

The Chain Rule: Breaking Down Complex Functions

Our Perceptron has multiple operations chained together:

x → [weighted sum] → z → [sigmoid] → ŷ → [BCE loss] → L
    w · x + b                                

To find how changing w affects the final loss L, we use the chain rule:

∂L∂w=∂L∂y^⋅∂y^∂z⋅∂z∂w\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial w}∂w∂L​=∂y^​∂L​⋅∂z∂y^​​⋅∂w∂z​

This looks complicated, but each piece is simple!

The Beautiful Simplification

For sigmoid activation with BCE loss, all the calculus simplifies to:

∂L∂w=(y^−y)⋅x\frac{\partial L}{\partial w} = (\hat{y} - y) \cdot x∂w∂L​=(y^​−y)⋅x

And for the bias:

∂L∂b=(y^−y)\frac{\partial L}{\partial b} = (\hat{y} - y)∂b∂L​=(y^​−y)

That's it! The gradient is just:

  • (prediction - actual) × input

Why This Formula Makes Intuitive Sense

PartMeaningIntuition
(y^−y)(\hat{y} - y)(y^​−y)ErrorHow wrong we are (and in which direction)
xxxInputWhich inputs contributed to the output

If we predicted too high (ŷ > y), the error is positive, so we'll decrease the weights. If the input was large, we'll decrease more (because it had more influence).

Let's Calculate Gradients Step by Step:

cell 017
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# =============================================================================# CALCULATING GRADIENTS: Step by Step# ============================================================================= print("="*70)print("CALCULATING GRADIENTS: Step by Step")print("="*70) # Use our Perceptron on a vertical linex = vertical_flat.copy()y_true = 1  # It IS vertical # Get the predictiony_pred = perceptron.forward(x) print(f"\nInput (x): {x}")print(f"Actual label (y): {y_true}")print(f"Prediction (ŷ): {y_pred:.4f}") # Calculate the error termprint("\n" + "-"*70)print("STEP 1: Calculate the error term (ŷ - y)")print("-"*70)error = y_pred - y_trueprint(f"\n  error = ŷ - y")print(f"  error = {y_pred:.4f} - {y_true}")print(f"  error = {error:.4f}") if error > 0:    print(f"\n  Interpretation: Error is POSITIVE ({error:.4f})")    print(f"  This means we predicted TOO HIGH - need to decrease output")else:    print(f"\n  Interpretation: Error is NEGATIVE ({error:.4f})")    print(f"  This means we predicted TOO LOW - need to increase output") # Calculate gradient for weightsprint("\n" + "-"*70)print("STEP 2: Calculate gradient for each weight")print("-"*70)print(f"\n  Formula: ∂L/∂w = (ŷ - y) × x = error × x")print(f"\n  For each weight w_i, the gradient is: error × x_i") gradient_weights = error * xprint(f"\n  Gradients for all 9 weights:")print(f"  error × x = {error:.4f} × {x}")print(f"           = [{', '.join([f'{g:.4f}' for g in gradient_weights])}]") # Show which weights should changeprint(f"\n  Let's interpret this (as a 3x3 grid):")grad_grid = gradient_weights.reshape(3, 3)print(f"    {grad_grid[0]}")print(f"    {grad_grid[1]}")print(f"    {grad_grid[2]}") print(f"\n  Notice: Only the middle column has non-zero gradients!")print(f"  That's because only those pixels had value 1 in the input.")print(f"  Weights for other pixels don't need to change (input was 0).") # Calculate gradient for biasprint("\n" + "-"*70)print("STEP 3: Calculate gradient for bias")print("-"*70)gradient_bias = errorprint(f"\n  Formula: ∂L/∂b = (ŷ - y) = error")print(f"  Bias gradient = {gradient_bias:.4f}") # Show the updateprint("\n" + "-"*70)print("STEP 4: Apply the update (with learning rate α = 0.5)")print("-"*70)learning_rate = 0.5print(f"\n  Update formula: w_new = w_old - α × gradient")print(f"\n  For weight w₁ (position 1, middle column):")old_w1 = perceptron.weights[1]new_w1 = old_w1 - learning_rate * gradient_weights[1]print(f"    w₁_new = {old_w1:.4f} - {learning_rate} × {gradient_weights[1]:.4f}")print(f"    w₁_new = {old_w1:.4f} - {learning_rate * gradient_weights[1]:.4f}")print(f"    w₁_new = {new_w1:.4f}")print(f"\n  Since error was negative, w₁ INCREASED to make output higher next time!")

5.7 Backpropagation: Tracing the Blame

Backpropagation ("backprop") is the algorithm that calculates gradients by flowing errors BACKWARD through the network.

Why Backpropagation is Revolutionary

Before backpropagation was popularized in 1986 (by Rumelhart, Hinton, and Williams), training neural networks was incredibly difficult. People didn't know how to efficiently calculate gradients for networks with many layers.

The Problem: In a network with multiple layers, changing one weight affects EVERYTHING that comes after it. How do you figure out exactly how much each weight contributed to the final error?

The Solution: Backpropagation! It uses the chain rule to efficiently calculate ALL gradients in ONE backward pass through the network.

The Name Explained

  • Back: We start from the OUTPUT (the error) and work BACKWARD
  • Propagation: The error "propagates" (spreads) to earlier layers

Think of it like blame assignment:

  1. The final output was wrong
  2. What caused it to be wrong? Trace backward...
  3. These specific weights were most responsible
  4. Adjust them accordingly

For Our Single Neuron

In our simple Perceptron, backpropagation is straightforward:

    FORWARD PASS (left to right):
    x → [w·x + b] → z → [sigmoid] → ŷ → [compare to y] → Loss
    
    BACKWARD PASS (right to left):
    Loss → ∂L/∂ŷ → ∂L/∂z → ∂L/∂w, ∂L/∂b
           ↑          ↑          ↑
        "How does   "How does   "How does
         loss       loss        loss
         change     change      change
         with ŷ?"   with z?"    with w,b?"

Committee Analogy

"Backpropagation is like a post-mortem after a mistake. The committee asks: 'What went wrong?' They trace the decision back: 'The final vote was wrong. Why? The weighted sum was off. Why? These specific weights gave too much importance to the wrong evidence.' Then they adjust those specific weights."

The Backprop Flow for Our Perceptron

StepCalculationFormula
1Loss gradient w.r.t. output∂L∂y^\frac{\partial L}{\partial \hat{y}}∂y^​∂L​ (from BCE)
2Output gradient w.r.t. pre-activation∂y^∂z=y^(1−y^)\frac{\partial \hat{y}}{\partial z} = \hat{y}(1-\hat{y})∂z∂y^​​=y^​(1−y^​) (sigmoid derivative)
3Pre-activation gradient w.r.t. weights∂z∂w=x\frac{\partial z}{\partial w} = x∂w∂z​=x
4Chain them together∂L∂w=(y^−y)⋅x\frac{\partial L}{\partial w} = (\hat{y} - y) \cdot x∂w∂L​=(y^​−y)⋅x

The beautiful thing: steps 1 and 2 combine to give us just (y^−y)(ŷ - y)(y^​−y)!


5.8 The Training Loop: Putting It All Together

Now we have all the pieces! Let's build the complete training algorithm.

Why Do We Need a Loop?

A single gradient descent step makes only a TINY improvement. To go from random weights to good weights, we need MANY small steps.

Real Example:

  • Start: Loss = 0.7, Accuracy = 50%
  • After 1 step: Loss = 0.69, Accuracy = 51% (tiny improvement)
  • After 10 steps: Loss = 0.5, Accuracy = 65%
  • After 100 steps: Loss = 0.1, Accuracy = 95%

Each step nudges the weights slightly. Over many steps, these tiny nudges accumulate into major improvements!

Why Multiple Epochs?

Problem: One pass through the data isn't enough.

  • With 100 samples, we only make 100 weight updates
  • The model might not have "seen" enough patterns
  • Early samples were processed with very different weights than late samples

Solution: Go through the data MULTIPLE times (epochs).

  • Epoch 1: First exposure to all samples
  • Epoch 2: Second look, with better weights now
  • Epoch 3: Refinement continues
  • ...

Each epoch, the model gets better at the task!

The Training Loop Algorithm

FOR each epoch (pass through the data):
    FOR each sample (x, y) in the training data:
        
        1. FORWARD PASS: Get prediction
           ŷ = sigmoid(w · x + b)
        
        2. COMPUTE LOSS: How wrong?
           L = BCE(y, ŷ)
        
        3. COMPUTE GRADIENTS: Which way to go?
           ∂L/∂w = (ŷ - y) × x
           ∂L/∂b = (ŷ - y)
        
        4. UPDATE WEIGHTS: Take a step downhill
           w = w - α × ∂L/∂w
           b = b - α × ∂L/∂b
    
    Record average loss for this epoch

Key Terms

TermMeaning
EpochOne complete pass through all training data
SampleOne training example (input + label)
BatchGroup of samples processed together (we use batch size = 1 here)
IterationOne weight update

Let's Build Our Trainable Perceptron!

cell 020
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
# =============================================================================# THE TRAINABLE PERCEPTRON: Complete Implementation# ============================================================================= class TrainablePerceptron:    """    A Perceptron that can learn from examples!        This class includes:    - Forward pass (prediction)    - Loss calculation (BCE)    - Gradient calculation (backpropagation)    - Weight update (gradient descent)    - Full training loop    """        def __init__(self, n_inputs):        """Initialize with random weights."""        self.weights = np.random.randn(n_inputs) * 0.1        self.bias = 0.0        self.n_inputs = n_inputs                # For tracking training progress        self.loss_history = []        self.accuracy_history = []        def forward(self, x):        """Forward pass: compute prediction."""        x = np.array(x).flatten()        z = np.dot(self.weights, x) + self.bias        return sigmoid(z)        def predict(self, x):        """Binary prediction (0 or 1)."""        return 1 if self.forward(x) >= 0.5 else 0        def compute_loss(self, y_true, y_pred):        """Compute BCE loss for one sample."""        epsilon = 1e-15        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)        return -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))        def train(self, X, y, learning_rate=0.1, epochs=100, verbose=True):        """        Train the Perceptron using gradient descent.                Parameters:            X: Training inputs, shape (n_samples, n_features)            y: Training labels, shape (n_samples,)            learning_rate: Step size for gradient descent            epochs: Number of passes through the data            verbose: Whether to print progress                Returns:            List of losses for each epoch        """        self.loss_history = []        self.accuracy_history = []                if verbose:            print("="*70)            print("TRAINING STARTED")            print("="*70)            print(f"  Samples: {len(X)}")            print(f"  Epochs: {epochs}")            print(f"  Learning rate: {learning_rate}")            print()                for epoch in range(epochs):            total_loss = 0            correct = 0                        # Go through each training sample            for i in range(len(X)):                xi = X[i]  # Input                yi = y[i]  # True label                                # ===== STEP 1: FORWARD PASS =====                y_pred = self.forward(xi)                                # ===== STEP 2: COMPUTE LOSS =====                loss = self.compute_loss(yi, y_pred)                total_loss += loss                                # Count correct predictions                if (y_pred >= 0.5 and yi == 1) or (y_pred < 0.5 and yi == 0):                    correct += 1                                # ===== STEP 3: COMPUTE GRADIENTS =====                # The beautiful simplification: gradient = (prediction - actual) * input                error = y_pred - yi                gradient_weights = error * xi                gradient_bias = error                                # ===== STEP 4: UPDATE WEIGHTS =====                self.weights = self.weights - learning_rate * gradient_weights                self.bias = self.bias - learning_rate * gradient_bias                        # Record progress            avg_loss = total_loss / len(X)            accuracy = correct / len(X)            self.loss_history.append(avg_loss)            self.accuracy_history.append(accuracy)                        # Print progress every 10 epochs            if verbose and (epoch + 1) % 10 == 0:                print(f"  Epoch {epoch+1:3d}/{epochs}: Loss = {avg_loss:.4f}, Accuracy = {accuracy*100:.1f}%")                if verbose:            print()            print("="*70)            print("TRAINING COMPLETE!")            print("="*70)            print(f"  Final Loss: {self.loss_history[-1]:.4f}")            print(f"  Final Accuracy: {self.accuracy_history[-1]*100:.1f}%")                return self.loss_history print("TrainablePerceptron class created!")print("Now let's train it and watch it learn...")

5.9 Watching It Learn!

This is the moment we've been building toward. Let's train our Perceptron and watch it transform from a confused guesser into an expert line detector!

What to Watch For

During training, you'll see:

  1. Loss decreasing - The model is making fewer/smaller mistakes
  2. Accuracy increasing - More predictions are correct
  3. Eventually plateauing - When the model has learned all it can

Convergence: When Has the Model Learned Enough?

Convergence means the model has stopped improving significantly. Signs of convergence:

SignWhat It Looks LikeWhat It Means
Loss plateausLoss curve flattens outNo more improvement possible
Loss oscillatesJumps up and down slightlyNear the minimum
Accuracy stableStays at same levelModel has learned the pattern

When to Stop Training:

  • When loss stops decreasing for several epochs
  • When accuracy reaches acceptable level (e.g., 95%+)
  • When you've run out of patience!
cell 022
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# =============================================================================# TRAINING THE PERCEPTRON: Watch It Learn!# ============================================================================= # Create a fresh Perceptronnp.random.seed(42)  # For reproducibilitymodel = TrainablePerceptron(n_inputs=9) # Check initial performance (before training)print("BEFORE TRAINING:")print("-"*40)correct_before = sum(model.predict(X_train[i]) == y_train[i] for i in range(len(X_train)))print(f"Accuracy: {correct_before}/{len(y_train)} = {correct_before/len(y_train)*100:.1f}%")print(f"(This is basically random guessing)")print() # Train the model!losses = model.train(X_train, y_train, learning_rate=0.5, epochs=50) # Check final performanceprint("\nAFTER TRAINING:")print("-"*40)correct_after = sum(model.predict(X_train[i]) == y_train[i] for i in range(len(X_train)))print(f"Accuracy: {correct_after}/{len(y_train)} = {correct_after/len(y_train)*100:.1f}%")
cell 023
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# =============================================================================# VISUALIZE THE LEARNING PROGRESS# ============================================================================= fig, axes = plt.subplots(1, 3, figsize=(15, 4)) # Plot 1: Loss over timeax1 = axes[0]ax1.plot(model.loss_history, 'b-', linewidth=2)ax1.set_xlabel('Epoch', fontsize=12)ax1.set_ylabel('Loss (BCE)', fontsize=12)ax1.set_title('Loss Decreasing Over Time', fontsize=12, fontweight='bold')ax1.grid(True, alpha=0.3)ax1.annotate(f'Start: {model.loss_history[0]:.2f}', xy=(0, model.loss_history[0]),             xytext=(5, model.loss_history[0]+0.1), fontsize=10)ax1.annotate(f'End: {model.loss_history[-1]:.2f}', xy=(len(model.loss_history)-1, model.loss_history[-1]),             xytext=(len(model.loss_history)-15, model.loss_history[-1]+0.1), fontsize=10) # Plot 2: Accuracy over timeax2 = axes[1]ax2.plot([a*100 for a in model.accuracy_history], 'g-', linewidth=2)ax2.set_xlabel('Epoch', fontsize=12)ax2.set_ylabel('Accuracy (%)', fontsize=12)ax2.set_title('Accuracy Increasing Over Time', fontsize=12, fontweight='bold')ax2.grid(True, alpha=0.3)ax2.set_ylim(0, 105)ax2.axhline(y=50, color='red', linestyle='--', alpha=0.5, label='Random guessing')ax2.legend() # Plot 3: Learned weights (as 3x3 grid)ax3 = axes[2]weights_grid = model.weights.reshape(3, 3)im = ax3.imshow(weights_grid, cmap='RdBu', vmin=-2, vmax=2)ax3.set_title('Learned Weights\n(What the model looks for)', fontsize=12, fontweight='bold')for i in range(3):    for j in range(3):        ax3.text(j, i, f'{weights_grid[i,j]:.2f}', ha='center', va='center', fontsize=11, fontweight='bold')plt.colorbar(im, ax=ax3)ax3.set_xticks([])ax3.set_yticks([]) plt.tight_layout()plt.show() print("\nKEY OBSERVATIONS:")print("="*60)print(f"1. Loss decreased from {model.loss_history[0]:.4f} to {model.loss_history[-1]:.4f}")print(f"2. Accuracy improved from ~50% to {model.accuracy_history[-1]*100:.1f}%")print(f"3. The learned weights show HIGH values in the middle column!")print(f"   This is exactly what we'd expect for a vertical line detector!")
cell 024
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# =============================================================================# TEST ON CANONICAL EXAMPLES: Before vs After# ============================================================================= print("="*70)print("TESTING ON CANONICAL EXAMPLES")print("="*70) # Test on vertical linev_pred = model.forward(vertical_flat)print(f"\nVertical Line:")print(f"  Prediction: {v_pred:.4f} ({v_pred*100:.1f}% confident it's vertical)")print(f"  Actual: 1 (IS vertical)")print(f"  Result: {'CORRECT!' if v_pred >= 0.5 else 'Wrong'}") # Test on horizontal lineh_pred = model.forward(horizontal_flat)print(f"\nHorizontal Line:")print(f"  Prediction: {h_pred:.4f} ({h_pred*100:.1f}% confident it's vertical)")print(f"  Actual: 0 (NOT vertical)")print(f"  Result: {'CORRECT!' if h_pred < 0.5 else 'Wrong'}") print("\n" + "="*70)print("THE PERCEPTRON HAS LEARNED!")print("="*70)print("""From random weights giving ~50% accuracy,our Perceptron now confidently classifies lines! It learned that:  - The MIDDLE COLUMN matters most for vertical lines  - Other pixels should have low/negative weights  This happened automatically through gradient descent -we never told it what a vertical line looks like!""")

Part 5 Summary: What We've Learned

This was the most important notebook in the series! You've learned the core of how neural networks learn.

Key Concepts Mastered

ConceptFormulaWhy It Matters
Errory - ŷMeasures how wrong we are
MSE Loss(1/n)Σ(y-ŷ)²Penalizes errors, larger errors more
BCE Loss-[y·log(ŷ) + (1-y)·log(1-ŷ)]Better for classification, harsh on confident mistakes
Gradient(ŷ - y) · xDirection and magnitude of improvement
Gradient Descentw = w - α·∇LAlgorithm to find better weights
Learning RateαControls step size (too big = overshoot, too small = slow)
BackpropagationChain rule backwardCalculates gradients for all weights

The Training Loop (Memorize This!)

for epoch in range(epochs):
    for x, y in training_data:
        y_pred = forward(x)           # 1. Predict
        loss = bce(y, y_pred)         # 2. Measure error
        gradient = (y_pred - y) * x   # 3. Calculate gradient
        weights -= lr * gradient      # 4. Update weights

Committee Analogy Progress

PartCommittee Story
Part 1-3Member learned procedures (math, weights, voting)
Part 4First case - confused, random guessing
Part 5Member receives feedback and LEARNS!
Part 6(Next) Evaluating the trained expert

The Big Picture

Before Training: Random weights → Random predictions → ~50% accuracy

After Training: Learned weights → Meaningful predictions → ~95%+ accuracy

The Perceptron discovered ON ITS OWN that vertical lines have pixels in the middle column!


Knowledge Check

cell 026
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
# =============================================================================# KNOWLEDGE CHECK - Part 5# ============================================================================= print("KNOWLEDGE CHECK - Part 5: Training")print("="*60)print("\nAnswer these questions to test your understanding:\n") questions = [    {        "q": "1. Why do we square errors in MSE?",        "options": [            "A) To make the math easier",            "B) To prevent positive and negative errors from canceling out",            "C) To make errors smaller",            "D) Because computers prefer square numbers"        ],        "answer": "B",        "explanation": "Squaring makes all errors positive, so they add up rather than cancel. It also penalizes larger errors more."    },    {        "q": "2. Why is BCE preferred over MSE for classification?",        "options": [            "A) BCE is faster to compute",            "B) BCE uses less memory",            "C) BCE severely punishes confident wrong predictions",            "D) BCE always gives lower values"        ],        "answer": "C",        "explanation": "BCE uses logarithms which give very large penalties when the model is confident but wrong (e.g., predicting 0.01 when answer is 1)."    },    {        "q": "3. What happens if the learning rate is too high?",        "options": [            "A) Training is faster and better",            "B) The model overshoots the minimum and may never converge",            "C) The model learns more features",            "D) Nothing bad, higher is always better"        ],        "answer": "B",        "explanation": "A high learning rate causes big jumps that overshoot the minimum, causing the loss to bounce around or even increase."    },    {        "q": "4. The gradient formula for our Perceptron is (ŷ - y) × x. What does the 'x' part mean?",        "options": [            "A) Larger inputs get larger weight updates",            "B) The input is added to the gradient",            "C) X marks the spot",            "D) Nothing, it's just mathematical convention"        ],        "answer": "A",        "explanation": "The input 'x' determines which weights contributed to the output. Weights connected to larger inputs get larger updates because they had more influence."    },    {        "q": "5. What is an 'epoch' in training?",        "options": [            "A) One weight update",            "B) One forward pass",            "C) One complete pass through all training data",            "D) When the model reaches 100% accuracy"        ],        "answer": "C",        "explanation": "An epoch is one complete pass through the entire training dataset. We typically train for many epochs until the model converges."    }] for q in questions:    print(q["q"])    for opt in q["options"]:        print(f"   {opt}")    print() print("\n" + "="*60)print("Scroll down for answers...")print("="*60)
cell 027
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# =============================================================================# ANSWERS - Knowledge Check Part 5# ============================================================================= print("ANSWERS - Part 5 Knowledge Check")print("="*60) for i, q in enumerate(questions, 1):    print(f"\n{i}. Answer: {q['answer']}")    print(f"   {q['explanation']}") print("\n" + "="*60)print("How did you do?")print("  5/5: Training Master!")print("  4/5: Solid understanding!")print("  3/5: Review the sections you missed")print("  <3:  Re-read Part 5 - these concepts are crucial!")print("="*60)

What's Next?

Congratulations! You've completed the most important notebook in this series!

You now understand how neural networks learn - loss functions, gradient descent, and backpropagation are the foundation of ALL deep learning.

Coming Up in Part 6: Evaluation - The Trained Expert

  • Training vs Inference - Learning mode vs using mode
  • Accuracy Metrics - Precision, recall, F1 score
  • Confusion Matrix - Detailed prediction breakdown
  • Interpretability - What did the model actually learn?

Continue to Part 6: part_6_evaluation.ipynb


"The Perceptron has learned. Now it's time to see what it REALLY knows."

The Brain's Decision Committee - From Confusion to Competence

Illustrated step

Loss

concept

How wrong the vote was

Loss turns a bad decision into a number the model can reduce.

Gradient descent

concept

Walk downhill

Each step moves weights in the direction that lowers loss.

Backpropagation

concept

Tracing blame

The committee works backward to see which weights caused the mistake.

AI tutor

Tutor chat is staged for the next slice. For now, use the concept cards and run cells to test each idea directly.

Pinned output

Plots and code output render under each cell. Pinning outputs to this rail will land once the core runner is evaluated.