BackpropagationPart 5 · 60 min · intermediate

Learning from mistakes

Measure error, apply gradient descent, tune learning rate, and implement the training loop.

Open in Colab Download notebook Full lab fallback

Kernel: ColdSections: 0/15

Neural Network Fundamentals

Part 5: Training - Learning from Mistakes

The Brain's Decision Committee - Chapter 5

The Story So Far...

In Part 4, our committee member attempted their first classification task. They looked at images of vertical and horizontal lines and tried to identify them. The results were... not great. With random weights, they achieved about 50% accuracy - no better than flipping a coin.

But here's the beautiful thing about neural networks: they can learn from their mistakes.

In this notebook, we'll teach our Perceptron how to improve. We'll show it examples, tell it when it's wrong, and let it gradually adjust its weights until it becomes an expert line detector.

This is training - the heart of machine learning.

What You'll Learn in Part 5

This is one of the most important notebooks in the series. By the end, you will understand:

Loss Functions - How to measure "how wrong" a prediction is
Why We Square Errors - The mathematical reason behind MSE
Binary Cross-Entropy - The preferred loss for classification (and why!)
Gradient Descent - The algorithm that finds better weights
Learning Rate - How fast to adjust (and what happens if it's wrong)
The Gradient - The direction of steepest improvement
Backpropagation - How errors flow backward through the network
The Training Loop - Putting it all together
Watch It Learn - See the Perceptron go from 50% to 95%+ accuracy!

Prerequisites

Make sure you've completed:

Parts 0-1: Matrices (neural_network_fundamentals.ipynb)
Part 2: Single Neuron (part_2_single_neuron.ipynb)
Part 3: Activation Functions (part_3_activation_functions.ipynb)
Part 4: The Perceptron (part_4_perceptron.ipynb)

Setup: Import Dependencies and Recreate Our Tools

Let's bring in everything we built in previous notebooks.

cell 003

# =============================================================================# PART 5: TRAINING - SETUP AND IMPORTS# ============================================================================= import numpy as npimport matplotlib.pyplot as pltfrom IPython.display import display, clear_output # Try to import ipywidgets for interactive featurestry:    import ipywidgets as widgets    WIDGETS_AVAILABLE = Trueexcept ImportError:    WIDGETS_AVAILABLE = False    print("Note: ipywidgets not installed. Interactive features will be limited.") # Set up matplotlib stylestyle_options = ['seaborn-v0_8-whitegrid', 'seaborn-whitegrid', 'ggplot', 'default']for style in style_options:    try:        plt.style.use(style)        break    except OSError:        continue plt.rcParams['figure.figsize'] = [10, 6]plt.rcParams['font.size'] = 12np.random.seed(42) print("Setup complete!")print("="*60)

cell 004

# =============================================================================# RECREATE OUR TOOLS FROM PREVIOUS NOTEBOOKS# ============================================================================= # -----------------------------------------------------------------------------# Our canonical line images (from Part 1)# -----------------------------------------------------------------------------vertical_line = np.array([[0, 1, 0], [0, 1, 0], [0, 1, 0]])horizontal_line = np.array([[0, 0, 0], [1, 1, 1], [0, 0, 0]])vertical_flat = vertical_line.flatten()horizontal_flat = horizontal_line.flatten() # -----------------------------------------------------------------------------# Dataset generator (from Part 4)# -----------------------------------------------------------------------------def generate_line_dataset(n_samples=100, noise_level=0.0, seed=None):    """Generate vertical (label=1) and horizontal (label=0) line images."""    if seed is not None:        np.random.seed(seed)        X, y = [], []        for i in range(n_samples):        image = np.zeros((3, 3))                if i < n_samples // 2:  # Vertical lines            col = np.random.randint(0, 3)            image[:, col] = 1            if noise_level > 0:                image = np.clip(image + np.random.randn(3, 3) * noise_level, 0, 1)            X.append(image.flatten())            y.append(1)        else:  # Horizontal lines            row = np.random.randint(0, 3)            image[row, :] = 1            if noise_level > 0:                image = np.clip(image + np.random.randn(3, 3) * noise_level, 0, 1)            X.append(image.flatten())            y.append(0)        X, y = np.array(X), np.array(y)    shuffle_idx = np.random.permutation(n_samples)    return X[shuffle_idx], y[shuffle_idx] # -----------------------------------------------------------------------------# Sigmoid activation function (from Part 3)# -----------------------------------------------------------------------------def sigmoid(z):    """Sigmoid activation: maps any value to range (0, 1)."""    return 1 / (1 + np.exp(-np.clip(z, -500, 500))) # -----------------------------------------------------------------------------# Basic Perceptron class (from Part 4) - We'll add training later!# -----------------------------------------------------------------------------class Perceptron:    """A single-layer Perceptron for binary classification."""        def __init__(self, n_inputs):        self.weights = np.random.randn(n_inputs) * 0.1        self.bias = 0.0        self.n_inputs = n_inputs        def forward(self, x):        """Compute the forward pass."""        x = np.array(x).flatten()        z = np.dot(self.weights, x) + self.bias        return sigmoid(z)        def predict(self, x):        """Make a binary prediction (0 or 1)."""        return 1 if self.forward(x) >= 0.5 else 0 # Generate our training datasetX_train, y_train = generate_line_dataset(n_samples=100, noise_level=0.0, seed=42) print("Tools recreated from previous notebooks!")print(f"  - Vertical/Horizontal line templates")print(f"  - Dataset generator")print(f"  - Sigmoid activation")print(f"  - Basic Perceptron class")print(f"\nTraining dataset: {len(X_train)} samples")print(f"  - {sum(y_train)} vertical lines (label=1)")print(f"  - {len(y_train) - sum(y_train)} horizontal lines (label=0)")

# =============================================================================
# RECREATE OUR TOOLS FROM PREVIOUS NOTEBOOKS
# =============================================================================

# -----------------------------------------------------------------------------
# Our canonical line images (from Part 1)
# -----------------------------------------------------------------------------
vertical_line = np.array([[0, 1, 0], [0, 1, 0], [0, 1, 0]])
horizontal_line = np.array([[0, 0, 0], [1, 1, 1], [0, 0, 0]])
vertical_flat = vertical_line.flatten()
horizontal_flat = horizontal_line.flatten()

# -----------------------------------------------------------------------------
# Dataset generator (from Part 4)
# -----------------------------------------------------------------------------
def generate_line_dataset(n_samples=100, noise_level=0.0, seed=None):
    """Generate vertical (label=1) and horizontal (label=0) line images."""
    if seed is not None:
        np.random.seed(seed)
    
    X, y = [], []
    
    for i in range(n_samples):
        image = np.zeros((3, 3))
        
        if i < n_samples // 2:  # Vertical lines
            col = np.random.randint(0, 3)
            image[:, col] = 1
            if noise_level > 0:
                image = np.clip(image + np.random.randn(3, 3) * noise_level, 0, 1)
            X.append(image.flatten())
            y.append(1)
        else:  # Horizontal lines
            row = np.random.randint(0, 3)
            image[row, :] = 1
            if noise_level > 0:
                image = np.clip(image + np.random.randn(3, 3) * noise_level, 0, 1)
            X.append(image.flatten())
            y.append(0)
    
    X, y = np.array(X), np.array(y)
    shuffle_idx = np.random.permutation(n_samples)
    return X[shuffle_idx], y[shuffle_idx]

# -----------------------------------------------------------------------------
# Sigmoid activation function (from Part 3)
# -----------------------------------------------------------------------------
def sigmoid(z):
    """Sigmoid activation: maps any value to range (0, 1)."""
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

# -----------------------------------------------------------------------------
# Basic Perceptron class (from Part 4) - We'll add training later!
# -----------------------------------------------------------------------------
class Perceptron:
    """A single-layer Perceptron for binary classification."""
    
    def __init__(self, n_inputs):
        self.weights = np.random.randn(n_inputs) * 0.1
        self.bias = 0.0
        self.n_inputs = n_inputs
    
    def forward(self, x):
        """Compute the forward pass."""
        x = np.array(x).flatten()
        z = np.dot(self.weights, x) + self.bias
        return sigmoid(z)
    
    def predict(self, x):
        """Make a binary prediction (0 or 1)."""
        return 1 if self.forward(x) >= 0.5 else 0

# Generate our training dataset
X_train, y_train = generate_line_dataset(n_samples=100, noise_level=0.0, seed=42)

print("Tools recreated from previous notebooks!")
print(f"  - Vertical/Horizontal line templates")
print(f"  - Dataset generator")
print(f"  - Sigmoid activation")
print(f"  - Basic Perceptron class")
print(f"\nTraining dataset: {len(X_train)} samples")
print(f"  - {sum(y_train)} vertical lines (label=1)")
print(f"  - {len(y_train) - sum(y_train)} horizontal lines (label=0)")

5.1 The Error: How Wrong Are We?

Before we can improve, we need to measure how wrong our predictions are. This is the foundation of learning.

The Basic Idea

When our Perceptron makes a prediction, we compare it to the actual answer:

Error = Actual Value - Predicted Value
      = y - ŷ

A Concrete Example

Let's say we show the Perceptron a vertical line (actual label y = 1):

Scenario	Prediction (ŷ)	Error (y - ŷ)	Interpretation
Perfect	1.0	1.0 - 1.0 = 0.0	No error!
Good	0.9	1.0 - 0.9 = 0.1	Small error
Bad	0.3	1.0 - 0.3 = 0.7	Big error!
Terrible	0.0	1.0 - 0.0 = 1.0	Maximum error

Committee Analogy

"The committee member votes on a case. After the vote, the supervisor reveals the correct answer. The difference between their vote and the correct answer is their ERROR - and they need to learn from it."

Why Error Matters

The error tells us two things:

How much to adjust (larger error = bigger adjustment needed)
Which direction to adjust (positive error = increase output, negative = decrease)

Let's see this with real numbers:

cell 006

# =============================================================================# CALCULATING ERROR: Step by Step# ============================================================================= # Create an untrained Perceptronperceptron = Perceptron(n_inputs=9) print("="*70)print("CALCULATING ERROR: Step by Step")print("="*70) # Test on a vertical line (actual label = 1)print("\n" + "-"*70)print("Example 1: Testing on a VERTICAL line")print("-"*70) y_actual = 1  # The true label (it IS a vertical line)y_predicted = perceptron.forward(vertical_flat) print(f"\n  Step 1: Get the actual label")print(f"          y (actual) = {y_actual}")print(f"          This means: 'This IS a vertical line'") print(f"\n  Step 2: Get the prediction from our Perceptron")print(f"          ŷ (predicted) = {y_predicted:.4f}")print(f"          This means: '{y_predicted*100:.1f}% confident it's vertical'") print(f"\n  Step 3: Calculate the error")print(f"          error = y - ŷ")print(f"          error = {y_actual} - {y_predicted:.4f}")error_vertical = y_actual - y_predictedprint(f"          error = {error_vertical:.4f}") print(f"\n  Interpretation:")if error_vertical > 0:    print(f"          The error is POSITIVE ({error_vertical:.4f})")    print(f"          This means: The Perceptron underestimated! It should output HIGHER.")else:    print(f"          The error is NEGATIVE ({error_vertical:.4f})")    print(f"          This means: The Perceptron overestimated! It should output LOWER.") # Test on a horizontal line (actual label = 0)print("\n" + "-"*70)print("Example 2: Testing on a HORIZONTAL line")print("-"*70) y_actual_h = 0  # The true label (it is NOT a vertical line)y_predicted_h = perceptron.forward(horizontal_flat) print(f"\n  Step 1: Get the actual label")print(f"          y (actual) = {y_actual_h}")print(f"          This means: 'This is NOT a vertical line'") print(f"\n  Step 2: Get the prediction from our Perceptron")print(f"          ŷ (predicted) = {y_predicted_h:.4f}")print(f"          This means: '{y_predicted_h*100:.1f}% confident it's vertical'") print(f"\n  Step 3: Calculate the error")print(f"          error = y - ŷ")print(f"          error = {y_actual_h} - {y_predicted_h:.4f}")error_horizontal = y_actual_h - y_predicted_hprint(f"          error = {error_horizontal:.4f}") print(f"\n  Interpretation:")if error_horizontal > 0:    print(f"          The error is POSITIVE ({error_horizontal:.4f})")    print(f"          This means: The Perceptron underestimated!")elif error_horizontal < 0:    print(f"          The error is NEGATIVE ({error_horizontal:.4f})")    print(f"          This means: The Perceptron overestimated! It should output LOWER.")else:    print(f"          The error is ZERO - perfect prediction!")

# =============================================================================
# CALCULATING ERROR: Step by Step
# =============================================================================

# Create an untrained Perceptron
perceptron = Perceptron(n_inputs=9)

print("="*70)
print("CALCULATING ERROR: Step by Step")
print("="*70)

# Test on a vertical line (actual label = 1)
print("\n" + "-"*70)
print("Example 1: Testing on a VERTICAL line")
print("-"*70)

y_actual = 1  # The true label (it IS a vertical line)
y_predicted = perceptron.forward(vertical_flat)

print(f"\n  Step 1: Get the actual label")
print(f"          y (actual) = {y_actual}")
print(f"          This means: 'This IS a vertical line'")

print(f"\n  Step 2: Get the prediction from our Perceptron")
print(f"          ŷ (predicted) = {y_predicted:.4f}")
print(f"          This means: '{y_predicted*100:.1f}% confident it's vertical'")

print(f"\n  Step 3: Calculate the error")
print(f"          error = y - ŷ")
print(f"          error = {y_actual} - {y_predicted:.4f}")
error_vertical = y_actual - y_predicted
print(f"          error = {error_vertical:.4f}")

print(f"\n  Interpretation:")
if error_vertical > 0:
    print(f"          The error is POSITIVE ({error_vertical:.4f})")
    print(f"          This means: The Perceptron underestimated! It should output HIGHER.")
else:
    print(f"          The error is NEGATIVE ({error_vertical:.4f})")
    print(f"          This means: The Perceptron overestimated! It should output LOWER.")

# Test on a horizontal line (actual label = 0)
print("\n" + "-"*70)
print("Example 2: Testing on a HORIZONTAL line")
print("-"*70)

y_actual_h = 0  # The true label (it is NOT a vertical line)
y_predicted_h = perceptron.forward(horizontal_flat)

print(f"\n  Step 1: Get the actual label")
print(f"          y (actual) = {y_actual_h}")
print(f"          This means: 'This is NOT a vertical line'")

print(f"\n  Step 2: Get the prediction from our Perceptron")
print(f"          ŷ (predicted) = {y_predicted_h:.4f}")
print(f"          This means: '{y_predicted_h*100:.1f}% confident it's vertical'")

print(f"\n  Step 3: Calculate the error")
print(f"          error = y - ŷ")
print(f"          error = {y_actual_h} - {y_predicted_h:.4f}")
error_horizontal = y_actual_h - y_predicted_h
print(f"          error = {error_horizontal:.4f}")

print(f"\n  Interpretation:")
if error_horizontal > 0:
    print(f"          The error is POSITIVE ({error_horizontal:.4f})")
    print(f"          This means: The Perceptron underestimated!")
elif error_horizontal < 0:
    print(f"          The error is NEGATIVE ({error_horizontal:.4f})")
    print(f"          This means: The Perceptron overestimated! It should output LOWER.")
else:
    print(f"          The error is ZERO - perfect prediction!")

5.2 Loss Functions: The Teacher's Grading System

Before we look at specific formulas, let's understand what a loss function is and why we need one.

What is a Loss Function?

A loss function (also called a "cost function" or "objective function") is a mathematical formula that:

Takes in predictions and actual labels
Outputs a single number representing "how wrong" the predictions are
Lower is better - a loss of 0 means perfect predictions

Why Do We Need Loss Functions?

Think about learning anything - you need feedback to improve. The loss function provides that feedback:

Without Loss Function	With Loss Function
"Your predictions are wrong"	"Your predictions are 0.25 wrong"
Vague, not actionable	Precise, quantifiable
Can't compare methods	Can compare: 0.25 vs 0.15
Can't track progress	Can see improvement over time

The Role of Loss in Training

Loss functions are the heart of machine learning. The entire training process is:

Make predictions
Calculate loss (how wrong?)
Adjust weights to reduce loss
Repeat

The weights that minimize loss are the "best" weights - that's the entire goal of training!

Committee Analogy

"The loss function is like a performance review score. Every time the committee member makes a decision, they get a score. A perfect decision scores 0. A terrible decision scores high. The member's goal is to adjust their behavior to minimize this score over time."

5.2.1 Mean Squared Error (MSE): Our First Loss Function

Now let's look at a specific loss function: Mean Squared Error (MSE).

Simple error (y - ŷ) has a problem: positive and negative errors can cancel out!

Example: If we have two predictions:

Prediction 1: error = +0.5 (underestimated)
Prediction 2: error = -0.5 (overestimated)
Average error = (+0.5 + -0.5) / 2 = 0 ← Looks perfect, but it's NOT!

The Solution: Square the Errors

By squaring each error before averaging, we solve this problem:

$MSE = \frac{1}{n} \sum i = 1 n (y_{i} - y_{^i})^{2}$

Let's break this formula down piece by piece:

Symbol	Meaning	Example
$n$	Number of samples	100 images
$y_{i}$	Actual label for sample $i$	1 (vertical)
$y_{^i}$	Predicted value for sample $i$	0.7
$(y_{i} - y_{^i})$	Error for sample $i$	1 - 0.7 = 0.3
$(y_{i} - y_{^i})^{2}$	Squared error	0.3² = 0.09
$\frac{1}{n} \sum$	Average of all squared errors	Mean

Why Square?

Squaring the errors has three important benefits:

No Cancellation: Positive and negative errors both become positive
Penalize Big Errors: A small error (0.1) becomes tiny (0.01), but a big error (0.9) becomes large (0.81)
Smooth Landscape: Creates a smooth "bowl" shape that's easy to optimize (more on this later)

Let's Calculate MSE Step by Step:

cell 008

# =============================================================================# MEAN SQUARED ERROR: Step by Step Calculation# ============================================================================= print("="*70)print("MEAN SQUARED ERROR (MSE): Step by Step")print("="*70) # Let's use 5 samples to make this clearsample_actuals = np.array([1, 1, 0, 0, 1])       # True labelssample_predictions = np.array([0.9, 0.6, 0.3, 0.1, 0.5])  # Our predictions print("\nOur data:")print(f"  Actual labels (y):      {sample_actuals}")print(f"  Predictions (ŷ):        {sample_predictions}") # Step 1: Calculate each errorprint("\n" + "-"*70)print("STEP 1: Calculate each error (y - ŷ)")print("-"*70)errors = sample_actuals - sample_predictionsprint(f"\n  Sample 1: {sample_actuals[0]} - {sample_predictions[0]} = {errors[0]:.2f}")print(f"  Sample 2: {sample_actuals[1]} - {sample_predictions[1]} = {errors[1]:.2f}")print(f"  Sample 3: {sample_actuals[2]} - {sample_predictions[2]} = {errors[2]:.2f}")print(f"  Sample 4: {sample_actuals[3]} - {sample_predictions[3]} = {errors[3]:.2f}")print(f"  Sample 5: {sample_actuals[4]} - {sample_predictions[4]} = {errors[4]:.2f}")print(f"\n  All errors: {errors}") # Step 2: Square each errorprint("\n" + "-"*70)print("STEP 2: Square each error (to make all positive)")print("-"*70)squared_errors = errors ** 2print(f"\n  Sample 1: ({errors[0]:.2f})² = {squared_errors[0]:.4f}")print(f"  Sample 2: ({errors[1]:.2f})² = {squared_errors[1]:.4f}")print(f"  Sample 3: ({errors[2]:.2f})² = {squared_errors[2]:.4f}")print(f"  Sample 4: ({errors[3]:.2f})² = {squared_errors[3]:.4f}")print(f"  Sample 5: ({errors[4]:.2f})² = {squared_errors[4]:.4f}")print(f"\n  Squared errors: {squared_errors}") # Step 3: Take the meanprint("\n" + "-"*70)print("STEP 3: Take the mean (average)")print("-"*70)mse = np.mean(squared_errors)print(f"\n  Sum of squared errors: {np.sum(squared_errors):.4f}")print(f"  Number of samples: {len(squared_errors)}")print(f"  MSE = Sum / n = {np.sum(squared_errors):.4f} / {len(squared_errors)}")print(f"  MSE = {mse:.4f}") # The MSE functionprint("\n" + "-"*70)print("THE MSE FUNCTION (for reuse)")print("-"*70) def mse_loss(y_true, y_pred):    """    Mean Squared Error loss function.        Formula: MSE = (1/n) * Σ(y - ŷ)²        Parameters:        y_true: Array of actual labels (0 or 1)        y_pred: Array of predicted probabilities (0 to 1)        Returns:        Single value representing average squared error    """    return np.mean((y_true - y_pred) ** 2) # Verify our calculationprint(f"\n  Using our function: mse_loss(y, ŷ) = {mse_loss(sample_actuals, sample_predictions):.4f}")print(f"  Our manual calculation: {mse:.4f}")print(f"  Match: {'Yes!' if abs(mse_loss(sample_actuals, sample_predictions) - mse) < 0.0001 else 'No'}")

# =============================================================================
# MEAN SQUARED ERROR: Step by Step Calculation
# =============================================================================

print("="*70)
print("MEAN SQUARED ERROR (MSE): Step by Step")
print("="*70)

# Let's use 5 samples to make this clear
sample_actuals = np.array([1, 1, 0, 0, 1])       # True labels
sample_predictions = np.array([0.9, 0.6, 0.3, 0.1, 0.5])  # Our predictions

print("\nOur data:")
print(f"  Actual labels (y):      {sample_actuals}")
print(f"  Predictions (ŷ):        {sample_predictions}")

# Step 1: Calculate each error
print("\n" + "-"*70)
print("STEP 1: Calculate each error (y - ŷ)")
print("-"*70)
errors = sample_actuals - sample_predictions
print(f"\n  Sample 1: {sample_actuals[0]} - {sample_predictions[0]} = {errors[0]:.2f}")
print(f"  Sample 2: {sample_actuals[1]} - {sample_predictions[1]} = {errors[1]:.2f}")
print(f"  Sample 3: {sample_actuals[2]} - {sample_predictions[2]} = {errors[2]:.2f}")
print(f"  Sample 4: {sample_actuals[3]} - {sample_predictions[3]} = {errors[3]:.2f}")
print(f"  Sample 5: {sample_actuals[4]} - {sample_predictions[4]} = {errors[4]:.2f}")
print(f"\n  All errors: {errors}")

# Step 2: Square each error
print("\n" + "-"*70)
print("STEP 2: Square each error (to make all positive)")
print("-"*70)
squared_errors = errors ** 2
print(f"\n  Sample 1: ({errors[0]:.2f})² = {squared_errors[0]:.4f}")
print(f"  Sample 2: ({errors[1]:.2f})² = {squared_errors[1]:.4f}")
print(f"  Sample 3: ({errors[2]:.2f})² = {squared_errors[2]:.4f}")
print(f"  Sample 4: ({errors[3]:.2f})² = {squared_errors[3]:.4f}")
print(f"  Sample 5: ({errors[4]:.2f})² = {squared_errors[4]:.4f}")
print(f"\n  Squared errors: {squared_errors}")

# Step 3: Take the mean
print("\n" + "-"*70)
print("STEP 3: Take the mean (average)")
print("-"*70)
mse = np.mean(squared_errors)
print(f"\n  Sum of squared errors: {np.sum(squared_errors):.4f}")
print(f"  Number of samples: {len(squared_errors)}")
print(f"  MSE = Sum / n = {np.sum(squared_errors):.4f} / {len(squared_errors)}")
print(f"  MSE = {mse:.4f}")

# The MSE function
print("\n" + "-"*70)
print("THE MSE FUNCTION (for reuse)")
print("-"*70)

def mse_loss(y_true, y_pred):
    """
    Mean Squared Error loss function.
    
    Formula: MSE = (1/n) * Σ(y - ŷ)²
    
    Parameters:
        y_true: Array of actual labels (0 or 1)
        y_pred: Array of predicted probabilities (0 to 1)
    
    Returns:
        Single value representing average squared error
    """
    return np.mean((y_true - y_pred) ** 2)

# Verify our calculation
print(f"\n  Using our function: mse_loss(y, ŷ) = {mse_loss(sample_actuals, sample_predictions):.4f}")
print(f"  Our manual calculation: {mse:.4f}")
print(f"  Match: {'Yes!' if abs(mse_loss(sample_actuals, sample_predictions) - mse) < 0.0001 else 'No'}")

5.3 Binary Cross-Entropy: The Better Loss for Classification

MSE works, but for classification problems (like our V/H detection), there's a better loss function: Binary Cross-Entropy (BCE).

First, Let's Understand the Name

The name "Binary Cross-Entropy" has three parts:

Term	Meaning	Our Context
Binary	Two classes only	Vertical (1) or Horizontal (0)
Cross	Comparing two distributions	Comparing predictions vs reality
Entropy	Measure of uncertainty/surprise	How "surprised" we are by the outcome

Entropy comes from information theory. It measures uncertainty:

If something is certain (100% probability), entropy is 0 - no surprise!
If something is uncertain (50/50), entropy is high - maximum surprise!

Cross-entropy compares what we PREDICTED against what ACTUALLY happened.

Why Not Just Use MSE?

MSE works for regression (predicting continuous values like house prices), but classification has a special property: we're predicting probabilities.

The Problem with MSE: When the prediction is very wrong (e.g., predicting 0.01 for a true label of 1), MSE gives an error of 0.99² = 0.98. That's bad, but is it bad enough?

Consider: predicting 0.01 when you should predict 1.0 means you were 99% confident and COMPLETELY wrong. That deserves a HUGE penalty!

BCE's Solution: BCE uses logarithms, which give much harsher penalties for confident wrong answers.

The Logarithm: Why It's Perfect for This

The logarithm is a special mathematical function. Here's why it works for measuring surprise:

log(1) = 0 → If probability was 100%, no surprise at all
log(0.5) ≈ -0.69 → Uncertain, some surprise
log(0.1) ≈ -2.30 → Low probability, big surprise!
log(0.01) ≈ -4.61 → Very low probability, huge surprise!
log(0) = -∞ → Zero probability, infinite surprise (impossible event!)

The negative sign flips these to positive loss values: -log(0.01) = 4.61

The Intuition: Measuring "Surprise"

Think of cross-entropy as measuring how surprised you are by the actual answer:

Prediction	Actual	BCE Value	Interpretation
0.99	1	0.01	"Not surprised at all - I expected this!"
0.5	1	0.69	"Somewhat surprised - I was uncertain"
0.01	1	4.61	"VERY surprised! I was confident it was NOT 1!"

The Mathematics

$BCE = - \frac{1}{n} \sum i = 1 n [y_{i} \cdot \log (y_{^i}) + (1 - y_{i}) \cdot \log (1 - y_{^i})]$

This looks scary! Let's break it down:

When the actual label y = 1 (it IS vertical):

The formula simplifies to: $- \log (y^)$
If we predicted high (ŷ = 0.9): $- \log (0.9) = 0.105$ (low loss - good!)
If we predicted low (ŷ = 0.1): $- \log (0.1) = 2.303$ (high loss - bad!)

When the actual label y = 0 (it is NOT vertical):

The formula simplifies to: $- \log (1 - y^)$
If we predicted low (ŷ = 0.1): $- \log (0.9) = 0.105$ (low loss - good!)
If we predicted high (ŷ = 0.9): $- \log (0.1) = 2.303$ (high loss - bad!)

Committee Analogy

"BCE measures how embarrassed the committee member should be. If they confidently voted 'definitely vertical!' (0.99) and it turned out to be horizontal, they should be VERY embarrassed. The logarithm captures this severe penalty for confident wrong answers."

Let's Implement and Compare:

cell 010

# =============================================================================# BINARY CROSS-ENTROPY: Step by Step# ============================================================================= print("="*70)print("BINARY CROSS-ENTROPY (BCE): Step by Step")print("="*70) # First, let's understand the log functionprint("\n" + "-"*70)print("UNDERSTANDING THE LOGARITHM")print("-"*70)print("""The natural log (ln or log) has a special property:  - log(1) = 0        (no surprise when probability matches reality)  - log(0.5) = -0.69  (some surprise)  - log(0.1) = -2.30  (very surprised!)  - log(0.01) = -4.61 (extremely surprised!) As the probability gets closer to 0, log goes to -infinity.That's why BCE severely punishes confident wrong predictions!""") # Show the log curveprint("  Let's calculate -log(ŷ) for different predictions:")predictions = [0.99, 0.9, 0.7, 0.5, 0.3, 0.1, 0.01]print(f"\n  {'Prediction (ŷ)':<18} {'-log(ŷ)':<15} {'Interpretation'}")print("  " + "-"*60)for p in predictions:    neg_log = -np.log(p)    if neg_log < 0.5:        interp = "Low loss (good prediction)"    elif neg_log < 1.5:        interp = "Medium loss"    else:        interp = "High loss (bad prediction!)"    print(f"  {p:<18} {neg_log:<15.4f} {interp}") print("\n" + "-"*70)print("BCE CALCULATION FOR A SINGLE SAMPLE")print("-"*70) # Example 1: Actual is 1, prediction is 0.9 (good prediction)y_true_1 = 1y_pred_1 = 0.9 print(f"\n  Example 1: Actual y = {y_true_1}, Predicted ŷ = {y_pred_1}")print(f"  (This is a GOOD prediction for a vertical line)")print(f"\n  BCE formula: -[y * log(ŷ) + (1-y) * log(1-ŷ)]")print(f"\n  Since y = 1, the (1-y) term becomes 0, so:")print(f"  BCE = -[{y_true_1} * log({y_pred_1}) + 0]")print(f"  BCE = -log({y_pred_1})")print(f"  BCE = -{np.log(y_pred_1):.4f}")bce_1 = -np.log(y_pred_1)print(f"  BCE = {bce_1:.4f}") # Example 2: Actual is 1, prediction is 0.1 (bad prediction)y_true_2 = 1y_pred_2 = 0.1 print(f"\n  Example 2: Actual y = {y_true_2}, Predicted ŷ = {y_pred_2}")print(f"  (This is a BAD prediction for a vertical line)")print(f"\n  Since y = 1:")print(f"  BCE = -log({y_pred_2})")print(f"  BCE = -{np.log(y_pred_2):.4f}")bce_2 = -np.log(y_pred_2)print(f"  BCE = {bce_2:.4f}") print(f"\n  Notice: The bad prediction has {bce_2/bce_1:.1f}x higher loss!") # The BCE functionprint("\n" + "-"*70)print("THE BCE FUNCTION (for reuse)")print("-"*70) def binary_cross_entropy(y_true, y_pred):    """    Binary Cross-Entropy loss function.        Formula: BCE = -(1/n) * Σ[y*log(ŷ) + (1-y)*log(1-ŷ)]        Parameters:        y_true: Array of actual labels (0 or 1)        y_pred: Array of predicted probabilities (0 to 1)        Returns:        Single value representing average cross-entropy loss    """    # Clip predictions to avoid log(0) which is undefined    epsilon = 1e-15  # A tiny number    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)        # Calculate BCE    bce = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))    return bce print("""def binary_cross_entropy(y_true, y_pred):    # Clip to avoid log(0) - would be undefined!    epsilon = 1e-15    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)        # BCE formula    return -np.mean(y_true * np.log(y_pred) +                     (1 - y_true) * np.log(1 - y_pred))""")

# =============================================================================
# BINARY CROSS-ENTROPY: Step by Step
# =============================================================================

print("="*70)
print("BINARY CROSS-ENTROPY (BCE): Step by Step")
print("="*70)

# First, let's understand the log function
print("\n" + "-"*70)
print("UNDERSTANDING THE LOGARITHM")
print("-"*70)
print("""
The natural log (ln or log) has a special property:
  - log(1) = 0        (no surprise when probability matches reality)
  - log(0.5) = -0.69  (some surprise)
  - log(0.1) = -2.30  (very surprised!)
  - log(0.01) = -4.61 (extremely surprised!)

As the probability gets closer to 0, log goes to -infinity.
That's why BCE severely punishes confident wrong predictions!
""")

# Show the log curve
print("  Let's calculate -log(ŷ) for different predictions:")
predictions = [0.99, 0.9, 0.7, 0.5, 0.3, 0.1, 0.01]
print(f"\n  {'Prediction (ŷ)':<18} {'-log(ŷ)':<15} {'Interpretation'}")
print("  " + "-"*60)
for p in predictions:
    neg_log = -np.log(p)
    if neg_log < 0.5:
        interp = "Low loss (good prediction)"
    elif neg_log < 1.5:
        interp = "Medium loss"
    else:
        interp = "High loss (bad prediction!)"
    print(f"  {p:<18} {neg_log:<15.4f} {interp}")

print("\n" + "-"*70)
print("BCE CALCULATION FOR A SINGLE SAMPLE")
print("-"*70)

# Example 1: Actual is 1, prediction is 0.9 (good prediction)
y_true_1 = 1
y_pred_1 = 0.9

print(f"\n  Example 1: Actual y = {y_true_1}, Predicted ŷ = {y_pred_1}")
print(f"  (This is a GOOD prediction for a vertical line)")
print(f"\n  BCE formula: -[y * log(ŷ) + (1-y) * log(1-ŷ)]")
print(f"\n  Since y = 1, the (1-y) term becomes 0, so:")
print(f"  BCE = -[{y_true_1} * log({y_pred_1}) + 0]")
print(f"  BCE = -log({y_pred_1})")
print(f"  BCE = -{np.log(y_pred_1):.4f}")
bce_1 = -np.log(y_pred_1)
print(f"  BCE = {bce_1:.4f}")

# Example 2: Actual is 1, prediction is 0.1 (bad prediction)
y_true_2 = 1
y_pred_2 = 0.1

print(f"\n  Example 2: Actual y = {y_true_2}, Predicted ŷ = {y_pred_2}")
print(f"  (This is a BAD prediction for a vertical line)")
print(f"\n  Since y = 1:")
print(f"  BCE = -log({y_pred_2})")
print(f"  BCE = -{np.log(y_pred_2):.4f}")
bce_2 = -np.log(y_pred_2)
print(f"  BCE = {bce_2:.4f}")

print(f"\n  Notice: The bad prediction has {bce_2/bce_1:.1f}x higher loss!")

# The BCE function
print("\n" + "-"*70)
print("THE BCE FUNCTION (for reuse)")
print("-"*70)

def binary_cross_entropy(y_true, y_pred):
    """
    Binary Cross-Entropy loss function.
    
    Formula: BCE = -(1/n) * Σ[y*log(ŷ) + (1-y)*log(1-ŷ)]
    
    Parameters:
        y_true: Array of actual labels (0 or 1)
        y_pred: Array of predicted probabilities (0 to 1)
    
    Returns:
        Single value representing average cross-entropy loss
    """
    # Clip predictions to avoid log(0) which is undefined
    epsilon = 1e-15  # A tiny number
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    
    # Calculate BCE
    bce = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
    return bce

print("""
def binary_cross_entropy(y_true, y_pred):
    # Clip to avoid log(0) - would be undefined!
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    
    # BCE formula
    return -np.mean(y_true * np.log(y_pred) + 
                    (1 - y_true) * np.log(1 - y_pred))
""")

cell 011

# =============================================================================# VISUALIZE: MSE vs BCE# ============================================================================= fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Generate prediction values from 0.01 to 0.99y_pred_range = np.linspace(0.01, 0.99, 100) # When actual label is 1 (vertical line)mse_when_y_is_1 = (1 - y_pred_range) ** 2bce_when_y_is_1 = -np.log(y_pred_range) # Plot for y = 1ax1 = axes[0]ax1.plot(y_pred_range, mse_when_y_is_1, 'b-', linewidth=2, label='MSE')ax1.plot(y_pred_range, bce_when_y_is_1, 'r-', linewidth=2, label='BCE')ax1.set_xlabel('Prediction (ŷ)', fontsize=12)ax1.set_ylabel('Loss', fontsize=12)ax1.set_title('When Actual y = 1 (Vertical Line)\nLower prediction = Higher loss', fontsize=12, fontweight='bold')ax1.legend()ax1.grid(True, alpha=0.3)ax1.set_xlim(0, 1)ax1.set_ylim(0, 5)ax1.axvline(x=0.5, color='gray', linestyle=':', alpha=0.5)ax1.annotate('If we predict 0.1\nBCE = 2.3 (harsh!)\nMSE = 0.81',              xy=(0.1, 2.3), xytext=(0.3, 3.5),             fontsize=9, arrowprops=dict(arrowstyle='->', color='red')) # When actual label is 0 (horizontal line)mse_when_y_is_0 = y_pred_range ** 2bce_when_y_is_0 = -np.log(1 - y_pred_range) # Plot for y = 0ax2 = axes[1]ax2.plot(y_pred_range, mse_when_y_is_0, 'b-', linewidth=2, label='MSE')ax2.plot(y_pred_range, bce_when_y_is_0, 'r-', linewidth=2, label='BCE')ax2.set_xlabel('Prediction (ŷ)', fontsize=12)ax2.set_ylabel('Loss', fontsize=12)ax2.set_title('When Actual y = 0 (Horizontal Line)\nHigher prediction = Higher loss', fontsize=12, fontweight='bold')ax2.legend()ax2.grid(True, alpha=0.3)ax2.set_xlim(0, 1)ax2.set_ylim(0, 5)ax2.axvline(x=0.5, color='gray', linestyle=':', alpha=0.5)ax2.annotate('If we predict 0.9\nBCE = 2.3 (harsh!)\nMSE = 0.81',              xy=(0.9, 2.3), xytext=(0.5, 3.5),             fontsize=9, arrowprops=dict(arrowstyle='->', color='red')) plt.tight_layout()plt.show() print("\nKEY INSIGHT: BCE vs MSE")print("="*60)print("""Notice how BCE (red line) rises much more steeply than MSE (blue line)as predictions get worse? This is why BCE is preferred for classification:  - It SEVERELY punishes confident wrong predictions  - A prediction of 0.1 when the answer is 1 has BCE loss of 2.3  - The same prediction has MSE loss of only 0.81 BCE creates stronger learning signals when the model is very wrong,which helps it learn faster and more reliably!""")

# =============================================================================
# VISUALIZE: MSE vs BCE
# =============================================================================

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Generate prediction values from 0.01 to 0.99
y_pred_range = np.linspace(0.01, 0.99, 100)

# When actual label is 1 (vertical line)
mse_when_y_is_1 = (1 - y_pred_range) ** 2
bce_when_y_is_1 = -np.log(y_pred_range)

# Plot for y = 1
ax1 = axes[0]
ax1.plot(y_pred_range, mse_when_y_is_1, 'b-', linewidth=2, label='MSE')
ax1.plot(y_pred_range, bce_when_y_is_1, 'r-', linewidth=2, label='BCE')
ax1.set_xlabel('Prediction (ŷ)', fontsize=12)
ax1.set_ylabel('Loss', fontsize=12)
ax1.set_title('When Actual y = 1 (Vertical Line)\nLower prediction = Higher loss', fontsize=12, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_xlim(0, 1)
ax1.set_ylim(0, 5)
ax1.axvline(x=0.5, color='gray', linestyle=':', alpha=0.5)
ax1.annotate('If we predict 0.1\nBCE = 2.3 (harsh!)\nMSE = 0.81', 
             xy=(0.1, 2.3), xytext=(0.3, 3.5),
             fontsize=9, arrowprops=dict(arrowstyle='->', color='red'))

# When actual label is 0 (horizontal line)
mse_when_y_is_0 = y_pred_range ** 2
bce_when_y_is_0 = -np.log(1 - y_pred_range)

# Plot for y = 0
ax2 = axes[1]
ax2.plot(y_pred_range, mse_when_y_is_0, 'b-', linewidth=2, label='MSE')
ax2.plot(y_pred_range, bce_when_y_is_0, 'r-', linewidth=2, label='BCE')
ax2.set_xlabel('Prediction (ŷ)', fontsize=12)
ax2.set_ylabel('Loss', fontsize=12)
ax2.set_title('When Actual y = 0 (Horizontal Line)\nHigher prediction = Higher loss', fontsize=12, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)
ax2.set_xlim(0, 1)
ax2.set_ylim(0, 5)
ax2.axvline(x=0.5, color='gray', linestyle=':', alpha=0.5)
ax2.annotate('If we predict 0.9\nBCE = 2.3 (harsh!)\nMSE = 0.81', 
             xy=(0.9, 2.3), xytext=(0.5, 3.5),
             fontsize=9, arrowprops=dict(arrowstyle='->', color='red'))

plt.tight_layout()
plt.show()

print("\nKEY INSIGHT: BCE vs MSE")
print("="*60)
print("""
Notice how BCE (red line) rises much more steeply than MSE (blue line)
as predictions get worse?

This is why BCE is preferred for classification:
  - It SEVERELY punishes confident wrong predictions
  - A prediction of 0.1 when the answer is 1 has BCE loss of 2.3
  - The same prediction has MSE loss of only 0.81

BCE creates stronger learning signals when the model is very wrong,
which helps it learn faster and more reliably!
""")

5.4 Gradient Descent: Finding Better Weights

Now we know HOW WRONG we are (the loss). But how do we make our predictions BETTER?

This is where optimization comes in - the process of finding the best values for our weights.

The Optimization Problem

Our Perceptron has 9 weights + 1 bias = 10 numbers to choose. Each combination of these 10 numbers gives different predictions and a different loss.

The Question: Out of the infinite possible combinations, which gives the LOWEST loss?

The Naive Approach: Try all combinations!

But with continuous numbers, there are infinitely many combinations
Even with just 100 values per parameter: 100^10 = 10^20 combinations
That's more than the number of grains of sand on Earth!

The Smart Approach: Use mathematics to guide our search toward better values.

What is a Derivative? (A Quick Refresher)

The derivative tells you how much one quantity changes when you change another.

Simple Example: You're driving a car.

Position = where you are
Derivative of position = speed (how fast position changes)
Derivative of speed = acceleration (how fast speed changes)

For Our Loss Function:

Loss = how wrong we are
Derivative of loss w.r.t. weight = how much loss changes when we change the weight

If the derivative is:

Positive: Increasing the weight increases loss → we should DECREASE the weight
Negative: Increasing the weight decreases loss → we should INCREASE the weight
Zero: We're at a minimum (or maximum)!

The Key Idea: The Loss Landscape

Imagine the loss as a landscape where:

The height at any point = how wrong we are (higher = worse)
The position = our current weights
Our goal = find the lowest point (minimum loss)

We want to "roll downhill" until we find the bottom!

The Algorithm: Gradient Descent

Gradient means "slope" - it tells us which way is uphill.

Gradient Descent means:

Look at the slope where we are
Take a step in the opposite direction (downhill)
Repeat until we reach the bottom

The Mathematics

$w_{n e w} = w_{o l d} - α \cdot \frac{\partial L}{\partial w}$

Let's break this down:

Symbol	Meaning	Intuition
$w_{n e w}$	Updated weight	Where we're going
$w_{o l d}$	Current weight	Where we are
$α$	Learning rate	How big a step to take
$\frac{\partial L}{\partial w}$	Gradient (slope)	Which way is uphill
$-$	Subtraction	We go OPPOSITE to uphill (= downhill)

Committee Analogy

"The gradient is like a compass that always points uphill. We want to go DOWNHILL (less error), so we walk in the opposite direction. The learning rate decides whether we take small careful steps or big bold leaps."

Let's Visualize This:

cell 013

# =============================================================================# VISUALIZE: The Loss Landscape and Gradient Descent# ============================================================================= # Create a simple 1D loss landscape (parabola)# This represents how loss changes as we change ONE weightweight_values = np.linspace(-3, 3, 100)loss_landscape = weight_values ** 2 + 0.5  # Simple parabola with minimum at w=0 fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Plot 1: The loss landscapeax1 = axes[0]ax1.plot(weight_values, loss_landscape, 'b-', linewidth=3)ax1.fill_between(weight_values, loss_landscape, alpha=0.2)ax1.set_xlabel('Weight Value (w)', fontsize=12)ax1.set_ylabel('Loss (L)', fontsize=12)ax1.set_title('The Loss Landscape\n(For a Single Weight)', fontsize=12, fontweight='bold')ax1.grid(True, alpha=0.3) # Mark the minimumax1.scatter([0], [0.5], color='green', s=200, zorder=5, marker='*', label='Minimum (goal)')ax1.annotate('Our goal: Find this minimum!', xy=(0, 0.5), xytext=(0.5, 2),            fontsize=10, arrowprops=dict(arrowstyle='->', color='green')) # Show current positioncurrent_w = 2.0current_loss = current_w ** 2 + 0.5ax1.scatter([current_w], [current_loss], color='red', s=150, zorder=5, label='Current position')ax1.axvline(x=current_w, color='red', linestyle='--', alpha=0.3)ax1.legend() # Plot 2: Gradient descent animation (multiple steps)ax2 = axes[1]ax2.plot(weight_values, loss_landscape, 'b-', linewidth=2, alpha=0.5)ax2.fill_between(weight_values, loss_landscape, alpha=0.1)ax2.set_xlabel('Weight Value (w)', fontsize=12)ax2.set_ylabel('Loss (L)', fontsize=12)ax2.set_title('Gradient Descent: Rolling Downhill\n(Learning Rate α = 0.3)', fontsize=12, fontweight='bold')ax2.grid(True, alpha=0.3) # Simulate gradient descentlearning_rate = 0.3w = 2.5  # Starting positionpath = [(w, w**2 + 0.5)] for step in range(8):    gradient = 2 * w  # Derivative of w² is 2w    w = w - learning_rate * gradient  # Gradient descent update    loss = w ** 2 + 0.5    path.append((w, loss)) # Plot the pathpath = np.array(path)ax2.plot(path[:, 0], path[:, 1], 'ro-', markersize=8, linewidth=2, label='Gradient descent path')ax2.scatter([path[0, 0]], [path[0, 1]], color='red', s=200, zorder=5, marker='o', label='Start')ax2.scatter([path[-1, 0]], [path[-1, 1]], color='green', s=200, zorder=5, marker='*', label='End (near minimum)') # Add step numbersfor i, (w_val, l_val) in enumerate(path):    ax2.annotate(f'{i}', xy=(w_val, l_val), xytext=(w_val+0.1, l_val+0.3),                fontsize=9, fontweight='bold') ax2.legend(loc='upper right') plt.tight_layout()plt.show() print("\nGRADIENT DESCENT STEPS:")print("="*60)print(f"{'Step':<6} {'Weight (w)':<15} {'Gradient (2w)':<15} {'Update':<20} {'Loss'}")print("-"*60)w = 2.5for step in range(6):    gradient = 2 * w    update = -learning_rate * gradient    loss = w ** 2 + 0.5    print(f"{step:<6} {w:<15.4f} {gradient:<15.4f} {update:<20.4f} {loss:.4f}")    w = w + update  # Same as w = w - learning_rate * gradient print("-"*60)print(f"\nStarted at w = 2.5 (loss = 6.75)")print(f"After 5 steps: w = {w:.4f} (loss = {w**2 + 0.5:.4f})")print(f"Getting closer to the minimum at w = 0 (loss = 0.5)!")

# =============================================================================
# VISUALIZE: The Loss Landscape and Gradient Descent
# =============================================================================

# Create a simple 1D loss landscape (parabola)
# This represents how loss changes as we change ONE weight
weight_values = np.linspace(-3, 3, 100)
loss_landscape = weight_values ** 2 + 0.5  # Simple parabola with minimum at w=0

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: The loss landscape
ax1 = axes[0]
ax1.plot(weight_values, loss_landscape, 'b-', linewidth=3)
ax1.fill_between(weight_values, loss_landscape, alpha=0.2)
ax1.set_xlabel('Weight Value (w)', fontsize=12)
ax1.set_ylabel('Loss (L)', fontsize=12)
ax1.set_title('The Loss Landscape\n(For a Single Weight)', fontsize=12, fontweight='bold')
ax1.grid(True, alpha=0.3)

# Mark the minimum
ax1.scatter([0], [0.5], color='green', s=200, zorder=5, marker='*', label='Minimum (goal)')
ax1.annotate('Our goal: Find this minimum!', xy=(0, 0.5), xytext=(0.5, 2),
            fontsize=10, arrowprops=dict(arrowstyle='->', color='green'))

# Show current position
current_w = 2.0
current_loss = current_w ** 2 + 0.5
ax1.scatter([current_w], [current_loss], color='red', s=150, zorder=5, label='Current position')
ax1.axvline(x=current_w, color='red', linestyle='--', alpha=0.3)
ax1.legend()

# Plot 2: Gradient descent animation (multiple steps)
ax2 = axes[1]
ax2.plot(weight_values, loss_landscape, 'b-', linewidth=2, alpha=0.5)
ax2.fill_between(weight_values, loss_landscape, alpha=0.1)
ax2.set_xlabel('Weight Value (w)', fontsize=12)
ax2.set_ylabel('Loss (L)', fontsize=12)
ax2.set_title('Gradient Descent: Rolling Downhill\n(Learning Rate α = 0.3)', fontsize=12, fontweight='bold')
ax2.grid(True, alpha=0.3)

# Simulate gradient descent
learning_rate = 0.3
w = 2.5  # Starting position
path = [(w, w**2 + 0.5)]

for step in range(8):
    gradient = 2 * w  # Derivative of w² is 2w
    w = w - learning_rate * gradient  # Gradient descent update
    loss = w ** 2 + 0.5
    path.append((w, loss))

# Plot the path
path = np.array(path)
ax2.plot(path[:, 0], path[:, 1], 'ro-', markersize=8, linewidth=2, label='Gradient descent path')
ax2.scatter([path[0, 0]], [path[0, 1]], color='red', s=200, zorder=5, marker='o', label='Start')
ax2.scatter([path[-1, 0]], [path[-1, 1]], color='green', s=200, zorder=5, marker='*', label='End (near minimum)')

# Add step numbers
for i, (w_val, l_val) in enumerate(path):
    ax2.annotate(f'{i}', xy=(w_val, l_val), xytext=(w_val+0.1, l_val+0.3),
                fontsize=9, fontweight='bold')

ax2.legend(loc='upper right')

plt.tight_layout()
plt.show()

print("\nGRADIENT DESCENT STEPS:")
print("="*60)
print(f"{'Step':<6} {'Weight (w)':<15} {'Gradient (2w)':<15} {'Update':<20} {'Loss'}")
print("-"*60)
w = 2.5
for step in range(6):
    gradient = 2 * w
    update = -learning_rate * gradient
    loss = w ** 2 + 0.5
    print(f"{step:<6} {w:<15.4f} {gradient:<15.4f} {update:<20.4f} {loss:.4f}")
    w = w + update  # Same as w = w - learning_rate * gradient

print("-"*60)
print(f"\nStarted at w = 2.5 (loss = 6.75)")
print(f"After 5 steps: w = {w:.4f} (loss = {w**2 + 0.5:.4f})")
print(f"Getting closer to the minimum at w = 0 (loss = 0.5)!")

5.5 Learning Rate: How Fast to Adjust

The learning rate (α, alpha) controls how big each step is. It's one of the most important choices in training!

Parameters vs Hyperparameters

Before we dive in, let's clarify an important distinction:

Term	What It Is	Examples	Who Sets It?
Parameters	Values the model LEARNS	Weights, Bias	The training algorithm
Hyperparameters	Settings WE choose before training	Learning rate, number of epochs	The human (you!)

The learning rate is a hyperparameter - we choose it before training, and it affects HOW the model learns (but is not learned itself).

Why Learning Rate Matters So Much

The learning rate multiplies the gradient to determine the step size:

step = learning_rate × gradient
new_weight = old_weight - step

The Problem: Gradients can vary wildly:

Sometimes the gradient is 10.0 (steep slope)
Sometimes it's 0.001 (nearly flat)

The Learning Rate's Job: Scale these gradients to reasonable step sizes.

The Goldilocks Problem

Learning Rate	Effect	Problem
Too Large (α = 1.0)	Big steps	Overshoot! Miss the minimum, bounce around
Too Small (α = 0.001)	Tiny steps	Takes forever, might get stuck
Just Right (α = 0.1)	Medium steps	Steady progress toward minimum

The Mathematics

Remember our update formula:

$w_{n e w} = w_{o l d} - α \cdot gradient$

If gradient = 10 and α = 0.1: step size = 1.0 (reasonable)
If gradient = 10 and α = 1.0: step size = 10.0 (too big!)
If gradient = 10 and α = 0.001: step size = 0.01 (too small!)

Committee Analogy

"The learning rate is how much the committee member adjusts after each mistake. Too much adjustment, and they overcorrect wildly. Too little, and they never improve. The right amount leads to steady learning."

Let's See All Three Scenarios:

cell 015

# =============================================================================# VISUALIZE: Learning Rate Effects# ============================================================================= def run_gradient_descent(start_w, learning_rate, steps=15):    """Run gradient descent and return the path."""    w = start_w    path = [(w, w**2 + 0.5)]    for _ in range(steps):        gradient = 2 * w  # Derivative of w²        w = w - learning_rate * gradient        w = np.clip(w, -5, 5)  # Prevent explosion        loss = w ** 2 + 0.5        path.append((w, loss))    return np.array(path) fig, axes = plt.subplots(1, 3, figsize=(15, 5))weight_values = np.linspace(-3, 3, 100)loss_landscape = weight_values ** 2 + 0.5 scenarios = [    (0.9, 'TOO LARGE (α=0.9)', 'red', 'Overshoots and bounces!'),    (0.3, 'JUST RIGHT (α=0.3)', 'green', 'Steady progress!'),    (0.05, 'TOO SMALL (α=0.05)', 'blue', 'Very slow progress...')] for ax, (lr, title, color, desc) in zip(axes, scenarios):    ax.plot(weight_values, loss_landscape, 'k-', linewidth=1, alpha=0.3)    ax.fill_between(weight_values, loss_landscape, alpha=0.1, color='gray')        path = run_gradient_descent(start_w=2.5, learning_rate=lr)    ax.plot(path[:, 0], path[:, 1], 'o-', color=color, markersize=6, linewidth=1.5)    ax.scatter([path[0, 0]], [path[0, 1]], color=color, s=150, zorder=5, marker='s', label='Start')    ax.scatter([path[-1, 0]], [path[-1, 1]], color='black', s=150, zorder=5, marker='*', label='End')        ax.set_xlabel('Weight (w)', fontsize=11)    ax.set_ylabel('Loss', fontsize=11)    ax.set_title(f'{title}\n{desc}', fontsize=11, fontweight='bold')    ax.set_xlim(-3, 3)    ax.set_ylim(0, 10)    ax.grid(True, alpha=0.3)    ax.legend(loc='upper right', fontsize=9)        # Show final loss    ax.annotate(f'Final loss: {path[-1, 1]:.2f}', xy=(0, 8), fontsize=10, ha='center') plt.tight_layout()plt.show() print("\nLEARNING RATE COMPARISON:")print("="*60)for lr, title, _, _ in scenarios:    path = run_gradient_descent(start_w=2.5, learning_rate=lr)    print(f"\n{title}")    print(f"  Final weight: {path[-1, 0]:.4f}")    print(f"  Final loss:   {path[-1, 1]:.4f}")    print(f"  Optimal loss: 0.5000 (at w=0)")    print(f"  Distance from optimal: {abs(path[-1, 1] - 0.5):.4f}")

# =============================================================================
# VISUALIZE: Learning Rate Effects
# =============================================================================

def run_gradient_descent(start_w, learning_rate, steps=15):
    """Run gradient descent and return the path."""
    w = start_w
    path = [(w, w**2 + 0.5)]
    for _ in range(steps):
        gradient = 2 * w  # Derivative of w²
        w = w - learning_rate * gradient
        w = np.clip(w, -5, 5)  # Prevent explosion
        loss = w ** 2 + 0.5
        path.append((w, loss))
    return np.array(path)

fig, axes = plt.subplots(1, 3, figsize=(15, 5))
weight_values = np.linspace(-3, 3, 100)
loss_landscape = weight_values ** 2 + 0.5

scenarios = [
    (0.9, 'TOO LARGE (α=0.9)', 'red', 'Overshoots and bounces!'),
    (0.3, 'JUST RIGHT (α=0.3)', 'green', 'Steady progress!'),
    (0.05, 'TOO SMALL (α=0.05)', 'blue', 'Very slow progress...')
]

for ax, (lr, title, color, desc) in zip(axes, scenarios):
    ax.plot(weight_values, loss_landscape, 'k-', linewidth=1, alpha=0.3)
    ax.fill_between(weight_values, loss_landscape, alpha=0.1, color='gray')
    
    path = run_gradient_descent(start_w=2.5, learning_rate=lr)
    ax.plot(path[:, 0], path[:, 1], 'o-', color=color, markersize=6, linewidth=1.5)
    ax.scatter([path[0, 0]], [path[0, 1]], color=color, s=150, zorder=5, marker='s', label='Start')
    ax.scatter([path[-1, 0]], [path[-1, 1]], color='black', s=150, zorder=5, marker='*', label='End')
    
    ax.set_xlabel('Weight (w)', fontsize=11)
    ax.set_ylabel('Loss', fontsize=11)
    ax.set_title(f'{title}\n{desc}', fontsize=11, fontweight='bold')
    ax.set_xlim(-3, 3)
    ax.set_ylim(0, 10)
    ax.grid(True, alpha=0.3)
    ax.legend(loc='upper right', fontsize=9)
    
    # Show final loss
    ax.annotate(f'Final loss: {path[-1, 1]:.2f}', xy=(0, 8), fontsize=10, ha='center')

plt.tight_layout()
plt.show()

print("\nLEARNING RATE COMPARISON:")
print("="*60)
for lr, title, _, _ in scenarios:
    path = run_gradient_descent(start_w=2.5, learning_rate=lr)
    print(f"\n{title}")
    print(f"  Final weight: {path[-1, 0]:.4f}")
    print(f"  Final loss:   {path[-1, 1]:.4f}")
    print(f"  Optimal loss: 0.5000 (at w=0)")
    print(f"  Distance from optimal: {abs(path[-1, 1] - 0.5):.4f}")

5.6 The Gradient: Which Way is Down?

We've been using the word "gradient" - but what IS it exactly, and how do we calculate it?

What is a Gradient?

The gradient is the derivative (slope) of the loss with respect to each weight. It tells us:

How much the loss changes when we change a weight
Which direction increases the loss (so we go the opposite way!)

Regular Derivatives vs Partial Derivatives

Regular derivative: When you have ONE variable.

Example: If f(x) = x², then df/dx = 2x

Partial derivative (∂): When you have MULTIPLE variables and you want to see the effect of changing just ONE while keeping others fixed.

Example: If f(x, y) = x² + y², then:
- ∂f/∂x = 2x (how f changes when x changes, y held constant)
- ∂f/∂y = 2y (how f changes when y changes, x held constant)

In our Perceptron:

Loss depends on 9 weights + 1 bias = 10 variables
We need 10 partial derivatives (one for each parameter)
The gradient is the collection of ALL these partial derivatives

The Notation "w.r.t." (With Respect To)

You'll often see "gradient of L w.r.t. w" - this means "how does L change when we change w?"

∂L/∂w is read as "partial derivative of L with respect to w"

The Chain Rule: Breaking Down Complex Functions

Our Perceptron has multiple operations chained together:

x → [weighted sum] → z → [sigmoid] → ŷ → [BCE loss] → L
    w · x + b

To find how changing w affects the final loss L, we use the chain rule:

$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial y^} \cdot \frac{\partial y^}{\partial z} \cdot \frac{\partial z}{\partial w}$

This looks complicated, but each piece is simple!

The Beautiful Simplification

For sigmoid activation with BCE loss, all the calculus simplifies to:

$\frac{\partial L}{\partial w} = (y^- y) \cdot x$

And for the bias:

$\frac{\partial L}{\partial b} = (y^- y)$

That's it! The gradient is just:

(prediction - actual) × input

Why This Formula Makes Intuitive Sense

Part	Meaning	Intuition
$(y^- y)$	Error	How wrong we are (and in which direction)
$x$	Input	Which inputs contributed to the output

If we predicted too high (ŷ > y), the error is positive, so we'll decrease the weights. If the input was large, we'll decrease more (because it had more influence).

Let's Calculate Gradients Step by Step:

cell 017

# =============================================================================# CALCULATING GRADIENTS: Step by Step# ============================================================================= print("="*70)print("CALCULATING GRADIENTS: Step by Step")print("="*70) # Use our Perceptron on a vertical linex = vertical_flat.copy()y_true = 1  # It IS vertical # Get the predictiony_pred = perceptron.forward(x) print(f"\nInput (x): {x}")print(f"Actual label (y): {y_true}")print(f"Prediction (ŷ): {y_pred:.4f}") # Calculate the error termprint("\n" + "-"*70)print("STEP 1: Calculate the error term (ŷ - y)")print("-"*70)error = y_pred - y_trueprint(f"\n  error = ŷ - y")print(f"  error = {y_pred:.4f} - {y_true}")print(f"  error = {error:.4f}") if error > 0:    print(f"\n  Interpretation: Error is POSITIVE ({error:.4f})")    print(f"  This means we predicted TOO HIGH - need to decrease output")else:    print(f"\n  Interpretation: Error is NEGATIVE ({error:.4f})")    print(f"  This means we predicted TOO LOW - need to increase output") # Calculate gradient for weightsprint("\n" + "-"*70)print("STEP 2: Calculate gradient for each weight")print("-"*70)print(f"\n  Formula: ∂L/∂w = (ŷ - y) × x = error × x")print(f"\n  For each weight w_i, the gradient is: error × x_i") gradient_weights = error * xprint(f"\n  Gradients for all 9 weights:")print(f"  error × x = {error:.4f} × {x}")print(f"           = [{', '.join([f'{g:.4f}' for g in gradient_weights])}]") # Show which weights should changeprint(f"\n  Let's interpret this (as a 3x3 grid):")grad_grid = gradient_weights.reshape(3, 3)print(f"    {grad_grid[0]}")print(f"    {grad_grid[1]}")print(f"    {grad_grid[2]}") print(f"\n  Notice: Only the middle column has non-zero gradients!")print(f"  That's because only those pixels had value 1 in the input.")print(f"  Weights for other pixels don't need to change (input was 0).") # Calculate gradient for biasprint("\n" + "-"*70)print("STEP 3: Calculate gradient for bias")print("-"*70)gradient_bias = errorprint(f"\n  Formula: ∂L/∂b = (ŷ - y) = error")print(f"  Bias gradient = {gradient_bias:.4f}") # Show the updateprint("\n" + "-"*70)print("STEP 4: Apply the update (with learning rate α = 0.5)")print("-"*70)learning_rate = 0.5print(f"\n  Update formula: w_new = w_old - α × gradient")print(f"\n  For weight w₁ (position 1, middle column):")old_w1 = perceptron.weights[1]new_w1 = old_w1 - learning_rate * gradient_weights[1]print(f"    w₁_new = {old_w1:.4f} - {learning_rate} × {gradient_weights[1]:.4f}")print(f"    w₁_new = {old_w1:.4f} - {learning_rate * gradient_weights[1]:.4f}")print(f"    w₁_new = {new_w1:.4f}")print(f"\n  Since error was negative, w₁ INCREASED to make output higher next time!")

# =============================================================================
# CALCULATING GRADIENTS: Step by Step
# =============================================================================

print("="*70)
print("CALCULATING GRADIENTS: Step by Step")
print("="*70)

# Use our Perceptron on a vertical line
x = vertical_flat.copy()
y_true = 1  # It IS vertical

# Get the prediction
y_pred = perceptron.forward(x)

print(f"\nInput (x): {x}")
print(f"Actual label (y): {y_true}")
print(f"Prediction (ŷ): {y_pred:.4f}")

# Calculate the error term
print("\n" + "-"*70)
print("STEP 1: Calculate the error term (ŷ - y)")
print("-"*70)
error = y_pred - y_true
print(f"\n  error = ŷ - y")
print(f"  error = {y_pred:.4f} - {y_true}")
print(f"  error = {error:.4f}")

if error > 0:
    print(f"\n  Interpretation: Error is POSITIVE ({error:.4f})")
    print(f"  This means we predicted TOO HIGH - need to decrease output")
else:
    print(f"\n  Interpretation: Error is NEGATIVE ({error:.4f})")
    print(f"  This means we predicted TOO LOW - need to increase output")

# Calculate gradient for weights
print("\n" + "-"*70)
print("STEP 2: Calculate gradient for each weight")
print("-"*70)
print(f"\n  Formula: ∂L/∂w = (ŷ - y) × x = error × x")
print(f"\n  For each weight w_i, the gradient is: error × x_i")

gradient_weights = error * x
print(f"\n  Gradients for all 9 weights:")
print(f"  error × x = {error:.4f} × {x}")
print(f"           = [{', '.join([f'{g:.4f}' for g in gradient_weights])}]")

# Show which weights should change
print(f"\n  Let's interpret this (as a 3x3 grid):")
grad_grid = gradient_weights.reshape(3, 3)
print(f"    {grad_grid[0]}")
print(f"    {grad_grid[1]}")
print(f"    {grad_grid[2]}")

print(f"\n  Notice: Only the middle column has non-zero gradients!")
print(f"  That's because only those pixels had value 1 in the input.")
print(f"  Weights for other pixels don't need to change (input was 0).")

# Calculate gradient for bias
print("\n" + "-"*70)
print("STEP 3: Calculate gradient for bias")
print("-"*70)
gradient_bias = error
print(f"\n  Formula: ∂L/∂b = (ŷ - y) = error")
print(f"  Bias gradient = {gradient_bias:.4f}")

# Show the update
print("\n" + "-"*70)
print("STEP 4: Apply the update (with learning rate α = 0.5)")
print("-"*70)
learning_rate = 0.5
print(f"\n  Update formula: w_new = w_old - α × gradient")
print(f"\n  For weight w₁ (position 1, middle column):")
old_w1 = perceptron.weights[1]
new_w1 = old_w1 - learning_rate * gradient_weights[1]
print(f"    w₁_new = {old_w1:.4f} - {learning_rate} × {gradient_weights[1]:.4f}")
print(f"    w₁_new = {old_w1:.4f} - {learning_rate * gradient_weights[1]:.4f}")
print(f"    w₁_new = {new_w1:.4f}")
print(f"\n  Since error was negative, w₁ INCREASED to make output higher next time!")

5.7 Backpropagation: Tracing the Blame

Backpropagation ("backprop") is the algorithm that calculates gradients by flowing errors BACKWARD through the network.

Why Backpropagation is Revolutionary

Before backpropagation was popularized in 1986 (by Rumelhart, Hinton, and Williams), training neural networks was incredibly difficult. People didn't know how to efficiently calculate gradients for networks with many layers.

The Problem: In a network with multiple layers, changing one weight affects EVERYTHING that comes after it. How do you figure out exactly how much each weight contributed to the final error?

The Solution: Backpropagation! It uses the chain rule to efficiently calculate ALL gradients in ONE backward pass through the network.

The Name Explained

Back: We start from the OUTPUT (the error) and work BACKWARD
Propagation: The error "propagates" (spreads) to earlier layers

Think of it like blame assignment:

The final output was wrong
What caused it to be wrong? Trace backward...
These specific weights were most responsible
Adjust them accordingly

For Our Single Neuron

In our simple Perceptron, backpropagation is straightforward:

    FORWARD PASS (left to right):
    x → [w·x + b] → z → [sigmoid] → ŷ → [compare to y] → Loss
    
    BACKWARD PASS (right to left):
    Loss → ∂L/∂ŷ → ∂L/∂z → ∂L/∂w, ∂L/∂b
           ↑          ↑          ↑
        "How does   "How does   "How does
         loss       loss        loss
         change     change      change
         with ŷ?"   with z?"    with w,b?"

Committee Analogy

"Backpropagation is like a post-mortem after a mistake. The committee asks: 'What went wrong?' They trace the decision back: 'The final vote was wrong. Why? The weighted sum was off. Why? These specific weights gave too much importance to the wrong evidence.' Then they adjust those specific weights."

The Backprop Flow for Our Perceptron

Step	Calculation	Formula
1	Loss gradient w.r.t. output	$\frac{\partial L}{\partial y^}$ (from BCE)
2	Output gradient w.r.t. pre-activation	$\frac{\partial y^}{\partial z} = y^(1 - y^)$ (sigmoid derivative)
3	Pre-activation gradient w.r.t. weights	$\frac{\partial z}{\partial w} = x$
4	Chain them together	$\frac{\partial L}{\partial w} = (y^- y) \cdot x$

The beautiful thing: steps 1 and 2 combine to give us just $(y^- y)$ !

5.8 The Training Loop: Putting It All Together

Now we have all the pieces! Let's build the complete training algorithm.

Why Do We Need a Loop?

A single gradient descent step makes only a TINY improvement. To go from random weights to good weights, we need MANY small steps.

Real Example:

Start: Loss = 0.7, Accuracy = 50%
After 1 step: Loss = 0.69, Accuracy = 51% (tiny improvement)
After 10 steps: Loss = 0.5, Accuracy = 65%
After 100 steps: Loss = 0.1, Accuracy = 95%

Each step nudges the weights slightly. Over many steps, these tiny nudges accumulate into major improvements!

Why Multiple Epochs?

Problem: One pass through the data isn't enough.

With 100 samples, we only make 100 weight updates
The model might not have "seen" enough patterns
Early samples were processed with very different weights than late samples

Solution: Go through the data MULTIPLE times (epochs).

Epoch 1: First exposure to all samples
Epoch 2: Second look, with better weights now
Epoch 3: Refinement continues
...

Each epoch, the model gets better at the task!

The Training Loop Algorithm

FOR each epoch (pass through the data):
    FOR each sample (x, y) in the training data:
        
        1. FORWARD PASS: Get prediction
           ŷ = sigmoid(w · x + b)
        
        2. COMPUTE LOSS: How wrong?
           L = BCE(y, ŷ)
        
        3. COMPUTE GRADIENTS: Which way to go?
           ∂L/∂w = (ŷ - y) × x
           ∂L/∂b = (ŷ - y)
        
        4. UPDATE WEIGHTS: Take a step downhill
           w = w - α × ∂L/∂w
           b = b - α × ∂L/∂b
    
    Record average loss for this epoch

Key Terms

Term	Meaning
Epoch	One complete pass through all training data
Sample	One training example (input + label)
Batch	Group of samples processed together (we use batch size = 1 here)
Iteration	One weight update

Let's Build Our Trainable Perceptron!

cell 020

# =============================================================================# THE TRAINABLE PERCEPTRON: Complete Implementation# ============================================================================= class TrainablePerceptron:    """    A Perceptron that can learn from examples!        This class includes:    - Forward pass (prediction)    - Loss calculation (BCE)    - Gradient calculation (backpropagation)    - Weight update (gradient descent)    - Full training loop    """        def __init__(self, n_inputs):        """Initialize with random weights."""        self.weights = np.random.randn(n_inputs) * 0.1        self.bias = 0.0        self.n_inputs = n_inputs                # For tracking training progress        self.loss_history = []        self.accuracy_history = []        def forward(self, x):        """Forward pass: compute prediction."""        x = np.array(x).flatten()        z = np.dot(self.weights, x) + self.bias        return sigmoid(z)        def predict(self, x):        """Binary prediction (0 or 1)."""        return 1 if self.forward(x) >= 0.5 else 0        def compute_loss(self, y_true, y_pred):        """Compute BCE loss for one sample."""        epsilon = 1e-15        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)        return -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))        def train(self, X, y, learning_rate=0.1, epochs=100, verbose=True):        """        Train the Perceptron using gradient descent.                Parameters:            X: Training inputs, shape (n_samples, n_features)            y: Training labels, shape (n_samples,)            learning_rate: Step size for gradient descent            epochs: Number of passes through the data            verbose: Whether to print progress                Returns:            List of losses for each epoch        """        self.loss_history = []        self.accuracy_history = []                if verbose:            print("="*70)            print("TRAINING STARTED")            print("="*70)            print(f"  Samples: {len(X)}")            print(f"  Epochs: {epochs}")            print(f"  Learning rate: {learning_rate}")            print()                for epoch in range(epochs):            total_loss = 0            correct = 0                        # Go through each training sample            for i in range(len(X)):                xi = X[i]  # Input                yi = y[i]  # True label                                # ===== STEP 1: FORWARD PASS =====                y_pred = self.forward(xi)                                # ===== STEP 2: COMPUTE LOSS =====                loss = self.compute_loss(yi, y_pred)                total_loss += loss                                # Count correct predictions                if (y_pred >= 0.5 and yi == 1) or (y_pred < 0.5 and yi == 0):                    correct += 1                                # ===== STEP 3: COMPUTE GRADIENTS =====                # The beautiful simplification: gradient = (prediction - actual) * input                error = y_pred - yi                gradient_weights = error * xi                gradient_bias = error                                # ===== STEP 4: UPDATE WEIGHTS =====                self.weights = self.weights - learning_rate * gradient_weights                self.bias = self.bias - learning_rate * gradient_bias                        # Record progress            avg_loss = total_loss / len(X)            accuracy = correct / len(X)            self.loss_history.append(avg_loss)            self.accuracy_history.append(accuracy)                        # Print progress every 10 epochs            if verbose and (epoch + 1) % 10 == 0:                print(f"  Epoch {epoch+1:3d}/{epochs}: Loss = {avg_loss:.4f}, Accuracy = {accuracy*100:.1f}%")                if verbose:            print()            print("="*70)            print("TRAINING COMPLETE!")            print("="*70)            print(f"  Final Loss: {self.loss_history[-1]:.4f}")            print(f"  Final Accuracy: {self.accuracy_history[-1]*100:.1f}%")                return self.loss_history print("TrainablePerceptron class created!")print("Now let's train it and watch it learn...")

# =============================================================================
# THE TRAINABLE PERCEPTRON: Complete Implementation
# =============================================================================

class TrainablePerceptron:
    """
    A Perceptron that can learn from examples!
    
    This class includes:
    - Forward pass (prediction)
    - Loss calculation (BCE)
    - Gradient calculation (backpropagation)
    - Weight update (gradient descent)
    - Full training loop
    """
    
    def __init__(self, n_inputs):
        """Initialize with random weights."""
        self.weights = np.random.randn(n_inputs) * 0.1
        self.bias = 0.0
        self.n_inputs = n_inputs
        
        # For tracking training progress
        self.loss_history = []
        self.accuracy_history = []
    
    def forward(self, x):
        """Forward pass: compute prediction."""
        x = np.array(x).flatten()
        z = np.dot(self.weights, x) + self.bias
        return sigmoid(z)
    
    def predict(self, x):
        """Binary prediction (0 or 1)."""
        return 1 if self.forward(x) >= 0.5 else 0
    
    def compute_loss(self, y_true, y_pred):
        """Compute BCE loss for one sample."""
        epsilon = 1e-15
        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
        return -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
    
    def train(self, X, y, learning_rate=0.1, epochs=100, verbose=True):
        """
        Train the Perceptron using gradient descent.
        
        Parameters:
            X: Training inputs, shape (n_samples, n_features)
            y: Training labels, shape (n_samples,)
            learning_rate: Step size for gradient descent
            epochs: Number of passes through the data
            verbose: Whether to print progress
        
        Returns:
            List of losses for each epoch
        """
        self.loss_history = []
        self.accuracy_history = []
        
        if verbose:
            print("="*70)
            print("TRAINING STARTED")
            print("="*70)
            print(f"  Samples: {len(X)}")
            print(f"  Epochs: {epochs}")
            print(f"  Learning rate: {learning_rate}")
            print()
        
        for epoch in range(epochs):
            total_loss = 0
            correct = 0
            
            # Go through each training sample
            for i in range(len(X)):
                xi = X[i]  # Input
                yi = y[i]  # True label
                
                # ===== STEP 1: FORWARD PASS =====
                y_pred = self.forward(xi)
                
                # ===== STEP 2: COMPUTE LOSS =====
                loss = self.compute_loss(yi, y_pred)
                total_loss += loss
                
                # Count correct predictions
                if (y_pred >= 0.5 and yi == 1) or (y_pred < 0.5 and yi == 0):
                    correct += 1
                
                # ===== STEP 3: COMPUTE GRADIENTS =====
                # The beautiful simplification: gradient = (prediction - actual) * input
                error = y_pred - yi
                gradient_weights = error * xi
                gradient_bias = error
                
                # ===== STEP 4: UPDATE WEIGHTS =====
                self.weights = self.weights - learning_rate * gradient_weights
                self.bias = self.bias - learning_rate * gradient_bias
            
            # Record progress
            avg_loss = total_loss / len(X)
            accuracy = correct / len(X)
            self.loss_history.append(avg_loss)
            self.accuracy_history.append(accuracy)
            
            # Print progress every 10 epochs
            if verbose and (epoch + 1) % 10 == 0:
                print(f"  Epoch {epoch+1:3d}/{epochs}: Loss = {avg_loss:.4f}, Accuracy = {accuracy*100:.1f}%")
        
        if verbose:
            print()
            print("="*70)
            print("TRAINING COMPLETE!")
            print("="*70)
            print(f"  Final Loss: {self.loss_history[-1]:.4f}")
            print(f"  Final Accuracy: {self.accuracy_history[-1]*100:.1f}%")
        
        return self.loss_history

print("TrainablePerceptron class created!")
print("Now let's train it and watch it learn...")

5.9 Watching It Learn!

This is the moment we've been building toward. Let's train our Perceptron and watch it transform from a confused guesser into an expert line detector!

What to Watch For

During training, you'll see:

Loss decreasing - The model is making fewer/smaller mistakes
Accuracy increasing - More predictions are correct
Eventually plateauing - When the model has learned all it can

Convergence: When Has the Model Learned Enough?

Convergence means the model has stopped improving significantly. Signs of convergence:

Sign	What It Looks Like	What It Means
Loss plateaus	Loss curve flattens out	No more improvement possible
Loss oscillates	Jumps up and down slightly	Near the minimum
Accuracy stable	Stays at same level	Model has learned the pattern

When to Stop Training:

When loss stops decreasing for several epochs
When accuracy reaches acceptable level (e.g., 95%+)
When you've run out of patience!

cell 022

# =============================================================================# TRAINING THE PERCEPTRON: Watch It Learn!# ============================================================================= # Create a fresh Perceptronnp.random.seed(42)  # For reproducibilitymodel = TrainablePerceptron(n_inputs=9) # Check initial performance (before training)print("BEFORE TRAINING:")print("-"*40)correct_before = sum(model.predict(X_train[i]) == y_train[i] for i in range(len(X_train)))print(f"Accuracy: {correct_before}/{len(y_train)} = {correct_before/len(y_train)*100:.1f}%")print(f"(This is basically random guessing)")print() # Train the model!losses = model.train(X_train, y_train, learning_rate=0.5, epochs=50) # Check final performanceprint("\nAFTER TRAINING:")print("-"*40)correct_after = sum(model.predict(X_train[i]) == y_train[i] for i in range(len(X_train)))print(f"Accuracy: {correct_after}/{len(y_train)} = {correct_after/len(y_train)*100:.1f}%")

cell 023

# =============================================================================# VISUALIZE THE LEARNING PROGRESS# ============================================================================= fig, axes = plt.subplots(1, 3, figsize=(15, 4)) # Plot 1: Loss over timeax1 = axes[0]ax1.plot(model.loss_history, 'b-', linewidth=2)ax1.set_xlabel('Epoch', fontsize=12)ax1.set_ylabel('Loss (BCE)', fontsize=12)ax1.set_title('Loss Decreasing Over Time', fontsize=12, fontweight='bold')ax1.grid(True, alpha=0.3)ax1.annotate(f'Start: {model.loss_history[0]:.2f}', xy=(0, model.loss_history[0]),             xytext=(5, model.loss_history[0]+0.1), fontsize=10)ax1.annotate(f'End: {model.loss_history[-1]:.2f}', xy=(len(model.loss_history)-1, model.loss_history[-1]),             xytext=(len(model.loss_history)-15, model.loss_history[-1]+0.1), fontsize=10) # Plot 2: Accuracy over timeax2 = axes[1]ax2.plot([a*100 for a in model.accuracy_history], 'g-', linewidth=2)ax2.set_xlabel('Epoch', fontsize=12)ax2.set_ylabel('Accuracy (%)', fontsize=12)ax2.set_title('Accuracy Increasing Over Time', fontsize=12, fontweight='bold')ax2.grid(True, alpha=0.3)ax2.set_ylim(0, 105)ax2.axhline(y=50, color='red', linestyle='--', alpha=0.5, label='Random guessing')ax2.legend() # Plot 3: Learned weights (as 3x3 grid)ax3 = axes[2]weights_grid = model.weights.reshape(3, 3)im = ax3.imshow(weights_grid, cmap='RdBu', vmin=-2, vmax=2)ax3.set_title('Learned Weights\n(What the model looks for)', fontsize=12, fontweight='bold')for i in range(3):    for j in range(3):        ax3.text(j, i, f'{weights_grid[i,j]:.2f}', ha='center', va='center', fontsize=11, fontweight='bold')plt.colorbar(im, ax=ax3)ax3.set_xticks([])ax3.set_yticks([]) plt.tight_layout()plt.show() print("\nKEY OBSERVATIONS:")print("="*60)print(f"1. Loss decreased from {model.loss_history[0]:.4f} to {model.loss_history[-1]:.4f}")print(f"2. Accuracy improved from ~50% to {model.accuracy_history[-1]*100:.1f}%")print(f"3. The learned weights show HIGH values in the middle column!")print(f"   This is exactly what we'd expect for a vertical line detector!")

# =============================================================================
# VISUALIZE THE LEARNING PROGRESS
# =============================================================================

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Plot 1: Loss over time
ax1 = axes[0]
ax1.plot(model.loss_history, 'b-', linewidth=2)
ax1.set_xlabel('Epoch', fontsize=12)
ax1.set_ylabel('Loss (BCE)', fontsize=12)
ax1.set_title('Loss Decreasing Over Time', fontsize=12, fontweight='bold')
ax1.grid(True, alpha=0.3)
ax1.annotate(f'Start: {model.loss_history[0]:.2f}', xy=(0, model.loss_history[0]), 
            xytext=(5, model.loss_history[0]+0.1), fontsize=10)
ax1.annotate(f'End: {model.loss_history[-1]:.2f}', xy=(len(model.loss_history)-1, model.loss_history[-1]), 
            xytext=(len(model.loss_history)-15, model.loss_history[-1]+0.1), fontsize=10)

# Plot 2: Accuracy over time
ax2 = axes[1]
ax2.plot([a*100 for a in model.accuracy_history], 'g-', linewidth=2)
ax2.set_xlabel('Epoch', fontsize=12)
ax2.set_ylabel('Accuracy (%)', fontsize=12)
ax2.set_title('Accuracy Increasing Over Time', fontsize=12, fontweight='bold')
ax2.grid(True, alpha=0.3)
ax2.set_ylim(0, 105)
ax2.axhline(y=50, color='red', linestyle='--', alpha=0.5, label='Random guessing')
ax2.legend()

# Plot 3: Learned weights (as 3x3 grid)
ax3 = axes[2]
weights_grid = model.weights.reshape(3, 3)
im = ax3.imshow(weights_grid, cmap='RdBu', vmin=-2, vmax=2)
ax3.set_title('Learned Weights\n(What the model looks for)', fontsize=12, fontweight='bold')
for i in range(3):
    for j in range(3):
        ax3.text(j, i, f'{weights_grid[i,j]:.2f}', ha='center', va='center', fontsize=11, fontweight='bold')
plt.colorbar(im, ax=ax3)
ax3.set_xticks([])
ax3.set_yticks([])

plt.tight_layout()
plt.show()

print("\nKEY OBSERVATIONS:")
print("="*60)
print(f"1. Loss decreased from {model.loss_history[0]:.4f} to {model.loss_history[-1]:.4f}")
print(f"2. Accuracy improved from ~50% to {model.accuracy_history[-1]*100:.1f}%")
print(f"3. The learned weights show HIGH values in the middle column!")
print(f"   This is exactly what we'd expect for a vertical line detector!")

cell 024

# =============================================================================# TEST ON CANONICAL EXAMPLES: Before vs After# ============================================================================= print("="*70)print("TESTING ON CANONICAL EXAMPLES")print("="*70) # Test on vertical linev_pred = model.forward(vertical_flat)print(f"\nVertical Line:")print(f"  Prediction: {v_pred:.4f} ({v_pred*100:.1f}% confident it's vertical)")print(f"  Actual: 1 (IS vertical)")print(f"  Result: {'CORRECT!' if v_pred >= 0.5 else 'Wrong'}") # Test on horizontal lineh_pred = model.forward(horizontal_flat)print(f"\nHorizontal Line:")print(f"  Prediction: {h_pred:.4f} ({h_pred*100:.1f}% confident it's vertical)")print(f"  Actual: 0 (NOT vertical)")print(f"  Result: {'CORRECT!' if h_pred < 0.5 else 'Wrong'}") print("\n" + "="*70)print("THE PERCEPTRON HAS LEARNED!")print("="*70)print("""From random weights giving ~50% accuracy,our Perceptron now confidently classifies lines! It learned that:  - The MIDDLE COLUMN matters most for vertical lines  - Other pixels should have low/negative weights  This happened automatically through gradient descent -we never told it what a vertical line looks like!""")

# =============================================================================
# TEST ON CANONICAL EXAMPLES: Before vs After
# =============================================================================

print("="*70)
print("TESTING ON CANONICAL EXAMPLES")
print("="*70)

# Test on vertical line
v_pred = model.forward(vertical_flat)
print(f"\nVertical Line:")
print(f"  Prediction: {v_pred:.4f} ({v_pred*100:.1f}% confident it's vertical)")
print(f"  Actual: 1 (IS vertical)")
print(f"  Result: {'CORRECT!' if v_pred >= 0.5 else 'Wrong'}")

# Test on horizontal line
h_pred = model.forward(horizontal_flat)
print(f"\nHorizontal Line:")
print(f"  Prediction: {h_pred:.4f} ({h_pred*100:.1f}% confident it's vertical)")
print(f"  Actual: 0 (NOT vertical)")
print(f"  Result: {'CORRECT!' if h_pred < 0.5 else 'Wrong'}")

print("\n" + "="*70)
print("THE PERCEPTRON HAS LEARNED!")
print("="*70)
print("""
From random weights giving ~50% accuracy,
our Perceptron now confidently classifies lines!

It learned that:
  - The MIDDLE COLUMN matters most for vertical lines
  - Other pixels should have low/negative weights
  
This happened automatically through gradient descent -
we never told it what a vertical line looks like!
""")

Part 5 Summary: What We've Learned

This was the most important notebook in the series! You've learned the core of how neural networks learn.

Key Concepts Mastered

Concept	Formula	Why It Matters
Error	y - ŷ	Measures how wrong we are
MSE Loss	(1/n)Σ(y-ŷ)²	Penalizes errors, larger errors more
BCE Loss	-[y·log(ŷ) + (1-y)·log(1-ŷ)]	Better for classification, harsh on confident mistakes
Gradient	(ŷ - y) · x	Direction and magnitude of improvement
Gradient Descent	w = w - α·∇L	Algorithm to find better weights
Learning Rate	α	Controls step size (too big = overshoot, too small = slow)
Backpropagation	Chain rule backward	Calculates gradients for all weights

The Training Loop (Memorize This!)

for epoch in range(epochs):
    for x, y in training_data:
        y_pred = forward(x)           # 1. Predict
        loss = bce(y, y_pred)         # 2. Measure error
        gradient = (y_pred - y) * x   # 3. Calculate gradient
        weights -= lr * gradient      # 4. Update weights

Committee Analogy Progress

Part	Committee Story
Part 1-3	Member learned procedures (math, weights, voting)
Part 4	First case - confused, random guessing
Part 5	Member receives feedback and LEARNS!
Part 6	(Next) Evaluating the trained expert

The Big Picture

Before Training: Random weights → Random predictions → ~50% accuracy

After Training: Learned weights → Meaningful predictions → ~95%+ accuracy

The Perceptron discovered ON ITS OWN that vertical lines have pixels in the middle column!

Knowledge Check

cell 026

# =============================================================================# KNOWLEDGE CHECK - Part 5# ============================================================================= print("KNOWLEDGE CHECK - Part 5: Training")print("="*60)print("\nAnswer these questions to test your understanding:\n") questions = [    {        "q": "1. Why do we square errors in MSE?",        "options": [            "A) To make the math easier",            "B) To prevent positive and negative errors from canceling out",            "C) To make errors smaller",            "D) Because computers prefer square numbers"        ],        "answer": "B",        "explanation": "Squaring makes all errors positive, so they add up rather than cancel. It also penalizes larger errors more."    },    {        "q": "2. Why is BCE preferred over MSE for classification?",        "options": [            "A) BCE is faster to compute",            "B) BCE uses less memory",            "C) BCE severely punishes confident wrong predictions",            "D) BCE always gives lower values"        ],        "answer": "C",        "explanation": "BCE uses logarithms which give very large penalties when the model is confident but wrong (e.g., predicting 0.01 when answer is 1)."    },    {        "q": "3. What happens if the learning rate is too high?",        "options": [            "A) Training is faster and better",            "B) The model overshoots the minimum and may never converge",            "C) The model learns more features",            "D) Nothing bad, higher is always better"        ],        "answer": "B",        "explanation": "A high learning rate causes big jumps that overshoot the minimum, causing the loss to bounce around or even increase."    },    {        "q": "4. The gradient formula for our Perceptron is (ŷ - y) × x. What does the 'x' part mean?",        "options": [            "A) Larger inputs get larger weight updates",            "B) The input is added to the gradient",            "C) X marks the spot",            "D) Nothing, it's just mathematical convention"        ],        "answer": "A",        "explanation": "The input 'x' determines which weights contributed to the output. Weights connected to larger inputs get larger updates because they had more influence."    },    {        "q": "5. What is an 'epoch' in training?",        "options": [            "A) One weight update",            "B) One forward pass",            "C) One complete pass through all training data",            "D) When the model reaches 100% accuracy"        ],        "answer": "C",        "explanation": "An epoch is one complete pass through the entire training dataset. We typically train for many epochs until the model converges."    }] for q in questions:    print(q["q"])    for opt in q["options"]:        print(f"   {opt}")    print() print("\n" + "="*60)print("Scroll down for answers...")print("="*60)

# =============================================================================
# KNOWLEDGE CHECK - Part 5
# =============================================================================

print("KNOWLEDGE CHECK - Part 5: Training")
print("="*60)
print("\nAnswer these questions to test your understanding:\n")

questions = [
    {
        "q": "1. Why do we square errors in MSE?",
        "options": [
            "A) To make the math easier",
            "B) To prevent positive and negative errors from canceling out",
            "C) To make errors smaller",
            "D) Because computers prefer square numbers"
        ],
        "answer": "B",
        "explanation": "Squaring makes all errors positive, so they add up rather than cancel. It also penalizes larger errors more."
    },
    {
        "q": "2. Why is BCE preferred over MSE for classification?",
        "options": [
            "A) BCE is faster to compute",
            "B) BCE uses less memory",
            "C) BCE severely punishes confident wrong predictions",
            "D) BCE always gives lower values"
        ],
        "answer": "C",
        "explanation": "BCE uses logarithms which give very large penalties when the model is confident but wrong (e.g., predicting 0.01 when answer is 1)."
    },
    {
        "q": "3. What happens if the learning rate is too high?",
        "options": [
            "A) Training is faster and better",
            "B) The model overshoots the minimum and may never converge",
            "C) The model learns more features",
            "D) Nothing bad, higher is always better"
        ],
        "answer": "B",
        "explanation": "A high learning rate causes big jumps that overshoot the minimum, causing the loss to bounce around or even increase."
    },
    {
        "q": "4. The gradient formula for our Perceptron is (ŷ - y) × x. What does the 'x' part mean?",
        "options": [
            "A) Larger inputs get larger weight updates",
            "B) The input is added to the gradient",
            "C) X marks the spot",
            "D) Nothing, it's just mathematical convention"
        ],
        "answer": "A",
        "explanation": "The input 'x' determines which weights contributed to the output. Weights connected to larger inputs get larger updates because they had more influence."
    },
    {
        "q": "5. What is an 'epoch' in training?",
        "options": [
            "A) One weight update",
            "B) One forward pass",
            "C) One complete pass through all training data",
            "D) When the model reaches 100% accuracy"
        ],
        "answer": "C",
        "explanation": "An epoch is one complete pass through the entire training dataset. We typically train for many epochs until the model converges."
    }
]

for q in questions:
    print(q["q"])
    for opt in q["options"]:
        print(f"   {opt}")
    print()

print("\n" + "="*60)
print("Scroll down for answers...")
print("="*60)

cell 027

# =============================================================================# ANSWERS - Knowledge Check Part 5# ============================================================================= print("ANSWERS - Part 5 Knowledge Check")print("="*60) for i, q in enumerate(questions, 1):    print(f"\n{i}. Answer: {q['answer']}")    print(f"   {q['explanation']}") print("\n" + "="*60)print("How did you do?")print("  5/5: Training Master!")print("  4/5: Solid understanding!")print("  3/5: Review the sections you missed")print("  <3:  Re-read Part 5 - these concepts are crucial!")print("="*60)

What's Next?

Congratulations! You've completed the most important notebook in this series!

You now understand how neural networks learn - loss functions, gradient descent, and backpropagation are the foundation of ALL deep learning.

Coming Up in Part 6: Evaluation - The Trained Expert

Training vs Inference - Learning mode vs using mode
Accuracy Metrics - Precision, recall, F1 score
Confusion Matrix - Detailed prediction breakdown
Interpretability - What did the model actually learn?

Continue to Part 6: part_6_evaluation.ipynb

"The Perceptron has learned. Now it's time to see what it REALLY knows."

The Brain's Decision Committee - From Confusion to Competence