MetricsPart 6 · 45 min · intermediate

The trained expert

Separate training from inference, compute evaluation metrics, and inspect what the model learned.

Open in Colab Download notebook Full lab fallback

Kernel: ColdSections: 0/16

Neural Network Fundamentals

Part 6: Evaluation - The Trained Expert

The Brain's Decision Committee - Chapter 6

The Story So Far...

In Part 5, something remarkable happened: our committee member learned. Starting with random weights and ~50% accuracy, they adjusted their priorities through gradient descent until they became an expert vertical line detector with 95%+ accuracy.

But how do we know they're actually good? Getting 95% on training data is one thing, but:

What kinds of mistakes do they still make?
Are some errors worse than others?
Can we understand why they make the decisions they do?

This is evaluation - properly assessing our trained model and understanding what it has learned.

What You'll Learn in Part 6

By the end of this notebook, you will understand:

Training vs Inference - The difference between learning mode and using mode
Accuracy - The simplest metric (and its limitations)
Confusion Matrix - A detailed breakdown of all prediction types
Precision & Recall - Measuring different kinds of correctness
F1 Score - Balancing precision and recall
Saliency/Interpretability - What did the model actually learn?
Test Sets - Why we need data the model has never seen

Prerequisites

Make sure you've completed:

Parts 0-1: Matrices (neural_network_fundamentals.ipynb)
Part 2: Single Neuron (part_2_single_neuron.ipynb)
Part 3: Activation Functions (part_3_activation_functions.ipynb)
Part 4: The Perceptron (part_4_perceptron.ipynb)
Part 5: Training (part_5_training.ipynb)

Setup: Import Dependencies and Recreate Our Trained Model

Let's bring in everything we need and train a model to evaluate.

cell 003

# =============================================================================# PART 6: EVALUATION - SETUP AND IMPORTS# ============================================================================= import numpy as npimport matplotlib.pyplot as pltfrom IPython.display import display, clear_output # Try to import ipywidgets for interactive featurestry:    import ipywidgets as widgets    WIDGETS_AVAILABLE = Trueexcept ImportError:    WIDGETS_AVAILABLE = False    print("Note: ipywidgets not installed. Interactive features will be limited.") # Set up matplotlib stylestyle_options = ['seaborn-v0_8-whitegrid', 'seaborn-whitegrid', 'ggplot', 'default']for style in style_options:    try:        plt.style.use(style)        break    except OSError:        continue plt.rcParams['figure.figsize'] = [10, 6]plt.rcParams['font.size'] = 12np.random.seed(42) print("Setup complete!")print("="*60)

cell 004

# =============================================================================# RECREATE OUR TOOLS FROM PREVIOUS NOTEBOOKS# ============================================================================= # -----------------------------------------------------------------------------# Our canonical line images (from Part 1)# -----------------------------------------------------------------------------vertical_line = np.array([[0, 1, 0], [0, 1, 0], [0, 1, 0]])horizontal_line = np.array([[0, 0, 0], [1, 1, 1], [0, 0, 0]])vertical_flat = vertical_line.flatten()horizontal_flat = horizontal_line.flatten() # -----------------------------------------------------------------------------# Dataset generator (from Part 4)# -----------------------------------------------------------------------------def generate_line_dataset(n_samples=100, noise_level=0.0, seed=None):    """Generate vertical (label=1) and horizontal (label=0) line images."""    if seed is not None:        np.random.seed(seed)        X, y = [], []        for i in range(n_samples):        image = np.zeros((3, 3))                if i < n_samples // 2:  # Vertical lines            col = np.random.randint(0, 3)            image[:, col] = 1            if noise_level > 0:                image = np.clip(image + np.random.randn(3, 3) * noise_level, 0, 1)            X.append(image.flatten())            y.append(1)        else:  # Horizontal lines            row = np.random.randint(0, 3)            image[row, :] = 1            if noise_level > 0:                image = np.clip(image + np.random.randn(3, 3) * noise_level, 0, 1)            X.append(image.flatten())            y.append(0)        X, y = np.array(X), np.array(y)    shuffle_idx = np.random.permutation(n_samples)    return X[shuffle_idx], y[shuffle_idx] # -----------------------------------------------------------------------------# Sigmoid activation function (from Part 3)# -----------------------------------------------------------------------------def sigmoid(z):    """Sigmoid activation: maps any value to range (0, 1)."""    return 1 / (1 + np.exp(-np.clip(z, -500, 500))) # -----------------------------------------------------------------------------# TrainablePerceptron class (from Part 5)# -----------------------------------------------------------------------------class TrainablePerceptron:    """A Perceptron that can learn from examples."""        def __init__(self, n_inputs):        self.weights = np.random.randn(n_inputs) * 0.1        self.bias = 0.0        self.n_inputs = n_inputs        self.loss_history = []        self.accuracy_history = []        self.is_trained = False  # Track if model has been trained        def forward(self, x):        x = np.array(x).flatten()        z = np.dot(self.weights, x) + self.bias        return sigmoid(z)        def predict(self, x):        return 1 if self.forward(x) >= 0.5 else 0        def compute_loss(self, y_true, y_pred):        epsilon = 1e-15        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)        return -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))        def train(self, X, y, learning_rate=0.1, epochs=100, verbose=True):        self.loss_history = []        self.accuracy_history = []                for epoch in range(epochs):            total_loss = 0            correct = 0                        for i in range(len(X)):                xi, yi = X[i], y[i]                y_pred = self.forward(xi)                loss = self.compute_loss(yi, y_pred)                total_loss += loss                                if (y_pred >= 0.5 and yi == 1) or (y_pred < 0.5 and yi == 0):                    correct += 1                                error = y_pred - yi                self.weights = self.weights - learning_rate * error * xi                self.bias = self.bias - learning_rate * error                        avg_loss = total_loss / len(X)            accuracy = correct / len(X)            self.loss_history.append(avg_loss)            self.accuracy_history.append(accuracy)                        if verbose and (epoch + 1) % 10 == 0:                print(f"  Epoch {epoch+1:3d}/{epochs}: Loss = {avg_loss:.4f}, Accuracy = {accuracy*100:.1f}%")                self.is_trained = True                if verbose:            print(f"\nTraining complete! Final accuracy: {self.accuracy_history[-1]*100:.1f}%")                return self.loss_history print("Tools recreated from previous notebooks!")print("  - Line image templates")print("  - Dataset generator")print("  - Sigmoid activation")print("  - TrainablePerceptron class")

# =============================================================================
# RECREATE OUR TOOLS FROM PREVIOUS NOTEBOOKS
# =============================================================================

# -----------------------------------------------------------------------------
# Our canonical line images (from Part 1)
# -----------------------------------------------------------------------------
vertical_line = np.array([[0, 1, 0], [0, 1, 0], [0, 1, 0]])
horizontal_line = np.array([[0, 0, 0], [1, 1, 1], [0, 0, 0]])
vertical_flat = vertical_line.flatten()
horizontal_flat = horizontal_line.flatten()

# -----------------------------------------------------------------------------
# Dataset generator (from Part 4)
# -----------------------------------------------------------------------------
def generate_line_dataset(n_samples=100, noise_level=0.0, seed=None):
    """Generate vertical (label=1) and horizontal (label=0) line images."""
    if seed is not None:
        np.random.seed(seed)
    
    X, y = [], []
    
    for i in range(n_samples):
        image = np.zeros((3, 3))
        
        if i < n_samples // 2:  # Vertical lines
            col = np.random.randint(0, 3)
            image[:, col] = 1
            if noise_level > 0:
                image = np.clip(image + np.random.randn(3, 3) * noise_level, 0, 1)
            X.append(image.flatten())
            y.append(1)
        else:  # Horizontal lines
            row = np.random.randint(0, 3)
            image[row, :] = 1
            if noise_level > 0:
                image = np.clip(image + np.random.randn(3, 3) * noise_level, 0, 1)
            X.append(image.flatten())
            y.append(0)
    
    X, y = np.array(X), np.array(y)
    shuffle_idx = np.random.permutation(n_samples)
    return X[shuffle_idx], y[shuffle_idx]

# -----------------------------------------------------------------------------
# Sigmoid activation function (from Part 3)
# -----------------------------------------------------------------------------
def sigmoid(z):
    """Sigmoid activation: maps any value to range (0, 1)."""
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

# -----------------------------------------------------------------------------
# TrainablePerceptron class (from Part 5)
# -----------------------------------------------------------------------------
class TrainablePerceptron:
    """A Perceptron that can learn from examples."""
    
    def __init__(self, n_inputs):
        self.weights = np.random.randn(n_inputs) * 0.1
        self.bias = 0.0
        self.n_inputs = n_inputs
        self.loss_history = []
        self.accuracy_history = []
        self.is_trained = False  # Track if model has been trained
    
    def forward(self, x):
        x = np.array(x).flatten()
        z = np.dot(self.weights, x) + self.bias
        return sigmoid(z)
    
    def predict(self, x):
        return 1 if self.forward(x) >= 0.5 else 0
    
    def compute_loss(self, y_true, y_pred):
        epsilon = 1e-15
        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
        return -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
    
    def train(self, X, y, learning_rate=0.1, epochs=100, verbose=True):
        self.loss_history = []
        self.accuracy_history = []
        
        for epoch in range(epochs):
            total_loss = 0
            correct = 0
            
            for i in range(len(X)):
                xi, yi = X[i], y[i]
                y_pred = self.forward(xi)
                loss = self.compute_loss(yi, y_pred)
                total_loss += loss
                
                if (y_pred >= 0.5 and yi == 1) or (y_pred < 0.5 and yi == 0):
                    correct += 1
                
                error = y_pred - yi
                self.weights = self.weights - learning_rate * error * xi
                self.bias = self.bias - learning_rate * error
            
            avg_loss = total_loss / len(X)
            accuracy = correct / len(X)
            self.loss_history.append(avg_loss)
            self.accuracy_history.append(accuracy)
            
            if verbose and (epoch + 1) % 10 == 0:
                print(f"  Epoch {epoch+1:3d}/{epochs}: Loss = {avg_loss:.4f}, Accuracy = {accuracy*100:.1f}%")
        
        self.is_trained = True
        
        if verbose:
            print(f"\nTraining complete! Final accuracy: {self.accuracy_history[-1]*100:.1f}%")
        
        return self.loss_history

print("Tools recreated from previous notebooks!")
print("  - Line image templates")
print("  - Dataset generator")
print("  - Sigmoid activation")
print("  - TrainablePerceptron class")

cell 005

# =============================================================================# TRAIN OUR MODEL (Quick recap from Part 5)# ============================================================================= print("="*70)print("TRAINING OUR MODEL (to have something to evaluate)")print("="*70) # Generate training datanp.random.seed(42)X_train, y_train = generate_line_dataset(n_samples=100, noise_level=0.0, seed=42) # Generate TEST data (NEW! - data the model has never seen)X_test, y_test = generate_line_dataset(n_samples=50, noise_level=0.0, seed=999) print(f"\nTraining set: {len(X_train)} samples")print(f"Test set: {len(X_test)} samples (model has NEVER seen these!)") # Create and train modelmodel = TrainablePerceptron(n_inputs=9)print("\nTraining...")model.train(X_train, y_train, learning_rate=0.5, epochs=50, verbose=True) print("\n" + "="*70)print("Model is trained and ready for evaluation!")print("="*70)

6.1 Training vs Inference: The Committee's Memory

Before we evaluate, let's understand an important distinction: training mode vs inference mode.

What IS Inference?

The word "inference" comes from Latin inferre meaning "to bring in" or "to conclude." In machine learning:

Inference = Using a trained model to make predictions on new data

Think of it like this:

Training = Teaching someone how to do a job
Inference = That person actually doing the job

Why Two Different Modes?

Aspect	Training Mode	Inference Mode
Purpose	Learn from examples	Make predictions
Weights	Being updated constantly	Frozen (fixed)
Data	Training set (with labels)	New, unseen data
Speed	Slower (computing gradients)	Fast (forward pass only)
Goal	Minimize loss	Predict accurately

Committee Analogy

"During training, the committee is in a meeting room, debating cases, learning from mistakes, and updating their rulebook. Once trained, they compile their final rulebook and hand it to the front desk. The front desk uses this rulebook to make quick decisions without calling the committee for every case."

Training: The committee meeting (slow, learning, updating)
Inference: The front desk using the final rulebook (fast, fixed, no learning)

Why Does This Distinction Matter?

Scenario	Why It Matters
Deployment	In production, you use inference mode for speed
Evaluation	We evaluate in inference mode (weights must be fixed!)
Consistency	Same weights give same predictions every time
Resources	Inference uses less memory (no gradients stored)

The Key Insight

During inference, the model does NOT learn anything new. The weights are "frozen" - they don't change. This is essential because:

Reproducibility: Same input always gives same output
Speed: No gradient computation needed
Fairness: Test data doesn't influence the model

Why "Frozen" Weights Matter Mathematically

During training, after each prediction, we do:

weights = weights - learning_rate × gradient

During inference, we SKIP this step entirely. The weights stay exactly as they were after training finished.

Why does this matter?

If we kept updating during inference...	Consequence
Weights would change with each new input	Same input could give different outputs!
Model would "drift" over time	Yesterday's predictions wouldn't match today's
Hard to reproduce results	"But it worked yesterday!"
Unfair for test evaluation	Test data would influence the model

The mathematical guarantee: With frozen weights, $f (x) = σ (w \cdot x + b)$ is a deterministic function - same input ALWAYS gives same output.

In Code

cell 007

# =============================================================================# TRAINING VS INFERENCE: Demonstration# ============================================================================= print("="*70)print("TRAINING vs INFERENCE MODE")print("="*70) # Show the model's stateprint(f"\nModel state: {'TRAINED' if model.is_trained else 'UNTRAINED'}") # In training mode, weights change after each sampleprint("\n" + "-"*70)print("DURING TRAINING (weights change):")print("-"*70)print("  For each sample:")print("    1. Forward pass → get prediction")print("    2. Compute loss → how wrong?")print("    3. Compute gradients → which direction?")print("    4. Update weights → improve! (weights CHANGE)") # In inference mode, weights are frozenprint("\n" + "-"*70)print("DURING INFERENCE (weights frozen):")print("-"*70)print("  For each sample:")print("    1. Forward pass → get prediction")print("    2. Done! (NO weight updates)") # Demonstrate inferenceprint("\n" + "-"*70)print("INFERENCE EXAMPLE:")print("-"*70) # Save weights beforeweights_before = model.weights.copy() # Make predictions (inference)pred_v = model.forward(vertical_flat)pred_h = model.forward(horizontal_flat) # Check weights afterweights_after = model.weights.copy() print(f"\n  Vertical line:   {pred_v:.4f} ({pred_v*100:.1f}% confident it's vertical)")print(f"  Horizontal line: {pred_h:.4f} ({pred_h*100:.1f}% confident it's vertical)")print(f"\n  Weights changed? {not np.allclose(weights_before, weights_after)}")print(f"  (In inference mode, weights stay fixed!)")

# =============================================================================
# TRAINING VS INFERENCE: Demonstration
# =============================================================================

print("="*70)
print("TRAINING vs INFERENCE MODE")
print("="*70)

# Show the model's state
print(f"\nModel state: {'TRAINED' if model.is_trained else 'UNTRAINED'}")

# In training mode, weights change after each sample
print("\n" + "-"*70)
print("DURING TRAINING (weights change):")
print("-"*70)
print("  For each sample:")
print("    1. Forward pass → get prediction")
print("    2. Compute loss → how wrong?")
print("    3. Compute gradients → which direction?")
print("    4. Update weights → improve! (weights CHANGE)")

# In inference mode, weights are frozen
print("\n" + "-"*70)
print("DURING INFERENCE (weights frozen):")
print("-"*70)
print("  For each sample:")
print("    1. Forward pass → get prediction")
print("    2. Done! (NO weight updates)")

# Demonstrate inference
print("\n" + "-"*70)
print("INFERENCE EXAMPLE:")
print("-"*70)

# Save weights before
weights_before = model.weights.copy()

# Make predictions (inference)
pred_v = model.forward(vertical_flat)
pred_h = model.forward(horizontal_flat)

# Check weights after
weights_after = model.weights.copy()

print(f"\n  Vertical line:   {pred_v:.4f} ({pred_v*100:.1f}% confident it's vertical)")
print(f"  Horizontal line: {pred_h:.4f} ({pred_h*100:.1f}% confident it's vertical)")
print(f"\n  Weights changed? {not np.allclose(weights_before, weights_after)}")
print(f"  (In inference mode, weights stay fixed!)")

6.2 Accuracy: The Simplest Metric

We've been using accuracy throughout our notebooks, but let's formally define it and understand its limitations.

What IS Accuracy?

Accuracy answers the question: "Of all the predictions I made, what fraction was correct?"

$Accuracy = \frac{Number of Correct Predictions}{Total Number of Predictions}$

Breaking Down the Formula

Let's understand each part:

Component	What It Means	Our Example
Correct Predictions	Cases where prediction matches truth	Said "vertical" for vertical, "horizontal" for horizontal
Total Predictions	All cases we predicted on	All 50 test images
Accuracy	The ratio (0 to 1, or 0% to 100%)	48/50 = 0.96 = 96%

Computing Accuracy Step by Step

Step 1: Make predictions on all samples
Step 2: Compare each prediction to the true label
Step 3: Count how many match (correct)
Step 4: Divide by total number of predictions

Why Accuracy Can Be Misleading

Accuracy has a hidden flaw: it treats all mistakes equally and ignores class imbalance.

Example - Fraud Detection:

Suppose 99% of transactions are legitimate, 1% are fraud.

Model Strategy	Accuracy	Is It Good?
Say "legitimate" for EVERYTHING	99%	NO! Catches 0% of fraud!
Actually detect fraud	97%	YES! Even though lower accuracy

The "dumb" model gets 99% accuracy by ignoring the problem entirely!

Example - Medical Diagnosis:

Scenario	Type of Error	Consequence
Say "healthy" when patient is sick	Miss a disease	Patient doesn't get treatment! (VERY bad)
Say "sick" when patient is healthy	False alarm	Unnecessary tests (annoying but not dangerous)

Both are "wrong" but one is much worse! Accuracy treats them the same.

When Accuracy Works Well

Accuracy is a good metric when:

Classes are balanced (roughly 50/50 split)
All mistakes have equal cost
You want a quick overall view

Our V/H classifier is a good case for accuracy: balanced classes, equal mistake costs.

Understanding Why Class Imbalance Breaks Accuracy

Let's do the math to see WHY accuracy is misleading with imbalanced data:

Scenario: Fraud Detection (1% fraud, 99% legitimate)

Strategy	Fraud Caught	Accuracy Calculation
"Always say legitimate"	0 of 100 frauds	(0 + 9900) / 10000 = 99%
Good detector	90 of 100 frauds	(90 + 9800) / 10000 = 98.9%

The "dumb" strategy has HIGHER accuracy but catches ZERO fraud!

Why this happens mathematically:

$Accuracy = \frac{T P + T N}{Total}$

When 99% of data is class 0, you can get 99% accuracy by predicting 0 for everything (TN = 9900, everything else = 0).

The lesson: When classes are imbalanced, accuracy is dominated by the majority class. We need metrics that focus on the minority class (precision, recall).

Let's Calculate Accuracy Properly

cell 009

# =============================================================================# ACCURACY: Step-by-Step Calculation# ============================================================================= print("="*70)print("CALCULATING ACCURACY: Step by Step")print("="*70) def calculate_accuracy(model, X, y, verbose=True):    """    Calculate accuracy of model on given data.        Parameters:        model: Trained model with predict() method        X: Input data (n_samples, n_features)        y: True labels (n_samples,)        verbose: Whether to print details        Returns:        accuracy: Float between 0 and 1        predictions: Array of predicted labels    """    predictions = []    correct = 0        for i in range(len(X)):        pred = model.predict(X[i])        predictions.append(pred)        if pred == y[i]:            correct += 1        accuracy = correct / len(y)        if verbose:        print(f"\n  Total samples: {len(y)}")        print(f"  Correct: {correct}")        print(f"  Wrong: {len(y) - correct}")        print(f"  Accuracy: {correct}/{len(y)} = {accuracy:.4f} = {accuracy*100:.1f}%")        return accuracy, np.array(predictions) # Calculate accuracy on TRAINING dataprint("\n" + "-"*70)print("TRAINING SET ACCURACY:")print("-"*70)train_accuracy, train_preds = calculate_accuracy(model, X_train, y_train) # Calculate accuracy on TEST data (NEW!)print("\n" + "-"*70)print("TEST SET ACCURACY:")print("-"*70)test_accuracy, test_preds = calculate_accuracy(model, X_test, y_test) print("\n" + "="*70)print("KEY INSIGHT: Training vs Test Accuracy")print("="*70)print(f"""Training accuracy: {train_accuracy*100:.1f}%Test accuracy:     {test_accuracy*100:.1f}% The TEST accuracy is what really matters!Training accuracy can be misleadingly high if the model "memorizes" the data.Test accuracy shows how well the model generalizes to NEW data.""")

# =============================================================================
# ACCURACY: Step-by-Step Calculation
# =============================================================================

print("="*70)
print("CALCULATING ACCURACY: Step by Step")
print("="*70)

def calculate_accuracy(model, X, y, verbose=True):
    """
    Calculate accuracy of model on given data.
    
    Parameters:
        model: Trained model with predict() method
        X: Input data (n_samples, n_features)
        y: True labels (n_samples,)
        verbose: Whether to print details
    
    Returns:
        accuracy: Float between 0 and 1
        predictions: Array of predicted labels
    """
    predictions = []
    correct = 0
    
    for i in range(len(X)):
        pred = model.predict(X[i])
        predictions.append(pred)
        if pred == y[i]:
            correct += 1
    
    accuracy = correct / len(y)
    
    if verbose:
        print(f"\n  Total samples: {len(y)}")
        print(f"  Correct: {correct}")
        print(f"  Wrong: {len(y) - correct}")
        print(f"  Accuracy: {correct}/{len(y)} = {accuracy:.4f} = {accuracy*100:.1f}%")
    
    return accuracy, np.array(predictions)

# Calculate accuracy on TRAINING data
print("\n" + "-"*70)
print("TRAINING SET ACCURACY:")
print("-"*70)
train_accuracy, train_preds = calculate_accuracy(model, X_train, y_train)

# Calculate accuracy on TEST data (NEW!)
print("\n" + "-"*70)
print("TEST SET ACCURACY:")
print("-"*70)
test_accuracy, test_preds = calculate_accuracy(model, X_test, y_test)

print("\n" + "="*70)
print("KEY INSIGHT: Training vs Test Accuracy")
print("="*70)
print(f"""
Training accuracy: {train_accuracy*100:.1f}%
Test accuracy:     {test_accuracy*100:.1f}%

The TEST accuracy is what really matters!
Training accuracy can be misleadingly high if the model "memorizes" the data.
Test accuracy shows how well the model generalizes to NEW data.
""")

6.3 The Confusion Matrix: A Detailed Report Card

Accuracy gives us one number. But what if we want to understand WHICH mistakes the model makes?

What IS a Confusion Matrix?

A confusion matrix is a table that breaks down all predictions into four categories based on two questions:

What did we predict?
What was the actual truth?

                      PREDICTED
                    0        1
              ┌─────────┬─────────┐
        0     │   TN    │   FP    │
   ACTUAL     ├─────────┼─────────┤
        1     │   FN    │   TP    │
              └─────────┴─────────┘

Why "Confusion" Matrix?

The name comes from the fact that it shows how the model gets "confused" - where it mixes up one class for another.

Understanding the Four Categories

Abbrev	Full Name	Meaning	Our Example
TP	True Positive	Predicted 1, was actually 1	Said "vertical", WAS vertical ✓
TN	True Negative	Predicted 0, was actually 0	Said "horizontal", WAS horizontal ✓
FP	False Positive	Predicted 1, was actually 0	Said "vertical", was horizontal ✗
FN	False Negative	Predicted 0, was actually 1	Said "horizontal", was vertical ✗

Memory Trick for TP/TN/FP/FN

Think of it as TWO questions:

True/False: Was the prediction correct?
- True = correct
- False = wrong
Positive/Negative: What did we predict?
- Positive = predicted class 1 (vertical)
- Negative = predicted class 0 (horizontal)

So:

True Positive = We were True (correct) when we predicted Positive (vertical)
False Positive = We were False (wrong) when we predicted Positive (vertical)
True Negative = We were True (correct) when we predicted Negative (horizontal)
False Negative = We were False (wrong) when we predicted Negative (horizontal)

Committee Analogy

"The confusion matrix is like a detailed performance review for our committee member:

TP: Cases they correctly identified as vertical
TN: Cases they correctly identified as NOT vertical
FP: Cases they wrongly called vertical (a false alarm!)
FN: Cases they missed (should have said vertical but didn't)"

Alternative Names You'll See

Our Term	Also Called	When Used
False Positive	Type I Error	Statistics
False Negative	Type II Error	Statistics
True Positive Rate	Sensitivity, Recall	Medical
True Negative Rate	Specificity	Medical

Real-World Examples of Each Error Type

Understanding these errors is easier with concrete examples:

Error Type	Medical Example	Email Example	Self-Driving Car
TP	Correctly diagnose sick patient	Correctly mark spam	Correctly detect pedestrian
TN	Correctly clear healthy patient	Correctly allow good email	Correctly ignore false alarm
FP	Diagnose healthy as sick	Mark good email as spam	Brake for nothing (annoying)
FN	Miss a sick patient	Allow spam through	Miss a pedestrian (FATAL!)

Notice: The consequences of FP vs FN are very different depending on the application!

Medical: FN is worse (missed diagnosis can be fatal)
Spam filter: FP is worse (losing important emails)
Self-driving: FN is MUCH worse (hitting someone)

This is why we have precision and recall - to measure these separately.

Let's Build a Confusion Matrix

cell 011

# =============================================================================# CONFUSION MATRIX: Implementation and Explanation# ============================================================================= def confusion_matrix(y_true, y_pred):    """    Compute the confusion matrix.        The logic behind each calculation:    - TP: prediction=1 AND truth=1 (both conditions true)    - TN: prediction=0 AND truth=0 (both conditions true)    - FP: prediction=1 AND truth=0 (predicted positive, was negative)    - FN: prediction=0 AND truth=1 (predicted negative, was positive)        Parameters:        y_true: Array of true labels (0 or 1)        y_pred: Array of predicted labels (0 or 1)        Returns:        dict with TP, TN, FP, FN counts    """    # True Positive: We said 1, it was 1    TP = np.sum((y_pred == 1) & (y_true == 1))        # True Negative: We said 0, it was 0    TN = np.sum((y_pred == 0) & (y_true == 0))        # False Positive: We said 1, but it was 0 (false alarm!)    FP = np.sum((y_pred == 1) & (y_true == 0))        # False Negative: We said 0, but it was 1 (missed it!)    FN = np.sum((y_pred == 0) & (y_true == 1))        return {'TP': TP, 'TN': TN, 'FP': FP, 'FN': FN} print("="*70)print("CONFUSION MATRIX: Step by Step")print("="*70) # Calculate confusion matrix for test setcm = confusion_matrix(y_test, test_preds) print("\nFor our TEST set:")print(f"  Total samples: {len(y_test)}")print(f"  Vertical lines (label=1): {np.sum(y_test == 1)}")print(f"  Horizontal lines (label=0): {np.sum(y_test == 0)}") print("\n" + "-"*70)print("CONFUSION MATRIX BREAKDOWN:")print("-"*70) print(f"""                       PREDICTED                   Horizontal(0)  Vertical(1)              ┌─────────────────┬─────────────────┐   Horiz.(0)  │  TN = {cm['TN']:3d}       │  FP = {cm['FP']:3d}       │   ACTUAL     ├─────────────────┼─────────────────┤   Vert.(1)   │  FN = {cm['FN']:3d}       │  TP = {cm['TP']:3d}       │              └─────────────────┴─────────────────┘""") print("Interpretation (reading the matrix):")print(f"  ✓ True Positives (TP = {cm['TP']}): Correctly identified as VERTICAL")print(f"  ✓ True Negatives (TN = {cm['TN']}): Correctly identified as HORIZONTAL")print(f"  ✗ False Positives (FP = {cm['FP']}): Wrongly called VERTICAL (was horizontal)")print(f"  ✗ False Negatives (FN = {cm['FN']}): Wrongly called HORIZONTAL (was vertical)") # Verify: TP + TN + FP + FN should equal total samplestotal = cm['TP'] + cm['TN'] + cm['FP'] + cm['FN']print(f"\n  Verification: TP + TN + FP + FN = {total} (should equal {len(y_test)}) ✓") # Show how accuracy relates to confusion matrixprint("\n" + "-"*70)print("ACCURACY FROM CONFUSION MATRIX:")print("-"*70)print(f"""  Accuracy = (TP + TN) / (TP + TN + FP + FN)           = ({cm['TP']} + {cm['TN']}) / ({cm['TP']} + {cm['TN']} + {cm['FP']} + {cm['FN']})           = {cm['TP'] + cm['TN']} / {total}           = {(cm['TP'] + cm['TN']) / total:.4f}           = {(cm['TP'] + cm['TN']) / total * 100:.1f}%""")

# =============================================================================
# CONFUSION MATRIX: Implementation and Explanation
# =============================================================================

def confusion_matrix(y_true, y_pred):
    """
    Compute the confusion matrix.
    
    The logic behind each calculation:
    - TP: prediction=1 AND truth=1 (both conditions true)
    - TN: prediction=0 AND truth=0 (both conditions true)
    - FP: prediction=1 AND truth=0 (predicted positive, was negative)
    - FN: prediction=0 AND truth=1 (predicted negative, was positive)
    
    Parameters:
        y_true: Array of true labels (0 or 1)
        y_pred: Array of predicted labels (0 or 1)
    
    Returns:
        dict with TP, TN, FP, FN counts
    """
    # True Positive: We said 1, it was 1
    TP = np.sum((y_pred == 1) & (y_true == 1))
    
    # True Negative: We said 0, it was 0
    TN = np.sum((y_pred == 0) & (y_true == 0))
    
    # False Positive: We said 1, but it was 0 (false alarm!)
    FP = np.sum((y_pred == 1) & (y_true == 0))
    
    # False Negative: We said 0, but it was 1 (missed it!)
    FN = np.sum((y_pred == 0) & (y_true == 1))
    
    return {'TP': TP, 'TN': TN, 'FP': FP, 'FN': FN}

print("="*70)
print("CONFUSION MATRIX: Step by Step")
print("="*70)

# Calculate confusion matrix for test set
cm = confusion_matrix(y_test, test_preds)

print("\nFor our TEST set:")
print(f"  Total samples: {len(y_test)}")
print(f"  Vertical lines (label=1): {np.sum(y_test == 1)}")
print(f"  Horizontal lines (label=0): {np.sum(y_test == 0)}")

print("\n" + "-"*70)
print("CONFUSION MATRIX BREAKDOWN:")
print("-"*70)

print(f"""
                       PREDICTED
                   Horizontal(0)  Vertical(1)
              ┌─────────────────┬─────────────────┐
   Horiz.(0)  │  TN = {cm['TN']:3d}       │  FP = {cm['FP']:3d}       │
   ACTUAL     ├─────────────────┼─────────────────┤
   Vert.(1)   │  FN = {cm['FN']:3d}       │  TP = {cm['TP']:3d}       │
              └─────────────────┴─────────────────┘
""")

print("Interpretation (reading the matrix):")
print(f"  ✓ True Positives (TP = {cm['TP']}): Correctly identified as VERTICAL")
print(f"  ✓ True Negatives (TN = {cm['TN']}): Correctly identified as HORIZONTAL")
print(f"  ✗ False Positives (FP = {cm['FP']}): Wrongly called VERTICAL (was horizontal)")
print(f"  ✗ False Negatives (FN = {cm['FN']}): Wrongly called HORIZONTAL (was vertical)")

# Verify: TP + TN + FP + FN should equal total samples
total = cm['TP'] + cm['TN'] + cm['FP'] + cm['FN']
print(f"\n  Verification: TP + TN + FP + FN = {total} (should equal {len(y_test)}) ✓")

# Show how accuracy relates to confusion matrix
print("\n" + "-"*70)
print("ACCURACY FROM CONFUSION MATRIX:")
print("-"*70)
print(f"""
  Accuracy = (TP + TN) / (TP + TN + FP + FN)
           = ({cm['TP']} + {cm['TN']}) / ({cm['TP']} + {cm['TN']} + {cm['FP']} + {cm['FN']})
           = {cm['TP'] + cm['TN']} / {total}
           = {(cm['TP'] + cm['TN']) / total:.4f}
           = {(cm['TP'] + cm['TN']) / total * 100:.1f}%
""")

cell 012

# =============================================================================# VISUALIZE THE CONFUSION MATRIX# ============================================================================= fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Plot 1: Confusion Matrix as heatmapax1 = axes[0]cm_matrix = np.array([[cm['TN'], cm['FP']],                        [cm['FN'], cm['TP']]]) im = ax1.imshow(cm_matrix, cmap='Blues')ax1.set_xticks([0, 1])ax1.set_yticks([0, 1])ax1.set_xticklabels(['Horizontal (0)', 'Vertical (1)'])ax1.set_yticklabels(['Horizontal (0)', 'Vertical (1)'])ax1.set_xlabel('Predicted Label', fontsize=12)ax1.set_ylabel('Actual Label', fontsize=12)ax1.set_title('Confusion Matrix', fontsize=14, fontweight='bold') # Add text annotationslabels = [['TN', 'FP'], ['FN', 'TP']]for i in range(2):    for j in range(2):        text_color = 'white' if cm_matrix[i, j] > cm_matrix.max()/2 else 'black'        ax1.text(j, i, f'{labels[i][j]}\n{cm_matrix[i, j]}',                 ha='center', va='center', fontsize=14, fontweight='bold', color=text_color) plt.colorbar(im, ax=ax1) # Plot 2: Visual explanationax2 = axes[1]ax2.axis('off') explanation_text = f"""READING THE CONFUSION MATRIX{'='*45} The DIAGONAL (top-left to bottom-right) shows CORRECT predictions:  • TN ({cm['TN']}): Horizontal predicted as Horizontal ✓  • TP ({cm['TP']}): Vertical predicted as Vertical ✓ The OFF-DIAGONAL shows ERRORS:  • FP ({cm['FP']}): Horizontal wrongly called Vertical ✗  • FN ({cm['FN']}): Vertical wrongly called Horizontal ✗ A PERFECT model has:  • All values on the diagonal  • Zeros everywhere else""" ax2.text(0.1, 0.5, explanation_text, fontsize=11, family='monospace',        verticalalignment='center', transform=ax2.transAxes,        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8)) plt.tight_layout()plt.show()

# =============================================================================
# VISUALIZE THE CONFUSION MATRIX
# =============================================================================

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Confusion Matrix as heatmap
ax1 = axes[0]
cm_matrix = np.array([[cm['TN'], cm['FP']], 
                       [cm['FN'], cm['TP']]])

im = ax1.imshow(cm_matrix, cmap='Blues')
ax1.set_xticks([0, 1])
ax1.set_yticks([0, 1])
ax1.set_xticklabels(['Horizontal (0)', 'Vertical (1)'])
ax1.set_yticklabels(['Horizontal (0)', 'Vertical (1)'])
ax1.set_xlabel('Predicted Label', fontsize=12)
ax1.set_ylabel('Actual Label', fontsize=12)
ax1.set_title('Confusion Matrix', fontsize=14, fontweight='bold')

# Add text annotations
labels = [['TN', 'FP'], ['FN', 'TP']]
for i in range(2):
    for j in range(2):
        text_color = 'white' if cm_matrix[i, j] > cm_matrix.max()/2 else 'black'
        ax1.text(j, i, f'{labels[i][j]}\n{cm_matrix[i, j]}', 
                ha='center', va='center', fontsize=14, fontweight='bold', color=text_color)

plt.colorbar(im, ax=ax1)

# Plot 2: Visual explanation
ax2 = axes[1]
ax2.axis('off')

explanation_text = f"""
READING THE CONFUSION MATRIX
{'='*45}

The DIAGONAL (top-left to bottom-right) shows 
CORRECT predictions:
  • TN ({cm['TN']}): Horizontal predicted as Horizontal ✓
  • TP ({cm['TP']}): Vertical predicted as Vertical ✓

The OFF-DIAGONAL shows ERRORS:
  • FP ({cm['FP']}): Horizontal wrongly called Vertical ✗
  • FN ({cm['FN']}): Vertical wrongly called Horizontal ✗

A PERFECT model has:
  • All values on the diagonal
  • Zeros everywhere else
"""

ax2.text(0.1, 0.5, explanation_text, fontsize=11, family='monospace',
        verticalalignment='center', transform=ax2.transAxes,
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

plt.tight_layout()
plt.show()

6.4 Precision, Recall, and F1 Score

The confusion matrix gives us four numbers. From these, we can calculate more specific metrics that answer different questions.

Precision: "When I Say Positive, Am I Right?"

Precision answers: "Of all the times I predicted 'positive' (vertical), how many were actually positive?"

$Precision = \frac{T P}{T P + F P}$

Breaking it down:

Numerator (TP): Cases we correctly called positive
Denominator (TP + FP): ALL cases we called positive (right or wrong)

High precision means: When we say "vertical", we're usually right. Few false alarms.

When to prioritize precision:

Spam filters (don't delete legitimate emails!)
Recommender systems (don't recommend things users hate!)
Any case where false alarms are costly

Recall: "Did I Catch All the Positives?"

Recall (also called Sensitivity) answers: "Of all the actual positives, how many did I catch?"

$Recall = \frac{T P}{T P + F N}$

Breaking it down:

Numerator (TP): Cases we correctly caught
Denominator (TP + FN): ALL actual positives (caught or missed)

High recall means: We catch most of the actual vertical lines. Few misses.

When to prioritize recall:

Disease detection (don't miss sick patients!)
Fraud detection (don't miss fraudulent transactions!)
Any case where missing positives is costly

The Precision-Recall Trade-off

Here's the fundamental tension:

Strategy	Precision	Recall	Problem
"Only say vertical when 100% sure"	HIGH (few false alarms)	LOW (miss many)	Miss too many positives
"Say vertical for anything remotely vertical"	LOW (many false alarms)	HIGH (catch most)	Too many false alarms

You often can't maximize both! This is called the precision-recall trade-off.

Concrete Example: Airport Security

Imagine a security scanner detecting threats:

Setting	Precision	Recall	Outcome
Super sensitive	10%	99%	Catches ALL threats but 90% of "threats" are false alarms. Massive delays!
Super strict	95%	20%	Few false alarms but misses 80% of real threats. Dangerous!
Balanced	70%	70%	Some false alarms, catches most threats. Practical!

Why the trade-off exists:

When we lower the threshold for saying "positive":

We catch MORE true positives (recall goes UP ↑)
But we also catch MORE false positives (precision goes DOWN ↓)

When we raise the threshold:

We have FEWER false positives (precision goes UP ↑)
But we miss MORE true positives (recall goes DOWN ↓)

There's no free lunch! The art is finding the right balance for your specific application.

F1 Score: Finding the Balance

The F1 Score is the harmonic mean of precision and recall - a single number that balances both:

$F 1 = 2 \cdot \frac{Precision \times Recall}{Precision + Recall}$

What IS a Harmonic Mean and Why Use It?

You might wonder: "Why not just use a regular average (arithmetic mean)?"

Three Types of Means:

Mean Type	Formula	Example: (99%, 10%)
Arithmetic	(a + b) / 2	(99 + 10) / 2 = 54.5%
Geometric	√(a × b)	√(99 × 10) = 31.5%
Harmonic	2ab / (a + b)	2×99×10 / (99+10) = 18.2%

Why harmonic mean is better for F1:

The harmonic mean is punishing when values are imbalanced. If you have 99% precision but only 10% recall:

Arithmetic mean says "54.5% - not bad!"
Harmonic mean says "18.2% - this is terrible!"

The harmonic mean forces BOTH values to be reasonably high to get a good score.

Intuition: Think about speed. If you drive 60 mph for half a trip and 20 mph for the other half, your average speed isn't 40 mph - it's closer to 30 mph (harmonic mean). The slow part dominates.

Why this matters for ML: A model that predicts "positive" for everything gets 100% recall but ~0% precision. The harmonic mean correctly identifies this as a terrible model.

Precision	Recall	F1 Score	Verdict
90%	90%	90%	Great! Both balanced
99%	10%	18%	Terrible! Very unbalanced
50%	50%	50%	Mediocre

F1 is high only when BOTH precision AND recall are reasonably high.

cell 014

# =============================================================================# PRECISION, RECALL, F1: Calculation# ============================================================================= def calculate_metrics(cm):    """    Calculate precision, recall, F1 from confusion matrix.        Parameters:        cm: dict with TP, TN, FP, FN        Returns:        dict with precision, recall, f1, accuracy    """    TP, TN, FP, FN = cm['TP'], cm['TN'], cm['FP'], cm['FN']        # Precision: When we say positive, are we right?    # Note: We add a check to avoid division by zero    precision = TP / (TP + FP) if (TP + FP) > 0 else 0        # Recall: Did we catch all the positives?    recall = TP / (TP + FN) if (TP + FN) > 0 else 0        # F1: Harmonic mean of precision and recall    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0        # Accuracy (for comparison)    accuracy = (TP + TN) / (TP + TN + FP + FN)        return {'precision': precision, 'recall': recall, 'f1': f1, 'accuracy': accuracy} print("="*70)print("PRECISION, RECALL, AND F1 SCORE")print("="*70) metrics = calculate_metrics(cm) print("\n" + "-"*70)print("STEP-BY-STEP CALCULATION:")print("-"*70) print(f"""From our confusion matrix:  TP = {cm['TP']} (correctly identified vertical lines)  TN = {cm['TN']} (correctly identified horizontal lines)  FP = {cm['FP']} (horizontal lines wrongly called vertical)  FN = {cm['FN']} (vertical lines wrongly called horizontal) PRECISION: "When I say vertical, am I right?"  Formula: Precision = TP / (TP + FP)    Precision = {cm['TP']} / ({cm['TP']} + {cm['FP']})            = {cm['TP']} / {cm['TP'] + cm['FP']}            = {metrics['precision']:.4f}            = {metrics['precision']*100:.1f}% RECALL: "Did I catch all the vertical lines?"  Formula: Recall = TP / (TP + FN)    Recall = {cm['TP']} / ({cm['TP']} + {cm['FN']})         = {cm['TP']} / {cm['TP'] + cm['FN']}         = {metrics['recall']:.4f}         = {metrics['recall']*100:.1f}% F1 SCORE: "Balance of precision and recall"  Formula: F1 = 2 × (Precision × Recall) / (Precision + Recall)    F1 = 2 × ({metrics['precision']:.4f} × {metrics['recall']:.4f}) / ({metrics['precision']:.4f} + {metrics['recall']:.4f})     = 2 × {metrics['precision'] * metrics['recall']:.4f} / {metrics['precision'] + metrics['recall']:.4f}     = {metrics['f1']:.4f}     = {metrics['f1']*100:.1f}% ACCURACY (for comparison):  Formula: Accuracy = (TP + TN) / Total    Accuracy = ({cm['TP']} + {cm['TN']}) / {cm['TP'] + cm['TN'] + cm['FP'] + cm['FN']}           = {cm['TP'] + cm['TN']} / {cm['TP'] + cm['TN'] + cm['FP'] + cm['FN']}           = {metrics['accuracy']:.4f}           = {metrics['accuracy']*100:.1f}%""")

# =============================================================================
# PRECISION, RECALL, F1: Calculation
# =============================================================================

def calculate_metrics(cm):
    """
    Calculate precision, recall, F1 from confusion matrix.
    
    Parameters:
        cm: dict with TP, TN, FP, FN
    
    Returns:
        dict with precision, recall, f1, accuracy
    """
    TP, TN, FP, FN = cm['TP'], cm['TN'], cm['FP'], cm['FN']
    
    # Precision: When we say positive, are we right?
    # Note: We add a check to avoid division by zero
    precision = TP / (TP + FP) if (TP + FP) > 0 else 0
    
    # Recall: Did we catch all the positives?
    recall = TP / (TP + FN) if (TP + FN) > 0 else 0
    
    # F1: Harmonic mean of precision and recall
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    # Accuracy (for comparison)
    accuracy = (TP + TN) / (TP + TN + FP + FN)
    
    return {'precision': precision, 'recall': recall, 'f1': f1, 'accuracy': accuracy}

print("="*70)
print("PRECISION, RECALL, AND F1 SCORE")
print("="*70)

metrics = calculate_metrics(cm)

print("\n" + "-"*70)
print("STEP-BY-STEP CALCULATION:")
print("-"*70)

print(f"""
From our confusion matrix:
  TP = {cm['TP']} (correctly identified vertical lines)
  TN = {cm['TN']} (correctly identified horizontal lines)
  FP = {cm['FP']} (horizontal lines wrongly called vertical)
  FN = {cm['FN']} (vertical lines wrongly called horizontal)

PRECISION: "When I say vertical, am I right?"
  Formula: Precision = TP / (TP + FP)
  
  Precision = {cm['TP']} / ({cm['TP']} + {cm['FP']})
            = {cm['TP']} / {cm['TP'] + cm['FP']}
            = {metrics['precision']:.4f}
            = {metrics['precision']*100:.1f}%

RECALL: "Did I catch all the vertical lines?"
  Formula: Recall = TP / (TP + FN)
  
  Recall = {cm['TP']} / ({cm['TP']} + {cm['FN']})
         = {cm['TP']} / {cm['TP'] + cm['FN']}
         = {metrics['recall']:.4f}
         = {metrics['recall']*100:.1f}%

F1 SCORE: "Balance of precision and recall"
  Formula: F1 = 2 × (Precision × Recall) / (Precision + Recall)
  
  F1 = 2 × ({metrics['precision']:.4f} × {metrics['recall']:.4f}) / ({metrics['precision']:.4f} + {metrics['recall']:.4f})
     = 2 × {metrics['precision'] * metrics['recall']:.4f} / {metrics['precision'] + metrics['recall']:.4f}
     = {metrics['f1']:.4f}
     = {metrics['f1']*100:.1f}%

ACCURACY (for comparison):
  Formula: Accuracy = (TP + TN) / Total
  
  Accuracy = ({cm['TP']} + {cm['TN']}) / {cm['TP'] + cm['TN'] + cm['FP'] + cm['FN']}
           = {cm['TP'] + cm['TN']} / {cm['TP'] + cm['TN'] + cm['FP'] + cm['FN']}
           = {metrics['accuracy']:.4f}
           = {metrics['accuracy']*100:.1f}%
""")

cell 015

# =============================================================================# DEMONSTRATING THE PRECISION-RECALL TRADE-OFF# ============================================================================= print("="*70)print("THE PRECISION-RECALL TRADE-OFF: A Visual Demonstration")print("="*70) print("""To understand the trade-off, let's see what happens when we changeour THRESHOLD for saying "vertical" (positive). Currently we use: threshold = 0.5  - If output >= 0.5 → predict "vertical"  - If output < 0.5 → predict "horizontal" But what if we change this threshold?""") # Try different thresholdsthresholds = [0.1, 0.3, 0.5, 0.7, 0.9]results = [] for threshold in thresholds:    # Make predictions at this threshold    preds = np.array([1 if model.forward(x) >= threshold else 0 for x in X_test])        # Calculate confusion matrix    TP = np.sum((preds == 1) & (y_test == 1))    TN = np.sum((preds == 0) & (y_test == 0))    FP = np.sum((preds == 1) & (y_test == 0))    FN = np.sum((preds == 0) & (y_test == 1))        # Calculate metrics    precision = TP / (TP + FP) if (TP + FP) > 0 else 0    recall = TP / (TP + FN) if (TP + FN) > 0 else 0    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0        results.append({        'threshold': threshold,        'precision': precision,        'recall': recall,        'f1': f1,        'TP': TP, 'FP': FP, 'FN': FN    })        print(f"Threshold = {threshold}:")    print(f"  TP={TP:2d}, FP={FP:2d}, FN={FN:2d}")    print(f"  Precision={precision:.1%}, Recall={recall:.1%}, F1={f1:.1%}")    print() # Visualizefig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Plot 1: Precision vs Recall at different thresholdsax = axes[0]precisions = [r['precision'] for r in results]recalls = [r['recall'] for r in results] ax.plot(recalls, precisions, 'b-o', linewidth=2, markersize=10)for r in results:    ax.annotate(f"  t={r['threshold']}",                (r['recall'], r['precision']), fontsize=9) ax.set_xlabel('Recall', fontsize=12)ax.set_ylabel('Precision', fontsize=12)ax.set_title('Precision-Recall Trade-off\n(Each point is a different threshold)',             fontsize=12, fontweight='bold')ax.set_xlim(-0.05, 1.05)ax.set_ylim(-0.05, 1.05)ax.grid(True, alpha=0.3) # Add ideal pointax.scatter([1], [1], color='gold', s=200, marker='*', zorder=5, label='Ideal (1,1)')ax.legend() # Plot 2: Bar chart showing trade-offax = axes[1]x = np.arange(len(thresholds))width = 0.25 bars1 = ax.bar(x - width, precisions, width, label='Precision', color='#e74c3c')bars2 = ax.bar(x, recalls, width, label='Recall', color='#27ae60')bars3 = ax.bar(x + width, [r['f1'] for r in results], width, label='F1', color='#9b59b6') ax.set_xlabel('Threshold', fontsize=12)ax.set_ylabel('Score', fontsize=12)ax.set_title('Metrics at Different Thresholds', fontsize=12, fontweight='bold')ax.set_xticks(x)ax.set_xticklabels([f'{t}' for t in thresholds])ax.legend()ax.set_ylim(0, 1.1) plt.tight_layout()plt.show() print("""KEY INSIGHT:════════════════════════════════════════════════════════════════════════ • LOW threshold (0.1): "Say vertical for almost everything!"  → High recall (catch most verticals) but low precision (many false alarms)  • HIGH threshold (0.9): "Only say vertical when VERY confident!"  → High precision (rarely wrong when we say vertical) but low recall (miss many)  • MIDDLE threshold (0.5): Balanced trade-off Notice how the precision-recall curve shows the trade-off: as one goes up, the other tends to go down. F1 score helps us find a good balance!""")

# =============================================================================
# DEMONSTRATING THE PRECISION-RECALL TRADE-OFF
# =============================================================================

print("="*70)
print("THE PRECISION-RECALL TRADE-OFF: A Visual Demonstration")
print("="*70)

print("""
To understand the trade-off, let's see what happens when we change
our THRESHOLD for saying "vertical" (positive).

Currently we use: threshold = 0.5
  - If output >= 0.5 → predict "vertical"
  - If output < 0.5 → predict "horizontal"

But what if we change this threshold?
""")

# Try different thresholds
thresholds = [0.1, 0.3, 0.5, 0.7, 0.9]
results = []

for threshold in thresholds:
    # Make predictions at this threshold
    preds = np.array([1 if model.forward(x) >= threshold else 0 for x in X_test])
    
    # Calculate confusion matrix
    TP = np.sum((preds == 1) & (y_test == 1))
    TN = np.sum((preds == 0) & (y_test == 0))
    FP = np.sum((preds == 1) & (y_test == 0))
    FN = np.sum((preds == 0) & (y_test == 1))
    
    # Calculate metrics
    precision = TP / (TP + FP) if (TP + FP) > 0 else 0
    recall = TP / (TP + FN) if (TP + FN) > 0 else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
    
    results.append({
        'threshold': threshold,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'TP': TP, 'FP': FP, 'FN': FN
    })
    
    print(f"Threshold = {threshold}:")
    print(f"  TP={TP:2d}, FP={FP:2d}, FN={FN:2d}")
    print(f"  Precision={precision:.1%}, Recall={recall:.1%}, F1={f1:.1%}")
    print()

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Precision vs Recall at different thresholds
ax = axes[0]
precisions = [r['precision'] for r in results]
recalls = [r['recall'] for r in results]

ax.plot(recalls, precisions, 'b-o', linewidth=2, markersize=10)
for r in results:
    ax.annotate(f"  t={r['threshold']}", 
               (r['recall'], r['precision']), fontsize=9)

ax.set_xlabel('Recall', fontsize=12)
ax.set_ylabel('Precision', fontsize=12)
ax.set_title('Precision-Recall Trade-off\n(Each point is a different threshold)', 
            fontsize=12, fontweight='bold')
ax.set_xlim(-0.05, 1.05)
ax.set_ylim(-0.05, 1.05)
ax.grid(True, alpha=0.3)

# Add ideal point
ax.scatter([1], [1], color='gold', s=200, marker='*', zorder=5, label='Ideal (1,1)')
ax.legend()

# Plot 2: Bar chart showing trade-off
ax = axes[1]
x = np.arange(len(thresholds))
width = 0.25

bars1 = ax.bar(x - width, precisions, width, label='Precision', color='#e74c3c')
bars2 = ax.bar(x, recalls, width, label='Recall', color='#27ae60')
bars3 = ax.bar(x + width, [r['f1'] for r in results], width, label='F1', color='#9b59b6')

ax.set_xlabel('Threshold', fontsize=12)
ax.set_ylabel('Score', fontsize=12)
ax.set_title('Metrics at Different Thresholds', fontsize=12, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels([f'{t}' for t in thresholds])
ax.legend()
ax.set_ylim(0, 1.1)

plt.tight_layout()
plt.show()

print("""
KEY INSIGHT:
════════════════════════════════════════════════════════════════════════

• LOW threshold (0.1): "Say vertical for almost everything!"
  → High recall (catch most verticals) but low precision (many false alarms)
  
• HIGH threshold (0.9): "Only say vertical when VERY confident!"
  → High precision (rarely wrong when we say vertical) but low recall (miss many)
  
• MIDDLE threshold (0.5): Balanced trade-off

Notice how the precision-recall curve shows the trade-off: as one goes up, 
the other tends to go down. F1 score helps us find a good balance!
""")

cell 016

# =============================================================================# VISUALIZE ALL METRICS# ============================================================================= fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Plot 1: Bar chart of all metricsax1 = axes[0]metric_names = ['Accuracy', 'Precision', 'Recall', 'F1 Score']metric_values = [metrics['accuracy'], metrics['precision'], metrics['recall'], metrics['f1']]colors = ['#3498db', '#e74c3c', '#27ae60', '#9b59b6'] bars = ax1.bar(metric_names, metric_values, color=colors, edgecolor='white', linewidth=2)ax1.set_ylim(0, 1.1)ax1.set_ylabel('Score', fontsize=12)ax1.set_title('Model Performance Metrics', fontsize=14, fontweight='bold')ax1.axhline(y=1.0, color='gray', linestyle='--', alpha=0.5, label='Perfect score') # Add value labelsfor bar, val in zip(bars, metric_values):    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,             f'{val:.1%}', ha='center', va='bottom', fontsize=12, fontweight='bold') # Plot 2: Which metric to use guideax2 = axes[1]ax2.axis('off') metrics_explanation = """WHICH METRIC SHOULD YOU USE?═══════════════════════════════════════════════════ ACCURACY  • Best when: Classes are balanced (50/50)  • Misleading when: Rare events (e.g., 1% fraud)  PRECISION  • Best when: False alarms are COSTLY  • Examples:     - Spam filter (don't delete real email!)    - Criminal conviction (don't jail innocent!)  RECALL  • Best when: Missing positives is COSTLY  • Examples:     - Disease detection (don't miss sick patients!)    - Fraud detection (don't miss fraud!)  F1 SCORE  • Best when: You need balance between P & R  • Most real-world applications use F1 ═══════════════════════════════════════════════════For our V/H classifier, all metrics are similar because our dataset is balanced and model works well!""" ax2.text(0.05, 0.5, metrics_explanation, fontsize=10, family='monospace',        verticalalignment='center', transform=ax2.transAxes,        bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.8)) plt.tight_layout()plt.show()

# =============================================================================
# VISUALIZE ALL METRICS
# =============================================================================

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Bar chart of all metrics
ax1 = axes[0]
metric_names = ['Accuracy', 'Precision', 'Recall', 'F1 Score']
metric_values = [metrics['accuracy'], metrics['precision'], metrics['recall'], metrics['f1']]
colors = ['#3498db', '#e74c3c', '#27ae60', '#9b59b6']

bars = ax1.bar(metric_names, metric_values, color=colors, edgecolor='white', linewidth=2)
ax1.set_ylim(0, 1.1)
ax1.set_ylabel('Score', fontsize=12)
ax1.set_title('Model Performance Metrics', fontsize=14, fontweight='bold')
ax1.axhline(y=1.0, color='gray', linestyle='--', alpha=0.5, label='Perfect score')

# Add value labels
for bar, val in zip(bars, metric_values):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02, 
            f'{val:.1%}', ha='center', va='bottom', fontsize=12, fontweight='bold')

# Plot 2: Which metric to use guide
ax2 = axes[1]
ax2.axis('off')

metrics_explanation = """
WHICH METRIC SHOULD YOU USE?
═══════════════════════════════════════════════════

ACCURACY
  • Best when: Classes are balanced (50/50)
  • Misleading when: Rare events (e.g., 1% fraud)
  
PRECISION
  • Best when: False alarms are COSTLY
  • Examples: 
    - Spam filter (don't delete real email!)
    - Criminal conviction (don't jail innocent!)
  
RECALL
  • Best when: Missing positives is COSTLY
  • Examples: 
    - Disease detection (don't miss sick patients!)
    - Fraud detection (don't miss fraud!)
  
F1 SCORE
  • Best when: You need balance between P & R
  • Most real-world applications use F1

═══════════════════════════════════════════════════
For our V/H classifier, all metrics are similar 
because our dataset is balanced and model works well!
"""

ax2.text(0.05, 0.5, metrics_explanation, fontsize=10, family='monospace',
        verticalalignment='center', transform=ax2.transAxes,
        bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.8))

plt.tight_layout()
plt.show()

6.5 The Committee Report: Saliency and Interpretability

We know our model works well, but WHY does it work? What has it actually learned?

What IS Interpretability?

Interpretability (also called Explainability) means understanding:

What patterns did the model learn?
Why does it make specific predictions?
Is it using the "right" features?

Question	How to Answer
What patterns did it learn?	Look at the weights
Why did it predict "vertical"?	Look at which inputs contributed most
Is it using the right features?	Visualize the saliency map

What IS Saliency?

The word "saliency" comes from Latin salire meaning "to leap." In machine learning:

Saliency = Which parts of the input "leap out" as important to the model

For our Perceptron, saliency is beautifully simple:

${Saliency}_{i} = ∣ w_{i} \times x_{i} ∣$

Where:

$w_{i}$ = weight for input $i$
$x_{i}$ = value of input $i$
$∣ . . . ∣$ = absolute value (we care about magnitude, not sign)

Why Absolute Value?

Weight × Input	Meaning	Contribution
+2.0 × 1.0 = +2.0	Strongly SUPPORTS vertical	HIGH
-2.0 × 1.0 = -2.0	Strongly OPPOSES vertical	HIGH
+0.1 × 1.0 = +0.1	Weakly supports vertical	LOW

Both +2.0 and -2.0 are strong contributions - just in opposite directions. The absolute value captures the strength of influence.

Committee Analogy

"We ask the committee: 'Show us your reasoning. Highlight the evidence that most influenced your decision.' They produce a report where the most influential pieces of evidence glow brightly. This is the saliency map - a visual explanation of the committee's thought process."

Why Interpretability Matters

Reason	Example
Trust	Can we trust this medical diagnosis?
Debugging	Why is the model getting this wrong?
Discovery	What features actually matter?
Fairness	Is it unfairly using race or gender?
Legal	GDPR requires "right to explanation"

The Math Behind Saliency

For our Perceptron, let's trace WHY $∣ w_{i} \times x_{i} ∣$ measures importance:

Step 1: The Neuron's Decision $z = w_{1} x_{1} + w_{2} x_{2} + . . . + w_{9} x_{9} + b$

Each term $w_{i} x_{i}$ is that pixel's contribution to the final sum $z$ .

Step 2: How Much Did Each Pixel Contribute?

Pixel	Weight ( $w_{i}$ )	Input ( $x_{i}$ )	Contribution ( $w_{i} \times x_{i}$ )
0	0.5	0	0.5 × 0 = 0 (no contribution)
1	1.2	1	1.2 × 1 = 1.2 (strong positive)
4	-0.8	1	-0.8 × 1 = -0.8 (strong negative)

Step 3: Why Absolute Value?

Both +1.2 and -0.8 are strong influences on the decision - they just push in opposite directions. The absolute value captures strength of influence regardless of direction.

${Saliency}_{i} = ∣ w_{i} \times x_{i} ∣$

Interpretation:

High saliency = This pixel strongly influenced the decision (positively OR negatively)
Low saliency = This pixel didn't matter much for this prediction

Looking at What Our Model Learned

cell 018

# =============================================================================# SALIENCY: What Did the Model Learn?# ============================================================================= print("="*70)print("THE COMMITTEE REPORT: What Did the Model Learn?")print("="*70) # First, let's look at the learned weightsprint("\n" + "-"*70)print("STEP 1: Examine the Learned Weights")print("-"*70) weights_grid = model.weights.reshape(3, 3)print("""Remember our pixel positions:     Position Index:     Image Layout:    [0] [1] [2]         [row 0]    [3] [4] [5]   →     [row 1]    [6] [7] [8]         [row 2] Our model's learned weights (as 3x3 grid):""")for i, row in enumerate(weights_grid):    print(f"  Row {i}: [{row[0]:6.3f}, {row[1]:6.3f}, {row[2]:6.3f}]") print(f"\n  Bias: {model.bias:.4f}") # Interpret the weightsprint("\n" + "-"*70)print("STEP 2: Interpret What the Weights Mean")print("-"*70) print("""HOW TO READ WEIGHTS:  • Positive weight → This pixel being bright INCREASES "vertical" confidence  • Negative weight → This pixel being bright DECREASES "vertical" confidence  • Near-zero weight → This pixel doesn't matter much""") # Find which positions have highest/lowest weightsflat_weights = model.weightsmax_idx = np.argmax(flat_weights)min_idx = np.argmin(flat_weights) print(f"""KEY OBSERVATIONS:   Maximum weight: position {max_idx} (row {max_idx//3}, col {max_idx%3}) = {flat_weights[max_idx]:.3f}    → If this pixel is bright, model is MORE confident it's vertical      Minimum weight: position {min_idx} (row {min_idx//3}, col {min_idx%3}) = {flat_weights[min_idx]:.3f}    → If this pixel is bright, model is LESS confident it's vertical      Positions with HIGH positive weights: {np.where(flat_weights > 0.3)[0].tolist()}    → These pixels SUPPORT "vertical" classification      Positions with HIGH negative weights: {np.where(flat_weights < -0.3)[0].tolist()}    → These pixels OPPOSE "vertical" classification""")

# =============================================================================
# SALIENCY: What Did the Model Learn?
# =============================================================================

print("="*70)
print("THE COMMITTEE REPORT: What Did the Model Learn?")
print("="*70)

# First, let's look at the learned weights
print("\n" + "-"*70)
print("STEP 1: Examine the Learned Weights")
print("-"*70)

weights_grid = model.weights.reshape(3, 3)
print("""
Remember our pixel positions:

Position Index:     Image Layout:
    [0] [1] [2]         [row 0]
    [3] [4] [5]   →     [row 1]
    [6] [7] [8]         [row 2]

Our model's learned weights (as 3x3 grid):
""")
for i, row in enumerate(weights_grid):
    print(f"  Row {i}: [{row[0]:6.3f}, {row[1]:6.3f}, {row[2]:6.3f}]")

print(f"\n  Bias: {model.bias:.4f}")

# Interpret the weights
print("\n" + "-"*70)
print("STEP 2: Interpret What the Weights Mean")
print("-"*70)

print("""
HOW TO READ WEIGHTS:
  • Positive weight → This pixel being bright INCREASES "vertical" confidence
  • Negative weight → This pixel being bright DECREASES "vertical" confidence
  • Near-zero weight → This pixel doesn't matter much
""")

# Find which positions have highest/lowest weights
flat_weights = model.weights
max_idx = np.argmax(flat_weights)
min_idx = np.argmin(flat_weights)

print(f"""
KEY OBSERVATIONS:

Maximum weight: position {max_idx} (row {max_idx//3}, col {max_idx%3}) = {flat_weights[max_idx]:.3f}
    → If this pixel is bright, model is MORE confident it's vertical
    
  Minimum weight: position {min_idx} (row {min_idx//3}, col {min_idx%3}) = {flat_weights[min_idx]:.3f}
    → If this pixel is bright, model is LESS confident it's vertical
    
  Positions with HIGH positive weights: {np.where(flat_weights > 0.3)[0].tolist()}
    → These pixels SUPPORT "vertical" classification
    
  Positions with HIGH negative weights: {np.where(flat_weights < -0.3)[0].tolist()}
    → These pixels OPPOSE "vertical" classification
""")

cell 019

# =============================================================================# VISUALIZE: Weights and Saliency Maps - THE "AHA!" MOMENT# ============================================================================= def compute_saliency(model, x):    """    Compute saliency map for an input.        Saliency = |weight × input|        This tells us: "How much did each input pixel     contribute to the final decision?"        Parameters:        model: Trained model with weights        x: Input image (flattened)        Returns:        saliency: Array of contribution magnitudes    """    x = np.array(x).flatten()    # Multiply each input by its weight, take absolute value    return np.abs(model.weights * x) fig, axes = plt.subplots(2, 4, figsize=(16, 8)) # =================# Top row: Vertical line analysis# ================= # 1. Input imageax = axes[0, 0]ax.imshow(vertical_line, cmap='Blues', vmin=0, vmax=1)ax.set_title('INPUT:\nVertical Line', fontsize=11, fontweight='bold')for i in range(3):    for j in range(3):        ax.text(j, i, f'{vertical_line[i,j]:.0f}', ha='center', va='center', fontsize=12)ax.axis('off') # 2. Model weightsax = axes[0, 1]im = ax.imshow(weights_grid, cmap='RdBu', vmin=-2, vmax=2)ax.set_title('WEIGHTS:\nLearned by Model', fontsize=11, fontweight='bold')for i in range(3):    for j in range(3):        color = 'white' if abs(weights_grid[i,j]) > 1 else 'black'        ax.text(j, i, f'{weights_grid[i,j]:.2f}', ha='center', va='center', fontsize=9, color=color)ax.axis('off') # 3. Saliency mapax = axes[0, 2]saliency_v = compute_saliency(model, vertical_flat).reshape(3, 3)im = ax.imshow(saliency_v, cmap='hot', vmin=0)ax.set_title('SALIENCY MAP:\n|Weight × Input|', fontsize=11, fontweight='bold')for i in range(3):    for j in range(3):        color = 'white' if saliency_v[i,j] > saliency_v.max()/2 else 'black'        ax.text(j, i, f'{saliency_v[i,j]:.2f}', ha='center', va='center', fontsize=9, color=color)ax.axis('off') # 4. Prediction resultax = axes[0, 3]ax.axis('off')pred_v = model.forward(vertical_flat)result_text = f"""PREDICTION Raw output: {pred_v:.4f}Confidence: {pred_v*100:.1f}% Decision: {"VERTICAL" if pred_v >= 0.5 else "HORIZONTAL"} Correct! ✓"""ax.text(0.5, 0.5, result_text, fontsize=11, fontweight='bold',       ha='center', va='center', transform=ax.transAxes,       bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.8)) # =================# Bottom row: Horizontal line analysis# ================= # 1. Input imageax = axes[1, 0]ax.imshow(horizontal_line, cmap='Blues', vmin=0, vmax=1)ax.set_title('INPUT:\nHorizontal Line', fontsize=11, fontweight='bold')for i in range(3):    for j in range(3):        ax.text(j, i, f'{horizontal_line[i,j]:.0f}', ha='center', va='center', fontsize=12)ax.axis('off') # 2. Model weights (same)ax = axes[1, 1]im = ax.imshow(weights_grid, cmap='RdBu', vmin=-2, vmax=2)ax.set_title('WEIGHTS:\n(Same model)', fontsize=11, fontweight='bold')for i in range(3):    for j in range(3):        color = 'white' if abs(weights_grid[i,j]) > 1 else 'black'        ax.text(j, i, f'{weights_grid[i,j]:.2f}', ha='center', va='center', fontsize=9, color=color)ax.axis('off') # 3. Saliency mapax = axes[1, 2]saliency_h = compute_saliency(model, horizontal_flat).reshape(3, 3)im = ax.imshow(saliency_h, cmap='hot', vmin=0)ax.set_title('SALIENCY MAP:\n|Weight × Input|', fontsize=11, fontweight='bold')for i in range(3):    for j in range(3):        color = 'white' if saliency_h[i,j] > saliency_h.max()/2 else 'black'        ax.text(j, i, f'{saliency_h[i,j]:.2f}', ha='center', va='center', fontsize=9, color=color)ax.axis('off') # 4. Prediction resultax = axes[1, 3]ax.axis('off')pred_h = model.forward(horizontal_flat)result_text = f"""PREDICTION Raw output: {pred_h:.4f}Confidence: {(1-pred_h)*100:.1f}% horizontal Decision: {"VERTICAL" if pred_h >= 0.5 else "HORIZONTAL"} Correct! ✓"""ax.text(0.5, 0.5, result_text, fontsize=11, fontweight='bold',       ha='center', va='center', transform=ax.transAxes,       bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.8)) plt.suptitle('THE COMMITTEE REPORT: How the Model Makes Decisions', fontsize=14, fontweight='bold', y=1.02)plt.tight_layout()plt.show()

# =============================================================================
# VISUALIZE: Weights and Saliency Maps - THE "AHA!" MOMENT
# =============================================================================

def compute_saliency(model, x):
    """
    Compute saliency map for an input.
    
    Saliency = |weight × input|
    
    This tells us: "How much did each input pixel 
    contribute to the final decision?"
    
    Parameters:
        model: Trained model with weights
        x: Input image (flattened)
    
    Returns:
        saliency: Array of contribution magnitudes
    """
    x = np.array(x).flatten()
    # Multiply each input by its weight, take absolute value
    return np.abs(model.weights * x)

fig, axes = plt.subplots(2, 4, figsize=(16, 8))

# =================
# Top row: Vertical line analysis
# =================

# 1. Input image
ax = axes[0, 0]
ax.imshow(vertical_line, cmap='Blues', vmin=0, vmax=1)
ax.set_title('INPUT:\nVertical Line', fontsize=11, fontweight='bold')
for i in range(3):
    for j in range(3):
        ax.text(j, i, f'{vertical_line[i,j]:.0f}', ha='center', va='center', fontsize=12)
ax.axis('off')

# 2. Model weights
ax = axes[0, 1]
im = ax.imshow(weights_grid, cmap='RdBu', vmin=-2, vmax=2)
ax.set_title('WEIGHTS:\nLearned by Model', fontsize=11, fontweight='bold')
for i in range(3):
    for j in range(3):
        color = 'white' if abs(weights_grid[i,j]) > 1 else 'black'
        ax.text(j, i, f'{weights_grid[i,j]:.2f}', ha='center', va='center', fontsize=9, color=color)
ax.axis('off')

# 3. Saliency map
ax = axes[0, 2]
saliency_v = compute_saliency(model, vertical_flat).reshape(3, 3)
im = ax.imshow(saliency_v, cmap='hot', vmin=0)
ax.set_title('SALIENCY MAP:\n|Weight × Input|', fontsize=11, fontweight='bold')
for i in range(3):
    for j in range(3):
        color = 'white' if saliency_v[i,j] > saliency_v.max()/2 else 'black'
        ax.text(j, i, f'{saliency_v[i,j]:.2f}', ha='center', va='center', fontsize=9, color=color)
ax.axis('off')

# 4. Prediction result
ax = axes[0, 3]
ax.axis('off')
pred_v = model.forward(vertical_flat)
result_text = f"""PREDICTION

Raw output: {pred_v:.4f}
Confidence: {pred_v*100:.1f}%

Decision: {"VERTICAL" if pred_v >= 0.5 else "HORIZONTAL"}

Correct! ✓"""
ax.text(0.5, 0.5, result_text, fontsize=11, fontweight='bold',
       ha='center', va='center', transform=ax.transAxes,
       bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.8))

# =================
# Bottom row: Horizontal line analysis
# =================

# 1. Input image
ax = axes[1, 0]
ax.imshow(horizontal_line, cmap='Blues', vmin=0, vmax=1)
ax.set_title('INPUT:\nHorizontal Line', fontsize=11, fontweight='bold')
for i in range(3):
    for j in range(3):
        ax.text(j, i, f'{horizontal_line[i,j]:.0f}', ha='center', va='center', fontsize=12)
ax.axis('off')

# 2. Model weights (same)
ax = axes[1, 1]
im = ax.imshow(weights_grid, cmap='RdBu', vmin=-2, vmax=2)
ax.set_title('WEIGHTS:\n(Same model)', fontsize=11, fontweight='bold')
for i in range(3):
    for j in range(3):
        color = 'white' if abs(weights_grid[i,j]) > 1 else 'black'
        ax.text(j, i, f'{weights_grid[i,j]:.2f}', ha='center', va='center', fontsize=9, color=color)
ax.axis('off')

# 3. Saliency map
ax = axes[1, 2]
saliency_h = compute_saliency(model, horizontal_flat).reshape(3, 3)
im = ax.imshow(saliency_h, cmap='hot', vmin=0)
ax.set_title('SALIENCY MAP:\n|Weight × Input|', fontsize=11, fontweight='bold')
for i in range(3):
    for j in range(3):
        color = 'white' if saliency_h[i,j] > saliency_h.max()/2 else 'black'
        ax.text(j, i, f'{saliency_h[i,j]:.2f}', ha='center', va='center', fontsize=9, color=color)
ax.axis('off')

# 4. Prediction result
ax = axes[1, 3]
ax.axis('off')
pred_h = model.forward(horizontal_flat)
result_text = f"""PREDICTION

Raw output: {pred_h:.4f}
Confidence: {(1-pred_h)*100:.1f}% horizontal

Decision: {"VERTICAL" if pred_h >= 0.5 else "HORIZONTAL"}

plt.suptitle('THE COMMITTEE REPORT: How the Model Makes Decisions', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

cell 020

# =============================================================================# THE "AHA!" MOMENT: Understanding What the Model Learned# ============================================================================= print("="*70)print("THE KEY INSIGHT: What Did the Model ACTUALLY Learn?")print("="*70) print("""Looking at the visualizations above, we can see something beautiful: FOR VERTICAL LINES:  • The middle column (positions 1, 4, 7) has POSITIVE weights  • When bright pixels appear in the middle column, the model says "VERTICAL!"  • The saliency map lights up exactly where the vertical line is  FOR HORIZONTAL LINES:  • The middle row (positions 3, 4, 5) has NEGATIVE or low weights for the sides  • When bright pixels appear across a row, they don't activate the "vertical" detector  • The output is LOW, meaning "not vertical" = "horizontal" THE MODEL LEARNED THE RIGHT PATTERN!═══════════════════════════════════════════════════════════════════════ Our model didn't just memorize examples. It learned a GENERAL RULE:   "Vertical lines have bright pixels stacked in a column.   Horizontal lines have bright pixels spread across a row." This is exactly what we hoped it would learn! ═══════════════════════════════════════════════════════════════════════""") # Show the pattern it learnedprint("\nVisualized Pattern Recognition:")print("-"*50)print("""  VERTICAL LINE:          MODEL LOOKS AT:  [ ] [●] [ ]             [ ] [HIGH] [ ]  [ ] [●] [ ]     →       [ ] [HIGH] [ ]  [ ] [●] [ ]             [ ] [HIGH] [ ]                          (Middle column weights are positive)    HORIZONTAL LINE:        MODEL LOOKS AT:  [ ] [ ] [ ]             [ ] [ ] [ ]  [●] [●] [●]     →       [LOW] [LOW] [LOW]  [ ] [ ] [ ]             [ ] [ ] [ ]                          (Row weights don't support "vertical")""")

# =============================================================================
# THE "AHA!" MOMENT: Understanding What the Model Learned
# =============================================================================

print("="*70)
print("THE KEY INSIGHT: What Did the Model ACTUALLY Learn?")
print("="*70)

print("""
Looking at the visualizations above, we can see something beautiful:

FOR VERTICAL LINES:
  • The middle column (positions 1, 4, 7) has POSITIVE weights
  • When bright pixels appear in the middle column, the model says "VERTICAL!"
  • The saliency map lights up exactly where the vertical line is
  
FOR HORIZONTAL LINES:
  • The middle row (positions 3, 4, 5) has NEGATIVE or low weights for the sides
  • When bright pixels appear across a row, they don't activate the "vertical" detector
  • The output is LOW, meaning "not vertical" = "horizontal"

THE MODEL LEARNED THE RIGHT PATTERN!
═══════════════════════════════════════════════════════════════════════

Our model didn't just memorize examples. It learned a GENERAL RULE:

"Vertical lines have bright pixels stacked in a column.
   Horizontal lines have bright pixels spread across a row."

This is exactly what we hoped it would learn!

═══════════════════════════════════════════════════════════════════════
""")

# Show the pattern it learned
print("\nVisualized Pattern Recognition:")
print("-"*50)
print("""
  VERTICAL LINE:          MODEL LOOKS AT:
  [ ] [●] [ ]             [ ] [HIGH] [ ]
  [ ] [●] [ ]     →       [ ] [HIGH] [ ]
  [ ] [●] [ ]             [ ] [HIGH] [ ]
                          (Middle column weights are positive)
  
  HORIZONTAL LINE:        MODEL LOOKS AT:
  [ ] [ ] [ ]             [ ] [ ] [ ]
  [●] [●] [●]     →       [LOW] [LOW] [LOW]
  [ ] [ ] [ ]             [ ] [ ] [ ]
                          (Row weights don't support "vertical")
""")

6.6 Train/Test Split: Why We Need Separate Data

Throughout this notebook, we've used separate training and test data. This is crucial for honest evaluation.

The Problem: Memorization vs Learning

A model could achieve 100% accuracy on training data by simply memorizing every example - like a student who memorizes test answers instead of understanding concepts.

But memorization isn't useful - we need the model to generalize to NEW data it has never seen.

Approach	Training Accuracy	Test Accuracy	What Happened?
True learning	95%	93%	Learned the general pattern
Memorization	100%	50%	Memorized training, fails on new

What IS a Train/Test Split?

We divide our data into two groups:

ALL DATA (150 samples)
    │
    ├── TRAINING SET (100 samples) ──→ Used to TRAIN the model
    │                                  Model sees these during learning
    │
    └── TEST SET (50 samples) ───────→ Used to EVALUATE the model
                                       Model NEVER sees these during training

Why This Works

Data Set	Model Sees During Training?	Purpose
Training	YES	Learn patterns
Test	NO	Evaluate generalization

The test set acts as a "final exam" - questions the model has never seen.

Committee Analogy

"It's like preparing for an exam:

Training data = study materials (examples you practice with)
Test data = the actual exam (new questions you've never seen)

If you just memorize your notes without understanding, you'll ace the practice problems but fail the exam. If you truly learned the concepts, you'll do well on both."

The Golden Rule

NEVER use test data for training!

If the model sees test data during training, it can memorize those examples too, and our evaluation becomes meaningless.

Common Split Ratios

Split	Training	Test	When to Use
80/20	80%	20%	Large datasets (>10,000 samples)
70/30	70%	30%	Medium datasets (1,000-10,000)
60/40	60%	40%	Small datasets (<1,000)

More test data = more reliable evaluation, but less training data.

Understanding Overfitting Mathematically

What IS Overfitting?

Overfitting is when a model learns the noise in the training data, not just the signal.

Analogy: Imagine studying for an exam by memorizing the exact wording of practice questions instead of understanding the concepts. You'd ace those exact questions but fail on new ones.

How Train/Test Split Reveals Overfitting:

Scenario	Training Accuracy	Test Accuracy	What's Happening
Good learning	95%	93%	Learned the pattern!
Mild overfitting	99%	85%	Some memorization
Severe overfitting	100%	50%	Memorized everything, learned nothing

The Math:

If a model memorizes all 100 training examples, it can get 100% training accuracy
But those memorized patterns don't apply to new data
Test accuracy reveals true generalization

The Gap: $Overfitting Gap = Training Accuracy - Test Accuracy$

Gap < 5%: Great! Model generalizes well
Gap 5-15%: Some overfitting, might need more data or simpler model
Gap > 15%: Serious overfitting, model is memorizing

Why These Specific Ratios?

More Training Data	More Test Data
Model can learn more	More reliable evaluation
Better final accuracy	Smaller margin of error
Less reliable evaluation	Model might underfit

The sweet spot: Enough training data to learn well, enough test data to evaluate reliably. With 100 samples, 80/20 gives 80 for training (decent) and 20 for testing (acceptable). With 10,000 samples, even 90/10 gives 1,000 test samples (very reliable).

cell 022

# =============================================================================# TRAIN/TEST SPLIT: Our Results# ============================================================================= print("="*70)print("TRAIN/TEST SPLIT: Checking for Generalization")print("="*70) print(f"""OUR DATA SPLIT:  • Training set: {len(X_train)} samples (used for learning)  • Test set: {len(X_test)} samples (used for evaluation only)  • Split ratio: {len(X_train)}/{len(X_train)+len(X_test)} = {len(X_train)/(len(X_train)+len(X_test))*100:.0f}% training  RESULTS:  • Training accuracy: {train_accuracy:.1%}  • Test accuracy: {test_accuracy:.1%}  • Difference: {abs(train_accuracy - test_accuracy):.1%}""") # Interpret the gapdiff = train_accuracy - test_accuracy print("-"*70)print("INTERPRETATION:")print("-"*70) if diff < 0.05:    print("""  ✓ EXCELLENT! Training and test accuracy are very similar.    This suggests the model has LEARNED the general pattern,  not just memorized the training data.    Our model generalizes well to new data!""")elif diff < 0.15:    print(f"""  ⚠ CAUTION: Training accuracy is {diff:.1%} higher than test accuracy.    Some memorization may have occurred.  The model might be slightly "overfitting" to training data.""")else:    print(f"""  ⚠ WARNING: Training accuracy is {diff:.1%} higher than test accuracy!    This suggests OVERFITTING - the model memorized training data  but doesn't generalize well to new data.    Possible solutions:    - Get more training data    - Use regularization    - Simplify the model""")

# =============================================================================
# TRAIN/TEST SPLIT: Our Results
# =============================================================================

print("="*70)
print("TRAIN/TEST SPLIT: Checking for Generalization")
print("="*70)

print(f"""
OUR DATA SPLIT:
  • Training set: {len(X_train)} samples (used for learning)
  • Test set: {len(X_test)} samples (used for evaluation only)
  • Split ratio: {len(X_train)}/{len(X_train)+len(X_test)} = {len(X_train)/(len(X_train)+len(X_test))*100:.0f}% training
  
RESULTS:
  • Training accuracy: {train_accuracy:.1%}
  • Test accuracy: {test_accuracy:.1%}
  • Difference: {abs(train_accuracy - test_accuracy):.1%}
""")

# Interpret the gap
diff = train_accuracy - test_accuracy

print("-"*70)
print("INTERPRETATION:")
print("-"*70)

if diff < 0.05:
    print("""
  ✓ EXCELLENT! Training and test accuracy are very similar.
  
  This suggests the model has LEARNED the general pattern,
  not just memorized the training data.
  
  Our model generalizes well to new data!
""")
elif diff < 0.15:
    print(f"""
  ⚠ CAUTION: Training accuracy is {diff:.1%} higher than test accuracy.
  
  Some memorization may have occurred.
  The model might be slightly "overfitting" to training data.
""")
else:
    print(f"""
  ⚠ WARNING: Training accuracy is {diff:.1%} higher than test accuracy!
  
  This suggests OVERFITTING - the model memorized training data
  but doesn't generalize well to new data.
  
  Possible solutions:
    - Get more training data
    - Use regularization
    - Simplify the model
""")

Part 6 Summary: What We've Learned

Key Concepts Mastered

Concept	Definition/Formula	Why It Matters
Training vs Inference	Learning mode vs using mode	Different behaviors, same weights
Accuracy	(TP + TN) / Total	Simple overall view (but can mislead)
Confusion Matrix	TP, TN, FP, FN breakdown	Shows WHAT mistakes are made
Precision	TP / (TP + FP)	"When I say yes, am I right?"
Recall	TP / (TP + FN)	"Did I catch all the positives?"
F1 Score	2 × (P × R) / (P + R)	Balance precision and recall
Saliency	\|weight × input\|	What did the model look at?
Train/Test Split	Separate data for evaluation	Detect memorization vs learning

The Four Categories Explained

Category	Model Said	Truth Was	Meaning
TP (True Positive)	Vertical	Vertical	Correct detection
TN (True Negative)	Horizontal	Horizontal	Correct rejection
FP (False Positive)	Vertical	Horizontal	False alarm
FN (False Negative)	Horizontal	Vertical	Missed detection

Committee Analogy Progress

| Part | What Happened | |------|--------------|\n| Parts 1-3 | Committee member learned procedures | | Part 4 | First case - confused, random guessing | | Part 5 | Learned from feedback, became expert | | Part 6 | Performance review: verified expertise and understood reasoning | | Part 7 | (Next) One expert isn't enough - building the full committee |

The Big Picture

We now have a complete, evaluated model that:

Achieves high accuracy on both training and test data
Makes few mistakes (low FP and FN)
Has interpretable learned weights
Uses the RIGHT features (column patterns for vertical detection)
Generalizes well to new data

Knowledge Check

cell 024

# =============================================================================# KNOWLEDGE CHECK - Part 6# ============================================================================= print("KNOWLEDGE CHECK - Part 6: Evaluation")print("="*60)print("\nAnswer these questions to test your understanding:\n") questions = [    {        "q": "1. What's the difference between training and inference mode?",        "options": [            "A) Training is faster than inference",            "B) In training, weights update; in inference, weights are frozen",            "C) Inference uses more data than training",            "D) They're the same thing with different names"        ],        "answer": "B",        "explanation": "During training, the model learns and weights change after each example. During inference, weights are frozen and we just make predictions - no learning happens."    },    {        "q": "2. A model predicts 'sick' for a healthy patient. What type of error is this?",        "options": [            "A) True Positive (TP)",            "B) True Negative (TN)",            "C) False Positive (FP)",            "D) False Negative (FN)"        ],        "answer": "C",        "explanation": "False Positive: We predicted Positive (sick), but we were False (wrong) - the patient was actually healthy. This is a 'false alarm'."    },    {        "q": "3. You're building a disease detection system. Missing a sick patient is VERY bad.\n   Which metric should you prioritize?",        "options": [            "A) Accuracy",            "B) Precision",            "C) Recall",            "D) F1 Score"        ],        "answer": "C",        "explanation": "Recall measures 'did we catch all the positives?' High recall means we catch most sick patients, even if we have some false alarms. When missing positives is costly, prioritize recall."    },    {        "q": "4. Why do we use a separate test set?",        "options": [            "A) To have more data for training",            "B) To make training faster",            "C) To check if the model memorized vs truly learned",            "D) It's optional and not really needed"        ],        "answer": "C",        "explanation": "A model could memorize training data and fail on new data. The test set (unseen data) reveals if it truly learned the general pattern or just memorized examples."    },    {        "q": "5. What does a saliency map show?",        "options": [            "A) The accuracy of the model over time",            "B) Which inputs the model focused on for its decision",            "C) The training loss curve",            "D) How fast the model runs"        ],        "answer": "B",        "explanation": "Saliency maps highlight which parts of the input were most important for the model's decision. It's a form of interpretability - understanding WHY the model made its prediction."    }] for q in questions:    print(q["q"])    for opt in q["options"]:        print(f"   {opt}")    print() print("\n" + "="*60)print("Scroll down for answers...")print("="*60)

# =============================================================================
# KNOWLEDGE CHECK - Part 6
# =============================================================================

print("KNOWLEDGE CHECK - Part 6: Evaluation")
print("="*60)
print("\nAnswer these questions to test your understanding:\n")

questions = [
    {
        "q": "1. What's the difference between training and inference mode?",
        "options": [
            "A) Training is faster than inference",
            "B) In training, weights update; in inference, weights are frozen",
            "C) Inference uses more data than training",
            "D) They're the same thing with different names"
        ],
        "answer": "B",
        "explanation": "During training, the model learns and weights change after each example. During inference, weights are frozen and we just make predictions - no learning happens."
    },
    {
        "q": "2. A model predicts 'sick' for a healthy patient. What type of error is this?",
        "options": [
            "A) True Positive (TP)",
            "B) True Negative (TN)",
            "C) False Positive (FP)",
            "D) False Negative (FN)"
        ],
        "answer": "C",
        "explanation": "False Positive: We predicted Positive (sick), but we were False (wrong) - the patient was actually healthy. This is a 'false alarm'."
    },
    {
        "q": "3. You're building a disease detection system. Missing a sick patient is VERY bad.\n   Which metric should you prioritize?",
        "options": [
            "A) Accuracy",
            "B) Precision",
            "C) Recall",
            "D) F1 Score"
        ],
        "answer": "C",
        "explanation": "Recall measures 'did we catch all the positives?' High recall means we catch most sick patients, even if we have some false alarms. When missing positives is costly, prioritize recall."
    },
    {
        "q": "4. Why do we use a separate test set?",
        "options": [
            "A) To have more data for training",
            "B) To make training faster",
            "C) To check if the model memorized vs truly learned",
            "D) It's optional and not really needed"
        ],
        "answer": "C",
        "explanation": "A model could memorize training data and fail on new data. The test set (unseen data) reveals if it truly learned the general pattern or just memorized examples."
    },
    {
        "q": "5. What does a saliency map show?",
        "options": [
            "A) The accuracy of the model over time",
            "B) Which inputs the model focused on for its decision",
            "C) The training loss curve",
            "D) How fast the model runs"
        ],
        "answer": "B",
        "explanation": "Saliency maps highlight which parts of the input were most important for the model's decision. It's a form of interpretability - understanding WHY the model made its prediction."
    }
]

for q in questions:
    print(q["q"])
    for opt in q["options"]:
        print(f"   {opt}")
    print()

print("\n" + "="*60)
print("Scroll down for answers...")
print("="*60)

cell 025

# =============================================================================# ANSWERS - Knowledge Check Part 6# ============================================================================= print("ANSWERS - Part 6 Knowledge Check")print("="*60) for i, q in enumerate(questions, 1):    print(f"\n{i}. Answer: {q['answer']}")    print(f"   {q['explanation']}") print("\n" + "="*60)print("How did you do?")print("  5/5: Evaluation Master! Ready for Part 7!")print("  4/5: Solid understanding - great job!")print("  3/5: Review the sections you missed")print("  <3:  Re-read Part 6 before continuing")print("="*60)

What's Next?

Congratulations! You've completed Part 6!

Our single neuron is now a verified expert - we've evaluated its performance, understood its decision-making process, and confirmed it learned the RIGHT patterns.

But Here's the Thing...

A single neuron (Perceptron) can only learn linear patterns - patterns that can be separated by a straight line. For more complex problems, one expert isn't enough.

The Limitation of Single Neurons

Some problems are not linearly separable. The classic example is the XOR problem:

Input A	Input B	Output (XOR)
0	0	0
0	1	1
1	0	1
1	1	0

No single neuron can learn this pattern! We need multiple neurons working together.

Coming Up in Part 7: Hidden Layers - The Full Committee

In the next notebook, we'll explore:

Why one neuron isn't enough - The XOR problem demonstration
Hidden layers - Adding more neurons between input and output
The full committee - Multiple experts with different perspectives
Universal approximation - Why deep networks can learn (almost) anything

Continue to Part 7: part_7_hidden_layers.ipynb

"One expert is good. A committee of experts is powerful."

The Brain's Decision Committee - From Expert to Team