AgenticWorks

A community for developers awakening to agentic AI. Hands-on lessons, enterprise-grade context engineering, and a forum that earns its quiet.

Platform

  • Learn
  • Forum
  • Showcase

Project

  • About

Community

  • Network
  • Code of conduct

Field reports

Monthly notes on what shipped, what broke, and what we learned.

© 2026 AgenticWorks. Built in public.

AgenticWorks
LearnShowcaseForumCommunity
Sign in

Track 1 · ML foundations

Brain's Decision Committee
  1. 01The first neuron
  2. 02A single neuron
  3. 03Activation functions
  4. 04The perceptron
  5. 05Training
  6. 06Evaluation
  7. 07Hidden layers
  8. 08Deep learning challenges
  9. 09Full implementation
  10. 10What's next
MetricsPart 6 · 45 min · intermediate

The trained expert

Separate training from inference, compute evaluation metrics, and inspect what the model learned.

Open in ColabDownload notebookFull lab fallback
Kernel: ColdSections: 0/16

Neural Network Fundamentals

Part 6: Evaluation - The Trained Expert

The Brain's Decision Committee - Chapter 6


The Story So Far...

In Part 5, something remarkable happened: our committee member learned. Starting with random weights and ~50% accuracy, they adjusted their priorities through gradient descent until they became an expert vertical line detector with 95%+ accuracy.

But how do we know they're actually good? Getting 95% on training data is one thing, but:

  • What kinds of mistakes do they still make?
  • Are some errors worse than others?
  • Can we understand why they make the decisions they do?

This is evaluation - properly assessing our trained model and understanding what it has learned.


What You'll Learn in Part 6

By the end of this notebook, you will understand:

  1. Training vs Inference - The difference between learning mode and using mode
  2. Accuracy - The simplest metric (and its limitations)
  3. Confusion Matrix - A detailed breakdown of all prediction types
  4. Precision & Recall - Measuring different kinds of correctness
  5. F1 Score - Balancing precision and recall
  6. Saliency/Interpretability - What did the model actually learn?
  7. Test Sets - Why we need data the model has never seen

Prerequisites

Make sure you've completed:

  • Parts 0-1: Matrices (neural_network_fundamentals.ipynb)
  • Part 2: Single Neuron (part_2_single_neuron.ipynb)
  • Part 3: Activation Functions (part_3_activation_functions.ipynb)
  • Part 4: The Perceptron (part_4_perceptron.ipynb)
  • Part 5: Training (part_5_training.ipynb)

Setup: Import Dependencies and Recreate Our Trained Model

Let's bring in everything we need and train a model to evaluate.

cell 003
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# =============================================================================# PART 6: EVALUATION - SETUP AND IMPORTS# ============================================================================= import numpy as npimport matplotlib.pyplot as pltfrom IPython.display import display, clear_output # Try to import ipywidgets for interactive featurestry:    import ipywidgets as widgets    WIDGETS_AVAILABLE = Trueexcept ImportError:    WIDGETS_AVAILABLE = False    print("Note: ipywidgets not installed. Interactive features will be limited.") # Set up matplotlib stylestyle_options = ['seaborn-v0_8-whitegrid', 'seaborn-whitegrid', 'ggplot', 'default']for style in style_options:    try:        plt.style.use(style)        break    except OSError:        continue plt.rcParams['figure.figsize'] = [10, 6]plt.rcParams['font.size'] = 12np.random.seed(42) print("Setup complete!")print("="*60)
cell 004
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
# =============================================================================# RECREATE OUR TOOLS FROM PREVIOUS NOTEBOOKS# ============================================================================= # -----------------------------------------------------------------------------# Our canonical line images (from Part 1)# -----------------------------------------------------------------------------vertical_line = np.array([[0, 1, 0], [0, 1, 0], [0, 1, 0]])horizontal_line = np.array([[0, 0, 0], [1, 1, 1], [0, 0, 0]])vertical_flat = vertical_line.flatten()horizontal_flat = horizontal_line.flatten() # -----------------------------------------------------------------------------# Dataset generator (from Part 4)# -----------------------------------------------------------------------------def generate_line_dataset(n_samples=100, noise_level=0.0, seed=None):    """Generate vertical (label=1) and horizontal (label=0) line images."""    if seed is not None:        np.random.seed(seed)        X, y = [], []        for i in range(n_samples):        image = np.zeros((3, 3))                if i < n_samples // 2:  # Vertical lines            col = np.random.randint(0, 3)            image[:, col] = 1            if noise_level > 0:                image = np.clip(image + np.random.randn(3, 3) * noise_level, 0, 1)            X.append(image.flatten())            y.append(1)        else:  # Horizontal lines            row = np.random.randint(0, 3)            image[row, :] = 1            if noise_level > 0:                image = np.clip(image + np.random.randn(3, 3) * noise_level, 0, 1)            X.append(image.flatten())            y.append(0)        X, y = np.array(X), np.array(y)    shuffle_idx = np.random.permutation(n_samples)    return X[shuffle_idx], y[shuffle_idx] # -----------------------------------------------------------------------------# Sigmoid activation function (from Part 3)# -----------------------------------------------------------------------------def sigmoid(z):    """Sigmoid activation: maps any value to range (0, 1)."""    return 1 / (1 + np.exp(-np.clip(z, -500, 500))) # -----------------------------------------------------------------------------# TrainablePerceptron class (from Part 5)# -----------------------------------------------------------------------------class TrainablePerceptron:    """A Perceptron that can learn from examples."""        def __init__(self, n_inputs):        self.weights = np.random.randn(n_inputs) * 0.1        self.bias = 0.0        self.n_inputs = n_inputs        self.loss_history = []        self.accuracy_history = []        self.is_trained = False  # Track if model has been trained        def forward(self, x):        x = np.array(x).flatten()        z = np.dot(self.weights, x) + self.bias        return sigmoid(z)        def predict(self, x):        return 1 if self.forward(x) >= 0.5 else 0        def compute_loss(self, y_true, y_pred):        epsilon = 1e-15        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)        return -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))        def train(self, X, y, learning_rate=0.1, epochs=100, verbose=True):        self.loss_history = []        self.accuracy_history = []                for epoch in range(epochs):            total_loss = 0            correct = 0                        for i in range(len(X)):                xi, yi = X[i], y[i]                y_pred = self.forward(xi)                loss = self.compute_loss(yi, y_pred)                total_loss += loss                                if (y_pred >= 0.5 and yi == 1) or (y_pred < 0.5 and yi == 0):                    correct += 1                                error = y_pred - yi                self.weights = self.weights - learning_rate * error * xi                self.bias = self.bias - learning_rate * error                        avg_loss = total_loss / len(X)            accuracy = correct / len(X)            self.loss_history.append(avg_loss)            self.accuracy_history.append(accuracy)                        if verbose and (epoch + 1) % 10 == 0:                print(f"  Epoch {epoch+1:3d}/{epochs}: Loss = {avg_loss:.4f}, Accuracy = {accuracy*100:.1f}%")                self.is_trained = True                if verbose:            print(f"\nTraining complete! Final accuracy: {self.accuracy_history[-1]*100:.1f}%")                return self.loss_history print("Tools recreated from previous notebooks!")print("  - Line image templates")print("  - Dataset generator")print("  - Sigmoid activation")print("  - TrainablePerceptron class")
cell 005
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# =============================================================================# TRAIN OUR MODEL (Quick recap from Part 5)# ============================================================================= print("="*70)print("TRAINING OUR MODEL (to have something to evaluate)")print("="*70) # Generate training datanp.random.seed(42)X_train, y_train = generate_line_dataset(n_samples=100, noise_level=0.0, seed=42) # Generate TEST data (NEW! - data the model has never seen)X_test, y_test = generate_line_dataset(n_samples=50, noise_level=0.0, seed=999) print(f"\nTraining set: {len(X_train)} samples")print(f"Test set: {len(X_test)} samples (model has NEVER seen these!)") # Create and train modelmodel = TrainablePerceptron(n_inputs=9)print("\nTraining...")model.train(X_train, y_train, learning_rate=0.5, epochs=50, verbose=True) print("\n" + "="*70)print("Model is trained and ready for evaluation!")print("="*70)

6.1 Training vs Inference: The Committee's Memory

Before we evaluate, let's understand an important distinction: training mode vs inference mode.

What IS Inference?

The word "inference" comes from Latin inferre meaning "to bring in" or "to conclude." In machine learning:

Inference = Using a trained model to make predictions on new data

Think of it like this:

  • Training = Teaching someone how to do a job
  • Inference = That person actually doing the job

Why Two Different Modes?

AspectTraining ModeInference Mode
PurposeLearn from examplesMake predictions
WeightsBeing updated constantlyFrozen (fixed)
DataTraining set (with labels)New, unseen data
SpeedSlower (computing gradients)Fast (forward pass only)
GoalMinimize lossPredict accurately

Committee Analogy

"During training, the committee is in a meeting room, debating cases, learning from mistakes, and updating their rulebook. Once trained, they compile their final rulebook and hand it to the front desk. The front desk uses this rulebook to make quick decisions without calling the committee for every case."

  • Training: The committee meeting (slow, learning, updating)
  • Inference: The front desk using the final rulebook (fast, fixed, no learning)

Why Does This Distinction Matter?

ScenarioWhy It Matters
DeploymentIn production, you use inference mode for speed
EvaluationWe evaluate in inference mode (weights must be fixed!)
ConsistencySame weights give same predictions every time
ResourcesInference uses less memory (no gradients stored)

The Key Insight

During inference, the model does NOT learn anything new. The weights are "frozen" - they don't change. This is essential because:

  1. Reproducibility: Same input always gives same output
  2. Speed: No gradient computation needed
  3. Fairness: Test data doesn't influence the model

Why "Frozen" Weights Matter Mathematically

During training, after each prediction, we do:

weights = weights - learning_rate × gradient

During inference, we SKIP this step entirely. The weights stay exactly as they were after training finished.

Why does this matter?

If we kept updating during inference...Consequence
Weights would change with each new inputSame input could give different outputs!
Model would "drift" over timeYesterday's predictions wouldn't match today's
Hard to reproduce results"But it worked yesterday!"
Unfair for test evaluationTest data would influence the model

The mathematical guarantee: With frozen weights, f(x)=σ(w⋅x+b)f(x) = \sigma(w \cdot x + b)f(x)=σ(w⋅x+b) is a deterministic function - same input ALWAYS gives same output.

In Code

cell 007
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# =============================================================================# TRAINING VS INFERENCE: Demonstration# ============================================================================= print("="*70)print("TRAINING vs INFERENCE MODE")print("="*70) # Show the model's stateprint(f"\nModel state: {'TRAINED' if model.is_trained else 'UNTRAINED'}") # In training mode, weights change after each sampleprint("\n" + "-"*70)print("DURING TRAINING (weights change):")print("-"*70)print("  For each sample:")print("    1. Forward pass → get prediction")print("    2. Compute loss → how wrong?")print("    3. Compute gradients → which direction?")print("    4. Update weights → improve! (weights CHANGE)") # In inference mode, weights are frozenprint("\n" + "-"*70)print("DURING INFERENCE (weights frozen):")print("-"*70)print("  For each sample:")print("    1. Forward pass → get prediction")print("    2. Done! (NO weight updates)") # Demonstrate inferenceprint("\n" + "-"*70)print("INFERENCE EXAMPLE:")print("-"*70) # Save weights beforeweights_before = model.weights.copy() # Make predictions (inference)pred_v = model.forward(vertical_flat)pred_h = model.forward(horizontal_flat) # Check weights afterweights_after = model.weights.copy() print(f"\n  Vertical line:   {pred_v:.4f} ({pred_v*100:.1f}% confident it's vertical)")print(f"  Horizontal line: {pred_h:.4f} ({pred_h*100:.1f}% confident it's vertical)")print(f"\n  Weights changed? {not np.allclose(weights_before, weights_after)}")print(f"  (In inference mode, weights stay fixed!)")

6.2 Accuracy: The Simplest Metric

We've been using accuracy throughout our notebooks, but let's formally define it and understand its limitations.

What IS Accuracy?

Accuracy answers the question: "Of all the predictions I made, what fraction was correct?"

Accuracy=Number of Correct PredictionsTotal Number of Predictions\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}Accuracy=Total Number of PredictionsNumber of Correct Predictions​

Breaking Down the Formula

Let's understand each part:

ComponentWhat It MeansOur Example
Correct PredictionsCases where prediction matches truthSaid "vertical" for vertical, "horizontal" for horizontal
Total PredictionsAll cases we predicted onAll 50 test images
AccuracyThe ratio (0 to 1, or 0% to 100%)48/50 = 0.96 = 96%

Computing Accuracy Step by Step

Step 1: Make predictions on all samples
Step 2: Compare each prediction to the true label
Step 3: Count how many match (correct)
Step 4: Divide by total number of predictions

Why Accuracy Can Be Misleading

Accuracy has a hidden flaw: it treats all mistakes equally and ignores class imbalance.

Example - Fraud Detection:

Suppose 99% of transactions are legitimate, 1% are fraud.

Model StrategyAccuracyIs It Good?
Say "legitimate" for EVERYTHING99%NO! Catches 0% of fraud!
Actually detect fraud97%YES! Even though lower accuracy

The "dumb" model gets 99% accuracy by ignoring the problem entirely!

Example - Medical Diagnosis:

ScenarioType of ErrorConsequence
Say "healthy" when patient is sickMiss a diseasePatient doesn't get treatment! (VERY bad)
Say "sick" when patient is healthyFalse alarmUnnecessary tests (annoying but not dangerous)

Both are "wrong" but one is much worse! Accuracy treats them the same.

When Accuracy Works Well

Accuracy is a good metric when:

  1. Classes are balanced (roughly 50/50 split)
  2. All mistakes have equal cost
  3. You want a quick overall view

Our V/H classifier is a good case for accuracy: balanced classes, equal mistake costs.

Understanding Why Class Imbalance Breaks Accuracy

Let's do the math to see WHY accuracy is misleading with imbalanced data:

Scenario: Fraud Detection (1% fraud, 99% legitimate)

StrategyFraud CaughtAccuracy Calculation
"Always say legitimate"0 of 100 frauds(0 + 9900) / 10000 = 99%
Good detector90 of 100 frauds(90 + 9800) / 10000 = 98.9%

The "dumb" strategy has HIGHER accuracy but catches ZERO fraud!

Why this happens mathematically:

Accuracy=TP+TNTotal\text{Accuracy} = \frac{TP + TN}{\text{Total}}Accuracy=TotalTP+TN​

When 99% of data is class 0, you can get 99% accuracy by predicting 0 for everything (TN = 9900, everything else = 0).

The lesson: When classes are imbalanced, accuracy is dominated by the majority class. We need metrics that focus on the minority class (precision, recall).

Let's Calculate Accuracy Properly

cell 009
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# =============================================================================# ACCURACY: Step-by-Step Calculation# ============================================================================= print("="*70)print("CALCULATING ACCURACY: Step by Step")print("="*70) def calculate_accuracy(model, X, y, verbose=True):    """    Calculate accuracy of model on given data.        Parameters:        model: Trained model with predict() method        X: Input data (n_samples, n_features)        y: True labels (n_samples,)        verbose: Whether to print details        Returns:        accuracy: Float between 0 and 1        predictions: Array of predicted labels    """    predictions = []    correct = 0        for i in range(len(X)):        pred = model.predict(X[i])        predictions.append(pred)        if pred == y[i]:            correct += 1        accuracy = correct / len(y)        if verbose:        print(f"\n  Total samples: {len(y)}")        print(f"  Correct: {correct}")        print(f"  Wrong: {len(y) - correct}")        print(f"  Accuracy: {correct}/{len(y)} = {accuracy:.4f} = {accuracy*100:.1f}%")        return accuracy, np.array(predictions) # Calculate accuracy on TRAINING dataprint("\n" + "-"*70)print("TRAINING SET ACCURACY:")print("-"*70)train_accuracy, train_preds = calculate_accuracy(model, X_train, y_train) # Calculate accuracy on TEST data (NEW!)print("\n" + "-"*70)print("TEST SET ACCURACY:")print("-"*70)test_accuracy, test_preds = calculate_accuracy(model, X_test, y_test) print("\n" + "="*70)print("KEY INSIGHT: Training vs Test Accuracy")print("="*70)print(f"""Training accuracy: {train_accuracy*100:.1f}%Test accuracy:     {test_accuracy*100:.1f}% The TEST accuracy is what really matters!Training accuracy can be misleadingly high if the model "memorizes" the data.Test accuracy shows how well the model generalizes to NEW data.""")

6.3 The Confusion Matrix: A Detailed Report Card

Accuracy gives us one number. But what if we want to understand WHICH mistakes the model makes?

What IS a Confusion Matrix?

A confusion matrix is a table that breaks down all predictions into four categories based on two questions:

  1. What did we predict?
  2. What was the actual truth?
                      PREDICTED
                    0        1
              ┌─────────┬─────────┐
        0     │   TN    │   FP    │
   ACTUAL     ├─────────┼─────────┤
        1     │   FN    │   TP    │
              └─────────┴─────────┘

Why "Confusion" Matrix?

The name comes from the fact that it shows how the model gets "confused" - where it mixes up one class for another.

Understanding the Four Categories

AbbrevFull NameMeaningOur Example
TPTrue PositivePredicted 1, was actually 1Said "vertical", WAS vertical ✓
TNTrue NegativePredicted 0, was actually 0Said "horizontal", WAS horizontal ✓
FPFalse PositivePredicted 1, was actually 0Said "vertical", was horizontal ✗
FNFalse NegativePredicted 0, was actually 1Said "horizontal", was vertical ✗

Memory Trick for TP/TN/FP/FN

Think of it as TWO questions:

  1. True/False: Was the prediction correct?

    • True = correct
    • False = wrong
  2. Positive/Negative: What did we predict?

    • Positive = predicted class 1 (vertical)
    • Negative = predicted class 0 (horizontal)

So:

  • True Positive = We were True (correct) when we predicted Positive (vertical)
  • False Positive = We were False (wrong) when we predicted Positive (vertical)
  • True Negative = We were True (correct) when we predicted Negative (horizontal)
  • False Negative = We were False (wrong) when we predicted Negative (horizontal)

Committee Analogy

"The confusion matrix is like a detailed performance review for our committee member:

  • TP: Cases they correctly identified as vertical
  • TN: Cases they correctly identified as NOT vertical
  • FP: Cases they wrongly called vertical (a false alarm!)
  • FN: Cases they missed (should have said vertical but didn't)"

Alternative Names You'll See

Our TermAlso CalledWhen Used
False PositiveType I ErrorStatistics
False NegativeType II ErrorStatistics
True Positive RateSensitivity, RecallMedical
True Negative RateSpecificityMedical

Real-World Examples of Each Error Type

Understanding these errors is easier with concrete examples:

Error TypeMedical ExampleEmail ExampleSelf-Driving Car
TPCorrectly diagnose sick patientCorrectly mark spamCorrectly detect pedestrian
TNCorrectly clear healthy patientCorrectly allow good emailCorrectly ignore false alarm
FPDiagnose healthy as sickMark good email as spamBrake for nothing (annoying)
FNMiss a sick patientAllow spam throughMiss a pedestrian (FATAL!)

Notice: The consequences of FP vs FN are very different depending on the application!

  • Medical: FN is worse (missed diagnosis can be fatal)
  • Spam filter: FP is worse (losing important emails)
  • Self-driving: FN is MUCH worse (hitting someone)

This is why we have precision and recall - to measure these separately.

Let's Build a Confusion Matrix

cell 011
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
# =============================================================================# CONFUSION MATRIX: Implementation and Explanation# ============================================================================= def confusion_matrix(y_true, y_pred):    """    Compute the confusion matrix.        The logic behind each calculation:    - TP: prediction=1 AND truth=1 (both conditions true)    - TN: prediction=0 AND truth=0 (both conditions true)    - FP: prediction=1 AND truth=0 (predicted positive, was negative)    - FN: prediction=0 AND truth=1 (predicted negative, was positive)        Parameters:        y_true: Array of true labels (0 or 1)        y_pred: Array of predicted labels (0 or 1)        Returns:        dict with TP, TN, FP, FN counts    """    # True Positive: We said 1, it was 1    TP = np.sum((y_pred == 1) & (y_true == 1))        # True Negative: We said 0, it was 0    TN = np.sum((y_pred == 0) & (y_true == 0))        # False Positive: We said 1, but it was 0 (false alarm!)    FP = np.sum((y_pred == 1) & (y_true == 0))        # False Negative: We said 0, but it was 1 (missed it!)    FN = np.sum((y_pred == 0) & (y_true == 1))        return {'TP': TP, 'TN': TN, 'FP': FP, 'FN': FN} print("="*70)print("CONFUSION MATRIX: Step by Step")print("="*70) # Calculate confusion matrix for test setcm = confusion_matrix(y_test, test_preds) print("\nFor our TEST set:")print(f"  Total samples: {len(y_test)}")print(f"  Vertical lines (label=1): {np.sum(y_test == 1)}")print(f"  Horizontal lines (label=0): {np.sum(y_test == 0)}") print("\n" + "-"*70)print("CONFUSION MATRIX BREAKDOWN:")print("-"*70) print(f"""                       PREDICTED                   Horizontal(0)  Vertical(1)              ┌─────────────────┬─────────────────┐   Horiz.(0)  │  TN = {cm['TN']:3d}       │  FP = {cm['FP']:3d}       │   ACTUAL     ├─────────────────┼─────────────────┤   Vert.(1)   │  FN = {cm['FN']:3d}       │  TP = {cm['TP']:3d}       │              └─────────────────┴─────────────────┘""") print("Interpretation (reading the matrix):")print(f"  ✓ True Positives (TP = {cm['TP']}): Correctly identified as VERTICAL")print(f"  ✓ True Negatives (TN = {cm['TN']}): Correctly identified as HORIZONTAL")print(f"  ✗ False Positives (FP = {cm['FP']}): Wrongly called VERTICAL (was horizontal)")print(f"  ✗ False Negatives (FN = {cm['FN']}): Wrongly called HORIZONTAL (was vertical)") # Verify: TP + TN + FP + FN should equal total samplestotal = cm['TP'] + cm['TN'] + cm['FP'] + cm['FN']print(f"\n  Verification: TP + TN + FP + FN = {total} (should equal {len(y_test)}) ✓") # Show how accuracy relates to confusion matrixprint("\n" + "-"*70)print("ACCURACY FROM CONFUSION MATRIX:")print("-"*70)print(f"""  Accuracy = (TP + TN) / (TP + TN + FP + FN)           = ({cm['TP']} + {cm['TN']}) / ({cm['TP']} + {cm['TN']} + {cm['FP']} + {cm['FN']})           = {cm['TP'] + cm['TN']} / {total}           = {(cm['TP'] + cm['TN']) / total:.4f}           = {(cm['TP'] + cm['TN']) / total * 100:.1f}%""")
cell 012
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# =============================================================================# VISUALIZE THE CONFUSION MATRIX# ============================================================================= fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Plot 1: Confusion Matrix as heatmapax1 = axes[0]cm_matrix = np.array([[cm['TN'], cm['FP']],                        [cm['FN'], cm['TP']]]) im = ax1.imshow(cm_matrix, cmap='Blues')ax1.set_xticks([0, 1])ax1.set_yticks([0, 1])ax1.set_xticklabels(['Horizontal (0)', 'Vertical (1)'])ax1.set_yticklabels(['Horizontal (0)', 'Vertical (1)'])ax1.set_xlabel('Predicted Label', fontsize=12)ax1.set_ylabel('Actual Label', fontsize=12)ax1.set_title('Confusion Matrix', fontsize=14, fontweight='bold') # Add text annotationslabels = [['TN', 'FP'], ['FN', 'TP']]for i in range(2):    for j in range(2):        text_color = 'white' if cm_matrix[i, j] > cm_matrix.max()/2 else 'black'        ax1.text(j, i, f'{labels[i][j]}\n{cm_matrix[i, j]}',                 ha='center', va='center', fontsize=14, fontweight='bold', color=text_color) plt.colorbar(im, ax=ax1) # Plot 2: Visual explanationax2 = axes[1]ax2.axis('off') explanation_text = f"""READING THE CONFUSION MATRIX{'='*45} The DIAGONAL (top-left to bottom-right) shows CORRECT predictions:  • TN ({cm['TN']}): Horizontal predicted as Horizontal ✓  • TP ({cm['TP']}): Vertical predicted as Vertical ✓ The OFF-DIAGONAL shows ERRORS:  • FP ({cm['FP']}): Horizontal wrongly called Vertical ✗  • FN ({cm['FN']}): Vertical wrongly called Horizontal ✗ A PERFECT model has:  • All values on the diagonal  • Zeros everywhere else""" ax2.text(0.1, 0.5, explanation_text, fontsize=11, family='monospace',        verticalalignment='center', transform=ax2.transAxes,        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8)) plt.tight_layout()plt.show()

6.4 Precision, Recall, and F1 Score

The confusion matrix gives us four numbers. From these, we can calculate more specific metrics that answer different questions.

Precision: "When I Say Positive, Am I Right?"

Precision answers: "Of all the times I predicted 'positive' (vertical), how many were actually positive?"

Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}Precision=TP+FPTP​

Breaking it down:

  • Numerator (TP): Cases we correctly called positive
  • Denominator (TP + FP): ALL cases we called positive (right or wrong)

High precision means: When we say "vertical", we're usually right. Few false alarms.

When to prioritize precision:

  • Spam filters (don't delete legitimate emails!)
  • Recommender systems (don't recommend things users hate!)
  • Any case where false alarms are costly

Recall: "Did I Catch All the Positives?"

Recall (also called Sensitivity) answers: "Of all the actual positives, how many did I catch?"

Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}Recall=TP+FNTP​

Breaking it down:

  • Numerator (TP): Cases we correctly caught
  • Denominator (TP + FN): ALL actual positives (caught or missed)

High recall means: We catch most of the actual vertical lines. Few misses.

When to prioritize recall:

  • Disease detection (don't miss sick patients!)
  • Fraud detection (don't miss fraudulent transactions!)
  • Any case where missing positives is costly

The Precision-Recall Trade-off

Here's the fundamental tension:

StrategyPrecisionRecallProblem
"Only say vertical when 100% sure"HIGH (few false alarms)LOW (miss many)Miss too many positives
"Say vertical for anything remotely vertical"LOW (many false alarms)HIGH (catch most)Too many false alarms

You often can't maximize both! This is called the precision-recall trade-off.

Concrete Example: Airport Security

Imagine a security scanner detecting threats:

SettingPrecisionRecallOutcome
Super sensitive10%99%Catches ALL threats but 90% of "threats" are false alarms. Massive delays!
Super strict95%20%Few false alarms but misses 80% of real threats. Dangerous!
Balanced70%70%Some false alarms, catches most threats. Practical!

Why the trade-off exists:

When we lower the threshold for saying "positive":

  • We catch MORE true positives (recall goes UP ↑)
  • But we also catch MORE false positives (precision goes DOWN ↓)

When we raise the threshold:

  • We have FEWER false positives (precision goes UP ↑)
  • But we miss MORE true positives (recall goes DOWN ↓)

There's no free lunch! The art is finding the right balance for your specific application.

F1 Score: Finding the Balance

The F1 Score is the harmonic mean of precision and recall - a single number that balances both:

F1=2⋅Precision×RecallPrecision+RecallF1 = 2 \cdot \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}F1=2⋅Precision+RecallPrecision×Recall​

What IS a Harmonic Mean and Why Use It?

You might wonder: "Why not just use a regular average (arithmetic mean)?"

Three Types of Means:

Mean TypeFormulaExample: (99%, 10%)
Arithmetic(a + b) / 2(99 + 10) / 2 = 54.5%
Geometric√(a × b)√(99 × 10) = 31.5%
Harmonic2ab / (a + b)2×99×10 / (99+10) = 18.2%

Why harmonic mean is better for F1:

The harmonic mean is punishing when values are imbalanced. If you have 99% precision but only 10% recall:

  • Arithmetic mean says "54.5% - not bad!"
  • Harmonic mean says "18.2% - this is terrible!"

The harmonic mean forces BOTH values to be reasonably high to get a good score.

Intuition: Think about speed. If you drive 60 mph for half a trip and 20 mph for the other half, your average speed isn't 40 mph - it's closer to 30 mph (harmonic mean). The slow part dominates.

Why this matters for ML: A model that predicts "positive" for everything gets 100% recall but ~0% precision. The harmonic mean correctly identifies this as a terrible model.

PrecisionRecallF1 ScoreVerdict
90%90%90%Great! Both balanced
99%10%18%Terrible! Very unbalanced
50%50%50%Mediocre

F1 is high only when BOTH precision AND recall are reasonably high.

cell 014
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
# =============================================================================# PRECISION, RECALL, F1: Calculation# ============================================================================= def calculate_metrics(cm):    """    Calculate precision, recall, F1 from confusion matrix.        Parameters:        cm: dict with TP, TN, FP, FN        Returns:        dict with precision, recall, f1, accuracy    """    TP, TN, FP, FN = cm['TP'], cm['TN'], cm['FP'], cm['FN']        # Precision: When we say positive, are we right?    # Note: We add a check to avoid division by zero    precision = TP / (TP + FP) if (TP + FP) > 0 else 0        # Recall: Did we catch all the positives?    recall = TP / (TP + FN) if (TP + FN) > 0 else 0        # F1: Harmonic mean of precision and recall    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0        # Accuracy (for comparison)    accuracy = (TP + TN) / (TP + TN + FP + FN)        return {'precision': precision, 'recall': recall, 'f1': f1, 'accuracy': accuracy} print("="*70)print("PRECISION, RECALL, AND F1 SCORE")print("="*70) metrics = calculate_metrics(cm) print("\n" + "-"*70)print("STEP-BY-STEP CALCULATION:")print("-"*70) print(f"""From our confusion matrix:  TP = {cm['TP']} (correctly identified vertical lines)  TN = {cm['TN']} (correctly identified horizontal lines)  FP = {cm['FP']} (horizontal lines wrongly called vertical)  FN = {cm['FN']} (vertical lines wrongly called horizontal) PRECISION: "When I say vertical, am I right?"  Formula: Precision = TP / (TP + FP)    Precision = {cm['TP']} / ({cm['TP']} + {cm['FP']})            = {cm['TP']} / {cm['TP'] + cm['FP']}            = {metrics['precision']:.4f}            = {metrics['precision']*100:.1f}% RECALL: "Did I catch all the vertical lines?"  Formula: Recall = TP / (TP + FN)    Recall = {cm['TP']} / ({cm['TP']} + {cm['FN']})         = {cm['TP']} / {cm['TP'] + cm['FN']}         = {metrics['recall']:.4f}         = {metrics['recall']*100:.1f}% F1 SCORE: "Balance of precision and recall"  Formula: F1 = 2 × (Precision × Recall) / (Precision + Recall)    F1 = 2 × ({metrics['precision']:.4f} × {metrics['recall']:.4f}) / ({metrics['precision']:.4f} + {metrics['recall']:.4f})     = 2 × {metrics['precision'] * metrics['recall']:.4f} / {metrics['precision'] + metrics['recall']:.4f}     = {metrics['f1']:.4f}     = {metrics['f1']*100:.1f}% ACCURACY (for comparison):  Formula: Accuracy = (TP + TN) / Total    Accuracy = ({cm['TP']} + {cm['TN']}) / {cm['TP'] + cm['TN'] + cm['FP'] + cm['FN']}           = {cm['TP'] + cm['TN']} / {cm['TP'] + cm['TN'] + cm['FP'] + cm['FN']}           = {metrics['accuracy']:.4f}           = {metrics['accuracy']*100:.1f}%""")
cell 015
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# =============================================================================# DEMONSTRATING THE PRECISION-RECALL TRADE-OFF# ============================================================================= print("="*70)print("THE PRECISION-RECALL TRADE-OFF: A Visual Demonstration")print("="*70) print("""To understand the trade-off, let's see what happens when we changeour THRESHOLD for saying "vertical" (positive). Currently we use: threshold = 0.5  - If output >= 0.5 → predict "vertical"  - If output < 0.5 → predict "horizontal" But what if we change this threshold?""") # Try different thresholdsthresholds = [0.1, 0.3, 0.5, 0.7, 0.9]results = [] for threshold in thresholds:    # Make predictions at this threshold    preds = np.array([1 if model.forward(x) >= threshold else 0 for x in X_test])        # Calculate confusion matrix    TP = np.sum((preds == 1) & (y_test == 1))    TN = np.sum((preds == 0) & (y_test == 0))    FP = np.sum((preds == 1) & (y_test == 0))    FN = np.sum((preds == 0) & (y_test == 1))        # Calculate metrics    precision = TP / (TP + FP) if (TP + FP) > 0 else 0    recall = TP / (TP + FN) if (TP + FN) > 0 else 0    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0        results.append({        'threshold': threshold,        'precision': precision,        'recall': recall,        'f1': f1,        'TP': TP, 'FP': FP, 'FN': FN    })        print(f"Threshold = {threshold}:")    print(f"  TP={TP:2d}, FP={FP:2d}, FN={FN:2d}")    print(f"  Precision={precision:.1%}, Recall={recall:.1%}, F1={f1:.1%}")    print() # Visualizefig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Plot 1: Precision vs Recall at different thresholdsax = axes[0]precisions = [r['precision'] for r in results]recalls = [r['recall'] for r in results] ax.plot(recalls, precisions, 'b-o', linewidth=2, markersize=10)for r in results:    ax.annotate(f"  t={r['threshold']}",                (r['recall'], r['precision']), fontsize=9) ax.set_xlabel('Recall', fontsize=12)ax.set_ylabel('Precision', fontsize=12)ax.set_title('Precision-Recall Trade-off\n(Each point is a different threshold)',             fontsize=12, fontweight='bold')ax.set_xlim(-0.05, 1.05)ax.set_ylim(-0.05, 1.05)ax.grid(True, alpha=0.3) # Add ideal pointax.scatter([1], [1], color='gold', s=200, marker='*', zorder=5, label='Ideal (1,1)')ax.legend() # Plot 2: Bar chart showing trade-offax = axes[1]x = np.arange(len(thresholds))width = 0.25 bars1 = ax.bar(x - width, precisions, width, label='Precision', color='#e74c3c')bars2 = ax.bar(x, recalls, width, label='Recall', color='#27ae60')bars3 = ax.bar(x + width, [r['f1'] for r in results], width, label='F1', color='#9b59b6') ax.set_xlabel('Threshold', fontsize=12)ax.set_ylabel('Score', fontsize=12)ax.set_title('Metrics at Different Thresholds', fontsize=12, fontweight='bold')ax.set_xticks(x)ax.set_xticklabels([f'{t}' for t in thresholds])ax.legend()ax.set_ylim(0, 1.1) plt.tight_layout()plt.show() print("""KEY INSIGHT:════════════════════════════════════════════════════════════════════════ • LOW threshold (0.1): "Say vertical for almost everything!"  → High recall (catch most verticals) but low precision (many false alarms)  • HIGH threshold (0.9): "Only say vertical when VERY confident!"  → High precision (rarely wrong when we say vertical) but low recall (miss many)  • MIDDLE threshold (0.5): Balanced trade-off Notice how the precision-recall curve shows the trade-off: as one goes up, the other tends to go down. F1 score helps us find a good balance!""")
cell 016
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# =============================================================================# VISUALIZE ALL METRICS# ============================================================================= fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Plot 1: Bar chart of all metricsax1 = axes[0]metric_names = ['Accuracy', 'Precision', 'Recall', 'F1 Score']metric_values = [metrics['accuracy'], metrics['precision'], metrics['recall'], metrics['f1']]colors = ['#3498db', '#e74c3c', '#27ae60', '#9b59b6'] bars = ax1.bar(metric_names, metric_values, color=colors, edgecolor='white', linewidth=2)ax1.set_ylim(0, 1.1)ax1.set_ylabel('Score', fontsize=12)ax1.set_title('Model Performance Metrics', fontsize=14, fontweight='bold')ax1.axhline(y=1.0, color='gray', linestyle='--', alpha=0.5, label='Perfect score') # Add value labelsfor bar, val in zip(bars, metric_values):    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,             f'{val:.1%}', ha='center', va='bottom', fontsize=12, fontweight='bold') # Plot 2: Which metric to use guideax2 = axes[1]ax2.axis('off') metrics_explanation = """WHICH METRIC SHOULD YOU USE?═══════════════════════════════════════════════════ ACCURACY  • Best when: Classes are balanced (50/50)  • Misleading when: Rare events (e.g., 1% fraud)  PRECISION  • Best when: False alarms are COSTLY  • Examples:     - Spam filter (don't delete real email!)    - Criminal conviction (don't jail innocent!)  RECALL  • Best when: Missing positives is COSTLY  • Examples:     - Disease detection (don't miss sick patients!)    - Fraud detection (don't miss fraud!)  F1 SCORE  • Best when: You need balance between P & R  • Most real-world applications use F1 ═══════════════════════════════════════════════════For our V/H classifier, all metrics are similar because our dataset is balanced and model works well!""" ax2.text(0.05, 0.5, metrics_explanation, fontsize=10, family='monospace',        verticalalignment='center', transform=ax2.transAxes,        bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.8)) plt.tight_layout()plt.show()

6.5 The Committee Report: Saliency and Interpretability

We know our model works well, but WHY does it work? What has it actually learned?

What IS Interpretability?

Interpretability (also called Explainability) means understanding:

  1. What patterns did the model learn?
  2. Why does it make specific predictions?
  3. Is it using the "right" features?
QuestionHow to Answer
What patterns did it learn?Look at the weights
Why did it predict "vertical"?Look at which inputs contributed most
Is it using the right features?Visualize the saliency map

What IS Saliency?

The word "saliency" comes from Latin salire meaning "to leap." In machine learning:

Saliency = Which parts of the input "leap out" as important to the model

For our Perceptron, saliency is beautifully simple:

Saliencyi=∣wi×xi∣\text{Saliency}_i = |w_i \times x_i|Saliencyi​=∣wi​×xi​∣

Where:

  • wiw_iwi​ = weight for input iii
  • xix_ixi​ = value of input iii
  • ∣...∣|...|∣...∣ = absolute value (we care about magnitude, not sign)

Why Absolute Value?

Weight × InputMeaningContribution
+2.0 × 1.0 = +2.0Strongly SUPPORTS verticalHIGH
-2.0 × 1.0 = -2.0Strongly OPPOSES verticalHIGH
+0.1 × 1.0 = +0.1Weakly supports verticalLOW

Both +2.0 and -2.0 are strong contributions - just in opposite directions. The absolute value captures the strength of influence.

Committee Analogy

"We ask the committee: 'Show us your reasoning. Highlight the evidence that most influenced your decision.' They produce a report where the most influential pieces of evidence glow brightly. This is the saliency map - a visual explanation of the committee's thought process."

Why Interpretability Matters

ReasonExample
TrustCan we trust this medical diagnosis?
DebuggingWhy is the model getting this wrong?
DiscoveryWhat features actually matter?
FairnessIs it unfairly using race or gender?
LegalGDPR requires "right to explanation"

The Math Behind Saliency

For our Perceptron, let's trace WHY ∣wi×xi∣|w_i \times x_i|∣wi​×xi​∣ measures importance:

Step 1: The Neuron's Decision z=w1x1+w2x2+...+w9x9+bz = w_1 x_1 + w_2 x_2 + ... + w_9 x_9 + bz=w1​x1​+w2​x2​+...+w9​x9​+b

Each term wixiw_i x_iwi​xi​ is that pixel's contribution to the final sum zzz.

Step 2: How Much Did Each Pixel Contribute?

PixelWeight (wiw_iwi​)Input (xix_ixi​)Contribution (wi×xiw_i \times x_iwi​×xi​)
00.500.5 × 0 = 0 (no contribution)
11.211.2 × 1 = 1.2 (strong positive)
4-0.81-0.8 × 1 = -0.8 (strong negative)

Step 3: Why Absolute Value?

Both +1.2 and -0.8 are strong influences on the decision - they just push in opposite directions. The absolute value captures strength of influence regardless of direction.

Saliencyi=∣wi×xi∣\text{Saliency}_i = |w_i \times x_i|Saliencyi​=∣wi​×xi​∣

Interpretation:

  • High saliency = This pixel strongly influenced the decision (positively OR negatively)
  • Low saliency = This pixel didn't matter much for this prediction

Looking at What Our Model Learned

cell 018
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# =============================================================================# SALIENCY: What Did the Model Learn?# ============================================================================= print("="*70)print("THE COMMITTEE REPORT: What Did the Model Learn?")print("="*70) # First, let's look at the learned weightsprint("\n" + "-"*70)print("STEP 1: Examine the Learned Weights")print("-"*70) weights_grid = model.weights.reshape(3, 3)print("""Remember our pixel positions:     Position Index:     Image Layout:    [0] [1] [2]         [row 0]    [3] [4] [5]   →     [row 1]    [6] [7] [8]         [row 2] Our model's learned weights (as 3x3 grid):""")for i, row in enumerate(weights_grid):    print(f"  Row {i}: [{row[0]:6.3f}, {row[1]:6.3f}, {row[2]:6.3f}]") print(f"\n  Bias: {model.bias:.4f}") # Interpret the weightsprint("\n" + "-"*70)print("STEP 2: Interpret What the Weights Mean")print("-"*70) print("""HOW TO READ WEIGHTS:  • Positive weight → This pixel being bright INCREASES "vertical" confidence  • Negative weight → This pixel being bright DECREASES "vertical" confidence  • Near-zero weight → This pixel doesn't matter much""") # Find which positions have highest/lowest weightsflat_weights = model.weightsmax_idx = np.argmax(flat_weights)min_idx = np.argmin(flat_weights) print(f"""KEY OBSERVATIONS:   Maximum weight: position {max_idx} (row {max_idx//3}, col {max_idx%3}) = {flat_weights[max_idx]:.3f}    → If this pixel is bright, model is MORE confident it's vertical      Minimum weight: position {min_idx} (row {min_idx//3}, col {min_idx%3}) = {flat_weights[min_idx]:.3f}    → If this pixel is bright, model is LESS confident it's vertical      Positions with HIGH positive weights: {np.where(flat_weights > 0.3)[0].tolist()}    → These pixels SUPPORT "vertical" classification      Positions with HIGH negative weights: {np.where(flat_weights < -0.3)[0].tolist()}    → These pixels OPPOSE "vertical" classification""")
cell 019
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
# =============================================================================# VISUALIZE: Weights and Saliency Maps - THE "AHA!" MOMENT# ============================================================================= def compute_saliency(model, x):    """    Compute saliency map for an input.        Saliency = |weight × input|        This tells us: "How much did each input pixel     contribute to the final decision?"        Parameters:        model: Trained model with weights        x: Input image (flattened)        Returns:        saliency: Array of contribution magnitudes    """    x = np.array(x).flatten()    # Multiply each input by its weight, take absolute value    return np.abs(model.weights * x) fig, axes = plt.subplots(2, 4, figsize=(16, 8)) # =================# Top row: Vertical line analysis# ================= # 1. Input imageax = axes[0, 0]ax.imshow(vertical_line, cmap='Blues', vmin=0, vmax=1)ax.set_title('INPUT:\nVertical Line', fontsize=11, fontweight='bold')for i in range(3):    for j in range(3):        ax.text(j, i, f'{vertical_line[i,j]:.0f}', ha='center', va='center', fontsize=12)ax.axis('off') # 2. Model weightsax = axes[0, 1]im = ax.imshow(weights_grid, cmap='RdBu', vmin=-2, vmax=2)ax.set_title('WEIGHTS:\nLearned by Model', fontsize=11, fontweight='bold')for i in range(3):    for j in range(3):        color = 'white' if abs(weights_grid[i,j]) > 1 else 'black'        ax.text(j, i, f'{weights_grid[i,j]:.2f}', ha='center', va='center', fontsize=9, color=color)ax.axis('off') # 3. Saliency mapax = axes[0, 2]saliency_v = compute_saliency(model, vertical_flat).reshape(3, 3)im = ax.imshow(saliency_v, cmap='hot', vmin=0)ax.set_title('SALIENCY MAP:\n|Weight × Input|', fontsize=11, fontweight='bold')for i in range(3):    for j in range(3):        color = 'white' if saliency_v[i,j] > saliency_v.max()/2 else 'black'        ax.text(j, i, f'{saliency_v[i,j]:.2f}', ha='center', va='center', fontsize=9, color=color)ax.axis('off') # 4. Prediction resultax = axes[0, 3]ax.axis('off')pred_v = model.forward(vertical_flat)result_text = f"""PREDICTION Raw output: {pred_v:.4f}Confidence: {pred_v*100:.1f}% Decision: {"VERTICAL" if pred_v >= 0.5 else "HORIZONTAL"} Correct! ✓"""ax.text(0.5, 0.5, result_text, fontsize=11, fontweight='bold',       ha='center', va='center', transform=ax.transAxes,       bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.8)) # =================# Bottom row: Horizontal line analysis# ================= # 1. Input imageax = axes[1, 0]ax.imshow(horizontal_line, cmap='Blues', vmin=0, vmax=1)ax.set_title('INPUT:\nHorizontal Line', fontsize=11, fontweight='bold')for i in range(3):    for j in range(3):        ax.text(j, i, f'{horizontal_line[i,j]:.0f}', ha='center', va='center', fontsize=12)ax.axis('off') # 2. Model weights (same)ax = axes[1, 1]im = ax.imshow(weights_grid, cmap='RdBu', vmin=-2, vmax=2)ax.set_title('WEIGHTS:\n(Same model)', fontsize=11, fontweight='bold')for i in range(3):    for j in range(3):        color = 'white' if abs(weights_grid[i,j]) > 1 else 'black'        ax.text(j, i, f'{weights_grid[i,j]:.2f}', ha='center', va='center', fontsize=9, color=color)ax.axis('off') # 3. Saliency mapax = axes[1, 2]saliency_h = compute_saliency(model, horizontal_flat).reshape(3, 3)im = ax.imshow(saliency_h, cmap='hot', vmin=0)ax.set_title('SALIENCY MAP:\n|Weight × Input|', fontsize=11, fontweight='bold')for i in range(3):    for j in range(3):        color = 'white' if saliency_h[i,j] > saliency_h.max()/2 else 'black'        ax.text(j, i, f'{saliency_h[i,j]:.2f}', ha='center', va='center', fontsize=9, color=color)ax.axis('off') # 4. Prediction resultax = axes[1, 3]ax.axis('off')pred_h = model.forward(horizontal_flat)result_text = f"""PREDICTION Raw output: {pred_h:.4f}Confidence: {(1-pred_h)*100:.1f}% horizontal Decision: {"VERTICAL" if pred_h >= 0.5 else "HORIZONTAL"} Correct! ✓"""ax.text(0.5, 0.5, result_text, fontsize=11, fontweight='bold',       ha='center', va='center', transform=ax.transAxes,       bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.8)) plt.suptitle('THE COMMITTEE REPORT: How the Model Makes Decisions', fontsize=14, fontweight='bold', y=1.02)plt.tight_layout()plt.show()
cell 020
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# =============================================================================# THE "AHA!" MOMENT: Understanding What the Model Learned# ============================================================================= print("="*70)print("THE KEY INSIGHT: What Did the Model ACTUALLY Learn?")print("="*70) print("""Looking at the visualizations above, we can see something beautiful: FOR VERTICAL LINES:  • The middle column (positions 1, 4, 7) has POSITIVE weights  • When bright pixels appear in the middle column, the model says "VERTICAL!"  • The saliency map lights up exactly where the vertical line is  FOR HORIZONTAL LINES:  • The middle row (positions 3, 4, 5) has NEGATIVE or low weights for the sides  • When bright pixels appear across a row, they don't activate the "vertical" detector  • The output is LOW, meaning "not vertical" = "horizontal" THE MODEL LEARNED THE RIGHT PATTERN!═══════════════════════════════════════════════════════════════════════ Our model didn't just memorize examples. It learned a GENERAL RULE:   "Vertical lines have bright pixels stacked in a column.   Horizontal lines have bright pixels spread across a row." This is exactly what we hoped it would learn! ═══════════════════════════════════════════════════════════════════════""") # Show the pattern it learnedprint("\nVisualized Pattern Recognition:")print("-"*50)print("""  VERTICAL LINE:          MODEL LOOKS AT:  [ ] [●] [ ]             [ ] [HIGH] [ ]  [ ] [●] [ ]     →       [ ] [HIGH] [ ]  [ ] [●] [ ]             [ ] [HIGH] [ ]                          (Middle column weights are positive)    HORIZONTAL LINE:        MODEL LOOKS AT:  [ ] [ ] [ ]             [ ] [ ] [ ]  [●] [●] [●]     →       [LOW] [LOW] [LOW]  [ ] [ ] [ ]             [ ] [ ] [ ]                          (Row weights don't support "vertical")""")

6.6 Train/Test Split: Why We Need Separate Data

Throughout this notebook, we've used separate training and test data. This is crucial for honest evaluation.

The Problem: Memorization vs Learning

A model could achieve 100% accuracy on training data by simply memorizing every example - like a student who memorizes test answers instead of understanding concepts.

But memorization isn't useful - we need the model to generalize to NEW data it has never seen.

ApproachTraining AccuracyTest AccuracyWhat Happened?
True learning95%93%Learned the general pattern
Memorization100%50%Memorized training, fails on new

What IS a Train/Test Split?

We divide our data into two groups:

ALL DATA (150 samples)
    │
    ├── TRAINING SET (100 samples) ──→ Used to TRAIN the model
    │                                  Model sees these during learning
    │
    └── TEST SET (50 samples) ───────→ Used to EVALUATE the model
                                       Model NEVER sees these during training

Why This Works

Data SetModel Sees During Training?Purpose
TrainingYESLearn patterns
TestNOEvaluate generalization

The test set acts as a "final exam" - questions the model has never seen.

Committee Analogy

"It's like preparing for an exam:

  • Training data = study materials (examples you practice with)
  • Test data = the actual exam (new questions you've never seen)

If you just memorize your notes without understanding, you'll ace the practice problems but fail the exam. If you truly learned the concepts, you'll do well on both."

The Golden Rule

NEVER use test data for training!

If the model sees test data during training, it can memorize those examples too, and our evaluation becomes meaningless.

Common Split Ratios

SplitTrainingTestWhen to Use
80/2080%20%Large datasets (>10,000 samples)
70/3070%30%Medium datasets (1,000-10,000)
60/4060%40%Small datasets (<1,000)

More test data = more reliable evaluation, but less training data.

Understanding Overfitting Mathematically

What IS Overfitting?

Overfitting is when a model learns the noise in the training data, not just the signal.

Analogy: Imagine studying for an exam by memorizing the exact wording of practice questions instead of understanding the concepts. You'd ace those exact questions but fail on new ones.

How Train/Test Split Reveals Overfitting:

ScenarioTraining AccuracyTest AccuracyWhat's Happening
Good learning95%93%Learned the pattern!
Mild overfitting99%85%Some memorization
Severe overfitting100%50%Memorized everything, learned nothing

The Math:

  • If a model memorizes all 100 training examples, it can get 100% training accuracy
  • But those memorized patterns don't apply to new data
  • Test accuracy reveals true generalization

The Gap: Overfitting Gap=Training Accuracy−Test Accuracy\text{Overfitting Gap} = \text{Training Accuracy} - \text{Test Accuracy}Overfitting Gap=Training Accuracy−Test Accuracy

  • Gap < 5%: Great! Model generalizes well
  • Gap 5-15%: Some overfitting, might need more data or simpler model
  • Gap > 15%: Serious overfitting, model is memorizing

Why These Specific Ratios?

More Training DataMore Test Data
Model can learn moreMore reliable evaluation
Better final accuracySmaller margin of error
Less reliable evaluationModel might underfit

The sweet spot: Enough training data to learn well, enough test data to evaluate reliably. With 100 samples, 80/20 gives 80 for training (decent) and 20 for testing (acceptable). With 10,000 samples, even 90/10 gives 1,000 test samples (very reliable).

cell 022
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# =============================================================================# TRAIN/TEST SPLIT: Our Results# ============================================================================= print("="*70)print("TRAIN/TEST SPLIT: Checking for Generalization")print("="*70) print(f"""OUR DATA SPLIT:  • Training set: {len(X_train)} samples (used for learning)  • Test set: {len(X_test)} samples (used for evaluation only)  • Split ratio: {len(X_train)}/{len(X_train)+len(X_test)} = {len(X_train)/(len(X_train)+len(X_test))*100:.0f}% training  RESULTS:  • Training accuracy: {train_accuracy:.1%}  • Test accuracy: {test_accuracy:.1%}  • Difference: {abs(train_accuracy - test_accuracy):.1%}""") # Interpret the gapdiff = train_accuracy - test_accuracy print("-"*70)print("INTERPRETATION:")print("-"*70) if diff < 0.05:    print("""  ✓ EXCELLENT! Training and test accuracy are very similar.    This suggests the model has LEARNED the general pattern,  not just memorized the training data.    Our model generalizes well to new data!""")elif diff < 0.15:    print(f"""  ⚠ CAUTION: Training accuracy is {diff:.1%} higher than test accuracy.    Some memorization may have occurred.  The model might be slightly "overfitting" to training data.""")else:    print(f"""  ⚠ WARNING: Training accuracy is {diff:.1%} higher than test accuracy!    This suggests OVERFITTING - the model memorized training data  but doesn't generalize well to new data.    Possible solutions:    - Get more training data    - Use regularization    - Simplify the model""")

Part 6 Summary: What We've Learned

Key Concepts Mastered

ConceptDefinition/FormulaWhy It Matters
Training vs InferenceLearning mode vs using modeDifferent behaviors, same weights
Accuracy(TP + TN) / TotalSimple overall view (but can mislead)
Confusion MatrixTP, TN, FP, FN breakdownShows WHAT mistakes are made
PrecisionTP / (TP + FP)"When I say yes, am I right?"
RecallTP / (TP + FN)"Did I catch all the positives?"
F1 Score2 × (P × R) / (P + R)Balance precision and recall
Saliency|weight × input|What did the model look at?
Train/Test SplitSeparate data for evaluationDetect memorization vs learning

The Four Categories Explained

CategoryModel SaidTruth WasMeaning
TP (True Positive)VerticalVerticalCorrect detection
TN (True Negative)HorizontalHorizontalCorrect rejection
FP (False Positive)VerticalHorizontalFalse alarm
FN (False Negative)HorizontalVerticalMissed detection

Committee Analogy Progress

| Part | What Happened | |------|--------------|\n| Parts 1-3 | Committee member learned procedures | | Part 4 | First case - confused, random guessing | | Part 5 | Learned from feedback, became expert | | Part 6 | Performance review: verified expertise and understood reasoning | | Part 7 | (Next) One expert isn't enough - building the full committee |

The Big Picture

We now have a complete, evaluated model that:

  • Achieves high accuracy on both training and test data
  • Makes few mistakes (low FP and FN)
  • Has interpretable learned weights
  • Uses the RIGHT features (column patterns for vertical detection)
  • Generalizes well to new data

Knowledge Check

cell 024
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
# =============================================================================# KNOWLEDGE CHECK - Part 6# ============================================================================= print("KNOWLEDGE CHECK - Part 6: Evaluation")print("="*60)print("\nAnswer these questions to test your understanding:\n") questions = [    {        "q": "1. What's the difference between training and inference mode?",        "options": [            "A) Training is faster than inference",            "B) In training, weights update; in inference, weights are frozen",            "C) Inference uses more data than training",            "D) They're the same thing with different names"        ],        "answer": "B",        "explanation": "During training, the model learns and weights change after each example. During inference, weights are frozen and we just make predictions - no learning happens."    },    {        "q": "2. A model predicts 'sick' for a healthy patient. What type of error is this?",        "options": [            "A) True Positive (TP)",            "B) True Negative (TN)",            "C) False Positive (FP)",            "D) False Negative (FN)"        ],        "answer": "C",        "explanation": "False Positive: We predicted Positive (sick), but we were False (wrong) - the patient was actually healthy. This is a 'false alarm'."    },    {        "q": "3. You're building a disease detection system. Missing a sick patient is VERY bad.\n   Which metric should you prioritize?",        "options": [            "A) Accuracy",            "B) Precision",            "C) Recall",            "D) F1 Score"        ],        "answer": "C",        "explanation": "Recall measures 'did we catch all the positives?' High recall means we catch most sick patients, even if we have some false alarms. When missing positives is costly, prioritize recall."    },    {        "q": "4. Why do we use a separate test set?",        "options": [            "A) To have more data for training",            "B) To make training faster",            "C) To check if the model memorized vs truly learned",            "D) It's optional and not really needed"        ],        "answer": "C",        "explanation": "A model could memorize training data and fail on new data. The test set (unseen data) reveals if it truly learned the general pattern or just memorized examples."    },    {        "q": "5. What does a saliency map show?",        "options": [            "A) The accuracy of the model over time",            "B) Which inputs the model focused on for its decision",            "C) The training loss curve",            "D) How fast the model runs"        ],        "answer": "B",        "explanation": "Saliency maps highlight which parts of the input were most important for the model's decision. It's a form of interpretability - understanding WHY the model made its prediction."    }] for q in questions:    print(q["q"])    for opt in q["options"]:        print(f"   {opt}")    print() print("\n" + "="*60)print("Scroll down for answers...")print("="*60)
cell 025
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# =============================================================================# ANSWERS - Knowledge Check Part 6# ============================================================================= print("ANSWERS - Part 6 Knowledge Check")print("="*60) for i, q in enumerate(questions, 1):    print(f"\n{i}. Answer: {q['answer']}")    print(f"   {q['explanation']}") print("\n" + "="*60)print("How did you do?")print("  5/5: Evaluation Master! Ready for Part 7!")print("  4/5: Solid understanding - great job!")print("  3/5: Review the sections you missed")print("  <3:  Re-read Part 6 before continuing")print("="*60)

What's Next?

Congratulations! You've completed Part 6!

Our single neuron is now a verified expert - we've evaluated its performance, understood its decision-making process, and confirmed it learned the RIGHT patterns.

But Here's the Thing...

A single neuron (Perceptron) can only learn linear patterns - patterns that can be separated by a straight line. For more complex problems, one expert isn't enough.

The Limitation of Single Neurons

Some problems are not linearly separable. The classic example is the XOR problem:

Input AInput BOutput (XOR)
000
011
101
110

No single neuron can learn this pattern! We need multiple neurons working together.

Coming Up in Part 7: Hidden Layers - The Full Committee

In the next notebook, we'll explore:

  • Why one neuron isn't enough - The XOR problem demonstration
  • Hidden layers - Adding more neurons between input and output
  • The full committee - Multiple experts with different perspectives
  • Universal approximation - Why deep networks can learn (almost) anything

Continue to Part 7: part_7_hidden_layers.ipynb


"One expert is good. A committee of experts is powerful."

The Brain's Decision Committee - From Expert to Team

Illustrated step

Inference

concept

Final handbook

The model uses learned rules without changing them.

Confusion matrix

concept

Detailed report card

Correct and incorrect decisions are split into useful categories.

Saliency

concept

Highlighted evidence

The model shows which pixels mattered most to the vote.

AI tutor

Tutor chat is staged for the next slice. For now, use the concept cards and run cells to test each idea directly.

Pinned output

Plots and code output render under each cell. Pinning outputs to this rail will land once the core runner is evaluated.