Neural Network Fundamentals
Part 6: Evaluation - The Trained Expert
The Brain's Decision Committee - Chapter 6
The Story So Far...
In Part 5, something remarkable happened: our committee member learned. Starting with random weights and ~50% accuracy, they adjusted their priorities through gradient descent until they became an expert vertical line detector with 95%+ accuracy.
But how do we know they're actually good? Getting 95% on training data is one thing, but:
- What kinds of mistakes do they still make?
- Are some errors worse than others?
- Can we understand why they make the decisions they do?
This is evaluation - properly assessing our trained model and understanding what it has learned.
What You'll Learn in Part 6
By the end of this notebook, you will understand:
- Training vs Inference - The difference between learning mode and using mode
- Accuracy - The simplest metric (and its limitations)
- Confusion Matrix - A detailed breakdown of all prediction types
- Precision & Recall - Measuring different kinds of correctness
- F1 Score - Balancing precision and recall
- Saliency/Interpretability - What did the model actually learn?
- Test Sets - Why we need data the model has never seen
Prerequisites
Make sure you've completed:
- Parts 0-1: Matrices (
neural_network_fundamentals.ipynb)
- Part 2: Single Neuron (
part_2_single_neuron.ipynb)
- Part 3: Activation Functions (
part_3_activation_functions.ipynb)
- Part 4: The Perceptron (
part_4_perceptron.ipynb)
- Part 5: Training (
part_5_training.ipynb)
Setup: Import Dependencies and Recreate Our Trained Model
Let's bring in everything we need and train a model to evaluate.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# =============================================================================# PART 6: EVALUATION - SETUP AND IMPORTS# ============================================================================= import numpy as npimport matplotlib.pyplot as pltfrom IPython.display import display, clear_output # Try to import ipywidgets for interactive featurestry: import ipywidgets as widgets WIDGETS_AVAILABLE = Trueexcept ImportError: WIDGETS_AVAILABLE = False print("Note: ipywidgets not installed. Interactive features will be limited.") # Set up matplotlib stylestyle_options = ['seaborn-v0_8-whitegrid', 'seaborn-whitegrid', 'ggplot', 'default']for style in style_options: try: plt.style.use(style) break except OSError: continue plt.rcParams['figure.figsize'] = [10, 6]plt.rcParams['font.size'] = 12np.random.seed(42) print("Setup complete!")print("="*60)1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
# =============================================================================# RECREATE OUR TOOLS FROM PREVIOUS NOTEBOOKS# ============================================================================= # -----------------------------------------------------------------------------# Our canonical line images (from Part 1)# -----------------------------------------------------------------------------vertical_line = np.array([[0, 1, 0], [0, 1, 0], [0, 1, 0]])horizontal_line = np.array([[0, 0, 0], [1, 1, 1], [0, 0, 0]])vertical_flat = vertical_line.flatten()horizontal_flat = horizontal_line.flatten() # -----------------------------------------------------------------------------# Dataset generator (from Part 4)# -----------------------------------------------------------------------------def generate_line_dataset(n_samples=100, noise_level=0.0, seed=None): """Generate vertical (label=1) and horizontal (label=0) line images.""" if seed is not None: np.random.seed(seed) X, y = [], [] for i in range(n_samples): image = np.zeros((3, 3)) if i < n_samples // 2: # Vertical lines col = np.random.randint(0, 3) image[:, col] = 1 if noise_level > 0: image = np.clip(image + np.random.randn(3, 3) * noise_level, 0, 1) X.append(image.flatten()) y.append(1) else: # Horizontal lines row = np.random.randint(0, 3) image[row, :] = 1 if noise_level > 0: image = np.clip(image + np.random.randn(3, 3) * noise_level, 0, 1) X.append(image.flatten()) y.append(0) X, y = np.array(X), np.array(y) shuffle_idx = np.random.permutation(n_samples) return X[shuffle_idx], y[shuffle_idx] # -----------------------------------------------------------------------------# Sigmoid activation function (from Part 3)# -----------------------------------------------------------------------------def sigmoid(z): """Sigmoid activation: maps any value to range (0, 1).""" return 1 / (1 + np.exp(-np.clip(z, -500, 500))) # -----------------------------------------------------------------------------# TrainablePerceptron class (from Part 5)# -----------------------------------------------------------------------------class TrainablePerceptron: """A Perceptron that can learn from examples.""" def __init__(self, n_inputs): self.weights = np.random.randn(n_inputs) * 0.1 self.bias = 0.0 self.n_inputs = n_inputs self.loss_history = [] self.accuracy_history = [] self.is_trained = False # Track if model has been trained def forward(self, x): x = np.array(x).flatten() z = np.dot(self.weights, x) + self.bias return sigmoid(z) def predict(self, x): return 1 if self.forward(x) >= 0.5 else 0 def compute_loss(self, y_true, y_pred): epsilon = 1e-15 y_pred = np.clip(y_pred, epsilon, 1 - epsilon) return -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)) def train(self, X, y, learning_rate=0.1, epochs=100, verbose=True): self.loss_history = [] self.accuracy_history = [] for epoch in range(epochs): total_loss = 0 correct = 0 for i in range(len(X)): xi, yi = X[i], y[i] y_pred = self.forward(xi) loss = self.compute_loss(yi, y_pred) total_loss += loss if (y_pred >= 0.5 and yi == 1) or (y_pred < 0.5 and yi == 0): correct += 1 error = y_pred - yi self.weights = self.weights - learning_rate * error * xi self.bias = self.bias - learning_rate * error avg_loss = total_loss / len(X) accuracy = correct / len(X) self.loss_history.append(avg_loss) self.accuracy_history.append(accuracy) if verbose and (epoch + 1) % 10 == 0: print(f" Epoch {epoch+1:3d}/{epochs}: Loss = {avg_loss:.4f}, Accuracy = {accuracy*100:.1f}%") self.is_trained = True if verbose: print(f"\nTraining complete! Final accuracy: {self.accuracy_history[-1]*100:.1f}%") return self.loss_history print("Tools recreated from previous notebooks!")print(" - Line image templates")print(" - Dataset generator")print(" - Sigmoid activation")print(" - TrainablePerceptron class")1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# =============================================================================# TRAIN OUR MODEL (Quick recap from Part 5)# ============================================================================= print("="*70)print("TRAINING OUR MODEL (to have something to evaluate)")print("="*70) # Generate training datanp.random.seed(42)X_train, y_train = generate_line_dataset(n_samples=100, noise_level=0.0, seed=42) # Generate TEST data (NEW! - data the model has never seen)X_test, y_test = generate_line_dataset(n_samples=50, noise_level=0.0, seed=999) print(f"\nTraining set: {len(X_train)} samples")print(f"Test set: {len(X_test)} samples (model has NEVER seen these!)") # Create and train modelmodel = TrainablePerceptron(n_inputs=9)print("\nTraining...")model.train(X_train, y_train, learning_rate=0.5, epochs=50, verbose=True) print("\n" + "="*70)print("Model is trained and ready for evaluation!")print("="*70)
6.1 Training vs Inference: The Committee's Memory
Before we evaluate, let's understand an important distinction: training mode vs inference mode.
What IS Inference?
The word "inference" comes from Latin inferre meaning "to bring in" or "to conclude." In machine learning:
Inference = Using a trained model to make predictions on new data
Think of it like this:
- Training = Teaching someone how to do a job
- Inference = That person actually doing the job
Why Two Different Modes?
| Aspect | Training Mode | Inference Mode |
|---|
| Purpose | Learn from examples | Make predictions |
| Weights | Being updated constantly | Frozen (fixed) |
| Data | Training set (with labels) | New, unseen data |
| Speed | Slower (computing gradients) | Fast (forward pass only) |
| Goal | Minimize loss | Predict accurately |
Committee Analogy
"During training, the committee is in a meeting room, debating cases, learning from mistakes, and updating their rulebook. Once trained, they compile their final rulebook and hand it to the front desk. The front desk uses this rulebook to make quick decisions without calling the committee for every case."
- Training: The committee meeting (slow, learning, updating)
- Inference: The front desk using the final rulebook (fast, fixed, no learning)
Why Does This Distinction Matter?
| Scenario | Why It Matters |
|---|
| Deployment | In production, you use inference mode for speed |
| Evaluation | We evaluate in inference mode (weights must be fixed!) |
| Consistency | Same weights give same predictions every time |
| Resources | Inference uses less memory (no gradients stored) |
The Key Insight
During inference, the model does NOT learn anything new. The weights are "frozen" - they don't change. This is essential because:
- Reproducibility: Same input always gives same output
- Speed: No gradient computation needed
- Fairness: Test data doesn't influence the model
Why "Frozen" Weights Matter Mathematically
During training, after each prediction, we do:
weights = weights - learning_rate × gradient
During inference, we SKIP this step entirely. The weights stay exactly as they were after training finished.
Why does this matter?
| If we kept updating during inference... | Consequence |
|---|
| Weights would change with each new input | Same input could give different outputs! |
| Model would "drift" over time | Yesterday's predictions wouldn't match today's |
| Hard to reproduce results | "But it worked yesterday!" |
| Unfair for test evaluation | Test data would influence the model |
The mathematical guarantee: With frozen weights, f(x)=σ(w⋅x+b) is a deterministic function - same input ALWAYS gives same output.
In Code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# =============================================================================# TRAINING VS INFERENCE: Demonstration# ============================================================================= print("="*70)print("TRAINING vs INFERENCE MODE")print("="*70) # Show the model's stateprint(f"\nModel state: {'TRAINED' if model.is_trained else 'UNTRAINED'}") # In training mode, weights change after each sampleprint("\n" + "-"*70)print("DURING TRAINING (weights change):")print("-"*70)print(" For each sample:")print(" 1. Forward pass → get prediction")print(" 2. Compute loss → how wrong?")print(" 3. Compute gradients → which direction?")print(" 4. Update weights → improve! (weights CHANGE)") # In inference mode, weights are frozenprint("\n" + "-"*70)print("DURING INFERENCE (weights frozen):")print("-"*70)print(" For each sample:")print(" 1. Forward pass → get prediction")print(" 2. Done! (NO weight updates)") # Demonstrate inferenceprint("\n" + "-"*70)print("INFERENCE EXAMPLE:")print("-"*70) # Save weights beforeweights_before = model.weights.copy() # Make predictions (inference)pred_v = model.forward(vertical_flat)pred_h = model.forward(horizontal_flat) # Check weights afterweights_after = model.weights.copy() print(f"\n Vertical line: {pred_v:.4f} ({pred_v*100:.1f}% confident it's vertical)")print(f" Horizontal line: {pred_h:.4f} ({pred_h*100:.1f}% confident it's vertical)")print(f"\n Weights changed? {not np.allclose(weights_before, weights_after)}")print(f" (In inference mode, weights stay fixed!)")
6.2 Accuracy: The Simplest Metric
We've been using accuracy throughout our notebooks, but let's formally define it and understand its limitations.
What IS Accuracy?
Accuracy answers the question: "Of all the predictions I made, what fraction was correct?"
Accuracy=Total Number of PredictionsNumber of Correct Predictions
Breaking Down the Formula
Let's understand each part:
| Component | What It Means | Our Example |
|---|
| Correct Predictions | Cases where prediction matches truth | Said "vertical" for vertical, "horizontal" for horizontal |
| Total Predictions | All cases we predicted on | All 50 test images |
| Accuracy | The ratio (0 to 1, or 0% to 100%) | 48/50 = 0.96 = 96% |
Computing Accuracy Step by Step
Step 1: Make predictions on all samples
Step 2: Compare each prediction to the true label
Step 3: Count how many match (correct)
Step 4: Divide by total number of predictions
Why Accuracy Can Be Misleading
Accuracy has a hidden flaw: it treats all mistakes equally and ignores class imbalance.
Example - Fraud Detection:
Suppose 99% of transactions are legitimate, 1% are fraud.
| Model Strategy | Accuracy | Is It Good? |
|---|
| Say "legitimate" for EVERYTHING | 99% | NO! Catches 0% of fraud! |
| Actually detect fraud | 97% | YES! Even though lower accuracy |
The "dumb" model gets 99% accuracy by ignoring the problem entirely!
Example - Medical Diagnosis:
| Scenario | Type of Error | Consequence |
|---|
| Say "healthy" when patient is sick | Miss a disease | Patient doesn't get treatment! (VERY bad) |
| Say "sick" when patient is healthy | False alarm | Unnecessary tests (annoying but not dangerous) |
Both are "wrong" but one is much worse! Accuracy treats them the same.
When Accuracy Works Well
Accuracy is a good metric when:
- Classes are balanced (roughly 50/50 split)
- All mistakes have equal cost
- You want a quick overall view
Our V/H classifier is a good case for accuracy: balanced classes, equal mistake costs.
Understanding Why Class Imbalance Breaks Accuracy
Let's do the math to see WHY accuracy is misleading with imbalanced data:
Scenario: Fraud Detection (1% fraud, 99% legitimate)
| Strategy | Fraud Caught | Accuracy Calculation |
|---|
| "Always say legitimate" | 0 of 100 frauds | (0 + 9900) / 10000 = 99% |
| Good detector | 90 of 100 frauds | (90 + 9800) / 10000 = 98.9% |
The "dumb" strategy has HIGHER accuracy but catches ZERO fraud!
Why this happens mathematically:
Accuracy=TotalTP+TN
When 99% of data is class 0, you can get 99% accuracy by predicting 0 for everything (TN = 9900, everything else = 0).
The lesson: When classes are imbalanced, accuracy is dominated by the majority class. We need metrics that focus on the minority class (precision, recall).
Let's Calculate Accuracy Properly
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# =============================================================================# ACCURACY: Step-by-Step Calculation# ============================================================================= print("="*70)print("CALCULATING ACCURACY: Step by Step")print("="*70) def calculate_accuracy(model, X, y, verbose=True): """ Calculate accuracy of model on given data. Parameters: model: Trained model with predict() method X: Input data (n_samples, n_features) y: True labels (n_samples,) verbose: Whether to print details Returns: accuracy: Float between 0 and 1 predictions: Array of predicted labels """ predictions = [] correct = 0 for i in range(len(X)): pred = model.predict(X[i]) predictions.append(pred) if pred == y[i]: correct += 1 accuracy = correct / len(y) if verbose: print(f"\n Total samples: {len(y)}") print(f" Correct: {correct}") print(f" Wrong: {len(y) - correct}") print(f" Accuracy: {correct}/{len(y)} = {accuracy:.4f} = {accuracy*100:.1f}%") return accuracy, np.array(predictions) # Calculate accuracy on TRAINING dataprint("\n" + "-"*70)print("TRAINING SET ACCURACY:")print("-"*70)train_accuracy, train_preds = calculate_accuracy(model, X_train, y_train) # Calculate accuracy on TEST data (NEW!)print("\n" + "-"*70)print("TEST SET ACCURACY:")print("-"*70)test_accuracy, test_preds = calculate_accuracy(model, X_test, y_test) print("\n" + "="*70)print("KEY INSIGHT: Training vs Test Accuracy")print("="*70)print(f"""Training accuracy: {train_accuracy*100:.1f}%Test accuracy: {test_accuracy*100:.1f}% The TEST accuracy is what really matters!Training accuracy can be misleadingly high if the model "memorizes" the data.Test accuracy shows how well the model generalizes to NEW data.""")
6.3 The Confusion Matrix: A Detailed Report Card
Accuracy gives us one number. But what if we want to understand WHICH mistakes the model makes?
What IS a Confusion Matrix?
A confusion matrix is a table that breaks down all predictions into four categories based on two questions:
- What did we predict?
- What was the actual truth?
PREDICTED
0 1
┌─────────┬─────────┐
0 │ TN │ FP │
ACTUAL ├─────────┼─────────┤
1 │ FN │ TP │
└─────────┴─────────┘
Why "Confusion" Matrix?
The name comes from the fact that it shows how the model gets "confused" - where it mixes up one class for another.
Understanding the Four Categories
| Abbrev | Full Name | Meaning | Our Example |
|---|
| TP | True Positive | Predicted 1, was actually 1 | Said "vertical", WAS vertical ✓ |
| TN | True Negative | Predicted 0, was actually 0 | Said "horizontal", WAS horizontal ✓ |
| FP | False Positive | Predicted 1, was actually 0 | Said "vertical", was horizontal ✗ |
| FN | False Negative | Predicted 0, was actually 1 | Said "horizontal", was vertical ✗ |
Memory Trick for TP/TN/FP/FN
Think of it as TWO questions:
-
True/False: Was the prediction correct?
- True = correct
- False = wrong
-
Positive/Negative: What did we predict?
- Positive = predicted class 1 (vertical)
- Negative = predicted class 0 (horizontal)
So:
- True Positive = We were True (correct) when we predicted Positive (vertical)
- False Positive = We were False (wrong) when we predicted Positive (vertical)
- True Negative = We were True (correct) when we predicted Negative (horizontal)
- False Negative = We were False (wrong) when we predicted Negative (horizontal)
Committee Analogy
"The confusion matrix is like a detailed performance review for our committee member:
- TP: Cases they correctly identified as vertical
- TN: Cases they correctly identified as NOT vertical
- FP: Cases they wrongly called vertical (a false alarm!)
- FN: Cases they missed (should have said vertical but didn't)"
Alternative Names You'll See
| Our Term | Also Called | When Used |
|---|
| False Positive | Type I Error | Statistics |
| False Negative | Type II Error | Statistics |
| True Positive Rate | Sensitivity, Recall | Medical |
| True Negative Rate | Specificity | Medical |
Real-World Examples of Each Error Type
Understanding these errors is easier with concrete examples:
| Error Type | Medical Example | Email Example | Self-Driving Car |
|---|
| TP | Correctly diagnose sick patient | Correctly mark spam | Correctly detect pedestrian |
| TN | Correctly clear healthy patient | Correctly allow good email | Correctly ignore false alarm |
| FP | Diagnose healthy as sick | Mark good email as spam | Brake for nothing (annoying) |
| FN | Miss a sick patient | Allow spam through | Miss a pedestrian (FATAL!) |
Notice: The consequences of FP vs FN are very different depending on the application!
- Medical: FN is worse (missed diagnosis can be fatal)
- Spam filter: FP is worse (losing important emails)
- Self-driving: FN is MUCH worse (hitting someone)
This is why we have precision and recall - to measure these separately.
Let's Build a Confusion Matrix
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
# =============================================================================# CONFUSION MATRIX: Implementation and Explanation# ============================================================================= def confusion_matrix(y_true, y_pred): """ Compute the confusion matrix. The logic behind each calculation: - TP: prediction=1 AND truth=1 (both conditions true) - TN: prediction=0 AND truth=0 (both conditions true) - FP: prediction=1 AND truth=0 (predicted positive, was negative) - FN: prediction=0 AND truth=1 (predicted negative, was positive) Parameters: y_true: Array of true labels (0 or 1) y_pred: Array of predicted labels (0 or 1) Returns: dict with TP, TN, FP, FN counts """ # True Positive: We said 1, it was 1 TP = np.sum((y_pred == 1) & (y_true == 1)) # True Negative: We said 0, it was 0 TN = np.sum((y_pred == 0) & (y_true == 0)) # False Positive: We said 1, but it was 0 (false alarm!) FP = np.sum((y_pred == 1) & (y_true == 0)) # False Negative: We said 0, but it was 1 (missed it!) FN = np.sum((y_pred == 0) & (y_true == 1)) return {'TP': TP, 'TN': TN, 'FP': FP, 'FN': FN} print("="*70)print("CONFUSION MATRIX: Step by Step")print("="*70) # Calculate confusion matrix for test setcm = confusion_matrix(y_test, test_preds) print("\nFor our TEST set:")print(f" Total samples: {len(y_test)}")print(f" Vertical lines (label=1): {np.sum(y_test == 1)}")print(f" Horizontal lines (label=0): {np.sum(y_test == 0)}") print("\n" + "-"*70)print("CONFUSION MATRIX BREAKDOWN:")print("-"*70) print(f""" PREDICTED Horizontal(0) Vertical(1) ┌─────────────────┬─────────────────┐ Horiz.(0) │ TN = {cm['TN']:3d} │ FP = {cm['FP']:3d} │ ACTUAL ├─────────────────┼─────────────────┤ Vert.(1) │ FN = {cm['FN']:3d} │ TP = {cm['TP']:3d} │ └─────────────────┴─────────────────┘""") print("Interpretation (reading the matrix):")print(f" ✓ True Positives (TP = {cm['TP']}): Correctly identified as VERTICAL")print(f" ✓ True Negatives (TN = {cm['TN']}): Correctly identified as HORIZONTAL")print(f" ✗ False Positives (FP = {cm['FP']}): Wrongly called VERTICAL (was horizontal)")print(f" ✗ False Negatives (FN = {cm['FN']}): Wrongly called HORIZONTAL (was vertical)") # Verify: TP + TN + FP + FN should equal total samplestotal = cm['TP'] + cm['TN'] + cm['FP'] + cm['FN']print(f"\n Verification: TP + TN + FP + FN = {total} (should equal {len(y_test)}) ✓") # Show how accuracy relates to confusion matrixprint("\n" + "-"*70)print("ACCURACY FROM CONFUSION MATRIX:")print("-"*70)print(f""" Accuracy = (TP + TN) / (TP + TN + FP + FN) = ({cm['TP']} + {cm['TN']}) / ({cm['TP']} + {cm['TN']} + {cm['FP']} + {cm['FN']}) = {cm['TP'] + cm['TN']} / {total} = {(cm['TP'] + cm['TN']) / total:.4f} = {(cm['TP'] + cm['TN']) / total * 100:.1f}%""")1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# =============================================================================# VISUALIZE THE CONFUSION MATRIX# ============================================================================= fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Plot 1: Confusion Matrix as heatmapax1 = axes[0]cm_matrix = np.array([[cm['TN'], cm['FP']], [cm['FN'], cm['TP']]]) im = ax1.imshow(cm_matrix, cmap='Blues')ax1.set_xticks([0, 1])ax1.set_yticks([0, 1])ax1.set_xticklabels(['Horizontal (0)', 'Vertical (1)'])ax1.set_yticklabels(['Horizontal (0)', 'Vertical (1)'])ax1.set_xlabel('Predicted Label', fontsize=12)ax1.set_ylabel('Actual Label', fontsize=12)ax1.set_title('Confusion Matrix', fontsize=14, fontweight='bold') # Add text annotationslabels = [['TN', 'FP'], ['FN', 'TP']]for i in range(2): for j in range(2): text_color = 'white' if cm_matrix[i, j] > cm_matrix.max()/2 else 'black' ax1.text(j, i, f'{labels[i][j]}\n{cm_matrix[i, j]}', ha='center', va='center', fontsize=14, fontweight='bold', color=text_color) plt.colorbar(im, ax=ax1) # Plot 2: Visual explanationax2 = axes[1]ax2.axis('off') explanation_text = f"""READING THE CONFUSION MATRIX{'='*45} The DIAGONAL (top-left to bottom-right) shows CORRECT predictions: • TN ({cm['TN']}): Horizontal predicted as Horizontal ✓ • TP ({cm['TP']}): Vertical predicted as Vertical ✓ The OFF-DIAGONAL shows ERRORS: • FP ({cm['FP']}): Horizontal wrongly called Vertical ✗ • FN ({cm['FN']}): Vertical wrongly called Horizontal ✗ A PERFECT model has: • All values on the diagonal • Zeros everywhere else""" ax2.text(0.1, 0.5, explanation_text, fontsize=11, family='monospace', verticalalignment='center', transform=ax2.transAxes, bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8)) plt.tight_layout()plt.show()
6.4 Precision, Recall, and F1 Score
The confusion matrix gives us four numbers. From these, we can calculate more specific metrics that answer different questions.
Precision: "When I Say Positive, Am I Right?"
Precision answers: "Of all the times I predicted 'positive' (vertical), how many were actually positive?"
Precision=TP+FPTP
Breaking it down:
- Numerator (TP): Cases we correctly called positive
- Denominator (TP + FP): ALL cases we called positive (right or wrong)
High precision means: When we say "vertical", we're usually right. Few false alarms.
When to prioritize precision:
- Spam filters (don't delete legitimate emails!)
- Recommender systems (don't recommend things users hate!)
- Any case where false alarms are costly
Recall: "Did I Catch All the Positives?"
Recall (also called Sensitivity) answers: "Of all the actual positives, how many did I catch?"
Recall=TP+FNTP
Breaking it down:
- Numerator (TP): Cases we correctly caught
- Denominator (TP + FN): ALL actual positives (caught or missed)
High recall means: We catch most of the actual vertical lines. Few misses.
When to prioritize recall:
- Disease detection (don't miss sick patients!)
- Fraud detection (don't miss fraudulent transactions!)
- Any case where missing positives is costly
The Precision-Recall Trade-off
Here's the fundamental tension:
| Strategy | Precision | Recall | Problem |
|---|
| "Only say vertical when 100% sure" | HIGH (few false alarms) | LOW (miss many) | Miss too many positives |
| "Say vertical for anything remotely vertical" | LOW (many false alarms) | HIGH (catch most) | Too many false alarms |
You often can't maximize both! This is called the precision-recall trade-off.
Concrete Example: Airport Security
Imagine a security scanner detecting threats:
| Setting | Precision | Recall | Outcome |
|---|
| Super sensitive | 10% | 99% | Catches ALL threats but 90% of "threats" are false alarms. Massive delays! |
| Super strict | 95% | 20% | Few false alarms but misses 80% of real threats. Dangerous! |
| Balanced | 70% | 70% | Some false alarms, catches most threats. Practical! |
Why the trade-off exists:
When we lower the threshold for saying "positive":
- We catch MORE true positives (recall goes UP ↑)
- But we also catch MORE false positives (precision goes DOWN ↓)
When we raise the threshold:
- We have FEWER false positives (precision goes UP ↑)
- But we miss MORE true positives (recall goes DOWN ↓)
There's no free lunch! The art is finding the right balance for your specific application.
F1 Score: Finding the Balance
The F1 Score is the harmonic mean of precision and recall - a single number that balances both:
F1=2⋅Precision+RecallPrecision×Recall
What IS a Harmonic Mean and Why Use It?
You might wonder: "Why not just use a regular average (arithmetic mean)?"
Three Types of Means:
| Mean Type | Formula | Example: (99%, 10%) |
|---|
| Arithmetic | (a + b) / 2 | (99 + 10) / 2 = 54.5% |
| Geometric | √(a × b) | √(99 × 10) = 31.5% |
| Harmonic | 2ab / (a + b) | 2×99×10 / (99+10) = 18.2% |
Why harmonic mean is better for F1:
The harmonic mean is punishing when values are imbalanced. If you have 99% precision but only 10% recall:
- Arithmetic mean says "54.5% - not bad!"
- Harmonic mean says "18.2% - this is terrible!"
The harmonic mean forces BOTH values to be reasonably high to get a good score.
Intuition: Think about speed. If you drive 60 mph for half a trip and 20 mph for the other half, your average speed isn't 40 mph - it's closer to 30 mph (harmonic mean). The slow part dominates.
Why this matters for ML:
A model that predicts "positive" for everything gets 100% recall but ~0% precision. The harmonic mean correctly identifies this as a terrible model.
| Precision | Recall | F1 Score | Verdict |
|---|
| 90% | 90% | 90% | Great! Both balanced |
| 99% | 10% | 18% | Terrible! Very unbalanced |
| 50% | 50% | 50% | Mediocre |
F1 is high only when BOTH precision AND recall are reasonably high.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
# =============================================================================# PRECISION, RECALL, F1: Calculation# ============================================================================= def calculate_metrics(cm): """ Calculate precision, recall, F1 from confusion matrix. Parameters: cm: dict with TP, TN, FP, FN Returns: dict with precision, recall, f1, accuracy """ TP, TN, FP, FN = cm['TP'], cm['TN'], cm['FP'], cm['FN'] # Precision: When we say positive, are we right? # Note: We add a check to avoid division by zero precision = TP / (TP + FP) if (TP + FP) > 0 else 0 # Recall: Did we catch all the positives? recall = TP / (TP + FN) if (TP + FN) > 0 else 0 # F1: Harmonic mean of precision and recall f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0 # Accuracy (for comparison) accuracy = (TP + TN) / (TP + TN + FP + FN) return {'precision': precision, 'recall': recall, 'f1': f1, 'accuracy': accuracy} print("="*70)print("PRECISION, RECALL, AND F1 SCORE")print("="*70) metrics = calculate_metrics(cm) print("\n" + "-"*70)print("STEP-BY-STEP CALCULATION:")print("-"*70) print(f"""From our confusion matrix: TP = {cm['TP']} (correctly identified vertical lines) TN = {cm['TN']} (correctly identified horizontal lines) FP = {cm['FP']} (horizontal lines wrongly called vertical) FN = {cm['FN']} (vertical lines wrongly called horizontal) PRECISION: "When I say vertical, am I right?" Formula: Precision = TP / (TP + FP) Precision = {cm['TP']} / ({cm['TP']} + {cm['FP']}) = {cm['TP']} / {cm['TP'] + cm['FP']} = {metrics['precision']:.4f} = {metrics['precision']*100:.1f}% RECALL: "Did I catch all the vertical lines?" Formula: Recall = TP / (TP + FN) Recall = {cm['TP']} / ({cm['TP']} + {cm['FN']}) = {cm['TP']} / {cm['TP'] + cm['FN']} = {metrics['recall']:.4f} = {metrics['recall']*100:.1f}% F1 SCORE: "Balance of precision and recall" Formula: F1 = 2 × (Precision × Recall) / (Precision + Recall) F1 = 2 × ({metrics['precision']:.4f} × {metrics['recall']:.4f}) / ({metrics['precision']:.4f} + {metrics['recall']:.4f}) = 2 × {metrics['precision'] * metrics['recall']:.4f} / {metrics['precision'] + metrics['recall']:.4f} = {metrics['f1']:.4f} = {metrics['f1']*100:.1f}% ACCURACY (for comparison): Formula: Accuracy = (TP + TN) / Total Accuracy = ({cm['TP']} + {cm['TN']}) / {cm['TP'] + cm['TN'] + cm['FP'] + cm['FN']} = {cm['TP'] + cm['TN']} / {cm['TP'] + cm['TN'] + cm['FP'] + cm['FN']} = {metrics['accuracy']:.4f} = {metrics['accuracy']*100:.1f}%""")1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# =============================================================================# DEMONSTRATING THE PRECISION-RECALL TRADE-OFF# ============================================================================= print("="*70)print("THE PRECISION-RECALL TRADE-OFF: A Visual Demonstration")print("="*70) print("""To understand the trade-off, let's see what happens when we changeour THRESHOLD for saying "vertical" (positive). Currently we use: threshold = 0.5 - If output >= 0.5 → predict "vertical" - If output < 0.5 → predict "horizontal" But what if we change this threshold?""") # Try different thresholdsthresholds = [0.1, 0.3, 0.5, 0.7, 0.9]results = [] for threshold in thresholds: # Make predictions at this threshold preds = np.array([1 if model.forward(x) >= threshold else 0 for x in X_test]) # Calculate confusion matrix TP = np.sum((preds == 1) & (y_test == 1)) TN = np.sum((preds == 0) & (y_test == 0)) FP = np.sum((preds == 1) & (y_test == 0)) FN = np.sum((preds == 0) & (y_test == 1)) # Calculate metrics precision = TP / (TP + FP) if (TP + FP) > 0 else 0 recall = TP / (TP + FN) if (TP + FN) > 0 else 0 f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0 results.append({ 'threshold': threshold, 'precision': precision, 'recall': recall, 'f1': f1, 'TP': TP, 'FP': FP, 'FN': FN }) print(f"Threshold = {threshold}:") print(f" TP={TP:2d}, FP={FP:2d}, FN={FN:2d}") print(f" Precision={precision:.1%}, Recall={recall:.1%}, F1={f1:.1%}") print() # Visualizefig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Plot 1: Precision vs Recall at different thresholdsax = axes[0]precisions = [r['precision'] for r in results]recalls = [r['recall'] for r in results] ax.plot(recalls, precisions, 'b-o', linewidth=2, markersize=10)for r in results: ax.annotate(f" t={r['threshold']}", (r['recall'], r['precision']), fontsize=9) ax.set_xlabel('Recall', fontsize=12)ax.set_ylabel('Precision', fontsize=12)ax.set_title('Precision-Recall Trade-off\n(Each point is a different threshold)', fontsize=12, fontweight='bold')ax.set_xlim(-0.05, 1.05)ax.set_ylim(-0.05, 1.05)ax.grid(True, alpha=0.3) # Add ideal pointax.scatter([1], [1], color='gold', s=200, marker='*', zorder=5, label='Ideal (1,1)')ax.legend() # Plot 2: Bar chart showing trade-offax = axes[1]x = np.arange(len(thresholds))width = 0.25 bars1 = ax.bar(x - width, precisions, width, label='Precision', color='#e74c3c')bars2 = ax.bar(x, recalls, width, label='Recall', color='#27ae60')bars3 = ax.bar(x + width, [r['f1'] for r in results], width, label='F1', color='#9b59b6') ax.set_xlabel('Threshold', fontsize=12)ax.set_ylabel('Score', fontsize=12)ax.set_title('Metrics at Different Thresholds', fontsize=12, fontweight='bold')ax.set_xticks(x)ax.set_xticklabels([f'{t}' for t in thresholds])ax.legend()ax.set_ylim(0, 1.1) plt.tight_layout()plt.show() print("""KEY INSIGHT:════════════════════════════════════════════════════════════════════════ • LOW threshold (0.1): "Say vertical for almost everything!" → High recall (catch most verticals) but low precision (many false alarms) • HIGH threshold (0.9): "Only say vertical when VERY confident!" → High precision (rarely wrong when we say vertical) but low recall (miss many) • MIDDLE threshold (0.5): Balanced trade-off Notice how the precision-recall curve shows the trade-off: as one goes up, the other tends to go down. F1 score helps us find a good balance!""")1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# =============================================================================# VISUALIZE ALL METRICS# ============================================================================= fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Plot 1: Bar chart of all metricsax1 = axes[0]metric_names = ['Accuracy', 'Precision', 'Recall', 'F1 Score']metric_values = [metrics['accuracy'], metrics['precision'], metrics['recall'], metrics['f1']]colors = ['#3498db', '#e74c3c', '#27ae60', '#9b59b6'] bars = ax1.bar(metric_names, metric_values, color=colors, edgecolor='white', linewidth=2)ax1.set_ylim(0, 1.1)ax1.set_ylabel('Score', fontsize=12)ax1.set_title('Model Performance Metrics', fontsize=14, fontweight='bold')ax1.axhline(y=1.0, color='gray', linestyle='--', alpha=0.5, label='Perfect score') # Add value labelsfor bar, val in zip(bars, metric_values): ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02, f'{val:.1%}', ha='center', va='bottom', fontsize=12, fontweight='bold') # Plot 2: Which metric to use guideax2 = axes[1]ax2.axis('off') metrics_explanation = """WHICH METRIC SHOULD YOU USE?═══════════════════════════════════════════════════ ACCURACY • Best when: Classes are balanced (50/50) • Misleading when: Rare events (e.g., 1% fraud) PRECISION • Best when: False alarms are COSTLY • Examples: - Spam filter (don't delete real email!) - Criminal conviction (don't jail innocent!) RECALL • Best when: Missing positives is COSTLY • Examples: - Disease detection (don't miss sick patients!) - Fraud detection (don't miss fraud!) F1 SCORE • Best when: You need balance between P & R • Most real-world applications use F1 ═══════════════════════════════════════════════════For our V/H classifier, all metrics are similar because our dataset is balanced and model works well!""" ax2.text(0.05, 0.5, metrics_explanation, fontsize=10, family='monospace', verticalalignment='center', transform=ax2.transAxes, bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.8)) plt.tight_layout()plt.show()
6.5 The Committee Report: Saliency and Interpretability
We know our model works well, but WHY does it work? What has it actually learned?
What IS Interpretability?
Interpretability (also called Explainability) means understanding:
- What patterns did the model learn?
- Why does it make specific predictions?
- Is it using the "right" features?
| Question | How to Answer |
|---|
| What patterns did it learn? | Look at the weights |
| Why did it predict "vertical"? | Look at which inputs contributed most |
| Is it using the right features? | Visualize the saliency map |
What IS Saliency?
The word "saliency" comes from Latin salire meaning "to leap." In machine learning:
Saliency = Which parts of the input "leap out" as important to the model
For our Perceptron, saliency is beautifully simple:
Saliencyi=∣wi×xi∣
Where:
- wi = weight for input i
- xi = value of input i
- ∣...∣ = absolute value (we care about magnitude, not sign)
Why Absolute Value?
| Weight × Input | Meaning | Contribution |
|---|
| +2.0 × 1.0 = +2.0 | Strongly SUPPORTS vertical | HIGH |
| -2.0 × 1.0 = -2.0 | Strongly OPPOSES vertical | HIGH |
| +0.1 × 1.0 = +0.1 | Weakly supports vertical | LOW |
Both +2.0 and -2.0 are strong contributions - just in opposite directions. The absolute value captures the strength of influence.
Committee Analogy
"We ask the committee: 'Show us your reasoning. Highlight the evidence that most influenced your decision.' They produce a report where the most influential pieces of evidence glow brightly. This is the saliency map - a visual explanation of the committee's thought process."
Why Interpretability Matters
| Reason | Example |
|---|
| Trust | Can we trust this medical diagnosis? |
| Debugging | Why is the model getting this wrong? |
| Discovery | What features actually matter? |
| Fairness | Is it unfairly using race or gender? |
| Legal | GDPR requires "right to explanation" |
The Math Behind Saliency
For our Perceptron, let's trace WHY ∣wi×xi∣ measures importance:
Step 1: The Neuron's Decision
z=w1x1+w2x2+...+w9x9+b
Each term wixi is that pixel's contribution to the final sum z.
Step 2: How Much Did Each Pixel Contribute?
| Pixel | Weight (wi) | Input (xi) | Contribution (wi×xi) |
|---|
| 0 | 0.5 | 0 | 0.5 × 0 = 0 (no contribution) |
| 1 | 1.2 | 1 | 1.2 × 1 = 1.2 (strong positive) |
| 4 | -0.8 | 1 | -0.8 × 1 = -0.8 (strong negative) |
Step 3: Why Absolute Value?
Both +1.2 and -0.8 are strong influences on the decision - they just push in opposite directions. The absolute value captures strength of influence regardless of direction.
Saliencyi=∣wi×xi∣
Interpretation:
- High saliency = This pixel strongly influenced the decision (positively OR negatively)
- Low saliency = This pixel didn't matter much for this prediction
Looking at What Our Model Learned
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# =============================================================================# SALIENCY: What Did the Model Learn?# ============================================================================= print("="*70)print("THE COMMITTEE REPORT: What Did the Model Learn?")print("="*70) # First, let's look at the learned weightsprint("\n" + "-"*70)print("STEP 1: Examine the Learned Weights")print("-"*70) weights_grid = model.weights.reshape(3, 3)print("""Remember our pixel positions: Position Index: Image Layout: [0] [1] [2] [row 0] [3] [4] [5] → [row 1] [6] [7] [8] [row 2] Our model's learned weights (as 3x3 grid):""")for i, row in enumerate(weights_grid): print(f" Row {i}: [{row[0]:6.3f}, {row[1]:6.3f}, {row[2]:6.3f}]") print(f"\n Bias: {model.bias:.4f}") # Interpret the weightsprint("\n" + "-"*70)print("STEP 2: Interpret What the Weights Mean")print("-"*70) print("""HOW TO READ WEIGHTS: • Positive weight → This pixel being bright INCREASES "vertical" confidence • Negative weight → This pixel being bright DECREASES "vertical" confidence • Near-zero weight → This pixel doesn't matter much""") # Find which positions have highest/lowest weightsflat_weights = model.weightsmax_idx = np.argmax(flat_weights)min_idx = np.argmin(flat_weights) print(f"""KEY OBSERVATIONS: Maximum weight: position {max_idx} (row {max_idx//3}, col {max_idx%3}) = {flat_weights[max_idx]:.3f} → If this pixel is bright, model is MORE confident it's vertical Minimum weight: position {min_idx} (row {min_idx//3}, col {min_idx%3}) = {flat_weights[min_idx]:.3f} → If this pixel is bright, model is LESS confident it's vertical Positions with HIGH positive weights: {np.where(flat_weights > 0.3)[0].tolist()} → These pixels SUPPORT "vertical" classification Positions with HIGH negative weights: {np.where(flat_weights < -0.3)[0].tolist()} → These pixels OPPOSE "vertical" classification""")1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
# =============================================================================# VISUALIZE: Weights and Saliency Maps - THE "AHA!" MOMENT# ============================================================================= def compute_saliency(model, x): """ Compute saliency map for an input. Saliency = |weight × input| This tells us: "How much did each input pixel contribute to the final decision?" Parameters: model: Trained model with weights x: Input image (flattened) Returns: saliency: Array of contribution magnitudes """ x = np.array(x).flatten() # Multiply each input by its weight, take absolute value return np.abs(model.weights * x) fig, axes = plt.subplots(2, 4, figsize=(16, 8)) # =================# Top row: Vertical line analysis# ================= # 1. Input imageax = axes[0, 0]ax.imshow(vertical_line, cmap='Blues', vmin=0, vmax=1)ax.set_title('INPUT:\nVertical Line', fontsize=11, fontweight='bold')for i in range(3): for j in range(3): ax.text(j, i, f'{vertical_line[i,j]:.0f}', ha='center', va='center', fontsize=12)ax.axis('off') # 2. Model weightsax = axes[0, 1]im = ax.imshow(weights_grid, cmap='RdBu', vmin=-2, vmax=2)ax.set_title('WEIGHTS:\nLearned by Model', fontsize=11, fontweight='bold')for i in range(3): for j in range(3): color = 'white' if abs(weights_grid[i,j]) > 1 else 'black' ax.text(j, i, f'{weights_grid[i,j]:.2f}', ha='center', va='center', fontsize=9, color=color)ax.axis('off') # 3. Saliency mapax = axes[0, 2]saliency_v = compute_saliency(model, vertical_flat).reshape(3, 3)im = ax.imshow(saliency_v, cmap='hot', vmin=0)ax.set_title('SALIENCY MAP:\n|Weight × Input|', fontsize=11, fontweight='bold')for i in range(3): for j in range(3): color = 'white' if saliency_v[i,j] > saliency_v.max()/2 else 'black' ax.text(j, i, f'{saliency_v[i,j]:.2f}', ha='center', va='center', fontsize=9, color=color)ax.axis('off') # 4. Prediction resultax = axes[0, 3]ax.axis('off')pred_v = model.forward(vertical_flat)result_text = f"""PREDICTION Raw output: {pred_v:.4f}Confidence: {pred_v*100:.1f}% Decision: {"VERTICAL" if pred_v >= 0.5 else "HORIZONTAL"} Correct! ✓"""ax.text(0.5, 0.5, result_text, fontsize=11, fontweight='bold', ha='center', va='center', transform=ax.transAxes, bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.8)) # =================# Bottom row: Horizontal line analysis# ================= # 1. Input imageax = axes[1, 0]ax.imshow(horizontal_line, cmap='Blues', vmin=0, vmax=1)ax.set_title('INPUT:\nHorizontal Line', fontsize=11, fontweight='bold')for i in range(3): for j in range(3): ax.text(j, i, f'{horizontal_line[i,j]:.0f}', ha='center', va='center', fontsize=12)ax.axis('off') # 2. Model weights (same)ax = axes[1, 1]im = ax.imshow(weights_grid, cmap='RdBu', vmin=-2, vmax=2)ax.set_title('WEIGHTS:\n(Same model)', fontsize=11, fontweight='bold')for i in range(3): for j in range(3): color = 'white' if abs(weights_grid[i,j]) > 1 else 'black' ax.text(j, i, f'{weights_grid[i,j]:.2f}', ha='center', va='center', fontsize=9, color=color)ax.axis('off') # 3. Saliency mapax = axes[1, 2]saliency_h = compute_saliency(model, horizontal_flat).reshape(3, 3)im = ax.imshow(saliency_h, cmap='hot', vmin=0)ax.set_title('SALIENCY MAP:\n|Weight × Input|', fontsize=11, fontweight='bold')for i in range(3): for j in range(3): color = 'white' if saliency_h[i,j] > saliency_h.max()/2 else 'black' ax.text(j, i, f'{saliency_h[i,j]:.2f}', ha='center', va='center', fontsize=9, color=color)ax.axis('off') # 4. Prediction resultax = axes[1, 3]ax.axis('off')pred_h = model.forward(horizontal_flat)result_text = f"""PREDICTION Raw output: {pred_h:.4f}Confidence: {(1-pred_h)*100:.1f}% horizontal Decision: {"VERTICAL" if pred_h >= 0.5 else "HORIZONTAL"} Correct! ✓"""ax.text(0.5, 0.5, result_text, fontsize=11, fontweight='bold', ha='center', va='center', transform=ax.transAxes, bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.8)) plt.suptitle('THE COMMITTEE REPORT: How the Model Makes Decisions', fontsize=14, fontweight='bold', y=1.02)plt.tight_layout()plt.show()1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# =============================================================================# THE "AHA!" MOMENT: Understanding What the Model Learned# ============================================================================= print("="*70)print("THE KEY INSIGHT: What Did the Model ACTUALLY Learn?")print("="*70) print("""Looking at the visualizations above, we can see something beautiful: FOR VERTICAL LINES: • The middle column (positions 1, 4, 7) has POSITIVE weights • When bright pixels appear in the middle column, the model says "VERTICAL!" • The saliency map lights up exactly where the vertical line is FOR HORIZONTAL LINES: • The middle row (positions 3, 4, 5) has NEGATIVE or low weights for the sides • When bright pixels appear across a row, they don't activate the "vertical" detector • The output is LOW, meaning "not vertical" = "horizontal" THE MODEL LEARNED THE RIGHT PATTERN!═══════════════════════════════════════════════════════════════════════ Our model didn't just memorize examples. It learned a GENERAL RULE: "Vertical lines have bright pixels stacked in a column. Horizontal lines have bright pixels spread across a row." This is exactly what we hoped it would learn! ═══════════════════════════════════════════════════════════════════════""") # Show the pattern it learnedprint("\nVisualized Pattern Recognition:")print("-"*50)print(""" VERTICAL LINE: MODEL LOOKS AT: [ ] [●] [ ] [ ] [HIGH] [ ] [ ] [●] [ ] → [ ] [HIGH] [ ] [ ] [●] [ ] [ ] [HIGH] [ ] (Middle column weights are positive) HORIZONTAL LINE: MODEL LOOKS AT: [ ] [ ] [ ] [ ] [ ] [ ] [●] [●] [●] → [LOW] [LOW] [LOW] [ ] [ ] [ ] [ ] [ ] [ ] (Row weights don't support "vertical")""")
6.6 Train/Test Split: Why We Need Separate Data
Throughout this notebook, we've used separate training and test data. This is crucial for honest evaluation.
The Problem: Memorization vs Learning
A model could achieve 100% accuracy on training data by simply memorizing every example - like a student who memorizes test answers instead of understanding concepts.
But memorization isn't useful - we need the model to generalize to NEW data it has never seen.
| Approach | Training Accuracy | Test Accuracy | What Happened? |
|---|
| True learning | 95% | 93% | Learned the general pattern |
| Memorization | 100% | 50% | Memorized training, fails on new |
What IS a Train/Test Split?
We divide our data into two groups:
ALL DATA (150 samples)
│
├── TRAINING SET (100 samples) ──→ Used to TRAIN the model
│ Model sees these during learning
│
└── TEST SET (50 samples) ───────→ Used to EVALUATE the model
Model NEVER sees these during training
Why This Works
| Data Set | Model Sees During Training? | Purpose |
|---|
| Training | YES | Learn patterns |
| Test | NO | Evaluate generalization |
The test set acts as a "final exam" - questions the model has never seen.
Committee Analogy
"It's like preparing for an exam:
- Training data = study materials (examples you practice with)
- Test data = the actual exam (new questions you've never seen)
If you just memorize your notes without understanding, you'll ace the practice problems but fail the exam. If you truly learned the concepts, you'll do well on both."
The Golden Rule
NEVER use test data for training!
If the model sees test data during training, it can memorize those examples too, and our evaluation becomes meaningless.
Common Split Ratios
| Split | Training | Test | When to Use |
|---|
| 80/20 | 80% | 20% | Large datasets (>10,000 samples) |
| 70/30 | 70% | 30% | Medium datasets (1,000-10,000) |
| 60/40 | 60% | 40% | Small datasets (<1,000) |
More test data = more reliable evaluation, but less training data.
Understanding Overfitting Mathematically
What IS Overfitting?
Overfitting is when a model learns the noise in the training data, not just the signal.
Analogy: Imagine studying for an exam by memorizing the exact wording of practice questions instead of understanding the concepts. You'd ace those exact questions but fail on new ones.
How Train/Test Split Reveals Overfitting:
| Scenario | Training Accuracy | Test Accuracy | What's Happening |
|---|
| Good learning | 95% | 93% | Learned the pattern! |
| Mild overfitting | 99% | 85% | Some memorization |
| Severe overfitting | 100% | 50% | Memorized everything, learned nothing |
The Math:
- If a model memorizes all 100 training examples, it can get 100% training accuracy
- But those memorized patterns don't apply to new data
- Test accuracy reveals true generalization
The Gap:
Overfitting Gap=Training Accuracy−Test Accuracy
- Gap < 5%: Great! Model generalizes well
- Gap 5-15%: Some overfitting, might need more data or simpler model
- Gap > 15%: Serious overfitting, model is memorizing
Why These Specific Ratios?
| More Training Data | More Test Data |
|---|
| Model can learn more | More reliable evaluation |
| Better final accuracy | Smaller margin of error |
| Less reliable evaluation | Model might underfit |
The sweet spot: Enough training data to learn well, enough test data to evaluate reliably. With 100 samples, 80/20 gives 80 for training (decent) and 20 for testing (acceptable). With 10,000 samples, even 90/10 gives 1,000 test samples (very reliable).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# =============================================================================# TRAIN/TEST SPLIT: Our Results# ============================================================================= print("="*70)print("TRAIN/TEST SPLIT: Checking for Generalization")print("="*70) print(f"""OUR DATA SPLIT: • Training set: {len(X_train)} samples (used for learning) • Test set: {len(X_test)} samples (used for evaluation only) • Split ratio: {len(X_train)}/{len(X_train)+len(X_test)} = {len(X_train)/(len(X_train)+len(X_test))*100:.0f}% training RESULTS: • Training accuracy: {train_accuracy:.1%} • Test accuracy: {test_accuracy:.1%} • Difference: {abs(train_accuracy - test_accuracy):.1%}""") # Interpret the gapdiff = train_accuracy - test_accuracy print("-"*70)print("INTERPRETATION:")print("-"*70) if diff < 0.05: print(""" ✓ EXCELLENT! Training and test accuracy are very similar. This suggests the model has LEARNED the general pattern, not just memorized the training data. Our model generalizes well to new data!""")elif diff < 0.15: print(f""" ⚠ CAUTION: Training accuracy is {diff:.1%} higher than test accuracy. Some memorization may have occurred. The model might be slightly "overfitting" to training data.""")else: print(f""" ⚠ WARNING: Training accuracy is {diff:.1%} higher than test accuracy! This suggests OVERFITTING - the model memorized training data but doesn't generalize well to new data. Possible solutions: - Get more training data - Use regularization - Simplify the model""")
Part 6 Summary: What We've Learned
Key Concepts Mastered
| Concept | Definition/Formula | Why It Matters |
|---|
| Training vs Inference | Learning mode vs using mode | Different behaviors, same weights |
| Accuracy | (TP + TN) / Total | Simple overall view (but can mislead) |
| Confusion Matrix | TP, TN, FP, FN breakdown | Shows WHAT mistakes are made |
| Precision | TP / (TP + FP) | "When I say yes, am I right?" |
| Recall | TP / (TP + FN) | "Did I catch all the positives?" |
| F1 Score | 2 × (P × R) / (P + R) | Balance precision and recall |
| Saliency | |weight × input| | What did the model look at? |
| Train/Test Split | Separate data for evaluation | Detect memorization vs learning |
The Four Categories Explained
| Category | Model Said | Truth Was | Meaning |
|---|
| TP (True Positive) | Vertical | Vertical | Correct detection |
| TN (True Negative) | Horizontal | Horizontal | Correct rejection |
| FP (False Positive) | Vertical | Horizontal | False alarm |
| FN (False Negative) | Horizontal | Vertical | Missed detection |
Committee Analogy Progress
| Part | What Happened |
|------|--------------|\n| Parts 1-3 | Committee member learned procedures |
| Part 4 | First case - confused, random guessing |
| Part 5 | Learned from feedback, became expert |
| Part 6 | Performance review: verified expertise and understood reasoning |
| Part 7 | (Next) One expert isn't enough - building the full committee |
The Big Picture
We now have a complete, evaluated model that:
- Achieves high accuracy on both training and test data
- Makes few mistakes (low FP and FN)
- Has interpretable learned weights
- Uses the RIGHT features (column patterns for vertical detection)
- Generalizes well to new data
Knowledge Check
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
# =============================================================================# KNOWLEDGE CHECK - Part 6# ============================================================================= print("KNOWLEDGE CHECK - Part 6: Evaluation")print("="*60)print("\nAnswer these questions to test your understanding:\n") questions = [ { "q": "1. What's the difference between training and inference mode?", "options": [ "A) Training is faster than inference", "B) In training, weights update; in inference, weights are frozen", "C) Inference uses more data than training", "D) They're the same thing with different names" ], "answer": "B", "explanation": "During training, the model learns and weights change after each example. During inference, weights are frozen and we just make predictions - no learning happens." }, { "q": "2. A model predicts 'sick' for a healthy patient. What type of error is this?", "options": [ "A) True Positive (TP)", "B) True Negative (TN)", "C) False Positive (FP)", "D) False Negative (FN)" ], "answer": "C", "explanation": "False Positive: We predicted Positive (sick), but we were False (wrong) - the patient was actually healthy. This is a 'false alarm'." }, { "q": "3. You're building a disease detection system. Missing a sick patient is VERY bad.\n Which metric should you prioritize?", "options": [ "A) Accuracy", "B) Precision", "C) Recall", "D) F1 Score" ], "answer": "C", "explanation": "Recall measures 'did we catch all the positives?' High recall means we catch most sick patients, even if we have some false alarms. When missing positives is costly, prioritize recall." }, { "q": "4. Why do we use a separate test set?", "options": [ "A) To have more data for training", "B) To make training faster", "C) To check if the model memorized vs truly learned", "D) It's optional and not really needed" ], "answer": "C", "explanation": "A model could memorize training data and fail on new data. The test set (unseen data) reveals if it truly learned the general pattern or just memorized examples." }, { "q": "5. What does a saliency map show?", "options": [ "A) The accuracy of the model over time", "B) Which inputs the model focused on for its decision", "C) The training loss curve", "D) How fast the model runs" ], "answer": "B", "explanation": "Saliency maps highlight which parts of the input were most important for the model's decision. It's a form of interpretability - understanding WHY the model made its prediction." }] for q in questions: print(q["q"]) for opt in q["options"]: print(f" {opt}") print() print("\n" + "="*60)print("Scroll down for answers...")print("="*60)1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# =============================================================================# ANSWERS - Knowledge Check Part 6# ============================================================================= print("ANSWERS - Part 6 Knowledge Check")print("="*60) for i, q in enumerate(questions, 1): print(f"\n{i}. Answer: {q['answer']}") print(f" {q['explanation']}") print("\n" + "="*60)print("How did you do?")print(" 5/5: Evaluation Master! Ready for Part 7!")print(" 4/5: Solid understanding - great job!")print(" 3/5: Review the sections you missed")print(" <3: Re-read Part 6 before continuing")print("="*60)
What's Next?
Congratulations! You've completed Part 6!
Our single neuron is now a verified expert - we've evaluated its performance, understood its decision-making process, and confirmed it learned the RIGHT patterns.
But Here's the Thing...
A single neuron (Perceptron) can only learn linear patterns - patterns that can be separated by a straight line. For more complex problems, one expert isn't enough.
The Limitation of Single Neurons
Some problems are not linearly separable. The classic example is the XOR problem:
| Input A | Input B | Output (XOR) |
|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
No single neuron can learn this pattern! We need multiple neurons working together.
Coming Up in Part 7: Hidden Layers - The Full Committee
In the next notebook, we'll explore:
- Why one neuron isn't enough - The XOR problem demonstration
- Hidden layers - Adding more neurons between input and output
- The full committee - Multiple experts with different perspectives
- Universal approximation - Why deep networks can learn (almost) anything
Continue to Part 7: part_7_hidden_layers.ipynb
"One expert is good. A committee of experts is powerful."
The Brain's Decision Committee - From Expert to Team