AgenticWorks

A community for developers awakening to agentic AI. Hands-on lessons, enterprise-grade context engineering, and a forum that earns its quiet.

Platform

  • Learn
  • Forum
  • Showcase

Project

  • About

Community

  • Network
  • Code of conduct

Field reports

Monthly notes on what shipped, what broke, and what we learned.

© 2026 AgenticWorks. Built in public.

AgenticWorks
LearnShowcaseForumCommunity
Sign in

Track 1 · ML foundations

Brain's Decision Committee
  1. 01The first neuron
  2. 02A single neuron
  3. 03Activation functions
  4. 04The perceptron
  5. 05Training
  6. 06Evaluation
  7. 07Hidden layers
  8. 08Deep learning challenges
  9. 09Full implementation
  10. 10What's next
MLPPart 7 · 55 min · intermediate

The full committee

Move beyond one neuron with hidden layers, XOR, and a multi-layer perceptron.

Open in ColabDownload notebookFull lab fallback
Kernel: ColdSections: 0/14

Neural Network Fundamentals

Part 7: Hidden Layers - The Full Committee

The Brain's Decision Committee - Chapter 7


The Story So Far...

In Parts 1-6, we built and trained a single neuron (Perceptron) that became an expert at detecting vertical vs horizontal lines. We evaluated its performance, understood its decision-making through saliency maps, and confirmed it learned the right patterns.

But our expert has a limitation.

A single neuron can only draw ONE straight line to separate categories. Some problems require more complex boundaries - curves, multiple regions, or intricate patterns.

"Our single committee member has done well, but some problems are too complex for one person. It's time to assemble a full committee with specialists."


What You'll Learn in Part 7

By the end of this notebook, you will understand:

  1. Why single neurons fail - The famous XOR problem AND challenging V/H variations
  2. What hidden layers are - Adding neurons between input and output
  3. How hidden neurons specialize - Different neurons detect different features
  4. The Multi-Layer Perceptron (MLP) - A complete neural network architecture
  5. Forward propagation - How data flows through multiple layers
  6. Backpropagation through layers - Training with chain rule
  7. Universal approximation - Why deep networks can learn (almost) anything

Two Complementary Examples

In this notebook, we'll explore limitations of single neurons through two lenses:

ExampleWhy Include It?
XOR ProblemThe famous textbook example - you'll encounter this everywhere in ML literature
Challenging V/H LinesOur continuing story - noisy images, multiple positions, harder patterns

Both examples teach the same lesson: some problems need multiple neurons working together.


Prerequisites

Make sure you've completed:

  • Parts 0-1: Matrices (neural_network_fundamentals.ipynb)
  • Part 2: Single Neuron (part_2_single_neuron.ipynb)
  • Part 3: Activation Functions (part_3_activation_functions.ipynb)
  • Part 4: The Perceptron (part_4_perceptron.ipynb)
  • Part 5: Training (part_5_training.ipynb)
  • Part 6: Evaluation (part_6_evaluation.ipynb)

Setup: Import Dependencies

cell 003
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# =============================================================================# PART 7: HIDDEN LAYERS - SETUP AND IMPORTS# ============================================================================= import numpy as npimport matplotlib.pyplot as pltfrom IPython.display import display, clear_output # Try to import ipywidgets for interactive featurestry:    import ipywidgets as widgets    WIDGETS_AVAILABLE = Trueexcept ImportError:    WIDGETS_AVAILABLE = False    print("Note: ipywidgets not installed. Interactive features will be limited.") # Set up matplotlib stylestyle_options = ['seaborn-v0_8-whitegrid', 'seaborn-whitegrid', 'ggplot', 'default']for style in style_options:    try:        plt.style.use(style)        break    except OSError:        continue plt.rcParams['figure.figsize'] = [10, 6]plt.rcParams['font.size'] = 12np.random.seed(42) # -----------------------------------------------------------------------------# Helper functions from previous notebooks# ----------------------------------------------------------------------------- def sigmoid(z):    """Sigmoid activation: maps any value to range (0, 1)."""    return 1 / (1 + np.exp(-np.clip(z, -500, 500))) def sigmoid_derivative(z):    """Derivative of sigmoid: σ(z) * (1 - σ(z))"""    s = sigmoid(z)    return s * (1 - s) def relu(z):    """ReLU activation: max(0, z)"""    return np.maximum(0, z) def relu_derivative(z):    """Derivative of ReLU: 1 if z > 0, else 0"""    return (z > 0).astype(float) print("Setup complete!")print("="*60)

7.1 The Limitation of Single Neurons: The XOR Problem

Our Perceptron works great for vertical vs horizontal lines. But there's a famous problem that NO single neuron can solve: the XOR problem.

What IS XOR?

XOR (exclusive OR) is a logical operation that outputs TRUE when inputs are DIFFERENT:

Input AInput BXOR Output
000
011
101
110

In words: "TRUE if one or the other, but not both."

Real-world examples of XOR:

  • A light switch: Flip EITHER switch to change the light, but if BOTH are up (or both down), it's off
  • Password requirements: "Use uppercase OR numbers" (but having BOTH doesn't double-satisfy it)

Why Can't a Single Neuron Solve XOR?

A single neuron creates a linear decision boundary - a straight line that separates the two classes.

What IS a Decision Boundary?

A decision boundary is the line (in 2D), plane (in 3D), or hyperplane (in higher dimensions) where the model switches from predicting one class to another.

For a single neuron: z=w1x1+w2x2+b=0z = w_1 x_1 + w_2 x_2 + b = 0z=w1​x1​+w2​x2​+b=0

This equation defines a straight line. Points on one side get z > 0 (predict class 1), points on the other side get z < 0 (predict class 0).

Why is this a line? Rearranging: x2=−w1w2x1−bw2x_2 = -\frac{w_1}{w_2} x_1 - \frac{b}{w_2}x2​=−w2​w1​​x1​−w2​b​

This is the equation of a line with slope −w1w2-\frac{w_1}{w_2}−w2​w1​​ and intercept −bw2-\frac{b}{w_2}−w2​b​.

The Problem: No matter what values we choose for w1w_1w1​, w2w_2w2​, and bbb, we can only draw ONE straight line!

Let's visualize the problem:

cell 005
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# =============================================================================# THE XOR PROBLEM: Visualizing Why Single Neurons Fail# ============================================================================= print("="*70)print("THE XOR PROBLEM: A Single Neuron's Nightmare")print("="*70) # XOR dataX_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])y_xor = np.array([0, 1, 1, 0]) fig, axes = plt.subplots(1, 3, figsize=(15, 5)) # Plot 1: The XOR problemax = axes[0]colors = ['red' if y == 0 else 'blue' for y in y_xor]ax.scatter(X_xor[:, 0], X_xor[:, 1], c=colors, s=300, edgecolor='black', linewidth=2)for i, (x, y, label) in enumerate(zip(X_xor[:, 0], X_xor[:, 1], y_xor)):    ax.annotate(f'({x},{y})→{label}', (x, y), xytext=(10, 10),                 textcoords='offset points', fontsize=10)ax.set_xlim(-0.5, 1.5)ax.set_ylim(-0.5, 1.5)ax.set_xlabel('Input A', fontsize=12)ax.set_ylabel('Input B', fontsize=12)ax.set_title('XOR Data Points\n(Red=0, Blue=1)', fontsize=14, fontweight='bold')ax.grid(True, alpha=0.3) # Plot 2: Can you draw ONE line to separate them?ax = axes[1]ax.scatter(X_xor[:, 0], X_xor[:, 1], c=colors, s=300, edgecolor='black', linewidth=2) # Try some linesx_line = np.linspace(-0.5, 1.5, 100)ax.plot(x_line, x_line, 'g--', linewidth=2, label='Diagonal?')ax.plot(x_line, 0.5 * np.ones_like(x_line), 'm--', linewidth=2, label='Horizontal?')ax.plot(0.5 * np.ones_like(x_line), x_line, 'c--', linewidth=2, label='Vertical?') ax.set_xlim(-0.5, 1.5)ax.set_ylim(-0.5, 1.5)ax.set_xlabel('Input A', fontsize=12)ax.set_ylabel('Input B', fontsize=12)ax.set_title('Try to Draw ONE Line\nto Separate Red from Blue', fontsize=14, fontweight='bold')ax.legend(loc='upper right')ax.grid(True, alpha=0.3) # Plot 3: The solution requires TWO lines (or a curve)ax = axes[2]ax.scatter(X_xor[:, 0], X_xor[:, 1], c=colors, s=300, edgecolor='black', linewidth=2) # Two lines that together solve XORax.plot(x_line, x_line - 0.3, 'g-', linewidth=2, label='Line 1')ax.plot(x_line, x_line + 0.3, 'g-', linewidth=2, label='Line 2')ax.fill_between(x_line, x_line - 0.3, x_line + 0.3, alpha=0.2, color='blue', label='Blue region') ax.set_xlim(-0.5, 1.5)ax.set_ylim(-0.5, 1.5)ax.set_xlabel('Input A', fontsize=12)ax.set_ylabel('Input B', fontsize=12)ax.set_title('Solution: TWO Lines\n(Requires Hidden Layer!)', fontsize=14, fontweight='bold')ax.legend(loc='upper right')ax.grid(True, alpha=0.3) plt.tight_layout()plt.show() print("""KEY INSIGHT: The XOR Problem════════════════════════════════════════════════════════════════════════ The red points (0) are at corners (0,0) and (1,1).The blue points (1) are at corners (0,1) and (1,0). NO SINGLE STRAIGHT LINE can separate red from blue! This is called being "not linearly separable." Why it matters:• A single neuron can only create ONE linear boundary• XOR requires a more complex, non-linear boundary• This was proven impossible for Perceptrons in 1969 (Minsky & Papert)• The solution: ADD MORE NEURONS → Hidden Layers!""")

What IS Linear Separability?

Linear Separability is a property of a dataset where the classes can be separated by a straight line (in 2D), a plane (in 3D), or a hyperplane (in higher dimensions).

Problem TypeLinearly Separable?Single Neuron Can Solve?
AND gateYes✓
OR gateYes✓
XOR gateNo✗
Vertical vs Horizontal lines (clean)Yes✓
Noisy/partial V/H linesHarderStruggles!
Complex overlapping patternsNo✗

Why Does Linear Separability Matter?

This is the fundamental limit of single-layer neural networks:

ModelDecision BoundaryWhat It Can Learn
Single neuronOne line/planeOnly linearly separable patterns
MLP (hidden layer)Multiple lines → curvesNon-linear patterns
Deep MLPVery complex shapesAlmost anything!

Mathematically: A single neuron computes σ(w⋅x+b)\sigma(w \cdot x + b)σ(w⋅x+b). The activation function σ\sigmaσ is monotonic (always increasing or flat), so it can only split the input space with ONE hyperplane. That's the fundamental constraint.

The Historical "AI Winter"

In 1969, Marvin Minsky and Seymour Papert published a book called "Perceptrons" proving that single-layer networks couldn't solve XOR or any non-linearly-separable problem.

Why was this so damaging? They proved it was a MATHEMATICAL impossibility, not just a training difficulty. No amount of training could make a single neuron learn XOR - it literally cannot represent that function.


7.1.5 Back to Our Story: When V/H Classification Gets Hard

XOR is the famous textbook example, but let's see how the same limitation affects our vertical/horizontal line detection problem.

Our Perceptron's Success... and Its Limits

In Parts 4-6, our single-neuron Perceptron achieved ~95-100% accuracy on clean V/H lines. But what happens when the problem gets harder?

ChallengeWhat ChangesWhy It's Harder
Noisy imagesRandom pixels addedPattern obscured
Lines in ANY positionNot just middleOne "middle detector" isn't enough
Partial/broken linesMissing pixelsIncomplete evidence
Thin vs thick linesDifferent widthsMultiple patterns to detect

Let's see if our single neuron can handle these challenges:

The Historical "AI Winter"

This caused the first "AI Winter" - a period where funding for neural network research dried up because people thought they were fundamentally limited.

The Solution Was Simple: Add More Layers!

The fix was known all along but computationally difficult:

Instead of one expert, use a TEAM of experts (neurons) working together!


7.2 The Panel of Experts: Hidden Layers

What IS a Hidden Layer?

A hidden layer is a layer of neurons that sits between the input and output:

INPUT (9 pixels) → [HIDDEN LAYER] → OUTPUT (1 prediction)
                   (multiple neurons)

Why "hidden"? Because we never directly see their values during normal use - they're internal to the network.

Why Multiple Neurons Help

Each neuron in the hidden layer can detect a different feature:

Hidden NeuronWhat It Might Detect
Neuron 1"Is there a vertical pattern on the LEFT?"
Neuron 2"Is there a vertical pattern in the MIDDLE?"
Neuron 3"Is there a vertical pattern on the RIGHT?"
Neuron 4"Is there a horizontal pattern on TOP?"

The output neuron then combines these feature detections to make a final decision.

The Critical Role of Activation Functions

Why do we NEED activation functions between layers?

Without activations, stacking layers does nothing! Here's why:

Without activation: Output=W2⋅(W1⋅x)=(W2⋅W1)⋅x=Wcombined⋅x\text{Output} = W_2 \cdot (W_1 \cdot x) = (W_2 \cdot W_1) \cdot x = W_{combined} \cdot xOutput=W2​⋅(W1​⋅x)=(W2​⋅W1​)⋅x=Wcombined​⋅x

The composition of two linear transformations is just... another linear transformation! We could replace the entire network with a single layer.

With activation: Output=W2⋅σ(W1⋅x)\text{Output} = W_2 \cdot \sigma(W_1 \cdot x)Output=W2​⋅σ(W1​⋅x)

The non-linear σ\sigmaσ "breaks" the linearity. Now we have:

  • Layer 1 creates multiple linear boundaries
  • Activation function "bends" these boundaries
  • Layer 2 combines the bent boundaries

This is how MLPs create curves from straight lines!

Committee Analogy: The Sub-Committee

"Before, we had ONE committee member who had to look at everything. Now we have a sub-committee of specialists:

  • Specialist 1 checks for patterns in the left region
  • Specialist 2 checks the middle region
  • Specialist 3 checks the right region
  • The final committee member listens to all specialists and makes the decision

This division of labor lets us solve more complex problems!"

Diversity of Opinion

Key insight from Part 1.7: If all hidden neurons look for the same thing, they're redundant!

We need diversity - each hidden neuron should specialize in detecting something different. This happens naturally during training as they adjust to minimize error.

cell 007
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
# =============================================================================# THE V/H CHALLENGE: When Our Perceptron Struggles# ============================================================================= print("="*70)print("BACK TO OUR STORY: Challenging V/H Classification")print("="*70) # Dataset generator from previous partsdef generate_line_dataset(n_samples=100, noise_level=0.0, seed=None):    """Generate vertical (1) and horizontal (0) line images."""    if seed is not None:        np.random.seed(seed)        X, y = [], []    for i in range(n_samples):        image = np.zeros((3, 3))        if i < n_samples // 2:  # Vertical            col = np.random.randint(0, 3)  # ANY column, not just middle            image[:, col] = 1            if noise_level > 0:                image = np.clip(image + np.random.randn(3, 3) * noise_level, 0, 1)            X.append(image.flatten())            y.append(1)        else:  # Horizontal            row = np.random.randint(0, 3)            image[row, :] = 1            if noise_level > 0:                image = np.clip(image + np.random.randn(3, 3) * noise_level, 0, 1)            X.append(image.flatten())            y.append(0)        X, y = np.array(X), np.array(y)    shuffle_idx = np.random.permutation(n_samples)    return X[shuffle_idx], y[shuffle_idx] # Simple Perceptron from Part 5class SimplePerceptron:    def __init__(self, n_inputs):        self.weights = np.random.randn(n_inputs) * 0.1        self.bias = 0.0        def forward(self, x):        z = np.dot(self.weights, x.flatten()) + self.bias        return sigmoid(z)        def predict(self, x):        return 1 if self.forward(x) >= 0.5 else 0        def train(self, X, y, lr=0.5, epochs=100):        for _ in range(epochs):            for xi, yi in zip(X, y):                pred = self.forward(xi)                error = pred - yi                self.weights -= lr * error * xi.flatten()                self.bias -= lr * error        return self # Test on different difficulty levelsprint("\nTesting Single Neuron on Increasingly Difficult V/H Problems:\n") difficulties = [    ("Clean (0% noise)", 0.0),    ("Light noise (10%)", 0.1),    ("Medium noise (20%)", 0.2),    ("Heavy noise (30%)", 0.3)] results = []for name, noise in difficulties:    np.random.seed(42)    X_train, y_train = generate_line_dataset(100, noise_level=noise, seed=42)    X_test, y_test = generate_line_dataset(50, noise_level=noise, seed=999)        perceptron = SimplePerceptron(9)    perceptron.train(X_train, y_train, epochs=100)        correct = sum(1 for x, y in zip(X_test, y_test) if perceptron.predict(x) == y)    accuracy = correct / len(y_test) * 100    results.append((name, accuracy))    print(f"  {name:25s} → Accuracy: {accuracy:5.1f}%") print("\n" + "="*70)print("KEY OBSERVATION:")print("="*70)print("""As noise increases, our single neuron struggles more! Why? The single neuron learned ONE pattern (e.g., "middle column = vertical").But noisy images have:  • Extra bright pixels confusing the detector  • Lines in different positions the single "template" doesn't match  • Partial patterns that need multiple feature detectors Just like XOR, complex V/H patterns need MULTIPLE SPECIALISTS!""")
cell 008
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# =============================================================================# VISUALIZING THE CHALLENGE: Clean vs Noisy V/H Images# ============================================================================= fig, axes = plt.subplots(2, 5, figsize=(15, 6)) # Generate examples at different noise levelsnp.random.seed(123) # Top row: Vertical lines with increasing noisenoises = [0.0, 0.1, 0.2, 0.3, 0.4]for i, noise in enumerate(noises):    ax = axes[0, i]    image = np.array([[0, 1, 0], [0, 1, 0], [0, 1, 0]], dtype=float)    if noise > 0:        image = np.clip(image + np.random.randn(3, 3) * noise, 0, 1)    ax.imshow(image, cmap='Blues', vmin=0, vmax=1)    ax.set_title(f'{int(noise*100)}% Noise', fontsize=11)    ax.axis('off')    if i == 0:        ax.set_ylabel('VERTICAL', fontsize=12, fontweight='bold') # Bottom row: Horizontal lines with increasing noise  for i, noise in enumerate(noises):    ax = axes[1, i]    image = np.array([[0, 0, 0], [1, 1, 1], [0, 0, 0]], dtype=float)    if noise > 0:        image = np.clip(image + np.random.randn(3, 3) * noise, 0, 1)    ax.imshow(image, cmap='Blues', vmin=0, vmax=1)    ax.axis('off')    if i == 0:        ax.set_ylabel('HORIZONTAL', fontsize=12, fontweight='bold') plt.suptitle('The Challenge: As Noise Increases, Patterns Become Harder to Detect',              fontsize=14, fontweight='bold', y=1.02)plt.tight_layout()plt.show() print("""With heavy noise, even WE have trouble seeing the pattern! A single neuron that learned "middle column bright = vertical" will strugglewhen noise makes OTHER pixels bright too. SOLUTION: Multiple specialists, each detecting different aspects of the pattern.""")
cell 009
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
# =============================================================================# VISUALIZING THE MLP ARCHITECTURE# ============================================================================= def draw_neural_network(ax, layer_sizes, layer_names=None):    """Draw a neural network architecture diagram."""    n_layers = len(layer_sizes)    max_neurons = max(layer_sizes)        # Spacing    layer_spacing = 1.5    neuron_spacing = 0.8        positions = []        for layer_idx, n_neurons in enumerate(layer_sizes):        layer_positions = []        x = layer_idx * layer_spacing                # Center the neurons vertically        start_y = (max_neurons - n_neurons) * neuron_spacing / 2                for neuron_idx in range(n_neurons):            y = start_y + neuron_idx * neuron_spacing            layer_positions.append((x, y))                        # Draw neuron            color = '#3498db' if layer_idx == 0 else '#e74c3c' if layer_idx == n_layers - 1 else '#27ae60'            circle = plt.Circle((x, y), 0.15, color=color, ec='black', linewidth=2, zorder=3)            ax.add_patch(circle)                positions.append(layer_positions)        # Draw connections    for layer_idx in range(n_layers - 1):        for start_pos in positions[layer_idx]:            for end_pos in positions[layer_idx + 1]:                ax.plot([start_pos[0], end_pos[0]], [start_pos[1], end_pos[1]],                        'gray', alpha=0.3, linewidth=0.5, zorder=1)        # Add layer labels    if layer_names:        for layer_idx, name in enumerate(layer_names):            x = layer_idx * layer_spacing            ax.text(x, -1, name, ha='center', fontsize=10, fontweight='bold')        ax.set_xlim(-0.5, (n_layers - 1) * layer_spacing + 0.5)    ax.set_ylim(-1.5, max_neurons * neuron_spacing)    ax.set_aspect('equal')    ax.axis('off') # Create visualizationfig, axes = plt.subplots(1, 3, figsize=(16, 5)) # Plot 1: Perceptron (Part 4-6)ax = axes[0]draw_neural_network(ax, [9, 1], ['Input\n(9 pixels)', 'Output\n(1 neuron)'])ax.set_title('PERCEPTRON (Parts 4-6)\nSingle Layer', fontsize=12, fontweight='bold') # Plot 2: Simple MLPax = axes[1]draw_neural_network(ax, [9, 4, 1], ['Input\n(9 pixels)', 'Hidden\n(4 neurons)', 'Output\n(1 neuron)'])ax.set_title('MLP: One Hidden Layer\nThe "Panel of Experts"', fontsize=12, fontweight='bold') # Plot 3: Deeper MLPax = axes[2]draw_neural_network(ax, [9, 6, 4, 1], ['Input\n(9)', 'Hidden 1\n(6)', 'Hidden 2\n(4)', 'Output\n(1)'])ax.set_title('DEEP MLP: Two Hidden Layers\n"Hierarchy of Specialists"', fontsize=12, fontweight='bold') plt.suptitle('Evolution of Neural Network Architectures', fontsize=14, fontweight='bold', y=1.02)plt.tight_layout()plt.show() print("""ARCHITECTURE COMPARISON:════════════════════════════════════════════════════════════════════════ PERCEPTRON (Parts 4-6):  • Input → Output directly  • Can only learn linear boundaries  • Limited to simple problems MLP WITH ONE HIDDEN LAYER:  • Input → Hidden → Output  • Each hidden neuron detects different features  • Can learn non-linear boundaries (like XOR!) DEEP MLP (Multiple Hidden Layers):  • Input → Hidden 1 → Hidden 2 → ... → Output  • Each layer builds on the previous layer's features  • Can learn very complex patterns ═══════════════════════════════════════════════════════════════════════Color Legend: 🔵 Input | 🟢 Hidden | 🔴 Output""")

7.3 The Multi-Layer Perceptron (MLP): Architecture and Math

Now let's understand the mathematics behind multi-layer networks.

What IS an MLP?

A Multi-Layer Perceptron (MLP) is a neural network with:

  • One input layer
  • One or more hidden layers
  • One output layer

Each layer is fully connected to the next (every neuron connects to every neuron in the next layer).

The Math: Forward Propagation

For an MLP with one hidden layer, the computation flows in two stages:

Stage 1: Input → Hidden h=σ(W1⋅x+b1)\mathbf{h} = \sigma(\mathbf{W}_1 \cdot \mathbf{x} + \mathbf{b}_1)h=σ(W1​⋅x+b1​)

Stage 2: Hidden → Output y^=σ(W2⋅h+b2)\hat{y} = \sigma(\mathbf{W}_2 \cdot \mathbf{h} + \mathbf{b}_2)y^​=σ(W2​⋅h+b2​)

Where:

  • x\mathbf{x}x = input vector (our 9 pixels)
  • W1\mathbf{W}_1W1​ = weights from input to hidden layer (matrix!)
  • b1\mathbf{b}_1b1​ = biases for hidden neurons
  • h\mathbf{h}h = hidden layer activations
  • W2\mathbf{W}_2W2​ = weights from hidden to output
  • b2\mathbf{b}_2b2​ = bias for output neuron
  • σ\sigmaσ = activation function (sigmoid, ReLU, etc.)
  • y^\hat{y}y^​ = final prediction

Breaking It Down Step by Step

Let's trace through with concrete dimensions:

ComponentShapeExample
Input x\mathbf{x}x(9,)9 pixels
Weights W1\mathbf{W}_1W1​(4, 9)4 hidden neurons, each with 9 weights
Biases b1\mathbf{b}_1b1​(4,)4 biases, one per hidden neuron
Hidden h\mathbf{h}h(4,)4 hidden activations
Weights W2\mathbf{W}_2W2​(1, 4)1 output neuron, 4 weights (from hidden)
Bias b2\mathbf{b}_2b2​(1,)1 bias for output
Output y^\hat{y}y^​(1,)Final prediction

Why These Specific Shapes?

Matrix multiplication rule: (m×n)⋅(n×1)=(m×1)(m \times n) \cdot (n \times 1) = (m \times 1)(m×n)⋅(n×1)=(m×1)

The shapes MUST align:

  • W1W_1W1​ is (4,9)(4, 9)(4,9) because we have 4 hidden neurons, each looking at 9 inputs
  • W1⋅xW_1 \cdot xW1​⋅x gives us (4,9)⋅(9,1)=(4,1)(4, 9) \cdot (9, 1) = (4, 1)(4,9)⋅(9,1)=(4,1) - one value per hidden neuron ✓
  • W2W_2W2​ is (1,4)(1, 4)(1,4) because we have 1 output looking at 4 hidden neurons
  • W2⋅hW_2 \cdot hW2​⋅h gives us (1,4)⋅(4,1)=(1,1)(1, 4) \cdot (4, 1) = (1, 1)(1,4)⋅(4,1)=(1,1) - our single output ✓

The key insight: Each row of W1W_1W1​ represents ONE hidden neuron's "view" of the input. Each column of W2W_2W2​ represents how much the output trusts each hidden neuron.

Why This Works for XOR

Each hidden neuron can learn ONE linear boundary. With multiple hidden neurons, we can combine their boundaries to create complex, non-linear decision regions!

Concrete XOR example with 2 hidden neurons:

  • Hidden neuron 1 might learn: "A OR B" (draw diagonal from bottom-left)
  • Hidden neuron 2 might learn: "A AND B" (draw diagonal from top-right)
  • Output combines them: "(A OR B) AND NOT (A AND B)" = XOR!
cell 011full lab recommended
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
# =============================================================================# BUILDING THE MLP: Step by Step Implementation# ============================================================================= class MLP:    """    Multi-Layer Perceptron with one hidden layer.        Architecture: Input → Hidden (with activation) → Output (with sigmoid)        This is the "Full Committee" - multiple experts working together!    """        def __init__(self, n_inputs, n_hidden, n_outputs=1):        """        Initialize the MLP with random weights.                Parameters:            n_inputs: Number of input features (e.g., 9 for 3x3 image)            n_hidden: Number of neurons in hidden layer (the "specialists")            n_outputs: Number of output neurons (1 for binary classification)        """        self.n_inputs = n_inputs        self.n_hidden = n_hidden        self.n_outputs = n_outputs                # Initialize weights with small random values (Xavier initialization)        # W1: weights from input to hidden (shape: n_hidden x n_inputs)        self.W1 = np.random.randn(n_hidden, n_inputs) * np.sqrt(2.0 / n_inputs)        self.b1 = np.zeros(n_hidden)                # W2: weights from hidden to output (shape: n_outputs x n_hidden)        self.W2 = np.random.randn(n_outputs, n_hidden) * np.sqrt(2.0 / n_hidden)        self.b2 = np.zeros(n_outputs)                # For storing values during forward pass (needed for backprop)        self.z1 = None  # Pre-activation of hidden layer        self.h = None   # Hidden layer activations        self.z2 = None  # Pre-activation of output        self.output = None                # Training history        self.loss_history = []        self.accuracy_history = []        def forward(self, x):        """        Forward propagation: Input → Hidden → Output                This is like the "Committee Meeting" where:        1. Each specialist (hidden neuron) examines the evidence        2. The final decision maker combines their opinions        """        x = np.array(x).flatten()                # Stage 1: Input → Hidden        # Each hidden neuron computes its weighted sum and activates        self.z1 = np.dot(self.W1, x) + self.b1  # (n_hidden,)        self.h = sigmoid(self.z1)               # (n_hidden,)                # Stage 2: Hidden → Output        # The output neuron combines hidden activations        self.z2 = np.dot(self.W2, self.h) + self.b2  # (n_outputs,)        self.output = sigmoid(self.z2)               # (n_outputs,)                return self.output[0] if self.n_outputs == 1 else self.output        def predict(self, x):        """Make a binary prediction (0 or 1)."""        return 1 if self.forward(x) >= 0.5 else 0 print("="*70)print("MLP CLASS: The Full Committee Implementation")print("="*70) # Create an example MLPmlp = MLP(n_inputs=9, n_hidden=4, n_outputs=1) print(f"""MLP Architecture Created:  • Input layer: {mlp.n_inputs} neurons (our 9 pixels)  • Hidden layer: {mlp.n_hidden} neurons (the specialists)  • Output layer: {mlp.n_outputs} neuron (final decision) Weight Shapes:  • W1 (input→hidden): {mlp.W1.shape} = {mlp.n_hidden} hidden neurons × {mlp.n_inputs} inputs  • b1 (hidden biases): {mlp.b1.shape} = {mlp.n_hidden} biases  • W2 (hidden→output): {mlp.W2.shape} = {mlp.n_outputs} output × {mlp.n_hidden} hidden  • b2 (output bias): {mlp.b2.shape} = {mlp.n_outputs} bias Total Parameters: {mlp.W1.size + mlp.b1.size + mlp.W2.size + mlp.b2.size}  (Compare to Perceptron: {9 + 1} parameters)""")

Understanding Xavier Initialization

In the code above, we used:

self.W1 = np.random.randn(n_hidden, n_inputs) * np.sqrt(2.0 / n_inputs)

What IS Xavier initialization and WHY do we need it?

InitializationFormulaProblem It Solves
All zerosw = 0All neurons learn same thing! (symmetry)
Large randomw ~ N(0, 1)Signals explode or vanish
Xavierw ~ N(0, √(2/n))Keeps signal variance stable

The Math Behind Xavier:

When we compute z=w1x1+w2x2+...+wnxnz = w_1 x_1 + w_2 x_2 + ... + w_n x_nz=w1​x1​+w2​x2​+...+wn​xn​:

  • Each wixiw_i x_iwi​xi​ term has variance ≈ Var(w)×Var(x)\text{Var}(w) \times \text{Var}(x)Var(w)×Var(x)
  • With n terms, total variance ≈ n×Var(w)×Var(x)n \times \text{Var}(w) \times \text{Var}(x)n×Var(w)×Var(x)

The problem: If Var(w)=1\text{Var}(w) = 1Var(w)=1, then variance grows by factor of n each layer!

  • Layer 1: variance × 9
  • Layer 2: variance × 9 × 4
  • Values explode exponentially!

The solution: Set Var(w)=2/n\text{Var}(w) = 2/nVar(w)=2/n so that output variance ≈ input variance.

This keeps signals "healthy" as they flow through the network.

Tracing Through the Forward Pass: What's Actually Happening?

Before we run the code, let's understand what the forward pass computes at each step:

Stage 1: Input → Hidden (What each specialist sees)

For hidden neuron iii:

  1. Weighted sum: z1[i]=W1[i,0]⋅x[0]+W1[i,1]⋅x[1]+...+W1[i,8]⋅x[8]+b1[i]z_1[i] = W_1[i,0] \cdot x[0] + W_1[i,1] \cdot x[1] + ... + W_1[i,8] \cdot x[8] + b_1[i]z1​[i]=W1​[i,0]⋅x[0]+W1​[i,1]⋅x[1]+...+W1​[i,8]⋅x[8]+b1​[i]
  2. Activation: h[i]=σ(z1[i])h[i] = \sigma(z_1[i])h[i]=σ(z1​[i]) → transforms to range (0, 1)

Each hidden neuron is essentially asking: "How strongly does this input match MY pattern?"

Stage 2: Hidden → Output (The final vote)

  1. Combine opinions: z2=W2[0]⋅h[0]+W2[1]⋅h[1]+...+W2[3]⋅h[3]+b2z_2 = W_2[0] \cdot h[0] + W_2[1] \cdot h[1] + ... + W_2[3] \cdot h[3] + b_2z2​=W2​[0]⋅h[0]+W2​[1]⋅h[1]+...+W2​[3]⋅h[3]+b2​
  2. Final decision: output=σ(z2)\text{output} = \sigma(z_2)output=σ(z2​) → probability of class 1

The output neuron asks: "Given what all specialists reported, what's my final decision?"

cell 014
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
# =============================================================================# FORWARD PASS: Step-by-Step Demonstration# ============================================================================= print("="*70)print("FORWARD PASS: Tracing Data Through the Network")print("="*70) # Create test input (vertical line from Part 1)vertical_line = np.array([[0, 1, 0], [0, 1, 0], [0, 1, 0]])x = vertical_line.flatten() print(f"\nInput (vertical line as 9 pixels):")print(f"  x = {x}") print("\n" + "-"*70)print("STAGE 1: Input → Hidden Layer")print("-"*70) # Stage 1: Compute hidden layerz1 = np.dot(mlp.W1, x) + mlp.b1h = sigmoid(z1) print(f"""Step 1a: Compute weighted sums for each hidden neuron  z1 = W1 · x + b1    For each hidden neuron i:    z1[i] = Σ(W1[i,j] × x[j]) + b1[i]      z1 = {z1} Step 1b: Apply activation function  h = sigmoid(z1)    For each hidden neuron:    h[i] = 1 / (1 + e^(-z1[i]))      h = {h}    These are the "opinions" from our {mlp.n_hidden} specialists!""") print("-"*70)print("STAGE 2: Hidden Layer → Output")print("-"*70) # Stage 2: Compute outputz2 = np.dot(mlp.W2, h) + mlp.b2output = sigmoid(z2) print(f"""Step 2a: Combine hidden activations  z2 = W2 · h + b2    The output neuron combines all specialist opinions:    z2 = Σ(W2[j] × h[j]) + b2      z2 = {z2} Step 2b: Apply sigmoid for final prediction  output = sigmoid(z2)    output = {output}    Final decision: {"VERTICAL" if output[0] >= 0.5 else "HORIZONTAL"}  (With random weights, this is just a guess!)""")

7.4 Backpropagation Through Multiple Layers

In Part 5, we learned backpropagation for a single neuron. With multiple layers, we need to chain the gradients - passing blame backward through each layer.

The Challenge: Who's Responsible for the Error?

When the network makes a mistake, we need to figure out:

  1. How much should we adjust the output weights (W2)?
  2. How much should we adjust the hidden weights (W1)?

The difficulty: W1 doesn't directly produce the output! It influences the hidden layer, which THEN influences the output. This is like asking: "If a manager's employee made a mistake, how much is the manager responsible?"

The Chain Rule: Passing Blame Backward

The key mathematical tool is the chain rule from calculus:

∂L∂W1=∂L∂y^⋅∂y^∂h⋅∂h∂W1\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial h} \cdot \frac{\partial h}{\partial W_1}∂W1​∂L​=∂y^​∂L​⋅∂h∂y^​​⋅∂W1​∂h​

What IS the Chain Rule?

The chain rule says: if A affects B, and B affects C, then A's effect on C is:

dCdA=dCdB×dBdA\frac{dC}{dA} = \frac{dC}{dB} \times \frac{dB}{dA}dAdC​=dBdC​×dAdB​

Intuitive example: If increasing temperature by 1°C increases pressure by 2 units, and increasing pressure by 1 unit increases volume by 3 units, then increasing temperature by 1°C increases volume by 2 × 3 = 6 units.

Think of it as a blame chain:

  • Loss depends on output prediction (how wrong is the answer?)
  • Output prediction depends on hidden activations (what did specialists say?)
  • Hidden activations depend on hidden weights (what were specialists looking for?)

Committee Analogy: Tracing Blame

"When the committee makes a wrong decision:

  1. First, we see how wrong the final decision was (output error)
  2. Then we ask: 'Which specialists contributed to this error?' (hidden layer blame)
  3. Finally: 'What evidence did each specialist focus on that led them astray?' (input weights)

The blame flows BACKWARD through the committee hierarchy."

The Backpropagation Steps

Step 1: Output Error δ2=y^−y\delta_2 = \hat{y} - yδ2​=y^​−y

Step 2: Hidden Layer Error (via chain rule) δ1=(W2T⋅δ2)⊙σ′(z1)\delta_1 = (W_2^T \cdot \delta_2) \odot \sigma'(z_1)δ1​=(W2T​⋅δ2​)⊙σ′(z1​)

Where ⊙\odot⊙ is element-wise multiplication and σ′\sigma'σ′ is the derivative of sigmoid.

Step 3: Update Weights W2=W2−α⋅δ2⋅hTW_2 = W_2 - \alpha \cdot \delta_2 \cdot h^TW2​=W2​−α⋅δ2​⋅hT W1=W1−α⋅δ1⋅xTW_1 = W_1 - \alpha \cdot \delta_1 \cdot x^TW1​=W1​−α⋅δ1​⋅xT

Why We Store Values During Forward Pass

Notice that backpropagation needs values computed during forward pass:

  • hhh (hidden activations) - needed to update W2
  • z1z_1z1​ (pre-activation) - needed for sigmoid derivative
  • xxx (input) - needed to update W1

This is why neural networks use memory! We can't compute gradients without remembering what happened during the forward pass. This creates a fundamental trade-off:

Memory UsageGradient Computation
Store all intermediate valuesExact gradients (standard backprop)
Store some valuesApproximate gradients (gradient checkpointing)

For deep networks with billions of parameters, memory management becomes critical!

cell 016
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
# =============================================================================# COMPLETE MLP WITH TRAINING# ============================================================================= class TrainableMLP:    """    Multi-Layer Perceptron with training capability.        This is the complete "Full Committee" that can learn!    """        def __init__(self, n_inputs, n_hidden, n_outputs=1):        """Initialize the MLP with Xavier initialization."""        self.n_inputs = n_inputs        self.n_hidden = n_hidden        self.n_outputs = n_outputs                # Xavier initialization for better training        self.W1 = np.random.randn(n_hidden, n_inputs) * np.sqrt(2.0 / n_inputs)        self.b1 = np.zeros(n_hidden)        self.W2 = np.random.randn(n_outputs, n_hidden) * np.sqrt(2.0 / n_hidden)        self.b2 = np.zeros(n_outputs)                # Cache for forward pass values        self.x = None        self.z1 = None        self.h = None        self.z2 = None        self.output = None                # Training history        self.loss_history = []        self.accuracy_history = []        def forward(self, x):        """Forward propagation."""        self.x = np.array(x).flatten()                # Hidden layer        self.z1 = np.dot(self.W1, self.x) + self.b1        self.h = sigmoid(self.z1)                # Output layer        self.z2 = np.dot(self.W2, self.h) + self.b2        self.output = sigmoid(self.z2)                return self.output[0] if self.n_outputs == 1 else self.output        def predict(self, x):        """Binary prediction."""        return 1 if self.forward(x) >= 0.5 else 0        def backward(self, y_true, learning_rate):        """        Backpropagation: compute gradients and update weights.                This is where the "blame assignment" happens!        """        # Output layer error        delta2 = self.output - y_true  # Shape: (1,) or (n_outputs,)                # Hidden layer error (chain rule!)        # delta1 = (W2.T @ delta2) * sigmoid_derivative(z1)        delta1 = np.dot(self.W2.T, delta2) * sigmoid_derivative(self.z1)                # Update output weights (W2, b2)        # dW2 = delta2 @ h.T (outer product)        dW2 = np.outer(delta2, self.h)        db2 = delta2                # Update hidden weights (W1, b1)        # dW1 = delta1 @ x.T (outer product)        dW1 = np.outer(delta1, self.x)        db1 = delta1                # Apply updates        self.W2 -= learning_rate * dW2        self.b2 -= learning_rate * db2        self.W1 -= learning_rate * dW1        self.b1 -= learning_rate * db1        def compute_loss(self, y_true, y_pred):        """Binary cross-entropy loss."""        epsilon = 1e-15        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)        return -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))        def train(self, X, y, learning_rate=0.5, epochs=100, verbose=True):        """        Train the MLP on data.                Parameters:            X: Training inputs (n_samples, n_features)            y: Training labels (n_samples,)            learning_rate: Step size for gradient descent            epochs: Number of passes through the dataset            verbose: Whether to print progress        """        self.loss_history = []        self.accuracy_history = []                for epoch in range(epochs):            total_loss = 0            correct = 0                        for i in range(len(X)):                # Forward pass                y_pred = self.forward(X[i])                                # Compute loss                loss = self.compute_loss(y[i], y_pred)                total_loss += loss                                # Check accuracy                if (y_pred >= 0.5 and y[i] == 1) or (y_pred < 0.5 and y[i] == 0):                    correct += 1                                # Backward pass (this is where learning happens!)                self.backward(np.array([y[i]]), learning_rate)                        # Record history            avg_loss = total_loss / len(X)            accuracy = correct / len(X)            self.loss_history.append(avg_loss)            self.accuracy_history.append(accuracy)                        if verbose and (epoch + 1) % 20 == 0:                print(f"  Epoch {epoch+1:3d}/{epochs}: Loss = {avg_loss:.4f}, Accuracy = {accuracy*100:.1f}%")                if verbose:            print(f"\nTraining complete! Final accuracy: {self.accuracy_history[-1]*100:.1f}%")                return self.loss_history print("TrainableMLP class defined!")print("This MLP can learn through backpropagation.")

Understanding the Backward Method

Let's trace through what backward actually computes:

Output error: delta2 = self.output - y_true

  • If predicted 0.8 but true is 0, error = +0.8 (need to decrease)
  • This comes from derivative of BCE loss with sigmoid

Hidden error: delta1 = np.dot(self.W2.T, delta2) * sigmoid_derivative(self.z1)

  • First part: Distribute output error to hidden neurons based on their weights
  • Second part: Scale by how "sensitive" each neuron was

Why the outer product for updates?

dW2 = np.outer(delta2, self.h) computes: error × what hidden neurons said

Each weight connects ONE hidden neuron to output. If that hidden neuron was highly active AND error was large, that weight contributed a lot → big update.


7.5 The MLP Solves XOR!

Now let's prove that our MLP can solve the XOR problem that defeated single neurons.

cell 019
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# =============================================================================# MLP SOLVES XOR: Proof That Hidden Layers Work!# ============================================================================= print("="*70)print("MLP vs XOR: The Hidden Layer Advantage")print("="*70) # XOR dataX_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])y_xor = np.array([0, 1, 1, 0]) # Create and train MLP with 2 hidden neuronsnp.random.seed(42)xor_mlp = TrainableMLP(n_inputs=2, n_hidden=4, n_outputs=1) print("\nTraining MLP on XOR problem...")print("(Remember: A single neuron CANNOT solve this!)\n") xor_mlp.train(X_xor, y_xor, learning_rate=1.0, epochs=1000, verbose=True) # Test predictionsprint("\n" + "-"*70)print("XOR PREDICTIONS:")print("-"*70)print("\n  Input A | Input B | Expected | Predicted | Correct?")print("  " + "-"*50) all_correct = Truefor i, (x, y_true) in enumerate(zip(X_xor, y_xor)):    y_pred = xor_mlp.predict(x)    prob = xor_mlp.forward(x)    correct = "Yes" if y_pred == y_true else "No"    if y_pred != y_true:        all_correct = False    print(f"    {x[0]}     |    {x[1]}    |    {y_true}     |     {y_pred}     |   {correct}") print("\n" + "="*70)if all_correct:    print("SUCCESS! The MLP solved XOR!")    print("Hidden layers enable learning non-linear patterns!")else:    print("Still learning... (try running training again)")print("="*70)

What XOR Taught Us

The XOR success proves several important points:

LessonWhy It Matters
Hidden layers enable non-linear boundariesWe can now solve problems impossible for single neurons
4 hidden neurons > 2 for XORSometimes extra capacity helps training
Higher learning rate (1.0)XOR has sharp boundaries, needs aggressive updates
More epochs (1000)Non-linear problems can take longer to converge

The key insight: Each hidden neuron learned to detect one "piece" of the XOR pattern. The output neuron combined these pieces into the full solution.

Now let's return to our V/H classification story and see if this same power translates to real image problems!


7.6 Back to Our Through-Line: MLP vs Perceptron on V/H

We've proven the MLP can solve XOR. Now let's return to our continuing V/H story and see if the MLP can handle the challenging noisy images that stumped our single neuron.

The Comparison We've Been Building To

ModelClean V/HNoisy V/H (20%)Why?
Perceptron~95-100%~70-80%One pattern detector isn't enough
MLP~95-100%?Multiple specialists should help!

Why Should MLP Help With Noise?

The Perceptron's problem with noise:

  • It learned ONE template (e.g., "middle column bright = vertical")
  • Noise adds random bright pixels everywhere
  • Random brightness confuses the single template

How MLP specialists help:

SpecialistWhat It Might DetectWhy Noise-Robust
Hidden 1Left column patternNoise in right columns doesn't affect it
Hidden 2Middle column patternNoise in left columns doesn't affect it
Hidden 3Vertical vs horizontal ratioLooks at overall shape
Hidden 4Edge patternsDifferent view of same data

Even if noise confuses ONE specialist, the others can "vote" correctly!

This is called ensemble robustness - multiple diverse detectors are more reliable than one.

Let's find out:

cell 022
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# =============================================================================# MLP vs PERCEPTRON: The Showdown on Noisy V/H Images# ============================================================================= print("="*70)print("THE SHOWDOWN: Perceptron vs MLP on Noisy V/H Images")print("="*70) # Compare Perceptron vs MLP at different noise levelsprint("\nComparing performance at different noise levels:\n")print("  Noise Level | Perceptron | MLP (4 hidden) | Winner")print("  " + "-"*55) noise_levels = [0.0, 0.1, 0.2, 0.3]perceptron_scores = []mlp_scores = [] for noise in noise_levels:    np.random.seed(42)    X_train, y_train = generate_line_dataset(100, noise_level=noise, seed=42)    X_test, y_test = generate_line_dataset(50, noise_level=noise, seed=999)        # Train Perceptron    perceptron = SimplePerceptron(9)    perceptron.train(X_train, y_train, epochs=100)    p_correct = sum(1 for x, y in zip(X_test, y_test) if perceptron.predict(x) == y)    p_acc = p_correct / len(y_test) * 100    perceptron_scores.append(p_acc)        # Train MLP    mlp_model = TrainableMLP(n_inputs=9, n_hidden=4, n_outputs=1)    mlp_model.train(X_train, y_train, learning_rate=0.5, epochs=100, verbose=False)    m_correct = sum(1 for x, y in zip(X_test, y_test) if mlp_model.predict(x) == y)    m_acc = m_correct / len(y_test) * 100    mlp_scores.append(m_acc)        winner = "TIE" if abs(p_acc - m_acc) < 2 else ("Perceptron" if p_acc > m_acc else "MLP ✓")    print(f"    {int(noise*100):3d}%       |   {p_acc:5.1f}%   |    {m_acc:5.1f}%     | {winner}") # Store the final trained MLP for later visualizationnp.random.seed(42)X_train, y_train = generate_line_dataset(100, noise_level=0.2, seed=42)X_test, y_test = generate_line_dataset(50, noise_level=0.2, seed=999)vh_mlp = TrainableMLP(n_inputs=9, n_hidden=4, n_outputs=1)vh_mlp.train(X_train, y_train, learning_rate=0.5, epochs=100, verbose=False) print("\n" + "="*70)print("KEY RESULT:")print("="*70)print("""As noise increases, the MLP maintains higher accuracy! WHY? The MLP has MULTIPLE SPECIALISTS:  • One hidden neuron might detect "left column patterns"  • Another detects "middle column patterns"    • Another detects "right column patterns"  • The output combines their votes Even if noise confuses one specialist, others can still contribute!This is the power of the FULL COMMITTEE.""")
cell 023
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
# =============================================================================# VISUALIZING THE COMPARISON: Perceptron vs MLP# ============================================================================= fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Plot 1: Accuracy comparison bar chartax = axes[0]x = np.arange(len(noise_levels))width = 0.35 bars1 = ax.bar(x - width/2, perceptron_scores, width, label='Perceptron (1 neuron)', color='#e74c3c')bars2 = ax.bar(x + width/2, mlp_scores, width, label='MLP (4 hidden neurons)', color='#27ae60') ax.set_xlabel('Noise Level', fontsize=12)ax.set_ylabel('Accuracy (%)', fontsize=12)ax.set_title('The Showdown: Perceptron vs MLP\non Noisy V/H Images', fontsize=14, fontweight='bold')ax.set_xticks(x)ax.set_xticklabels([f'{int(n*100)}%' for n in noise_levels])ax.legend()ax.set_ylim(50, 105) # Add value labelsfor bar in bars1:    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,             f'{bar.get_height():.0f}%', ha='center', va='bottom', fontsize=9)for bar in bars2:    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,             f'{bar.get_height():.0f}%', ha='center', va='bottom', fontsize=9) # Plot 2: The insightax = axes[1]ax.axis('off') insight_text = """WHY MLP WINS ON NOISY DATA════════════════════════════════════════════════════ PERCEPTRON (Single Expert):┌─────────────────────────────────────┐│  "I look for ONE pattern:           ││   middle column = vertical"         ││                                     ││  Problem: Noise activates other     ││  pixels, confusing my ONE detector  │└─────────────────────────────────────┘ MLP (Committee of Specialists):┌─────────────────────────────────────┐│  Specialist 1: "I check LEFT"       ││  Specialist 2: "I check MIDDLE"     ││  Specialist 3: "I check RIGHT"      ││  Specialist 4: "I check PATTERNS"   ││                                     ││  Even if noise fools one of us,     ││  the others provide backup!         │└─────────────────────────────────────┘ This is REDUNDANCY and SPECIALIZATION working together!""" ax.text(0.05, 0.5, insight_text, fontsize=10, family='monospace',        verticalalignment='center', transform=ax.transAxes,        bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.9)) plt.tight_layout()plt.show()
cell 024
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# =============================================================================# VISUALIZING HIDDEN NEURON SPECIALIZATION# ============================================================================= fig, axes = plt.subplots(2, 3, figsize=(14, 8)) # Top row: Hidden neuron weights (what each specialist looks for)for i in range(min(4, vh_mlp.n_hidden)):    ax = axes[0, i] if i < 3 else axes[1, 0]    weights = vh_mlp.W1[i].reshape(3, 3)    im = ax.imshow(weights, cmap='RdBu', vmin=-2, vmax=2)    ax.set_title(f'Hidden Neuron {i+1}\nWeights', fontsize=11, fontweight='bold')    for r in range(3):        for c in range(3):            color = 'white' if abs(weights[r,c]) > 1 else 'black'            ax.text(c, r, f'{weights[r,c]:.2f}', ha='center', va='center', fontsize=9, color=color)    ax.axis('off')    plt.colorbar(im, ax=ax, fraction=0.046) # Bottom row: Output weights and explanationax = axes[1, 1]ax.bar(range(vh_mlp.n_hidden), vh_mlp.W2[0], color=['#e74c3c' if w < 0 else '#27ae60' for w in vh_mlp.W2[0]])ax.set_xlabel('Hidden Neuron', fontsize=11)ax.set_ylabel('Output Weight', fontsize=11)ax.set_title('How Output Combines\nHidden Neurons', fontsize=11, fontweight='bold')ax.axhline(y=0, color='black', linestyle='-', linewidth=0.5) # Explanationax = axes[1, 2]ax.axis('off')explanation = """WHAT EACH HIDDEN NEURON LEARNED════════════════════════════════════════ Each hidden neuron became a "specialist": • Some neurons learned to detect  VERTICAL patterns (strong middle column)  • Some neurons learned to detect    HORIZONTAL patterns (strong middle row) • The output neuron COMBINES these  specialist opinions:  - Positive weight = "trust this specialist"  - Negative weight = "opposite of this specialist" This is DIVERSITY OF OPINION in action!"""ax.text(0.1, 0.5, explanation, fontsize=10, family='monospace',        verticalalignment='center', transform=ax.transAxes,        bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.8)) plt.suptitle('The Committee of Specialists: What Each Hidden Neuron Learned',              fontsize=14, fontweight='bold', y=1.02)plt.tight_layout()plt.show()

7.7 The Universal Approximation Theorem

One of the most powerful results in neural network theory is the Universal Approximation Theorem.

What Does It Say?

"A neural network with a single hidden layer containing enough neurons can approximate ANY continuous function to arbitrary accuracy."

In simpler terms: with enough hidden neurons, a neural network can learn to represent virtually any pattern!

What This Means

StatementImplication
"Any continuous function"Any smooth input-output relationship
"Single hidden layer"You only NEED one hidden layer (in theory)
"Enough neurons"May need many neurons for complex functions
"Arbitrary accuracy"Can get as close as you want to the true function

The Catch

The theorem tells us networks CAN represent any function, but NOT:

  • How to FIND the right weights (training is still hard!)
  • How MANY neurons are needed (could be huge!)
  • Whether training will converge

Why Add MORE Layers?

If one hidden layer is theoretically enough, why do modern networks have many layers?

Deep networks (more layers) are more EFFICIENT:

ArchitectureParameters NeededWhy?
Wide (1 layer, many neurons)ExponentialEach neuron works independently
Deep (many layers, fewer neurons)PolynomialLayers build on each other

The Compositionality Argument: Why Depth Wins

Key insight: Complex functions often have hierarchical structure.

Consider recognizing a face:

  1. Layer 1: Detect edges (simple lines, curves)
  2. Layer 2: Combine edges into parts (eyes, nose, mouth)
  3. Layer 3: Combine parts into faces

Each layer REUSES what the previous layer learned!

With a single wide layer: Each neuron must independently learn to detect "face" from raw pixels. No reuse.

With deep layers: Edge detectors are shared across eye detectors, nose detectors, etc. Massive reuse!

Mathematical example:

  • To represent f(x)=x2nf(x) = x^{2^n}f(x)=x2n with wide network: need 2n2^n2n neurons
  • With deep network: just n layers, each computing x2x^2x2 of the previous layer

What Does "Arbitrary Accuracy" Mean?

The theorem says we can get "arbitrarily close" to any function. Concretely:

∣f(x)−f^(x)∣<ϵ for any ϵ>0|f(x) - \hat{f}(x)| < \epsilon \text{ for any } \epsilon > 0∣f(x)−f^​(x)∣<ϵ for any ϵ>0

Where fff is the true function and f^\hat{f}f^​ is the network's approximation.

Catch: The number of neurons needed grows as ϵ\epsilonϵ gets smaller. For very precise approximations, you might need astronomically many neurons!

Committee Analogy

"One giant room of 1000 generalist committee members CAN solve any problem. But a hierarchical organization with specialists (layer 1: evidence gatherers, layer 2: pattern detectors, layer 3: decision makers) can solve it with fewer people and better organization."


Part 7 Summary: What We've Learned

Key Concepts Mastered

ConceptDefinitionWhy It Matters
Linear SeparabilityCan separate with one lineDetermines what single neurons can learn
XOR ProblemNon-linearly separableProves single neurons have limits
Hidden LayerNeurons between input and outputEnable non-linear boundaries
MLPMulti-Layer PerceptronNetwork with hidden layers
Forward PropagationInput → Hidden → OutputHow predictions are made
BackpropagationChain rule through layersHow MLPs learn
Universal ApproximationMLPs can learn anythingTheoretical foundation

Architecture Comparison

ModelLayersXORClean V/HNoisy V/H (20%)Why?
Perceptron1✗~95%~70-80%One detector isn't enough
MLP (4 hidden)2✓~95%~85-95%Multiple specialists!
Deep MLP3+✓✓✓Even more capacity

Two Complementary Examples

ExampleWhat We Learned
XOR ProblemClassic proof that single neurons have fundamental limits
Noisy V/H LinesPractical demonstration using our continuing story

Both examples taught the same lesson: complex problems need multiple specialists working together.

Committee Analogy Progress

PartWhat Happened
Parts 1-3Single member learned procedures
Part 4First case - confused
Part 5Learned from feedback
Part 6Performance review
Part 7Assembled the full committee with specialists!
Part 8(Next) The committee faces growing pains

Knowledge Check

How Many Hidden Neurons Do We Need?

A natural question: "Should I use 4 hidden neurons? 10? 100?"

Understanding Network Capacity:

Hidden NeuronsCapacityRisk
Too few (1-2)Can't represent complex patternsUnderfitting
Just right (4-8 for V/H)Captures patterns without memorizingGood generalization
Too many (50+)Can memorize training dataOverfitting

Rules of Thumb:

  1. Start small, increase if needed - Begin with 2-4 hidden neurons, add more if accuracy plateaus
  2. Watch train vs test gap - If training accuracy >> test accuracy, reduce neurons
  3. Problem complexity guides size - Simple patterns need fewer neurons

For our V/H problem:

  • 9 input pixels
  • 2 classes (binary)
  • 4 hidden neurons is reasonable: enough for specialization, not so many that overfitting occurs

We'll explore overfitting in detail in Part 8!

cell 027
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
# =============================================================================# KNOWLEDGE CHECK - Part 7# ============================================================================= print("KNOWLEDGE CHECK - Part 7: Hidden Layers")print("="*60) questions = [    {        "q": "1. Why can't a single neuron solve the XOR problem?",        "options": [            "A) XOR has too many inputs",            "B) XOR is not linearly separable - can't draw one line to separate classes",            "C) XOR requires too much memory",            "D) Single neurons can solve XOR, it just takes longer"        ],        "answer": "B",        "explanation": "XOR points cannot be separated by a single straight line. The (0,0) and (1,1) points are class 0, while (0,1) and (1,0) are class 1 - no line can separate them."    },    {        "q": "2. Why does MLP outperform Perceptron on noisy V/H images?",        "options": [            "A) MLP runs faster",            "B) MLP has multiple specialists - if noise fools one, others provide backup",            "C) MLP uses less memory",            "D) Perceptron can't process images"        ],        "answer": "B",        "explanation": "MLP has multiple hidden neurons that each detect different features. Even if noise confuses one specialist, the others can still detect patterns and contribute to the correct answer."    },    {        "q": "3. What is a 'hidden layer' in a neural network?",        "options": [            "A) A layer that is invisible to users",            "B) A layer of neurons between the input and output layers",            "C) A layer that stores hidden data",            "D) A layer that only activates sometimes"        ],        "answer": "B",        "explanation": "Hidden layers sit between input and output. They're 'hidden' because we don't directly observe their values - they're internal to the network."    },    {        "q": "4. What does each hidden neuron typically learn to detect?",        "options": [            "A) The same pattern as other neurons",            "B) Random noise",            "C) Different features or patterns (specialization)",            "D) Only the output labels"        ],        "answer": "C",        "explanation": "Each hidden neuron specializes in detecting different features. This 'diversity of opinion' is what gives MLPs their power to learn complex patterns."    },    {        "q": "5. In backpropagation through multiple layers, how does error flow?",        "options": [            "A) Forward, from input to output",            "B) Backward, from output to input via chain rule",            "C) Randomly through the network",            "D) Only through the hidden layer"        ],        "answer": "B",        "explanation": "Backpropagation passes error backward using the chain rule. Output error → hidden layer error → input weight updates."    },    {        "q": "6. What does the Universal Approximation Theorem tell us?",        "options": [            "A) Neural networks always converge",            "B) One hidden layer with enough neurons can approximate any function",            "C) Deep networks are always better than shallow ones",            "D) Training is guaranteed to find optimal weights"        ],        "answer": "B",        "explanation": "The theorem says MLPs CAN represent any function, but doesn't guarantee we can find the weights or how many neurons we need."    }] for q in questions:    print(f"\n{q['q']}")    for opt in q["options"]:        print(f"   {opt}") print("\n" + "="*60)print("Scroll down for answers...")print("="*60)
cell 028
1
2
3
4
5
6
# ANSWERSprint("ANSWERS - Part 7 Knowledge Check")print("="*60)for i, q in enumerate(questions, 1):    print(f"\n{i}. Answer: {q['answer']}")    print(f"   {q['explanation']}")

What's Next?

Congratulations! You've completed Part 7!

We've assembled the full committee - a Multi-Layer Perceptron with hidden layers that can solve problems single neurons cannot. We proved this by solving XOR and saw how hidden neurons specialize in detecting different features.

But There's a Problem...

As neural networks grow deeper and more complex, they face new challenges:

  • Overfitting: The committee memorizes cases instead of learning patterns
  • Vanishing Gradients: Feedback becomes too weak in deep networks
  • Dead Neurons: Some specialists stop contributing entirely

Coming Up in Part 8: Deep Learning Challenges

In the next notebook, we'll explore:

  • Overfitting - When the committee memorizes instead of learns
  • Regularization - Rules to prevent over-specialization
  • Vanishing/Exploding Gradients - The deep network dilemma
  • Solutions - Dropout, batch normalization, and more

Continue to Part 8: part_8_deep_learning_challenges.ipynb


"With great power comes great responsibility - and new challenges."

The Brain's Decision Committee - Growing Pains

Illustrated step

Hidden layer

concept

Specialist sub-committee

Different members learn different intermediate features.

XOR

concept

The case one member cannot solve

Some patterns need a team because one straight boundary is not enough.

Universal approximation

concept

Enough specialists can model the shape

A network can approximate complex functions with the right structure.

AI tutor

Tutor chat is staged for the next slice. For now, use the concept cards and run cells to test each idea directly.

Pinned output

Plots and code output render under each cell. Pinning outputs to this rail will land once the core runner is evaluated.