MLPPart 7 · 55 min · intermediate

The full committee

Move beyond one neuron with hidden layers, XOR, and a multi-layer perceptron.

Open in Colab Download notebook Full lab fallback

Kernel: ColdSections: 0/14

Neural Network Fundamentals

Part 7: Hidden Layers - The Full Committee

The Brain's Decision Committee - Chapter 7

The Story So Far...

In Parts 1-6, we built and trained a single neuron (Perceptron) that became an expert at detecting vertical vs horizontal lines. We evaluated its performance, understood its decision-making through saliency maps, and confirmed it learned the right patterns.

But our expert has a limitation.

A single neuron can only draw ONE straight line to separate categories. Some problems require more complex boundaries - curves, multiple regions, or intricate patterns.

"Our single committee member has done well, but some problems are too complex for one person. It's time to assemble a full committee with specialists."

What You'll Learn in Part 7

By the end of this notebook, you will understand:

Why single neurons fail - The famous XOR problem AND challenging V/H variations
What hidden layers are - Adding neurons between input and output
How hidden neurons specialize - Different neurons detect different features
The Multi-Layer Perceptron (MLP) - A complete neural network architecture
Forward propagation - How data flows through multiple layers
Backpropagation through layers - Training with chain rule
Universal approximation - Why deep networks can learn (almost) anything

Two Complementary Examples

In this notebook, we'll explore limitations of single neurons through two lenses:

Example	Why Include It?
XOR Problem	The famous textbook example - you'll encounter this everywhere in ML literature
Challenging V/H Lines	Our continuing story - noisy images, multiple positions, harder patterns

Both examples teach the same lesson: some problems need multiple neurons working together.

Prerequisites

Make sure you've completed:

Parts 0-1: Matrices (neural_network_fundamentals.ipynb)
Part 2: Single Neuron (part_2_single_neuron.ipynb)
Part 3: Activation Functions (part_3_activation_functions.ipynb)
Part 4: The Perceptron (part_4_perceptron.ipynb)
Part 5: Training (part_5_training.ipynb)
Part 6: Evaluation (part_6_evaluation.ipynb)

Setup: Import Dependencies

cell 003

# =============================================================================# PART 7: HIDDEN LAYERS - SETUP AND IMPORTS# ============================================================================= import numpy as npimport matplotlib.pyplot as pltfrom IPython.display import display, clear_output # Try to import ipywidgets for interactive featurestry:    import ipywidgets as widgets    WIDGETS_AVAILABLE = Trueexcept ImportError:    WIDGETS_AVAILABLE = False    print("Note: ipywidgets not installed. Interactive features will be limited.") # Set up matplotlib stylestyle_options = ['seaborn-v0_8-whitegrid', 'seaborn-whitegrid', 'ggplot', 'default']for style in style_options:    try:        plt.style.use(style)        break    except OSError:        continue plt.rcParams['figure.figsize'] = [10, 6]plt.rcParams['font.size'] = 12np.random.seed(42) # -----------------------------------------------------------------------------# Helper functions from previous notebooks# ----------------------------------------------------------------------------- def sigmoid(z):    """Sigmoid activation: maps any value to range (0, 1)."""    return 1 / (1 + np.exp(-np.clip(z, -500, 500))) def sigmoid_derivative(z):    """Derivative of sigmoid: σ(z) * (1 - σ(z))"""    s = sigmoid(z)    return s * (1 - s) def relu(z):    """ReLU activation: max(0, z)"""    return np.maximum(0, z) def relu_derivative(z):    """Derivative of ReLU: 1 if z > 0, else 0"""    return (z > 0).astype(float) print("Setup complete!")print("="*60)

# =============================================================================
# PART 7: HIDDEN LAYERS - SETUP AND IMPORTS
# =============================================================================

import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, clear_output

# Try to import ipywidgets for interactive features
try:
    import ipywidgets as widgets
    WIDGETS_AVAILABLE = True
except ImportError:
    WIDGETS_AVAILABLE = False
    print("Note: ipywidgets not installed. Interactive features will be limited.")

# Set up matplotlib style
style_options = ['seaborn-v0_8-whitegrid', 'seaborn-whitegrid', 'ggplot', 'default']
for style in style_options:
    try:
        plt.style.use(style)
        break
    except OSError:
        continue

plt.rcParams['figure.figsize'] = [10, 6]
plt.rcParams['font.size'] = 12
np.random.seed(42)

# -----------------------------------------------------------------------------
# Helper functions from previous notebooks
# -----------------------------------------------------------------------------

def sigmoid(z):
    """Sigmoid activation: maps any value to range (0, 1)."""
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

def sigmoid_derivative(z):
    """Derivative of sigmoid: σ(z) * (1 - σ(z))"""
    s = sigmoid(z)
    return s * (1 - s)

def relu(z):
    """ReLU activation: max(0, z)"""
    return np.maximum(0, z)

def relu_derivative(z):
    """Derivative of ReLU: 1 if z > 0, else 0"""
    return (z > 0).astype(float)

print("Setup complete!")
print("="*60)

7.1 The Limitation of Single Neurons: The XOR Problem

Our Perceptron works great for vertical vs horizontal lines. But there's a famous problem that NO single neuron can solve: the XOR problem.

What IS XOR?

XOR (exclusive OR) is a logical operation that outputs TRUE when inputs are DIFFERENT:

Input A	Input B	XOR Output
0	0	0
0	1	1
1	0	1
1	1	0

In words: "TRUE if one or the other, but not both."

Real-world examples of XOR:

A light switch: Flip EITHER switch to change the light, but if BOTH are up (or both down), it's off
Password requirements: "Use uppercase OR numbers" (but having BOTH doesn't double-satisfy it)

Why Can't a Single Neuron Solve XOR?

A single neuron creates a linear decision boundary - a straight line that separates the two classes.

What IS a Decision Boundary?

A decision boundary is the line (in 2D), plane (in 3D), or hyperplane (in higher dimensions) where the model switches from predicting one class to another.

For a single neuron: $z = w_{1} x_{1} + w_{2} x_{2} + b = 0$

This equation defines a straight line. Points on one side get z > 0 (predict class 1), points on the other side get z < 0 (predict class 0).

Why is this a line? Rearranging: $x_{2} = - \frac{w_{1}}{w_{2}} x_{1} - \frac{b}{w_{2}}$

This is the equation of a line with slope $- \frac{w_{1}}{w_{2}}$ and intercept $- \frac{b}{w_{2}}$ .

The Problem: No matter what values we choose for $w_{1}$ , $w_{2}$ , and $b$ , we can only draw ONE straight line!

Let's visualize the problem:

cell 005

# =============================================================================# THE XOR PROBLEM: Visualizing Why Single Neurons Fail# ============================================================================= print("="*70)print("THE XOR PROBLEM: A Single Neuron's Nightmare")print("="*70) # XOR dataX_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])y_xor = np.array([0, 1, 1, 0]) fig, axes = plt.subplots(1, 3, figsize=(15, 5)) # Plot 1: The XOR problemax = axes[0]colors = ['red' if y == 0 else 'blue' for y in y_xor]ax.scatter(X_xor[:, 0], X_xor[:, 1], c=colors, s=300, edgecolor='black', linewidth=2)for i, (x, y, label) in enumerate(zip(X_xor[:, 0], X_xor[:, 1], y_xor)):    ax.annotate(f'({x},{y})→{label}', (x, y), xytext=(10, 10),                 textcoords='offset points', fontsize=10)ax.set_xlim(-0.5, 1.5)ax.set_ylim(-0.5, 1.5)ax.set_xlabel('Input A', fontsize=12)ax.set_ylabel('Input B', fontsize=12)ax.set_title('XOR Data Points\n(Red=0, Blue=1)', fontsize=14, fontweight='bold')ax.grid(True, alpha=0.3) # Plot 2: Can you draw ONE line to separate them?ax = axes[1]ax.scatter(X_xor[:, 0], X_xor[:, 1], c=colors, s=300, edgecolor='black', linewidth=2) # Try some linesx_line = np.linspace(-0.5, 1.5, 100)ax.plot(x_line, x_line, 'g--', linewidth=2, label='Diagonal?')ax.plot(x_line, 0.5 * np.ones_like(x_line), 'm--', linewidth=2, label='Horizontal?')ax.plot(0.5 * np.ones_like(x_line), x_line, 'c--', linewidth=2, label='Vertical?') ax.set_xlim(-0.5, 1.5)ax.set_ylim(-0.5, 1.5)ax.set_xlabel('Input A', fontsize=12)ax.set_ylabel('Input B', fontsize=12)ax.set_title('Try to Draw ONE Line\nto Separate Red from Blue', fontsize=14, fontweight='bold')ax.legend(loc='upper right')ax.grid(True, alpha=0.3) # Plot 3: The solution requires TWO lines (or a curve)ax = axes[2]ax.scatter(X_xor[:, 0], X_xor[:, 1], c=colors, s=300, edgecolor='black', linewidth=2) # Two lines that together solve XORax.plot(x_line, x_line - 0.3, 'g-', linewidth=2, label='Line 1')ax.plot(x_line, x_line + 0.3, 'g-', linewidth=2, label='Line 2')ax.fill_between(x_line, x_line - 0.3, x_line + 0.3, alpha=0.2, color='blue', label='Blue region') ax.set_xlim(-0.5, 1.5)ax.set_ylim(-0.5, 1.5)ax.set_xlabel('Input A', fontsize=12)ax.set_ylabel('Input B', fontsize=12)ax.set_title('Solution: TWO Lines\n(Requires Hidden Layer!)', fontsize=14, fontweight='bold')ax.legend(loc='upper right')ax.grid(True, alpha=0.3) plt.tight_layout()plt.show() print("""KEY INSIGHT: The XOR Problem════════════════════════════════════════════════════════════════════════ The red points (0) are at corners (0,0) and (1,1).The blue points (1) are at corners (0,1) and (1,0). NO SINGLE STRAIGHT LINE can separate red from blue! This is called being "not linearly separable." Why it matters:• A single neuron can only create ONE linear boundary• XOR requires a more complex, non-linear boundary• This was proven impossible for Perceptrons in 1969 (Minsky & Papert)• The solution: ADD MORE NEURONS → Hidden Layers!""")

# =============================================================================
# THE XOR PROBLEM: Visualizing Why Single Neurons Fail
# =============================================================================

print("="*70)
print("THE XOR PROBLEM: A Single Neuron's Nightmare")
print("="*70)

# XOR data
X_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_xor = np.array([0, 1, 1, 0])

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Plot 1: The XOR problem
ax = axes[0]
colors = ['red' if y == 0 else 'blue' for y in y_xor]
ax.scatter(X_xor[:, 0], X_xor[:, 1], c=colors, s=300, edgecolor='black', linewidth=2)
for i, (x, y, label) in enumerate(zip(X_xor[:, 0], X_xor[:, 1], y_xor)):
    ax.annotate(f'({x},{y})→{label}', (x, y), xytext=(10, 10), 
                textcoords='offset points', fontsize=10)
ax.set_xlim(-0.5, 1.5)
ax.set_ylim(-0.5, 1.5)
ax.set_xlabel('Input A', fontsize=12)
ax.set_ylabel('Input B', fontsize=12)
ax.set_title('XOR Data Points\n(Red=0, Blue=1)', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3)

# Plot 2: Can you draw ONE line to separate them?
ax = axes[1]
ax.scatter(X_xor[:, 0], X_xor[:, 1], c=colors, s=300, edgecolor='black', linewidth=2)

# Try some lines
x_line = np.linspace(-0.5, 1.5, 100)
ax.plot(x_line, x_line, 'g--', linewidth=2, label='Diagonal?')
ax.plot(x_line, 0.5 * np.ones_like(x_line), 'm--', linewidth=2, label='Horizontal?')
ax.plot(0.5 * np.ones_like(x_line), x_line, 'c--', linewidth=2, label='Vertical?')

ax.set_xlim(-0.5, 1.5)
ax.set_ylim(-0.5, 1.5)
ax.set_xlabel('Input A', fontsize=12)
ax.set_ylabel('Input B', fontsize=12)
ax.set_title('Try to Draw ONE Line\nto Separate Red from Blue', fontsize=14, fontweight='bold')
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3)

# Plot 3: The solution requires TWO lines (or a curve)
ax = axes[2]
ax.scatter(X_xor[:, 0], X_xor[:, 1], c=colors, s=300, edgecolor='black', linewidth=2)

# Two lines that together solve XOR
ax.plot(x_line, x_line - 0.3, 'g-', linewidth=2, label='Line 1')
ax.plot(x_line, x_line + 0.3, 'g-', linewidth=2, label='Line 2')
ax.fill_between(x_line, x_line - 0.3, x_line + 0.3, alpha=0.2, color='blue', label='Blue region')

ax.set_xlim(-0.5, 1.5)
ax.set_ylim(-0.5, 1.5)
ax.set_xlabel('Input A', fontsize=12)
ax.set_ylabel('Input B', fontsize=12)
ax.set_title('Solution: TWO Lines\n(Requires Hidden Layer!)', fontsize=14, fontweight='bold')
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("""
KEY INSIGHT: The XOR Problem
════════════════════════════════════════════════════════════════════════

The red points (0) are at corners (0,0) and (1,1).
The blue points (1) are at corners (0,1) and (1,0).

NO SINGLE STRAIGHT LINE can separate red from blue!

This is called being "not linearly separable."

Why it matters:
• A single neuron can only create ONE linear boundary
• XOR requires a more complex, non-linear boundary
• This was proven impossible for Perceptrons in 1969 (Minsky & Papert)
• The solution: ADD MORE NEURONS → Hidden Layers!
""")

What IS Linear Separability?

Linear Separability is a property of a dataset where the classes can be separated by a straight line (in 2D), a plane (in 3D), or a hyperplane (in higher dimensions).

Problem Type	Linearly Separable?	Single Neuron Can Solve?
AND gate	Yes	✓
OR gate	Yes	✓
XOR gate	No	✗
Vertical vs Horizontal lines (clean)	Yes	✓
Noisy/partial V/H lines	Harder	Struggles!
Complex overlapping patterns	No	✗

Why Does Linear Separability Matter?

This is the fundamental limit of single-layer neural networks:

Model	Decision Boundary	What It Can Learn
Single neuron	One line/plane	Only linearly separable patterns
MLP (hidden layer)	Multiple lines → curves	Non-linear patterns
Deep MLP	Very complex shapes	Almost anything!

Mathematically: A single neuron computes $σ (w \cdot x + b)$ . The activation function $σ$ is monotonic (always increasing or flat), so it can only split the input space with ONE hyperplane. That's the fundamental constraint.

The Historical "AI Winter"

In 1969, Marvin Minsky and Seymour Papert published a book called "Perceptrons" proving that single-layer networks couldn't solve XOR or any non-linearly-separable problem.

Why was this so damaging? They proved it was a MATHEMATICAL impossibility, not just a training difficulty. No amount of training could make a single neuron learn XOR - it literally cannot represent that function.

7.1.5 Back to Our Story: When V/H Classification Gets Hard

XOR is the famous textbook example, but let's see how the same limitation affects our vertical/horizontal line detection problem.

Our Perceptron's Success... and Its Limits

In Parts 4-6, our single-neuron Perceptron achieved ~95-100% accuracy on clean V/H lines. But what happens when the problem gets harder?

Challenge	What Changes	Why It's Harder
Noisy images	Random pixels added	Pattern obscured
Lines in ANY position	Not just middle	One "middle detector" isn't enough
Partial/broken lines	Missing pixels	Incomplete evidence
Thin vs thick lines	Different widths	Multiple patterns to detect

Let's see if our single neuron can handle these challenges:

The Historical "AI Winter"

This caused the first "AI Winter" - a period where funding for neural network research dried up because people thought they were fundamentally limited.

The Solution Was Simple: Add More Layers!

The fix was known all along but computationally difficult:

Instead of one expert, use a TEAM of experts (neurons) working together!

7.2 The Panel of Experts: Hidden Layers

What IS a Hidden Layer?

A hidden layer is a layer of neurons that sits between the input and output:

INPUT (9 pixels) → [HIDDEN LAYER] → OUTPUT (1 prediction)
                   (multiple neurons)

Why "hidden"? Because we never directly see their values during normal use - they're internal to the network.

Why Multiple Neurons Help

Each neuron in the hidden layer can detect a different feature:

Hidden Neuron	What It Might Detect
Neuron 1	"Is there a vertical pattern on the LEFT?"
Neuron 2	"Is there a vertical pattern in the MIDDLE?"
Neuron 3	"Is there a vertical pattern on the RIGHT?"
Neuron 4	"Is there a horizontal pattern on TOP?"

The output neuron then combines these feature detections to make a final decision.

The Critical Role of Activation Functions

Why do we NEED activation functions between layers?

Without activations, stacking layers does nothing! Here's why:

Without activation: $Output = W_{2} \cdot (W_{1} \cdot x) = (W_{2} \cdot W_{1}) \cdot x = W_{c o m b i n e d} \cdot x$

The composition of two linear transformations is just... another linear transformation! We could replace the entire network with a single layer.

With activation: $Output = W_{2} \cdot σ (W_{1} \cdot x)$

The non-linear $σ$ "breaks" the linearity. Now we have:

Layer 1 creates multiple linear boundaries
Activation function "bends" these boundaries
Layer 2 combines the bent boundaries

This is how MLPs create curves from straight lines!

Committee Analogy: The Sub-Committee

"Before, we had ONE committee member who had to look at everything. Now we have a sub-committee of specialists:

Specialist 1 checks for patterns in the left region
Specialist 2 checks the middle region
Specialist 3 checks the right region
The final committee member listens to all specialists and makes the decision

This division of labor lets us solve more complex problems!"

Diversity of Opinion

Key insight from Part 1.7: If all hidden neurons look for the same thing, they're redundant!

We need diversity - each hidden neuron should specialize in detecting something different. This happens naturally during training as they adjust to minimize error.

cell 007

# =============================================================================# THE V/H CHALLENGE: When Our Perceptron Struggles# ============================================================================= print("="*70)print("BACK TO OUR STORY: Challenging V/H Classification")print("="*70) # Dataset generator from previous partsdef generate_line_dataset(n_samples=100, noise_level=0.0, seed=None):    """Generate vertical (1) and horizontal (0) line images."""    if seed is not None:        np.random.seed(seed)        X, y = [], []    for i in range(n_samples):        image = np.zeros((3, 3))        if i < n_samples // 2:  # Vertical            col = np.random.randint(0, 3)  # ANY column, not just middle            image[:, col] = 1            if noise_level > 0:                image = np.clip(image + np.random.randn(3, 3) * noise_level, 0, 1)            X.append(image.flatten())            y.append(1)        else:  # Horizontal            row = np.random.randint(0, 3)            image[row, :] = 1            if noise_level > 0:                image = np.clip(image + np.random.randn(3, 3) * noise_level, 0, 1)            X.append(image.flatten())            y.append(0)        X, y = np.array(X), np.array(y)    shuffle_idx = np.random.permutation(n_samples)    return X[shuffle_idx], y[shuffle_idx] # Simple Perceptron from Part 5class SimplePerceptron:    def __init__(self, n_inputs):        self.weights = np.random.randn(n_inputs) * 0.1        self.bias = 0.0        def forward(self, x):        z = np.dot(self.weights, x.flatten()) + self.bias        return sigmoid(z)        def predict(self, x):        return 1 if self.forward(x) >= 0.5 else 0        def train(self, X, y, lr=0.5, epochs=100):        for _ in range(epochs):            for xi, yi in zip(X, y):                pred = self.forward(xi)                error = pred - yi                self.weights -= lr * error * xi.flatten()                self.bias -= lr * error        return self # Test on different difficulty levelsprint("\nTesting Single Neuron on Increasingly Difficult V/H Problems:\n") difficulties = [    ("Clean (0% noise)", 0.0),    ("Light noise (10%)", 0.1),    ("Medium noise (20%)", 0.2),    ("Heavy noise (30%)", 0.3)] results = []for name, noise in difficulties:    np.random.seed(42)    X_train, y_train = generate_line_dataset(100, noise_level=noise, seed=42)    X_test, y_test = generate_line_dataset(50, noise_level=noise, seed=999)        perceptron = SimplePerceptron(9)    perceptron.train(X_train, y_train, epochs=100)        correct = sum(1 for x, y in zip(X_test, y_test) if perceptron.predict(x) == y)    accuracy = correct / len(y_test) * 100    results.append((name, accuracy))    print(f"  {name:25s} → Accuracy: {accuracy:5.1f}%") print("\n" + "="*70)print("KEY OBSERVATION:")print("="*70)print("""As noise increases, our single neuron struggles more! Why? The single neuron learned ONE pattern (e.g., "middle column = vertical").But noisy images have:  • Extra bright pixels confusing the detector  • Lines in different positions the single "template" doesn't match  • Partial patterns that need multiple feature detectors Just like XOR, complex V/H patterns need MULTIPLE SPECIALISTS!""")

# =============================================================================
# THE V/H CHALLENGE: When Our Perceptron Struggles
# =============================================================================

print("="*70)
print("BACK TO OUR STORY: Challenging V/H Classification")
print("="*70)

# Dataset generator from previous parts
def generate_line_dataset(n_samples=100, noise_level=0.0, seed=None):
    """Generate vertical (1) and horizontal (0) line images."""
    if seed is not None:
        np.random.seed(seed)
    
    X, y = [], []
    for i in range(n_samples):
        image = np.zeros((3, 3))
        if i < n_samples // 2:  # Vertical
            col = np.random.randint(0, 3)  # ANY column, not just middle
            image[:, col] = 1
            if noise_level > 0:
                image = np.clip(image + np.random.randn(3, 3) * noise_level, 0, 1)
            X.append(image.flatten())
            y.append(1)
        else:  # Horizontal
            row = np.random.randint(0, 3)
            image[row, :] = 1
            if noise_level > 0:
                image = np.clip(image + np.random.randn(3, 3) * noise_level, 0, 1)
            X.append(image.flatten())
            y.append(0)
    
    X, y = np.array(X), np.array(y)
    shuffle_idx = np.random.permutation(n_samples)
    return X[shuffle_idx], y[shuffle_idx]

# Simple Perceptron from Part 5
class SimplePerceptron:
    def __init__(self, n_inputs):
        self.weights = np.random.randn(n_inputs) * 0.1
        self.bias = 0.0
    
    def forward(self, x):
        z = np.dot(self.weights, x.flatten()) + self.bias
        return sigmoid(z)
    
    def predict(self, x):
        return 1 if self.forward(x) >= 0.5 else 0
    
    def train(self, X, y, lr=0.5, epochs=100):
        for _ in range(epochs):
            for xi, yi in zip(X, y):
                pred = self.forward(xi)
                error = pred - yi
                self.weights -= lr * error * xi.flatten()
                self.bias -= lr * error
        return self

# Test on different difficulty levels
print("\nTesting Single Neuron on Increasingly Difficult V/H Problems:\n")

difficulties = [
    ("Clean (0% noise)", 0.0),
    ("Light noise (10%)", 0.1),
    ("Medium noise (20%)", 0.2),
    ("Heavy noise (30%)", 0.3)
]

results = []
for name, noise in difficulties:
    np.random.seed(42)
    X_train, y_train = generate_line_dataset(100, noise_level=noise, seed=42)
    X_test, y_test = generate_line_dataset(50, noise_level=noise, seed=999)
    
    perceptron = SimplePerceptron(9)
    perceptron.train(X_train, y_train, epochs=100)
    
    correct = sum(1 for x, y in zip(X_test, y_test) if perceptron.predict(x) == y)
    accuracy = correct / len(y_test) * 100
    results.append((name, accuracy))
    print(f"  {name:25s} → Accuracy: {accuracy:5.1f}%")

print("\n" + "="*70)
print("KEY OBSERVATION:")
print("="*70)
print("""
As noise increases, our single neuron struggles more!

Why? The single neuron learned ONE pattern (e.g., "middle column = vertical").
But noisy images have:
  • Extra bright pixels confusing the detector
  • Lines in different positions the single "template" doesn't match
  • Partial patterns that need multiple feature detectors

Just like XOR, complex V/H patterns need MULTIPLE SPECIALISTS!
""")

cell 008

# =============================================================================# VISUALIZING THE CHALLENGE: Clean vs Noisy V/H Images# ============================================================================= fig, axes = plt.subplots(2, 5, figsize=(15, 6)) # Generate examples at different noise levelsnp.random.seed(123) # Top row: Vertical lines with increasing noisenoises = [0.0, 0.1, 0.2, 0.3, 0.4]for i, noise in enumerate(noises):    ax = axes[0, i]    image = np.array([[0, 1, 0], [0, 1, 0], [0, 1, 0]], dtype=float)    if noise > 0:        image = np.clip(image + np.random.randn(3, 3) * noise, 0, 1)    ax.imshow(image, cmap='Blues', vmin=0, vmax=1)    ax.set_title(f'{int(noise*100)}% Noise', fontsize=11)    ax.axis('off')    if i == 0:        ax.set_ylabel('VERTICAL', fontsize=12, fontweight='bold') # Bottom row: Horizontal lines with increasing noise  for i, noise in enumerate(noises):    ax = axes[1, i]    image = np.array([[0, 0, 0], [1, 1, 1], [0, 0, 0]], dtype=float)    if noise > 0:        image = np.clip(image + np.random.randn(3, 3) * noise, 0, 1)    ax.imshow(image, cmap='Blues', vmin=0, vmax=1)    ax.axis('off')    if i == 0:        ax.set_ylabel('HORIZONTAL', fontsize=12, fontweight='bold') plt.suptitle('The Challenge: As Noise Increases, Patterns Become Harder to Detect',              fontsize=14, fontweight='bold', y=1.02)plt.tight_layout()plt.show() print("""With heavy noise, even WE have trouble seeing the pattern! A single neuron that learned "middle column bright = vertical" will strugglewhen noise makes OTHER pixels bright too. SOLUTION: Multiple specialists, each detecting different aspects of the pattern.""")

# =============================================================================
# VISUALIZING THE CHALLENGE: Clean vs Noisy V/H Images
# =============================================================================

fig, axes = plt.subplots(2, 5, figsize=(15, 6))

# Generate examples at different noise levels
np.random.seed(123)

# Top row: Vertical lines with increasing noise
noises = [0.0, 0.1, 0.2, 0.3, 0.4]
for i, noise in enumerate(noises):
    ax = axes[0, i]
    image = np.array([[0, 1, 0], [0, 1, 0], [0, 1, 0]], dtype=float)
    if noise > 0:
        image = np.clip(image + np.random.randn(3, 3) * noise, 0, 1)
    ax.imshow(image, cmap='Blues', vmin=0, vmax=1)
    ax.set_title(f'{int(noise*100)}% Noise', fontsize=11)
    ax.axis('off')
    if i == 0:
        ax.set_ylabel('VERTICAL', fontsize=12, fontweight='bold')

# Bottom row: Horizontal lines with increasing noise  
for i, noise in enumerate(noises):
    ax = axes[1, i]
    image = np.array([[0, 0, 0], [1, 1, 1], [0, 0, 0]], dtype=float)
    if noise > 0:
        image = np.clip(image + np.random.randn(3, 3) * noise, 0, 1)
    ax.imshow(image, cmap='Blues', vmin=0, vmax=1)
    ax.axis('off')
    if i == 0:
        ax.set_ylabel('HORIZONTAL', fontsize=12, fontweight='bold')

plt.suptitle('The Challenge: As Noise Increases, Patterns Become Harder to Detect', 
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("""
With heavy noise, even WE have trouble seeing the pattern!

A single neuron that learned "middle column bright = vertical" will struggle
when noise makes OTHER pixels bright too.

SOLUTION: Multiple specialists, each detecting different aspects of the pattern.
""")

cell 009

# =============================================================================# VISUALIZING THE MLP ARCHITECTURE# ============================================================================= def draw_neural_network(ax, layer_sizes, layer_names=None):    """Draw a neural network architecture diagram."""    n_layers = len(layer_sizes)    max_neurons = max(layer_sizes)        # Spacing    layer_spacing = 1.5    neuron_spacing = 0.8        positions = []        for layer_idx, n_neurons in enumerate(layer_sizes):        layer_positions = []        x = layer_idx * layer_spacing                # Center the neurons vertically        start_y = (max_neurons - n_neurons) * neuron_spacing / 2                for neuron_idx in range(n_neurons):            y = start_y + neuron_idx * neuron_spacing            layer_positions.append((x, y))                        # Draw neuron            color = '#3498db' if layer_idx == 0 else '#e74c3c' if layer_idx == n_layers - 1 else '#27ae60'            circle = plt.Circle((x, y), 0.15, color=color, ec='black', linewidth=2, zorder=3)            ax.add_patch(circle)                positions.append(layer_positions)        # Draw connections    for layer_idx in range(n_layers - 1):        for start_pos in positions[layer_idx]:            for end_pos in positions[layer_idx + 1]:                ax.plot([start_pos[0], end_pos[0]], [start_pos[1], end_pos[1]],                        'gray', alpha=0.3, linewidth=0.5, zorder=1)        # Add layer labels    if layer_names:        for layer_idx, name in enumerate(layer_names):            x = layer_idx * layer_spacing            ax.text(x, -1, name, ha='center', fontsize=10, fontweight='bold')        ax.set_xlim(-0.5, (n_layers - 1) * layer_spacing + 0.5)    ax.set_ylim(-1.5, max_neurons * neuron_spacing)    ax.set_aspect('equal')    ax.axis('off') # Create visualizationfig, axes = plt.subplots(1, 3, figsize=(16, 5)) # Plot 1: Perceptron (Part 4-6)ax = axes[0]draw_neural_network(ax, [9, 1], ['Input\n(9 pixels)', 'Output\n(1 neuron)'])ax.set_title('PERCEPTRON (Parts 4-6)\nSingle Layer', fontsize=12, fontweight='bold') # Plot 2: Simple MLPax = axes[1]draw_neural_network(ax, [9, 4, 1], ['Input\n(9 pixels)', 'Hidden\n(4 neurons)', 'Output\n(1 neuron)'])ax.set_title('MLP: One Hidden Layer\nThe "Panel of Experts"', fontsize=12, fontweight='bold') # Plot 3: Deeper MLPax = axes[2]draw_neural_network(ax, [9, 6, 4, 1], ['Input\n(9)', 'Hidden 1\n(6)', 'Hidden 2\n(4)', 'Output\n(1)'])ax.set_title('DEEP MLP: Two Hidden Layers\n"Hierarchy of Specialists"', fontsize=12, fontweight='bold') plt.suptitle('Evolution of Neural Network Architectures', fontsize=14, fontweight='bold', y=1.02)plt.tight_layout()plt.show() print("""ARCHITECTURE COMPARISON:════════════════════════════════════════════════════════════════════════ PERCEPTRON (Parts 4-6):  • Input → Output directly  • Can only learn linear boundaries  • Limited to simple problems MLP WITH ONE HIDDEN LAYER:  • Input → Hidden → Output  • Each hidden neuron detects different features  • Can learn non-linear boundaries (like XOR!) DEEP MLP (Multiple Hidden Layers):  • Input → Hidden 1 → Hidden 2 → ... → Output  • Each layer builds on the previous layer's features  • Can learn very complex patterns ═══════════════════════════════════════════════════════════════════════Color Legend: 🔵 Input | 🟢 Hidden | 🔴 Output""")

# =============================================================================
# VISUALIZING THE MLP ARCHITECTURE
# =============================================================================

def draw_neural_network(ax, layer_sizes, layer_names=None):
    """Draw a neural network architecture diagram."""
    n_layers = len(layer_sizes)
    max_neurons = max(layer_sizes)
    
    # Spacing
    layer_spacing = 1.5
    neuron_spacing = 0.8
    
    positions = []
    
    for layer_idx, n_neurons in enumerate(layer_sizes):
        layer_positions = []
        x = layer_idx * layer_spacing
        
        # Center the neurons vertically
        start_y = (max_neurons - n_neurons) * neuron_spacing / 2
        
        for neuron_idx in range(n_neurons):
            y = start_y + neuron_idx * neuron_spacing
            layer_positions.append((x, y))
            
            # Draw neuron
            color = '#3498db' if layer_idx == 0 else '#e74c3c' if layer_idx == n_layers - 1 else '#27ae60'
            circle = plt.Circle((x, y), 0.15, color=color, ec='black', linewidth=2, zorder=3)
            ax.add_patch(circle)
        
        positions.append(layer_positions)
    
    # Draw connections
    for layer_idx in range(n_layers - 1):
        for start_pos in positions[layer_idx]:
            for end_pos in positions[layer_idx + 1]:
                ax.plot([start_pos[0], end_pos[0]], [start_pos[1], end_pos[1]], 
                       'gray', alpha=0.3, linewidth=0.5, zorder=1)
    
    # Add layer labels
    if layer_names:
        for layer_idx, name in enumerate(layer_names):
            x = layer_idx * layer_spacing
            ax.text(x, -1, name, ha='center', fontsize=10, fontweight='bold')
    
    ax.set_xlim(-0.5, (n_layers - 1) * layer_spacing + 0.5)
    ax.set_ylim(-1.5, max_neurons * neuron_spacing)
    ax.set_aspect('equal')
    ax.axis('off')

# Create visualization
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Plot 1: Perceptron (Part 4-6)
ax = axes[0]
draw_neural_network(ax, [9, 1], ['Input\n(9 pixels)', 'Output\n(1 neuron)'])
ax.set_title('PERCEPTRON (Parts 4-6)\nSingle Layer', fontsize=12, fontweight='bold')

# Plot 2: Simple MLP
ax = axes[1]
draw_neural_network(ax, [9, 4, 1], ['Input\n(9 pixels)', 'Hidden\n(4 neurons)', 'Output\n(1 neuron)'])
ax.set_title('MLP: One Hidden Layer\nThe "Panel of Experts"', fontsize=12, fontweight='bold')

# Plot 3: Deeper MLP
ax = axes[2]
draw_neural_network(ax, [9, 6, 4, 1], ['Input\n(9)', 'Hidden 1\n(6)', 'Hidden 2\n(4)', 'Output\n(1)'])
ax.set_title('DEEP MLP: Two Hidden Layers\n"Hierarchy of Specialists"', fontsize=12, fontweight='bold')

plt.suptitle('Evolution of Neural Network Architectures', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("""
ARCHITECTURE COMPARISON:
════════════════════════════════════════════════════════════════════════

PERCEPTRON (Parts 4-6):
  • Input → Output directly
  • Can only learn linear boundaries
  • Limited to simple problems

MLP WITH ONE HIDDEN LAYER:
  • Input → Hidden → Output
  • Each hidden neuron detects different features
  • Can learn non-linear boundaries (like XOR!)

DEEP MLP (Multiple Hidden Layers):
  • Input → Hidden 1 → Hidden 2 → ... → Output
  • Each layer builds on the previous layer's features
  • Can learn very complex patterns

═══════════════════════════════════════════════════════════════════════
Color Legend: 🔵 Input | 🟢 Hidden | 🔴 Output
""")

7.3 The Multi-Layer Perceptron (MLP): Architecture and Math

Now let's understand the mathematics behind multi-layer networks.

What IS an MLP?

A Multi-Layer Perceptron (MLP) is a neural network with:

One input layer
One or more hidden layers
One output layer

Each layer is fully connected to the next (every neuron connects to every neuron in the next layer).

The Math: Forward Propagation

For an MLP with one hidden layer, the computation flows in two stages:

Stage 1: Input → Hidden $h = σ (W_{1} \cdot x + b_{1})$

Stage 2: Hidden → Output $y^= σ (W_{2} \cdot h + b_{2})$

Where:

$x$ = input vector (our 9 pixels)
$W_{1}$ = weights from input to hidden layer (matrix!)
$b_{1}$ = biases for hidden neurons
$h$ = hidden layer activations
$W_{2}$ = weights from hidden to output
$b_{2}$ = bias for output neuron
$σ$ = activation function (sigmoid, ReLU, etc.)
$y^$ = final prediction

Breaking It Down Step by Step

Let's trace through with concrete dimensions:

Component	Shape	Example
Input $x$	(9,)	9 pixels
Weights $W_{1}$	(4, 9)	4 hidden neurons, each with 9 weights
Biases $b_{1}$	(4,)	4 biases, one per hidden neuron
Hidden $h$	(4,)	4 hidden activations
Weights $W_{2}$	(1, 4)	1 output neuron, 4 weights (from hidden)
Bias $b_{2}$	(1,)	1 bias for output
Output $y^$	(1,)	Final prediction

Why These Specific Shapes?

Matrix multiplication rule: $(m \times n) \cdot (n \times 1) = (m \times 1)$

The shapes MUST align:

$W_{1}$ is $(4, 9)$ because we have 4 hidden neurons, each looking at 9 inputs
$W_{1} \cdot x$ gives us $(4, 9) \cdot (9, 1) = (4, 1)$ - one value per hidden neuron ✓
$W_{2}$ is $(1, 4)$ because we have 1 output looking at 4 hidden neurons
$W_{2} \cdot h$ gives us $(1, 4) \cdot (4, 1) = (1, 1)$ - our single output ✓

The key insight: Each row of $W_{1}$ represents ONE hidden neuron's "view" of the input. Each column of $W_{2}$ represents how much the output trusts each hidden neuron.

Why This Works for XOR

Each hidden neuron can learn ONE linear boundary. With multiple hidden neurons, we can combine their boundaries to create complex, non-linear decision regions!

Concrete XOR example with 2 hidden neurons:

Hidden neuron 1 might learn: "A OR B" (draw diagonal from bottom-left)
Hidden neuron 2 might learn: "A AND B" (draw diagonal from top-right)
Output combines them: "(A OR B) AND NOT (A AND B)" = XOR!

cell 011full lab recommended

# =============================================================================# BUILDING THE MLP: Step by Step Implementation# ============================================================================= class MLP:    """    Multi-Layer Perceptron with one hidden layer.        Architecture: Input → Hidden (with activation) → Output (with sigmoid)        This is the "Full Committee" - multiple experts working together!    """        def __init__(self, n_inputs, n_hidden, n_outputs=1):        """        Initialize the MLP with random weights.                Parameters:            n_inputs: Number of input features (e.g., 9 for 3x3 image)            n_hidden: Number of neurons in hidden layer (the "specialists")            n_outputs: Number of output neurons (1 for binary classification)        """        self.n_inputs = n_inputs        self.n_hidden = n_hidden        self.n_outputs = n_outputs                # Initialize weights with small random values (Xavier initialization)        # W1: weights from input to hidden (shape: n_hidden x n_inputs)        self.W1 = np.random.randn(n_hidden, n_inputs) * np.sqrt(2.0 / n_inputs)        self.b1 = np.zeros(n_hidden)                # W2: weights from hidden to output (shape: n_outputs x n_hidden)        self.W2 = np.random.randn(n_outputs, n_hidden) * np.sqrt(2.0 / n_hidden)        self.b2 = np.zeros(n_outputs)                # For storing values during forward pass (needed for backprop)        self.z1 = None  # Pre-activation of hidden layer        self.h = None   # Hidden layer activations        self.z2 = None  # Pre-activation of output        self.output = None                # Training history        self.loss_history = []        self.accuracy_history = []        def forward(self, x):        """        Forward propagation: Input → Hidden → Output                This is like the "Committee Meeting" where:        1. Each specialist (hidden neuron) examines the evidence        2. The final decision maker combines their opinions        """        x = np.array(x).flatten()                # Stage 1: Input → Hidden        # Each hidden neuron computes its weighted sum and activates        self.z1 = np.dot(self.W1, x) + self.b1  # (n_hidden,)        self.h = sigmoid(self.z1)               # (n_hidden,)                # Stage 2: Hidden → Output        # The output neuron combines hidden activations        self.z2 = np.dot(self.W2, self.h) + self.b2  # (n_outputs,)        self.output = sigmoid(self.z2)               # (n_outputs,)                return self.output[0] if self.n_outputs == 1 else self.output        def predict(self, x):        """Make a binary prediction (0 or 1)."""        return 1 if self.forward(x) >= 0.5 else 0 print("="*70)print("MLP CLASS: The Full Committee Implementation")print("="*70) # Create an example MLPmlp = MLP(n_inputs=9, n_hidden=4, n_outputs=1) print(f"""MLP Architecture Created:  • Input layer: {mlp.n_inputs} neurons (our 9 pixels)  • Hidden layer: {mlp.n_hidden} neurons (the specialists)  • Output layer: {mlp.n_outputs} neuron (final decision) Weight Shapes:  • W1 (input→hidden): {mlp.W1.shape} = {mlp.n_hidden} hidden neurons × {mlp.n_inputs} inputs  • b1 (hidden biases): {mlp.b1.shape} = {mlp.n_hidden} biases  • W2 (hidden→output): {mlp.W2.shape} = {mlp.n_outputs} output × {mlp.n_hidden} hidden  • b2 (output bias): {mlp.b2.shape} = {mlp.n_outputs} bias Total Parameters: {mlp.W1.size + mlp.b1.size + mlp.W2.size + mlp.b2.size}  (Compare to Perceptron: {9 + 1} parameters)""")

# =============================================================================
# BUILDING THE MLP: Step by Step Implementation
# =============================================================================

class MLP:
    """
    Multi-Layer Perceptron with one hidden layer.
    
    Architecture: Input → Hidden (with activation) → Output (with sigmoid)
    
    This is the "Full Committee" - multiple experts working together!
    """
    
    def __init__(self, n_inputs, n_hidden, n_outputs=1):
        """
        Initialize the MLP with random weights.
        
        Parameters:
            n_inputs: Number of input features (e.g., 9 for 3x3 image)
            n_hidden: Number of neurons in hidden layer (the "specialists")
            n_outputs: Number of output neurons (1 for binary classification)
        """
        self.n_inputs = n_inputs
        self.n_hidden = n_hidden
        self.n_outputs = n_outputs
        
        # Initialize weights with small random values (Xavier initialization)
        # W1: weights from input to hidden (shape: n_hidden x n_inputs)
        self.W1 = np.random.randn(n_hidden, n_inputs) * np.sqrt(2.0 / n_inputs)
        self.b1 = np.zeros(n_hidden)
        
        # W2: weights from hidden to output (shape: n_outputs x n_hidden)
        self.W2 = np.random.randn(n_outputs, n_hidden) * np.sqrt(2.0 / n_hidden)
        self.b2 = np.zeros(n_outputs)
        
        # For storing values during forward pass (needed for backprop)
        self.z1 = None  # Pre-activation of hidden layer
        self.h = None   # Hidden layer activations
        self.z2 = None  # Pre-activation of output
        self.output = None
        
        # Training history
        self.loss_history = []
        self.accuracy_history = []
    
    def forward(self, x):
        """
        Forward propagation: Input → Hidden → Output
        
        This is like the "Committee Meeting" where:
        1. Each specialist (hidden neuron) examines the evidence
        2. The final decision maker combines their opinions
        """
        x = np.array(x).flatten()
        
        # Stage 1: Input → Hidden
        # Each hidden neuron computes its weighted sum and activates
        self.z1 = np.dot(self.W1, x) + self.b1  # (n_hidden,)
        self.h = sigmoid(self.z1)               # (n_hidden,)
        
        # Stage 2: Hidden → Output
        # The output neuron combines hidden activations
        self.z2 = np.dot(self.W2, self.h) + self.b2  # (n_outputs,)
        self.output = sigmoid(self.z2)               # (n_outputs,)
        
        return self.output[0] if self.n_outputs == 1 else self.output
    
    def predict(self, x):
        """Make a binary prediction (0 or 1)."""
        return 1 if self.forward(x) >= 0.5 else 0

print("="*70)
print("MLP CLASS: The Full Committee Implementation")
print("="*70)

# Create an example MLP
mlp = MLP(n_inputs=9, n_hidden=4, n_outputs=1)

print(f"""
MLP Architecture Created:
  • Input layer: {mlp.n_inputs} neurons (our 9 pixels)
  • Hidden layer: {mlp.n_hidden} neurons (the specialists)
  • Output layer: {mlp.n_outputs} neuron (final decision)

Weight Shapes:
  • W1 (input→hidden): {mlp.W1.shape} = {mlp.n_hidden} hidden neurons × {mlp.n_inputs} inputs
  • b1 (hidden biases): {mlp.b1.shape} = {mlp.n_hidden} biases
  • W2 (hidden→output): {mlp.W2.shape} = {mlp.n_outputs} output × {mlp.n_hidden} hidden
  • b2 (output bias): {mlp.b2.shape} = {mlp.n_outputs} bias

Total Parameters: {mlp.W1.size + mlp.b1.size + mlp.W2.size + mlp.b2.size}
  (Compare to Perceptron: {9 + 1} parameters)
""")

Understanding Xavier Initialization

In the code above, we used:

self.W1 = np.random.randn(n_hidden, n_inputs) * np.sqrt(2.0 / n_inputs)

What IS Xavier initialization and WHY do we need it?

Initialization	Formula	Problem It Solves
All zeros	w = 0	All neurons learn same thing! (symmetry)
Large random	w ~ N(0, 1)	Signals explode or vanish
Xavier	w ~ N(0, √(2/n))	Keeps signal variance stable

The Math Behind Xavier:

When we compute $z = w_{1} x_{1} + w_{2} x_{2} + . . . + w_{n} x_{n}$ :

Each $w_{i} x_{i}$ term has variance ≈ $Var (w) \times Var (x)$
With n terms, total variance ≈ $n \times Var (w) \times Var (x)$

The problem: If $Var (w) = 1$ , then variance grows by factor of n each layer!

Layer 1: variance × 9
Layer 2: variance × 9 × 4
Values explode exponentially!

The solution: Set $Var (w) = 2 / n$ so that output variance ≈ input variance.

This keeps signals "healthy" as they flow through the network.

Tracing Through the Forward Pass: What's Actually Happening?

Before we run the code, let's understand what the forward pass computes at each step:

Stage 1: Input → Hidden (What each specialist sees)

For hidden neuron $i$ :

Weighted sum: $z_{1} [i] = W_{1} [i, 0] \cdot x [0] + W_{1} [i, 1] \cdot x [1] + . . . + W_{1} [i, 8] \cdot x [8] + b_{1} [i]$
Activation: $h [i] = σ (z_{1} [i])$ → transforms to range (0, 1)

Each hidden neuron is essentially asking: "How strongly does this input match MY pattern?"

Stage 2: Hidden → Output (The final vote)

Combine opinions: $z_{2} = W_{2} [0] \cdot h [0] + W_{2} [1] \cdot h [1] + . . . + W_{2} [3] \cdot h [3] + b_{2}$
Final decision: $output = σ (z_{2})$ → probability of class 1

The output neuron asks: "Given what all specialists reported, what's my final decision?"

cell 014

# =============================================================================# FORWARD PASS: Step-by-Step Demonstration# ============================================================================= print("="*70)print("FORWARD PASS: Tracing Data Through the Network")print("="*70) # Create test input (vertical line from Part 1)vertical_line = np.array([[0, 1, 0], [0, 1, 0], [0, 1, 0]])x = vertical_line.flatten() print(f"\nInput (vertical line as 9 pixels):")print(f"  x = {x}") print("\n" + "-"*70)print("STAGE 1: Input → Hidden Layer")print("-"*70) # Stage 1: Compute hidden layerz1 = np.dot(mlp.W1, x) + mlp.b1h = sigmoid(z1) print(f"""Step 1a: Compute weighted sums for each hidden neuron  z1 = W1 · x + b1    For each hidden neuron i:    z1[i] = Σ(W1[i,j] × x[j]) + b1[i]      z1 = {z1} Step 1b: Apply activation function  h = sigmoid(z1)    For each hidden neuron:    h[i] = 1 / (1 + e^(-z1[i]))      h = {h}    These are the "opinions" from our {mlp.n_hidden} specialists!""") print("-"*70)print("STAGE 2: Hidden Layer → Output")print("-"*70) # Stage 2: Compute outputz2 = np.dot(mlp.W2, h) + mlp.b2output = sigmoid(z2) print(f"""Step 2a: Combine hidden activations  z2 = W2 · h + b2    The output neuron combines all specialist opinions:    z2 = Σ(W2[j] × h[j]) + b2      z2 = {z2} Step 2b: Apply sigmoid for final prediction  output = sigmoid(z2)    output = {output}    Final decision: {"VERTICAL" if output[0] >= 0.5 else "HORIZONTAL"}  (With random weights, this is just a guess!)""")

# =============================================================================
# FORWARD PASS: Step-by-Step Demonstration
# =============================================================================

print("="*70)
print("FORWARD PASS: Tracing Data Through the Network")
print("="*70)

# Create test input (vertical line from Part 1)
vertical_line = np.array([[0, 1, 0], [0, 1, 0], [0, 1, 0]])
x = vertical_line.flatten()

print(f"\nInput (vertical line as 9 pixels):")
print(f"  x = {x}")

print("\n" + "-"*70)
print("STAGE 1: Input → Hidden Layer")
print("-"*70)

# Stage 1: Compute hidden layer
z1 = np.dot(mlp.W1, x) + mlp.b1
h = sigmoid(z1)

print(f"""
Step 1a: Compute weighted sums for each hidden neuron
  z1 = W1 · x + b1
  
  For each hidden neuron i:
    z1[i] = Σ(W1[i,j] × x[j]) + b1[i]
    
  z1 = {z1}

Step 1b: Apply activation function
  h = sigmoid(z1)
  
  For each hidden neuron:
    h[i] = 1 / (1 + e^(-z1[i]))
    
  h = {h}
  
  These are the "opinions" from our {mlp.n_hidden} specialists!
""")

print("-"*70)
print("STAGE 2: Hidden Layer → Output")
print("-"*70)

# Stage 2: Compute output
z2 = np.dot(mlp.W2, h) + mlp.b2
output = sigmoid(z2)

print(f"""
Step 2a: Combine hidden activations
  z2 = W2 · h + b2
  
  The output neuron combines all specialist opinions:
    z2 = Σ(W2[j] × h[j]) + b2
    
  z2 = {z2}

Step 2b: Apply sigmoid for final prediction
  output = sigmoid(z2)
  
  output = {output}
  
  Final decision: {"VERTICAL" if output[0] >= 0.5 else "HORIZONTAL"}
  (With random weights, this is just a guess!)
""")

7.4 Backpropagation Through Multiple Layers

In Part 5, we learned backpropagation for a single neuron. With multiple layers, we need to chain the gradients - passing blame backward through each layer.

The Challenge: Who's Responsible for the Error?

When the network makes a mistake, we need to figure out:

How much should we adjust the output weights (W2)?
How much should we adjust the hidden weights (W1)?

The difficulty: W1 doesn't directly produce the output! It influences the hidden layer, which THEN influences the output. This is like asking: "If a manager's employee made a mistake, how much is the manager responsible?"

The Chain Rule: Passing Blame Backward

The key mathematical tool is the chain rule from calculus:

$\frac{\partial L}{\partial W_{1}} = \frac{\partial L}{\partial y^} \cdot \frac{\partial y^}{\partial h} \cdot \frac{\partial h}{\partial W_{1}}$

What IS the Chain Rule?

The chain rule says: if A affects B, and B affects C, then A's effect on C is:

$\frac{d C}{d A} = \frac{d C}{d B} \times \frac{d B}{d A}$

Intuitive example: If increasing temperature by 1°C increases pressure by 2 units, and increasing pressure by 1 unit increases volume by 3 units, then increasing temperature by 1°C increases volume by 2 × 3 = 6 units.

Think of it as a blame chain:

Loss depends on output prediction (how wrong is the answer?)
Output prediction depends on hidden activations (what did specialists say?)
Hidden activations depend on hidden weights (what were specialists looking for?)

Committee Analogy: Tracing Blame

"When the committee makes a wrong decision:

First, we see how wrong the final decision was (output error)
Then we ask: 'Which specialists contributed to this error?' (hidden layer blame)
Finally: 'What evidence did each specialist focus on that led them astray?' (input weights)

The blame flows BACKWARD through the committee hierarchy."

The Backpropagation Steps

Step 1: Output Error $δ_{2} = y^- y$

Step 2: Hidden Layer Error (via chain rule) $δ_{1} = (W 2 T \cdot δ_{2}) ⊙ σ^{'} (z_{1})$

Where $⊙$ is element-wise multiplication and $σ^{'}$ is the derivative of sigmoid.

Step 3: Update Weights $W_{2} = W_{2} - α \cdot δ_{2} \cdot h^{T}$ $W_{1} = W_{1} - α \cdot δ_{1} \cdot x^{T}$

Why We Store Values During Forward Pass

Notice that backpropagation needs values computed during forward pass:

$h$ (hidden activations) - needed to update W2
$z_{1}$ (pre-activation) - needed for sigmoid derivative
$x$ (input) - needed to update W1

This is why neural networks use memory! We can't compute gradients without remembering what happened during the forward pass. This creates a fundamental trade-off:

Memory Usage	Gradient Computation
Store all intermediate values	Exact gradients (standard backprop)
Store some values	Approximate gradients (gradient checkpointing)

For deep networks with billions of parameters, memory management becomes critical!

cell 016

# =============================================================================# COMPLETE MLP WITH TRAINING# ============================================================================= class TrainableMLP:    """    Multi-Layer Perceptron with training capability.        This is the complete "Full Committee" that can learn!    """        def __init__(self, n_inputs, n_hidden, n_outputs=1):        """Initialize the MLP with Xavier initialization."""        self.n_inputs = n_inputs        self.n_hidden = n_hidden        self.n_outputs = n_outputs                # Xavier initialization for better training        self.W1 = np.random.randn(n_hidden, n_inputs) * np.sqrt(2.0 / n_inputs)        self.b1 = np.zeros(n_hidden)        self.W2 = np.random.randn(n_outputs, n_hidden) * np.sqrt(2.0 / n_hidden)        self.b2 = np.zeros(n_outputs)                # Cache for forward pass values        self.x = None        self.z1 = None        self.h = None        self.z2 = None        self.output = None                # Training history        self.loss_history = []        self.accuracy_history = []        def forward(self, x):        """Forward propagation."""        self.x = np.array(x).flatten()                # Hidden layer        self.z1 = np.dot(self.W1, self.x) + self.b1        self.h = sigmoid(self.z1)                # Output layer        self.z2 = np.dot(self.W2, self.h) + self.b2        self.output = sigmoid(self.z2)                return self.output[0] if self.n_outputs == 1 else self.output        def predict(self, x):        """Binary prediction."""        return 1 if self.forward(x) >= 0.5 else 0        def backward(self, y_true, learning_rate):        """        Backpropagation: compute gradients and update weights.                This is where the "blame assignment" happens!        """        # Output layer error        delta2 = self.output - y_true  # Shape: (1,) or (n_outputs,)                # Hidden layer error (chain rule!)        # delta1 = (W2.T @ delta2) * sigmoid_derivative(z1)        delta1 = np.dot(self.W2.T, delta2) * sigmoid_derivative(self.z1)                # Update output weights (W2, b2)        # dW2 = delta2 @ h.T (outer product)        dW2 = np.outer(delta2, self.h)        db2 = delta2                # Update hidden weights (W1, b1)        # dW1 = delta1 @ x.T (outer product)        dW1 = np.outer(delta1, self.x)        db1 = delta1                # Apply updates        self.W2 -= learning_rate * dW2        self.b2 -= learning_rate * db2        self.W1 -= learning_rate * dW1        self.b1 -= learning_rate * db1        def compute_loss(self, y_true, y_pred):        """Binary cross-entropy loss."""        epsilon = 1e-15        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)        return -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))        def train(self, X, y, learning_rate=0.5, epochs=100, verbose=True):        """        Train the MLP on data.                Parameters:            X: Training inputs (n_samples, n_features)            y: Training labels (n_samples,)            learning_rate: Step size for gradient descent            epochs: Number of passes through the dataset            verbose: Whether to print progress        """        self.loss_history = []        self.accuracy_history = []                for epoch in range(epochs):            total_loss = 0            correct = 0                        for i in range(len(X)):                # Forward pass                y_pred = self.forward(X[i])                                # Compute loss                loss = self.compute_loss(y[i], y_pred)                total_loss += loss                                # Check accuracy                if (y_pred >= 0.5 and y[i] == 1) or (y_pred < 0.5 and y[i] == 0):                    correct += 1                                # Backward pass (this is where learning happens!)                self.backward(np.array([y[i]]), learning_rate)                        # Record history            avg_loss = total_loss / len(X)            accuracy = correct / len(X)            self.loss_history.append(avg_loss)            self.accuracy_history.append(accuracy)                        if verbose and (epoch + 1) % 20 == 0:                print(f"  Epoch {epoch+1:3d}/{epochs}: Loss = {avg_loss:.4f}, Accuracy = {accuracy*100:.1f}%")                if verbose:            print(f"\nTraining complete! Final accuracy: {self.accuracy_history[-1]*100:.1f}%")                return self.loss_history print("TrainableMLP class defined!")print("This MLP can learn through backpropagation.")

# =============================================================================
# COMPLETE MLP WITH TRAINING
# =============================================================================

class TrainableMLP:
    """
    Multi-Layer Perceptron with training capability.
    
    This is the complete "Full Committee" that can learn!
    """
    
    def __init__(self, n_inputs, n_hidden, n_outputs=1):
        """Initialize the MLP with Xavier initialization."""
        self.n_inputs = n_inputs
        self.n_hidden = n_hidden
        self.n_outputs = n_outputs
        
        # Xavier initialization for better training
        self.W1 = np.random.randn(n_hidden, n_inputs) * np.sqrt(2.0 / n_inputs)
        self.b1 = np.zeros(n_hidden)
        self.W2 = np.random.randn(n_outputs, n_hidden) * np.sqrt(2.0 / n_hidden)
        self.b2 = np.zeros(n_outputs)
        
        # Cache for forward pass values
        self.x = None
        self.z1 = None
        self.h = None
        self.z2 = None
        self.output = None
        
        # Training history
        self.loss_history = []
        self.accuracy_history = []
    
    def forward(self, x):
        """Forward propagation."""
        self.x = np.array(x).flatten()
        
        # Hidden layer
        self.z1 = np.dot(self.W1, self.x) + self.b1
        self.h = sigmoid(self.z1)
        
        # Output layer
        self.z2 = np.dot(self.W2, self.h) + self.b2
        self.output = sigmoid(self.z2)
        
        return self.output[0] if self.n_outputs == 1 else self.output
    
    def predict(self, x):
        """Binary prediction."""
        return 1 if self.forward(x) >= 0.5 else 0
    
    def backward(self, y_true, learning_rate):
        """
        Backpropagation: compute gradients and update weights.
        
        This is where the "blame assignment" happens!
        """
        # Output layer error
        delta2 = self.output - y_true  # Shape: (1,) or (n_outputs,)
        
        # Hidden layer error (chain rule!)
        # delta1 = (W2.T @ delta2) * sigmoid_derivative(z1)
        delta1 = np.dot(self.W2.T, delta2) * sigmoid_derivative(self.z1)
        
        # Update output weights (W2, b2)
        # dW2 = delta2 @ h.T (outer product)
        dW2 = np.outer(delta2, self.h)
        db2 = delta2
        
        # Update hidden weights (W1, b1)
        # dW1 = delta1 @ x.T (outer product)
        dW1 = np.outer(delta1, self.x)
        db1 = delta1
        
        # Apply updates
        self.W2 -= learning_rate * dW2
        self.b2 -= learning_rate * db2
        self.W1 -= learning_rate * dW1
        self.b1 -= learning_rate * db1
    
    def compute_loss(self, y_true, y_pred):
        """Binary cross-entropy loss."""
        epsilon = 1e-15
        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
        return -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
    
    def train(self, X, y, learning_rate=0.5, epochs=100, verbose=True):
        """
        Train the MLP on data.
        
        Parameters:
            X: Training inputs (n_samples, n_features)
            y: Training labels (n_samples,)
            learning_rate: Step size for gradient descent
            epochs: Number of passes through the dataset
            verbose: Whether to print progress
        """
        self.loss_history = []
        self.accuracy_history = []
        
        for epoch in range(epochs):
            total_loss = 0
            correct = 0
            
            for i in range(len(X)):
                # Forward pass
                y_pred = self.forward(X[i])
                
                # Compute loss
                loss = self.compute_loss(y[i], y_pred)
                total_loss += loss
                
                # Check accuracy
                if (y_pred >= 0.5 and y[i] == 1) or (y_pred < 0.5 and y[i] == 0):
                    correct += 1
                
                # Backward pass (this is where learning happens!)
                self.backward(np.array([y[i]]), learning_rate)
            
            # Record history
            avg_loss = total_loss / len(X)
            accuracy = correct / len(X)
            self.loss_history.append(avg_loss)
            self.accuracy_history.append(accuracy)
            
            if verbose and (epoch + 1) % 20 == 0:
                print(f"  Epoch {epoch+1:3d}/{epochs}: Loss = {avg_loss:.4f}, Accuracy = {accuracy*100:.1f}%")
        
        if verbose:
            print(f"\nTraining complete! Final accuracy: {self.accuracy_history[-1]*100:.1f}%")
        
        return self.loss_history

print("TrainableMLP class defined!")
print("This MLP can learn through backpropagation.")

Understanding the Backward Method

Let's trace through what backward actually computes:

Output error: delta2 = self.output - y_true

If predicted 0.8 but true is 0, error = +0.8 (need to decrease)
This comes from derivative of BCE loss with sigmoid

Hidden error: delta1 = np.dot(self.W2.T, delta2) * sigmoid_derivative(self.z1)

First part: Distribute output error to hidden neurons based on their weights
Second part: Scale by how "sensitive" each neuron was

Why the outer product for updates?

dW2 = np.outer(delta2, self.h) computes: error × what hidden neurons said

Each weight connects ONE hidden neuron to output. If that hidden neuron was highly active AND error was large, that weight contributed a lot → big update.

7.5 The MLP Solves XOR!

Now let's prove that our MLP can solve the XOR problem that defeated single neurons.

cell 019

# =============================================================================# MLP SOLVES XOR: Proof That Hidden Layers Work!# ============================================================================= print("="*70)print("MLP vs XOR: The Hidden Layer Advantage")print("="*70) # XOR dataX_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])y_xor = np.array([0, 1, 1, 0]) # Create and train MLP with 2 hidden neuronsnp.random.seed(42)xor_mlp = TrainableMLP(n_inputs=2, n_hidden=4, n_outputs=1) print("\nTraining MLP on XOR problem...")print("(Remember: A single neuron CANNOT solve this!)\n") xor_mlp.train(X_xor, y_xor, learning_rate=1.0, epochs=1000, verbose=True) # Test predictionsprint("\n" + "-"*70)print("XOR PREDICTIONS:")print("-"*70)print("\n  Input A | Input B | Expected | Predicted | Correct?")print("  " + "-"*50) all_correct = Truefor i, (x, y_true) in enumerate(zip(X_xor, y_xor)):    y_pred = xor_mlp.predict(x)    prob = xor_mlp.forward(x)    correct = "Yes" if y_pred == y_true else "No"    if y_pred != y_true:        all_correct = False    print(f"    {x[0]}     |    {x[1]}    |    {y_true}     |     {y_pred}     |   {correct}") print("\n" + "="*70)if all_correct:    print("SUCCESS! The MLP solved XOR!")    print("Hidden layers enable learning non-linear patterns!")else:    print("Still learning... (try running training again)")print("="*70)

# =============================================================================
# MLP SOLVES XOR: Proof That Hidden Layers Work!
# =============================================================================

print("="*70)
print("MLP vs XOR: The Hidden Layer Advantage")
print("="*70)

# XOR data
X_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_xor = np.array([0, 1, 1, 0])

# Create and train MLP with 2 hidden neurons
np.random.seed(42)
xor_mlp = TrainableMLP(n_inputs=2, n_hidden=4, n_outputs=1)

print("\nTraining MLP on XOR problem...")
print("(Remember: A single neuron CANNOT solve this!)\n")

xor_mlp.train(X_xor, y_xor, learning_rate=1.0, epochs=1000, verbose=True)

# Test predictions
print("\n" + "-"*70)
print("XOR PREDICTIONS:")
print("-"*70)
print("\n  Input A | Input B | Expected | Predicted | Correct?")
print("  " + "-"*50)

all_correct = True
for i, (x, y_true) in enumerate(zip(X_xor, y_xor)):
    y_pred = xor_mlp.predict(x)
    prob = xor_mlp.forward(x)
    correct = "Yes" if y_pred == y_true else "No"
    if y_pred != y_true:
        all_correct = False
    print(f"    {x[0]}     |    {x[1]}    |    {y_true}     |     {y_pred}     |   {correct}")

print("\n" + "="*70)
if all_correct:
    print("SUCCESS! The MLP solved XOR!")
    print("Hidden layers enable learning non-linear patterns!")
else:
    print("Still learning... (try running training again)")
print("="*70)

What XOR Taught Us

The XOR success proves several important points:

Lesson	Why It Matters
Hidden layers enable non-linear boundaries	We can now solve problems impossible for single neurons
4 hidden neurons > 2 for XOR	Sometimes extra capacity helps training
Higher learning rate (1.0)	XOR has sharp boundaries, needs aggressive updates
More epochs (1000)	Non-linear problems can take longer to converge

The key insight: Each hidden neuron learned to detect one "piece" of the XOR pattern. The output neuron combined these pieces into the full solution.

Now let's return to our V/H classification story and see if this same power translates to real image problems!

7.6 Back to Our Through-Line: MLP vs Perceptron on V/H

We've proven the MLP can solve XOR. Now let's return to our continuing V/H story and see if the MLP can handle the challenging noisy images that stumped our single neuron.

The Comparison We've Been Building To

Model	Clean V/H	Noisy V/H (20%)	Why?
Perceptron	~95-100%	~70-80%	One pattern detector isn't enough
MLP	~95-100%	?	Multiple specialists should help!

Why Should MLP Help With Noise?

The Perceptron's problem with noise:

It learned ONE template (e.g., "middle column bright = vertical")
Noise adds random bright pixels everywhere
Random brightness confuses the single template

How MLP specialists help:

Specialist	What It Might Detect	Why Noise-Robust
Hidden 1	Left column pattern	Noise in right columns doesn't affect it
Hidden 2	Middle column pattern	Noise in left columns doesn't affect it
Hidden 3	Vertical vs horizontal ratio	Looks at overall shape
Hidden 4	Edge patterns	Different view of same data

Even if noise confuses ONE specialist, the others can "vote" correctly!

This is called ensemble robustness - multiple diverse detectors are more reliable than one.

Let's find out:

cell 022

# =============================================================================# MLP vs PERCEPTRON: The Showdown on Noisy V/H Images# ============================================================================= print("="*70)print("THE SHOWDOWN: Perceptron vs MLP on Noisy V/H Images")print("="*70) # Compare Perceptron vs MLP at different noise levelsprint("\nComparing performance at different noise levels:\n")print("  Noise Level | Perceptron | MLP (4 hidden) | Winner")print("  " + "-"*55) noise_levels = [0.0, 0.1, 0.2, 0.3]perceptron_scores = []mlp_scores = [] for noise in noise_levels:    np.random.seed(42)    X_train, y_train = generate_line_dataset(100, noise_level=noise, seed=42)    X_test, y_test = generate_line_dataset(50, noise_level=noise, seed=999)        # Train Perceptron    perceptron = SimplePerceptron(9)    perceptron.train(X_train, y_train, epochs=100)    p_correct = sum(1 for x, y in zip(X_test, y_test) if perceptron.predict(x) == y)    p_acc = p_correct / len(y_test) * 100    perceptron_scores.append(p_acc)        # Train MLP    mlp_model = TrainableMLP(n_inputs=9, n_hidden=4, n_outputs=1)    mlp_model.train(X_train, y_train, learning_rate=0.5, epochs=100, verbose=False)    m_correct = sum(1 for x, y in zip(X_test, y_test) if mlp_model.predict(x) == y)    m_acc = m_correct / len(y_test) * 100    mlp_scores.append(m_acc)        winner = "TIE" if abs(p_acc - m_acc) < 2 else ("Perceptron" if p_acc > m_acc else "MLP ✓")    print(f"    {int(noise*100):3d}%       |   {p_acc:5.1f}%   |    {m_acc:5.1f}%     | {winner}") # Store the final trained MLP for later visualizationnp.random.seed(42)X_train, y_train = generate_line_dataset(100, noise_level=0.2, seed=42)X_test, y_test = generate_line_dataset(50, noise_level=0.2, seed=999)vh_mlp = TrainableMLP(n_inputs=9, n_hidden=4, n_outputs=1)vh_mlp.train(X_train, y_train, learning_rate=0.5, epochs=100, verbose=False) print("\n" + "="*70)print("KEY RESULT:")print("="*70)print("""As noise increases, the MLP maintains higher accuracy! WHY? The MLP has MULTIPLE SPECIALISTS:  • One hidden neuron might detect "left column patterns"  • Another detects "middle column patterns"    • Another detects "right column patterns"  • The output combines their votes Even if noise confuses one specialist, others can still contribute!This is the power of the FULL COMMITTEE.""")

# =============================================================================
# MLP vs PERCEPTRON: The Showdown on Noisy V/H Images
# =============================================================================

print("="*70)
print("THE SHOWDOWN: Perceptron vs MLP on Noisy V/H Images")
print("="*70)

# Compare Perceptron vs MLP at different noise levels
print("\nComparing performance at different noise levels:\n")
print("  Noise Level | Perceptron | MLP (4 hidden) | Winner")
print("  " + "-"*55)

noise_levels = [0.0, 0.1, 0.2, 0.3]
perceptron_scores = []
mlp_scores = []

for noise in noise_levels:
    np.random.seed(42)
    X_train, y_train = generate_line_dataset(100, noise_level=noise, seed=42)
    X_test, y_test = generate_line_dataset(50, noise_level=noise, seed=999)
    
    # Train Perceptron
    perceptron = SimplePerceptron(9)
    perceptron.train(X_train, y_train, epochs=100)
    p_correct = sum(1 for x, y in zip(X_test, y_test) if perceptron.predict(x) == y)
    p_acc = p_correct / len(y_test) * 100
    perceptron_scores.append(p_acc)
    
    # Train MLP
    mlp_model = TrainableMLP(n_inputs=9, n_hidden=4, n_outputs=1)
    mlp_model.train(X_train, y_train, learning_rate=0.5, epochs=100, verbose=False)
    m_correct = sum(1 for x, y in zip(X_test, y_test) if mlp_model.predict(x) == y)
    m_acc = m_correct / len(y_test) * 100
    mlp_scores.append(m_acc)
    
    winner = "TIE" if abs(p_acc - m_acc) < 2 else ("Perceptron" if p_acc > m_acc else "MLP ✓")
    print(f"    {int(noise*100):3d}%       |   {p_acc:5.1f}%   |    {m_acc:5.1f}%     | {winner}")

# Store the final trained MLP for later visualization
np.random.seed(42)
X_train, y_train = generate_line_dataset(100, noise_level=0.2, seed=42)
X_test, y_test = generate_line_dataset(50, noise_level=0.2, seed=999)
vh_mlp = TrainableMLP(n_inputs=9, n_hidden=4, n_outputs=1)
vh_mlp.train(X_train, y_train, learning_rate=0.5, epochs=100, verbose=False)

print("\n" + "="*70)
print("KEY RESULT:")
print("="*70)
print("""
As noise increases, the MLP maintains higher accuracy!

WHY? The MLP has MULTIPLE SPECIALISTS:
  • One hidden neuron might detect "left column patterns"
  • Another detects "middle column patterns"  
  • Another detects "right column patterns"
  • The output combines their votes

Even if noise confuses one specialist, others can still contribute!
This is the power of the FULL COMMITTEE.
""")

cell 023

# =============================================================================# VISUALIZING THE COMPARISON: Perceptron vs MLP# ============================================================================= fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Plot 1: Accuracy comparison bar chartax = axes[0]x = np.arange(len(noise_levels))width = 0.35 bars1 = ax.bar(x - width/2, perceptron_scores, width, label='Perceptron (1 neuron)', color='#e74c3c')bars2 = ax.bar(x + width/2, mlp_scores, width, label='MLP (4 hidden neurons)', color='#27ae60') ax.set_xlabel('Noise Level', fontsize=12)ax.set_ylabel('Accuracy (%)', fontsize=12)ax.set_title('The Showdown: Perceptron vs MLP\non Noisy V/H Images', fontsize=14, fontweight='bold')ax.set_xticks(x)ax.set_xticklabels([f'{int(n*100)}%' for n in noise_levels])ax.legend()ax.set_ylim(50, 105) # Add value labelsfor bar in bars1:    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,             f'{bar.get_height():.0f}%', ha='center', va='bottom', fontsize=9)for bar in bars2:    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,             f'{bar.get_height():.0f}%', ha='center', va='bottom', fontsize=9) # Plot 2: The insightax = axes[1]ax.axis('off') insight_text = """WHY MLP WINS ON NOISY DATA════════════════════════════════════════════════════ PERCEPTRON (Single Expert):┌─────────────────────────────────────┐│  "I look for ONE pattern:           ││   middle column = vertical"         ││                                     ││  Problem: Noise activates other     ││  pixels, confusing my ONE detector  │└─────────────────────────────────────┘ MLP (Committee of Specialists):┌─────────────────────────────────────┐│  Specialist 1: "I check LEFT"       ││  Specialist 2: "I check MIDDLE"     ││  Specialist 3: "I check RIGHT"      ││  Specialist 4: "I check PATTERNS"   ││                                     ││  Even if noise fools one of us,     ││  the others provide backup!         │└─────────────────────────────────────┘ This is REDUNDANCY and SPECIALIZATION working together!""" ax.text(0.05, 0.5, insight_text, fontsize=10, family='monospace',        verticalalignment='center', transform=ax.transAxes,        bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.9)) plt.tight_layout()plt.show()

# =============================================================================
# VISUALIZING THE COMPARISON: Perceptron vs MLP
# =============================================================================

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Accuracy comparison bar chart
ax = axes[0]
x = np.arange(len(noise_levels))
width = 0.35

bars1 = ax.bar(x - width/2, perceptron_scores, width, label='Perceptron (1 neuron)', color='#e74c3c')
bars2 = ax.bar(x + width/2, mlp_scores, width, label='MLP (4 hidden neurons)', color='#27ae60')

ax.set_xlabel('Noise Level', fontsize=12)
ax.set_ylabel('Accuracy (%)', fontsize=12)
ax.set_title('The Showdown: Perceptron vs MLP\non Noisy V/H Images', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels([f'{int(n*100)}%' for n in noise_levels])
ax.legend()
ax.set_ylim(50, 105)

# Add value labels
for bar in bars1:
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1, 
            f'{bar.get_height():.0f}%', ha='center', va='bottom', fontsize=9)
for bar in bars2:
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1, 
            f'{bar.get_height():.0f}%', ha='center', va='bottom', fontsize=9)

# Plot 2: The insight
ax = axes[1]
ax.axis('off')

insight_text = """
WHY MLP WINS ON NOISY DATA
════════════════════════════════════════════════════

PERCEPTRON (Single Expert):
┌─────────────────────────────────────┐
│  "I look for ONE pattern:           │
│   middle column = vertical"         │
│                                     │
│  Problem: Noise activates other     │
│  pixels, confusing my ONE detector  │
└─────────────────────────────────────┘

MLP (Committee of Specialists):
┌─────────────────────────────────────┐
│  Specialist 1: "I check LEFT"       │
│  Specialist 2: "I check MIDDLE"     │
│  Specialist 3: "I check RIGHT"      │
│  Specialist 4: "I check PATTERNS"   │
│                                     │
│  Even if noise fools one of us,     │
│  the others provide backup!         │
└─────────────────────────────────────┘

This is REDUNDANCY and SPECIALIZATION working together!
"""

ax.text(0.05, 0.5, insight_text, fontsize=10, family='monospace',
        verticalalignment='center', transform=ax.transAxes,
        bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.9))

plt.tight_layout()
plt.show()

cell 024

# =============================================================================# VISUALIZING HIDDEN NEURON SPECIALIZATION# ============================================================================= fig, axes = plt.subplots(2, 3, figsize=(14, 8)) # Top row: Hidden neuron weights (what each specialist looks for)for i in range(min(4, vh_mlp.n_hidden)):    ax = axes[0, i] if i < 3 else axes[1, 0]    weights = vh_mlp.W1[i].reshape(3, 3)    im = ax.imshow(weights, cmap='RdBu', vmin=-2, vmax=2)    ax.set_title(f'Hidden Neuron {i+1}\nWeights', fontsize=11, fontweight='bold')    for r in range(3):        for c in range(3):            color = 'white' if abs(weights[r,c]) > 1 else 'black'            ax.text(c, r, f'{weights[r,c]:.2f}', ha='center', va='center', fontsize=9, color=color)    ax.axis('off')    plt.colorbar(im, ax=ax, fraction=0.046) # Bottom row: Output weights and explanationax = axes[1, 1]ax.bar(range(vh_mlp.n_hidden), vh_mlp.W2[0], color=['#e74c3c' if w < 0 else '#27ae60' for w in vh_mlp.W2[0]])ax.set_xlabel('Hidden Neuron', fontsize=11)ax.set_ylabel('Output Weight', fontsize=11)ax.set_title('How Output Combines\nHidden Neurons', fontsize=11, fontweight='bold')ax.axhline(y=0, color='black', linestyle='-', linewidth=0.5) # Explanationax = axes[1, 2]ax.axis('off')explanation = """WHAT EACH HIDDEN NEURON LEARNED════════════════════════════════════════ Each hidden neuron became a "specialist": • Some neurons learned to detect  VERTICAL patterns (strong middle column)  • Some neurons learned to detect    HORIZONTAL patterns (strong middle row) • The output neuron COMBINES these  specialist opinions:  - Positive weight = "trust this specialist"  - Negative weight = "opposite of this specialist" This is DIVERSITY OF OPINION in action!"""ax.text(0.1, 0.5, explanation, fontsize=10, family='monospace',        verticalalignment='center', transform=ax.transAxes,        bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.8)) plt.suptitle('The Committee of Specialists: What Each Hidden Neuron Learned',              fontsize=14, fontweight='bold', y=1.02)plt.tight_layout()plt.show()

# =============================================================================
# VISUALIZING HIDDEN NEURON SPECIALIZATION
# =============================================================================

fig, axes = plt.subplots(2, 3, figsize=(14, 8))

# Top row: Hidden neuron weights (what each specialist looks for)
for i in range(min(4, vh_mlp.n_hidden)):
    ax = axes[0, i] if i < 3 else axes[1, 0]
    weights = vh_mlp.W1[i].reshape(3, 3)
    im = ax.imshow(weights, cmap='RdBu', vmin=-2, vmax=2)
    ax.set_title(f'Hidden Neuron {i+1}\nWeights', fontsize=11, fontweight='bold')
    for r in range(3):
        for c in range(3):
            color = 'white' if abs(weights[r,c]) > 1 else 'black'
            ax.text(c, r, f'{weights[r,c]:.2f}', ha='center', va='center', fontsize=9, color=color)
    ax.axis('off')
    plt.colorbar(im, ax=ax, fraction=0.046)

# Bottom row: Output weights and explanation
ax = axes[1, 1]
ax.bar(range(vh_mlp.n_hidden), vh_mlp.W2[0], color=['#e74c3c' if w < 0 else '#27ae60' for w in vh_mlp.W2[0]])
ax.set_xlabel('Hidden Neuron', fontsize=11)
ax.set_ylabel('Output Weight', fontsize=11)
ax.set_title('How Output Combines\nHidden Neurons', fontsize=11, fontweight='bold')
ax.axhline(y=0, color='black', linestyle='-', linewidth=0.5)

# Explanation
ax = axes[1, 2]
ax.axis('off')
explanation = """
WHAT EACH HIDDEN NEURON LEARNED
════════════════════════════════════════

Each hidden neuron became a "specialist":

• Some neurons learned to detect
  VERTICAL patterns (strong middle column)
  
• Some neurons learned to detect  
  HORIZONTAL patterns (strong middle row)

• The output neuron COMBINES these
  specialist opinions:
  - Positive weight = "trust this specialist"
  - Negative weight = "opposite of this specialist"

This is DIVERSITY OF OPINION in action!
"""
ax.text(0.1, 0.5, explanation, fontsize=10, family='monospace',
        verticalalignment='center', transform=ax.transAxes,
        bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.8))

plt.suptitle('The Committee of Specialists: What Each Hidden Neuron Learned', 
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

7.7 The Universal Approximation Theorem

One of the most powerful results in neural network theory is the Universal Approximation Theorem.

What Does It Say?

"A neural network with a single hidden layer containing enough neurons can approximate ANY continuous function to arbitrary accuracy."

In simpler terms: with enough hidden neurons, a neural network can learn to represent virtually any pattern!

What This Means

Statement	Implication
"Any continuous function"	Any smooth input-output relationship
"Single hidden layer"	You only NEED one hidden layer (in theory)
"Enough neurons"	May need many neurons for complex functions
"Arbitrary accuracy"	Can get as close as you want to the true function

The Catch

The theorem tells us networks CAN represent any function, but NOT:

How to FIND the right weights (training is still hard!)
How MANY neurons are needed (could be huge!)
Whether training will converge

Why Add MORE Layers?

If one hidden layer is theoretically enough, why do modern networks have many layers?

Deep networks (more layers) are more EFFICIENT:

Architecture	Parameters Needed	Why?
Wide (1 layer, many neurons)	Exponential	Each neuron works independently
Deep (many layers, fewer neurons)	Polynomial	Layers build on each other

The Compositionality Argument: Why Depth Wins

Key insight: Complex functions often have hierarchical structure.

Consider recognizing a face:

Layer 1: Detect edges (simple lines, curves)
Layer 2: Combine edges into parts (eyes, nose, mouth)
Layer 3: Combine parts into faces

Each layer REUSES what the previous layer learned!

With a single wide layer: Each neuron must independently learn to detect "face" from raw pixels. No reuse.

With deep layers: Edge detectors are shared across eye detectors, nose detectors, etc. Massive reuse!

Mathematical example:

To represent $f (x) = x^{2^{n}}$ with wide network: need $2^{n}$ neurons
With deep network: just n layers, each computing $x^{2}$ of the previous layer

What Does "Arbitrary Accuracy" Mean?

The theorem says we can get "arbitrarily close" to any function. Concretely:

$∣ f (x) - f^(x) ∣ < ϵ for any ϵ > 0$

Where $f$ is the true function and $f^$ is the network's approximation.

Catch: The number of neurons needed grows as $ϵ$ gets smaller. For very precise approximations, you might need astronomically many neurons!

Committee Analogy

"One giant room of 1000 generalist committee members CAN solve any problem. But a hierarchical organization with specialists (layer 1: evidence gatherers, layer 2: pattern detectors, layer 3: decision makers) can solve it with fewer people and better organization."

Part 7 Summary: What We've Learned

Key Concepts Mastered

Concept	Definition	Why It Matters
Linear Separability	Can separate with one line	Determines what single neurons can learn
XOR Problem	Non-linearly separable	Proves single neurons have limits
Hidden Layer	Neurons between input and output	Enable non-linear boundaries
MLP	Multi-Layer Perceptron	Network with hidden layers
Forward Propagation	Input → Hidden → Output	How predictions are made
Backpropagation	Chain rule through layers	How MLPs learn
Universal Approximation	MLPs can learn anything	Theoretical foundation

Architecture Comparison

Model	Layers	XOR	Clean V/H	Noisy V/H (20%)	Why?
Perceptron	1	✗	~95%	~70-80%	One detector isn't enough
MLP (4 hidden)	2	✓	~95%	~85-95%	Multiple specialists!
Deep MLP	3+	✓	✓	✓	Even more capacity

Two Complementary Examples

Example	What We Learned
XOR Problem	Classic proof that single neurons have fundamental limits
Noisy V/H Lines	Practical demonstration using our continuing story

Both examples taught the same lesson: complex problems need multiple specialists working together.

Committee Analogy Progress

Part	What Happened
Parts 1-3	Single member learned procedures
Part 4	First case - confused
Part 5	Learned from feedback
Part 6	Performance review
Part 7	Assembled the full committee with specialists!
Part 8	(Next) The committee faces growing pains

Knowledge Check

How Many Hidden Neurons Do We Need?

A natural question: "Should I use 4 hidden neurons? 10? 100?"

Understanding Network Capacity:

Hidden Neurons	Capacity	Risk
Too few (1-2)	Can't represent complex patterns	Underfitting
Just right (4-8 for V/H)	Captures patterns without memorizing	Good generalization
Too many (50+)	Can memorize training data	Overfitting

Rules of Thumb:

Start small, increase if needed - Begin with 2-4 hidden neurons, add more if accuracy plateaus
Watch train vs test gap - If training accuracy >> test accuracy, reduce neurons
Problem complexity guides size - Simple patterns need fewer neurons

For our V/H problem:

9 input pixels
2 classes (binary)
4 hidden neurons is reasonable: enough for specialization, not so many that overfitting occurs

We'll explore overfitting in detail in Part 8!

cell 027

# =============================================================================# KNOWLEDGE CHECK - Part 7# ============================================================================= print("KNOWLEDGE CHECK - Part 7: Hidden Layers")print("="*60) questions = [    {        "q": "1. Why can't a single neuron solve the XOR problem?",        "options": [            "A) XOR has too many inputs",            "B) XOR is not linearly separable - can't draw one line to separate classes",            "C) XOR requires too much memory",            "D) Single neurons can solve XOR, it just takes longer"        ],        "answer": "B",        "explanation": "XOR points cannot be separated by a single straight line. The (0,0) and (1,1) points are class 0, while (0,1) and (1,0) are class 1 - no line can separate them."    },    {        "q": "2. Why does MLP outperform Perceptron on noisy V/H images?",        "options": [            "A) MLP runs faster",            "B) MLP has multiple specialists - if noise fools one, others provide backup",            "C) MLP uses less memory",            "D) Perceptron can't process images"        ],        "answer": "B",        "explanation": "MLP has multiple hidden neurons that each detect different features. Even if noise confuses one specialist, the others can still detect patterns and contribute to the correct answer."    },    {        "q": "3. What is a 'hidden layer' in a neural network?",        "options": [            "A) A layer that is invisible to users",            "B) A layer of neurons between the input and output layers",            "C) A layer that stores hidden data",            "D) A layer that only activates sometimes"        ],        "answer": "B",        "explanation": "Hidden layers sit between input and output. They're 'hidden' because we don't directly observe their values - they're internal to the network."    },    {        "q": "4. What does each hidden neuron typically learn to detect?",        "options": [            "A) The same pattern as other neurons",            "B) Random noise",            "C) Different features or patterns (specialization)",            "D) Only the output labels"        ],        "answer": "C",        "explanation": "Each hidden neuron specializes in detecting different features. This 'diversity of opinion' is what gives MLPs their power to learn complex patterns."    },    {        "q": "5. In backpropagation through multiple layers, how does error flow?",        "options": [            "A) Forward, from input to output",            "B) Backward, from output to input via chain rule",            "C) Randomly through the network",            "D) Only through the hidden layer"        ],        "answer": "B",        "explanation": "Backpropagation passes error backward using the chain rule. Output error → hidden layer error → input weight updates."    },    {        "q": "6. What does the Universal Approximation Theorem tell us?",        "options": [            "A) Neural networks always converge",            "B) One hidden layer with enough neurons can approximate any function",            "C) Deep networks are always better than shallow ones",            "D) Training is guaranteed to find optimal weights"        ],        "answer": "B",        "explanation": "The theorem says MLPs CAN represent any function, but doesn't guarantee we can find the weights or how many neurons we need."    }] for q in questions:    print(f"\n{q['q']}")    for opt in q["options"]:        print(f"   {opt}") print("\n" + "="*60)print("Scroll down for answers...")print("="*60)

# =============================================================================
# KNOWLEDGE CHECK - Part 7
# =============================================================================

print("KNOWLEDGE CHECK - Part 7: Hidden Layers")
print("="*60)

questions = [
    {
        "q": "1. Why can't a single neuron solve the XOR problem?",
        "options": [
            "A) XOR has too many inputs",
            "B) XOR is not linearly separable - can't draw one line to separate classes",
            "C) XOR requires too much memory",
            "D) Single neurons can solve XOR, it just takes longer"
        ],
        "answer": "B",
        "explanation": "XOR points cannot be separated by a single straight line. The (0,0) and (1,1) points are class 0, while (0,1) and (1,0) are class 1 - no line can separate them."
    },
    {
        "q": "2. Why does MLP outperform Perceptron on noisy V/H images?",
        "options": [
            "A) MLP runs faster",
            "B) MLP has multiple specialists - if noise fools one, others provide backup",
            "C) MLP uses less memory",
            "D) Perceptron can't process images"
        ],
        "answer": "B",
        "explanation": "MLP has multiple hidden neurons that each detect different features. Even if noise confuses one specialist, the others can still detect patterns and contribute to the correct answer."
    },
    {
        "q": "3. What is a 'hidden layer' in a neural network?",
        "options": [
            "A) A layer that is invisible to users",
            "B) A layer of neurons between the input and output layers",
            "C) A layer that stores hidden data",
            "D) A layer that only activates sometimes"
        ],
        "answer": "B",
        "explanation": "Hidden layers sit between input and output. They're 'hidden' because we don't directly observe their values - they're internal to the network."
    },
    {
        "q": "4. What does each hidden neuron typically learn to detect?",
        "options": [
            "A) The same pattern as other neurons",
            "B) Random noise",
            "C) Different features or patterns (specialization)",
            "D) Only the output labels"
        ],
        "answer": "C",
        "explanation": "Each hidden neuron specializes in detecting different features. This 'diversity of opinion' is what gives MLPs their power to learn complex patterns."
    },
    {
        "q": "5. In backpropagation through multiple layers, how does error flow?",
        "options": [
            "A) Forward, from input to output",
            "B) Backward, from output to input via chain rule",
            "C) Randomly through the network",
            "D) Only through the hidden layer"
        ],
        "answer": "B",
        "explanation": "Backpropagation passes error backward using the chain rule. Output error → hidden layer error → input weight updates."
    },
    {
        "q": "6. What does the Universal Approximation Theorem tell us?",
        "options": [
            "A) Neural networks always converge",
            "B) One hidden layer with enough neurons can approximate any function",
            "C) Deep networks are always better than shallow ones",
            "D) Training is guaranteed to find optimal weights"
        ],
        "answer": "B",
        "explanation": "The theorem says MLPs CAN represent any function, but doesn't guarantee we can find the weights or how many neurons we need."
    }
]

for q in questions:
    print(f"\n{q['q']}")
    for opt in q["options"]:
        print(f"   {opt}")

print("\n" + "="*60)
print("Scroll down for answers...")
print("="*60)

cell 028

# ANSWERSprint("ANSWERS - Part 7 Knowledge Check")print("="*60)for i, q in enumerate(questions, 1):    print(f"\n{i}. Answer: {q['answer']}")    print(f"   {q['explanation']}")

What's Next?

Congratulations! You've completed Part 7!

We've assembled the full committee - a Multi-Layer Perceptron with hidden layers that can solve problems single neurons cannot. We proved this by solving XOR and saw how hidden neurons specialize in detecting different features.

But There's a Problem...

As neural networks grow deeper and more complex, they face new challenges:

Overfitting: The committee memorizes cases instead of learning patterns
Vanishing Gradients: Feedback becomes too weak in deep networks
Dead Neurons: Some specialists stop contributing entirely

Coming Up in Part 8: Deep Learning Challenges

In the next notebook, we'll explore:

Overfitting - When the committee memorizes instead of learns
Regularization - Rules to prevent over-specialization
Vanishing/Exploding Gradients - The deep network dilemma
Solutions - Dropout, batch normalization, and more

Continue to Part 8: part_8_deep_learning_challenges.ipynb

"With great power comes great responsibility - and new challenges."

The Brain's Decision Committee - Growing Pains