In Parts 1-6, we built and trained a single neuron (Perceptron) that became an expert at detecting vertical vs horizontal lines. We evaluated its performance, understood its decision-making through saliency maps, and confirmed it learned the right patterns.
But our expert has a limitation.
A single neuron can only draw ONE straight line to separate categories. Some problems require more complex boundaries - curves, multiple regions, or intricate patterns.
"Our single committee member has done well, but some problems are too complex for one person. It's time to assemble a full committee with specialists."
What You'll Learn in Part 7
By the end of this notebook, you will understand:
Why single neurons fail - The famous XOR problem AND challenging V/H variations
What hidden layers are - Adding neurons between input and output
How hidden neurons specialize - Different neurons detect different features
The Multi-Layer Perceptron (MLP) - A complete neural network architecture
Forward propagation - How data flows through multiple layers
Backpropagation through layers - Training with chain rule
Universal approximation - Why deep networks can learn (almost) anything
Two Complementary Examples
In this notebook, we'll explore limitations of single neurons through two lenses:
Example
Why Include It?
XOR Problem
The famous textbook example - you'll encounter this everywhere in ML literature
Challenging V/H Lines
Our continuing story - noisy images, multiple positions, harder patterns
Both examples teach the same lesson: some problems need multiple neurons working together.
Prerequisites
Make sure you've completed:
Parts 0-1: Matrices (neural_network_fundamentals.ipynb)
Part 2: Single Neuron (part_2_single_neuron.ipynb)
Part 3: Activation Functions (part_3_activation_functions.ipynb)
Part 4: The Perceptron (part_4_perceptron.ipynb)
Part 5: Training (part_5_training.ipynb)
Part 6: Evaluation (part_6_evaluation.ipynb)
Setup: Import Dependencies
cell 003
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# =============================================================================# PART 7: HIDDEN LAYERS - SETUP AND IMPORTS# =============================================================================importnumpyasnpimportmatplotlib.pyplotaspltfromIPython.displayimportdisplay, clear_output# Try to import ipywidgets for interactive featurestry:importipywidgetsaswidgetsWIDGETS_AVAILABLE = TrueexceptImportError:WIDGETS_AVAILABLE = Falseprint("Note: ipywidgets not installed. Interactive features will be limited.")# Set up matplotlib stylestyle_options = ['seaborn-v0_8-whitegrid', 'seaborn-whitegrid', 'ggplot', 'default']forstyleinstyle_options:try:plt.style.use(style)breakexceptOSError:continueplt.rcParams['figure.figsize'] = [10, 6]plt.rcParams['font.size'] = 12np.random.seed(42)# -----------------------------------------------------------------------------# Helper functions from previous notebooks# -----------------------------------------------------------------------------defsigmoid(z):"""Sigmoid activation: maps any value to range (0, 1)."""return1 / (1 + np.exp(-np.clip(z, -500, 500)))defsigmoid_derivative(z):"""Derivative of sigmoid: σ(z) * (1 - σ(z))"""s = sigmoid(z)returns * (1 - s)defrelu(z):"""ReLU activation: max(0, z)"""returnnp.maximum(0, z)defrelu_derivative(z):"""Derivative of ReLU: 1 if z > 0, else 0"""return (z > 0).astype(float)print("Setup complete!")print("="*60)
7.1 The Limitation of Single Neurons: The XOR Problem
Our Perceptron works great for vertical vs horizontal lines. But there's a famous problem that NO single neuron can solve: the XOR problem.
What IS XOR?
XOR (exclusive OR) is a logical operation that outputs TRUE when inputs are DIFFERENT:
Input A
Input B
XOR Output
0
0
0
0
1
1
1
0
1
1
1
0
In words: "TRUE if one or the other, but not both."
Real-world examples of XOR:
A light switch: Flip EITHER switch to change the light, but if BOTH are up (or both down), it's off
Password requirements: "Use uppercase OR numbers" (but having BOTH doesn't double-satisfy it)
Why Can't a Single Neuron Solve XOR?
A single neuron creates a linear decision boundary - a straight line that separates the two classes.
What IS a Decision Boundary?
A decision boundary is the line (in 2D), plane (in 3D), or hyperplane (in higher dimensions) where the model switches from predicting one class to another.
For a single neuron:z=w1x1+w2x2+b=0
This equation defines a straight line. Points on one side get z > 0 (predict class 1), points on the other side get z < 0 (predict class 0).
Why is this a line? Rearranging:
x2=−w2w1x1−w2b
This is the equation of a line with slope −w2w1 and intercept −w2b.
The Problem: No matter what values we choose for w1, w2, and b, we can only draw ONE straight line!
Let's visualize the problem:
cell 005
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# =============================================================================# THE XOR PROBLEM: Visualizing Why Single Neurons Fail# =============================================================================print("="*70)print("THE XOR PROBLEM: A Single Neuron's Nightmare")print("="*70)# XOR dataX_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])y_xor = np.array([0, 1, 1, 0])fig, axes = plt.subplots(1, 3, figsize=(15, 5))# Plot 1: The XOR problemax = axes[0]colors = ['red'ify == 0else'blue'foryiny_xor]ax.scatter(X_xor[:, 0], X_xor[:, 1], c=colors, s=300, edgecolor='black', linewidth=2)fori, (x, y, label) inenumerate(zip(X_xor[:, 0], X_xor[:, 1], y_xor)):ax.annotate(f'({x},{y})→{label}', (x, y), xytext=(10, 10), textcoords='offset points', fontsize=10)ax.set_xlim(-0.5, 1.5)ax.set_ylim(-0.5, 1.5)ax.set_xlabel('Input A', fontsize=12)ax.set_ylabel('Input B', fontsize=12)ax.set_title('XOR Data Points\n(Red=0, Blue=1)', fontsize=14, fontweight='bold')ax.grid(True, alpha=0.3)# Plot 2: Can you draw ONE line to separate them?ax = axes[1]ax.scatter(X_xor[:, 0], X_xor[:, 1], c=colors, s=300, edgecolor='black', linewidth=2)# Try some linesx_line = np.linspace(-0.5, 1.5, 100)ax.plot(x_line, x_line, 'g--', linewidth=2, label='Diagonal?')ax.plot(x_line, 0.5 * np.ones_like(x_line), 'm--', linewidth=2, label='Horizontal?')ax.plot(0.5 * np.ones_like(x_line), x_line, 'c--', linewidth=2, label='Vertical?')ax.set_xlim(-0.5, 1.5)ax.set_ylim(-0.5, 1.5)ax.set_xlabel('Input A', fontsize=12)ax.set_ylabel('Input B', fontsize=12)ax.set_title('Try to Draw ONE Line\nto Separate Red from Blue', fontsize=14, fontweight='bold')ax.legend(loc='upper right')ax.grid(True, alpha=0.3)# Plot 3: The solution requires TWO lines (or a curve)ax = axes[2]ax.scatter(X_xor[:, 0], X_xor[:, 1], c=colors, s=300, edgecolor='black', linewidth=2)# Two lines that together solve XORax.plot(x_line, x_line - 0.3, 'g-', linewidth=2, label='Line 1')ax.plot(x_line, x_line + 0.3, 'g-', linewidth=2, label='Line 2')ax.fill_between(x_line, x_line - 0.3, x_line + 0.3, alpha=0.2, color='blue', label='Blue region')ax.set_xlim(-0.5, 1.5)ax.set_ylim(-0.5, 1.5)ax.set_xlabel('Input A', fontsize=12)ax.set_ylabel('Input B', fontsize=12)ax.set_title('Solution: TWO Lines\n(Requires Hidden Layer!)', fontsize=14, fontweight='bold')ax.legend(loc='upper right')ax.grid(True, alpha=0.3)plt.tight_layout()plt.show()print("""KEYINSIGHT: TheXORProblem════════════════════════════════════════════════════════════════════════Theredpoints (0) areatcorners (0,0) and (1,1).Thebluepoints (1) areatcorners (0,1) and (1,0).NOSINGLESTRAIGHTLINEcanseparateredfromblue!Thisiscalledbeing"not linearly separable."Whyitmatters:• AsingleneuroncanonlycreateONElinearboundary• XORrequiresamorecomplex, non-linearboundary• ThiswasprovenimpossibleforPerceptronsin1969 (Minsky & Papert)• Thesolution: ADDMORENEURONS → HiddenLayers!""")
What IS Linear Separability?
Linear Separability is a property of a dataset where the classes can be separated by a straight line (in 2D), a plane (in 3D), or a hyperplane (in higher dimensions).
Problem Type
Linearly Separable?
Single Neuron Can Solve?
AND gate
Yes
✓
OR gate
Yes
✓
XOR gate
No
✗
Vertical vs Horizontal lines (clean)
Yes
✓
Noisy/partial V/H lines
Harder
Struggles!
Complex overlapping patterns
No
✗
Why Does Linear Separability Matter?
This is the fundamental limit of single-layer neural networks:
Model
Decision Boundary
What It Can Learn
Single neuron
One line/plane
Only linearly separable patterns
MLP (hidden layer)
Multiple lines → curves
Non-linear patterns
Deep MLP
Very complex shapes
Almost anything!
Mathematically: A single neuron computes σ(w⋅x+b). The activation function σ is monotonic (always increasing or flat), so it can only split the input space with ONE hyperplane. That's the fundamental constraint.
The Historical "AI Winter"
In 1969, Marvin Minsky and Seymour Papert published a book called "Perceptrons" proving that single-layer networks couldn't solve XOR or any non-linearly-separable problem.
Why was this so damaging? They proved it was a MATHEMATICAL impossibility, not just a training difficulty. No amount of training could make a single neuron learn XOR - it literally cannot represent that function.
7.1.5 Back to Our Story: When V/H Classification Gets Hard
XOR is the famous textbook example, but let's see how the same limitation affects our vertical/horizontal line detection problem.
Our Perceptron's Success... and Its Limits
In Parts 4-6, our single-neuron Perceptron achieved ~95-100% accuracy on clean V/H lines. But what happens when the problem gets harder?
Challenge
What Changes
Why It's Harder
Noisy images
Random pixels added
Pattern obscured
Lines in ANY position
Not just middle
One "middle detector" isn't enough
Partial/broken lines
Missing pixels
Incomplete evidence
Thin vs thick lines
Different widths
Multiple patterns to detect
Let's see if our single neuron can handle these challenges:
The Historical "AI Winter"
This caused the first "AI Winter" - a period where funding for neural network research dried up because people thought they were fundamentally limited.
The Solution Was Simple: Add More Layers!
The fix was known all along but computationally difficult:
Instead of one expert, use a TEAM of experts (neurons) working together!
7.2 The Panel of Experts: Hidden Layers
What IS a Hidden Layer?
A hidden layer is a layer of neurons that sits between the input and output:
Why "hidden"? Because we never directly see their values during normal use - they're internal to the network.
Why Multiple Neurons Help
Each neuron in the hidden layer can detect a different feature:
Hidden Neuron
What It Might Detect
Neuron 1
"Is there a vertical pattern on the LEFT?"
Neuron 2
"Is there a vertical pattern in the MIDDLE?"
Neuron 3
"Is there a vertical pattern on the RIGHT?"
Neuron 4
"Is there a horizontal pattern on TOP?"
The output neuron then combines these feature detections to make a final decision.
The Critical Role of Activation Functions
Why do we NEED activation functions between layers?
Without activations, stacking layers does nothing! Here's why:
Without activation:Output=W2⋅(W1⋅x)=(W2⋅W1)⋅x=Wcombined⋅x
The composition of two linear transformations is just... another linear transformation! We could replace the entire network with a single layer.
With activation:Output=W2⋅σ(W1⋅x)
The non-linear σ "breaks" the linearity. Now we have:
Layer 1 creates multiple linear boundaries
Activation function "bends" these boundaries
Layer 2 combines the bent boundaries
This is how MLPs create curves from straight lines!
Committee Analogy: The Sub-Committee
"Before, we had ONE committee member who had to look at everything. Now we have a sub-committee of specialists:
Specialist 1 checks for patterns in the left region
Specialist 2 checks the middle region
Specialist 3 checks the right region
The final committee member listens to all specialists and makes the decision
This division of labor lets us solve more complex problems!"
Diversity of Opinion
Key insight from Part 1.7: If all hidden neurons look for the same thing, they're redundant!
We need diversity - each hidden neuron should specialize in detecting something different. This happens naturally during training as they adjust to minimize error.
cell 007
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
# =============================================================================# THE V/H CHALLENGE: When Our Perceptron Struggles# =============================================================================print("="*70)print("BACK TO OUR STORY: Challenging V/H Classification")print("="*70)# Dataset generator from previous partsdefgenerate_line_dataset(n_samples=100, noise_level=0.0, seed=None):"""Generate vertical (1) and horizontal (0) line images."""ifseedisnotNone:np.random.seed(seed)X, y = [], []foriinrange(n_samples):image = np.zeros((3, 3))ifi < n_samples // 2: # Verticalcol = np.random.randint(0, 3) # ANY column, not just middleimage[:, col] = 1ifnoise_level > 0:image = np.clip(image + np.random.randn(3, 3) * noise_level, 0, 1)X.append(image.flatten())y.append(1)else: # Horizontalrow = np.random.randint(0, 3)image[row, :] = 1ifnoise_level > 0:image = np.clip(image + np.random.randn(3, 3) * noise_level, 0, 1)X.append(image.flatten())y.append(0)X, y = np.array(X), np.array(y)shuffle_idx = np.random.permutation(n_samples)returnX[shuffle_idx], y[shuffle_idx]# Simple Perceptron from Part 5classSimplePerceptron:def__init__(self, n_inputs):self.weights = np.random.randn(n_inputs) * 0.1self.bias = 0.0defforward(self, x):z = np.dot(self.weights, x.flatten()) + self.biasreturnsigmoid(z)defpredict(self, x):return1ifself.forward(x) >= 0.5else0deftrain(self, X, y, lr=0.5, epochs=100):for_inrange(epochs):forxi, yiinzip(X, y):pred = self.forward(xi)error = pred - yiself.weights -= lr * error * xi.flatten()self.bias -= lr * errorreturnself# Test on different difficulty levelsprint("\nTesting Single Neuron on Increasingly Difficult V/H Problems:\n")difficulties = [ ("Clean (0% noise)", 0.0), ("Light noise (10%)", 0.1), ("Medium noise (20%)", 0.2), ("Heavy noise (30%)", 0.3)]results = []forname, noiseindifficulties:np.random.seed(42)X_train, y_train = generate_line_dataset(100, noise_level=noise, seed=42)X_test, y_test = generate_line_dataset(50, noise_level=noise, seed=999)perceptron = SimplePerceptron(9)perceptron.train(X_train, y_train, epochs=100)correct = sum(1forx, yinzip(X_test, y_test) ifperceptron.predict(x) == y)accuracy = correct / len(y_test) * 100results.append((name, accuracy))print(f" {name:25s} → Accuracy: {accuracy:5.1f}%")print("\n" + "="*70)print("KEY OBSERVATION:")print("="*70)print("""Asnoiseincreases, oursingleneuronstrugglesmore!Why? ThesingleneuronlearnedONEpattern (e.g., "middle column = vertical").Butnoisyimageshave: • Extrabrightpixelsconfusingthedetector • Linesindifferentpositionsthesingle"template"doesn'tmatch • PartialpatternsthatneedmultiplefeaturedetectorsJustlikeXOR, complexV/HpatternsneedMULTIPLESPECIALISTS!""")
7.3 The Multi-Layer Perceptron (MLP): Architecture and Math
Now let's understand the mathematics behind multi-layer networks.
What IS an MLP?
A Multi-Layer Perceptron (MLP) is a neural network with:
One input layer
One or more hidden layers
One output layer
Each layer is fully connected to the next (every neuron connects to every neuron in the next layer).
The Math: Forward Propagation
For an MLP with one hidden layer, the computation flows in two stages:
Stage 1: Input → Hiddenh=σ(W1⋅x+b1)
Stage 2: Hidden → Outputy^=σ(W2⋅h+b2)
Where:
x = input vector (our 9 pixels)
W1 = weights from input to hidden layer (matrix!)
b1 = biases for hidden neurons
h = hidden layer activations
W2 = weights from hidden to output
b2 = bias for output neuron
σ = activation function (sigmoid, ReLU, etc.)
y^ = final prediction
Breaking It Down Step by Step
Let's trace through with concrete dimensions:
Component
Shape
Example
Input x
(9,)
9 pixels
Weights W1
(4, 9)
4 hidden neurons, each with 9 weights
Biases b1
(4,)
4 biases, one per hidden neuron
Hidden h
(4,)
4 hidden activations
Weights W2
(1, 4)
1 output neuron, 4 weights (from hidden)
Bias b2
(1,)
1 bias for output
Output y^
(1,)
Final prediction
Why These Specific Shapes?
Matrix multiplication rule:(m×n)⋅(n×1)=(m×1)
The shapes MUST align:
W1 is (4,9) because we have 4 hidden neurons, each looking at 9 inputs
W1⋅x gives us (4,9)⋅(9,1)=(4,1) - one value per hidden neuron ✓
W2 is (1,4) because we have 1 output looking at 4 hidden neurons
W2⋅h gives us (1,4)⋅(4,1)=(1,1) - our single output ✓
The key insight: Each row of W1 represents ONE hidden neuron's "view" of the input. Each column of W2 represents how much the output trusts each hidden neuron.
Why This Works for XOR
Each hidden neuron can learn ONE linear boundary. With multiple hidden neurons, we can combine their boundaries to create complex, non-linear decision regions!
Concrete XOR example with 2 hidden neurons:
Hidden neuron 1 might learn: "A OR B" (draw diagonal from bottom-left)
Hidden neuron 2 might learn: "A AND B" (draw diagonal from top-right)
Output combines them: "(A OR B) AND NOT (A AND B)" = XOR!
cell 011full lab recommended
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
# =============================================================================# BUILDING THE MLP: Step by Step Implementation# =============================================================================classMLP:"""Multi-LayerPerceptronwithonehiddenlayer.Architecture: Input → Hidden (withactivation) → Output (withsigmoid)Thisisthe"Full Committee" - multipleexpertsworkingtogether!"""def__init__(self, n_inputs, n_hidden, n_outputs=1):"""InitializetheMLPwithrandomweights.Parameters:n_inputs: Numberofinputfeatures (e.g., 9for 3x3 image)n_hidden: Numberofneuronsinhiddenlayer (the"specialists")n_outputs: Numberofoutputneurons (1forbinaryclassification)"""self.n_inputs = n_inputsself.n_hidden = n_hiddenself.n_outputs = n_outputs# Initialize weights with small random values (Xavier initialization)# W1: weights from input to hidden (shape: n_hidden x n_inputs)self.W1 = np.random.randn(n_hidden, n_inputs) * np.sqrt(2.0 / n_inputs)self.b1 = np.zeros(n_hidden)# W2: weights from hidden to output (shape: n_outputs x n_hidden)self.W2 = np.random.randn(n_outputs, n_hidden) * np.sqrt(2.0 / n_hidden)self.b2 = np.zeros(n_outputs)# For storing values during forward pass (needed for backprop)self.z1 = None# Pre-activation of hidden layerself.h = None# Hidden layer activationsself.z2 = None# Pre-activation of outputself.output = None# Training historyself.loss_history = []self.accuracy_history = []defforward(self, x):"""Forwardpropagation: Input → Hidden → OutputThisislikethe"Committee Meeting"where:1. Eachspecialist (hiddenneuron) examinestheevidence2. Thefinaldecisionmakercombinestheiropinions"""x = np.array(x).flatten()# Stage 1: Input → Hidden# Each hidden neuron computes its weighted sum and activatesself.z1 = np.dot(self.W1, x) + self.b1# (n_hidden,)self.h = sigmoid(self.z1) # (n_hidden,)# Stage 2: Hidden → Output# The output neuron combines hidden activationsself.z2 = np.dot(self.W2, self.h) + self.b2# (n_outputs,)self.output = sigmoid(self.z2) # (n_outputs,)returnself.output[0] ifself.n_outputs == 1elseself.outputdefpredict(self, x):"""Make a binary prediction (0 or 1)."""return1ifself.forward(x) >= 0.5else0print("="*70)print("MLP CLASS: The Full Committee Implementation")print("="*70)# Create an example MLPmlp = MLP(n_inputs=9, n_hidden=4, n_outputs=1)print(f"""MLPArchitectureCreated: • Inputlayer: {mlp.n_inputs} neurons (our9pixels) • Hiddenlayer: {mlp.n_hidden} neurons (thespecialists) • Outputlayer: {mlp.n_outputs} neuron (finaldecision)WeightShapes: • W1 (input→hidden): {mlp.W1.shape} = {mlp.n_hidden} hiddenneurons × {mlp.n_inputs} inputs • b1 (hiddenbiases): {mlp.b1.shape} = {mlp.n_hidden} biases • W2 (hidden→output): {mlp.W2.shape} = {mlp.n_outputs} output × {mlp.n_hidden} hidden • b2 (outputbias): {mlp.b2.shape} = {mlp.n_outputs} biasTotalParameters: {mlp.W1.size + mlp.b1.size + mlp.W2.size + mlp.b2.size} (ComparetoPerceptron: {9 + 1} parameters)""")
In Part 5, we learned backpropagation for a single neuron. With multiple layers, we need to chain the gradients - passing blame backward through each layer.
The Challenge: Who's Responsible for the Error?
When the network makes a mistake, we need to figure out:
How much should we adjust the output weights (W2)?
How much should we adjust the hidden weights (W1)?
The difficulty: W1 doesn't directly produce the output! It influences the hidden layer, which THEN influences the output. This is like asking: "If a manager's employee made a mistake, how much is the manager responsible?"
The Chain Rule: Passing Blame Backward
The key mathematical tool is the chain rule from calculus:
∂W1∂L=∂y^∂L⋅∂h∂y^⋅∂W1∂h
What IS the Chain Rule?
The chain rule says: if A affects B, and B affects C, then A's effect on C is:
dAdC=dBdC×dAdB
Intuitive example: If increasing temperature by 1°C increases pressure by 2 units, and increasing pressure by 1 unit increases volume by 3 units, then increasing temperature by 1°C increases volume by 2 × 3 = 6 units.
Think of it as a blame chain:
Loss depends on output prediction (how wrong is the answer?)
Output prediction depends on hidden activations (what did specialists say?)
Hidden activations depend on hidden weights (what were specialists looking for?)
Committee Analogy: Tracing Blame
"When the committee makes a wrong decision:
First, we see how wrong the final decision was (output error)
Then we ask: 'Which specialists contributed to this error?' (hidden layer blame)
Finally: 'What evidence did each specialist focus on that led them astray?' (input weights)
The blame flows BACKWARD through the committee hierarchy."
Notice that backpropagation needs values computed during forward pass:
h (hidden activations) - needed to update W2
z1 (pre-activation) - needed for sigmoid derivative
x (input) - needed to update W1
This is why neural networks use memory! We can't compute gradients without remembering what happened during the forward pass. This creates a fundamental trade-off:
Memory Usage
Gradient Computation
Store all intermediate values
Exact gradients (standard backprop)
Store some values
Approximate gradients (gradient checkpointing)
For deep networks with billions of parameters, memory management becomes critical!
First part: Distribute output error to hidden neurons based on their weights
Second part: Scale by how "sensitive" each neuron was
Why the outer product for updates?
dW2 = np.outer(delta2, self.h) computes: error × what hidden neurons said
Each weight connects ONE hidden neuron to output. If that hidden neuron was highly active AND error was large, that weight contributed a lot → big update.
7.5 The MLP Solves XOR!
Now let's prove that our MLP can solve the XOR problem that defeated single neurons.
cell 019
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# =============================================================================# MLP SOLVES XOR: Proof That Hidden Layers Work!# =============================================================================print("="*70)print("MLP vs XOR: The Hidden Layer Advantage")print("="*70)# XOR dataX_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])y_xor = np.array([0, 1, 1, 0])# Create and train MLP with 2 hidden neuronsnp.random.seed(42)xor_mlp = TrainableMLP(n_inputs=2, n_hidden=4, n_outputs=1)print("\nTraining MLP on XOR problem...")print("(Remember: A single neuron CANNOT solve this!)\n")xor_mlp.train(X_xor, y_xor, learning_rate=1.0, epochs=1000, verbose=True)# Test predictionsprint("\n" + "-"*70)print("XOR PREDICTIONS:")print("-"*70)print("\n Input A | Input B | Expected | Predicted | Correct?")print(" " + "-"*50)all_correct = Truefori, (x, y_true) inenumerate(zip(X_xor, y_xor)):y_pred = xor_mlp.predict(x)prob = xor_mlp.forward(x)correct = "Yes"ify_pred == y_trueelse"No"ify_pred != y_true:all_correct = Falseprint(f" {x[0]} | {x[1]} | {y_true} | {y_pred} | {correct}")print("\n" + "="*70)ifall_correct:print("SUCCESS! The MLP solved XOR!")print("Hidden layers enable learning non-linear patterns!")else:print("Still learning... (try running training again)")print("="*70)
What XOR Taught Us
The XOR success proves several important points:
Lesson
Why It Matters
Hidden layers enable non-linear boundaries
We can now solve problems impossible for single neurons
4 hidden neurons > 2 for XOR
Sometimes extra capacity helps training
Higher learning rate (1.0)
XOR has sharp boundaries, needs aggressive updates
More epochs (1000)
Non-linear problems can take longer to converge
The key insight: Each hidden neuron learned to detect one "piece" of the XOR pattern. The output neuron combined these pieces into the full solution.
Now let's return to our V/H classification story and see if this same power translates to real image problems!
7.6 Back to Our Through-Line: MLP vs Perceptron on V/H
We've proven the MLP can solve XOR. Now let's return to our continuing V/H story and see if the MLP can handle the challenging noisy images that stumped our single neuron.
The Comparison We've Been Building To
Model
Clean V/H
Noisy V/H (20%)
Why?
Perceptron
~95-100%
~70-80%
One pattern detector isn't enough
MLP
~95-100%
?
Multiple specialists should help!
Why Should MLP Help With Noise?
The Perceptron's problem with noise:
It learned ONE template (e.g., "middle column bright = vertical")
Noise adds random bright pixels everywhere
Random brightness confuses the single template
How MLP specialists help:
Specialist
What It Might Detect
Why Noise-Robust
Hidden 1
Left column pattern
Noise in right columns doesn't affect it
Hidden 2
Middle column pattern
Noise in left columns doesn't affect it
Hidden 3
Vertical vs horizontal ratio
Looks at overall shape
Hidden 4
Edge patterns
Different view of same data
Even if noise confuses ONE specialist, the others can "vote" correctly!
This is called ensemble robustness - multiple diverse detectors are more reliable than one.
One of the most powerful results in neural network theory is the Universal Approximation Theorem.
What Does It Say?
"A neural network with a single hidden layer containing enough neurons can approximate ANY continuous function to arbitrary accuracy."
In simpler terms: with enough hidden neurons, a neural network can learn to represent virtually any pattern!
What This Means
Statement
Implication
"Any continuous function"
Any smooth input-output relationship
"Single hidden layer"
You only NEED one hidden layer (in theory)
"Enough neurons"
May need many neurons for complex functions
"Arbitrary accuracy"
Can get as close as you want to the true function
The Catch
The theorem tells us networks CAN represent any function, but NOT:
How to FIND the right weights (training is still hard!)
How MANY neurons are needed (could be huge!)
Whether training will converge
Why Add MORE Layers?
If one hidden layer is theoretically enough, why do modern networks have many layers?
Deep networks (more layers) are more EFFICIENT:
Architecture
Parameters Needed
Why?
Wide (1 layer, many neurons)
Exponential
Each neuron works independently
Deep (many layers, fewer neurons)
Polynomial
Layers build on each other
The Compositionality Argument: Why Depth Wins
Key insight: Complex functions often have hierarchical structure.
Consider recognizing a face:
Layer 1: Detect edges (simple lines, curves)
Layer 2: Combine edges into parts (eyes, nose, mouth)
Layer 3: Combine parts into faces
Each layer REUSES what the previous layer learned!
With a single wide layer: Each neuron must independently learn to detect "face" from raw pixels. No reuse.
With deep layers: Edge detectors are shared across eye detectors, nose detectors, etc. Massive reuse!
Mathematical example:
To represent f(x)=x2n with wide network: need 2n neurons
With deep network: just n layers, each computing x2 of the previous layer
What Does "Arbitrary Accuracy" Mean?
The theorem says we can get "arbitrarily close" to any function. Concretely:
∣f(x)−f^(x)∣<ϵ for any ϵ>0
Where f is the true function and f^ is the network's approximation.
Catch: The number of neurons needed grows as ϵ gets smaller. For very precise approximations, you might need astronomically many neurons!
Committee Analogy
"One giant room of 1000 generalist committee members CAN solve any problem. But a hierarchical organization with specialists (layer 1: evidence gatherers, layer 2: pattern detectors, layer 3: decision makers) can solve it with fewer people and better organization."
Part 7 Summary: What We've Learned
Key Concepts Mastered
Concept
Definition
Why It Matters
Linear Separability
Can separate with one line
Determines what single neurons can learn
XOR Problem
Non-linearly separable
Proves single neurons have limits
Hidden Layer
Neurons between input and output
Enable non-linear boundaries
MLP
Multi-Layer Perceptron
Network with hidden layers
Forward Propagation
Input → Hidden → Output
How predictions are made
Backpropagation
Chain rule through layers
How MLPs learn
Universal Approximation
MLPs can learn anything
Theoretical foundation
Architecture Comparison
Model
Layers
XOR
Clean V/H
Noisy V/H (20%)
Why?
Perceptron
1
✗
~95%
~70-80%
One detector isn't enough
MLP (4 hidden)
2
✓
~95%
~85-95%
Multiple specialists!
Deep MLP
3+
✓
✓
✓
Even more capacity
Two Complementary Examples
Example
What We Learned
XOR Problem
Classic proof that single neurons have fundamental limits
Noisy V/H Lines
Practical demonstration using our continuing story
Both examples taught the same lesson: complex problems need multiple specialists working together.
Committee Analogy Progress
Part
What Happened
Parts 1-3
Single member learned procedures
Part 4
First case - confused
Part 5
Learned from feedback
Part 6
Performance review
Part 7
Assembled the full committee with specialists!
Part 8
(Next) The committee faces growing pains
Knowledge Check
How Many Hidden Neurons Do We Need?
A natural question: "Should I use 4 hidden neurons? 10? 100?"
Understanding Network Capacity:
Hidden Neurons
Capacity
Risk
Too few (1-2)
Can't represent complex patterns
Underfitting
Just right (4-8 for V/H)
Captures patterns without memorizing
Good generalization
Too many (50+)
Can memorize training data
Overfitting
Rules of Thumb:
Start small, increase if needed - Begin with 2-4 hidden neurons, add more if accuracy plateaus
Watch train vs test gap - If training accuracy >> test accuracy, reduce neurons
Problem complexity guides size - Simple patterns need fewer neurons
For our V/H problem:
9 input pixels
2 classes (binary)
4 hidden neurons is reasonable: enough for specialization, not so many that overfitting occurs
We'll explore overfitting in detail in Part 8!
cell 027
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
# =============================================================================# KNOWLEDGE CHECK - Part 7# =============================================================================print("KNOWLEDGE CHECK - Part 7: Hidden Layers")print("="*60)questions = [ {"q": "1. Why can't a single neuron solve the XOR problem?","options": ["A) XOR has too many inputs","B) XOR is not linearly separable - can't draw one line to separate classes","C) XOR requires too much memory","D) Single neurons can solve XOR, it just takes longer" ],"answer": "B","explanation": "XOR points cannot be separated by a single straight line. The (0,0) and (1,1) points are class 0, while (0,1) and (1,0) are class 1 - no line can separate them." }, {"q": "2. Why does MLP outperform Perceptron on noisy V/H images?","options": ["A) MLP runs faster","B) MLP has multiple specialists - if noise fools one, others provide backup","C) MLP uses less memory","D) Perceptron can't process images" ],"answer": "B","explanation": "MLP has multiple hidden neurons that each detect different features. Even if noise confuses one specialist, the others can still detect patterns and contribute to the correct answer." }, {"q": "3. What is a 'hidden layer' in a neural network?","options": ["A) A layer that is invisible to users","B) A layer of neurons between the input and output layers","C) A layer that stores hidden data","D) A layer that only activates sometimes" ],"answer": "B","explanation": "Hidden layers sit between input and output. They're 'hidden' because we don't directly observe their values - they're internal to the network." }, {"q": "4. What does each hidden neuron typically learn to detect?","options": ["A) The same pattern as other neurons","B) Random noise","C) Different features or patterns (specialization)","D) Only the output labels" ],"answer": "C","explanation": "Each hidden neuron specializes in detecting different features. This 'diversity of opinion' is what gives MLPs their power to learn complex patterns." }, {"q": "5. In backpropagation through multiple layers, how does error flow?","options": ["A) Forward, from input to output","B) Backward, from output to input via chain rule","C) Randomly through the network","D) Only through the hidden layer" ],"answer": "B","explanation": "Backpropagation passes error backward using the chain rule. Output error → hidden layer error → input weight updates." }, {"q": "6. What does the Universal Approximation Theorem tell us?","options": ["A) Neural networks always converge","B) One hidden layer with enough neurons can approximate any function","C) Deep networks are always better than shallow ones","D) Training is guaranteed to find optimal weights" ],"answer": "B","explanation": "The theorem says MLPs CAN represent any function, but doesn't guarantee we can find the weights or how many neurons we need." }]forqinquestions:print(f"\n{q['q']}")foroptinq["options"]:print(f" {opt}")print("\n" + "="*60)print("Scroll down for answers...")print("="*60)
We've assembled the full committee - a Multi-Layer Perceptron with hidden layers that can solve problems single neurons cannot. We proved this by solving XOR and saw how hidden neurons specialize in detecting different features.
But There's a Problem...
As neural networks grow deeper and more complex, they face new challenges:
Overfitting: The committee memorizes cases instead of learning patterns
Vanishing Gradients: Feedback becomes too weak in deep networks
Dead Neurons: Some specialists stop contributing entirely
Coming Up in Part 8: Deep Learning Challenges
In the next notebook, we'll explore:
Overfitting - When the committee memorizes instead of learns
Regularization - Rules to prevent over-specialization
Vanishing/Exploding Gradients - The deep network dilemma
Solutions - Dropout, batch normalization, and more
Continue to Part 8:part_8_deep_learning_challenges.ipynb
"With great power comes great responsibility - and new challenges."