Previously: In Part 2, our first committee member learned to gather evidence (inputs), weigh it (weights), add their personal threshold (bias), and calculate a score (weighted sum). But a score isn't a decision.
Today's Mission: Our committee member needs to cast their vote. The score tells them how convinced they are, but now they need to turn that into an actual decision. There are several ways to vote - and choosing the right one matters.
What You'll Learn
By the end of this notebook, you will understand:
Why activation functions are necessary - Linear isn't enough!
Step Function - The simplest "yes or no" vote
Sigmoid - A smooth probability between 0 and 1
Tanh - A centered vote from -1 to +1
ReLU - The modern workhorse (and its quirks)
Softmax - For multi-class decisions
How to choose the right activation for your problem
Prerequisites
Make sure you've completed:
Part 0: Welcome & Introduction (neural_network_fundamentals.ipynb)
Part 1: Matrices - The Language of the Brain (neural_network_fundamentals.ipynb)
Part 2: The First Committee Member (part_2_single_neuron.ipynb)
🔧 Setup: Import Dependencies
Let's import our tools and set up our environment. We'll also recreate our V/H line examples from previous notebooks.
cell 003
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
# =============================================================================# PART 3: ACTIVATION FUNCTIONS - SETUP# =============================================================================importnumpyasnpimportmatplotlib.pyplotaspltimportmatplotlib.patchesaspatchesfrommatplotlib.colorsimportLinearSegmentedColormapfromIPython.displayimportdisplay, clear_output, HTML# Try to import ipywidgets for interactive featurestry:importipywidgetsaswidgetsWIDGETS_AVAILABLE = TrueexceptImportError:WIDGETS_AVAILABLE = Falseprint(" ipywidgets not installed. Interactive features will be limited.")print(" Install with: pip install ipywidgets")# Set up matplotlib stylestyle_options = ['seaborn-v0_8-whitegrid', 'seaborn-whitegrid', 'ggplot', 'default']forstyleinstyle_options:try:plt.style.use(style)breakexceptOSError:continueplt.rcParams['figure.figsize'] = [10, 6]plt.rcParams['font.size'] = 12plt.rcParams['axes.titlesize'] = 14np.random.seed(42)# =============================================================================# RECREATE OUR V/H LINE EXAMPLES (from previous notebooks)# =============================================================================# Our canonical vertical and horizontal linesvertical_line = np.array([ [0, 1, 0], [0, 1, 0], [0, 1, 0]])horizontal_line = np.array([ [0, 0, 0], [1, 1, 1], [0, 0, 0]])# Flattened versions for neuron inputvertical_flat = vertical_line.flatten()horizontal_flat = horizontal_line.flatten()# =============================================================================# RECREATE OUR SIMPLE NEURON CLASS (from Part 2)# =============================================================================classSimpleNeuron:"""Asimpleneuronthatcomputesweightedsum + bias (noactivationyet).FromPart2: TheFirstCommitteeMember."""def__init__(self, n_inputs):"""Initialize neuron with random weights and bias."""self.weights = np.random.randn(n_inputs) * 0.5self.bias = np.random.randn() * 0.1self.n_inputs = n_inputsdefforward(self, x):"""Compute the weighted sum (pre-activation value)."""x = np.array(x).flatten()returnnp.dot(self.weights, x) + self.biasdefset_weights(self, weights, bias=None):"""Manually set weights and bias."""self.weights = np.array(weights).flatten()ifbiasisnotNone:self.bias = biasprint(" All libraries imported successfully!")print(f" NumPy version: {np.__version__}")print(f" Matplotlib version: {plt.matplotlib.__version__}")ifWIDGETS_AVAILABLE:print(f" IPyWidgets available: Yes")else:print(f" IPyWidgets available: No (interactive labs won't work)")print("\n Welcome to Part 3: Activation Functions!")print(" Time to learn how to cast our vote.")
3.1 Why Activate? The Problem with Pure Mathematics
First, What IS an Activation Function?
An activation function is a mathematical function that transforms the neuron's raw score into a useful output.
Term
Meaning
Example
Input to activation
The weighted sum z = w·x + b
z = 2.3
Activation function
A formula that transforms z
sigmoid(z)
Output of activation
The transformed value
0.91
Think of it as a "translator" - it takes the neuron's raw calculation and converts it into something meaningful.
Committee Analogy
Our committee member has done the math:
Gathered all the evidence (inputs)
Multiplied by how much each piece matters (weights)
Added their personal threshold (bias)
Got a final score
But here's the problem: That score could be anything. It might be -37.5, or 102.3, or 0.0001.
A vote needs to be meaningful. When you ask "Is this a vertical line?", you don't want to hear "-14.7". You want to hear:
"Yes" or "No" (binary)
"I'm 85% confident it's vertical" (probability)
"On a scale of -1 to 1, I'd say 0.7 toward vertical" (centered scale)
Activation functions transform raw scores into meaningful decisions.
The Deeper Problem: Linear Functions Aren't Enough
What Does "Linear" Mean?
A linear function is one where the output changes proportionally with the input - draw it on a graph and you get a straight line.
Linear: y = 3x + 2 (always a straight line)
Non-linear: y = x² (a curve)
Non-linear: y = sigmoid(x) (an S-curve)
Why does this matter? Many real-world patterns are NOT straight lines!
There's a mathematical reason why activation functions are essential, beyond just making outputs meaningful.
Without activation: If our neuron just outputs the weighted sum, it's a linear function:
z=w1x1+w2x2+...+wnxn+b
The problem? Stacking linear functions gives you another linear function!
If we chain two neurons without activation:
Neuron 1: z1=W1⋅x+b1
Neuron 2: z2=W2⋅z1+b2=W2⋅(W1⋅x+b1)+b2
This simplifies to: z2=(W2⋅W1)⋅x+(W2⋅b1+b2)
That's just another linear function! No matter how many layers we stack, we're still just drawing straight lines. But many real problems need curved decision boundaries.
Let's visualize this:
cell 005
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# =============================================================================# VISUALIZE: Linear vs Non-Linear Decision Boundaries# =============================================================================fig, axes = plt.subplots(1, 3, figsize=(15, 5))# Problem 1: Linearly Separable (can solve with linear)ax1 = axes[0]np.random.seed(42)# Class A: upper leftclass_a = np.random.randn(30, 2) * 0.5 + [-1, 1]# Class B: lower rightclass_b = np.random.randn(30, 2) * 0.5 + [1, -1]ax1.scatter(class_a[:, 0], class_a[:, 1], c='#3498db', s=60, label='Class A', edgecolors='white')ax1.scatter(class_b[:, 0], class_b[:, 1], c='#e74c3c', s=60, label='Class B', edgecolors='white')# Linear boundaryx_line = np.linspace(-3, 3, 100)ax1.plot(x_line, -x_line, 'g--', linewidth=2, label='Linear Boundary')ax1.set_xlim(-3, 3)ax1.set_ylim(-3, 3)ax1.set_title(' Linearly Separable\n(Linear neuron CAN solve)', fontsize=12, fontweight='bold')ax1.legend()ax1.set_xlabel('Feature 1')ax1.set_ylabel('Feature 2')# Problem 2: XOR Pattern (cannot solve with linear)ax2 = axes[1]# XOR: opposite corners are same classxor_class_a = np.array([[-1, -1], [1, 1]]) # Diagonal pair 1xor_class_b = np.array([[-1, 1], [1, -1]]) # Diagonal pair 2# Add some noise around each pointnp.random.seed(42)xor_a_noisy = np.vstack([xor_class_a[0] + np.random.randn(15, 2)*0.3,xor_class_a[1] + np.random.randn(15, 2)*0.3])xor_b_noisy = np.vstack([xor_class_b[0] + np.random.randn(15, 2)*0.3,xor_class_b[1] + np.random.randn(15, 2)*0.3])ax2.scatter(xor_a_noisy[:, 0], xor_a_noisy[:, 1], c='#3498db', s=60, label='Class A', edgecolors='white')ax2.scatter(xor_b_noisy[:, 0], xor_b_noisy[:, 1], c='#e74c3c', s=60, label='Class B', edgecolors='white')# Try to draw a line - it fails!ax2.plot(x_line, -x_line * 0.3, 'r--', linewidth=2, alpha=0.5, label='Linear attempt (fails)')ax2.set_xlim(-2.5, 2.5)ax2.set_ylim(-2.5, 2.5)ax2.set_title('❌ XOR Pattern\n(Linear neuron CANNOT solve)', fontsize=12, fontweight='bold')ax2.legend()ax2.set_xlabel('Feature 1')ax2.set_ylabel('Feature 2')# Problem 3: XOR with non-linear boundaryax3 = axes[2]ax3.scatter(xor_a_noisy[:, 0], xor_a_noisy[:, 1], c='#3498db', s=60, label='Class A', edgecolors='white')ax3.scatter(xor_b_noisy[:, 0], xor_b_noisy[:, 1], c='#e74c3c', s=60, label='Class B', edgecolors='white')# Draw non-linear boundary (curves)theta = np.linspace(0, 2*np.pi, 100)ax3.plot(0.9*np.cos(theta) - 1, 0.9*np.sin(theta), 'g-', linewidth=2)ax3.plot(0.9*np.cos(theta) + 1, 0.9*np.sin(theta), 'g-', linewidth=2, label='Non-linear boundary')ax3.set_xlim(-2.5, 2.5)ax3.set_ylim(-2.5, 2.5)ax3.set_title('✅ XOR with Non-Linear Boundary\n(Activation functions enable this!)', fontsize=12, fontweight='bold')ax3.legend()ax3.set_xlabel('Feature 1')ax3.set_ylabel('Feature 2')plt.tight_layout()plt.show()print("\n Key Insight:")print(" Without activation functions, neural networks can only draw straight lines.")print(" With activation functions, they can draw curves, circles, and complex shapes!")print(" This is what makes deep learning powerful.")
3.2 Step Function - The Binary Vote
Committee Analogy: "Yes or No, Nothing In Between"
The Step Function is the simplest voting method. It's like a committee member who can only say:
"YES" (vote = 1) if the score is at or above 0
"NO" (vote = 0) if the score is below 0
No hesitation. No "maybe". Just a crisp, decisive vote.
The Mathematics
f(z)={10if z≥0if z<0
Where z is the weighted sum (pre-activation value) from our neuron.
Historical Context
This was the activation function used in the original Perceptron (Rosenblatt, 1958). It's historically significant - the first neural networks used this simple approach.
Implementation
cell 007
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
# =============================================================================# STEP FUNCTION: Implementation and Visualization# =============================================================================defstep_function(z):"""Stepactivationfunction.Returns1ifz >= 0, else0.Parameters:z: inputvalue (weightedsum)Returns:0or1"""returnnp.where(z >= 0, 1, 0)# Create a range of z values to visualizez_values = np.linspace(-5, 5, 1000)step_output = step_function(z_values)# Visualizefig, axes = plt.subplots(1, 2, figsize=(14, 5))# Plot 1: The step function curveax1 = axes[0]ax1.plot(z_values, step_output, 'b-', linewidth=3, label='Step Function')ax1.axhline(y=0.5, color='gray', linestyle=':', alpha=0.5)ax1.axvline(x=0, color='gray', linestyle=':', alpha=0.5)ax1.scatter([0], [1], color='blue', s=100, zorder=5) # Point at z=0ax1.set_xlabel('z (weighted sum)', fontsize=12)ax1.set_ylabel('f(z) (output)', fontsize=12)ax1.set_title('Step Function: The Binary Vote', fontsize=14, fontweight='bold')ax1.set_ylim(-0.2, 1.3)ax1.legend()ax1.grid(True, alpha=0.3)# Annotateax1.annotate('If z ≥ 0 → Output 1', xy=(2, 1), fontsize=11, color='green')ax1.annotate('If z < 0 → Output 0', xy=(-4.5, 0.1), fontsize=11, color='red')ax1.annotate('Decision boundary\nat z = 0', xy=(0.2, 0.5), fontsize=10, color='gray')# Plot 2: Show how it works with example valuesax2 = axes[1]example_z = np.array([-2.5, -1.0, -0.1, 0, 0.1, 1.0, 2.5])example_output = step_function(example_z)colors = ['#e74c3c'ifo == 0else'#27ae60'foroinexample_output]bars = ax2.bar(range(len(example_z)), example_output, color=colors, edgecolor='white', linewidth=2)ax2.set_xticks(range(len(example_z)))ax2.set_xticklabels([f'z={z:.1f}'forzinexample_z], rotation=45)ax2.set_ylabel('Output', fontsize=12)ax2.set_title('Step Function Examples', fontsize=14, fontweight='bold')ax2.set_ylim(0, 1.3)# Add value labels on barsfori, (z, out) inenumerate(zip(example_z, example_output)):ax2.annotate(f'{int(out)}', xy=(i, out + 0.05), ha='center', fontsize=12, fontweight='bold')plt.tight_layout()plt.show()# Test with our neuronprint("\n🔬 Testing Step Function with our V/H Detector:")print("=" * 50)# Create a neuron with hand-designed vertical detector weightsvertical_detector = SimpleNeuron(9)detector_weights = np.array([ -1, +2, -1, # Top row: high weight in middle -1, +2, -1, # Middle row: high weight in middle -1, +2, -1# Bottom row: high weight in middle]).astype(float) * 0.5vertical_detector.set_weights(detector_weights, bias=-1.0)# Get raw scoresv_score = vertical_detector.forward(vertical_flat)h_score = vertical_detector.forward(horizontal_flat)print(f"\nVertical Line:")print(f" Raw score (z): {v_score:.3f}")print(f" Step activation: {step_function(v_score)} {'✓ Correct (Vertical)' if step_function(v_score) == 1 else '✗ Wrong'}")print(f"\nHorizontal Line:")print(f" Raw score (z): {h_score:.3f}")print(f" Step activation: {step_function(h_score)} {'✓ Correct (Not Vertical)' if step_function(h_score) == 0 else '✗ Wrong'}")
⚠️ The Problem with Step Functions
The step function has a critical flaw: It's not differentiable at z = 0, and the gradient is 0 everywhere else.
What Does "Differentiable" Mean?
A function is differentiable if you can calculate its slope (how steep it is) at every point.
Function
At z = 0.5
Differentiable?
Sigmoid
Smooth curve, slope = 0.24
Yes
Step
Flat (slope = 0)
No (at z=0 it jumps!)
The step function has a "jump" - it goes from 0 to 1 instantly at z = 0. There's no smooth slope at that point. It's like asking "what's the steepness of a cliff edge?" - the question doesn't have a good answer.
Why Does Differentiability Matter for Learning?
Neural networks learn by:
Making a prediction
Measuring how wrong it is (loss)
Using the derivative to figure out which direction to adjust weights
Adjusting weights slightly in that direction
If the derivative is always 0 (flat) or undefined (jump), the network has no signal to tell it how to improve. It's like trying to find the bottom of a valley while blindfolded - you need to feel the slope to know which way is down!
Why does this matter? In Part 5, we'll learn that neural networks learn by using gradients to figure out how to adjust weights. If the gradient is always 0, the network can't learn!
Pros
Cons
✅ Simple to understand
❌ Not differentiable (can't use gradient descent)
✅ Clear binary output
❌ No gradient signal for learning
✅ Historically important
❌ All-or-nothing (loses information)
The step function is great for understanding, but modern networks need smoother alternatives.
3.3 Sigmoid - The Confidence Vote
Committee Analogy: "I'm 85% Sure It's Vertical"
The Sigmoid function is like a committee member who expresses confidence levels:
Instead of "YES" or "NO", they say "I'm 85% confident it's vertical"
Output is always between 0 and 1
Can be interpreted as a probability
This is much more nuanced than the step function!
The Mathematics
σ(z)=1+e−z1
Where:
z is the weighted sum (pre-activation value)
e is Euler's number (≈ 2.718)
σ(z) smoothly maps any input to the range (0, 1)
What is Euler's Number (e)?
You'll see e ≈ 2.71828 in many formulas. It's a special mathematical constant (like π = 3.14159).
Why e? When you raise e to a power, you get a curve that:
Grows smoothly and never stops
Has a special property: its slope at any point equals its value at that point!
For sigmoid: The e^(-z) term creates the smooth S-curve:
When z is very negative: e^(-z) is huge → output ≈ 0
When z is very positive: e^(-z) is tiny → output ≈ 1
When z = 0: e^0 = 1 → output = 0.5
Don't worry about memorizing e - just know it creates smooth curves!
The Sigmoid Derivative (Why It Matters for Learning)
First, What IS a Derivative?
The derivative tells you the slope (steepness) of a curve at any point.
Slope
What It Means
Learning Impact
Steep (large derivative)
Output changes quickly when input changes
Strong learning signal
Flat (small derivative)
Output barely changes when input changes
Weak learning signal
Zero
Output doesn't change at all
NO learning signal!
Simple Example: If you're walking up a hill:
Steep section: Each step moves you a lot higher → large derivative
Flat section: Each step barely changes your height → small derivative
Why Derivatives Matter for Training
Neural networks learn by asking: "If I change this weight slightly, how does the output change?"
The derivative answers this question! If the derivative is:
Large: Small weight change → big output change → we know which direction helps
Small: Small weight change → tiny output change → hard to tell which direction helps
Zero: Weight change → no output change → we're stuck!
The Beautiful Sigmoid Derivative Formula
A beautiful property of sigmoid: its derivative has a simple form!
dzdσ=σ(z)⋅(1−σ(z))
This elegant formula means:
If we know the output, we can easily compute the gradient
The gradient is highest around z = 0 (where the function is steepest)
The gradient approaches 0 as z gets very large or very small
⚠️ The Vanishing Gradient Problem
This is important: Look at the derivative plot below. Notice how the gradient is nearly 0 for large positive or negative z values.
When the gradient is nearly 0, learning almost stops! The network can't figure out which direction to adjust weights.
This causes the vanishing gradient problem in deep networks - a topic we'll explore deeply in Part 8. For now, just know that sigmoid can make deep networks hard to train.
cell 011
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# =============================================================================# SIGMOID DERIVATIVE: Visualization# =============================================================================defsigmoid_derivative(z):"""Derivativeofsigmoidfunction. σ'(z) = σ(z) * (1 - σ(z))"""s = sigmoid(z)returns * (1 - s)# Visualize sigmoid and its derivativefig, axes = plt.subplots(1, 2, figsize=(14, 5))z_values = np.linspace(-6, 6, 1000)# Plot sigmoid and derivative togetherax1 = axes[0]ax1.plot(z_values, sigmoid(z_values), 'b-', linewidth=3, label='Sigmoid σ(z)')ax1.plot(z_values, sigmoid_derivative(z_values), 'r-', linewidth=3, label="Derivative σ'(z)")ax1.axhline(y=0, color='gray', linestyle='-', alpha=0.3)ax1.axhline(y=0.25, color='gray', linestyle=':', alpha=0.5)ax1.axvline(x=0, color='gray', linestyle=':', alpha=0.5)ax1.set_xlabel('z (weighted sum)', fontsize=12)ax1.set_ylabel('Output', fontsize=12)ax1.set_title('Sigmoid and Its Derivative', fontsize=14, fontweight='bold')ax1.legend()ax1.grid(True, alpha=0.3)# Annotateax1.annotate('Max gradient = 0.25\nat z = 0', xy=(0.5, 0.25), fontsize=10, color='red')# Plot: Show where gradient vanishesax2 = axes[1]ax2.fill_between(z_values, sigmoid_derivative(z_values), alpha=0.3, color='red')ax2.plot(z_values, sigmoid_derivative(z_values), 'r-', linewidth=3, label="Derivative σ'(z)")ax2.axhline(y=0.01, color='orange', linestyle='--', linewidth=2, label='Vanishing threshold')# Highlight vanishing regionsax2.axvspan(-6, -4, alpha=0.2, color='gray', label='Vanishing gradient zone')ax2.axvspan(4, 6, alpha=0.2, color='gray')ax2.set_xlabel('z (weighted sum)', fontsize=12)ax2.set_ylabel('Gradient magnitude', fontsize=12)ax2.set_title('⚠️ The Vanishing Gradient Problem', fontsize=14, fontweight='bold')ax2.legend(loc='upper right')ax2.grid(True, alpha=0.3)ax2.set_ylim(-0.02, 0.3)plt.tight_layout()plt.show()print("\n⚠️ Key Warning about Sigmoid:")print(" The maximum gradient is only 0.25 (at z = 0)")print(" In deep networks, gradients multiply through layers:")print(" 0.25 × 0.25 × 0.25 = 0.016 (after just 3 layers!)")print(" This makes deep networks very hard to train with sigmoid.")
3.4 Tanh - The Centered Vote
🧠 Committee Analogy: "From Strongly Against to Strongly For"
Tanh (hyperbolic tangent) is like a committee member who can express the full spectrum of opinion:
-1: "I strongly believe this is NOT vertical"
0: "I'm completely neutral / unsure"
+1: "I strongly believe this IS vertical"
The key difference from sigmoid: zero means neutral, not uncertain.
The Mathematics
tanh(z)=ez+e−zez−e−z
Or equivalently:
tanh(z)=2σ(2z)−1
(It's just a rescaled, shifted sigmoid!)
What Does "Zero-Centered" Mean?
Zero-centered means the average output is around 0, with equal spread in both directions.
When training, the gradient is calculated as (output) × (some error term).
With Sigmoid
With Tanh
Output always > 0
Output can be + or -
Gradient always same sign
Gradient can be + or -
Weights zig-zag
Weights can go directly
The Result: With sigmoid, all weight updates push in the same direction, then overcorrect, then push back... creating inefficient zig-zag learning. Tanh avoids this!
Zero-centered outputs are helpful for training because:
Negative inputs can push weights in one direction
Positive inputs can push weights in the other direction
This creates a balanced learning signal
With sigmoid (outputs 0 to 1), all outputs are positive, which can cause slower, zig-zagging learning.
🧠 Committee Analogy: "The Permanently Skeptical Member"
Imagine a committee member who becomes so pessimistic that they always compute a negative weighted sum. With ReLU, they'll output 0 forever. Since the gradient is also 0 when the output is 0, they can never recover - no learning signal gets through!
This member is "dead" to the committee. They contribute nothing and can't be revived.
How Neurons Die
Bad initialization: If weights start too negative, z might always be < 0
Large learning rate: A big update might push weights into permanently negative territory
Unlucky data: Some neurons just never activate for the training data
The Solution: Leaky ReLU
Instead of outputting 0 for negative inputs, Leaky ReLU outputs a small negative value:
f(z)={zαzif z>0if z≤0
Where α is a small constant (typically 0.01).
This ensures there's always some gradient, so dead neurons can potentially recover.
cell 017
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# =============================================================================# DEAD RELU & LEAKY RELU: Visualization# =============================================================================defleaky_relu(z, alpha=0.01):"""LeakyReLUactivationfunction.Returnszifz > 0, elsealpha * z.Parameters:z: inputvalue (weightedsum)alpha: slopefornegativevalues (default0.01)Returns:zifz > 0, elsealpha * z"""returnnp.where(z > 0, z, alpha * z)defleaky_relu_derivative(z, alpha=0.01):"""Derivative of Leaky ReLU."""returnnp.where(z > 0, 1, alpha)# Visualize ReLU vs Leaky ReLUz_values = np.linspace(-5, 5, 1000)fig, axes = plt.subplots(1, 2, figsize=(14, 5))# Plot 1: ReLU vs Leaky ReLUax1 = axes[0]ax1.plot(z_values, relu(z_values), 'green', linewidth=3, label='ReLU')ax1.plot(z_values, leaky_relu(z_values, alpha=0.1), 'orange', linewidth=3, label='Leaky ReLU (α=0.1)')ax1.axhline(y=0, color='gray', linestyle='-', linewidth=0.5)ax1.axvline(x=0, color='gray', linestyle=':', alpha=0.5)ax1.set_xlabel('z (weighted sum)', fontsize=12)ax1.set_ylabel('Output', fontsize=12)ax1.set_title('ReLU vs Leaky ReLU', fontsize=14, fontweight='bold')ax1.legend()ax1.grid(True, alpha=0.3)ax1.set_ylim(-1, 5)# Annotate the differenceax1.annotate('ReLU: completely flat\n(no gradient, neuron dead!)', xy=(-3, 0.1), fontsize=10, color='green')ax1.annotate('Leaky ReLU: small slope\n(gradient exists, can recover)', xy=(-3, -0.8), fontsize=10, color='orange')# Plot 2: Compare gradientsax2 = axes[1]ax2.plot(z_values, relu_derivative(z_values), 'green', linewidth=3, label='ReLU gradient')ax2.plot(z_values, leaky_relu_derivative(z_values, alpha=0.1), 'orange', linewidth=3, label='Leaky ReLU gradient (α=0.1)')ax2.axhline(y=0, color='gray', linestyle='-', linewidth=0.5)ax2.axvline(x=0, color='gray', linestyle=':', alpha=0.5)ax2.set_xlabel('z (weighted sum)', fontsize=12)ax2.set_ylabel('Gradient', fontsize=12)ax2.set_title('Gradient Comparison: Dead vs Alive', fontsize=14, fontweight='bold')ax2.legend()ax2.grid(True, alpha=0.3)ax2.set_ylim(-0.1, 1.2)# Highlight the dead zoneax2.axvspan(-5, 0, alpha=0.1, color='red')ax2.annotate('DEAD ZONE (ReLU)\nNo gradient = no learning!', xy=(-2.5, 0.6), fontsize=10, color='red', ha='center')plt.tight_layout()plt.show()# Demonstrate dead neuron scenarioprint("\n☠️ Dead Neuron Demonstration:")print("=" * 50)# Neuron with very negative weights (simulating a "dead" neuron)dead_neuron = SimpleNeuron(9)dead_neuron.set_weights(np.array([-2]*9), bias=-5) # Always outputs very negative zprint("\nDead neuron (very negative weights):")forname, imgin [("Vertical", vertical_flat), ("Horizontal", horizontal_flat)]:z = dead_neuron.forward(img)print(f" {name}: z = {z:.2f}, ReLU output = {relu(z):.2f}, Leaky ReLU = {leaky_relu(z, 0.1):.2f}")print("\n⚠️ With ReLU: Both outputs are 0 - this neuron is DEAD!")print(" With Leaky ReLU: Small negative outputs - can still learn!")
3.6 Softmax - The Committee Consensus (Multi-Class)
🧠 Committee Analogy: "All Votes Must Sum to 100%"
So far, our activation functions have been for binary decisions (Vertical vs Horizontal). But what if we had three or more options?
Vertical
Horizontal
Diagonal
Softmax is like a committee voting system where:
Each option gets a score
Scores are converted to probabilities
All probabilities must sum to 1 (100%)
This is consensus voting - every option gets a slice of the pie.
What is a Probability Distribution?
A probability distribution is a list of numbers that:
Are all between 0 and 1 (0% to 100%)
Add up to exactly 1 (100% total)
Example - Weather Forecast:
Outcome
Probability
Sunny
0.60 (60%)
Cloudy
0.30 (30%)
Rainy
0.10 (10%)
Total
1.00 (100%)
This makes sense - one of these outcomes WILL happen, and together they cover all possibilities.
Why Must Probabilities Sum to 1?
The practical reason: In classification, we're asking "which class does this belong to?" The answer must be ONE of the classes - so the confidence in all classes together must be 100%.
Each output represents the probability of that class.
In Plain English: Take e^(each score), then divide by the sum of all e^(scores). This guarantees the outputs sum to 1!
Key Properties
Property
Explanation
Outputs sum to 1
All probabilities add up to exactly 1
All positive
Even negative inputs give positive probabilities
Amplifies differences
Larger scores get proportionally more probability
Used for output layer
Specifically for multi-class classification
Softmax vs Sigmoid: When to Use Which
Scenario
Use This
Why
2 classes (V vs H)
Sigmoid
Single output: P(vertical), P(horizontal) = 1 - P(vertical)
3+ classes (V vs H vs D)
Softmax
Need probabilities for EACH class that sum to 1
Implementation
cell 019
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
# =============================================================================# SOFTMAX FUNCTION: Implementation and Visualization# =============================================================================defsoftmax(z):"""Softmaxactivationfunctionformulti-classclassification.Convertsavectorofscoresintoprobabilitiesthatsumto1.Parameters:z: arrayofscores (oneperclass)Returns:Arrayofprobabilities (sameshapeasz, sum = 1)"""# Subtract max for numerical stability (prevents overflow)z_stable = z - np.max(z)exp_z = np.exp(z_stable)returnexp_z / np.sum(exp_z)# Example: Multi-class classification (Vertical, Horizontal, Diagonal)# Suppose our network outputs these raw scores:scores_example1 = np.array([2.0, 1.0, 0.1]) # Prefers class 0 (Vertical)scores_example2 = np.array([0.1, 2.0, 0.5]) # Prefers class 1 (Horizontal)scores_example3 = np.array([1.0, 1.0, 1.0]) # Uncertain (equal scores)# Apply softmaxprobs_example1 = softmax(scores_example1)probs_example2 = softmax(scores_example2)probs_example3 = softmax(scores_example3)# Visualizefig, axes = plt.subplots(1, 3, figsize=(15, 5))class_names = ['Vertical', 'Horizontal', 'Diagonal']colors = ['#3498db', '#e74c3c', '#2ecc71']forax, scores, probs, titleinzip(axes, [scores_example1, scores_example2, scores_example3], [probs_example1, probs_example2, probs_example3], ['Confident: Vertical', 'Confident: Horizontal', 'Uncertain']):x = np.arange(len(class_names))width = 0.35# Plot raw scores and probabilitiesbars1 = ax.bar(x - width/2, scores, width, label='Raw scores (z)', color='gray', alpha=0.5)bars2 = ax.bar(x + width/2, probs, width, label='Softmax (probabilities)', color=colors)ax.set_ylabel('Value', fontsize=12)ax.set_title(title, fontsize=14, fontweight='bold')ax.set_xticks(x)ax.set_xticklabels(class_names)ax.legend()ax.set_ylim(0, max(max(scores), 1.1))# Add probability labelsfori, (bar, prob) inenumerate(zip(bars2, probs)):ax.annotate(f'{prob:.1%}', xy=(bar.get_x() + bar.get_width()/2, bar.get_height()),ha='center', va='bottom', fontsize=11, fontweight='bold')plt.tight_layout()plt.show()# Show the mathprint("\n📊 Softmax Examples:")print("=" * 60)print("\nExample 1: Raw scores [2.0, 1.0, 0.1]")print(f" Softmax output: [{', '.join([f'{p:.3f}' for p in probs_example1])}]")print(f" Sum of probabilities: {sum(probs_example1):.3f} ✓")print(f" Prediction: {class_names[np.argmax(probs_example1)]} ({probs_example1.max():.1%} confident)")print("\nExample 2: Raw scores [0.1, 2.0, 0.5]")print(f" Softmax output: [{', '.join([f'{p:.3f}' for p in probs_example2])}]")print(f" Sum of probabilities: {sum(probs_example2):.3f} ✓")print(f" Prediction: {class_names[np.argmax(probs_example2)]} ({probs_example2.max():.1%} confident)")print("\nExample 3: Equal scores [1.0, 1.0, 1.0]")print(f" Softmax output: [{', '.join([f'{p:.3f}' for p in probs_example3])}]")print(f" Sum of probabilities: {sum(probs_example3):.3f} ✓")print(f" All equal! The network is completely uncertain.")print("\n💡 Notice: Softmax always outputs valid probabilities that sum to 1!")
3.7 Activation Comparison - When to Use Which?
The Complete Picture
Now that we've seen all the major activation functions, let's visualize them together and understand when to use each.
Quick Reference Table
Activation
Output Range
Use Case
Pros
Cons
Step
{0, 1}
Historical only
Simple
Not differentiable
Sigmoid
(0, 1)
Binary output layer
Probability interpretation
Vanishing gradient
Tanh
(-1, 1)
Hidden layers
Zero-centered
Vanishing gradient
ReLU
[0, ∞)
Hidden layers (default)
Fast, no vanishing
Dead neurons
Leaky ReLU
(-∞, ∞)
Hidden layers
No dead neurons
Slight complexity
Softmax
(0, 1), Σ=1
Multi-class output
Probability distribution
Only for final layer
Decision Flowchart
What layer are you building?
│
├─► Output Layer
│ │
│ ├─► Binary classification (V vs H)
│ │ └─► Use SIGMOID
│ │
│ ├─► Multi-class (V vs H vs Diagonal)
│ │ └─► Use SOFTMAX
│ │
│ └─► Regression (predict a number)
│ └─► Use LINEAR (no activation)
│
└─► Hidden Layer
│
├─► Default choice
│ └─► Use RELU
│
├─► Worried about dead neurons?
│ └─► Use LEAKY RELU
│
└─► Legacy/special cases
└─► Use TANH
Now let's upgrade our SimpleNeuron from Part 2 to include activation functions. This will be our complete, working neuron!
The Neuron Class with Activation
Our neuron now has all the pieces:
Inputs (x) - The flattened image data
Weights (w) - What each pixel contributes
Bias (b) - The personal threshold
Weighted Sum (z) - The raw score: z = w·x + b
Activation (a) - The final vote: a = f(z)
The full formula:
a=f(w⋅x+b)
Where f is the chosen activation function.
cell 023
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
# =============================================================================# COMPLETE NEURON CLASS WITH ACTIVATION# =============================================================================classNeuron:"""Acompletesingleneuronwithconfigurableactivationfunction.ThisistheupgradedversionofSimpleNeuronfromPart2,nowwiththeabilitytochoosehowto"cast its vote."Attributes:weights: numpyarrayofweights (oneperinput)bias: singlebiasvalueactivation: stringnameofactivationfunction"""def__init__(self, n_inputs, activation='sigmoid'):"""Initializetheneuronwithrandomweightsandchosenactivation.Parameters:n_inputs: numberofinputfeatures (9forour 3x3 images)activation: 'sigmoid', 'relu', 'tanh', 'leaky_relu', or'step'"""# Initialize weights (small random values work best)self.weights = np.random.randn(n_inputs) * 0.1self.bias = 0.0self.activation = activation.lower()self.n_inputs = n_inputs# Store the last computed values (useful for visualization)self.last_z = None# Pre-activation (weighted sum)self.last_a = None# Post-activation (output)def_activate(self, z):"""Applythechosenactivationfunction.Thisiswheretheneuron"casts its vote.""""ifself.activation == 'sigmoid':return1 / (1 + np.exp(-np.clip(z, -500, 500)))elifself.activation == 'relu':returnnp.maximum(0, z)elifself.activation == 'leaky_relu':returnnp.where(z > 0, z, 0.01 * z)elifself.activation == 'tanh':returnnp.tanh(z)elifself.activation == 'step':return1ifz >= 0else0else:raiseValueError(f"Unknown activation: {self.activation}")defforward(self, x):"""Computetheneuron'soutput (forwardpass).Parameters:x: inputarray (canbe 2D imageor 1D flattened)Returns:Theactivatedoutput (the"vote")"""# Flatten input if neededx = np.array(x).flatten()# Step 1: Weighted sum (the "score")self.last_z = np.dot(self.weights, x) + self.bias# Step 2: Apply activation (the "vote")self.last_a = self._activate(self.last_z)returnself.last_adefpredict(self, x, threshold=0.5):"""Makeabinaryprediction (0or1).Parameters:x: inputarraythreshold: valueabovewhichwepredict1 (default0.5)Returns:0or1"""output = self.forward(x)return1ifoutput >= thresholdelse0defset_weights(self, weights, bias=None):"""Manually set weights and optionally bias."""self.weights = np.array(weights).flatten()ifbiasisnotNone:self.bias = biasdef__repr__(self):returnf"Neuron(inputs={self.n_inputs}, activation='{self.activation}')"# Test our complete neuron!print("🧠 Complete Neuron Class Created!")print("=" * 60)# Create a neuron with hand-designed weights for vertical detectionvertical_neuron = Neuron(9, activation='sigmoid')vertical_neuron.set_weights(weights=np.array([ -1, +2, -1, # Top row -1, +2, -1, # Middle row -1, +2, -1# Bottom row ]) * 0.5,bias=-1.0)print(f"\nOur vertical detector: {vertical_neuron}")print(f" Weights (reshaped to 3x3):")print(f" {vertical_neuron.weights.reshape(3,3)}")print(f" Bias: {vertical_neuron.bias}")print("\n" + "=" * 60)print("Testing with different activation functions:")print("=" * 60)foractivationin ['sigmoid', 'relu', 'tanh', 'step']:vertical_neuron.activation = activationv_output = vertical_neuron.forward(vertical_flat)h_output = vertical_neuron.forward(horizontal_flat)print(f"\n{activation.upper()}:")print(f" Vertical line: z={vertical_neuron.forward(vertical_flat)} (raw z={vertical_neuron.last_z:.3f})")# Recompute for horizontalvertical_neuron.forward(vertical_flat)print(f" Horizontal line: z={vertical_neuron.forward(horizontal_flat)} (raw z={vertical_neuron.last_z:.3f})")
3.9 🎮 Interactive Lab: Activation Explorer
Now it's your turn to experiment! Use the interactive widget below to:
Choose an activation function - See how each one transforms the same input
Slide the z value - Watch the output change in real-time
Compare visually - See all activations plotted together
This hands-on exploration will build your intuition for how activations work!
cell 025full lab recommended
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
# =============================================================================# INTERACTIVE ACTIVATION EXPLORER# =============================================================================ifWIDGETS_AVAILABLE:defcreate_activation_explorer():"""Create an interactive activation function explorer."""# Create widgetsactivation_dropdown = widgets.Dropdown(options=['sigmoid', 'relu', 'leaky_relu', 'tanh', 'step'],value='sigmoid',description='Activation:',style={'description_width': '100px'} )z_slider = widgets.FloatSlider(value=0.0,min=-5.0,max=5.0,step=0.1,description='z value:',continuous_update=True,style={'description_width': '100px'} )output_area = widgets.Output()defupdate_plot(activation, z_value):"""Update the visualization."""withoutput_area:clear_output(wait=True)fig, axes = plt.subplots(1, 2, figsize=(14, 5))z_range = np.linspace(-5, 5, 500)# Get activation valuesifactivation == 'sigmoid':values = sigmoid(z_range)current = sigmoid(z_value)color = '#3498db'elifactivation == 'relu':values = relu(z_range)current = relu(z_value)color = '#27ae60'elifactivation == 'leaky_relu':values = leaky_relu(z_range, alpha=0.1)current = leaky_relu(z_value, alpha=0.1)color = '#e67e22'elifactivation == 'tanh':values = tanh(z_range)current = tanh(z_range)[np.argmin(np.abs(z_range - z_value))]current = np.tanh(z_value)color = '#9b59b6'elifactivation == 'step':values = step_function(z_range)current = 1ifz_value >= 0else0color = '#95a5a6'# Plot 1: The activation function with current pointax1 = axes[0]ax1.plot(z_range, values, color=color, linewidth=3, label=activation)ax1.scatter([z_value], [current], color='red', s=200, zorder=5, edgecolors='white', linewidth=2)ax1.axhline(y=0, color='gray', linestyle='-', linewidth=0.5)ax1.axvline(x=0, color='gray', linestyle=':', alpha=0.5)ax1.axvline(x=z_value, color='red', linestyle='--', alpha=0.5)ax1.axhline(y=current, color='red', linestyle='--', alpha=0.5)ax1.set_xlabel('z (weighted sum)', fontsize=12)ax1.set_ylabel('Output', fontsize=12)ax1.set_title(f'{activation.upper()}: z = {z_value:.2f} → output = {current:.4f}', fontsize=14, fontweight='bold')ax1.legend()ax1.grid(True, alpha=0.3)ax1.set_ylim(-1.5ifactivation == 'tanh'else -0.5, 5ifactivationin ['relu', 'leaky_relu'] else1.5)# Plot 2: Compare all activations at this z valueax2 = axes[1]all_activations = {'sigmoid': sigmoid(z_value),'relu': relu(z_value),'leaky_relu': leaky_relu(z_value, alpha=0.1),'tanh': np.tanh(z_value),'step': 1ifz_value >= 0else0 }bars = ax2.bar(all_activations.keys(), all_activations.values(),color=['#3498db', '#27ae60', '#e67e22', '#9b59b6', '#95a5a6'])ax2.axhline(y=0, color='gray', linestyle='-', linewidth=0.5)ax2.set_ylabel('Output', fontsize=12)ax2.set_title(f'All Activations at z = {z_value:.2f}', fontsize=14, fontweight='bold')ax2.set_xticklabels(all_activations.keys(), rotation=45, ha='right')# Add value labelsforbar, (name, val) inzip(bars, all_activations.items()):ax2.annotate(f'{val:.3f}', xy=(bar.get_x() + bar.get_width()/2, bar.get_height()),ha='center', va='bottom'ifval >= 0else'top',fontsize=10, fontweight='bold')# Highlight selected activationidx = list(all_activations.keys()).index(activation)bars[idx].set_edgecolor('red')bars[idx].set_linewidth(3)plt.tight_layout()plt.show()# Print interpretationprint("\n" + "="*60)print(f"🎯 Interpretation for {activation.upper()} at z = {z_value:.2f}:")print("="*60)ifactivation == 'sigmoid':print(f" Output: {current:.4f} = {current*100:.1f}% confidence")print(f" The neuron is {current*100:.0f}% confident in class 1 (Vertical)")elifactivation == 'relu':ifcurrent == 0:print(" Output: 0 (neuron is silent - not convinced)")else:print(f" Output: {current:.4f} (neuron is active with intensity {current:.2f})")elifactivation == 'tanh':ifcurrent > 0:print(f" Output: {current:.4f} (leaning toward class 1)")elifcurrent < 0:print(f" Output: {current:.4f} (leaning toward class 0)")else:print(" Output: 0 (completely neutral)")elifactivation == 'step':print(f" Output: {int(current)} ({'YES - Vertical' if current == 1 else 'NO - Not Vertical'})")# Connect widgets to update functionwidgets.interactive_output(update_plot, {'activation': activation_dropdown, 'z_value': z_slider})# Create layoutcontrols = widgets.HBox([activation_dropdown, z_slider])display(widgets.VBox([controls, output_area]))# Initial plotupdate_plot('sigmoid', 0.0)print("🎮 Interactive Activation Explorer")print("=" * 60)print("Use the controls below to explore different activation functions!")print("• Select an activation from the dropdown")print("• Slide the z value to see how the output changes")print("")create_activation_explorer()else:print("⚠️ Interactive widgets not available.")print(" Install ipywidgets to enable: pip install ipywidgets")print("\n Showing static comparison instead...")# Static fallbackz_test_values = [-3, -1, 0, 1, 3]print("\n" + "="*70)print(f"{'z value':<10} {'Sigmoid':<12} {'ReLU':<12} {'Tanh':<12} {'Step':<12}")print("="*70)forzinz_test_values:print(f"{z:<10} {sigmoid(z):<12.4f} {relu(z):<12.4f} {np.tanh(z):<12.4f} {1 if z>=0 else 0:<12}")
Part 3 Summary
What We Learned
In this notebook, our committee member learned how to cast their vote. Here's what we covered:
Concept
What It Is
Committee Analogy
Activation Function
Transforms raw score into meaningful output
How the member casts their vote
Step Function
Binary 0 or 1 output
"Yes" or "No", no middle ground
Sigmoid
Smooth curve from 0 to 1
Confidence level (0-100%)
Tanh
Smooth curve from -1 to +1
Opinion spectrum (against → for)
ReLU
max(0, z) - silent if negative
"If not convinced, stay silent"
Leaky ReLU
Small slope for negatives
Allows recovery from "death"
Softmax
Probabilities summing to 1
Consensus vote across options
Key Takeaways
Activation functions are essential - Without them, neural networks can only draw straight lines
Choose based on layer position:
Output layer: Sigmoid (binary) or Softmax (multi-class)
Hidden layers: ReLU (default) or Leaky ReLU
Watch out for:
Vanishing gradients with Sigmoid/Tanh in deep networks
Dead neurons with ReLU
The complete neuron formula:a=f(w⋅x+b)
Where f is your chosen activation function
Connection to the Journey
Part
What We Learned
Part 1
How to read images as numbers (matrices)
Part 2
How to weigh evidence and add bias
Part 3
How to turn scores into meaningful votes
Part 4
(Next) Making our first prediction!
📝 Knowledge Check
Test your understanding of activation functions!
cell 027
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# =============================================================================# KNOWLEDGE CHECK - Part 3# =============================================================================print(" KNOWLEDGE CHECK - Part 3: Activation Functions")print("=" * 60)print("\nAnswer these questions to test your understanding.\n")questions = [ {"question": "1. Why can't we just use the raw weighted sum (z) as the neuron output?","options": ["A) It's too slow to compute","B) It can be any value, and stacking linear functions gives linear functions","C) It uses too much memory","D) It only works for images" ],"answer": "B","explanation": "Without activation, outputs can be any value (not meaningful), and stacking linear functions just gives another linear function - you can't learn complex patterns!" }, {"question": "2. What output range does Sigmoid produce?","options": ["A) (-∞, +∞)","B) (-1, +1)","C) (0, 1)","D) {0, 1}" ],"answer": "C","explanation": "Sigmoid squashes any input to a value between 0 and 1, which can be interpreted as a probability." }, {"question": "3. What activation would you use for the OUTPUT layer of a 3-class classifier?","options": ["A) ReLU","B) Sigmoid","C) Softmax","D) Tanh" ],"answer": "C","explanation": "Softmax is perfect for multi-class classification because it outputs probabilities that sum to 1." }, {"question": "4. What is the 'Dead ReLU' problem?","options": ["A) ReLU is too slow","B) Neurons with negative weighted sums output 0 and can't recover","C) ReLU uses too much memory","D) ReLU only works on images" ],"answer": "B","explanation": "If a neuron always computes negative z, ReLU outputs 0, gradient is 0, and it can never learn again - it's 'dead'!" }, {"question": "5. What's the maximum gradient (derivative) of sigmoid?","options": ["A) 1.0","B) 0.5","C) 0.25","D) 0.1" ],"answer": "C","explanation": "Sigmoid's derivative σ'(z) = σ(z)(1-σ(z)) has maximum 0.25 at z=0. This is why deep networks with sigmoid suffer from vanishing gradients." }]# Display questionsforqinquestions:print(q["question"])foroptinq["options"]:print(f" {opt}")print()print("\n" + "="*60)print("Scroll down to see the answers...")print("="*60)
cell 028
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# =============================================================================# ANSWERS - Knowledge Check Part 3# =============================================================================print("✅ ANSWERS - Knowledge Check Part 3")print("=" * 60)fori, qinenumerate(questions, 1):print(f"\n{i}. Answer: {q['answer']}")print(f" 💡 {q['explanation']}")print("\n" + "="*60)print("How did you do?")print(" 5/5: 🌟 Activation Expert!")print(" 4/5: 👍 Great understanding!")print(" 3/5: 📚 Review the sections you missed")print(" <3: 🔄 Re-read Part 3 before continuing")print("="*60)
What's Next?
Congratulations! You've completed Part 3 of the Neural Network Fundamentals series!
Our committee member now knows:
✅ How to read evidence (matrices - Part 1)
✅ How to weigh evidence and set thresholds (weights & bias - Part 2)
✅ How to cast a meaningful vote (activation functions - Part 3)
Coming Up in Part 4: The Perceptron - First Prediction
Now our committee member is fully equipped! In the next notebook, we'll:
Build a complete Perceptron - The first working neural network (1958!)
Generate our V/H line dataset - Create training examples on-the-fly
Make predictions - See our untrained network in action
Witness the confusion - Spoiler: random weights = random guesses!
Prepare for training - Set the stage for learning from mistakes
The perceptron is where theory meets practice. Our committee member will finally attempt to classify lines - even if they fail spectacularly at first!
Continue to Part 4:part_4_perceptron.ipynb
Full Notebook Series
Notebook
Topic
Status
neural_network_fundamentals.ipynb
Part 0: Welcome & Part 1: Matrices
✅ Complete
part_2_single_neuron.ipynb
Part 2: The First Committee Member
✅ Complete
part_3_activation_functions.ipynb
Part 3: The Vote - Activation Functions
✅ Complete
part_4_perceptron.ipynb
Part 4: The Perceptron - First Prediction
⏳ Next
part_5_training.ipynb
Part 5: Training - Learning from Mistakes
🔜 Coming Soon
part_6_evaluation.ipynb
Part 6: The Trained Expert
🔜 Coming Soon
part_7_hidden_layers.ipynb
Part 7: The Full Committee
🔜 Coming Soon
part_8_deep_learning_challenges.ipynb
Part 8: Dangers of Deep Learning
🔜 Coming Soon
part_9_full_implementation.ipynb
Part 9: Complete Journey
🔜 Coming Soon
part_10_whats_next.ipynb
Part 10: What's Next & Appendix
🔜 Coming Soon
"A vote must be decisive. Activation functions turn raw scores into meaningful decisions."
The Brain's Decision Committee - Learning to See, One Step at a Time