Neural Networks

The brain behind AI


Matteo Di Paolantonio

Definition

“machine learning algorithm inspired by the structure and function of the human brain.”

Structure

Neurons
Layers
Weights

Neurons and layers

6

The white pattern activates some neurons in your brain.

Defining the pattern

28 x 28 = 784 pixels

Defining the pattern

784 greyscale values (0.0 - 1.0)

Defining the pattern

0 1 2 3 4 5

6

7 8 9

Feedforward

Multilayer Perceptron

Pattern as input

784 neurons

Pattern values as first layer

Activation

0.4
Activation value

First and last layer

0.0
Inactive
0.4
Active
0.0
Inactive
1.0
Really Active

Sub-patterns

Loops and lines

Sub-patterns

Fragments of loops and lines

Middle layers

Recap

  • There are many type of neural networks and we are focusing on the simplest one: the Multilayer Perceptron (MLP).
  • A neuron is a basic unit of a neural network. For the sake of simplicity let's say that it holds a value between 0.0 and 1.0. The greater the value the more active is the neuron.
  • Neurons are organized in layers. The first is the input layer, the last is the output layer, and the middle layers are the hidden layers.
  • Neurons in a given layer are connected to the neurons in the previous layer. Activation values are propagated forward through the network.

Doubts or thoughts?

Weights

Weights propagate activation

Weights

Activation
0.01.0

Weight
-xy

Weights

w1 = 7.0 w2 = -4.6 w3 = 5.8 w4 = 7.3 w5 = -0.5 w6 = -3.2 . . . w784 = 2.2

Activation

w1a1 + w2a2 + ... + wnan

Activation

w1a1 + w2a2 + ... + wnan

!
Weight
-xy

Sigmoid function

σ(x) = 1 / (1 + e-x)

Activation

σ( w1a1 + w2a2 + ... + wnan )

Activation

σ( w1a1 + w2a2 + ... + wnan + b )

Bias

Some numbers

  • 4 layers
  • 784 + 16 + 16 + 10 = 826 neurons
  • (784 x 16) + (16 x 16) + (16 x 10) = 12.960 weights

Activation

a0(1) = σ( w0.0a0(0) + w0.1a1(0) + ... + w0.nan(0) + b0(1) )

Layer Activation

σ ( [
w0,0w0,1...w0,n
w1,0w1,1...w1,n
wk,0wk,1...wk,n
]
[
a(0)0
a(0)1
a(0)n
]
+ [
b0(1)
b1(1)
bn(1)
]
)

Activation

a(1) = σ( W(1,0)a(0) + b(1) )

a(2) = σ( W(2,1)a(1) + b(2) )

a(3) = σ( W(3,2)a(2) + b(3) )


a(3) = σ( W(3,2)σ( W(2,1)σ( W(1,0)a(0) + b(1) ) + b(2) ) + b(3) )

Just a function...

ƒ(a0, ..., a783) =
y0
y9

...with many parameters


  • (784 x 16) + (16 x 16) + (16 x 10) = 12.960 weights
  • 16 + 16 + 10 = 42 biases

13.002

Recap

  • Neurons are wired together by weights, each neuron is connected to all the neurons in the previous layer and all the neurons in the next layer.
  • Neurons activation values are defined by the previous layer's neurons and the weights that connect them, plus a bias.
  • A neural network is a just a function, an overly complex one with many parameters. Weights (and biases) are the parameters of the function, its dials and knobs.
  • The model implemented by a neural network is made of the huge amount of parameters and the matrix operations that are performed to calculate the function output.

Doubts or thoughts?

Functions


MNIST dataset

Test and train

Test and train

Cost

(0.22 - 0.00)2 + (0.86 - 0.00)2 + (0.38 - 0.00)2 + (0.92 - 0.00)2 + (0.75 - 0.00)2 + (0.12 - 0.00)2 + (0.66 - 1.00)2 + (0.88 - 0.00)2 + (0.43 - 0.00)2 + (0.15 - 0.00)2 Single sample cost

Cost

(0.22 - 0.00)2 = 0.0484 (0.86 - 0.00)2 = 0.7396 (0.38 - 0.00)2 = 0.1444 (0.92 - 0.00)2 = 0.8464 (0.75 - 0.00)2 = 0.5625 (0.12 - 0.00)2 = 0.0144 (0.66 - 1.00)2 = 0.1156 (0.88 - 0.00)2 = 0.7744 (0.43 - 0.00)2 = 0.1849 (0.15 - 0.00)2 = 0.0225
0.0484 + 0.7396 + 0.1444 + 0.8464 + 0.5625 + 0.0144 + 0.1156 + 0.7744 + 0.1849 + 0.0225 = 3.5387 Single sample cost

Functions

Neural Network Function

Input: 784 numbers (pixels values) Parameters: 13.002 numbers (weights and biases) Output: 10 numbers (digits)

Cost Function

Input: 13.002 numbers (weights and biases) Parameters: tens of thousands of training samples (pixels values) Output: 1 number (cost)

Function minima

C(w)

Function minima

C(w)

Function minima

C(w)

Gradient descent

-∇ C(w)

Recap

  • A neural network is a function with many parameters (weights and biases) wheras in the cost function weights and biases are the inputs and the parameters are the training samples.
  • The cost function is a function that takes the network's parameters as inputs and returns a single number, the cost. The lower the cost, the better the network is performing.
  • The gradient descent algorithm is a way to find the parameters that minimize the cost function.
  • The algorithm that efficiently computes this gradient and forms the core of how a neural network learns is known as backpropagation.

Doubts or thoughts?

Backpropagation

Backpropagation

0.66 = σ( w1a1 + ... + wnan + b )
  1. Increase the bias: b
  2. Increase the weights: wi
  3. Increase the activation of previous layer: ai

Keep propagating

a = σ( w1a1 + ... + wnan + b )
  1. Increase the bias - b
  2. Increase the weights - wi
  3. Keep propagating

Tweak knobs and dials

Recap

  • Training a neural network is a process of adjusting its parameters (weights and biases) minimizing the cost function.
  • A properly labeled dataset is crucial for training a neural network, but the network will know only about the training samples.
  • Starting with random weights and biases, the network iteratively adjusts them based on the desired output of the training samples.

Doubts or thoughts?

A bit of history

  • 1943 - Warren McCulloch and Walter Pitts conceptualized the first neural network comparing the human brain to an electrical circuit.
  • 1958 - Frank Rosenblatt invented the perceptron, a type of neural network with one layer that could learn to recognize patterns in data.
  • 1980 - The first neural network with few multiple layers was developed by David Rumelhart, Geoffrey Hinton, and Ronald Williams.
  • 1986 - Backpropagation was developed by Rumelhart, Hinton, and Williams, a technique for training neural networks with multiple layers.
  • So called AI winter (1974-1980) - funding cuts, lack of progress, and skepticism about the future of AI because of the computational complexity.
  • 1997 - The first Long Short-Term Memory (LSTM) neural network was developed by Sepp Hochreiter and Jürgen Schmidhuber. Capable of learning long-term dependencies and well suited for sequence prediction tasks.
  • 1999 - GPUs becomes the perfect fit to execute matrix operations required for neural networks.
  • 2012 - AlexNet wins the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) a type of convolutional neural network (CNN) image recognition challenge.
  • 2017 - Attention is all you need (Google paper) introduced the transformer architecture, a type of neural network that can process sequence data in parallel. This architecture is the foundation of modern LLMs.

It's been 82 years...

Thank you!

Thanks to 3Blue1Brown