Outline:
How the brain works
Neural Networks
Perceptrons
Multilayer Feed-forward Networks
Application of Neural Networks
Computational viewpoint:
to represent functions using network of simple arithmetic computing
elements and methods for learning such representation from examples.
Biological viewpoint:
mathematical models for the operation of the brain.
Neural network:
a network of interconnected neurons.
Simple arithmetic elements <=> neurons (brain cells).
Comparing brains with digital computers:
1. Computer: ns; Brain: ms, but neurons & synapses activate
simultaneously, and very flexible connectivity.
2. A NN running on a serial computer takes hundreds of steps to
decide if a single neuron-like unit will fire. (c.f. Brain 1 step)
The advantages to mimic a brain with NNs:
1. Fault-tolerant (? down set) & self-healing
2. capability to face new inputs
3. graceful degradation (something breaks down)
4. training & learning
Neural Networks:
Node & link:
A NN is composed of a number of nodes, or units, connected by
links. Some are input/output nodes.
Weight:
Each link has a numeric weight associated with it. Weights are
the primary means of long-term storage. Learning by updating
weights.
Activation:
Each unit has a set of input links from other units, a set of
output links to other units, a current activation level and a
means to compute the activation level at the next step given
its inputs and weights.
Input function:
A linear component that computes the weighted sum of the
unit's input values.
Activation function (g):
A non-linear component that transforms the weighted sum into
final value serving as the unit's activition value (a_i).
Practical activation functions:
1. step(x) = 1, if x >=t;
= 0, if x < t.
2. sign(x) = +1, if x >=t;
= -1, if x < t.
3. sigmoid(x) = 1/(1+e^(-x)). --(differentiable)
Structures:
1. feed-forward
Links are unidirectional; no cycle.
=> A directed acyclic graph (DAG).
Each unit is linked only to units in the next layer.
No links between units in the same layer, backward to
the previous layer and cannot skip layers.
Feed-forward networks with no hidden units are called
perceptrons, otherwise, called mulilayer feed-forward
networks.
For a fixed structure and g, learning is a process of
tuning the parameters to fit the data in the training
set called nonlinear regression in statistics.
2. recurrent
Stored in the activation levels of the Units.
Can model more complex agents with internal states.
Learning more difficult -> can become instable.
Example recurrent networks:
1. Hopfield Network
2. Boltzmann Machines
Optimal network structure:
Problems:
1. Too small --> incapability;
2. Too big --> overfitting, time-consuming.
Feed-forward network with 1 hidden layer NN can
approximate any continuous function. 2 hidden
layer NN can approximate any function at all.
Methods:
1. Genetic algorithm (time-consuming);
2. hill-climbing searches.
Perceptron:
Definition:
Single-layer, feed-forward networks.
Use:
Representing Boolean functions AND, OR & NOT.
Learning linearly separable functions.
Limitation:
perceptrons can only handle linearly separable
functions.
Training process:
Most NN learning, including the perceptron learning method,
follow the current-best-hypothesis (CBH) scheme:
◦ Set initial weights randomly, usually [-0.5, 0.5]
◦ Update the weight to make the network consistent
with the examples. Make small adjustments to reduce
the difference between the observed and predicted
values.
◦ Repeat for each weight.
Each epoch involves updating all the weights.
Perceptron learning method:
Each input contributes W_j*I_j to O,
if I_j is positive, increase W_j
⇒ increase O
⇒ decrease Err (E = T - O); and vice versa.
To achieve the effect, use the following learning rule:
W_j <- W_j + alpha * I_j * Err
where the alpha constant is a term called the learning rate,
which needs tuning.
Comparision:
Decision tree: discrete (multivalued) attributes only.
NN: continuous (real nos. in some fixed range)
Multilayer Feed-Forward:
Most popular learning method: Back-propagation.
Notations:
I^(k) --- layer k (input units);
W^(k,j) --- weights between layer k and j
a^(j) --- activation of layer j
W^(j,i) --- weights between layer j and i
O^(i) --- layer i (output units);
in_i --- sum(W^(j,i)*a^(j)) for all j
g'(x) --- gradient of the activation function
The detailed algorithm summary:
Compute the Δ values for the output units using the
observed error.
Starting with output layer, repeat the following for
each layer in the network, until the earliest hidden
layer is reached:
Propagate the values back to the previous layer.
Update the weights between the 2 layers.
During the observed error computation, save intermediate
values for later use, in particular, cache g'(in_i ).