# Recurrent neural networks

Recurrent neural networks are a variety of artificial neural network (ANN) with persistent state. The state persists via cyclical connections or integrators (which resemble capacitors).

## Perceptrons

The basic unit of a Neural Network is a Perceptron, or something similar. A Perceptron has several numerical inputs. It computes a weighted sum of its inputs, and if the result is greater than a certain number, it outputs a 1. Otherwise, it outputs 0 (or -1, depending on your design). The data that describes a Perceptron is its vector of inputs $x_i$, its vector of weights $w_i$, and its threshold c.

For convenience, instead of storing a separate threshold, we can have a "bias" input $x_0$ that is always 1, and make the threshold always zero. This is equivalent to having a threshold of $-w_0$ under the previous system.

### Standard Perceptron Output Function

$output = sgn(-c + \sum_{i=1}^N w_i*x_i)$

which is equivalent to:

$output = sgn(\sum_{i=0}^N w_i * x_i)$

with $w_0 = -c$ and $x_0=1$

Fig.1 The sigmoid function.[1]

This function is called "local" because it doesn't need to know anything outside of this node and its direct inputs. Sometimes the sign function is replaced with a sigmoid function, which is a continuous and differentiable approximation of the sign function.

### Alternative Types of Neurons

Some neurons use a Gaussian-shaped activation function, so that the output is high when the input is in some narrow range, and low otherwise[2].

Some neurons in RNNs use output functions that integrate the input over time. Such output functions can usually be described with a differential equation where the one of the variables is the weighted sum of the inputs, and another variable is the output of the neuron.

## Acyclic Neural Networks

Multiple Perceptrons can be hooked up to form neural networks. In that case, an output of one perceptron may be the input to another perceptron. The final output from the network is the numbers from the perceptrons that are not feeding back in to other perceptrons, and these perceptrons are called the output layer. The perceptrons that recieve numerical input from outside the neural network are called the input layer. All remaining perceptrons are said to be in the hidden layer(s). Connections are generally viewed as directed edges from the information source to the information consumer. In the simplest case, these neural networks are acyclic.

An example of a basic acyclic neural network. (This image is borrowed from Wikipedia: Artificial Neural Network)

### Calculating the Output

Initially, all the external inputs are known. When a neural network is acyclic the outputs of the neurons must be calculated in the order of a valid topological sort. That is, while there are still perceptrons with unknown outputs, calculate the output of a perceptron whose inputs are all known. Since the network is acyclic you are guaranteed that there is always at least one such node, until you are done.

### Backpropagation

Backpropagation is an algorithm for training neural networks. The simple version, when you are dealing with an acyclic neural network, involves the following steps:

1. Calculate each external output of the neural network, and compare these with what the outputs are "supposed to be".

2. For each output: If it "should have been" higher, than increment the weights of the current "1" inputs to the perceptron that created it. If it "should have been" lower, then decrement the weights of the current "1" inputs to the perceptron that created it.

3. Assume the inputs whose weights were decremented "should have been lower" and the inputs whose weights were incremented "should have been higher".

4. Repeat steps 2-3 recursively until you reach the input layer.

Note that simple backpropagation can work with any acyclic neural network (It need not have a number of discrete layers). It just evaluates a local procedure at each node, in the order of any breadth first search starting with the set of nodes in the output layer (and following edges backwards).

An example of a simple recurrent neural network.

Backpropagation must be run repeatedly, on many different training examples, before the neural network will approximate the target function.

## Recurrent Neural Network Definition

A recurrent neural network is to an acyclic neural network as a sequential circuit is to a combinational circuit. It requires a "clock" pulse. At each time step, the output of each perceptron is re-evaluated individually, using current external inputs and the outputs of other perceptrons from the previous time step. Therefore it is necessary to initialize the outputs of the perceptrons before any execution begins. Perceptrons may be arranged to form any graph, with any number of cycles, inputs, and outputs.

## Training Recurrent Neural Networks

### Backpropagation

#### Time Unfolding

One approach is to unfold the recurrent network over several time steps, chop off the cycles in the first/last stages, and then apply traditional backpropagation to the resulting approximation of the RNN.[3] Then the weight vector of a neuron in the RNN would be the average of the weight vectors of the neurons that represent it in the unfolded version.

#### Exponential Decay

As you apply backpropagation, beginning at the output nodes and traveling backwards, the magnitude of the changes you make decreases after each hop. This prevents making too much change to one node if you visit it many times in a row due to the cycles in the network. When the magnitude gets below a certain threshold, the recursion stops.

### Genetic Algorithms

• Randomly make some mutated copies/hybrids of existing models, then kill off models with a probability inversely proportional to their fitness according to some performance metric.
• Goto the previous step, unless you decide your best model so far is "good enough".

## Architecture Zoo

Here are a bunch of variations of recurrent neural networks.

### Elman Neural Network

• Only three discrete layers
• The previous outputs of the normal part of the hidden layer are stored in the "context units" of the hidden layer. Each context unit has one input, and a weight fixed at 1. The outputs of the context units feed back into the normal part of the hidden layer.
• There is a clock pulse, of course. The local output function is evaluated once at each node, for each time step. Also a local update rule for training may be applied at each node, during each time step.
• Good enough to learn some data sequences

### Jordan Network

Same thing, except the context unit inputs come from the input layer.

### Hopfield Net

In a Hopfield Net, the weight on the connection from A to B must be the same as the weight on the connection from node B to node A. A node must not be connected to itself, but can be connected to any other node. It can serve as content-addressable memory. Hopfield nets are trained by applying a deterministic update rule at each time-step.[4]

### Echo State Network

ESNs are based on a "reservoir" that consists of a random web of neurons with random weights, in between the input nodes and the output nodes. To conduct supervised learning, the random web is driven by some training inputs, and then only the output weights are modified. This algorithm is claimed in a patent. Its reservoir neurons use a Leaky Integrator function of the form dx/dt = -Ax + C, where C is the total input to the neuron, x is the output of the neuron, and A is a constant. [5] [6]

### Long Short Term Memory Network

Its basic unit is a self-connected linear integrator, called the "error carousel".[7]

### RNN with parametric bias

An RNNPB is basically a three-layer feedforward network, plus some connections from some output nodes to some input nodes, plus some input nodes which are the output nodes of a different hierarchical neural network.[8]

### Continuous-time RNN

The number of neurons is required to be equal to the number of inputs, for some unknown reason.[9]. Each neuron's output is specified by a differential equation like dx/dt = -Ax + g(C + u), where A is a constant, g is an activation function, C is the weighted sum of the inputs to the neuron, and u is one of the external inputs to the neural network.

### Hierarchical RNN

This approach divides the RNN into separate components that have different functions and are sparsely connected to each other. They are designed to be self-organizing.[10]

### Recurrent Multilayer Perceptron

This is like a layered feedforward neural network, except each neuron is allowed to have crosstalk-links with other neurons in the same layer, and each neuron may have a self-loop that is delayed by 1 time step.[11]

Mostly Harmless[12]

The original source is out of print.[13]

## Applications

RNNs can process time-series data where there may be several inputs in each tick, unlike NFAs and DFAs. RNNs have a persistent and continuously variable internal state, unlike HMMs, feedforward networks, and SVMs. RNNs are used for analyzing speech and composing music.[14] RNNs might also be used for video recognition.

Even non-sequential problems may benefit from RNNs. For example, the problem of determining the parity of a set of bits.[15]. This is very simple with RNNs, but doing it with a feedforward neural network would require excessive complexity.

Edited by Pengfei

Here are some applications of Back Propagation:

End edit by Pengfei

## Why other sequence learning techniques sometimes fail

• HMMs: discrete and lacking good algorithms for learning structure
• Symbolic grammars: Don't work well with noisy data and natural language.
• Genetic Programming: Too slow.[16]