Logo

Artificial neural networks

Artificial neural networks (ANN) are computer-generated adaptive systems which are thought to work similar to biological neural networks, in particular to the human brain. In economics and statistics, ANN are used to find structures and regularities in complex data sets.

Artificial neural networks consist of a large number of basically identical or very similar components, neurons, that interact with each other on the principle of a rather simple mechanism, a so called threshold function. In other words, they operate on the basis of a distributed representation of their "knowledge". Crucial for their performance is not so much a specialization of individual components but a complex reference structure of many similar  and surprisingly simple components with collective, aggregated performance. Epistemologically, this opposes the traditional philosophical attention for a "lonely", but highly complex unity, a subject or an "I".

The perceptron

The essential aspect of the operation of a neural network are the states of excitement of its neurons, or more exactly, the transmission of these states of excitement triggering excitements in other neurons which amplifies or weakens the connections between these neurons. This transmission can be explained on the example of the operation mode of the so called perceptron, a predecessor of ANNs, as it was suggested by Frank Rosenblatt in 1958 in order to simulate the receptors of the retina.

Consider the task to teach a machine the logical operation of the inclusive OR-function. The machine is to learn to provide the outputs as listed in the following table when confronted with the input values \(I_1\) and \(I_2\):

\(I_1\) \(I_2\) output
0 0 0
0 1 1
1 0 1
1 1 1
Perceptron

Following the model of the brain, the computer generates a "virtual" network which consists of two so-called input neurons, a hidden neuron, and one output neuron. The two input neurons have connections to the hidden neuron which in its turn is connected to the output neuron. The connections of all these neurons are weighted with an initially randomly-assigned numeric value between \(-1\) and \(+1\). Each information passing through them is multiplied with this weight. The actual work is done by the hidden neuron. It combines the input values, in this case ​​of the logical operation as indicated in the table above, in respect to its connections' weights to produce the actual input (net). That is: \(net = w_1 * I_1 + w_2 * I_2\)

The actual output value then is determined in respect to a threshold value which here for example could be \(0.5\). The results from \(net\) larger than or equal to the threshold are rounded to 1 and the ones smaller are rounded to 0, which is possible in this case since there are only binary outputs in this example. Eventually results are presented in the output neuron.

The actual learning process now consists of increasing or decreasing the (initially randomly assigned) connection weights iteratively with respect to a certain learning rate until the output value in each of the four cases possible in the OR operation coincides with the desired result.

So if, for example, the initial connection weights are assigned as \(w_1 = 0.1\) and \(w_2 = 0.3\) and the learning rate as \(0.2\), the learning process would proceed in the following sequence:

In the first step, according to the above table, the two values ​​\(0\) and \(0\) are introduced to the input neurons and forwarded to the hidden neuron. Weighted with \(0.1\) and \(0.3\) (i.e, multiplied ​​with these values) this produces the \(net\) value of \(0 + 0 = 0\) and thus - even without rounding on the threshold \(0.5\) - the desired result . Hence, the weights are not changed in this case.

In the next step, the next two values ​​in the table, 0 and 1, are introduced and passed into the input neurons. With the set weights this now gives \(net= 0 + 0.3\) and thus, rounded on the threshold \(0.5\), a \(net\) value of \(0\), which in this case is not the desired result. Hence, a weight has to be changed. The weight with the value that has changed in comparison with the previous case will be increased by the learning rate of \(0.2\) to \(w-2 = 0.5\). Rounding the  \(net\) value now on the threshold yields \(1\), which corresponds to the desired result.

With the next row's entry of \(1\) and \(0\) the weighting results in \(0.1\) and \(0\), and thus, rounded on the threshold value \(0.5\), yields a \(net\) value of \(0\), which in this case as well is not the desired result. Once again the weight for the new value, now for \(I-1\), is increased by the learning rate \(0.2\) to \(w_1 = 0.3\). The \(net\) value  now is \(round(0.3 * 1 + 0.5 * 0)=0\), which does not yet meet the desired result, but is accepted for the moment.

With the, in this first training round, last entry of \(1\) and \(1\) the weighting yields \(0.3\) and \(0.5\) and thus rounded on the threshold \(0.5\) a \(net\) value of \(1\), which in this case corresponds to the desired result. The weight must not be changed.

The learning process has now completed one pass through all possible cases of the data and thus starts over again at the first row of the above table.

Again the input values \(0\) and \(0\) are passed to the input neurons and now, with weighting of \(0.3\) and \(0.5\) yield a \(net\) value of \(0\), the desired result. The weights are not changed.

With the next entry of \(0\) and \(1\) the weighting yields \(0\) and \(0.5\), and thus, rounded on the threshold, a \(net\) value of \(1\), which corresponds to the desired result. The weights are not changed.

With the entry \(1\) and \(0\) the weighting yields \(0.3\) and \(0\), and thus, rounded on the threshold, a \(net\) value of \(0\), which does not correspond to the desired result. The weight corresponding to the changed value, i.e. for \(I_1\), is increased by the the learning rate of \(0.2\) to \(w_1 = 0.5\). The \(net\) value now, with \(round(0.5 * 0.5 * 1 + 0)\) is \(1\), the desired result.

The last input of this round is \(1\) and \(1\) and thus yields a weighting of \(0.5 + 0.5\) which corresponds (even with out rounding) to the desired result.

Again, the learning process has completed one pass through all possible cases and starts over again. As it turns out however, with the current weighting all values ​​are output correctly already . The learning process would thus be completed in this case. The weighting of the network connections is "tuned" so that each time one of the possible binary combinations are introduced into the input neurons, the output corresponds to the result of the logical OR-operation.

The iterative process of approximating consistent weights is the essential aspect of neural networks. It works analogously with much more complex problems, in the case of which however more hidden neurons and, depending on the size of the input and the output, also more input and output neurons are deployed. In cases where the deviation from the desired result is not just binary, as in the above example, an error is taken into account to adapt the learning rate. In this case, the connection weights are changed according to the product of learning rate and error. (In the above example, a multiplication by 1 would not have any effect on the learning rate).

Back propagation

An intermittent irritation in the development of Artificial Neural Networks was caused by the so-called XOR-function, that is, the logical link corresponding to an exclusive OR which, in contrast to the inclusive OR returns the output \(0\) for the input-values \(1\) and \(1\) . The XOR-function is not linearly separable. In two-dimensions its solutions cannot space be separated by a simple straight line. In their 1969 book on perceptrons Marvin Minsky and Seymour Papert showed that neural networks, in the form described above, will not be able to learn the XOR-function and similar complex relations. As a consequence, the initial euphoria about the possibilities of neural networks in the field of Artificial Intelligence Research vanished.

In the 1970s, it was proposed to connect several layers of hidden neurons - so-called hidden layers - in series. It could be shown that with such multi-layer perceptrons (MLPs), in which multiple layers of internal neurons provide for higher resolutions, non-linear-separable relations, such as the XOR-function, can be learned. in terms of calculation, a MLP corresponds to several single-layer perceptrons, where the internal "hidden" layers have multiple output neurons, which in their turn function as the input neurons for the next layer.

In single-layer perceptrons, the adjustment of weights takes place immediately after a calculation. The network reacts to inputs by processing them forward from input to output. Therefore this process is called forward propagation.

In more complex multi-layer perceptrons, the weight adjustments can be made only after the information has gone through all the layers of hidden neurons and the calculated output has been compared with the desired output. This process of forward-propagation therefore is supported with a second process known as back propagation, in the course of which the weights are adjusted back to front. In this case, after an output has been generated as described above, the connection weights are adjusted backwards, but as before in respect to the difference (the error) of output and expected output, starting from the connections of the output neurons to the neurons of the last hidden layer back to the first connections of the input-neurons.

In each run, that is, with each each data set, the network thus is traversed forward in order to calculate an output, and a second time backwards in order to adjust the respective weights on the basis of this output.

MLP

As in the example of the OR-operator, it can happen that relevant inputs consist of zeros. Multiplied by the weights of the connections these inputs would always yield zero and therefore would not cause any new reaction of the network. For this reason, the neuron layers of back-propagation networks are equipped with so-called bias-neurons to produce a stable "neutral input" of \(1\) causing a reaction of the network.

What is more, not every data landscape whose regularities are to be learned can be easily captured with binary values. It is therefore not always possible to simply round up or down on a threshold. For this, often sigmoid functions of the form \(y = \frac{1}{1 + e^x}\) are used which squeeze input values ​​in respect to their proximity to the threshold into the interval between \(0\) and \(1\) and considers the result as the strength of activation of the neuron.

log

As mentioned above, the essential aspect for the functioning of ANNs are not so much the neurons themselves but the connections and their structure between them. This shows on the fact that learning processes can be effective in yielding the same desired results but the networks deployed for these effects show completely different structures. Dependent on random initial differences - as mentioned connection weights are often initiated randomly - weights can develop completely differently, while still allowing the network to perform exactly the same task. The following figures show two simulated networks, both of which have learned to correctly reproduce the logical AND-operator. As can be seen on the connection weights, represented by color and thickness and color (blue for postive values, green for negative) , the structure of these networks is different, although their performance is exactly the same. If one is willing to consider such analogy, this would suggests that the same ability of two people to play cello could be based on completely different network connections in their brains.

trained ANN-1
trained ANN-2

Applications

Back-propagation networks are instances of supervised machine learning. This means that parts of the expected output of these algorithms are known and the network is trained to minimize the difference (the error) between its actual output and the expected output. To test this training, often input data is split into a training set and a test set which which the learning success can be evaluated. Once the network is trained accordingly it can be used to predict corresponding regularities in so far un-surveyed data. One practical application for this is Optical Character Recognition (OCR) as it is used to process scanned documents for instance or for handwriting-recognition on smartphones or personal digital assistants.

A handwriting recognition program works on a probability distribution of individual handwriting samples, which usually have to be offered to the algorithm several times. For this, the input fields of respective devices consist of a grid of cells which correspond to the input neurons of the artificial neural network. As the individual handwriting samples will touch some of these cells and others not, the input neurons are activated repeatedly in the above-described way.

OCR

The network iteratively learns from these activation inputs. It adjusts its weights and the activation of its neurons so that its structure eventually is able to recognize input data that differs from earlier inputs, maybe because they have been written somewhat hastily.

The adaptation of the network does not have to be completed with the initial training. ANNs find their attractiveness in the fact that they can continue to learn while being used. They are able to adapt continuously anew to dynamic conditions. They are not, like hard-wired circuits, fixed once and for in the way they operate.

Graceful degradation

This circumstance is responsible for the high reliability and resilience of ANNs. While in hard-wired circuits, the failure of a single, perhaps even rather irrelevant component can lead to a breakdown of the entire system, networks show remarkable resistance to blackout. A striking example provide victims of accidents or stroke patients, whose brains were damaged by lack of oxygen, but sometimes find back after appropriate therapy to behaviors that in view of the injuries can be considered as relatively normal . Obviously, the non-damaged parts of the brain are able to relearn the lost functions and to compensate for the damage.

This property of networks is called graceful degradation and is deployed quite deliberately in areas in which partial failures should not cause total blackouts. The supply with electricity in most OECD countries, for instance, follows this principle, and also the internet has been designed in regard to the particular reliability and resilience of networks.