necroforest wrote:basically, the weights of a neuron forms a line (or hyperplane) through your input space, and the activation function tells you which side of the line you're on (and how far from the line, if you're using a sigmoid function).
if you have a bunch of neurons organized in layers, the first layer chops up your input space into sections and "transforms" your input into a new space, then each layer transforms that space into another one, etc.
This explains the part "how do feed-forward neural networks interpret patterns". To put it just slightly more precise:
For a start, you must realise that nodes in connectionist networks (this name is somewhat more correct than "neural networks", but it basically means the same) basically only say "yes" or "no". Like "the pixel is on" or "the pixel is off". Often it is allowed to say
how much "yes" or "no", like with a sigmoid activation function, but still it's basically just "yes" or "no". So when "the activation function tells you which side of the line you're on", as necroforest put it, the node in question is really telling you whether the input is on the "positive side" of the (hyper-) line,
yes or no.
Suppose you have a set of input nodes that all link to one output node, with no layers in between. We call such a network a
perceptron. A perceptron can tell (through the output value of the output node) how far a given input vector is from the linear separator, just as necroforest said. A different way to say this, is that perceptrons can (always and only) represent functions that are
linearly separable.
Functions like "left or right", "majority", "AND" and "inclusive OR" are linearly separable. "XOR" is not linearly separable, so it cannot be represented by a perceptron.
Now, take a feed-forward network with any number of layers and any number of nodes in each of the layers. You could consider each node in the network that is not in the input layer, together with all nodes from the previous layer (= all nodes that link to it), as a perceptron. So for example if we have 6 nodes in the input layer and 3 nodes in the next layer, then we have 3 perceptrons that each process information from the 6 input nodes. Each perceptron can represent a different linearly separable function (and that's most probably the case, because otherwise you could do with less perceptrons). So we are able to cut up our input space into 8 sections (why?) and determine in which one our input vector resides.
We repeat this pattern with our 3 intermediate nodes and the second next layer. The output nodes of our old perceptrons have now become variables in a new input space. So the new perceptrons are
functions over a space of partition locations in the original input space. This is rather abstract, but fortunately you don't really need to grasp this because some smart people already figured out the consequence of it quite long ago, also for networks with even more layers:
While perceptrons can only represent linearly separable functions, feed-forward networks with a hidden layer can represent any continuous function, and networks with two hidden layers can even represent any discontinuous function.(EDIT: the part about networks with two hidden layers and discontinuous functions turned out to be incorrect. See my later post below for details.)However, maybe this wasn't exactly the question the OP meant to ask.
gabrielkfl wrote:I've even seen some examples that seem to work, I just can't understand HOW the hell they work. I mean, they're freaky - you start changing the weight values and suddently your network is able to tell apart a '4' and a '5', what the f*ck?
I think this boils down to the question "how is it possible that we can train a network to represent some function when starting out with a completely random weight pattern?". The answer is this: when we try an input on a network, we can see how far the output is from what it should have been. We adjust the weights slightly in a direction that would yield a better output. When we repeat this very often with many different examples, we will eventually have nudged the network enough to more or less do what we want.
Again, the idea of a perceptron can help to understand this procedure. The weights adjusting procedure of backward-propagation really works on a per-perceptron base, from the output side of the network towards the input side. In a single perceptron, it is easy to tell how the weights can be adjusted to better reflect the function that you target. From the old weights in your perceptron, you can see how "wrong" the outputs from the perceptrons of the previous layer were, and you adjust those proportionally to their amount of "wrongness".
You should look again at the code that you already know, and now see it in the light of perceptrons. I think it should help!