- 16th April 2018
- Posted by: Manolis
The aim of this blog is just to get one acquainted with theory of Neural Networks.
Neural Networks started off as an attempt to to replicate the working of the human brain in order to make things more intelligent. Even something like this is not necessarily always complex. At its heart, it is just multiplication and differentiation.
I am really bad at remembering maths. But. Once I tried learning concepts I realised I am pretty decent at visualising maths. 3Blue1Brown has some amazing visualisations of mathematical concepts.
So there are many ways you can understand a concept, choose whichever suits you, being persistent about the learning part. At the end knowing maths is a useful tool when it comes to optimisations or experimentations.
Neural Network is, usually, a supervised method of learning. This means there is presence of a training set. Ideally this set contains examples with their absolutely truth values (tags, classes etc). In case of sentiment analysis the training set would be list of sentences and their respective correct sentiment.
Note: Unlabelled datasets can also be used to train neural networks but we are considering the basic case here.
Let’s refer to the texts as X and they labels/tags as Y. There is some function that defines the relationship between X and Y. In case of sentiment analysis that would be what features (words / phrases / sentence structure etc) lead to a sentence being negative or positive. Earlier people used to find these features manually, this was called feature engineering. Neural Network automated this process.
Top 3 Most Popular Ai Articles:
An Artificial Neural Network is made up of 3 components:
- Input Layer
- Hidden (computation) Layers
- Output Layer
Furthermore the learning happens in two steps:
In simple words
- forward propagation is making a guess about the answer
- back propagation is minimising the error between the actual answer and guessed answer
Randomly initialize weights
Data at input layer is multiplied with weights to form hidden layer
- h1 = (x1 * w1) + (x2 * w1)
- h2 = (x1 * w2) + (x2 * w2)
- h3 = (x1 * w3) + (x2 * w3)
Output of hidden layer is passed through a non-linear function also known as activation function to form guessed output
- y_ = fn( h1 , h2, h3 )
- total_error is calculated by calculating the difference between the expected value ‘y’ (value in training set) and observed value ‘y_’ (value attained from forward propagation) by passing them through a cost function.
- Partial derivative of error is calculated w.r.t each weight ( these partial differentials are measures of contribution of each weight in total_loss )
- The differentials are then multiplied by a small number called learning rate ( η )
- The resultant is then subtracted from the respective weights
The result of backprop is the following updated weights:
- w1 = w1 – (η * ∂(err) / ∂(w1))
- w2 = w2 – (η * ∂(err) / ∂(w2))
- w3 = w3 – (η * ∂(err) / ∂(w3))
Basically we initialise random weights and assume they would produce accurate answers, sounds trivial because it is but seems to work pretty well.
Those familiar with Taylor Series, backpropogation shares the same end result with it. But instead of an indefinite series we try to optimise the first element only.
Biases are weights added to hidden layers. They too are randomly initialised and updated in similar manner as the hidden layer. While the role of hidden layer is to map the shape of the underlying function in the data, the role of bias is to laterally shift the learned function so it overlaps with the original function.
Partial Derivatives are calculated so we know what was the contribution of error by each weight.
The need of derivatives obvious if you think of it.
For example think of a neural network trying to find the optimal speed (velocity) of a self driving car. Now if the car finds out the the is either faster or slower than desired speed neural network will change its speed by either accelerating or decelerating the car. What is accelerating/decelerating? Derivatives of speed.
Let’s explain the need of ‘Partial Derivatives’ with an example as well:
Let’s say that a few kids were asked to throw dart at a dart-board, aiming at the center. The initial results were:
Now if we found total loss and simply subtracted that from all the weights then we generalize the mistakes made by each student. So let’s say a kid aimed too low but we ask all the kids to aim high then it results in:
The error of a few students might decrease but overall error still increases.
By finding partial derivatives we find what was the error by each weight individually. Correcting each weight individually results in following results:
While neural network is used to automate feature selection, there are still a few parameters that we have to input manually.
Learning Rate is again a very crucial hyper-parameter. If the learning rate is too small then even after training the neural network for long time, it will still be away from the optimal results. Results would look something like:
Instead, if the learning rate is too high then the learner jumps to conclusions too soon. Producing following results:
Activation Function is one of most powerful arsenal, which is responsible for powers Neural Networks advertised to have. Vaguely, it decides which neurons will be activated, in other words what information would be passed to further layers.
Without activation functions, deep nets lose a bulk of their representation learning power.
These functions’ non-linearity is responsible for increased degree of freedom of the learners, enabling them to generalize problems of high dimensionality in lower dimensions.
Below are few examples of popular Activation Functions:
Cost Function is at the centre of Neural Network. It is used to calculate loss given the real and observed results. Our aim throughout is to minimise this loss. So Cost Function effectively drives the learning of neural network towards it’s goal.
A cost function is a measure of “how good” a neural network did with respect to it’s given training sample and the expected output. It also may depend on variables such as weights and biases.
A cost function is a single value, not a vector, because it rates how good the neural network did as a whole.
Some of the most famous cost functions are:
- Quadratic Cost (Root Mean Square)
- Cross Entropy
- Exponential (AdaBoost)
- Kullback–Leibler divergence or Information Gain
Root Mean Square is the simplest and most used of them all. It is simply defined as:
Loss = √(expected_output ** 2) - (real_output ** 2)
The Cost function in NN should satisfy two conditions
- The cost function must be able to be written as an average
- The cost function must not be dependent on any activation values of a neural network except the output values
Deep learning is a class of machine learning algorithms that learn deeper (more abstract) insights from data.
In more formal terms:
- Uses a cascade (pipeline like flow, successively passed on) of many layers of processing units (nonlinear) for feature extraction and transformation.
- Are based on learning of features (representation of knowledge of data) of the data in unsupervised manner. Higher level features (that are found in latter processing layers layers) are derived from lower level features (that are found in initial processing layers layers).
- Learn multiple levels of representations that correspond to different levels of abstraction; the levels form a hierarchy of concepts.
Let’s consider a single layer Neural Network:
Here whatever the first layer (green neurons) learn, they simply pass it to output.
In case of two layer Neural Networks, whatever the green hidden layer learns, is then passed on to blue hidden layer where it learns further (about the learning in green layer).
Hence more the number of hidden layers, more is the learning on previously learned concepts.
This is not to be confused with Wide Neural Networks
In this case the presence of more neurons in one layer, does not result in learning an insight in depth. Instead it results in learning more number of concepts.
Learning English Grammar, it requires understanding of huge number of concepts. In this case a single layer Wide Neural Network works much better than Deep Neural Network which is significantly less wider.
In case of learning the Fourier Transform, the learner (Neural Network) needs to be Deep one because there aren’t many concepts to be learned but each of these concepts is complex enough to require deep learning.
Balance is Key
It’s very tempting to use deep and wide neural networks for every task. That might be a very bad idea because:
- Both require significantly more data to be trained upon to reach a minimum desirable accuracy
- Both have exponentially higher time complexity
- Too Deep NN will try to break a fundamental concept deeper, but at this point it will be making wrong assumptions about the concept and try to find pseudo patterns that do not exist
- Too Wide NN will try to find more number of features (individual measurable property of a pattern being observed) than there exist. So, similar to last point, it will start making wrong assumptions about data.
Curse of Dimensionality
The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings.
Something like Grammar of English language or Prize of Stocks have Humongous amount of features affecting them. Now with Machine Learning, we have to represent these with features with array/matrix of finite and comparatively much smaller length (than number of features that actually exist). To do this, learners generalize thew concepts. This gives rise to two problems:
- Bias arises due to wrong assumptions made by a learner. High bias can cause an algorithm to miss the relevant relations between features and target outputs. This phenomenon is called underfitting.
- Variance arises from small fluctuations in the training set due insufficient learning about a feature. High variance results overfitting, learning errors as relevant information.
Early in training the Bias is large because the network output is far from desired. The Variance is very small as data had little influence yet.
Late in training the Bias is small because the network has learned the underlying function. However if trained too long, the network will also learn the noise specific to that dataset. This results in high variance in results when tested on different datasets as noise varies from dataset to dataset.
Algorithms with high bias typically produce simpler models that don’t tend to overfit, but may underfit their training data, failing to capture important patterns or features’ properties.
Models with low bias and high variance are usually more complex in terms of their structure, enabling them to represent the training set more accurately. In the process, however, they may also represent a large noise component in the training set, making their predictions less accurate — despite their added complexity.
Hence, it is typically impossible to have low bias and low variance.
Today, in presence of abundance data and tools to easily create complex ML models with ease, over-fitting takes the centre stage. Because bias effectively occurs when there learner is not provided with enough information. But more examples, mean more variation as number of patterns and variation in those patterns, both increase.