September 18, 2015

DON'T PANIC

Helpful resources

Perceptron

Frank Rosenblatt & Mark 1 Perceptron

Single layer perceptron

Take input \(x_{p\times 1}\) and calculate the weighed input \(w\cdot x + b\), where

  • \(w_{p\times 1}\) is a vector of weights applied to the input, and
  • \(b\) is a scalar bias term that offsets the weighted sum

Now, classify x according to

\[output = \begin{cases} 0 & \text{if } w\cdot x + b \leq 0 \\ 1 & \text{if } w\cdot x + b > 0 \end{cases}\]

Binary classification

Binary classification

Binary classification

Binary classification

End of the perceptron…

Training may result in wildly different classification boundaries.

Only applicable to linearly separable data.

Minsky and Seymour (1969) highlighted the inability to learn a simple XOR function.

Dealt a critical blow to the perceptron, research largely vanished.

1970s: *crickets*

… or is it?

Two Ideas:

  1. Use a more flexible function than the perceptron, and
  2. Train multiple neurons within a layer or include additional layers.

This produces the multilayer perceptron (MLP).

Performs non-linear classification (i.e., does not require linear separability) and regression.

1980s: Resurgence in neural network research as a result.

Sigmoid \(\;\sigma(z) = (1+e^{-z})^{-1}\)

Artificial neural network

Artificial neural network

The data \(x\) are fed to the network through the input layer.

Predictions are made at the output layer.

So what happens in between?

Some notation

For layers \(l = 1, 2, \ldots, L\), define:

  • \(w^l_{jk}\) is the weight between the \(k\)th neuron in layer \(l-1\) to the \(j\)th neuron in layer \(l\)
  • \(b^{l}_j\) is the bias of the \(j\)th neuron in layer \(l\)
  • \(a^l_j\) is the activation from the \(j\)th neuron in layer \(l\) defined by \[ a^l_j = \sigma\left(\sum_k w^l_{jk}a^{l-1}_k + b^l_j\right) \] where \(\sigma(\cdot)\) denotes the sigmoid activation function.

How to obtain optimal weights and bias terms?

Loss function \(C\)

Idea: Define a loss function and minimize it with respect to all weights and biases using, say, gradient descent.

For a single training example \((x, y)\),

  • Mean squared error: \(C(a^L) = \frac{1}{2}\sum_j (y_j - a^L_j)^2\)

  • Cross-entropy: \[C(a^L) = -\frac{1}{2}\sum_j\left[y_j\ln a^L_j + (1-y_j)\ln(1-a^L_j)\right]\]

which are functions of the predicted outputs (i.e., activations of the final layer), which are, in turn, functions of the network's weights and biases.

Backpropogation algorithm

Fast algorithm for calculating gradients by working backward from outputs to early layers (Rumelhart, Hinton, and Williams 1988).

Define the error at neuron \(j\) in layer \(l\) by \(\delta^l_j = \frac{\partial C}{\partial z^l_j}\).

Feedforward \(\quad a^l_j = \sigma(z^l_j)\;\text{where}\;z^l_j = \sum_k w^l_{jk}a^{l-1}_k + b^l_j \)

Error at output layer \(\quad\delta^L_j = \frac{\partial C}{\partial z^L_j} = \frac{\partial C}{\partial a^L_j}\frac{\partial a^L_j}{\partial z^L_j} = \frac{\partial C}{\partial a^L_j}\sigma'(z^L_j)\)

Error at earlier layers \(\quad\delta^{l-1}_k = \left(\sum_jw^l_{jk}\delta^l_j\right)\sigma'(z^{l-1}_k)\)

Gradients \(\quad\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k\delta^l_j\quad\text{and}\quad\frac{\partial C}{\partial b^l_j} = \delta^l_j\)

Gradient descent

Iterating through this process to convergence, we continually update the weights and bias terms according to

\[ w^l_{jk} \leftarrow w^l_{jk} - \eta\frac{\partial C}{\partial w^l_{jk}}, \quad b^l_j \leftarrow b^l_j - \eta\frac{\partial C}{\partial b^l_j} \]

where \(\eta\) is the learning parameter chosen by the user.

Unfortunately, successful convergence is proplematic due to the high-dimensional surface with a multitude of local minima.

Idea: Operate on random "mini-batches" of the training data instead; called stochastic gradient descent (SGD).

Hello world!

… Well, sorta. With these modifications, neural networks are competitive classifiers but still unable to learn a wide variety of patterns.

Idea: Pass inputs and weighted inputs through numerous hidden layers.

May better learn intricate patterns that just one or two layers are incapable of capturing.

Problem: Too many layers, even just more than two, rapidly yields the unstable gradient problem.

Unstable gradient problem

Gradient in early layers is the product of terms from all later layers.

These terms rarely "play nicely" with one another, causing the gradients to either vanish (approach 0) or explode (to \(\pm\infty\)).

Thus, these early layers learn very slowly, if at all.

Since the NN's weights and biases are randomly initialized, they are likely far from optimal and, due to the unstable gradient problem, never even get close.

Result: Networks with many hidden layers (or those with many neurons) perform no better than those with only 1-2 hidden layers.

Early 1990s: Once hailed as the panacea of artificial intelligence, neural network research is again largely abandoned in favor of other machine learning methods, most notably the support vector machine (SVM).

Late 1990s: *mostly crickets*

Early 2000s: *still mostly crickets*

2006: Hinton, Osindero, and Teh (2006) introduce a deep belief network (DBN) that circumvents the unstable gradient problem via unsupervised pre-training of each layer's weights and biases. Bengio et al. (2007) generalize this method to other techniques.

Incorporation of many hidden layers \(\rightarrow\) "deeper" network architectures are possible, hence…

Deep learning

Deep learning

Deep architecture allows the network to learn patterns from the raw data, rather than relying on "feature engineering" as is necessary in shallow architectures.

Further developments in DL

  • Rectified linear units (ReLU) \(f(z) = \max(z, 0)\) learn quickly and, unlike the sigmoid, do not require unsupervised pre-training (Glorot, Bordes, and Bengio 2011).

  • Dropout (Srivastava et al. 2014) prevents overfitting by removing neurons that are least critical to prediction (similar in spirit to pruning in decision trees).

Forms of deep learning

Supervised learning

  • Convolutional neural nets in computer vision, and
  • Recurrent neural nets & long short-term memory in speech recognition and natural language processing
  • Combined, these networks can learn image captioning.

Unsupervised learning (typically for pre-training)

  • Deep autoencoders,
  • Deep belief networks,
  • Restricted Boltzmann machines

Convolutional neural nets (CNNs)

Designed for data that are highly correlated in space. e.g., pixel intensities in a RGB image.

Introduces two alternating types of layers:

  • Convolutional layer detects local features from previous layer
  • Pooling layer aggregates similar features together

Current state-of-the-art in object recognition and detection tasks in image processing.

Convolutional neural nets (CNNs)

2012 ImageNet (Krizhevsky, Sutskever, and Hinton 2012)

Task: Perform object recognition on dataset of ~1,000,000 images with ~1,000 different classes.

Halved the error rate of its closest competitors.

2014 Galaxy Zoo (Dieleman, Willett, and Dambre 2015)

Key insight? "Exploiting invariances in the data using data augmentation."

"Using a feature learning approach does not excuse you from having to get to know the data;" […] feature engineering "just happens at a higher level of abstraction." – Dieleman in his Kaggle interview

Recurrent neural nets (RNNs)

Designed for sequential inputs (such as those over time).

Introduces a state vector into hidden layer neurons to retain a history of past elements; essentially, a form of short-term memory.

A RNN is simply a very deep NN "unfolded in time," sharing weights and bias across layers.

Huge successes in speech recognition and natural language processing.

Still, learning long-term dependencies is hard…

Long Short-Term Memory (LSTM)

Originally developed by Hochreiter and Schmidhuber (1997), LSTM networks also introduce a memory cell that stores states for longer periods of time before clearing.

Able to "recall" past themes and events.

State-of-the-art in speech recognition systems, from raw audiowaves to word-for-word transcription.

Unsupervised learning

Instrumental in reviving NN research through pre-training weights and biases.

Still much more progress to be made, likely where the next "big hits" will come from.

Reinforcement learning is another encouraging area of research.

Programming frameworks

Theano (Python)

CPU/GPU symbolic expression compiler from Montreal Institute for Learning Algorithms (MILA) at University of Montreal.

Numerous Theano-based libraries including:

Torch7 (Lua + LuaJIT)

Lua is a lightweight, embeddable scripting language, designed for implementation in external applications.

Popular use in industry at Facebook, Google, & Twitter.

It is no coincidence that LeCun and Hinton head DL research labs at Facebook and Google, respectively.

And the rest…

Remember: DON'T PANIC

References

Bengio, Yoshua, Pascal Lamblin, Dan Popovici, Hugo Larochelle, and others. 2007. “Greedy Layer-Wise Training of Deep Networks.” Advances in Neural Information Processing Systems 19. MIT; 1998: 153.

Dieleman, Sander, Kyle W Willett, and Joni Dambre. 2015. “Rotation-Invariant Convolutional Neural Networks for Galaxy Morphology Prediction.” Monthly Notices of the Royal Astronomical Society 450 (2). Oxford University Press: 1441–59.

Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. 2011. “Deep Sparse Rectifier Neural Networks.” In International Conference on Artificial Intelligence and Statistics, 315–23.

Hinton, Geoffrey E, Simon Osindero, and Yee-Whye Teh. 2006. “A Fast Learning Algorithm for Deep Belief Nets.” Neural Computation 18 (7). MIT Press: 1527–54.

Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. “Long Short-Term Memory.” Neural Computation 9 (8). MIT Press: 1735–80.

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. 2012. “ImageNet Classification with Deep Convolutional Neural Networks.” In Advances in Neural Information Processing Systems 25, edited by F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, 1097–1105. Curran Associates, Inc. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf.

Minsky, Marvin, and Papert Seymour. 1969. “Perceptrons.” MIT press.

Rumelhart, David E, Geoffrey E Hinton, and Ronald J Williams. 1988. “Learning Representations by Back-Propagating Errors.” Cognitive Modeling 5: 3.

Schmidhuber, Jürgen. 2015. “Deep Learning in Neural Networks: An Overview.” Neural Networks 61. Elsevier: 85–117.

Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting.” The Journal of Machine Learning Research 15 (1). JMLR. org: 1929–58.