W2 - Neural network training

Discover neural network training, activation functions, multiclass classification, and advanced optimization techniques in TensorFlow.

This week, you’ll learn how to train your model in TensorFlow, and also learn about other important activation functions (besides the sigmoid function), and where to use each type in a neural network. You’ll also learn how to go beyond binary classification to multiclass classification (3 or more categories). Multiclass classification will introduce you to a new activation function and a new loss function. Optionally, you can also learn about the difference between multiclass classification and multi-label classification. You’ll learn about the Adam optimizer, and why it’s an improvement upon regular gradient descent for neural network training. Finally, you will get a brief introduction to other layer types besides the one you’ve seen thus far.

Learning Objectives

  • Train a neural network on data using TensorFlow
  • Understand the difference between various activation functions (sigmoid, ReLU, and linear)
  • Understand which activation functions to use for which type of layer
  • Understand why we need non-linear activation functions
  • Understand multiclass classification
  • Calculate the softmax activation for implementing multiclass classification
  • Use the categorical cross entropy loss function for multiclass classification
  • Use the recommended method for implementing multiclass classification in code
  • (Optional): Explain the difference between multi-label and multiclass classification

Neural network training

TensorFlow implementation

Training Details

Step 1 - create the model

Step 2 - Loss and Cost functions

The cost function $J_{W,B}(\vec{X})$ is a function of all the parameters into neural network:

  • $W$, that includes matrix $W^{[1]}$, $W^{[2]}$ and $W^{[3]}$ for respectively layer 1, 2 and 3 (could be vectors as neurone one has only one weight $w_1^{[1]}$)
  • $B$, that includes $\vec{b}^{[1]}$, $\vec{b}^{[2]}$, $\vec{b}^{[3]}$, vectors of respectively layers 1, 2 and 3

Optimizing the cost function respect to $W$ and $B$, means optimizing it with respect to all of the parameters in the neural network (all layers)

In TensorFlow, we can use for loss function :

  • tf.keras.losses.BinaryCrossentropy (the name comes from statistics), and binary just reemphasizes this is a binary classification problem
  • tf.keras.losses.MeanSquaredError for regression

Keras was originally a library that had developed independently of TensorFlow is actually totally separate project from TensorFlow

Step 3 - Gradient descent

To use gradient descent to train the parameters of a neural network, we repeatedly update $w_j^{[l]}$ for every layer $l$ and for every unit $j$

This update depends on :

  • the learning rate
  • and the partial derivative of the cost function $J_{W,B}(\vec{X})$, with respect to parameter $w_j$ and $b$.

So in order to use gradient descent, the key activity is computing these partial derivative terms.

TensorFlow do all of these things for you, it implements backpropagation all within this function called fit().

TensorFlow can use an algorithm that is even a little bit faster than gradient descent

Frameworks

Activation Functions

Alternatives to the sigmoid activation

Using the sigmoid activation function awareness as a binary number 0, 1 Maybe awareness should be any non negative number because there can be any non negative value of awareness going from 0 up to very very large numbers

ReLU (rectified linear unit) is another activation function

Here are the most commonly used activation functions:

  • Sigmoid
  • ReLU
  • and Linear activation function g(z)=z (no activation function)

Choosing activation functions

Output Layer:

  • Linear for both negatives and positives values
  • ReLU for positives values
  • sigmoid for binary

Hidden Layer:

  • Even though we had initially described neural networks using the sigmoid activation function, the field has evolved to use ReLU much more often and sigmoids hardly ever (exception for binary classification problem)
  • Relu is prefered for two reasons:
    • faster to compute because it just requires computing maximux, whereas the sigmoid requires taking an exponentiation
    • sigmoid is fast at two place (vs one for ReLU) and so gradient descent is slower (intuition reason)

Summary

Why do we need activation functions?

If we were to use a linear activation function for all of the nodes, the neural network will act as a linear regression

Demonstration for a simple example of a neural network

In the general case, if you had a neural network with multiple layers and you use a linear activation function for all of the hidden layers and for the output layer, then it turns out this model will compute an output that is equivalent to linear regression

Or alternatively, if we were to still use a linear activation function for all the hidden layers, and logistic activation function for the output layer, then the model becomes equivalent to logistic regression

ReLU activation

Multiclass classification

Multiclass

For the handwritten digit classification problems we’ve looked at so far, we were just trying to distinguish between the handwritten digits 0 and 1. But if you’re trying to read protocols or zip codes in an envelope, well, there are actually 10 possible digits you might want to recognize

Another multiclass classification problems, where data set that maybe classified in 4 different classes

Softmax

Wikipedia:

The softmax function (or normalized exponential function)  converts a vector of K real numbers into a probability distribution of K possible outcomes. It is a generalization of the logistic function to multiple dimensions The softmax function takes as input a vector z of K real numbers, and normalizes it into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some vector components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the interval [0,1], and the components will add up to 1, so that they can be interpreted as probabilities

Softmax regression with n equals 2, is equivalent to logistic regression (not proven here)

Neural Network with Softmax output

Previously, when we were doing handwritten digit recognition with just two classes. We use a new Neural Network with an output layer with one unique neurone or unit.

If you now want to do handwritten digit classification with 10 classes, all the digits from zero to nine, then we’re going to change this Neural Network to have 10 output units. And this new output layer will be a Softmax output layer

Softmax layer will sometimes also called the Softmax activation function, it is a little bit unusual in one respect compared to the other activation functions. a1 is a function of Z1, Z3, up to Z10. So each of these activation values, it depends on all of the values of Z

Tensorflow implentation (there’s a better version of the code that makes tensorflow work better)

Improved implementation of softmax

Computing with intermediate term could create round-off error.

Algorithm that allows TensorFlow to not have to compute a as an intermediate term for logistic regression :

  • from_logits=True
  • linear’ activation on output layer (instead of softmax)

Full algorithm:

Algorithm that allows TensorFlow to not have to compute a as an intermediate term for softmax regression

Full algorithm:

Classification with multiple outputs (Optional)

Example of multi-label (multi labels possible), that differs from multi-class (only one class possible)

Implemented wit output with multiple sigmoid units (neurones)

Softmax

Multiclass

Additional Neural Network concepts

Advanced Optimization

Gradient descent is an optimization algorithm that was the foundation of many algorithms like linear regression, logistic regression and early implementations of neural networks. But there are now some other optimization algorithms for minimizing the cost function, that are even better than gradient descent.

Depending on how gradient descent is proceeding, we need to increase or decrease learning rate Alpha

Use more than one Alpha (Adam algorithm)

Intuition : increase when same direction, decrease when oscillation

Implementation in tensorflow:

Additional Layer Types

With dense layer, every neuron in the layer gets its inputs all the activations from the previous layer. And it turns out that just using the dense layer type, you can actually build some pretty powerful learning algorithms.

We can design a neural with a different type of layer. One other layer type that you may see in some work is called a convolutional layer

With convolutional layer units looks at only a limited window of the input

Exemple with Electrocardiogram (ECG or EKG)

Back propagation (Optional)

Remeinder:

  • Inference is making predictions and itś forward propagation
  • Learning is called in contrast backward propagation or back propagation

What is a derivative? (Optional)

Example

Definition of derivative

Derivative and slope

Compute derivative in python with sympy

Derivative Notation

  • Derivative $\frac{dJ}{dw}$
  • Partial derivative $\frac{\partial J}{\partial w_i}$

Computation graph (Optional)

The computation graph is a key idea in deep learning and it’s used by programming framework like TensorFlow to automatic compute derivatives for neural networks

To compute the output a of the neural network, we need to execute the following forwrad propo steps of the the computation graph to calculate a = wx + b and then J = 1/2(a-y)² (J = 1/2((wx + b) - y)² )

  1. c = w . x
  2. a = c + b
  3. d = a - y
  4. J = 1/2 d²

Then we execute the backprop, from right to left.

  1. J = 1/2 d²
    • if d += 0.001, J += 0.002, so dJ/dd = 2
  2. d = a - y
    • if a += 0.001, d += 0.001, so dd/da =1
    • with step 1, if a += 0.001 then J+= 0.002, so we have dJ/da = dd/da . dJ/dd
  3. a = c + b
    • if c += 0.001, a += 0.001
    • with step 2, if c += 0.001 then J+= 0.002, so we have dJ/dc = da/dc . dJ/da
    • idem for b += 0.001, J += 0.002
  4. c = w . x
    • id w += 0.001 then c -= 0.002, J -= -0.004

Chain rule is du\dx = du\dv . dv\dx or d/dx ( f(g(x) ) = f’ (g(x)) · g’ (x)

Double check :

Backprop efficiency

Larger neural network example (Optional)

Many years ago, before the rise of frameworks like tensorflow and pytorch, researchers used to have to manually use calculus to compute the derivatives of the neural networks that they wanted to train.

In modern program frameworks you can specify forwardprop and have it take care of backprop for you. Thanks to the computation graph and these techniques for automatically carrying out derivative calculations. Is sometimes called autodiff, for automatic differentiation. This process of researchers manually using calculus to take derivatives is no longer done.

Last modified February 4, 2024: meta description on coursera (b2d9a0d)