
Deep Learning深度学习-学习笔记
This notes’ content are all based on https://www.coursera.org/specializations/deep-learning
Latex may have some issues when displaying.
1. Neural Networks and Deep Learning
1.1 Introduction to Deep Learning
1.1.1 Supervised Learning with Deep Learning
- Structured Data: Charts.
- Unstructured Data: Audio, Image, Text.
1.1.2 Scale drives deep learning progress
- The larger the amount of data, the better the performance of the larger neural network compare to smaller one or supervised learning.
- Sigmoid change to ReLU will make gradient descent much more faster. Since the gradient will not go to 0 really fast.
1.2 Basics of Neural Network Programming
1.2.1 Binary Classification
- Input:
- Output: 0, 1
1.2.2 Logistic Regression
Given , want
Input:
Parameters:
Output
- If large,
- If large negative number,
Loss (error) function:
Cost function
1.2.3 Gradient Descent
- Repeat ;
- : Learning rate
- Right side of minimum, ; Left side of minimum,
- Logistic Regression Gradient Descent
- -->-->
- ()
- Gradient Descent on examples
- for to
- (for n = 2)
- (for n = 2)
1.2.4 Computational Graph
1.2.5 Vectorization
avoid explicit for-loops.
- for to
- ; b(1,1)-->Broodcasting
Vectorization Logistic Regression
- Get rid of and in for loop
- New Form of Logistic Regression
Broadcasting(same as bsxfun in Matlab/Octave)
- +-*/->1->m will be all the same number.
- +-*/->1->n will be all the same number
- Don’t use “rank 1 array”
- Use or
- Check by
- Fix rank 1 array by
Logistic Regression Cost Function
- Lost
- If :
- If :
- Cost
- Use maximum likelihood estimation(MLE)
- Cost(minmize):
1.3 Shallow Neural Networks
1.3.1 Neural Network Representation

Input layer, hidden layer, output layer
- -> ->
- Layers count by # of hidden layer+# of output layer.
- -> ->
- First hidden node:
- Seconde hidden node:
- Third hidden node:
- Forth hidden node:
Vectorization
- : layer ; example
for i=1 to m:
Vectorizing of the above for loop
- n is different hidden units
- hrizontally: training examples; vertically: hidden units
1.3.2 Activation Functions
- : activation function of layer
- Sigmoid:
- Tanh:
- ReLU:
- Leaky ReLu:
- Rules to choose activation function
- Output is between {0, 1}, choose sigmoid.
- Default choose ReLu.
- Why need non-liner activation function
- Use linear hidden layer will be useless to have multiple hidden layers. It will become .
- Linear may sometime use at output layer but with non-linear at hidden layers.
1.3.3 Forward and Backward Propogation
- Derivative of activation function
- Sigmoid:
- Tanh:
- ReLU:
- Leaky ReLU:
- Gradient descent for neural networks
- Parameters:
- Cost function:
- Forward propagation:
- Back Propogation:
- Random Initialization
1.4 Deep Neural Networks
1.4.1 Deep L-Layer Neural Network
- Deep neural network notation

- (#layers)
1.4.2 Forward Propagation in a Deep Network
- General:
- …
- Vectorizing:
- Matrix dimensions

- Why deep representation?
- Earier layers learn simple features; later deeper layers put together to detect more complex things.
- Circuit theory and deep learning: Informally: There are functions you can compute with a “small” L-layer deep neural network that shallower networks require exponentially more hidden units to compute.
1.4.3 Building Blocks of Deep Neural Networks
- Forward and backward functions

- Layer
- Forward: Input , output
- Backward: Input , output
- One iteration of gradient descent of neural network
- How to implement?
1.4.4 Parameters vs. Hyperparameters
- Parameters:
- Hyperparameters (will affect/control/determine parameters):
- learning rate
- # iterations
- # of hidden units
- # of hidden layers
- Choice of activation function
- Later: momemtum, minibatch size, regularization parameters,…
II. Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
2.1 Practical Aspects of Deep Learning
2.1.1 Train / Dev / Test sets
- Big data may need only 1% or even less dev/test sets.
- Mismatched: Make sure dev/test come from same distribution
- Not having a test set might be okay. (Only dev set.)
2.1.2 Bias / Variance


- Assume optimal (Bayes) error:
- High bias (underfitting): The prediction cannot classify different elemets as we want.
- Training set error , Dev set error .
- Training set error , Dev set error .
- “just right”: The prediction perfectly classify different elemets as we want.
- Training set error , Dev set error .
- High variance (overfitting): The prediction 100% classify different elemets.
- Training set error , Dev set error .
- Training set error , Dev set error .
2.1.3 Basic Recipe for Machine Learning
2.1.3.1 Basic Recipe
- High bias(training data performance)
- Bigger network
- Train longer
- (NN architecture search)
- High variance (dev set performance)
- More data
- Regulairzation
- (NN architecture search)
2.1.3.2 Regularization
- Logistic regression.
- L2 regularization
- L1 regularization
- will be spouse(for L1) (will have lots of 0 in it, only help a little bit)
- Neural network
- Frobenius norm: Square root of square sum of all elements in a matrix.
- (keep the same)
- Weight decay
-
-
- How does regularization prevent overfitting: bigger smaller smaller, which will make the activation function nearly linear(take tanh as an example). This will cause the network really hard to draw boundary with curve.
- Dropout regularization

- Implementing dropout(“Inverted dropout”)
- Illustrate with layer (means 0.2 chance get dropout/be 0 out)
- #This will set d3 to be a same shape matrix as a3 with True (1), False (0) value.
- #a3*=d3; This will let some neruons been dropout
- #inverted dropout, keep the total avtivation the same before and after dropout.
- Why work: Can’t rely on any one feature, so have to spread out weights.(shrink weights)
- First make sure the J is decreasing during iteration, then turn on dropout.
- Data augmentation
- Image: crop, flop, twist…
- Early stopping
- Mid-size
- May caused optimize cost function and not overfir at the same time.
- Orthogonalization
- Only consider optimize cost function or consider not overfit at one time.
2.1.3.3 Setting up your optimization problem
- Normalizing training sets

- Subtract mean:
- Normalize variance:
- "**" element-wise
- Use same to normalize test set.
- Why normalize inputs?
- When inputs in very different scales will help a lot for performance and gradient descent/learning rate.

- Vanishing/exploding gradients
- Just slightly, will make the gradient increase really fast (exploding).
- Just slightly, will make the gradient decrease really slow (varnishing).
- Weight initalization (Single neuron)
- large (number of input features) –> smaller
- (sigmoid/tanh) ReLU: (variance can be a hyperparameter, DO NOT DO THAT)
- ReLU:
- Xavier initialization: Sometime
- Numerical approximation of gradients
- Gradient checking (Grad check)
- Take and reshape into a big vector .
- Take and reshape into a big vector .
- for each i:
- Check Euclidean distance (is Euclidean norm, sqare root of the sum of all elements’ power of 2)
- take , if above Euclidean distance is or smaller, is great.
- If is or bigger may need to check.
- If is or bigger may need to worry, maybe a bug. Check which i approx is difference between the real value.
- notes:
- Don’t use in training - only to debug.
- If algorithm fails grad check, look at components to try to identify bug.
- Remember regularization. (include the )
- Doesn’t work with dropout. (since is random, implement without dropout)
- Run at random initialization; perhaps again after some training. (not work when )
2.2 Optimization Algorithms
2.2.1 Mini-batch gradient descent
2.2.2 Exponentially weighted averages
2.2.3 RMSprop and Adam optimization
2.3 Hyperparameter Tuning, Batch Normalization, and Programming Frameworks
2.3.1 Tuning process
2.3.2 Using an appropriate scale to pick hyperparameters
2.3.3Batch Normalization
2.3.4 Multi-class classification
III. Structuring Machine Learning Projects
IV. Convolutional Neural Networks
V. Sequence Models