### 一言 # Deep Learning深度学习-学习笔记

This notes’ content are all based on https://www.coursera.org/specializations/deep-learning

Latex may have some issues when displaying.

## 1. Neural Networks and Deep Learning

### 1.1 Introduction to Deep Learning

#### 1.1.1 Supervised Learning with Deep Learning

• Structured Data: Charts.
• Unstructured Data: Audio, Image, Text.

#### 1.1.2 Scale drives deep learning progress

• The larger the amount of data, the better the performance of the larger neural network compare to smaller one or supervised learning.
• Sigmoid change to ReLU will make gradient descent much more faster. Since the gradient will not go to 0 really fast.

### 1.2 Basics of Neural Network Programming

#### 1.2.1 Binary Classification

• Input: $X \in R^{nx}$
• Output: 0, 1

#### 1.2.2 Logistic Regression

• Given $x$, want $\hat{y} = P(y=1|x)$

• Input: $x \in R^{n_x}$

• Parameters: $w \in R^{n_x}, b \in R$

• Output $\hat{y} = \sigma(w^Tx + b)$

• $\sigma(z)=\dfrac{1}{1+e^{-z}}$
• If $z$large, $\sigma(z)\approx\dfrac{1}{1+0}\approx1$
• If $z$large negative number, $\sigma(z)\approx\dfrac{1}{1+Bignum}\approx0$
• Loss (error) function:

• $\hat{y} = \sigma(w^Tx + b)$, where $\sigma(z)=\dfrac{1}{1+e^{-z}}$
• $z^{(i)}=w^Tx^{(i)}+b$
• Want $y^{(i)} \approx \hat{y}^{(i)}$

• $L(y, \hat{y}) = -[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})]$
• If $y=1: L(\hat{y}, y)=-\log{\hat{y}}$<- want $\log{\hat{y}}$as large as possible, want $\hat{y}$large
• If $y=0: L(\hat{y}, y)=-\log{(1-\hat{y})}$<- want $\log{(1-\hat{y})}$as large as possible, want $\hat{y}$small
• Cost function

• $J(w, b)=\dfrac{1}{m}\sum\limits_{i=1}^{m}L(\hat{y}^{(i)},y^{(i)})=-\dfrac{1}{m}\sum\limits_{i=1}^{m}L[y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)})]$

• Repeat $w:=w-\alpha\dfrac{dJ(w)}{dw}$; $b:=b-\alpha\dfrac{\partial J(w,b)}{\partial b}$
• $\alpha$: Learning rate
• Right side of minimum, $\dfrac{dJ(w)}{dw} > 0$; Left side of minimum, $\dfrac{dJ(w)}{dw} < 0$
• $x_1,x_2,w_1,w_2,b$
• $z=w_1x_1+w_2x_2+b$-->$a=\sigma(z)$-->$L=(a,y)$
• $da=\dfrac{dL(a,y)}{da}=-\dfrac{y}{a}+\dfrac{1-y}{1-a}$
• $\dfrac{dL(y,a)}{da} = \dfrac{d}{da}(-y\log(a) - (1-y)\log(1-a))$
• $\dfrac{d}{da} (-y\log(a)) = -\dfrac{y}{a}$
• $\dfrac{d}{da} (-(1-y)\log(1-a)) = -\dfrac{1-y}{1-a} \times (-1) = \dfrac{1-y}{1-a}$
• $=-\dfrac{y}{a} + \dfrac{1-y}{1-a} = -\dfrac{y}{a} - \dfrac{y-1}{1-a}$
• $dz=\dfrac{dL}{dz}=\dfrac{dL(a,y)}{dz}=a-y$
• $=\dfrac{dL}{da}\cdot\dfrac{da}{dz}$($\dfrac{da}{dz}=a(1-a)$)
• $\dfrac{dL}{dw_1}="dw_1"=x_1\cdot dz$
• $\dfrac{dL}{dw_2}="dw_2"=x_2\cdot dz$
• $db=dz$
• Gradient Descent on $m$examples
• $J(w, b)=\dfrac{1}{m}\sum\limits_{i=1}^{m}L(a^{(i)},y^{(i)})$
• $\dfrac{\partial}{\partial w_1}J(w,b)=\dfrac{1}{m}\sum\limits_{i=1}^{m}\dfrac{\partial}{\partial w_1}L(a^{(i)},y^{(i)})$
• $J=0;dw_1=0;dw_2=0;db=0$
• for $i=1$to $m$
• $z^{(i)}=w^Tx^{(i)}+b$
• $a^{(i)}=\sigma (z^{(i)})$
• $J+=-[y^{(i)}loga^{(i)}+(1-y^{(i)})log(1-a^{(i)})]$
• $dz^{(i)}=a^{(i)}-y^{(i)}$
• $dw_1+=x_1^{(i)}dz^{(i)}$(for n = 2)
• $dw_2+=x_2^{(i)}dz^{(i)}$(for n = 2)
• $db+=dz^{(i)}$
• $J/=m;dw_1/=m;dw_2/=m;db/=m$
• $dw_1=\dfrac{\partial J}{\partial w_1}; dw_2=\dfrac{\partial J}{\partial w_2}$
• $w_1:=w_1-\alpha dw_1$
• $w_2:=w_2-\alpha dw_2$
• $b:=b-\alpha db$

#### 1.2.4 Computational Graph

• $J(a,b,c)=3(a+bc)$
• $u=bc$
• $v=a+u$
• $J=3v$
• Left to right computation
• Derivatives with a Computation Graph

• $\dfrac{dJ}{dv}=3$
• $\dfrac{dJ}{da}=3$
• $\dfrac{dv}{da}=1$
• Chain Rule: $\dfrac{dJ}{da}=\dfrac{dJ}{dv}\cdot\dfrac{dv}{da}$
• $\dfrac{dJ}{du}=3; \dfrac{du}{db}=2; \dfrac{dJ}{db}=6$
• $\dfrac{du}{dc}=3; \dfrac{dJ}{dc}=9$

#### 1.2.5 Vectorization

• avoid explicit for-loops.

• $J=0;dw=np.zeros((n_x,1));db=0$
• for $i=1$to $m$
• $z^{(i)}=w^Tx^{(i)}+b$
• $a^{(i)}=\sigma (z^{(i)})$
• $J+=-[y^{(i)}loga^{(i)}+(1-y^{(i)})log(1-a^{(i)})]$
• $dz^{(i)}=a^{(i)}-y^{(i)}$
• $dw+=x^{(i)}dz^{(i)}$
• $db+=dz^{(i)}$
• $J/=m;dw/=m;db/=m$
• $Z=np.dot(w.T,x)+b$; b(1,1)-->Broodcasting
• Vectorization Logistic Regression

• $dz^{(1)}=a^{(1)}-y^{(1)}; dz^{(2)}=a^{(2)}-y^{(2)}...$
• $dz=[dz^{(1)}, dz^{(2)},...,dz^{(m)}]$$1\times m$
• $A=[a^{(1)}, a^{(2)}, ..., a^{(m)}]$$Y=[y^{(1)}, y^{(2)}, ..., y^{(m)}]$
• $dz=A-Y=[a^{(1)}-y^{(1)}, a^{(2)}-y^{(2)}, ...]$
• Get rid of $db$and $dw$in for loop
• $db=\dfrac{1}{m}\sum\limits_{i=1}^{m}dz^{(i)}=\dfrac{1}{m} np.sum(dz)$
• $dw=\dfrac{1}{m}\cdot X\cdot dz^T=\dfrac{1}{m}[x^{(1)}...][dz^{(1)}...]=\dfrac{1}{m}\cdot[x^{(1)}dz^{(1)}+...+x^{(m)}dz^{(m)}]$$n\times 1$
• New Form of Logistic Regression
• $Z=w^tX+b=np.dot(w.T, X)+b$
• $A=\sigma (Z)$
• $dz=A-Y$
• $dw=\dfrac{1}{m}\cdot X \cdot dZ^T$
• $db=\dfrac{1}{m}np.sum(dz)$
• $w:=w-\alpha dw$
• $b:=b-\alpha db$
• Broadcasting(same as bsxfun in Matlab/Octave)

• $(m,n)$+-*/$(1,n)$->$(m,n)$1->m will be all the same number.
• $(m,n)$+-*/$(m,1)$->$(m,n)$1->n will be all the same number
• Don’t use $a = np.random.randn(5)$$a.shape = (5,)$“rank 1 array”
• Use $a = np.random.randn(5,1)$or $a = np.random.randn(1,5)$
• Check by $assert(a.shape == (5,1))$
• Fix rank 1 array by $a = a.reshape((5,1))$
• Logistic Regression Cost Function

• Lost
• $p(y|x)=\hat{y}^y(1-\hat{y})^{(1-y)}$
• If $y=1$: $p(y|x)=\hat{y}$
• If $y=0$: $p(y|x)=(1-\hat{y})$
• $\log p(y|x)=\log \hat{y}^y(1-\hat{y})^{(1-y)}=y\log \hat{y}+(1-y)\log(1-\hat{y})=-L(\hat{y},y)$
• Cost
• $\log p(labels\space in\space training\space set)=\log \Pi_{i=1}^{m}p(y^{(i)},x^{(i)})$
• $\log p(labels\space in\space training\space set)=\sum\limits_{i=1}^m\log p(y^{(i)},x^{(i)})=-\sum\limits_{i=1}^mL(\hat{y}^{(i)},y^{(i)})$
• Use maximum likelihood estimation(MLE)
• Cost(minmize): $J(w,b)=\dfrac{1}{m}\sum\limits_{i=1}^mL(\hat{y}^{(i)},y^{(i)})$

### 1.3 Shallow Neural Networks

#### 1.3.1 Neural Network Representation

• • Input layer, hidden layer, output layer

• $a^{}=x$-> $a^{}=[[a^{}_1,a^{}_2,a^{}_3,a^{}_4]]$-> $a^{}$
• Layers count by # of hidden layer+# of output layer.
• $x_1,x_2,x_3$-> $4\space hidden\space nodes$-> $Output\space layer$
• First hidden node: $z^{}_1=w^{T}_1+b^{}_1, a^{}_1=\sigma(z^{}_1)$
• Seconde hidden node: $z^{}_2=w^{T}_2+b^{}_2, a^{}_2=\sigma(z^{}_2)$
• Third hidden node: $z^{}_3=w^{T}_3+b^{}_3, a^{}_3=\sigma(z^{}_3)$
• Forth hidden node: $z^{}_4=w^{T}_4+b^{}_4, a^{}_4=\sigma(z^{}_4)$
• Vectorization

• $w^{}=\begin{gathered}\begin{bmatrix}-w^{T}_1- \\ -w^{T}_2- \\ -w^{T}_3- \\ -w^{T}_4- \end{bmatrix}\end{gathered} (4,3)matrix$
• $z^{}=\begin{gathered}\begin{bmatrix}-w^{T}_1- \\ -w^{T}_2- \\ -w^{T}_3- \\ -w^{T}_4- \end{bmatrix}\end{gathered}\cdot \begin{gathered}\begin{bmatrix}x_1 \\ x_2 \\ x_3 \end{bmatrix}\end{gathered} + \begin{gathered}\begin{bmatrix}b^{}_1 \\ b^{}_2 \\b^{}_3 \\ b^{}_4 \end{bmatrix}\end{gathered} =\begin{gathered}\begin{bmatrix}w^{T}_1\cdot x+b^{}_1 \\ w^{T}_2\cdot x+b^{}_2 \\ w^{T}_3\cdot x++b^{}_3 \\ w^{T}_4\cdot x+b^{}_4 \end{bmatrix}\end{gathered}=\begin{gathered}\begin{bmatrix}z^{}_1 \\ z^{}_2 \\z^{}_3 \\ z^{}_4 \end{bmatrix}\end{gathered}$
• $a^{}=\begin{gathered}\begin{bmatrix}a^{}_1 \\ a^{}_2 \\a^{}_3 \\ a^{}_4 \end{bmatrix}\end{gathered}=\sigma(z^{})$
• $z^{}=W^{}\cdot a^{}+b^{}$$(1, 1),(1, 4),(4, 1),(1, 1)$
• $a^{}=\sigma(z^{})$$(1,1),(1,1)$
• $a^{(i)}$: layer $2$; example $i$
• for i=1 to m:

• $z^{(i)}=W^{}\cdot x(i)+b^{}$
• $a^{(i)}=\sigma(z^{(i)})$
• $z^{(i)}=W^{}\cdot a^{(i)}+b^{}$
• $a^{(i)}=\sigma(z^{(i)})$
• Vectorizing of the above for loop

• $X=\begin{gathered}\begin{bmatrix}| & | & | & | \\ x^{(1)}, & x^{(2)}, & ..., & x^{(m)} \\ | & | & | & |\end{bmatrix}\end{gathered} (n_x,m)matrix$n is different hidden units
• $Z^{}=W^{}\cdot X+b^{}$
• $A^{}=\sigma(Z^{})$
• $Z^{}=W^{}\cdot A^{}+b^{}$
• $A^{}=\sigma(Z^{})$
• hrizontally: training examples; vertically: hidden units

#### 1.3.2 Activation Functions

• $g^{[i]}$: activation function of layer $i$
• Sigmoid: $a=\dfrac{1}{1+e^{[-z]}}$
• Tanh: $a=\dfrac{e^z-e^{[-z]}}{e^z+e^{[-z]}}$
• ReLU: $a=max(0,z)$
• Leaky ReLu: $a=max(0.01z, z)$
• Rules to choose activation function
1. Output is between {0, 1}, choose sigmoid.
2. Default choose ReLu.
• Why need non-liner activation function
• Use linear hidden layer will be useless to have multiple hidden layers. It will become $a=w'x+b'$.
• Linear may sometime use at output layer but with non-linear at hidden layers.

#### 1.3.3 Forward and Backward Propogation

• Derivative of activation function
• Sigmoid: $g'(z)=\dfrac{d}{dz}g(z)=\dfrac{1}{1+e^{[-z]}}(1-\dfrac{1}{1+e^{[-z]}})=g(z)(1-g(z))=a(1-a)$
• Tanh: $g'(z)=\dfrac{d}{dz}g(z)=1-(tanh(z))^2$
• ReLU: $g'(z)=\left\{\begin{array}{lr}0&if \space z<0 \\1&if \space z\geq0\\\usepackage{undefined}&\usepackage{if \space z=0}\end{array}\right.$
• Leaky ReLU: $g'(z)=\left\{\begin{array}{lr}0.01&if \space z<0 \\1&if \space z\geq0\end{array}\right.$
• Gradient descent for neural networks
• Parameters: $w^{}(n^{},n^{}), b^{}(n^{},1),w^{}(n^{},n^{}), b^{}(n^{},1)$
• $n_x=n^{},n^{},n^{}=1$
• Cost function: $J(w^{}, b^{},w^{}, b^{})=\dfrac{1}{m}\sum\limits_{i=1}^nL(\hat{y},y)$
• Forward propagation:
• $Z^{}=W^{}\cdot X+b^{}$
• $A^{}=g^{}(Z^{})$
• $Z^{}=W^{}\cdot A^{}+b^{}$
• $A^{}=g^{}(Z^{})=\sigma(Z^{})$
• Back Propogation:
• $dZ^{}=A^{}-Y$$Y=[y^{(1)},y^{(2)},...,y^{(m)}]$
• $dW^{}=\dfrac{1}{m}dZ^{}A^{T}$
• $db^{}=\dfrac{1}{m}np.sum(dZ^{},axis=1,keepdims=True)$
• $dZ^{}=W^{T}dZ^{}*g'^{}(Z^{1})$
• $(n^{},m)->element-wise\space product->(n^{},m)$
• $dW^{}=\dfrac{1}{m}dZ^{}X^{T}$
• $db^{}=\dfrac{1}{m}np.sum(dZ^{},axis=1,keepdims=True)$
• Random Initialization
• $x_1,x_2->a_1^{},a_2^{}->a_1^{}->\hat{y}$
• $w^{}=np.random.randn((2,2))*0.01$
• $b^{}=np.zeros((2,1))$
• $w^{}=np.random.randn((1,2))*0.01$
• $b^{}=0$

### 1.4 Deep Neural Networks

#### 1.4.1 Deep L-Layer Neural Network

• Deep neural network notation
• • $L=4$(#layers)
• $n^{[l]}= \#\space units\space in\space layer\space l$
• $n^{}=5,n^{}=5,n^{}=3,n^{}=n^{[l]}=1$
• $n^{}=n_x=3$
• $a^{[l]}=activations\space in\space layer\space l$
• $a^{[l]}=g^{[l]}(z^{[l]}),\space w^{[l]}=weights\space for\space z^{[l]},\space b^{[l]}=bias\space for\space z^{[l]}$
• $x=a^{},\space \hat{y}=a^{l}$

#### 1.4.2 Forward Propagation in a Deep Network

• General: $Z^{[l]}=w^{[l]}A^{[l-1]}+b^{[l]}, A^{[l]}=g^{[l]}(Z^{[l]})$
• $x: z^{}=w^{}a^{}+b^{}, a^{}=g^{}(z^{})$$a^{}=X$
• $z^{}=w^{}a^{}+b^{}, a^{}=g^{}(z^{})$
• $z^{}=w^{}a^{}+b^{}, a^{}=g^{}(z^{})=\hat{y}$
• Vectorizing:
• $Z^{}=w^{}A^{}+b^{}, A^{}=g^{}(Z^{})$$A^{}=X$
• $Z^{}=w^{}A^{}+b^{}, A^{}=g^{}(Z^{})$
• $\hat{Y}=g(Z^{})=A^{}$
• Matrix dimensions
• • $z^{}=w^{}\cdot x+b^{}$
• $z^{}=(3,1),w^{}=(3,2),x=(2,1),b^{}=(3,1)$
• $z^{}=(n^{},1),w^{}=(n^{},n^{}),x=(n^{},1),b^{}=(n^{},1)$
• $w^{[l]}/dw^{[l]}=(n^{[l]},n^{[l-1]}),b^{[l]}/db^{[l]}=(n^{[l]},1)$
• $z^{[l]},a^{[l]}=(n^{[l]},1),Z^{[l]}/dZ^{[l]},A^{[l]}/dA^{[l]}=(n^{[l]},1)$$l=0, A^{}=X=(n^{},m)$
• Why deep representation?
• Earier layers learn simple features; later deeper layers put together to detect more complex things.
• Circuit theory and deep learning: Informally: There are functions you can compute with a “small” L-layer deep neural network that shallower networks require exponentially more hidden units to compute.

#### 1.4.3 Building Blocks of Deep Neural Networks

• Forward and backward functions
• • Layer $l:w^{[l]},b^{[l]}$
• Forward: Input $a^{[l-1]}$, output $a^{[l]}$
• $z^{[l]}:w^{[l]}a^{[l-1]}+b^{[l]}$$cache\space z^{[l]}$
• $a^{[l]}:g^{[l]}(z^{[l]})$
• Backward: Input $da^{[l]}, cache(z^{[l]})$, output $da^{[l-1]},dw^{[l]},db^{[l]}$
• One iteration of gradient descent of neural network
• • How to implement?
• Forward propagation for layer $l$
• Input $a^{[l-1]}$, output $a^{[l]},cache\space (z^{[l]})$
-   {% katex %}z^{[l]}=w^{[l]}a^{[l-1]}+b^{[l]}{% endkatex %}
-   {% katex %}a^{[l]}=g^{[l]}(z^{[l]}){% endkatex %}

• Vectoried
• $Z^{[l]}=W^{[l]}A^{[l-1]}+b^{[l]}$
• $A^{[l]}=g^{[l]}(Z^{[l]})$
• Backward propagation for layer $l$
• Input $da^{[l]}, cache(z^{[l]})$, output $da^{[l-1]},dw^{[l]},db^{[l]}$
• $dz^{[l]}=da^{[l]}*g'^{[l]}(z^{[l]})$
• $dw^{[l]}=dz^{[l]}\cdot a^{[l-1]}$
• $db^{[l]}=dz^{[l]}$
• $da^{[l-1]}=w^{[l]T}\cdot dz^{[l]}$
• Vectorized:
• $dZ^{[l]}=dA^{[l]}*g'^{[l]}(Z^{[l]})$
• $dW^{[l]}=\dfrac{1}{m}dZ^{[l]}A^{[l-1]T}$
• $db^{[l]}=\dfrac{1}{m}np.sum(dZ^{[l]},axis=1,keepdims=True)$
• $dA^{[l-1]}=W^{[l]T}\cdot dZ^{[l]}$

#### 1.4.4 Parameters vs. Hyperparameters

• Parameters: $W^{}, b^{}, W^{}, b^{},...$
• Hyperparameters (will affect/control/determine parameters):
• learning rate $\alpha$
• # iterations
• # of hidden units $n^{},n^{},...$
• # of hidden layers
• Choice of activation function
• Later: momemtum, minibatch size, regularization parameters,…

## II. Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization

### 2.1 Practical Aspects of Deep Learning

#### 2.1.1 Train / Dev / Test sets

• Big data may need only 1% or even less dev/test sets.
• Mismatched: Make sure dev/test come from same distribution
• Not having a test set might be okay. (Only dev set.)

#### 2.1.2 Bias / Variance  • Assume optimal (Bayes) error: $\approx0\%$
• High bias (underfitting): The prediction cannot classify different elemets as we want.
• Training set error $15\%$, Dev set error $16\%$.
• Training set error $15\%$, Dev set error $30\%$.
• “just right”: The prediction perfectly classify different elemets as we want.
• Training set error $0.5\%$, Dev set error $1\%$.
• High variance (overfitting): The prediction 100% classify different elemets.
• Training set error $1\%$, Dev set error $11\%$.
• Training set error $15\%$, Dev set error $30\%$.

#### 2.1.3 Basic Recipe for Machine Learning

##### 2.1.3.1 Basic Recipe
• High bias(training data performance)
• Bigger network
• Train longer
• (NN architecture search)
• High variance (dev set performance)
• More data
• Regulairzation
• (NN architecture search)
##### 2.1.3.2 Regularization
• Logistic regression. $\min\limits_{w,b}J(w,b)$
• $w\in\mathbb{R}^{n_x}, b\in\mathbb{R}$
• $\lambda=regularization\space parameter$
• $J(w,b)=\dfrac{1}{m}\sum\limits_{i=1}^mL(\hat{y}^{(i)},y^{(i)})+\dfrac{\lambda}{2m}||w^2||_2$
• L2 regularization $||w^2||_2=\sum\limits_{j=1}^{n_x}w_j^2=w^Tw$
• L1 regularization $\dfrac{\lambda}{2m}\sum\limits_{j=1}^{n_x}|w_j|=\dfrac{\lambda}{2m}||w||_1$
• $w$will be spouse(for L1) (will have lots of 0 in it, only help a little bit)
• Neural network
• $J(w^{},b^{},...,w^{[l]},b^{[l]})=\dfrac{1}{m}\sum\limits_{i=1}^{m}L(\hat{y}^{(i)},y^{(i)})+\dfrac{\lambda}{2m}\sum\limits_{l=1}^{l}||w^2||_F$
• $||w^{[l]}||_F^2=\sum\limits_{i=1}^{n^{[l-1]}}\sum\limits_{j=1}^{n^{[l]}}(w_{ij}^{[l]})^2$$w: (w^{[l]},w^{[l-1]})$
• Frobenius norm: Square root of square sum of all elements in a matrix.
• $dw^{[l]}=(from\space backprop)+\dfrac{\lambda}{m}w^{[l]}$
• $w^{[l]}:=w^{[l]}-\alpha dw^{[l]}$(keep the same)
• Weight decay
• $w^{[l]}:=w^{[l]}-\alpha[(from\space backprop)+\dfrac{\lambda}{m}w^{[l]}]$
• $=w^{[l]}-\dfrac{\alpha\lambda}{m}w^{[l]}-\alpha(from\space backprop)$
• $=(1-\dfrac{\alpha\lambda}{m})w^{[l]}-\alpha(from\space backprop)$
• How does regularization prevent overfitting: $\lambda$bigger $w^{[l]}$smaller $z^{[l]}$smaller, which will make the activation function nearly linear(take tanh as an example). This will cause the network really hard to draw boundary with curve.
• Dropout regularization
• • Implementing dropout(“Inverted dropout”)
• Illustrate with layer $l=3$$keep-prob=0.8$(means 0.2 chance get dropout/be 0 out)
• $d3 = np.random.rand(a3.shape,a3.shape) < keep-prob$#This will set d3 to be a same shape matrix as a3 with True (1), False (0) value.
• $a3 = np.multiply(a3, d3)$#a3*=d3; This will let some neruons been dropout
• $a3/=keep-prob$#inverted dropout, keep the total avtivation the same before and after dropout.
• Why work: Can’t rely on any one feature, so have to spread out weights.(shrink weights)
• First make sure the J is decreasing during iteration, then turn on dropout.
• Data augmentation
• Image: crop, flop, twist…
• Early stopping
• Mid-size $||w||_F^2$
• May caused optimize cost function and not overfir at the same time.
• Orthogonalization
• Only consider optimize cost function or consider not overfit at one time.
##### 2.1.3.3 Setting up your optimization problem
• Normalizing training sets
• • $x=\begin{gathered}\begin{bmatrix}x_1 \\ x_2\end{bmatrix}\end{gathered}$
• Subtract mean:
• $\mu=\dfrac{1}{m}\sum\limits_{i=1}^{m}x^{(i)}$
• $x:=x-\mu$
• Normalize variance:
• $\sigma^2=\dfrac{1}{m}\sum\limits_{i=1}^{m}x^{(i)}**2$"**" element-wise
• $x/=\sigma^2$
• Use same $\mu,\sigma^2$to normalize test set.
• Why normalize inputs?
• When inputs in very different scales will help a lot for performance and gradient descent/learning rate.
• • $w^{[l]}>I$Just slightly, will make the gradient increase really fast (exploding).
• $w^{[l]}Just slightly, will make the gradient decrease really slow (varnishing).
• Weight initalization (Single neuron)
• large $n$(number of input features) –> smaller $w_i$
• $Variance(w:)=\dfrac{1}{n}$(sigmoid/tanh) ReLU: $\dfrac{2}{n}$(variance can be a hyperparameter, DO NOT DO THAT)
• $w^{[l]}=np.random.randn(shapeOfMatrix)*np.sqrt(\dfrac{1}{n^{[l-1]}})$ReLU: $\dfrac{2}{n^{[l-1]}}$
• Xavier initialization: $\sqrt{\dfrac{1}{n^{[l-1]}})}$Sometime $\sqrt{\dfrac{2}{n^{[l-1]}+n^{[l]}})}$
• $\dfrac{f(\theta+\epsilon)-f(\theta-\epsilon)}{2\epsilon}$
• Take $W^{},b^{},...,W^{[L]},b^{[L]}$and reshape into a big vector $\theta$.
• Take $dW^{},db^{},...,dW^{[L]},db^{[L]}$and reshape into a big vector $d\theta$.
• for each i:
• $d\theta_{approx}[i]=\dfrac{J(\theta_1,\theta_2,...,\theta_i+\epsilon,...)-J(\theta_1,\theta_2,...,\theta_i-\epsilon,...)}{2\epsilon}\approx d\theta[i]=\dfrac{\partial J}{\partial \theta_i}$
• Check Euclidean distance $\dfrac{||d\theta_{approx}-d\theta||_2}{||d\theta_{approx}||_2+||d\theta||_2}$($||.||_2$is Euclidean norm, sqare root of the sum of all elements’ power of 2)
• take $\epsilon=10^{-7}$, if above Euclidean distance is $\approx10^{-7}$or smaller, is great.
• If is $10^{-5}$or bigger may need to check.
• If is $10^{-3}$or bigger may need to worry, maybe a bug. Check which i approx is difference between the real value.
• notes:
• Don’t use in training - only to debug.
• If algorithm fails grad check, look at components to try to identify bug.
• Remember regularization. (include the $\dfrac{\lambda}{2m}$)
• Doesn’t work with dropout. (since is random, implement without dropout)
• Run at random initialization; perhaps again after some training. (not work when $w,b\approx0$)