ZL Asica
ZL Asica


Deep Learning深度学习-学习笔记

Deep Learning深度学习-学习笔记

This notes’ content are all based on https://www.coursera.org/specializations/deep-learning

Latex may have some issues when displaying.

1. Neural Networks and Deep Learning

1.1 Introduction to Deep Learning

1.1.1 Supervised Learning with Deep Learning

  • Structured Data: Charts.
  • Unstructured Data: Audio, Image, Text.

1.1.2 Scale drives deep learning progress

  • The larger the amount of data, the better the performance of the larger neural network compare to smaller one or supervised learning.
  • Sigmoid change to ReLU will make gradient descent much more faster. Since the gradient will not go to 0 really fast.

1.2 Basics of Neural Network Programming

1.2.1 Binary Classification

  • Input: XRnxX \in R^{nx}
  • Output: 0, 1

1.2.2 Logistic Regression

  • Given xx, want y^=P(y=1x)\hat{y} = P(y=1|x)

  • Input: xRnxx \in R^{n_x}

  • Parameters: wRnx,bRw \in R^{n_x}, b \in R

  • Output y^=σ(wTx+b)\hat{y} = \sigma(w^Tx + b)

    • σ(z)=11+ez\sigma(z)=\dfrac{1}{1+e^{-z}}
    • If zzlarge, σ(z)11+01\sigma(z)\approx\dfrac{1}{1+0}\approx1
    • If zzlarge negative number, σ(z)11+Bignum0\sigma(z)\approx\dfrac{1}{1+Bignum}\approx0
  • Loss (error) function:

    • y^=σ(wTx+b)\hat{y} = \sigma(w^Tx + b), where σ(z)=11+ez\sigma(z)=\dfrac{1}{1+e^{-z}}
      • z(i)=wTx(i)+bz^{(i)}=w^Tx^{(i)}+b
    • Want y(i)y^(i)y^{(i)} \approx \hat{y}^{(i)}

    • L(y,y^)=[ylog(y^)+(1y)log(1y^)]L(y, \hat{y}) = -[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})]
      • If y=1:L(y^,y)=logy^y=1: L(\hat{y}, y)=-\log{\hat{y}}<- want logy^\log{\hat{y}}as large as possible, want y^\hat{y}large
      • If y=0:L(y^,y)=log(1y^)y=0: L(\hat{y}, y)=-\log{(1-\hat{y})}<- want log(1y^)\log{(1-\hat{y})}as large as possible, want y^\hat{y}small
  • Cost function

    • J(w,b)=1mi=1mL(y^(i),y(i))=1mi=1mL[y(i)log(y^(i))+(1y(i))log(1y^(i))]J(w, b)=\dfrac{1}{m}\sum\limits*{i=1}^{m}L(\hat{y}^{(i)},y^{(i)})=-\dfrac{1}{m}\sum\limits*{i=1}^{m}L[y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)})]

1.2.3 Gradient Descent

  • Repeat w:=wαdJ(w)dww:=w-\alpha\dfrac{dJ(w)}{dw}; b:=bαJ(w,b)bb:=b-\alpha\dfrac{\partial J(w,b)}{\partial b}
    • α\alpha: Learning rate
    • Right side of minimum, dJ(w)dw>0\dfrac{dJ(w)}{dw} > 0; Left side of minimum, dJ(w)dw<0\dfrac{dJ(w)}{dw} < 0
  • Logistic Regression Gradient Descent
    • x1,x2,w1,w2,bx_1,x_2,w_1,w_2,b
      • z=w1x1+w2x2+bz=w_1x_1+w_2x_2+b-->a=σ(z)a=\sigma(z)-->L=(a,y)L=(a,y)
      • da=dL(a,y)da=ya+1y1ada=\dfrac{dL(a,y)}{da}=-\dfrac{y}{a}+\dfrac{1-y}{1-a}
        • dL(y,a)da=dda(ylog(a)(1y)log(1a))\dfrac{dL(y,a)}{da} = \dfrac{d}{da}(-y\log(a) - (1-y)\log(1-a))
        • dda(ylog(a))=ya\dfrac{d}{da} (-y\log(a)) = -\dfrac{y}{a}
        • dda((1y)log(1a))=1y1a×(1)=1y1a\dfrac{d}{da} (-(1-y)\log(1-a)) = -\dfrac{1-y}{1-a} \times (-1) = \dfrac{1-y}{1-a}
        • =ya+1y1a=yay11a=-\dfrac{y}{a} + \dfrac{1-y}{1-a} = -\dfrac{y}{a} - \dfrac{y-1}{1-a}
      • dz=dLdz=dL(a,y)dz=aydz=\dfrac{dL}{dz}=\dfrac{dL(a,y)}{dz}=a-y
        • =dLdadadz=\dfrac{dL}{da}\cdot\dfrac{da}{dz}(dadz=a(1a)\dfrac{da}{dz}=a(1-a))
      • dLdw1="dw1"=x1dz\dfrac{dL}{dw_1}="dw_1"=x_1\cdot dz
      • dLdw2="dw2"=x2dz\dfrac{dL}{dw_2}="dw_2"=x_2\cdot dz
      • db=dzdb=dz
  • Gradient Descent on mmexamples
    • J(w,b)=1m_i=1mL(a(i),y(i))J(w, b)=\dfrac{1}{m}\sum\limits\_{i=1}^{m}L(a^{(i)},y^{(i)})
    • w1J(w,b)=1mi=1mw1L(a(i),y(i))\dfrac{\partial}{\partial w*1}J(w,b)=\dfrac{1}{m}\sum\limits*{i=1}^{m}\dfrac{\partial}{\partial w_1}L(a^{(i)},y^{(i)})
    • J=0;dw1=0;dw2=0;db=0J=0;dw_1=0;dw_2=0;db=0
      • for i=1i=1to mm
        • z(i)=wTx(i)+bz^{(i)}=w^Tx^{(i)}+b
        • a(i)=σ(z(i))a^{(i)}=\sigma (z^{(i)})
        • J+=[y(i)loga(i)+(1y(i))log(1a(i))]J+=-[y^{(i)}loga^{(i)}+(1-y^{(i)})log(1-a^{(i)})]
        • dz(i)=a(i)y(i)dz^{(i)}=a^{(i)}-y^{(i)}
        • dw1+=x1(i)dz(i)dw_1+=x_1^{(i)}dz^{(i)}(for n = 2)
        • dw2+=x2(i)dz(i)dw_2+=x_2^{(i)}dz^{(i)}(for n = 2)
        • db+=dz(i)db+=dz^{(i)}
      • J/=m;dw1/=m;dw2/=m;db/=mJ/=m;dw_1/=m;dw_2/=m;db/=m
      • dw1=Jw1;dw2=Jw2dw_1=\dfrac{\partial J}{\partial w_1}; dw_2=\dfrac{\partial J}{\partial w_2}
        • w1:=w1αdw1w_1:=w_1-\alpha dw_1
        • w2:=w2αdw2w_2:=w_2-\alpha dw_2
        • b:=bαdbb:=b-\alpha db

1.2.4 Computational Graph

  • J(a,b,c)=3(a+bc)J(a,b,c)=3(a+bc)
    • u=bcu=bc
    • v=a+uv=a+u
    • J=3vJ=3v
    • Left to right computation
  • Derivatives with a Computation Graph

    • dJdv=3\dfrac{dJ}{dv}=3
      • dJda=3\dfrac{dJ}{da}=3
      • dvda=1\dfrac{dv}{da}=1
      • Chain Rule: dJda=dJdvdvda\dfrac{dJ}{da}=\dfrac{dJ}{dv}\cdot\dfrac{dv}{da}
      • dJdu=3;dudb=2;dJdb=6\dfrac{dJ}{du}=3; \dfrac{du}{db}=2; \dfrac{dJ}{db}=6
      • dudc=3;dJdc=9\dfrac{du}{dc}=3; \dfrac{dJ}{dc}=9

1.2.5 Vectorization

  • avoid explicit for-loops.

  • J=0;dw=np.zeros((nx,1));db=0J=0;dw=np.zeros((n_x,1));db=0
    • for i=1i=1to mm
      • z(i)=wTx(i)+bz^{(i)}=w^Tx^{(i)}+b
      • a(i)=σ(z(i))a^{(i)}=\sigma (z^{(i)})
      • J+=[y(i)loga(i)+(1y(i))log(1a(i))]J+=-[y^{(i)}loga^{(i)}+(1-y^{(i)})log(1-a^{(i)})]
      • dz(i)=a(i)y(i)dz^{(i)}=a^{(i)}-y^{(i)}
      • dw+=x(i)dz(i)dw+=x^{(i)}dz^{(i)}
      • db+=dz(i)db+=dz^{(i)}
    • J/=m;dw/=m;db/=mJ/=m;dw/=m;db/=m
  • Z=np.dot(w.T,x)+bZ=np.dot(w.T,x)+b; b(1,1)-->Broodcasting
  • Vectorization Logistic Regression

    • dz(1)=a(1)y(1);dz(2)=a(2)y(2)...dz^{(1)}=a^{(1)}-y^{(1)}; dz^{(2)}=a^{(2)}-y^{(2)}...
    • dz=[dz(1),dz(2),...,dz(m)]dz=[dz^{(1)}, dz^{(2)},...,dz^{(m)}]1×m1\times m
    • A=[a(1),a(2),...,a(m)]A=[a^{(1)}, a^{(2)}, ..., a^{(m)}]Y=[y(1),y(2),...,y(m)]Y=[y^{(1)}, y^{(2)}, ..., y^{(m)}]
    • dz=AY=[a(1)y(1),a(2)y(2),...]dz=A-Y=[a^{(1)}-y^{(1)}, a^{(2)}-y^{(2)}, ...]
    • Get rid of dbdband dwdwin for loop
      • db=1m_i=1mdz(i)=1mnp.sum(dz)db=\dfrac{1}{m}\sum\limits\_{i=1}^{m}dz^{(i)}=\dfrac{1}{m} np.sum(dz)
      • dw=1mXdzT=1m[x(1)...][dz(1)...]=1m[x(1)dz(1)+...+x(m)dz(m)]dw=\dfrac{1}{m}\cdot X\cdot dz^T=\dfrac{1}{m}[x^{(1)}...][dz^{(1)}...]=\dfrac{1}{m}\cdot[x^{(1)}dz^{(1)}+...+x^{(m)}dz^{(m)}]n×1n\times 1
    • New Form of Logistic Regression
      • Z=wtX+b=np.dot(w.T,X)+bZ=w^tX+b=np.dot(w.T, X)+b
      • A=σ(Z)A=\sigma (Z)
      • dz=AYdz=A-Y
      • dw=1mXdZTdw=\dfrac{1}{m}\cdot X \cdot dZ^T
      • db=1mnp.sum(dz)db=\dfrac{1}{m}np.sum(dz)
      • w:=wαdww:=w-\alpha dw
      • b:=bαdbb:=b-\alpha db
  • Broadcasting(same as bsxfun in Matlab/Octave)

    • (m,n)(m,n)+-\*/(1,n)(1,n)->(m,n)(m,n)1->m will be all the same number.
    • (m,n)(m,n)+-\*/(m,1)(m,1)->(m,n)(m,n)1->n will be all the same number
    • Don’t use a=np.random.randn(5)a = np.random.randn(5)a.shape=(5,)a.shape = (5,)“rank 1 array”
    • Use a=np.random.randn(5,1)a = np.random.randn(5,1)or a=np.random.randn(1,5)a = np.random.randn(1,5)
    • Check by assert(a.shape==(5,1))assert(a.shape == (5,1))
    • Fix rank 1 array by a=a.reshape((5,1))a = a.reshape((5,1))
  • Logistic Regression Cost Function

    • Lost
      • p(yx)=y^y(1y^)(1y)p(y|x)=\hat{y}^y(1-\hat{y})^{(1-y)}
      • If y=1y=1: p(yx)=y^p(y|x)=\hat{y}
      • If y=0y=0: p(yx)=(1y^)p(y|x)=(1-\hat{y})
      • logp(yx)=logy^y(1y^)(1y)=ylogy^+(1y)log(1y^)=L(y^,y)\log p(y|x)=\log \hat{y}^y(1-\hat{y})^{(1-y)}=y\log \hat{y}+(1-y)\log(1-\hat{y})=-L(\hat{y},y)
    • Cost
      • logp(labels in training set)=logΠ_i=1mp(y(i),x(i))\log p(labels\space in\space training\space set)=\log \Pi\_{i=1}^{m}p(y^{(i)},x^{(i)})
      • logp(labels in training set)=i=1mlogp(y(i),x(i))=i=1mL(y^(i),y(i))\log p(labels\space in\space training\space set)=\sum\limits*{i=1}^m\log p(y^{(i)},x^{(i)})=-\sum\limits*{i=1}^mL(\hat{y}^{(i)},y^{(i)})
      • Use maximum likelihood estimation(MLE)
      • Cost(minmize): J(w,b)=1m_i=1mL(y^(i),y(i))J(w,b)=\dfrac{1}{m}\sum\limits\_{i=1}^mL(\hat{y}^{(i)},y^{(i)})

1.3 Shallow Neural Networks

1.3.1 Neural Network Representation

  • deep-learning-notes_1-3-1

  • Input layer, hidden layer, output layer

    • a[0]=xa^{[0]}=x-> a[1]=[[a1[1],a2[1],a3[1],a4[1]]]a^{[1]}=[[a^{[1]}_1,a^{[1]}_2,a^{[1]}_3,a^{[1]}_4]]-> a[2]a^{[2]}
    • Layers count by # of hidden layer+# of output layer.
  • x1,x2,x3x_1,x_2,x_3-> 4 hidden nodes4\space hidden\space nodes-> Output layerOutput\space layer
    • First hidden node: z[1]_1=w[1]T_1+b[1]_1,a[1]_1=σ(z[1]_1)z^{[1]}\_1=w^{[1]T}\_1+b^{[1]}\_1, a^{[1]}\_1=\sigma(z^{[1]}\_1)
    • Seconde hidden node: z[1]_2=w[1]T_2+b[1]_2,a[1]_2=σ(z[1]_2)z^{[1]}\_2=w^{[1]T}\_2+b^{[1]}\_2, a^{[1]}\_2=\sigma(z^{[1]}\_2)
    • Third hidden node: z[1]_3=w[1]T_3+b[1]_3,a[1]_3=σ(z[1]_3)z^{[1]}\_3=w^{[1]T}\_3+b^{[1]}\_3, a^{[1]}\_3=\sigma(z^{[1]}\_3)
    • Forth hidden node: z[1]_4=w[1]T_4+b[1]_4,a[1]_4=σ(z[1]_4)z^{[1]}\_4=w^{[1]T}\_4+b^{[1]}\_4, a^{[1]}\_4=\sigma(z^{[1]}\_4)
  • Vectorization

    • w[1]=[w[1]T_1w[1]T_2w[1]T_3w[1]T_4](4,3)matrixw^{[1]}=\begin{gathered}\begin{bmatrix}-w^{[1]T}\_1- \\ -w^{[1]T}\_2- \\ -w^{[1]T}\_3- \\ -w^{[1]T}\_4- \end{bmatrix}\end{gathered} (4,3)matrix
    • z[1]=[w[1]T_1w[1]T_2w[1]T_3w[1]T_4][x1x2x3]+[b[1]_1b[1]_2b[1]_3b[1]_4]=[w[1]T_1x+b[1]_1w[1]T_2x+b[1]_2w[1]T_3x++b[1]_3w[1]T_4x+b[1]_4]=[z[1]_1z[1]_2z[1]_3z[1]_4]z^{[1]}=\begin{gathered}\begin{bmatrix}-w^{[1]T}\_1- \\ -w^{[1]T}\_2- \\ -w^{[1]T}\_3- \\ -w^{[1]T}\_4- \end{bmatrix}\end{gathered}\cdot \begin{gathered}\begin{bmatrix}x_1 \\ x_2 \\ x_3 \end{bmatrix}\end{gathered} + \begin{gathered}\begin{bmatrix}b^{[1]}\_1 \\ b^{[1]}\_2 \\b^{[1]}\_3 \\ b^{[1]}\_4 \end{bmatrix}\end{gathered} =\begin{gathered}\begin{bmatrix}w^{[1]T}\_1\cdot x+b^{[1]}\_1 \\ w^{[1]T}\_2\cdot x+b^{[1]}\_2 \\ w^{[1]T}\_3\cdot x++b^{[1]}\_3 \\ w^{[1]T}\_4\cdot x+b^{[1]}\_4 \end{bmatrix}\end{gathered}=\begin{gathered}\begin{bmatrix}z^{[1]}\_1 \\ z^{[1]}\_2 \\z^{[1]}\_3 \\ z^{[1]}\_4 \end{bmatrix}\end{gathered}
    • a[1]=[a[1]_1a[1]_2a[1]_3a[1]_4]=σ(z[1])a^{[1]}=\begin{gathered}\begin{bmatrix}a^{[1]}\_1 \\ a^{[1]}\_2 \\a^{[1]}\_3 \\ a^{[1]}\_4 \end{bmatrix}\end{gathered}=\sigma(z^{[1]})
    • z[2]=W[2]a[1]+b[2]z^{[2]}=W^{[2]}\cdot a^{[1]}+b^{[2]}(1,1),(1,4),(4,1),(1,1)(1, 1),(1, 4),(4, 1),(1, 1)
    • a[2]=σ(z[2])a^{[2]}=\sigma(z^{[2]})(1,1),(1,1)(1,1),(1,1)
    • a[2](i)a^{[2](i)}: layer 22; example ii
  • for i=1 to m:

    • z[1](i)=W[1]x(i)+b[1]z^{[1](i)}=W^{[1]}\cdot x(i)+b^{[1]}
    • a[1](i)=σ(z[1](i))a^{[1](i)}=\sigma(z^{[1](i)})
    • z[2](i)=W[2]a[1](i)+b[2]z^{[2](i)}=W^{[2]}\cdot a^{[1](i)}+b^{[2]}
    • a[2](i)=σ(z[2](i))a^{[2](i)}=\sigma(z^{[2](i)})
  • Vectorizing of the above for loop

    • X=[x(1),x(2),...,x(m)](nx,m)matrixX=\begin{gathered}\begin{bmatrix}| & | & | & | \\ x^{(1)}, & x^{(2)}, & ..., & x^{(m)} \\ | & | & | & |\end{bmatrix}\end{gathered} (n_x,m)matrixn is different hidden units
    • Z[1]=W[1]X+b[1]Z^{[1]}=W^{[1]}\cdot X+b^{[1]}
    • A[1]=σ(Z[1])A^{[1]}=\sigma(Z^{[1]})
    • Z[2]=W[2]A[1]+b[2]Z^{[2]}=W^{[2]}\cdot A^{[1]}+b^{[2]}
    • A[2]=σ(Z[2])A^{[2]}=\sigma(Z^{[2]})
    • hrizontally: training examples; vertically: hidden units

1.3.2 Activation Functions

  • g[i]g^{[i]}: activation function of layer ii
    • Sigmoid: a=11+e[z]a=\dfrac{1}{1+e^{[-z]}}
    • Tanh: a=eze[z]ez+e[z]a=\dfrac{e^z-e^{[-z]}}{e^z+e^{[-z]}}
    • ReLU: a=max(0,z)a=max(0,z)
    • Leaky ReLu: a=max(0.01z,z)a=max(0.01z, z)
  • Rules to choose activation function
    1. Output is between {0, 1}, choose sigmoid.
    2. Default choose ReLu.
  • Why need non-liner activation function
    • Use linear hidden layer will be useless to have multiple hidden layers. It will become a=wx+ba=w'x+b'.
    • Linear may sometime use at output layer but with non-linear at hidden layers.

1.3.3 Forward and Backward Propogation

  • Derivative of activation function
    • Sigmoid: g(z)=ddzg(z)=11+e[z](111+e[z])=g(z)(1g(z))=a(1a)g'(z)=\dfrac{d}{dz}g(z)=\dfrac{1}{1+e^{[-z]}}(1-\dfrac{1}{1+e^{[-z]}})=g(z)(1-g(z))=a(1-a)
    • Tanh: g(z)=ddzg(z)=1(tanh(z))2g'(z)=\dfrac{d}{dz}g(z)=1-(tanh(z))^2
    • ReLU: g(z)={0if z<01if z0\usepackageundefined\usepackageif z=0g'(z)=\left\{\begin{array}{lr}0&if \space z<0 \\1&if \space z\geq0\\\usepackage{undefined}&\usepackage{if \space z=0}\end{array}\right.
    • Leaky ReLU: g(z)={0.01if z<01if z0g'(z)=\left\{\begin{array}{lr}0.01&if \space z<0 \\1&if \space z\geq0\end{array}\right.
  • Gradient descent for neural networks
    • Parameters: w[1](n[1],n[2]),b[1](n[2],1),w[2](n[2],n[1]),b[2](n[2],1)w^{[1]}(n^{[1]},n^{[2]}), b^{[1]}(n^{[2]},1),w^{[2]}(n^{[2]},n^{[1]}), b^{[2]}(n^{[2]},1)
    • nx=n[0],n[1],n[2]=1n_x=n^{[0]},n^{[1]},n^{[2]}=1
    • Cost function: J(w[1],b[1],w[2],b[2])=1m_i=1nL(y^,y)J(w^{[1]}, b^{[1]},w^{[2]}, b^{[2]})=\dfrac{1}{m}\sum\limits\_{i=1}^nL(\hat{y},y)
  • Forward propagation:
    • Z[1]=W[1]X+b[1]Z^{[1]}=W^{[1]}\cdot X+b^{[1]}
    • A[1]=g[1](Z[1])A^{[1]}=g^{[1]}(Z^{[1]})
    • Z[2]=W[2]A[1]+b[2]Z^{[2]}=W^{[2]}\cdot A^{[1]}+b^{[2]}
    • A[2]=g[2](Z[2])=σ(Z[2])A^{[2]}=g^{[2]}(Z^{[2]})=\sigma(Z^{[2]})
  • Back Propogation:
    • dZ[2]=A[2]YdZ^{[2]}=A^{[2]}-YY=[y(1),y(2),...,y(m)]Y=[y^{(1)},y^{(2)},...,y^{(m)}]
    • dW[2]=1mdZ[2]A[1]TdW^{[2]}=\dfrac{1}{m}dZ^{[2]}A^{[1]T}
    • db[2]=1mnp.sum(dZ[2],axis=1,keepdims=True)db^{[2]}=\dfrac{1}{m}np.sum(dZ^{[2]},axis=1,keepdims=True)
    • dZ[1]=W[2]TdZ[2]\*g[1](Z1)dZ^{[1]}=W^{[2]T}dZ^{[2]}\*g'^{[1]}(Z^{1})
      • (n[1],m)>elementwise product>(n[1],m)(n^{[1]},m)->element-wise\space product->(n^{[1]},m)
    • dW[1]=1mdZ[1]XTdW^{[1]}=\dfrac{1}{m}dZ^{[1]}X^{T}
    • db[1]=1mnp.sum(dZ[1],axis=1,keepdims=True)db^{[1]}=\dfrac{1}{m}np.sum(dZ^{[1]},axis=1,keepdims=True)
  • Random Initialization
    • x1,x2>a1[1],a2[1]>a1[2]>y^x_1,x_2->a_1^{[1]},a_2^{[1]}->a_1^{[2]}->\hat{y}
    • w[1]=np.random.randn((2,2))\*0.01w^{[1]}=np.random.randn((2,2))\*0.01
    • b[1]=np.zeros((2,1))b^{[1]}=np.zeros((2,1))
    • w[2]=np.random.randn((1,2))\*0.01w^{[2]}=np.random.randn((1,2))\*0.01
    • b[2]=0b^{[2]}=0

1.4 Deep Neural Networks

1.4.1 Deep L-Layer Neural Network

  • Deep neural network notation
    • deep-learning-notes_1-4-1
    • L=4L=4(#layers)
    • n[l]=# units in layer ln^{[l]}= \#\space units\space in\space layer\space l
      • n[1]=5,n[2]=5,n[3]=3,n[4]=n[l]=1n^{[1]}=5,n^{[2]}=5,n^{[3]}=3,n^{[4]}=n^{[l]}=1
      • n[0]=nx=3n^{[0]}=n_x=3
    • a[l]=activations in layer la^{[l]}=activations\space in\space layer\space l
    • a[l]=g[l](z[l]), w[l]=weights for z[l], b[l]=bias for z[l]a^{[l]}=g^{[l]}(z^{[l]}),\space w^{[l]}=weights\space for\space z^{[l]},\space b^{[l]}=bias\space for\space z^{[l]}
    • x=a[0], y^=alx=a^{[0]},\space \hat{y}=a^{l}

1.4.2 Forward Propagation in a Deep Network

  • General: Z[l]=w[l]A[l1]+b[l],A[l]=g[l](Z[l])Z^{[l]}=w^{[l]}A^{[l-1]}+b^{[l]}, A^{[l]}=g^{[l]}(Z^{[l]})
    • x:z[1]=w[1]a[0]+b[1],a[1]=g[1](z[1])x: z^{[1]}=w^{[1]}a^{[0]}+b^{[1]}, a^{[1]}=g^{[1]}(z^{[1]})a[0]=Xa^{[0]}=X
    • z[2]=w[2]a[1]+b[2],a[1]=g[2](z[2])z^{[2]}=w^{[2]}a^{[1]}+b^{[2]}, a^{[1]}=g^{[2]}(z^{[2]})
    • z[4]=w[4]a[3]+b[4],a[4]=g[4](z[4])=y^z^{[4]}=w^{[4]}a^{[3]}+b^{[4]}, a^{[4]}=g^{[4]}(z^{[4]})=\hat{y}
  • Vectorizing:
    • Z[1]=w[1]A[0]+b[1],A[1]=g[1](Z[1])Z^{[1]}=w^{[1]}A^{[0]}+b^{[1]}, A^{[1]}=g^{[1]}(Z^{[1]})A[0]=XA^{[0]}=X
    • Z[2]=w[2]A[1]+b[2],A[2]=g[2](Z[2])Z^{[2]}=w^{[2]}A^{[1]}+b^{[2]}, A^{[2]}=g^{[2]}(Z^{[2]})
    • Y^=g(Z[4])=A[4]\hat{Y}=g(Z^{[4]})=A^{[4]}
  • Matrix dimensions
    • deep-learning-notes_1-4-2
    • z[1]=w[1]x+b[1]z^{[1]}=w^{[1]}\cdot x+b^{[1]}
    • z[1]=(3,1),w[1]=(3,2),x=(2,1),b[1]=(3,1)z^{[1]}=(3,1),w^{[1]}=(3,2),x=(2,1),b^{[1]}=(3,1)
    • z[1]=(n[1],1),w[1]=(n[1],n[0]),x=(n[0],1),b[1]=(n[1],1)z^{[1]}=(n^{[1]},1),w^{[1]}=(n^{[1]},n^{[0]}),x=(n^{[0]},1),b^{[1]}=(n^{[1]},1)
    • w[l]/dw[l]=(n[l],n[l1]),b[l]/db[l]=(n[l],1)w^{[l]}/dw^{[l]}=(n^{[l]},n^{[l-1]}),b^{[l]}/db^{[l]}=(n^{[l]},1)
    • z[l],a[l]=(n[l],1),Z[l]/dZ[l],A[l]/dA[l]=(n[l],1)z^{[l]},a^{[l]}=(n^{[l]},1),Z^{[l]}/dZ^{[l]},A^{[l]}/dA^{[l]}=(n^{[l]},1)l=0,A[0]=X=(n[0],m)l=0, A^{[0]}=X=(n^{[0]},m)
  • Why deep representation?
    • Earier layers learn simple features; later deeper layers put together to detect more complex things.
    • Circuit theory and deep learning: Informally: There are functions you can compute with a “small” L-layer deep neural network that shallower networks require exponentially more hidden units to compute.

1.4.3 Building Blocks of Deep Neural Networks

  • Forward and backward functions
    • deep-learning-notes_1-4-3
    • Layer l:w[l],b[l]l:w^{[l]},b^{[l]}
    • Forward: Input a[l1]a^{[l-1]}, output a[l]a^{[l]}
      • z[l]:w[l]a[l1]+b[l]z^{[l]}:w^{[l]}a^{[l-1]}+b^{[l]}cache z[l]cache\space z^{[l]}
      • a[l]:g[l](z[l])a^{[l]}:g^{[l]}(z^{[l]})
    • Backward: Input da[l],cache(z[l])da^{[l]}, cache(z^{[l]}), output da[l1],dw[l],db[l]da^{[l-1]},dw^{[l]},db^{[l]}
  • One iteration of gradient descent of neural network
    • deep-learning-notes_1-4-3-2
  • How to implement?
    • Forward propagation for layer ll
      • Input a[l1]a^{[l-1]}, output a[l],cache (z[l])a^{[l]},cache\space (z^{[l]})
      • z[l]=w[l]a[l1]+b[l]z^{[l]}=w^{[l]}a^{[l-1]}+b^{[l]}
      • a[l]=g[l](z[l])a^{[l]}=g^{[l]}(z^{[l]})
      • Vectoried
        • Z[l]=W[l]A[l1]+b[l]Z^{[l]}=W^{[l]}A^{[l-1]}+b^{[l]}
        • A[l]=g[l](Z[l])A^{[l]}=g^{[l]}(Z^{[l]})
    • Backward propagation for layer ll
      • Input da[l],cache(z[l])da^{[l]}, cache(z^{[l]}), output da[l1],dw[l],db[l]da^{[l-1]},dw^{[l]},db^{[l]}
        • dz[l]=da[l]\*g[l](z[l])dz^{[l]}=da^{[l]}\*g'^{[l]}(z^{[l]})
        • dw[l]=dz[l]a[l1]dw^{[l]}=dz^{[l]}\cdot a^{[l-1]}
        • db[l]=dz[l]db^{[l]}=dz^{[l]}
        • da[l1]=w[l]Tdz[l]da^{[l-1]}=w^{[l]T}\cdot dz^{[l]}
      • Vectorized:
        • dZ[l]=dA[l]\*g[l](Z[l])dZ^{[l]}=dA^{[l]}\*g'^{[l]}(Z^{[l]})
        • dW[l]=1mdZ[l]A[l1]TdW^{[l]}=\dfrac{1}{m}dZ^{[l]}A^{[l-1]T}
        • db[l]=1mnp.sum(dZ[l],axis=1,keepdims=True)db^{[l]}=\dfrac{1}{m}np.sum(dZ^{[l]},axis=1,keepdims=True)
        • dA[l1]=W[l]TdZ[l]dA^{[l-1]}=W^{[l]T}\cdot dZ^{[l]}

1.4.4 Parameters vs. Hyperparameters

  • Parameters: W[1],b[1],W[2],b[2],...W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]},...
  • Hyperparameters (will affect/control/determine parameters):
    • learning rate α\alpha
    • # iterations
    • # of hidden units n[1],n[2],...n^{[1]},n^{[2]},...
    • # of hidden layers
    • Choice of activation function
  • Later: momemtum, minibatch size, regularization parameters,…

II. Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization

2.1 Practical Aspects of Deep Learning

2.1.1 Train / Dev / Test sets

  • Big data may need only 1% or even less dev/test sets.
  • Mismatched: Make sure dev/test come from same distribution
  • Not having a test set might be okay. (Only dev set.)

2.1.2 Bias / Variance



  • Assume optimal (Bayes) error: 0%\approx0\%
  • High bias (underfitting): The prediction cannot classify different elemets as we want.
    • Training set error 15%15\%, Dev set error 16%16\%.
    • Training set error 15%15\%, Dev set error 30%30\%.
  • “just right”: The prediction perfectly classify different elemets as we want.
    • Training set error 0.5%0.5\%, Dev set error 1%1\%.
  • High variance (overfitting): The prediction 100% classify different elemets.
    • Training set error 1%1\%, Dev set error 11%11\%.
    • Training set error 15%15\%, Dev set error 30%30\%.

2.1.3 Basic Recipe for Machine Learning Basic Recipe
  • High bias(training data performance)
    • Bigger network
    • Train longer
    • (NN architecture search)
  • High variance (dev set performance)
    • More data
    • Regulairzation
    • (NN architecture search) Regularization
  • Logistic regression. min_w,bJ(w,b)\min\limits\_{w,b}J(w,b)
    • wRnx,bRw\in\mathbb{R}^{n_x}, b\in\mathbb{R}
    • λ=regularization parameter\lambda=regularization\space parameter
    • J(w,b)=1m_i=1mL(y^(i),y(i))+λ2mw2_2J(w,b)=\dfrac{1}{m}\sum\limits\_{i=1}^mL(\hat{y}^{(i)},y^{(i)})+\dfrac{\lambda}{2m}||w^2||\_2
    • L2 regularization w22=j=1nxwj2=wTw||w^2||_2=\sum\limits_{j=1}^{n_x}w_j^2=w^Tw
    • L1 regularization λ2m_j=1nxwj=λ2mw_1\dfrac{\lambda}{2m}\sum\limits\_{j=1}^{n_x}|w_j|=\dfrac{\lambda}{2m}||w||\_1
      • wwwill be spouse(for L1) (will have lots of 0 in it, only help a little bit)
  • Neural network
    • J(w[1],b[1],...,w[l],b[l])=1mi=1mL(y^(i),y(i))+λ2ml=1lw2_FJ(w^{[1]},b^{[1]},...,w^{[l]},b^{[l]})=\dfrac{1}{m}\sum\limits*{i=1}^{m}L(\hat{y}^{(i)},y^{(i)})+\dfrac{\lambda}{2m}\sum\limits*{l=1}^{l}||w^2||\_F
    • w[l]F2=i=1n[l1]j=1n[l](wij[l])2||w^{[l]}||_F^2=\sum\limits_{i=1}^{n^{[l-1]}}\sum\limits*{j=1}^{n^{[l]}}(w*{ij}^{[l]})^2w:(w[l],w[l1])w: (w^{[l]},w^{[l-1]})
      • Frobenius norm: Square root of square sum of all elements in a matrix.
    • dw[l]=(from backprop)+λmw[l]dw^{[l]}=(from\space backprop)+\dfrac{\lambda}{m}w^{[l]}
      • w[l]:=w[l]αdw[l]w^{[l]}:=w^{[l]}-\alpha dw^{[l]}(keep the same)
      • Weight decay
        • w[l]:=w[l]α[(from backprop)+λmw[l]]w^{[l]}:=w^{[l]}-\alpha[(from\space backprop)+\dfrac{\lambda}{m}w^{[l]}]
        • =w[l]αλmw[l]α(from backprop)=w^{[l]}-\dfrac{\alpha\lambda}{m}w^{[l]}-\alpha(from\space backprop)
        • =(1αλm)w[l]α(from backprop)=(1-\dfrac{\alpha\lambda}{m})w^{[l]}-\alpha(from\space backprop)
  • How does regularization prevent overfitting: λ\lambdabigger w[l]w^{[l]}smaller z[l]z^{[l]}smaller, which will make the activation function nearly linear(take tanh as an example). This will cause the network really hard to draw boundary with curve.
  • Dropout regularization
    • deep-learning-notes_2-1-3-2
    • Implementing dropout(“Inverted dropout”)
      • Illustrate with layer l=3l=3keepprob=0.8keep-prob=0.8(means 0.2 chance get dropout/be 0 out)
      • d3=np.random.rand(a3.shape[0],a3.shape[1])<keepprobd3 = np.random.rand(a3.shape[0],a3.shape[1]) < keep-prob#This will set d3 to be a same shape matrix as a3 with True (1), False (0) value.
      • a3=np.multiply(a3,d3)a3 = np.multiply(a3, d3)#a3\*=d3; This will let some neruons been dropout
      • a3/=keepproba3/=keep-prob#inverted dropout, keep the total avtivation the same before and after dropout.
    • Why work: Can’t rely on any one feature, so have to spread out weights.(shrink weights)
    • First make sure the J is decreasing during iteration, then turn on dropout.
  • Data augmentation
    • Image: crop, flop, twist…
  • Early stopping
    • Mid-size w_F2||w||\_F^2
    • May caused optimize cost function and not overfir at the same time.
  • Orthogonalization
    • Only consider optimize cost function or consider not overfit at one time. Setting up your optimization problem
  • Normalizing training sets
    • deep-learning-notes_2-1-3-3
    • x=[x1x2]x=\begin{gathered}\begin{bmatrix}x_1 \\ x_2\end{bmatrix}\end{gathered}
    • Subtract mean:
      • μ=1m_i=1mx(i)\mu=\dfrac{1}{m}\sum\limits\_{i=1}^{m}x^{(i)}
      • x:=xμx:=x-\mu
    • Normalize variance:
      • σ2=1m_i=1mx(i)2\sigma^2=\dfrac{1}{m}\sum\limits\_{i=1}^{m}x^{(i)}**2"**" element-wise
      • x/=σ2x/=\sigma^2
    • Use same μ,σ2\mu,\sigma^2to normalize test set.
    • Why normalize inputs?
      • When inputs in very different scales will help a lot for performance and gradient descent/learning rate.
      • deep-learning-notes_2-1-3-3-2
  • Vanishing/exploding gradients
    • w[l]>Iw^{[l]}>IJust slightly, will make the gradient increase really fast (exploding).
    • w[l]<Iw^{[l]}<IJust slightly, will make the gradient decrease really slow (varnishing).
  • Weight initalization (Single neuron)
    • large nn(number of input features) –> smaller wiw_i
    • Variance(w:)=1nVariance(w:)=\dfrac{1}{n}(sigmoid/tanh) ReLU: 2n\dfrac{2}{n}(variance can be a hyperparameter, DO NOT DO THAT)
    • w[l]=np.random.randn(shapeOfMatrix)\*np.sqrt(1n[l1])w^{[l]}=np.random.randn(shapeOfMatrix)\*np.sqrt(\dfrac{1}{n^{[l-1]}})ReLU: 2n[l1]\dfrac{2}{n^{[l-1]}}
    • Xavier initialization: 1n[l1])\sqrt{\dfrac{1}{n^{[l-1]}})}Sometime 2n[l1]+n[l])\sqrt{\dfrac{2}{n^{[l-1]}+n^{[l]}})}
  • Numerical approximation of gradients
    • f(θ+ϵ)f(θϵ)2ϵ\dfrac{f(\theta+\epsilon)-f(\theta-\epsilon)}{2\epsilon}
  • Gradient checking (Grad check)
    • Take W[1],b[1],...,W[L],b[L]W^{[1]},b^{[1]},...,W^{[L]},b^{[L]}and reshape into a big vector θ\theta.
    • Take dW[1],db[1],...,dW[L],db[L]dW^{[1]},db^{[1]},...,dW^{[L]},db^{[L]}and reshape into a big vector dθd\theta.
    • for each i:
      • dθ_approx[i]=J(θ1,θ2,...,θi+ϵ,...)J(θ1,θ2,...,θiϵ,...)2ϵdθ[i]=Jθid\theta\_{approx}[i]=\dfrac{J(\theta_1,\theta_2,...,\theta_i+\epsilon,...)-J(\theta_1,\theta_2,...,\theta_i-\epsilon,...)}{2\epsilon}\approx d\theta[i]=\dfrac{\partial J}{\partial \theta_i}
      • Check Euclidean distance dθapproxdθ_2dθapprox_2+dθ_2\dfrac{||d\theta*{approx}-d\theta||\_2}{||d\theta*{approx}||\_2+||d\theta||\_2}(._2||.||\_2is Euclidean norm, sqare root of the sum of all elements’ power of 2)
      • take ϵ=107\epsilon=10^{-7}, if above Euclidean distance is 107\approx10^{-7}or smaller, is great.
      • If is 10510^{-5}or bigger may need to check.
      • If is 10310^{-3}or bigger may need to worry, maybe a bug. Check which i approx is difference between the real value.
    • notes:
      • Don’t use in training - only to debug.
      • If algorithm fails grad check, look at components to try to identify bug.
      • Remember regularization. (include the λ2m\dfrac{\lambda}{2m})
      • Doesn’t work with dropout. (since is random, implement without dropout)
      • Run at random initialization; perhaps again after some training. (not work when w,b0w,b\approx0)

2.2 Optimization Algorithms

2.2.1 Mini-batch gradient descent

  • Batch vs. mini-batch gradient descent
    • Normal batch may have large amount of data like millions of elements.
      • set m=5,000,000m=5,000,000
      • X=[x(1),x(2),x(3),...,x(1000),x(1001),...,x(2000),...,x(m)](nx,m)X=[x^{(1)},x^{(2)},x^{(3)},...,x^{(1000)},x^{(1001)},...,x^{(2000)},...,x^{(m)}] (n_x,m)
      • Y=[y(1),y(2),y(3),...,y(m)](1,m)Y=[y^{(1)},y^{(2)},y^{(3)},...,y^{(m)}] (1,m)
    • Mini-batches make 1,000 xxeach.
      • Mini-batch number t:X{t},Y{t}t:X^{\{t\}},Y^{\{t\}}
        • x(i)x^{(i)}ith in trainning set, z[l]z^{[l]}layer in network X{t}X^{\{t\}}batch in mini-batch
      • X=[X{1},X{2},...,X{5000}]X = [X^{\{1\}},X^{\{2\}},...,X^{\{5000\}}]
      • Y=[Y{1},Y{2},Y{3},...,Y(5,000)]Y=[Y^{\{1\}},Y^{\{2\}},Y^{\{3\}},...,Y^{(5,000)}]
  • Mini-batch gradient descent
    • 1 step of gradient descent using X{t},Y{t}X^{\{t\}},Y^{\{t\}}(1000)
      • 1 epoch: single pass through training set.
    • for t=1,...,5000for\space t=1,...,5000
      • Forward prop on X{t}X^{\{t\}}
      • Z[1]=W[1]X{t}+b[1]Z^{[1]}=W^{[1]}X^{\{t\}}+b^{[1]}
      • A[1]=g[1](Z[1])A^{[1]}=g^{[1]}(Z^{[1]})
      • A[l]=g[l](Z[l])A^{[l]}=g^{[l]}(Z^{[l]})
    • Compute cost J{t}=11000i=1lL(y^(i),y(i))+λ21000l=1lw[l]_F2J^{\{t\}}=\dfrac{1}{1000}\sum\limits*{i=1}^{l}L(\hat{y}^{(i)},y^{(i)})+\dfrac{\lambda}{2\cdot1000}\sum\limits*{l=1}^{l}||w^{[l]}||\_F^2
      • y^(i),y(i)\hat{y}^{(i)},y^{(i)}--> from X{t},Y{t}X^{\{t\}},Y^{\{t\}}
    • Backprop to compute gradient cost J{t} (using (X{t},Y{t}))J^{\{t\}}\space (using\space (X^{\{t\}},Y^{\{t\}}))
    • w[l]:=w[l]αdw[l],b[l]:=b[l]αdb[l]w^{[l]}:=w^{[l]}-\alpha dw^{[l]}, b^{[l]}:=b^{[l]}-\alpha db^{[l]}
  • Understanding mini-batch gradient descent
    • deep-learning-notes_2-2-1
    • If mini-batch size=m:batch gradient descent (Too long per iteration).–(X{t},Y{t})=(X,Y)(X^{\{t\}},Y^{\{t\}})=(X,Y)
    • If mini-batch size=1:Stochatic gradient descent (noisy, not converge, loos speedup from vectorization).– Every example is it own mini-batch.
    • In practice: select in-between 1 and m.
      • Get lots of vectorization
      • Make progress without needing to wait entire training set.
  • Choosing mini-batch size
    • No need for small training set (m<2000m<2000)
    • Typical mini-batch size: 64, 128, 256, 512. (Use power of 2)
    • Make sure minibatch fir in CPU/GPU memory.

2.2.2 Exponentially weighted averages

  • Vt=βVt1+(1β)θtV*t = \beta V*{t-1} + (1 - \beta) \theta_t
    • VtV_tis the weighted average at time tt.
    • θt\theta_tis the actual observed value at time tt.
    • β\betais the decay rate (usually between 0 and 1).
    • V_t1V\_{t-1}is the weighted average at the previous time step.
  • Impact of Decay Rate β\beta: The value of β\betasignificantly affects the smoothness of the weighted average curve:

    • A larger β\betamakes the curve smoother, as it gives more weight to past observations, thereby reducing the impact of recent changes on the weighted average.
    • A smaller β\betamakes the curve more responsive to recent changes, as it gives more weight to recent observations.
  • Interpretation of (1ϵ)1ϵsome constant=1e\dfrac{(1-\epsilon)^{\frac{1}{\epsilon}}}{\text{some constant}} = \dfrac{1}{e}

    • Defining ϵ\epsilonas 1β1 - \betaprovides insight into how the influence of past data gradually diminishes as β\betaapproaches 1 (i.e., ϵ\epsilonapproaches 0).
    • As ϵ\epsilonapproaches 0, (1ϵ)1ϵ(1-\epsilon)^{\frac{1}{\epsilon}}approaches 1e\dfrac{1}{e}, indicating that even though past data is given more weight (high β\beta), its actual impact on the current value is decreasing.
  • Implementation

    • v_θ:=0v\_{\theta}:=0
    • Repear for each day:
      • Get the next θt\theta_t
      • vθ:=βvθ+(1β)θtv*\theta:=\beta v*\theta+(1-\beta)\theta_t
  • Bias correction in exponentially weighted averages

    • Bias correction is applied to counteract the initial bias in exponentially weighted averages, especially when the number of data points is small or at the start of the calculation.
    • vt1βt\dfrac{v_t}{1-\beta^t}Here, vtv_tis the uncorrected exponentially weighted average at time tt, and β\betais the decay rate.
    • It ensures that the moving averages are not underestimated, particularly when β\betais high and in the early stages of the iteration. With iteration goes on, the affect of this correction will become smaller since βt\beta^tis closer to 1.
  • Gradient descent with momentum

    • On iteration t:
      • Compute dw,dbdw, dbon current mini-batch (whole batch if not using mini-batch)
      • vdw=βvdw+(1β)dwv*{dw}=\beta v*{dw}+(1-\beta)dw
      • vdb=βvdb+(1β)dbv*{db}=\beta v*{db}+(1-\beta)db
      • w:=wαvdw,b:=bαvdbw:=w-\alpha v*{dw}, b:=b-\alpha v*{db}
    • initiate vdw and vdb=0v*{dw}\space and\space v*{db} = 0
    • Smooth out gradient descent
      • The momentum term vveffectively provides a smoothing effect since it is an average of past gradients. This means that extreme gradient changes in a single iteration are averaged out, reducing the volatility of the update steps.
      • This smoothing effect is particularly useful on loss function surfaces that are not flat or have many local minima.
    • Consider set β\betaas 0.90.9(common, about the average last 10 gradients), it gives more weight to v_dwv\_{dw}, consider dwdwas the acceleration. With betabetadecreasing, velocity increasing slower and acceleration increasing faster.

2.2.3 RMSprop and Adam optimization

  • RMSprop (Root Mean Square Propagation)
    • On iteration t:
      • Compute dw,dbdw, dbon current mini-batch
      • sdw=β2sdw+(1β2)dw2s*{dw}=\beta_2 s*{dw}+(1-\beta_2)dw^2Hope to be relative small.
      • sdb=β2sdb+(1β2)db2s*{db}=\beta_2 s*{db}+(1-\beta_2)db^2Hope to be relative large.
      • w:=wαdwsdw+ϵ,b:=bαdbsdb+ϵw:=w-\alpha\dfrac{dw}{\sqrt{s*{dw}}+\epsilon}, b:=b-\alpha\dfrac{db}{\sqrt{s*{db}}+\epsilon}ϵ\epsilonis a realative small number(10810^{-8}) ot prevent nominaotr being 0.
    • Slow down in vertical direction, fast in horizontal direction.
  • Adam (Adaptive moment estimation) optimization algorithm
    • vdw=0,sdw=0.vdb=0,sdw=0v*{dw}=0, s*{dw}=0. v*{db}=0, s*{dw}=0
    • On iteration t:
      • Compute dw,bddw, bdusing current mini-batch
      • vdw=β1vdw+(1β1)dw,vdb=β1vdb+(1β1)dbv*{dw}=\beta_1v*{dw}+(1-\beta*1)dw,v*{db}=\beta*1v*{db}+(1-\beta_1)db
      • sdw=β2sdw+(1β2)dw2,sdb=β2sdb+(1β2)dbs*{dw}=\beta_2s*{dw}+(1-\beta*2)dw^2,s*{db}=\beta*2s*{db}+(1-\beta_2)db
      • vdwcorrected=vdw1β1t,vdbcorrected=v_db1β1tv*{dw}^{corrected}=\dfrac{v*{dw}}{1-\beta*1^t}, v*{db}^{corrected}=\dfrac{v\_{db}}{1-\beta_1^t}
      • sdwcorrected=sdw1β2t,sdbcorrected=s_db1βsts*{dw}^{corrected}=\dfrac{s*{dw}}{1-\beta*2^t}, s*{db}^{corrected}=\dfrac{s\_{db}}{1-\beta_s^t}
      • w:=wαvdwcorrectedsdwcorrected+ϵ,b:=bαvdbcorrectedsdbcorrected+ϵw:=w-\alpha\dfrac{v*{dw}^{corrected}}{\sqrt{s*{dw}^{corrected}}+\epsilon}, b:=b-\alpha\dfrac{v*{db}^{corrected}}{\sqrt{s*{db}^{corrected}}+\epsilon}
    • Hyperparameters choice:
      • α\alpha: needs to be tune
      • β1\beta_1: 0.9 (dwdw) First moment
      • β2\beta_2: 0.999 (dw2dw^2) Second moment
      • ϵ:108\epsilon: 10^{-8}Not affect performance
  • Learning rate decay
    • 1 epoch = 1 pass through the data
    • α=11+decayrate\*epochnumα0\alpha=\dfrac{1}{1+decay-rate\*epoch-num}\alpha_0
    • Other methods
      • α=0.95epochnumα0\alpha=0.95^{epoch-num}\cdot \alpha_0---- exponentially decay
      • α=kepochnumα0\alpha=\dfrac{k}{\sqrt{epoch-num}}\cdot\alpha_0or ktα0\dfrac{k}{\sqrt{t}}\cdot\alpha_0---- discrete staircase
      • Manual decay (small number of model)
  • The problem of local optima
    • Unlikely to stuck in a bad local optima, since there are too many dimensions and all algorithms in deep learning.
    • saddle point —- gradient = 0
    • Problem of plateaus: Make learning slow

2.3.1 Tuning process

  • Hyperparameters
    • α\alpha: learning rate (1st)
    • β\beta: momentum (2nd)
    • β1,β2,ϵ\beta_1, \beta_2, \epsilon
    • # of layers (3rd)
    • # of hidden units (2nd)
    • learning rate decay (3rd)
    • mini-batch size (2nd)
  • Try random values: Don’t use a grid
  • Coarse to fine: Trying coarse random first, then fine in working well range.

2.3.2 Using an appropriate scale to pick hyperparameters

  • Learning rate: α=0.0001,...,1\alpha = 0.0001,...,1

    • r=4\*np.random.rand()r=-4\*np.random.rand()---- r[4,0]r\in[-4,0]
      • r[a,b]r\in[a,b]
      • a=log100.0001=4,b=log101=0a=log*{10}0.0001 = -4, b=log*{10}1 = 0
    • α=10r\alpha=10^r----- α[104...100]\alpha\in[10^{-4}...10^0]
  • Exponentially Weighted Averages Decay Rate: β=0.9(last 10),...,0.999(last 1000)\beta=0.9(last\space 10),...,0.999(last\space1000)

    • 1β=0.1,...,0.001 r[3,1]1-\beta=0.1,...,0.001\space r\in[-3,-1]
      • Reason for focusing on this instead of single β\beta: β\betais too close to 1, small changes may have big affects.
    • 1β=10r1-\beta=10^r
    • β=110r\beta=1-10^r
  • In practice:

    • Re-test/Re-evaluate occasionally.
    • Babysitting one model (don’t have enough training capacity) (Panda): One model at one time.
    • Training many models in parallel (Caviar): Can try many at same time.

2.3.3 Batch Normalization

  • Implementing Batch Norm
    • Batch Norm: make sure hidden units have standardized mean and variance.
    • Given some intermediate value in NN z(1),...,z(m)z[l](i)z^{(1)},...,z^{(m)}-z^{[l](i)}(llfor some hidden layers, iifor 1 through mm)
      • μ=1m_iz(i)\mu=\dfrac{1}{m}\sum\limits\_{i}z^{(i)}(Mean)
      • σ2=1mi(ziμ)2\sigma^2=\dfrac{1}{m}\sum\limits_i(z_i-\mu)^2(Variance)
      • z_norm(i)=z(i)μσ2+ϵz\_{norm}^{(i)}=\dfrac{z^{(i)}-\mu}{\sqrt{\sigma^2+\epsilon}}(Make sure mean=0, variance=1. ϵ\epsilonprevent denominator=0)
      • z~(i)=γz_norm(i)+β\widetilde{z}^{(i)}=\gamma z\_{norm}^{(i)}+\beta(γ,β\gamma, \betaare learnable parameters of model)
        • If
          • γ=σ2+ϵ\gamma=\sqrt{\sigma^2+\epsilon}
          • β=μ\beta=\mu
        • Then z~(i)=z_norm(i)\widetilde{z}^{(i)}=z\_{norm}^{(i)}
    • Use z~[l](i)\widetilde{z}^{[l](i)}instead of z[l](i)z^{[l](i)}
  • Adding Batch Norm to a network
    • deep-learning-notes_2-3-3
    • Parameters: w[1],b[1],β[1],γ[1],...,w[l],b[l],β[l],γ[l]w^{[1]},b^{[1]},\beta^{[1]},\gamma^{[1]},...,w^{[l]},b^{[l]},\beta^{[l]},\gamma^{[l]}
      • May use gradient/Adam/momentum to tune dβ[l]d\beta^{[l]}β[l]=β[l]αdβ[l]\beta^{[l]}=\beta^{[l]}-\alpha d\beta^{[l]}
    • Working with mini-batches: Work the same but on single batches. No need for b[l]b^{[l]}, since variance are all 1. β  γ\beta\space\space\gammahave same dimension with bb.
  • Implementing gradient descent (works with momentum, RMSprop, Adam)
    • for t=1t=1…numMiniBatches
      • Compute forwardProp on X^.
        • In each hidden layer use BN to replace z[l]z^{[l]}with z~[l]\widetilde{z}^{[l]}
      • Use backprop to compute dw[l],db[l],dβ[l],dγ[l]dw^{[l]},db^{[l]},d\beta^{[l]},d\gamma^{[l]}(no dbdb)
      • Update parameters
        • w[l]:=w[l]αdw[l]w^{[l]}:=w^{[l]}-\alpha dw^{[l]}
        • β[l]:=β[l]αdβ[l]\beta^{[l]}:=\beta^{[l]}-\alpha d\beta^{[l]}
        • γ[l]:=γ[l]αdγ[l]\gamma^{[l]}:=\gamma^{[l]}-\alpha d\gamma^{[l]}
  • Why does Batch Norm work
    • Covariate Shift: Different test and training data (training on black cats but try to test on other color of cats).
      • Internal Covariate Shift: Between different layers of the network, the distribution of inputs to each layer changes. Recursively it changes the input of the latter layer. May lead to instability and reduced efficiency.
      • Batch norm reduces the problem of input values changes. Make input stable. Let the network learn more independent.
    • Batch norm as regularization
      • In mini-batch, each batch is scaled by the mean/variance computed on just that mini-batch. May adds some noise to each hidden layer’s (since is not consider the whole training set) (similar to dropout).
      • This has a slight regularization effect. (Use larger mini-batch size could reduce regularization)
  • Batch Norm at test time
    • μ,σ2\mu, \sigma^2: estimate using exponentially weighted average (across mini-batch).
      • μglobal=βμglobal+(1β)μ\mu*{\text{global}} = \beta \mu*{\text{global}} + (1 - \beta) \mu
      • σ2global=βσ2global+(1β)σ2\sigma^2*{\text{global}} = \beta \sigma^2*{\text{global}} + (1 - \beta) \sigma^2
    • During testing, use the global mean and variance estimates for normalization, instead of the statistics from the current test sample or mini-batch.

2.3.4 Multi-class classification

  • Softmax regression
    • CC= # classes = 4 (0,...,3)
    • Output layer: 4 nodes for each class. y^\hat{y}is (4,1) matrix, sum should be 1.
    • z[L]=w[L]a[L1]+bLz^{[L]}=w^{[L]}a^{[L-1]}+b^{L}(4,1) vector (L represents the output layer)
    • Activation function:
      • t=e(z[L])t=e^{(z^{[L]})}(4,1) vector
    • a[L]=ez[L]j=14ti  (4,1),a[L]_i=tij=14tia^{[L]}=\dfrac{e^{z^{[L]}}}{\sum\limits*{j=1}^4t_i}\space\space(4,1), a^{[L]}\_i=\dfrac{t_i}{\sum\limits*{j=1}^4t_i}
  • Hardmax: Change beigest to 1, rest all set to 0.
  • Training a softmax classifier
    • If C=2C=2, softmax reduces to logistic regression.
    • Loss function:
      • z[L]z^{[L]}->a[L]=y^a^{[L]}=\hat{y}->L(y^,y)L(\hat{y},y)(4,1)
      • Backprop: dz[L]=y^ydz^{[L]}=\hat{y}-y(4,1) dz[L]=JZ[L]dz^{[L]}=\dfrac{\partial{J}}{\partial{Z^{[L]}}}
  • Deep Learning frameworks
    • TensorFlowdeep-learning-notes_2-3-4

III. Structuring Machine Learning Projects

3.1 ML Strategy (1)

3.1.1 Setting up your goal

  • Orthogonalization

    • Chain of assumptions in ML
      • Fir training set well on cost function: bigger network; Adam
      • Fit dev set well on cost function: Regularization; Bigger training set
      • Fit test set well on cost function: Bigger dev set
      • Perorms well in real world: Change dev set or cost function
  • Single number evaluation metric

    • Precision: In examples recognized, what percentage are actually true.
    • Recall: What percentage of target are correctly recognized in whole test set.
    • F1 Score: Average of precision and recall. 11P+1R\dfrac{1}{\dfrac{1}{P}+\dfrac{1}{R}}(harmonic mean)
    • Dev set + Single number evaluation matric: Speed up iteration
    • Use average error rate instead of single error rate for each classes in estimate many classes at same time.
  • Satisficing and optimizing matrics

    • Consider classifiers with accuracy and running time.

      • maximize accuracy and subject to running time <= 100ms
      • Accuracy: optimizing
      • Running time: satisfiying
    • N metic: 1 optimizing, n-1 satisficing

  • Train/dev/test distributions

    • Come from same distribution. (Use randomly shuffle)
    • Choose a dev set and test set to reflect data you expect to get in the future and consider important to do well on.
  • Size of dev/test set

    • For large data set, use 98% training, 1% dev, 1% test
    • Size of test set: Set your test set to be big enough to give high confidence in the overall performance of your system.
    • Sometime use only train+dev, without test set.
  • When to change dev/test sets and metrics

    • Filter pornographic images out of error rate:

      • deep-learning-notes_3-1-1
    • Two Steps

      1. How to define a metric to evaluate classifiers.
      2. How to do well on this metric.
    • If doing well on your metric + dev/test set does not correspond to doing well on your application, change your metric and/or dev/test set.

3.1.2 Comparing to human-level performance

  • Why human-level performance:
    • Bayes (optimal) error: best possible error. Can never surpass.
    • Humans are quite good at a lot of tasks. So long as ML is worse than humans, you can:
      • Get labeled data from humans.
      • Gain insight from manual error analysis: Why did a person get this right?
      • Better analysis of bias/variance.

3.1.3 Analyzing bias and variance

  • Avoidable bias
    • If training error is far from human error (bayes error), focus on bias (avoidable bias). If training error is close to human error but far from dev error, focus on variance.
    • Consider human-level error as a proxy for Bayes error (since is not too far from human-level error to Bayes error).
  • Understanding human-level performance:
    • Based on purpose defined which is the human-level error want to use.
    • If human can perform really well, we can use human-level error as proxy for Bayes error.
  • Surpassing human-level performance
    • Not natural perception
    • Lots of data
  • Improving your model performance
    • The two fundamental assumptions of supervised learning
      • You can fit the training set pretty well. (Avoidable bias)
      • The training set performance generalizes pretty well to the dev/test set. (Variance)
    • Reducing (avoidable) bias and variance
      • Avoidable bias:
        • Train bigger model.
        • Train longer/better optimization, algorithms (momentum, RMSprop, Adam).
        • NN architecture/hyperparameters search (RNN, CNN).
      • Variance:
        • More data.
        • Regularization (L2, dropout, data augmentation).
        • NN architecture/hyperparameters search.

3.2 ML Strategy (2)

3.2.1 Error analysis

  • Carrying out error analysis
    • Error analysis (count mislabel, minus from the error rate get the ceiling of error rate)
      • Get ~100 mislabeled dev set examples.
      • Count up how many are dogs.
    • Evaluate multiple ideas in parallel (ideas for cat detection)
      • Fix pictures of dogs being recognized as cats
      • Fix great cats (lions, panthers, etc.) being misrecognized
      • Improve performance on blurry images
      • Check the details of mislabeled images (only few minutes/hours)
      • deep-learning-notes_3-2-1
  • Cleaning up incorrectly labeled data
    • DL algorithms are quite robust to random errors in the training set. (random error will not affect the algorithm too much)
    • DL algorithms are less robust to systematic errors.
    • When a high fraction of mistake is due to incorrectly label, should spend time to fix it.
    • Correcting incorrect dev/test set examples
      • Apply same process to your dev and test sets to make sure they continue to come from the same distribution.
      • Consider examining examples your algorithm got right as well as ones it got wrong.
      • Train and dev/test data may now come from slightly different distributions.
  • Build your first system quickly, then iterate
    • Set up dev/test set and metric
    • Build initial system quickly
    • Use Bias/Variance analysis & Error analysis to prioritize next steps.
  • Training and testing on different distributions
    • 200,000 from high quality webpages, 10,000 from low quality mobile app (but we care about this).
      • Shuffle before use those data. (not a good option, will cause the influence of what we care small.)
      • Use mobile app as dev/test set, and just really small part of training set from app. (This we will make our target to what we want.) Maybe 50% in training, 25% in dev, and 25% test.

3.2.2 Mismatched training and dev/test set

  • Training-dev set: Same distribution as training set, but not used for training.
  • Training error - Training-dev error - Dev error
    • Human level - traning set error: avoidable bias
    • Traning error - Training-dev error: Variance
    • Training-dev error - Dev error: Data mismatch
    • Dev error - Test error: degree of overfitting to dev set.
  • Addressing data mismatch
    • Carry out manual error analysis to try to understand difference between training and dev/test sets.
    • Make training data more similar; or collect more data similar to dev/test sets.
    • Artificial data synthesis:
      • Possible issue (overfitting): Original data is 10000, only have the noise of 1, maybe overfit to this 1.
  • Transfer learning
    • Pre-training/Fine-tune
    • From relatively large data to relatively small data.
    • But if the target data is too small may not be suitable for transfer learning. (Depend on the outcome we want, it would be valuable to have more data)
    • When makes sense (transfer from A-> B):
      • Task A and B have the same input x.
      • You have a lot more data for Task A than Task B (want this one).
      • Low level features from A could be helpful for learning B.

3.2.3 Learning from multiple tasks

  • Loss function for multiple tasks
    • Loss: y^(4,1)(i)=1mi=1m_j=14L(y^_j(i),yj(i))\hat{y}_{(4,1)}^{(i)}=\dfrac{1}{m}\sum\limits_{i=1}^m\sum\limits\_{j=1}^4L(\hat{y}\_j^{(i)},y_j^{(i)})
    • Sum only over valid of j with 0/1 label. (some of them may only labeled some feature)
    • Unlike softmax regression: One image can have multiple labels
  • When multi-task learning makes sense
    • Training on a set of tasks that could benefit from having shared lower-level features.
    • Usually: Amount of data you have for each task is quite similar.
    • Can train a big enough neural network to do well on all the tasks.

3.2.4 End-to-end deep learning

  • End-to-end needs lots of data to work well.
  • Breaking small data scenario into different deep learning will be better results.
  • Wether to use end-to-end learning
    • Pros:
      • Let the data speak.
      • Less hand-designing of components needed.
    • Cons:
      • May need large amount of data
      • Excludes potentially useful hand-designed components.
    • Key question: Do you have sufficient data to learn a function of the complexity needed to map x to y?
      • Use DL to learn individual components.
      • Carefully choose X->Y mappping depending on what tasks you can got data for.

IV. Convolutional Neural Networks

4.1 Foundations of Convolutional Neural Networks

4.1.1 Convolutional operatin

  • Vertical Edge Detection

    • Used to identify vertical edges in images, which is a crucial step in image analysis and understanding.
    • A small matrix, typically 3x3 or 5x5, is used as a convolution kernel to detect vertical edges.
    • The kernel slides over the image, moving one pixel at a time.
    • At each position, element-wise multiplication is performed between the kernel and the overlapping image area, followed by a sum to produce an output feature map.
    • High values in the output feature map indicate the presence of a vertical edge at that location.
    • $
      1 & 0 & -1 \
      1 & 0 & -1 \
      1 & 0 & -1
    • Based on this matrix example below, it will detect lighter on the left and darker on the right.
  • Horizontal Edge Detection

    • Brighter on the top and darker on the bottom
    • $
      1 & 1 & 1 \
      0 & 0 & 0 \
      -1 & -1 & -1
    • TBC
  • Other Common Filters

    • Sobel filter

      • $
        1 & 0 & -1 \
        2 & 0 & -2 \
        1 & 0 & -1
    • Scharr filter

      • $
        3 & 0 & -3 \
        10 & 0 & -10 \
        3 & 0 & -3
  • Padding

    • nxn * fxf = n-f+1 x n-f+1

    • Problems of convolution:

      • Shrinking output
      • Through away information from edge.
    • Add a padding(p) of 0

      • n+2pxn+2p * fxf = n+2p-f+1 x n+2p-f+1
    • Valid convolutions: No padding

    • Same convolutions: Pad so that output size is the same as the input size. (padding is f12\dfrac{f-1}{2})

    • f is usually odd.

  • Strided convolution

    • Stepping s steps instead of 1.
    • n+2pfs+1\dfrac{n+2p-f}{s}+1x n+2pfs+1\dfrac{n+2p-f}{s}+1(If not integer, bound down to the nearest integer.)
  • cross-correlation is the real name of convolution in DL.

  • Convolution over volume

    • Set the filter into the same volume as the input matrix. (e.g. RGB image with 3x3x3 filter)
    • If only look at an individual channel, just make other channel with all 0.
    • If consider vertical and horitental seperately, each output 4x4, the final could stack together get a 4x4x2 volume.
    • n×n×nc\*f×f×ncn\times n\times n_c \* f\times f \times n_c-> nf+1×nf+1×ncn-f+1 \times n-f+1 \times n_c^{'}(\# of filters)
  • One layer of a CNN

    • Each output add a bias and apply non-learner to it. ReLU(Output+b) –> Consider stack all outputs after this as volume as the a in a=g(z)
    • Consider output as the same as the w in z=wa+b.
    • Number of parameters in one layer: If you have 10 filters that are 3x3x 3 in one layer of a neural network, how many parameters does that layer have?(Consider 3x3x3 + bias, it will be 280 parameters)
  • Summary of notation (If layer 1 is a convolution layer)

    • f[l]=f^{[l]}=filter size (3x3 filter will be f=3)
    • p[l]=p^{[l]}=padding
    • s[l]=s^{[l]}=stride
    • Input: nH[l1]×nW[l1]×nC[l1]n_H^{[l-1]}\times n_W^{[l-1]}\times n_C^{[l-1]}
    • Output: nH[l]×nW[l]×nC[l]n_H^{[l]}\times n_W^{[l]}\times n_C^{[l]}
    • nH/W[l]=nH/W[l1]+2p[l]f[l]x[l]+1n*{H/W}^{[l]}=\dfrac{n*{H/W}^{[l-1]}+2p^{[l]}-f^{[l]}}{x^{[l]}}+1Round down to nearest integer
    • Each filter is: f[l]×f[l]×nC[l1]f^{[l]}\times f^{[l]}\times n_C^{[l-1]}
    • Activations: a[l]a^{[l]}-> nH[l]×nW[l]×nC[l]n_H^{[l]}\times n_W^{[l]}\times n_C^{[l]}Batch gradient descent A[l]A^{[l]}-> m×nH[l]×nW[l]×nC[l]m\times n_H^{[l]}\times n_W^{[l]}\times n_C^{[l]}
    • Weights: f[l]×f[l]×nC[l1]×nC[l]f^{[l]}\times f^{[l]}\times n_C^{[l-1]}\times n_C^{[l]}(The last quantity is # filters in layer l)
    • bias: nC[l]n_C^{[l]}- (1,1,1,nC[l])(1,1,1,n_C^{[l]})
  • A simple example ConvNet

    • deep-learning-notes_4-1-1
    • Get the final output(7x7x40) and take it as a 1960 vector pass through logistic/softmax to get out actual final value.
  • Types of layer in a convolutional network

    • Convolution (CONV)
    • Pooling (POOL)
    • Fully connected (FC)

4.1.2 Pooling layers

  • No parameters to learn.
  • Max pooling
    • Consider input is 4x4 matrix, output a 2x2 matrix. f(filter) = 2, s(stride) = 2. Just max each 2x2 in the input and put it into one cell in the output matrix.
    • Hyperparameters: f(filter) and s(stride).
    • deep-learning-notes_4-1-2
  • Average pooling
    • Instead of take the maxium, take the average.
    • deep-learning-notes_4-1-2-2
  • Input: nH×nW×nCn_H\times n_W\times n_C
  • Output: nHfs+1×nWfs+1×nC\dfrac{n_H-f}{s}+1\times \dfrac{n_W-f}{s}+1\times n_CDown to the nearest integer.

4.1.3 CNN example

  • Fully Connected layer
    • After several convolutional and pooling layers, the high-level reasoning in the neural network is done via FC layers. The output of the last pooling or convolutional layer, which is typically a multi-dimensional array, is flattened into a single vector of values. This vector is then fed into one or more FC layers.
    • Role:
      • Integration of Learned Features: FC layers combine all the features learned by previous convolutional layers across the entire image. While convolutional layers are good at identifying features in local areas of the input image, FC layers help in learning global patterns in the data.
      • Dimensionality Reduction: FC layers can be seen as a form of dimensionality reduction, where the high-level, spatially hierarchical features extracted by the convolutional layers are compacted into a form where predictions can be made.
      • Classification or Regression: In classification tasks, the final FC layer typically has as many neurons as the number of classes, with a softmax activation function being applied to the output. For regression tasks, the final FC layer’s output size and activation function are adjusted according to the specific requirements of the task.
    • Operation is similar to neurons in a standard neural network.
  • Example
    • deep-learning-notes_4-1-3
    • deep-learning-notes_4-1-3-2
  • Why convolutions?
    • Parameter sharing: A feature detector (such as a vertical edge detector) that’s useful in one part of the image is probably useful in another part of the image.
    • Sparsity of connections: In each layer, each output value depends only on a small number of inputs.
  • Training set (x(1),y(1))...(x(m),y(m))(x^{(1)},y^{(1)})...(x^{(m)},y^{(m)})
  • Cost J=1m_i=1mL(y^(i),y(i))J=\dfrac{1}{m}\sum\limits\_{i=1}^{m}L(\hat{y}^{(i)},y^{(i)})

4.2 Deep Convolutional Models: Case Studies

4.2.1 Case studies (LeNet-5, AlexNet, VGG, ResNets)

  • Red notations in the image below are what the network original designed but not suitable for nowadays. LeNet-5
  • Pioneer in CNNs: One of the earliest Convolutional Neural Networks, primarily used for digit recognition tasks.
  • Architecture:
    • Consists of 7 layers (excluding input).
    • Includes convolutional layers, average pooling layers, and fully connected layers.
  • Activation Functions: Uses sigmoid and tanh activation functions in different layers. (Not using nowadays)
  • Local Receptive Fields: Utilizes 5x5 convolution filters to capture spatial features.
  • Subsampling Layers: Employs average pooling for subsampling. (Using max pool nowadays)
  • deep-learning-notes_4-1-1-1 AlexNet
  • Multiple GPUs in the paper is outdated for today. LRN is not useful after lots of other researches.
  • Deeper Architecture: Contains 8 learned layers, 5 convolutional layers followed by 3 fully connected layers.
  • ReLU Activation: One of the first CNNs to use ReLU (Rectified Linear Unit) activation function for faster training.
  • Overlapping Pooling: Uses overlapping max pooling, reducing the network’s size and overfitting.
  • Data Augmentation and Dropout: Employs data augmentation and dropout techniques for better generalization.
  • deep-learning-notes_4-1-1-2 VGG-16
  • Simplicity and Depth: Known for its simplicity and depth, with 16 learned layers.
  • Uniform Architecture: Features a very uniform architecture, using 3x3 convolution filters with stride and pad of 1, max pooling, and fully connected layers.
  • Convolutional Layers: Stacks convolutional layers (2-4 layers) before each max pooling layer.
  • Large Number of Parameters: Has a high number of parameters (around 138 million), making it computationally intensive.
  • Transfer Learning: Proved to be an excellent model for transfer learning due to its performance and simplicity.
  • deep-learning-notes_4-1-1-3 ResNets
  • Residual block

    • deep-learning-notes_4-2-1-4-1
    • Main Path: a[l]a^{[l]}–> Linear –> ReLU –> a[l+1]a^{[l+1]}–> Linear –> ReLU –> a[l+2]a^{[l+2]}
      • z[l+1]=W[l+1]a[l]+b[l+1]z^{[l+1]}=W^{[l+1]}a^{[l]}+b^{[l+1]}a[l+1]=g(z[l+1])a^{[l+1]}=g(z^{[l+1]})z[l+2]=W[l+2]a[l+1]+b[l+2]z^{[l+2]}=W^{[l+2]}a^{[l+1]}+b^{[l+2]}a[l+2]=g(z[l+2])a^{[l+2]}=g(z^{[l+2]})
    • Short Cut / Skip Connection: a[l]a^{[l]}–> ReLU –> a[l+2]a^{[l+2]}
      • a[l+2]=g(z[l+2]+a[l])a^{[l+2]}=g(z^{[l+2]}+a^{[l]})
  • In normal plain network, the trainning error with increasing number of layers in theory will continuesly decrease. But in reality it will decrease but increase after a sweet point. What ResNet performs is decreasing training error with numbers of layers increase and the training error not increasing again.

  • Why do residual networks work?

    • deep-learning-notes_4-2-1-4-2

    • Residual networks introduce a shortcut or skip connection that allows the network to learn identity functions effectively.

    • This is crucial for training very deep networks by avoiding the vanishing gradient problem.

    • In a residual block:

      • XX-> BigNN -> a[l]a^{[l]}-> Residual block -> a[l+2]a^{[l+2]}
      • Input XXis passed through a standard neural network (BigNN) to obtain a[l]a^{[l]}, and then it goes through the residual block to produce a[l+2]a^{[l+2]}.
      • The formulation of a residual block can be represented as:a[l+2]=g(z[l+2]+a[l])=g(w[l+2]a[l+1]+b[l+2]+a[l])a^{[l+2]} = g(z^{[l+2]} + a^{[l]}) = g(w^{[l+2]} a^{[l+1]} + b^{[l+2]} + a^{[l]})
        • Here, ggis the activation function.
        • z[l+2]z^{[l+2]}is the output of the layer just before the activation function.
        • w[l+2]w^{[l+2]}and b[l+2]b^{[l+2]}are the weight and bias of the layer, respectively.
        • If w[l+2]=0w^{[l+2]} = 0and b[l+2]=0b^{[l+2]} = 0, then a[l+2]=g(a[l])=a[l]a^{[l+2]} = g(a^{[l]}) = a^{[l]}, effectively allowing the network to learn the identity function.
      • In cases where the dimensions of a[l+2]a^{[l+2]}and a[l]a^{[l]}differ (e.g., a[l]R128a^{[l]} \in \mathbb{R}^{128}and a[l+2]R256a^{[l+2]} \in \mathbb{R}^{256}), a linear transformation wsw_s(e.g., wsR256×128w_s \in \mathbb{R}^{256 \times 128}) is applied to a[l]a^{[l]}to match the dimensions.
    • This architecture enables training deeper models without performance degradation, which was a significant challenge in deep learning before the development of ResNet.

    • Understand through backdrop(personal notes not from the class content)

      • Consider input as x, the residual block calculation as F(x), identity mapping just drag the x and add it to the residual block’s calculation which makes the final value y=F(x)+xy=F(x)+x
      • Backprop for this will be as follow
        • Gradient of the Residual Blokc’s Output: yw\dfrac{\partial y}{\partial w}
          • This represents the gradient of the output yywith respect to the weights ww.
        • By chain rule: yw=yF(x)F(x)xxw+yxxw\dfrac{\partial y}{\partial w} = \dfrac{\partial y}{\partial F(x)}\dfrac{\partial F(x)}{\partial x}\dfrac{\partial x}{\partial w} + \dfrac{\partial y}{\partial x}\dfrac{\partial x}{\partial w}
        • Since y=F(x)+xy=F(x)+x, yF(x)\dfrac{\partial y}{\partial F(x)}and yx\dfrac{\partial y}{\partial x}should be 1
        • So the formula become yw=F(x)xxw+xw\dfrac{\partial y}{\partial w} = \dfrac{\partial F(x)}{\partial x}\dfrac{\partial x}{\partial w} + \dfrac{\partial x}{\partial w}
      • Compare to without the identity mapping xxadded. yw=F(x)xxw\dfrac{\partial y}{\partial w} = \dfrac{\partial F(x)}{\partial x}\dfrac{\partial x}{\partial w}, there is a xw\dfrac{\partial x}{\partial w}less. Add this xxto F(x)F(x)makes the network will not get worse results compare to before.

4.2.2 Network in Network and 1 X 1 convolutions

  • 1x1 convolutions
    • Functionality of 1x1 Convolutions: A 1x1 convolution, despite its simplicity, acts as a fully connected layer applied to each pixel separately across depth. It’s effectively used for channel-wise interactions and dimensionality reduction.
    • Increasing Network Depth: 1x1 convolutions can increase the depth of the network without a significant increase in computational complexity.
    • Dimensionality Reduction: They are often used for reducing the number of channels (depth) before applying expensive 3x3 or 5x5 convolutions, thus reducing the computational cost.
    • Feature Re-calibration: 1x1 convolutions can recalibrate the feature maps channel-wise, enhancing the representational power of the network.
  • Using 1x1 convolutions:
    • Reduce dimension: Consider a 28x28x192 input with CONV 1x1 with 32 filters, the output will be 28x28x32.

4.2.3 Inception network

  • Motivation for inception network
    • Input 28x28x192
      • Use 1x1x192 with 64 filters, output 28x28x64
      • Use same dimension 3x3x192, output 28x28x128
      • Use same dimension 5x5x192, output 28x28x32
      • use same dimension and s=1 Max-Pool, output 28x28x32.
    • Final output 28x28x256.
    • The problem of computational cost (Consider 5x5x192)
      • 5x5x192x28x28x32 is really big, 120M.
      • Bottleneck layer (Using 1x1 convolution): shrink 28x28x192 –> CONV, 1x1, 16, 1x1x192 –> 28x28x16 (Bottleneck layer) –> CONV 5x5, 32, 5x5x16 –> 28x28x32
      • In total only 28x28x16+28x28x32x5x5x16=12.4M
  • Inception moule
    • deep-learning-notes_4-2-1
    • deep-learning-notes_4-2-1-2
    • The softmax in the itermediate position is used for regularization which is used avoid overfitting.

4.2.4 MobileNet

  • Depthwise Separable Convolution

    • Depthwise Convolution
      • Computational cost = #filter params x #filter positions x #of filters
    • Ppointwise Convolution
      • Computational cost = #filter params x #filter positions x # of filters
        • ncncfiltersn*c * n*c * filters
    • Cost of depthwise seprable convolution / normal convolution
      • 1nc+1f2\dfrac{1}{n_c} + \dfrac{1}{f^2}
  • MobileNet v2 Bottleneck

    • Residual Connection

      • MobileNet v2 Bottleneck
      • Expansion
      • Depthwise
      • Pointwise (Projection)
    • Similar computational cost as v1

      • MobileNet V2 improves upon V1 by introducing an inverted residual structure with linear bottlenecks, which enhances the efficiency of feature extraction and information flow through the network. This architectural advancement allows V2 to achieve better performance than V1, despite having similar computational costs. Essentially, V2 optimizes the way features are processed and combined, providing more effective and complex feature representation within the same computational budget as V1.

4.2.5 EfficientNet

  • EfficientNet is a series of deep learning models known for high efficiency and accuracy in image classification tasks.
  • Compound Scaling:
    • It introduces a novel compound scaling method, scaling network depth, width, and resolution uniformly with a set of fixed coefficients.
  • High Efficiency and Accuracy:
    • EfficientNets provide state-of-the-art accuracy for image classification while being more computationally efficient compared to other models.

4.2.6 Inception network

  • Transfer Learning
    • Small training set: Freeze all hidden layers (save to disk), only train the softmax unit.
    • Big training set: Freeze less hidden layers, train some of the hidden layers (or use new hidden units), and also own softmax unit.
    • Lots of data: Use the already trained weights and bias as initalization, re-train based on it, as well as the softmax unit.
  • Data augmentation
    • Common augmentation method: Mirroring, Random Cropping, (Rotation, Shearing, Local warping, …)
    • Color shifting: add/minus from RGB. Advanced: PCA / PCA color augmentation.
    • Implementing distortions during training: One CPU thread doing augmentation, and other threads or GPU doing the training at same time.
  • State of CV
    • Data needed (little data to lots of data): Object detection < Image recognition < Speach recognition
    • Lots of data - Simpler algotithms (Less hand-engineering)
    • Little data - more hand-engineering (“hacks”) - Transfer learning
    • Two sources of knowledge
      • Labeled data
      • Hand engineered features/network architecture/other components
    • Tips for doing well on benchmarks/winning competitions
      • Ensembling: Train several networks independently and average their outputs (y^\hat{y}) 1-2% better. (3-15 networks)
      • Multi-crop at test time: Run classifier on multiple versions of test images and average results. (10-crop: center, four corner, also on mirror image the same 5 crops)
    • Use open source code
      • Use architectures of networks published in the literature.
      • Use open source implementations if possible.
      • Use pretrained models and fine-tune on your dataset.

4.3 Object Detection

4.3.1 Object localization

  • Want to detect 4 class: 1-pedestrian, 2-car, 3-mtorcycle, 4-background.
  • Defining the target label y: Need to out put bx,by,bh,bwb_x, b_y, b_h, b_w, class label (1-4). (In total 9 elements in the output vector).
  • y=[pc,bx,by,bh,bw,c1,c2,c3]y=[p_c, b_x, b_y, b_h, b_w, c_1, c_2, c_3]
    • There is an object y=[1,bx,by,bh,bw,c1,c2,c3]y=[1, b_x, b_y, b_h, b_w, c_1, c_2, c_3]
    • No object y=[0,?,?,?,?,?,?,?]y=[0, ?, ?, ?, ?, ?, ?, ?]Don’t care for all of other
  • Lost function:
    • L(y^,y)=(y1^y1)2+(y2^y2)2+...+(y8^y8)2L(\hat{y}, y)=(\hat{y_1} - y_1)^2 + (\hat{y_2} - y_2)^2 + ... + (\hat{y_8} - y_8)^2if y1=1y_1=1
    • L(y^,y)=(y1^y1)2L(\hat{y}, y)=(\hat{y_1} - y_1)^2if y1=0y_1=0

4.3.2 Landmark detection

  • Annotate key positions (points-xy coordinate) as landmarks.

4.3.3 Object detection

  • Object detection

    • Starts with closely crops images.
    • A window sliding from the top left to bottom right, once and once. If not find increase the window’s size and redo the sliding.
    • Run each individual image to the convnet.
  • Turning FC layer into convolutional layers

    • Instead directly to FC, use conv filter.
  • Convolution implementation of sliding windows

    • Convolution implementation of sliding windows
    • Instead of do 4 times 14x14x3, new conv fc share the computation, directly using the 2x2x4.
  • Output accurate bounding boxes

    • YOLO algorithm

      • Find the medium point of target and working into the boundary box that contains that point.
      • YOLO algorithm
      • YOLO algorithm-2
    • Intersection over union (IoU)

      • Use to check accuracy.
      • Size of intersection / size of reunion (normally “Correct” if loU \geq0.5)
      • loU
    • Non-max suppression

      • Leave the maximum accuracy one, supprese all with high IoU.
      • Non-max suppression-1
      • Non-max suppression-2
    • Anchor Boxes

      • Predefine anchor boxes, associate ojects with anchor boxes.
      • If objects more than assigned anchor boxes, not works. Not same shape, not works.
      • Anchor Boxes-1
      • Anchor Boxe-2
    • Training set

      • y is 3x3x2x8 (which is # of grids x # of anchors x # classes(5(pc.bx,by,bh,bwp_c. b_x, b_y, b_h, b_w) + classes))
      • YOLO
  • Regision Proposals

    • R-CNN: Propose regions. Classify proposed regions one at a time. Output label + bounding box.
    • Fast R-CNN: Propose regions. Use convolution implementation of sliding windows to classify all the proposed regions.
    • Faster R-CNN: Use convolutional network to propose regions.
  • Semantic Segmentation with U-Net

    • Per-pixel class labels

      • Per-pixel class labels
    • Deep Learning for Semantic Segmentation

      • Deep Learning for Semantic Segmentation
    • Transpose Convolution

      • Increase the image size.
      • Transpose Convolution - 1
      • Transpose Convolution - 2
    • U-Net Architecture

      • Skip Connections: Left one get more details in color or anything like that. Right one is more spatial information to figure out where is the object really is.
      • U-Net Architecture - Skip Connections
      • U-Net Architecture

4.4 Special Applications: Face Recognition & Neural Style Transfer

4.4.1 Face recognition

  • Face verification vs. face recognition

    • verification vs recognition —- 1:1 vs 1:K
    • Verification
      • Input image, name/ID.
      • Output whether the input image is that of the claimed person.
    • Recognition
      • Has a database of K persons
      • Get an input image
      • Output ID if the image is any of the K persons (or “not recognized”)
  • One-shot learning

    • Learning from one example to recognize the person again.
    • Learning a “similarity” function
      • d(img1, img2) = degree of difference between images
      • If d(img1, img2) τ\le \tau“same” \textgreater τ\textgreater \space \tau“Different”
  • Siamese network

    • Siamese network
    • Input two differnet images into two CNN and ge the result of them.
    • Such as input x(1),x(2)x^{(1)}, x^{(2)}seperately into two differnt CNN, and the output will be the encoding of each of them f(x(1)),f(x(2))f(x^{(1)}), f(x^{(2)})
    • Then compare the distance between them d(x(1),x(2))=f(x(1))f(x(2))_22d(x^{(1)}, x^{(2)}) = ||f(x^{(1)}) - f(x^{(2)})||\_2^2
    • Parameters of NN define an encoding f(x(i))f(x^{(i)})
    • Learn parameters so that:
      • If x(i),x(j)x^{(i)}, x^{(j)}are the smae person, f(x(i))f(x(j))2||f(x^{(i)}) - f(x^{(j)})||^2is small.
      • If x(i),x(j)x^{(i)}, x^{(j)}are the different person, f(x(i))f(x(j))2||f(x^{(i)}) - f(x^{(j)})||^2is large..
  • Triplet Loss

    • Learning objective: (Anchor, Positive), (Anchor, Negative)

      • Want: f(A)f(P)2+αf(A)f(N)2||f(A) - f(P)||^2 + \alpha \le ||f(A) - f(N)||^2α\alphais the margin (similar to SVM)
      • f(A)f(P)2f(A)f(N)2+α0||f(A) - f(P)||^2 - ||f(A) - f(N)||^2 + \alpha \le 0
    • Loss function

      • Given 3 images A, P, N:
      • L(A,P,N)=max(f(A)f(P)2f(A)f(N)2+α,0)L(A, P, N) = max(||f(A) - f(P)||^2 - ||f(A) - f(N)||^2 + \alpha, 0)
      • J=_i=1mL(A(i),P(i),N(i))J = \sum\limits\_{i=1}^m L(A^{(i)},P^{(i)},N^{(i)})
    • If have a training set of 10K pictures of 1k persons. Put those 10K into triplet A, P, N, then put into the loss function.

    • Choosing the triplets A, P, N

      • During training, if A, P, N are chosen randomly, d(A,P)+αd(A,N)d(A,P) +\alpha \le d(A, N)is easily satisfied.
      • Choose triplets that’re “hard” to train on. (such as choose d(A,P)d(A,N)d(A,P) \approx d(A,N))
    • Training set using triplet loss to make J smaller. And make distance of d for same person small and different large.

  • Face Verification and Binary Classification

    • Learning the similarity function
    • y^=σ(_k=1128wkf(x(i))_kf(x(j))_k+b)\hat{y} = \sigma (\sum\limits\_{k=1}^{128}w_k|f(x^{(i)})\_k-f(x^{(j)})\_k| + b)
    • Only store the f(x(j))f(x^{(j)})as pre-compute, save storage and computational resources.
    • Face verification supervised learning.

4.4.2 Neural style transfer

  • What is it?
    • Neural style transfer
  • Cost Function
    • J(G)=αJcontent(C,G)+βJStyle(S,G)J(G) = \alpha J*{content}(C, G) + \beta J*{Style}(S, G)
    • Find the generated image G
      1. Initiate G randomly G: 100x100x3
      2. Use gradient descent to minimize J(G) G:=GGJ(G)G:=G-\dfrac{\partial}{\partial G}J(G)

V. Sequence Models

5.1 Recurrent Neural Networks

5.1.1 RNN model

5.1.2 Backpropagation through time

5.1.3 Different types of RNNs

5.2 Natural Language Processing & Word Embeddings

5.2.1 Word Representation

5.2.2 Embedding matrix

5.2.3 Word embeddings in TensorFlow

5.3 Sequence Models & Attention Mechanism

5.3.1 Sequence to sequence model

5.3.3 Attention model

本文作者:ZL Asica
版权声明:本文采用 CC BY-NC-SA 4.0 DEED 协议进行许可