ZL Asica
ZL Asica
文章13
标签12
分类3

一言

Deep Learning深度学习-学习笔记

Deep Learning深度学习-学习笔记

This notes’ content are all based on https://www.coursera.org/specializations/deep-learning

Latex may have some issues when displaying.

1. Neural Networks and Deep Learning

1.1 Introduction to Deep Learning

1.1.1 Supervised Learning with Deep Learning

  • Structured Data: Charts.
  • Unstructured Data: Audio, Image, Text.

1.1.2 Scale drives deep learning progress

  • The larger the amount of data, the better the performance of the larger neural network compare to smaller one or supervised learning.
  • Sigmoid change to ReLU will make gradient descent much more faster. Since the gradient will not go to 0 really fast.

1.2 Basics of Neural Network Programming

1.2.1 Binary Classification

  • Input: XRnxX \in R^{nx}
  • Output: 0, 1

1.2.2 Logistic Regression

  • Given xx, want y^=P(y=1x)\hat{y} = P(y=1|x)

  • Input: xRnxx \in R^{n_x}

  • Parameters: wRnx,bRw \in R^{n_x}, b \in R

  • Output y^=σ(wTx+b)\hat{y} = \sigma(w^Tx + b)

    • σ(z)=11+ez\sigma(z)=\dfrac{1}{1+e^{-z}}
    • If zzlarge, σ(z)11+01\sigma(z)\approx\dfrac{1}{1+0}\approx1
    • If zzlarge negative number, σ(z)11+Bignum0\sigma(z)\approx\dfrac{1}{1+Bignum}\approx0
  • Loss (error) function:

    • y^=σ(wTx+b)\hat{y} = \sigma(w^Tx + b), where σ(z)=11+ez\sigma(z)=\dfrac{1}{1+e^{-z}}
      • z(i)=wTx(i)+bz^{(i)}=w^Tx^{(i)}+b
    • Want y(i)y^(i)y^{(i)} \approx \hat{y}^{(i)}

    • L(y,y^)=[ylog(y^)+(1y)log(1y^)]L(y, \hat{y}) = -[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})]
      • If y=1:L(y^,y)=logy^y=1: L(\hat{y}, y)=-\log{\hat{y}}<- want logy^\log{\hat{y}}as large as possible, want y^\hat{y}large
      • If y=0:L(y^,y)=log(1y^)y=0: L(\hat{y}, y)=-\log{(1-\hat{y})}<- want log(1y^)\log{(1-\hat{y})}as large as possible, want y^\hat{y}small
  • Cost function

    • J(w,b)=1mi=1mL(y^(i),y(i))=1mi=1mL[y(i)log(y^(i))+(1y(i))log(1y^(i))]J(w, b)=\dfrac{1}{m}\sum\limits_{i=1}^{m}L(\hat{y}^{(i)},y^{(i)})=-\dfrac{1}{m}\sum\limits_{i=1}^{m}L[y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)})]

1.2.3 Gradient Descent

  • Repeat w:=wαdJ(w)dww:=w-\alpha\dfrac{dJ(w)}{dw}; b:=bαJ(w,b)bb:=b-\alpha\dfrac{\partial J(w,b)}{\partial b}
    • α\alpha: Learning rate
    • Right side of minimum, dJ(w)dw>0\dfrac{dJ(w)}{dw} > 0; Left side of minimum, dJ(w)dw<0\dfrac{dJ(w)}{dw} < 0
  • Logistic Regression Gradient Descent
    • x1,x2,w1,w2,bx_1,x_2,w_1,w_2,b
      • z=w1x1+w2x2+bz=w_1x_1+w_2x_2+b-->a=σ(z)a=\sigma(z)-->L=(a,y)L=(a,y)
      • da=dL(a,y)da=ya+1y1ada=\dfrac{dL(a,y)}{da}=-\dfrac{y}{a}+\dfrac{1-y}{1-a}
        • dL(y,a)da=dda(ylog(a)(1y)log(1a))\dfrac{dL(y,a)}{da} = \dfrac{d}{da}(-y\log(a) - (1-y)\log(1-a))
        • dda(ylog(a))=ya\dfrac{d}{da} (-y\log(a)) = -\dfrac{y}{a}
        • dda((1y)log(1a))=1y1a×(1)=1y1a\dfrac{d}{da} (-(1-y)\log(1-a)) = -\dfrac{1-y}{1-a} \times (-1) = \dfrac{1-y}{1-a}
        • =ya+1y1a=yay11a=-\dfrac{y}{a} + \dfrac{1-y}{1-a} = -\dfrac{y}{a} - \dfrac{y-1}{1-a}
      • dz=dLdz=dL(a,y)dz=aydz=\dfrac{dL}{dz}=\dfrac{dL(a,y)}{dz}=a-y
        • =dLdadadz=\dfrac{dL}{da}\cdot\dfrac{da}{dz}(dadz=a(1a)\dfrac{da}{dz}=a(1-a))
      • dLdw1="dw1"=x1dz\dfrac{dL}{dw_1}="dw_1"=x_1\cdot dz
      • dLdw2="dw2"=x2dz\dfrac{dL}{dw_2}="dw_2"=x_2\cdot dz
      • db=dzdb=dz
  • Gradient Descent on mmexamples
    • J(w,b)=1mi=1mL(a(i),y(i))J(w, b)=\dfrac{1}{m}\sum\limits_{i=1}^{m}L(a^{(i)},y^{(i)})
    • w1J(w,b)=1mi=1mw1L(a(i),y(i))\dfrac{\partial}{\partial w_1}J(w,b)=\dfrac{1}{m}\sum\limits_{i=1}^{m}\dfrac{\partial}{\partial w_1}L(a^{(i)},y^{(i)})
    • J=0;dw1=0;dw2=0;db=0J=0;dw_1=0;dw_2=0;db=0
      • for i=1i=1to mm
        • z(i)=wTx(i)+bz^{(i)}=w^Tx^{(i)}+b
        • a(i)=σ(z(i))a^{(i)}=\sigma (z^{(i)})
        • J+=[y(i)loga(i)+(1y(i))log(1a(i))]J+=-[y^{(i)}loga^{(i)}+(1-y^{(i)})log(1-a^{(i)})]
        • dz(i)=a(i)y(i)dz^{(i)}=a^{(i)}-y^{(i)}
        • dw1+=x1(i)dz(i)dw_1+=x_1^{(i)}dz^{(i)}(for n = 2)
        • dw2+=x2(i)dz(i)dw_2+=x_2^{(i)}dz^{(i)}(for n = 2)
        • db+=dz(i)db+=dz^{(i)}
      • J/=m;dw1/=m;dw2/=m;db/=mJ/=m;dw_1/=m;dw_2/=m;db/=m
      • dw1=Jw1;dw2=Jw2dw_1=\dfrac{\partial J}{\partial w_1}; dw_2=\dfrac{\partial J}{\partial w_2}
        • w1:=w1αdw1w_1:=w_1-\alpha dw_1
        • w2:=w2αdw2w_2:=w_2-\alpha dw_2
        • b:=bαdbb:=b-\alpha db

1.2.4 Computational Graph

  • J(a,b,c)=3(a+bc)J(a,b,c)=3(a+bc)
    • u=bcu=bc
    • v=a+uv=a+u
    • J=3vJ=3v
    • Left to right computation
  • Derivatives with a Computation Graph

    • dJdv=3\dfrac{dJ}{dv}=3
      • dJda=3\dfrac{dJ}{da}=3
      • dvda=1\dfrac{dv}{da}=1
      • Chain Rule: dJda=dJdvdvda\dfrac{dJ}{da}=\dfrac{dJ}{dv}\cdot\dfrac{dv}{da}
      • dJdu=3;dudb=2;dJdb=6\dfrac{dJ}{du}=3; \dfrac{du}{db}=2; \dfrac{dJ}{db}=6
      • dudc=3;dJdc=9\dfrac{du}{dc}=3; \dfrac{dJ}{dc}=9

1.2.5 Vectorization

  • avoid explicit for-loops.

  • J=0;dw=np.zeros((nx,1));db=0J=0;dw=np.zeros((n_x,1));db=0
    • for i=1i=1to mm
      • z(i)=wTx(i)+bz^{(i)}=w^Tx^{(i)}+b
      • a(i)=σ(z(i))a^{(i)}=\sigma (z^{(i)})
      • J+=[y(i)loga(i)+(1y(i))log(1a(i))]J+=-[y^{(i)}loga^{(i)}+(1-y^{(i)})log(1-a^{(i)})]
      • dz(i)=a(i)y(i)dz^{(i)}=a^{(i)}-y^{(i)}
      • dw+=x(i)dz(i)dw+=x^{(i)}dz^{(i)}
      • db+=dz(i)db+=dz^{(i)}
    • J/=m;dw/=m;db/=mJ/=m;dw/=m;db/=m
  • Z=np.dot(w.T,x)+bZ=np.dot(w.T,x)+b; b(1,1)-->Broodcasting
  • Vectorization Logistic Regression

    • dz(1)=a(1)y(1);dz(2)=a(2)y(2)...dz^{(1)}=a^{(1)}-y^{(1)}; dz^{(2)}=a^{(2)}-y^{(2)}...
    • dz=[dz(1),dz(2),...,dz(m)]dz=[dz^{(1)}, dz^{(2)},...,dz^{(m)}]1×m1\times m
    • A=[a(1),a(2),...,a(m)]A=[a^{(1)}, a^{(2)}, ..., a^{(m)}]Y=[y(1),y(2),...,y(m)]Y=[y^{(1)}, y^{(2)}, ..., y^{(m)}]
    • dz=AY=[a(1)y(1),a(2)y(2),...]dz=A-Y=[a^{(1)}-y^{(1)}, a^{(2)}-y^{(2)}, ...]
    • Get rid of dbdband dwdwin for loop
      • db=1mi=1mdz(i)=1mnp.sum(dz)db=\dfrac{1}{m}\sum\limits_{i=1}^{m}dz^{(i)}=\dfrac{1}{m} np.sum(dz)
      • dw=1mXdzT=1m[x(1)...][dz(1)...]=1m[x(1)dz(1)+...+x(m)dz(m)]dw=\dfrac{1}{m}\cdot X\cdot dz^T=\dfrac{1}{m}[x^{(1)}...][dz^{(1)}...]=\dfrac{1}{m}\cdot[x^{(1)}dz^{(1)}+...+x^{(m)}dz^{(m)}]n×1n\times 1
    • New Form of Logistic Regression
      • Z=wtX+b=np.dot(w.T,X)+bZ=w^tX+b=np.dot(w.T, X)+b
      • A=σ(Z)A=\sigma (Z)
      • dz=AYdz=A-Y
      • dw=1mXdZTdw=\dfrac{1}{m}\cdot X \cdot dZ^T
      • db=1mnp.sum(dz)db=\dfrac{1}{m}np.sum(dz)
      • w:=wαdww:=w-\alpha dw
      • b:=bαdbb:=b-\alpha db
  • Broadcasting(same as bsxfun in Matlab/Octave)

    • (m,n)(m,n)+-*/(1,n)(1,n)->(m,n)(m,n)1->m will be all the same number.
    • (m,n)(m,n)+-*/(m,1)(m,1)->(m,n)(m,n)1->n will be all the same number
    • Don’t use a=np.random.randn(5)a = np.random.randn(5)a.shape=(5,)a.shape = (5,)“rank 1 array”
    • Use a=np.random.randn(5,1)a = np.random.randn(5,1)or a=np.random.randn(1,5)a = np.random.randn(1,5)
    • Check by assert(a.shape==(5,1))assert(a.shape == (5,1))
    • Fix rank 1 array by a=a.reshape((5,1))a = a.reshape((5,1))
  • Logistic Regression Cost Function

    • Lost
      • p(yx)=y^y(1y^)(1y)p(y|x)=\hat{y}^y(1-\hat{y})^{(1-y)}
      • If y=1y=1: p(yx)=y^p(y|x)=\hat{y}
      • If y=0y=0: p(yx)=(1y^)p(y|x)=(1-\hat{y})
      • logp(yx)=logy^y(1y^)(1y)=ylogy^+(1y)log(1y^)=L(y^,y)\log p(y|x)=\log \hat{y}^y(1-\hat{y})^{(1-y)}=y\log \hat{y}+(1-y)\log(1-\hat{y})=-L(\hat{y},y)
    • Cost
      • logp(labels in training set)=logΠi=1mp(y(i),x(i))\log p(labels\space in\space training\space set)=\log \Pi_{i=1}^{m}p(y^{(i)},x^{(i)})
      • logp(labels in training set)=i=1mlogp(y(i),x(i))=i=1mL(y^(i),y(i))\log p(labels\space in\space training\space set)=\sum\limits_{i=1}^m\log p(y^{(i)},x^{(i)})=-\sum\limits_{i=1}^mL(\hat{y}^{(i)},y^{(i)})
      • Use maximum likelihood estimation(MLE)
      • Cost(minmize): J(w,b)=1mi=1mL(y^(i),y(i))J(w,b)=\dfrac{1}{m}\sum\limits_{i=1}^mL(\hat{y}^{(i)},y^{(i)})

1.3 Shallow Neural Networks

1.3.1 Neural Network Representation

  • deep-learning-notes_1-3-1

  • Input layer, hidden layer, output layer

    • a[0]=xa^{[0]}=x-> a[1]=[[a1[1],a2[1],a3[1],a4[1]]]a^{[1]}=[[a^{[1]}_1,a^{[1]}_2,a^{[1]}_3,a^{[1]}_4]]-> a[2]a^{[2]}
    • Layers count by # of hidden layer+# of output layer.
  • x1,x2,x3x_1,x_2,x_3-> 4 hidden nodes4\space hidden\space nodes-> Output layerOutput\space layer
    • First hidden node: z1[1]=w1[1]T+b1[1],a1[1]=σ(z1[1])z^{[1]}_1=w^{[1]T}_1+b^{[1]}_1, a^{[1]}_1=\sigma(z^{[1]}_1)
    • Seconde hidden node: z2[1]=w2[1]T+b2[1],a2[1]=σ(z2[1])z^{[1]}_2=w^{[1]T}_2+b^{[1]}_2, a^{[1]}_2=\sigma(z^{[1]}_2)
    • Third hidden node: z3[1]=w3[1]T+b3[1],a3[1]=σ(z3[1])z^{[1]}_3=w^{[1]T}_3+b^{[1]}_3, a^{[1]}_3=\sigma(z^{[1]}_3)
    • Forth hidden node: z4[1]=w4[1]T+b4[1],a4[1]=σ(z4[1])z^{[1]}_4=w^{[1]T}_4+b^{[1]}_4, a^{[1]}_4=\sigma(z^{[1]}_4)
  • Vectorization

    • w[1]=[w1[1]Tw2[1]Tw3[1]Tw4[1]T](4,3)matrixw^{[1]}=\begin{gathered}\begin{bmatrix}-w^{[1]T}_1- \\ -w^{[1]T}_2- \\ -w^{[1]T}_3- \\ -w^{[1]T}_4- \end{bmatrix}\end{gathered} (4,3)matrix
    • z[1]=[w1[1]Tw2[1]Tw3[1]Tw4[1]T][x1x2x3]+[b1[1]b2[1]b3[1]b4[1]]=[w1[1]Tx+b1[1]w2[1]Tx+b2[1]w3[1]Tx++b3[1]w4[1]Tx+b4[1]]=[z1[1]z2[1]z3[1]z4[1]]z^{[1]}=\begin{gathered}\begin{bmatrix}-w^{[1]T}_1- \\ -w^{[1]T}_2- \\ -w^{[1]T}_3- \\ -w^{[1]T}_4- \end{bmatrix}\end{gathered}\cdot \begin{gathered}\begin{bmatrix}x_1 \\ x_2 \\ x_3 \end{bmatrix}\end{gathered} + \begin{gathered}\begin{bmatrix}b^{[1]}_1 \\ b^{[1]}_2 \\b^{[1]}_3 \\ b^{[1]}_4 \end{bmatrix}\end{gathered} =\begin{gathered}\begin{bmatrix}w^{[1]T}_1\cdot x+b^{[1]}_1 \\ w^{[1]T}_2\cdot x+b^{[1]}_2 \\ w^{[1]T}_3\cdot x++b^{[1]}_3 \\ w^{[1]T}_4\cdot x+b^{[1]}_4 \end{bmatrix}\end{gathered}=\begin{gathered}\begin{bmatrix}z^{[1]}_1 \\ z^{[1]}_2 \\z^{[1]}_3 \\ z^{[1]}_4 \end{bmatrix}\end{gathered}
    • a[1]=[a1[1]a2[1]a3[1]a4[1]]=σ(z[1])a^{[1]}=\begin{gathered}\begin{bmatrix}a^{[1]}_1 \\ a^{[1]}_2 \\a^{[1]}_3 \\ a^{[1]}_4 \end{bmatrix}\end{gathered}=\sigma(z^{[1]})
    • z[2]=W[2]a[1]+b[2]z^{[2]}=W^{[2]}\cdot a^{[1]}+b^{[2]}(1,1),(1,4),(4,1),(1,1)(1, 1),(1, 4),(4, 1),(1, 1)
    • a[2]=σ(z[2])a^{[2]}=\sigma(z^{[2]})(1,1),(1,1)(1,1),(1,1)
    • a[2](i)a^{[2](i)}: layer 22; example ii
  • for i=1 to m:

    • z[1](i)=W[1]x(i)+b[1]z^{[1](i)}=W^{[1]}\cdot x(i)+b^{[1]}
    • a[1](i)=σ(z[1](i))a^{[1](i)}=\sigma(z^{[1](i)})
    • z[2](i)=W[2]a[1](i)+b[2]z^{[2](i)}=W^{[2]}\cdot a^{[1](i)}+b^{[2]}
    • a[2](i)=σ(z[2](i))a^{[2](i)}=\sigma(z^{[2](i)})
  • Vectorizing of the above for loop

    • X=[x(1),x(2),...,x(m)](nx,m)matrixX=\begin{gathered}\begin{bmatrix}| & | & | & | \\ x^{(1)}, & x^{(2)}, & ..., & x^{(m)} \\ | & | & | & |\end{bmatrix}\end{gathered} (n_x,m)matrixn is different hidden units
    • Z[1]=W[1]X+b[1]Z^{[1]}=W^{[1]}\cdot X+b^{[1]}
    • A[1]=σ(Z[1])A^{[1]}=\sigma(Z^{[1]})
    • Z[2]=W[2]A[1]+b[2]Z^{[2]}=W^{[2]}\cdot A^{[1]}+b^{[2]}
    • A[2]=σ(Z[2])A^{[2]}=\sigma(Z^{[2]})
    • hrizontally: training examples; vertically: hidden units

1.3.2 Activation Functions

  • g[i]g^{[i]}: activation function of layer ii
    • Sigmoid: a=11+e[z]a=\dfrac{1}{1+e^{[-z]}}
    • Tanh: a=eze[z]ez+e[z]a=\dfrac{e^z-e^{[-z]}}{e^z+e^{[-z]}}
    • ReLU: a=max(0,z)a=max(0,z)
    • Leaky ReLu: a=max(0.01z,z)a=max(0.01z, z)
  • Rules to choose activation function
    1. Output is between {0, 1}, choose sigmoid.
    2. Default choose ReLu.
  • Why need non-liner activation function
    • Use linear hidden layer will be useless to have multiple hidden layers. It will become a=wx+ba=w'x+b'.
    • Linear may sometime use at output layer but with non-linear at hidden layers.

1.3.3 Forward and Backward Propogation

  • Derivative of activation function
    • Sigmoid: g(z)=ddzg(z)=11+e[z](111+e[z])=g(z)(1g(z))=a(1a)g'(z)=\dfrac{d}{dz}g(z)=\dfrac{1}{1+e^{[-z]}}(1-\dfrac{1}{1+e^{[-z]}})=g(z)(1-g(z))=a(1-a)
    • Tanh: g(z)=ddzg(z)=1(tanh(z))2g'(z)=\dfrac{d}{dz}g(z)=1-(tanh(z))^2
    • ReLU: g(z)={0if z<01if z0\usepackageundefined\usepackageif z=0g'(z)=\left\{\begin{array}{lr}0&if \space z<0 \\1&if \space z\geq0\\\usepackage{undefined}&\usepackage{if \space z=0}\end{array}\right.
    • Leaky ReLU: g(z)={0.01if z<01if z0g'(z)=\left\{\begin{array}{lr}0.01&if \space z<0 \\1&if \space z\geq0\end{array}\right.
  • Gradient descent for neural networks
    • Parameters: w[1](n[1],n[2]),b[1](n[2],1),w[2](n[2],n[1]),b[2](n[2],1)w^{[1]}(n^{[1]},n^{[2]}), b^{[1]}(n^{[2]},1),w^{[2]}(n^{[2]},n^{[1]}), b^{[2]}(n^{[2]},1)
    • nx=n[0],n[1],n[2]=1n_x=n^{[0]},n^{[1]},n^{[2]}=1
    • Cost function: J(w[1],b[1],w[2],b[2])=1mi=1nL(y^,y)J(w^{[1]}, b^{[1]},w^{[2]}, b^{[2]})=\dfrac{1}{m}\sum\limits_{i=1}^nL(\hat{y},y)
  • Forward propagation:
    • Z[1]=W[1]X+b[1]Z^{[1]}=W^{[1]}\cdot X+b^{[1]}
    • A[1]=g[1](Z[1])A^{[1]}=g^{[1]}(Z^{[1]})
    • Z[2]=W[2]A[1]+b[2]Z^{[2]}=W^{[2]}\cdot A^{[1]}+b^{[2]}
    • A[2]=g[2](Z[2])=σ(Z[2])A^{[2]}=g^{[2]}(Z^{[2]})=\sigma(Z^{[2]})
  • Back Propogation:
    • dZ[2]=A[2]YdZ^{[2]}=A^{[2]}-YY=[y(1),y(2),...,y(m)]Y=[y^{(1)},y^{(2)},...,y^{(m)}]
    • dW[2]=1mdZ[2]A[1]TdW^{[2]}=\dfrac{1}{m}dZ^{[2]}A^{[1]T}
    • db[2]=1mnp.sum(dZ[2],axis=1,keepdims=True)db^{[2]}=\dfrac{1}{m}np.sum(dZ^{[2]},axis=1,keepdims=True)
    • dZ[1]=W[2]TdZ[2]g[1](Z1)dZ^{[1]}=W^{[2]T}dZ^{[2]}*g'^{[1]}(Z^{1})
      • (n[1],m)>elementwise product>(n[1],m)(n^{[1]},m)->element-wise\space product->(n^{[1]},m)
    • dW[1]=1mdZ[1]XTdW^{[1]}=\dfrac{1}{m}dZ^{[1]}X^{T}
    • db[1]=1mnp.sum(dZ[1],axis=1,keepdims=True)db^{[1]}=\dfrac{1}{m}np.sum(dZ^{[1]},axis=1,keepdims=True)
  • Random Initialization
    • x1,x2>a1[1],a2[1]>a1[2]>y^x_1,x_2->a_1^{[1]},a_2^{[1]}->a_1^{[2]}->\hat{y}
    • w[1]=np.random.randn((2,2))0.01w^{[1]}=np.random.randn((2,2))*0.01
    • b[1]=np.zeros((2,1))b^{[1]}=np.zeros((2,1))
    • w[2]=np.random.randn((1,2))0.01w^{[2]}=np.random.randn((1,2))*0.01
    • b[2]=0b^{[2]}=0

1.4 Deep Neural Networks

1.4.1 Deep L-Layer Neural Network

  • Deep neural network notation
    • deep-learning-notes_1-4-1
    • L=4L=4(#layers)
    • n[l]=# units in layer ln^{[l]}= \#\space units\space in\space layer\space l
      • n[1]=5,n[2]=5,n[3]=3,n[4]=n[l]=1n^{[1]}=5,n^{[2]}=5,n^{[3]}=3,n^{[4]}=n^{[l]}=1
      • n[0]=nx=3n^{[0]}=n_x=3
    • a[l]=activations in layer la^{[l]}=activations\space in\space layer\space l
    • a[l]=g[l](z[l]), w[l]=weights for z[l], b[l]=bias for z[l]a^{[l]}=g^{[l]}(z^{[l]}),\space w^{[l]}=weights\space for\space z^{[l]},\space b^{[l]}=bias\space for\space z^{[l]}
    • x=a[0], y^=alx=a^{[0]},\space \hat{y}=a^{l}

1.4.2 Forward Propagation in a Deep Network

  • General: Z[l]=w[l]A[l1]+b[l],A[l]=g[l](Z[l])Z^{[l]}=w^{[l]}A^{[l-1]}+b^{[l]}, A^{[l]}=g^{[l]}(Z^{[l]})
    • x:z[1]=w[1]a[0]+b[1],a[1]=g[1](z[1])x: z^{[1]}=w^{[1]}a^{[0]}+b^{[1]}, a^{[1]}=g^{[1]}(z^{[1]})a[0]=Xa^{[0]}=X
    • z[2]=w[2]a[1]+b[2],a[1]=g[2](z[2])z^{[2]}=w^{[2]}a^{[1]}+b^{[2]}, a^{[1]}=g^{[2]}(z^{[2]})
    • z[4]=w[4]a[3]+b[4],a[4]=g[4](z[4])=y^z^{[4]}=w^{[4]}a^{[3]}+b^{[4]}, a^{[4]}=g^{[4]}(z^{[4]})=\hat{y}
  • Vectorizing:
    • Z[1]=w[1]A[0]+b[1],A[1]=g[1](Z[1])Z^{[1]}=w^{[1]}A^{[0]}+b^{[1]}, A^{[1]}=g^{[1]}(Z^{[1]})A[0]=XA^{[0]}=X
    • Z[2]=w[2]A[1]+b[2],A[2]=g[2](Z[2])Z^{[2]}=w^{[2]}A^{[1]}+b^{[2]}, A^{[2]}=g^{[2]}(Z^{[2]})
    • Y^=g(Z[4])=A[4]\hat{Y}=g(Z^{[4]})=A^{[4]}
  • Matrix dimensions
    • deep-learning-notes_1-4-2
    • z[1]=w[1]x+b[1]z^{[1]}=w^{[1]}\cdot x+b^{[1]}
    • z[1]=(3,1),w[1]=(3,2),x=(2,1),b[1]=(3,1)z^{[1]}=(3,1),w^{[1]}=(3,2),x=(2,1),b^{[1]}=(3,1)
    • z[1]=(n[1],1),w[1]=(n[1],n[0]),x=(n[0],1),b[1]=(n[1],1)z^{[1]}=(n^{[1]},1),w^{[1]}=(n^{[1]},n^{[0]}),x=(n^{[0]},1),b^{[1]}=(n^{[1]},1)
    • w[l]/dw[l]=(n[l],n[l1]),b[l]/db[l]=(n[l],1)w^{[l]}/dw^{[l]}=(n^{[l]},n^{[l-1]}),b^{[l]}/db^{[l]}=(n^{[l]},1)
    • z[l],a[l]=(n[l],1),Z[l]/dZ[l],A[l]/dA[l]=(n[l],1)z^{[l]},a^{[l]}=(n^{[l]},1),Z^{[l]}/dZ^{[l]},A^{[l]}/dA^{[l]}=(n^{[l]},1)l=0,A[0]=X=(n[0],m)l=0, A^{[0]}=X=(n^{[0]},m)
  • Why deep representation?
    • Earier layers learn simple features; later deeper layers put together to detect more complex things.
    • Circuit theory and deep learning: Informally: There are functions you can compute with a “small” L-layer deep neural network that shallower networks require exponentially more hidden units to compute.

1.4.3 Building Blocks of Deep Neural Networks

  • Forward and backward functions
    • deep-learning-notes_1-4-3
    • Layer l:w[l],b[l]l:w^{[l]},b^{[l]}
    • Forward: Input a[l1]a^{[l-1]}, output a[l]a^{[l]}
      • z[l]:w[l]a[l1]+b[l]z^{[l]}:w^{[l]}a^{[l-1]}+b^{[l]}cache z[l]cache\space z^{[l]}
      • a[l]:g[l](z[l])a^{[l]}:g^{[l]}(z^{[l]})
    • Backward: Input da[l],cache(z[l])da^{[l]}, cache(z^{[l]}), output da[l1],dw[l],db[l]da^{[l-1]},dw^{[l]},db^{[l]}
  • One iteration of gradient descent of neural network
    • deep-learning-notes_1-4-3-2
  • How to implement?
    • Forward propagation for layer ll
      • Input a[l1]a^{[l-1]}, output a[l],cache (z[l])a^{[l]},cache\space (z^{[l]})
      -   {% katex %}z^{[l]}=w^{[l]}a^{[l-1]}+b^{[l]}{% endkatex %}
      -   {% katex %}a^{[l]}=g^{[l]}(z^{[l]}){% endkatex %}
      
      • Vectoried
        • Z[l]=W[l]A[l1]+b[l]Z^{[l]}=W^{[l]}A^{[l-1]}+b^{[l]}
        • A[l]=g[l](Z[l])A^{[l]}=g^{[l]}(Z^{[l]})
    • Backward propagation for layer ll
      • Input da[l],cache(z[l])da^{[l]}, cache(z^{[l]}), output da[l1],dw[l],db[l]da^{[l-1]},dw^{[l]},db^{[l]}
        • dz[l]=da[l]g[l](z[l])dz^{[l]}=da^{[l]}*g'^{[l]}(z^{[l]})
        • dw[l]=dz[l]a[l1]dw^{[l]}=dz^{[l]}\cdot a^{[l-1]}
        • db[l]=dz[l]db^{[l]}=dz^{[l]}
        • da[l1]=w[l]Tdz[l]da^{[l-1]}=w^{[l]T}\cdot dz^{[l]}
      • Vectorized:
        • dZ[l]=dA[l]g[l](Z[l])dZ^{[l]}=dA^{[l]}*g'^{[l]}(Z^{[l]})
        • dW[l]=1mdZ[l]A[l1]TdW^{[l]}=\dfrac{1}{m}dZ^{[l]}A^{[l-1]T}
        • db[l]=1mnp.sum(dZ[l],axis=1,keepdims=True)db^{[l]}=\dfrac{1}{m}np.sum(dZ^{[l]},axis=1,keepdims=True)
        • dA[l1]=W[l]TdZ[l]dA^{[l-1]}=W^{[l]T}\cdot dZ^{[l]}

1.4.4 Parameters vs. Hyperparameters

  • Parameters: W[1],b[1],W[2],b[2],...W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]},...
  • Hyperparameters (will affect/control/determine parameters):
    • learning rate α\alpha
    • # iterations
    • # of hidden units n[1],n[2],...n^{[1]},n^{[2]},...
    • # of hidden layers
    • Choice of activation function
  • Later: momemtum, minibatch size, regularization parameters,…

II. Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization

2.1 Practical Aspects of Deep Learning

2.1.1 Train / Dev / Test sets

  • Big data may need only 1% or even less dev/test sets.
  • Mismatched: Make sure dev/test come from same distribution
  • Not having a test set might be okay. (Only dev set.)

2.1.2 Bias / Variance

deep-learning-notes_2-1-2

deep-learning-notes_2-1-2-2

  • Assume optimal (Bayes) error: 0%\approx0\%
  • High bias (underfitting): The prediction cannot classify different elemets as we want.
    • Training set error 15%15\%, Dev set error 16%16\%.
    • Training set error 15%15\%, Dev set error 30%30\%.
  • “just right”: The prediction perfectly classify different elemets as we want.
    • Training set error 0.5%0.5\%, Dev set error 1%1\%.
  • High variance (overfitting): The prediction 100% classify different elemets.
    • Training set error 1%1\%, Dev set error 11%11\%.
    • Training set error 15%15\%, Dev set error 30%30\%.

2.1.3 Basic Recipe for Machine Learning

2.1.3.1 Basic Recipe
  • High bias(training data performance)
    • Bigger network
    • Train longer
    • (NN architecture search)
  • High variance (dev set performance)
    • More data
    • Regulairzation
    • (NN architecture search)
2.1.3.2 Regularization
  • Logistic regression. minw,bJ(w,b)\min\limits_{w,b}J(w,b)
    • wRnx,bRw\in\mathbb{R}^{n_x}, b\in\mathbb{R}
    • λ=regularization parameter\lambda=regularization\space parameter
    • J(w,b)=1mi=1mL(y^(i),y(i))+λ2mw22J(w,b)=\dfrac{1}{m}\sum\limits_{i=1}^mL(\hat{y}^{(i)},y^{(i)})+\dfrac{\lambda}{2m}||w^2||_2
    • L2 regularization w22=j=1nxwj2=wTw||w^2||_2=\sum\limits_{j=1}^{n_x}w_j^2=w^Tw
    • L1 regularization λ2mj=1nxwj=λ2mw1\dfrac{\lambda}{2m}\sum\limits_{j=1}^{n_x}|w_j|=\dfrac{\lambda}{2m}||w||_1
      • wwwill be spouse(for L1) (will have lots of 0 in it, only help a little bit)
  • Neural network
    • J(w[1],b[1],...,w[l],b[l])=1mi=1mL(y^(i),y(i))+λ2ml=1lw2FJ(w^{[1]},b^{[1]},...,w^{[l]},b^{[l]})=\dfrac{1}{m}\sum\limits_{i=1}^{m}L(\hat{y}^{(i)},y^{(i)})+\dfrac{\lambda}{2m}\sum\limits_{l=1}^{l}||w^2||_F
    • w[l]F2=i=1n[l1]j=1n[l](wij[l])2||w^{[l]}||_F^2=\sum\limits_{i=1}^{n^{[l-1]}}\sum\limits_{j=1}^{n^{[l]}}(w_{ij}^{[l]})^2w:(w[l],w[l1])w: (w^{[l]},w^{[l-1]})
      • Frobenius norm: Square root of square sum of all elements in a matrix.
    • dw[l]=(from backprop)+λmw[l]dw^{[l]}=(from\space backprop)+\dfrac{\lambda}{m}w^{[l]}
      • w[l]:=w[l]αdw[l]w^{[l]}:=w^{[l]}-\alpha dw^{[l]}(keep the same)
      • Weight decay
        • w[l]:=w[l]α[(from backprop)+λmw[l]]w^{[l]}:=w^{[l]}-\alpha[(from\space backprop)+\dfrac{\lambda}{m}w^{[l]}]
        • =w[l]αλmw[l]α(from backprop)=w^{[l]}-\dfrac{\alpha\lambda}{m}w^{[l]}-\alpha(from\space backprop)
        • =(1αλm)w[l]α(from backprop)=(1-\dfrac{\alpha\lambda}{m})w^{[l]}-\alpha(from\space backprop)
  • How does regularization prevent overfitting: λ\lambdabigger w[l]w^{[l]}smaller z[l]z^{[l]}smaller, which will make the activation function nearly linear(take tanh as an example). This will cause the network really hard to draw boundary with curve.
  • Dropout regularization
    • deep-learning-notes_2-1-3-2
    • Implementing dropout(“Inverted dropout”)
      • Illustrate with layer l=3l=3keepprob=0.8keep-prob=0.8(means 0.2 chance get dropout/be 0 out)
      • d3=np.random.rand(a3.shape[0],a3.shape[1])<keepprobd3 = np.random.rand(a3.shape[0],a3.shape[1]) < keep-prob#This will set d3 to be a same shape matrix as a3 with True (1), False (0) value.
      • a3=np.multiply(a3,d3)a3 = np.multiply(a3, d3)#a3*=d3; This will let some neruons been dropout
      • a3/=keepproba3/=keep-prob#inverted dropout, keep the total avtivation the same before and after dropout.
    • Why work: Can’t rely on any one feature, so have to spread out weights.(shrink weights)
    • First make sure the J is decreasing during iteration, then turn on dropout.
  • Data augmentation
    • Image: crop, flop, twist…
  • Early stopping
    • Mid-size wF2||w||_F^2
    • May caused optimize cost function and not overfir at the same time.
  • Orthogonalization
    • Only consider optimize cost function or consider not overfit at one time.
2.1.3.3 Setting up your optimization problem
  • Normalizing training sets
    • deep-learning-notes_2-1-3-3
    • x=[x1x2]x=\begin{gathered}\begin{bmatrix}x_1 \\ x_2\end{bmatrix}\end{gathered}
    • Subtract mean:
      • μ=1mi=1mx(i)\mu=\dfrac{1}{m}\sum\limits_{i=1}^{m}x^{(i)}
      • x:=xμx:=x-\mu
    • Normalize variance:
      • σ2=1mi=1mx(i)2\sigma^2=\dfrac{1}{m}\sum\limits_{i=1}^{m}x^{(i)}**2"**" element-wise
      • x/=σ2x/=\sigma^2
    • Use same μ,σ2\mu,\sigma^2to normalize test set.
    • Why normalize inputs?
      • When inputs in very different scales will help a lot for performance and gradient descent/learning rate.
      • deep-learning-notes_2-1-3-3-2
  • Vanishing/exploding gradients
    • w[l]>Iw^{[l]}>IJust slightly, will make the gradient increase really fast (exploding).
    • w[l]<Iw^{[l]}<IJust slightly, will make the gradient decrease really slow (varnishing).
  • Weight initalization (Single neuron)
    • large nn(number of input features) –> smaller wiw_i
    • Variance(w:)=1nVariance(w:)=\dfrac{1}{n}(sigmoid/tanh) ReLU: 2n\dfrac{2}{n}(variance can be a hyperparameter, DO NOT DO THAT)
    • w[l]=np.random.randn(shapeOfMatrix)np.sqrt(1n[l1])w^{[l]}=np.random.randn(shapeOfMatrix)*np.sqrt(\dfrac{1}{n^{[l-1]}})ReLU: 2n[l1]\dfrac{2}{n^{[l-1]}}
    • Xavier initialization: 1n[l1])\sqrt{\dfrac{1}{n^{[l-1]}})}Sometime 2n[l1]+n[l])\sqrt{\dfrac{2}{n^{[l-1]}+n^{[l]}})}
  • Numerical approximation of gradients
    • f(θ+ϵ)f(θϵ)2ϵ\dfrac{f(\theta+\epsilon)-f(\theta-\epsilon)}{2\epsilon}
  • Gradient checking (Grad check)
    • Take W[1],b[1],...,W[L],b[L]W^{[1]},b^{[1]},...,W^{[L]},b^{[L]}and reshape into a big vector θ\theta.
    • Take dW[1],db[1],...,dW[L],db[L]dW^{[1]},db^{[1]},...,dW^{[L]},db^{[L]}and reshape into a big vector dθd\theta.
    • for each i:
      • dθapprox[i]=J(θ1,θ2,...,θi+ϵ,...)J(θ1,θ2,...,θiϵ,...)2ϵdθ[i]=Jθid\theta_{approx}[i]=\dfrac{J(\theta_1,\theta_2,...,\theta_i+\epsilon,...)-J(\theta_1,\theta_2,...,\theta_i-\epsilon,...)}{2\epsilon}\approx d\theta[i]=\dfrac{\partial J}{\partial \theta_i}
      • Check Euclidean distance dθapproxdθ2dθapprox2+dθ2\dfrac{||d\theta_{approx}-d\theta||_2}{||d\theta_{approx}||_2+||d\theta||_2}(.2||.||_2is Euclidean norm, sqare root of the sum of all elements’ power of 2)
      • take ϵ=107\epsilon=10^{-7}, if above Euclidean distance is 107\approx10^{-7}or smaller, is great.
      • If is 10510^{-5}or bigger may need to check.
      • If is 10310^{-3}or bigger may need to worry, maybe a bug. Check which i approx is difference between the real value.
    • notes:
      • Don’t use in training - only to debug.
      • If algorithm fails grad check, look at components to try to identify bug.
      • Remember regularization. (include the λ2m\dfrac{\lambda}{2m})
      • Doesn’t work with dropout. (since is random, implement without dropout)
      • Run at random initialization; perhaps again after some training. (not work when w,b0w,b\approx0)

2.2 Optimization Algorithms

2.2.1 Mini-batch gradient descent

2.2.2 Exponentially weighted averages

2.2.3 RMSprop and Adam optimization

2.3 Hyperparameter Tuning, Batch Normalization, and Programming Frameworks

2.3.1 Tuning process

2.3.2 Using an appropriate scale to pick hyperparameters

2.3.3Batch Normalization

2.3.4 Multi-class classification

III. Structuring Machine Learning Projects

IV. Convolutional Neural Networks

V. Sequence Models

本文作者:ZL Asica
本文链接:https://www.zl-asica.com/2023/deep-learning-notes/
版权声明:本文采用 CC BY-NC-SA 4.0 DEED 协议进行许可