Intro & Linear regression with one variable

supervised learning & unsupervised learning

  • supervised learning: learn from labeled data. input \rightarrow output

    • regression
    • classification
  • unsupervised learning: only unlabeled data

    • cluster
    • dimensionality reduction
    • anomaly detection

linear regression

y=wx+by=wx+b

  • training set: data used to train the model

  • cost function:

    J(w,b)=12mi=1m(y^(i)y(i))2J(w,b)=\frac{1}{2m}\sum_{i=1}^{m}(\hat{y}^{(i)}-{y}^{(i)})^2

    • goal: minimize JJ
    • it’s a bowl shape surface

gradient descent

how to move downhill

  • algorithm

wwαwJ(w,b)w \leftarrow{w}-\alpha\frac{\partial}{\partial{w}}J(w,b)

bbαbJ(w,b)b\leftarrow{b}-\alpha\frac{\partial}{\partial{b}}J(w,b)

wJ(w,b)=1mi=1m(y^(i)y(i))x(i)\frac{\partial}{\partial{w}}J(w,b)=\frac{1}{m}\sum_{i=1}^{m}(\hat{y}^{(i)}-{y}^{(i)})x^{(i)}

bJ(w,b)=1mi=1m(y^(i)y(i))\frac{\partial}{\partial{b}}J(w,b)=\frac{1}{m}\sum_{i=1}^{m}(\hat{y}^{(i)}-{y}^{(i)})

​ where α\alpha is the learning rate, which is between 0 and 1. It’s the degree of how large the step downhill we take. if α\alpha is too small, gradient descent will be slow. if α\alpha is too large, you will get overshoot.

repeat until convergence

  • batch gradient descent: each step uses all training data

Linear regression with multiple variables

multiple features

use vectors

fw,b( x )=wx+bf_{\overrightarrow{w},b}(\ \overrightarrow{x}\ )=\overrightarrow{w}\cdot\overrightarrow{x}+b

  • vectorization (with numpy)

    1
    f=np.dot(w,x)+b
    • compared to loops, it’s a more efficient way
  • normal equation

    • only for linear equation to find w,b without iterations

feature scaling

make gradient descent faster

  • Z-score normalization

    x1=x1μ1σ1x_1=\frac{x_1-\mu_1}{\sigma_1}

check convergence

if J(w,b)J(\overrightarrow{w},b) decreases by ϵ\leqslant\epsilon , we declare convergence

feature engineering

design features

polynomial regression

with scikit-learn

Classification

binary classification: true(1) or false(0)

logistic regression

  • sigmoid function:

g(z)=11+ezg(z)=\frac{1}{1+e^{-z}}

  • logistic regression

    fw,b( x )=11+e(w  x+b)f_{\overrightarrow{w},b}(\ \overrightarrow{x}\ )=\frac{1}{1+e^{-(\overrightarrow{w}\ \cdot\ \overrightarrow{x}+b)}}

decision boundary

the line that is neutral (threshold is 0.5) about y=0 or y=1

z=wx+b=0z=\overrightarrow{w}\cdot\overrightarrow{x}+b=0

cost function

if we use the squared error cost which is the same as linear regression, we will get a non-convex curve.

  • loss function

    measures how well we are doing on one training examples

    • loss function for squared error cost

      L(fw,b( x (i)),y(i))=12(fw,b( x (i))y(i))2L(f_{\overrightarrow{w},b}(\ \overrightarrow{x}\ ^{(i)}),y^{(i)})=\frac{1}{2}(f_{\overrightarrow{w},b}(\ \overrightarrow{x}\ ^{(i)})-y^{(i)})^2

    • logistic loss function

      L(fw,b( x (i)),y(i))={log(fw,b( x (i)))if  y(i)=1log(1fw,b( x (i)))if  y(i)=0L(f_{\overrightarrow{w},b}(\ \overrightarrow{x}\ ^{(i)}),y^{(i)})=\left\{\begin{array}{rcl}-log(f_{\overrightarrow{w},b}(\ \overrightarrow{x}\ ^{(i)}))&if\ \ y^{(i)}=1\\-log(1-f_{\overrightarrow{w},b}(\ \overrightarrow{x}\ ^{(i)}))&if\ \ y^{(i)}=0 \end{array}\right.

      • y(i)=1y^{(i)}=1

        loss0  as  fw,b( x (i))1loss\rightarrow 0\ \ as\ \ f_{\overrightarrow{w},b}(\ \overrightarrow{x}\ ^{(i)})\rightarrow1

      • y(i)=0y^{(i)}=0

        loss0  as  fw,b( x (i))0loss\rightarrow 0\ \ as\ \ f_{\overrightarrow{w},b}(\ \overrightarrow{x}\ ^{(i)})\rightarrow0

  • cost function

    J(w,b)=1mi=1mL(fw,b( x (i)),y(i))J(\overrightarrow{w},b)=\frac{1}{m}\sum_{i=1}^{m}L(f_{\overrightarrow{w},b}(\ \overrightarrow{x}\ ^{(i)}),y^{(i)})

  • simplified loss function

    L(fw,b( x (i)),y(i))=y(i)log(fw,b( x (i)))(1y(i))log(1fw,b( x (i)))L(f_{\overrightarrow{w},b}(\ \overrightarrow{x}\ ^{(i)}),y^{(i)})=-y^{(i)}log(f_{\overrightarrow{w},b}(\ \overrightarrow{x}\ ^{(i)}))-(1-y^{(i)})log(1-f_{\overrightarrow{w},b}(\ \overrightarrow{x}\ ^{(i)}))

  • simplified cost function

    J(w,b)=1mi=1m[y(i)log(fw,b( x (i)))+(1y(i))log(1fw,b( x (i)))]J(\overrightarrow{w},b)=-\frac{1}{m}\sum_{i=1}^m[y^{(i)}log(f_{\overrightarrow{w},b}(\ \overrightarrow{x}\ ^{(i)}))+(1-y^{(i)})log(1-f_{\overrightarrow{w},b}(\ \overrightarrow{x}\ ^{(i)}))]

gradient descent

wjJ(w,b)=1mi=1m(fw,b( x (i))y(i))xj(i)  ()bJ(w,b)=1mi=1m(fw,b( x (i))y(i))\frac{\partial}{\partial{w_j}}J(\overrightarrow{w},b)=\frac{1}{m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\ \overrightarrow{x}\ ^{(i)})-{y}^{(i)})x_j^{(i)}\ \ (*)\\ \frac{\partial}{\partial{b}}J(\overrightarrow{w},b)=\frac{1}{m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\ \overrightarrow{x}\ ^{(i)})-{y}^{(i)})

  • how to get ()(*)

    wjJ(w,b)=1mi=1m(y(i)f1y(i)1f)fwj=1mi=1my(i)ff(1f)fwj=1mi=1my(i)ff(1f)f(1f)xj(i)=1mi=1m(fy(i))xj(i)\begin{align} \frac{\partial}{\partial{w_j}}J(\overrightarrow{w},b)&=-\frac{1}{m}\sum_{i=1}^m(\frac{y^{(i)}}{f}-\frac{1-y^{(i)}}{1-f})\frac{\partial{f}}{\partial{w_j}}\nonumber\\ &=-\frac{1}{m}\sum_{i=1}^{m}\frac{y^{(i)}-f}{f(1-f)}\cdot\frac{\partial{f}}{\partial{w_j}}\nonumber\\ &=-\frac{1}{m}\sum_{i=1}^{m}\frac{y^{(i)}-f}{f(1-f)}f(1-f)x_j^{(i)}\nonumber\\ &=\frac{1}{m}\sum_{i=1}^{m}(f-{y}^{(i)})x_j^{(i)}\nonumber \end{align}

    where f=11+e(w  x (i)+b)f=\frac{1}{1+e^{-(\overrightarrow{w}\ \cdot\ \overrightarrow{x}\ ^{(i)}+b)}}

overfitting

high variance

fits the training set extremely well

  • how to address it

    • collect more training examples
    • select features to include/ exclude
    • regularization: reduce the size of parameter wjw_j
  • regularization

    minw,bJ(w,b)=minw,b[12mi=1mL(fw,b( x (i))y(i))2+λ2mj=1nwj2]\min_{\overrightarrow{w},b}J(\overrightarrow{w},b)=\min_{\overrightarrow{w},b}[\frac{1}{2m}\sum_{i=1}^{m}L(f_{\overrightarrow{w},b}(\ \overrightarrow{x}\ ^{(i)})-y^{(i)})^2+\frac{\lambda}{2m}\sum_{j=1}^{n}w_j^2]

    • gradient descent

      wjwj(1αλm)α1mi=1m(fw,b( x (i))y(i))xj(i)bbα1mi=1m(fw,b( x (i))y(i))w_j\leftarrow{w_j}(1-\alpha\frac{\lambda}{m})-\alpha\frac{1}{m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\ \overrightarrow{x}\ ^{(i)})-{y}^{(i)})x_j^{(i)}\\ b\leftarrow{b}-\alpha\frac{1}{m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\ \overrightarrow{x}\ ^{(i)})-{y}^{(i)})

Neural network

dense: every neuron gets its inputs all the activations from the previous layer.

x=a j[0]aj[l]=g(w j[l]a [l1]+bj[l])\overrightarrow{x}=\overrightarrow{a}\ _j^{[0]}\\ a_j^{[l]}=g(\overrightarrow{w}\ _j^{[l]}\cdot\overrightarrow{a}\ ^{[l-1]}+b_j^{[l]})

where aj[l]a_j^{[l]} is the activation value of layer ll, unit jj. gg is sigmoid function (a activation function).

  • forward propagation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def dense(a_in,W,b,g):
units=W.shape[1]
a_out=np.zeros(units)
for j in range(units):
w=W[:,j]
z=np.dot(w,a_in)+b[j]
a_out[j]=g(z)
return a_out


#simplication
def dense(AT,W,b,g):
z=np.matmul(AT,W)+b
a_out=g(z)
return a_out


def sequential(x):
a1=dense(x,W1,b1)
a2=dense(a1,W2,b2)
a3=dense(a2,W3,b3)
a4=dense(a3,W4,b4)
f_x=a4
return f_x
  • matrix multiplication

Z=ATWZ=A^TW

1
2
3
Z=np.matmul(AT,W)
#or
Z=AT@W
  • sample code

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    from tensorflow.python.keras.layers import Dense
    from tensorflow.python.keras import Sequential
    from tensorflow.python.keras.losses import BinaryCrossentropy

    model=Sequential([
    Dense(units=25,activation='sigmoid'),
    Dense(units=15,activation='sigmoid'),
    Dense(units=1,activation='sigmoid')
    ])

    model.compile(loss=BinaryCrossentropy)
    model.fit(X,Y,epochs=100)

activation function

  • linear (also means no activation function)

    • output: +/-
  • sigmoid

    • output: 0/1
  • ReLU

    g(z)=max(0,z)g(z)=max(0,z)

    • output: 0/+
    • most common choice
      • faster
      • only one side flat. flat is slow for gradient descent.

activation function issue

why we use activation function?

  • use linear function for all hidden layers and output layer \Longleftrightarrow linear regression
  • use linear function for all hidden layers but sigmoid function for output layer \Longleftrightarrow logistic regression

adam algorithm

Adaptive Moment estimation (not just one α\alpha)

1
model.compile(optimizer=Adam(learning_rate=1e-3),loss=...)

convolutional neural network

convolutional layer

Each neuron only looks at part of the previous layer’s inputs.

  • why
    • faster
    • less training data

Muticlass Classification

softmax

zj=wjx+bj     j=1,...,Naj=ezjk=1Nezk=P(y=j x )z_j=\overrightarrow{w}_j\cdot\overrightarrow{x}+b_j\ \ \ \ \ j=1,...,N\\ a_j=\frac{e^{z_j}}{\sum_{k=1}^Ne^{z_k}}=P(y=j|\ \overrightarrow{x}\ )

  • loss function

    loss(a1,...,aN,y)={log(a1)if  y=1log(a2)if  y=2log(aN)if  y=Nloss(a_1,...,a_N,y)=\left\{\begin{array}{rcl}-log(a_1)&if\ \ y=1\\ -log(a_2)&if\ \ y=2\\ \vdots \\-log(a_N)&if\ \ y=N \end{array}\right.

  • sample code

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    from tensorflow.python.keras.layers import Dense
    from tensorflow.python.keras import Sequential
    from tensorflow.python.keras.losses import SparseCategoricalCrossentropy

    model=Sequential([
    Dense(units=25,activation='relu'),
    Dense(units=15,activation='relu'),
    Dense(units=1,activation='linear')
    ])
    #More numercially accurate implementation of softmax
    model.compile(loss=SparseCategoricalCrossentropy(from_logits=True))
    model.fit(X,Y,epochs=100)

Evaluating a model

  • training set --60%
  • cross validation set --20%
    • Jcv(w,b)J_{cv}(\overrightarrow{w},b) is used to select a model, pick a set of parameters
  • test set --20%
    • Jtest(w,b)J_{test}(\overrightarrow{w},b) is better estimate of how well the model will generalize to new data

The error does not contain regularization term

Jtest(w,b)J_{test}(\overrightarrow{w},b) is better estimate of how well the model will generalize to new data

bias & variance

  • high bias – underfit \leftarrow Jtrain(w,b)J_{train}(\overrightarrow{w},b) is high, Jcv(w,b)J_{cv}(\overrightarrow{w},b) is high
  • just right \leftarrow Jtrain(w,b)J_{train}(\overrightarrow{w},b) is low, Jcv(w,b)J_{cv}(\overrightarrow{w},b) is low
  • high variance – overfit \leftarrow Jtrain(w,b)J_{train}(\overrightarrow{w},b) is low, Jcv(w,b)J_{cv}(\overrightarrow{w},b) is high

regularization issue

  • large λ\lambda \Rightarrow high bias
  • small λ\lambda \Rightarrow high variance

learning curve

Decision trees

  • choose feature to be split on – maximize purity

  • when to stop?

    • a node is 100% one class
    • exceeding a maximum depth
    • improvement in purity score is below a threshold
    • number of examples in a node is below a threshold

impurity (entropy)

p0=1p1H(p1)=p1log2(p1)p0log2(p0)p_0=1-p_1\\ H(p_1)=-p_1log_2(p_1)-p_0log_2(p_0)

information gain

Inforamtion Gain=H(p1root)(wleftH(p1left)+wrightHp1right)Inforamtion\ Gain=H(p_1^{root})-(w^{left}H(p_1^{left})+w^{right}Hp_1^{right})

  • the feature with high information gain will be used to split on.

one hot encoding

why? – a feature has multi-features

If a categorical feature can take on kk values, create kk binary features (0 or 1 valued).

pointy floppy oval
cat a 1 0 0
cat b 0 0 1
cat c 0 1 0
dog d 1 0 0
cat e 0 0 1