Machine Learning
Intro & Linear regression with one variable
supervised learning & unsupervised learning
-
supervised learning: learn from labeled data. input output
- regression
- classification
-
unsupervised learning: only unlabeled data
- cluster
- dimensionality reduction
- anomaly detection
linear regression
-
training set: data used to train the model
-
cost function:
- goal: minimize
- it’s a bowl shape surface
gradient descent
how to move downhill
- algorithm
where is the learning rate, which is between 0 and 1. It’s the degree of how large the step downhill we take. if is too small, gradient descent will be slow. if is too large, you will get overshoot.
repeat until convergence
- batch gradient descent: each step uses all training data
Linear regression with multiple variables
multiple features
use vectors
-
vectorization (with numpy)
python1
f=np.dot(w,x)+b
- compared to loops, it’s a more efficient way
-
normal equation
- only for linear equation to find w,b without iterations
feature scaling
make gradient descent faster
- Z-score normalization
check convergence
if decreases by , we declare convergence
feature engineering
design features
polynomial regression
with scikit-learn
Classification
binary classification: true(1) or false(0)
logistic regression
- sigmoid function:
- logistic regression
decision boundary
the line that is neutral (threshold is 0.5) about y=0 or y=1
cost function
if we use the squared error cost which is the same as linear regression, we will get a non-convex curve.
-
loss function
measures how well we are doing on one training examples
-
loss function for squared error cost
-
logistic loss function
-
-
-
cost function
-
simplified loss function
-
simplified cost function
gradient descent
- how to get where
overfitting
high variance
fits the training set extremely well
-
how to address it
- collect more training examples
- select features to include/ exclude
- regularization: reduce the size of parameter
-
regularization
- gradient descent
- gradient descent
Neural network
dense: every neuron gets its inputs all the activations from the previous layer.
where is the activation value of layer , unit . is sigmoid function (a activation function).
- forward propagation
1 | def dense(a_in,W,b,g): |
- matrix multiplication
1 | Z=np.matmul(AT,W) |
-
sample code
python1
2
3
4
5
6
7
8
9
10
11
12from tensorflow.python.keras.layers import Dense
from tensorflow.python.keras import Sequential
from tensorflow.python.keras.losses import BinaryCrossentropy
model=Sequential([
Dense(units=25,activation='sigmoid'),
Dense(units=15,activation='sigmoid'),
Dense(units=1,activation='sigmoid')
])
model.compile(loss=BinaryCrossentropy)
model.fit(X,Y,epochs=100)
activation function
-
linear (also means no activation function)
- output: +/-
-
sigmoid
- output: 0/1
-
ReLU
- output: 0/+
- most common choice
- faster
- only one side flat. flat is slow for gradient descent.
activation function issue
why we use activation function?
- use linear function for all hidden layers and output layer linear regression
- use linear function for all hidden layers but sigmoid function for output layer logistic regression
adam algorithm
Adaptive Moment estimation (not just one )
1 | model.compile(optimizer=Adam(learning_rate=1e-3),loss=...) |
convolutional neural network
convolutional layer
Each neuron only looks at part of the previous layer’s inputs.
- why
- faster
- less training data
Muticlass Classification
softmax
-
loss function
-
sample code
python1
2
3
4
5
6
7
8
9
10
11
12from tensorflow.python.keras.layers import Dense
from tensorflow.python.keras import Sequential
from tensorflow.python.keras.losses import SparseCategoricalCrossentropy
model=Sequential([
Dense(units=25,activation='relu'),
Dense(units=15,activation='relu'),
Dense(units=1,activation='linear')
])
#More numercially accurate implementation of softmax
model.compile(loss=SparseCategoricalCrossentropy(from_logits=True))
model.fit(X,Y,epochs=100)
Evaluating a model
- training set --60%
- cross validation set --20%
- is used to select a model, pick a set of parameters
- test set --20%
- is better estimate of how well the model will generalize to new data
The error does not contain regularization term
is better estimate of how well the model will generalize to new data
bias & variance
- high bias – underfit is high, is high
- just right is low, is low
- high variance – overfit is low, is high
regularization issue
- large high bias
- small high variance
learning curve
Decision trees
-
choose feature to be split on – maximize purity
-
when to stop?
- a node is 100% one class
- exceeding a maximum depth
- improvement in purity score is below a threshold
- number of examples in a node is below a threshold
impurity (entropy)
information gain
- the feature with high information gain will be used to split on.
one hot encoding
why? – a feature has multi-features
If a categorical feature can take on values, create binary features (0 or 1 valued).
pointy | floppy | oval | |
---|---|---|---|
cat a | 1 | 0 | 0 |
cat b | 0 | 0 | 1 |
cat c | 0 | 1 | 0 |
dog d | 1 | 0 | 0 |
cat e | 0 | 0 | 1 |
v1.5.2