Peter dev blog

Sunday, November 10, 2019

Artificial Neural Network

Logical Computation With Neuron
* It has one or more binary input and one output.
* Activate output when certain number of input is active

Perceptron
* Threshold login unit (TLU)
* applies weight to each input layer and then compute it on the output layer, before going through step function.
* Step function can be Heaviside or sign.
* To solve XOR problem, it needs multi-layer perceptron.
* feed forward neural network because the signal flows one directional only
* Input Layer - Hidden layer - Output Layer
* When hidden layer is lots, it's called deep neural network. Deep learning studies this.

Backpropagation
* works for training multi-layer perceptron
* work through the instance -> compute the lost -> goes through backwards to measure the error from each connection -> tweak to minimise the error
* the initialisation weight is very important to be random
* the activation function must be changed to non-heaviside or non-sign because those are flat. the original used the logistic function = 1 / (1 + exp(-z))

Keras
* wrapper to access tensorflow
* see own code to see how it works.
* The flow -> create model -> compile -> train -> evaluate
* Start with Sequential API

Functional API
* the output layer has direct connection to input path, beside the hidden layer (the stack).
* this enable the network to learn both deep pattern and simple pattern

Subclassing API
* subclass the model
* create your own
* can put all the functional and sequential inside; but we lost the keras analysis advantage though.

Hyperparameter
* Layers: start from one, then maybe max three. It is actually more parameter friendly in more layers.
* Number of neurons, pyramid or same number of neurons. Layers is more important.

Saturday, November 9, 2019

Unsupervised Learning

* It is without labels. just data.
* Three types: Clustering, Anomaly detection, Density Estimation

Clustering
* Separate data to different clusters
* Hard clustering: it must belong to one cluster
* Soft clustering: calculate the score based on distance from centroid

K-Means
* if we have cluster, it is easy to assign training instance
* if we have label, it is easy to calculate the means of each cluster.
* No Cluster + No Label? => pick random k, label it, update the centroid (calculate the centroid means), label it and so on until it converges.
* Incorrect initialisation might lead to local minimum
* How to select initialisation k?
** use other algorithm to roughly know where it is.
** run multiple times, and select the best one.
** inertia = mean of distance between instance and its cluster centroid.
* K-Means-+ improve the initialisation selection by
** pick the further away centroid after the initial k1 is selected based on weighted randomisation.
* Accelerated K-means used triangle inequality
* Mini-batch K-means does not use the whole training set
** a Bit worse inertia, but it is faster than normal one.
* Deciding number of cluster
** inertia is not valid for this one, because large number of cluster may have small inertia. so do small number of cluster.
** using graph is bit coarse
** instead, use silhouette score. (b-a)/max(b,a), b distance in the cluster, a distance to the next cluser
** when the score is approaching +1, it is well inside the cluster
** when the score is around zero, it is around boundary
** when the score is approaching -1, it might be belong to another cluster
** use silhouette diagram
* K-means has its own limit, like those graphs that have varying size, varying densities, or non-spherical shape.
* K-means can be used for image segmentation, dimensional reduction, semi-supervised learning
* semi-supervised: use the representative instance, instance that is the closest to the centroid.

DBSCAN
* define cluster as continuous region of high density
* It is good for data that is separated by low density.

Gaussian Misture Model
* It assumes instance is generated by various Gaussian distribution form.
* deciding number of cluster by using Theoretical Information Criterion
** Bayesian Information Criterion
** Akaike Information Criterion
** BIC tends to be simpler but AIC tends for fit the data more.
* Probability function is the possibility of outcome x given condition theta
* Likelihood function is the possibility of condition theta given outcome x

Friday, November 8, 2019

Dimensionality Reduction

* It is complex to imagine anything higher than 3D.
* In 2D, distance of any point close to boundary is low. while in high dimensional, it is high.
* In 2D, distance between any point is near by, whereas in high dimensional, it is far.
* It is called Curse of Dimensionality for a reason.

Projection
* Project from higher dimensional to lower dimensional.

Manifold Learning
* modeling manifold on which training instances lie on
* based on manifold hypothesis, most real world high dimensional lie close to much lower dimensional.
* Think of Swiss roll.

PCA
* Principal Component Analysis
* Identifies the hyperplane closest to the data and project the data to that hyperplane.
* choose axis that preserve maximum amount of variance.
* Principal Components refer to axis that has the largest amount of variance of training set. Then the second axis the orthogonal to the first one.
* In scenario where there is multiple feature (high dimensional), principal components will do the axis for each feature.
* Easy way to calculate is by Singular Value Decomposition (SVD) = U S V_T. V_T is the matrix containing principal components.
* The SVD must be applied on data centred on Origin.
* The projected matrix X is X . W_d (subset of V_T columns). choose many dimensions that we want to project it to. in terms of dimension, m x n . n x d = m x d => d dimensions.
* explained_variance_ratio, describe proportion of dataset variance that lies along the axis of each principal component.
** From the values, we know which axis has no significant impact.
** It can be used to determine number of dimension we want. Normally we want 95% variance.
* We can decompress the data by applying: X_transformed . W_d_T.
* There will be lost of data depends on the variance. The mean squared distance between the original data and reconstructed data is called reconstruction error.
* Randomized PCA; faster, use stochastic approach to find d principal components.
* Incremental PCA; does not need all training data to be loaded to memory.

LLE
* Locally Linear Embedding
* rely on closest neighbours.
* Look for low level representation where these local neighbours relationship are preserved.
* First step, Find w such that the distance between x and sigma w . x_neighbours is minimum.
* Second step, find the projected z such that distance between z and sigma w . z_neighbours is minimum.

Ensemble

* Use diversified classifier so that each result in independent of each other.

Voting
* Combine few algorithm. Predict the class with the highest votes.
* Law of large number.
** As the number increase, it will start closer and closer to the probability value it supposed to have.
* Soft Voting vs Hard Voting
** 0.3 0.3 0.9
** Hard Voting: No (0.3 win)
** Soft Voting: Yes (0.5 average)

Bagging and Pasting
* Instead of using different algorithm, use same algorithm as predictor
* train on different subset of training set.
* Bagging take the subset, but the instance can be the same (it has replacement)
* Pasting take the subset without the replacement
* It can be paralleled using different core.
* It is good because it scaled well.
* Also supports sampling of features as well.
** Sampling both feature and instances is called Random Patches
** Sampling feature only is called Random Subspaces

RandomTreeForest
* Bagging type
* Introduce randomness by searching best feature in each subset.
* Extremely Randomized Tree (Extra Tree) by choosing random threshold for each feature.
* Compare both normal RandomTree and ExtraTree to see which one perform better.
* Feature Importance, it is also good to measure the relative importance of each feature.
** Measured by how much the feature reduce the impurity on average.
** each node weight = number of training sample associated with it.

Boosting
* Combine several weaker predictor become stronger predictor

Adaptive Boosting
* Train the data -> compute the error rate -> put the weight on instance that has the error -> Train -> Rinse and Repeat until number of predictor is reached or perfect predictor is found.
* The prediction made is the probability according to the weight calculated.
* Error Rate -> Weight -> instance weight updated -> Retrain

Gradient Boosting
* Instead of weight, it feed the residual error from one predictor to another predictor.
* Combined it with early stopping
* Instead of full training set, we can use subset. This is called Stochastic Gradient Boosting.

Stacking
* Use another predictor as aggregator for all the predictor
* Common strategy is hold-out set (Blending).
** Split the training set to be one for the predictors and one for hold-out set.
** Get value for each predictor
** These values will be the new input features to hold-out set
** Use another predictor to act at this new training set
* The hold-out set can be applied so that it has multiple blending layers

Decision Tree

* Gini impurity => 1 - sigma ratio of k among the training instances.

* Pure node will have zero impurity.

* Classification and Regression Tree algorithm is used.

* Classifier cost function:

** (m_left/m * G_left) + (m_right/m * G_right)

** G is the impurity

* sometimes entropy is being used instead of Gini.

** Gini isolate most frequent class in its own branch

** entropy tends to produce slightly balanced tree

** Gini is faster.

* Regression cost function
** (m_left/m * MSE_left) + (m_right/m * MSE_right)

* It is nonparametric model because the model structure is free to stick closely to the data.
* Parametric model has predefined number of parameters
* Nonparametric model tends to overfitting the data, while parametric tends to underfitting

* Decision tree is also love orthogonal decision boundary. This has problem with data that has no orthogonal boundary, like being rotated.

Sunday, November 3, 2019

Support Vector Machine

Linear SVM Classification
* create decision boundary that separates the instances
* the edge of decision boundary is called support vector
* hard margin classification is not possible due to outlier
* soft margin classification is more possible.
* use hyper parameter C to control how narrow or wide the street is.

Nonlinear SVM Classification
* sometimes data is not separable in linear.
* Method 1: use polynomial features
* Or use polynomial kernel
* Method 2: use Similarity features (Gaussian Radial Basic Function)
* Also has RBF kernel

How to choose which kernel to use?
* try linear first
* if not, then try rbf
* if both are not able, try to use GridSearch before try other kernels.

SVM Regression
* Reverse the objective in SVM classification
* instead of putting data off the street, put eat data on the street instead
* margin of error instead become data that is off

Under the hood
y = 0 if wx + b < 0, 1 if wx + b >= 0
Diff with Linear Regression
* there is no need to add b as x0 to x
* w is used instead of theta

Cost function
* weight of function is w
* try to minimize w so that the margin of the street is big
* at same time all positive instance need to have decision function >= 0 and vice versa for negative instance.
* it is quadratic programming problems
* The minimise function is 0.5 * w.T * w
* constraint to: t(wx+b) >= 1, where t=-1 when y = 0, and t=1 when y = 1
* Introduce slack variable to see how much violation we allows
* C => hyperparameter to control the minimise and the constraint

Dual Problem
* the QP problem of SVM is primal problem, but it can be solved by dual problem.
* The dual problem equation of the cost function is 0.5 i..m j..m alpha(i)alpha(j) t(i)t(j) x(i)x(j) - i..m alpha
* Objective: find alpha that minimise the equation
* from the alpha, we can calculate w and b based on another equation

Kernel Trick
* can skip the polynomial transformation by using the product of the vectors and square it (if it is degree=2)
* Common Kernels: linear, polynomial, gausian RBF, sigmoid
* The idea is, it is capable of calculating the dot product of transformed matrix without even transforming it.
* why? Most of times, data is not linear separateable. The transformation is needed to transform data to hyperplane; this is where kernel trick comes. It allows us to operate in original space without computing the coordinate of the tranformmed space.

Hinge Loss
max(0, 1-t)it is 0 when t >= 1. otherwise it will be 1-t which is the instance cross the street.

Logistic Regression

Instead of value like linear regression, it calculate probability out of the value.
p = logistic(X.theta)
* logistic is inverse of logit function.
logistic(t) = 1 / ( 1 + exp(-t) )

y = 0 if p < 0.5, this also means t < 0
y = 1 if p >= 1, this also means t >= 1

The cost function of it is:
c(theta) = -log(p) if y = 1, -log(1-p) if y = 0
* increase probability when it is positive instance
* reduce probability of negative instance
* there is no normal equation to solve. need to use gradient descent approach
* The partial derivative is similar to linear regression. replace the value with the probability instead

Decision Boundary
* the boundary on which it is equally possible for y = 0 and y = 1, i.e. p = 0.5

* LogisticRegression.predict_proba will return two column. Column 0 is the probability of 0, Column is the probability of 1

Support of Multiple Class
* Using Softmax regression
* Compute sk(x) for each class k = x.theta
* Each class has its own parameter vector theta. Every parameter vector can be put in row in matrix called parameter matrix
* Probability of sk(x) is the softmax function: exponential of sk divide by sum of all exponential

Cost function of Softmax Regression
* When k = 2, it is the same as logistic regression.