Peter dev blog: November 2019

Sunday, November 10, 2019

Artificial Neural Network

Logical Computation With Neuron
* It has one or more binary input and one output.
* Activate output when certain number of input is active

Perceptron
* Threshold login unit (TLU)
* applies weight to each input layer and then compute it on the output layer, before going through step function.
* Step function can be Heaviside or sign.
* To solve XOR problem, it needs multi-layer perceptron.
* feed forward neural network because the signal flows one directional only
* Input Layer - Hidden layer - Output Layer
* When hidden layer is lots, it's called deep neural network. Deep learning studies this.

Backpropagation
* works for training multi-layer perceptron
* work through the instance -> compute the lost -> goes through backwards to measure the error from each connection -> tweak to minimise the error
* the initialisation weight is very important to be random
* the activation function must be changed to non-heaviside or non-sign because those are flat. the original used the logistic function = 1 / (1 + exp(-z))

Keras
* wrapper to access tensorflow
* see own code to see how it works.
* The flow -> create model -> compile -> train -> evaluate
* Start with Sequential API

Functional API
* the output layer has direct connection to input path, beside the hidden layer (the stack).
* this enable the network to learn both deep pattern and simple pattern

Subclassing API
* subclass the model
* create your own
* can put all the functional and sequential inside; but we lost the keras analysis advantage though.

Hyperparameter
* Layers: start from one, then maybe max three. It is actually more parameter friendly in more layers.
* Number of neurons, pyramid or same number of neurons. Layers is more important.

Saturday, November 9, 2019

Unsupervised Learning

* It is without labels. just data.
* Three types: Clustering, Anomaly detection, Density Estimation

Clustering
* Separate data to different clusters
* Hard clustering: it must belong to one cluster
* Soft clustering: calculate the score based on distance from centroid

K-Means
* if we have cluster, it is easy to assign training instance
* if we have label, it is easy to calculate the means of each cluster.
* No Cluster + No Label? => pick random k, label it, update the centroid (calculate the centroid means), label it and so on until it converges.
* Incorrect initialisation might lead to local minimum
* How to select initialisation k?
** use other algorithm to roughly know where it is.
** run multiple times, and select the best one.
** inertia = mean of distance between instance and its cluster centroid.
* K-Means-+ improve the initialisation selection by
** pick the further away centroid after the initial k1 is selected based on weighted randomisation.
* Accelerated K-means used triangle inequality
* Mini-batch K-means does not use the whole training set
** a Bit worse inertia, but it is faster than normal one.
* Deciding number of cluster
** inertia is not valid for this one, because large number of cluster may have small inertia. so do small number of cluster.
** using graph is bit coarse
** instead, use silhouette score. (b-a)/max(b,a), b distance in the cluster, a distance to the next cluser
** when the score is approaching +1, it is well inside the cluster
** when the score is around zero, it is around boundary
** when the score is approaching -1, it might be belong to another cluster
** use silhouette diagram
* K-means has its own limit, like those graphs that have varying size, varying densities, or non-spherical shape.
* K-means can be used for image segmentation, dimensional reduction, semi-supervised learning
* semi-supervised: use the representative instance, instance that is the closest to the centroid.

DBSCAN
* define cluster as continuous region of high density
* It is good for data that is separated by low density.

Gaussian Misture Model
* It assumes instance is generated by various Gaussian distribution form.
* deciding number of cluster by using Theoretical Information Criterion
** Bayesian Information Criterion
** Akaike Information Criterion
** BIC tends to be simpler but AIC tends for fit the data more.
* Probability function is the possibility of outcome x given condition theta
* Likelihood function is the possibility of condition theta given outcome x

Friday, November 8, 2019

Dimensionality Reduction

* It is complex to imagine anything higher than 3D.
* In 2D, distance of any point close to boundary is low. while in high dimensional, it is high.
* In 2D, distance between any point is near by, whereas in high dimensional, it is far.
* It is called Curse of Dimensionality for a reason.

Projection
* Project from higher dimensional to lower dimensional.

Manifold Learning
* modeling manifold on which training instances lie on
* based on manifold hypothesis, most real world high dimensional lie close to much lower dimensional.
* Think of Swiss roll.

PCA
* Principal Component Analysis
* Identifies the hyperplane closest to the data and project the data to that hyperplane.
* choose axis that preserve maximum amount of variance.
* Principal Components refer to axis that has the largest amount of variance of training set. Then the second axis the orthogonal to the first one.
* In scenario where there is multiple feature (high dimensional), principal components will do the axis for each feature.
* Easy way to calculate is by Singular Value Decomposition (SVD) = U S V_T. V_T is the matrix containing principal components.
* The SVD must be applied on data centred on Origin.
* The projected matrix X is X . W_d (subset of V_T columns). choose many dimensions that we want to project it to. in terms of dimension, m x n . n x d = m x d => d dimensions.
* explained_variance_ratio, describe proportion of dataset variance that lies along the axis of each principal component.
** From the values, we know which axis has no significant impact.
** It can be used to determine number of dimension we want. Normally we want 95% variance.
* We can decompress the data by applying: X_transformed . W_d_T.
* There will be lost of data depends on the variance. The mean squared distance between the original data and reconstructed data is called reconstruction error.
* Randomized PCA; faster, use stochastic approach to find d principal components.
* Incremental PCA; does not need all training data to be loaded to memory.

LLE
* Locally Linear Embedding
* rely on closest neighbours.
* Look for low level representation where these local neighbours relationship are preserved.
* First step, Find w such that the distance between x and sigma w . x_neighbours is minimum.
* Second step, find the projected z such that distance between z and sigma w . z_neighbours is minimum.

Ensemble

* Use diversified classifier so that each result in independent of each other.

Voting
* Combine few algorithm. Predict the class with the highest votes.
* Law of large number.
** As the number increase, it will start closer and closer to the probability value it supposed to have.
* Soft Voting vs Hard Voting
** 0.3 0.3 0.9
** Hard Voting: No (0.3 win)
** Soft Voting: Yes (0.5 average)

Bagging and Pasting
* Instead of using different algorithm, use same algorithm as predictor
* train on different subset of training set.
* Bagging take the subset, but the instance can be the same (it has replacement)
* Pasting take the subset without the replacement
* It can be paralleled using different core.
* It is good because it scaled well.
* Also supports sampling of features as well.
** Sampling both feature and instances is called Random Patches
** Sampling feature only is called Random Subspaces

RandomTreeForest
* Bagging type
* Introduce randomness by searching best feature in each subset.
* Extremely Randomized Tree (Extra Tree) by choosing random threshold for each feature.
* Compare both normal RandomTree and ExtraTree to see which one perform better.
* Feature Importance, it is also good to measure the relative importance of each feature.
** Measured by how much the feature reduce the impurity on average.
** each node weight = number of training sample associated with it.

Boosting
* Combine several weaker predictor become stronger predictor

Adaptive Boosting
* Train the data -> compute the error rate -> put the weight on instance that has the error -> Train -> Rinse and Repeat until number of predictor is reached or perfect predictor is found.
* The prediction made is the probability according to the weight calculated.
* Error Rate -> Weight -> instance weight updated -> Retrain

Gradient Boosting
* Instead of weight, it feed the residual error from one predictor to another predictor.
* Combined it with early stopping
* Instead of full training set, we can use subset. This is called Stochastic Gradient Boosting.

Stacking
* Use another predictor as aggregator for all the predictor
* Common strategy is hold-out set (Blending).
** Split the training set to be one for the predictors and one for hold-out set.
** Get value for each predictor
** These values will be the new input features to hold-out set
** Use another predictor to act at this new training set
* The hold-out set can be applied so that it has multiple blending layers

Decision Tree

* Gini impurity => 1 - sigma ratio of k among the training instances.

* Pure node will have zero impurity.

* Classification and Regression Tree algorithm is used.

* Classifier cost function:

** (m_left/m * G_left) + (m_right/m * G_right)

** G is the impurity

* sometimes entropy is being used instead of Gini.

** Gini isolate most frequent class in its own branch

** entropy tends to produce slightly balanced tree

** Gini is faster.

* Regression cost function
** (m_left/m * MSE_left) + (m_right/m * MSE_right)

* It is nonparametric model because the model structure is free to stick closely to the data.
* Parametric model has predefined number of parameters
* Nonparametric model tends to overfitting the data, while parametric tends to underfitting

* Decision tree is also love orthogonal decision boundary. This has problem with data that has no orthogonal boundary, like being rotated.

Sunday, November 3, 2019

Support Vector Machine

Linear SVM Classification
* create decision boundary that separates the instances
* the edge of decision boundary is called support vector
* hard margin classification is not possible due to outlier
* soft margin classification is more possible.
* use hyper parameter C to control how narrow or wide the street is.

Nonlinear SVM Classification
* sometimes data is not separable in linear.
* Method 1: use polynomial features
* Or use polynomial kernel
* Method 2: use Similarity features (Gaussian Radial Basic Function)
* Also has RBF kernel

How to choose which kernel to use?
* try linear first
* if not, then try rbf
* if both are not able, try to use GridSearch before try other kernels.

SVM Regression
* Reverse the objective in SVM classification
* instead of putting data off the street, put eat data on the street instead
* margin of error instead become data that is off

Under the hood
y = 0 if wx + b < 0, 1 if wx + b >= 0
Diff with Linear Regression
* there is no need to add b as x0 to x
* w is used instead of theta

Cost function
* weight of function is w
* try to minimize w so that the margin of the street is big
* at same time all positive instance need to have decision function >= 0 and vice versa for negative instance.
* it is quadratic programming problems
* The minimise function is 0.5 * w.T * w
* constraint to: t(wx+b) >= 1, where t=-1 when y = 0, and t=1 when y = 1
* Introduce slack variable to see how much violation we allows
* C => hyperparameter to control the minimise and the constraint

Dual Problem
* the QP problem of SVM is primal problem, but it can be solved by dual problem.
* The dual problem equation of the cost function is 0.5 i..m j..m alpha(i)alpha(j) t(i)t(j) x(i)x(j) - i..m alpha
* Objective: find alpha that minimise the equation
* from the alpha, we can calculate w and b based on another equation

Kernel Trick
* can skip the polynomial transformation by using the product of the vectors and square it (if it is degree=2)
* Common Kernels: linear, polynomial, gausian RBF, sigmoid
* The idea is, it is capable of calculating the dot product of transformed matrix without even transforming it.
* why? Most of times, data is not linear separateable. The transformation is needed to transform data to hyperplane; this is where kernel trick comes. It allows us to operate in original space without computing the coordinate of the tranformmed space.

Hinge Loss
max(0, 1-t)it is 0 when t >= 1. otherwise it will be 1-t which is the instance cross the street.

Logistic Regression

Instead of value like linear regression, it calculate probability out of the value.
p = logistic(X.theta)
* logistic is inverse of logit function.
logistic(t) = 1 / ( 1 + exp(-t) )

y = 0 if p < 0.5, this also means t < 0
y = 1 if p >= 1, this also means t >= 1

The cost function of it is:
c(theta) = -log(p) if y = 1, -log(1-p) if y = 0
* increase probability when it is positive instance
* reduce probability of negative instance
* there is no normal equation to solve. need to use gradient descent approach
* The partial derivative is similar to linear regression. replace the value with the probability instead

Decision Boundary
* the boundary on which it is equally possible for y = 0 and y = 1, i.e. p = 0.5

* LogisticRegression.predict_proba will return two column. Column 0 is the probability of 0, Column is the probability of 1

Support of Multiple Class
* Using Softmax regression
* Compute sk(x) for each class k = x.theta
* Each class has its own parameter vector theta. Every parameter vector can be put in row in matrix called parameter matrix
* Probability of sk(x) is the softmax function: exponential of sk divide by sum of all exponential

Cost function of Softmax Regression
* When k = 2, it is the same as logistic regression.

Saturday, November 2, 2019

Gradient Descent

Linear Regression
y = ax + b
* trying to find a and b based on the train set.
* create cost function which is Mean Squared Error.
* minimising the mean_squared_error.

Ways to minimise
* Normal equation
inv(X.T.dot.X).dot.X.T.y
Computational complexity O(n2.4) to O(n3)
* Gradient Descent
Start from random initialisation, then keep trying by modifying the partial derivative until it reaches minimum.
Important parameter: learning rate.
* Stochastic Gradient Descent
Instead of using the whole train_set, use random index of it
* Mini-batch Gradient Descent
Combining both batch Gradient Descent and Stochastic Gradient Descent
Use small random set of index called mini-batches

Polynomial Regression
Convert the polynomial into another feature (PolynomialFeatures) and then use LinearRegression to solve it.

Learning Curves
* see how is the model learning in terms of errors
* one of the way to see whether model is too simple or too complex, beside the cross validation way.

How to reduce overfitting
* Regularized Linear Models

Multiple way to regularised linear model
* Ridge Regression. Adding 1/2 of squared l2 norm of the weight vector to MSE
* Lasso Regression. Adding l1 norm of the weight vector to MSE
* Lasso Regression tends to perform feature selection and remove it from the equation.
* Elastic Net. combining both ridge regression and lasso regression.

Classification

Binary Classifier

* try SGDClassifier as case study

Performance Measure

* use cross validation

cons: not good for skewed dataset

* use confusion matrix

cross_val_predict follows by confusion_matrix.

* Receiver Operating Characteristic (ROC) Curve

Confusion matrix

True Negative | False Negative

False Positive | True Positive

Accuracy = TP / ( TP + FN ) . e.g. kids video filterer.

Recall/TPR = TP / (TP + FP) . e.g. detect thief on video surveillance

Both accuracy and recall is tradeoff; because based on SGDClassifier, if you move threshold to right; recall will go down and accuracy will go up. vice versa.

ROC

* plots true positive rate against false positive rate. or, plot recall versus 1-specificity (True negative rate)

* Based on graph, higher recall, more false positive the classifier produces.

* One way to measure the classifier is Area Under Curve. (roc_auc_score).

* Perfect classifier will have auc=1.0, while random will have 0.5.

PrecisionRecall Curve should be used when positivitiese class is rare or false positive is more important than false negative. Otherwise, use ROC curve

Other Type of classification

* Multiclass classification

extending binary classification to support more than one, either through OneVersusOne or OneVersusAll

* Multilabel classification

By outputing multiple label because it is trained on multiple label

* Multioutput classification

generalisation of multilabel classification where each label can have multiclass (multiple values)

Error Analysis

* By using confusion matrix and plotting it, we can see which error is the most common from our model

* It is not enough though.

* Divide each value in confusion matrix by number of images in corresponding class

* We will see more meaningful errors.

First Attempt Into Creating Model

Performance Measure
One of typical one is Root Mean Squared Error (RMSE). It gives the idea of how much error the prediction against the label.
The other one is Mean Squared Error (MSE).

Some data is in string format. need to convert it to something else.

How to create train and test set
* Train and test set can be achieved by sklearn train_test_split. normally, the percentage is 20% for test set.
* It is important to include all data in equal proportion between the real case and train set. e.g. if there is 30%men in real world, the data should also contain 30% men.

How to analyse the feature
* through graphs
* through correlation coefficient. e.g. data.corr()
it shows the coefficient between -1 and 1.
* another one, use panda scatter_matrix. compare each attribute against other attributes.
* combine the attributes because individually they are of no use.
* data cleaning. remove unrelated attributes.

How to handle text
* Conver to ordinal using OrdinalEncoder()
cons: the algorithm might mistake there is distance between different instance.
* OneHotEncoder where it creates column for every category and put 1 for the one that the instance applies to.
use sparse matrix to optimise space.

Fit and transform
* fit is to learn from existing data
* transform is to change the data
* fit_transform does the fit and transform together

Feature scaling
It is important to create same scale for all data; otherwise certain algorithm might perform badly.
* MinMaxScaler
convert it to 0-1 range
* Standardization
subtract the mean value and divided by standard deviation. it is less affected by outliers.

Custom Transformer and Transformation Pipelines
create your own transformer
create pipeline that put together bunch of transformer and models.

How to validate the model
* use mean_squared_error
but it is not enough, because it can lead to overfitting
* use cross validation
split the training set further into folds of train and validation set. and do the train on the train set and validate against the other validation test. calculate the MSE

Finetuning model
* GridSearchCV
* RandomizedSearchCV
* Ensemble; combine model

Intro to Machine Learning

Machine Learning in simple word, is trying to ask machine to deduct something based on data, lots of data.

Based on this data, model is created by algorithm.

argmax return value of a variable that maximise the function.

Machine learning is about:
* turn raw data into feature vectors
* analyse the feature and try different algorithm to come up with the model
* try the model
* rinse and repeat

Type of machine learning:
* supervised vs semisupervised vs unsupervised
supervised have training data.
* online vs batch
online; means it can be done on the go. while batch is offline
* instance-based vs model-based
instance-based; learn by heart. comparing it to the learned example
model-based; build the model from example. then use the model to predict

Overfitting
the model is good for the training set; but it is not good for test data

Underfitting
the model is not good for the training set.

First Post

This space is to document my learnings of interesting computer science and programming subjects.

Why? Because try to explain something that you learn will make it better understanding.