Sunday, November 3, 2019

Support Vector Machine

Linear SVM Classification
* create decision boundary that separates the instances
* the edge of decision boundary is called support vector
* hard margin classification is not possible due to outlier
* soft margin classification is more possible.
* use hyper parameter C to control how narrow or wide the street is.

Nonlinear SVM Classification
* sometimes data is not separable in linear.
* Method 1: use polynomial features
* Or use polynomial kernel
* Method 2: use Similarity features (Gaussian Radial Basic Function)
* Also has RBF kernel

How to choose which kernel to use?
* try linear first
* if not, then try rbf
* if both are not able, try to use GridSearch before try other kernels.

SVM Regression
* Reverse the objective in SVM classification
* instead of putting data off the street, put eat data on the street instead
* margin of error instead become data that is off

Under the hood
y = 0 if wx + b < 0, 1 if wx + b >= 0
Diff with Linear Regression
* there is no need to add b as x0 to x
* w is used instead of theta

Cost function
* weight of function is w
* try to minimize w so that the margin of the street is big
* at same time all positive instance need to have decision function >= 0 and vice versa for negative instance.
* it is quadratic programming problems
* The minimise function is 0.5 * w.T * w
* constraint to: t(wx+b) >= 1, where t=-1 when y = 0, and t=1 when y = 1
* Introduce slack variable to see how much violation we allows
* C => hyperparameter to control the minimise and the constraint

Dual Problem
* the QP problem of SVM is primal problem, but it can be solved by dual problem.
* The dual problem equation of the cost function is 0.5 i..m j..m alpha(i)alpha(j) t(i)t(j) x(i)x(j) - i..m alpha
* Objective: find alpha that minimise the equation
* from the alpha, we can calculate w and b based on another equation

Kernel Trick
* can skip the polynomial transformation by using the product of the vectors and square it (if it is degree=2)
* Common Kernels: linear, polynomial, gausian RBF, sigmoid
* The idea is, it is capable of calculating the dot product of transformed matrix without even transforming it.
* why? Most of times, data is not linear separateable. The transformation is needed to transform data to hyperplane; this is where kernel trick comes. It allows us to operate in original space without computing the coordinate of the tranformmed space.

Hinge Loss
max(0, 1-t)it is 0 when t >= 1. otherwise it will be 1-t which is the instance cross the street.

No comments:

Post a Comment

Artificial Neural Network

Logical Computation With Neuron * It has one or more binary input and one output. * Activate output when certain number of input is active...