* create decision boundary that separates the instances
* the edge of decision boundary is called support vector
* hard margin classification is not possible due to outlier
* soft margin classification is more possible.
* use hyper parameter C to control how narrow or wide the street is.
Nonlinear SVM Classification
* sometimes data is not separable in linear.
* Method 1: use polynomial features
* Or use polynomial kernel
* Method 2: use Similarity features (Gaussian Radial Basic Function)
* Also has RBF kernel
How to choose which kernel to use?
* try linear first
* if not, then try rbf
* if both are not able, try to use GridSearch before try other kernels.
SVM Regression
* Reverse the objective in SVM classification
* instead of putting data off the street, put eat data on the street instead
* margin of error instead become data that is off
Under the hood
y = 0 if wx + b < 0, 1 if wx + b >= 0
Diff with Linear Regression
* there is no need to add b as x0 to x
* w is used instead of theta
Cost function
* weight of function is w
* try to minimize w so that the margin of the street is big
* at same time all positive instance need to have decision function >= 0 and vice versa for negative instance.
* it is quadratic programming problems
* The minimise function is 0.5 * w.T * w
* constraint to: t(wx+b) >= 1, where t=-1 when y = 0, and t=1 when y = 1
* Introduce slack variable to see how much violation we allows
* C => hyperparameter to control the minimise and the constraint
Dual Problem
* the QP problem of SVM is primal problem, but it can be solved by dual problem.
* The dual problem equation of the cost function is 0.5 i..m j..m alpha(i)alpha(j) t(i)t(j) x(i)x(j) - i..m alpha
* Objective: find alpha that minimise the equation
* from the alpha, we can calculate w and b based on another equation
Kernel Trick
* can skip the polynomial transformation by using the product of the vectors and square it (if it is degree=2)
* Common Kernels: linear, polynomial, gausian RBF, sigmoid
* The idea is, it is capable of calculating the dot product of transformed matrix without even transforming it.
* why? Most of times, data is not linear separateable. The transformation is needed to transform data to hyperplane; this is where kernel trick comes. It allows us to operate in original space without computing the coordinate of the tranformmed space.
Hinge Loss
max(0, 1-t)it is 0 when t >= 1. otherwise it will be 1-t which is the instance cross the street.
No comments:
Post a Comment