Performance Measure
One of typical one is Root Mean Squared Error (RMSE). It gives the idea of how much error the prediction against the label.
The other one is Mean Squared Error (MSE).
Some data is in string format. need to convert it to something else.
How to create train and test set
* Train and test set can be achieved by sklearn train_test_split. normally, the percentage is 20% for test set.
* It is important to include all data in equal proportion between the real case and train set. e.g. if there is 30%men in real world, the data should also contain 30% men.
How to analyse the feature
* through graphs
* through correlation coefficient. e.g. data.corr()
it shows the coefficient between -1 and 1.
* another one, use panda scatter_matrix. compare each attribute against other attributes.
* combine the attributes because individually they are of no use.
* data cleaning. remove unrelated attributes.
How to handle text
* Conver to ordinal using OrdinalEncoder()
cons: the algorithm might mistake there is distance between different instance.
* OneHotEncoder where it creates column for every category and put 1 for the one that the instance applies to.
use sparse matrix to optimise space.
Fit and transform
* fit is to learn from existing data
* transform is to change the data
* fit_transform does the fit and transform together
Feature scaling
It is important to create same scale for all data; otherwise certain algorithm might perform badly.
* MinMaxScaler
convert it to 0-1 range
* Standardization
subtract the mean value and divided by standard deviation. it is less affected by outliers.
Custom Transformer and Transformation Pipelines
create your own transformer
create pipeline that put together bunch of transformer and models.
How to validate the model
* use mean_squared_error
but it is not enough, because it can lead to overfitting
* use cross validation
split the training set further into folds of train and validation set. and do the train on the train set and validate against the other validation test. calculate the MSE
Finetuning model
* GridSearchCV
* RandomizedSearchCV
* Ensemble; combine model
Subscribe to:
Post Comments (Atom)
Artificial Neural Network
Logical Computation With Neuron * It has one or more binary input and one output. * Activate output when certain number of input is active...
-
Instead of value like linear regression, it calculate probability out of the value. p = logistic(X.theta) * logistic is inverse of logit f...
-
Linear SVM Classification * create decision boundary that separates the instances * the edge of decision boundary is called support vector...
-
* Gini impurity => 1 - sigma ratio of k among the training instances. * Pure node will have zero impurity. * Classification and Regre...
No comments:
Post a Comment