Peter dev blog: First Attempt Into Creating Model

Performance Measure
One of typical one is Root Mean Squared Error (RMSE). It gives the idea of how much error the prediction against the label.
The other one is Mean Squared Error (MSE).

Some data is in string format. need to convert it to something else.

How to create train and test set
* Train and test set can be achieved by sklearn train_test_split. normally, the percentage is 20% for test set.
* It is important to include all data in equal proportion between the real case and train set. e.g. if there is 30%men in real world, the data should also contain 30% men.

How to analyse the feature
* through graphs
* through correlation coefficient. e.g. data.corr()
it shows the coefficient between -1 and 1.
* another one, use panda scatter_matrix. compare each attribute against other attributes.
* combine the attributes because individually they are of no use.
* data cleaning. remove unrelated attributes.

How to handle text
* Conver to ordinal using OrdinalEncoder()
cons: the algorithm might mistake there is distance between different instance.
* OneHotEncoder where it creates column for every category and put 1 for the one that the instance applies to.
use sparse matrix to optimise space.

Fit and transform
* fit is to learn from existing data
* transform is to change the data
* fit_transform does the fit and transform together

Feature scaling
It is important to create same scale for all data; otherwise certain algorithm might perform badly.
* MinMaxScaler
convert it to 0-1 range
* Standardization
subtract the mean value and divided by standard deviation. it is less affected by outliers.

Custom Transformer and Transformation Pipelines
create your own transformer
create pipeline that put together bunch of transformer and models.

How to validate the model
* use mean_squared_error
but it is not enough, because it can lead to overfitting
* use cross validation
split the training set further into folds of train and validation set. and do the train on the train set and validate against the other validation test. calculate the MSE

Finetuning model
* GridSearchCV
* RandomizedSearchCV
* Ensemble; combine model

Peter dev blog

Saturday, November 2, 2019

First Attempt Into Creating Model

No comments:

Post a Comment

Artificial Neural Network

Report Abuse