Regression

1. Overview

Linear regression is mainly used for builiding the relationship between dependent variable and independent variable. It’s achieved by minimize the sum of square difference between actual and predict values, after optimization round by round, derive the weight and bias for the model.

On other hand, the limitation for linear regression is obvious, as it only works for linear relationship, non-linear relation eg. circle usually has poor performance. Also, it’s sensitive to outliers and easily leads to overfitting.

lr_concept
@Image Source: https://editor.analyticsvidhya.com/uploads/375512.jpg

2. Data Prep

Using the same amazon store sales data as before
sales_raw_data

I did following steps to clean optimize the data.

  1. Keep only numerical value as input feature, for training purpose
  2. Clean and format Numerical data
  3. Outliers removal for some columns

Code Step: https://github.com/BraydenZheng/Product_Recommendation/blob/master/nn/data_prepare.ipynb

Split training and testing data as 80%, 20% portion accordingly.

Taking ‘Actual Price’ as independent variable, ‘discounted Price’ as dependent variable

Cleaned Datalr_dt

3. Code

Model training and evaluation: https://github.com/BraydenZheng/Product_Recommendation/blob/master/nn/regression.ipynb

4. Results

score: 0.57

lr_vs

Taking an example for explanation, for the first data row (X=1099), which means this product’s actual_price is 1099, after passing into the trained model (weight=0.33, bias=228), we get result 0.33 * 1099 + 228 = 590. 590 here represent our predited discounted price based on model, and our ground truth discounted price

As it shows on the linear regression graph, data points are concentrated when actual price < 2000, and it becomes quite spread after that, the red line which represents the regression line has a good reflection for the general trend of given dataset, the computed score for model is 0.57, which is not high, but based on visualiztion and data distribtuion, I believe it’s a wonderful training result.

5. Conclusion

Linear regression can be quite useful dealing with simple variable relation, expecially for two variable case ‘dependence’ and ‘independence’ like this one. The major benefit model simplicity and speed can be an effective part when applying to retail store and other big data application.

NN