Regression
1. Overview
Linear regression is mainly used for builiding the relationship between dependent variable and independent variable. It’s achieved by minimize the sum of square difference between actual and predict values, after optimization round by round, derive the weight and bias for the model.
On other hand, the limitation for linear regression is obvious, as it only works for linear relationship, non-linear relation eg. circle usually has poor performance. Also, it’s sensitive to outliers and easily leads to overfitting.
@Image Source: https://editor.analyticsvidhya.com/uploads/375512.jpg
2. Data Prep
Using the same amazon store sales data as before
I did following steps to clean optimize the data.
- Keep only numerical value as input feature, for training purpose
- Clean and format Numerical data
- Outliers removal for some columns
Code Step: https://github.com/BraydenZheng/Product_Recommendation/blob/master/nn/data_prepare.ipynb
Split training and testing data as 80%, 20% portion accordingly.
Taking ‘Actual Price’ as independent variable, ‘discounted Price’ as dependent variable
Cleaned Data
3. Code
Model training and evaluation: https://github.com/BraydenZheng/Product_Recommendation/blob/master/nn/regression.ipynb
4. Results
score: 0.57
Taking an example for explanation, for the first data row (X=1099), which means this product’s actual_price is 1099, after passing into the trained model (weight=0.33, bias=228), we get result 0.33 * 1099 + 228 = 590. 590 here represent our predited discounted price based on model, and our ground truth discounted price
As it shows on the linear regression graph, data points are concentrated when actual price < 2000, and it becomes quite spread after that, the red line which represents the regression line has a good reflection for the general trend of given dataset, the computed score for model is 0.57, which is not high, but based on visualiztion and data distribtuion, I believe it’s a wonderful training result.
5. Conclusion
Linear regression can be quite useful dealing with simple variable relation, expecially for two variable case ‘dependence’ and ‘independence’ like this one. The major benefit model simplicity and speed can be an effective part when applying to retail store and other big data application.