Python Regression Models
Simple prediction models can find patterns in structured data in order to learn to predict what the output would be for new inputs. Linear regression models can predict numeric values, while classification models can predict what category something belongs to. The data must be structured into array-like formats such as NDArrays or DataFrames. The data must also be numeric, so values that are boolean (True/False) or strings (category labels) will have to first be converted to numbers. When training a prediction model, a small portion of the available data should be set aside for testing the model later, this is called the test set of the data.
Linear Regression
(Optional) Generate example data
- The matrix X has a row for each of 100 data entries, and each row has three values (x1, x2, and x3).
- The target values (y) are calculated with the equation y = 2*x1 + 3*x2 + x3 + 4
| x1 | x2 | x3 |
|---|---|---|
| 8 | 5 | 6 |
| 7 | 6 | 8 |
| 5 | 9 | 10 |
| ... | ... | ... |
| y |
|---|
| 41 |
| 44 |
| 51 |
| ... |
Split the data
- We split that data so that we can test the model on values different from those it was trained on.
- X and y can be NDArrays or DataFrames. X is a matrix with a row for the inputs of each data entry, and y is an array with the outputs.
- The test_size attribute indicates how much of the available data should be set aside to use in testing. You can adjust it according to the amount of available data.
- In this example, X_tr will be a matrix with 80% of the data entries in X, and X_tst will get the other 20% of the data entries. Likewise y_tr and y_tst will be split from y.
Create a linear regression model
- The fit function trains the model, it takes a matrix of input entries and an array of their corresponding outputs in order to find the pattern for prediction.
- The trained model can also be represented mathematically as an equation with the coefficients (numbers before variables) and the intercept.
- For example, if the coefficients are 2, 3 and 1 and the y-intercept is 4, the equation is y = 2*x1 + 3*x2 + x3 + 4
- A data entry with the inputs [5, 7, 1] would be input into the example equation as y = 2*(5) + 3*(7) + (1) + 4
Test the accuracy of your model
- After the model has been trained, you can use it for prediction. It accepts multiple data entries at a time in the form of a matrix of inputs, and it produces an array of outputs.
- You check the accuracy by comparing the predicted results to the actual results so you can see if the model needs improvement.
- The Mean Squared Error (MSE) measures the average amount of difference (squared) between the predicted values and the actual values. A perfect score would be 0.
- Some ways that you can improve the model are by training with more data, scaling the data, adding or removing variables, removing outliars, regularization, cross-validation, etc.
Predict individual values
- This inputs the values 9, 7, and 1 into the model and displays the predicted value.
- Since the model is made to work with multiple entries at a time, we pass the inputs as a 2D array with a single row (data entry).
- The model output is in the form of an array, so we use '[0]' to get the first (and only) value from it.
All together
- This creates, trains, runs, and evaluates a linear regression model for predicting values.
Challenge
Write code to generate two columns of 200 random input values stored in an NDArray. Then generate a single column NDArray with the actual values calculated according to the equation y = 5a - 4b + 7 with a small random value (between -2 and 2) added for variation. Then create a linear regression model that can predict a value when given two inputs. Train the model on the first 180 values and test it on the last 20 values. Evaluate the model and display the accuracy. Display the coefficients and make sure its prediction of them is close to the actual coefficients.