Python Prediction Models

Python Regression Models

Simple prediction models can find patterns in structured data in order to learn to predict what the output would be for new inputs. Linear regression models can predict numeric values, while classification models can predict what category something belongs to. The data must be structured into array-like formats such as NDArrays or DataFrames. The data must also be numeric, so values that are boolean (True/False) or strings (category labels) will have to first be converted to numbers. When training a prediction model, a small portion of the available data should be set aside for testing the model later, this is called the test set of the data.

Linear Regression

(Optional) Generate example data

import numpy as np X = np.random.randint(5,11,size=(100,3)) # Creates a 100x3 matrix with random integers between 5 and 10 y = X[:,0]*2 + X[:,1]*3 + X[:,2] + 4 # Generates actual output values from an equation

The matrix X has a row for each of 100 data entries, and each row has three values (x1, x2, and x3).
The target values (y) are calculated with the equation y = 2*x1 + 3*x2 + x3 + 4

x1	x2	x3
8	5	6
7	6	8
5	9	10
...	...	...

y
41
44
51
...

Split the data

from sklearn.model_selection import train_test_split X_tr, X_tst, y_tr, y_tst = train_test_split(X, y, test_size=0.2) # Split input values (X) and target values (y) into training sets and testing sets

We split that data so that we can test the model on values different from those it was trained on.
X and y can be NDArrays or DataFrames. X is a matrix with a row for the inputs of each data entry, and y is an array with the outputs.
The test_size attribute indicates how much of the available data should be set aside to use in testing. You can adjust it according to the amount of available data.
In this example, X_tr will be a matrix with 80% of the data entries in X, and X_tst will get the other 20% of the data entries. Likewise y_tr and y_tst will be split from y.

Create a linear regression model

from sklearn.linear_model import LinearRegression myModel = LinearRegression() # Create the model myModel.fit(X_tr, y_tr) # Train the model print(myModel.coef_) # Display the coefficients of the prediction equation print(myModel.intercept_) # Display the y-intercept of the prediction equation

The fit function trains the model, it takes a matrix of input entries and an array of their corresponding outputs in order to find the pattern for prediction.
The trained model can also be represented mathematically as an equation with the coefficients (numbers before variables) and the intercept.
For example, if the coefficients are 2, 3 and 1 and the y-intercept is 4, the equation is y = 2*x1 + 3*x2 + x3 + 4
A data entry with the inputs [5, 7, 1] would be input into the example equation as y = 2*(5) + 3*(7) + (1) + 4

Test the accuracy of your model

from sklearn.metrics import mean_squared_error myPreds = myModel.predict(X_tst) # Predict the outputs for all entries in the test set myError = mean_squared_error(y_tst,myPreds) # Get the error score print("MSE:", round(myError, 3)) # Display the error, rounded to 3 decimals

After the model has been trained, you can use it for prediction. It accepts multiple data entries at a time in the form of a matrix of inputs, and it produces an array of outputs.
You check the accuracy by comparing the predicted results to the actual results so you can see if the model needs improvement.
The Mean Squared Error (MSE) measures the average amount of difference (squared) between the predicted values and the actual values. A perfect score would be 0.
Some ways that you can improve the model are by training with more data, scaling the data, adding or removing variables, removing outliars, regularization, cross-validation, etc.

Predict individual values

myPred = myModel.predict(np.array([[9,7,1]])) # Predict the value from the inputs print("Prediction:",myPred[0]) # Display the predicted value

This inputs the values 9, 7, and 1 into the model and displays the predicted value.
Since the model is made to work with multiple entries at a time, we pass the inputs as a 2D array with a single row (data entry).
The model output is in the form of an array, so we use '[0]' to get the first (and only) value from it.

All together

import numpy as np from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error X = np.random.randint(5,11,size=(100,3)) # Create a 100x3 matrix with random integers between 5 and 10 y = X[:,0]*2 + X[:,1]*3 + X[:,2] + 4 # Generate actual output values from an equation X_tr, X_tst, y_tr, y_tst = train_test_split(X, y, test_size=0.2) myModel = LinearRegression() # Create the model myModel.fit(X_tr, y_tr) # Train the model myPreds = myModel.predict(X_tst) # Predict the outputs for all entries in the test set myError = mean_squared_error(y_tst,myPreds) # Get the error score print("MSE:", round(myError, 3)) # Display the error, rounded to 3 decimals myPred = myModel.predict(np.array([[9,7,1]])) # Predict the value from the inputs print("Prediction:",myPred[0]) # Display the predicted value

This creates, trains, runs, and evaluates a linear regression model for predicting values.

Challenge

Write code to generate two columns of 200 random input values stored in an NDArray. Then generate a single column NDArray with the actual values calculated according to the equation y = 5a - 4b + 7 with a small random value (between -2 and 2) added for variation. Then create a linear regression model that can predict a value when given two inputs. Train the model on the first 180 values and test it on the last 20 values. Evaluate the model and display the accuracy. Display the coefficients and make sure its prediction of them is close to the actual coefficients.

Completed