Python Prediction Models

Python Classification Models

Simple classification models can find patterns in structured data in order to learn to predict how to classify something based on its features. The data must be structured into array-like formats such as NDArrays or DataFrames. The data must also be numeric, so values that are boolean (True/False) or strings (category labels) will have to first be converted to numbers. When training a prediction model, a small portion of the available data should be set aside for testing the model later, this is called the test set of the data.

Binary Classification

(Optional) Generate demo data for binary classification

import numpy as np X = np.random.randint(0,6,size=(100,2)) # Creates a 100x2 matrix with random values from 0 to 5 y = np.where( X[:,0] + X[:,1]*2 > 7, 1, 0) # Have the outputs be 1 only if x1 + 2*x2 > 7

This generates data for training the model to classify it as 1 if the first value plus the doubled second value is greater than 7.

x1	x2
1	5
4	2
3	0
...	...

y
1
0
0
...

Split the data

from sklearn.model_selection import train_test_split X_tr, X_tst, y_tr, y_tst = train_test_split(X, y, test_size=0.2) # Split input values (X) and target values (y) into training sets and testing sets

We split that data so that we can test the model on values different from those it was trained on.
X and y can be NDArrays or DataFrames. X is a matrix with a row for the inputs of each data entry, and y is an array with the outputs.
The test_size attribute indicates how much of the available data should be set aside to use in testing. You can adjust it according to the amount of available data.
In this example, X_tr will be a matrix with 80% of the data entries in X, and X_tst will get the other 20% of the data entries. Likewise y_tr and y_tst will be split from y.

Create a binary classification model (logistic regression)

from sklearn.linear_model import LogisticRegression myModel = LogisticRegression() # Create the model myModel.fit(X_tr, y_tr) # Train the model

Binary classification is for when the output can be one of two options (e.g. True or False).
It is just like linear regression except that it gives a true/false classification depending on if the output value meets a certain threshold.

Test the accuracy of a classification model

from sklearn.metrics import accuracy_score myPreds = myModel.predict(X_tst) # Predict the outputs for all entries in the test set myScore = accuracy_score(y_tst, myPreds) # Get the accuracy score print("Accuracy:", myScore) # Display the accuracy

For classification models, we measure accuracy by checking if each predicted value matches each actual value, unlike with regression models for which we compare how close a predicted value is to an expected value.
The accuracy score returns a value between 0 and 1, with 1 being the best (100% accuracy).

Classify an individual entry

myPred = myModel.predict(np.array([[4,2]])) # Predict the classification from the inputs print("Prediction:",myPred[0]) # Display the prediction

This inputs the values 4 and 2 into the model and displays the predicted value.
Since the model is made to work with multiple entries at a time, we pass the inputs as a 2D array with a single row (data entry).
The model output is in the form of an array, so we use '[0]' to get the first (and only) value from it.

See probabilities for classification predictions

myProbs = myModel.predict_proba(np.array([[4,2]])) # Predict the probabilities for classifications print("Probabilities:", myProbs) # Display the predicted probabilities

When you use the predict_proba() function instead of predict(), the output is a 2D array, where each row in the array contains the probability of classification of 0 and that of 1.
For example, the output may be [[0.34 0.66]], meaning there is a probability of 34% that it should be classified as a 0, and 66% that it should be a 1.
If both numbers are close to 0.5, that means that the model is not predicting with much confidence, and you might want to train it with more data.

All together

import numpy as np from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score import numpy as np X = np.random.randint(0,6,size=(100,2)) # Creates a 100x2 matrix with random values from 0 to 5 y = np.where( X[:,0] + X[:,1]*2 > 7, 1, 0) # Have the outputs be 1 only if x1 + 2*x2 > 7 X_tr, X_tst, y_tr, y_tst = train_test_split(X, y, test_size=0.2) myModel = LogisticRegression() # Create the model myModel.fit(X_tr, y_tr) # Train the model myPreds = myModel.predict(X_tst) # Predict the outputs for all entries in the test set myScore = accuracy_score(y_tst, myPreds) # Get the accuracy score print("Accuracy:", myScore) # Display the accuracy myPred = myModel.predict(np.array([[4,2]])) # Predict the value from the inputs print("Prediction:",myPred[0]) # Display the predicted value myProbs = myModel.predict_proba(np.array([[4,2]])) # Predict the probabilities for classifications print("Probabilities:", myProbs) # Display the predicted probabilities

This creates, trains, and runs a logistic regression classification model, but there are other types.

Multiclass Classification

* Modifications to Binary Classification code above

(Optional) Generate example data for multiclass classification

import numpy as np X = np.random.randint(-10,11,size=(100,2)) # Creates a 100x2 matrix with random values between -10 and 10 y = np.where( X[:,1] > 0, np.where(X[:,0] > 0, 1, 2), np.where(X[:,0] > 0, 4, 3) ) # label data points according to quadrants

This generates data as if on coordinates of a grid with 'x1' as the horizontal axis and 'x2' as the vertical axis.
It classifies data points by the quadrant they fall in (1 = Top Right (TR), 2 = TL, 3 = BL, 4 = BR).

x1	x2
-4	8
7	1
-3	-3
...	...

y
2
1
3
...

Create a multiclass classification model

from sklearn.linear_model import LogisticRegression myModel = LogisticRegression(solver='lbfgs') # Create the model myModel.fit(X_tr, y_tr) # Train the model

Multiclass classification is for when the output can be one of more than two options.
There are other models that can be used for this, such as SVC and RandomForestClassifier.

Challenge

Write code to generate 2 columns of 100 random input values of 0 or 1 stored in an NDArray. Then generate a single column NDArray that classifies them according to the AND operator (as 1 if both inputs are 1, and as 0 otherwise). Then create a classification model that can predict the classification when provided the inputs. Train the model on the first 90 values and test it on the last 10 values. Evaluate the model and display the accuracy.

Completed