Linear/Multiple Linear Regression Sklearn/Python

Python Code

The following code is intended to illustrate 2D and 3D Linear Regression using Python and LinearRegression in Sklearn. There are also some fun visual options included.

REMEMBER – this is my code and it runs on my machine and my current version of Python. Use this code as a reference. If you try to just copy/paste – it may not run. You will also need to update any paths in this code to match paths on your computer.

There are TWO Code Examples Here.

Scroll Down to see the second one.


Code Example 1 for 2D Data

# -*- coding: utf-8 -*-
"""
@author: profa
"""

######################---------------------------------------
##
## Linear Regression
##
## Gates
## NOTICE: Please read the comments as they will explain
## steps and will sometimes offer links.
######################---------------------------------------
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
#https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
from sklearn.model_selection import train_test_split
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
import matplotlib.pyplot as plt
#https://matplotlib.org/3.5.3/api/_as_gen/matplotlib.pyplot.html
##
## LINK TO THE DATASET
## https://drive.google.com/file/d/1FEZ5tsTRBZsakBqXPB0c4TZCR17CNDsf/view?usp=sharing
##
## -- Read in and print the dataset. 
## If you are working with a large dataset, just print the first 10 - 15 rows.

## !! ATTENTION - Remember - this is my path. 
## You will need to update this to your path
filepath="/Datasets/Small_Pretend_Student_Grade_Dataset_Quant_noLabels_2D.csv"

MyDataSet=pd.read_csv(filepath)
print(MyDataSet)
## Because this is 2D data, let's visualize the entire dataset
## This will help us to confirm that the data has a linear relationship
MyDataSet.plot(kind="scatter", x="TestScore", y="HoursStudy")

## Next - because we are performing Linear Regression
## using a 2D dataset, the first column will be 
## our independent variable "X" and the second column will be
## our dependent variable "Y"  (what we want to predict.)
##-------------------------------------------
## Break the dataset into X and Y first
## You can do this using the name of the column or the location of the column. 
X = MyDataSet[["TestScore"]]
Y = MyDataSet[["HoursStudy"]]
## !! Using the [[]] will create the *shape* we need
## Want to learn more: https://medium.com/geekculture/sklearn-expects-data-to-be-in-shape-64fbcaf80a8c
print(X)
print(type(X))
print(X.shape)
print(Y)
## Let's also save the max and min values of X and Y for later use
x_max=X.max()
print(x_max)
x_min=X.min()

y_max=Y.max()
print(y_max)
y_min=Y.min()
##--------------------------------------------
## At this time, we have X and Y which are
## the two entire columns of our data.
## However, if we want to TRAIN the model
## and then TEST it - we need to split X 
## and Y into training and testing data.
##-----------------------------------------------
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size =0.33)
print(x_train)
print(y_train)
print(type(x_train))
print(x_train.shape)

##----------------------------------------------------------
## Run Linear Regression. The goal here is to BUILD
## a model. In this case, the model will be a linear equation
## so we need a value for the slope and for the y-intercept.
##-------------------------------------------------------------
##Instantiate first (this is always true when using Sklearn)
MyLR = LinearRegression()
##Fit the model to X and Y of the Training data. 
MyLR.fit(x_train, y_train)
print(MyLR.coef_) ## This is the slope
print(MyLR.intercept_) ## This is the y-intercept

plt.scatter(x_train, y_train, label="Training Data", color="r", alpha=.6)
plt.scatter(x_test, y_test, label="Testing Data", color="g", alpha=.6)
plt.title("Plot of Training Data (red) and Testing Data (green)")
plt.show()

## Here, let's use our trained model to predict
## We will give it our independent variable testing data
## It will predict the dependent variable results.
MyPrediction = MyLR.predict(x_test)
## The above will predict the study hours
print(MyPrediction)
## Here, we can also look at the actual Testing data Y values 
print(y_test)
## If our models is doing well, the predictions and the actual values 
## be similar.

## How accurate is the model?
print(MyLR.score(x_train,y_train))

## Visualization of the Model - the linear equation over the data
x_vals = np.linspace(x_min, x_max, 100) ## The 100 gives us 100 "ticks" on the x axis

fig = plt.figure(figsize = (10, 5))
y_vals = MyLR.intercept_ + (MyLR.coef_ * x_vals)
#print(y_vals.shape)
#y_vals=y_vals.reshape(100,)

# Create the plot
plt.plot(x_vals, y_vals)
plt.scatter(x_train,y_train,label="Training Data", color="g", alpha=.6)
## Learn more about printing formats like %
## here: https://www.geeksforgeeks.org/python-output-formatting/
plt.title('Data and Regreesion Line\n$y=%3.7fx+%3.7f$'%(MyLR.coef_ , MyLR.intercept_))
plt.show()

Code Example 2 for 3D Data

# -*- coding: utf-8 -*-
"""
@author: profa
"""

###############################################################################
##   Linear Regression in 3D with 2 independent variables
##
##   3D dataset
## The DATASET IS HERE:
## https://drive.google.com/file/d/1uRiMe4FskLp27SbPb5OgVPrrsUMFfPw5/view?usp=sharing
##
## For this example, the dataset has three quantitative variables:
## TestScore,	HoursStudy, and 	GPA
##
## We will predict TestScore using both HoursStudy and GPA
## Therefore, for this example, TestScore is "Y" (the dependent variable)
## and HoursStudy is X1 and GPA is X2. 
##
## NOTE: It is up to the data scientist to determine which 
## variable they want to predict using the others.
##
######################################################################


import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error 
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
#https://matplotlib.org/3.5.3/api/_as_gen/matplotlib.pyplot.html

## !! ATTENTION-Remember-this is my path. You will need to update this to your path
filepath="/Datasets/Small_Pretend_Student_Grade_Dataset_Quant_noLabels_3D.csv"
MyDataSet=pd.read_csv(filepath)
print(MyDataSet)
## Because this is 3D data, let's visualize the entire dataset
## This will help us to confirm that the data has a linear relationship
fig = plt.figure(figsize = (10, 7))
ax = plt.axes(projection ="3d")
ax.scatter3D(MyDataSet.iloc[:,0], MyDataSet.iloc[:,1], MyDataSet.iloc[:,2], color = "green")
plt.title("TestScore - HoursStudy - GPA plot")
plt.show()

## Next - because we are performing Linear Regression with 3 variables
## We will place all independent variables in to X
## We will place

Y = MyDataSet[["TestScore"]] ## Remember, WE can determine which variable we want to predict
## In this case, I have decided to predict TestScore using HoursStudy and GPA. 
X = MyDataSet.drop("TestScore", axis= 1) 

print(X) ## Here X has 2 variables
## !!! IMPORTANT !!! - Because X contains 2 variables, we can call them X1 and X2
## In our case X1 is HoursStudy and X2 is GPA
## This will become important below after we run the LinearRegression 
## and create the equation of the best fit line - our model.
print(type(X))
print(X.shape)
print(Y)

##--------------------------------------------
## At this time, we have X and Y 
## However, if we want to TRAIN the model
## and then TEST it - we need to split X 
## and Y into training and testing data.
##-----------------------------------------------
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size =0.33)
print(x_train)
print(y_train)
print(type(x_train))
print(x_train.shape)

##----------------------------------------------------------
## Run Linear Regression. The goal here is to BUILD
## a model. In this case, the model will be a linear equation
## so we need a value for the slope and for the y-intercept.
##-------------------------------------------------------------
##Instantiate first (this is always true when using Sklearn)
MyLR = LinearRegression()
##Fit the model to X and Y of the Training data. 
MyLR.fit(x_train, y_train)
print(MyLR.coef_) ## These are the coefficients of X1 and X2
print(MyLR.intercept_) ## This is the y-intercept
print("Our MODEL is:\n")
print("y=%3.2fx1 + %3.2fx2 + %3.2f"%(MyLR.coef_[0,0] , MyLR.coef_[0,1], MyLR.intercept_))

## Here, let's use our trained model to predict
## We will give it our independent variable testing data
## It will predict the dependent variable results.
MyPrediction = MyLR.predict(x_test)
## The above will predict the study hours
print(MyPrediction)
## Here, we can also look at the actual Testing data Y values 
print(y_test)
## If our models is doing well, the predictions and the actual values 
## be similar.

## How accurate is the model?
print(MyLR.score(x_train,y_train))
# model evaluation 
print('mean_squared_error : ', mean_squared_error(y_test, MyPrediction)) 
print('mean_absolute_error : ', mean_absolute_error(y_test, MyPrediction)) 

## Visualize the results in 3D
# Create the plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

# Define the independent variables
y = Y
print(y)
x1 = X.iloc[:,0]
print(x1)
x2 = X.iloc[:,1]
print(x2)

# Add the data points
ax.scatter(x1, x2, y, color = "green")

# Fit a plane using np.linalg.lstsq
A = np.vstack([x1, x2, np.ones_like(x1)]).T
plane_coef, _, _, _ = np.linalg.lstsq(A, y, rcond=None)

# Create a meshgrid for the plane
x_plane, y_plane = np.meshgrid(x1, x2)
z_plane = plane_coef[0] * x_plane + plane_coef[1] * y_plane + plane_coef[2]

# Add the regression plane
ax.plot_surface(x_plane, y_plane, z_plane, alpha=0.1)

# Add labels and title
ax.set_xlabel('Hours of Study')
ax.set_ylabel('GPA')
ax.set_zlabel('TestScore')
plt.title('Multiple Linear Regression')

# Show the plot
plt.show()