Logistic Regression Python/Sklearn – Gates Bolton Analytics

NOTICE: Folks - this is my code ;)
Whenever you use code that you did not write, it is important to:
1) Use it as a reference or resource
2) Do not just copy/paste and click play. It is likely not to work.
3) Review each line of code so that you can make NEEDED UPDATES
4) Read the comments in the code. This code is also a tutorial.

UPDATES MIGHT INCLUDE
1) Updating paths to paths on your computer. 
2) Each version of Python can be a little different. 
3) Make sure comments render as comments.

---------------------------------------------------------------------------

###################################
##
## Logistic Regression
## https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
##
## Gates
##
## The Dataset Used Here:
## https://drive.google.com/file/d/1ZGwtkFu3vPwqn_1-3cXI2oS29xhAiAdN/view?usp=sharing
##################################################################################################

## Import Libraries
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import seaborn as sns  ## This will be for the prettier confusion matrix vis
from sklearn.metrics import confusion_matrix
import numpy as np

## Read in the data

filename="Admission_StudentData_LogisticRegression_3D.csv"
MyData=pd.read_csv(filename)
print(MyData)

## We want to Train the model and then we want to Test the model
## To do this, we need to split up the data into:
    ## Training Data and Training Label
    ## Testing Data and Testing Label
## There are many ways to do this. 

TrainingData, TestingData = train_test_split(MyData, test_size=.3)
print(TrainingData)
print(TestingData)

## Next, remove and save the labels from the Training Data
TrainingLabels = TrainingData["Admission"]
## Drop the label from the TrainingData now that we have saved it
TrainingData=TrainingData.drop(["Admission"], axis=1)
## print everything to make sure it looks right
print("The Training Labels are:")
print(TrainingLabels)
print("The Training Data is:")
print(TrainingData)

##Now - repeat this for the Testing Data so that you 
## end up with Testing Data and Testing Labels
TestingLabels = TestingData["Admission"]
## Drop the label from the TrainingData now that we have saved it
TestingData=TestingData.drop(["Admission"], axis=1)
## print everything to make sure it looks right
print("The Testing Labels are:")
print(TestingLabels)
print("The Testing Data is:")
print(TestingData)

##-------------------------
## Make sure you understand what we have.
## We now have Training Data and the Training Labels 
## We also have the Testing Data and the Testing Labels
##-------------------------------------------------------

##########################################
## Perform Logistic Regression
##
## In Sklearn, there are many parameter options
## for Logistic Regression. 
#https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
## We will use the defaults.
###########################################################################

## Instantiate Logistic Regression (create your own copy)
MyLR = LogisticRegression()
##Perform Logistic Regression - using your copy - on the Training Data and Training Labels
My_LR_Model=MyLR.fit(TrainingData, TrainingLabels)

## Now that we created the model, we can do several things.
## 1) We can use the model to predict the Testing Data labels
##    and then we can compare the predictions to the actual labels.
##    We can use a Confusion Matrix to compare. 
## 2) We can get the parameters of the model we created so that
##    we can write down the actual model

## -----------------------------------
## Use the model to predict the Test data
## -------------------------------------------------
MyModelPredictions=My_LR_Model.predict(TestingData)
print(MyModelPredictions)

## Print the actual labels
print(TestingLabels)

## Create a standard Confusion Matrix to compare the actual and predicted labels
MyCM=confusion_matrix(TestingLabels, MyModelPredictions)
print(MyCM)

## Use Seaborn to create a pretty confusion matrix visualization
sns.heatmap(MyCM/np.sum(MyCM), annot=True, fmt='.2%', cmap='Blues')

##-------------------------------------------
## Print some properties of the model
##---------------------------------------------
## We can print an accuracy score for the model
print(My_LR_Model.score(TrainingData, TrainingLabels))
## Recall that Logistic Regression calculates a value between
## 0 and 1 (a probability) for each prediction using the Sigmoid. 
## A threshold (cut-off) of .5 is used for the final
## prediction of 0 or 1. 
## However, you have to open to seeing the original probabilities
## as they were before they were thresholded to 0 or 1
print(My_LR_Model.predict_proba(TestingData))

##------------
## Printing the model
##-------------------------
## Recall the formula:
    ## y - w1x1 + w2x1 + .. + wnxn + b
## We can print the coefficients (the weights) of the model
print(My_LR_Model.coef_)
## and we can print the intercept (b) for the model
print(My_LR_Model.intercept_)

## For a coef_ output of:
    ## [[0.02394083 0.54278591 0.13992734]]
## and for an intercept output of [-82.43127784]
## Our actual model is (rounded):
    ## y = .024x1 + .543x2 + .140x3  - 82.44
## We can plug in our variable values here, calculate y
## and then apply the Sigmoid to get any final classification
## for any new data.

## For example, suppose we have a new student with 
## a GPA (x1) of 3.82, a Q_GRE (x2) of 166, and Months_Int (x3) if 5
## 
## Then we have:
    ## y = .024x1 + .543x2 + .140x3  - 82.44
    ## y = .024(3.82) + .543(166) + .140(5)  - 82.44
    ## y = 8.5
    
    ## Then, applying the Sigmoid:
        ## S(8.5) = 1/ (1 + e^(-8.5))  = .9998
        ## Which predicts as "1"