
Below, you will find an option for cleaning and preparing record type data using Python and pandas.
Note that there are an infinite number of options and ways. This is one.
If you are looking for more complicated examples, choose other ML topics as each contains code that will show steps for preparing and analyzing data.
Finally, always remember that preparing and “cleaning” data is unique to each dataset and each goal. Cleaning/preparing is an active process that you must think about, plan, and create/do at each step.
NOTE: If you are new to Python, it might help to review the Python tutorials before using the following code.
### UPDATED 1/20/26
### Data Exploration
### Data Cleaning
### Data Processing
###
### Gates
### This examples uses the Kaggle Titanic Training dataset
import pandas as pd
import numpy as np
import statistics as stat
import seaborn as sns
import matplotlib.pyplot as plt
## Data *always* needs to be cleaned and prepared
## Create an account with Kaggle so you can login
## Source: https://www.kaggle.com/sudhirtk/titanic-train-dataset
## Once you download the data - save it as .csv
## Use a smart name. I used Titanic_Kaggle_Train_Data.csv
## Make sure you can see the dataset to the right under File explorer
#################################################
############# HERE IS THE DATASET
#################################################
## https://drive.google.com/file/d/1J4hQPngIfP5rez1ciFQL4mXvCrGKXk2t/view?usp=sharing
##### Bring the data in using pandas
filename="C:/Users/profa/Documents/Python Scripts/ANLY501/Titanic_Kaggle_Train_Data.csv"
#filename="Titanic_Kaggle_Train_Data.csv"
## NOtice that I did not hard-code the filename
TitanicDF = pd.read_csv(filename)
## Check to see if this worked
print(type(TitanicDF))
print(TitanicDF.head(15))
## I notice right away that Cabin has a lot of
## NaN values *and* does not contain information
## that I need. I also do not want to keep Ticket.
############ DROP a COLUMN
drop_these = ["Cabin", "Ticket"]
TitanicDF.drop(drop_these, inplace=True, axis=1) #columns #axis=0 rows
## axis = 0 are the rows
## Let's see what this did - you will see that we now have only 10 cols
print(TitanicDF.head(15))
## We can also use the column number
## The "Name" column is column 3. Recall that Python starts at 0
## Let's drop the Name column as well
cols=[0,3]
print(TitanicDF.columns[cols])
TitanicDF.drop(TitanicDF.columns[cols], inplace=True, axis=1)
print(TitanicDF.head(15))
## Now we have 8 columns remaining
######### DEALING WITH MISSING VALUES #################
print(TitanicDF.dtypes)
## Missing values find there way into many places.
## A good first step is to explore - see how many are there
## Get a list of all the column names
ColumnNamesList = TitanicDF.columns.values
print(ColumnNamesList)
## Checking for missing values in a column by name
## HUGE !! - the isna will only count missing items
## if they are numbers - but not if they are categories
## You will need to clean in an order tha makes sense
## and then update data types....
print(TitanicDF["Survived"].isna().sum())
## The above works because Survived is still a float
print(TitanicDF["Sex"].isna().sum())
## Print the number of missing values in all variables
for name in ColumnNamesList:
total_nas=TitanicDF[name].isna().sum()
print(name, ": ", total_nas)
###########################################################
## Remember - you can double check anything you code
## by creating a small dataset and seeing if it works
#### EXAMPLE
#Small=pd.read_csv("SmallExampleDataWithMissingValues.csv")
#print(Small)
## Results: this replaced NaN, NA, and blanks with NaN
## So - back to the missing data in our Titanic dataset
## Let's again look at one variable at a time
##################### AGE and EMBARK ########################
## "Age" has 177 missing values! That's a lot
OriginalDataSetShape=TitanicDF.shape
print(OriginalDataSetShape)
## So I have 891 rows and 8 columns
## This means that 177 NA values is 177/891 or about 20%
## We have some options.
## First, we can remove all the rows with missing values.
## If we do this, we lose about 20% of the data - not great!
## We can replace the values with a mean or median.
## This is risky - especially if Age is critical to the
## analysis.
## Because this is a tutorial - I will offer two examples:
## 1) We will remove all rows for which "Age" is NaN
## 2) We will replace the NaN values under Embark with the *mode*
### FIrst - update the NaNs under Embark to be the mode
print(TitanicDF.Embarked)
print(TitanicDF.Embarked.isna().sum())
GetMode=stat.mode(TitanicDF.Embarked)
print(GetMode)
## Pandas provides the fillna() function for replacing missing
## values with a specific value.
## TitanicDF.Embarked.fillna(GetMode, inplace=True) - phased out
TitanicDF['Embarked'] = TitanicDF['Embarked'].fillna(GetMode)
## Count the NaN's again...
print(TitanicDF.isna().sum())
print(TitanicDF.Embarked)
## OK - so far so good. Now we need to worry about the 177 NaN
## values under the Age column.
## In some cases, one might choose to retain these rows and
## replace the missing values with the mean or median.
## However, Titanic data (and survival) is correlated to Age
## and we do not want to lose that!
## HOW TO drop all rows with NaN
# TitanicDF.dropna(inplace=True)
## Check it now
# print(TitanicDF.isna().sum())
## The AGE for Titanic data is so critical,
## that we cannot keep any rows with missing
## age data
## Notice the A in Age is cap
print(TitanicDF)
NumRowsBefore=len(TitanicDF)
print(NumRowsBefore)
TitanicDF.dropna(inplace=True, subset=['Age'], axis=0)
## axis = 0 will drop rows if a missing value is discovered
print(TitanicDF)
NumRowsAfter=len(TitanicDF)
print(NumRowsAfter)
PercentRowsRemoved=NumRowsAfter/NumRowsBefore
print(np.round(PercentRowsRemoved,2))
## ALWAYS review, discussion, illustrate BEFORE and AFTER
## cleaning
CurrentDataSetShape=TitanicDF.shape
print("Original Shape: ", OriginalDataSetShape, "\n")
print("Current Shape: ", CurrentDataSetShape, "\n")
## Correlation heatmap
plt.figure(figsize=(8, 6))
numeric_df = TitanicDF[['Age', 'Survived', 'Pclass', 'Fare', 'Parch']]
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Titanic Variables')
plt.show()
############## GET THE TYPES OF EACH COLUMN/VARIABLE
###############
### Changing Data Types ###############
########################################
print(TitanicDF.dtypes)
print(TitanicDF.isna().sum())
#############
## THere is still more to do here!
#############
## The next steps include looking at each variable individually
## We will determine if it has the type we want, or if it needs
## to change.
### Survived has type float64
### This is not numeric and so should not be an integer type.
### We can change this to category to see how that works
## WARNING !!
## When you convert a column to a string (if you do)
## your real missing values (NaN) are converted into the literal
## text string "nan". Which is not a missing value but rather just
## three letters of text. Pandas isna() no longer recognizes it as a missing value
## BE CAREFUL to clean in an order that makes sense
print(TitanicDF.Survived.isna().sum()) ## Should be 2
## Now change to category
TitanicDF["Survived"]= TitanicDF["Survived"].astype('category')
print(TitanicDF.head(15))
print(TitanicDF.dtypes)
##-----------------------------------------------------
### The "Survived " column is very important. Why?
##----------------------------------------------------------
## This column offers a *label* for this dataset
## Because this is labeled data, supervised methods can
## be applied to it later.
### !!!!!!!!!! IMPORTANT !!!!!!!!!!!! ##
## To use a label to train any ML methods, the label should
## be a "factor" / "category" data type. Not numeric or char, etc.
## So, let's confirm that "Survived" is now category
print(TitanicDF.dtypes)
## NOTES:
## Categoricals (category) are a pandas data type
## corresponding to categorical variables in statistics.
## A categorical variable takes on a limited, and usually fixed,
## number of possible values (categories are called "levels" in R)
## Now, we can move a bit faster....let's find all the variables
## that should be categories and change them all at once
TitanicDF["Sex"] = TitanicDF["Sex"].astype('category')
TitanicDF["Pclass"] = TitanicDF["Pclass"].astype('category')
TitanicDF["Embarked"] = TitanicDF["Embarked"].astype('category')
print(TitanicDF.dtypes)
### Next, I see that "Age" is a float64 or 64 bit decimal value
## This is fine. Age is a numeric value, as is SibSp and Fare
## So - the data types have now been successfully corrected.
################# INCORRECT VALUES ######################
## The steps above are often easier.
## Incorrect values are harder to find and harder to fix
## As a first step, let's explore the data
#### Step 1 - Look at each variable and some basics stats and vis
print(TitanicDF.columns.values)
print(TitanicDF.describe())
plt.figure()
plot1=sns.boxplot(x="Age",data=TitanicDF)
plt.show()
plt.figure()
plot2=sns.swarmplot(x="Age", data=TitanicDF, color=".25")
plt.show()
## To get plots to pop up and not be inline (the reverse...)
## Tools --> Preferences --> IPython --> Graphics --> Automatic
## Look at the min for Age!
## This is .42. That means there are some incorrect values and
## we will need to remove those rows.
#################### OPTIONS
## Here, we can simply remove any row from
## the dataset for which the age does not fall
## between values we want - such as between less than 1
## or more than 100.
## There are no RULES!! It is up to you. For example,
## instead we can keep ages between 0 and 120.
## Sometimes, removing entire rows (or columns) is something
## we feel needs to be done.
## In other cases, we might want to *correct* incorrect or
## outlier values.
## SIMPLE EXAMPLE - FIXING the age of 220 rather than removing the row
## Here, we will see steps to correct the value of the age
## that is currently 220.
## One method is to simply replace that 220 age value with the
## mean, median, or mode age of the entire dataset.
## HOWEVER - this is not the best method. We can do better!
## Because age (in this case!) is slightly correlated with other variables
## we can use that correlation to correct the age to a closer guess.
## Here, younger people tended to survuve more often.
## The Age of 220 that we are trying to fix is associated
## with a "Survived" value of 0 (so did not survive)
## Therefore, we can replace this age with the median of all the
## "0" for Survived ages.
## Here is how....
## First, get the median of all ages for those whose did not
## survive (Survived = 0)
median_age_non_survivors = TitanicDF[TitanicDF['Survived'] == 0]['Age'].median()
print(median_age_non_survivors)
## Next, replace the incorrect age value with this median
TitanicDF['Age'] = TitanicDF['Age'].replace(220, median_age_non_survivors)
print(TitanicDF.describe())
## Replot
plt.figure()
plot7=sns.boxplot(x="Age",data=TitanicDF)
plt.show()
## Another option is to simply remove all rows that
## contain ages not in range.
## Example...
## Remove any ages below 1 and above 120
##
TitanicDF=TitanicDF[TitanicDF.Age > 1 ]
TitanicDF=TitanicDF[TitanicDF.Age < 120 ]
print(TitanicDF.head(15))
print(TitanicDF.describe())
print(TitanicDF.shape)
## Look at all of the other variables to see if there are any issues
## The Fare is interesting...the max is very large - let's
## use a boxplot to take a closer look
plt.figure()
plot3=sns.boxplot(x="Fare",data=TitanicDF)
plt.show()
#plt.cla()
## We have some outliers!
## Let's count up the Fares that many deviations above the mean
## The "many" is subjective. I am playing safe
## There are three Fares that are above $400 - let's remove these rows
TitanicDF=TitanicDF[TitanicDF.Fare < 400]
print(TitanicDF.describe())
## This is good. We only removed three rows and the new max is
## almost half the size it was before
## Our SibSp looks OK
## Let's make some plots for the non-numeric variables....
print(TitanicDF.head(15))
plt.figure()
sns.countplot(x="Survived",data=TitanicDF)
plt.show()
#plt.cla()
plt.figure()
sns.countplot(x="Pclass",data=TitanicDF)
plt.show()
#plt.cla()
plt.figure()
sns.countplot(x="Sex",data=TitanicDF)
plt.show()
#plt.cla()
### Write to csv
TitanicDF.to_csv("TempCleanTitanic.csv",index = False)
## Important !! The above are just examples/tools/options
## Each dataset you clean is unique and should be treated as such
## Always THINK about what you want to do when you are cleaning
## and check as you go.