Decision Trees in R

Decision Trees in R

ATTENTION:

This page contains TWO (2) code examples in R for decision trees – one for record data (mixed) and one for Text Data.

SCROLL DOWN to see the Text Data Example.

Code also include examples of visualization options. You will find examples for using both record and text data.


Example 1: Decision Trees in R using labeled record data – and visualizations

################################
##
##  Decision Tree Tutorial
##
##  Gates
##
##  Dataset
##  https://drive.google.com/file/d/1wxcPBcYHUFh7gPSC0ApPSnQZ6VRUe-C3/view?usp=sharing
##
##
#######################################

## LIBRARIES
library(rpart)   ## FOR Decision Trees
library(rattle)  ## FOR Decision Tree Vis
library(rpart.plot)
library(RColorBrewer)
library(Cairo)
library(network)
library(ggplot2)
##If you install from the source....
#Sys.setenv(NOAWT=TRUE)
## ONCE: install.packages("wordcloud")
library(wordcloud)
## ONCE: install.packages("tm")

library(slam)
library(quanteda)
## ONCE: install.packages("quanteda")
## Note - this includes SnowballC
#library(SnowballC)

library(proxy)
## ONCE: if needed:  install.packages("stringr")
library(stringr)
## ONCE: install.packages("textmineR")
library(textmineR)
library(igraph)
library(caret)
#library(lsa)


###############
## Read in the dataset you want to work with....
#################################
## THis is my path - not yours :) Update this for your path
MyPath="C:/Users/profa/Documents/RStudioFolder_1/DrGExamples/ANLY501/DATA/"
setwd(MyPath)

RecordDatasetName="Labeled_ThreeClasses_Mammals_Fish_Reptiles3.csv"
RecordDF_A<-read.csv(RecordDatasetName, stringsAsFactors=TRUE)
head(RecordDF_A)




##########################################
##
##  Let's start with the record dataset
##
##  We need to split it into a TRAINING
##  and a TESTING set  - AND - 
##  we will remove the label and save it.
##  Finally, we will remove and save the 
##  Name of the animal.
#########################################

## While we do this - let's check data types

str(RecordDF_A)
#RecordDF_A$label<-as.factor(RecordDF_A$label)

## We MUST convert the label (called label) into type FACTOR!
## If you do not do this, your modeling will not work as well 
## (or at all)
## I did this above using stringsAsFactors=TRUE


#####################
## Our data is already clean and it is MIXED
## data. I will not normalize it.
######
## What is mixed data? It is data made up of many data types
##
#####################################################

## However, let's explore just a little bit to look for 
## BALANCE in the variables AND in the label
##
##  !!!!!! We will also need to remove the "name" column
##   Why??

########################################################

## Simple tables

apply(RecordDF_A, 2, table)  # 2 means columns

## NOTE: Our data and label are balanced pretty well.

## Think about what you see. Are there columns to remove 
## from the data?
## 
##################################################
## Here is a fancy method to use a function to
## create a bar graph for each variable. 
#####################################################
## Define the function on any dataframe input x
GoPlot <- function(x) {
  
  G <-ggplot(data=RecordDF_A, aes(.data[[x]], y="") ) +
  geom_bar(stat="identity", aes(fill =.data[[x]])) 
  
  return(G)
}

## Use the function in lappy
lapply(names(RecordDF_A), function(x) GoPlot(x))


##################################################
##
##  Next - let's look at the DF and remove
## columns we do not want to use in our model
##
###################################################

#RecordDF_A

## We MUST remove the name column or the model will 
## not be good. Think about why. 

(AnimalName <- RecordDF_A$name)
(RecordDF_A<-RecordDF_A[-c(1)])

############################################
## 
## Next - split into TRAIN and TEST data
##
## !!!! Sampling Matters !!!
##
## In our case, we will use random sampling without
## replacement.
##
## Why without replacement?
##
## !!!! IMPORTANT - always clean, prepare, etc. BEFORE
## splitting data into train and test. NEVER after.
##
######################################################
(DataSize=nrow(RecordDF_A)) ## how many rows?
(TrainingSet_Size<-floor(DataSize*(3/4))) ## Size for training set
(TestSet_Size <- DataSize - TrainingSet_Size) ## Size for testing set

## Random sample WITHOUT replacement (why?)
## set a seed if you want it to be the same each time you
## run the code. The number (like 1234) does not matter
set.seed(1234)

## This is the sample of row numbers
(MyTrainSample <- sample(nrow(RecordDF_A),
                                    TrainingSet_Size,replace=FALSE))

## Use the sample of row numbers to grab those rows only from
## the dataframe....
(MyTrainingSET <- RecordDF_A[MyTrainSample,])
table(MyTrainingSET$label)

## Use the NOT those row numbers (called -) to get the
## other row numbers not in the training to use to create
## the test set.

## Training and Testing datasets MUST be disjoint. Why?
(MyTestSET <- RecordDF_A[-MyTrainSample,])
table(MyTestSET$label)

##Make sure your Training and Testing datasets are BALANCED

###########
## NEXT - 
## REMOVE THE LABELS from the test set!!! - and keep them
################################################

(TestKnownLabels <- MyTestSET$label)
(MyTestSET <- MyTestSET[ , -which(names(MyTestSET) %in% c("label"))])


###################################################
##
##     Decision Trees
##
##      First - train the model with your training data
##
##      Second - test the model - get predictions - compare
##               to the known labels you have.
###########################################################
MyTrainingSET
str(MyTrainingSET)

## This code uses rpart to create decision tree
## Here, the ~ .  means to train using all data variables
## The MyTrainingSET#label tells it what the label is called
## In this dataset, the label is called "label".

DT <- rpart(MyTrainingSET$label ~ ., data = MyTrainingSET, method="class")
summary(DT)

## Let's make another tree...here we will use cp
DT2<-rpart(MyTrainingSET$label ~ ., data = MyTrainingSET,cp=.27, method="class")
## The small cp the larger the tree if cp is too small you have overfitting
summary(DT2)

plotcp(DT) ## This is the cp plot

## Let's make a third tree - here we use cp = 0 and 
## "information" instead of the default which is GINI
DT3<-rpart(MyTrainingSET$label ~ ., 
           data = MyTrainingSET,cp=0, method="class",
           parms = list(split="information"),minsplit=2)
## The small cp the larger the tree if cp is too small you have overfitting
summary(DT3)


## Let's make a 4th tree - but here, we will only use SOME
## of the variables in the dataset to train the model
DT4<-rpart(MyTrainingSET$label ~ feathers + livebirth + blood, 
           data = MyTrainingSET,cp=0, method="class",
           parms = list(split="information"),minsplit=2)
## The small cp the larger the tree if cp is too small you have overfitting
summary(DT4)

#################################################################
## Extra notes about the output/summary
## - Root Node Error x  - X Error  - is the cross-validated error rate, 
## which is a more objective measure of predictive accuracy
##  - Root Node Error x  - Rel Error -  is the resubstitution 
## error rate (the error rate computed on the training sample).

## Variable Importance: The values are calculate by summing up all the 
## improvement measures that each variable contributes
## RE: the sum of the goodness of split measures for each split 
## for which it was the primary variable

## in Summary, the variable importance sums to 100

## NOTE: variable.importance is a named numeric vector giving 
## the importance of each variable. (Only present
## if there are any splits.) 
## When printed by summary.rpart these are rescaled to
## add to 100.
###########################################################
DT3$variable.importance  ## before re-eval to add to 100

############################################################
##
## Predict the Testset using all 4 trees - 
## Let's see what we get. 
## We will build a tree and a confusion matrix for all 4
##
###############################################################
## 
## DT---------------------------------
(DT_Prediction= predict(DT, MyTestSET, type="class"))
## Confusion Matrix
table(DT_Prediction,TestKnownLabels) ## one way to make a confu mat
## VIS..................
fancyRpartPlot(DT)

## DT2-----------------------------
### Example two with cp - a lower cp value is a bigger tree
(DT_Prediction2= predict(DT2, MyTestSET, type = "class"))
## ANother way to make a confusion matrix
caret::confusionMatrix(DT_Prediction2, TestKnownLabels, positive="true")
fancyRpartPlot(DT2)
## Example three with information gain and lower cp

##DT3---------------------------------------------------------
(DT_Prediction3= predict(DT3, MyTestSET, type = "class"))
confusionMatrix(DT_Prediction3, TestKnownLabels, positive="true")
rattle::fancyRpartPlot(DT3,main="Decision Tree", cex=.5)


##DT4---------------------------------------------------------
(DT_Prediction4= predict(DT4, MyTestSET, type = "class"))
confusionMatrix(DT_Prediction4, TestKnownLabels, positive="true")
rattle::fancyRpartPlot(DT4,main="Decision Tree", cex=.5)


### NOTES:
## IN this code, we only created one training and one testing set
##
## If we repeat the above process fully for a different 
## training and testing set - and if we do that say 5 times
## it is called 5-fold cross validation.
############################################################
## READ MORE about cross-validation:
## 1) http://inferate.blogspot.com/2015/05/k-fold-cross-validation-with-decision.html
## 2) This one is mathy and complicated for those who are interested
##    https://rafalab.github.io/dsbook/cross-validation.html
## 3) I like this one...
##    https://www.kaggle.com/satishgunjal/tutorial-k-fold-cross-validation
##

Example 2: Decision Trees for text data in R and visualizations

################################################
## 
##  Basic R Decision Trees Tutorial 
##  numeric text data via corpus
##
##  Gates
################################################
## The text corpus used is here:
## https://drive.google.com/drive/folders/1xtMjNKb1-W9rh67_ZtESwxaArCuUbApd?usp=sharing
##
################################################

library(rpart)
library(rattle)
library(rpart.plot)
library(RColorBrewer)
library(Cairo)
library(network)
##If you install from the source....
#Sys.setenv(NOAWT=TRUE)
## Do ONCE: install.packages("wordcloud")
library(wordcloud)
## Do ONCE: install.packages("tm")
library(tm)
# Do ONCE: install.packages("Snowball")
## NOTE Snowball is not yet available for R v 3.5.x
## So I cannot use it  - yet...
##library("Snowball")
##set working directory
## ONCE: install.packages("slam")
library(slam)
library(quanteda)
## ONCE: install.packages("quanteda")
## Note - this includes SnowballC
library(SnowballC)
library(arules)
##ONCE: install.packages('proxy')
library(proxy)
## ONCE: if needed:  install.packages("stringr")
library(stringr)
## ONCE: install.packages("textmineR")
library(textmineR)
library(igraph)
library(lsa)


## This is my path - update to your path
TextCorpusPath="C:/Users/profa/Documents/Python Scripts/TextMining/DATA/Sent_Corpus"

#######################################
##  Read in and prepare the data.
######################################################

##-------------------------------------------------------
## Read in and prep the text data from the corpus
(NCorpus <- Corpus(DirSource(TextCorpusPath)))
## Here, I am keeping a list of the filenames so I can associate them with the DTM
(FileNames <- list.files(TextCorpusPath, pattern=NULL))
NCorpus_DTM <- DocumentTermMatrix(NCorpus,
                                  control = list(
                                    stopwords = TRUE, ## remove normal stopwords
                                    wordLengths=c(3, 10), ## get rid of words of len 2 or smaller or larger than 15
                                    removePunctuation = TRUE,
                                    removeNumbers = TRUE,
                                    tolower=TRUE
                                    #stemming = TRUE,
                                  ))

#inspect(NCorpus_DTM)
## Convert to DF
NCorpusDF <- as.data.frame(as.matrix(NCorpus_DTM))
#head(NCorpusDF)
NCorpus_MAT <- as.matrix(NCorpus_DTM)
(NCorpus_MAT[1:12,1:10])

## WORDCLOUD ##_---------------------------------------
word.freq <- sort(rowSums(t(NCorpus_MAT)), decreasing = T)
wordcloud(words = names(word.freq), freq = word.freq*2, min.freq = 2,
          random.order = F, max.words=30)

## Place the LABELS onto the DF---------------------------------
## Recall that the filenames must be converted and added as labels
FileNames
(Temp1<-strsplit(FileNames,"_"))
#Temp1[[5]][1]
MyList<-list()

for(i in 1:length(FileNames)){
  MyList[[i]] <- Temp1[[i]][1]
  
}

MyList

## Add MyList to the dataframe
(NCorpusDF$LABEL <- unlist(MyList))
(NumCol<-ncol(NCorpusDF))  ## get the num columns
## Have a look to check for the label - which is at the end
NCorpusDF[1:5, (NumCol-5):NumCol] 

#####################################################
## We have labeled text data
## as a dataframe
##
######################################################
NCorpusDF[1:3, ((NumCol/2)-5):(NumCol/2)]   #label here is called LABEL

##################### CRITICAL ###################
## Make sure that your labels are
## type FACTOR!!
##################################################
str(NCorpusDF$LABEL)
## It was not!!
## Fix it
NCorpusDF$LABEL <- as.factor(NCorpusDF$LABEL)
## check it
str(NCorpusDF$LABEL)


########################################################
##
##       Decision Trees, etc..
##
#########################################################
##
## At this point, we have created a dataframe - 
## from text data - 
## it has labels. 
## Now we can perform DT.......................
########################################################
## In R - unlike Python - you only need to remove the label
## from the TESTING data but not from the TRAINING data

## There are many ways to create Training and Testing data

##################### CREATE Train/Test for Text data
##--------------------------------------------------------

(every7_indexes<-seq(1,nrow(NCorpusDF),7))
##  [1]   1   8  15  22  29  36  43  50  ...
NCorpusDF_Test<-NCorpusDF[every7_indexes, ]
NCorpusDF_Train<-NCorpusDF[-every7_indexes, ]
NCorpusDF_Train[1:5, (NumCol-5):NumCol] 
NCorpusDF_Test[1:5, (NumCol-5):NumCol] 

#############
## REMOVE the labels from the test data
############

## AFTER
## VERY IMPORTANT - this works when the LABEL is AT THE END...
## Otherwise it must be updated.....
## -------------------------------
(NCorpusDF_TestLabels<-NCorpusDF_Test[NumCol] )
NCorpusDF_Test<-NCorpusDF_Test[-c(NumCol)] ##
(NewNumCol<-ncol(NCorpusDF_Test))
NCorpusDF_Test[1:5, (NewNumCol-5):NewNumCol] 



##############################################################
##
##            Decision Trees 
##
##############################################################
## Remember the NAMES of the dataframes we just created:

## NCorpusDF_Train
## NCorpusDF_Test
## NCorpusDF_TestLabels

###################################################################

## Text DT
fitText <- rpart(NCorpusDF_Train$LABEL ~ . -night - movie -plot - people, data = NCorpusDF_Train, method="class")
summary(fitText)

##########################
## Predict the Test sets
#########################

## Text--------------------------------------
(predictedText= predict(fitText, NCorpusDF_Test, type="class"))
## Confusion Matrix
#str(NCorpusDF_TestLabels)
#str(predictedText)
table(unlist(predictedText) ,unlist(NCorpusDF_TestLabels))
## VIS..................
fancyRpartPlot(fitText)


## Save the Decision Tree as a jpg image
jpeg("DecisionTree.jpg")
fancyRpartPlot(fitText)
dev.off()

########################### Information Gain with Entropy ----------------------------
#library(CORElearn)
##install.packages("CORElearn")

############# Text.............
#Method.CORElearn <- CORElearn::attrEval(NCorpusDF_Train$LABEL ~ ., data=NCorpusDF_Train,  estimator = "InfGain")
#head( sort(Method.CORElearn, decreasing = TRUE), 10)
#Method.CORElearn2 <- CORElearn::attrEval(NCorpusDF_Train$LABEL ~ ., data=NCorpusDF_Train,  estimator = "Gini")
#head( sort(Method.CORElearn2, decreasing = TRUE), 10)
#Method.CORElearn3 <- CORElearn::attrEval(NCorpusDF_Train$LABEL ~ ., data=NCorpusDF_Train,  estimator = "GainRatio")
#head( sort(Method.CORElearn3, decreasing = TRUE), 10)

Example 2


################################################
## 
##  Basic R Decision Trees Tutorial for 
##  Mixed data types on a CLEAN record dataset
##  and a numeric text corpus
##
##  Gates
################################################
## The record data used is here:
##  https://drive.google.com/file/d/18dJPOiiO9ogqOibJppc0lsDiQ2-bQs0f/view?usp=sharing
## The text corpus used is here:
##  https://drive.google.com/drive/folders/1NnPZfhg_04lUAdKPNMMqkTcyfF6Z0BP2?usp=sharing
##
################################################

library(rpart)
library(rattle)
library(rpart.plot)
library(RColorBrewer)
library(Cairo)
library(network)
##If you install from the source....
#Sys.setenv(NOAWT=TRUE)
## ONCE: install.packages("wordcloud")
library(wordcloud)
## ONCE: install.packages("tm")
library(tm)
# ONCE: install.packages("Snowball")
## NOTE Snowball is not yet available for R v 3.5.x
## So I cannot use it  - yet...
##library("Snowball")
##set working directory
## ONCE: install.packages("slam")
library(slam)
library(quanteda)
## ONCE: install.packages("quanteda")
## Note - this includes SnowballC
library(SnowballC)
library(arules)
##ONCE: install.packages('proxy')
library(proxy)
## ONCE: if needed:  install.packages("stringr")
library(stringr)
## ONCE: install.packages("textmineR")
library(textmineR)
library(igraph)
library(lsa)


## These are paths on MY COMPUTER - not yours :)
RecordDataPath="C:/Users/profa/Documents/Python Scripts/ANLY503/DATA/StudentSummerProgramDataClean_DT.csv"
TextCorpusPath="C:/Users/profa/Documents/Python Scripts/TextMining/DATA/Sent_Corpus"
#######################################
##  Read in and prepare the data.
######################################################
##  The record data is easy - it already has a label and is clean...
RecordDF=read.csv(RecordDataPath, )
head(RecordDF)
##-------------------------------------------------------
## Read in and prep the text data from the corpus
(NCorpus <- Corpus(DirSource(TextCorpusPath)))
## Here, I am keeping a list of the filenames so I can associate them with the DTM
(FileNames <- list.files(TextCorpusPath, pattern=NULL))
NCorpus_DTM <- DocumentTermMatrix(NCorpus,
                                      control = list(
                                        stopwords = TRUE, ## remove normal stopwords
                                        wordLengths=c(3, 10), ## get rid of words of len 2 or smaller or larger than 15
                                        removePunctuation = TRUE,
                                        removeNumbers = TRUE,
                                        tolower=TRUE
                                        #stemming = TRUE,
                                      ))

#inspect(NCorpus_DTM)
## Convert to DF
NCorpusDF <- as.data.frame(as.matrix(NCorpus_DTM))
head(NCorpusDF[1:12,1:10])
NCorpus_MAT <- as.matrix(NCorpus_DTM)
(NCorpus_MAT[1:12,1:10])

## WORDCLOUD ##_---------------------------------------
word.freq <- sort(rowSums(t(NCorpus_MAT)), decreasing = T)
wordcloud(words = names(word.freq), freq = word.freq*2, min.freq = 2,
          random.order = F, max.words=30)

## Place the LABELS onto the DF---------------------------------
## Recall that the filenames must be converted and added as labels
FileNames
(Temp1<-strsplit(FileNames,"_"))
#Temp1[[5]][1]
MyList<-list()

for(i in 1:length(FileNames)){
  MyList[[i]] <- Temp1[[i]][1]
  
}

MyList

## Add MyList to the dataframe
(NCorpusDF$LABEL <- unlist(MyList))
(NumCol<-ncol(NCorpusDF))  ## get the num columns
## Have a look to check for the label - which is at the end
NCorpusDF[1:5, (NumCol-5):NumCol] 

#####################################################
## OK - now we have labeled record data
## and labeled text data
## Both as dataframes..
##
######################################################
NCorpusDF[1:3, ((NumCol/2)-5):(NumCol/2)]   #label here is called LABEL
## AND
head(RecordDF)   ## label here is called "Decision"

##################### CRITICAL ###################
## Make sure that your labels - so LABEL in the text data
## and Decision in the record data are both 
## type FACTOR!!
##################################################
str(NCorpusDF$LABEL)
## It was not!!
## Fix it
NCorpusDF$LABEL <- as.factor(NCorpusDF$LABEL)
## check it
str(NCorpusDF$LABEL)

str(RecordDF$Decision)
RecordDF$Decision <- as.factor(RecordDF$Decision)
str(RecordDF$Decision)
## This one is OK

########################################################
##
##       Decision Trees, etc..
##
#########################################################
##
## At this point, we have created two dataframes - 
## One from record data and one from text data - 
## and both with labels. 
## Now we can perform DT.......................
########################################################
## In R - unlike Python - you only need to remove the label
## from the TESTING data but not from the TRAINING data

## There are many ways to create Training and Testing data
##################### CREATE Train/Test for Record data
##--------------------------------------------------------
(every7_indexes<-seq(1,nrow(RecordDF),7))
##  [1]   1   8  15  22  29  36  43  50  ...
(RecordDF_Test<-RecordDF[every7_indexes, ])
(RecordDF_Train<-RecordDF[-every7_indexes, ])

## There are many ways to create Training and Testing data
##################### CREATE Train/Test for Text data
##--------------------------------------------------------
(every7_indexes<-seq(1,nrow(NCorpusDF),7))
##  [1]   1   8  15  22  29  36  43  50  ...
NCorpusDF_Test<-NCorpusDF[every7_indexes, ]
NCorpusDF_Train<-NCorpusDF[-every7_indexes, ]
NCorpusDF_Train[1:5, (NumCol-5):NumCol] 
NCorpusDF_Test[1:5, (NumCol-5):NumCol] 

#############
## REMOVE the labels from the test data
############

## AFTER
## VERY IMPORTANT - this works when the LABEL is AT THE END...
## Otherwise it must be updated.....
## -------------------------------
(NumCol=ncol(NCorpusDF_Test))
(NCorpusDF_TestLabels<-NCorpusDF_Test[NumCol] ) ##keeping my label
NCorpusDF_Test<-NCorpusDF_Test[-c(NumCol)] ## removing the label from the test set
(NewNumCol<-ncol(NCorpusDF_Test))
NCorpusDF_Test[1:5, (NewNumCol-5):NewNumCol] 

###############--------------------->
## Now remove labels from the Record data test set
###############

(RecordDF_TestLabels<-RecordDF_Test$Decision )
RecordDF_Test<-subset( RecordDF_Test, select = -c(Decision))
head(RecordDF_Test)


##############################################################
##
##            Decision Trees 
##
##############################################################
## Remember the NAMES of the dataframes:
## RecordDF_Train
## RecordDF_Test
## RecordDF_TestLabels
##    *AND*
## NCorpusDF_Train
## NCorpusDF_Test
## NCorpusDF_TestLabels
###################################################################
## Record DT
fitR <- rpart(RecordDF_Train$Decision ~ GPA + Gender + WritingScore, data = RecordDF_Train, method="class")
summary(fitR)

## Text DT
fitText <- rpart(NCorpusDF_Train$LABEL ~ . -night - movie -plot - people, data = NCorpusDF_Train, method="class")
summary(fitText)

##########################
## Predict the Test sets
#########################
## Record--------------------------------
predictedR= predict(fitR,RecordDF_Test, type="class")
## Confusion Matrix
table(predictedR,RecordDF_TestLabels)
## VIS..................
fancyRpartPlot(fitR)

## Text--------------------------------------
(predictedText= predict(fitText, NCorpusDF_Test, type="class"))
## Confusion Matrix
#str(NCorpusDF_TestLabels)
#str(predictedText)
table(unlist(predictedText) ,unlist(NCorpusDF_TestLabels))
## VIS..................
fancyRpartPlot(fitText)


## Save the Decision Tree as a jpg image
jpeg("DecisionTree_Students.jpg")
fancyRpartPlot(fitR)
dev.off()