Text and Record Data

Record Data

The above record data is mixed and needs to be cleaned and prepared before using. You can see that some variables are numeric, like Age, while others are qualitative, like Ticket or ID. While ID is represented using a number, it is not numeric data. Finally, you can see missing values, incorrect values, and data that you will not want to use if building a model.

This is record data above that is mixed. Some variables like GPA are quantitative. Other variables like State are qualitative. This data is labeled. The label or category of each row is “Decision” which has three options: Admit, Decline, Waitlist. This data contains a geo variable, State, which gives location. This data also contains a temporal variable, AppDate, which gives a relative time.
This above record data was converted to records data format from text data. You can see that each row is a document and each column (variable) is a word. The data represent the “count”. For example, the word “hike” appears in the first document 3 times.

Record data is a common format for data. Record data contains rows and columns. While there are always exceptions, the rows generally represent an observation and the columns represent the attributes.

Many different names are used to describe rows and columns.

Rows: Observations, instances, objects, people, vectors

Columns: Variables, fields, attributes, features, dimensions

This is important so that if a book, person, article or site refers to “feature engineering” for example, you know that they are talking about “engineering new columns”.

If you are told that your data has 4 dimensions, you know that this means it has four columns or variables.

If your data have 100 vectors, it has 100 rows or observations.

Record data can be mixed meaning that some variables are qualitative (names, categories, descriptions, etc.), while other variables may be quantitative (numeric). Variable can also describe times and dates or geographical locations (such as states or latitude/longitude).

Similarly, record data can be all numeric or all qualitative. The type of data that you have will affect which models and methods you can use.

Many formats of data can be put into record format. This is most often done using a programming language such as R or Python. For example, as you will see in the next section, text data can be gathered and then converted to record format for analysis.

To learn more about data cleaning, please review the data cleaning tutorial.

Text Data

Text data is any data that contains words. This might be Tweets, chats, webpage HTML, novels, reviews, news articles, social media posts, etc.

Text data is often very messy and disorganized when it is first gathered. To apply analytical methods, such as machine learning modeling, to text data, it must first be cleaned and formatted.

First, think about text data. More specifically, let’s imagine three news articles – two on football and one on politics. All three articles use words (in our case we will focus on English only). So words like “and”, “the”, “but”, etc. will be in all articles. However, other words like “quarterback” are more likely to occur in the football articles and no so much in the political article.

In other words, the words in any collection of text (whether its an article or Tweet) will give you information about the topic(s), sentiment, etc.

For example, if one document contains the word “hike” 27 times and “gear” 45 times, while another document contains the word “dog” 64 times and the word “cat” 23 times, we might start to surmise that one article is about hiking and the other is about pets.

So, you can think of the words in text data as being the variables.

We can tokenize text data (break it into tokens – or words) and then we can vectorize (create a dataframe or matrix where the words are the column names and the documents are the rows).

This process of tokenize and vectorizing will convert text into record format. Keep in mind that a dataframe is record format.

Text data can also be transformed into transaction data.

The format we choose depends on the models or methods we want to use. For example, if using Association Rule Mining, we will need transaction data. If using SVMs we will need labeled record data, and so on.

The above is text data that has been tokenized and vectorized. It is now record data (a data frame) with words as the column names. Each row is a document. Remember that a “document” can be a Tweet, novel, speech, review, etc. – anything that contains text.
The above is an example of a corpus, which is a folder that contains text files. Here, the text files are named as n_1, n_2, etc. and the corpus folder is called NEG. You can also see that a portion of n_1.txt is open in notepad. It contains text.
This is the results of tokenizing, vectorizing, and normalizing text data. You can see that the column names are words. You can also see that the data is numeric and between 0 and 1 (in decimal form) because it was normalized. This is now record data and is a dataframe in R or Python.