Data Preparation

Gathering, cleaning, formatting, and preparing data is a critical first step in any data science or analytics project.

Data comes in a wide range of formats, from record (standard rows and columns), to text (anything from Tweets to Novels), to sequential (such as DNA), to image and music.

Methods for cleaning and formatting a data set may vary depending on what models or methods you plan to use the data for.

For example, data that will be used for Association Rule Mining (ARM) will likely be in “transaction” where each row is a transaction and each transaction contains a finite number of elements.

Data that will be used for supervised learning modeling must be “labeled data” and may (such as in the case of support vector machine modeling) need to be quantitative only.

Data can be temporal (meaning that it contains time-based variables) and can be used in forecasting.

Record data contains rows (also called observations, instances, vectors, etc.) as well as columns (also called variables, attributes, features, fields, etc.).

Data variables each have a type, such as quantitative (numeric) or qualitative (descriptive). Data can be nominal, ordinal, interval, or ratio. It can be categorical, continuous, spatial, and so on.

In other words, not all data is the same. Each model or method, each goal, and even each programming language package or library will require data to be in a requisite format with specific types.

For this reason, you must know your data types and formats and must reformat as needed. Even the most perfect and cleanest dataset may not be applicable to all models or methods.

Text data is another example of data. Text data can be gathered from social media, reviews, webpages, speeches, novels, chats, or any other form that is made up of words.

Tokenized and Vectorized Text Data

Text data generally requires a lot of cleaning, preparation, and formatting to use in analytics.

For example, Tweet results (see left) that are read into a .csv file can be very “messy” and in need of a lot of “clean-up” before the data can be used.

There are several tutorials included that will review data cleaning, formats and formatting, and prep in both R and Python.