Structured & Unstructured Data
Deep Dive on Structured Data
Majority of the raw data available today is Unstructured, as the name suggests this kind of data has no defined structure. Unstructured data comes from several sources such as text files, images, videos, audio files, click stream and so on. A major challenge in Data Science is to harness the raw unstructured data into actionable information. For practical application of machine learning and statistical concepts, processing the raw data into a structured form will provide an elevated platform for success. In this blog we are going to discuss the statistical concepts of structured data types in detail.
The term data type?
Data type is simply an attribute of the data which conveys the intention of usage to the compiler. In data science, software such as R and Python uses these data types to improve computational performance. Based on the data type, the software decides how to handle computations for that variable. Explicit identification of data type of a variable offer multiple computational advantages which will be discussed later.
Terms used to define Data Types:
Continuous : Data that can take any value in an interval
Eg : Temperature on a given day, Wind speedDiscrete : Data that can take only integer values
Eg : Number of items bought from grocery store every week, Number of computers by department
Categorical : Data that can take only specific set of values representing a set of possible categories, also called as ‘Levels’.
Eg : State name (California, NY, Ohio etc)
Binary : A special case of categorical data with just two values or levels
Eg : True or False, 1 or 0
Ordinal : Categorical data that has explicit ordering
Eg : Numerical rating (1,2,3,4 or 5)
Taking categorical data as an example, lets go back to understanding how defining data type offers computational advantages.
· Knowing the data as categorical can act as a signal telling the software how statistical procedures, such as, producing a chart or fitting a model should behave. In particular, ordinal data can be represented as an ordered.factor in R, preserving a user-specified ordering in charts, tables and models.
· Storage and indexing can be optimized
· The possible values a categorical variable can take are enforced in the software.
Rectangular Data
Rectangular data object is a typical frame of reference for an analysis in Data Science. It is essentially a two dimensional matrix with rows indicating ‘records’ and columns indicating ‘features’. Rectangular data is also referred to as ‘Data frame’.
Data Frames and Indexes
In Python, with pandas library, the basic rectangular data structure is a DataFrame object. By default, an automatic integer index is created for DataFrame on the order of the rows. It is also possible to set multi-level/hierarchical indexes with pandas library to improve the efficiency of certain operations.
In R, the basic rectangular data structure is data.frame object. The data.frame also has implicit indexing based on row order.
p.s : Rectangular data is R’s bread and butter 😉
Terms used for Rectangular Data
· Feature : A column in the data frame which does not represent the outcome to be predicted. Also known as independent variable or predictor
· Response Variable : A column in the data frame representing the outcome to be predicted. Also known as dependent variable or target variable
· Records : A row in the dataframe representing an observation.
Non-Rectangular Data
Non-Rectangular data does not have a defined structure. Majority of the data available in its raw form is Non-Rectangular…Oh What A Surprise!!
Json, XML, Time series records, spatial data, graph data are few of the examples. There are multiple techniques involved in processing non-rectangular data and make it ‘model ready’, the techniques ranges from conversion to rectangular format to signal processing on raw data based on the problem statement.
In the next blog, we shall discuss the techniques adapted to process non-rectangular data in detail. Do subscribe for weekly articles on Data Science & AI
Follow us on Instagram @ai_galore
Happy Learning :)