Skip to content

Latest commit

 

History

History
216 lines (178 loc) · 8.99 KB

1.3_Data_Understanding.md

File metadata and controls

216 lines (178 loc) · 8.99 KB

1.3 Data Understanding and Preparation

CRISP-DM

Data Understanding

  • Collect Initial Data
  • Describe Data
  • Explore Data
  • Verify Data Quality

Data Preparation

  • Data Set
  • Select Data
  • Clean Data
  • Construct Data
  • Integrate Data
  • Format Data

Data Understanding

Data Summarization

Motivation

  • with big data sets it is hard to have an idea of what is going on in the data
  • data summaries provide overviews of key properties of the data
  • help selecting the most suitable tool for the analysis
  • their goal is to describe important properties of the distribution of the values

Types of Summaries

  • What is the "most common value"?

  • What is the "variability" in the values?

  • Are there "strange"/unexpected values in the data set?

  • Data set: univariate data or multivariate data

  • Variables: categorical variables or numeric variables

Categorical Variables
  • Mode: the most frequent value
  • Frequency table: frequency of each value (absolute or relative)
  • Contingency table: cross-frequency of values for 2 variables
Numeric Variables

Statistics of location

  • Mean (or sample mean) - sensitive to extreme values
  • Median - 50th percentile
  • Mode - most common value

Statistics of variability of dispersion

  • Range - $max_x - min_x$
  • Variance - $\sigma^2_x$ - sensitive to extreme values
  • Standard Deviation - sensitive to extreme values ![[Pasted image 20220208172524.png]]
  • Inter-quartile Range (IQR) - difference between the 3rd ($Q_3$) and 1st ($Q_1$) quartiles
    • $Q_1$ → nr < 25%
    • $Q_3$ → nr < 75%

Outliers

  • For a numeric variable, an outlier can be an extreme value
  • In the presence of such values:
    • median or mode are more robust as a central tendency statistic
    • interquartile range is more appropriate as a variability statistic
  • Boxplot Definition - any value in the interval $[Q_1 - 1.5 * IQR, Q_3 + 1.5 * IQR]$ is an outlier

Multivariate analysis of variability or dispersion

  • Covariance Matrix - variance between every pair of numeric variables - the value depend on the magnitude of the variable ![[Pasted image 20220208173057.png]]
  • Correlation Matrix - correlation between every pair of numeric variables - the influence of the magnitude is removed $$cor(x,y) = cov(x,y)/\sigma_x\sigma_y$$
  • Pearson Correlation Coefficient ($\rho$) - measures the linear correlation between 2 $\in$ [-1, +1]
  • Spearman Rank-Order Correlation Coefficient - measure the stregth and direction of monotonic association between 2 variables; rank-based and non-parametric version of Pearson

Data Visualization

Motivation

  • Humans are outstanding at detecting patterns and structures with their eyes
  • Data visualization methods try to explore these capabilities
  • Help detecting patterns and trends, and also outliers and unusual patterns

Main Types of Graphs

  • Univariate Graphs
  • Bivariate Graphs
  • Multivariate/Conditioned Graphs

Univariate Graphs

  • Categorical Variables: Barplots, Piecharts, ...
  • Numeric Variables: Line plots, Histograms, QQ Plots, Boxplots, ...

Barplots

  • display a set of values as heights of bars
  • display frequency of occurrence of different values ![[Pasted image 20220208174102.png]]

Piecharts

  • same purpose as barplots with information in form of a pie
  • not so good for comparisons ![[Pasted image 20220208174138.png]]

Line Plots

  • analyze the evolution of the values of a continuous variable
  • x-axis represent a quantitative scale with equal lag between observations
  • used to deal with notion of time ![[Pasted image 20220208174424.png]]

Histograms

  • display how the values of a continuous variable are distributed
  • may be misleading in small data sets
  • shape depends on the number of bins ![[Pasted image 20220208174526.png]]

QQ Plots

  • how properties such as location, scale and skewness compare in 2 distributions
  • visually check the hypothesis that the variable under study follow a normal distribution, comparing the observed distribution against the Normal distribution ![[Pasted image 20220208174646.png]]

Boxplots

  • provide an interesting summary of a variable distribution
  • inform us of the interquartile range and of the outliers (if any) ![[Pasted image 20220208174730.png]]

Bivariate Graphs

Scatterplots

  • show the relationship between 2 numeric variables ![[Pasted image 20220208174802.png]]

Multivariate Graphs

  • plot the relationship between every pair of numeric variables and respective groups ![[Pasted image 20220208174930.png]]

Parallel Coordinates Plot

  • attributes values for each case (line)
  • order might be important to help identify groups ![[Pasted image 20220208175043.png]]

Correlogram

  • correlation statistics (e.g. pearson) for each pair of variables ![[Pasted image 20220208175122.png]]

Conditioned Graphs

  • allow the simultaneous presentation of subgroup graphs to better allow finding eventual differences between the subgroups

Data Preparation

Set of steps that may be necessary to carry out before any further analysis takes place on the available data

  • may face the need to "create" new variables to achieve objectives

  • set may be too large

  • Feature Extraction: extract features from raw data on which analysis can be performed

  • Data Cleaning: data may be hard to read or require extra parsing efforts

  • Data Transformation: it may be necessary to change some values of the data

  • Feature Engineering: to incorporate some domain knowledge

  • Data and Dimensionality Reduction: to make modeling possible

Feature Extraction

  • Very application specific and a very crucial step
    • Sensor data: large volume of low-level signals associated with date/time attributes
    • Image data: very high-dimensional data that can be represented by pixels, color histograms, etc.
    • Web logs: text in a prespecified format with both categorical and numerical attributes
    • Network traffic: network packets information
    • Document data: raw and unstructured data

Data Cleaning

Handling Missing Values

  • Goal: make data tidy
Strategies
  • Remove all cases in a data set with some unknown value
  • Fill-in:
    • the unknowns with the imputation of the most common value
    • with the most common value on the cases that are more "similar" to the one with unknowns
    • with linear interpolation of nearby values in time and/or space
  • Explore eventual correlations between variables
  • Do nothing

Handling Incorrect Values

  • Inconsistency detection: data integration techniques within the database field
  • Domain knowledge: data auditing that use domain knowledge and constraints
  • Data-centric methods: statistical-based methods to detect outliers

Data Transformation

  • Map the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values

Common Strategies

  • Normalization
  • Binarization/One-Hot Encoding
  • Discretization
Normalization
  • Min-Max Scaling (Range-base Normalization) - not robust for scenarios where there are outliers ![[Pasted image 20220208180707.png]]
  • Standardization (z-score Normalization) ![[Pasted image 20220208180733.png]]
Case Dependencies
  • In time series, it is common to use different techniques.
  • E.g.: adjust mean, variance range; remove unwanted, common signal
Binarization/One-Hot Encoding
  • Binarization: if the attribute has only 2 possible nominal values, it can be transformed into 1 binary attribute
    • fever: yes/no → fever: 1/0
  • One-Hot Encoding: if the attribute has k possible nominal values, it can be transformed into k binary attributes
    • eye_color: brown/blue/green → eye_brown: 1/0, eye_blue: 1/0, eye_green 1/0
Discretization
  • process of converting continuous attribute into an ordinal attribute of numeric variables
  • Unsupervised discretization: find breaks in the data values
    • Equal-width: divides the original values into equal-width range of values; may be affected by outliers
    • Equal-frequency: divides the original values so that the same number of values are assigned to each range; can generate ranges with different amplitudes
  • Supervised discretization: use class labels to find breaks

Feature Engineering

  • Fundamental to the application of machine learning
  • The process of using domain knowledge of the data to create features that might help when solving the problem
  • New features that can capture the important information in a ta set much more efficiently than the original features
Case Dependencies
  • Case 1: express known relationships between existing variables
    • create ratios and proportions
  • Case 2: overcome limitations of some data mining tools regarding cases dependencies
    • create variables that express dependency relationships
  • In time series is common to create features that represent relative values instead of absolute values, so to avoid trend effects ![[Pasted image 20220208181854.png]]