Why Do We Need EDA?

Most modeling failures are caused by poor data understanding, not poor algorithms.

EDA helps us:

Where EDA Fits in the Data Science Pipeline

Raw data → EDA → Feature engineering → Modeling → Evaluation

EDA is:

You perform EDA before trusting any subsequent model output.

EDA is entirely driven by YOU.

What Is Exploratory Data Analysis?

Definition
Exploratory Data Analysis is the use of visualization, summary statistics, and simple transformations to understand the main characteristics of a dataset.

Key ideas:

Personally, I always start with qualitative aspects and visualization and then work on quantitative aspects.

EDA vs Statistical Inference

EDA Statistical Inference
Discovery Confirmation
Flexible Formal
Visual Mathematical
No p-values p-values

EDA often guides what inference is appropriate later.

First Look: Always Start Simple

Questions to ask immediately:

Typical first plots:

First: look at individual variables

Second: explore the multivariate aspects 

Third: hidden and more complex structures
e.g. dependence to categories (e.g. regime), time-changing properties/non-stationarity, …

Univariate EDA: One Variable at a Time

Goal: understand distribution and scale

Tools:

  • Single scatterplot / Time series plot
  • Histogram / density plot
  • Boxplot
  • Mean, median, variance, IQR

Questions:

  • Is the distribution symmetric? Skewed?
  • Is the distribution multi-modal?
  • Are there outliers? Heavy tails?
  • Is a transformation needed?

Exemple Dataset

Electricity Load + Weather (Hourly)

Variables include:

Properties:

Example: Electricity Load (Demand)

What are your thoughts here?

What we often observe:

Implication:

A simple linear-Gaussian model may be inappropriate.

Bivariate EDA: Relationships Between Variables

Goal: understand dependence betwen variables

Tools:

Important warning:

Correlation can miss nonlinear relationships.

Example: Load vs Weather Variables

How would you model such relationships?

Example: Load vs Temperature

Clear pattern:

Key lesson:

Demand and snowfall’s relationship even more complex,
It pertains to the class of regime dependence (wet/dry regimes).

Hidden Structure Through Conditioning

Some relationships cannot always be highlighted by scatterplots,
due ot their complexity and-or the nature of data.

Technique:

EDA reveals latent variables.

What would you like to see from the electricity demand?

How would you model such relationships?

How would you model such relationships?

Hidden Structure Through Conditioning

These types of structures are typically hard to model,

Adding extra covariates does not solve the issue,

Because there are non-linearities in the systems,

Often we need models that enable conditioning,

So that, parts of the model differ depending on a factor or a latent variable.

Examples are mixture models, state-space models, hierarchical models.

Multivariate EDA

Goal: explore structure across many variables

Tools:

Ask:

Multivariate EDA

Time Series–Specific EDA

Key questions:

Tools:

Data Quality Checks (Critical!)

Things to look for:

Rule of thumb:

Never clean data before understanding why it is dirty.

Missing Data: An EDA Perspective

Questions:

Visualization:

Treatment: case by case

EDA as Hypothesis Generation

EDA helps us form modeling hypotheses:

EDA Observation Modeling Consequence
Seasonality Periodic terms as predictors
Nonlinearity GAMs, trees
Regimes Mixture models
Heteroscedasticity Transformation, Weighted models

From EDA to Modeling Choices

Example pipeline:

  1. EDA shows strong daily seasonality
  2. Hypothesis: periodic structure
  3. Feature engineering: hour-of-day
  4. Model: regression with seasonal terms

EDA reduces the model search space.

Spatiotemporal Data

ADD space-time data somewhere

Common EDA Pitfalls

Remember: A model cannot fix misunderstood data.

Key Takeaways

Better data understanding beats better algorithms.

EDA is the first step toward trustworthy statistical learning.