Exploratory Data Analysis

Why Do We Need EDA?

Most modeling failures are caused by poor data understanding, not poor algorithms.

EDA helps us:

Understand the data-generating process
Detect structure/dependencies, anomalies, and limitations
Decide what models make sense
Avoid costly mistakes (leakage, wrong assumptions)

Where EDA Fits in the Data Science Pipeline

Raw data → EDA → Feature engineering → Modeling → Evaluation

EDA is:

Iterative
Non-formal
Question-driven

You perform EDA before trusting any subsequent model output.

EDA is entirely driven by YOU.

What Is Exploratory Data Analysis?

Definition
Exploratory Data Analysis is the use of visualization, summary statistics, and simple transformations to understand the main characteristics of a dataset.

Key ideas:

Let the data speak
Prefer plots over metrics (early on)
Look for surprises, not confirmation

Personally, I always start with qualitative aspects and visualization and then work on quantitative aspects.

EDA vs Statistical Inference

EDA	Statistical Inference
Discovery	Confirmation
Flexible	Formal
Visual	Mathematical
No p-values	p-values

EDA often guides what inference is appropriate later.

First Look: Always Start Simple

Questions to ask immediately:

How big is the dataset? Does it fit R and-or my computer?
What is the format?
What are the variables and units?
What is the time resolution?
Are there missing values?

Typical first plots:

Single scatterplot / Time series plot
Histograms, boxplots
Summary table

First: look at individual variables

Second: explore the multivariate aspects

Third: hidden and more complex structures
e.g. dependence to categories (e.g. regime), time-changing properties/non-stationarity, …

Univariate EDA: One Variable at a Time

Goal: understand distribution and scale

Tools:

Single scatterplot / Time series plot

Histogram / density plot

Boxplot

Mean, median, variance, IQR

Questions:

Is the distribution symmetric? Skewed?

Is the distribution multi-modal?

Are there outliers? Heavy tails?

Is a transformation needed?

Exemple Dataset

Electricity Load + Weather (Hourly)

Variables include:

Electricity demand (MW)
Weather variables (temperature, wind speed, rainfall, …)
Calendar features (hour, weekday, day of the week, season)

Properties:

Multivariate
Time-dependent
Nonlinear relationships

Example: Electricity Load (Demand)

What are your thoughts here?

What we often observe:

Non-Gaussian distribution
Multiple modes (weekday vs weekend)
Long tails

Implication:

A simple linear-Gaussian model may be inappropriate.

Bivariate EDA: Relationships Between Variables

Goal: understand dependence betwen variables

Tools:

Scatter plots
Conditional boxplots
Correlation (Pearson vs Spearman)

Important warning:

Correlation can miss nonlinear relationships.

Example: Load vs Weather Variables

How would you model such relationships?

Example: Load vs Temperature

Clear pattern:

U-shaped relationship
Heating and cooling regimes

Key lesson:

Low correlation ≠ no relationship
Visualization reveals structure missed by metrics

Demand and snowfall’s relationship even more complex,
It pertains to the class of regime dependence (wet/dry regimes).

Hidden Structure Through Conditioning

Some relationships cannot always be highlighted by scatterplots,
due ot their complexity and-or the nature of data.

Hour of day
Day of the week
Season

Technique:

Color or facet plots by group

EDA reveals latent variables.

What would you like to see from the electricity demand?

How would you model such relationships?

Hidden Structure Through Conditioning

These types of structures are typically hard to model,

Adding extra covariates does not solve the issue,

Because there are non-linearities in the systems,

Often we need models that enable conditioning,

So that, parts of the model differ depending on a factor or a latent variable.

Examples are mixture models, state-space models, hierarchical models.

Multivariate EDA

Goal: explore structure across many variables

Tools:

Pair plots
Correlation heatmaps
Grouped summaries
PCA (for exploration, not prediction)

Ask:

Redundant variables?
Strong dependencies?
Clusters or regimes?

Multivariate EDA

Time Series–Specific EDA

Key questions:

Is the process stationary?
Are there trends or seasonality?
Are there regime changes?

Tools:

Time plots
Seasonal decomposition
Autocorrelation plots

Data Quality Checks (Critical!)

Things to look for:

Missing data
Outliers
Duplicates
Inconsistent units
Data leakage (we will talk about that in future classes)

Rule of thumb:

Never clean data before understanding why it is dirty.

Missing Data: An EDA Perspective

Questions:

How much data is missing?
Is it random or structured?
Does missingness depend on time or values?

Visualization:

Missingness heatmaps
Time-ordered missing indicators

Treatment: case by case

EDA as Hypothesis Generation

EDA helps us form modeling hypotheses:

EDA Observation	Modeling Consequence
Seasonality	Periodic terms as predictors
Nonlinearity	GAMs, trees
Regimes	Mixture models
Heteroscedasticity	Transformation, Weighted models

From EDA to Modeling Choices

Example pipeline:

EDA shows strong daily seasonality
Hypothesis: periodic structure
Feature engineering: hour-of-day
Model: regression with seasonal terms

EDA reduces the model search space.

Spatiotemporal Data

ADD space-time data somewhere

Common EDA Pitfalls

Over-interpreting noise
Cleaning too early
Ignoring data provenance
Relying only on summary statistics
Skipping visual checks

Remember: A model cannot fix misunderstood data.

Key Takeaways

EDA is not optional
Visualization is your strongest tool
Always question assumptions
Good EDA saves time and prevents failure

Better data understanding beats better algorithms.

EDA is the first step toward trustworthy statistical learning.

Exploratory Data Analysis

Statistical Learning - Week 2

Julie Bessac

2026

Why Do We Need EDA?

Where EDA Fits in the Data Science Pipeline

What Is Exploratory Data Analysis?

EDA vs Statistical Inference

First Look: Always Start Simple

Univariate EDA: One Variable at a Time

Exemple Dataset

Example: Electricity Load (Demand)

Bivariate EDA: Relationships Between Variables

Example: Load vs Weather Variables

Example: Load vs Temperature

Hidden Structure Through Conditioning

Hidden Structure Through Conditioning

Multivariate EDA

Multivariate EDA

Time Series–Specific EDA

Data Quality Checks (Critical!)

Missing Data: An EDA Perspective

EDA as Hypothesis Generation

From EDA to Modeling Choices

Spatiotemporal Data

Common EDA Pitfalls

Key Takeaways