Why Do We Need EDA?
Most modeling failures are caused by poor data understanding, not
poor algorithms.
EDA helps us:
- Understand the data-generating process
- Detect structure/dependencies, anomalies, and
limitations
- Decide what models make sense
- Avoid costly mistakes (leakage, wrong assumptions)
Where EDA Fits in the Data Science Pipeline
Raw data → EDA → Feature engineering → Modeling →
Evaluation
EDA is:
- Iterative
- Non-formal
- Question-driven
You perform EDA before trusting any subsequent model
output.
EDA is entirely driven by YOU.
What Is Exploratory Data Analysis?
Definition
Exploratory Data Analysis is the use of visualization, summary
statistics, and simple transformations to understand the main
characteristics of a dataset.
Key ideas:
- Let the data speak
- Prefer plots over metrics (early on)
- Look for surprises, not confirmation
Personally, I always start with qualitative aspects and visualization
and then work on quantitative aspects.
EDA vs Statistical Inference
| Discovery |
Confirmation |
| Flexible |
Formal |
| Visual |
Mathematical |
| No p-values |
p-values |
EDA often guides what inference is appropriate
later.
First Look: Always Start Simple
Questions to ask immediately:
- How big is the dataset? Does it fit R and-or my computer?
- What is the format?
- What are the variables and units?
- What is the time resolution?
- Are there missing values?
Typical first plots:
- Single scatterplot / Time series plot
- Histograms, boxplots
- Summary table
First: look at individual variables
Second: explore the multivariate aspectsÂ
Third: hidden and more complex structures
e.g. dependence to categories (e.g. regime), time-changing
properties/non-stationarity, …
Univariate EDA: One Variable at a Time
Goal: understand distribution and scale
Tools:
- Single scatterplot / Time series plot
- Histogram / density plot
- Boxplot
- Mean, median, variance, IQR
Questions:
- Is the distribution symmetric? Skewed?
- Is the distribution multi-modal?
- Are there outliers? Heavy tails?
- Is a transformation needed?
Exemple Dataset
Electricity Load + Weather (Hourly)
Variables include:
- Electricity demand (MW)
- Weather variables (temperature, wind speed, rainfall, …)
- Calendar features (hour, weekday, day of the week, season)
Properties:
- Multivariate
- Time-dependent
- Nonlinear relationships
Example: Electricity Load (Demand)
What are your thoughts
here?
What we often observe:
- Non-Gaussian distribution
- Multiple modes (weekday vs weekend)
- Long tails
Implication:
A simple linear-Gaussian model may be inappropriate.
Bivariate EDA: Relationships Between Variables
Goal: understand dependence betwen variables
Tools:
- Scatter plots
- Conditional boxplots
- Correlation (Pearson vs Spearman)
Important warning:
Correlation can miss nonlinear relationships.
Example: Load vs Weather Variables
How would you model such relationships?

Example: Load vs Temperature
Clear pattern:
- U-shaped relationship
- Heating and cooling regimes
Key lesson:
- Low correlation ≠no relationship
- Visualization reveals structure missed by metrics
Demand and snowfall’s relationship even more complex,
It pertains to the class of regime dependence (wet/dry regimes).
Hidden Structure Through Conditioning
Some relationships cannot always be highlighted by
scatterplots,
due ot their complexity and-or the nature of data.
- Hour of day
- Day of the week
- Season
Technique:
- Color or facet plots by group
EDA reveals latent variables.
What would you like to see from the electricity demand?
How would you model such relationships?

How would you model such relationships?

Hidden Structure Through Conditioning
These types of structures are typically hard to model,
Adding extra covariates does not solve the issue,
Because there are non-linearities in the systems,
Often we need models that enable conditioning,
So that, parts of the model differ depending on a factor or a latent
variable.
Examples are mixture models, state-space models, hierarchical
models.
Multivariate EDA
Goal: explore structure across many variables
Tools:
- Pair plots
- Correlation heatmaps
- Grouped summaries
- PCA (for exploration, not prediction)
Ask:
- Redundant variables?
- Strong dependencies?
- Clusters or regimes?
Multivariate EDA

Time Series–Specific EDA
Key questions:
- Is the process stationary?
- Are there trends or seasonality?
- Are there regime changes?
Tools:
- Time plots
- Seasonal decomposition
- Autocorrelation plots
Data Quality Checks (Critical!)
Things to look for:
- Missing data
- Outliers
- Duplicates
- Inconsistent units
- Data leakage (we will talk about that in future classes)
Rule of thumb:
Never clean data before understanding why it is dirty.
Missing Data: An EDA Perspective
Questions:
- How much data is missing?
- Is it random or structured?
- Does missingness depend on time or values?
Visualization:
- Missingness heatmaps
- Time-ordered missing indicators
Treatment: case by case
EDA as Hypothesis Generation
EDA helps us form modeling hypotheses:
| Seasonality |
Periodic terms as predictors |
| Nonlinearity |
GAMs, trees |
| Regimes |
Mixture models |
| Heteroscedasticity |
Transformation, Weighted models |
From EDA to Modeling Choices
Example pipeline:
- EDA shows strong daily seasonality
- Hypothesis: periodic structure
- Feature engineering: hour-of-day
- Model: regression with seasonal terms
EDA reduces the model search space.
Spatiotemporal Data
ADD space-time data somewhere
Common EDA Pitfalls
- Over-interpreting noise
- Cleaning too early
- Ignoring data provenance
- Relying only on summary statistics
- Skipping visual checks
Remember: A model cannot fix misunderstood data.
Key Takeaways
- EDA is not optional
- Visualization is your strongest tool
- Always question assumptions
- Good EDA saves time and prevents failure
Better data understanding beats better algorithms.
EDA is the first step toward trustworthy statistical
learning.