Objective: Reflect on the process of building a predictive model connecting your practical work on the data to the statistical concepts and methods covered in class. In practice, you should demonstrate your understanding of the entire modeling strategy (problem identification and statement, data analysis, model selection, diagnostics, and validation), as well as the challenges and trade-offs involved in predictive modeling.
Problem Identification → EDA → Feature Engineering → Model Fitting → Diagnostics → Validation → Iteration.
In this class, we have only talked about supervised learning. Unsupervised learning will be covered in 5526.
In supervised learning: always start with looking at your target variable: are you solving a regression or classification problem?
What is the end goal? Prediction or understanding a phenomenon via potential predictors? Both?
Have you gotten guidance/constraints on quantities of interest or fetaures of the target response to focus on?
EDA is an essential first step, you should explore the following in order:
Are you dealing with a large dataset? Lots of datapoints? Lots of predictors? What’s the ratio between both?
The target variable: if continuous, is it skewed? Does it have tails? Outliers? Does it require a transformation? If categorical, is it balanced?
The predictors: explore their individual distribution and key features, similarly to the target variables.
Bivariate interaction: target variables VS each predictor (scatterplot if continuous predictor, boxplot if categorical predictor, summary statistics when target and predictor are categorical) \(\rightarrow\) help understand associateion between predictor and target
Multivariate interaction: scatterplot of target with multiple predictions \(\rightarrow\) helps detect co-linearity in the predictor space.
Hidden structures with conditioning: for instance effect of a categorical predictor on the multivariate interaction.
Always use qualitative and quantitative tools.
EDA is critical to:
- Identify non-linear relationships (e.g., bike counts peak at 8 AM and
6 PM, suggesting a U-shaped pattern with temperature),
- Detect outliers (e.g., zero counts during extreme weather) and missing
data (e.g., sensor failures),
- Guide feature selection: for example, humidity showed weak correlation
with bike counts in scatterplots, so we deprioritized it.
Are you needing a regression or classification model?
If regression, is your target continuous? is it a count? or has a specific distribution? We may need a GLM. In classification, you may use a GLM or a tree-based method.
Is your problem highly non-linear that it requires a tree-based method or can you afford a linear regression model?
Does your model have a high-variance (typically trees have)? Do you need bagging to reduce the variance?
Do you need to boost your model to improve prediction skills? Remember boosting is performed with trees fitted to pseudo-residuals but can be applied to any model with a differentiable loss function.
When you build and select a model, keep in mind its parsimony and interpretability as well as its computational fitting cost.
Hypertuning is typically done in a cross-validation setting.
We will use cross-validation to compare several models quantitatively (with metrics \(R^2\), MSE, MAE, correlation, percentage of coverage, accuracy,…).
Create plots to evaluate qualitatively your model on a testing set. Prediction VS truth, scatterplot, their distribution, correlation, percentage of coverage, etc.
Make sure to report as many sources of uncertainty as possible: data,
model, estimation
1. You can use confidence intervals or bootstrap to quantify data
uncertainties,
2. You can use Hessian information or bootstrapping to report estimation
uncertainty i.e. variance-covariance of the estimated parameters.
Remember when dealing with categorical outputs, it is hard to create
meaningful plots. See examples in Homework 4 correction from the package
explore.
When something wrong, try to understand whether the issue comes from your data or your model, or both. Always, try one thing at a time only.