Reflections and Roadmap on Predictive Modeling

Objective: Reflect on the process of building a predictive model connecting your practical work on the data to the statistical concepts and methods covered in class. In practice, you should demonstrate your understanding of the entire modeling strategy (problem identification and statement, data analysis, model selection, diagnostics, and validation), as well as the challenges and trade-offs involved in predictive modeling.

Order of Operations

Problem Identification → EDA → Feature Engineering → Model Fitting → Diagnostics → Validation → Iteration.

1. Problem Identification and Statement

In this class, we have only talked about supervised learning. Unsupervised learning will be covered in 5526.

In supervised learning: always start with looking at your target variable: are you solving a regression or classification problem?

What is the end goal? Prediction or understanding a phenomenon via potential predictors? Both?

Have you gotten guidance/constraints on quantities of interest or fetaures of the target response to focus on?

2. Exploratory Data Analysis (EDA) and Feature Engineering

EDA is an essential first step, you should explore the following in order:

  1. Are you dealing with a large dataset? Lots of datapoints? Lots of predictors? What’s the ratio between both?

  2. The target variable: if continuous, is it skewed? Does it have tails? Outliers? Does it require a transformation? If categorical, is it balanced?

  3. The predictors: explore their individual distribution and key features, similarly to the target variables.

  4. Bivariate interaction: target variables VS each predictor (scatterplot if continuous predictor, boxplot if categorical predictor, summary statistics when target and predictor are categorical) \(\rightarrow\) help understand associateion between predictor and target

  5. Multivariate interaction: scatterplot of target with multiple predictions \(\rightarrow\) helps detect co-linearity in the predictor space.

  6. Hidden structures with conditioning: for instance effect of a categorical predictor on the multivariate interaction.

Always use qualitative and quantitative tools.

EDA is critical to:
- Identify non-linear relationships (e.g., bike counts peak at 8 AM and 6 PM, suggesting a U-shaped pattern with temperature),
- Detect outliers (e.g., zero counts during extreme weather) and missing data (e.g., sensor failures),
- Guide feature selection: for example, humidity showed weak correlation with bike counts in scatterplots, so we deprioritized it.

3. Model Selection

Are you needing a regression or classification model?

If regression, is your target continuous? is it a count? or has a specific distribution? We may need a GLM. In classification, you may use a GLM or a tree-based method.

Is your problem highly non-linear that it requires a tree-based method or can you afford a linear regression model?

Does your model have a high-variance (typically trees have)? Do you need bagging to reduce the variance?

Do you need to boost your model to improve prediction skills? Remember boosting is performed with trees fitted to pseudo-residuals but can be applied to any model with a differentiable loss function.

When you build and select a model, keep in mind its parsimony and interpretability as well as its computational fitting cost.

4. Model Hypertuning

Hypertuning is typically done in a cross-validation setting.

4.1 ANOVA, Feature Selection and Regularization for Linear Regression and GLM

  • In multiple linear regression, it is useful to downselect predictors to avoid overfitting and improve interpretability. ANOVa or feature selection can be used.
  • Alternatively, we can penalize the loss function of your model (mainly used for regression and GLM) to force parameters to zero (LASSO), lower their variance (Ridge) or both (ElasticNet).

4.2 Hypertuning of tree-based methods

  • Note that feature selection is not a thing per-se in tree-based methods but the regularized analogous is a pruned tree.
  • Trees: depth, number of leaves, number of observation per leaf or node (these are often automatically set in most software)
  • Bagging: number of trees/members
  • Random Forests: number of trees in the forest, number of features to consider at each split.

4.3 Hypertuning of tree-boosted methods

  • Boosted models: number of boosting stages (trees), learning_rate (shrinks the contribution of each tree).

5. Cross-Validation and Model Evaluation

We will use cross-validation to compare several models quantitatively (with metrics \(R^2\), MSE, MAE, correlation, percentage of coverage, accuracy,…).

Create plots to evaluate qualitatively your model on a testing set. Prediction VS truth, scatterplot, their distribution, correlation, percentage of coverage, etc.

Make sure to report as many sources of uncertainty as possible: data, model, estimation
1. You can use confidence intervals or bootstrap to quantify data uncertainties,
2. You can use Hessian information or bootstrapping to report estimation uncertainty i.e. variance-covariance of the estimated parameters.

Remember when dealing with categorical outputs, it is hard to create meaningful plots. See examples in Homework 4 correction from the package explore.

6. Challenges

When something wrong, try to understand whether the issue comes from your data or your model, or both. Always, try one thing at a time only.