Module Learning Goals

The goal of this module is for you to

develop a vocabulary about data, analytics, statistical and machine learning
understand the “big data” evolution toward data-driven systems
define and distinguish types of learning, types of insights, types of responses
distinguish the approaches to data analysis, data mining, machine learning, and statistical learning
understand the principles and goals of supervised, unsupervised, and reinforcement learning

Data, Analytics, Decision

Definitions

Data: is information about the world
Analytics: is the process of extracting insight from data
Decisioning: is to choose an action after consideration

In recent decades this relationship has changed fundamentally:

The amount of available data has exploded
Our understanding of data has evolved
Our capabilities to perform analytics have evolved
Data and analytics have become the foundation for decisions

Data + Analytics = Decision

Data Evolution

Types
- Then : Single stream/source of structured data
- Now : Rich data types, unstructured data (text, documents, emails, records), audio, video, images, sensor data, ticker data, click streams, geo-spatial data, ...
Approach
- Then : Define the problem, then collect the data
- Now : Examine what the data can tell you
Motion
- Then : Data at rest
- Now : Streaming data, data at rest, data in motion (cloud-data center, multi-cloud, distributed data)
Origin
- Then : Observational studies or designed experiments
- Now : Machine-generated data, large-scale data collections
Size
- Then : KB, MB to GB range, up to 10s of features
- Now : GB - TB - PB table sizes, 1000s of features
Others?

Big Data

What is Big Data? The term Big Data appeared in the 2000s to describe a new class of applications and uses of data.
The term is unfortunate in that size of data sets was only one thing that had changed.
What had changed? Why a new term?

The Now of data evolution was here
Begin of data-driven applications
Greater automation of data and model lifecycle

What is Big Data Big data refers to datasets that are too large, complex, or fast-growing to be effectively captured, stored, managed, or analyzed using traditional data-processing tools.

What makes data “big”?
It cannot fit in memory on a single machine
Processing it takes too long with standard methods
You need distributed storage and computation (e.g., clusters, cloud computing)

Examples
Climate and weather model ensembles
High-resolution turbulence simulations
Genomics data
Web traffic and user behavior logs
IoT and sensor networks

It’s not just about size—big data is usually described by the “5 Vs”:

Volume: Massive amounts of data (terabytes, petabytes, or more)
Example: satellite observations, climate model outputs, social media logs
Velocity: Data generated and collected at high speed
Example: sensor streams, financial transactions, real-time monitoring systems
Variety: Different data types and formats
Structured: tables, databases
Semi-structured: JSON, XML
Unstructured: text, images, video, simulation outputs
Veracity: Data quality and uncertainty
Noise, missing values, biases, measurement errors
Value: The potential to extract useful information or insight
Big data is only valuable if it can be turned into knowledge or decisions

Data-driven Systems

A data-driven system (process or application) is one where the data defines how the system operates. The system has been trained on data and applies implicitly or explicitly derived algorithms to new data in order to provide intelligence and support actions.

Self-driving vehicles
Robotic process automation (RPA)
Ride-sharing
Social media sites
Streaming media
Fraud (anomaly) detection
Generative AI (ChatGPT)
Recommendation systems
Internet of Things (IoT)
Precision agriculture
e-commerce

Example of data-driven system: Churn model
Losing a customer is known by different names depending on the industry (attrition, churn, turnover, defection, ...). As a company interacts with its customers it needs to determine whether to offer the customer incentives (loyalty points, discounts, perks, etc.) if it wants to retain the customer.

What information would you like to collect, what would you like to learn from these data, what would be next?
What elements in this process are data, analytics, and decision?
What would this look like in a data-driven process?
What role is played by a churn model such as \[\Pr(Y=\mbox{churn}|\mbox{customer data}) = f(\mbox{customer data})\]

A churn model is not just a model; it is a component of a decision system.

Pipeline view
- Data collection (logs, demographics, billing)
- Feature engineering
- Model training & validation
- Prediction at scale
- Decision/action layer

Example actions driven by churn predictions
- Targeted retention offers depending on customer profile
- Personalized messaging
- Pricing incentives
- Customer success outreach

Churn modeling illustrates key principles of data-driven systems:

Feedback loops
- Interventions affect future churn
- Models must adapt over time
Concept drift
- Customer behavior changes
- Products and markets evolve
Cost-sensitive learning
- False negatives (missed churners) are expensive
- False positives waste incentives
Interpretability vs performance
- Business users often require explanations
- Tradeoff between black-box models and trust

Example of data-driven system: Self-driven cars
Discuss the differences in how data are used in a late-model car and in a (nearly) autonomous vehicle.

Late model (computerized) VS Autonomous (Data-driven system)

Data: Sources, Analysis and Challenges

Data Analysis

The term Data Analysis was introduced by John W. Tukey in 1962.
In 1977, Tukey published the seminal book “Exploratory Data Analysis” that defined the field of EDA.
Tukey’s approach was revolutionary, in that

much of statistical work was confirmatory, based on the concept of a data-generating mechanism
he placed the data itself at the center, not the generating mechanism
much can be learned by working with the data itself
the data can suggest hypotheses
data analysis can precede probability-based models
we should be looking for novelty in data analysis

Data Analysis VS Classical Stats

Tukey contrasted the Data Analysis approach in particular with statistical modeling.

Modeling
- identify a probability-based model that serves as an abstraction of the data-generating mechanism
- observed data are seen as one realization of this random mechanism
- find functions of the data that provide good estimates of the unknown model parameters
- calculate the parameter estimates and use them to test hypotheses and-or make predictions
Data Analysis
- identify a particular function of the data
- make progress by asking what the function might be reasonably estimating

Sources of Data

Various sources of data with different characteristics, error and uncertainty: measurements, machine-generated data, large-scale data collections, ... Data are a proxy to underlying true process

Computer simulations: e.g. physics-based model outputs (physically consistent; regularly available; but subject to modeling choices and approximations)
Observations, measurements: no modeling assumptions but measurement errors, irregularly sampled and sparse; e.g. weather stations, satellite data, floating buoys, citizen science, …

Examples of numerical weather prediction models (low and high resolutions, **Weather Research and Forecasting model, NCAR**) and satellite data coverage (**Natural Resources Canada**)

Examples of numerical weather prediction models (low and high resolutions, Weather Research and Forecasting model, NCAR) and satellite data coverage (Natural Resources Canada)

Data Challenges

Nature of data: quantities (precipitation), (weather regimes, product types (phone, laptop, tablet)), (counts), data (wind direction, movement data),...

Left: Regimes (shaded intervals) in a time series; Right: Windrose of wind speed and direction

Amount of data: large amount (computations, analysis, viz), too little data (robustness, representativeness) e.g. ,...

Complex patterns and dependencies: , multiple , multiple , , rare events, ...

Propagation of a wind storm

Uncertainty and errors: many in the data (measurement error, simulation model error, resolutions, ... )

\(\rightarrow\) need to adapt analysis, models and techniques to these challenges

Types of Relationships

Causal: one event is the result of the occurrence of the other event, also referred to as cause and effect.
Example: \(A \Rightarrow B\). Smoking causes an increase in the risk of lung disease
Correlation
- Dependence: variables are related to each other without causing each other
  Example: The weight of mammals is correlated with their height
  Example: Smoking correlates with alcoholism but does not cause it
- Spurious: the relationship is due to latent (confounding) or mediating variables.
  Example: \(A \Rightarrow B, A \Rightarrow C\). \(B\) and \(C\) appear correlated. Warmer temperature increases ice cream sales and trips to the beach: ice cream sales and shark attacks are positively correlated.
  Example: \(A \rightarrow B \rightarrow C\). \(A\) is related to \(B\) and \(B\) is related to \(C\). If you observe only \(A\) and \(C\) they appear correlated. Parental education level (\(B\)) is the mediator between socioeconomic status (\(A\)) and child reading ability (\(C\))

Spurious Correlation

Example of spurious correlation

Source: Statology

Check this site for many examples of spurious temporal correlations.

Survivorship on the Titanic

Tukey defined the boxplot as a way to visualize the distribution of data. These box plots are computed from passenger data of the Titanic.
Which questions and hypotheses do they answer/suggest?

Titanic data

Which questions or hypotheses do these histograms answer/suggest?

Methods of Analysis–from Statistics to ML to AI

Introduction to statistical modeling

Goals: characterize, predict, classify and simulate phenomena in many fields e.g. ecology, epidemiology, meteorology, medicine, finance, industry …

Motivations:
- provide & quantify (data, prediction, model/process, ...)
- of data: correlation, variable importance, extremes, ...
- overcome : conditional emulation, prediction, fusion, surrogate, ...

8cm Approach:
\(\rightarrow\) Reproduce target quantities of interest
e.g. probabilistic distribution, time series dynamics, spatial correlation, ...
\(\rightarrow\) Build parametric structures to describe relationships, distributions, covariances, ...

Wind speed probability distribution

Methods of Analysis

In Tukey’s approach we recognize the beginnings of EDA, Data Mining, and Machine Learning.
Our understanding of Data Analytics is encompassing. It includes elements of Statistical Modeling, Statistical Learning, Machine Learning, Data Mining, etc.
Sometimes we will be led more by the data, sometimes we will be led more by a model that captures the salient features of the data.

Classical Statistics: tests hypotheses and makes predictions based on a probability-based model that explains the data

Statistical Learning: was introduced by Hastie & Tibshirani.
Very much model-based but with a focus on prediction vs hypothesis testing
More recently, statistical models used to simulate/surrogate

Data Mining: searches large datasets for trends and patterns using computer automation.

Machine Learning: programs computers (machines) so that they will learn.

General Concepts

Types of Insights
- Descriptive: What has happened?
- Predictive: What will happen?
- Classification: Which category does this belong to?
- Clustering: Which groups/categories exist?
- Prescriptive: What should I do?
Online vs Batch
- Online: the learner responds online during the training process–real time
- Batch: data are at rest and analyzed as a set

Types of Learning

Supervised:
- Data are “labeled”, “labeled” meaning each piece of input data is provided with the correct output,
  Algorithm learn to map input data to a specific output based on example input-output pairs.
- Goal of supervised learning is for the trained model to accurately predict the output for new, unseen data. This requires the algorithm to effectively generalize.
- Commonly used for tasks like classification (predicting a category, e.g., spam or not spam) and regression (predicting a continuous value, e.g., house prices).
Unsupervised:
- Data are not “labeled”, the model identifies patterns or structures in unlabeled data (no input-output pairs).
- Examples of unsupervised techniques: clustering algorithms like k-means, dimensionality reduction techniques like principal component analysis (PCA)

Response (Variable) Types

Continuous: The number of possible values is not countable
(tire pressure, temperature, length, weight, ...)
Discrete: The number of possible values is countable
- Counts: The values are true counts
  Count per unit: (# of fish caught today, # of bike rentals per hour)
  Proportions: count out of a total (# of larvae died out of 20)
- Categorical: The values consist of labels, possibly numbers
  Nominal: Unordered labels (“Apples”, “Oranges”, “Bananas”)
  Ordinal: Ordered labels (Reviews: 1-5 stars)
  Binary: Exactly two labels (Dead/Alive, Fraud/Not-Fraud, 0/1, Yes/No)

Statistical Learning

Statistical Learning is the process of understanding data through the application of tools that describe structure and relationships in data.
Models are formulated based on the structure of data to predict outputs from inputs, to group data, or to reduce the dimensionality of a problem.

SL combines mathematical statistics and statistical modeling with greater emphasis on predicting rather than testing hypotheses
SL emphasizes models and their interpretability, precision, and uncertainty.
SL distinguishes supervised and unsupervised learning, depending on whether the values (the labels) of the response variable are known.
Supervised Learning
- The response variable (output, target, dependent) is observed along with a number of input variables (features, independent variables)
- The goal is to predict or classify the response variable and to quantify the uncertainty in parameter estimates and predictions.
- Models (examples)
  - Linear models
  - Nonlinear models
  - Nonparametric models
  - Generalized linear models (GLM), generalized additive models (GAM)
  - Discriminant analysis
  - Mixed Models
  - Tree-based models
  - Neural networks
  - ...

Data Mining

Data mining is the process of searching large data sets for trends and patterns using computer automation. As data sets grow, the type of exploratory discovery in the sense of Tukey can no longer be done manually.

Data mining relies on compute power and automation to find patterns in data at scale.
DM focuses on patterns discovered from data rather than model theory and prediction.
Data mining is about discovering useful, non-obvious patterns or structures in data, often without a predefined response variable.
Critics of data mining refer to it as “data dredging” or “data fishing”: looking for patterns and relationships that are meaningless and then forming hypotheses about why the patterns exist.

Example 1: Customer segmentation (clustering)
Unsupervised, very common in industry

Data:
- Age
- Income
- Spending score
- Number of purchases
Data mining task:
- Group customers into segments (e.g., “low spenders",”loyal customers", “untapped potentials")
Techniques:
- K-means
- Hierarchical clustering
What is mined:
- Natural groupings not labeled in advance

Example 2: Anomaly (fraud) detection
Pattern discovery through rarity

Data:
- Credit card transactions
- Electricity consumption
- Network traffic
Data mining task:
- Identify unusual behavior compared to the bulk of data
Techniques:
- Distance-based outliers
- Isolation Forest
- Simple z-score rules
Easy example:
- A household suddenly consumes 10 times its usual electricity \(\rightarrow\) anomaly

Machine Learning

Learning is the process of converting experience into knowledge.
Machine Learning is an automated way of learning through the use of computers.
Rather than directly programming computers to perform a task, ML is used when the tasks are not easily described and communicated (driving, reading, image recognition) or when the tasks exceed human capability (analyzing large and complex data sets).

ML is a branch of artificial intelligence (AI) as it is using computers to turn experience into expertise (knowledge)
Origin of modern ML: Computer Science discovered data as a source for computational considerations
Like Data Mining, ML uses automation to extract insight from large data sets. The focus of automation in developing ML algorithms is to predict, classify, or make recommendations. Data Mining tends to be more descriptive than predictive.
Supervised Learning
- The training example contains significant information (“labels" (i.e. outputs) for the target variable, the”ground truth”) that might not be present in the test example.
- The goal is to predict or classify the missing information in the test data. We want to learn the outputs from the inputs.
- We can think of the environment as a teacher that “supervises” the learner by providing information about the truth.
- Algorithms (examples)
  - Linear regression (simple & multiple)
  - Regularized regression (Ridge & Lasso)
  - Logistic regression
  - Support-vector machines
  - Gradient boosting
  - Decision trees, Random Forests
  - Neural networks (CNN, RNN, LSTM, ...)
  - Transformer (NLP, CV)
Unsupervised Learning
- There is no target variable we wish to predict or classify. We still want to learn relationships and structure from the data.
- Clustering data into similar groups or reducing the dimension of the data are typical uses for unsupervised learning.
- Algorithms (examples)
  - Clustering
    K-nearest neighbors
    Hierarchical clustering
    K-means clustering
    Association rules learning
  - Dimensionality reduction
    Principal Component Analysis (PCA)
    Matrix Factorization
Reinforcement Learning
- An agent (player) is taking actions (moves) in an environment (game)
- The agent learns by interacting with the environment and receiving feedback
- Actions are judged by a reward function (score), and the system is trained to maximize the sum of all future rewards
- Unlike supervised learning, input and output (target) do not need to be present
- Commonly used in robotics, game playing, and recommendation systems.

SL & ML Compared

Statistical Learning & Machine Learning

Common
- the input to a learning algorithm is training data, representing “experience”. The raw material for SL and ML is the same: data.
- the data are thought of as randomly generated
- both disciplines distinguish supervised and unsupervised learning, ML also has reinforcement learning
- SL and ML use many of the same models and algorithms for regression, classification, clustering, dimension reduction
Differences
- ML uses the observed data to come up with a description of relationship and “causes”. Emphasis is on prediction over explanation (e.g., hypothesis testing)
- ML emphasizes large scale applications and prediction accuracy.
  SL emphasizes models, interpretability, precision and uncertainty.
- SL is concerned with asymptotic behavior of parameter estimates, e.g. as (\(n \rightarrow \infty\)).
  ML focuses on finite sample bounds: what is the degree of accuracy expected on the basis of available samples.
- ML worries more about algorithms than SL. ML develops algorithms for learning tasks and considers their computational efficiency.
- SL often works from assumptions of the data model (independence, equi-variance, linearity of effects).
  ML assumes as little as possible about the distribution of the data.
- SL is more model oriented: what models are available to predict the expected value of a binary random variable?
  ML is more task oriented: given this set of data with a binary target variable, which methods are available to predict or classify the target?

Statistical Learning & Machine Learning

Combine the best of both worlds!
Quantify uncertainty (precision) and accuracy for given samples
Statisticians:
- worry a bit more about algorithms and computational efficiency
- worry a bit more about generalization of models to new data
Machine Learners:
- worry a bit more about the effects of violating assumptions.
  (e.g., it is easy not to make distributional assumptions, but what are the implications of lack of independence of observations?)

Artificial Intelligence

Artificial Intelligence (AI) is the effort to build systems that can perform tasks or make decisions a human could make. AI draws on many disciplines, e.g., robotics, process automation, biology, neuroscience, cognitive science, computer science.

The current form of AI–which is much more successful than attempts in previous decades–differs from past approaches in its reliance on data.
Data-driven AI uses analytic techniques to develop algorithms that can perform human tasks.

Its success was enabled by the confluence of three factors:

Advances in neural network technology (CNN, RNN, Generative networks, transformers, ...)
Large data sets made it possible to design deep neural networks (deep learning)
Cloud computing and GPUs made computing resources accessible to train these deep networks

Introduction to Statistical Learning

Week 1

ADS-STAT 5525 - Statistical Learning

Module Learning Goals

Data, Analytics, Decision

Definitions

Data Evolution

Big Data

Data-driven Systems

Data: Sources, Analysis and Challenges

Data Analysis

Data Analysis VS Classical Stats

Sources of Data

Data Challenges

Types of Relationships

Spurious Correlation

Survivorship on the Titanic

Methods of Analysis–from Statistics to ML to AI

Introduction to statistical modeling

Methods of Analysis

General Concepts

Types of Learning

Response (Variable) Types

Statistical Learning

Data Mining

Machine Learning

SL & ML Compared

Statistical Learning & Machine Learning

Statistical Learning & Machine Learning

Artificial Intelligence