The goal of this module is for you to
develop a vocabulary about data, analytics, statistical and
machine learning
understand the “big data” evolution toward data-driven
systems
define and distinguish types of learning, types of insights,
types of responses
distinguish the approaches to data analysis, data mining, machine
learning, and statistical learning
understand the principles and goals of supervised, unsupervised,
and reinforcement learning
Data: is information about the world
Analytics: is the process of extracting insight from
data
Decisioning: is to choose an action after
consideration
In recent decades this relationship has changed fundamentally:
The amount of available data has exploded
Our understanding of data has evolved
Our capabilities to perform analytics have evolved
Data and analytics have become the foundation for decisions
Data + Analytics = Decision
Types
Then : Single stream/source of structured data
Now : Rich data types, unstructured data (text, documents, emails, records), audio, video, images, sensor data, ticker data, click streams, geo-spatial data, ...
Approach
Then : Define the problem, then collect the data
Now : Examine what the data can tell you
Motion
Then : Data at rest
Now : Streaming data, data at rest, data in motion (cloud-data center, multi-cloud, distributed data)
Origin
Then : Observational studies or designed experiments
Now : Machine-generated data, large-scale data collections
Size
Then : KB, MB to GB range, up to 10s of features
Now : GB - TB - PB table sizes, 1000s of features
Others?
What is Big Data? The term Big Data appeared in the
2000s to describe a new class of applications and uses of data.
The term is unfortunate in that size of data sets was
only one thing that had changed.
What had changed? Why a new term?
The Now of data evolution was here
Begin of data-driven applications
Greater automation of data and model lifecycle
What is Big Data Big data refers to datasets that are too large, complex, or fast-growing to be effectively captured, stored, managed, or analyzed using traditional data-processing tools.
What makes data “big”?
It cannot fit in memory on a single machine
Processing it takes too long with standard methods
You need distributed storage and computation (e.g., clusters, cloud
computing)
Examples
Climate and weather model ensembles
High-resolution turbulence simulations
Genomics data
Web traffic and user behavior logs
IoT and sensor networks
It’s not just about size—big data is usually described by the “5 Vs”:
Volume: Massive amounts of data (terabytes,
petabytes, or more)
Example: satellite observations, climate model outputs, social media
logs
Velocity: Data generated and collected at high
speed
Example: sensor streams, financial transactions, real-time monitoring
systems
Variety: Different data types and formats
Structured: tables, databases
Semi-structured: JSON, XML
Unstructured: text, images, video, simulation outputs
Veracity: Data quality and uncertainty
Noise, missing values, biases, measurement errors
Value: The potential to extract useful
information or insight
Big data is only valuable if it can be turned into knowledge or
decisions
A data-driven system (process or application) is one where the data defines how the system operates. The system has been trained on data and applies implicitly or explicitly derived algorithms to new data in order to provide intelligence and support actions.
Self-driving vehicles
Robotic process automation (RPA)
Ride-sharing
Social media sites
Streaming media
Fraud (anomaly) detection
Generative AI (ChatGPT)
Recommendation systems
Internet of Things (IoT)
Precision agriculture
e-commerce
Example of data-driven system: Churn model
Losing a customer is known by different names depending on the industry
(attrition, churn, turnover, defection, ...). As a company interacts
with its customers it needs to determine whether to offer the customer
incentives (loyalty points, discounts, perks, etc.) if it wants to
retain the customer.
What information would you like to collect, what would you like
to learn from these data, what would be next?
What elements in this process are data,
analytics, and decision?
What would this look like in a data-driven
process?
What role is played by a churn model such as \[\Pr(Y=\mbox{churn}|\mbox{customer data}) = f(\mbox{customer data})\]
A churn model is not just a model; it is a component of a decision system.
Pipeline view
- Data collection (logs, demographics, billing)
- Feature engineering
- Model training & validation
- Prediction at scale
- Decision/action layer
Example actions driven by churn predictions
- Targeted retention offers depending on customer profile
- Personalized messaging
- Pricing incentives
- Customer success outreach
Churn modeling illustrates key principles of data-driven systems:
Feedback loops
- Interventions affect future churn
- Models must adapt over time
Concept drift
- Customer behavior changes
- Products and markets evolve
Cost-sensitive learning
- False negatives (missed churners) are expensive
- False positives waste incentives
Interpretability vs performance
- Business users often require explanations
- Tradeoff between black-box models and trust
Example of data-driven system: Self-driven
cars
Discuss the differences in how data are used in a late-model car and in
a (nearly) autonomous vehicle.
Late model (computerized) VS Autonomous (Data-driven system)
The term Data Analysis was introduced by John W.
Tukey in 1962.
In 1977, Tukey published the seminal book “Exploratory Data Analysis”
that defined the field of EDA.
Tukey’s approach was revolutionary, in that
Tukey contrasted the Data Analysis approach in particular with statistical modeling.
Modeling
identify a probability-based model that serves as an abstraction
of the data-generating mechanism
observed data are seen as one realization of this random
mechanism
find functions of the data that provide good estimates of the
unknown model parameters
calculate the parameter estimates and use them to test hypotheses and-or make predictions
Data Analysis
identify a particular function of the data
make progress by asking what the function might be reasonably estimating
Various sources of data with different characteristics, error and uncertainty: measurements, machine-generated data, large-scale data collections, ... Data are a proxy to underlying true process
Computer simulations: e.g. physics-based model outputs (physically consistent; regularly available; but subject to modeling choices and approximations)
Observations, measurements: no modeling assumptions but measurement errors, irregularly sampled and sparse; e.g. weather stations, satellite data, floating buoys, citizen science, …
Examples of numerical weather prediction models (low and high resolutions, Weather Research and Forecasting model, NCAR) and satellite data coverage (Natural Resources Canada)
Nature of data: quantities (precipitation), (weather regimes, product types (phone, laptop, tablet)), (counts), data (wind direction, movement data),...
Left: Regimes (shaded intervals) in a time series; Right: Windrose of wind speed and direction
Amount of data: large amount (computations, analysis, viz), too little data (robustness, representativeness) e.g. ,...
Complex patterns and dependencies: , multiple , multiple , , rare events, ...
Propagation of a wind storm
Uncertainty and errors: many in the data
(measurement error, simulation model error, resolutions, ... )
\(\rightarrow\) need to adapt analysis, models and techniques to these challenges
Causal: one event is the result of the
occurrence of the other event, also referred to as cause and
effect.
Example: \(A \Rightarrow B\). Smoking
causes an increase in the risk of lung disease
Correlation
Dependence: variables are related to each other
without causing each other
Example: The weight of mammals is correlated with their height
Example: Smoking correlates with alcoholism but does not cause
it
Spurious: the relationship is due to latent
(confounding) or mediating variables.
Example: \(A \Rightarrow B, A \Rightarrow
C\). \(B\) and \(C\) appear correlated. Warmer temperature
increases ice cream sales and trips to the beach: ice cream sales and
shark attacks are positively correlated.
Example: \(A \rightarrow B \rightarrow
C\). \(A\) is related to \(B\) and \(B\) is related to \(C\). If you observe only \(A\) and \(C\) they appear correlated. Parental
education level (\(B\)) is the mediator
between socioeconomic status (\(A\))
and child reading ability (\(C\))
Example of spurious correlation
Source: Statology
Check this site for many examples of spurious temporal correlations.
Tukey defined the boxplot as a way to visualize the
distribution of data. These box plots are computed from passenger data
of the Titanic.
Which questions and hypotheses do they answer/suggest?
Titanic data
Which questions or hypotheses do these histograms answer/suggest?
Motivations:
- provide & quantify (data, prediction, model/process, ...)
- of data: correlation, variable importance, extremes, ...
- overcome : conditional emulation, prediction, fusion, surrogate,
...
8cm Approach:
\(\rightarrow\) Reproduce target
quantities of interest
e.g. probabilistic distribution, time series dynamics, spatial
correlation, ...
\(\rightarrow\) Build parametric
structures to describe relationships, distributions, covariances,
...
Wind speed probability distribution
In Tukey’s approach we recognize the beginnings of EDA, Data Mining, and Machine Learning.
Our understanding of Data Analytics is encompassing. It includes elements of Statistical Modeling, Statistical Learning, Machine Learning, Data Mining, etc.
Sometimes we will be led more by the data, sometimes we will be led more by a model that captures the salient features of the data.
Classical Statistics: tests hypotheses and makes predictions based on a probability-based model that explains the data
Statistical Learning: was introduced by Hastie &
Tibshirani.
Very much model-based but with a focus on prediction vs hypothesis
testing
More recently, statistical models used to simulate/surrogate
Data Mining: searches large datasets for trends and patterns using computer automation.
Machine Learning: programs computers (machines) so that they will learn.
Types of Insights
Descriptive: What has happened?
Predictive: What will happen?
Classification: Which category does this belong to?
Clustering: Which groups/categories exist?
Prescriptive: What should I do?
Online vs Batch
Online: the learner responds online during the training process–real time
Batch: data are at rest and analyzed as a set
Supervised:
Data are “labeled”, “labeled” meaning each piece of input data is
provided with the correct output,
Algorithm learn to map input data to a specific output based on example
input-output pairs.
Goal of supervised learning is for the trained model to
accurately predict the output for new, unseen data. This requires the
algorithm to effectively generalize.
Commonly used for tasks like classification (predicting a category, e.g., spam or not spam) and regression (predicting a continuous value, e.g., house prices).
Unsupervised:
Data are not “labeled”, the model identifies patterns or
structures in unlabeled data (no input-output pairs).
Examples of unsupervised techniques: clustering algorithms like k-means, dimensionality reduction techniques like principal component analysis (PCA)
Continuous: The number of possible values is not
countable
(tire pressure, temperature, length, weight, ...)
Discrete: The number of possible values is
countable
Statistical Learning is the process of understanding
data through the application of tools that describe structure and
relationships in data.
Models are formulated based on the structure of data to
predict outputs from inputs, to group
data, or to reduce the dimensionality of a problem.
SL combines mathematical statistics and statistical modeling with greater emphasis on predicting rather than testing hypotheses
SL emphasizes models and their interpretability, precision, and uncertainty.
SL distinguishes supervised and unsupervised learning, depending on whether the values (the labels) of the response variable are known.
Supervised Learning
The response variable (output, target, dependent) is observed along with a number of input variables (features, independent variables)
The goal is to predict or classify the response variable and to
quantify the uncertainty in parameter estimates and predictions.
Models (examples)
Linear models
Nonlinear models
Nonparametric models
Generalized linear models (GLM), generalized additive models (GAM)
Discriminant analysis
Mixed Models
Tree-based models
Neural networks
...
Data mining is the process of searching large data sets for trends and patterns using computer automation. As data sets grow, the type of exploratory discovery in the sense of Tukey can no longer be done manually.
Data mining relies on compute power and automation to find patterns in data at scale.
DM focuses on patterns discovered from data rather than model theory and prediction.
Data mining is about discovering useful, non-obvious patterns or structures in data, often without a predefined response variable.
Critics of data mining refer to it as “data dredging” or “data fishing”: looking for patterns and relationships that are meaningless and then forming hypotheses about why the patterns exist.
Example 1: Customer segmentation (clustering)
Unsupervised, very common in industry
Data:
Age
Income
Spending score
Number of purchases
Data mining task:
Techniques:
K-means
Hierarchical clustering
What is mined:
Example 2: Anomaly (fraud) detection
Pattern discovery through rarity
Data:
Credit card transactions
Electricity consumption
Network traffic
Data mining task:
Techniques:
Distance-based outliers
Isolation Forest
Simple z-score rules
Easy example:
Learning is the process of converting experience
into knowledge.
Machine Learning is an automated way of learning
through the use of computers.
Rather than directly programming computers to perform a task, ML is used
when the tasks are not easily described and communicated (driving,
reading, image recognition) or when the tasks exceed human capability
(analyzing large and complex data sets).
ML is a branch of artificial intelligence (AI) as it is using computers to turn experience into expertise (knowledge)
Origin of modern ML: Computer Science discovered data as a source for computational considerations
Like Data Mining, ML uses automation to extract insight from large data sets. The focus of automation in developing ML algorithms is to predict, classify, or make recommendations. Data Mining tends to be more descriptive than predictive.
Supervised Learning
The training example contains significant information (“labels" (i.e. outputs) for the target variable, the”ground truth”) that might not be present in the test example.
The goal is to predict or classify the missing information in the test data. We want to learn the outputs from the inputs.
We can think of the environment as a teacher that “supervises” the learner by providing information about the truth.
Algorithms (examples)
Linear regression (simple & multiple)
Regularized regression (Ridge & Lasso)
Logistic regression
Support-vector machines
Gradient boosting
Decision trees, Random Forests
Neural networks (CNN, RNN, LSTM, ...)
Transformer (NLP, CV)
Unsupervised Learning
There is no target variable we wish to predict or classify. We still want to learn relationships and structure from the data.
Clustering data into similar groups or reducing the dimension of the data are typical uses for unsupervised learning.
Algorithms (examples)
Clustering
K-nearest neighbors
Hierarchical clustering
K-means clustering
Association rules learning
Dimensionality reduction
Principal Component Analysis (PCA)
Matrix Factorization
Reinforcement Learning
An agent (player) is taking actions (moves) in an environment (game)
The agent learns by interacting with the environment and receiving feedback
Actions are judged by a reward function (score), and the system is trained to maximize the sum of all future rewards
Unlike supervised learning, input and output (target) do not need to be present
Commonly used in robotics, game playing, and recommendation systems.
Common
the input to a learning algorithm is training data, representing “experience”. The raw material for SL and ML is the same: data.
the data are thought of as randomly generated
both disciplines distinguish supervised and unsupervised learning, ML also has reinforcement learning
SL and ML use many of the same models and algorithms for regression, classification, clustering, dimension reduction
Differences
ML uses the observed data to come up with a description of relationship and “causes”. Emphasis is on prediction over explanation (e.g., hypothesis testing)
ML emphasizes large scale applications and prediction
accuracy.
SL emphasizes models, interpretability, precision and
uncertainty.
SL is concerned with asymptotic behavior of
parameter estimates, e.g. as (\(n \rightarrow
\infty\)).
ML focuses on finite sample bounds: what is the degree
of accuracy expected on the basis of available samples.
ML worries more about algorithms than SL. ML develops algorithms for learning tasks and considers their computational efficiency.
SL often works from assumptions of the data model (independence,
equi-variance, linearity of effects).
ML assumes as little as possible about the distribution of the
data.
SL is more model oriented: what models are
available to predict the expected value of a binary random
variable?
ML is more task oriented: given this set of data with a
binary target variable, which methods are available to predict or
classify the target?
Combine the best of both worlds!
Quantify uncertainty (precision) and accuracy for given samples
Statisticians:
worry a bit more about algorithms and computational efficiency
worry a bit more about generalization of models to new data
Machine Learners:
Artificial Intelligence (AI) is the effort to build systems that can perform tasks or make decisions a human could make. AI draws on many disciplines, e.g., robotics, process automation, biology, neuroscience, cognitive science, computer science.
The current form of AI–which is much more successful than attempts in
previous decades–differs from past approaches in its reliance on
data.
Data-driven AI uses analytic techniques to develop algorithms that can
perform human tasks.
Its success was enabled by the confluence of three factors:
Advances in neural network technology (CNN, RNN, Generative networks, transformers, ...)
Large data sets made it possible to design deep neural networks (deep learning)
Cloud computing and GPUs made computing resources accessible to train these deep networks