What is Machine Learning?

Supervised Learning

Supervised techniques require a set of inputs and corresponding outputs to “learn from” in order to build a predictive model. Supervised learning algorithms learn by tuning a set of model parameters that operate on the model’s inputs, and that best fit the set of outputs. The goal of supervised machine learning is to train a model of the form y = f(x), to predict outputs, y based on inputs, x.

There are two main types of supervised learning techniques. The first type is classification. Classification techniques predict categorical outputs, such as whether a certain object is a cat or not, whether a transaction represents fraud or not, or whether a customer will return or not. The second type is regression. Regression techniques predict continuous values, such as a forecast of sales over the next week.

The inputs to machine learning algorithms are called features. Features can include mathematical transformations of data elements that are relevant to the machine learning task, for example, the total value of financial transactions in the last week, or the minimum transaction value over the last month, or the 12-week moving average of an account balance.

After features, x, are designed and implemented and observations, y, are identified, the model y = f(x) is ready to be trained. During model training, the ML algorithm “learns” parameters or weights. These parameters or weights are applied to the features to generate a trained model f(x) to best fit the outputs. Examples of model parameters include coefficients in a linear regression or split points in a decision tree. The following figure illustrates the concept.

A Simplified Machine Learning Pipeline

A Simplified Machine Learning Pipeline infographic

Figure 3 A supervised machine learning pipeline including raw data input, features, outputs, the ML model and model parameters, and prediction outputs. In this example, the machine learning model is trained to classify whether a customer will remain or leave.

After a model is trained, it is often evaluated and tested on a holdout data set to validate model performance. Generating predictions on holdout data indicates how well the model performs with new data on which it was not trained. Training and testing, or validation, are often iterative, time-intensive steps in a machine learning project. These topics are discussed in additional detail in the following sections.

Supervised techniques often require non-trivial dataset sizes to learn reliably from ground truth observations. Models may require many thousands of input and output examples to learn from in order to perform effectively. Larger datasets, including greater numbers of historic examples from which to learn, enable the algorithms to incorporate a variety of edge cases and produce models that handle these edge cases elegantly. Depending on the business problem at hand, multiple years of data are necessary to account for seasonality.

Consider a machine learning model that aims to classify if something is “true” or “false.” This type of classifier can be used to predict customer attrition: it may aim to predict if an existing business customer is likely to remain or leave. The following figure shows a graphical representation of a supervised model, where the horizontal and vertical axes display input features, x (e.g., level of digital engagement by customer and number of purchases by customer), and the color-coded dots indicate labeled examples of past customer behavior, y (blue indicating attrition, red indicating retention). The labeled examples teach the model to identify patterns that indicate attrition. Once the model is trained, it can be applied to new data to predict the behavior of future customers. In the figure below, the green dashed line represents a decision boundary that partitions the feature space. On one side of the decision boundary, the model predicts “true” and on the other side “false.”

Supervised Learning: “Good Truth” Available

Supervised Learning infographic

Figure 4 Supervised machine learning models are trained on labeled data that are considered “ground truth” for the model to identify patterns that predict those labels on new data.

During model training, the supervised machine learning algorithm is fed examples of both model inputs and outputs. Figure 5 demonstrates the design of a feature table with inputs and outputs that can be used to train a machine learning model to predict customer attrition – in this case, retail customers who stop subscribing to a service. Training data are aggregated at a monthly interval, with a single record for each customer on the first of each month. We could just as easily create a similar feature matrix at more frequent, or “rolling,” intervals, but this simple example illustrates the concept.

To aggregate at a monthly level, features are aggregated over the monthly time period – like total purchases in the last month ($) and the month-to-month change in website traffic (clicks). Outcomes or outputs are also captured monthly. In this case, the output value equals 1 if the customer stopped subscribing at any time over the past month, and 0 otherwise.

This feature matrix is input into the supervised model and the model parameters are adjusted so that the model best “fits” the example outputs. By leveraging historical examples to train the model, the model learns the patterns that are predictive of customer attrition in the past. When new customer data are available, the trained model can be used to predict customers who will unsubscribe in the future.

When developing a new machine learning model, it is just as important to recognize its limitations as it is to understand its potential benefits. In the customer attrition example, the model is not predicting new, novel ways to retain customers. The model is learning based on historical patterns and then applying those patterns to predict future behavior.

Machine learning systems are, however, self-learning. As new data labels become available (e.g., new modes of customer churn or attrition), models can be retrained to learn those new patterns.

Examples of input signals and output data infographic

Figure 5 Examples of input signals and output data are required to train a supervised learning model.

One of the simplest machine learning formulations is described in the following equation:

Y = X.θ

In the above equation:

Model features are represented by a feature matrix, X, where columns represent features, and rows correspond to each data point. With m features, and n data points, the dimensions of X are n X m. Labels are represented by a vector Y, where rows correspond to each data point (dimension n X 1). And model weights (or the importance) of each feature are represented by the vector, θ (dimension m X 1).

The training task of the supervised machine learning algorithm involves finding feature weights, θ, that minimize a training loss function.

The dimensionality of the problem is important to consider. The size of the feature space (m, in the above formulation) should typically be smaller than the number of labeled data points (n, in the above formulation).

In practice, supervised machine learning problems are often limited by the number of labeled examples that are available from which the algorithm can learn. Usually, the more examples available, the higher the likelihood that a supervised technique will be successful.

There are two main categories of supervised learning techniques: classification and regression.


Classification models predict a class label, such as whether a customer will return or not, whether a certain transaction represents fraud or not, or whether a certain image is a car or not. Classification approaches are useful for business problems that have large amounts of historical data, including labels, that specify if something is in one group or another.

Classification algorithms map inputs (X) to outputs (Y) where Y∈{1, … , C} with C being the number of classes. If C = 2, this is called binary classification, and if C > 2, this is called multiclass classification.

An example of a classification task is predicting when an equipment or a machine is likely to fail. This predictive maintenance task is a common problem faced by manufacturing and operations-focused companies. Predictive maintenance can help avoid failure events that may be expensive or potentially dangerous.

If sufficient historical failure examples are available, as well as other relevant input data (e.g., sensor data, technician notes), a supervised machine learning classifier can be trained to predict if equipment will be operating in the failed or not-failed class in the future. In supervised classification problems, training examples are often referred to as labels. The following figure shows an example of failure labels and classifier predictions.

Examples of input signals and output data infographic

Figure 6 Time-series representation of a classifier label (“failed” or “not failed”) that can be used to train a predictive maintenance machine learning model using classification.

Examples of supervised classifier models include support vector machines (SVM), XGBoost, gradient-boosted decision trees (GBDT), random forest, and neural networks.


Regression models predict quantities, such as how many customers are likely to churn or the sales forecast over the next week. Regression techniques are useful for business problems that have large historical datasets that correlate to numeric labels, including such things as sales, inventory, or loan value.

Reconsider the predictive maintenance example we explored with a classifier model, but this time we want to predict equipment failure using a regression model. Instead of predicting a categorical label like “failed” or “not failed” (as with the classifier model), a regression model can be trained to predict a continuous value, such as time to failure, as shown in the Figure 7.

Training a regression algorithm is similar to training a classifier. The feature matrix is comprised of input signals such as sensor data and work orders. The regression model also requires labels, but instead of a binary (1 or 0) indicator of class (“failed” or “not failed”), the label is numeric (time to failure).


Figure 7 Time-series representation of a time-to-failure label that can be used to train a predictive maintenance machine learning model using regression.

Examples of supervised regression models include linear regressions that predict a linear relationship between input features and outputs, ridge regressions that are a more advanced variation of linear regression, random forests that predict nonlinear relationships between inputs and outputs using decision trees, and neural networks that predict nonlinear relationships between inputs and outputs using layers of complex nodes.

The following figure shows an example of a classification and regression technique. The left-hand side of the figure illustrates the result of a classification algorithm that estimates a decision boundary to separate two classes (the classes are represented with different symbols in the figure). The axes on the chart represent two input features. The right-hand side of the figure illustrates the result of a regression algorithm that predicts a quantity (shown on the y axis) as a function of a feature input (shown on the x axis).


Figure 8 Examples of classification and regression techniques.