top of page

Machine Learning Foundations for Product Managers Wk 5 - Tree and Ensemble Models

  • Writer: Muxin Li
    Muxin Li
  • Mar 25, 2024
  • 8 min read

Updated: Jun 17, 2024

Technical terms:

  • Decision tree

  • Splits

  • Information Gain (IG)

  • Node

  • Leaf

  • Depth

  • Ensemble Model

  • Aggregation function

  • Bootstrap aggregating (bagging)

  • Clustering

  • Unsupervised learning

  • Basis

  • K-means Clustering


It's time for nonparametric and unsupervised learning models.


Tree Models


Decision tree models work by using a series of questions (or 'splits'), with each question generating new information that helps the model make a prediction (e.g., if this animal has horns, then maybe it's a moose?)

  • If you've ever played Guess Who?, this process feels familiar - each player has to ask questions to narrow down who the other person's character is. The player who is able to guess the other person's character first wins.

  • Stronger players know the best strategy is to start by asking questions that can help them eliminate as many choices as possible.

    • In the 90s version of the game, the best question to start off with may be, "Is your person a man or a woman?" Trans awareness hadn't picked up back then.

Who knew we were being trained as data scientists at a young age?


A good decision tree model uses a similar strategy to efficiently make its prediction, such as classifying what an animal is. Our goal is to build the most efficient tree that uses the fewest possible questions, or splits, to help us separate the data into classes.

  • To find good splits, we want to maximize our information gain (IG) at the split, or as the course describes it, how to reduce the data impurity (the unclassified, disorganized data mixing all your different classes together into one big Guess Who? board).

  • You can split by features and values of those features. So in the case of Guess Who? a feature could be Gender and the value can be Male, Female, and perhaps Trans in the 2024 version.


Making Predictions with Tree Models


Once you have a tree model, it's pretty straightforward to follow - like a flowchart, each split takes you down to another level or a node. At the lowest nodes, you've arrived at the 'leaves' of the tree.



In classification problems, decision trees use a majority vote to predict.

  • In the example below, if most of the classes on Leaf 1 are Class A, then the prediction is Class A if the decision tree ends up on Leaf 1.



In regression problems, decision trees use averages.

  • Here, the prediction for each leaf is the average of its sample values.



Tuning Hyperparameters - Tree Depth


The depth of a tree is how many splits you want your model to have - this is a hyper parameter setting, which can't be derived from the data but is a decision you'd have to make for your model and how it will behave.

  • Too few splits, or too shallow of a depth, and your decision tree is unlikely to produce good predictions as it will underfit the data.

  • Too many splits, or too deep of a tree model, and you risk overfitting the model to the training data, making it less capable of making good predictions with new data.

    • The bigger risk seems to be overfitting your decision tree model, as it's hard to know when to stop.


Decision trees are easy to interpret, train quickly, and since they're nonparametric they can handle non-linear relationships well:

  • Ex of a non-linear relationship: The amount of time studying vs exam scores. Initially, there's a positive relationship between more study hours and higher scores. But there comes a time when you're getting diminishing returns and your increase in study hours doesn't necessarily increase your exam score - and if you're pulling an all-nighter before the final, chances are your expected exam score may go down instead of up.

  • A decision tree model could find different thresholds for study hours and its impact on exam scores for better predictions.


Ensemble Models


Each model has its strengths and weaknesses, and vary in their ability to make predictions. A popular strategy to overcome this is to pull a Frankenstein and create a meta model, or an ensemble model made up of multiple models.

  • Each model should be independent, or close to independent of each other. This lowers the risk of letting the variances of one model dominate.

  • Each model can use the same algorithm but be trained with different hyperparameters, or use entirely different algorithms, it's up to you.

  • You can train each model on the entire dataset, or slice up the data set into multiple sets (each set can then be used to train a model).

  • Use each model to generate predictions, then aggregate the predictions to get one single output prediction for the overall ensemble model.

  • The aggregation function is up to you, and it will depend on the problem you're trying to solve - a classification problem may mean using a majority vote, a regression problem may mean using an average or weighted average.



Electric utilities often use ensemble models to predict how the amount of load or demand of electricity on the grid at any time. They may use data like weather forecasts, dates of special events that might drive a temporary increase in visitors, historic consumption patterns throughout the day or the week, and other scenarios as inputs into individual member models. The outputs of those individual member models then aggregate into the ensemble model into a single prediction to be used by the utility on how much power production is needed and when.


Ensemble models can reduce overfitting, which helps to make better predictions with new data. But ensemble models have downsides - remember the Netflix Prize?

  • Netflix held a competition for a movie recommendation algorithm that could outperform their own in-house version by at least 10%. In 2009, it awarded a team $1M for their submission.

  • But it was not to last - upon review, the engineers at Netflix found that the winner's solution was an ensemble model that was too expensive to run, and its use of multiple algorithms and techniques made it a significant challenge to integrate and maintain.


Ensemble models are also more opaque - by design, they're more complex and therefore harder to scrutinize compared to simpler, single models.


Like Frankenstein's monster, ensemble models may be too difficult for us humans to understand.


Random Forest


Actually, first we're going to talk about bagging.


A common way to build ensemble models is by bagging or bootstrap aggregating:

  • Imagine you have a basket of different colored balls (each ball a sample). To train models, you use a subset (training data set) that you've determined will be made up of 5 balls.

    • To select the balls for your subset, you pick one out of the basket, make a note of it, and then put it back into the basket. You repeat this process until you have made notes of 5 balls in total for your data subset.

    • There's a chance you'll pull out the same ball multiple times, but it's at random. As a whole, the mix of balls in your subset would vary from subset to subset, therefore giving you a slightly different variation of data in each subset to train your models on.


Bootstrapping is described in the course as 'sampling with replacement' as each time you pull out a sample, you put it back into the data (e.g. the ball and the basket).

  • Select the size of the bagging subset (e.g., I want 5 samples in a subset) - this can be either a % of the number of rows of data or some random number of rows you choose. Create multiple subsets on the data.

  • Run each model's predictions using different subset data, then combine the predictions of these models (usually by averaging).

  • Why do this? It's another way of preserving your precious data and getting the most use out of it. Bagging lets you have carve out different subsets of data for training, so that the predictions you're getting from each model is more independent (you don't want to have a model make predictions using the same data again and again, that'd be cheating).

  • This also reduces the risks of overfitting to the training data, since you're introducing the model to to different flavors of the training data.


Random Forest for real


According to the course, random forest is a common type of bagging model - we're going to create multiple decision trees, have them trained on data subsets that we carved out using bagging, then we aggregate the outputs of the tree models and make a prediction using either a majority vote (classification problem) or averaging (regression problem).

  • Random forest sounds more like an ensemble model of different tree models, and bagging was primarily used as a way to carve out the data for training the tree models.

  • Like ensemble models, a random forest helps to reduce the risk of overfitting - something that can be hard to avoid with tree models given the difficulty of knowing what depth to set as the hyperparameter.


Designing Random Forests and Challenges


Random forests take more decisions to design:

  • How many tree models in this forest?

  • What is the sampling strategy for bagging?

    • Using a % of total rows from your training set is a popular choice - but what %?

    • Maximum number of features we want to include in each of the bagging samples (subset)? We don't want a sample of all the features in our subset - to ensure the tree models are independent, we can choose a % of the features.

  • Depth of the trees (yes, this again)

    • Maximum level of depth, or the max number of splits in our tree models?

    • Minimum number of samples per leaf? If you don't specify this, you may end up with a very large, complex tree in which each sample is a leaf - this becomes a risk of overfitting.

      • You can set a minimum number of samples needed on each leaf, so that it limits how big the tree model can grow (e.g. do not have fewer than 4 samples per leaf - this would stop the tree from splitting further once it reaches that level per leaf).


Clustering


Clustering is an unsupervised learning technique - in unsupervised learning, we don't have to set the target output labels to train a model (like how we did with home prices based on features like sq footage). A clustering model will take input data and then determine for itself which data points are similar enough to each other to belong to the same clusters.

  • You can cluster market segments using demographic or geography, or aggregate daily news articles about the same topics

Clustering does require some setup - you need to determine what to cluster around, or the basis you'll use to calculate similarity or dissimilarity.

  • Is apple juice similar to beer because they're both beverages, or should they be considered different because they don't taste the same?

    • Where does hard apple cider fit into this? These are the tough questions.

  • The course doesn't get into this, but you should probably use the outcomes of your project (what end goal or business performance you're trying to drive) to help you determine what basis you're going to use.


K-means Clustering


The most popular clustering algorithm, K-means clustering, randomly creates cluster centers (after you select the basis for clustering, like whether to evaluate beverages by color or taste).

  • You start off with data points that are mapped out on a feature space - each data point represents a single observation and the axes corresponds to features you have chosen.

    • Simple example is comparing just 2 features for an x-y axis (e.g. comparing the height and weight of each animal, plotting each animal as a data point with weight being the x-axis and height as the y-axis).

  • K-means will start off with a random location for a cluster center but then iterate the location until it reaches the 'real' center of the cluster.

    • Upon each iteration, the algorithm recalculates the center of each cluster by averaging the position of all the data points closest to the cluster.

    • It'll repeat these steps until the cluster center location no longer changes significantly


K-means clustering is easy to run and quick, and is an overall good starting point. But it doesn't work well with very complex data - imagine a feature space where you're trying to map out 10 features, or have complex patterns that aren't linear.

  • K-means require you to determine how many clusters you want - there's a bit of risk here in having to guess.

  • It's important to pick an algorithm that best suits your problem and situation. If you're not sure, K-means is not a bad place to start.



 

Like this post? Let's stay in touch!

Learn with me as I dive into AI and Product Leadership, and how to build and grow impactful products from 0 to 1 and beyond.


Follow or connect with me on LinkedIn: Muxin Li


Comments


© 2024 by Muxin Li

bottom of page