Machine Learning Foundations for Product Managers Wk 2 - The Modeling Process
- Muxin Li
- Mar 2, 2024
- 5 min read
Updated: Jun 17, 2024
Technical terms covered:
CRISP-DM process
Features
Parametric and Non-parametric Algorithms
Datasets
Training datasets
Validation datasets
Test datasets
No Free Lunch theorem
Cross-Validation
K-Folds Cross Validation
CRISP-DM and Features Selection
There's a lot that goes on before we even start building a model. The industry accepted process is CRISP-DM.
Just like in developing new products - what is the problem that you're trying to solve?
E.g. how do we predict power outages?
Research to understand what kind of data you'll need to collect e.g. the features you should focus on that are related to what you're trying to predict
If I want to predict housing prices, the features I'd include would be size and age of home, zip code, etc.
Talking to domain experts is very valuable - e.g. if I want to reduce outages, and I don't know where to start, I should start with anyone who knows anything about power outages (leading me to talk to meteorologists for weather related data)
Aim to get only the features that you need - you don't want too many non-relevant features (increases model complexity) or too few features
Usually having too few features is worse than having too many
When in doubt about whether to include a feature, include it
Algorithm Selection - and where this diverges from AI like GPT
So at this point of the lesson, they're covering terms and techniques used in more traditional ML applications - think Siri and Google Assistant AI experiences, before ChatGPT came and changed everything.
Traditional ML uses algorithms as templates or frameworks to solve the problem it's tasked with. So given a set of inputs and a set of outputs that I want, what's the pre-existing framework that best describes the relationship between the inputs and the outputs?
Having to select an algorithm (or a set of algorithms that you think are most likely to succeed in the task), is a key design decision you'd have to make upfront - then you cross your fingers and hope that you made the right choice.
A couple of technical terms they covered:
Parametric algorithms are mathematical equations - think linear regressions
Non-parametric algorithms are more like frameworks? Think decision trees; these aren't mathematical equations but there is a sort of predefined order or organizational method being applied here.
'No free lunch theorem' - the idea that there isn't one algorithm to rule them all. Different algorithms are good for different jobs, there's no silver bullet...
...until you get to transformer-based AI, like ChatGPT (maybe).
In a seminal paper I haven't read yet, 'Attention is All You Need', the idea of transformer-based AI was outlined. It talks about AI being able to understand things in context, using attention mechanisms to understand which words are more important to the meaning of the text, among various other groundbreaking ideas I'll write about later once I've read it in more detail.
Transformer-based AI is where it's happening right now, and why we're now seeing this explosion of interest in AI. These models can now do things we thought was only reserved for humans - understanding context, being creative, actually sounding like a human being on paper. Who knows if this is going to be the technology that will get us that super intelligent human-level AI that will be able to do everything - what's clear is that right now, it is a huge improvement over the traditional ML based AI we've had before.
Okay, so back to the course - technically a transformer-based AI is 'non-parametric' because it isn't using a mathematical equation, but it isn't exactly following a predefined framework either - it uses algorithms to help it understand and process data: Attention mechanisms, positional encodings, optimization algorithms like gradient descent to train on tons of data (more technical terms I haven't dove into yet).
If a traditional ML is using a non-parametric algorithm like a decision tree, it almost sounds like it's being given a set of rules to follow to complete its job. But it doesn't make its own rules.
Whereas a transformer-based AI seems to be able to generate its rules and frameworks for how it approaches problems. It's almost like, a scientist in its own right. If I'm understanding this correctly.
More to follow once I start diving into more transformer-based AI content.
Model Complexity
Things that affect a model's complexity - Number of features, the algorithm itself, and the hyperparameters you use to tweak the algorithm.
More features = more complexity
Some algorithms are simpler than others - a straightforward linear regression vs a neural network
Hyperparameters are configurations to your algorithm - e.g. learning rates, or how big of a leap your model takes to arrive to a solution, or how deep a decision tree should go. Which hyperparameters you may tweak will depend on the algorithm you use.
How complex should your model be? It depends - you want to get it 'just right'. The right level of complexity will get you a model that makes decent but not 100% error free predictions.
You don't want the model to aim for 100% accurate fit to your test data, it'll then only be good at the test data and not with new information.
Good examples describing the issues with undercutting vs overfitting:

Planning your Data Sets
Have separate data for Training vs Testing
Training data can be reused again and again until your models' performance is satisfactory enough to run testing on new data
Prevent leakage - don't let test data into your training data or vice versa. In testing, you want to make sure your model has never seen the test data before so you can determine how good it is.
Allocate a total of 20-40% of all your data for validation purposes
10-20% of it goes to a validation step between Training and Testing
Validation lets you run smaller tests on different models to find the most promising ones
The other 10-20% of your data would be your actual Testing set
Conserve your precious test data! Don't let your models test on all of your data at once, you may have to go out and source new test data.

Cross-Validation - Validation on Steroids
Data is precious - especially cleaned up, pre-labeled data used for training models. You don't want to use up all of that training data in one go. Why not do smaller tests while you're training your various models?
Slice thin slivers of your training dataset off and feed it to your models. Fine tune as needed to get better performance.
This should probably be avoided if you don't have a lot data to begin with - you don't want too small of a sample size. There's no standard number of samples you need, but generally at least a few hundred samples is needed (and I've heard at least 1,000 samples from data scientists).
Repeat this process with other slices of your training data, helping to ensure your models aren't 'cheating' in their performance by giving them new data they haven't seen before.
Find the model that did the best on average across all these slices of data.
Technical terms in cross-validation:
K-Folds Cross Validation - K is the number of 'slices' you're making. Folds are the subset of data that you're using. So basically, K-Folds means 'number of data subsets'.
5-10 folds is typical
Like this post? Let's stay in touch!
Learn with me as I dive into AI and Product Leadership, and how to build and grow impactful products from 0 to 1 and beyond.
Follow or connect with me on LinkedIn: Muxin Li
Comentários