Machine Learning Foundations for Product Managers Wk 4 - Linear Models
- Muxin Li
- Mar 14, 2024
- 9 min read
Updated: Jun 17, 2024
Technical terms:
Parametric algorithm
Non-parametric algorithms
Linear Regression
Simple linear regression
Multiple linear regression
Bias
Coefficient or Weight
Sum of Squared Errors (SSE)
Cost function, loss function
Polynomial Regression
Regularization
Penalty Factor
Lambda (λ)
LASSO Regression
Ridge Regression
Logistic Regression
Logistic or Sigmoid Function
Gradient Descent
Learning Rate
Softmax Regression
A parametric algorithm, like a linear model, makes predictions with known parameters that are predefined and assumes relationships between the inputs and outputs (similar to an algebra equation). Non-parametric algorithms don't assume that there is a predefined relationship; there isn't a template that they follow like there is in a parametric algorithm.
Parametric algorithms are quick to train, even with small data, but they can be too simple for complex problems (under fitting the data)
Non-parametric algorithms can make better predictions, but require more data to train and are more prone to overfitting on the training data
We're starting on the easier side of supervised learning algorithms for now:

Linear Regression
If you've ever plotted data points on a X-Y axis chart, and drew a line that best followed the pattern of those dots, you've made a linear regression.
Simple linear regression - given one input feature, how does it affect the output? Example, what's the home price based on the number bedrooms?
Multiple linear regression - given multiple input features, how does it affect the output? Example, what's the home price based on the number of bedrooms, zip code, square footage, and whether or not your dog deems the backyard worthy?
Benefits of this simple model:
It's easy to see how one (or multiple) input(s) affect an output, making it easier to understand situations
Try simpler models like linear regression to get a benchmark first, before going into more complex models
Breaking down a linear regression equation (remember your algebra):

Bias is a constant - this value is some number that never changes.
Coefficient assigns some impact or 'weight' to the input feature (it's basically multiplying your feature by some value). In this case, the feature 'x' is weighted at 'W1'.
String together different input features and their coefficients (# of bedrooms, sq footage, school district), and you've got a multiple linear regression equation showing how all the inputs affect the output ('y' or in this case, your house price).

Training the Linear Regression Model
Ultimately, the linear regression model's job is to be able to figure out what all the coefficient values are supposed to be, so you can calculate your target output ('y'). Steps on how:
Make a guess on what your coefficients are, and run your model
Check your model's predictions and measure its errors against the real data - in this scenario, we use Sum of Squared Errors (SSE)
Sum of Squared Errors (SSE)
For each data point, find the difference between the actual observed value, and your model's predicted value. You square that difference, which is your error.
You then add them all up. See? Sum of squared errors.

May you never have to do this by hand
Why so square? When finding the difference between the observed and model prediction, you can end up with a negative number just because of the order in which you did the subtraction (was the observed value first, or second?).
Squaring eliminates negative values, and ultimately we want to see how 'big' the difference or the error is between our model's predictions vs the real world observations.
Squaring also helps exaggerate larger errors, so it makes it very clear if there's a big gap in our model's predictions.
In modeling terminology, your cost function or loss function is what you're using to evaluate errors in your model.
Back to Training your Model
Now that you have your SSE, your goal is to minimize that error by figuring out what the best coefficient and bias values are. There are methods far superior to random guessing:
A closed-form solution is essentially doing algebra using your observed data to find the best-fitting line for your data.
The equation below assumes that a straight line best fits your data.


Plug in some actual input-output values from your observation data and solve for the unknowns:
E.g. my datapoint says a 2,000 sq foot home costs $300K
I put in 300K for y and 2,000 for x - then solve for the coefficient and bias
What if a straight line doesn't fit my data?
Get yourself a polynomial regression when you want to model a non-linear relationship - these ditch simple coefficients in favor of exponents, log functions, and more to create line-bending models.


Transform the linear feature (coefficient * X) into a new feature (x^2, log(x) etc)
Label that new feature 'z' and try it out as an input in your model
You can try at variations of a polynomial regression to see what fits:
Quadratic equations (degree of 2), cubic (degree of 3) and so on
Evaluate your Regression model's performance before and after the transformation with Mean Squared Error (from Week 3)
Using Regularization to Reduce Overfitting
A downside with using SSE as your error model (and trying to minimize it as much as possible) is you can end up overfitting your model to the training data, making it less useful on new data. After all, SSE penalizes errors between your predictions vs your observations.
Reign in your SSE with regularization, which puts a penalty on things that add to complexity - the number of features, and the weights of all those features. Reducing the complexity helps to generalize the model to improve its handling of new data.
Applying regularization adds the sum of the coefficients/weights from your linear regression equation to the SSE calculation, making your SSE (error metric) bigger
The sum of coefficients is the penalty factor
The penalty factor is multiplied by its own coefficient, or Lambda (λ)
So in essence, you're using the coefficient Lambda to apply more or less weight to the sum of your coefficients or weights... it's a meta

Variations on Penalty Factor
The sum of the coefficients (penalty factor) can be done in 2 ways:
LASSO Regression calculates the absolute value of the sum of the coefficients, multiplied by the Lambda value
It tends to force coefficients to 0, effectively removing the feature tied to it
It 'selects' which features get to stay, and removes the ones that aren't really as important to your prediction model
It can be very effective at culling down a complex model with many features
Ridge Regression is the sum of all of the squared coefficients, multiplied by the Lambda value
It lowers the coefficient values close to zero, but does not actually go all the way to zero
It minimizes the effects of less important features, but doesn't necessarily remove them entirely
It can be effective at maintaining complex relationships between the output and the input features, and where there is collinearity or correlation between input features to each other
TLDR; LASSO makes a simpler, more easily interpretable model. Ridge handles complex relationships with many features better. Try them both and see which works best for your model!
Binary Classification Problems with Logistic Regression
Linear models can be used in classification tasks - but you can't use a linear regression directly to a classification problem and get usable results.
Classification is a 'yes/no' problem of whether something is or isn't the thing you're looking for, with results being a '0/negative' or a '1/positive'
A linear regression would get you a range of values, from negative numbers to above 1 and everything in between - not the classification outputs we're looking for (negative)
We're starting simple with a binary classification problem, where we're predicting for just one thing.
Instead of getting the model to give us an exact output of 0 or a 1, we can use a logistic regression model to predict the probability that the output equals 1.
This is done by applying the sigmoid/logistic function on to the outputs of your linear regression model
The sigmoid function can map any number to a value between 0 and 1, representing 0-100%

When z is large and positive, σ(z) approaches 1 - a high probability this is a positive classification output.
If z is large and negative, σ(z) approaches 0 - a low probability of a positive classification output.
If z is close to 0, σ(z) is around 0.5 - a 50/50 chance of either positive or negative, or random.
It works on negative numbers, and on numbers above 1, to ensure everything stays between 0-1. The sigmoid function ensures that you cannot put in 110%, no matter what your motivational coach or trainer says.
In classification problems, we really want to know whether we've found what we're looking for. In binary, that means being able to get an output of 1, or a positive that yes, this is it.
So mathematically, you can represent that goal as y = 1 or output equals 1. When we're using linear regression models for classification problems, we're really getting the probability that y = 1. To convert that into math, "the probability of whether the output equals 1 (this is the thing we're looking for)," we can use the equation P(y = 1) to symbolize it all.
Math, the ultimate shorthand.
Visualizing the process where we want to predict the probability of y = 1 for a binary classification problem (just trying to identify 1 thing):
The output of your linear regression model on the left (X = features, W = weights or coefficients) is all wrapped up and symbolized as z
Plugging z into the sigmoid function gets you a number between 0-1 that represents the probability that this is indeed a positive classification

Optimizing Logistic Regression Models for Binary Classification with Gradient Descent
It's time to play train the model. Assume here that we're trying to solve a classification problem, and we have already adjusted our linear regression model into a logistic regression model.
ID your cost function - this is where you'll have to decide what's the best cost function that represents the error in your model.
This isn't detailed in the lesson, but typically a logistic regression uses the binary cross-entropy loss or log loss. It's a common and effective approach for binary classification problems, but the right cost function depends on your requirements and the problem you're dealing with.
Use random weights/coefficient values as assumptions for your regression model, and calculate the output.
Calculate your error with the cost function to establish your baseline.
Your goal now is to minimize the error in your model, by getting your coefficient values to be better than random guesses. A popular way to do this is with gradient descent - which is surprisingly visual: Imagine you're trying to descend down a slope to reach the bottom of a canyon, which represents the minimum of the cost function (the error in your model).

Continuing with our descent into the bottom of the canyon - unfortunately, it's extremely dark and you have no light. You have little clue which direction you should be going, and you don't feel like you should take a huge leap and hope for the best (who knows where you'll end up instead).
So first, we need to know which direction we're going. Easy enough in a real world canyon, but in math we'll need a few tools to get numbers to move in the right direction.
Calculus time - the derivative (aka the gradient, or the slope) of a function points in the direction of the steepest ascent of that function. We definitely want to go the opposite way.
You're calculating the the slope of the cost function
Now we just need to know how big of a step to take, or the learning rate, which gets multiplied by the gradient/slope
Experiment with values of your learning rate - you can either overshoot the 'bottom'/minimum of the cost function by picking too big of a step to take, or you can take forever trying to take tiny tiptoe steps to the bottom
Now we update the coefficients, or weights, in our logistic regression model:
New weight = old weight - learning rate x gradient
Run the cost function (it should give you a smaller number this time, meaning you've lowered you error)
Keep repeating these steps (starting with calculating the gradient) until you've arrived at a good enough minimum for your cost function, or when you've reached the bottom
Multi-Class Prediction Problems with Softmax Regression
Need to predict more than 1 thing? Instead of using a logistic regression, we'll need a softmax regression model:
It also starts out with the output of your linear model, but this time we use the softmax function (instead of good old sigmoid)
Instead of calculating the probability of y = 1, it calculates the probability of something belonging to each of the classes the model is trained to identify, or y = k
Each class now has its own weights or coefficients (but all share the same input features)
To calculate the z or the output of your model to feed your softmax function, you'd use the model and the class's coefficients to calculate for each class
Each z you get out gets fed into the softmax function, to get the probabilities of each class
The softmax function also limits the final output between 0-1, to ensure a probability score - except it takes it to the next level, ensuring that the sum of all the probabilities of all the classes in the model do not exceed 1
Gradient descent methods can also be used to optimize a softmax regression model

The output you'll get at the end is a probability score for all the classes it's trained to predict:

There's an 80% chance this thing is a dog, a 5% it's a Cat or Rabbit, and a 10% it's actually a bear. But it's 100% a good boy.
Like this post? Let's stay in touch!
Learn with me as I dive into AI and Product Leadership, and how to build and grow impactful products from 0 to 1 and beyond.
Follow or connect with me on LinkedIn: Muxin Li
Comments