top of page

Managing Machine Learning Projects Wk 5 - Model Lifecycle Management

  • Writer: Muxin Li
    Muxin Li
  • Jun 25, 2024
  • 11 min read

ML and AI products aren't done when they're launched - model lifecycle management is critical to maintain ML systems in production.


What's Covered:

  • Model lifecycle management

  • ML system failures and risks

  • Types of ML model issues (e.g., data drift, concept drift)

  • ML system monitoring techniques

  • Model maintenance and retraining strategies

  • Model versioning and its importance

  • Organizational considerations for supporting ML models

  • Additional considerations for Large Language Models (LLMs) and Generative AI


Key Takeaways:

  • ML projects require ongoing management and adaptation to maintain optimal performance in changing environments. The work doesn't stop at deployment.

  • Comprehensive monitoring, regular maintenance, and robust versioning practices are essential for managing the complexity of ML systems and ensuring their continued reliability.

  • LLMs and Generative AI models introduce additional challenges, including unique performance metrics and ethical considerations, that build upon the foundational practices of ML model management.


Technical terms:

  • Training-serving skew

  • Data drift

  • Concept drift

  • Shadow releasing

  • Champion-challenger testing

  • Continuous learning

  • Online learning

  • SHAP (Shapley Additive Explanations)

  • LIME (Local Interpretable Model-Agnostic Explanations)



 


Machine learning projects don't end when models are deployed to production. The environment around the model continues to change, significantly impacting its performance over time.

  • ML projects are not done once they're released into the wild and in production - the environment around the model (the real world) can change, which will affect the performance of the model.

  • Understand the main risks of ML models in production: training serving skew, data concept drift.

  • How to mitigate those risks.

  • What to monitor (and how to) monitor models to determine when they need retraining.

  • Model versioning and best practices to manage many versions of models, during development and in production.


Takeaway: Successful ML projects require ongoing management and adaptation to maintain optimal performance in changing environments.



ML System Failures

Machine learning-based products are subject to a wide range of potential failures, including both software-related issues and model-specific risks. Model-related failures can be particularly dangerous due to their difficulty in detection.

  • Model performance degrades over time - the rate that it decays will depend on a number of factors (how fast is the environment around them changing?) The new data in production may not match the data you used for training. Or the data difference drifts very slowly and degrades over time. Model performance is always best when they're initially released into production.

  • What is an acceptable level of decay for your particular model before you need to take action? Model related failures are notoriously difficult to identify - predictions can be slightly off to significantly off, and it's impossible for the user to detect.


Takeaway: Understanding and monitoring for both software and model-specific failures is crucial for maintaining ML system reliability.


Types of ML Model Issues

Several issues can affect ML models in production, including training-serving skew, excessive latency, and different types of drift.

  • Training serving skew - the mismatch between the training data (and its quality) vs the real world inputs your model is receiving in production (e.g. high quality resolution photos in training vs real life blurry resolution or low light photos as inputs).

  • Training serving skew often occurs due to how the data is processed in training vs the reality of the data in production. It's typically easy to detect since the model will significantly degrade once users get their hands on it, so you'll find out fairly quickly.


Certain use cases require very low latency, like self-driving cars. The speed at which models can generate accurate predictions and outputs can depend on many factors.

  • Latency can vary significantly depending on the volume of input data coming into your system, the extent of the data pipeline you've created (how complex and how many steps does it take for raw data to become inputs ready for your model), how long it takes for data to flow through the pipeline, as well as the choice of algorithm and model in how quickly it can generate predictions.

  • In systems where the model's outputs impacts future inputs, latency can create a lag in the feedback loop.


When the environment around your model changes, it can cause the model to drift.

  • Data drift occurs as changes in the environment around the model affect its predictions - whether those changes are very sudden (e.g. COVID restrictions) or very gradual (shifting demographics in a neighborhood).

  • Compared to training serving skew, data drift occurs more gradually. It has more to do with changes in the input data itself (not so much on how it was processed in training, as training serving skew does).


Concept drift occurs from changes in the relationship between inputs and outputs - changes in human behavior or preferences tend to drive this. During COVID, 1-way tickets for flights home surged as people returned for lockdown. Before COVID, 1-way ticket purchases correlated with potential credit card fraud, but the sudden change in human behavior led to concept drift in fraud detection models.

  • In concept drift, the real-world relationship between inputs and outputs has changed, like the case of COVID fraud detection wrt 1-way flight tickets. In data drift, it's the input data itself (either its distribution or its properties) that changes, like the size of houses over time.

  • In both concept and data drift, changes in the real world causes a model to no longer accurately predict as well as it did.


Takeaway: Awareness of these issues helps in designing robust monitoring systems and maintenance strategies for ML models in production.



ML System Monitoring

Proper monitoring is critical for detecting and addressing issues with ML models in production before they cause noticeable disruption to users.

  • It's critical to build a monitoring system for your ML model so you can triage and diagnose issues before they cause major disruption. It should accompany best practices in traditional SW and infrastructure monitoring.

  • Common things to look for - the outputs of the model compared to predictions (is there a data or a concept drift happening), the input data going into your system, the data as it goes through your data pipeline.



Monitor the quality of your input data with basic quality checks - look at the schema (data types, field names etc) and encoding (the process of converting human readable data into binary for machine to interpret e.g. Red becomes [1, 0, 0]) to ensure it's correct.

  • Check if the volume of input data matches your expectations, look for signs of missing data.

  • Evaluate whether the distribution of your input data has changed from what's expected - run statistical tests like mean and variance of the distribution of each input feature, or create visualizations to compare the distribution of new input data coming in to what you've seen in the past. Changes in distribution would signal a data drift.

  • Conduct correlation analysis between input features to your target values to identify signs of concept drift, in which the relationship between the inputs and outputs have changed.


Do NOT rely on automated testing alone. Periodically perform a manual audit on your input data, visualize it, and see if there are other issues that your tests have not yet caught.


To monitor issues related to your data pipeline, check the distribution of your production data after it's gone through your pipeline and compare it against what you expected.

  • Check the values of features after they've been processed and before they are fed to the model - if features are continuous e.g. a number, check whether the minimum and maximum ranges, the distributions, match your expectations. If features are categorical values e.g. a label, check for whether the production data includes additional categories that were not present in the training dataset.


Evaluating the performance of the model itself can involve looking at performance metrics (output metrics like Recall) over time and monitoring its decay rate. Set a threshold for performance to know when it's time to update or make changes to the model.

  • Monitor for the distribution of your predicted outputs vs what you're receiving from the real world - as in, what was the expected statistical spread of the outcomes vs the spread of outcomes from data in reality? If there's a significant difference in the distribution of predictions vs the target values, it could indicate issues like data or concept drift, or model degradation or bias etc.

  • Analyze how the model performs within subgroups within the population - if the model performance is being impacted by different demographics, then it indicates bias in the model.


Monitoring for the correlations between input features and the output predictions can be challenging for more complex models like neural networks - techniques like SHAP and LIME can help to evaluate these complex models and understand how they're driving predictions, which features they're relying on to generate predictions, and help you flag whether certain features are carrying more weight in driving outputs than expected (which can signal that there is bias in the model).

  • LIME (Local Interpretable Model-Agnostic Explanations) approximates locally with a simpler, surrogate model that is easier to understand.

  • SHAP (Shapley Additive Explanations) assigns a value to each feature based on its ability to change the prediction from the expected value.

  • Remember that models are not inherently biased - but human beings are generating the data that the model is trained on, and human beings are biased.



Takeaway: Comprehensive monitoring across all aspects of the ML system is essential for maintaining model health and performance.



Model Maintenance

Model maintenance is crucial for keeping models fresh and performing within acceptable bounds. It involves a cyclical process of monitoring, action, and evaluation.

  • Monitor the decay rate of your model (all models decay after deployment) - if it's about to cross a threshold, you'll know it's time to retrain or update your model.

  • Deploy the new model version, evaluate its performance, and start the cycle all over again.



Improve model performance by retraining with new data, which updates the weights or coefficients of your model's inputs while keeping the overall structure of the model the same (the hyperparameters, algorithms chosen). This can be done on a fixed scheduled basis or as needed when triggered by the model's performance.

  • Improve model performance by updating the model entirely - using the new data to redo the modeling, which can include changing its algorithm, hyperparameters, and retraining the new version of the model. It can lead to pruning unnecessary features that are no longer useful for output predictions, or finding models that result in higher performance.


Why retrain? Usually the data in production is more recent, more relevant, and/or more important than the predictions the model generated in testing using older data (especially true for things that change over time, like anything impacted by human behavior).

  • Retraining reduces the impact of data and concept drifts, allows for the model to reflect the changes in its environment, and is a necessary part of maintaining the model's performance when data about the world the model is modeling, changes.


When to retrain often depends on how automated your data collection and processing pipeline is - if you're able to automate the entire data collection and processing steps, then a triggered retraining is possible. In triggered retraining, your model automatically retrains when performance has slipped past a designated threshold. It will use the automatically collected and processed new data, keeping your model fresh.

  • Scheduled retraining is done on a recurring schedule (the cadence will depend on your use case). It's a common practice and likely required if parts of your datasets require manual data collection - you'll want to set a schedule to give enough time for the data collection.

  • If you require a scheduled retraining process, it's critical to monitor the decay rate of your model so you can predict when a retraining is necessary, and get ahead of it.

  • In continuous or online learning, the model is retrained each time a new data point (or small batch of datapoints) comes in. This is helpful for dealing with very large datasets, as the data won't fit into memory and batch training is unfeasible, so continuous learning is used instead.

  • Continuous learning has an advantage for applications that need real-time responsiveness and have to respond to quick changes in the environment - in social media, trends and topics change very quickly throughout the day, and we want our models to be responsive to those quick changes and ensure the model stays relevant.


Takeaway: Regular model maintenance is essential for ensuring continued high performance of ML models in production.



Model Versioning

Model versioning is crucial for tracking iterations during development and production, capturing dependencies and performance across versions.

  • Model versioning practices are important during development and in production. Creating effective ML models is an iterative process, so ensuring there's a procedure in place to track versions and dependencies of your model over time will avoid a lot of headaches down the road.



Everything related to your model needs to be tracked. It's important to track versions of the model, the data pipeline, the application, and all of the dependencies between them when you version.

  • There may be a code base for the application that uses a certain version of a model, which uses a certain version of the data pipeline, which uses a certain version of the original data…

  • Versioning a model means capturing details about the model - the algorithm it's used, features as its inputs, the architecture, the hyperparameter choices made, the dependencies of the model, the performance of the model etc…


Model versioning lets you evaluate the different iterations of a model and choose the best one. It's useful in development so you can determine whether things you've changed to the model are improving performance or not.

  • Tracking dependencies ensures your model is reproducible - if you have to go back to the previous version of a model, and it had a dependency on a particular version of the data pipeline, you need to know that in order to reproduce your model version. A proper versioning system captures all the information needed to trace your model and understand its dependencies.

  • Versioning allows for everyone on the team to ensure they're working on the correct version together.

  • Versioning lets you roll back capabilities if you run into issues with your new model in production - rolling back lets you go back to the model that was working prior to the update.


With versioning, you can also test multiple models together, run them in parallel, compare their performance, and move the better performing versions into production.

  • Model versioning lets you run models in parallel to track which to put into production. A common practice is a shadow release, where the new retrained model is running offline in parallel to your production model (which is still serving your users).

  • After monitoring the performance of both models and comparing, you can determine if the retrained model has consistently exceeded the performance of the production model - if so, you can move it into production and start monitoring it, initiating the maintenance cycle all over again.

  • A more extreme version of this is champion challenger testing - you have a reigning champion model that's in production. In parallel, challenger models (newer versions that may use different features, different data or algorithms etc) are being evaluated separately.

  • If a challenger model consistently outperforms the champion model, the challenger can be moved to production as your new challenger. You repeat the process again with other new challenger models to evaluate against the champion.



Takeaway: Effective model versioning is essential for managing the complexity of ML systems and ensuring reproducibility.



Organizational Considerations

Supporting ML models in production requires ongoing commitment and resources from the team.

  • By now, it should be obvious to business leaders that a ML project is not done once it's in production. Like software, ML projects also need ongoing maintenance and updates.

  • Not only do you need to monitor and maintain the model, but the users of your end application as well (as all products do). It's important to be able to explain to users how the model is working behind the scenes, how it is generating predictions, and when there are issues, provide user a recourse.

  • Examples of recourse include user opt out options, offering feedback, tiered services with different levels of ML automation, customization of parameters etc. ChatGPT and Claude are good examples - you can choose which model to use, you can provide feedback vibe checks (upvotes and downvotes), in ChatGPT you're able to customize what the AI should remember about you for convenience and advanced users are able to tune its temperature settings to fit their needs.


Takeaway: Organizations must recognize that ML projects require continuous support and resources even after initial deployment.



LLM Considerations

Unlike the ML models we’ve been looking at so far, LLMs have different output metrics and additional needs on top of ML products. Instead of Recall or TPR, you’re measuring an LLM metric like perplexity or relevance. There’s additional ethical considerations with LLMs and Gen AI to risk manage for inappropriate or offensive content, since users directly see the outputs of the model instead of being given a digested version of a recommendation or classification.


Here’s a list of additional Gen AI and LLM performance indicators:

  • LLM hallucinations

  • Safeguarding it from prompt injections and adversarial attacks (e.g. getting the LLM chatbot to agree to sell a car for $1)

  • Updating its knowledge base

  • Context window of the models you're using

  • For multimodal models, track performance on different types of inputs

  • Track and comply to ever evolving AI regulations

  • Monitor for potential copyright infringement

  • Monitor for privacy issues


Takeaway: LLMs and Gen AI models require unique performance metrics and additional safeguards beyond traditional ML models, including monitoring for hallucinations, adversarial attacks, and evolving ethical and legal considerations.


 

Like this post? Let's stay in touch!

Learn with me as I dive into AI and Product Leadership, and how to build and grow impactful products from 0 to 1 and beyond.


Follow or connect with me on LinkedIn: Muxin Li



Comentarios


© 2024 by Muxin Li

bottom of page