Managing Machine Learning Projects Wk 3 - Data Considerations
- Muxin Li
- Jun 19, 2024
- 10 min read
Updated: Jun 25, 2024
Getting enough relevant, cleaned data is crucial for ML models to work (although I've heard running more cycles of training and using synthesized data has been working well). This weeks covers all the ways data can be tricky and gnarly to work with.
What's Covered:
How to plan around your data needs, manage it as you build your model, and plan for collaboration and access with your team.
Key Takeaways:
Data is crucial but easy to get wrong - assess for bias risks, outliers, how much data, missing data, and only get the data you need based on which features are needed for your model.
Document, document, document - details about your model, what data you're using, how they're sourced where they're stored, the pipeline itself, any data relationships a new team should be aware of. Leverage data lineage mapping to visualize the journey data takes to end up in your model.
Technical terms:
Flywheel Effect
Cold Start Problem
Data Stewardship
Feature Engineering
Feature Selection
Exploratory Data Analysis (EDA)
Data lineage
The AI Maslow Hierarchy of Data Needs - can't build models on the higher tiers without first going through the lower foundational steps with data:

Objectives of this module: Evaluate data needs, strategies for data collection to support modeling, steps in the data pipeline.
Data Needs
Data is needed for both training and testing:
Need data on inputs, and on the outputs (targets you're predicting for)
Historical and ongoing real-time data for making predictions
Needs features, labels for training

To ID which features to input into the model: Ask domain experts, customers and users who have the problem and understand the problem space:
Understand what contributes to the problem, what can solve for the problem, if there are any geospatial or time related needs
Start with a small set of features to establish a baseline; add more features and test them for impact. Always try features - usually missing features leads to worse outcomes. Take out features if they do not help.
Takeaway: Always try features, talk with domain experts who understand the problem well, and have both historical and real-time data on your input features and output targets.
How Much Data?
Usually, the more data the better. Will usually need much more observation data than the labeled, input features data - usually an order of magnitude more observations than features.
The number of features, how complex the relationships between the features and the observations are - the more complex the relationships, the more data is needed.
Data quality - if there’s missing or noisy data, it may require a larger dataset as you’ll be forced to dwindle it down to usable data
The higher the bar set for the model’s performance, the more data you’ll need
Simpler relationships between data and simpler problems usually only needs smaller datasets
Identifying a flower with a few simple features vs being able to translate languages - translation is very complex, needs a lot more data

Takeaway: Collect sufficient historical data with relevant features and labels for training, as well as real-time inputs for predictions. More data is generally better, but consider the number of features, complexity of relationships, data quality, and desired performance.
Data Collection
Where to get data:
Internal data - user data, log files, business operations, machinery and ERP systems
Customer data - sensors, hardware, web and app data, user behaviors
External 3rd party data - e.g. weather, demographics, social media
Best practices to collect data intentionally: Only gather the data that you need.
More data means more storage and processing costs
Ethics and privacy concerns - ID these early in order to figure out exactly which data you actually need vs do not need
Bias risks - be careful of how and where you’re collecting data from
Data should be representative of the people you want to model
Data needs to be updated when the environment changes; use new data to retrain the model (frequency depends on how fast things change).
Document your sources of data, the metadata about the data (attributes, relationships in the data) to reduce pain of figuring it out later.
The team should understand where the data is coming from, the attributes it has, the relationships that can be found in it
Takeaway: Intentionally collect necessary, representative data while addressing costs, ethics & privacy concerns, bias risks, and update the data as needed.
Getting User Data
Acquiring user data is a popular method - various options to collect:
Surveys
User behaviors (e.g., Google Analytics)
User actions (voting, rankings, reviews)
Acquire user data in a way that is not obtrusive, and not painful to your user - ideally it should be natural and embedded as part of their workflow (e.g., CAPTCHA, StitchFix):
e.g. CAPTCHA image tests for proving you’re not a bot gives Google information to train their AI on how to identify target observations from images
StitchFix sends weekly outfits to users, who get to decide which ones to keep. The AI learns the user’s personal preferences from these actions.
Creating a flywheel effect from user-generated data - the data feeds into the AI, which improves the AI (and user experience), and opens opportunities to leverage the AI in other products and services
Amazon’s user data on product searches and purchases can lead to features like reordering the product listings, personalized recommendations, ‘shoppers also bought’ these other items (an example of co-occurrence matrix, being able to ID multiple items that are commonly bought together)(
Common challenge to getting user data is the cold start problem - if the AI has never interacted with the user before, it has no preexisting data with which it can bring value to the user. Ways to get around this:
Use a simple heuristic, e.g. top shows other users love
Get some early information on the user - e.g. as a new Netflix user, you may be asked to rate a few shows and the algorithm will decide what to recommend
Takeaway: Acquire user data seamlessly to drive a positive flywheel effect while adding value and addressing cold-start issues.
Data Governance and Access
Large organizations are often siloed, making data inaccessible to teams. Each department uses their own systems, storing data in different places with different schemas.
Before hiring a big Data Science team, tackle data silos first. Cracking down on data silos often involves three things:
Culture of change - executive sponsor to champion opening data access, create incentives for teams to centralize data
Technology - where to centralize data, how to represent it, how users will query it
Data stewardship and access - how to maintain, organize, clean data; ensure users know data is available and how to get access easily
Case studies: Facebook and Spotify
Facebook’s data growth was exploding - the Data Engineering team migrated to Hadoop for storage, but this was inaccessible for most users as it required being able to write Hadoop programs to access the data.
Developing HIVE on top of Hadoop allowed for simpler SQL queries to access the data. Facebook also trained employees to encourage self-service, ran hackathons to encourage ideas on how to use the data.
Spotify had a similar story - they had migrated to Google Cloud’s platform to centralize their data and market research insights. They then released Lexikon to allow users to search and browse the data (through BigQuery).
But Spotify went further and did many internal studies and experiments to continue improving the platform for their data users. They found that many Data Scientists still faced significant challenges finding the right dataset (they typically use 25-30 datasets a month).
The Spotify team ended up improving Lexikon by exploring ways to serve data based on user intent, encourage knowledge sharing between users and experts, and surface ideas to help users get more value from the data set.
Several ways they implemented this - to encourage discovery, they displayed popular datasets, most recently used, datasets used widely by the user’s teams, and recommendations the user may find helpful. Through experimentation and feedback, they found that simpler heuristics based on usage statistics worked quite well, not requiring a more complicated ML model.
For high-intent purposes where the user needed something specific, the ability to search datasets by topic, name, relevance to a project, or whether it was used by the user’s team was surfaced.
By having descriptions of the datasets there was less need to contact experts to understand what was in the data, but users still found it valuable to connect with experts and understand what else they should be exploring. The Spotify team encouraged this by creating expertise mapping so users can find a SME related to a topic and incorporated Slack for collaboration.
When encountering a new dataset, it can be overwhelming to explore it - by adding usage statistics of schema fields, number of queries or number of unique users who queried a schema field, displaying examples of queries others have used with the dataset, commonly joined tables, the team encouraged exploration of a new dataset.
Takeaway: Data governance and access are crucial for successful machine learning projects. Breaking down data silos requires a multi-faceted approach, including cultural change, technology implementation, and effective data stewardship..
Data Cleaning
Messy data is a significant issue - anomalies, missing data, incorrect mapping of different data sources. The type of missing data (missing completely at random, missing at random, missing not at random) impacts the potential for bias in the model. Dealing with missing data involves removing, flagging, replacing, or inferring missing values. Outliers, which can be extreme or context-dependent, can be detected using statistical tests and visualizations.
Missing Data and Bias
Why the data is missing is a significant factor in bias risk: In completely random events e.g. power outages, a blip in the data is not likely to have any pattern to skew your model. But in sampling errors, where the data is not representative of the real population, or in feature related patterns (like user reviews - negative reviews are far more common than positive ones), there is often significant bias in the dataset that will need to be managed.


For most time series data like sensor data, replace the missing data with the previous observation in the series, or ‘forward-filling’, or replace with the next observation or ‘back-filling’
Infer a pattern from the rest of the data using a simple regression model and use it to plug in for the value
Handling Outliers
Outliers can overly influence your model as they have high weights - however they’re not always obvious, and they shouldn’t always be removed.
Is the outlier a real world observation, or is it an error or an anomaly? This depends on the context - if there’s a sudden drop in temperature and it suddenly rises back up again, that’s an anomaly.
Outliers that are errors can be treated by removal or adjusting the value by inferring what it should have been with a simpler regression model.
Use statistical tests and especially visualizations to help you assess outliers.
Takeaway: Handle missing data and outliers based on the nature of their "missingness" to mitigate biases and errors. Use appropriate strategies like removal, replacement, or adjustment.
Preparing for Data Modeling
Preparing data for modeling involves cleaning, exploratory data analysis (EDA), feature engineering, feature selection, and data transformation. EDA helps identify issues and understand trends, while feature engineering involves building or creating features. Feature selection reduces complexity and overfitting. Data transformation ensures features are on the same scale and converts categorical variables into numerical codes.
Get familiar with your data - always go through an Exploratory Data Analysis (EDA):
Statistics and visualizations to understand data distributions, relationships between features and output targets
Feature Engineering
Building, creating features to use in modeling, selecting the right features (using the wrong features will never produce good results no matter how great the model is).
Feature selection: downsizing feature set to optimal size to reduce complexity, improve training time and interpretability, reduce overfitting risks.

Methods: Run correlations, test different feature sets with models, analyze model to ID top contributing features. Missing a key feature can be significant.
After feature selection, it’s time to transform the data to be usable for modeling. This usually means scaling all the features on the same order of magnitude, or converting text/string into code.
Takeaway: Proper data preparation is essential for successful machine learning modeling. It involves cleaning, understanding the data through EDA, creating relevant features, selecting the most important ones, and transforming the data into a suitable format for modeling.
Reproducibility and Versioning
Reproducibility and versioning are crucial for successful machine learning projects. Reproducibility ensures that results can be replicated, aiding in debugging, knowledge transfer, and establishing credibility. Versioning involves tracking changes in models, code, data, and pipelines, facilitating debugging, data migrations, and compliance.
Include version control for the model, codebase, the data, the pipeline - this allows for faster debugging when you realize which version is causing an issue, or which version is performing better.
Document and Visualize for Reproducibility
Document the functionality, dependencies of the model, the data, the code.
A good practice is data lineage - tracking the data from its source to consumption, documenting its transformations and locations throughout the data pipeline.
It enables debugging, simplifies data migrations, inspires trust in data, and helps meet compliance requirements
Data lineage maps visualize data flows at different levels, from high-level overviews to detailed step-by-step diagrams
As part of your data lineage diagram, there may be additional detailed diagrams of each step in the data lineage
It’s a good idea to document and track the things you need for your model and where they sit:
Certain industries like finance require data lineage practices
Extract Transform Load (ETL) operations are common - moving data from its raw format to a data warehouse will cause issues for your model if you haven’t been keeping track of your model’s data dependencies
Software tools to help track exist, but simpler spreadsheets and diagramming tools can help as well.
Paid tools like Weights & Biases
ML platforms as a service like H2O
Open-source solutions like MLFlow and DVC
Manually logging your metadata as you go
Takeaway: Reproducibility and versioning are essential for maintaining the integrity and reliability of machine learning projects. They ensure transparency, facilitate collaboration, and enable future debugging and improvements - data lineage diagramming is a valuable tool to visualize the data's journey,
Conclusions
Collecting sufficient amounts of relevant, good quality data with the right features is the most important factor in executing a successful ML project. Ensure data is unbiased, representative, and updated when the environment changes to retrain models. Clean and make data accessible, with good practices for collaboration, version control, and reproducibility (document everything).
Like this post? Let's stay in touch!
Learn with me as I dive into AI and Product Leadership, and how to build and grow impactful products from 0 to 1 and beyond.
Follow or connect with me on LinkedIn: Muxin Li
Comments