01 ML system design introduction - bytebytego

Created	@February 29, 2024
Tags	System design

Clarifying Requirements

Business objective. If we are asked to create a system to recommend vacation rentals, two possible motivations are to increase the number of bookings and increase the revenue.

Features the system needs to support. What are some of the features that the system is expected to support which could affect our ML system design? For example, let’s assume we’re asked to design a video recommendation system. We might want to know if users can “like” or “dislike” recommended videos, as those interactions could be used to label training data.

Data. What are the data sources? How large is the dataset? Is the data labeled?

Constraints. How much computing power is available? Is it a cloud-based system, or should the system work on a device? Is the model expected to improve automatically over time?

Scale of the system. How many users do we have? How many items, such as videos, are we dealing with? What’s the rate of growth of these metrics?

Performance. How fast must prediction be? Is a real-time solution expected? Does accuracy have more priority or latency?

Frame the Problem as an ML Task

Defining the ML objective

Specifying the system’s input and output

Choosing the right ML category

Defining the ML objective

Specifying the system’s input and output

Choosing the right ML category

Data Preparation

Data sources

which may include: who collected it? How clean is the data? Can the data source be trusted? Is the data user-generated or system generated?

Data storage

a high level how different databases work

Extract, transform, and load (ETL)

ETL consists of three phases:

Extract. This process extracts data from different data sources.

Transform. In this phase, data is often cleansed, mapped, and transformed into a specific format to meet operational needs.

Load. The transformed data is loaded into the target destination, which can be a file, a database, or a data warehouse.

Data types

Numerical data

Numerical data are any data points represented by numbers. numerical data is divided into continuous numerical data and discrete numerical data.

Categorical data

Categorical data refers to data that can be stored and identified based on their assigned names or labels. Categorical data can be divided into two groups: nominal (male/female) and ordinal.

Feature Engineering

Feature engineering contains two processes:

Using domain knowledge to select and extract predictive features from raw data

Transforming predictive features into a format usable by the model

Handling missing values

Data in production often has missing values, which can generally be addressed in two ways: deletion or imputation.

Feature scaling

Many ML models struggle to learn a task when the features of the dataset are in different ranges

Normalization (min-max scaling)

Standardization (Z-score normalization).

Log scaling

Discretization (Bucketing)

Encoding categorical features

Integer encoding

One-hot encoding

Embedding learning

Talking points

Data availability and data collection: What are the data sources? What data is available to us, and how do we collect it? How large is the data size? How often do new data come in?

Data storage: Where is the data currently stored? Is it on the cloud or on user devices? Which data format is appropriate for storing the data? How do we store multimodal data, e.g., a data point that might contain both images and texts?

Feature engineering: How do we process raw data into a form that’s useful for the models? What should we do about missing data? Is feature engineering required for this task? Which operations do we use to transform the raw data into a format usable by the ML model? Do we need to normalize the features? Which features should we construct from the raw data? How do we plan to combine data of different types, such as texts, numbers, and images?

Privacy: How sensitive are the available data? Are users concerned about the privacy of their data? Is anonymization of user data necessary? Is it possible to store users’ data on our servers, or is it only possible to access their data on their devices?

Biases: Are there any biases in the data? If yes, what kinds of biases are present, and how do we correct them?

Model Development

Model selection

Model selection is the process of choosing the best ML algorithm and architecture for a predictive modeling problem. In practice, a typical process for selecting a model is to:

Establish a simple baseline.

Experiment with simple models.

Switch to more complex models.

Use an ensemble of models if we want more accurate predictions. bagging [3], boosting [4], and stacking [5]
- Logistic regression
- Linear regression
- Decision trees
- Gradient boosted decision trees and random forests
- Support vector machines
- Naive Bayes
- Factorization Machines (FM)
- Neural networks
- The amount of data the model needs to train on
- Training speed
- Hyperparameters to choose and hyperparameter tuning techniques
- Possibility of continual learning
- Compute requirements. A more complex model might deliver higher accuracy, but might require more computing power, such as a GPU instead of a CPU
- Model’s interpretability [6]. A more complex model can give better performance, but its results may be less interpretable
- Constructing the dataset
- Choosing the loss function
- Training from scratch vs. fine-tuning
- Distributed training
Take away points:
- Model selection: Which ML models are suitable for the task, and what are their pros and cons. Here’s a list of topics to consider during model selection:
  - The time it takes to train
  - The amount of training data the model expects
  - The computing resources the model may need
  - Latency of the model at inference time
  - Can the model be deployed on a user’s device?
  - Model’s interpretability. Making a model more complex may increase its performance, but the results might be harder to interpret
  - Can we leverage continual training, or should we train from scratch?
  - How many parameters does the model have? How much memory is needed?
  - For neural networks, you might want to discuss typical architectures/blocks, such as ResNet or Transformer-based architectures. You can also discuss the choice of hyperparameters, such as the number of hidden layers, the number of neurons, activation functions, etc.
- Dataset labels: How should we obtain the labels? Is the data annotated, and if so, how good are the annotations? If natural labels are available, how do we get them? How do we receive user feedback on the system? How long does it take to get natural labels?
- Model training.
  - What loss function should we choose? (e.g., Cross-entropy [15], MSE [16], MAE [17], Huber loss [18], etc.)
  - What regularization should we use? (e.g., L1 [19], L2 [19], Entropy Regularization [20], K-fold CV [21], or dropout [22])
  - What is backpropagation?
  - You may need to describe common optimization methods [23] such as SGD [24], AdaGrad [25], Momentum [26], and RMSProp [27].
  - What activation functions do we want to use and why? (e.g., ELU [28], ReLU [29], Tanh [30], Sigmoid [31]).
  - How to handle an imbalanced dataset?
  - What is the bias/variance trade-off?
  - What are the possible causes of overfitting and underfitting? How to address them?
- Continual learning: Do we want to train the model online with each new data point? Do we need to personalize the model to each user? How often do we retrain the model? Some models need to be retrained daily or weekly, and others monthly or yearly.

Model training

Evaluation

Offline evaluation

Online evaulation

Talking points

Here are some talking points for the evaluation step:

Online metrics: Which metrics are important for measuring the effectiveness of the ML system online? How do these metrics relate to the business objective?

Offline metrics: Which offline metrics are good at evaluating the model’s predictions during the development phase?

Fairness and bias: Does the model have the potential for bias across different attributes such as age, gender, race, etc.? How would you fix this? What happens if someone with malicious intent gets access to your system?

Deployment and Serving

Cloud vs. on-device deployment

Model compression

Testing in production

Prediction pipeline

Model compression

Knowledge distillation

Pruning

Quantization

Test in production

Shadow deployment

Prediction pipeline

Batch prediction

Online prediction.

Talking points

Is model compression needed? What are some commonly used compression techniques?

Is online prediction or batch prediction more suitable? What are the trade-offs?

Is real-time access to features possible? What are the challenges?

How should we test the deployed model in production?

An ML system consists of various components working together to serve requests. What are the responsibilities of each component in the proposed design?

What technologies should we use to ensure that serving is fast and scalable?

Monitoring

Why a system fails in production?

data distribution shift

Train on large datasets. A big enough training dataset enables the model to learn a comprehensive distribution, so any data points encountered in production likely come from this learned distribution.

Regularly retrain the model using labeled data from the new distribution.

What to monitor

Operation-related metrics: Those metrics ensure the system is up and running. They include average serving times, throughput, the number of prediction requests, CPU/GPU utilization, etc.

ML-specific metrics:

Monitoring inputs/outputs. Models are only as good as the data they consume, so monitoring the model’s input and outputs is vital.

Drifts. Inputs to the system and the model’s outputs are monitored to detect changes in their underlying distribution.

Model accuracy. For example, we expect the accuracy to be within a specific range.

Model versions. Monitor which version of the model is deployed.

Data sources

Extract, transform, and load (ETL)

Data types

Numerical data

Categorical data

Feature Engineering

Handling missing values

Feature scaling

Encoding categorical features

Talking points

Model selection

Take away points:

Model training

Offline evaluation

Online evaulation

Talking points

Model compression

Test in production

Prediction pipeline

Talking points

Why a system fails in production?

What to monitor

Infrastructure