How LinkedIn Generates Data for Course Recommendation

Created	@February 19, 2024
Tags	System design

At the beginning, the main goal of Course Recommendations is to acquire new learners by showing highly relevant courses to learners. There are few challenges:Lack of label data: if we have user activities (browse, click) available, we can use these signals as implicit labels to train supervised model. As we’re building this LinkedIn Learning system, we don’t have any engagement signals yet. This is also called Cold start problem.

One way to deal with it is to rely on user survey during their on-boarding process, i.e: ask learners which skills they want to learn/improve. In practice, it’s usually insufficient.

Let’s take a look at one example: given learner Khang Pham with skills: BigData, Database, Data Analysis in his LinkedIn profile. Assume we have two courses: Data Engineering and Accounting 101, should we recommend Data Engineering or Accounting course? It’s self-explained that Data Engineering would be a better recommendation because it’s more relevant to this user’s skillset. This lead us to one idea: we can use skills as a way to measure relevance. If we can map learners to Skills and map Course to Skills, we can measure and rank relevance accordingly.

Course to Skill: Cold Start Model
There are various techniques to build the mapping from scratch.

Manual tagging using taxonomy (A). All LinkedIn Learning courses are tagged with categories. We asked taxonomist to perform mapping from categories to skills. This approach helps us acquired high precision human-generated courses to skill mapping. On the other hand, it doesn’t scale i.e: low coverage.

Leverage LinkedIn skill taggers (B): leverage LinkedIn Skill Taggers features to extract skill tags from course data.

Use supervised model: train a classification model such that for a given pair (course, skill): return 1 if the pair is relevant and 0 otherwise.
1. Label data: collect samples from A and B as positive training data. We then random samples from our data to create negative labels. We want our training dataset to be balance.
1. Features: course data (title, description, categories, section names, video names). We also leverage skill-to-skill similarity mapping features.
1. Disadvantage: a) relies heavily on the quality of the skill-taggers b)one single logistic regression model might not be able to capture the per-skill level effects.

Use Semi supervised learning.
1. We learn a different model for each skill, as opposed to one common model for all (course, skill) pairs.
1. Data Augmentation: leverage skill-correlation graph to add more positive labels data. For example: if SQL is highly relevant to Data Analysis skill then we can add Data Analysis to training data as positive labels.

Evaluation: offline metrics
1. Skill-coverage: measure how many LinkedIn standardized skills are present in the mapping.
1. Precision and Recall: we treat course to skill mapping from human as ground truth. We can evaluate our classification models using precision and recall.

Member to Skill

Member to skill via profile: LinkedIn users can add skills to their profile by entering free-form text or choosing existing standardized skills. This mapping is usually noisy and needs to be standardized. In practice, the coverage is not high since not many users provide this data. We also train supervised model $p(user\_free\_from\_skill, standardized\_skill)$ to provide a score for the mapping.

Member to skill using title and industry: in order to increase the coverage we can use cohort-level mapping. For example: user Khang Pham work in Ad Tech industry and title Machine Learning Engineer and he didn’t provide any skill set in his profile. We can rely on cohort of Machine Learning Engineer in Ad Tech to infer this user’s skills. We then combine the profile-based mapping using weight combination with cohort-based mapping.

How to Split Train/Test Data

This consideration is often overlooked but very important in the production environment. In forecast or any time-dependent use cases, it’s important to respect the chronological order when you split train and test data.For example, it doesn’t make sense to use data in the future to “forecast” data in the past. For sales forecast use case, we want to forecast sales for each store. If we randomly split data by storeID, that train data might not have data for some stores. Hence, the model can’t forecast for such stores. In practice, we need to consider split data so that we can have storeId in train data as well as test data.

Sliding Window

First, we select data from day 0 to day 60 as the train set and day 61 to day 90 as the test set.
Then, we select data from day 10 to day 70 as the train set and day 71 to day 100 as the test set.

Expanding Window

First, we select data from day 0 to day 60 as train set and day 61 to day 90 as test set.
Then we select data from day 0 to day 70 as train set and day 71 to day 100 as test set

Retraining Requirements

Retraining is a requirement in many tech companies. In practice, the data distribution is a nonstationary process, so the model does not perform well without retraining. In AdTech and recommendation/personalization use cases, it’s important to be able to retrain models to capture changes in users’ behavior and trending topics. So the machine learning engineers need to make the training pipeline run fast and scale well with big data. When you design such a system, you need to balance between model complexity and training time. The common design pattern is to have a scheduler retrain the model on a regular basis, usually many times per day.

Four Levels of Retraining

Level 0: Train and forget. Train the model once and never retrain it again. This is appropriate for the ’stationary’ problem.
Level 1: cold-start retraining: Periodically retrain the whole model on a batch dataset.
Level 2: Near-line retraining: Similar to level 2, we retrain model per-key components individually and asynchronously nearline on streaming data.
Level 3: warm-start retraining: If the model has personalized per-key components, retrain only these in bulk on data specific to each key (e.g., all impressions of an advertiser’s ads) once enough data has accumulated.