Yandex Releases Yambda Dataset with 5 Billion Music Events to Fuel Next-Gen Recommender Research

In a major move for recommendation research, Yandex has opened up the Yambda dataset, the largest publicly offered collection of user interactions for recommender system development. The archive tallies nearly 5 billion anonymized records drawn from Yandex Music, a flagship streaming service that logs over 28 million monthly active listeners. With its vast reach and detail, this release aims to narrow the gap between lab-scale studies and systems that run at commercial scale. This marks a shift in how research teams can experiment with real traffic data without the legal and technical challenges that typically limit access. It offers a resource that is both broad enough for industry tasks and transparent enough for scholarly analysis.

Recommender engines power experiences across online retail, social media and digital entertainment. They sift through vast streams of user signals—clicks, plays, likes and skips—to infer preferences and suggest items that match individual tastes. The quality of recommendations hinges heavily on the volume and diversity of these signals, which feed machine learning pipelines that rank, filter or generate relevant content.

Even with rapid advances in artificial intelligence fields like language and vision, recommendation research has lacked open, large-scale datasets. Language models train on public text from the web, but recommendation methods require private behavioral logs that firms tend to protect as strategic assets. That means most academic work relies on limited public dumps or synthetic sets that fail to reflect true user journeys.

Several public collections exist—Spotify’s Million Playlist Dataset, the Netflix Prize logs and Criteo’s click records—but each has drawbacks. Some offer only coarse time bins, others cover few users or omit metadata. Documentation may be incomplete, limiting use in production-quality experiments. Yambda overcomes these issues by delivering a comprehensive, well-annotated stream, complete with safeguards for user privacy.

Yambda records 4.79 billion anonymized events captured over ten months. The stream traces actions from nearly 1 million listeners engaging with about 9.4 million unique tracks. Every play, like, dislike and the removal of such feedback appears as a discrete event. Implicit feedback comes from listening activity. Explicit signals include “likes,” “dislikes” and undo operations.

This dataset packs additional rich features. It includes anonymized audio embeddings generated by convolutional neural networks, giving each track a vector that encodes its sonic profile. A field labeled “is_organic” flags whether a user reached a track by direct navigation or via a system suggestion. Precise timestamps keep events in strict order, a must for any model that learns from user sequences.

All user and item identifiers turn into numeric codes, meeting privacy standards that strip out personal details. Data ships in Apache Parquet format, designed for columnar storage. It works smoothly with big data engines such as Apache Spark and Hadoop, and it plugs into analytics libraries like Pandas and Polars without extra conversion.

In typical recommendation research, experts use a Leave-One-Out strategy by hiding each user’s final event for test and training on the rest. That breaks the time axis, giving models a peek at future patterns during training. Yandex suggests a Global Temporal Split instead. GTS divides records by timestamp, training on earlier events and testing on later ones, mirroring a system that never sees future feedback. Under this scheme, researchers can calibrate hyperparameters on a validation window drawn from mid-range timestamps, then test on the final segment, eliminating overlap that could bias results.

A chronology-aware split reveals how algorithms behave under real deployment constraints. Teams avoid overly optimistic results and gain a clearer sense of an approach’s practical value before rolling it into live products. Chronological testing helps identify performance drops on seasonality shifts or sudden trend changes, bringing to light weaknesses that random splits might hide. This becomes critical when user interests shift over holidays or after new track releases.

Beyond the split method, Yandex ships baseline models to jump-start experiments. MostPop ranks items by global play count. DecayPop applies an exponential decay to historical play records, downweighting older hits. ItemKNN computes a similarity score between tracks based on co-occurrence in user playlists, enabling neighborhood-based filtering. Implicit Alternating Least Squares, or iALS, factorizes the user-item matrix under an implicit feedback assumption to capture latent preferences. Bayesian Personalized Ranking, known as BPR, treats recommendation as a pairwise ranking problem, optimizing a loss that orders relevant items above irrelevant ones. SANSA adopts a self-attentive neighborhood model that scales to long event sequences. SASRec brings a transformer-based design, using self-attention to predict the next item by attending over previous interactions.

Yandex provides standard evaluation routines for NDCG@k, Recall@k and Coverage@k. NDCG@k, or Normalized Discounted Cumulative Gain at rank k, evaluates how well the top-k positions in a recommendation list align with ground truth, giving higher credit to hits in earlier slots. Recall@k checks whether relevant items appear anywhere among the top k suggestions. Coverage@k tracks the portion of the entire item catalog that algorithms are capable of recommending across all users, an indicator of catalog diversity and risk of over-specialization. Teams can run these evaluations at different k values—such as k = 5, 10 or 20—to see how recommendation quality shifts as list length varies. These metrics help teams gauge new methods against established baselines with minimal setup.

Though sourced from a music service, Yambda’s scale and structure make it relevant far beyond streaming. Retailers testing product suggestions, video platforms refining watch feeds and social networks engineering content discovery can all adapt its patterns. The data’s size ensures experiments reflect traffic at a scale that was once possible only within large tech firms. Models tested on Yambda can be adapted across industries, making it a go-to testbed for tasks beyond its music roots.

Academic groups can test hypotheses under conditions matching real systems. Startups and small businesses gain access to data volumes long locked behind corporate doors. Listeners and users at large can expect smarter suggestions, quicker matches and deeper personalization as researchers leverage this open resource.

Within Yandex Music, a recommender called My Wave stitches together layers of deep learning and AI pipelines. It examines a listener’s action sequence, along with saved preferences such as mood or language. On the audio side, My Wave analyzes inputs in real time—scrutinizing spectrogram shapes, rhythmic elements, vocal timbre, frequency balances and subgenres—to align new tracks with each individual’s flavor. This pipeline ingests thousands of factors drawn from user profiles, context signals like time of day or device type and content descriptors derived from neural audio analysis. By blending collaborative and content-based elements, My Wave aims to surface tracks that listeners would be unlikely to find through simple search or playlist browsing. Models update dynamically, adapting to fresh inputs and evolving tastes.

Privacy underpins every part of Yambda. All fields that might identify an individual—such as IP lines, device IDs or profile attributes—are omitted or hashed behind numeric codes. The dataset contains only interaction records and encoded features, guaranteeing researchers never see personal identifiers.

Yandex delivers three dataset sizes to fit diverse computing resources. The small tier spans roughly 50 million events. The medium edition covers about 500 million. The full release approaches 5 billion interactions. Each package is hosted on Hugging Face, streamlining download and integration into existing workflows. Each variant arrives pre-sliced by time windows, letting teams apply the Global Temporal Split approach immediately and use data loaders to stream or ingest subsets on demand.

With Yambda now available, recommender research gains a powerful resource. A massive scale of anonymized interactions comes paired with chronology-aware testing and ready-to-use benchmarks. The offering should accelerate progress on systems that serve personalized content, cutting development time and helping new ideas prove themselves under realistic conditions.

Similar Posts