Meet Yambda: The world's largest incidental data set to speed up recommendation systems

Yandex has recently made a significant contribution to the Society of the Recommendation System by releasing YambdaThe world’s largest publicly available data set for recommendation system research and development. This data set is designed to bridge academic research and applications to industry and offers nearly 5 billion anonymous user interaction events from Yandex Music-a of the company’s flagship flow services with over 28 million monthly users.

Why Yambda Matters: Addressing a Critical DataFr3 in Recommendation Systems

Recommendation systems support the personal experiences of many digital services today, from e-commerce and social networks to streaming platforms. These systems are very dependent on massive amounts of behavioral data, such as clicks, likes and listening, to derive user preferences and deliver tailor -made content.

However, the recommendation systems have hung behind other AI domains, such as natural language processing, largely due to lack of large, openly available data sets. Unlike large language models (LLMs) that learn from publicly available text sources, recommendation systems need sensitive behavioral data – which are commercially valuable and difficult to anonymize. As a result, companies have traditionally guarded this data closely and limited the researchers’ access to the real world’s data set.

Existing data sets such as Spotify’s million playlist data sets, Netflix price data and Criteo’s click logs are either too small, lacking temporal details or are poorly documented for the development of recommendation models for production quality. Yandex’s release of Yambda addresses these challenges by providing a comprehensive high quality data set with a rich set of features and anonymization protection measures.

What Yambda contains: scale, wealth and privacy

The Yambda Data sets include 4.79 billion anonymized user interactions collected over a period of 10 months. These events come from approx. 1 million users who interact with almost 9.4 million numbers on Yandex Music. The data set includes:

User Interactions: Both implicit feedback (listening) and explicit feedback (likes, dislikes and their removal).
Anonymized audio deposits: Vector representations of clues derived from intricate neural networks, allowing models to exploit the sound content of the sound content.
Organic Interaction Flag: An “ice_organic” flag indicates whether users discovered a track independently or through recommendations that facilitate behavioral analysis.
Precise time stamps: Each event is time stamped to maintain temporal order, crucial for modeling sequential user behavior.

All user and track identifiers are anonymized using numeric IDs to comply with privacy standards, ensuring that there is no personally identifiable information on personally identifiable information.

The data set is found in Apache Parquet format optimized for Big Data Processing Frames such as Apache Spark and Hadoop, and also compatible with analytical libraries such as Pandas and Polar. This makes Yambda available to researchers and developers working in different environments.

Evaluation Method: Global Temporal Split

An important innovation in Yandex’s data set is the adoption of a Global Temporal Split (GTS) Evaluation strategy. In typical recommendation system study, the widely used leave-en-out method removes the last interaction between each user for testing. However, this approach interferes with the temporal continuity of user interactions and creates unrealistic training conditions.

GTS on the other hand divides the data based on timestamps and retains the entire event sequence. This approach mimics scenarios in the real world’s recommendation more closely because it prevents future data from leaking for training and allows models to be tested on truly unseen, chronologically later interactions.

This temporal conscious evaluation is important for benchmarking algorithms under realistic limitations and understanding of their practical efficiency.

Baseline models and measurements included

To support the benchmarking and accelerate innovation, Yandex Baseline supplies -recommendation models implemented on the data set, including:

Mostpop: A popularity -based model that recommends the most popular items.
Due pop: A time -interrupted popularity model.
Item KNN: A neighborhood -based collaborative filtering method.
IALS: Implicitly changing minimum square matrix factorization.
BPR: Bayesian personal location, a pair of ranking method.
Sansa and Sasrec: Sequence -conscious models that utilize self -perceived mechanisms.

These base lines are evaluated using Standard Recommendation Metrics such as:

NDCG@K (normalized reduced cumulative gain): Measures ranking quality that emphasize the location of relevant objects.
Remember@K: Assesses the fraction of relevant items picked up.
Coverage@K: Indicates the diversity of recommendations across the catalog.

Providing these benchmarks quickly helps researchers measure the performance of new algorithms in relation to established methods.

Wide usability beyond the streaming of music

While the data set comes from a music flow service, its value extends far beyond this domain. Interaction types, user behavior dynamics and large scale make Yambda a universal benchmark for across sectors such as e-commerce, video platforms and social networks. Algorithms validated on this data set can be generalized or adapted to various recommendation tasks.

Benefits for different stakeholders

Academia: Enables strict testing of theories and new algorithms on an industry-relevant scale.
Startups and SMBs: Offers a resource comparable to what tech giants possess, smooth the rules of the game and accelerate the development of advanced recommendation engines.
End users: Indirectly benefits from smarter recommendation algorithms that improve content discovery, reduce search time and increase engagement.

My wave: Yandex’s personalized recommendation system

Yandex -Musik utilizes a proprietary recommendation system called My wavethat contains deep neural networks and AI to customize music suggestions. My wave analyzes thousands of factors including:

User interaction sequences and listening history.
Adaptable preferences such as mood and language.
Real-time music analysis of spectrams, rhythm, vocal tone, frequency areas and genres.

This system adapts dynamically to individual tastes by identifying sound similarities and predicting preferences, demonstrating the kind of complex recommendation pipeline that benefits from large datasets such as Yambda.

Ensure privacy and ethical use

The release of Yambda emphasizes the importance of privacy in recommendation system research. Yandex anonymizes all data with numeric IDs and omits personally identifiable information. The data set contains only interaction signals without revealing accurate user identities or sensitive attributes.

This balance between openness and privacy allows for robust research while protecting individual user data, a critical consideration for the ethical progress of AI technologies.

Access and versions

Yandex offers the Yambda Data set in three sizes to accommodate different research and computer capabilities:

Full version: ~ 5 billion events.
Medium version: ~ 500 million events.
Small version: ~ 50 million events.

All versions are available via Hugging faceA popular platform for hosting data sets and machine learning models that enables easy integration into research work.

Conclusion

Yandex’s release of Yambda Data set marks a pivotal moment in recommendation system research. By providing an unprecedented scale of anonymized interaction data paired with temporal attention evaluation and basic lines, it sets a new standard for benchmarking and accelerating innovation. Researchers, startups and businesses can now explore and develop recommendation systems that better reflect the use of the real world and deliver improved personalization.

As recommendation systems continue to affect countless online experiences, Data Sets like Yambda plays a fundamental role in pushing the boundaries of what AI-driven personalization can achieve.

Check Yambda Dataset about hug face.

_{Note: Thanks to the Yandex Team for Thought Management/ Resources for this article. Yandex Team has supported and sponsored this content/article.}

Asif Razzaq is CEO of Marketchpost Media Inc. His latest endeavor is the launch of an artificial intelligence media platform, market post that stands out for its in -depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts over 2 million monthly views and illustrates its popularity among the audience.

Meet Yambda: The world’s largest incidental data set to speed up recommendation systems