Highlights from KDD 2023

| #conference #machine-learning

A watercolor painting of a California landscape in warm tones.

California Spring Landscape (circa 1920) by Elmer Wachtel, via the Smithsonian American Art Museum.

In August I attended ACM’s 29th Conference on Knowledge Discovery and Data Mining, better known as KDD, in Long Beach, California. Nearly three months later (whoops), my head is still overflowing with everything I saw there: I watched more than 50 paper talks, plus keynotes, invited talks, and workshops—and I still ended up missing several things I wanted to see. In this post, I’ll give an overview of what I saw and run down some of my favorite papers and talks from the conference.1

Themes and trends

There were a few themes that I came across repeatedly at the conference. The first is no surprise: everyone is talking about large language models. Google’s Ed Chi gave a keynote on “The LLM Revolution,” there was even an “LLM day”, featuring a great talk from OpenAI’s Jason Wei on new paradigms that LLMs have ushered in. That being said, I couldn’t help but notice that (with a few exceptions) the vast majority of work I saw presented was not leveraging LLMs.

Another recurring motif was scale and the issues it raises. Problems that are tractable on a small scale raise new algorithmic and engineering challenges when scaled to millions of users and items.

Lastly, I noticed that the majority of recommender papers formulated the problem as multitask. It’s no longer sufficient to optimize for click rate alone: authors acknowledge that users and businesses have multiple goals and those goals can potentially be mutually satisfied by a single model.

Recommender systems

Because my work focuses on recommender systems and personalization, I spent a disproportionate amount of time attending talks on the topic. These talks typically, though not always, come from an industry setting and are much easier for me to transplant onto concepts and problems I’m already working on.

EvalRS workshop on evaluating recommender systems

I had a blast attending the EvalRS workshop on well-rounded evaluation of recommender systems, which consisted of two keynotes, lightning talks, and a hackathon.

Luca Belli gave a keynote on Practical Considerations for Responsible RecSys, where he outlined the challenges in evaluating recommender systems for fairness (what even is it?), picking metrics, and untangling personalization and discrimination. He mentioned that he has some upcoming work on “nutrition labels” for recommenders with standardized metrics, impacts beyond discrimination, transparency requirements, and more.

I was particularly fond of one of the lightning talks: Metric@Customer N: Evaluating Metrics at a Customer Level in E-Commerce by Mayank Singh et al from Grubhub. The gist is that using a fixed “N” for metrics like precision and recall from the top N recommendations per user is likely overly simplistic, neglecting to account for the fact that users’ browsing behaviors vary, and user A might realistically only ever consider the top 5 items while user B might browse 25. They propose adapting metrics to use a user-personalized N for each user, based on that user’s median or maximum items consumed in previous sessions.

In fact, I liked this idea so much that my colleagues and I implemented it for the EvalRS hackathon, which had the broad task of contributing to the well-rounded evaluation of the target dataset.2 My team consisted of myself, my colleagues Eyan Yeung and Dev Goyal from Hinge, and Alexandra Johnson of Rubber Ducky Labs, and we managed to snag second place!3 First place went to the team from GrubHub, who made several contributions including implementing metrics@N.

Jacopo Tagliabue presenting the introduction to the EvalRS workshop at KDD 2023.

Jacopo holding court at the EvalRS workshop, via Mattia Pavoni.

Joey Robinson wrapped up the workshop with the second keynote, Ask and Answer: A Case Study in ML Evaluation at Snap. I really liked his central thesis, which is that a model evaluation framework has to be able to mode than just tell us “is this model good or bad?”—it has to provide a means for us to ask and answer questions about how model’s behavior.

I want to give a huge thank you to the EvalRS organizers for putting together a great workshop (and afterparty), with an extra special thanks to Jacopo Tagliabue, who’s always been generous with his time and insights on all things RecSys and ML.

Joint Optimization of Ranking and Calibration with Contextualized Hybrid Model

Xiang-Rong Sheng et al, Alibaba. Paper at ACM and arXiv.

Many recommender systems have moved from estimating user ratings or click probabilities to focusing on learning relative rankings between items, since a ranked list is often the user-facing output of the system. However, a well-calibrated estimate of click-through rate is often still desirable (especially in online advertising contexts), and ranking loss functions (which typically pass logits for each item through a softmax to measure relative importance) aren’t well-suited to well-calibrated point estimates of probabilities. The authors of this work want to have their ranking performance cake and eat their click rate estimates too.

The go-to solution for tasks with competing loss functions is to just optimize for some weighted sum of the two losses, but the authors note that it’s been done and still fails to produce interpretable probabilities. Their solution is a bit more complex, but also clever. Rather than producing one logit per item, they produce two: one representing a click and one representing a non-click. The probability estimate is the sigmoid of the difference between the click and non-click logits (equivalent to the softmax of the two logits) and its calibration is optimized with log loss. Simultaneously, the ranking performance is optimized with contrastive loss encouraging a high logit for the positive sample and a log logit for the negative samples within a session (and there may be multiple sessions within a batch). The final loss function is a convex sum of these two.

A diagram contrasting different ways of calculating loss for ranking and probability estimation.

A figure from the paper comparing pointwise loss, listwise loss, and the joint ranking and calibration loss.

It’s a nice trick, and separating out negative examples by session within the batch is something I hadn’t seen before (and required some work to ensure that entire sessions would be placed together in a batch during distributed training). The authors evaluated it against pointwise, pair/listwise, and convex sum losses on two open source datasets and their own logged data from production and found it (typically) outperformed the baselines on both ranking and calibration metrics. More importantly, the model was evaluated in a user-level A/B test against a pointwise baseline and brought a 4.4% lift in CTR, 2.4% increase in revenue per thousand impressions, and a 0.27% drop in log loss.

Recommender quick hits

Search

Search is intrinsically tied to recommendation and increasingly personalization, so I’ve been taking more and more of an interest in the topic.4 As large language models have blown up, search has seen a surge of interest, with particular focus on semantic search methods (which operate on embeddings rather than on text itself) and hybrid methods for tying semantic and more “classical” approaches.

Optimizing Airbnb Search Journey with Multi-task Learning

Chun How Tan et al, Airbnb. Paper at ACM and arXiv.

Just like in the Spotify Impatient Bandits paper mentioned above, Airbnb deals with optimizing outcomes with substantial delays between user actions and measured rewards—in this case, search journeys which may take weeks before a final reservation is made. Their method, Journey Ranker, is a multi-task model that leverages intermediate milestones to improve search personalization along this journey.

Airbnb searchers often abandon search sessions before ultimately making reservations later down the line, so clicks on search results are really only the first of many steps that must occur before an actual booking. Journey Ranker considers not only clicks but payment page visits, booking requests, host acceptance of booking request, and whether or not either the host or booker cancels the booking.

A shared representation embeds both the listing and search context, while the “base module” leverages this representation to make several intermediate probability estimates (probability of click given impression, booking given click, etc.) whose product is the final desired probability, as inspired by the Entire Space Multi-Task Model approach. A “Twiddler” module scores listings based on negative milestones like cancellation or booking rejection, using the same shared representation. Lastly, a combination module produces a learned linear combination the base and Twiddler module outputs.

A diagram of the paper's combined architecture.

A figure from the paper visualizing the Journey Ranker combination module.

The paper goes into detail on design decisions taken and not taken, and demonstrates the interpretability of the combination module’s linear task weights. They evaluated the method in offline and online settings, with the online test leading to a 0.61% increase in uncancelled bookers. They also applied the same method to other Airbnb use cases, netting a 9.0% increase in uncancelled bookers for online experiences (a newer and less optimized flow) and a 3.7% reduction in email unsubscribes.

End-to-End Query Term Weighting

Karan Samel et al, Google. Paper at ACM and Google Research

As sexy as “semantic search” may be, deep learning models are expensive to run at massive scale. As a result, bag-of-words and n-gram methods are often still deployed, with tokens weighted by scoring functions like BM25, which has proven to be a formidable baseline.

The authors of this paper wanted to adapt newer deep learning methods to operate within the bounds of a bag-of-words system by using a language model to learn weights (end-to-end) for query terms that are used as part of the input to the BM25 scoring function.

A masked language modeling task is used for pre-training an open source BERT checkpoint. Then a second pre-training task is used to learn uniform term weights as a starting point for the actual term weighting tasks. The fine-tuning is the end-to-end step which leverages the entire information retrieval pipeline: the query is fed to the BERT model to produce term weights, which are combined with document statistics as input to BM25. A combination of pointwise and ranking loss is then back-propagated all the way back to the BERT model to update the term weights.

A diagram of the paper's query term weighting system.

A figure from the paper visualizing the architecture of the end-to-end weighting system.

There’s other fancy bits in here, like using a T5 model to generate soft labels for negative examples as well as a query expansion solution which weights both the original query terms as well as expanded terms. Another challenge was reconciling the wordpieces from BERT’s tokenizer with the world-level terms used by the existing system, which required additional masking and pooling.

The model performs well on the evaluation datasets, with some tough competition from SPLADE. They don’t mention any online experiments, but note that since their model only incurs a single forward pass on BERT, it’s “tractable to perform during serving.” Nonetheless, I admire the pragmatism underlying the approach. Replacing the existing retrieval system with deep learning would be a massive infrastructural change with integration costs, not to mention monetary and performance costs, but leveraging a deep learning model within the existing system can potentially get the best of both worlds.

Search quick hit

Software and engineering challenges

These papers focused less on novel algorithms (though they do include some) and more on building new solutions to improve performance and resilience. Software engineering is a fairly rare topic at ML conferences, but when it does come up it often provides a unique look behind the curtain at engineering challenges faced by some of the biggest tech companies on the planet.

Yggdrasil Decision Forests

The logo for the Yggdrasil decision forests library.

Matthieu Guillaume-Bert et al, Google and Pinecone. Paper at ACM and arXiv, code on Github.

Yggdrasil is a new decision forest library in C++ and Python (as TensorFlow Decision Forests), with support for inference in Go and JavaScript.

I have a thing for talks about new libraries, like Transformers4Rec at RecSys 2021 or TorchRec at RecSys 2022.5 But what I really loved about the Yggdrasil talk wasn’t the discussion of features (a bazillion different tree-based models, evaluation methods, distributed training, etc.) but the focus on the design principles that went into making the library:

  1. Simplicity of use: helpful interactions and messages at the appropriate level of abstraction, sensible defaults, clarity and transparency
  2. Safety of use: warnings and errors that make it hard for users to make mistakes, easily-accessible best practices
  3. Modularity and high-level abstraction: sufficiently complex pieces of code should be understandable independently, with well-defined interfaces between them
  4. Integration with other ML libraries: avoid limiting users to methods available within their chosen library, favor composability

It reminds me of Vim’s design principles, and re-affirms my belief (partially inspired by my employer) that having principles from the outset is a good way to guide decision-making during any large project.

A comparison of an opaque and a more verbose error message.

An excerpt from the Yggdrasil paper showing the library's approach to helpful, humane errors.

As you can see in the example from the paper shown above, these principles manifest in a library that attempts to guide you to use it correctly and effectively. The authors note that machine learning systems are often at risk of running without error, while silently doing something totally wrong, so this approach is valuable in preventing some of those potentially insidious errors.

Revisiting Neural Retrieval on Accelerators

Jiaqi Zhai et al, Meta. Paper at ACM and arXiv.

Deep learning has gained popularity in recommendation, older methods like matrix factorization have proven to be a hard baseline to beat.6 But even as matrix factorization has been largely supplanted neural alternatives, it has survived in the fundamental structure of most dense retrieval models: representing similarity as the dot product of learned vectors for users (or queries) and items. By saving all the vectors to an index, retrieval can be done with approximate nearest neighbors methods which scale to massive datasets while remaining highly performant.

But dot products just aren’t good enough for Zhai et al. They write: “The relationship between users and items in real world, however, demonstrates a significant level of complexity, and may not be approximated well by dot products,” citing the fact that interaction likelihoods are often frequently much higher-rank than these low-rank approximations can model, and noting that ranking models have mostly moved forward to more complex neural architectures. Nonetheless, there haven’t yet been solid alternatives to the maximum-inner-product search formulation. The authors seek to change this, and their contribution is threefold: a “mixture of logits” similarity function designed to outperform dot products and generalize better to the long tail, a hierarchical retrieval strategy that employs GPUs and TPUs to scale this method to corpora of hundreds of millions of items, and experiments demonstrating improvements on baseline datasets and Meta production traffic.

The authors go on to discuss not only their novel similarity function but the various tricks they employ to make it run efficiently on the GPU, bringing to mind a similar blend of algorithmic and engineering wizardry in Meta’s DLRM paper.

Improving Training Stability for Multitask Ranking Models in Recommender Systems

Jiaxi Tang et al, Google and Deepmind. Paper at ACM and arXiv, code on Github.

I mentioned that scale and multitask recommenders were major themes at KDD this year, and this paper (which was the winner on the Best Paper Award for Applied Data Science) has both. This paper is all about techniques to prevent instability (loss divergence) while training a multitask recommender for YouTube.

Training failures had clearly become a major issue for the team, and they found it hard to reproduce (models didn’t always diverge, even with the same configuration), hard to detect (divergence might occur before metrics are logged), and hard to measure. Rolling back to earlier training checkpoints didn’t address the heart of the problem and wasn’t even guaranteed to work.

A line plot demonstrating loss divergence of a deep learning model.

A (cropped) figure from the paper demonstrating temporary and permanent loss divergence.

They found that the root cause of loss divergence was “step size being too large when loss curvature is steep” (they even bolded it in their paper). Recommendation models are particularly susceptible because they’re trained on many features in sequential data, which means there’s a very large probability of distribution shift during training, which in turn requires a consistently large learning rate to continue adapting to these shifts. Furthermore, ranking models diverged more than retrieval models seemingly due to their larger size and their tendency to be multitask (big gradients for one task can spoil performance on others through shared layers).

The solution ends up being a fancy variant of gradient clipping, affectionately dubbed “Clippy.” What makes Clippy unique is that it controls the size of the model update rather than the size of the gradient itself—these two values are synonymous in SGD, but can be very different with more sophisticated optimizers. They also measure the size of the update using the L-infinity norm, which is sensitive to large updates in even just a few coordinates.

Engineering quick hit

Grab bag

Not every paper I liked fits neatly into the themes above. Below are a few more papers of note that I wanted to highlight:

Wrapping up

There are around 20 papers in this write-up, and I left plenty of cool talks on the cutting room floor too. I’m thankful for my employer for sending me to conferences like KDD and Recsys where I’m able to hear so many ideas and take away so much. If you think I missed a particularly cool paper or talk, or just want to share your thoughts, don’t hesitate to reach out, and thanks for reading ‘til the end! If you liked this post and want to hear more from me in the future, you can follow my RSS feed.


  1. I’ll note that this selection will be pretty biased by what I chose to attend. For instance, KDD features a ton of content on graph learning, the mast majority of which I didn’t attend because I don’t do much work with graph data. On the other hand, you’ll find recommender systems and search over-represented here, as those are areas of particular interest to me. I also missed the some workshops that I would’ve loved to have seen, like Decision Intelligence and Analytics for Online Marketplaces and Online and Adaptive Recommender Systems

  2. It turns out that personalizing the metric can be quite harsh: the baseline model had a hit rate at 100 of 4.8% but a hit rate at median listens/day of only 0.03%! 

  3. Second place…though I’m not sure if more than two teams submitted 😅. 

  4. For others interested in search basics, I’d highly recommend the Search Fundamentals course from Uplimit, taught by Grant Ingersoll and Daniel Tunkelang. It’s a great intro into key search concepts and a whirlwind tour of basic Elasticsearch/OpenSearch functionality. I took the course in February and it opened my eyes to a lot of the parallels between search and recommendation. 

  5. I never did get around to writing that highlights of RecSys ‘22 post… 

  6. For example, see the work of Steffen Rendle, who has contributed factorization machines and Bayesian personalized ranking to the pre-neural era of recsys. He has continued to demonstrate the performance of non-neural recommendation models several times. See also Are We Really Making Much Progress? and Reenvisioning the comparison between Neural Collaborative Filtering and Matrix Factorization