Highlights from Git 2.43
6 days, 8 hours ago
For Good First Issue: Introducing a new way to contribute
1 week ago

For Good First Issue is a curated list of open source projects that are also digital public goods and need the help of developers.

The post For Good First Issue: Introducing a new way to contribute appeared first on The GitHub Blog.

Security best practices for authors of GitHub Actions
1 week, 5 days ago

Improve your GitHub Action’s security posture by securing your source repository, protecting your maintainers, and making it easy to report security incidents.

The post Security best practices for authors of GitHub Actions appeared first on The GitHub Blog.

Universe’s key takeaway: Innovate better with AI-powered workflows on a single, unified platform
1 week, 6 days ago

Discover new AI-powered features and tools to help developers stay in the flow and organizations innovate at scale.

The post Universe’s key takeaway: Innovate better with AI-powered workflows on a single, unified platform appeared first on The GitHub Blog.

GitHub Availability Report: October 2023
2 weeks, 1 day ago

In October, we experienced two incidents that resulted in degraded performance across GitHub services.

The post GitHub Availability Report: October 2023 appeared first on The GitHub Blog.

Celebrating the GitHub Awards 2023 recipients 🎉
2 weeks, 5 days ago

The GitHub Awards recognizes and celebrates the outstanding contributions and achievements in the developer community, honoring individuals, projects, and organizations for their impactful work, innovation, thought leadership, and creating an outsized positive impact on the community.

The post Celebrating the GitHub Awards 2023 recipients 🎉 appeared first on The GitHub Blog.

Universe 2023: Copilot transforms GitHub into the AI-powered developer platform
2 weeks, 6 days ago

GitHub is announcing general availability of GitHub Copilot Chat and previews of the new GitHub Copilot Enterprise offering, new AI-powered security features, and the GitHub Copilot Partner Program.

The post Universe 2023: Copilot transforms GitHub into the AI-powered developer platform appeared first on The GitHub Blog.

Octoverse: The state of open source and rise of AI in 2023
2 weeks, 6 days ago

In this year’s Octoverse report, we study how open source activity around AI, the cloud, and Git are changing the developer experience.

The post Octoverse: The state of open source and rise of AI in 2023 appeared first on The GitHub Blog.

New How to tackle unreliability of coding assistants
14 hours, 22 minutes ago

Over the last year, lots of developers have incorporated LLM coding assistants into their work, finding them a useful tool. But one of the problems of these tools is that they are unreliable, often coming up with poor or outright wrong-headed suggestions. Birgitta Böckeler continues her exploration of GenAI for developers by passing on what she's learned about how think about this unreliability, and why it may be good to call your LLM tool “Dusty”.


Patterns of Distributed Systems is published by Pearson
4 days, 9 hours ago

During the last four years, my colleague Unmesh Joshi been developing a collection of patterns to help us all better understand how modern distributed systems work. We've been publishing drafts of these patterns on this site. Now these have turned into a book, published by Addison-Wesley in my signature series. As such, we've now removed the work-in-progress drafts from this site, and have replaced them with a catalog of pattern summaries. For those with a subscription to oreilly.com, we have deep links from the summaries to the relevant chapter of the online book.


Three reasons a liberal arts degree helped me succeed in tech
2 weeks, 5 days ago

My colleague Sannie Lee has met many students who are looking into getting into technology, taking narrow professionally-oriented majors. Sannie, however, has found that a traditional liberal-arts degree has given her skills that are highly relevant to her work as a product manager.


Enhancing the Headless Component
3 weeks ago

In the second (and final) part of his explanation of React Headless Components Juntao Qiu explores how a headless component allows us to create a visually different component that does the same base behavior, and how it encourages better factoring as we extend base behavior further.


Current thoughts on social media
3 weeks, 5 days ago

It's now been a year since The Muskover, what does my use of social media look like now, both as a reader and a writer?


Headless Component: a pattern for composing React UIs
3 weeks, 6 days ago

As React UI controls become more sophisticated, complex logic can get intertwined with the visual representation. This makes it hard to reason about the behavior of the component, hard to test it, and necessary to build similar components that need a different look. Juntao Qiu tackles this by using a Headless Component, which extracts all non-visual logic and state management, separating the brain of a component from its looks.


How is GenAI different from other code generators?
2 months, 1 week ago

How is code generation with GenAI different from more "traditional" code generators? The newest memo in Birgitta Böckeler's explorations of GenAI talks about abstraction levels in software engineering, and on which levels GenAI sits in the translation of our thoughts into zeros and ones.


Technology Strategy for Emerging Technologies and Markets
3 months ago

Sarah Taraporewalla completes her study of building a technology strategy that's integrated with strategic business interests. This final strategic direction considers the ever-changing future, suggesting lines of inquiry to consider the impact of new technologies, market trends, and broader social-political changes.


Bottlenecks of Scaleups #05: Resilience and Observability
3 months ago

Here is a new article in the bottlenecks of scaleups series, looking at resilience and observability. Startups tend to only address resilience when their systems are already down, often taking a very reactive approach. For a scaleup, excessive system downtime represents a significant bottleneck to the organization, both from the effort expended on restoring function and also from the impact of customer dissatisfaction. Punit Lad and Carl Nygard explain that to move past this, resilience needs to be built into the business objectives, which will influence the architecture, design, product management, and even governance of business systems.


Strategic Directions supporting the people
3 months ago

Having a robust digital talent strategy is a competitive advantage in today’s fiercely competitive market. This enables businesses to have the right talent and have the right competencies to meet current and future demand to meet business goals or to stay on track for digital transformation aspirations. Sarah Taraporewalla continues her article on how to create an integrated business and technology strategy by looking at questions raised by two strategic directions that support people: culture and internal systems.


New Netflix Original Research: MIT CODE 2023
1 day, 11 hours ago

Netflix was thrilled to be the premier sponsor for the 2nd year in a row at the 2023 Conference on Digital Experimentation (CODE@MIT) in Cambridge, MA. The conference features a balanced blend of academic and industry research from some wicked smart folks, and we’re proud to have contributed a number of talks and posters along with a plenary session.

Our contributions kicked off with a concept that is crucial to our understanding of A/B tests: surrogates!

Our first talk was given by Aurelien Bibaut (with co-authors Nathan Kallus, Simon Ejdemyr and Michael Zhao) in which we discussed how to confidently measure long-term outcomes using short term surrogates in the presence of bias. For example, how do we estimate the effects of innovations on retention a year later without running all our experiments for a year? We proposed an estimation method using cross-fold procedures, and construct valid confidence intervals for long term effects before that effect is fully observed.

Later on, Michael Zhao (with Vickie Zhang, Anh Le and Nathan Kallus) spoke about the evaluation of surrogate index models for product decision making. Using 200 real A/B tests performed at Netflix, we showed that surrogate-index models, constructed using only 2 weeks of data, lead to the same product ship decisions ~95% of the time when compared to making a call based on 2 months of data. This means we can reliably run shorter tests with confidence without needing to wait months for results!

Our next topic focused on how to understand and balance competing engagement metrics; for example, should 1 hour of gaming equal 1 hour of streaming? Michael Zhao and Jordan Schafer shared a poster on how they built an Overall Evaluation Criterion (OEC) metric that provides holistic evaluation for A/B tests, appropriately weighting different engagement metrics to serve a single overall objective. This new framework has enabled fast and confident decision making in tests, and is being actively adapted as our business continues to expand into new areas.

In the second plenary session of the day, Martin Tingley took us on a compelling and fun journey of complexity, exploring key challenges in digital experimentation and how they differ from the challenges faced by agricultural researchers a century ago. He highlighted different areas of complexity and provided perspectives on how to tackle the right challenges based on business objectives.

Our final talk was given by Apoorva Lal (with co-authors Samir Khan and Johan Ugander) in which we show how partial identification of the dose-response function (DRF) under non-parametric assumptions can be used to provide more insightful analyses of experimental data than the standard ATE analysis does. We revisited a study that reduced like-minded content algorithmically, and showed how we could extend the binary ATE learning to answer how the amount of like-minded content a user sees affects their political attitudes.

We had a blast connecting with the CODE@MIT community and bonding over our shared enthusiasm for not only rigorous measurement in experimentation, but also stats-themed stickers and swag!

One of our stickers this year, can you guess what this is showing?!

We look forward to next year’s iteration of the conference and hope to see you there!

Psst! We’re hiring Data Scientists across a variety of domains at Netflix — check out our open roles.

Netflix Original Research: MIT CODE 2023 was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Causal Machine Learning for Creative Insights
4 days, 6 hours ago

A framework to identify the causal impact of successful visual components.

By Billur Engin, Yinghong Lan, Grace Tang, Cristina Segalin, Kelli Griggs, Vi Iyengar


At Netflix, we want our viewers to easily find TV shows and movies that resonate and engage. Our creative team helps make this happen by designing promotional artwork that best represents each title featured on our platform. What if we could use machine learning and computer vision to support our creative team in this process? Through identifying the components that contribute to a successful artwork — one that leads a member to choose and watch it — we can give our creative team data-driven insights to incorporate into their creative strategy, and help in their selection of which artwork to feature.

We are going to make an assumption that the presence of a specific component will lead to an artwork’s success. We will discuss a causal framework that will help us find and summarize the successful components as creative insights, and hypothesize and estimate their impact.


The Challenge

Given Netflix’s vast and increasingly diverse catalog, it is a challenge to design experiments that both work within an A/B test framework and are representative of all genres, plots, artists, and more. In the past, we have attempted to design A/B tests where we investigate one aspect of artwork at a time, often within one particular genre. However, this approach has a major drawback: it is not scalable because we either have to label images manually or create new asset variants differing only in the feature under investigation. The manual nature of these tasks means that we cannot test many titles at a time. Furthermore, given the multidimensional nature of artwork, we might be missing many other possible factors that might explain an artwork’s success, such as figure orientation, the color of the background, facial expressions, etc. Since we want to ensure that our testing framework allows for maximum creative freedom, and avoid any interruption to the design process, we decided to try an alternative approach.

Figure. Given the multidimensional nature of artwork, it is challenging to design an A/B test to investigate one aspect of artwork at a given time. We could be missing many other possible factors that might explain an artwork’s success, such as figure orientation, the color of the background, facial expressions, etc.

The Causal Framework

Thanks to our Artwork Personalization System and vision algorithms (some of which are exemplified here), we have a rich dataset of promotional artwork components and user engagement data to build a causal framework. Utilizing this dataset, we have developed the framework to test creative insights and estimate their causal impact on an artwork’s performance via the dataset generated through our recommendation system. In other words, we can learn which attributes led to a title’s successful selection based on its artwork.

Let’s first explore the workflow of the causal framework, as well as the data and success metrics that power it.

We represent the success of an artwork with the take rate: the probability of an average user to watch the promoted title after seeing its promotional artwork, adjusted for the popularity of the title. Every show on our platform has multiple promotional artwork assets. Using Netflix’s Artwork Personalization, we serve these assets to hundreds of millions of members everyday. To power this recommendation system, we look at user engagement patterns and see whether or not these engagements with artworks resulted in a successful title selection.

With the capability to annotate a given image (some of which are mentioned in an earlier post), an artwork asset in this case, we use a series of computer vision algorithms to gather objective image metadata, latent representation of the image, as well as some of the contextual metadata that a given image contains. This process allows our dataset to consist of both the image features and user data, all in an effort to understand which image components lead to successful user engagement. We also utilize machine learning algorithms, consumer insights¹, and correlational analysis for discovering high-level associations between image features and an artwork’s success. These statistically significant associations become our hypotheses for the next phase.

Once we have a specific hypothesis, we can test it by deploying causal machine learning algorithms. This framework reduces our experimental effort to uncover causal relationships, while taking into account confounding among the high-level variables (i.e. the variables that may influence both the treatment / intervention and outcome).

The Hypothesis and Assumptions

We will use the following hypothesis in the rest of the script: presence of a face in an artwork causally improves the asset performance. (We know that faces work well in artwork, especially images with an expressive facial emotion that’s in line with the tone of the title.)

Here are two promotional artwork assets from Unbreakable Kimmy Schmidt. We know that the image on the left performed better than the image on the right. However, the difference between them is not only the presence of a face. There are many other variances, like the difference in background, text placement, font size, face size, etc. Causal Machine Learning makes it possible for us to understand an artwork’s performance based on the causal impact of its treatment.

To make sure our hypothesis is fit for the causal framework, it’s important we go over the identification assumptions.

  • Consistency: The treatment component is sufficiently well-defined.

We use machine learning algorithms to predict whether or not the artwork contains a face. That’s why the first assumption we make is that our face detection algorithm is mostly accurate (~92% average precision).

  • Positivity / Probabilistic Assignment: Every unit (an artwork) has some chance of getting treated.

We calculate the propensity score (the probability of receiving the treatment based on certain baseline characteristics) of having a face for samples with different covariates. If a certain subset of artwork (such as artwork from a certain genre) has close to a 0 or 1 propensity score for having a face, then we discard these samples from our analysis.

  • Individualistic Assignment / SUTVA (stable unit treatment value assumption): The potential outcomes of a unit do not depend on the treatments assigned to others.

Creatives make the decision to create artwork with or without faces based on considerations limited to the title of interest itself. This decision is not dependent on whether other assets have a face in them or not.

  • Conditional exchangeability (Unconfoundedness): There are no unmeasured confounders.

This assumption is by definition not testable. Given a dataset, we can’t know if there has been an unobserved confounder. However, we can test the sensitivity of our conclusions toward the violation of this assumption in various different ways.

The Models

Now that we have established our hypothesis to be a causal inference problem, we can focus on the Causal Machine Learning Application. Predictive Machine Learning (ML) models are great at finding patterns and associations in order to predict outcomes, however they are not great at explaining cause-effect relationships, as their model structure does not reflect causality (the relationship between cause and effect). As an example, let’s say we looked at the price of Broadway theater tickets and the number of tickets sold. An ML algorithm may find a correlation between price increases and ticket sales. If we have used this algorithm for decision making, we could falsely conclude that increasing the ticket price leads to higher ticket sales if we do not consider the confounder of show popularity, which clearly impacts both ticket prices and sales. It is understandable that a Broadway musical ticket may be more expensive if the show is a hit, however simply increasing ticket prices to gain more customers is counter-intuitive.

Causal ML helps us estimate treatment effects from observational data, where it is challenging to conduct clean randomizations. Back-to-back publications on Causal ML, such as Double ML, Causal Forests, Causal Neural Networks, and many more, showcased a toolset for investigating treatment effects, via combining domain knowledge with ML in the learning system. Unlike predictive ML models, Causal ML explicitly controls for confounders, by modeling both treatment of interest as a function of confounders (i.e., propensity scores) as well as the impact of confounders on the outcome of interest. In doing so, Causal ML isolates out the causal impact of treatment on outcome. Moreover, the estimation steps of Causal ML are carefully set up to achieve better error bounds for the estimated treatment effects, another consideration often overlooked in predictive ML. Compared to more traditional Causal Inference methods anchored on linear models, Causal ML leverages the latest ML techniques to not only better control for confounders (when propensity or outcome models are hard to capture by linear models) but also more flexibly estimate treatment effects (when treatment effect heterogeneity is nonlinear). In short, by utilizing machine learning algorithms, Causal ML provides researchers with a framework for understanding causal relationships with flexible ML methods.

Y : outcome variable (take rate)
T : binary treatment variable (presence of a face or not)
W: a vector of covariates (features of the title and artwork)
X ⊆ W: a vector of covariates (a subset of W) along which treatment effect heterogeneity is evaluated

Let’s dive more into the causal ML (Double ML to be specific) application steps for creative insights.

  1. Build a propensity model to predict treatment probability (T) given the W covariates.

2. Build a potential outcome model to predict Y given the W covariates.

3. Residualization of

  • The treatment (observed T — predicted T via propensity model)
  • The outcome (observed Y — predicted Y via potential outcome model)

4. Fit a third model on the residuals to predict the average treatment effect (ATE) or conditional average treatment effect (CATE).

Where 𝜖 and η are stochastic errors and we assume that E[ 𝜖|T,W] = 0 , E[ η|W] = 0.

For the estimation of the nuisance functions (i.e., the propensity score model and the outcome model), we have implemented the propensity model as a classifier (as we have a binary treatment variable — the presence of face) and the potential outcome model as a regressor (as we have a continuous outcome variable — adjusted take rate). We have used grid search for tuning the XGBoosting classifier & regressor hyperparameters. We have also used k-fold cross-validation to avoid overfitting. Finally, we have used a causal forest on the residuals of treatment and the outcome variables to capture the ATE, as well as CATE on different genres and countries.

Mediation and Moderation

ATE will reveal the impact of the treatment — in this case, having a face in the artwork — across the board. The result will answer the question of whether it is worth applying this approach for all of our titles across our catalog, regardless of potential conditioning variables e.g. genre, country, etc. Another advantage of our multi-feature dataset is that we get to deep dive into the relationships between attributes. To do this, we can employ two methods: mediation and moderation.

In their classic paper, Baron & Kenny define a moderator as “a qualitative (e.g., sex, race, class) or quantitative (e.g., level of reward) variable that affects the direction and/or strength of the relation between an independent or predictor variable and a dependent or criterion variable.”. We can investigate suspected moderators to uncover Conditional Average Treatment Effects (CATE). For example, we might suspect that the effect of the presence of a face in artwork varies across genres (e.g. certain genres, like nature documentaries, probably benefit less from the presence of a human face since titles in those genres tend to focus more on non-human subject matter). We can investigate these relationships by including an interaction term between the suspected moderator and the independent variable. If the interaction term is significant, we can conclude that the third variable is a moderator of the relationship between the independent and dependent variables.

Mediation, on the other hand, occurs when a third variable explains the relationship between an independent and dependent variable. To quote Baron & Kenny once more, “whereas moderator variables specify when certain effects will hold, mediators speak to how or why such effects occur.”

For example, we observed that the presence of more than 3 people tends to negatively impact performance. It could be that higher numbers of faces make it harder for a user to focus on any one face in the asset. However, since face count and face size tend to be negatively correlated (since we fit more information in an image of fixed size, each individual piece of information tends to be smaller), one could also hypothesize that the negative correlation with face count is not driven so much from the number of people featured in the artwork, but rather the size of each individual person’s face, which may affect how visible each person is. To test this, we can run a mediation analysis to see if face size is mediating the effect of face count on the asset’s performance.

The steps of the mediation analysis are as follows: We have already detected a correlation between the independent variable (number of faces) and the outcome variable (user engagement) — in other words, we observed that a higher number of faces is associated with lower user engagement. But, we also observe that the number of faces is negatively correlated with average face size — faces tend to be smaller when more faces are fit into the same fixed-size canvas. To find out the degree to which face size mediates the effect of face count, we regress user engagement on both average face size and the number of faces. If 1) face size is a significant predictor of engagement, and 2) the significance of the predictive contribution of the number of people drops, we can conclude that face size mediates the effect of the number of people in artwork user engagement. If the coefficient for the number of people is no longer significant, it shows that face size fully mediates the effect of the number of faces on engagement.

In this dataset, we found that face size only partially mediates the effect of face count on asset effectiveness. This implies that both factors have an impact on asset effectiveness — fewer faces tend to be more effective even if we control for the effect of face size.

Sensitivity Analysis

As alluded to above, the conditional exchangeability assumption (unconfoundedness) is not testable by definition. It is thus crucial to evaluate how sensitive our findings and insights are to the violation of this assumption. Inspired by prior work, we conducted a suite of sensitivity analyses that stress-tested this assumption from multiple different angles. In addition, we leveraged ideas from academic research (most notably the E-value) and concluded that our estimates are robust even when the unconfoundedness assumption is violated. We are actively working on designing and implementing a standardized framework for sensitivity analysis and will share the various applications in an upcoming blog post — stay tuned for a more detailed discussion!

Finally, we also compared our estimated treatment effects with known effects for specific genres that were derived with other different methods, validating our estimates with consistency across different methods


Using the causal machine learning framework, we can potentially test and identify the various components of promotional artwork and gain invaluable creative insights. With this post, we just started to scratch the surface of this interesting challenge. In the upcoming posts in this series, we will share alternative machine learning and computer vision approaches that can provide insights from a causal perspective. These insights will guide and assist our team of talented strategists and creatives to select and generate the most attractive artwork, leveraging the attributes that these models selected, down to a specific genre. Ultimately this will give Netflix members a better and more personalized experience.

If these types of challenges interest you, please let us know! We are always looking for great people who are inspired by causal inference, machine learning, and computer vision to join our team.


The authors contributed to the post as follows.

Billur Engin was the main driver of this blog post, she worked on the causal machine learning theory and its application in the artwork space. Yinghong Lan contributed equally to the causal machine learning theory. Grace Tang worked on the mediation analysis. Cristina Segalin engineered and extracted the visual features at scale from artworks used in the analysis. Grace Tang and Cristina Segalin initiated and conceptualized the problem space that is being used as the illustrative example in this post (studying factors affecting user engagement with a broad multivariate analysis of artwork features), curated the data, and performed initial statistical analysis and construction of predictive models supporting this work.


We would like to thank Shiva Chaitanya for reviewing this work, and a special thanks to Shaun Wright , Luca Aldag, Sarah Soquel Morhaim, and Anna Pulido who helped make this possible.


¹The Consumer Insights team at Netflix seeks to understand members and non-members through a wide range of quantitative and qualitative research methods.

Causal Machine Learning for Creative Insights was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Incremental Processing using Netflix Maestro and Apache Iceberg
1 week, 1 day ago

by Jun He, Yingyi Zhang, and Pawan Dixit

Incremental processing is an approach to process new or changed data in workflows. The key advantage is that it only incrementally processes data that are newly added or updated to a dataset, instead of re-processing the complete dataset. This not only reduces the cost of compute resources but also reduces the execution time in a significant manner. When workflow execution has a shorter duration, chances of failure and manual intervention reduce. It also improves the engineering productivity by simplifying the existing pipelines and unlocking the new patterns.

In this blog post, we talk about the landscape and the challenges in workflows at Netflix. We will show how we are building a clean and efficient incremental processing solution (IPS) by using Netflix Maestro and Apache Iceberg. IPS provides the incremental processing support with data accuracy, data freshness, and backfill for users and addresses many of the challenges in workflows. IPS enables users to continue to use the data processing patterns with minimal changes.


Netflix relies on data to power its business in all phases. Whether in analyzing A/B tests, optimizing studio production, training algorithms, investing in content acquisition, detecting security breaches, or optimizing payments, well structured and accurate data is foundational. As our business scales globally, the demand for data is growing and the needs for scalable low latency incremental processing begin to emerge. There are three common issues that the dataset owners usually face.

  • Data Freshness: Large datasets from Iceberg tables needed to be processed quickly and accurately to generate insights to enable faster product decisions. The hourly processing semantics along with valid–through-timestamp watermark or data signals provided by the Data Platform toolset today satisfies many use cases, but is not the best for low-latency batch processing. Before IPS, the Data Platform did not have a solution for tracking the state and progression of data sets as a single easy to use offering. This has led to a few internal solutions such as Psyberg. These internal libraries process data by capturing the changed partitions, which works only on specific use cases. Additionally, the libraries have tight coupling to the user business logic, which often incurs higher migration costs, maintenance costs, and requires heavy coordination with the Data Platform team.
  • Data Accuracy: Late arriving data causes datasets processed in the past to become incomplete and as a result inaccurate. To compensate for that, ETL workflows often use a lookback window, based on which they reprocess the data in that certain time window. For example, a job would reprocess aggregates for the past 3 days because it assumes that there would be late arriving data, but data prior to 3 days isn’t worth the cost of reprocessing.
  • Backfill: Backfilling datasets is a common operation in big data processing. This requires repopulating data for a historical time period which is before the scheduled processing. The need for backfilling could be due to a variety of factors, e.g. (1) upstream data sets got repopulated due to changes in business logic of its data pipeline, (2) business logic was changed in a data pipeline, (3) anew metric was created that needs to be populated for historical time ranges, (4) historical data was found missing, etc.

These challenges are currently addressed in suboptimal and less cost efficient ways by individual local teams to fulfill the needs, such as

  • Lookback: This is a generic and simple approach that data engineers use to solve the data accuracy problem. Users configure the workflow to read the data in a window (e.g. past 3 hours or 10 days). The window is set based on users’ domain knowledge so that users have a high confidence that the late arriving data will be included or will not matter (i.e. data arrives too late to be useful). It ensures the correctness with a high cost in terms of time and compute resources.
  • Foreach pattern: Users build backfill workflows using Maestro foreach support. It works well to backfill data produced by a single workflow. If the pipeline has multiple stages or many downstream workflows, users have to manually create backfill workflows for each of them and that requires significant manual work.

The incremental processing solution (IPS) described here has been designed to address the above problems. The design goal is to provide a clean and easy to adopt solution for the Incremental processing to ensure data freshness, data accuracy, and to provide easy backfill support.

  • Data Freshness: provide the support for scheduling workflows in a micro batch fashion (e.g. 15 min interval) with state tracking functionality
  • Data Accuracy: provide the support to process all late arriving data to achieve data accuracy needed by the business with significantly improved performance in terms of multifold time and cost efficiency
  • Backfill: provide managed backfill support to build, monitor, and validate the backfill, including automatically propagating changes from upstream to downstream workflows, to greatly improve engineering productivity (i.e. a few days or weeks of engineering work to build backfill workflows vs one click for managed backfill)

Approach Overview

General Concept

Incremental processing is an approach to process data in batch — but only on new or changed data. To support incremental processing, we need an approach for not only capturing incremental data changes but also tracking their states (i.e. whether a change is processed by a workflow or not). It must be aware of the change and can capture the changes from the source table(s) and then keep tracking those changes. Here, changes mean more than just new data itself. For example, a row in an aggregation target table needs all the rows from the source table associated with the aggregation row. Also, if there are multiple source tables, usually the union of the changed data ranges from all input tables gives the full change data set. Thus, change information captured must include all related data including those unchanged rows in the source table as well. Due to previously mentioned complexities, change tracking cannot be simply achieved by using a single watermark. IPS has to track those captured changes in finer granularity.

The changes from the source tables might affect the transformed result in the target table in various ways.

  • If one row in the target table is derived from one row in the source table, newly captured data change will be the complete input dataset for the workflow pipeline.
  • If one row in the target table is derived from multiple rows in the source table, capturing new data will only tell us the rows have to be re-processed. But the dataset needed for ETL is beyond the change data itself. For example, an aggregation based on account id requires all rows from the source table about an account id. The change dataset will tell us which account ids are changed and then the user business logic needs to load all data associated with those account ids found in the change data.
  • If one row in the target table is derived based on the data beyond the changed data set, e.g. joining source table with other tables, newly captured data is still useful and can indicate a range of data to be affected. Then the workflow will re-process the data based on the range. For example, assuming we have a table that keeps the accumulated view time for a given account partitioned by the day. If the view time 3-days ago is updated right now due to late arriving data, then the view time for the following two days has to be re-calculated for this account. In this case, the captured late arriving data will tell us the start of the re-calculation, which is much more accurate than recomputing everything for the past X days by guesstimate, where X is a cutoff lookback window decided by business domain knowledge.

Once the change information (data or range) is captured, a workflow has to write the data to the target table in a slightly more complicated way because the simple INSERT OVERWRITE mechanism won’t work well. There are two alternatives:

  • Merge pattern: In some compute frameworks, e.g. Spark 3, it supports MERGE INTO to allow new data to be merged into the existing data set. That solves the write problem for incremental processing. Note that the workflow/step can be safely restarted without worrying about duplicate data being inserted when using MERGE INTO.
  • Append pattern: Users can also use append only write (e.g. INSERT INTO) to add the new data to the existing data set. Once the processing is completed, the append data is committed to the table. If users want to re-run or re-build the data set, they will run a backfill workflow to completely overwrite the target data set (e.g. INSERT OVERWRITE).

Additionally, the IPS will naturally support the backfill in many cases. Downstream workflows (if there is no business logic change) will be triggered by the data change due to backfill. This enables auto propagation of backfill data in multi-stage pipelines. Note that the backfill support is skipped in this blog. We will talk about IPS backfill support in another following blog post.

Netflix Maestro

Maestro is the Netflix data workflow orchestration platform built to meet the current and future needs of Netflix. It is a general-purpose workflow orchestrator that provides a fully managed workflow-as-a-service (WAAS) to the data platform users at Netflix. It serves thousands of users, including data scientists, data engineers, machine learning engineers, software engineers, content producers, and business analysts, in various use cases. Maestro is highly scalable and extensible to support existing and new use cases and offers enhanced usability to end users.

Since the last blog on Maestro, we have migrated all the workflows to it on behalf of users with minimal interruption. Maestro has been fully deployed in production with 100% workload running on it.

IPS is built upon Maestro as an extension by adding two building blocks, i.e. a new trigger mechanism and step job type, to enable incremental processing for all workflows. It is seamlessly integrated into the whole Maestro ecosystem with minimal onboarding cost.

Apache Iceberg

Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time. It supports expressive SQL, full schema evolution, hidden partitioning, data compaction, and time travel & rollback. In the IPS, we leverage the rich features provided by Apache Iceberg to develop a lightweight approach to capture the table changes.

Incremental Change Capture Design

Using Netflix Maestro and Apache Iceberg, we created a novel solution for incremental processing, which provides the incremental change (data and range) capture in a super lightweight way without copying any data. During our exploration, we see a huge opportunity to improve cost efficiency and engineering productivity using incremental processing.

Here is our solution to achieve incremental change capture built upon Apache Iceberg features. As we know, an iceberg table contains a list of snapshots with a set of metadata data. Snapshots include references to the actual immutable data files. A snapshot can contain data files from different partitions.

The graph above shows that s0 contains data for Partition P0 and P1 at T1. Then at T2, a new snapshot s1 is committed to the table with a list of new data files, which includes late arriving data for partition P0 and P1 and data for P2.

We implemented a lightweight approach to create an iceberg table (called ICDC table), which has its own snapshot but only includes the new data file references from the original table without copying the data files. It is highly efficient with a low cost. Then workflow pipelines can just load the ICDC table to process only the change data from partition P0, P1, P2 without reprocessing the unchanged data in P0 and P1. Meanwhile, the change range is also captured for the specified data field as the Iceberg table metadata contains the upper and lower bound information of each data field for each data file. Moreover, IPS will track the changes in data file granularity for each workflow.

This lightweight approach is seamlessly integrated with Maestro to allow all (thousands) scheduler users to use this new building block (i.e. incremental processing) in their tens of thousands of workflows. Each workflow using IPS will be injected with a table parameter, which is the table name of the lightweight ICDC table. The ICDC table contains only the change data. Additionally, if the workflow needs the change range, a list of parameters will be injected to the user workflow to include the change range information. The incremental processing can be enabled by a new step job type (ICDC) and/or a new incremental trigger mechanism. Users can use them together with all existing Maestro features, e.g. foreach patterns, step dependencies based on valid–through-timestamp watermark, write-audit-publish templatized pattern, etc.

Main Advantages

With this design, user workflows can adopt incremental processing with very low efforts. The user business logic is also decoupled from the IPS implementation. Multi-stage pipelines can also mix the incremental processing workflows with existing normal workflows. We also found that user workflows can be simplified after using IPS by removing additional steps to handle the complexity of the lookback window or calling some internal libraries.

Adding incremental processing features into Netflix Maestro as new features/building blocks for users will enable users to build their workflows in a much more efficient way and bridge the gaps to solve many challenging problems (e.g. dealing with late arriving data) in a much simpler way.

Emerging Incremental Processing Patterns

While onboarding user pipelines to IPS, we have discovered a few incremental processing patterns:

Incrementally process the captured incremental change data and directly append them to the target table

This is the straightforward incremental processing use case, where the change data carries all the information needed for the data processing. Upstream changes (usually from a single source table) are propagated to the downstream (usually another target table) and the workflow pipeline only needs to process the change data (might join with other dimension tables) and then merge into (usually append) to the target table. This pattern will replace lookback window patterns to take care of late arriving data. Instead of overwriting past X days of data completely by using a lookback window pattern, user workflows just need to MERGE the change data (including late arriving data) into the target table by processing the ICDC table.

Use captured incremental change data as the row level filter list to remove unnecessary transformation

ETL jobs usually need to aggregate data based on certain group-by keys. Change data will disclose all the group-by keys that require a re-aggregation due to the new landing data from the source table(s). Then ETL jobs can join the original source table with the ICDC table on those group-by keys by using ICDC as a filter to speed up the processing to enable calculations of a much smaller set of data. There is no change to business transform logic and no re-design of ETL workflow. ETL pipelines keep all the benefits of batch workflows.

Use the captured range parameters in the business logic

This pattern is usually used in complicated use cases, such as joining multiple tables and doing complex processings. In this case, the change data do not give the full picture of the input needed by the ETL workflow. Instead, the change data indicates a range of changed data sets for a specific set of fields (might be partition keys) in a given input table or usually multiple input tables. Then, the union of the change ranges from all input tables gives the full change data set needed by the workflow. Additionally, the whole range of data usually has to be overwritten because the transformation is not stateless and depends on the outcome result from the previous ranges. Another example is that the aggregated record in the target table or window function in the query has to be updated based on the whole data set in the partition (e.g. calculating a medium across the whole partition). Basically, the range derived from the change data indicates the dataset to be re-processed.

Use cases

Data workflows at Netflix usually have to deal with late arriving data which is commonly solved by using lookback window pattern due to its simplicity and ease of implementation. In the lookback pattern, the ETL pipeline will always consume the past X number of partition data from the source table and then overwrite the target table in every run. Here, X is a number decided by the pipeline owners based on their domain expertise. The drawback is the cost of computation and execution time. It usually costs almost X times more than the pipeline without considering late arriving data. Given the fact that the late arriving data is sparse, the majority of the processing is done on the data that have been already processed, which is unnecessary. Also, note that this approach is based on domain knowledge and sometimes is subject to changes of the business environment or the domain expertise of data engineers. In certain cases, it is challenging to come up with a good constant number.

Below, we will use a two-stage data pipeline to illustrate how to rebuild it using IPS to improve the cost efficiency. We will observe a significant cost reduction (> 80%) with little changes in the business logic. In this use case, we will set the lookback window size X to be 14 days, which varies in different real pipelines.

Original Data Pipeline with Lookback Window

  • playback_table: an iceberg table holding playback events from user devices ingested by streaming pipelines with late arriving data, which is sparse, only about few percents of the data is late arriving.
  • playback_daily_workflow: a daily scheduled workflow to process the past X days playback_table data and write the transformed data to the target table for the past X days
  • playback_daily_table: the target table of the playback_daily_workflow and get overwritten every day for the past X days
  • playback_daily_agg_workflow: a daily scheduled workflow to process the past X days’ playback_daily_table data and write the aggregated data to the target table for the past X days
  • playback_daily_agg_table: the target table of the playback_daily_agg_workflow and get overwritten every day for the past 14 days.

We ran this pipeline in a sample dataset using the real business logic and here is the average execution result of sample runs

  • The first stage workflow takes about 7 hours to process playback_table data
  • The second stage workflow takes about 3.5 hours to process playback_daily_table data

New Data Pipeline with Incremental Processing

Using IPS, we rewrite the pipeline to avoid re-processing data as much as possible. The new pipeline is shown below.

Stage 1:

  • ips_playback_daily_workflow: it is the updated version of playback_daily_workflow.
  • The workflow spark sql job then reads an incremental change data capture (ICDC) iceberg table (i.e. playback_icdc_table), which only includes the new data added into the playback_table. It includes the late arriving data but does not include any unchanged data from playback_table.
  • The business logic will replace INSERT OVERWRITE by MERGE INTO SQL query and then the new data will be merged into the playback_daily_table.

Stage 2:

  • IPS captures the changed data of playback_daily_table and also keeps the change data in an ICDC source table (playback_daily_icdc_table). So we don’t need to hard code the lookback window in the business logic. If there are only Y days having changed data in playback_daily_table, then it only needs to load data for Y days.
  • In ips_playback_daily_agg_workflow, the business logic will be the same for the current day’s partition. We then need to update business logic to take care of late arriving data by
  • JOIN the playback_daily table with playback_daily_icdc_table on the aggregation group-by keys for the past 2 to X days, excluding the current day (i.e. day 1)
  • Because late arriving data is sparse, JOIN will narrow down the playback_daily_table data set so as to only process a very small portion of it.
  • The business logic will use MERGE INTO SQL query then the change will be propagated to the downstream target table
  • For the current day, the business logic will be the same and consume the data from playback_daily_table and then write the outcome to the target table playback_daily_agg_table using INSERT OVERWRITE because there is no need to join with the ICDC table.

With these small changes, the data pipeline efficiency is greatly improved. In our sample run,

  • The first stage workflow takes just about 30 minutes to process X day change data from playback_table.
  • The second stage workflow takes about 15 minutes to process change data between day 2 to day X from playback_daily_table by joining with playback_daily_cdc_table data and takes another 15 minutes to process the current day (i.e. day 1) playback_daily_table change data.

Here the spark job settings are the same in original and new pipelines. So in total, the new IPS based pipeline overall needs around 10% of resources (measured by the execution time) to finish.

Looking Forward

We will improve IPS to support more complicated cases beyond append-only cases. IPS will be able to keep track of the progress of the table changes and support multiple Iceberg table change types (e.g. append, overwrite, etc.). We will also add managed backfill support into IPS to help users to build, monitor, and validate the backfill.

We are taking Big Data Orchestration to the next level and constantly solving new problems and challenges, please stay tuned. If you are motivated to solve large scale orchestration problems, please join us.


Thanks to our Product Manager Ashim Pokharel for driving the strategy and requirements. We’d also like to thank Andy Chu, Kyoko Shimada, Abhinaya Shetty, Bharath Mummadisetty, John Zhuge, Rakesh Veeramacheneni, and other stunning colleagues at Netflix for their suggestions and feedback while developing IPS. We’d also like to thank Prashanth Ramdas, Eva Tse, Charles Smith, and other leaders of Netflix engineering organizations for their constructive feedback and suggestions on the IPS architecture and design.

Incremental Processing using Netflix Maestro and Apache Iceberg was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Psyberg: Automated end to end catch up
2 weeks ago

By Abhinaya Shetty, Bharath Mummadisetty

This blog post will cover how Psyberg helps automate the end-to-end catchup of different pipelines, including dimension tables.

In the previous installments of this series, we introduced Psyberg and delved into its core operational modes: Stateless and Stateful Data Processing. Now, let’s explore the state of our pipelines after incorporating Psyberg.

Pipelines After Psyberg

Let’s explore how different modes of Psyberg could help with a multistep data pipeline. We’ll return to the sample customer lifecycle:

Processing Requirement:
Keep track of the end-of-hour state of accounts, e.g., Active/Upgraded/Downgraded/Canceled.

One potential approach here would be as follows

  1. Create two stateless fact tables :
    a. Signups
    b. Account Plans
  2. Create one stateful fact table:
    a. Cancels
  3. Create a stateful dimension that reads the above fact tables every hour and derives the latest account state.

Let’s look at how this can be integrated with Psyberg to auto-handle late-arriving data and corresponding end-to-end data catchup.

Navigating the Workflow: How Psyberg Handles Late-Arriving Data

We follow a generic workflow structure for both stateful and stateless processing with Psyberg; this helps maintain consistency and makes debugging and understanding these pipelines easier. The following is a concise overview of the various stages involved; for a more detailed exploration of the workflow specifics, please turn to the second installment of this series.

1. Psyberg Initialization

The workflow starts with the Psyberg initialization (init) step.

  • Input: List of source tables and required processing mode
  • Output: Psyberg identifies new events that have occurred since the last high watermark (HWM) and records them in the session metadata table.

The session metadata table can then be read to determine the pipeline input.

2. Write-Audit-Publish (WAP) Process

This is the general pattern we use in our ETL pipelines.

a. Write
Apply the ETL business logic to the input data identified in Step 1 and write to an unpublished iceberg snapshot based on the Psyberg mode

b. Audit
Run various quality checks on the staged data. Psyberg’s metadata session table is used to identify the partitions included in a batch run. Several audits, such as verifying source and target counts, are performed on this batch of data.

c. Publish
If the audits are successful, cherry-pick the staging snapshot to publish the data to production.

3. Psyberg Commit

Now that the data pipeline has been executed successfully, the new high watermark identified in the initialization step is committed to Psyberg’s high watermark metadata table. This ensures that the next instance of the workflow will pick up newer updates.


  • Having the Psyberg step isolated from the core data pipeline allows us to maintain a consistent pattern that can be applied across stateless and stateful processing pipelines with varying requirements.
  • This also enables us to update the Psyberg layer without touching the workflows.
  • This is compatible with both Python and Scala Spark.
  • Debugging/figuring out what was loaded in every run is made easy with the help of workflow parameters and Psyberg Metadata.

The Setup: Automated end-to-end catchup

Let’s go back to our customer lifecycle example. Once we integrate all four components with Psyberg, here’s how we would set it up for automated catchup.

The three fact tables, comprising the signup and plan facts encapsulated in Psyberg’s stateless mode, along with the cancel fact in stateful mode, serve as inputs for the stateful sequential load ETL pipeline. This data pipeline monitors the various stages in the customer lifecycle.

In the sequential load ETL, we have the following features:

  • Catchup Threshold: This defines the lookback period for the data being read. For instance, only consider the last 12 hours of data.
  • Data Load Type: The ETL can either load the missed/new data specifically or reload the entire specified range.
  • Metadata Recording: Metadata is persisted for traceability.

Here is a walkthrough on how this system would automatically catch up in the event of late-arriving data:

Premise: All the tables were last loaded up to hour 5, meaning that any data from hour 6 onwards is considered new, and anything before that is classified as late data (as indicated in red above)

Fact level catchup:

  1. During the Psyberg initialization phase, the signup and plan facts identify the late data from hours 2 and 3, as well as the most recent data from hour 6. The ETL then appends this data to the corresponding partitions within the fact tables.
  2. The Psyberg initialization for the cancel fact identifies late data from hour 5 and additional data from hours 6 and 7. Since this ETL operates in stateful mode, the data in the target table from hours 5 to 7 will be overwritten with the new data.
  3. By focusing solely on updates and avoiding reprocessing of data based on a fixed lookback window, both Stateless and Stateful Data Processing maintain a minimal change footprint. This approach ensures data processing is both efficient and accurate.

Dimension level catchup:

  1. The Psyberg wrapper for this stateful ETL looks at the updates to the upstream Psyberg powered fact tables to determine the date-hour range to reprocess. Here’s how it would calculate the above range:
    MinHr = least(min processing hour from each source table)
    This ensures that we don’t miss out on any data, including late-arriving data. In this case, the minimum hour to process the data is hour 2.
    MaxHr = least(max processing hour from each source table)
    This ensures we do not process partial data, i.e., hours for which data has not been loaded into all source tables. In this case, the maximum hour to process the data is hour 6.
  2. The ETL process uses this time range to compute the state in the changed partitions and overwrite them in the target table. This helps overwrite data only when required and minimizes unnecessary reprocessing.

As seen above, by chaining these Psyberg workflows, we could automate the catchup for late-arriving data from hours 2 and 6. The Data Engineer does not need to perform any manual intervention in this case and can thus focus on more important things!

The Impact: How Psyberg Transformed Our Workflows

The introduction of Psyberg into our workflows has served as a valuable tool in enhancing accuracy and performance. The following are key areas that have seen improvements from using Psyberg:

  • Computational Resources Used:
    In certain instances, we’ve noticed a significant reduction in resource utilization, with the number of Spark cores used dropping by 90% following the implementation of Psyberg, compared to using fixed lookback windows
  • Workflow and Table Onboarding:
    We have onboarded 30 tables and 13 workflows into incremental processing since implementing Psyberg
  • Reliability and Accuracy:
    Since onboarding workflows to Psyberg, we have experienced zero manual catchups or missing data incidents
  • Bootstrap template:
    The process of integrating new tables into incremental processing has been made more accessible and now requires minimal effort using Psyberg

These performance metrics suggest that adopting Psyberg has been beneficial to the efficiency of our data processing workflows.

Next Steps and Conclusion

Integrating Psyberg into our operations has improved our data workflows and opened up exciting possibilities for the future. As we continue to innovate, Netflix’s data platform team is focused on creating a comprehensive solution for incremental processing use cases. This platform-level solution is intended to enhance our data processing capabilities across the organization. Stay tuned for a new post on this!

In conclusion, Psyberg has proven to be a reliable and effective solution for our data processing needs. As we look to the future, we’re excited about the potential for further advancements in our data platform capabilities.

Psyberg: Automated end to end catch up was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Diving Deeper into Psyberg: Stateless vs Stateful Data Processing
2 weeks ago

By Abhinaya Shetty, Bharath Mummadisetty

In the inaugural blog post of this series, we introduced you to the state of our pipelines before Psyberg and the challenges with incremental processing that led us to create the Psyberg framework within Netflix’s Membership and Finance data engineering team. In this post, we will delve into a more detailed exploration of Psyberg’s two primary operational modes: stateless and stateful.

Modes of Operation of Psyberg

Psyberg has two main modes of operation or patterns, as we call them. Understanding the nature of the late-arriving data and processing requirements will help decide which pattern is most appropriate for a use case.

  1. Stateless Data Processing: As the name suggests, one should use this pattern in scenarios where the columns in the target table solely depend on the content of the incoming events, irrespective of their order of occurrence. For instance, consider a scenario where we need to keep track of all the customer signups over time. In this case, the order of signups wouldn’t matter, and individual signup records are independent of each other. This information has only one source, and we can append new/late records to the fact table as and when the events are received.
  2. Stateful Data Processing: This pattern is useful when the output depends on a sequence of events across one or more input streams. For example, the customer account lifecycle in a business might involve multiple stages, such as account creation, plan upgrades, downgrades, and cancellation. To derive attributes like the lifetime of an account or the latest plan the account is on, we need to track the sequence of these events across different input streams. A missed event in such a scenario would result in incorrect analysis due to a wrong derived state. Late-arriving data in such cases requires overwriting data that was previously processed to ensure all events are accounted for.

Let’s visualize how these two modes work within our data processing pipeline using a general workflow for loading a fact table. If you would like to learn more about how the workflows are orchestrated in Netflix Maestro scheduler, please check out this blog post from our data platform team.

With this illustration as our guide, let’s explore each mode in more detail.

The Psyberg Initialization Phase

This step invokes Psyberg with the required parameters. Based on these parameters, Psyberg then computes the correct data range for the pipeline processing needs.

Input parameters in this step include the following:

Initialization for Stateless Data Processing

Let’s use the signup fact table as an example here. This table’s workflow runs hourly, with the main input source being an Iceberg table storing all raw signup events partitioned by landing date, hour, and batch id.

Here’s a YAML snippet outlining the configuration for this during the Psyberg initialization step:

- job:
id: psyberg_session_init
type: Spark
- --process_name=signup_fact_load
- --src_tables=raw_signups
- --psyberg_session_id=20230914061001
- --psyberg_hwm_table=high_water_mark_table
- --psyberg_session_table=psyberg_session_metadata
- --etl_pattern_id=1

Behind the scenes, Psyberg identifies that this pipeline is configured for a stateless pattern since etl_pattern_id=1.

Psyberg also uses the provided inputs to detect the Iceberg snapshots that persisted after the latest high watermark available in the watermark table. Using the summary column in snapshot metadata [see the Iceberg Metadata section in post 1 for more details], we parse out the partition information for each Iceberg snapshot of the source table.

Psyberg then retains these processing URIs (an array of JSON strings containing combinations of landing date, hour, and batch IDs) as determined by the snapshot changes. This information and other calculated metadata are stored in the psyberg_session_f table. This stored data is then available for the subsequent LOAD.FACT_TABLE job in the workflow to utilize and for analysis and debugging purposes.

Initialization for Stateful Data Processing

Stateful Data Processing is used when the output depends on a sequence of events across one or more input streams.

Let’s consider the example of creating a cancel fact table, which takes the following as input:

  1. Raw cancellation events indicating when the customer account was canceled
  2. A fact table that stores incoming customer requests to cancel their subscription at the end of the billing period

These inputs help derive additional stateful analytical attributes like the type of churn i.e. voluntary or involuntary, etc.

The initialization step for Stateful Data Processing differs slightly from Stateless. Psyberg offers additional configurations according to the pipeline needs. Here’s a YAML snippet outlining the configuration for the cancel fact table during the Psyberg initialization step:

- job:
id: psyberg_session_init
type: Spark
- --process_name=cancel_fact_load
- --src_tables=raw_cancels|processing_ts,cancel_request_fact
- --psyberg_session_id=20230914061501
- --psyberg_hwm_table=high_water_mark_table
- --psyberg_session_table=psyberg_session_metadata
- --etl_pattern_id=2

Behind the scenes, Psyberg identifies that this pipeline is configured for a stateful pattern since etl_pattern_id is 2.

Notice the additional detail in the src_tables list corresponding to raw_cancels above. The processing_ts here represents the event processing timestamp which is different from the regular Iceberg snapshot commit timestamp i.e. event_landing_ts as described in part 1 of this series.

It is important to capture the range of a consolidated batch of events from all the sources i.e. both raw_cancels and cancel_request_fact, while factoring in late-arriving events. Changes to the source table snapshots can be tracked using different timestamp fields. Knowing which timestamp field to use i.e. event_landing_ts or something like processing_ts helps avoid missing events.

Similar to the approach in stateless data processing, Psyberg uses the provided inputs to parse out the partition information for each Iceberg snapshot of the source table.

Sample parsed input for target snapshot_date 20230914 and snapshot_hour 9

This is then used to query the partitions metadata table which has the min and max range for each column in the source table. In this case, we look at the min and max range of the processing_ts column to determine actual partitions for any late-arriving events. The minimum value here helps determine the lower limit of the data to be processed i.e. the derived minimum date and hour based on the input epoch timestamp.

Lower Limit to be processed = least ( “min” event_processing_ts)

It also tracks the VTTS (Valid To TimeStamp) of all the input streams and determines the minimum VTTS of all the streams together. This helps determine the upper limit of data to be processed, thus restricting the data load based on data completeness of all the streams combined.

Upper Limit to be processed = least (vtts date-hour)

Using this metadata from different streams, Psyberg calculates several parameters like minimum/maximum processing date and hour and event landing date hour. These parameters, along with other metadata, discussed in the previous post, are persisted in the psyberg_session_f table for analysis and debugging purposes.

Write Audit Publish (WAP) process

The Write Audit Publish (WAP) process is a general pattern we use in our ETLs to validate writes to the uncommitted Iceberg snapshot before publishing to the target table. The LOAD.FACT_TABLE step takes psyberg_session_id and process_name as input arguments.

For stateless pattern, the processing URIs to be processed as part of the load step are identified by reading the psyberg_session_f table. This information is then used to filter the source table and apply the business logic to create the signup fact table. Any late-arriving signup events data is appended to the target table partitions as part of this. All these writes go into the uncommitted Iceberg snapshot managed by the WAP pattern.

Similarly, in the stateful pattern, the ETL step reads the psyberg_session_f table to identify the derived minimum and maximum date hour range to be processed, which acts as a filter for different input tables involved in the ETL. After applying the corresponding business logic for cancellation events, we create the cancel fact table along with columns like cancellation type (i.e., voluntary vs involuntary churn) representing the state of the canceled account. If there are any late-arriving events, Psyberg handles them automatically by providing the correct range to the data process to derive the state changes correctly.


We run different audits on the uncommitted Iceberg snapshot created as part of the job run. Leveraging Psyberg metadata, we can identify the cohort of data involved as part of the job run. This helps in pinpointing changes and applying blocking audits efficiently. Audits like source-to-target count comparison and checking for no missing events in the target Iceberg snapshot ensure data integrity and completeness. Once the audits pass successfully, the data is published to the target table.

HWM Commit

Leveraging Psyberg metadata tables, we determine the latest timestamp associated with the Iceberg snapshot seen as part of the job run. This timestamp is used to update the high watermark table with the new high watermark so that the subsequent pipeline instance can pick up the next set of changes.


This exploration shows how Psyberg brings efficiency, accuracy, and timeliness to Stateless and Stateful Data Processing within the Membership and Finance data engineering team. Join us in the next part of our blog series, where we’ll discuss how it also helps automate the end-to-end catchup of different pipelines.

Diving Deeper into Psyberg: Stateless vs Stateful Data Processing was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

1. Streamlining Membership Data Engineering at Netflix with Psyberg
2 weeks ago

Streamlining Membership Data Engineering at Netflix with Psyberg

By Abhinaya Shetty, Bharath Mummadisetty

At Netflix, our Membership and Finance Data Engineering team harnesses diverse data related to plans, pricing, membership life cycle, and revenue to fuel analytics, power various dashboards, and make data-informed decisions. Many metrics in Netflix’s financial reports are powered and reconciled with efforts from our team! Given our role on this critical path, accuracy is paramount. In this context, managing the data, especially when it arrives late, can present a substantial challenge!

In this three-part blog post series, we introduce you to Psyberg, our incremental data processing framework designed to tackle such challenges! We’ll discuss batch data processing, the limitations we faced, and how Psyberg emerged as a solution. Furthermore, we’ll delve into the inner workings of Psyberg, its unique features, and how it integrates into our data pipelining workflows. By the end of this series, we hope you will gain an understanding of how Psyberg transformed our data processing, making our pipelines more efficient, accurate, and timely. Let’s dive in!

The Challenge: Incremental Data Processing with Late Arriving Data

Our teams’ data processing model mainly comprises batch pipelines, which run at different intervals ranging from hourly to multiple times a day (also known as intraday) and even daily. We expect complete and accurate data at the end of each run. To meet such expectations, we generally run our pipelines with a lag of a few hours to leave room for late-arriving data.

What is late-arriving data?

Late-arriving data is essentially delayed data due to system retries, network delays, batch processing schedules, system outages, delayed upstream workflows, or reconciliation in source systems.

How does late-arriving data impact us?

You could think of our data as a puzzle. With each new piece of data, we must fit it into the larger picture and ensure it’s accurate and complete. Thus, we must reprocess the missed data to ensure data completeness and accuracy.

Types of late-arriving data

Based on the structure of our upstream systems, we’ve classified late-arriving data into two categories, each named after the timestamps of the updated partition:

Ways to process such data

Our team previously employed some strategies to manage these scenarios, which often led to unnecessarily reprocessing unchanged data. Some techniques we used were:

1. Using fixed lookback windows to always reprocess data, assuming that most late-arriving events will occur within that window. However, this approach usually leads to redundant data reprocessing, thereby increasing ETL processing time and compute costs. It also becomes inefficient as the data scale increases. Imagine reprocessing the past 6 hours of data every hour!

2. Add alerts to flag when late arriving data appears, block the pipelines, and perform a manual intervention where we triggered backfill pipelines to handle the missed events. This approach was a simple solution with minimal extra processing for the most part and, hence, was our preferred solution. However, when the late events occurred, the pain of reprocessing data and catching up on all the dependent pipelines was not worth it! We will talk about this shortly.

At a high level, both these approaches were inefficient for intraday pipelines and impacted cost, performance, accuracy, and time. We developed Psyberg, an incremental processing framework using Iceberg to handle these challenges more effectively.

The state of our pipelines before Psyberg

Before diving into the world of Psyberg, it’s crucial to take a step back and reflect on the state of the data pipelines in our team before its implementation. The complexities involved in these processes and the difficulties they posed led to the development of Psyberg.

At Netflix, our backend microservices continuously generate real-time event data that gets streamed into Kafka. These raw events are the source of various data processing workflows within our team. We ingest this diverse event data and transform it into standardized fact tables. The fact tables then feed downstream intraday pipelines that process the data hourly. The sequential load ETL shown in the diagram below depicts one such pipeline that calculates an account's state every hour.

Raw data for hours 3 and 6 arrive. Hour 6 data flows through the various workflows, while hour 3 triggers a late data audit alert.

Let’s walk through an example to understand the complexity of this pre-Psyberg world.

Consider a simplified version of our pipelines where we process three events: signups, plan changes, and cancels. Now imagine that some signup events from hour 3 were delayed and sent in at hour 6 instead. Our audits would detect this and alert the on-call data engineer (DE). The on-call DE would then face the daunting task of making things right!

Step 1: Dive into the audit logs to identify the late-arriving data and the impacted workflows. In this case, they would discover that the late-arriving data for hour 3 must be included in the signup facts.

Step 2: Stop all impacted workflows and downstream jobs (such as the sequential load ETL) and patch the missed data in the fact tables. Now, the data in the signup fact is patched.

Step 3: Identify the number of partitions to be rerun for the sequential stateful load jobs to account for the delayed data and rerun them from the impacted date-hour. The DE would note that the data for hours 3–6 needs to be reprocessed and will retrigger four instances to be run sequentially. This step is crucial because missing signup events from hour 3 would result in us missing subsequent events for those affected accounts (e.g., a cancel event for a missed signup would have had no effect). As we capture the state of an account based on the sequence of different types of events, rerunning the sequential load ETL from hours 3 to 6 ensures the accurate representation of account states.

Step 4: Now that we’ve spent significant time triaging and resolving the alert, the sequential ETL workflow likely experienced a delay. As a result, we need to catch up to schedule. To compensate for the lost time, the DE must trigger a few additional instances until the latest hour that would have run if the data hadn’t arrived late.

This entire process was challenging and required significant manual intervention from the on-call DE perspective. Note that these are hourly jobs, so the alert could be triggered at any time of the day (or night!). Yes, they were infrequent, but a big pain point when they occurred! Also, the on-call DE was usually not the SME for these pipelines, as the late data could have arrived in any of our upstream pipelines. To solve these problems, we came up with Psyberg!

Psyberg: The Game Changer!

Psyberg automates our data loads, making it suitable for various data processing needs, including intraday pipeline use cases. It leverages Iceberg metadata to facilitate processing incremental and batch-based data pipelines.

One of the critical features of Psyberg is its ability to detect and manage late-arriving data, no matter the partition it lands in. This feature allows data pipelines to handle late-arriving data effectively without manual intervention, ensuring higher data accuracy in our systems. Iceberg metadata and Psyberg’s own metadata form the backbone of its efficient data processing capabilities.

ETL Process High Watermark

This is the last recorded update timestamp for any data pipeline process. This is mainly used to identify new changes since the last update.

Iceberg Metadata

Psyberg primarily harnesses two key iceberg metadata tables — snapshots and partitions — to manage the workload. All Iceberg tables have associated metadata that provide insight into changes or updates within the data tables.

The snapshots metadata table records essential metadata such as:

  • The creation time of a snapshot
  • The type of operation performed (append, overwrite, etc.)
  • A summary of partitions created/updated during the generation of the Iceberg snapshot

These details enable Psyberg to track different operations and identify changes made to a source table since the previous high watermark. For example:

The partitions metadata table is particularly interesting as it stores:

  • Information about partition keys used in the data table
  • Column names and the range of values for each column within a specific partition

One unique aspect of Netflix’s internal implementation is that it provides the range of values for each column within a partition in a deserialized format. This information helps Psyberg comprehend the timestamp ranges for both types of late-arriving data (event and processing time) without querying the actual data.

Psyberg Metadata

In addition to Iceberg metadata, Psyberg maintains its own metadata tables — the session table and the high watermark table. Both these tables are partitioned by the pipeline process name to maintain information related to each data pipeline independently.

The session table captures metadata specific to each pipeline run, including:

  • Process name partition to track all the runs associated with the data pipeline process
  • Session ID to track unique runs within the process
  • Processing URIs to identify the input partitions involved in the load
  • “from date”, “from hour”, “to date” and “to hour” for both event and processing times

The high watermark table stores relevant values from the session table at the end of each pipeline run:

  • Latest and previous high water mark timestamp
  • Metadata related to the latest run

This information is vital for each pipeline run instance as it helps determine the data to be loaded, updates the high water mark after processing, and finally generates output signals to inform downstream workflows about the date-hour up to which data is complete and available. It also serves as an essential resource for debugging and creating audits on the pipeline jobs.


In this post, we described our data architecture at a high level, along with the pain points that led to the development of Psyberg. We also went into details related to the metadata that powers Psyberg. If you understand the challenges faced by the on-call DE and would like to learn more about our solution, please check out the next iteration of this three-part series, where we delve deeper into different modes of Psyberg.

1. Streamlining Membership Data Engineering at Netflix with Psyberg was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Detecting Speech and Music in Audio Content
2 weeks, 1 day ago

Iroro Orife, Chih-Wei Wu and Yun-Ning (Amy) Hung


When you enjoy the latest season of Stranger Things or Casa de Papel (Money Heist), have you ever wondered about the secrets to fantastic story-telling, besides the stunning visual presentation? From the violin melody accompanying a pivotal scene to the soaring orchestral arrangement and thunderous sound-effects propelling an edge-of-your-seat action sequence, the various components of the audio soundtrack combine to evoke the very essence of story-telling. To uncover the magic of audio soundtracks and further improve the sonic experience, we need a way to systematically examine the interaction of these components, typically categorized as dialogue, music and effects.

In this blog post, we will introduce speech and music detection as an enabling technology for a variety of audio applications in Film & TV, as well as introduce our speech and music activity detection (SMAD) system which we recently published as a journal article in EURASIP Journal on Audio, Speech, and Music Processing.

Like semantic segmentation for audio, SMAD separately tracks the amount of speech and music in each frame in an audio file and is useful in content understanding tasks during the audio production and delivery lifecycle. The detailed temporal metadata SMAD provides about speech and music regions in a polyphonic audio mixture are a first step for structural audio segmentation, indexing and pre-processing audio for the following downstream tasks. Let’s have a look at a few applications.

Practical use cases for speech & music activity

Audio dataset preparation

Speech & music activity is an important preprocessing step to prepare corpora for training. SMAD classifies & segments long-form audio for use in large corpora, such as

From “Audio Signal Classification” by David Gerhard

Dialogue analysis & processing

  • During encoding at Netflix, speech-gated loudness is computed for every audio master track and used for loudness normalization. Speech-activity metadata is thus a central part of accurate catalog-wide loudness management and improved audio volume experience for Netflix members.
  • Similarly, algorithms for dialogue intelligibility, spoken-language-identification and speech-transcription are only applied to audio regions where there is measured speech.

Music information retrieval

  • There are a few studio use cases where music activity metadata is important, including quality-control (QC) and at-scale multimedia content analysis and tagging.
  • There are also inter-domain tasks like singer-identification and song lyrics transcription, which do not fit neatly into either speech or classical MIR tasks, but are useful for annotating musical passages with lyrics in closed captions and subtitles.
  • Conversely, where neither speech nor music activity is present, such audio regions are estimated to have content classified as noisy, environmental or sound-effects.

Localization & Dubbing

Finally, there are post-production tasks, which take advantage of accurate speech segmentation at the the spoken utterance or sentence level, ahead of translation and dub-script generation. Likewise, authoring accessibility-features like Audio Description (AD) involves music and speech segmentation. The AD narration is typically mixed-in to not overlap with the primary dialogue, while music lyrics strongly tied to the plot of the story, are sometimes referenced by AD creators, especially for translated AD.

A voice actor in the studio

Our Approach to Speech and Music Activity Detection

Although the application of deep learning methods has improved audio classification systems in recent years, this data driven approach for SMAD requires large amounts of audio source material with audio-frame level speech and music activity labels. The collection of such fine-resolution labels is costly and labor intensive and audio content often cannot be publicly shared due to the copyright limitations. We address the challenge from a different angle.

Content, genre and languages

Instead of augmenting or synthesizing training data, we sample the large scale data available in the Netflix catalog with noisy labels. In contrast to clean labels, which indicate precise start and end times for each speech/music region, noisy labels only provide approximate timing, which may impact SMAD classification performance. Nevertheless, noisy labels allow us to increase the scale of the dataset with minimal manual efforts and potentially generalize better across different types of content.

Our dataset, which we introduced as TVSM (TV Speech and Music) in our publication, has a total number of 1608 hours of professionally recorded and produced audio. TVSM is significantly larger than other SMAD datasets and contains both speech and music labels at the frame level. TVSM also contains overlapping music and speech labels, and both classes have a similar total duration.

Training examples were produced between 2016 and 2019, in 13 countries, with 60% of the titles originating in the USA. Content duration ranged from 10 minutes to over 1 hour, across the various genres listed below.

The dataset contains audio tracks in three different languages, namely English, Spanish, and Japanese. The language distribution is shown in the figure below. The name of the episode/TV show for each sample remains unpublished. However, each sample has both a show-ID and a season-ID to help identify the connection between the samples. For instance, two samples from different seasons of the same show would share the same show ID and have different season IDs.

What constitutes music or speech?

To evaluate and benchmark our dataset, we manually labeled 20 audio tracks from various TV shows which do not overlap with our training data. One of the fundamental issues encountered during the annotation of our manually-labeled TVSM-test set, was the definition of music and speech. The heavy usage of ambient sounds and sound effects blurs the boundaries between active music regions and non-music. Similarly, switches between conversational speech and singing voices in certain TV genres obscure where speech starts and music stops. Furthermore, must these two classes be mutually exclusive? To ensure label quality, consistency, and to avoid ambiguity, we converged on the following guidelines for differentiating music and speech:

  • Any music that is perceivable by the annotator at a comfortable playback volume should be annotated.
  • Since sung lyrics are often included in closed-captions or subtitles, human singing voices should all be annotated as both speech and music.
  • Ambient sound or sound effects without apparent melodic contours should not be annotated as music. Traditional phone bell, ringing, or buzzing without apparent melodic contours should not be annotated as music.
  • Filled pauses (uh, um, ah, er), backchannels (mhm, uh-huh), sighing, and screaming should not be annotated as speech.

Audio format and preprocessing

All audio files were originally delivered from the post-production studios in the standard 5.1 surround format at 48 kHz sampling rate. We first normalize all files to an average loudness of −27 LKFS ± 2 LU dialog-gated, then downsample to 16 kHz before creating an ITU downmix.

Model Architecture

Our modeling choices take advantage of both convolutional and recurrent architectures, which are known to work well on audio sequence classification tasks, and are well supported by previous investigations. We adapted the SOTA convolutional recurrent neural network (CRNN) architecture to accommodate our requirements for input/output dimensionality and model complexity. The best model was a CRNN with three convolutional layers, followed by two bi-directional recurrent layers and one fully connected layer. The model has 832k trainable parameters and emits frame-level predictions for both speech and music with a temporal resolution of 5 frames per second.

For training, we leveraged our large and diverse catalog dataset with noisy labels, introduced above. Applying a random sampling strategy, each training sample is a 20 second segment obtained by randomly selecting an audio file and corresponding starting timecode offset on the fly. All models in our experiments were trained by minimizing binary cross-entropy (BCE) loss.


In order to understand the influence of different variables in our experimental setup, e.g. model architecture, training data or input representation variants like log-Mel Spectrogram versus per-channel energy normalization (PCEN), we setup a detailed ablation study, which we encourage the reader to explore fully in our EURASIP journal article.

For each experiment, we reported the class-wise F-score and error rate with a segment size of 10ms. The error rate is the summation of deletion rate (false negative) and insertion rate (false positive). Since a binary decision must be attained for music and speech to calculate the F-score, a threshold of 0.5 was used to quantize the continuous output of speech and music activity functions.


We evaluated our models on four open datasets comprising audio data from TV programs, YouTube clips and various content such as concert, radio broadcasts, and low-fidelity folk music. The excellent performance of our models demonstrates the importance of building a robust system that detects overlapping speech and music and supports our assumption that a large but noisy-labeled real-world dataset can serve as a viable solution for SMAD.


At Netflix, tasks throughout the content production and delivery lifecycle work are most often interested in one part of the soundtrack. Tasks that operate on just dialogue, music or effects are performed hundreds of times a day, by teams around the globe, in dozens of different audio languages. So investments in algorithmically-assisted tools for automatic audio content understanding like SMAD, can yield substantial productivity returns at scale while minimizing tedium.

Additional Resources

We have made audio features and labels available via Zenodo. There is also GitHub repository with the following audio tools:

  • Python code for data pre-processing, including scripts for 5.1 downmixing, Mel spectrogram generation, MFCCs generation, VGGish features generation, and the PCEN implementation.
  • Python code for reproducing all experiments, including scripts of data loaders, model implementations, training and evaluation pipelines.
  • Pre-trained models for each conducted experiment.
  • Prediction outputs for all audio in the evaluation datasets.

Special thanks to the entire Audio Algorithms team, as well as Amir Ziai, Anna Pulido, and Angie Pollema.

Detecting Speech and Music in Audio Content was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Next Step in Personalization: Dynamic Sizzles
2 weeks, 6 days ago

Authors:Bruce Wobbe, Leticia Kwok

Additional Credits:Sanford Holsapple, Eugene Lok, Jeremy Kelly


At Netflix, we strive to give our members an excellent personalized experience, helping them make the most successful and satisfying selections from our thousands of titles. We already personalize artwork and trailers, but we hadn’t yet personalized sizzle reels — until now.

A sizzle reel is a montage of video clips from different titles strung together into a seamless A/V asset that gets members excited about upcoming launches (for example, our Emmys nominations or holiday collections). Now Netflix can create a personalized sizzle reel dynamically in real time and on demand. The order of the clips and included titles are personalized per member, giving each a unique and effective experience. These new personalized reels are called Dynamic Sizzles.

In this post, we will dive into the exciting details of how we create Dynamic Sizzles with minimal human intervention, including the challenges we faced and the solutions we developed.

An example of a Dynamic Sizzle created for Chuseok, the Korean mid-autumn harvest festival collection.


In the past, each sizzle reel was created manually. The time and cost of doing this prevents scaling and misses the invaluable benefit of personalization, which is a bedrock principle at Netflix. We wanted to figure out how to efficiently scale sizzle reel production, while also incorporating personalization — all in an effort to yield greater engagement and enjoyment for our members.

Enter the creation of Dynamic Sizzles. We developed a systems-based approach that uses our interactive and creative technology to programmatically stitch together multiple video clips alongside a synced audio track. The process involves compiling personalized multi-title/multi-talent promotional A/V assets on the fly into a Mega Asset. A Mega Asset is a large A/V asset made up of video clips from various titles, acting as a library from which the Dynamic Sizzle pulls media. These clips are then used to construct a personalized Dynamic Sizzle according to a predefined cadence.

With Dynamic Sizzles, we can utilize more focused creative work from editors and generate a multitude of personalized sizzle reels efficiently and effectively — up to 70% in terms of time and cost savings than a manually created one. This gives us the ability to create thousands, if not millions, of combinations of video clips and assets that result in optimized and personalized sizzle reel experiences for Netflix members.

Creating the Mega Asset

Where To Begin

Our first challenge was figuring out how to create the Mega Asset, as each video clip needs to be precise in its selection and positioning. A Mega Asset can contain any number of clips, and millions of unique Dynamic Sizzles can be produced from a single Mega Asset.

We accomplished this by using human editors to select the clips — ensuring that they are well-defined from both a creative and technical standpoint — then laying them out in a specific known order in a timeline. We also need each clip marked with an index to its location — an extremely tedious and time consuming process for an editor. To solve this, we created an Adobe Premiere plug-in to automate the process. Further verifications can also be done programmatically via ingestion of the timecode data, as we can validate the structure of the Mega Asset by looking at the timecodes.

An example of a title’s video clips layout.

The above layout shows how a single title’s clips are ordered in a Mega Asset and in 3 different lengths: 160, 80 and 40 frame rates. Each clip should be unique per title; however, when using multiple titles, they may share the same frame rate. This gives us more variety to choose from while maintaining a structured order in the layout.


The cadence is a predetermined collection of clip lengths that indicates when, where, and for how long a title shows within a Dynamic Sizzle. The cadence ensures that when a Dynamic Sizzle is played, it will show a balanced view of any titles chosen, while still giving more time to a member’s higher ranked titles. Cadence is something we can personalize or randomize, and will continue to evolve as needed.

Sample Cadence

In the above sample cadence, Title A refers to the highest ranked title in a member’s personalized sort, Title B the second highest, and so on. The cadence is made up of 3 distinct segments with 5 chosen titles (A-E) played in sequence using various clip lengths. Each clip in the cadence refers to a different clip in the Mega Asset. For example, the 80 frame clip for title A in the first (red) segment is different from the 80 frame clip for title A in the third (purple) segment.

Composing the Dynamic Sizzle


When a request comes in for a sizzle reel, our system determines what titles are in the Mega Asset and based on the request, a personalized list of titles is created and sorted. The top titles for a member are then used to construct the Dynamic Sizzle by leveraging the clips in the Mega Asset. Higher ranked titles get more weight in placement and allotted time.

Finding Timecodes

For the Dynamic Sizzle process, we have to quickly and dynamically determine the timecodes for each clip in the Mega Asset and make sure they are easily accessed at runtime. We accomplish this by utilizing Netflix’s Hollow technology. Hollow allows us to store timecodes for quick searches and use timecodes as a map — a key can be used to find the timecodes needed as defined by the cadence. The key can be as simple as titleId-clip-1.

Building The Reel

The ordering of the clips are set by the predefined cadence, which dictates the final layout and helps easily build the Dynamic Sizzle. For example, if the system knows to use title 17 within the Mega Asset, we can easily calculate the time offset for all the clips because of the known ordering of the titles and clips within the Mega Asset. This all comes together in the following way:

The result is a series of timecodes indicating the start and stop times for each clip. These codes appear in the order they should be played and the player uses them to construct a seamless video experience as seen in the examples below:

The Beautiful Game Dynamic Sizzle

With Dynamic Sizzles, each member experiences a personalized sizzle reel.

Example of what 2 different profiles might see for the same sizzle

Playing the Dynamic Sizzle

Delivering To The Player

The player leverages the Mega Asset by using timecodes to know where to start and stop each clip, and then seamlessly plays each one right after the other. This required a change in the API that devices normally use to get trailers. The API change was twofold. First, on the request we need the device to indicate that it can support Dynamic Sizzles. Second, on the response the timecode list needs to be sent. (Changing the API and rolling it out took time, so this all had to be implemented before Dynamic Sizzles could actually be used, tested, and productized.)

Challenges With The Player

There were two main challenges with the player. First, in order to support features like background music across multiple unique video segments, we needed to support asymmetrical segment streaming from discontiguous locations in the Mega Asset. This involved modifying existing schemas and adding corresponding support to the player to allow for the stitching of the video and audio together separately while still keeping the timecodes in sync. Second, we needed to optimize our streaming algorithms to account for these much shorter segments, as some of our previous assumptions were incorrect when dealing with dozens of discontiguous tiny segments in the asset.

Building Great Things Together

We are just getting started on this journey to build truly great experiences. While the challenges may seem endless, the work is incredibly fulfilling. The core to bringing these great engineering solutions to life is the direct collaboration we have with our colleagues and innovating together to solve these challenges.

If you are interested in working on great technology like Dynamic Sizzles, we’d love to talk to you! We are hiring: jobs.netflix.com

The Next Step in Personalization: Dynamic Sizzles was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Building In-Video Search
3 weeks, 1 day ago

Boris Chen, Ben Klein, Jason Ge, Avneesh Saluja, Guru Tahasildar, Abhishek Soni, Juan Vimberg, Elliot Chow, Amir Ziai, Varun Sekhri, Santiago Castro, Keila Fong, Kelli Griggs, Mallia Sherzai, Robert Mayer, Andy Yao, Vi Iyengar, Jonathan Solorzano-Hamilton, Hossein Taghavi, Ritwik Kumar


Today we’re going to take a look at the behind the scenes technology behind how Netflix creates great trailers, Instagram reels, video shorts and other promotional videos.


Suppose you’re trying to create the trailer for the action thriller The Gray Man, and you know you want to use a shot of a car exploding. You don’t know if that shot exists or where it is in the film, and you have to look for it it by scrubbing through the whole film.

Exploding cars — The Gray Man (2022)

Or suppose it’s Christmas, and you want to create a great instagram piece out all the best scenes across Netflix films of people shouting “Merry Christmas”! Or suppose it’s Anya Taylor Joy’s birthday, and you want to create a highlight reel of all her most iconic and dramatic shots.

Creating these involves sifting through hundreds of thousands of movies and TV shows to find the right line of dialogue or the appropriate visual elements (objects, scenes, emotions, actions, etc.). We have built an internal system that allows someone to perform in-video search across the entire Netflix video catalog, and we’d like to share our experience in building this system.

Building in-video search

To build such a visual search engine, we needed a machine learning system that can understand visual elements. Our early attempts included object detection, but found that general labels were both too limiting and too specific, yet not specific enough. Every show has special objects that are important (e.g. Demogorgon in Stranger Things) that don’t translate to other shows. The same was true for action recognition, and other common image and video tasks.

The Approach

We discovered that contrastive learning works well for our objectives when applied to image and text pairs, as these models can effectively learn joint embedding spaces between the two modalities. This approach is also able to learn about objects, scenes, emotions, actions, and more in a single model. We also found that extending contrastive learning to videos and text provided a substantial improvement over frame-level models.

In order to train the model on internal training data (video clips with aligned text descriptions), we implemented a scalable version on Ray Train and switched to a more performant video decoding library. Lastly, the embeddings from the video encoder exhibit strong zero or few-shot performance on multiple video and content understanding tasks at Netflix and are used as a starting point in those applications.

The recent success of large-scale models that jointly train image and text embeddings has enabled new use cases around multimodal retrieval. These models are trained on large amounts of image-caption pairs via in-batch contrastive learning. For a (large) batch of N examples, we wish to maximize the embedding (cosine) similarity of the N correct image-text pairs, while minimizing the similarity of the other N²-N paired embeddings. This is done by treating the similarities as logits and minimizing the symmetric cross-entropy loss, which gives equal weighting to the two settings (treating the captions as labels to the images and vice versa).

Consider the following two images and captions:

Images are from Glass Onion: A Knives Out Mystery (2022)

Once properly trained, the embeddings for the corresponding images and text (i.e. captions) will be close to each other and farther away from unrelated pairs.

Typically embedding spaces are hundred/thousand dimensional.

At query time, the input text query can be mapped into this embedding space, and we can return the closest matching images.

The query may have not existed in the training set. Cosine similarity can be used as a similarity measure.

While these models are trained on image-text pairs, we have found that they are an excellent starting point to learning representations of video units like shots and scenes. As videos are a sequence of images (frames), additional parameters may need to be introduced to compute embeddings for these video units, although we have found that for shorter units like shots, an unparameterized aggregation like averaging (mean-pooling) can be more effective. To train these parameters as well as fine-tune the pretrained image-text model weights, we leverage in-house datasets that pair shots of varying durations with rich textual descriptions of their content. This additional adaptation step improves performance by 15–25% on video retrieval tasks (given a text prompt), depending on the starting model used and metric evaluated.

On top of video retrieval, there are a wide variety of video clip classifiers within Netflix that are trained specifically to find a particular attribute (e.g. closeup shots, caution elements). Instead of training from scratch, we have found that using the shot-level embeddings can give us a significant head start, even beyond the baseline image-text models that they were built on top of.

Lastly, shot embeddings can also be used for video-to-video search, a particularly useful application in the context of trailer and promotional asset creation.

Engineering and Infrastructure

Our trained model gives us a text encoder and a video encoder. Video embeddings are precomputed on the shot level, stored in our media feature store, and replicated to an elastic search cluster for real-time nearest neighbor queries. Our media feature management system automatically triggers the video embedding computation whenever new video assets are added, ensuring that we can search through the latest video assets.

The embedding computation is based on a large neural network model and has to be run on GPUs for optimal throughput. However, shot segmentation from a full-length movie is CPU-intensive. To fully utilize the GPUs in the cloud environment, we first run shot segmentation in parallel on multi-core CPU machines, store the result shots in S3 object storage encoded in video formats such as mp4. During GPU computation, we stream mp4 video shots from S3 directly to the GPUs using a data loader that performs prefetching and preprocessing. This approach ensures that the GPUs are efficiently utilized during inference, thereby increasing the overall throughput and cost-efficiency of our system.

At query time, a user submits a text string representing what they want to search for. For visual search queries, we use the text encoder from the trained model to extract an text embedding, which is then used to perform appropriate nearest neighbor search. Users can also select a subset of shows to search over, or perform a catalog wide search, which we also support.

If you’re interested in more details, see our other post covering the Media Understanding Platform.


Finding a needle in a haystack is hard. We learned from talking to video creatives who make trailers and social media videos that being able to find needles was key, and a big pain point. The solution we described has been fruitful, works well in practice, and is relatively simple to maintain. Our search system allows our creatives to iterate faster, try more ideas, and make more engaging videos for our viewers to enjoy.

We hope this post has been interesting to you. If you are interested in working on problems like this, Netflix is always hiring great researchers, engineers and creators.

Building In-Video Search was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Streaming SQL in Data Mesh
3 weeks, 4 days ago

Democratizing Stream Processing @ Netflix

By Guil Pires, Mark Cho, Mingliang Liu, Sujay Jain

Data powers much of what we do at Netflix. On the Data Platform team, we build the infrastructure used across the company to process data at scale.

In our last blog post, we introduced “Data Mesh” — A Data Movement and Processing Platform. When a user wants to leverage Data Mesh to move and transform data, they start by creating a new Data Mesh pipeline. The pipeline is composed of individual “Processors” that are connected by Kafka topics. The Processors themselves are implemented as Flink jobs that use the DataStream API.

Since then, we have seen many use cases (including Netflix Graph Search) adopt Data Mesh for stream processing. We were able to onboard many of these use cases by offering some commonly used Processors out of the box, such as Projection, Filtering, Unioning, and Field Renaming.

An example of a Data Mesh pipeline which moves and transforms data using Union, GraphQL Enrichment, and Column Rename Processor before writing to an Iceberg table.

By keeping the logic of individual Processors simple, it allowed them to be reusable so we could centrally manage and operate them at scale. It also allowed them to be composable, so users could combine the different Processors to express the logic they needed.

However, this design decision led to a different set of challenges.

Some teams found the provided building blocks were not expressive enough. For use cases which were not solvable using existing Processors, users had to express their business logic by building a custom Processor. To do this, they had to use the low-level DataStream API from Flink and the Data Mesh SDK, which came with a steep learning curve. After it was built, they also had to operate the custom Processors themselves.

Furthermore, many pipelines needed to be composed of multiple Processors. Since each Processor was implemented as a Flink Job connected by Kafka topics, it meant there was a relatively high runtime overhead cost for many pipelines.

We explored various options to solve these challenges, and eventually landed on building the Data Mesh SQL Processor that would provide additional flexibility for expressing users’ business logic.

The existing Data Mesh Processors have a lot of overlap with SQL. For example, filtering and projection can be expressed in SQL through SELECT and WHERE clauses. Additionally, instead of implementing business logic by composing multiple individual Processors together, users could express their logic in a single SQL query, avoiding the additional resource and latency overhead that came from multiple Flink jobs and Kafka topics. Furthermore, SQL can support User Defined Functions (UDFs) and custom connectors for lookup joins, which can be used to extend expressiveness.

Data Mesh SQL Processor

Since Data Mesh Processors are built on top of Flink, it made sense to consider using Flink SQL instead of continuing to build additional Processors for every transform operation we needed to support.

The Data Mesh SQL Processor is a platform-managed, parameterized Flink Job that takes schematized sources and a Flink SQL query that will be executed against those sources. By leveraging Flink SQL within a Data Mesh Processor, we were able to support the streaming SQL functionality without changing the architecture of Data Mesh.

Underneath the hood, the Data Mesh SQL Processor is implemented using Flink’s Table API, which provides a powerful abstraction to convert between DataStreams and Dynamic Tables. Based on the sources that the processor is connected to, the SQL Processor will automatically convert the upstream sources as tables within Flink’s SQL engine. User’s query is then registered with the SQL engine and translated into a Flink job graph consisting of physical operators that can be executed on a Flink cluster. Unlike the low-level DataStream API, users do not have to manually build a job graph using low-level operators, as this is all managed by Flink’s SQL engine.

SQL Experience on Data Mesh

The SQL Processor enables users to fully leverage the capabilities of the Data Mesh platform. This includes features such as autoscaling, the ability to manage pipelines declaratively via Infrastructure as Code, and a rich connector ecosystem.

In order to ensure a seamless user experience, we’ve enhanced the Data Mesh platform with SQL-centric features. These enhancements include an Interactive Query Mode, real-time query validation, and automated schema inference.

To understand how these features help the users be more productive, let’s take a look at a typical user workflow when using the Data Mesh SQL Processor.

  • Users start their journey by live sampling their upstream data sources using the Interactive Query Mode.
  • As the user iterate on their SQL query, the query validation service provides real-time feedback about the query.
  • With a valid query, users can leverage the Interactive Query Mode again to execute the query and get the live results streamed back to the UI within seconds.
  • For more efficient schema management and evolution, the platform will automatically infer the output schema based on the fields selected by the SQL query.
  • Once the user is done editing their query, it is saved to the Data Mesh Pipeline, which will then be deployed as a long running, streaming SQL job.
Overview of the SQL Processor workflow.

Users typically iterate on their SQL query multiple times before deploying it. Validating and analyzing queries at runtime after deployment will not only slow down their iteration, but also make it difficult to automate schema evolution in Data Mesh.

To address this challenge, we have implemented a query validation service that can verify a Flink SQL query and provide a meaningful error message for violations in real time. This enables users to have prompt validation feedback while they are editing the query. We leverage Apache Flink’s internal Planner classes to parse and transform SQL queries without creating a fully-fledged streaming table environment. This makes the query service lightweight, scalable, and execution agnostic.

To effectively operate thousands of use cases at the platform layer, we built opinionated guardrails to limit some functionalities of Flink SQL. We plan on gradually expanding the supported capabilities over time. We implemented the guardrails by recursively inspecting the Calcite tree constructed from user’s query. If the tree contains nodes that we currently don’t support, the query will be rejected from being deployed. Additionally, we translate Flink’s internal exceptions containing cryptic error messages into more meaningful error messages for our users. We plan on continuing our investments into improving the guardrails, as having proper guardrails help to improve the user experience. Some ideas for the future include rules to reject expensive and suboptimal queries.

To help Data Mesh users iterate quickly on their business logic, we have built the Interactive Query Mode as part of the platform. Users can start live sampling their streaming data by executing a simple `SELECT * FROM <table>` query. Using the Interactive Query Mode, Data Mesh platform will execute the Flink SQL query and display the results in the UI in seconds. Since this is a Flink SQL query on streaming data, new results will continue to be delivered to the user in real-time.

Users can continue to iterate and modify their Flink SQL query and once they’re satisfied with their query output, they can save the query as part of their stream processing pipeline.

To provide this interactive experience, we maintain an always-running Flink Session Cluster that can run concurrent parameterized queries. These queries will output their data to a Mantis sink in order to stream the results back to the user’s browser.

Interactive Query mode in action

Learnings from our journey

In hindsight, we wish we had invested in enabling Flink SQL on the DataMesh platform much earlier. If we had the Data Mesh SQL Processor earlier, we would’ve been able to avoid spending engineering resources to build smaller building blocks such as the Union Processor, Column Rename Processor, Projection and Filtering Processor.

Since we’ve productionized Data Mesh SQL Processor, we’ve seen excitement and quick adoption from our Data Mesh users. Thanks to the flexibility of Flink SQL, users have a new way to express their streaming transformation logic other than writing a custom processor using the low-level DataStream API.

While Flink SQL is a powerful tool, we view the Data Mesh SQL Processor as a complimentary addition to our platform. It is not meant to be a replacement for custom processors and Flink jobs using low-level DataStream API. Since SQL is a higher-level abstraction, users no longer have control over low-level Flink operators and state. This means that if state evolution is critical to the user’s business logic, then having complete control over the state can only be done through low-level abstractions like the DataStream API. Even with this limitation, we have seen that there are many new use cases that are unlocked through the Data Mesh SQL Processor.

Our early investment in guardrails has helped set clear expectations with our users and keep the operational burden manageable. It has allowed us to productionize queries and patterns that we are confident about supporting, while providing a framework to introduce new capabilities gradually.

Future of SQL on Data Mesh

While introducing the SQL Processor to the Data Mesh platform was a great step forward, we still have much more work to do in order to unlock the power of stream processing at Netflix. We’ve been working with our partner teams to prioritize and build the next set of features to extend the SQL Processor. These include stream enrichment using Slowly-Changing-Dimension (SCD) tables, temporal joins, and windowed aggregations.

Stay tuned for more updates!

Streaming SQL in Data Mesh was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

New Building Mature Content Detection for Mod Tools
1 day, 13 hours ago

Written by Nandika Donthi and Jerry Chu.


Reddit is a platform serving diverse content to over 57 million users every day. One mission of the Safety org is protecting users (including our mods) from potentially harmful content. In September 2023, Reddit Safety introduced Mature Content filters (MCFs) for mods to enable on their subreddits. This feature allows mods to automatically filter NSFW content (e.g. sexual and graphic images/videos) into a community’s modqueue for further review.

While allowed on Reddit within the confines of our content policy, sexual and violent content is not necessarily welcome in every community. In the past, to detect such content, mods often relied on keyword matching or monitoring their communities in real time. The launch of this filter helped mods decrease the time and effort of managing such content within their communities, while also increasing the amount of content coverage.

In this blog post, we’ll delve into how we built a real-time detection system that leverages in-house Machine Learning models to classify mature content for this filter.


Over the past couple years, the Safety org established a development framework to build Machine Learning models and data products. This was also the framework we used to build models for the mature content filters:

The ML Data Product Lifecycle: Understanding the product problem, data curation, modeling, and productionization.

Product Problem:

The first step we took in building this detection was to thoroughly understand the problem we’re trying to solve. This seems pretty straightforward but how and where the model is used determines what goals we focus on; this affects how we decide to create a dataset, build a model, and what to optimize for, etc. Learning about what content classification already exists and what we can leverage is also important in this stage.

While the sitewide “NSFW” tag could have been a way to classify content as sexually explicit or violent, we wanted to allow mods to have more granular control over the content they could filter. This product use case necessitated a new kind of content classification, prompting our decision to develop new models that classify images and videos, according to the definitions of sexually explicit and violent. We also worked with the Community and Policy teams to understand in what cases images/videos should be considered explicit/violent and the nuances between different subreddits.

Data Curation:

Once we had an understanding of the product problem, we began the data curation phase. The main goal of this phase was to have a balanced annotated dataset of images/videos that were labeled as explicit/violent and figure out what features (or inputs) that we could use to build the model.

We started out with conducting exploratory data analysis (or EDA), specifically focusing on the sensitive content areas that we were building classification models for. Initially, the analysis was open-ended, aimed at understanding general questions like: What is the prevalence of the content on the platform? What is the volume of images/videos on Reddit? What types of images/videos are in each content category? etc. Conducting EDA was a critical step for us in developing an intuition for the data. It also helped us identify potential pitfalls in model development, as well as in building the system that processes media and applies model classifications.

Throughout this analysis, we also explored signals that were already available, either developed by other teams at Reddit or open source tools. Given that Reddit is inherently organized into communities centered around specific content areas, we were able to utilize this structure to create heuristics and sampling techniques for our model training dataset.

Data Annotation:
Having a large dataset of high-quality ground truth labels was essential in building an accurate, effectual Machine Learning model. To form an annotated dataset, we created detailed classification guidelines according to content policy, and had a production dataset labeled with the classification. We went through several iterations of annotation, verifying the labeling quality and adjusting the annotation job to address any “gray areas” or common patterns of mislabeling. We also implemented various quality assurance controls on the labeler side such as establishing a standardized labeler assessment, creating test questions inserted throughout the annotation job, analyzing time spent on each task, etc.


The next phase of this lifecycle is to build the actual model itself. The goal is to have a viable model that we can use in production to classify content using the datasets we created in the previous annotation phase. This phase also involved exploratory data analysis to figure out what features to use, which ones are viable in a production setting, and experimenting with different model architectures. After iterating and experimenting through multiple sets of features, we found that a mix of visual signals, post-level and subreddit-level signals as inputs produced the best image and video classification models.

Before we decided on a final model, we did some offline model impact analysis to estimate what effect it would have in production. While seeing how the model performs on a held out test set is usually the standard way to measure its efficacy, we also wanted a more detailed and comprehensive way to measure each model’s potential impact. We gathered a dataset of historical posts and comments and produced model inferences for each associated image or video and each model. With this dataset and corresponding model predictions, we analyzed how each model performed on different subreddits, and roughly predicted the amount of posts/comments that would be filtered in each community. This analysis helped us ensure that the detection that we’d be putting into production was aligned with the original content policy and product goals.

This model development and evaluation process (i.e. exploratory data analysis, training a model, performing offline analysis, etc.) was iterative and repeated several times until we were satisfied with the model results on all types of offline evaluation.


The last stage is productionizing the model. The goal of this phase is to create a system to process each image/video, gather the relevant features and inputs to the models, integrate the models into a hosting service, and relay the corresponding model predictions to downstream consumers like the MCF system. We used an existing Safety service, Content Classification Service, to implement the aforementioned system and added two specialized queues for our processing and various service integrations. To use the model for online, synchronous inference, we added it to Gazette, Reddit’s internal ML inference service. Once all the components were up and running, our final step was to run A/B tests on Reddit to understand the live impact on areas like user engagement before finalizing the entire detection system.

The ML model serving architecture in production

The above architecture graph describes the ML model serving workflow. During user media upload, Reddit’s Media-service notifies Content Classification Service (CCS). CCS, a main backend service owned by Safety for content classification, collects different levels of signals of images/videos in real-time, and sends the assembled feature vector to our safety moderation models hosted by Gazette to conduct online inference. If the ML models detect X (for sexual) and/or V (for violent) content in the media, the service relays this information to the downstream MCF system via a messaging service.

Throughout this project, we often went back and forth between these steps, so it’s not necessarily a linear process. We also went through this lifecycle twice, first building a simple v0 heuristic model, building a v1 model to improve each model’s accuracy and precision, and finally building more advanced deep learning models to productionize in the future.

Integration with MCF

Creation of test content

To ensure the Mature Content Filtering system was integrated with the ML detection, we needed to generate test images and videos that, while not inherently explicit or violent, would deliberately yield positive model classifications when processed by our system. This testing approach was crucial in assessing the effectiveness and accuracy of our filtering mechanisms, and allowed us to identify bugs and fine-tune our systems for optimal performance upfront.

Reduce latency

Efforts to reduce latency have been a top priority in our service enhancements, especially since our SLA is to guarantee near real-time content detection. We've implemented multiple measures to ensure that our services can automatically and effectively scale during upstream incidents and periods of high volume. We've also introduced various caching mechanisms for frequently posted images, videos, and features, optimizing data retrieval and enhancing load times. Furthermore, we've initiated work on separating image and video processing, a strategic step towards more efficient media handling and improved overall system performance.

Future Work

Though we are satisfied with the current system, we are constantly striving to improve it, especially the ML model performance.

One of our future projects includes building an automated model quality monitoring framework. We have millions of Reddit posts & comments created daily that require us to keep the model up-to-date to avoid performance drift. Currently, we conduct routine model assessments to understand if there is any drift, with the help of manual scripting. This automatic monitoring framework will have features including

  • During production data sampling, having data annotated by our third-party annotation platform, automatically generating model metrics to gauge model performance over time
  • Connecting these annotated datasets and feedbacks of Mod ML models to our automated model re-training pipelines to create a true active learning framework

Additionally, we plan to productionize more advanced models to replace our current model. In particular, we’re actively working with Reddit’s central ML org to support large model serving via GPU, which paves the path for online inference of more complex Deep Learning models within our latency requirements. We’ll also continuously incorporate other newer signals for better classification.

Within Safety, we’re committed to build great products to improve the quality of Reddit’s communities. If ensuring the safety of users on one of the most popular websites in the US excites you, please check out our careers page for a list of open positions.

submitted by /u/sassyshalimar
[link] [comments]
Happy Thanksgiving to the r/RedditEng Community
1 week, 1 day ago

Thankful and Grateful

It is Thanksgiving this week in the United States. We would like to take this opportunity to express our thanks and gratitude to the entire r/RedditEng community for your continued support over the past 2.5 years. We'll be back next week (after we finish stuffing ourselves with delicious food) with our usual content. For now, Happy Thanksgiving!

submitted by /u/nhandlerOfThings
[link] [comments]
The Definitive Guide for Asking for Technical Help, and Other Pro Tips I wish I Knew Earlier in My Eng Career.
2 weeks, 1 day ago

Written by Becca Rosenthal, u/singshredcode.

I was a Middle East Studies major who worked in the Jewish Non-Profit world for a few years after college before attending a coding bootcamp and pestering u/spez into a engineering job at Reddit with the help of a fictional comedy song about matching with a professional mentor on tinder (true story – AMA here).

Five years later, I’m a senior engineer on our security team who is good at my job. How did I do this? I got really good at asking questions, demonstrating consistent growth, and managing interpersonal relationships.

Sure, my engineering skills have obviously helped me get and stay where I am, but I think of myself as the world’s okayest engineer. My soft skills have been the differentiating factor in my career, and since I hate gatekeeping, this post is going to be filled with phrases, framings, tips, and tricks that I’ve picked up over the years. Also, if you read something in this post and strongly disagree or think it doesn’t work for you, that’s fine! Trust your gut for what you need.

World's Okayest Engineer Mug!

This advice will be geared toward early career folks, but I think there’s something here for everyone.

The guide to asking technical questions:

You’re stuck. You’ve spent an appropriate amount of time working on the problem yourself, trying to get yourself unstuck, and things aren’t working. You’re throwing shit against the wall to see what sticks, confident that there’s some piece of information you’re missing that will make this whole thing make sense. How do you get the right help from the right person? Sure, you can post in your team’s slack channel and say, “does anyone know something about {name of system}”, but that’s unlikely to get you the result you want.

Instead, frame your question in the following way:

I’m trying to __________. I’m looking at {link to documentation/code}, and based on that, I think that the solution should be {description of what you’re doing, maybe even a link to a draft PR}.

However, when I do that, instead of getting {expected outcome}, I see {error message}. Halp?

There are a few reasons why this is good

  1. The process of writing out the question and explaining your assumptions may help you solve it yourself. What a win!
  2. If you can’t solve it yourself, you’ve provided enough context for your colleagues to easily jump in, ask questions, and guide you toward a solution.
  3. This effort demonstrates to your colleagues that you have put in an appropriate amount of effort and aren’t asking them to do your work for you.

How to get bonus points:

  • Once you get the answer, write documentation that would have helped you solve the problem in the first place.
  • Put the question in a public channel. Likely, other people will run into the same error message as you, and when they search slack for the error, you putting in public will speed up their debugging

What about small clarification questions?

Just ask them. Every team/company has random acronyms. Ask what they stand for. I guarantee you’re not the only person in that meeting who has no idea what the acronym stands for. If you still don’t understand what that acronym means, ask for clarification again. You are not in the wrong for wanting to understand what people are talking about in your presence. Chances are you aren’t the only person who doesn’t know what LFGUSWNT stands for in an engineering context (the answer is nothing, but it’s my rallying cry in life).

What if someone’s explanation doesn’t make sense to you?

The words “will you say that differently, please” are your friend. Keep saying those words and listening to their answers until you understand what they’re saying. It is the responsibility of the teacher to make sure the student understands the content. But is the responsibility of the student to teach up and let the teacher know there’s more work to be done.

Don’t let your fear of annoying someone prevent you from getting the help you need.

Steve Huffman spoke at my bootcamp and talked about the importance of being a “noisy engineer”. He assured us that it’s the senior person’s job to tell you that you’re annoying them, not your job to protect that person from potential annoyance. This is profoundly true, and as I’ve gotten more senior, I believe in it even more than I did then.

Part of the job of senior people is to mentor and grow junior folks. When someone reaches out to me looking for help/advice/to vent, they are not a burden to me. Quite the opposite–they are giving me an opportunity to demonstrate my ability to do my job. Plus, I’m going to learn a ton from you. It’s mutually beneficial.

Navigating Imposter Syndrome:

Particularly as a Junior dev, you are probably not getting hired because you're the best engineer who applied for the role. You are getting hired because the team has decided that you have a strong foundation and a ton of potential to grow with time and investment. That’s not an insult. You will likely take longer than someone else on your team to accomplish a task. That’s OK! That’s expected.

You’re not dumb. You’re not incapable. You’re just new!

Stop comparing yourself to other people, and compare yourself to yourself, three months ago. Are you more self-sufficient? Are you taking on bigger tasks? Are you asking better questions? Do tasks that used to take you two weeks now take you two days? If so, great. You’re doing your job. You are good enough. Full stop.

Important note: making mistakes is a part of the job. You will break systems. You will ship buggy code. All of that is normal (see r/shittychangelog for evidence). None of this makes you a bad or unworthy engineer. It makes you human. Just make sure to make new mistakes as you evolve.

How to make the most of your 1:1s

Your manager can be your biggest advocate, and they can’t help you if they don’t know what’s going on. They can only know what’s going on if you tell them. Here are some tips/tricks for 1:1s that I’ve found useful:

  • Frame your accomplishments in terms of growth: “Three months ago, it took me [timeframe] to do [task]. This sprint, I noticed that a ticket asking me to do [that task] only took me [shorter timeframe].” Even if the task seems small and insignificant in the grand scheme of things, growth is growth and deserves to be acknowledged.
    • When you’re having conversations with your manager asking for more money/a bigger title, you need to convince them that you are contributing more to the business than you were when your salary was set. This framing is an incredibly tangible way to show that you are more valuable to the business (and should be compensated accordingly).
  • If something is not on track, don’t pretend like it is on track. Give updates early and often, especially if you’re blocked waiting on someone else. If your manager can help unblock you, tell them how (ex: I submitted a ticket with [other team]. Can you please help escalate it?)

Demonstrate growth and independence by asking people their advice on your proposed solution instead of asking them to give a proposal.

You’ve been tasked with some technical problem–build some system. Maybe you have some high level ideas for how to approach the problem, but there are significant tradeoffs. You may assume by default that your idea isn’t a good one. Thus, the obvious thing to do is to reach out to someone more senior than you and say, “I’m trying to solve this problem. What should I do?”.

You could do that, but that’s not the best option.

Instead, try, “I’m trying to solve this problem. Here are two options I can think of to solve it. I think we should do [option] because [justification].” In the ensuing conversation, your tech lead may agree with you. Great! Take that as a confidence boost that your gut aligns with other people. They may disagree (or even have an entire alternative you hadn’t considered). This is also good! It can lead to a fruitful conversation where you can really hash out the idea and make sure the best decision gets made. You took the mental load off of your teammates’ plate and helped the team! Go you!

To conclude:

Ask lots of questions, be proactive, advocate for yourself, keep growing, and be a good teammate. You’ll do just fine.

submitted by /u/sassyshalimar
[link] [comments]
Building Reddit Ep. 13: Growing Healthy International Communities
3 weeks ago

Hello Reddit!

I’m happy to announce the thirteenth episode of the Building Reddit podcast. In this episode I spoke with several Country Growth Leads about the unique approaches they take to grow the user base outside of the US. Hope you enjoy it! Let us know in the comments.

You can listen on all major podcast platforms: Apple Podcasts, Spotify, Google Podcasts, and more!

Building Reddit Ep. 13: Growing Healthy International Communities

Watch on Youtube

Communities form the backbone of Reddit. From r/football to r/AskReddit, people come from all over the world to take part in conversations. While Reddit is a US-based company, the platform has a growing international user base that has unique interests and needs.

In this episode, you’ll hear from Country Growth Leads for France, Germany, The United Kingdom, and India. They’ll dive into what makes their markets unique, how they’ve facilitated growth in those markets, and the memes that keep those users coming back to Reddit.

Check out all the open positions at Reddit on our careers site: https://www.redditinc.com/careers

submitted by /u/unavailable4coffee
[link] [comments]
How to Decide…Fast
3 weeks, 1 day ago

Written by Mirela Spasova, Eng Manager, Collectible Avatars

Congratulations! You are a decision-maker for a major technical project. You get to decide which features get prioritized on the roadmap - an exciting but challenging responsibility. What you decide to build can make or break the project’s success. So how would you navigate this responsibility?

The Basics

Decision making is the process of committing to a single option from many possibilities.

For your weekend trip, you might consider dozens of destinations, but you get to fly to one. For your roadmap planning, you might collect hundreds of product ideas, but you get to build one.

In theory, you can streamline any type of decision making with a simple process:

  1. Define your goal.
  2. Gather relevant options to pick from.
  3. Evaluate each option for impact, costs, risks, feasibility and other considerations.
  4. Decide on the option that maximizes the outcome towards your goal.

In practice, decision-making is filled with uncertainties. Incomplete information, cognitive biases, or inaccurate predictions can lead to suboptimal decisions and risk your team’s goals. Hence, critical decisions often require thorough analysis and careful consideration.

Often, we have to decide from a multitude of ambiguous options

For example, my team meticulously planned how to introduce Collectible Avatars to millions of Redditors. With only one chance at a first impression, we aimed for the Avatar artwork to resonate with the largest number of users. We invested time to analyze user’s historic preferences, and prototyped a number of options with our creative team.

Collectible Avatars Initial Claim Screen

What happens when time isn't on your side? What if you have to decide in days, hours or even minutes?

Why the Rush?

Productivity Improvements
Any planning involves multiple decisions, which are also interdependent. You cannot book a hotel before choosing your trip destination. You cannot pick a specific feature before deciding which product to build. Even with plenty of lead time, it is crucial to maintain a steady decision making pace. One delayed decision can block your project’s progress.

Imagine each decision is a car on the road. You might have hundreds of them and limited resources (e.g. meetin

For our "Collectible Avatars" storefront, we had to make hundreds of decisions around the shop experience, purchase methods, and scale limits before jumping into technical designs. Often, we had to timebox important decisions to avoid blocking the engineering team.
Non-blocking decisions can still consume resources such as meeting time, data science hours, or your team’s async attention. Ever been in a lengthy meeting with numerous stakeholders that ends with "let's discuss this as a follow up"? If this becomes a routine, speeding up decision making can save your team dozens of hours per month.

Unexpected Challenges

Often, project progress is not linear. You might have to address an unforeseen challenge or pivot based on new experiment data. Quick decision making can help you get back on track ASAP.
Late last year, our project was behind on one of its annual goals. An opportunity arose to build a “Reddit Recap” (personalized yearly review) integration with “Collectible Avatars”. With just three weeks to ship, we quickly assessed the impact, chose a design solution, and picked other features to cut. Decisions had to be made within days to capture the opportunity.
Our fastest decisions were during an unexpected bot attack at one of our launches. The traffic surged 100x, causing widespread failures. We had to make a split second call to stop the launch followed by a series of both careful and rapid decisions to relaunch within hours.

How to Speed up?

The secret to fast decision-making is preparation. Not every decision has to start from scratch. On your third weekend trip, you already know how to pick a hotel and what to pack. For your roadmap planning, you are faced with a series of decisions which share the same goal, information context, and stakeholders. Can you foster a repeatable process that optimizes your decision making?

I encourage you to review your current process and identify areas of improvement. Below are several insights based on my team’s experience:


Simply imagine roadmap planning as a tree of decisions with your goal serving as the root from which branches out a network of paths representing progressively more detailed decisions. Starting from the goal, sequence decisions layer by layer to avoid backtracking.

On occasion, our team starts planning a project with a brainstorming session, where we generate a lot of feature ideas. Deciding between them can be difficult without committing to a strategic direction first. We often find ourselves in disagreement as each team member is prioritizing based on their individual idea of the strategy.

Chosen options are in red


Understand the guardrails of your options before you start the planning process. If certain options are infeasible or off-limits, there is no reason to consider them. As our team works on monetization projects, we often incorporate legal and financial limitations upfront.


Similarly, quickly decide on inconsequential or obvious decisions. It’s easy to spend precious meeting time prioritizing nice-to-have copy changes or triaging a P2 bug. Instead, make a quick call and leave extra time for critical decisions.

Balance Delegation and Input

As a decision maker, you are accountable for decisions without having to make them all. Delegate and parallelize sets of decisions into sub-teams. For efficient delegation, ensure each sub-team can make decisions relatively independently from each other.

You decide to build both strategy 2 and 3. Sub-team 1 decides the details for strategy 2 and sub-team 2 - for strategy 3

As a caveat, delegation runs the risks of information silos, where sub-teams can overlook important considerations from the rest of the group. In such cases, decisions might be inadequate or have to be redone.

While our team distributes decisions in sub-groups, we also give an opportunity for async feedback from a larger group (teammates, partners, stakeholders). Then, major questions and disagreements are discussed in meetings. Although this approach may initially decelerate decisions, it eventually helps sub-teams develop broader awareness and make more informed decisions aligned with the larger group. Balancing autonomy with collective inputs has often helped us anticipate critical considerations from our legal, finance, and community support partners.

Anticipate Risks

It’s rare for a project to go all according to plan. To make good decisions on the fly, our team conducts pre-mortems for potential risks that can cause the project to fail. Those can be anything from undercosting a feature, to being blocked by a dependency, to facing a fraud case. We decide on the mitigation step for probable failure risk upfront - similar to a runbook in case of an incident.

Trust Your Gut

No matter how much you prepare, real-life chaos will ensue and demand fast, intuition-based decisions with limited information. You can explore ways to strengthen your intuitive decision-making if you feel unprepared.


Effective decision-making is critical for any project's success. Invest in a robust decision-making process to speed up decisions without significantly compromising quality. Choose a framework that suits your needs and refine it over time. Feel free to share your thoughts in the comments.

submitted by /u/SussexPondPudding
[link] [comments]
From Chaos to Cohesion: Reddit's Design System Story
4 weeks ago

Written By Mike Price, Engineering Manager, UI Platform

When I joined Reddit as an engineering manager three years ago, I had never heard of a design system. Today, RPL (Reddit Product Language), our design system, is live across all platforms and drives Reddit's most important and complicated surfaces.

This article will explore how we got from point A to point B.

Chapter 1: The Catalyst - Igniting Reddit's Design System Journey

The UI Platform team didn't start its journey as a team focused on design systems; we began with a high-level mission to "Improve the quality of the app." We initiated various projects toward this goal and shipped several features, with varying degrees of success. However, one thing remained consistent across all our work:

It was challenging to make UI changes at Reddit. To illustrate this, let's focus on a simple project we embarked on: changing our buttons from rounded rectangles to fully rounded ones.


In a perfect world this would be a simple code change. However, at Reddit in 2020, it meant repeating the same code change 50 times, weeks of manual testing, auditing, refactoring, and frustration. We lacked consistency in how we built UI, and we had no single source of truth. As a result, even seemingly straightforward changes like this one turned into weeks of work and low-confidence releases.

It was at this point that we decided to pivot toward design systems. We realized that for Reddit to have a best-in-class UI/UX, every team at Reddit needed to build best-in-class UI/UX. We could be the team to enable that transformation.

Chapter 2: The Sell - Gaining Support for Reddit's Design System Initiative

While design systems are gaining popularity, they have yet to attain the same level of industry-wide standardization as automated testing, version control, and code reviews. In 2020, Reddit's engineering and design teams experienced rapid growth, presenting a challenge in maintaining consistency across user interfaces and user experiences.

Recognizing that a design system represents a long-term investment with a significant upfront cost before realizing its benefits, we observed distinct responses based on individuals' prior experiences. Those who had worked in established companies with sophisticated design systems required little persuasion, having firsthand experience of the impact such systems can deliver. They readily supported our initiative. However, individuals from smaller or less design-driven companies initially harbored skepticism and required additional persuasion. There is no shortage of articles extolling the value of design systems. Our challenge was to tailor our message to the right audience at the right time.

For engineering leaders, we emphasized the value of reusable components and the importance of investing in robust automated testing for a select set of UI components. We highlighted the added confidence in making significant changes and the efficiency of resolving issues in one central location, with those changes automatically propagating across the entire application.

For design leaders, we underscored the value of achieving a cohesive design experience and the opportunity to elevate the entire design organization. We presented the design system as a means to align the design team around a unified vision, ultimately expediting future design iterations while reinforcing our branding.

For product leaders, we pitched the potential reduction in cycle time for feature development. With the design system in place, designers and engineers could redirect their efforts towards crafting more extensive user experiences, without the need to invest significant time in fine-tuning individual UI elements.

Ultimately, our efforts garnered the support and resources required to build the MVP of the design system, which we affectionately named RPL 1.0.

Chapter 3: Design System Life Cycle


The development process of a design system can be likened to a product life cycle. At each stage of the life cycle, a different strategy and set of success criteria are required. Additionally, RPL encompasses iOS, Android, and Web, each presenting its unique set of challenges.

The iOS app was well-established but had several different ways to build UI: UIKit, Texture, SwiftUI, React Native, and more. The Android app had a unified framework but lacked consistent architecture and struggled to create responsive UI without reinventing the wheel and writing overly complex code. Finally, the web space was at the beginning of a ground-up rebuild.

We first spent time investigation on the technical side and answering the question “What framework do we use to build UI components” a deep dive into each platform can be found below:

Building Reddit’s Design System on iOS

Building Reddit’s design system for Android with Jetpack Compose

Web: Coming Soon!

In addition to rolling out a brand new set of UI components we also signed up to unify the UI framework and architecture across Reddit. Which was necessary, but certainly complicated our problem space.




How many components should a design system have before its release? Certainly more than five, maybe more than ten? Is fifteen too many?

At the outset of development, we didn't know either. We conducted an audit of Reddit's core user flows and recorded which components were used to build those experiences. We found that there was a core set of around fifteen components that could be used to construct 90% of the experiences across the apps. This included low-level components like Buttons, Tabs, Text Fields, Anchors, and a couple of higher-order components like dialogs and bottom sheets.

One of the most challenging problems to solve initially was deciding what these new components should look like. Should they mirror the existing UI and be streamlined for incremental adoption, or should they evolve the UI and potentially create seams between new and legacy flows?

There is no one-size-fits-all solution. On the web side, we had no constraints from legacy UI, so we could evolve as aggressively as we wanted. On iOS and Android, engineering teams were rightly hesitant to merge new technologies with vastly different designs. However, the goal of the design system was to deliver a consistent UI experience, so we also aimed to keep web from diverging too much from mobile. This meant attacking this problem component by component and finding the right balance, although we didn't always get it right on the first attempt.

So, we had our technologies selected, a solid roadmap of components, and two quarters of dedicated development. We built the initial set of 15 components on each platform and were ready to introduce them to the company.



Before announcing the 1.0 launch, we knew we needed to partner with a feature team to gain early adoption of the system and work out any kinks. Our first partnership was with the moderation team on a feature with the right level of complexity. It was complex enough to stress the breadth of the system but not so complex that being the first adopter of RPL would introduce unnecessary risk.

We were careful and explicit about selecting that first feature to partner with. What really worked in our favor was that the engineers working on those features were eager to embrace new technologies, patient, and incredibly collaborative. They became the early adopters and evangelists of RPL, playing a critical role in the early success of the design system.

Once we had a couple of successful partnerships under our belt, we announced to the company that the design system was ready for adoption.




We found early success partnering with teams to build small to medium complexity features using RPL. However, the real challenge was to power the most complex and critical surface at Reddit: the Feed. Rebuilding the Feed would be a complex and risky endeavor, requiring alignment and coordination between several orgs at Reddit. Around this time, conversations among engineering leaders began about packaging a series of technical decisions into a single concept we'd call: Core Stack. This major investment in Reddit's foundation unified RPL, SliceKit, Compose, MVVM, and several other technologies and decisions into a single vision that everyone could align on. Check out this blog post on Core Stack to learn more. With this unification came the investment to fund a team to rebuild our aging Feed code on this new tech stack.

As RPL gained traction, the number of customers we were serving across Reddit also grew. Providing the same level of support to every team building features with RPL that we had given to the first early adopters became impossible. We scaled in two ways: headcount and processes. The design system team started with 5 people (1 engineering manager, 3 engineers, 1 designer) and now has grown to 18 (1 engineering manager, 10 engineers, 5 designers, 1 product manager, 1 technical program manager). During this time, the company also grew 2-3 times, and we kept up with this growth by investing heavily in scalable processes and systems. We needed to serve approximately 25 teams at Reddit across 3 platforms and deliver component updates before their engineers started writing code. To achieve this, we needed our internal processes to be bulletproof. In addition to working with these teams to enhance processes across engineering and design, we continually learn from our mistakes and identify weak links for improvement.

The areas we have invested in to enable this scaling have been

  • Documentation
  • Educational meetings
  • Snapshot and unit testing
  • Code and Figma Linting
  • Jira automations
  • Gallery apps
  • UX review process



Today, we are approaching the tail end of the growth stage and entering the beginning of the maturity stage. We are building far fewer new components and spending much more time iterating on existing ones. We no longer need to explain what RPL is; instead, we're asking how we can make RPL better. We're expanding the scope of our focus to include accessibility and larger, more complex pieces of horizontal UI. Design systems at Reddit are in a great place, but there is plenty more work to do, and I believe we are just scratching the surface of the value it can provide. The true goal of our team is to achieve the best-in-class UI/UX across all platforms at Reddit, and RPL is a tool we can use to get there.

Chapter 4: Today I Learned

This project has been a constant learning experience, here are the top three lessons I found most impactful.

  1. Everything is your fault

It is easy to get frustrated working on design systems. Picture this, your team has spent weeks building a button component, you have investigated all the best practices, you have provided countless configuration options, it has a gauntlet of automated testing back it, it is consistent across all platforms, by all accounts it's a masterpiece.

Then you see the pull request “I needed a button in this specific shade of red so I built my own version”.

  • Why didn’t THEY read the documentation
  • Why didn't THEY reach out and ask if we could add support for what they needed,
  • Why didn’t THEY do it right?

This is a pretty natural response but only leads to more frustration. We have tried to establish a culture and habit of looking inwards when problems arise, we never blame the consumer of the design system, we blame ourselves.

  • What could we do to make the documentation more discoverable?
  • How can we communicate more clearly that teams can request iterations from us?
  • What could we have done to have prevented this.
  1. A Good Plan, Violently Executed Now, Is Better Than a Perfect Plan Next Week

This applies to building UI components but also building processes. In the early stages, rather than building the component that can satisfy all of today's cases and all of tomorrow's cases, build the component that works for today that can easily evolve for tomorrow.

This also applies to processes, the development cycle of how a component flows from design to engineering will be complicated. The approach we have found the most success with is to start simple, and aggressively iterate on adding new processes when we find new problems, but also taking a critical look at existing processes and deleting them when they become stale or no longer serve a purpose.

  1. Building Bridges, Not Walls: Collaboration is Key

Introducing a design system marks a significant shift in the way we approach feature development. In the pre-design system era, each team could optimize for their specific vertical slice of the product. However, a design system compels every team to adopt a holistic perspective on the user experience. This shift often necessitates compromises, as we trade some individual flexibility for a more consistent product experience. Adjusting to this change in thinking can bring about friction.

As the design system team continues to grow alongside Reddit, we actively seek opportunities each quarter to foster close partnerships with teams, allowing us to take a more hands-on approach and demonstrate the true potential of the design system. When a team has a successful experience collaborating with RPL, they often become enthusiastic evangelists, keeping design systems at the forefront of their minds for future projects. This transformation from skepticism to advocacy underscores the importance of building bridges and converting potential adversaries into allies within the organization.

Chapter 5: Go build a design system

To the uninitiated, a design system is a component library with good documentation. Three years into my journey at Reddit, it’s obvious they are much more than that. Design systems are transformative tools capable of aligning entire companies around a common vision. Design systems raise the minimum bar of quality and serve as repositories of best practices.

In essence, they're not just tools; they're catalysts for excellence. So, my parting advice is simple: if you haven't already, consider building one at your company. You won't be disappointed; design systems truly kick ass.

submitted by /u/beautifulboy11
[link] [comments]
Mobile Tech Talk Slides from Droidcon and Mobile DevOps Summit
1 month ago

Mobile Tech Talk Slides from Droidcon and Mobile DevOps Summit

In September, Drew Heavner, Geoff Hackett, Fano Yong and Laurie Darcey presented several Android tech talks at Droidcon NYC. These talks covered a variety of techniques we’ve used to modernize the Reddit apps and improve the Android developer experience, adopting Compose and building better dependency injection patterns with Anvil. We also shared our Compose adoption story on the Android Developers blog and Youtube channel!!

In October, Vlad Zhluktsionak and Laurie Darcey presented on mobile release engineering at Mobile Devops Summit. This talk focused on how we’ve improved mobile app stability through better release processes, from adopting trunk-based development patterns to having staged deployments.

We did four talks and an Android Developer story in total - check them out below!

Reddit Developer Story on the Android Developers Blog: Adopting Compose

Android Developer Story: Adopting Compose @ Reddit

ABSTRACT: It's important for the Reddit engineering team to have a modern tech stack because it enables them to move faster and have fewer bugs. Laurie Darcey, Senior Engineering Manager and Eric Kuck, Principal Engineer share the story of how Reddit adopted Jetpack Compose for their design system and across many features. Jetpack Compose provided the team with additional flexibility, reduced code duplication, and allowed them to seamlessly implement their brand across the app. The Reddit team also utilized Compose to create animations, and they found it more fun and easier to use than other solutions.

Video Link / Android Developers Blog

Dive deeper into Reddit’s Compose Adoption in related RedditEng posts, including:


Plugging into Anvil and Powering Up Your Dependency Injection Presentation


ABSTRACT: Writing Dagger code can produce cumbersome boilerplate and Anvil helps to reduce some of it, but isn’t a magic solution.

Video Link / Slide Link

Dive deeper into Reddit’s Anvil adoption in related RedditEng posts, including:


How We Learned to Stop Worrying and Embrace DevX Presentation


ABSTRACT: Successful platform teams are often caretakers of the developer experience and productivity. Explore some of the ways that the Reddit platform team has evolved its tooling and processes over time, and how we turned a platform with multi-hour build times into a hive of modest efficiency.

Video Link / Slide Link

Dive deeper into Reddit’s Mobile Developer Experience Improvements in related RedditEng posts, including:


Adopting Jetpack Compose @ Scale Presentation


ABSTRACT: Over the last couple years, thousands of apps have embraced Jetpack Compose for building their Android apps. While every company is using the same library, the approach they've taken in adopting it is really different on each team.

Video Link

Dive deeper into Reddit’s Compose Adoption in related RedditEng posts, including:


Case Study: Mobile Release Engineering @ Reddit Presentation


ABSTRACT: Reddit releases their Android and iOS apps weekly, one of the fastest deployment cadences in mobile. In the past year, we've harnessed this power to improve app quality and crash rates, iterate quickly to improve release stability and observability, and introduced increasingly automated processes to keep our releases and our engineering teams efficient, effective, and on-time (most of the time). In this talk, you'll hear about what has worked, what challenges we've faced, and learn how you can help your organization evolve its release processes successfully over time, as you scale.

Video Link / Slide Link


Dive deeper into these topics in related RedditEng posts, including:

Compose Adoption

Core Stack, Modularization & Anvil

submitted by /u/nhandlerOfThings
[link] [comments]
Principals Onsite
1 month ago

Written by Amaya Booker

Chicago River | Source: Original photo

We recently held our first Principal engineering onsite in Chicago in September, an internal conference for senior folks in Product and Tech to come together and talk about strategy and community. Our primary focus was on building connectivity across team functions to connect our most senior engineers with each other and to come together.

We wanted to build better connections for this virtual team of similar individuals in different verticals, shared context, and generate actionable next steps to continue to elevate the responsibilities and outputs from our Principal Engineering community.

As a new hire at Reddit the Principals Onsite was an amazing opportunity to meet senior ICs and VPs all at the same time and get my finger on the pulse of their thinking. It was also nice to get back into the swing of work travel.

Field trip to Mindworks | Source: Original photo

What is a Principal Engineer?

For a long time the tech industry rewarded highly technical people with success paths that only included management. At Reddit we believe that not all senior technical staff are bound for managerial roles and we provide a parallel career path for engineers wishing to stay as individual contributors.

Principal Engineers carry expert levels of technical skills: coding, debugging, architecture, but they also carry organisation skills like long term planning, team leadership and career coaching along with scope and strategic thinking equivalent to a Director. A high performing Principal can anticipate the needs of a (sometimes large) organisation and navigate ambiguity with ease. They translate “what the business needs” into technical outcomes and solutions: setting technical direction across the organisation and helping align the company to be ready for challenges into the future (eg. 2 years+).

Our onsite was focused on bringing together all the Principal Engineers across a range of products and disciplines to explore their role and influence in Reddit engineering culture.

What did we do?

When senior people get together we want the time to be as productive as possible so we requested pre-work be done by attendees to think about:

  • Their role expectations and engineering ladder definition
  • Cross functional and leadership relationships
  • How to optimise engineering culture at Reddit

Key sessions at the summit dived deep into these topics facilitated by a Principal Engineer as a round table leadership conversation. Of course we also took some time to socialise together with a group dinner and a trip to MindWorks the behavioural science research space.

Organisational Topics

Building a highly productive engineering team requires thinking about how we work together. Since this was our first event coming together as a group we spent time thinking about how to empower our Principal Engineers to work together and build community. We want Principal Engineers to come together and solve highly complex and cross functional problems, building towards high levels of productivity, creativity, and innovation. This allows us to get exponential impact as subject matter experts solve problems together and fan out work to their teams.

Reddit as an engineering community has grown significantly in recent years, adding complexity to communications. This made it a good time to think about how the role of Principal needs to evolve for our current employee population. This meant discussions on role expectations between Principal Engineers and their Director equivalents in people management and finding ways to tightly couple Principal ICs into Technical Leadership discussions.

We also want to think about the future and how to build the engineering team we want to work for:

  • Building a ramp of promising and diverse candidates who could be invested in to make the next principals
  • Retaining and motivating senior ICs and aligning incentives with the needs of the company. Ensuring high-performers always have enough opportunity to excel and grow
  • Inspired by Will Larson and Meta engineering culture, the team discussed role archetypes common to principal engineers and how technical leadership shows up for different people

Technical topics

How we work together is important but as technical people we want to spend more of our time thinking about how to build the best Reddit we can, meeting the needs of our varied users and with efficient development experience and cycle time.

  • Identifying ‘best of breed’ engineering workflows across the company, especially standardising on Requirements and Design analysis
  • Developer experience topics such as optimized development infrastructure, code freeze and deployment policies
  • Identifying gaps in our production support staffing, building teams out of virtual teams and focusing on areas that have typically lagged
  • Building reusable engineering assets like code libraries and templates to improve developer velocity

The team is self organized into working groups to focus on improvement opportunities and are working independently on activities to build a stronger Reddit engineering team.

What’s next?

Reddit is a remote first organisation and so events like this are critical to our ability to build community and focus on engineering culture, and we’ll be concentrating on bi annual summits of this nature to build strategic vision and community.

We identified a number of next steps to ensure that the time we took together was valuable. Creating working groups focussed on improving engineering processes such as technical design, technical documentation and programming language style guides.

Interested in hearing more? Check out careers at Reddit.

Want to know more about the senior IC Career path? I recommend the following books: The Staff Engineer’s Path and Staff Engineering Leadership Beyond Management.

submitted by /u/Pr00fPuddin
[link] [comments]
Reddit Conversion Lift and Lift Study Framework
1 month, 1 week ago

Written by Yimin Wu and Ellis Miranda.

At the end of May 2023, Reddit launched Reddit Conversion Lift (RCL) product to General Availability. Reddit Conversion Lift (RCL) is the Reddit first-party measurement solution that enables marketers to evaluate the incremental impact of Reddit ads on driving conversions (conversion is an action that our advertisers define as valuable to their business, such as an online purchase or a subscription of their service, etc). It measures the number of conversions that were caused by exposure to ads on Reddit.

Along with the development of the RCL product, we also developed a generic Lift Study Framework that supports both Reddit Conversion Lift and Reddit Brand Lift. Reddit Brand Lift (RBL) is the Reddit first-party measurement solution that helps advertisers understand the effectiveness of their ads in influencing brand awareness, perception, and action intent. By analyzing the results of a Reddit Brand Lift study across different demographic groups, advertisers can identify which groups are most likely to engage with their brand and tailor marketing efforts accordingly. Reddit Brand Lift uses experimental design and stat testing to scientifically prove Reddit’s impact.

We will focus on introducing the engineering details about RCL and the Lift Study Framework in this blog post. Please read this RBL Blog Post to learn more about RBL. We will cover the analytics jobs that measure conversion lift in a future blog.

How RCL Works

The following picture depicts how RCL works:

How RCL works

RCL leverages Reddit’s Experimentation platform to create A/B testing experiments and manage bucketing users into Test and Control groups. Each RCL study targets specific pieces of ad content, which are tied to the experiment. Additional metadata about the participating lift study ads are specified in each RCL experiment. We extended the ads auction logic of Reddit’s in-house Ad Server to handle lift studies as follows:

  1. For each ad request, we check whether the user’s bucket is Test or Control for each active lift study.
  2. For an ad winning an auction, we check if it belongs to any active lift studies.
  3. If the ad does belong to an active lift study, we check the bucketing of that lift study for the current user:
    1. If the user belongs to the Test group, the lift ad will be returned as usual.
    2. If the user belongs to the Control group, the lift ad will be replaced by its runner up. We then log the event of replacing lift ads with its runner up; this is called a Counterfactual Event.
  4. Last but not least, our RCL Analytics jobs will measure the conversion lift of our logged data, comparing the conversion performance of the Test group to that of the Control group (via the Counterfactual Event Log).

Lift Study UI

Feasibility Calculator

Feasibility Calculator

Calculation names and key event labels have been removed for advertisers’ privacy.

The Feasibility Calculator is a tool designed to assist Admins (i.e., ad account administrators) in determining whether advertisers are “feasible” for a Lift study. Based on a variety of factors about an advertiser’s spend and performance history, Admins can quickly determine whether an advertiser would have sufficient volumes of data to achieve a statistically powered study or whether a study would be unpowered even with increased advertising reach.

There were two goals for building this tool:

  • First, reduce the number of surfaces that an Admin had to touch to observe this result by containing it within one designated space.
  • Second, improve the speed and consistency of the calculations, by normalizing their execution within Reddit’s stack.


We centralized all the management in a single service - the Lift Study Management Service - built on our in-house open-source Go service framework called baseplate.go. Requests coming from the UI are validated, verified, and stored in the service’s local database before corresponding action is taken. For feasibility calculations, the request is translated into a request to GCP to execute a feasibility calculation, and store the results in BigQuery.

Admin are able to define the parameters of the feasibility calculation and submit for calculation, check on the status of the computation, and retrieve the results all from the UI.

Experiment Setup UI


The Experiment Setup tool was also built with a specific goal in mind: reduce errors during experiment setup. Reddit supports a wide set of options for running experiments, but the majority are not relevant to Conversion Lift experiments. By reducing the number of options to those seen above, we reduce potential mistakes.

This tool also reduces the number of surfaces that Admin have to touch to execute Conversion Lift experiments: the Experiment Setup tool is built right alongside the Feasibility Calculator. Admins can create experiments directly from the results of a feasibility calculation, tying together the intention and context that led to the study’s creation. This information is displayed on the right-hand side modal.

A Generic Lift Study Framework


While we’ve discussed the flow of RCL, the Lift Study Framework was developed to be generic to support both RCL and RBL in the following areas:

  • The experimentation platform can be used by any existing lift study product and is extensible to new types of lift studies. The core functionality of this type of experimentation is not restricted to any single type of study.
  • The Admin UI tools for managing lift studies are generic. This UI helps Admins reduce toil in managing studies and streamlines the experiment creation experience. It should continue to do so as we add additional types of studies.

Next Steps

After the responses are collected, they are fed into the Analysis pipeline. For now I’ll just say that the numbers are crunched, and the lift metrics are calculated. But keep an eye out for a follow-up post that dives deeper into that process!

If this work sounds interesting and you’d like to work on the systems that power Reddit Ads, please take a look at our open roles.

submitted by /u/sassyshalimar
[link] [comments]
Come see our CorpTech team present some cool stuff at SuiteWorld
1 month, 2 weeks ago

We’re excited to announce that some of our Corporate Technology (CorpTech) team members will be attending NetSuite’s conference, SuiteWorld, in Caesars Forum at Las Vegas during the week of October 16th! We’ll be presenting on two topics at SuiteWorld to share our perspectives and best practices:

  • Leverage the SuiteCloud Developer Framework (SDF) to Enable Flexible, Scalable, and Efficient Software Lifecycle Management Processes” on Tuesday, Oct 17, 2023 at 1:30 PM - 2:30 PM PDT
    • Learn more about the process of incorporating DevOps principles into your NetSuite development lifecycle by building a scalable Node.js-based NetSuite project. In this presentation, we explore the tools that can help make auditable change management, defined code standards, and automated deployments a reality.
  • What They Learned: Building a World-Class Controls Environment” on Tuesday, Oct 17, 2023 at 3:00 PM - 4:00 PM PDT
    • Learn more about implementing controls designed to improve the close, segregation of duties, and reporting within NetSuite. In this presentation, we explore bringing automation to controls and share tips on identifying key controls, coordinating with auditors, avoiding bottlenecks, and evaluating what needs to be customized in NetSuite versus out of the box capabilities.

If you are attending SuiteWorld, please join us at these sessions!

submitted by /u/SussexPondPudding
[link] [comments]
Improving Efficiency Of Goku Time Series Database at Pinterest (Part — 1)
6 days, 12 hours ago

Improving Efficiency Of Goku Time Series Database at Pinterest (Part — 1)

Monil Mukesh Sanghavi, Kapil Bajaj, Ming-May Hu, Xiao Li and Zhenxiao Luo


At Pinterest, one of the pillars of the observability stack provides internal engineering teams (our users) the opportunity to monitor their services using metrics data and set up alerting on it. Goku is our in-house time series database providing cost efficient and low latency storage for metrics data. Underneath, Goku is not a single cluster but a collection of sub-service components including:

  • Goku Short Term (in-memory storage for the last 24 hours of data and referred to as GokuS)
  • Goku Long Term (ssd and hdd based storage for older data and referred to as GokuL)
  • Goku Compactor (time series data aggregation and conversion engine)
  • Goku Root (smart query routing)

You can read more about these components in the blog posts on GokuS Storage, GokuL (long term) storage, and Cost Savings on Goku, but a lot has changed in Goku since those were written. We have implemented multiple features that increased the efficiency of Goku and improved the user experience. In this 3 part blog post series, we will cover the efficiency improvements in 3 major aspects:

  1. Improving recovery time of both GokuS and GokuL (this is the total time a single host or cluster in Goku takes to come up and start serving time series queries)
  2. Improving query experience in Goku by lowering latencies of expensive and high cardinality queries
  3. Reducing the overall cost of Goku in Pinterest

We’ll also share some learnings and takeaways from using Goku for storing metrics at Pinterest.

In the first blog, we will share a short summary on the GokuS and GokuL architecture, data format for Goku Long Term, and how we improved the bootstrap time for our storage and serving components.

Initial Architecture For Goku Short Term Ingestion

Figure 1: Old push based ingestion pipeline into GokuS

At Pinterest, we have a sidecar metrics agent running on every host that logs the application system metrics time series data points (metric name, tag value pairs, timestamp and value) into dedicated kafka topics. To learn more about data points, you can look at the Time Series Data Model in this post.

From these kafka topics, an ingestion service would consume the data points and push them into the GokuS cluster(s) with a retry mechanism (via a separate kafka + small ingestion service) to handle failure.

Figure 2: Initial write path in GokuS with EFS as deep persistent store.

GokuS (storage architecture found in this link), which serves the last 1 day of metrics, stores the data in memory. Once the data becomes immutable (i.e. data before the last 2 hours, since GokuS allows only 2 hours of backfill old data in most cases), it stores a copy of the finalized data on AWS EFS (deep persistent storage). It also asynchronously logs the latest data points onto AWS EFS.

During recovery, it reads the finalized data from EFS into memory, then replays the logs from EFS and relies on the retry ingestion service to bring it up to speed with the latest data.

Scaling Issues

This architecture worked for us, but over time as our scale increased, we started seeing a few problems, especially during host replacements or deployments.

Longer Recovery Time

Figure 3: Initial recovery path with EFS

During recovery, a GokuS host reading hits the throughput limits of EFS very easily. We observed a throughput limit of 1024 Megabytes per second (network I/O combined) for the whole cluster/ replica.

To give context, we have around 150–200 hosts on each cluster with close to 80 GB of time series data stored in memory. With a throughput of 1024 Megabytes per second, a read only workload is able to read 60 GB per minute.If multiple hosts were recovering at the same time, we observed that the phase of recovering the finalized data and logs from EFS would sometimes take almost 90 minutes to complete. This would then be followed by the ingestor service pushing the recent data points to the host.The total bootstrap / recovery time for the whole cluster would cross 2 hours. To add to the latency, the retry ingestor service reads the data points from retry kafka, calculates the target shard again, and tries to push data points.

Figure 4: GokuS clusters health determination and query routing the old way

Single Point Of Failure Due To Health Inference at Cluster Level

At Pinterest, we use an internal tool called Statsboard to visualize metrics stored on Goku. Statsboard continuously pushes synthetic metrics to Goku and then reads them to determine its liveness.The synthetic metrics follow the same write/ ingestion path mentioned above and are queried from GokuS clusters. They are used to determine the health of the cluster and for query routing. If the statsboard client is able to read its own generated latest synthetic data point from GokuS cluster, that means the cluster or replica is up to date with the latest data points and is fit for serving queries.

In case of a host replacement or a host running the recovery routine, the statsboard client is not able to read the latest synthetic metrics that were stored on the host. This is because the synthetic data points would be present in the retry kafka waiting to be pushed into the recovering host by the retry ingestor. Thus, the whole cluster would be inferred as unfit for reading until the host completes its recovery, because statsboard does not know which host/ shard is recovering. For a cluster to be fit for querying, it needs to successfully read all the synthetic metrics. This would make the other replica cluster as a single point of failure in such situations.

Ingestion Model Change to Pull Based Shard Aware Ingestion

Figure 5: New pull based ingestion pipeline.

Before we go into the solution, let’s understand what a shard is in GokuS terms. As defined generally, a shard is a logical subset of data used for distributing data across distributed systems. In GokuS terms, a shard is a subset of time series data. How to know which shard a data point belongs to is explained in detail in here the “Sharding and Routing” section. To summarize, there are 2 levels of sharding in GokuS. We 1) hash the metric name (not the tag value pairs) to determine the shard group id, and 2) hash the metric name + tag value pairs to determine the shard id within the group.

A shard group is a logical collection of shards. The concept of shard groups was introduced to confine the queries for a particular metric name to a select set of hosts rather than all hosts in the cluster. For example: if we did not have the concept of a shard group, the query for a metric name (assuming no filters are provided and we have to scan all time series for the metric name) would have to look at all shards in the cluster, and the fan out would be very high. So, we changed the push model of ingestor service pushing the data points into GokuS service to a shard aware pull modeland introduced another kafka topic between the ingestor service and the GokuS storage (calling it Goku side Kafka). The ingestor determines the kafka partition of the datapoint after appropriate hashing and produces to the partition. A very important thing to note here is that partition id in kafka is the same as shard group id * number of shards per shard group + shard id. Hence, for GokuS, a host which hosts a shard x will consume from partition x.

Figure 6: New logging mechanism for fast recovery

We replaced EFS with a local disk (local instance storage) for persistent data and S3 as backup. Since finalized data is written every 2 hours, it was easy to add an additional call to upload to s3 after writing to local instance storage. However, since logs are frequently written, we cannot do the same (append to log file in EFS) for it (i.e. write to s3 on every append). Hence, we decided to async log into local storage and then move the log file into s3 every 20 minutes so that, at most, 20 minutes of logging would be lost per shard if the host was terminated/ replaced.

In GokuS, we also recorded the kafka offset of the logged data points in the log and would commit the offset to Goku side kafka every 20 minutes after the log file was backed up in s3.

Figure 7: Recovery using local disk instance store and S3

During recovery, the host checks the local storage if it has finalized files and the log files for the shard(s). If not, it downloads the files from s3 then recreates the finalized data in memory and replays the logs in memory.


The recovery time of a single GokuS cluster (replica) was reduced from 90–120 minutes to under 40 minutes with these changes. This can be owed to the recovery from local disk instead of EFS in non host replacement scenarios. Another source of this reduction was no additional computation on the data points to calculate the shard id by the retry ingestor. With additional Goku side kafka and GokuS host directly pulling from respective kafka partitions, this repeated logic was not needed anymore.

Figure 8: New shard health aware query routing

The above changes also provided an added benefit of being able to determine the data lag (kafka consumer lag) at a shard/partition level. We exported this data lag per partition per cluster into shared files and the routing service (Goku Root — more information here) and made use of this information to efficiently route the queries. Previously, statsboard query clients would route the queries based on the generated synthetic metrics, which did not have per shard information. Hence, the cluster would only receive queries once all shards have recovered.But now, the Goku root detects if a shard is ready for queries by simply looking at its lag (shard health based query routing). Thus, the dependency on the statsboard generated synthetic metrics for query routing is removed. Queries start hitting a host as soon as a shard has less than some threshold of kafka lag. This is even if the host was actively recovering data for other shards. This reduced query load on the other replicas as well during deployment.

Goku Long Term Storage Architecture Summary and Challenges

Figure 9: Flow of data from GokuS to GokuL.

GokuL leverages RocksDB for time series data storage, and the data is tiered into buckets based on its age.In short, RocksDB is a key value store that uses a log structure DB engine for storage and retrieval. It writes the key-value pairs in memory and async logs them. Periodically, it flushes the in memory data onto persistent storage, i.e. sorts the key value pairs based on key and stores them into a Static Sorted Table (SST) file. The sorting is for efficient retrieval. It has tiered storage, i.e. keeps compacting smaller recent SST files in lower tiers into larger and older SST files in higher tiers. More information about the architecture can be found in the GokuL blog and the cost reduction blog.

In summary, Goku compactor service prepares SST files (can be multiple ssSSTt files per bucket) for ingestion in GokuL RocksDB instances. It reads GokuS finalized data from S3, processes it (doing roll ups etc), reshards it as per GokuL sharding policy, and writes it to S3. More information about how SST files can be created and ingested can be found here.

GokuL Cluster stores and serves the metrics data that is older than a day. The metrics stored on this cluster have a TTL of 1 year. The below table shows the data tiering strategy and storage clusters. Note that each tier holds a finite number of fixed sized buckets. The bucket in itself is actually nothing but a collection of SST files holding all the time series data and metadata for the corresponding bucket size. The bucket id is unix time divided by bucket size.

When a new host is added to the existing GokuL cluster, it downloads all the relevant SST files from S3 (for all the buckets in different tiers) and ingests it in the RocksDB instance, which we create per shard in GokuL. The ingestion is tier by tier and bucket by bucket. In the GokuL SSD storage cluster, which stores tier 1 to 4 ( every tier has 5–6 buckets) metrics data i.e. data up to 80 days old, any new host would take around 6 to 12 hours to fully bootstrap and be ready for queries. This is due to the fact that each host would hold approximately 100–120 shards. Hence, the total number of buckets ingested would be 100 shards * 4 tiers * 6 buckets, that is 2400 buckets.

Also, each bucket would store different lengths of data. For example a bucket in tier 1 would be 6 hours of data, and a bucket in tier 4 would have 16 days of data. More on the tiering strategy can be found here. Also, the bigger the bucket size, the more tSST files it has.In general, each host would ingest around a couple of 1000s of GB worth of data when it bootstrapped.

Another cause for the slow bootstrap time is that the ingestion competes for CPU resources with rocksdb SST compaction. Compaction is a process that consolidates 2 higher level files into a larger lower level file and also deletes any unwanted keys during this process. Generally, when SSTs are ingested (size-based SST generation), rocksdb decides the best level for storing the SST files. However, we observed compaction starting as soon as SSTs were getting ingested. See the graph below, which shows the compaction read and write bytes on a cluster when it is bootstrapping for the first time.

Figure 10: compaction read and write bytes showing non zero values as soon as host starts up.

This slow bootstrap time was a definite hindrance on our move to less compute heavy instances for cost savings. In fact, the bootstrap time degraded further (to 12 hours), as we noticed in our tests.

RocksDB Bulk Ingestion

We tried the rocksdb bulk ingestion api to ingest all tiers and all buckets at once for any shard and it failed with an overlapping ranges error. We tried ingesting all buckets for the same tier and that failed with the same error again. It could be easily inferred that the SST files could not be sorted in any way during bulk ingestion due to them having overlapping keys (i.e. in (s1, l1) and (s2,l2) where s is start key and l is end key for files 1 and 2, we have s2 < l1 and l2 > l1). The overlapping keys were also the reason for compaction happening as soon as ingestion started because rocksdb would not keep SST files with overlapping keys separate in the same level. However, looking at the GokuL data, there was a logical assumption of keys being sorted (e.g, a bucket).

Goku Long Term Data Format

The GokuL blog post explains in detail the rocksdb key format followed in GokuL.In summary, because our data are tiered and bucketed, when generating these keys, we prepend [magic(1 byte)][tier(1 byte)][bucket(4 bytes)] to every RocksDB key. Magic number is a byte to identify different types of keys like data key, index key, dictionary key, reverse dictionary key, tier marker key, and so on. More details about these key types and what they represent can be found in the blog post link above.

Now assume for simplicity sake that magic characters are ‘A’,’B’,’C’,’D’,’E’ and ‘F’. An SST file in a single bucket in a single tier can contain multiple key types. We also limit the size of each SST file created. We could thus have multiple SST files per bucket. So we would have SST files for a single bucket like the following:

And the second bucket may look like:

If you try to bulk ingest them all together, RocksDB will try to sort the SST files based on the last and first key in each file. However, in the above case, it may never be able to do so because of the magic character and varied nature of the number of keys which could be stored in a single SST file. For example, when it tries to sort all the SSTs lexicographically into a list, SST 1 of bucket 24 would not be able to find the right position in the sorted list of SSTs of both bucket 23 and 24. <A,1,24> needs to be after <A,1,23> but it cannot be inserted in the middle of SST 1 of bucket 23. Similarly, <C,1,24> cannot be inserted after <D,1,23>. Ideally, the sorted pairs should look like this: <A,1,23>…<A,1,24> <B,1,23>….<B,1,24><C,1,23>…<C,1,24> and so on. Thus, bulk ingestion would error out with overlapping ranges errors and compaction would trigger when data is ingested bucket by bucket.

Change in Data Format and Improvement in Bootstrap Times

There were 2 solutions to tackle this:

  • Have a separate SST file for each key type
  • Store tier and bucket information before the key type

In both cases, we would have to store version information in the name of the SST file, which would be useful while bootstrapping and reading the files. We decided to do the former because the code change was very minimal and the path to production would be fast. The below tables show the SST format changes. V0 is version 0, which is the old SST format, while V1 is version 1 (separate SST for each key type). Note how the SSTs with version v1 in both buckets can easily be sorted and ingested together.

The above changes to the following:

Similarly, this:

Changes to this:

The bootstrap time was reduced to 2 hours (from 6–12 hours) after the change was made to ingest from newly created SST files for each key type. We also observed no compaction during ingestion. For example, the graph below which shows the compaction read-write bytes for a prod cluster vs dev cluster, which has the SST based optimizations running.

Figure 11: compaction read and write bytes showing zero values as soon as dev host starts up in with version 1 SST files.

Future Work

  • For GokuS, we want to explore snapshotting the data every hour into the disk to reduce the replaying of the logs and make the bootstrap even faster (maybe reduce from 40 minutes to 10–15 minutes).
  • For GokuL, we want to explore using tier and bucket as the prefix for the keys rather than the key type and using DeleteRange for deleting the keys for the whole bucket at once rather than manual compaction for deletion. This would keep the bootstrap time intact with additional benefit of data for the same bucket being close together on ssd.

Next Blog

In part 2, we will discuss how we improved the query experience in Goku using different techniques.

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog and visit our Pinterest Labs site. To explore and apply to open roles, visit our Careers page.

Improving Efficiency Of Goku Time Series Database at Pinterest (Part — 1) was originally published in Pinterest Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Running Unified PubSub Client in Production at Pinterest
3 weeks ago

Jeff Xiang | Software Engineer, Logging Platform

Vahid Hashemian | Software Engineer, Logging Platform

Jesus Zuniga | Software Engineer, Logging Platform

At Pinterest, data is ingested and transported at petabyte scale every day, bringing inspiration for our users to create a life they love. A central component of data ingestion infrastructure at Pinterest is our PubSub stack, and the Logging Platform team currently runs deployments of Apache Kafka and MemQ. Over the years, operational experience has taught us that our customers and business would greatly benefit from a unified PubSub interface that the platform team owns and maintains, so that application developers can focus on application logic instead of spending precious hours debugging client-server connectivity issues. Value-add features on top of the native clients can also help us achieve more ambitious goals for dev velocity, scalability, and stability. For these reasons, and others detailed in our original PubSub Client blog post, our team has decided to invest in building, productionalizing, and most recently open-sourcing PubSub Client (PSC).

In the 1.5 years since our previous blog post, PSC has been battle-tested at large scale in Pinterest with notably positive feedback and results. From dev velocity and service stability improvements to seamless migrations from native client to PSC, we would like to share some of our findings from running a unified PubSub client library in production.

Dev Velocity Improvements

In a distributed PubSub environment, complexities related to client-server communication can often be hard blockers for application developers, and solving them often require a joint investigation between the application and platform teams. One of the core motivations driving our development of PSC was to hide these complexities from application developers so that precious time spent on debugging such issues can instead be used to focus on the application logic itself.


  • Full automation in PubSub service endpoint discovery
  • Estimated 80% reduction in time spent for setting up new PubSub producers and consumers
  • Optimized client configurations managed by platform team

How We Did It

Automated Service Discovery

PSC offers a simple and familiar solution to automate PubSub service endpoint discovery, which hides those complexities away from application developers. Through the introduction of Resource Names (RNs), PubSub resources (e.g. topics) are now uniquely identified with an RN string that contains all the information PSC needs in order to establish a connection with the servers that contain the resource in question. This is a similar concept to Internet URIs and Amazon ARNs. For example,


is an RN that tells PSC exactly which topic, cluster, region, and PubSub backend the client needs to connect to. Furthermore, the protocol in front of the RN creates a complete Unique Resource Identifier (URI), letting PSC know exactly how the connection should be established.

This simplification stands in stark contrast to some of the common pitfalls using the native client, such as hardcoding potentially invalid hostname/port combinations, scattering SSL passwords across client configurations, and mistakenly connecting to a topic in the wrong region. With endpoint discovery fully automated and consolidated, client teams rarely / never report these issues that used to require time-consuming investigations from our platform team.

Optimized Configurations and Tracking

Prior to productionalizing PSC, application developers were required to specify their own client configurations. With this liberty came issues, notably:

  1. Some client-specified configurations may cause performance degradation for both client and server
  2. Application developers may have a limited understanding of each configuration and their implications
  3. Platform team had no visibility into what client configurations are being used

At Pinterest, PSC comes out-of-the-box for our users with client configurations that are optimized and standardized by the platform team, reducing the need for application developers to specify individual configurations that they would have otherwise needed to perform in-depth research into during configuration / application tuning. Instead, application developers are now focusing on tuning only the configurations that matter to them, and our platform team has spent significantly less time investigating performance / connectivity issues that came with client misconfigurations.

PSC takes it one step further with config logging. Having psc.config.logging.enabled=true turned on, our platform team now has further insights into the client configurations used across the PubSub environment in real time.

These features amount to not only significant dev velocity improvements but also gains in stability and reliability of our PubSub services.

Stability & Scalability Improvements


  • >80% reduction in Flink application restarts caused by remediable client exceptions
  • Estimated 275+ FTE hours / year saved in KTLO work by application and platform teams

How We Did It

Prior to PSC, client applications often encountered PubSub-related exceptions that resulted in application failure or restarts, severely impacting the stability of business-critical data jobs and increasing KTLO burden for both platform and application teams. Furthermore, many of these exceptions were resolvable via a client reset or even just a simple retry, meaning that the KTLO burden caused by these issues was unnecessarily large.

For instance, we noticed that out-of-sync metadata between client and server can happen during regular Kafka cluster maintenance and scaling operations such as broker replacements and rolling restarts. When the client and server metadata go out-of-sync, the client begins to throw related exceptions and becomes unstable, and does not self-recover until the client is reconstructed or reset. These types of auto-remediable issues threatened our ability to scale PubSub clusters efficiently to meet business needs, and caused significant KTLO overhead for all teams involved.

Automated Error Handling

To combat these risks, we implemented automated error handling within PSC. Positioned between the native client and application layers, PSC has the unique advantage of being able to catch and remediate known exceptions thrown by the backend client, all without causing disruption to the application layer.

With automated error handling logic implemented, we also ship PSC with psc.auto.resolution.enabled=true turned on by default, allowing all PubSub clients to run out-of-the-box with automated error handling logic managed by our platform team. Taking Flink-Kafka clients as an example, we have observed more than 80% reduction in job failures caused by remediable client exceptions after migrating them to PSC, all without any changes to our regular Kafka broker environment and scaling / maintenance activities:

Figure 1: More than 80% of remediable Flink job restarts were prevented via PSC

As a result of automated error handling in PSC, we have been able to save more than ~275 FTE hours per year in KTLO work for both application and platform teams, driving significant improvements in the stability of client applications and scalability of our PubSub environment. We are also actively adding to PSC’s catalog of known exceptions / remediation strategies as we expand our understanding of these issues, as well as exploring options to take proactive instead of reactive measures to prevent such issues from happening in the first place.

Seamless Migrations from Native Client


  • >90% of Java applications migrated to PSC (100% for Flink)
  • 0 incidents caused by migration
  • Full integration test suite and CICD pipeline

How We Did It

Feature and API Parity

Built with ease-of-adoption in mind, PSC comes with 100% feature and API parity to the native backend client version it supports. With PSC being currently available for Kafka clients using Java, we have been able to migrate >90% of Pinterest’s Java applications to PSC with minimal changes to their code and logic. In most cases, the only changes required on the application were:

  1. Replace the native client imports and references with the corresponding PSC ones
  2. Update the client configuration keys to match PSC’s
  3. Remove all previous configurations related to service discovery / SSL and replace them with just the Resource Name (RN) string

Simple, low effort migrations enabled by feature and API parity has been a strong selling point for application teams to quickly and efficiently migrate their clients to PSC. We have observed 0 incidents so far with migrations and do not expect this number to increase.

Apache Flink Integration

To support the ever-growing share of clients using Apache Flink data streaming framework, we have developed a Flink-PSC connector that allows Flink jobs to leverage the benefits of PSC. Given that around 50% of Java clients at Pinterest are on Flink, PSC integration with Flink was key to achieving our platform goals of fully migrating Java clients to PSC.

With Flink jobs, we had to ensure that migrations from Flink-Kafka to Flink-PSC were seamless in that the newly migrated Flink-PSC jobs must be able to recover from checkpoints generated by the pre-migration Flink-Kafka jobs. This is critical to Flink migrations due to the fact that Flink jobs store offsets and a number of other state-related information within the checkpoint files. This presented a technical challenge that required opening up the Flink-Kafka checkpoint files, understanding its contents, and understanding the way the contents are processed by Flink source and sink operators. Ultimately, we were able to achieve 100% adoption of Flink-PSC at Pinterest with the following efforts:

  1. We implemented Kafka to PSC checkpoint migration logic within FlinkPscProducer and FlinkPscConsumer to ensure that state and offset information from the pre-migration Flink-Kafka checkpoint is recoverable via a Flink-PSC job
  2. We added a small amount of custom code in our internal release of Flink-Kafka connector to ensure Flink-Kafka and Flink-PSC checkpoints are deemed compatible from the perspective of Flink’s internal logic

Robust Integration Tests and CICD

With PSC being in the active path for data ingestion and processing at Pinterest, we have taken extra care to ensure that it is robustly tested on all levels prior to releases, notably in integration testing and dev / staging environment testing. For this reason, PSC comes out-of-the-box with a full integration test suite that covers many common scenarios that we have observed in our PubSub operational experience. Furthermore, we have cataloged the public APIs within both PscConsumer and PscProducer to create a CICD pipeline that launches a PSC client application processing production-level traffic and touches all of the public API’s. Robust integration testing and CICD, alongside expansive unit test coverage, have been instrumental in building our confidence in PSC’s ability to take on business-critical data workloads from day one.

Future Work

Having been battle-tested at scale for over one year, PSC is now a core piece of the puzzle within Pinterest’s data infrastructure. There is more work planned for the future, aiming to increase its technical capability and value to our business.

Error Handling Improvements

As PSC is onboarded to more client applications, we began to notice and catalog the variety of remediable errors that PSC currently does not have the capability to automatically resolve, and we are actively adding these capabilities to PSC with each new release. One example is to detect expiring SSL certificates so that a proactive client reset can be done upon approaching certificate expiration to load a fresh certificate into the client’s memory and prevent any interruptions to a client using SSL protocol.

Cost Attribution and Chargeback

PSC offers us the ability to track our clients, providing valuable information such as their attributed projects, hostnames, configurations, and more. One potential use case for this newfound visibility is to set up a chargeback framework for PubSub clients so that platform teams are able to break down how their PubSub cost can be attributed to various client projects and teams.

C++ and Python

PSC is currently available in Java. To expand the scope of PSC, C++ support is being actively developed while Python support is on the horizon.

Check it out!

PSC-Java is now open-sourced on GitHub with Apache License 2.0. Check it out here! Feedback and contributions are welcome and encouraged.


The current state of PSC would not have been possible without significant contributions and support provided by Shardul Jewalikar and Ambud Sharma. Ping-Min Lin has also contributed substantially to design and implementation of the project. Special thanks to Logging Platform and Xenon Platform Teams, Chunyan Wang and Dave Burgess for their continuous guidance, feedback, and support.


Apache®️, Apache Kafka, Kafka, Apache Flink, and Flink are trademarks of the Apache Software Foundation.

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog and visit our Pinterest Labs site. To explore and apply to open roles, visit our Careers page.

Running Unified PubSub Client in Production at Pinterest was originally published in Pinterest Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

PinCompute: A Kubernetes Backed General Purpose Compute Platform for Pinterest
4 weeks ago

Harry Zhang, Jiajun Wang, Yi Li, Shunyao Li, Ming Zong, Haniel Martino, Cathy Lu, Quentin Miao, Hao Jiang, James Wen, David Westbrook | Cloud Runtime Team

Image Source: https://unsplash.com/photos/ZfVyuV8l7WU


Modern compute platforms are foundational to accelerating innovation and running applications more efficiently. At Pinterest, we are evolving our compute platform to provide an application-centric and fully managed compute API for the 90th percentile of use cases. This will accelerate innovation through platform agility, scalability, and a reduced cost of keeping systems up to date, and will improve efficiency by running our users’ applications on Kubernetes-based compute. We refer to this next generation compute platform as PinCompute, and our multi-year vision is for PinCompute to run the most mission critical applications and services at Pinterest.

PinCompute aligns with the Platform as a Service (PaaS) cloud computing model, in that it abstracts away the undifferentiated heavy lifting of managing infrastructure and Kubernetes and enables users to focus on the unique aspects of their applications. PinCompute evolves Pinterest architecture with cloud-native principles, including containers, microservices, and service mesh, reduces the cost of keeping systems up to date by providing and managing immutable infrastructure, operating system upgrades, and graviton instances, and delivers costs savings by applying enhanced scheduling capabilities to large multi-tenant Kubernetes clusters, including oversubscription, bin packing, resource tiering, and trough usage.

In this article, we discuss the PinCompute primitives, architecture, control plane and data plane capabilities, and showcase the value that PinCompute has delivered for innovation and efficiency at Pinterest.


PinCompute is a regional Platform-as-a-Service (PaaS) that builds on top of Kubernetes. PinCompute’s architecture consists of a host Kubernetes cluster (host cluster) and multiple member Kubernetes clusters (member clusters). The host cluster runs the regional federation control plane, and keeps track of workloads in that region. The member clusters are zonal, and are used for the actual workload executions. Each zone can have multiple member clusters, which strictly aligns with the failure domain defined by the cloud provider, and clearly defines fault isolation and operation boundaries for the platform to ensure availability and control blast radius. All member clusters share a standard Kubernetes setup across control plane and data plane capabilities, and they support heterogeneous capabilities such as different workload types and hardware selections. PinCompute is multi-tenant, where a variety of types of workloads from different teams and organizations share the same platform. The platform provides needful isolations to ensure it can be shared across tenants securely and efficiently.

Figure 1: High Level Architecture of PinCompute

Users access the platform via Compute APIs to perform operations on their workloads. We leverage Custom Resources (CR) to define the kinds of workloads supported by the platform, and the platform offers a range of workload orchestration capabilities which supports both batch jobs and long running services in various forms. When a workload is submitted to the platform, it first gets persisted with the host cluster’s Kubernetes API. The federation control plane will then kick in to perform workload management tasks needed at the regional level, including quota enforcement, workload sharding, and member cluster selection. Then, the workload shards get propagated to member clusters for execution. The member cluster control plane consists of a combination of in-house and open source operators that are responsible for orchestrating workloads of different kinds. The federation control plane also collects execution statuses of workloads from their corresponding member clusters and aggregates them to be consumable via PinCompute APIs.

Figure 2: Workflow for Execution and Status Aggregation of PinCompute

PinCompute Primitives

Figure 3: Workload architecture on PinCompute

PinCompute primitives serve heterogeneous workloads across Pinterest, from long running, run-to-finish, ML training, scheduled run, and more. These use cases are essentially divided into three categories: (1) general purpose compute and service deployment, (2) run-to-finish jobs, and (3) infrastructure services. Pinterest run-to-finish jobs and infrastructure services are supported by existing Kubernetes native and Pinterest-specific resources, and with our latest thoughts on how to define simple, intuitive and extendable compute primitives, PinCompute introduces a new set of primitives for general purpose compute and service deployment. These primitives include PinPod, PinApp, and PinScaler.

PinPod is the basic building block for general purpose compute at Pinterest. Like the native Kubernetes Pod, PinPod inherits the Pod’s essence of being a foundational building block while providing additional Pinterest-specific capabilities. This includes features like per container updates, managed sidecars, data persistence, failovers, and more that allow PinPod to be easily leveraged as a building block under various production scenarios at Pinterest. PinPod is designed to create a clear divide between application and infrastructure teams, while still retaining the light-weighted nature of running containers. It solves many existing pain points: for example, the per container update can speed up application rolling updates, reduce resource consumption, and eliminate disturbance to user containers during infra sidecar upgrades.

PinApp is an abstraction that provides the best way to run and manage long running applications at Pinterest. By leveraging PinPod as an application replica, PinApp inherits all the integrations and best practices about software delivery from PinPod. Thanks to the federation control plane, PinApp offers a set of built-in orchestration capabilities to fulfill common distributed application management requirements, which includes zone-based rollouts and balancing zonal capacity. PinApp supports the functionality offered by Kubernetes native primitives such as Deployments and ReplicaSets but also includes extensions like deployment semantics to meet business needs and enhance manageability.

PinScaler is an abstraction that supports application auto scaling at Pinterest. It is integrated with Statsboard, Pinterest’s native metrics dashboard, allowing users to configure application-level metrics with desired thresholds to trigger scaling along with scaling safeguards, such as a cool down window and replica min/max limitations. PinScaler supports simple scaling with CPU and memory metrics, as well as scheduled scaling and custom metrics to support various production scenarios.

Figure 4: PinCompute Primitives: PinPod, PinApp, and PinScaler. PinPod operates as an independent workload, and also a reusable building block for the higher-order primitive PinApp. PinScaler automatically scales PinApp.

Returning to the bigger picture, PinCompute leverages the next generation primitives (PinPod, PinApp, PinScaler), building blocks from native Kubernetes and open source communities, along with deep integrations with federation architecture to provide the following categories of use cases:

(1) General purpose compute and service deployment: This is handled by PinCompute’s new primitive types. PinApp and PinScaler help long-running stateless services deploy and scale quickly. PinPod functions as a general purpose compute unit and is currently serving Jupyter Notebook for Pinterest developers.

(2) Run-to-finish jobs: PinterestJobSet leverages Jobs to provide users a mechanism to execute run-to-finish, framework-less parallel processings; PinterestTrainingJob leverages TFJob and PyTorchJob from the Kubeflow community for distributed training; PinterestCronJob leverages CronJob to execute scheduled jobs based on cron expressions.

(3) Infrastructure services: We have PinterestDaemon leveraging DaemonSet, and a proprietary PinterestSideCar to support different deploy modes of infrastructure services. Components that are able to be shared by multiple tenants (e.g. logging agent, metrics agent, configuration deployment agent) are deployed as PinterestDaemons, which ensures one copy per node, shared by all Pods on that node. Those that cannot be shared will leverage PinterestSideCar and will be deployed as sidecar containers within user Pods.

The PinCompute primitives enable Pinterest developers to delegate infrastructure management and the associated concerns of troubleshooting and operations, allowing them to concentrate on evolving business logics to better serve Pinners.

Accessing PinCompute

Users access PinCompute primitives via PinCompute’s Platform Interfaces, which consists of an API layer, a client layer for the APIs, and the underlying services/storages that support those APIs.

Figure 5: High level architecture of PinCompute Platform Interface layer

PinCompute API

PinCompute API is a gateway for users to access the platform. It provides three groups of APIs: workload APIs, operation APIs, and insight APIs. Workload APIs contains methods to perform CRUD actions on compute workloads, debugging APIs provide mechanisms such as stream logs or open container shells to troubleshoot live workloads, and insight APIs provide users with runtime information such as application state change and system internal events to help users to understand the state of their existing and past workloads.

Why PinCompute API

Introducing PinCompute API on top of raw Kubernetes APIs has many benefits. First, as PinCompute federates many Kubernetes clusters, PinCompute API integrates user requests with federation and aggregates cross-cluster information to form a holistic user-side view of the compute platform. Second, PinCompute API accesses Kubernetes API efficiently. For example, it contains a caching layer to serve read APIs efficiently, which offloads expensive list and query API calls from Kubernetes API server. Finally, as a gateway service, PinCompute API ensures uniformed user experience when accessing different PinCompute backend services such as Kubernetes, node service, insights service, project governance services, etc.

Figure 6: PinCompute API data flow

Integrating With Pinterest Infrastructure

This layer incorporates Pinterest’s infrastructure capabilities like rate limiting and security practices to simplify the Kubernetes API usage and provide a stable interface for our API consumers and developers. The PinCompute API implements rate limiting mechanisms to ensure fair resource usage leveraging our Traffic team’s rate limiting sidecar, benefiting from reusable Pinterest components. PinCompute API is also fully integrated with Pinterest’s proprietary security primitives to ensure authentication, authorization, and auditing to follow paved paths. Such integration enables us to provide Pinterest developers with unified access control experience with granularity at API call and API resource level. These integrations are critical for PinCompute APIs to be reliable, secure, and compliant.

Enhanced API Semantics

PinCompute API provides enhanced API semantics on top of the Kubernetes API to improve the user experience. One important enhancement PinCompute API does is that it presents the raw Kubernetes data model in a simplified way with only information relevant to building software at Pinterest, which not only reduces the infrastructure learning curve for developers who focus on building high level application logics, but also improved data efficiency for API serving. For example, removing managed fields will reduce up to 50% data size for PinCompute API calls. We also designed the APIs in a way that is more descriptive for use cases such as pause, stop, restart-container, etc., which are intuitive and easy to use in many scenarios. PinCompute provides OpenAPI documentation and auto generated clients, documentation and SDKs to help users self-serve building applications on PinCompute.

PinCompute SDK

We strategically invest in building an SDK for clients to standardize access to PinCompute. With the SDK, we are able to encapsulate best practices such as error handling, retry with backoff, logging, and metrics as reusable building blocks, and ensure these best practices are always applied to a client. We also publish and manage versioned SDKs with clear guidance on how to develop on top of the SDK. We closely work with our users to ensure the adoption of the latest and greatest versions of the SDK for optimized interactions with PinCompute.

Managing Resources in PinCompute

Resource Model

PinCompute supports three resource tiers: Reserved, OnDemand, and Preemptible. Users define the resource quota of their projects for each tier. Reserved tier quotas are backed by a fixed-size resource pool and a dedicated workload scheduling queue, which ensures scheduling throughput and capacity availability. OnDemand tier quotas leverage a globally shared, and dynamically sized resource pool, serving workloads in a first-come, first-serve manner. Preemptible tier is being developed to make opportunistic usage of unused Reserved tier and OnDemand tier capacity, which would get reclaimed when needed by their corresponding tiers. PinCompute clusters are also provisioned with a buffer space consisting of active but unused resources to accommodate workload bursts. The following diagram illustrates the resource model of PinCompute.

Figure 7: PinCompute resource model

Scheduling Architecture

PinCompute consists of two layers of scheduling mechanisms to ensure effective workload placements. Cluster level scheduling is performed in PinCompute’s regional federation control plane. Cluster level scheduling takes a workload and picks one or more member clusters for execution. During cluster level scheduling, the workload is first passed through a group of filters that filter out clusters that cannot fit, and then leverage a group of score calculators to rank candidate clusters. Cluster level scheduling ensures high level placement strategy and resources requirements are satisfied, and also takes factors such as load distribution, cluster health, etc., into consideration to perform regional optimizations. Node level scheduling happens inside member clusters, where workloads are converted to Pods by the corresponding operators. After Pods are created, a Pod scheduler is used to place Pods onto nodes for execution. PinCompute’s Pod scheduler leverages Kubernetes’s scheduler framework, with a combination of upstream and proprietary plugins to ensure the scheduler supports all features available in open source Kubernetes, but at the same time is optimized to PinCompute’s specific requirements.

Figure 8: PinCompute scheduling architecture

PinCompute Cost Efficiency

Cost efficiency is critical to PinCompute. We have enacted various methods to successfully drive down PinCompute infrastructure cost without compromising on the user experience.

We promote multi-tenancy usage by eliminating unnecessary resource reservation and migrating user workloads to the on-demand resource pool that is shared across the federated environment. We collaborated with major platform users to smoothen their workload submission pattern to avoid oversubscription in resources. We also started a platform-level initiative to switch GPU usage from P4 family instances to the cost-performant alternatives (i.e. G5 family). The following diagram demonstrates the trend of PinCompute GPU cost vs. capacity, where we successfully reduced cost while supporting the growing business.

Figure 9: PinCompute GPU cost vs. capacity

Moving forward, there are several on-going projects in PinCompute to further enhance cost efficiency. 1) We will introduce preemptable workloads to encourage more flexible resource sharing. 2) We will enhance the platform resource tiering and workload queueing mechanisms to make smarter decisions with balanced tradeoff on fairness and efficiency when scheduling user workloads.

PinCompute Node Runtime

Node architecture is a critical space where we invested heavily to ensure applications are able to run on a containerized, multi-tenanted environment securely, reliably, and efficiently.

Figure 10: High level architecture of PinCompute Node and infrastructure integrations

Pod in PinCompute

Pod is designed to isolate tenants on the node. When a Pod is launched, it is granted its own network identity, security principal, and resource isolation boundary atomically, which are immutable during a Pod’s lifecycle.

When defining containers inside a Pod, users can specify two lifecycle options: main container and sidecar container. Main containers will honor Pod level restart policy, while sidecar containers are ensured to be available as long as main containers need to run. In addition, users can enable start up and termination ordering between sidecar and main containers. Pod in PinCompute also supports per container update, with which containers can be restarted with new spec in a Pod without requiring the Pod to be terminated and launched again. Sidecar container lifecycle and per container update are critical features for batch job execution reliability, and service deployment efficiency.

PinCompute has a proprietary networking plugin to support a variety of container networking requirements. Host network is reserved for system applications only. “Bridge Port” assigns a node-local, non-routable IP to Pods that do not need to serve traffic. For Pods that need to serve traffic, we provide “Routable IP” allocated from a shared network interface, or Pod can request a “Dedicated ENI” for full network segmentation. Network resources such as ENI and IP allocations are holistically managed through cloud resource control plane, which ensures management efficiently.

PinCompute supports a variety of volumes including EmptyDir, EBS, and EFS. Specifically, we have a proprietary volume plugin for logging, which integrates with in-house logging pipelines to ensure efficient and reliable log collections.

Integrating With Pinterest Infrastructure

PinCompute node contains critical integration points between user containers and Pinterest’s infrastructure ecosystem, namely, security, traffic, configuration, logging and observability. These capabilities have independent control planes that are orthogonal to PinCompute, and therefore are not limited to any “Kubernetes cluster” boundary.

Infrastructure capabilities are deployed in three manners: host-level daemon, sidecar container, or with a dual mode. Daemons are shared by all Pods running on the node. Logging, metrics, and configuration propagation are deployed as daemons, as they do not need to leverage Pod’s tenancy or stay in the critical data paths of the applications running in the Pod. Sidecar containers operate within Pod’s tenancy and are leveraged by capabilities that rely on Pod’s tenancy or need performance guarantees such as traffic and security.

User containers interact with infrastructure capabilities such as logging, configuration, service discovery through file system sharing, and capabilities such as traffic and metrics through networking (local host or unix domain socket). Pod, along with the tenancy definition we have, ensures various infrastructure capabilities can be integrated in a secure and effective manner.

Enhanced Operability

PinCompute node has a proprietary node management system that enhances visibility and operability of nodes. It contains node level probing mechanisms to deliver supplementary signals for node health which covers areas such as container runtime, DNS, devices, various daemons, etc. These signals serve as a node readiness gate to ensure new nodes are schedulable only after all capabilities are ready, and are also used during application runtime to assist automation and debugging. As part of node quality of service (QoS), when a node is marked for reserved tier workloads, it can provide enhanced QoS management such as configuration pre-downloading or container image cache refresh. Node also exposes runtime APIs such as container shells and live log streaming to help users troubleshoot their workloads.

Figure 11: PinCompute’s proprietary node management system

Managing PinCompute Infrastructure

Prioritizing Automation

Automation has a large return on investment when it comes to minimizing human error and boosting productivity. PinCompute integrates a range of proprietary services aimed at streamlining daily operations.

Automatic Remediation

Operators are often troubled with trivial node health issues. PinCompute is equipped to self-remediate these issues with an automatic remediation service. Health probes operating on the Node Manager detect node complications and mark them via specific signal annotations. This signal is monitored and interpreted into actions. Then the remediation service executes actions such as cordoning or terminating. The components for detection, monitoring, and the remediation service align with principles of decoupling and extensibility. Furthermore, deliberate rate limiting and circuit-breaking mechanisms are established providing a systematic approach to node health management.

Figure 12: PinCompute Automatic Remediation Architecture

Application Aware Cluster Rotation

The primary function of the PinCompute Upgrade service is to facilitate the rotations of Kubernetes clusters in a secure, fully automated manner while adhering to both PinCompute platform SLOs and user agreements concerning rotation protocol and graceful termination. When processing cluster rotation, concerns range from the sequence of rotating different types of nodes, simultaneous rotations of nodes, nodes rotated in parallel or individually, and the specific timings of node rotations. Such concerns arise from the diverse nature of user workloads running on the PinCompute platform. Through the PinCompute Upgrade service, platform operators can explicitly dictate how they would like cluster rotations to be conducted. This configuration allows for a carefully managed automatic progression.

Release PinCompute

Platform Verification

The PinCompute release pipeline is constituted by four stages, each of them being an individual federated environment. Changes are deployed through stages and verified before promoting. An end-to-end test framework operates continuously on PinCompute to authenticate platform accuracy. This framework emulates a genuine user, and functions as a constant canary to oversee the platform’s correctness.

Figure 13: PinCompute Release Procedure

Machine Image (AMI) Management

PinCompute selectively offers a finite set of node types, taking into account user needs of hardware families, manageability and cost-effectiveness. The AMIs responsible for bootstrapping these nodes fall into three categories: general-purpose AMIs, machine learning focused AMI, and a customizable AMI. The concept of inheriting from a parent AMI and configuration simplifies their management considerably. Each AMI is tagged according to type and version, and they utilize the Upgrade service to initiate automatic deployments.

Operation and User Facing Tools

In PinCompute, we provide a set of tools for platform users and administrators to easily operate the platform and the workloads running on it. We built a live-debugging system to provide end users with UI-based container shells to debug inside their Pods, as well as stream console logs and file-based logs to understand the progress of their running applications. This tool leverages proprietary node level APIs to decouple user debugging from critical control paths such as Kubernetes API and Kubelet, and ensures failure isolation and scalability. Self-service project management along with step-by-step tutorials also reduced user’s overhead to onboard new projects or make adjustments of properties of existing projects such as resource quota. PinCompute’s cluster management system provides an interactive mechanism for editing cluster attributes which makes it handy to iterate new hardwares or adjust capacity settings. The easy-to-use tool chains ensure efficient and scalable operations and over the time greatly improved user experiences of the platform.

Scalability and SLOs

PinCompute is designed to support the compute requirements at Pinterest scale. Scalability is a complex goal to achieve, and to us, each of PinCompute’s Kubernetes cluster is optimized towards a sweet spot with 3000 nodes, 120k pods, and 1000 mutating pod operations per minute, with a 25sec P99 workload end to end launch latency. These scaling targets are defined by the requirements of most applications at Pinterest, and are results of balancing across multiple factors including cluster size, workload agility, operability, blast radius and efficiency. This scaling target makes each Kubernetes cluster a solid building block for overall compute, and PinCompute’s architecture can horizontally scale by adding more member clusters to ensure enough scalability for the continuous growth of PinCompute footprint.

PinCompute defines its SLOs in two forms: API availability and platform responsiveness. PinCompute ensures 99.9% availability of its critical workload orchestration related APIs. PinCompute offers SLO in control plane reconcile latency which focuses on the latency for the system to take action. Such latency varies from seconds to 10s seconds based on workload complexity and corresponding business requirements. For reserved tier quality of service, PinCompute provides SLO for workload end to end launch speed, which does not only focus on platform’s taking action, but also includes how fast such actions can take effect. Those SLOs are important signals for platform level performance and availability, and also sets high standards for platform developers to iterate platform capabilities with high quality.

Learnings and Future Work

Over the past few years, we have matured the platform both in its architecture as well as a set of capabilities Pinterest requires. Introducing compute as Platform as a Service (PaaS) has been seen as the biggest win for Pinterest developers. An internal research showed that > 90% use cases with > 60% infrastructure footprint can benefit from leveraging a PaaS to iterate their software. As platform users, PaaS abstracts away the undifferentiated heavy lifting of owning and managing infrastructure and Kubernetes, and enables them to focus on the unique aspects of their applications. As platform operators, PaaS enables holistic infrastructure management through standardization, which provides opportunities to enhance efficiency and reduce the cost of keeping infrastructure up-to-date. PinCompute embraces “API First” which defines a crisp support contract and makes the platform programmable and extendable. Moreover, a solid definition of “tenancy” in the platform establishes clear boundaries across use cases and their interactions with infrastructure capabilities, which is critical to the success of a multi-tenanted platform. Last but not least, by doubling down on automation, we were able to improve support response time and reduce team KTLO and on-call overhead.

There are a lot of exciting opportunities as PinCompute keeps growing its footprint in Pinterest. Resource management and efficiency is a big area we are working on; projects such as multi-tenant cost attribution, efficient bin packing, autoscaling and capacity forecast are critical to support an efficient and accountable infrastructure in Pinterest. Orchestrating stateful applications is both technically challenging and important to Pinterest business, and while PinPod and PinApp are providing solid foundations to orchestrate applications, we are actively working with stakeholders of stateful systems on shareable solutions to improve operational efficiency and reduce maintenance costs. We also recognize the importance of use cases being able to access Kubernetes API. As Kubernetes and its communities are actively evolving, it is a big benefit to follow industry trends and adopt industry standard practices, and therefore we are actively working with partner teams and vendors to enable more Pinterest developers to do so. Meanwhile, we are working on contributing back to the community, as we believe a widely trusted community is the best platform to build a shared understanding, contribute features and improvements, and share and absorb wins and learnings in production for the good of all. Finally, we are evaluating opportunities to leverage managed services to further offload infrastructure management to our cloud provider.


It has been a multi-year effort to evolve PinCompute to enable multiple use cases across Pinterest. We’d like to acknowledge the following teams and individuals who closely worked with us in building, iterating, productizing, and improving PinCompute:

  • ML Platform: Karthik Anantha Padmanabhan, Chia-Wei Chen
  • Workflow Platform: Evan Li, Dinghang Yu
  • Online Systems: Ping Jin, Zhihuang Chen
  • App Foundation: Yen-Wei Liu, Alice Yang
  • Ads Delivery Infra: Huiqing Zhou
  • Traffic Engineering: Scott Beardsley, James Fish, Tian Zhao
  • Observability: Nomy Abbas, Brian Overstreet, Wei Zhu, Kayla Lin
  • Continuous Delivery Platform: Naga Bharath Kumar Mayakuntla, Trent Robbins, Mitch Goodman
  • Platform Security: Cedric Staub, Jeremy Krach
  • TPM — Governance and Platforms: Anthony Suarez, Svetlana Vaz Menezes Pereira

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog and visit our Pinterest Labs site. To explore and apply to open roles, visit our Careers page.

PinCompute: A Kubernetes Backed General Purpose Compute Platform for Pinterest was originally published in Pinterest Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Makeathon 2023
1 month ago

Each year, we host Makeathon, our annual internal version of a hackathon, where employees from across the business collaborate for three days to bring their dream passion projects to life. The ideas they pitch have a goal to improve our product, culture, internal processes or a combination of the three. This year, Makeathon was hosted from August 7–August 11. Groups connected from Monday through Wednesday, presentations were shared on Thursday in our Science Fair and we wrapped up the week with our Grand Awards Ceremony on Friday.

Today, we’re going behind the curtains and interviewing two employees who will share more about Makeathon 2023. First up, we’ll interview Chief Hack Doctor, Anirudh Koul (ML Data Science Manager), who will share some insight into the outcomes of Makeathon. Then, we’ll interview Juan Pablo Ramos (Software Engineer) to learn about his first Makeathon experience. Finally, we’ll give a big shout out to the Makeathon winners!

Interview with Anirudh Koul

Hi Chief Hack Doctor, thanks for joining us today! Can you introduce yourself and, most importantly, can you please tell us where you got your nickname?

Thank you for having me. I’m Anirudh Koul, and in my day job, I get to work on AI and Machine Learning at Pinterest. During Makeathon, I have the opportunity to be part of a team of experienced experts called Hack Doctors who are all running around in white lab coats. These experts are volunteers on standby all week to ‘diagnose’ issues and ‘prescribe’ solutions, with a goal to help projects shine. Whether it’s building a pitch, design, video or solving hurdles in data analysis, engineering and AI, Hack Doctors are around to support Pinterest employees during Makeathon. Of course, we use a lot of medical puns!

The team name was inspired by the “Icefall Doctors,” a team of elite sherpas charged with securing a route to Everest every year, allowing mountaineers to pass safely through the maze of deep crevasses and frozen cliffs. This year, 94% of the participating teams utilized the Hack Doctors. Luckily, being the first hack doctor, Leo Nagata (Principal Technical Program Manager) promoted me to “Chief Hack Doctor” status this year.

How did you get involved in Makeathon planning?

I have been going to hackathons for the last 17 years. I would build something personally intriguing and often turn those projects into my day job, eventually bringing them to life. Every event was a learning experience, and it taught me so many new skills beyond engineering. When I immigrated to North America, I lacked self-confidence and was always the quietest kid. Hackathons immersed me into a new world, pushing me outside of my comfort zone. For example, they compelled me to convey my ideas concisely with clarity to large audiences. I quickly found a love for the art of storytelling. Eventually, I found myself doing speaking engagements such as TEDx and the United Nations that I would never have dreamt in my wildest dreams. And for these projects, my teammates have even been recognized with international awards, met heads of state, been featured on TV and even received job offers. Hackathons truly changed my career!

The organizing team and I want to help create an environment where others can experience the same. People often have ideas, but they might only have a subset of skills to execute with. Makeathon allows them to discover people, often outside their usual circles, with the skill sets to solve the remaining pieces of the puzzle, turning their brainchild into reality. Discovering this joy of solving tough problems fast and the ability to learn skills rapidly ultimately develops the confidence that carries over into their daily work. Through the Makeathon, we hope people can realize their maximum potential.

Can you share more about the Innovation Cycle at Pinterest?

In most hackathons, while you can have a great experience, unfortunately most of the ideas never progress further after the event ends. At Pinterest, we want to ensure all ideas are heard and given a fair shot at being implemented within our product or culture.

Our Innovation Cycle has three steps. They are:

  1. Idea Factory
  2. Makeathon
  3. Planning

First, we built a company-wide portal called the Idea Factory, where anyone can go to share their ideas throughout the year. Then, other employees across the company vote on ideas. If the idea gets enough votes, it will be expedited for review with our senior leaders.

Next, ideas often require derisking, prototyping and deeper analysis, a perfect task for the Makeathon. Often, participants can be hesitant to take the leap. To solve that, we started a number of initiatives:

  1. Showcase tour: We share success stories from people across the business, not just engineering, who have built projects and brought them to life. Relatable stories make first time participants wonder, “If they could do it, what’s stopping me?” This increases participation.
  2. Skill-building classes: Open to all, we hosted three sessions to teach a range of skills to help rapidly turn ideas into action. Keeping them inclusive, the classes showed that you don’t need to be an engineer to build an app. We even had classes on Generative AI and GPT. We made them comical and personable to drive engagement. These classes broke records with over 1,500 attendees in a single class. Plus, individuals learned skills that they could continue to use in their day jobs.
  3. Hack Doctors: When it was time to start collaborating, we brought in the Hack Doctors who served as the guides during the journey. Even if people were hesitant to participate, all they needed was a little guidance and an ecosystem of support.

Finally, the day after projects are showcased, we have a team that takes these ideas to leaders during the planning season. In 2021, 66% of Makeathon projects got launched or influenced our product roadmap plans. If they weren’t accepted, leaders provided transparent feedback which helps project leads learn key takeaways for future projects.

This was Pinterest’s 11th Makeathon, what is the importance of Makeathon?

Makeathon is the living definition of synthesizing ideation and innovation. It gives employees the freedom to go beyond their day job, focusing their energy on bringing their ideas to life. With people from other skill sets and organizations joining in, it’s amazing to witness how seemingly impossible problems get solved so quickly.

Can you give us some insight into Makeathon 2023? How many projects and participants did we have?

Here are some quick stats:

  • 326 participants from eight countries
  • 198 ideas
  • 85 projects
  • 5728 votes from Pinployees across the business
  • 1500 live attendees in a single skill-building class
  • 2500 total attendees over all classes

Now, let’s hear from JP Ramos on how his first Makeathon went!

Interview with Juan Pablo Ramos

Hi JP, welcome back to the hot seat! We last spoke in your Life at Pinterest interview. Let’s start with another quick intro.

Thanks, happy to be here! For those who didn’t catch the past interview, I’ve been an iOS Engineer in the Client Excellence team, which is part of App Foundations at Pinterest, for over a year now. My day-to-day work consists of making the iOS Pinterest app best in class. This year I participated for the first time in Makeathon with my very own project initiative. Read along to find out how it went.

How was your first Makeathon?

My Makeathon week started with a team meeting to share ideas on how to actually implement the feature we wanted to build. During Makeathon, all Pinployees are encouraged to work on anything they wish. We had to take some time to read more about the part of the app we were planning to work on. After that, it was coding time! We developed a plan and got to work on it. It was really fun getting out of the routine and challenging ourselves with something different. We got three full days to develop our minimum viable product (MVP), for which we also created a promotional video. The video was really good, Spielberg watch out! I wish you all could see what we came up with, who knows, maybe one day you’ll open Pinterest and see something… new!

What were you most surprised about?

What surprised me the most during Makeathon was the creative energy that prevailed across the company. Everyone was really focused on creating amazing tools and features! Another great thing about this week was seeing the collaboration between engineering, sales, IT and basically teams from all disciplines here at Pinterest. By Thursday, teams start sharing their demos. You get to try some projects, watch creative and fun videos and chat with folks on how they came up with their ideas and what steps it took to make them a reality. It was a really fun week!

Will you be participating next year?

Of course, I can’t wait for Makeathon 2024. Next time, I’d like to be part of a cross-discipline team. I’ll definitely be on the lookout for cool opportunities we can build together, get to meet new people and continue making Pinterest an inspiring place!

Let’s celebrate our Makeathon 2023 Winners!

We have seven awards that were granted to project teams, including five awards that mirror our core company values. Check out the individuals who won!

Judges’ Choice Award: David Rojas Camaggi, Travis Chan, Stacy Kelsey, Frances Lee

Crowd Favorite Award: Tom Spratt, Marty Nikolova, Brodie Gullic, Daniel Richardson, Raudha Ahmad, Jon Chen, Beverlyn Law, Peony Chuen, Doruk Korkmaz, Elise Wright, Frances Lee, Mutesi Ntazinda, Anirudh Koul

Put Pinners First Award: Edgar Chaparro, Justin Mangue

Aim for Extraordinary Award: Kritarth Anand, AJ Oxendine, Sarah Tao, Diana Wong, Kirsten Browne, Jonatan Luna, Armando Leal, Matt Beattie, J.J Hu

Create Belonging Award: Madelyn Reyes, Giovanni Propersi, Florian Marcu

Act as One Award: Sarah Pervaiz, Anirudh Koul, Nimalan Bala, Doug Rangel, Rivy Obinomen, Hannah Hester, Leon Arnold, Charlie Gu, Nick Erickson, Akshat Amritkar, J.C. Zhong, Swati Kumar, Feras Alazzeh, Matthew Lawhon

Win or Learn Award: Faye Zhang

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog and visit our Pinterest Labs site. To explore and apply to open roles, visit our Careers page.

Makeathon 2023 was originally published in Pinterest Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Bring Your Own Algorithm to Anomaly Detection
1 month, 1 week ago

Charles Wu | Software Engineer; Isabel Tallam | Software Engineer; Kapil Bajaj | Engineering Manager


In this blog, we present a pragmatic way of integrating analytics, written in Python, with our distributed anomaly detection platform, written in Java. The approach here could be generalized to integrate processing done in one language/paradigm into a platform in another language/paradigm.


Warden is the distributed anomaly detection platform at Pinterest. It aims to be fast, scalable, and end-to-end: starting from fetching the data from various data sources to be analyzed, and ending with pushing result notifications to tools like Slack.

Warden started off as a Java Thrift service built around the EGADs open-source library, which contains Java implementations of various time-series anomaly detection algorithms.

The execution flow of one anomaly detection job, defined by one JSON job spec. Each job is load-balanced to a node in the Warden cluster.

Warden has played an important role at Pinterest; for example, it was used to catch spammers. Over time, we have built more features and optimizations into the Warden platform, such as interactive data visualizations, query pagination, and sending customized notification messages. We have also found it useful to have Warden as a separate Thrift service as it gives us more flexibility to scale it by adding or removing nodes in its clusters, to call it via a Thrift client from a variety of places, and to add instrumentations for better monitoring.

What’s the Problem?

Despite the many useful features of the Warden platform, a requirement emerged. As we expanded the use cases of Warden throughout Pinterest, we started to collaborate more and more with data scientists who would like to use Warden to analyze their data. They found the existing selection of anomaly detection algorithms in EGADs to be limiting. While Warden could be extended with more customized algorithms, they would have to be developed in Java. Many data scientists preferred to bring to Warden their own anomaly detection algorithms in Python instead, which has at its disposal a rich set of ML and data analysis libraries.

What’s the Goal?

Functionally, we want to expand Warden such that it can retain the Java algorithms in the EGADs library used by the existing use-cases like spam detection, but it can also support new algorithms developed in Python. The Python algorithms, like the EGADs Java algorithms, would be part of the end-to-end Warden platform, integrated with all of the existing Warden features.

With that in mind, we want to develop a framework to achieve two things:

  1. For our users (mainly Pinterest data scientists) to develop or migrate their own Python algorithms to the Warden platform
  2. For the Warden platform to deploy the Python algorithms and execute them as part of its workflow

In particular, this framework should satisfy all of the following:

  • Easy to get started: users can start implementing their algorithms very quickly
  • Easy to test deploy the Python algorithms being developed in relation to the Warden platform, while requiring no knowledge of Java, inner workings of Warden, or any deployment pipelines
  • Easy and safe to deploy the algorithms to all the Warden nodes in a production cluster
  • To optimize for the usability in production cases, as well as to minimize the feedback time for testing, the Python algorithms should be executed synchronously on the input data and ideally with minimum latency overhead

Options We Explored

We thought of experimenting with Jython. However, at the time of development, Jython did not have a stable release that supported Python 3+, and at the moment, all Python programs at Pinterest should generally conform to at least Python 3.8.

We have also thought of building a RESTful API endpoint in Python. However, having intensive data processing done through API endpoints is not a good use of the API infrastructure at Pinterest, which is generally designed around low-CPU, I/O-bound use-cases.

Additionally, we had thought about having a Python Thrift service that the Warden Java Thrift service could call to, but Thrift services in Python are not fully supported at Pinterest (compared to Java or C++) and have very few precedents. Setting up a separate Thrift service would also require us to address additional complexities (e.g. setting up additional load-balancers) that are not required by the approach we ended up going with.

High-Level Design

The main idea is to move the computation as close to the data as possible. In this case, we will package all the Python algorithms into one binary executable (we are using Pyinstaller to do this), and then distribute that executable to each Warden node, where the data will reside in memory after Warden has fetched them from the databases. (Note: instead of producing a single executable using Pyinstaller, you can also experiment with producing a folder instead in order to further optimize latency.)

Each Warden node, after fetching the data, will serialize the data using an agreed-upon protocol (like JSON or Thrift), and pass it to the executable along with the name of the Python algorithm being used. The executable contains the logic to deserialize the data and run it through the specified algorithm; it will then pass the algorithm output in a serialized format back to Warden, which will deserialize the result and continue processing it as usual.

This approach has the benefits of being efficient and reliable. Since all the Python algorithms are packaged and distributed to each node, each node can execute these algorithms locally instead of via a network call each time. This enables us to avoid network latency and network failures.

While the executable being distributed to each node contains all the Python algorithms, each node can apply an algorithm to only a subset of the data, if processing the entire data exceeds the memory or CPU resources of that node. Of course, there would then need to be additional logic that distributes the data processing to each node and assembles the results from each node.

Production Deployment

Warden production cluster

To deploy to production, we build an executable with all of the Python algorithms and put that executable into an access area within the company, like a Warden-specific S3 bucket. The Warden service instance on each node will contain the logic to pull the executable from S3 if it’s not found at a pre-specified local file path. (Note: instead of programming this, the build system for your service could also support something like this natively, e.g. Bazel’s http_file functionality.)

To make a new deployment to production, the operator will build and push the executable to S3, and then do a rolling-restart of all the Warden nodes in the production cluster. We have ideas to further automate this, so that the executables are continuously built and deployed as new algorithms are added.

Test Deployment

When users want to test their algorithm, they would run a script that would build their algorithm into an executable and copy that executable into the running service container on each node of the Warden test cluster. Afterwards, from places like Jupyter notebook, users could send a job to the Warden test cluster (via a Thrift call) to use the test algorithm that they have just copied over.

We have invested time to make this process as simple as possible, and have made calling the script an essentially one-stop process for the user to deploy their algorithms to the test Warden cluster. No knowledge of Java, the inner workings of Warden, or any deployment pipelines is required.


On the note of simplicity, another way that we have tried to make adding algorithms easy for our users is by organizing algorithms through clearly defined and documented interfaces.

Each Python algorithm will implement an interface (or, more accurately in Python, extend an abstract base class) that defines a specific set of inputs and outputs for the algorithm. All the users have to do is to implement the interface, and the Warden platform will have the logic to connect this algorithm with the rest of the platform.

Below is a very simple example of an interface for anomaly detection:

The typical workflow for the users to create an algorithm is to:

  1. Select and implement an interface
  2. Test deploy their algorithm through the one-stop process as described in Test Deployment
  3. Submit a PR for their algorithm code

Once the PR has been approved and merged, the algorithms will be deployed to production

In practice, we try to define interfaces broadly enough that users who wish to develop or migrate their algorithms to Warden can usually find an interface that their algorithm fits under; however, if none fit, then users would have to request to have a new interface supported by the Warden team.

Interfaces give us a way of organizing the algorithms as well as the serialization logic in the Warden platform. For each interface, we can implement the serialization logic in the Warden platform just once (to support the passing of data between the Java platform and the executable), and it would apply to all the algorithms under that interface.

Additionally, and perhaps more importantly, interfaces provide us a way of designing solutions: when we start thinking about what new functionalities the platform should support via its Python algorithms, we can start by specifying the set of inputs and outputs we need. From there, we can work backwards and see how we get those inputs and where we pass those outputs.

For example, when we want to have Python algorithms for root-cause analysis in the Warden platform, we can start by defining an interface similar to the following:

Where TimeSeries could be defined as:

For you, the reader, it would be a fun and useful exercise to think about whether the analytic problems you are working on could be abstracted down to broad categories of interfaces.


We are currently expanding Bring Your Own Algorithm throughout Pinterest.

We are migrating the algorithms used in several existing Jupyter reports (used in metrics reviews) to the Warden platform through the Bring Your Own Algorithm framework. This enables better, more standardized code review and version control, since the algorithms will actually be checked into a Python repo instead of residing in the Jupyter notebooks. This also leads to easier collaboration on future enhancements, as once the users migrate their use-case to the Warden platform, they can easily switch within a library of Warden algorithms and take advantage of various Warden features (e.g. pagination, and customized notifications/alerts).

Bring Your Own Algorithm has also enabled Warden to support algorithms based on a variety of Python ML and data science libraries. For instance, we have added an algorithm using Prophet, an open-source, time-series forecasting library from Meta. This has enabled us to perform anomaly detection with more sophisticated analytics, including tunable uncertainty intervals, and take into account seasonalities and holiday effects. We are using this algorithm to capture meaningful anomalies in Pinner metrics that went unnoticed with simpler statistical methods.

Additionally, as alluded to in the Interfaces section above, Bring Your Own Algorithm is serving as the foundation for adding root-cause analysis capabilities to Warden, as we set up the workflow and Python interface that would enable data scientists to plug in their root-cause analysis algorithms. This separation of expertise — us focusing on developing the platform, and the data scientists focusing on the algorithms and statistics — will undoubtedly facilitate more collaborations on exciting problems into the future.


In summary, we have presented here an approach to embedding analytics done in one language within a platform done in another, as well as an interface-driven approach to algorithm and functionality development. We hope you can take the approach outlined here and tailor it to your own analytic needs.


We would like to extend our sincere gratitude to our data scientist partners, who have always been enthusiastic in using Warden to solve their problems, and who have always been eager to contribute their statistical expertise to Warden.

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog and visit our Pinterest Labs site. To explore and apply to open roles, visit our Careers page.

Bring Your Own Algorithm to Anomaly Detection was originally published in Pinterest Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Pinternship Wrap-Up: Summer 2023
1 month, 3 weeks ago

Each summer, Pinterest welcomes Software Engineering Pinterns who spend 12 weeks with us creating impact within our product and teams. While Pinterns are fully immersed in their teams throughout the summer, they also get to attend exciting activities and events hosted by the University Recruiting team and within the company.

Here’s a quick recap from this summer:

  • Social events were a hit with boba tea making, creating your own vision board, chocolate making and a virtual escape room.
  • Our University Recruiting Team hosted professional development workshops to drive skill-building and discuss topics like imposter syndrome, time management and productivity.
  • Pinterns attended executive coffee chats to learn more about different functions of the business from those leading our teams. They had time with Andréa Mallard (Chief Marketing Officer), Jeremy King (Chief Technology Officer), Nichole Barnes Marshall (Chief Diversity, Equity and Inclusion Officer) and Wanji Walcott (Chief Legal Officer).
  • We celebrated National Intern Day on July 27th and introduced six of our Pinterns in this video.
  • Pinterns participated in Makeathon, our internal version of a hackathon, where employees collaborate and pitch passion projects.

We sat down with Sierra Lee, Software Engineering Pintern, to learn more about her second summer interning with the Pinterest team.

Hi there! Thanks for joining us, Sierra. Let’s start by getting to know you.

Hi! I’m Sierra Lee, and I go to the University of Washington in Seattle. This summer, I am interning on the Ads Measurement Interfaces team under the Monetization org. I was also a Pintern last summer, and I interned on the Ads Measurement Ingestion team, who partners closely with my current team. Both summers, my work has revolved around increasing conversion visibility for the businesses that advertise on our platform. In essence, my goal is to show them that they are missing out if they are not advertising on Pinterest! The year before last, I started my Pinterest journey participating in the Engage Program, a summer professional development program.

What was your favorite part about working for the Pinterest Engineering team?

From an engineering perspective, I have really enjoyed that Pinterns are given real projects that often make it into production in some shape or form. I feel like I have been treated like a full-time Software Engineer at Pinterest, both in the work that I’ve been able to do and in how others have approached me. On my team, I have been able to work in many different codebases and in tons of different languages. I have always found more joy in doing full stack work, so that is a huge plus for me. In being exposed to many different parts of Pinterest code, I feel I have learned so much more than I expected to. Things move pretty fast at Pinterest, which makes each day exciting and new.

In addition to the work I do, Pinterest’s culture has been inspiring to me. I’ve reached out to many different teams, and every time, they are more than willing to go the extra mile to assist me despite being incredibly busy themselves. The size of Pinterest as a company is also a plus. It feels big enough that I can reap the benefits of a big company and learn best engineering practices, but still small enough that I feel I can create an immense impact and approach anyone across the business. In fact, many people in the Seattle office remembered me from last summer and welcomed me back, which was very sweet. Both summers at Pinterest, I have grown professionally through working with my mentors and managers and truly appreciated their mentorship and guidance. I feel that I’ve gotten the perfect amount of support and independence as a Pintern.

As you mentioned above, all Pinterns have a manager and a mentor. Can you give us some insight into what it was like working with them?

My mentor and manager have been amazing! My mentor has been the person I go to for any onboarding and project related questions I have. It’s my job to communicate to him everything that I’ve been up to, and he reviews code and documents I create as well as helps me out if any unexpected issues come up. While I feel that I have ownership of my project, I also feel like he and I are partnering to make sure everything with my project and internship are successful, which strikes a great balance. I also get to work closely with my manager. I sync regularly with her and have approached her for questions related to a greater scope and for things that I work on that may affect a larger amount of people in our org. Something that has stood out to me is how my mentor and manager are advocating for me and my success. They make sure to help me increase visibility of my work and find other ways to contribute to the team, increasing my impact and scope. Even now, my mentor from last year still checks in with me, supporting me throughout my new internship and rooting for my success. In addition to growing professionally, I’ve had the opportunity to get to know my mentor and manager as people, which I think goes back to how welcoming Pinterest’s culture is.

What was your biggest accomplishment during your internship? What was one challenge you faced?

I think my biggest accomplishment during this internship is largely thanks to my mentor and manager. This year, I was able to work on a couple different projects: one of them a completely frontend project and the other is a backend project. Everything from the coding language to the deployment process for these two projects was vastly different. Being able to successfully complete two distinct projects was my biggest accomplishment because I learned so much. I did encounter some challenges. There was a period of time where I was working on both projects simultaneously, and at first, I felt really overwhelmed juggling them. However, I feel that I really grew from that experience and have grown more comfortable handling many different tasks at once, which I think is important to becoming a really great developer.

At Pinterest, we have five core values: Put Pinners First, Aim for Extraordinary, Create Belonging, Act as One and Win or Learn. Which one resonates with you the most?

I would say that the core value that resonates with me the most is “Create Belonging.” Outside of work, I am part of a few different student organizations that are very people-oriented. I also recently studied abroad in South Korea, where I met people from all over the world and gained an appreciation for learning about other people’s journeys and perspectives. I really enjoy being around others, and I feel very grateful that, for the majority of my life, I have been surrounded by people who have both given me the space to grow and let me feel like I could be myself. I hope to create that sort of environment for others as well. I’ve learned that, for myself, this kind of sentiment extends itself to the workplace: when it comes to a job, the people and culture really matter to me. For me, I’ve found Pinterest to be a great fit culture-wise, and I definitely feel like I belong here!

What’s one thing you’d want others to know about working at Pinterest?

Working at Pinterest is both exactly what you’d expect and also not. For one, the culture is great and what I think most people expect: people are friendly and excited to work on the product. On the other hand, I think a lot of people may get the impression that there’s not that much engineering work to do at Pinterest, but there’s actually a ton of interesting problems to solve, including many explorations into new parts of ML and AI.

Applications are now open for our 2024 Software Engineering Internship Program. Discover open roles and apply on our Careers page. To learn more about engineering at Pinterest, check out the rest of our Engineering Blog and visit Pinterest Labs.

Pinternship Wrap-Up: Summer 2023 was originally published in Pinterest Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Lessons from debugging a tricky direct memory leak
1 month, 4 weeks ago

Sanchay Javeria | Software Engineer, Ads Data Infrastructure

To support metrics reporting for ads from external advertisers and real-time ad budget calculations at Pinterest, we run streaming pipelines using Apache Flink. These jobs have guaranteed an overall 99th percentile availability to our users; however, every once in a while some tasks get hit with nasty direct out-of-memory (OOM) errors on multiple operators that look something like this:

As is the case with most failures in a distributed system, this often leads to cascading failures elsewhere leaving a trail of red herrings in its wake. Pinterest’s Flink platform supports automatic job retries when task failures exceed a configurable threshold, so due to the infrequency of these exceptions we generally let automatic restarts from the most recent checkpoint handle fault tolerance. However, towards the end of last year, we initiated the consolidation of our clusters and readjusted memory allocation across all our jobs to achieve better resource efficiency. As an unintended consequence, we began receiving pages for these direct memory OOMs early this year, leading to outages and downstream service impact. It became increasingly evident that this issue needed to be looked at. In this post, we will outline our diagnostic process and share insights that can be generalized to debugging any large-scale distributed system where judiciously placed print statements alone are not sufficient to debug issues.

The first piece of the puzzle was to separate the symptoms from the cause. During this incident, we noticed high back pressure on several operators along with task failures accompanied by the stack trace above. At first glance, it also looked like the container ran out of memory while allocating direct memory for network buffers used for channel I/O. This led to our first set of action items — simulating task failures and high back pressure on a dev instance of our job while monitoring its effects on direct memory consumption to establish a causal relationship between the two events.

But first, we needed a band-aid solution to prevent frequent pages to the on-call engineer while we address the underlying root cause of the issue. To do this, it was helpful to revisit Flink’s memory model.

Figure 1: Flink’s memory model

As seen in Figure 1, Flink’s direct memory configuration can be split into three parts — framework off-heap, task off-heap and network memory. Framework off-heap memory is reserved for Flink’s internal operations and data structures. Unsure whether the OOM is caused by an application level memory leak, we increased both task off-heap and network memory from 2G to 5G. This was intentionally generous to buy us enough time to fix the issue.

Simulating back pressure

Since our Flink job has just one output sink operator, simulating back pressure was as easy as adding a long pause in the main thread using a Thread.sleep(). Since the sink operator won’t process any input records, the output buffers of all upstream operators will fill up quickly causing significant back pressure.

Figure 2: Back pressure on various operators in the application, red markers signal back pressure at a given time.

Figure 2 captures the back pressured state of the application after some time across the various operators. This invariably led to direct memory OOMs again on the back pressured nodes which led to repeated task failures.

Simulating task retries

At Pinterest, Flink applications are submitted to YARN’s ResourceManager, which allocates job tasks to containers on machines managed by YARN NodeManagers. To simulate task retries, we shutdown a random sample of containers using yarn container -signal [container-id] GRACEFUL_SHUTDOWN while monitoring the application’s direct memory consumption.

Figure 3: Direct memory consumption pattern upon task failures

The graph in Figure 3 illustrates the impact of simulating task failures on direct memory consumption. It shows a noticeable increase in memory consumption precisely at the time of issuing shutdowns. This eventually led to OOM errors, and when a quorum of containers on the same operator was shut down, it caused back pressure on the upstream nodes. The staircase pattern in the graph is particularly intriguing because it is a telltale sign of a memory leak and suggests that somewhere in the code allocated direct memory was not released properly.

To narrow down the scope of the problem, we found whether it was caused by a platform bug or an application logic issue. To do this, we repeated the task retry simulation on a separate application not running our job logic. We aimed to observe if a similar pattern in direct memory consumption would emerge, indicating the possibility of a platform bug.

Figure 4: Direct memory consumption pattern upon task failures in a separate job

As seen in Figure 4, we didn’t observe any noticeable spikes in direct memory consumption in a different Flink application. This served as compelling evidence indicating that the memory leak originated from a bug in our application code.

Debugging application code

Our Flink application consists of several thousand lines of code. A helpful approach for debugging such a large codebase is to use the “peel the onion” approach, wherein we break down the code into smaller components with the goal of reproducing the issue. A very simplified view of our application looks something like this:

Figure 5: High level operator overview of our Flink application

The first layer reads different topics from Kafka, deserializes them into internal objects and feeds into the second layer which joins the output and performs some transformations. This layer also makes some RPC calls to an external KVStore for downstream services and finally feeds into the third layer, which performs more transformations and outputs the event to Druid. The three layers fence the group of operators that use direct memory, and we can now individually remove certain operators and try to reproduce the issue by manually simulating task retries. This way we can isolate the culprit operator and apply a fix.

Removing Layer 2 and Layer 3 operators

In Figure 5, certain operators in Layer 2 make RPC calls to an external KVStore with a very large payload. We were suspicious that these large objects could result in an OOM error if the Thrift’s DirectByteBuffer failed to reserve sufficient direct memory for network I/O.

Layer 3 also utilizes off-heap memory to store currency exchange rates for various countries by downloading this information from an external datastore. This computation was previously done on-heap but was putting tremendous pressure on the heap memory. The file storing the exchange rates was periodically downloaded from the datastore, parsed to extract useful information, transformed into a hashmap and finally swapped with the older (immutable) hashmap. The older hashmap was then moved to the heap’s old gen which meant that it wasn’t freed until the next full GC trigger. Due to the data size and full-GC infrequency of the online application, we instead moved to an off-heap solution using ChronicleMap. However, a bug in freeing up this memory could very well result in OOMs overtime. Hence, we started by removing these and simulating arbitrary task retries in the remaining operators while monitoring the impact on direct memory consumption.

Figure 6: Direct memory consumption pattern upon task failures on remaining operators

As expected, we did not notice any blips in direct memory consumption. We can now narrow down our hunt for the source of the memory leak to the remaining operators.

Removing Layer 3 operators

Next, we removed the Layer 3 operators that utilize ChronicleMap for application logic and repeated the same exercise of simulating task retries.

Figure 7: Direct memory consumption pattern upon task failures on remaining operators

As shown in Figure 7, we noticed a minor blip but no definitive staircase pattern to conclude a memory leak in the remaining operators. This was an interesting finding because as opposed to our initial hunch, we could not find evidence of a memory leak in the operators that talk to the external KVStore via RPC calls.

Removing Layer 2 operators

Next, we isolate the Layer 3 operators by removing the Layer 2 operators that also utilize direct memory.

Figure 8: Direct memory consumption pattern upon task failures on remaining operators

Aha! We’ve managed to reproduce the issue in our leaner application code which looks eerily similar to the pattern observed in Figure 3. This was conclusive evidence that the direct memory leak originated from the application code in Layer 3 utilizing direct memory.

The fix

Upon investigating the problematic operator, we found that the reference to the ChronicleMap was being removed but the associated memory wasn’t freed, resulting in a memory leak. This memory was not released till the next full GC trigger which is especially problematic in online services like ours where rare garbage collection is targeted.

To understand this better, it is prudent to talk about Flink’s task lifecycle and the internal restart mechanism in case of terminations due to task failures. In this case the JVM does not crash, rather Flink execution jumps to the close()method on the affected operators. After the restart, Flink would then call the open() method defined in the operator code. If the logic references an object (like the ChronicleMap) that lives outside of this lifecycle, the code might be leaking memory inadvertently.

After fixing the leak, we simulated task retries once again and monitored its impact on direct memory consumption.

Figure 9: Direct memory consumption pattern upon task failures after patching the leak

As observed in Figure 9, we noticed a smooth memory consumption pattern as opposed to the elevating staircase pattern observed in Figure 3.

While the fix was specific to our application logic, the overarching takeaway is the procedure for finding the root cause. With the example of this war story, we just went over the nine debugging tenets as described by David J. Agans in his book “Debugging: The 9 Indispensable Rules for Finding Even the Most Elusive Software and Hardware Problems”:

  1. Understand the System
  2. Make it Fail (and make it fail fast)
  3. Quit Thinking and Look
  4. Divide and Conquer
  5. Change One Thing at a Time
  6. Keep an Audit Trail
  7. Check the Plug
  8. Get a Fresh View
  9. If You Didn’t Fix it, It Ain’t Fixed


We would like to thank Divye Kapoor from Pinterest’s Stream Processing Platform for helping with all the platform related questions and issues, and Naehee Kim, Filip Jaros, Insu Lee, Kuang He and Weihong Wang for supporting this effort and putting up with the numerous alerts while we resolved this issue.

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog and visit our Pinterest Labs site. To explore and apply to open roles, visit our Careers page.

Lessons from debugging a tricky direct memory leak was originally published in Pinterest Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Training Foundation Improvements for Closeup Recommendation Ranker
2 months ago

Fan Jiang | Software Engineer, Closeup Candidate Retrieval; Liyao Lu | Software Engineer, Closeup Ranking & Blending; Laksh Bhasin | Software Engineer, Core ML Foundations; Chen Yang | Software Engineer, Core ML Foundations; Shivin Thukral | Software Engineer, Closeup Ranking & Blending; Travis Ebesu | Software Engineer, Closeup Ranking & Blending; Kent Jiang | Software Engineer, Core Serving Infra; Yan Sun | Engineering Manager, Closeup Ranking & Blending; Huizhong Duan | Engineering Manager, Closeup Relevance


Pinterest’s mission is- to bring everyone the inspiration to create a life they love. The closeup team helps with this mission by providing a feed of relevant and context-and-user-aware recommendations when a Pinner closes up on any Pin.

The recommendations are powered by innovative and cutting-edge machine learning technologies. We have published a detailed blog post of its modeling architecture. While adopting the newest architectures improves a model’s capabilities, building a solid training foundation stabilizes the model and further up-levels the model’s potential.

Training foundations cover a lot of aspects, from training preparation (training data logging, feature freshness, sampling strategies, hyperparameter tuning, etc), to training efficiency optimization (distributed training, model refreshes, GPU training, etc), to post training validation (offline replay, etc).

In this post, we are going to take a deeper look into three areas for closeup ranking model, specifically:

  • Training data logging and generation
  • Various sampling configurations and learnings
  • Periodical and automatic model refreshes with in-house auto-retraining framework

Logging Foundation and Improvements

Hybrid logging

The closeup surface handles a large number of Pin impressions and engagements. While it is blessed with an abundance of data for training, it is also crucial to maintain a high data storage efficiency. Therefore, we adopted a hybrid data logging approach, with which the data is logged through both the backend service and the frontend clients. The process is captured in Figure 1.

The frontend logging system tracks the Pins that have been impressed by the Pinner and keeps a low percentage of the impressions and all positive engagements. For the sampled Pins, it reads the context and candidate cache, which are populated by the backend service, and calls the deduping service for further pruning. Then the frontend logging service calls the inference service to log the Pins with the full set of features. At the end of this pipeline, the data with training features are ingested in the database.

Figure 1: hybrid logging for features

On a daily basis, the features are joined with the labels to produce the final training dataset. Last year, we migrated the dataset from Thrift format to tabular format, which largely reduced the data size and improved development velocity due to its better data inspection capability.

By leveraging the hybrid logging approach, the pipeline avoids logging data without impressions, which drastically reduces the logging volume to achieve the same level of training data efficiency.

Randomized Traffic

We keep a small amount of traffic to have completely randomly ordered candidates. For this stream of traffic, we log all the candidates that have been served to Pinners instead of just the impressed candidates. The randomized training data has proven to be helpful for multiple purposes, including offline replay experimentation, calibration, and model evaluations.

Sampling Foundation and Improvements

Undoubtedly, training data is one of the most important components of model training. What the model learns largely depends on what data the model has seen. The bias of the model can be caused by the biases from the training data. And the training data can be measured by different segmentations, including positive and negative label ratios, content type distribution, user / context distributions, etc.

As covered in the data logging pipeline section, initially we only had a simple sampling strategy, which was to downsample the impressed candidates and keep all candidates with positive actions. Essentially, we were under-utilizing the opportunities contained in the data. Therefore we constructed a sampling job as part of the training data generation pipeline.

Figure 2: Current training pipeline. Joiner is a Pyspark Job that joins features and labels to format a complete machine learning instance in full training data. Tabularizer is another Pyspark Job that converts full training data to TabularML format. Sampler reads in full training data and outputs the sampled training data for the downstream pytorch trainer job to consume.

The sampler is a Pyspark job that reads in petabyte level of full training data, applies sampling logic, and then outputs sampled training data in hundreds of terabytes level. Users can pass their customized sampling logic via sampling configs to the sampler and then a downstream trainer will consume the sampled training data. Thanks to Pinterest’s Ezflow framework, datasets generated in the workflow are managed and cached by Ezflow’s lineage tracking mechanism, such that for multiple training jobs adopting the same sampling logic, sampled training data can be reused and we don’t have to rerun the sampler.

The overall goal for sampling foundation is to:

  • Increase topline engagement
  • Enhance content safety

At the current stage we have experimented with several sampling configurations, and the results shown below come from our online A/B experiment on the Closeup surface. Even though users are triggered in the experiment only when they visit the closeup surface, the engagement impact is not limited to the surface only. At the site-wide level, we see broad-based engagement metric gains across different actions and content types.

Table 1: Site-wide Engagement Gain in A/B experiments

Future Work on Sampling

In addition to the exploration of more sophisticated sampling logic, we have a big opportunity in training efficiency. Current sampling logic is implemented as a standalone Pyspark batch processing job and we have ongoing effort to integrate sampling logic in the Ray dataloader. We believe it will significantly speed up the training workflow runtime as sampling and training can be coordinated in parallel. In addition, for two different but similar sampling logics, we no longer have to generate two different sampled training dataset, saving hundreds of terabytes of storage cost for each training workflow.

Figure 3: Future Training Pipeline

Auto-Retraining Framework


A deep neural network model’s performance can degrade as time goes on. For example, a model may be trained to give accurate outputs for specific input feature distributions, but these distributions can drift over time. More broadly, seasonal factors and user trends can change what users find useful and inspiring on Pinterest.

To keep our models fresh and avoid degradation, teams across Pinterest make use of the Auto-Retraining Framework (ARF), which allows for the automated training and re-training of models on a specified cadence.

ARF includes two main components:

  1. An offline Airflow workflow that trains, validates, and registers models for use in serving. Models must pass both an absolute validation check, where their evaluation metrics must exceed a threshold, and a relative validation check, where they must not regress on the previous production model’s metrics.
  2. An online model deploy Spinnaker pipeline that releases new model versions in serving, with validation on online metrics such as model latencies, resource usages, and predicted scores.
Figure 4: Components involved in ARF. Offline components (left) are run within an Airflow DAG. Model artifacts are registered in an MLflow run (center), which a Spinnaker deploy pipeline (right) then reads from and deploys to all users.

With ARF, teams across Pinterest have a validated infrastructure to train on Pinners’ latest interactions, thereby continually improving our ranking models.

Extending ARF for Closeup Ranking Model

Hypothesis on Model Refreshes

We conducted learning experiments and refreshed the closeup ranking model on daily, tri-daily, weekly, and bi-weekly cadences. We consistently found that model refreshes bring in better performance across all refresh cadence. Though an increased cadence of refreshing yields better results, the maintenance overhead is not trivial. Since the closeup model utilizes knowledge distillation, a bad teacher model can be populated faster and make the investigation harder. Therefore, the weekly retraining strikes a good balance between model refreshness and maintenance.

It is not trivial to set up the auto-retrain experiments without the support of ARF. We need to separately maintain the data generation pipeline, model training flow, manually model deployment, and manually update the holdout experiment. ARF provides a configuration interface as the single entrypoint and contract between the client and the auto-retrain process. The data, model, deployment, and maintenance are handled with minimal human intervention. Onboarding to ARF greatly improved velocity from 3+ hours to 30 minutes.

Customized Components

The closeup ranking model was one of the earliest adopters for ARF. Customized components need to be supported for the closeup ranking model use case.

The closeup ranking model leverages knowledge distillation, where the previous production model acts as the teacher model and the model scores are used in the loss function. As part of the data processing, we utilize batch inference to get model scores from the previous version of the production model and enrich the training dataset.

We also calibrate the scores of our ranking model, and the training pipeline produces both an uncalibrated and calibrated model, wherein we use the calibrated model as the serving production model, and the uncalibrated model as the teacher model for knowledge distillation. Whenever the ranking model is retrained, it is important that both these models are updated simultaneously. To allow this, ARF infrastructure extends support for the multi-model case so that both the calibrated and uncalibrated models are trained and deployed in sync.

Performance Validation

We validate the auto-retraining quality at two places throughout the pipeline. The first place is the data validation, where we examine the features and labels and make sure there is no large shift in the distribution. Once we make sure the training data is valid, we also check the offline model evaluation metrics to make sure that the model performance is not degrading.

It is important to keep track of the model performance since the framework updates the model in production. In addition to real-time engagement alerts we have in place, we also set up a holdout experiment to track the performance. Every time the workflow successfully retrains a new model and publishes the model to production, the workflow will automatically reversion the experiment so that the experiment is always tracking the up-to-date production model with its previous version. By looking at the holdout experiment, we can conclude that in a relatively long term (usually a month), the auto-retraining brings in consistent gains for the core metrics.


In this blog post, we shed some light on a few training foundations that powers the machine learning technical stack for the closeup recommendation system. Throughout the work, we found out that:

  • By leveraging a hybrid data logging approach, we are able to achieve very high data storage efficiency.
  • By providing a configuration based sampling mechanism, we can easily experiment with various strategies. Sampling can be a powerful lever to mitigate the system biases and improve pinner experience.
  • By adopting the auto-retraining framework, we are able to refresh the production model with confidence and adapt to trends and shifts with high efficiency.

Machine learning training foundations can be as powerful as the machine learning techniques to drive pinner experiences. We are always looking for opportunities to improve the experience throughout the whole tech stack.


The above work cannot be accomplished without the help from Olafur Gudmundsson, Pong Eksombatchai, Abhishek Tayal, Serena Rao, Chen Chen, Andrew Zhai, Bo Fu, Mingda Li. We would like to thank them for their support and contributions along the way.

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog and visit our Pinterest Labs site. To explore and apply to open roles, visit our Careers page.

Training Foundation Improvements for Closeup Recommendation Ranker was originally published in Pinterest Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Building for Inclusivity: The Technical Blueprint of Pinterest’s Multidimensional Diversification
2 months, 1 week ago

Pedro Silva | Sr. ML Engineer & Inclusive AI Tech Lead; Bhawna Juneja | Sr. Machine Learning Engineer; Rohan Mahadev | Machine Learning Engineer II; Sujay Khandagale | Machine Learning Engineer II; Abhay Varmaraja | Machine Learning Engineer II

Pinterest’s mission as a company is to bring everyone the inspiration to create a life they love. “Everyone” has been the north star for our Inclusive AI and Inclusive Product teams. These teams work together to ensure algorithmic fairness, inclusive design, and representation are an integral part of our platform and product experience.

Our commitment is evidenced by our history of building products that champion inclusivity. In 2018, Pinterest announced the skin tone signal and skin tone ranges. In 2020, we announced the integration of skin tone ranges into Try on for Beauty. In 2021, we announced hair pattern search. In early 2023, we announced how we have been using our skin tone signal to shape our recommendations to increase skin tone representation across several surfaces. Now, we are expanding the latter to also include body type representation in fashion related results across search and closeup recommendations (AKA related feeds).

Body image and representation in the media, online and offline, has been a part of the cultural dialogue for decades. For Pinterest, a visually inspiring platform with a mission to give everybody ideas fit for them, we saw an opportunity to start tackling this issue head-on. We know from experience that building for marginalized communities helps make the product work better for everyone. As a first step, we took on the challenge of building a visual body type signal which will help us surface diverse content and also help ensure our recommendations are more representative of various body types.

Signal Development and Indexing

The process of developing our visual body type signal essentially begins with data collection. In this case, thousands of fashion Pins¹ publicly available on Pinterest are gathered to serve as the raw dataset. The aim is to identify unique patterns and characteristics within these images that may provide a basis for meaningful groupings. Bias-aware guidelines are established in order to determine uniformity in terms of how these images should be grouped. Additionally, we partnered with external organizations, such as the National Association to Advance Fat Acceptance (NAAFA) and Pinterest Creators, to help us understand the nuances of size representation. These external partnerships along with our internal fashion specialists and labellers were fundamental in helping us design the experience from both a technical and human-centric perspective. The resulting structured dataset becomes the foundation to train and evaluate the machine learning model known as the body type signal.

To ensure an unbiased approach, we also leveraged our skin tone and hair pattern signals when building this dataset. This inclusion helps us create a model that is uniquely representative of diverse human attributes, giving us a more precise way to gauge and mitigate biases, if needed, across disparate segments in order to improve fairness and accuracy. With high quality labeled data, the next critical phase in the ML development cycle is training the model. Again, building on top of previous work, we use our in-house state of the art transformer-based unified visual embedding as the basis for this model (as seen in Figure 1).

Fig 1. The multi-task Unified Visual Embedding model which powers the body type signal

After initial training, we continue to have sessions with internal and external experts for feedback and further human validation. Their inputs are incredibly valuable in fine-tuning the ML model to improve its accuracy. This approach, alongside fairness evaluations, are fundamental to uncover areas where the model may be underperforming. This iterative process facilitates the evolution of the model, enhancing its capability to make increasingly accurate predictions over time. The development cycle is recurrent, with constant iterations providing continuous improvements to the model, contributing to its performance gradually. This process will continue indefinitely to ensure we improve data coverage, quality, and account for possible domain shifts.

Lastly, we index the signal at the content side as a discrete feature, associating all women’s fashion Pins with the prevalent body type present in them. This helps us fetch data at serving time for our recommendations and use it to diversify various Pinterest surfaces.

Diversifying Search Results and Recommendations

Building on top of our previous work on multi-stage diversification in search and recommender systems, we leveraged the existing Determinantal Point Process (DPP) algorithm to enable diversification at the ranking stage, but this time using both skin tone and body type signals.

Since DPP takes into account both the utility scores from ranking models and similarity scores with respect to the diversification dimensions, we are able to balance their trade-off and tune it appropriately for different surfaces and use cases. In our scenario with multiple diversity dimensions, DPP can be operationalized with a joint similarity matrix to account for the intersectionality between different dimensions. A simpler option, which also offers more flexibility in terms of how similarity between items is defined, is to add a new diversity term per dimension in the weighted sum between the utility term and the, now several, diversity terms used to solve the DPP optimization. Given this flexibility, we used the latter approach on search and closeup recommendations.

On search, we introduced this technique in women’s fashion and wedding related results, adding a new body type objective to our existing DPP Blender Node, which re-ranks the top search results to optimize for diversity objectives. Through an A/B experiment that we ran for users in the US who searched for fashion related queries, we saw a 454% improvement in the representation of all body types and a statistically significant impact on some engagement metrics on search, such as click throughs. To further enhance the effectiveness of body type diversification efforts in search, we also improved retrieval diversity. We leveraged the Strong-OR logic, which we had previously added for skin tone diversification, in order to surface content with more diverse body types from our candidate generation phase. Improving Strong-OR for body diversity also means we are surfacing more Pins with all visible skin tones. Given this, we also observed a statistically significant increase in the representation of all skin tones in the top recommendations².

Likewise in closeup recommendations, we added an additional diversification objective to the existing DPP Node as the final step in our blending pipeline prior to returning ranked results. Body type diversification in closeup recommendations takes place when the query Pin is in the women’s fashion or wedding interests categories. In this experiment we observed a 772% increase in all body types represented in the top recommendations. Furthermore, for the countries where we launched this approach, we observed a positive statistically significant impact in some engagement metrics³.

Body type diversification has been rolled out on search and closeup recommendations within the United States, New Zealand, United Kingdom, Ireland, Canada, and Australia. This shift towards inclusive and saveable content leads to increases in relevance, engagement, and user value as people come back to act on the ideas that represent them.


Through so many iterations with different inclusive signals like skin tone, hair pattern, and now body type, we continue to recognize the significance of building ML systems that prioritize inclusion and respect user privacy in our technical choices. With this multi-disciplinary collaboration between engineering and teams spanning many organizations, we will continue to build on our foundation adding more diversity signals, integrating them to diversify search results and recommendations, and expanding the inclusive product experience to more content and domains globally.

This work is the result of a cross-functional collaboration between many teams. Many thanks to Shloka Desai, Huizhong Duan, Travis Ebesu, Katie Elfering, Nadia Fawaz, Jean Garcia-Gathright, Kurchi Subhra Hazra, Kevin Bannerman Hutchful, Dmitry Kislyuk, Helene Labriet-Gross, Sudeep Paul, Chuck Rosenberg, Ivan Shpuntov, Ashudeep Singh, Yan Sun, Annie Ta, Catie Marques Teles, Yuting Wang, Jiajing Xu, David Xue.

¹Pinterest internal data, global, Q3 2023
²Pinterest internal data, US, Q2 2023, comparing pre-launch to post launch.
³Pinterest internal data, US, IE, NZ, UK, CA, AU, Q3 2023.

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog and visit our Pinterest Labs site. To explore and apply to open roles, visit our Careers page.

Building for Inclusivity: The Technical Blueprint of Pinterest’s Multidimensional Diversification was originally published in Pinterest Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Last Mile Data Processing with Ray
2 months, 2 weeks ago

Raymond Lee | Software Engineer II; Qingxian Lai | Sr. Software Engineer; Karthik Anantha Padmanabhan | Manager II, Engineering; Se Won Jang | Manager II, Engineering

Photo by Claudio Schwarz on Unsplash

Our mission at Pinterest is to bring everyone the inspiration to create the life they love. Machine Learning plays a crucial role in this mission. It allows us to continuously deliver high-quality inspiration to our 460 million monthly active users, curated from billions of pins on our platform. Behind the scenes, hundreds of ML engineers iteratively improve a wide range of recommendation engines that power Pinterest, processing petabytes of data and training thousands of models using hundreds of GPUs.

Recently, we started to notice an interesting trend in the Pinterest ML community. As model architecture building blocks (e.g. transformers) became standardized, ML engineers started to show a growing appetite to iterate on datasets. This includes sampling strategies, labeling, weighting, as well as batch inference for transfer learning and distillation.

While such dataset iterations can yield significant gains, we observed that only a handful of such experiments were conducted and productionized in the last six months. This motivated us to look deeper into the development process of our ML engineers, identify bottlenecks, and invest in ways to improve the dataset iteration speed in the ML lifecycle.

In this blogpost, we will share our assessment of the ML developer velocity bottlenecks and delve deeper into how we adopted Ray, the open source framework to scale AI and machine learning workloads, into our ML Platform to improve dataset iteration speed from days to hours, while improving our GPU utilization to over 90%. We will go even deeper into this topic and our learnings at the Ray Summit 2023. Please join us at our suggestion there to learn more in detail!

What Slows Down ML Dataset Iteration

At Pinterest, ML datasets used for recommender models are highly standardized. Features are shared, represented in ML-friendly types, and stored in parquet tables that enable both analytical queries and large scale training.

However, even with a high level of standardization, it is not easy to iterate quickly with web-scale data produced by hundreds of millions of users. Tables have thousands of features and span several months of user engagement history. In some cases, petabytes of data are streamed into training jobs to train a model. In order to try a new downsampling strategy, an ML engineer needs to not only figure out a way to process extremely large scales of data, but also pay wall-clock time required to generate new dataset variations.

Pattern 1: Apache Spark Jobs Orchestrated through Workflow Templates

Figure 1: Dataset iteration by chaining Spark jobs and Torch jobs using Airflow (Workflow based ML Training Inner loop)

One of the most common technologies that ML engineers use to process petabyte scale data is Apache Spark. ML engineers chain a sequence of Spark and Pytorch jobs using Airflow, and package them as “workflow templates” that can be reused to produce new model training DAGs quickly.

However, as ML is rapidly evolving, not all dataset iteration needs can be supported quickly by workflow templates. It often requires a long process that touches many languages and frameworks. ML engineers have to write new jobs in scala / PySpark and test them. They have to integrate these jobs with workflow systems, test them at scale, tune them, and release into production. This is not an interactive process, and often bugs are not found until later.

We found out that in some cases, it takes several weeks for an ML engineer to train a model with a new dataset variation using workflows! This is what we call the “scale first, learn last” problem.

Pattern 2: Last Mile Processing in Training Jobs

Figure 2: Last Mile processing on the rigid training resources.

Since it takes so long to iterate on workflows, some ML engineers started to perform data processing directly inside training jobs. This is what we commonly refer to as Last Mile Data Processing. Last Mile processing can boost ML engineers’ velocity as they can write code in Python, directly using PyTorch.

However, this approach has its own challenges. As ML engineers move more data processing workloads to the training job, the training throughput slows down. To address this, they add more data loader workers that require more CPU and memory. Once the CPU / memory limit is reached, ML engineers continue to scale the machines vertically by provisioning expensive GPU machines that have more CPU and memory. The GPU resources in these machines are not adequately utilized as the training job is bottle-necked on CPU.

Figure 3: Training with the same resources & model architecture, but with progressively more complex in trainer data processing, has shown significant throughput decrease.

Even if we horizontally scale the training workload through distributed training, it is very challenging to find the right balance between training throughput and cost. These problems become more prominent as the datasets get larger and the data processing logic gets more complicated. In order to make optimal usage of both CPU and GPU resources, we need the ability to manage heterogeneous types of instances and distribute the workload in a resource-aware manner.

Solution: Using Ray for Last Mile Processing

Why we chose Ray

Having visited the above two patterns, we believe that horizontally scalable Last Mile Data Processing is the direction to achieve fast and efficient dataset iteration. The ideal solution should have three key capabilities:

  • Distributed Processing: Able to efficiently parallelize large scale data processing across multiple nodes
  • Heterogeneous Resource Management: Capable of managing diverse resources, like GPU and CPU, ensuring workloads are scheduled on the most efficient hardware
  • High Dev Velocity: Everything should be in a single framework, so that users don’t have context switch between multiple systems when authoring dataset experiments

After evaluating various open-source tools, we decided to go with Ray. We were very excited to see that Ray not only fulfills all the requirements we have but also presents a unique opportunity to provide our engineers a unified AI Runtime for all the MLOps components, not only just data processing but also distributed training, hyperparameter tuning, serving, etc. with first class support for scalability.

Figure 4: Ray based ML Training inner loop

Utilizing Ray to speed up ML dataset experiments

Figure 5: Ray managing CPU and GPU workload within one cluster

With Ray, ML engineers start their development process by spinning up a dedicated, heterogeneous Ray Cluster that manages both CPU and GPU resources. This process is automated through the unified training job launcher tool, which also bootstraps the Ray driver that manages both data processing and training compute in the Cluster. In the driver, users can also invoke a programmable launcher API to orchestrate distributed training with the PyTorch training scripts that ML engineers author across multiple GPU nodes.

Figure 6: Ray Data’s streaming execution [reference]

Scalable Last Mile Data processing is enabled by adopting Ray Data in this driver. Ray Data is a distributed data processing library built on top of Ray that supports a wide variety of data sources and common data processing operators. One of the key breakthrough functionalities we saw from Ray data is its streaming execution capability. This allows us to concurrently transform data and train at the same time. This means that (1) we do not need to load the entire dataset in order to process them, and (2) we do not need for the data computation to be completely finished in order for training to progress. ML engineers can receive feedback on their new dataset experimentation logic in a matter of minutes.

With streaming execution, we can significantly lower the resource requirement for petabytes data ingestion, speed up the computation, and give ML engineers immediate, end-to-end feedback as soon as the first data block is ingested. Furthermore, In order to improve the data processing throughput, the ML engineer simply needs to elastically scale the CPU resources managed by the heterogeneous Ray cluster.

The following code snippet demonstrates how our ML engineers try out a training dataset iteration with Ray, interactively inside a jupyter notebook.

Benchmark & Improvements

To assess the benefits of using Ray for Last Mile Data Processing, we conducted a set of benchmarks by training models on the same model architecture while progressively increasing the Last Mile Data Processing workloads.

To our surprise, the Ray dataloader showed a 20% improvement in the training throughput even without any Last Mile Data Processing. Ray dataloader handled extremely large features like user-sequence features much better than torch dataloader.

The improvement became more prominent as we started to incorporate more complex data-processing and downsampling logic into the data loader. After adding spam-user filtering (map-side join) and dynamic negative downsampling, Ray dataloader was up to 45% faster than our torch based implementation. This means that an ML engineer can now gain 2x the learnings from training experimental models within the same time as before. While we had to horizontally scale the data-loaders by adding more CPU nodes, the decrease in training time ultimately allowed us to save cost by 25% for this application as well.

When ML engineers conducted the same experiment by writing Spark jobs and workflows, it took them 90 hours to train a new model. With Ray, the ML engineers were able to reduce this down to 15 hours, a whopping +6x improvement in developer velocity!

Figure 7: Training runtime comparison
Figure 8: Cost per training job comparison

Closing Remarks

This post only touches on a small portion of our journey in Pinterest with Ray and marks the beginning of the “Ray @ Pinterest” blog post series. Spanning multiple parts, this series will cover the different facets of utilizing Ray at Pinterest: infrastructure setup and advanced usage patterns including feature importance and transfer learning. Stay tuned for our upcoming posts!

Furthermore, we’re excited to announce that we’ll be attending this year’s Ray Summit on September 18th. During the Summit, we’ll delve deeper into the topics in this post and provide sneak peeks into the rest of the series. We invite you to join us during the Ray Summit to gain a deeper understanding of how Ray has transformed the landscape of ML training at Pinterest. We look forward to seeing you there!


Related Pins: Liyao Lu, Travis Ebesu

M10n: Haoyu He, Kartik Kapur

ML Platform: Chia-wei Chen, Saurabh Vishwas Joshi

Anyscale: Amog Kamsetty, Cheng Su, Hao Chen, Eric Liang, Jian Xiao, Jiao Dong, Zhe Zhang

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog and visit our Pinterest Labs site. To explore and apply to open roles, visit our Careers page.

Last Mile Data Processing with Ray was originally published in Pinterest Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Writing and linting Python at scale
1 week ago

Python plays a big part at Meta. It powers Instagram’s backend and plays an important role in our configuration systems, as well as much of our AI work. Meta even made contributions to Python 3.12, the latest version of Python. On this episode of the Meta Tech Podcast, Meta engineer Pascal Hartig (@passy) is joined by Amethyst [...]


The post Writing and linting Python at scale appeared first on Engineering at Meta.

Watch: Meta’s engineers on building network infrastructure for AI
1 week, 6 days ago

Meta is building for the future of AI at every level – from hardware like MTIA v1, Meta’s first-generation AI inference accelerator to publicly released models like Llama 2, Meta’s next-generation large language model, as well as new generative AI (GenAI) tools like Code Llama. Delivering next-generation AI products and services at Meta’s scale also [...]


The post Watch: Meta’s engineers on building network infrastructure for AI appeared first on Engineering at Meta.

Enhancing the security of WhatsApp calls
2 weeks, 6 days ago

New optional features in WhatsApp have helped make calling on WhatsApp more secure. “Silence Unknown Callers” is a new setting on WhatsApp that not only quiets annoying calls but also blocks sophisticated cyber attacks. “Protect IP Address in Calls” is a new setting on WhatsApp that helps hide your location from other parties on the [...]


The post Enhancing the security of WhatsApp calls appeared first on Engineering at Meta.

How Meta built Threads in 5 months
3 weeks, 1 day ago

In about five short months, a small team of engineers at Meta took Threads, the new text-based conversations app, from from an idea to the most successful app launch of all time, pulling in over 100M users in its first five days. But this achievement wouldn’t have been possible without Meta’s existing systems and infrastructure. [...]


The post How Meta built Threads in 5 months appeared first on Engineering at Meta.

Automating data removal
4 weeks ago

Meta’s Systematic Code and Asset Removal Framework (SCARF) has a subsystem for identifying and removing unused data types. SCARF scans production data systems to identify tables or assets that are unused and safely removes them. SCARF avoids tedious manual work and ensures that product data is correctly removed when a product is shut down. This [...]


The post Automating data removal appeared first on Engineering at Meta.

5 Things you didn’t know about Buck2
1 month ago

Meta has a very large monorepo, with many  different programming languages. To optimize build and performance, we developed our own build system called Buck, which was first open-sourced in 2013.  Buck2 is the recently open-sourced successor. In our internal tests at Meta, we observed that Buck2 completed builds approximately 2x as fast as Buck1. Below [...]


The post 5 Things you didn’t know about Buck2 appeared first on Engineering at Meta.

How Meta is creating custom silicon for AI
1 month, 1 week ago

Olivia Wu, Meta’s Technical Lead for Infra Silicon, discusses the design and development of Meta’s first-generation AI inference accelerator. [...]


The post How Meta is creating custom silicon for AI appeared first on Engineering at Meta.

Automating product deprecation
1 month, 1 week ago

Systematic Code and Asset Removal Framework (SCARF) is Meta’s unused code and data deletion framework. SCARF guides engineers through deprecating a product safely and efficiently via an internal tool. SCARF combines this tooling with automation to reduce load on engineers. At Meta, we are constantly innovating and experimenting by building and shipping many different products, [...]


The post Automating product deprecation appeared first on Engineering at Meta.

How Sonar built a unified API on AWS
1 week, 1 day ago
SonarCloud, a software-as-a-service (SaaS) product developed by Sonar, seamlessly integrates into developers’ CI/CD workflows to increase code quality and identify vulnerabilities. Over the last few months, Sonar’s cloud engineers have worked on modernizing SonarCloud to increase the lead time to production. Following Domain Driven Design principles, Sonar split the application into multiple business domains, each […]
Converting stateful application to stateless using AWS services
1 week, 4 days ago
Designing a system to be either stateful or stateless is an important choice with tradeoffs regarding its performance and scalability. In a stateful system, data from one session is carried over to the next. A stateless system doesn’t preserve data between sessions and depends on external entities such as databases or cache to manage state. […]
Let’s Architect! Tools for developers
2 weeks, 6 days ago
In the software development process, adopting developer tools makes it easier for developers to write code, build applications, and test more efficiently. As a developer, you can use various AWS developer tools for code editing, code quality, code completion, and so on. These tools include Amazon CodeGuru for code analysis, and Amazon CodeWhisper for getting coding recommendations powered by machine learning algorithms. In this edition of Let’s Architect!, we’ll show you some tools that every developer should consider including in their toolkit.
Journey to Cloud-Native Architecture Series #7:  Using Containers and Cell-based design for higher resiliency and efficiency
4 weeks, 1 day ago
In our previous Journey to Cloud-Native blogposts, we talked about evolving our architecture to become more scalable, secure, and cost effective to handle hyperscale requirements. In this post, we take these next steps: 1/ containerizing our applications to improve resource efficiency, and, 2/ using cell-based design to improve resiliency and time to production. Containerize applications […]
Let’s Architect! Designing systems for stream data processing
1 month ago
Harnessing the potential of streaming data processing offers the opportunity to stay at the forefront of industries, make data-informed decisions with agility, and gain invaluable insights into customer behavior and operational efficiency.
Let’s Architect! Designing systems for batch data processing
1 month, 2 weeks ago
With this edition of Let's Architect!, we'll cover important things to keep in mind while working in the area of data engineering. Most of these concepts come directly from the principles of system design and software engineering. We'll show you how to extend beyond the basics to ensure you can handle datasets of any size — including for training AI models.
ITS adopts microservices architecture for improved air travel search engine
1 month, 2 weeks ago
Internet Travel Solutions, LLC (ITS) is a travel management company that develops and maintains smart products and services for the corporate, commercial, and cargo sectors. ITS streamlines travel bookings for companies of any size around the world. It provides an intuitive consumer site with an integrated view of your travel and expenses. ITS had been […]
Announcing updates to the AWS Well-Architected Framework guidance
1 month, 3 weeks ago
We are excited to announce the availability of improved AWS Well-Architected Framework guidance. In this update, we have made changes across all six pillars of the framework: Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability. In this release, we have made the implementation guidance for the new and updated best practices more prescriptive, including enhanced recommendations and steps on reusable […]
Let’s Architect! Leveraging SQL databases on AWS
2 months ago
SQL databases in Amazon Web Services (AWS), using services like Amazon Relational Database Service (Amazon RDS) and Amazon Aurora, offer software architects scalability, automated management, robust security, and cost-efficiency. This combination simplifies database management, improves performance, enhances security, and allows architects to create efficient and scalable software systems. In this post, we introduce caching strategies […]
Automating multi-AZ high availability for WebLogic administration server
2 months ago
Oracle WebLogic Server is used by enterprises to power production workloads, including Oracle E-Business Suite (EBS) and Oracle Fusion Middleware applications. Customer applications are deployed to WebLogic Server instances (managed servers) and managed using an administration server (admin server) within a logical organization unit, called a domain. Clusters of managed servers provide application availability and horizontal […]
What Is Retrieval-Augmented Generation?
1 week, 6 days ago
To understand the latest advance in generative AI, imagine a courtroom. Judges hear and decide cases based on their general understanding of the law. Sometimes a case — like a malpractice suit or a labor dispute —  requires special expertise, so judges send court clerks to a law library, looking for precedents and specific cases Read article >
Igniting the Future: TensorRT-LLM Release Accelerates AI Inference Performance, Adds Support for New Models Running on RTX-Powered Windows 11 PCs
1 week, 6 days ago
Artificial intelligence on Windows 11 PCs marks a pivotal moment in tech history, revolutionizing experiences for gamers, creators, streamers, office workers, students and even casual PC users. It offers unprecedented opportunities to enhance productivity for users of the more than 100 million Windows PCs and workstations that are powered by RTX GPUs. And NVIDIA RTX Read article >
New Class of Accelerated, Efficient AI Systems Mark the Next Era of Supercomputing
2 weeks, 1 day ago
NVIDIA today unveiled at SC23 the next wave of technologies that will lift scientific and industrial research centers worldwide to new levels of performance and energy efficiency. “NVIDIA hardware and software innovations are creating a new class of AI supercomputers,” said Ian Buck, vice president of the company’s high performance computing and hyperscale data center Read article >
Gen AI for the Genome: LLM Predicts Characteristics of COVID Variants
2 weeks, 1 day ago
A widely acclaimed large language model for genomic data has demonstrated its ability to generate gene sequences that closely resemble real-world variants of SARS-CoV-2, the virus behind COVID-19. Called GenSLMs, the model, which last year won the Gordon Bell special prize for high performance computing-based COVID-19 research, was trained on a dataset of nucleotide sequences Read article >
Acing the Test: NVIDIA Turbocharges Generative AI Training in MLPerf Benchmarks
2 weeks, 6 days ago
NVIDIA’s AI platform raised the bar for AI training and high performance computing in the latest MLPerf industry benchmarks. Among many new records and milestones, one in generative AI stands out: NVIDIA Eos — an AI supercomputer powered by a whopping 10,752 NVIDIA H100 Tensor Core GPUs and NVIDIA Quantum-2 InfiniBand networking — completed a Read article >
How AI-Based Cybersecurity Strengthens Business Resilience
3 weeks, 4 days ago
The world’s 5 billion internet users and nearly 54 billion devices generate 3.4 petabytes of data per second, according to IDC. As digitalization accelerates, enterprise IT teams are under greater pressure to identify and block incoming cyber threats to ensure business operations and services are not interrupted — and AI-based cybersecurity provides a reliable way Read article >
Turing’s Mill: AI Supercomputer Revs UK’s Economic Engine
3 weeks, 6 days ago
The home of the first industrial revolution just made a massive investment in the next one. The U.K. government has announced it will spend £225 million ($273 million) to build one of the world’s fastest AI supercomputers. Called Isambard-AI, it’s the latest in a series of systems named after a legendary 19th century British engineer Read article >
Unlocking the Power of Language: NVIDIA’s Annamalai Chockalingam on the Rise of LLMs
3 weeks, 6 days ago
Generative AI and large language models are stirring change across industries — but according to NVIDIA Senior Product Manager of Developer Marketing Annamalai Chockalingam, “we’re still in the early innings.” In the latest episode of NVIDIA’s AI Podcast, host Noah Kravitz spoke with Chockalingam about LLMs: what they are, their current state and their future Read article >
Silicon Volley: Designers Tap Generative AI for a Chip Assist
4 weeks, 1 day ago
A research paper released today describes ways generative AI can assist one of the most complex engineering efforts: designing semiconductors. The work demonstrates how companies in highly specialized fields can train large language models (LLMs) on their internal data to build assistants that increase productivity. Few pursuits are as challenging as semiconductor design. Under a Read article >
Next-Gen Neural Networks: NVIDIA Research Announces Array of AI Advancements at NeurIPS
1 month ago
NVIDIA researchers are collaborating with academic centers worldwide to advance generative AI, robotics and the natural sciences — and more than a dozen of these projects will be shared at NeurIPS, one of the world’s top AI conferences. Set for Dec. 10-16 in New Orleans, NeurIPS brings together experts in generative AI, machine learning, computer Read article >
The What, Why, and How of Mastering App Size
2 weeks ago

Sometimes a shiny new feature brings more harm than good. The reason is simple — application size. Any addition to the application — be it code for a new feature, an image resource for a new button or even support for a new localization — contributes to the increase of the application’s size.

The post The What, Why, and How of Mastering App Size appeared first on Spotify Engineering.

Spotify Wins CNCF Top End User Award for the Second Time!
2 weeks, 6 days ago

This week at KubeCon + CloudNativeCon in Chicago, the Cloud Native Computing Foundation announced that Spotify won their Top End User Award.

The post Spotify Wins CNCF Top End User Award for the Second Time! appeared first on Spotify Engineering.

How We Automated Content Marketing to Acquire Users at Scale
3 weeks, 1 day ago

Spotify runs paid marketing campaigns across the globe on various digital ad platforms. Being efficient with our marketing budget is critical for maximizing the return on ad spend.

The post How We Automated Content Marketing to Acquire Users at Scale appeared first on Spotify Engineering.

Introducing Voyager: Spotify’s New Nearest-Neighbor Search Library
1 month ago

For the past decade, Spotify has used approximate nearest-neighbor search technology to power our personalization, recommendation, and search systems. These technologies allow engineers and researchers to build systems that recommend similar items (like similar tracks, artists, or albums) without needing to run slow and expensive machine learning algorithms in real time. Spotify led the pack [...]

The post Introducing Voyager: Spotify’s New Nearest-Neighbor Search Library appeared first on Spotify Engineering.

Announcing the Recipients of the 2023 Spotify FOSS Fund
1 month ago

TL;DR It’s back! Last year, we created the Spotify FOSS Fund to help support the free and open source software projects we use at Spotify. We’re excited to announce that the fund has returned for 2023, and the recipients have been selected. This year, the fund’s 100,000 EUR are going to the following four projects: [...]

The post Announcing the Recipients of the 2023 Spotify FOSS Fund appeared first on Spotify Engineering.

Exclude from Your Taste Profile
1 month, 1 week ago

What is “Exclude from your taste profile”? Are you a parent forced to put the Bluey theme song on repeat? Do you work from home and play lofi beats or ambient piano music? Do you fall asleep to peaceful ambient noises? Are you bummed out when these songs come up as your most listened to [...]

The post Exclude from Your Taste Profile appeared first on Spotify Engineering.

Switching Build Systems, Seamlessly
1 month, 1 week ago

At Spotify, we have experimented with the Bazel build system since 2017. Over the years, the project has matured, and support for more languages and ecosystems have been added, thanks to the open source community and its maintainers at Google. In 2020,  it became clear that the future of our client development required a unified [...]

The post Switching Build Systems, Seamlessly appeared first on Spotify Engineering.

Managing Software at Scale: Kelsey Hightower Talks with Niklas Gustavsson about Fleet Management
1 month, 3 weeks ago

How does Spotify manage a sprawling tech ecosystem made up of 500+ squads managing over 10,000 software components in production? Last November, Google Cloud distinguished engineer Kelsey Hightower met with Spotify chief architect Niklas Gustavsson at Spotify’s office in Gothenburg, Sweden, to talk about just that.  Watch the video below to hear the two go [...]

The post Managing Software at Scale: Kelsey Hightower Talks with Niklas Gustavsson about Fleet Management appeared first on Spotify Engineering.

How to Accurately Test Significance with Difference in Difference Models
2 months ago

When we want to determine the causal effect of a product or business change at Spotify, A/B testing is the gold standard. However, in some cases, it’s not possible to run A/B tests. For example, when the intervention is an exogenous shock we can’t control, such as the COVID pandemic. Or when using experimental control [...]

The post How to Accurately Test Significance with Difference in Difference Models appeared first on Spotify Engineering.

Encouragement Designs and Instrumental Variables for A/B Testing
3 months ago

At Spotify, we run a lot of A/B tests. Most of these tests follow a standard design, where we assign users randomly to control and treatment groups, and then observe the difference in outcomes between these two groups. Usually, the control group, also known as the “holdout” group, retains the current experience, while the treatment [...]

The post Encouragement Designs and Instrumental Variables for A/B Testing appeared first on Spotify Engineering.

Introducing Ruvy
1 month, 1 week ago
We’ve recently open sourced a project called Ruvy! Ruvy is a toolchain that takes Ruby code as input and creates a WebAssembly module that will execute that Ruby code. There are other options for creating Wasm modules from Ruby code. The most common one is ruby.wasm. Ruvy is built on top of ruby.wasm to provide some specific benefits. We created Ruvy to take advantage of performance improvements from pre-initializing the Ruby virtual machine and Ruby files included by the Ruby script as well...
Building a ShopifyQL Code Editor
2 months, 2 weeks ago
In October 2022, Shopify released ShopifyQL Notebooks, a first-party app that lets merchants analyze their shop data to make better decisions. It puts the power of ShopifyQL into merchants’ hands with a guided code editing experience. In order to provide a first-class editing experience, we turned to CodeMirror, a code editor framework built for the web. Out of the box, CodeMirror didn’t have support for ShopifyQL–here’s how we built it. ShopifyQL Everywhere ShopifyQL is an accessible,...
Sidekick’s Improved Streaming Experience
3 months, 4 weeks ago
In this post, learn how Shopify's Sidekick solves markdown rendering jank and response delay in LLM chatbots with buffering parser and async content resolution.
Sidekick’s Improved Streaming Experience
3 months, 4 weeks ago
In this post, learn how Shopify's Sidekick solves markdown rendering jank and response delay in LLM chatbots with buffering parser and async content resolution.
Shopify’s platform is the Web platform
4 months ago
Remix is now the recommended way to build Admin apps on Shopify. With Remix, you get a best-in-class developer experience while ensuring exceptional out-of-the-box performance for your app. Remix embraces the web platform and web standards, allowing web developers to use more of their existing knowledge and skills.
Creating a Flexible Order Routing System with Shopify Functions
7 months, 2 weeks ago
In this article, Ebun covers how we added flexibility to our previous one-size-fits-all order routing system with the introduction of “routing rules”, and how we dogfooded our own Shopify Functions feature to give merchants the ability to create their own routing rules.
How Migrating from Vanilla Redux to Redux Toolkit Improved State Management in Shopify POS
7 months, 4 weeks ago
A look at Shopify’s experience improving state management in the Shopify POS app by migrating from a Vanilla Redux codebase to Redux Toolkit one.
What Being a Staff Developer Means at Shopify
8 months ago
A staff developer is an individual contributor who can have the same scope of impact and seniority as an engineering manager.
How to communicate like a GitHub engineer
1 month, 3 weeks ago
Learn more about how we use GitHub to build GitHub, how we turned our guiding communications principles into prescriptive practices to manage our internal communications signal-to-noise ratio, and how you can contribute to the ongoing conversation.
Transparent collaboration is the andon of your knowledge production system
2 months, 4 weeks ago
How the andon principle from lean manufacturing can help you spot and solve critical issues early on and foster a culture of transparency and collaboration within the software development process by encouraging anyone to "stop the line" when necessary.
Remote work requires communicating more, less frequently
3 months, 3 weeks ago
Remote work requires communicating more, less frequently, because asynchronous communication involves less frequent, but richer communication, meaning there is less time talking *about* the work and more time *doing* it, allowing the system to optimize for throughput and flow.
Practice inclusive scheduling
6 months, 1 week ago
When working as a distributed team, be mindful of cultural differences, time zones, encouraging breaks between meetings, and connecting as humans.
Pull requests are a form of documentation
6 months, 1 week ago
When authoring a pull request, use the body as an opportunity to document the proposed change, especially the "why", and cross link any related issues or other PRs to create a trail of breadcrumbs for future contributors.
Meetings are a point of escalation, not the starting point of a conversation
7 months, 1 week ago
Default to transferring context asynchronously. Hold colleagues accountable for being async first. If you receive a meeting invite without context, an agenda, or a read-ahead doc, consider politely declining.
Intro to GitHub for non-technical roles
8 months, 3 weeks ago
GitHub isn't just for software developers. If you're in a non-technical role, you can use GitHub to follow along, collaborate with your team, track your work, and share information. This brief guide includes everything you need to know to get started confidently with GitHub.
How to write a great extended leave document
10 months, 2 weeks ago
This is the template I use to prepare for an extended leave (or to hand off responsibilities as I transition to a new role).
Manage like an engineer
10 months, 2 weeks ago
If issues, pull requests, and project boards are the best way to develop software, should they not also be the best way to manage software development?
Helpful 404s for Jekyll (and GitHub Pages)
1 year, 4 months ago
How to implement `404 - not found` pages for Jekyll and GitHub pages that automatically suggest similar URLs to the one requested based on your site's `sitemap.xml`.
Responsible AI at Google Research: Perception Fairness
3 months ago
Posted by Susanna Ricco and Utsav Prabhu, co-leads, Perception Fairness Team, Google Research

Google’s Responsible AI research is built on a foundation of collaboration — between teams with diverse backgrounds and expertise, between researchers and product developers, and ultimately with the community at large. The Perception Fairness team drives progress by combining deep subject-matter expertise in both computer vision and machine learning (ML) fairness with direct connections to the researchers building the perception systems that power products across Google and beyond. Together, we are working to intentionally design our systems to be inclusive from the ground up, guided by Google’s AI Principles.

Perception Fairness research spans the design, development, and deployment of advanced multimodal models including the latest foundation and generative models powering Google's products.

Our team's mission is to advance the frontiers of fairness and inclusion in multimodal ML systems, especially related to foundation models and generative AI. This encompasses core technology components including classification, localization, captioning, retrieval, visual question answering, text-to-image or text-to-video generation, and generative image and video editing. We believe that fairness and inclusion can and should be top-line performance goals for these applications. Our research is focused on unlocking novel analyses and mitigations that enable us to proactively design for these objectives throughout the development cycle. We answer core questions, such as: How can we use ML to responsibly and faithfully model human perception of demographic, cultural, and social identities in order to promote fairness and inclusion? What kinds of system biases (e.g., underperforming on images of people with certain skin tones) can we measure and how can we use these metrics to design better algorithms? How can we build more inclusive algorithms and systems and react quickly when failures occur?

Measuring representation of people in media

ML systems that can edit, curate or create images or videos can affect anyone exposed to their outputs, shaping or reinforcing the beliefs of viewers around the world. Research to reduce representational harms, such as reinforcing stereotypes or denigrating or erasing groups of people, requires a deep understanding of both the content and the societal context. It hinges on how different observers perceive themselves, their communities, or how others are represented. There's considerable debate in the field regarding which social categories should be studied with computational tools and how to do so responsibly. Our research focuses on working toward scalable solutions that are informed by sociology and social psychology, are aligned with human perception, embrace the subjective nature of the problem, and enable nuanced measurement and mitigation. One example is our research on differences in human perception and annotation of skin tone in images using the Monk Skin Tone scale.

Our tools are also used to study representation in large-scale content collections. Through our Media Understanding for Social Exploration (MUSE) project, we've partnered with academic researchers, nonprofit organizations, and major consumer brands to understand patterns in mainstream media and advertising content. We first published this work in 2017, with a co-authored study analyzing gender equity in Hollywood movies. Since then, we've increased the scale and depth of our analyses. In 2019, we released findings based on over 2.7 million YouTube advertisements. In the latest study, we examine representation across intersections of perceived gender presentation, perceived age, and skin tone in over twelve years of popular U.S. television shows. These studies provide insights for content creators and advertisers and further inform our own research.

An illustration (not actual data) of computational signals that can be analyzed at scale to reveal representational patterns in media collections. [Video Collection / Getty Images]

Moving forward, we're expanding the ML fairness concepts on which we focus and the domains in which they are responsibly applied. Looking beyond photorealistic images of people, we are working to develop tools that model the representation of communities and cultures in illustrations, abstract depictions of humanoid characters, and even images with no people in them at all. Finally, we need to reason about not just who is depicted, but how they are portrayed — what narrative is communicated through the surrounding image content, the accompanying text, and the broader cultural context.

Analyzing bias properties of perceptual systems

Building advanced ML systems is complex, with multiple stakeholders informing various criteria that decide product behavior. Overall quality has historically been defined and measured using summary statistics (like overall accuracy) over a test dataset as a proxy for user experience. But not all users experience products in the same way.

Perception Fairness enables practical measurement of nuanced system behavior beyond summary statistics, and makes these metrics core to the system quality that directly informs product behaviors and launch decisions. This is often much harder than it seems. Distilling complex bias issues (e.g., disparities in performance across intersectional subgroups or instances of stereotype reinforcement) to a small number of metrics without losing important nuance is extremely challenging. Another challenge is balancing the interplay between fairness metrics and other product metrics (e.g., user satisfaction, accuracy, latency), which are often phrased as conflicting despite being compatible. It is common for researchers to describe their work as optimizing an "accuracy-fairness" tradeoff when in reality widespread user satisfaction is aligned with meeting fairness and inclusion objectives.

We built and released the MIAP dataset as part of Open Images, leveraging our research on perception of socially relevant concepts and detection of biased behavior in complex systems to create a resource that furthers ML fairness research in computer vision. Original photo credits — left: Boston Public Library; middle: jen robinson; right: Garin Fons; all used with permission under the CC- BY 2.0 license.

To these ends, our team focuses on two broad research directions. First, democratizing access to well-understood and widely-applicable fairness analysis tooling, engaging partner organizations in adopting them into product workflows, and informing leadership across the company in interpreting results. This work includes developing broad benchmarks, curating widely-useful high-quality test datasets and tooling centered around techniques such as sliced analysis and counterfactual testing — often building on the core representation signals work described earlier. Second, advancing novel approaches towards fairness analytics — including partnering with product efforts that may result in breakthrough findings or inform launch strategy.

Advancing AI responsibly

Our work does not stop with analyzing model behavior. Rather, we use this as a jumping-off point for identifying algorithmic improvements in collaboration with other researchers and engineers on product teams. Over the past year we've launched upgraded components that power Search and Memories features in Google Photos, leading to more consistent performance and drastically improving robustness through added layers that keep mistakes from cascading through the system. We are working on improving ranking algorithms in Google Images to diversify representation. We updated algorithms that may reinforce historical stereotypes, using additional signals responsibly, such that it’s more likely for everyone to see themselves reflected in Search results and find what they're looking for.

This work naturally carries over to the world of generative AI, where models can create collections of images or videos seeded from image and text prompts and can answer questions about images and videos. We're excited about the potential of these technologies to deliver new experiences to users and as tools to further our own research. To enable this, we're collaborating across the research and responsible AI communities to develop guardrails that mitigate failure modes. We’re leveraging our tools for understanding representation to power scalable benchmarks that can be combined with human feedback, and investing in research from pre-training through deployment to steer the models to generate higher quality, more inclusive, and more controllable output. We want these models to inspire people, producing diverse outputs, translating concepts without relying on tropes or stereotypes, and providing consistent behaviors and responses across counterfactual variations of prompts.

Opportunities and ongoing work

Despite over a decade of focused work, the field of perception fairness technologies still seems like a nascent and fast-growing space, rife with opportunities for breakthrough techniques. We continue to see opportunities to contribute technical advances backed by interdisciplinary scholarship. The gap between what we can measure in images versus the underlying aspects of human identity and expression is large — closing this gap will require increasingly complex media analytics solutions. Data metrics that indicate true representation, situated in the appropriate context and heeding a diversity of viewpoints, remains an open challenge for us. Can we reach a point where we can reliably identify depictions of nuanced stereotypes, continually update them to reflect an ever-changing society, and discern situations in which they could be offensive? Algorithmic advances driven by human feedback point a promising path forward.

Recent focus on AI safety and ethics in the context of modern large model development has spurred new ways of thinking about measuring systemic biases. We are exploring multiple avenues to use these models — along with recent developments in concept-based explainability methods, causal inference methods, and cutting-edge UX research — to quantify and minimize undesired biased behaviors. We look forward to tackling the challenges ahead and developing technology that is built for everybody.


We would like to thank every member of the Perception Fairness team, and all of our collaborators.

How to compare a noisy quantum processor to a classical computer
3 months ago
Posted by Sergio Boixo and Vadim Smelyanskiy, Principal Scientists, Google Quantum AI Team

A full-scale error-corrected quantum computer will be able to solve some problems that are impossible for classical computers, but building such a device is a huge endeavor. We are proud of the milestones that we have achieved toward a fully error-corrected quantum computer, but that large-scale computer is still some number of years away. Meanwhile, we are using our current noisy quantum processors as flexible platforms for quantum experiments.

In contrast to an error-corrected quantum computer, experiments in noisy quantum processors are currently limited to a few thousand quantum operations or gates, before noise degrades the quantum state. In 2019 we implemented a specific computational task called random circuit sampling on our quantum processor and showed for the first time that it outperformed state-of-the-art classical supercomputing.

Although they have not yet reached beyond-classical capabilities, we have also used our processors to observe novel physical phenomena, such as time crystals and Majorana edge modes, and have made new experimental discoveries, such as robust bound states of interacting photons and the noise-resilience of Majorana edge modes of Floquet evolutions.

We expect that even in this intermediate, noisy regime, we will find applications for the quantum processors in which useful quantum experiments can be performed much faster than can be calculated on classical supercomputers — we call these "computational applications" of the quantum processors. No one has yet demonstrated such a beyond-classical computational application. So as we aim to achieve this milestone, the question is: What is the best way to compare a quantum experiment run on such a quantum processor to the computational cost of a classical application?

We already know how to compare an error-corrected quantum algorithm to a classical algorithm. In that case, the field of computational complexity tells us that we can compare their respective computational costs — that is, the number of operations required to accomplish the task. But with our current experimental quantum processors, the situation is not so well defined.

In “Effective quantum volume, fidelity and computational cost of noisy quantum processing experiments”, we provide a framework for measuring the computational cost of a quantum experiment, introducing the experiment’s “effective quantum volume”, which is the number of quantum operations or gates that contribute to a measurement outcome. We apply this framework to evaluate the computational cost of three recent experiments: our random circuit sampling experiment, our experiment measuring quantities known as “out of time order correlators” (OTOCs), and a recent experiment on a Floquet evolution related to the Ising model. We are particularly excited about OTOCs because they provide a direct way to experimentally measure the effective quantum volume of a circuit (a sequence of quantum gates or operations), which is itself a computationally difficult task for a classical computer to estimate precisely. OTOCs are also important in nuclear magnetic resonance and electron spin resonance spectroscopy. Therefore, we believe that OTOC experiments are a promising candidate for a first-ever computational application of quantum processors.

Plot of computational cost and impact of some recent quantum experiments. While some (e.g., QC-QMC 2022) have had high impact and others (e.g., RCS 2023) have had high computational cost, none have yet been both useful and hard enough to be considered a “computational application.” We hypothesize that our future OTOC experiment could be the first to pass this threshold. Other experiments plotted are referenced in the text.

Random circuit sampling: Evaluating the computational cost of a noisy circuit

When it comes to running a quantum circuit on a noisy quantum processor, there are two competing considerations. On one hand, we aim to do something that is difficult to achieve classically. The computational cost — the number of operations required to accomplish the task on a classical computer — depends on the quantum circuit’s effective quantum volume: the larger the volume, the higher the computational cost, and the more a quantum processor can outperform a classical one.

But on the other hand, on a noisy processor, each quantum gate can introduce an error to the calculation. The more operations, the higher the error, and the lower the fidelity of the quantum circuit in measuring a quantity of interest. Under this consideration, we might prefer simpler circuits with a smaller effective volume, but these are easily simulated by classical computers. The balance of these competing considerations, which we want to maximize, is called the "computational resource", shown below.

Graph of the tradeoff between quantum volume and noise in a quantum circuit, captured in a quantity called the “computational resource.” For a noisy quantum circuit, this will initially increase with the computational cost, but eventually, noise will overrun the circuit and cause it to decrease.

We can see how these competing considerations play out in a simple “hello world” program for quantum processors, known as random circuit sampling (RCS), which was the first demonstration of a quantum processor outperforming a classical computer. Any error in any gate is likely to make this experiment fail. Inevitably, this is a hard experiment to achieve with significant fidelity, and thus it also serves as a benchmark of system fidelity. But it also corresponds to the highest known computational cost achievable by a quantum processor. We recently reported the most powerful RCS experiment performed to date, with a low measured experimental fidelity of 1.7x10-3, and a high theoretical computational cost of ~1023. These quantum circuits had 700 two-qubit gates. We estimate that this experiment would take ~47 years to simulate in the world's largest supercomputer. While this checks one of the two boxes needed for a computational application — it outperforms a classical supercomputer — it is not a particularly useful application per se.

OTOCs and Floquet evolution: The effective quantum volume of a local observable

There are many open questions in quantum many-body physics that are classically intractable, so running some of these experiments on our quantum processor has great potential. We typically think of these experiments a bit differently than we do the RCS experiment. Rather than measuring the quantum state of all qubits at the end of the experiment, we are usually concerned with more specific, local physical observables. Because not every operation in the circuit necessarily impacts the observable, a local observable’s effective quantum volume might be smaller than that of the full circuit needed to run the experiment.

We can understand this by applying the concept of a light cone from relativity, which determines which events in space-time can be causally connected: some events cannot possibly influence one another because information takes time to propagate between them. We say that two such events are outside their respective light cones. In a quantum experiment, we replace the light cone with something called a “butterfly cone,” where the growth of the cone is determined by the butterfly speed — the speed with which information spreads throughout the system. (This speed is characterized by measuring OTOCs, discussed later.) The effective quantum volume of a local observable is essentially the volume of the butterfly cone, including only the quantum operations that are causally connected to the observable. So, the faster information spreads in a system, the larger the effective volume and therefore the harder it is to simulate classically.

A depiction of the effective volume Veff of the gates contributing to the local observable B. A related quantity called the effective area Aeff is represented by the cross-section of the plane and the cone. The perimeter of the base corresponds to the front of information travel that moves with the butterfly velocity vB.

We apply this framework to a recent experiment implementing a so-called Floquet Ising model, a physical model related to the time crystal and Majorana experiments. From the data of this experiment, one can directly estimate an effective fidelity of 0.37 for the largest circuits. With the measured gate error rate of ~1%, this gives an estimated effective volume of ~100. This is much smaller than the light cone, which included two thousand gates on 127 qubits. So, the butterfly velocity of this experiment is quite small. Indeed, we argue that the effective volume covers only ~28 qubits, not 127, using numerical simulations that obtain a larger precision than the experiment. This small effective volume has also been corroborated with the OTOC technique. Although this was a deep circuit, the estimated computational cost is 5x1011, almost one trillion times less than the recent RCS experiment. Correspondingly, this experiment can be simulated in less than a second per data point on a single A100 GPU. So, while this is certainly a useful application, it does not fulfill the second requirement of a computational application: substantially outperforming a classical simulation.

Information scrambling experiments with OTOCs are a promising avenue for a computational application. OTOCs can tell us important physical information about a system, such as the butterfly velocity, which is critical for precisely measuring the effective quantum volume of a circuit. OTOC experiments with fast entangling gates offer a potential path for a first beyond-classical demonstration of a computational application with a quantum processor. Indeed, in our experiment from 2021 we achieved an effective fidelity of Feff ~ 0.06 with an experimental signal-to-noise ratio of ~1, corresponding to an effective volume of ~250 gates and a computational cost of 2x1012.

While these early OTOC experiments are not sufficiently complex to outperform classical simulations, there is a deep physical reason why OTOC experiments are good candidates for the first demonstration of a computational application. Most of the interesting quantum phenomena accessible to near-term quantum processors that are hard to simulate classically correspond to a quantum circuit exploring many, many quantum energy levels. Such evolutions are typically chaotic and standard time-order correlators (TOC) decay very quickly to a purely random average in this regime. There is no experimental signal left. This does not happen for OTOC measurements, which allows us to grow complexity at will, only limited by the error per gate. We anticipate that a reduction of the error rate by half would double the computational cost, pushing this experiment to the beyond-classical regime.


Using the effective quantum volume framework we have developed, we have determined the computational cost of our RCS and OTOC experiments, as well as a recent Floquet evolution experiment. While none of these meet the requirements yet for a computational application, we expect that with improved error rates, an OTOC experiment will be the first beyond-classical, useful application of a quantum processor.

Teaching language models to reason algorithmically
3 months ago
Posted by Hattie Zhou, Graduate Student at MILA, Hanie Sedghi, Research Scientist, Google

Large language models (LLMs), such as GPT-3 and PaLM, have shown impressive progress in recent years, which have been driven by scaling up models and training data sizes. Nonetheless, a long standing debate has been whether LLMs can reason symbolically (i.e., manipulating symbols based on logical rules). For example, LLMs are able to perform simple arithmetic operations when numbers are small, but struggle to perform with large numbers. This suggests that LLMs have not learned the underlying rules needed to perform these arithmetic operations.

While neural networks have powerful pattern matching capabilities, they are prone to overfitting to spurious statistical patterns in the data. This does not hinder good performance when the training data is large and diverse and the evaluation is in-distribution. However, for tasks that require rule-based reasoning (such as addition), LLMs struggle with out-of-distribution generalization as spurious correlations in the training data are often much easier to exploit than the true rule-based solution. As a result, despite significant progress in a variety of natural language processing tasks, performance on simple arithmetic tasks like addition has remained a challenge. Even with modest improvement of GPT-4 on the MATH dataset, errors are still largely due to arithmetic and calculation mistakes. Thus, an important question is whether LLMs are capable of algorithmic reasoning, which involves solving a task by applying a set of abstract rules that define the algorithm.

In “Teaching Algorithmic Reasoning via In-Context Learning”, we describe an approach that leverages in-context learning to enable algorithmic reasoning capabilities in LLMs. In-context learning refers to a model’s ability to perform a task after seeing a few examples of it within the context of the model. The task is specified to the model using a prompt, without the need for weight updates. We also present a novel algorithmic prompting technique that enables general purpose language models to achieve strong generalization on arithmetic problems that are more difficult than those seen in the prompt. Finally, we demonstrate that a model can reliably execute algorithms on out-of-distribution examples with an appropriate choice of prompting strategy.

By providing algorithmic prompts, we can teach a model the rules of arithmetic via in-context learning. In this example, the LLM (word predictor) outputs the correct answer when prompted with an easy addition question (e.g., 267+197), but fails when asked a similar addition question with longer digits. However, when the more difficult question is appended with an algorithmic prompt for addition (blue box with white + shown below the word predictor), the model is able to answer correctly. Moreover, the model is capable of simulating the multiplication algorithm (X) by composing a series of addition calculations.

Teaching an algorithm as a skill

In order to teach a model an algorithm as a skill, we develop algorithmic prompting, which builds upon other rationale-augmented approaches (e.g., scratchpad and chain-of-thought). Algorithmic prompting extracts algorithmic reasoning abilities from LLMs, and has two notable distinctions compared to other prompting approaches: (1) it solves tasks by outputting the steps needed for an algorithmic solution, and (2) it explains each algorithmic step with sufficient detail so there is no room for misinterpretation by the LLM.

To gain intuition for algorithmic prompting, let’s consider the task of two-number addition. In a scratchpad-style prompt, we process each digit from right to left and keep track of the carry value (i.e., we add a 1 to the next digit if the current digit is greater than 9) at each step. However, the rule of carry is ambiguous after seeing only a few examples of carry values. We find that including explicit equations to describe the rule of carry helps the model focus on the relevant details and interpret the prompt more accurately. We use this insight to develop an algorithmic prompt for two-number addition, where we provide explicit equations for each step of computation and describe various indexing operations in non-ambiguous formats.

Illustration of various prompt strategies for addition.

Using only three prompt examples of addition with answer length up to five digits, we evaluate performance on additions of up to 19 digits. Accuracy is measured over 2,000 total examples sampled uniformly over the length of the answer. As shown below, the use of algorithmic prompts maintains high accuracy for questions significantly longer than what’s seen in the prompt, which demonstrates that the model is indeed solving the task by executing an input-agnostic algorithm.

Test accuracy on addition questions of increasing length for different prompting methods.

Leveraging algorithmic skills as tool use

To evaluate if the model can leverage algorithmic reasoning in a broader reasoning process, we evaluate performance using grade school math word problems (GSM8k). We specifically attempt to replace addition calculations from GSM8k with an algorithmic solution.

Motivated by context length limitations and possible interference between different algorithms, we explore a strategy where differently-prompted models interact with one another to solve complex tasks. In the context of GSM8k, we have one model that specializes in informal mathematical reasoning using chain-of-thought prompting, and a second model that specializes in addition using algorithmic prompting. The informal mathematical reasoning model is prompted to output specialized tokens in order to call on the addition-prompted model to perform the arithmetic steps. We extract the queries between tokens, send them to the addition-model and return the answer to the first model, after which the first model continues its output. We evaluate our approach using a difficult problem from the GSM8k (GSM8k-Hard), where we randomly select 50 addition-only questions and increase the numerical values in the questions.

An example from the GSM8k-Hard dataset. The chain-of-thought prompt is augmented with brackets to indicate when an algorithmic call should be performed.

We find that using separate contexts and models with specialized prompts is an effective way to tackle GSM8k-Hard. Below, we observe that the performance of the model with algorithmic call for addition is 2.3x the chain-of-thought baseline. Finally, this strategy presents an example of solving complex tasks by facilitating interactions between LLMs specialized to different skills via in-context learning.

Chain-of-thought (CoT) performance on GSM8k-Hard with or without algorithmic call.


We present an approach that leverages in-context learning and a novel algorithmic prompting technique to unlock algorithmic reasoning abilities in LLMs. Our results suggest that it may be possible to transform longer context into better reasoning performance by providing more detailed explanations. Thus, these findings point to the ability of using or otherwise simulating long contexts and generating more informative rationales as promising research directions.


We thank our co-authors Behnam Neyshabur, Azade Nova, Hugo Larochelle and Aaron Courville for their valuable contributions to the paper and great feedback on the blog. We thank Tom Small for creating the animations in this post. This work was done during Hattie Zhou’s internship at Google Research.

Language to rewards for robotic skill synthesis
3 months ago
Posted by Wenhao Yu and Fei Xia, Research Scientists, Google

Empowering end-users to interactively teach robots to perform novel tasks is a crucial capability for their successful integration into real-world applications. For example, a user may want to teach a robot dog to perform a new trick, or teach a manipulator robot how to organize a lunch box based on user preferences. The recent advancements in large language models (LLMs) pre-trained on extensive internet data have shown a promising path towards achieving this goal. Indeed, researchers have explored diverse ways of leveraging LLMs for robotics, from step-by-step planning and goal-oriented dialogue to robot-code-writing agents.

While these methods impart new modes of compositional generalization, they focus on using language to link together new behaviors from an existing library of control primitives that are either manually engineered or learned a priori. Despite having internal knowledge about robot motions, LLMs struggle to directly output low-level robot commands due to the limited availability of relevant training data. As a result, the expression of these methods are bottlenecked by the breadth of the available primitives, the design of which often requires extensive expert knowledge or massive data collection.

In “Language to Rewards for Robotic Skill Synthesis”, we propose an approach to enable users to teach robots novel actions through natural language input. To do so, we leverage reward functions as an interface that bridges the gap between language and low-level robot actions. We posit that reward functions provide an ideal interface for such tasks given their richness in semantics, modularity, and interpretability. They also provide a direct connection to low-level policies through black-box optimization or reinforcement learning (RL). We developed a language-to-reward system that leverages LLMs to translate natural language user instructions into reward-specifying code and then applies MuJoCo MPC to find optimal low-level robot actions that maximize the generated reward function. We demonstrate our language-to-reward system on a variety of robotic control tasks in simulation using a quadruped robot and a dexterous manipulator robot. We further validate our method on a physical robot manipulator.

The language-to-reward system consists of two core components: (1) a Reward Translator, and (2) a Motion Controller. The Reward Translator maps natural language instruction from users to reward functions represented as python code. The Motion Controller optimizes the given reward function using receding horizon optimization to find the optimal low-level robot actions, such as the amount of torque that should be applied to each robot motor.

LLMs cannot directly generate low-level robotic actions due to lack of data in pre-training dataset. We propose to use reward functions to bridge the gap between language and low-level robot actions, and enable novel complex robot motions from natural language instructions.

Reward Translator: Translating user instructions to reward functions

The Reward Translator module was built with the goal of mapping natural language user instructions to reward functions. Reward tuning is highly domain-specific and requires expert knowledge, so it was not surprising to us when we found that LLMs trained on generic language datasets are unable to directly generate a reward function for a specific hardware. To address this, we apply the in-context learning ability of LLMs. Furthermore, we split the Reward Translator into two sub-modules: Motion Descriptor and Reward Coder.

Motion Descriptor

First, we design a Motion Descriptor that interprets input from a user and expands it into a natural language description of the desired robot motion following a predefined template. This Motion Descriptor turns potentially ambiguous or vague user instructions into more specific and descriptive robot motions, making the reward coding task more stable. Moreover, users interact with the system through the motion description field, so this also provides a more interpretable interface for users compared to directly showing the reward function.

To create the Motion Descriptor, we use an LLM to translate the user input into a detailed description of the desired robot motion. We design prompts that guide the LLMs to output the motion description with the right amount of details and format. By translating a vague user instruction into a more detailed description, we are able to more reliably generate the reward function with our system. This idea can also be potentially applied more generally beyond robotics tasks, and is relevant to Inner-Monologue and chain-of-thought prompting.

Reward Coder

In the second stage, we use the same LLM from Motion Descriptor for Reward Coder, which translates generated motion description into the reward function. Reward functions are represented using python code to benefit from the LLMs’ knowledge of reward, coding, and code structure.

Ideally, we would like to use an LLM to directly generate a reward function R (s, t) that maps the robot state s and time t into a scalar reward value. However, generating the correct reward function from scratch is still a challenging problem for LLMs and correcting the errors requires the user to understand the generated code to provide the right feedback. As such, we pre-define a set of reward terms that are commonly used for the robot of interest and allow LLMs to composite different reward terms to formulate the final reward function. To achieve this, we design a prompt that specifies the reward terms and guide the LLM to generate the correct reward function for the task.

The internal structure of the Reward Translator, which is tasked to map user inputs to reward functions.

Motion Controller: Translating reward functions to robot actions

The Motion Controller takes the reward function generated by the Reward Translator and synthesizes a controller that maps robot observation to low-level robot actions. To do this, we formulate the controller synthesis problem as a Markov decision process (MDP), which can be solved using different strategies, including RL, offline trajectory optimization, or model predictive control (MPC). Specifically, we use an open-source implementation based on the MuJoCo MPC (MJPC).

MJPC has demonstrated the interactive creation of diverse behaviors, such as legged locomotion, grasping, and finger-gaiting, while supporting multiple planning algorithms, such as iterative linear–quadratic–Gaussian (iLQG) and predictive sampling. More importantly, the frequent re-planning in MJPC empowers its robustness to uncertainties in the system and enables an interactive motion synthesis and correction system when combined with LLMs.


Robot dog

In the first example, we apply the language-to-reward system to a simulated quadruped robot and teach it to perform various skills. For each skill, the user will provide a concise instruction to the system, which will then synthesize the robot motion by using reward functions as an intermediate interface.

Dexterous manipulator

We then apply the language-to-reward system to a dexterous manipulator robot to perform a variety of manipulation tasks. The dexterous manipulator has 27 degrees of freedom, which is very challenging to control. Many of these tasks require manipulation skills beyond grasping, making it difficult for pre-designed primitives to work. We also include an example where the user can interactively instruct the robot to place an apple inside a drawer.

Validation on real robots

We also validate the language-to-reward method using a real-world manipulation robot to perform tasks such as picking up objects and opening a drawer. To perform the optimization in Motion Controller, we use AprilTag, a fiducial marker system, and F-VLM, an open-vocabulary object detection tool, to identify the position of the table and objects being manipulated.


In this work, we describe a new paradigm for interfacing an LLM with a robot through reward functions, powered by a low-level model predictive control tool, MuJoCo MPC. Using reward functions as the interface enables LLMs to work in a semantic-rich space that plays to the strengths of LLMs, while ensuring the expressiveness of the resulting controller. To further improve the performance of the system, we propose to use a structured motion description template to better extract internal knowledge about robot motions from LLMs. We demonstrate our proposed system on two simulated robot platforms and one real robot for both locomotion and manipulation tasks.


We would like to thank our co-authors Nimrod Gileadi, Chuyuan Fu, Sean Kirmani, Kuang-Huei Lee, Montse Gonzalez Arenas, Hao-Tien Lewis Chiang, Tom Erez, Leonard Hasenclever, Brian Ichter, Ted Xiao, Peng Xu, Andy Zeng, Tingnan Zhang, Nicolas Heess, Dorsa Sadigh, Jie Tan, and Yuval Tassa for their help and support in various aspects of the project. We would also like to acknowledge Ken Caluwaerts, Kristian Hartikainen, Steven Bohez, Carolina Parada, Marc Toussaint, and the greater teams at Google DeepMind for their feedback and contributions.

Google at Interspeech 2023
3 months, 1 week ago
Posted by Catherine Armato, Program Manager, Google

This week, the 24th Annual Conference of the International Speech Communication Association (INTERSPEECH 2023) is being held in Dublin, Ireland, representing one of the world’s most extensive conferences on research and technology of spoken language understanding and processing. Experts in speech-related research fields gather to take part in oral presentations and poster sessions and to build collaborations across the globe.

We are excited to be a Platinum Sponsor of INTERSPEECH 2023, where we will be showcasing more than 20 research publications and supporting a number of workshops and special sessions. We welcome in-person attendees to drop by the Google Research booth to meet our researchers and participate in Q&As and demonstrations of some of our latest speech technologies, which help to improve accessibility and provide convenience in communication for billions of users. In addition, online attendees are encouraged to visit our virtual booth in Topia where you can get up-to-date information on research and opportunities at Google. Visit the @GoogleAI Twitter account to find out about Google booth activities (e.g., demos and Q&A sessions). You can also learn more about the Google research being presented at INTERSPEECH 2023 below (Google affiliations in bold).

Board and Organizing Committee

ISCA Board, Technical Committee Chair: Bhuvana Ramabhadran

Area Chairs include:
    Analysis of Speech and Audio Signals: Richard Rose
    Speech Synthesis and Spoken Language Generation: Rob Clark
    Special Areas: Tara Sainath

Satellite events

VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC-23)
Organizers include: Arsha Nagrani

ISCA Speech Synthesis Workshop (SSW12)
Speakers include: Rob Clark

Keynote talk – ISCA Medalist

Survey Talk

Speech Compression in the AI Era
Speaker: Jan Skoglund

Special session papers

Cascaded Encoders for Fine-Tuning ASR Models on Overlapped Speech
Richard Rose, Oscar Chang, Olivier Siohan

TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition
Hakan Erdogan, Scott Wisdom, Xuankai Chang*, Zalán Borsos, Marco Tagliasacchi, Neil Zeghidour, John R. Hershey


DeePMOS: Deep Posterior Mean-Opinion-Score of Speech
Xinyu Liang, Fredrik Cumlin, Christian Schüldt, Saikat Chatterjee

O-1: Self-Training with Oracle and 1-Best Hypothesis
Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Kartik Audhkhasi

Re-investigating the Efficient Transfer Learning of Speech Foundation Model Using Feature Fusion Methods
Zhouyuan Huo, Khe Chai Sim, Dongseong Hwang, Tsendsuren Munkhdalai, Tara N. Sainath, Pedro Moreno

MOS vs. AB: Evaluating Text-to-Speech Systems Reliably Using Clustered Standard Errors
Joshua Camp, Tom Kenter, Lev Finkelstein, Rob Clark

LanSER: Language-Model Supported Speech Emotion Recognition
Taesik Gong, Josh Belanich, Krishna Somandepalli, Arsha Nagrani, Brian Eoff, Brendan Jou

Modular Domain Adaptation for Conformer-Based Streaming ASR
Qiujia Li, Bo Li, Dongseong Hwang, Tara N. Sainath, Pedro M. Mengibar

On Training a Neural Residual Acoustic Echo Suppressor for Improved ASR
Sankaran Panchapagesan, Turaj Zakizadeh Shabestary, Arun Narayanan

MD3: The Multi-dialect Dataset of Dialogues
Jacob Eisenstein, Vinodkumar Prabhakaran, Clara Rivera, Dorottya Demszky, Devyani Sharma

Dual-Mode NAM: Effective Top-K Context Injection for End-to-End ASR
Zelin Wu, Tsendsuren Munkhdalai, Pat Rondon, Golan Pundak, Khe Chai Sim, Christopher Li

Using Text Injection to Improve Recognition of Personal Identifiers in Speech
Yochai Blau, Rohan Agrawal, Lior Madmony, Gary Wang, Andrew Rosenberg, Zhehuai Chen, Zorik Gekhman, Genady Beryozkin, Parisa Haghani, Bhuvana Ramabhadran

How to Estimate Model Transferability of Pre-trained Speech Models?
Zih-Ching Chen, Chao-Han Huck Yang*, Bo Li, Yu Zhang, Nanxin Chen, Shuo-yiin Chang, Rohit Prabhavalkar, Hung-yi Lee, Tara N. Sainath

Improving Joint Speech-Text Representations Without Alignment
Cal Peyser, Zhong Meng, Ke Hu, Rohit Prabhavalkar, Andrew Rosenberg, Tara N. Sainath, Michael Picheny, Kyunghyun Cho

Text Injection for Capitalization and Turn-Taking Prediction in Speech Models
Shaan Bijwadia, Shuo-yiin Chang, Weiran Wang, Zhong Meng, Hao Zhang, Tara N. Sainath

Streaming Parrotron for On-Device Speech-to-Speech Conversion
Oleg Rybakov, Fadi Biadsy, Xia Zhang, Liyang Jiang, Phoenix Meadowlark, Shivani Agrawal

Semantic Segmentation with Bidirectional Language Models Improves Long-Form ASR
W. Ronny Huang, Hao Zhang, Shankar Kumar, Shuo-yiin Chang, Tara N. Sainath

Universal Automatic Phonetic Transcription into the International Phonetic Alphabet
Chihiro Taguchi, Yusuke Sakai, Parisa Haghani, David Chiang

Mixture-of-Expert Conformer for Streaming Multilingual ASR
Ke Hu, Bo Li, Tara N. Sainath, Yu Zhang, Francoise Beaufays

Real Time Spectrogram Inversion on Mobile Phone
Oleg Rybakov, Marco Tagliasacchi, Yunpeng Li, Liyang Jiang, Xia Zhang, Fadi Biadsy

2-Bit Conformer Quantization for Automatic Speech Recognition
Oleg Rybakov, Phoenix Meadowlark, Shaojin Ding, David Qiu, Jian Li, David Rim, Yanzhang He

LibriTTS-R: A Restored Multi-speaker Text-to-Speech Corpus
Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Michiel Bacchiani, Yu Zhang, Wei Han, Ankur Bapna

PronScribe: Highly Accurate Multimodal Phonemic Transcription from Speech and Text
Yang Yu, Matthew Perez*, Ankur Bapna, Fadi Haik, Siamak Tazari, Yu Zhang

Label Aware Speech Representation Learning for Language Identification
Shikhar Vashishth, Shikhar Bharadwaj, Sriram Ganapathy, Ankur Bapna, Min Ma, Wei Han, Vera Axelrod, Partha Talukdar

* Work done while at Google

Autonomous visual information seeking with large language models
3 months, 1 week ago
Posted by Ziniu Hu, Student Researcher, and Alireza Fathi, Research Scientist, Google Research, Perception Team

There has been great progress towards adapting large language models (LLMs) to accommodate multimodal inputs for tasks including image captioning, visual question answering (VQA), and open vocabulary recognition. Despite such achievements, current state-of-the-art visual language models (VLMs) perform inadequately on visual information seeking datasets, such as Infoseek and OK-VQA, where external knowledge is required to answer the questions.

Examples of visual information seeking queries where external knowledge is required to answer the question. Images are taken from the OK-VQA dataset.

In “AVIS: Autonomous Visual Information Seeking with Large Language Models”, we introduce a novel method that achieves state-of-the-art results on visual information seeking tasks. Our method integrates LLMs with three types of tools: (i) computer vision tools for extracting visual information from images, (ii) a web search tool for retrieving open world knowledge and facts, and (iii) an image search tool to glean relevant information from metadata associated with visually similar images. AVIS employs an LLM-powered planner to choose tools and queries at each step. It also uses an LLM-powered reasoner to analyze tool outputs and extract key information. A working memory component retains information throughout the process.

An example of AVIS’s generated workflow for answering a challenging visual information seeking question. The input image is taken from the Infoseek dataset.

Comparison to previous work

Recent studies (e.g., Chameleon, ViperGPT and MM-ReAct) explored adding tools to LLMs for multimodal inputs. These systems follow a two-stage process: planning (breaking down questions into structured programs or instructions) and execution (using tools to gather information). Despite success in basic tasks, this approach often falters in complex real-world scenarios.

There has also been a surge of interest in applying LLMs as autonomous agents (e.g., WebGPT and ReAct). These agents interact with their environment, adapt based on real-time feedback, and achieve goals. However, these methods do not restrict the tools that can be invoked at each stage, leading to an immense search space. Consequently, even the most advanced LLMs today can fall into infinite loops or propagate errors. AVIS tackles this via guided LLM use, influenced by human decisions from a user study.

Informing LLM decision making with a user study

Many of the visual questions in datasets such as Infoseek and OK-VQA pose a challenge even for humans, often requiring the assistance of various tools and APIs. An example question from the OK-VQA dataset is shown below. We conducted a user study to understand human decision-making when using external tools.

We conducted a user study to understand human decision-making when using external tools. Image is taken from the OK-VQA dataset.

The users were equipped with an identical set of tools as our method, including PALI, PaLM, and web search. They received input images, questions, detected object crops, and buttons linked to image search results. These buttons offered diverse information about the detected object crops, such as knowledge graph entities, similar image captions, related product titles, and identical image captions.

We record user actions and outputs and use it as a guide for our system in two key ways. First, we construct a transition graph (shown below) by analyzing the sequence of decisions made by users. This graph defines distinct states and restricts the available set of actions at each state. For example, at the start state, the system can take only one of these three actions: PALI caption, PALI VQA, or object detection. Second, we use the examples of human decision-making to guide our planner and reasoner with relevant contextual instances to enhance the performance and effectiveness of our system.

AVIS transition graph.

General framework

Our approach employs a dynamic decision-making strategy designed to respond to visual information-seeking queries. Our system has three primary components. First, we have a planner to determine the subsequent action, including the appropriate API call and the query it needs to process. Second, we have a working memory that retains information about the results obtained from API executions. Last, we have a reasoner, whose role is to process the outputs from the API calls. It determines whether the obtained information is sufficient to produce the final response, or if additional data retrieval is required.

The planner undertakes a series of steps each time a decision is required regarding which tool to employ and what query to send to it. Based on the present state, the planner provides a range of potential subsequent actions. The potential action space may be so large that it makes the search space intractable. To address this issue, the planner refers to the transition graph to eliminate irrelevant actions. The planner also excludes the actions that have already been taken before and are stored in the working memory.

Next, the planner collects a set of relevant in-context examples that are assembled from the decisions previously made by humans during the user study. With these examples and the working memory that holds data collected from past tool interactions, the planner formulates a prompt. The prompt is then sent to the LLM, which returns a structured answer, determining the next tool to be activated and the query to be dispatched to it. This design allows the planner to be invoked multiple times throughout the process, thereby facilitating dynamic decision-making that gradually leads to answering the input query.

We employ a reasoner to analyze the output of the tool execution, extract the useful information and decide into which category the tool output falls: informative, uninformative, or final answer. Our method utilizes the LLM with appropriate prompting and in-context examples to perform the reasoning. If the reasoner concludes that it’s ready to provide an answer, it will output the final response, thus concluding the task. If it determines that the tool output is uninformative, it will revert back to the planner to select another action based on the current state. If it finds the tool output to be useful, it will modify the state and transfer control back to the planner to make a new decision at the new state.

AVIS employs a dynamic decision-making strategy to respond to visual information-seeking queries.


We evaluate AVIS on Infoseek and OK-VQA datasets. As shown below, even robust visual-language models, such as OFA and PaLI, fail to yield high accuracy when fine-tuned on Infoseek. Our approach (AVIS), without fine-tuning, achieves 50.7% accuracy on the unseen entity split of this dataset.

AVIS visual question answering results on Infoseek dataset. AVIS achieves higher accuracy in comparison to previous baselines based on PaLI, PaLM and OFA.

Our results on the OK-VQA dataset are shown below. AVIS with few-shot in-context examples achieves an accuracy of 60.2%, higher than most of the previous works. AVIS achieves lower but comparable accuracy in comparison to the PALI model fine-tuned on OK-VQA. This difference, compared to Infoseek where AVIS outperforms fine-tuned PALI, is due to the fact that most question-answer examples in OK-VQA rely on common sense knowledge rather than on fine-grained knowledge. Therefore, PaLI is able to encode such generic knowledge in the model parameters and doesn’t require external knowledge.

Visual question answering results on A-OKVQA. AVIS achieves higher accuracy in comparison to previous works that use few-shot or zero-shot learning, including Flamingo, PaLI and ViperGPT. AVIS also achieves higher accuracy than most of the previous works that are fine-tuned on OK-VQA dataset, including REVEAL, ReVIVE, KAT and KRISP, and achieves results that are close to the fine-tuned PaLI model.


We present a novel approach that equips LLMs with the ability to use a variety of tools for answering knowledge-intensive visual questions. Our methodology, anchored in human decision-making data collected from a user study, employs a structured framework that uses an LLM-powered planner to dynamically decide on tool selection and query formation. An LLM-powered reasoner is tasked with processing and extracting key information from the output of the selected tool. Our method iteratively employs the planner and reasoner to leverage different tools until all necessary information required to answer the visual question is amassed.


This research was conducted by Ziniu Hu, Ahmet Iscen, Chen Sun, Kai-Wei Chang, Yizhou Sun, David A. Ross, Cordelia Schmid and Alireza Fathi.

Neural network pruning with combinatorial optimization
3 months, 1 week ago
Posted by Hussein Hazimeh, Research Scientist, Athena Team, and Riade Benbaki, Graduate Student at MIT

Modern neural networks have achieved impressive performance across a variety of applications, such as language, mathematical reasoning, and vision. However, these networks often use large architectures that require lots of computational resources. This can make it impractical to serve such models to users, especially in resource-constrained environments like wearables and smartphones. A widely used approach to mitigate the inference costs of pre-trained networks is to prune them by removing some of their weights, in a way that doesn’t significantly affect utility. In standard neural networks, each weight defines a connection between two neurons. So after weights are pruned, the input will propagate through a smaller set of connections and thus requires less computational resources.

Original network vs. a pruned network.

Pruning methods can be applied at different stages of the network’s training process: post, during, or before training (i.e., immediately after weight initialization). In this post, we focus on the post-training setting: given a pre-trained network, how can we determine which weights should be pruned? One popular method is magnitude pruning, which removes weights with the smallest magnitude. While efficient, this method doesn’t directly consider the effect of removing weights on the network’s performance. Another popular paradigm is optimization-based pruning, which removes weights based on how much their removal impacts the loss function. Although conceptually appealing, most existing optimization-based approaches seem to face a serious tradeoff between performance and computational requirements. Methods that make crude approximations (e.g., assuming a diagonal Hessian matrix) can scale well, but have relatively low performance. On the other hand, while methods that make fewer approximations tend to perform better, they appear to be much less scalable.

In “Fast as CHITA: Neural Network Pruning with Combinatorial Optimization”, presented at ICML 2023, we describe how we developed an optimization-based approach for pruning pre-trained neural networks at scale. CHITA (which stands for “Combinatorial Hessian-free Iterative Thresholding Algorithm”) outperforms existing pruning methods in terms of scalability and performance tradeoffs, and it does so by leveraging advances from several fields, including high-dimensional statistics, combinatorial optimization, and neural network pruning. For example, CHITA can be 20x to 1000x faster than state-of-the-art methods for pruning ResNet and improves accuracy by over 10% in many settings.

Overview of contributions

CHITA has two notable technical improvements over popular methods:

  • Efficient use of second-order information: Pruning methods that use second-order information (i.e., relating to second derivatives) achieve the state of the art in many settings. In the literature, this information is typically used by computing the Hessian matrix or its inverse, an operation that is very difficult to scale because the Hessian size is quadratic with respect to the number of weights. Through careful reformulation, CHITA uses second-order information without having to compute or store the Hessian matrix explicitly, thus allowing for more scalability.
  • Combinatorial optimization: Popular optimization-based methods use a simple optimization technique that prunes weights in isolation, i.e., when deciding to prune a certain weight they don’t take into account whether other weights have been pruned. This could lead to pruning important weights because weights deemed unimportant in isolation may become important when other weights are pruned. CHITA avoids this issue by using a more advanced, combinatorial optimization algorithm that takes into account how pruning one weight impacts others.

In the sections below, we discuss CHITA’s pruning formulation and algorithms.

A computation-friendly pruning formulation

There are many possible pruning candidates, which are obtained by retaining only a subset of the weights from the original network. Let k be a user-specified parameter that denotes the number of weights to retain. Pruning can be naturally formulated as a best-subset selection (BSS) problem: among all possible pruning candidates (i.e., subsets of weights) with only k weights retained, the candidate that has the smallest loss is selected.

Pruning as a BSS problem: among all possible pruning candidates with the same total number of weights, the best candidate is defined as the one with the least loss. This illustration shows four candidates, but this number is generally much larger.

Solving the pruning BSS problem on the original loss function is generally computationally intractable. Thus, similar to previous work, such as OBD and OBS, we approximate the loss with a quadratic function by using a second-order Taylor series, where the Hessian is estimated with the empirical Fisher information matrix. While gradients can be typically computed efficiently, computing and storing the Hessian matrix is prohibitively expensive due to its sheer size. In the literature, it is common to deal with this challenge by making restrictive assumptions on the Hessian (e.g., diagonal matrix) and also on the algorithm (e.g., pruning weights in isolation).

CHITA uses an efficient reformulation of the pruning problem (BSS using the quadratic loss) that avoids explicitly computing the Hessian matrix, while still using all the information from this matrix. This is made possible by exploiting the low-rank structure of the empirical Fisher information matrix. This reformulation can be viewed as a sparse linear regression problem, where each regression coefficient corresponds to a certain weight in the neural network. After obtaining a solution to this regression problem, coefficients set to zero will correspond to weights that should be pruned. Our regression data matrix is (n x p), where n is the batch (sub-sample) size and p is the number of weights in the original network. Typically n << p, so storing and operating with this data matrix is much more scalable than common pruning approaches that operate with the (p x p) Hessian.

CHITA reformulates the quadratic loss approximation, which requires an expensive Hessian matrix, as a linear regression (LR) problem. The LR’s data matrix is linear in p, which makes the reformulation more scalable than the original quadratic approximation.

Scalable optimization algorithms

CHITA reduces pruning to a linear regression problem under the following sparsity constraint: at most k regression coefficients can be nonzero. To obtain a solution to this problem, we consider a modification of the well-known iterative hard thresholding (IHT) algorithm. IHT performs gradient descent where after each update the following post-processing step is performed: all regression coefficients outside the Top-k (i.e., the k coefficients with the largest magnitude) are set to zero. IHT typically delivers a good solution to the problem, and it does so iteratively exploring different pruning candidates and jointly optimizing over the weights.

Due to the scale of the problem, standard IHT with constant learning rate can suffer from very slow convergence. For faster convergence, we developed a new line-search method that exploits the problem structure to find a suitable learning rate, i.e., one that leads to a sufficiently large decrease in the loss. We also employed several computational schemes to improve CHITA’s efficiency and the quality of the second-order approximation, leading to an improved version that we call CHITA++.


We compare CHITA’s run time and accuracy with several state-of-the-art pruning methods using different architectures, including ResNet and MobileNet.

Run time: CHITA is much more scalable than comparable methods that perform joint optimization (as opposed to pruning weights in isolation). For example, CHITA’s speed-up can reach over 1000x when pruning ResNet.

Post-pruning accuracy: Below, we compare the performance of CHITA and CHITA++ with magnitude pruning (MP), Woodfisher (WF), and Combinatorial Brain Surgeon (CBS), for pruning 70% of the model weights. Overall, we see good improvements from CHITA and CHITA++.

Post-pruning accuracy of various methods on ResNet20. Results are reported for pruning 70% of the model weights.
Post-pruning accuracy of various methods on MobileNet. Results are reported for pruning 70% of the model weights.

Next, we report results for pruning a larger network: ResNet50 (on this network, some of the methods listed in the ResNet20 figure couldn’t scale). Here we compare with magnitude pruning and M-FAC. The figure below shows that CHITA achieves better test accuracy for a wide range of sparsity levels.

Test accuracy of pruned networks, obtained using different methods.

Conclusion, limitations, and future work

We presented CHITA, an optimization-based approach for pruning pre-trained neural networks. CHITA offers scalability and competitive performance by efficiently using second-order information and drawing on ideas from combinatorial optimization and high-dimensional statistics.

CHITA is designed for unstructured pruning in which any weight can be removed. In theory, unstructured pruning can significantly reduce computational requirements. However, realizing these reductions in practice requires special software (and possibly hardware) that support sparse computations. In contrast, structured pruning, which removes whole structures like neurons, may offer improvements that are easier to attain on general-purpose software and hardware. It would be interesting to extend CHITA to structured pruning.


This work is part of a research collaboration between Google and MIT. Thanks to Rahul Mazumder, Natalia Ponomareva, Wenyu Chen, Xiang Meng, Zhe Zhao, and Sergei Vassilvitskii for their help in preparing this post and the paper. Also thanks to John Guilyard for creating the graphics in this post.

STUDY: Socially aware temporally causal decoder recommender systems
3 months, 1 week ago
Posted by Eltayeb Ahmed, Research Engineer, and Subhrajit Roy, Senior Research Scientist, Google Research

Reading has many benefits for young students, such as better linguistic and life skills, and reading for pleasure has been shown to correlate with academic success. Furthermore students have reported improved emotional wellbeing from reading, as well as better general knowledge and better understanding of other cultures. With the vast amount of reading material both online and off, finding age-appropriate, relevant and engaging content can be a challenging task, but helping students do so is a necessary step to engage them in reading. Effective recommendations that present students with relevant reading material helps keep students reading, and this is where machine learning (ML) can help.

ML has been widely used in building recommender systems for various types of digital content, ranging from videos to books to e-commerce items. Recommender systems are used across a range of digital platforms to help surface relevant and engaging content to users. In these systems, ML models are trained to suggest items to each user individually based on user preferences, user engagement, and the items under recommendation. These data provide a strong learning signal for models to be able to recommend items that are likely to be of interest, thereby improving user experience.

In “STUDY: Socially Aware Temporally Causal Decoder Recommender Systems”, we present a content recommender system for audiobooks in an educational setting taking into account the social nature of reading. We developed the STUDY algorithm in partnership with Learning Ally, an educational nonprofit, aimed at promoting reading in dyslexic students, that provides audiobooks to students through a school-wide subscription program. Leveraging the wide range of audiobooks in the Learning Ally library, our goal is to help students find the right content to help boost their reading experience and engagement. Motivated by the fact that what a person’s peers are currently reading has significant effects on what they would find interesting to read, we jointly process the reading engagement history of students who are in the same classroom. This allows our model to benefit from live information about what is currently trending within the student’s localized social group, in this case, their classroom.


Learning Ally has a large digital library of curated audiobooks targeted at students, making it well-suited for building a social recommendation model to help improve student learning outcomes. We received two years of anonymized audiobook consumption data. All students, schools and groupings in the data were anonymized, only identified by a randomly generated ID not traceable back to real entities by Google. Furthermore all potentially identifiable metadata was only shared in an aggregated form, to protect students and institutions from being re-identified. The data consisted of time-stamped records of student’s interactions with audiobooks. For each interaction we have an anonymized student ID (which includes the student’s grade level and anonymized school ID), an audiobook identifier and a date. While many schools distribute students in a single grade across several classrooms, we leverage this metadata to make the simplifying assumption that all students in the same school and in the same grade level are in the same classroom. While this provides the foundation needed to build a better social recommender model, it's important to note that this does not enable us to re-identify individuals, class groups or schools.

The STUDY algorithm

We framed the recommendation problem as a click-through rate prediction problem, where we model the conditional probability of a user interacting with each specific item conditioned on both 1) user and item characteristics and 2) the item interaction history sequence for the user at hand. Previous work suggests Transformer-based models, a widely used model class developed by Google Research, are well suited for modeling this problem. When each user is processed individually this becomes an autoregressive sequence modeling problem. We use this conceptual framework to model our data and then extend this framework to create the STUDY approach.

While this approach for click-through rate prediction can model dependencies between past and future item preferences for an individual user and can learn patterns of similarity across users at train time, it cannot model dependencies across different users at inference time. To recognise the social nature of reading and remediate this shortcoming we developed the STUDY model, which concatenates multiple sequences of books read by each student into a single sequence that collects data from multiple students in a single classroom.

However, this data representation requires careful diligence if it is to be modeled by transformers. In transformers, the attention mask is the matrix that controls which inputs can be used to inform the predictions of which outputs. The pattern of using all prior tokens in a sequence to inform the prediction of an output leads to the upper triangular attention matrix traditionally found in causal decoders. However, since the sequence fed into the STUDY model is not temporally ordered, even though each of its constituent subsequences is, a standard causal decoder is no longer a good fit for this sequence. When trying to predict each token, the model is not allowed to attend to every token that precedes it in the sequence; some of these tokens might have timestamps that are later and contain information that would not be available at deployment time.

In this figure we show the attention mask typically used in causal decoders. Each column represents an output and each column represents an output. A value of 1 (shown as blue) for a matrix entry at a particular position denotes that the model can observe the input of that row when predicting the output of the corresponding column, whereas a value of 0 (shown as white) denotes the opposite.

The STUDY model builds on causal transformers by replacing the triangular matrix attention mask with a flexible attention mask with values based on timestamps to allow attention across different subsequences. Compared to a regular transformer, which would not allow attention across different subsequences and would have a triangular matrix mask within sequence, STUDY maintains a causal triangular attention matrix within a sequence and has flexible values across sequences with values that depend on timestamps. Hence, predictions at any output point in the sequence are informed by all input points that occurred in the past relative to the current time point, regardless of whether they appear before or after the current input in the sequence. This causal constraint is important because if it is not enforced at train time, the model could potentially learn to make predictions using information from the future, which would not be available for a real world deployment.

In (a) we show a sequential autoregressive transformer with causal attention that processes each user individually; in (b) we show an equivalent joint forward pass that results in the same computation as (a); and finally, in (c) we show that by introducing new nonzero values (shown in purple) to the attention mask we allow information to flow across users. We do this by allowing a prediction to condition on all interactions with an earlier timestamp, irrespective of whether the interaction came from the same user or not.


We used the Learning Ally dataset to train the STUDY model along with multiple baselines for comparison. We implemented an autoregressive click-through rate transformer decoder, which we refer to as “Individual”, a k-nearest neighbor baseline (KNN), and a comparable social baseline, social attention memory network (SAMN). We used the data from the first school year for training and we used the data from the second school year for validation and testing.

We evaluated these models by measuring the percentage of the time the next item the user actually interacted with was in the model’s top n recommendations, i.e., hits@n, for different values of n. In addition to evaluating the models on the entire test set we also report the models’ scores on two subsets of the test set that are more challenging than the whole data set. We observed that students will typically interact with an audiobook over multiple sessions, so simply recommending the last book read by the user would be a strong trivial recommendation. Hence, the first test subset, which we refer to as “non-continuation”, is where we only look at each model’s performance on recommendations when the students interact with books that are different from the previous interaction. We also observe that students revisit books they have read in the past, so strong performance on the test set can be achieved by restricting the recommendations made for each student to only the books they have read in the past. Although there might be value in recommending old favorites to students, much value from recommender systems comes from surfacing content that is new and unknown to the user. To measure this we evaluate the models on the subset of the test set where the students interact with a title for the first time. We name this evaluation subset “novel”.

We find that STUDY outperforms all other tested models across almost every single slice we evaluated against.

In this figure we compare the performance of four models, Study, Individual, KNN and SAMN. We measure the performance with hits@5, i.e., how likely the model is to suggest the next title the user read within the model’s top 5 recommendations. We evaluate the model on the entire test set (all) as well as the novel and non-continuation splits. We see STUDY consistently outperforms the other three models presented across all splits.

Importance of appropriate grouping

At the heart of the STUDY algorithm is organizing users into groups and doing joint inference over multiple users who are in the same group in a single forward pass of the model. We conducted an ablation study where we looked at the importance of the actual groupings used on the performance of the model. In our presented model we group together all students who are in the same grade level and school. We then experiment with groups defined by all students in the same grade level and district and also place all students in a single group with a random subset used for each forward pass. We also compare these models against the Individual model for reference.

We found that using groups that were more localized was more effective, with the school and grade level grouping outperforming the district and grade level grouping. This supports the hypothesis that the STUDY model is successful because of the social nature of activities such as reading — people’s reading choices are likely to correlate with the reading choices of those around them. Both of these models outperformed the other two models (single group and Individual) where grade level is not used to group students. This suggests that data from users with similar reading levels and interests is beneficial for performance.

Future work

This work is limited to modeling recommendations for user populations where the social connections are assumed to be homogenous. In the future it would be beneficial to model a user population where relationships are not homogeneous, i.e., where categorically different types of relationships exist or where the relative strength or influence of different relationships is known.


This work involved collaborative efforts from a multidisciplinary team of researchers, software engineers and educational subject matter experts. We thank our co-authors: Diana Mincu, Lauren Harrell, and Katherine Heller from Google. We also thank our colleagues at Learning Ally, Jeff Ho, Akshat Shah, Erin Walker, and Tyler Bastian, and our collaborators at Google, Marc Repnyek, Aki Estrella, Fernando Diaz, Scott Sanner, Emily Salkey and Lev Proleev.

Advances in document understanding
3 months, 2 weeks ago
Posted by Sandeep Tata, Software Engineer, Google Research, Athena Team

The last few years have seen rapid progress in systems that can automatically process complex business documents and turn them into structured objects. A system that can automatically extract data from documents, e.g., receipts, insurance quotes, and financial statements, has the potential to dramatically improve the efficiency of business workflows by avoiding error-prone, manual work. Recent models, based on the Transformer architecture, have shown impressive gains in accuracy. Larger models, such as PaLM 2, are also being leveraged to further streamline these business workflows. However, the datasets used in academic literature fail to capture the challenges seen in real-world use cases. Consequently, academic benchmarks report strong model accuracy, but these same models do poorly when used for complex real-world applications.

In “VRDU: A Benchmark for Visually-rich Document Understanding”, presented at KDD 2023, we announce the release of the new Visually Rich Document Understanding (VRDU) dataset that aims to bridge this gap and help researchers better track progress on document understanding tasks. We list five requirements for a good document understanding benchmark, based on the kinds of real-world documents for which document understanding models are frequently used. Then, we describe how most datasets currently used by the research community fail to meet one or more of these requirements, while VRDU meets all of them. We are excited to announce the public release of the VRDU dataset and evaluation code under a Creative Commons license.

Benchmark requirements

First, we compared state-of-the-art model accuracy (e.g., with FormNet and LayoutLMv2) on real-world use cases to academic benchmarks (e.g., FUNSD, CORD, SROIE). We observed that state-of-the-art models did not match academic benchmark results and delivered much lower accuracy in the real world. Next, we compared typical datasets for which document understanding models are frequently used with academic benchmarks and identified five dataset requirements that allow a dataset to better capture the complexity of real-world applications:

  • Rich Schema: In practice, we see a wide variety of rich schemas for structured extraction. Entities have different data types (numeric, strings, dates, etc.) that may be required, optional, or repeated in a single document or may even be nested. Extraction tasks over simple flat schemas like (header, question, answer) do not reflect typical problems encountered in practice.
  • Layout-Rich Documents: The documents should have complex layout elements. Challenges in practical settings come from the fact that documents may contain tables, key-value pairs, switch between single-column and double-column layout, have varying font-sizes for different sections, include pictures with captions and even footnotes. Contrast this with datasets where most documents are organized in sentences, paragraphs, and chapters with section headers — the kinds of documents that are typically the focus of classic natural language processing literature on long inputs.
  • Diverse Templates: A benchmark should include different structural layouts or templates. It is trivial for a high-capacity model to extract from a particular template by memorizing the structure. However, in practice, one needs to be able to generalize to new templates/layouts, an ability that the train-test split in a benchmark should measure.
  • High-Quality OCR: Documents should have high-quality Optical Character Recognition (OCR) results. Our aim with this benchmark is to focus on the VRDU task itself and to exclude the variability brought on by the choice of OCR engine.
  • Token-Level Annotation: Documents should contain ground-truth annotations that can be mapped back to corresponding input text, so that each token can be annotated as part of the corresponding entity. This is in contrast with simply providing the text of the value to be extracted for the entity. This is key to generating clean training data where we do not have to worry about incidental matches to the given value. For instance, in some receipts, the ‘total-before-tax’ field may have the same value as the ‘total’ field if the tax amount is zero. Having token level annotations prevents us from generating training data where both instances of the matching value are marked as ground-truth for the ‘total’ field, thus producing noisy examples.

VRDU datasets and tasks

The VRDU dataset is a combination of two publicly available datasets, Registration Forms and Ad-Buy forms. These datasets provide examples that are representative of real-world use cases, and satisfy the five benchmark requirements described above.

The Ad-buy Forms dataset consists of 641 documents with political advertisement details. Each document is either an invoice or receipt signed by a TV station and a campaign group. The documents use tables, multi-columns, and key-value pairs to record the advertisement information, such as the product name, broadcast dates, total price, and release date and time.

The Registration Forms dataset consists of 1,915 documents with information about foreign agents registering with the US government. Each document records essential information about foreign agents involved in activities that require public disclosure. Contents include the name of the registrant, the address of related bureaus, the purpose of activities, and other details.

We gathered a random sample of documents from the public Federal Communications Commission (FCC) and Foreign Agents Registration Act (FARA) sites, and converted the images to text using Google Cloud's OCR. We discarded a small number of documents that were several pages long and the processing did not complete in under two minutes. This also allowed us to avoid sending very long documents for manual annotation — a task that can take over an hour for a single document. Then, we defined the schema and corresponding labeling instructions for a team of annotators experienced with document-labeling tasks.

The annotators were also provided with a few sample labeled documents that we labeled ourselves. The task required annotators to examine each document, draw a bounding box around every occurrence of an entity from the schema for each document, and associate that bounding box with the target entity. After the first round of labeling, a pool of experts were assigned to review the results. The corrected results are included in the published VRDU dataset. Please see the paper for more details on the labeling protocol and the schema for each dataset.

Existing academic benchmarks (FUNSD, CORD, SROIE, Kleister-NDA, Kleister-Charity, DeepForm) fall-short on one or more of the five requirements we identified for a good document understanding benchmark. VRDU satisfies all of them. See our paper for background on each of these datasets and a discussion on how they fail to meet one or more of the requirements.

We built four different model training sets with 10, 50, 100, and 200 samples respectively. Then, we evaluated the VRDU datasets using three tasks (described below): (1) Single Template Learning, (2) Mixed Template Learning, and (3) Unseen Template Learning. For each of these tasks, we included 300 documents in the testing set. We evaluate models using the F1 score on the testing set.

  • Single Template Learning (STL): This is the simplest scenario where the training, testing, and validation sets only contain a single template. This simple task is designed to evaluate a model’s ability to deal with a fixed template. Naturally, we expect very high F1 scores (0.90+) for this task.
  • Mixed Template Learning (MTL): This task is similar to the task that most related papers use: the training, testing, and validation sets all contain documents belonging to the same set of templates. We randomly sample documents from the datasets and construct the splits to make sure the distribution of each template is not changed during sampling.
  • Unseen Template Learning (UTL): This is the most challenging setting, where we evaluate if the model can generalize to unseen templates. For example, in the Registration Forms dataset, we train the model with two of the three templates and test the model with the remaining one. The documents in the training, testing, and validation sets are drawn from disjoint sets of templates. To our knowledge, previous benchmarks and datasets do not explicitly provide such a task designed to evaluate the model’s ability to generalize to templates not seen during training.

The objective is to be able to evaluate models on their data efficiency. In our paper, we compared two recent models using the STL, MTL, and UTL tasks and made three observations. First, unlike with other benchmarks, VRDU is challenging and shows that models have plenty of room for improvements. Second, we show that few-shot performance for even state-of-the-art models is surprisingly low with even the best models resulting in less than an F1 score of 0.60. Third, we show that models struggle to deal with structured repeated fields and perform particularly poorly on them.


We release the new Visually Rich Document Understanding (VRDU) dataset that helps researchers better track progress on document understanding tasks. We describe why VRDU better reflects practical challenges in this domain. We also present experiments showing that VRDU tasks are challenging, and recent models have substantial headroom for improvements compared to the datasets typically used in the literature with F1 scores of 0.90+ being typical. We hope the release of the VRDU dataset and evaluation code helps research teams advance the state of the art in document understanding.


Many thanks to Zilong Wang, Yichao Zhou, Wei Wei, and Chen-Yu Lee, who co-authored the paper along with Sandeep Tata. Thanks to Marc Najork, Riham Mansour and numerous partners across Google Research and the Cloud AI team for providing valuable insights. Thanks to John Guilyard for creating the animations in this post.

AdaTape: Foundation model with adaptive computation and dynamic read-and-write
3 months, 2 weeks ago
Posted by Fuzhao Xue, Research Intern, and Mostafa Dehghani, Research Scientist, Google

Adaptive computation refers to the ability of a machine learning system to adjust its behavior in response to changes in the environment. While conventional neural networks have a fixed function and computation capacity, i.e., they spend the same number of FLOPs for processing different inputs, a model with adaptive and dynamic computation modulates the computational budget it dedicates to processing each input, depending on the complexity of the input.

Adaptive computation in neural networks is appealing for two key reasons. First, the mechanism that introduces adaptivity provides an inductive bias that can play a key role in solving some challenging tasks. For instance, enabling different numbers of computational steps for different inputs can be crucial in solving arithmetic problems that require modeling hierarchies of different depths. Second, it gives practitioners the ability to tune the cost of inference through greater flexibility offered by dynamic computation, as these models can be adjusted to spend more FLOPs processing a new input.

Neural networks can be made adaptive by using different functions or computation budgets for various inputs. A deep neural network can be thought of as a function that outputs a result based on both the input and its parameters. To implement adaptive function types, a subset of parameters are selectively activated based on the input, a process referred to as conditional computation. Adaptivity based on the function type has been explored in studies on mixture-of-experts, where the sparsely activated parameters for each input sample are determined through routing.

Another area of research in adaptive computation involves dynamic computation budgets. Unlike in standard neural networks, such as T5, GPT-3, PaLM, and ViT, whose computation budget is fixed for different samples, recent research has demonstrated that adaptive computation budgets can improve performance on tasks where transformers fall short. Many of these works achieve adaptivity by using dynamic depth to allocate the computation budget. For example, the Adaptive Computation Time (ACT) algorithm was proposed to provide an adaptive computational budget for recurrent neural networks. The Universal Transformer extends the ACT algorithm to transformers by making the computation budget dependent on the number of transformer layers used for each input example or token. Recent studies, like PonderNet, follow a similar approach while improving the dynamic halting mechanisms.

In the paper “Adaptive Computation with Elastic Input Sequence”, we introduce a new model that utilizes adaptive computation, called AdaTape. This model is a Transformer-based architecture that uses a dynamic set of tokens to create elastic input sequences, providing a unique perspective on adaptivity in comparison to previous works. AdaTape uses an adaptive tape reading mechanism to determine a varying number of tape tokens that are added to each input based on input’s complexity. AdaTape is very simple to implement, provides an effective knob to increase the accuracy when needed, but is also much more efficient compared to other adaptive baselines because it directly injects adaptivity into the input sequence instead of the model depth. Finally, Adatape offers better performance on standard tasks, like image classification, as well as algorithmic tasks, while maintaining a favorable quality and cost tradeoff.

Adaptive computation transformer with elastic input sequence

AdaTape uses both the adaptive function types and a dynamic computation budget. Specifically, for a batch of input sequences after tokenization (e.g., a linear projection of non-overlapping patches from an image in the vision transformer), AdaTape uses a vector representing each input to dynamically select a variable-sized sequence of tape tokens.

AdaTape uses a bank of tokens, called a “tape bank”, to store all the candidate tape tokens that interact with the model through the adaptive tape reading mechanism. We explore two different methods for creating the tape bank: an input-driven bank and a learnable bank.

The general idea of the input-driven bank is to extract a bank of tokens from the input while employing a different approach than the original model tokenizer for mapping the raw input to a sequence of input tokens. This enables dynamic, on-demand access to information from the input that is obtained using a different point of view, e.g., a different image resolution or a different level of abstraction.

In some cases, tokenization in a different level of abstraction is not possible, thus an input-driven tape bank is not feasible, such as when it's difficult to further split each node in a graph transformer. To address this issue, AdaTape offers a more general approach for generating the tape bank by using a set of trainable vectors as tape tokens. This approach is referred to as the learnable bank and can be viewed as an embedding layer where the model can dynamically retrieve tokens based on the complexity of the input example. The learnable bank enables AdaTape to generate a more flexible tape bank, providing it with the ability to dynamically adjust its computation budget based on the complexity of each input example, e.g., more complex examples retrieve more tokens from the bank, which let the model not only use the knowledge stored in the bank, but also spend more FLOPs processing it, since the input is now larger.

Finally, the selected tape tokens are appended to the original input and fed to the following transformer layers. For each transformer layer, the same multi-head attention is used across all input and tape tokens. However, two different feed-forward networks (FFN) are used: one for all tokens from the original input and the other for all tape tokens. We observed slightly better quality by using separate feed-forward networks for input and tape tokens.

An overview of AdaTape. For different samples, we pick a variable number of different tokens from the tape bank. The tape bank can be driven from input, e.g., by extracting some extra fine-grained information or it can be a set of trainable vectors. Adaptive tape reading is used to recursively select different sequences of tape tokens, with variable lengths, for different inputs. These tokens are then simply appended to inputs and fed to the transformer encoder.

AdaTape provides helpful inductive bias

We evaluate AdaTape on parity, a very challenging task for the standard Transformer, to study the effect of inductive biases in AdaTape. With the parity task, given a sequence 1s, 0s, and -1s, the model has to predict the evenness or oddness of the number of 1s in the sequence. Parity is the simplest non-counter-free or periodic regular language, but perhaps surprisingly, the task is unsolvable by the standard Transformer.

Evaluation on the parity task. The standard Transformer and Universal Transformer were unable to perform this task, both showing performance at the level of a random guessing baseline.

Despite being evaluated on short, simple sequences, both the standard Transformer and Universal Transformers were unable to perform the parity task as they are unable to maintain a counter within the model. However, AdaTape outperforms all baselines, as it incorporates a lightweight recurrence within its input selection mechanism, providing an inductive bias that enables the implicit maintenance of a counter, which is not possible in standard Transformers.

Evaluation on image classification

We also evaluate AdaTape on the image classification task. To do so, we trained AdaTape on ImageNet-1K from scratch. The figure below shows the accuracy of AdaTape and the baseline methods, including A-ViT, and the Universal Transformer ViT (UViT and U2T) versus their speed (measured as number of images, processed by each code, per second). In terms of quality and cost tradeoff, AdaTape performs much better than the alternative adaptive transformer baselines. In terms of efficiency, larger AdaTape models (in terms of parameter count) are faster than smaller baselines. Such results are consistent with the finding from previous work that shows that the adaptive model depth architectures are not well suited for many accelerators, like the TPU.

We evaluate AdaTape by training on ImageNet from scratch. For A-ViT, we not only report their results from the paper but also re-implement A-ViT by training from scratch, i.e., A-ViT(Ours).

A study of AdaTape’s behavior

In addition to its performance on the parity task and ImageNet-1K, we also evaluated the token selection behavior of AdaTape with an input-driven bank on the JFT-300M validation set. To better understand the model's behavior, we visualized the token selection results on the input-driven bank as heatmaps, where lighter colors mean that position is more frequently selected. The heatmaps reveal that AdaTape more frequently picks the central patches. This aligns with our prior knowledge, as central patches are typically more informative — especially in the context of datasets with natural images, where the main object is in the middle of the image. This result highlights the intelligence of AdaTape, as it can effectively identify and prioritize more informative patches to improve its performance.

We visualize the tape token selection heatmap of AdaTape-B/32 (left) and AdaTape-B/16 (right). The hotter / lighter color means the patch at this position is more frequently selected.


AdaTape is characterized by elastic sequence lengths generated by the adaptive tape reading mechanism. This also introduces a new inductive bias that enables AdaTape to have the potential to solve tasks that are challenging for both standard transformers and existing adaptive transformers. By conducting comprehensive experiments on image recognition benchmarks, we demonstrate that AdaTape outperforms standard transformers and adaptive architecture transformers when computation is held constant.


One of the authors of this post, Mostafa Dehghani, is now at Google DeepMind.