Designing the Right Infrastructure for Long-running Agentic Queries

A user would launch a case study. Arbiter would fan out data collection across the time period and the selected social media platforms. The batches that came back fed into a downstream pipeline: normalization, NLP enrichment, embedding, and indexing into the search index. The whole pipeline ran asynchronously, with each stage scaling out independently to keep up. Once all of that is finished, a preliminary analysis will start running for the user to see.

But each stage had to communicate with the others to know when to start, and the analysis stage in particular needed to know when collection and enrichment were done. The system was designed to be event-driven and serverless, so each Lambda only knew its own piece. None of them knew the whole. We managed the coordination by threading bookkeeping keys through every batch, so we could track which stage each one belonged to. As the time period or the number of search topics for a case study grew, the pile of keys grew faster than our systems could keep track of.

Collection fans out by platform and time window; batches then stream through queue-separated stages to an analysis gated on collection and enrichment.

Most users wanted to ask is this case study done? So every collector wrote its status to a DynamoDB table when it finished. To get this answer, we kept polling. We had a check-completion flag that ran every fifteen minutes to see whether the upstream tasks were done. Fifteen was the longest a single Lambda could run, so we used it as the gap between checks. Anything still in flight had at least had the chance to finish before we asked again. If the upstream tasks were done, analysis kicked off. If not, the check scheduled itself to run again in another fifteen minutes. We capped this at three snoozes. After the third, we stopped waiting and moved on to analysis with whatever the collectors had managed to report in, on the assumption that we had the full picture.

The technical mechanism worked but the assumption underlying it was invalid. Some collectors take longer than forty-five minutes to return their data, depending on the volume of posts, platform rate limits, traffic spikes, or cloud provider outages (which is sadly a real thing) on a given day. The analysis would run anyway, on the data we had, and nothing in the system told us part of the data was still missing.

After three 15-minute checks, analysis starts at 45 minutes, while a slow collector is still returning data.

When we sketched Arbiter on a whiteboard early on, four problems stood out: collection, enrichment, search, and storage. We built around each of them in our first production version of the system. Collection had to pull from many social media platforms, each with its own quirks and its own limits on how much you could pull at once. Enrichment had to turn millions of raw posts into the structured signal a user could actually work with: entities, sentiment, embeddings, and the other features they wanted to search and filter by. Search had to stay fast for a user even as the corpus kept growing. Storage came down to early choices about how to organize data so the system would still work as it scaled. The system we built reflected those constraints: collection ran separately from the shorter-lived processing stages, queues sat between every step so the stages could move at their own pace, each stage scaled on its own, and search ran on Elasticsearch with data organized by week. Each is a different subsystem, and we will go deeper into each in coming posts.

As a result, Arbiter became very good at moving content through a pipeline. But the stages did not communicate with each other in time to flag where a case study was in the pipeline or to point at failures when they happened. The system was therefore not good at reasoning about a case study as a single unit of work. At each stage we could see individual posts, queue messages, enrichment jobs.

That gap looks small in theory. But in practice, it produced three problems we kept trying to engineer around and could not. The snooze chain was the most obvious of them: each stage scaled independently and no stage had a view of the whole, so the only way to know whether a case study had finished was to poll, and polling cost fifteen minutes a try. The second was observability. We had reasonable visibility into the pipeline itself, into logs and metrics and queue depths and container health. What we had very little visibility into was the thing a user actually cared about, which was their case study. Debugging a slow or broken run meant reconstructing it by hand from fragments scattered across services. The third class of problems kept arriving as feature requests. Users wanted to know how far along their case study was, whether something had failed, what was running now, whether they could close the tab. Those sound like product questions. Underneath, they were structural ones: the system had no native concept of workflow state, so there was very little meaningful state to expose, and every new piece of progress reporting became a new piece of bespoke plumbing. These three problems looked unrelated when we first hit them. They were the same missing concept showing up in three different layers of the system: the layer that did the work, the layer that watched it run, and the layer that talked to the user about it.

The old coordination: stage Lambdas wrote status to DynamoDB, and a separate Lambda polled that table every 15 minutes to guess whether the run was done. No orchestrator owned the run.

When it came time to redesign, we deliberately avoided the framing engineering teams default to in moments like this: what is the best architecture? That framing tends to produce vendor comparisons and component swaps. It does not usually produce the realization that is actually needed. The question we asked instead was narrower: what architecture would let the system understand a case study as a first-class object? A small reframe, and a large change. The sentence that fell out of it, the one that ended up on a whiteboard for the rest of the redesign, was simpler: a social intelligence system needs to understand case studies, not just data.

What we were hearing from users pointed in the same direction. The more we talked to them once they started running case studies on Arbiter, the more we kept hearing the same thing: they wanted transparency into how their case study was actually built. They did not just want to look at the final outcome. They wanted to see when a stage had finished, to understand how their investigation was being conducted, and to adjust the scope mid-run based on what they were seeing along the way. None of that was possible in a system that could not even tell us internally where a case study was, let alone show that to the user.

That realization led to one of the largest architectural redesigns in Arbiter's admittedly nascent history. We will explore that redesign in the next post: what we chose, what changed for the team, and what changed for Arbiter's users. Stay tuned to our newsletter to learn more and drop us a line to chat about this if you're also building in the space and would like to compare notes!

Contributors — Utkarsh Verma, Atmik Shetty, Dhara Mungra, Delisha Naik

Designing the Right Infrastructure for Long-running Agentic Queries

Building in the same space? Drop us a line and let's compare notes.