Nobody likes lag: How to make low-latency dev sandboxes
Posted by mnazzaro 1 day ago
Comments
Comment by tuhgdetzhh 22 hours ago
In contrast, AWS Lambdas, which run on Firecracker, have sub-second startup latency, often just a few hundred milliseconds.
Is there anything comparable on GCP that achieves similar low latency cold starts?
Comment by mnazzaro 21 hours ago
Comment by nicolaslecomte 19 hours ago
We had a similar bottleneck building out our sandbox routing layer, where we were doing a lookup to a centralized db to route the query. We found that even with a fast KV store, that lookup still added too much overhead. We moved to encoding the routing logic (like region, cluster ID, etc) directly into the subdomain/hostname. This allowed to drop the db read entirely on the hot path and rely on Anycast + latency-based DNS to route the user to the exact right regional gateway instantly. Also, if you ever find yourselves outgrowing standard HTTP proxies for those long-lived agent sessions, I highly recommend looking at Pingora. It gave us way more control over connection lifecycles than NGINX.
For the compute aspect doing sandbox pooling is cool but might kill your unit economics, especially if at some point each tenant has different images. Have you looked into memory snapshots (that way you only have storage costs not full VMs)?
Comment by metadat 12 hours ago
Comment by jasonjmcghee 11 hours ago
Comment by rbbydotdev 6 hours ago
Comment by iterateoften 19 hours ago
Comment by simonw 19 hours ago
Coding agents are still a relatively new category to most people. Claude Code dates back to February last year, and it took a while for the general engineering public to understand why that format - coding LLMs that can execute and iterate on the code they are writing - was such a big deal.
As a result the demand for good sandboxing options is skyrocketing.
It also takes a while for new solutions to spin up - if someone realized sandboxes were a good commercial idea back in September last year the products they built may only just be ready for people to start trying out today.
Comment by ambicapter 19 hours ago
Comment by simonw 17 hours ago
Comment by phainopepla2 19 hours ago
Comment by esperent 18 hours ago
My level of involvement decreases from step to step. I'm totally in control of the initial plan. I'm giving strong oversight of the technical plan. But by the time it comes to executing, I'm happy to let it completely take over and I'll review either at the end, or break it down into 2 - 4 phases for long plans and I'll review after each phases.
For this final step, which might be 30 minutes, I'll step out and do something else. I want to be sure nothing bad will happen on my machine if I do that, so sandboxing is important.
Comment by skinner927 3 hours ago
Comment by theblazehen 48 minutes ago
https://awesomeclaude.ai/ralph-wiggum has some tips and examples of it
Comment by cedws 19 hours ago
Comment by reactordev 19 hours ago
Comment by subscribed 5 hours ago
Comment by jpalepu33 22 hours ago
The lesson about "delete code to improve performance" resonates. I've been down similar paths where adding middleware/routing layers seemed like good abstractions, but they ended up being the performance bottleneck.
A few thoughts on this approach:
1. Warm pools are brilliant but expensive - how are you handling the economics? With multi-region pools, you're essentially paying for idle capacity across multiple data centers. I'm curious how you balance pool size vs. cold start probability.
2. Fly's replay mechanism is clever, but that initial bounce still adds latency. Have you considered using GeoDNS to route users to the correct regional endpoint from the start? Though I imagine the caching makes this a non-issue after the first request.
3. For the JWT approach - are you rotating these tokens per-session? Just thinking about the security implications if someone intercepts the token.
The 79ms → 14ms improvement is night and day for developer experience. Latency under 20ms feels instant to humans, so you've hit that sweet spot.
Comment by mnazzaro 21 hours ago
Comment by hinkley 21 hours ago
Splunk was a particular problem that way, but I also started seeing it with Grafana, at least in extremis, once we migrated to self hosted on AWS from a vendor. Most times it was fine, but if we had a bug that none of the teams could quickly disavow as being theirs, we had a lot of chefs in the kitchen and things would start to hiccup.
There can be thundering herds in dev. And a bunch of people trying a repro case in a thirty second window can be one of them. The question is if anyone has the spare bandwidth to notice that it’s happening or if everyone trudges along making the same mistakes every time.
Comment by barishnamazov 22 hours ago
Comment by mnazzaro 21 hours ago
Comment by mlhpdx 20 hours ago
Comment by mnazzaro 19 hours ago
Comment by mlhpdx 19 hours ago
Comment by mnazzaro 19 hours ago
Comment by formerly_proven 19 hours ago
Comment by jgtrosh 19 hours ago
Comment by mlhpdx 19 hours ago
Comment by nickandbro 16 hours ago
Comment by sam_lowry_ 19 hours ago
Comment by mnazzaro 18 hours ago
Comment by williamstein 17 hours ago
Comment by globular-toast 9 hours ago
Comment by hinkley 21 hours ago
I also had switched a head of line service call that was, for reasons I never sorted out, costing us 30ms TTFB per request for basically fifty bytes of data, to use a long poll in Consul because the data was only meant to be changed at most once every half hour and in practice twice a week. So that latency was hidden in dev sandbox except for startup time, where we had several consul keys being fetched in parallel and applied in order, so one more was hardly noticeable.
The nasty one though was that Artifactory didn’t compress its REST responses, and when you have a CI/CD pipeline that’s been running for six years with half a hundred devs that response is huge because npm is teh dumb. So our poor UI lead kept having npm install timeout and the UI team’s answer for “my environment isn’t working” started with clearing your downloaded deps and starting over.
They finally fixed it after we (and presumably half of the rest of their customers) complained but I was on the back 9 of migrating our entire deployment pipeline to docker and so I had nginx config fairly fresh in my brain and I set them up a forward proxy to do compression termination. It still blew up once a week but that was better than him spending half his day praying to the gods of chaos.
Comment by PaulHoule 21 hours ago
Comment by hinkley 17 hours ago
They say doctors make the worst patients. I wonder if programmers are the least useful of users.
Comment by gsf_emergency_6 16 hours ago
Wait on tasks that are urgent,
Act on ideas that are important.
Is that even more dangerous?Comment by alooPotato 22 hours ago
Comment by mnazzaro 21 hours ago
If we were starting from 0, I would definitely try it. My favorite thing about it is the progressive checkpointing- you can snapshot file system deltas and store them at s3 prices. Cool stuff!
Comment by imiric 18 hours ago
This article reads like a thinly veiled ad. Certainly not the best way to start a technical blog. If you didn't have the technical insight to know that physics is a factor in latency, why should I trust you with the problems your product actually solves?
Comment by hackomorespacko 18 hours ago
Comment by yellow_lead 12 hours ago
Valuable insight /s