The Token Tax You Didn't Know You Were Paying
Posted by singularly 3 hours ago
I watched it happen again. I asked the AI agent a simple question about my cloud infrastructure, and within minutes, it hit the wall: “Context limit reached.” Or worse, it started hallucinating because it was buried under 15,000 tokens of raw JSON output it didn’t actually need.
When we use tools like Claude Code or other autonomous agents, we’re essentially paying them to read. The problem is, most of what we feed them is noise.
Think about a standard aws ec2 describe-instances call. It returns a mountain of data: null fields for features you don’t use, empty arrays, and 800-token-long base64 certificate blobs. The agent was spending half its “brainpower” just trying to navigate the noise to find the one ec2 instance I actually asked for.
I realized the agent didn’t need a better prompt. It needed a sieve.
The Proof: Seeing the Noise Disappear To understand why this matters, you have to see what your agent is actually struggling with. Here is an example of how TokenSieve processes a standard cloud response.
Example 1: Killing the “Base64 Bloat” In a typical EKS cluster description, the API returns a massive PEM certificate. This single field can eat up ~800 tokens.
The Raw JSON:
{
“cluster”: {
“name”: “prod-cluster”,
“certificateAuthority”: { “data”: “ASDASDASDASDDDA0tCk1JSUR...” },
“tags”: {},
“status”: “ACTIVE”
}
}
The TokenSieve Output:
cluster.name: prod-cluster
cluster.certificateAuthority.data: <base64 1476 chars>
cluster.status: ACTIVE
Note: The empty tags object is pruned entirely, and the certificate is replaced by a 4-token placeholder.
Example 2: Compressing Redundant Lists When you list VPCs or Subnets, the API repeats the same keys (VpcId, State, OwnerId) over and over for every single item.
The Raw JSON:
[
{ “VpcId”: “vpc-001”, “CidrBlock”: “x.x.x.x/16”, “State”: “available” },
{ “VpcId”: “vpc-002”, “CidrBlock”: “y.y.y.y/16”, “State”: “available” }
]
The TokenSieve Output (SchemaYAML):
schema: [VpcId, CidrBlock, State]
data:
- [vpc-001, x.x.x.x/16, available]
- [vpc-002, y.y.y.y/16, available]
By emitting the keys only once, we stop charging the agent to read the same words dozens of times.
Why It Works: The Invisible Filter I built TokenSieve to sit silently between my shell and my tools. It doesn’t change how I work, and it doesn’t change how the agent works. It just intercepts the “trash” before the agent ever sees it.
On average, I’m seeing 46.9% token savings across real-world AWS tasks. In some cases, like EKS cluster descriptions, it cuts the noise by over 66%. My agent is faster, it’s cheaper, and it’s significantly more accurate because its “focus” isn’t being pulled in a thousand wrong directions.
Why I Chose Rust This tool lives in the “hot path” of my workflow. It has to be invisible. If it adds even a second of lag to my commands, the magic is gone.
By using Rust, I created a tool that starts up in less than 5 milliseconds. It’s a single, static file that I can drop onto any machine without worrying about dependencies or environment conflicts. It just works, every single time, with the kind of reliability that disappears into the background.
Try it Yourself I’ve open sourced TokenSieve because I know I’m not the only one who has experienced “Token Exhaustion.” It’s built for the way we actually work: transparent, fast, and effective.
If you’re tired of your agents drowning in data, I’d love for you to give it a spin. It takes five commands to install, and you’ll never have to think about it again .. until you see your next token bill.
Check it out on GitHub: https://github.com/ankit481/tokensieve
Comments
Comment by hkonte 2 hours ago
Prose prompts pad tokens with hedging, transitions, and repeated emphasis. The model has to figure out which parts are constraints vs. context vs. objectives. That parsing overhead is its own tax.
Typed blocks strip that ambiguity. I built github.com/Nyrok/flompt for this: decomposes prompts into 12 semantic blocks (role, objective, constraints, output_format, etc.) and compiles to Claude-optimized XML. Each block tells the model exactly what kind of content it's reading, no guessing.