Claude Code Token Usage: How The Caveman Skill Cuts 75 Percent Of The Chatter

By Beau Johnson·April 29, 2026·10 min read

Claude Code Token Usage: How The Caveman Skill Cuts 75 Percent Of The Chatter

Most Claude Code users focus on the model. Bigger model. Newer model. Better benchmark. But a lot of the waste is not in the model choice. It is in the way the model talks back. A skill called Caveman is getting attention because it attacks that exact problem. Instead of long polite explanations, it pushes Claude Code to answer in short fragments that keep the fix and drop the fluff. The result, according to the repo and the original thread, is a dramatic cut in token usage without stripping out the real technical value.

If you use Claude Code every day, this matters for three reasons. It can reduce direct output cost. It can keep your context window cleaner for longer sessions. And it can make your whole stack feel faster because the agent has less text to generate and less text to reread on later turns.

Main takeaway: lower Claude Code token usage is not just a billing win, it is a session quality win.
Why it works: every extra explanation becomes future context the model has to read again.
What changed: the Caveman repo claims about 75 percent less output, with a benchmark table averaging 65 percent savings across prompts and a range of 22 percent to 87 percent.
Who should care: solo builders, agencies, and anyone running multiple AI agents all day.

What the Caveman skill actually changes

The Caveman skill does not change the underlying model weights. It changes the response style. That sounds small until you remember how most coding sessions actually unfold. The model does the work, but it also wraps the work in soft intros, narration, transition phrases, and explanation you did not ask for. One response might not look expensive. Fifty turns later, it adds up.

The repo example shows a normal React explanation at 69 tokens versus a Caveman version at 19 tokens. Same diagnosis. Same fix. Less chatter. The README also says the project now includes several intensity levels, a one line review mode, terse commit helpers, and an input compression tool that claims roughly 46 percent less input token load per session. That last piece matters because a token strategy is stronger when it attacks both output and input.

This is why the idea spread so fast. It is simple enough to understand in ten seconds, but useful enough to save real money for people living inside Claude Code.

Why lower Claude Code token usage compounds over time

Here is the part most people miss. Saving output tokens once is nice. Saving output tokens that then never come back into future turns is better. In a long coding session, every reply becomes part of the session history. So the model has to read its own earlier verbosity over and over until the session compacts or ends.

That means a smaller answer today becomes a smaller input burden tomorrow. Less output now. Less rereading later. Less context bloat across the whole day. That is why a style constraint can feel bigger than its headline number. You are not only shrinking the current turn. You are shrinking every turn that follows.

Session behavior	Verbose mode	Terse mode	Why it matters
Reply length	Long walkthroughs and filler	Short direct answer	Lower immediate output cost
Future turns	Model rereads more old text	Model rereads less old text	Compounding savings across the session
Context window pressure	Bloats faster	Stays cleaner longer	Fewer slowdowns and fewer limit hits
Operator experience	More scrolling, more noise	Faster scan, faster decisions	Better flow when you are shipping all day

The numbers builders are reacting to

The README for Caveman makes a big headline claim of about 75 percent less output token usage. It also shows a benchmark table with an average reduction of 65 percent across several tasks, including debugging a PostgreSQL race condition, reviewing a PR for security issues, and implementing a React error boundary. The published range in that table runs from 22 percent to 87 percent.

Those numbers are big enough to matter even if your own results come in lower. If you shave a third of the waste from a daily coding routine, that is still meaningful. If you shave half, you feel it. And if your session patterns line up with the benchmark ceiling, the savings are hard to ignore.

The README also cites a March 2026 paper, Brevity Constraints Reverse Performance Hierarchies in Language Models, and points to cases where forcing brevity improved accuracy instead of harming it. That is a useful reminder. More words do not always mean more intelligence. Sometimes more words just mean more ceremony.

Where the headline can mislead people

A 75 percent reduction does not mean every Claude Code bill magically falls by 75 percent. Tool calls, input context, and reasoning behavior still matter. But the claim is still important because output verbosity is one of the easiest leaks to fix. It does not require a new provider, a new stack, or a new workflow. It just requires tighter response rules.

Who should use a Claude Code token optimization skill

If you touch Claude Code once in a while, this is nice to have. If you live in it every day, this is operational. Solo builders benefit because the session lasts longer and the response stream gets easier to scan. Agencies benefit because lower AI cost improves margin without changing what the client receives. Multi agent operators benefit because every saved token creates more room for concurrent work.

This also matters for builders running content systems, support workflows, automation stacks, or repo maintenance agents. When the agent is looping through repetitive tasks, you do not need elegance. You need throughput. A shorter answer is often the better answer.

Solo builder: more productive hours before rate limits or context drag slow you down.
Freelancer or agency: same client outcome, better margin on AI spend.
Multi agent operator: more room for parallel runs without bloating cost.
Beginner: cleaner responses can actually make the tool easier to follow.

The bigger lesson is not Caveman, it is instrumentation

The real story here is not that someone made a funny skill. The real story is that someone bothered to measure where the leak was. Most builders never revisit their prompts once the thing works. They install an agent, see that it completes the task, and move on. Then months later they are surprised by higher usage, slower sessions, and more rate limit friction.

The builders who win this next stretch are probably not the ones who chase every model release. They are the ones who inspect the stack they already have. They log token usage. They watch session behavior. They tighten the prompts. They shorten the boilerplate. They cut the polite filler. Then they compound the savings across every agent they run.

That is a much better habit than waiting for a provider to rescue you with pricing or a new model.

When a terse response style is the wrong move

There are real cases where you do want more explanation. If you are onboarding a team member, documenting a decision, or asking the model to teach instead of execute, the shorter answer is not always the better one. Brevity helps most when you already understand the task and want the model to move fast.

So do not apply this blindly. The best setup is usually selective. Use terse mode for implementation, debugging, reviews, and repeated workflows. Use fuller explanations when you are learning, handing work to someone else, or creating documentation that needs context.

Should you try Caveman on your own stack

Yes, if you are running Claude Code regularly, it is worth testing on a real task this week. Not on a toy example. Use it on the kind of work that normally burns through your day, bug fixing, feature work, refactors, or repetitive agent loops. Watch what happens to reply length, speed, and context quality.

The key is not to worship one skill. The key is to notice what the skill reveals. A lot of AI stacks are leaking budget through language habits, not just model selection. Once you see that, you start looking at every system prompt, every agent instruction file, and every long winded response a little differently.

If you want help building an agent stack that actually ships, join Shipping Skool. That is where Beau breaks down the real workflows, the prompt strategy, and the systems behind shipping with AI every day.

FAQ

What is the Caveman skill for Claude Code?

The Caveman skill is a response formatting layer that pushes Claude Code to answer in short, information-dense fragments instead of long polite explanations. The goal is not to make the model dumber. The goal is to stop paying for filler.

Does shorter output really save more than output tokens?

Yes. Shorter output reduces the current reply cost, then reduces the amount of text the model has to read back on later turns. That means the savings compound across the whole session, especially on long coding runs.

Who benefits most from lower Claude Code token usage?

Solo builders, agency operators, and anyone running multiple AI agents all benefit. If you hit rate limits, pay for heavy daily usage, or keep long sessions open, shorter outputs can buy you more runtime and cleaner context.

Will a terse style hurt coding accuracy?

It does not have to. The whole point of the Caveman approach is to remove the politeness layer while keeping the fix, the decision, or the next step intact. The repo README says the benchmark average was 65 percent savings across prompts while keeping the technical substance.

Want the full build in public playbook? Join Shipping Skool and see how Beau is tightening prompts, routing agents, and turning these small optimizations into faster shipping every week.

Ready to start building with AI?

Join Shipping Skool and ship your first product in weeks.

Join Shipping Skool