Skip to content

Cost Tracking

The cost layer keeps your AI spending under control. It routes tasks to the cheapest model that can handle them, enforces budget limits, caches responses, and provides detailed cost analytics.

const agent = await ReactiveAgents.create()
.withProvider("anthropic")
.withCostTracking() // Enable cost controls
.build();

The cost layer analyzes each task and routes it to the optimal model tier:

TierWhen UsedExamples
HaikuSimple tasks < 50 words, no code, no analysis”What’s 2+2?”, “Hello!”, greetings
SonnetMedium complexity — code OR analysis keywords”Explain recursion”, “Review this function”
OpusHigh complexity — code + multi-step + analysis”Architect a microservices system with code examples”

The router uses heuristics: word count, presence of code blocks, multi-step instructions, and analysis keywords to classify complexity.

// The execution engine calls routeToModel() during Phase 3 (Cost Route)
// You don't need to call this manually — it happens automatically

Set spending limits at multiple levels:

import { createCostLayer } from "@reactive-agents/cost";
const costLayer = createCostLayer({
budgetLimits: {
perRequest: 1.00, // Max $1 per individual request
perSession: 5.00, // Max $5 per session
daily: 25.00, // Max $25 per day
monthly: 200.00, // Max $200 per month
},
});

When a budget limit is exceeded, the agent fails with a BudgetExceededError rather than silently overspending.

Cache responses to avoid paying for identical queries:

// Automatically checked during execution
// If a semantically similar query was recently answered, the cached response is used
// Cache entries have configurable TTL
await costService.cacheResponse(query, response, model, 3600_000); // 1 hour TTL

The cost layer uses makeSemanticCache() internally to provide cosine similarity-based prompt deduplication:

import { makeSemanticCache } from "@reactive-agents/cost";
// Without embedFn — falls back to exact hash matching only
const cache = makeSemanticCache();
// With embedFn — enables semantic similarity matching (>0.92 threshold)
const cache = makeSemanticCache(myEmbedFn);
BehaviorWithout embedFnWith embedFn
Exact matchYes (hash)Yes (hash, fast path)
Semantic matchNoYes (cosine similarity > 0.92)

When an embedFn is provided, queries that are semantically equivalent (e.g., “What is the capital of France?” and “Which city is France’s capital?”) hit the cache without requiring an exact string match.

Get detailed reports on spending:

import { CostService } from "@reactive-agents/cost";
import { Effect } from "effect";
const program = Effect.gen(function* () {
const cost = yield* CostService;
// Current budget status
const status = yield* cost.getBudgetStatus("my-agent");
console.log(`Daily spend: $${status.currentDaily} (${status.percentUsedDaily}%)`);
console.log(`Monthly spend: $${status.currentMonthly} (${status.percentUsedMonthly}%)`);
// Detailed report
const report = yield* cost.getReport("daily", "my-agent");
console.log(`Total cost: $${report.totalCost}`);
console.log(`Cache hit rate: ${(report.cacheHitRate * 100).toFixed(1)}%`);
console.log(`Savings from cache: $${report.savings}`);
console.log(`Avg cost/request: $${report.avgCostPerRequest}`);
console.log(`Cost by tier:`, report.costByTier);
});
FieldDescription
totalCostTotal spend for the period
totalRequestsNumber of LLM calls
cacheHits / cacheMissesSemantic cache performance
cacheHitRateHit rate (0-1)
savingsEstimated savings from caching
costByTierBreakdown by model tier (haiku/sonnet/opus)
costByAgentBreakdown by agent ID
avgCostPerRequestAverage cost per LLM call
avgLatencyMsAverage response latency

Cost tracking integrates with three phases of the execution lifecycle:

  1. Phase 3 (Cost Route) — Selects optimal model tier based on task complexity
  2. Phase 8 (Cost Track) — Records actual cost after LLM calls complete
  3. Phase 9 (Audit) — Includes cost data in the audit log

Reduce token usage by compressing prompts before sending to the LLM:

const { compressed, savedTokens } = yield* cost.compressPrompt(longPrompt, 2000);
console.log(`Saved ${savedTokens} tokens`);

makePromptCompressor() uses a two-pass approach to reduce token count:

import { makePromptCompressor } from "@reactive-agents/cost";
// Heuristic-only compression (always runs — no LLM required)
const compressor = makePromptCompressor();
// Heuristic + optional LLM second pass
const compressor = makePromptCompressor(myLlmService);

Two-pass strategy:

  1. Heuristic pass (always runs): Removes redundant whitespace, collapses repeated content, strips boilerplate. Fast and free.
  2. LLM second pass (optional): If the heuristic result still exceeds maxTokens, an LLM call intelligently summarizes or abbreviates the prompt further.

Without an llm parameter, only the heuristic pass runs. The LLM second pass is recommended for very long prompts (>4,000 tokens) where heuristic compression alone may not be sufficient.

The execution engine automatically accumulates token usage across all LLM calls within a task. The final AgentResult includes accurate tokensUsed and cost metadata:

const result = await agent.run("Complex multi-step task");
console.log(`Tokens used: ${result.metadata.tokensUsed}`);
console.log(`Cost: $${result.metadata.cost}`);