ArvexiBuilders Blog

How We Built Context Management for a Financial AI Agent

Enterprise lease accounting is unforgiving. A single misplaced decimal in a present value calculation cascades through amortization schedules, journal entries, and financial statements. When we built Arvexi's AI Workspace (an autonomous agent that handles ASC 842, IFRS 16, and GASB 87/96) we knew the context window would eventually become our biggest constraint.


The problem: context windows have a hard ceiling

Claude's context window is 200K tokens. That sounds like a lot until you account for everything that goes into every API call.

System Prompt
4K
2%
Tool Definitions
8K
4%
Conversation History
60K
30%
Tool Results
120K
60%
Available
8K
4%

Click any segment for details — typical 30-lease audit session

A user asks "prepare the KPMG audit for all leases" and the agent starts working. It queries the portfolio, pulls each lease's schedule, validates classifications, generates journal entries. By lease 15, the conversation history alone is pushing 120K tokens. By lease 30, we're hitting the wall.

The naive solution is to warn the user and ask them to start a new chat. But that defeats the purpose. We're selling an outcome, a complete audit package, not a tool that makes you restart every 15 minutes.


What we tried first

Our first approach used Claude's context editing to silently drop old tool results once usage crossed a threshold. We shipped it. Then we thought harder.

In a financial product, "silently dropping data" is a terrifying phrase. If the agent references a liability figure from a tool result that was cleared, it's working from memory of a number it can no longer verify. For a consumer chatbot, that's fine. For a platform where auditors rely on the output, it's not.

We ripped it out.


The three phases of context

Context Strategy
Three phases, infinite runway
Full Context
0 → 150K tokens
Everything stays. Every tool result, every figure, every turn.
Compaction
At 150K tokens
Structured summary preserving all financial data.
Repeat
Post-compaction
Fresh runway. Agent continues indefinitely.
0150K200K
200K
context window
~60%
freed per compaction
total session length

Phase 1: Full context (0 to 150K tokens). Everything stays. Every tool result, every conversation turn, every financial figure. No dropping, no summarizing. This is where most conversations live. A typical session uses 30-50K tokens.

Phase 2: Compaction (at 150K tokens). Claude generates a structured summary of older conversation turns while preserving recent ones verbatim. The critical difference from our first approach: we control exactly what gets preserved through domain-specific instructions.

Phase 3: Repeat. After compaction frees 60-80% of the window, the agent has fresh runway. If it fills up again, compaction fires again. Each cycle only summarizes turns added since the last compaction; prior summaries are preserved verbatim. The agent can run indefinitely.


What compaction preserves

The compaction instructions are opinionated. We tell the model exactly what matters in lease accounting and what's safe to discard.

What compaction preserves

Preserve

exact precision required

$
Financial Data
$1,234,567.89 not ~$1.2M
All dollar amounts, interest rates, discount rates, payment amounts, liability balances, ROU asset values
#
Lease Identifiers
Every lease ID, lease number, lessor name, classification
▮▮▯
Processing State
Processed 47 of 200 leases
Which items completed vs pending, with counts
Σ
Calculation Results
NPV, PV, WARLT, amortization totals, journal entry sums
User Requests
The original task and any modifications
Next Steps
What the agent was about to do next
Discard

safe to summarize

Raw Metadata
API response headers, tool execution timestamps, request IDs
Period-by-Period Data
360-period schedules replaced with totals + period count
Retry Attempts
Failed tool calls, debugging traces, intermediate errors

Domain-specific instructions ensure every dollar survives compaction

The key insight: $1,234,567.89 and ~$1.2M are not the same number in accounting. Our instructions encode that domain knowledge directly into the summarization process.


Smart truncation

Even with compaction, individual tool results can be enormous. A 30-year monthly lease has 360 schedule periods, totaling 150K+ characters of JSON for a single tool call.

Our original truncation replaced anything over 30KB with { _truncated: true }. The agent would work with zero data from that call. We replaced it with a binary search that preserves real data:

Binary search truncation360-period schedule → 25KB target
0360 periods

Binary search converges in 6 steps to the largest slice fitting 25KB

The agent always has something real to work with. We also added pagination to high-volume tools. get_lease_schedule accepts offset/limit (default 60, max 120) so the agent can page through large datasets deliberately.


What this enables

Before this work, our agent hit a wall after 15 leases in a complex analysis. Now a "prepare audit for all leases" command processes the entire portfolio in a single session. The agent can run 15-20 minutes uninterrupted, making 25+ tool calls, with multiple compaction cycles and zero degradation.

The goal hasn't changed: the user says what they want, and the agent delivers the outcome. Context management, pagination, compaction: all of it should be invisible.


See Intelligence to learn more about these capabilities.