ArvexiBuilders Blog

How We Extract Lease Data from 500 PDFs Without Losing a Dollar

Every enterprise has leases. A mid-market company might have 200. A Fortune 500 retailer might have 5,000. All of them buried in PDFs (scanned agreements, amendments, subleases, estoppels) sitting in SharePoint folders nobody wants to open.

When ASC 842 forced operating leases onto the balance sheet, accounting teams had two options: hire temps to key in every field by hand, or pay a consulting firm $500K+ to do it. Both are slow, expensive, and riddled with transcription errors. A single misread commencement date cascades into a wrong lease term, wrong amortization schedule, wrong journal entries, and an audit finding.

We built a pipeline that takes 500 lease PDFs and returns structured, validated, QA-routed data ready for import into any lease accounting system.

Architecture
End-to-End Extraction Pipeline
Learning context loop — QA corrections feed back into extraction prompts for the same lessor
Click any stage for details · 500 PDFs in, structured data out

The extraction prompt

We do not ask Claude to "read this lease." We give it a field-by-field specification (60+ fields across seven categories) with explicit value types, confidence calibration rules, and source provenance requirements:

const system = `You are a lease accounting specialist with deep expertise
in ASC 842, IFRS 16, and GASB 87.
 
EXTRACTION RULES:
1. Extract EVERY identifiable field from the document
2. For each field, provide a confidence score from 0.0 to 1.0:
   - 1.0 = explicitly stated verbatim in the document
   - 0.9-0.99 = clearly stated with minor interpretation
   - 0.8-0.89 = requires reasonable inference
   - 0.5-0.79 = uncertain, requires human review
   - Below 0.5 = very uncertain, just a guess
3. Include the exact source page number and quote`;

Without explicit calibration guidance, models cluster confidence around 0.9 for everything. By defining what each band means, we get a signal we can route on downstream. Every recurring payment also gets an ASC 842 classification (lease_component, nonlease_not_separated, nonlease_separated, or variable) because this determines what enters the present value calculation.


Tiered model strategy

Running Sonnet on 500 PDFs costs real money. Most leases are straightforward. We only need the heavy model when the cheap one is not confident enough.

Cost Optimization
Tiered Model Decision Tree
Yes (70%)No (30%)conf ≥ 0.85?thresholdPDF DocumentLease agreement uploadedHaiku Extraction$0.25 / 1M tokensAccept Result~70% of documentsSonnet Extraction$3 / 1M tokensAccept Result~30% of documents
$0.01–0.05
Haiku only
70% of docs
$0.10–0.50
Sonnet fallback
30% of docs
$75–150
Manual abstraction
Click any node for details · per-document cost vs $75–150 manual

Haiku extracts first at $0.25/M tokens. If average confidence across all fields drops below 0.85, Sonnet re-extracts at $3/M tokens. Both get the same prompt, the same 16K token budget, and temperature 0. In practice, 70% complete with Haiku alone. The threshold and both models are env-configurable, so no deploy is needed to tune.


The 497-line validation engine

Claude gives us confidence scores. We do not trust them blindly.

After extraction, every field passes through a three-layer validation engine. Each failed check penalizes the field's confidence by 0.1, which can push it into a different QA bucket. A field starting at 0.92 that fails two checks drops to 0.72, below the 0.80 threshold, straight into full review.

Quality Gate
The 3-Layer Validation Engine
1
Format Validation
Does the value parse as its declared type?
−0.1
2
Range Validation
Is the value within domain-specific bounds?
−0.1
3
Cross-Field Consistency
Do related fields agree with each other?
−0.1
Confidence
≥0.95
0.80–0.95
<0.80
0.62
full_review
3 layers active
Click a layer for examples · Toggle layers to see confidence impact · 497 lines of validation

The validation engine does not fix data. It creates a reliable signal for the QA routing layer. Cross-field consistency is where we catch the errors that matter most: commencement + term ≈ expiration (within 60 days), monthly × 12 ≈ annual rent (within 5%), payment streams matching the base rent field.


Confidence-bucketed QA routing

After validation adjusts the scores, every field gets routed into one of three QA buckets. This replaces the old approach of "review everything" or "review nothing."

Interactive
Confidence-Bucketed QA Router
Drag to set field confidence score
0.87
0.000.800.951.00
spot_check
≥ 0.95
AUTO_ACCEPTED
~60%
validate
0.80 – 0.95
PENDING (quick)
~25%
full_review
< 0.80
PENDING (full)
~15%
Mid-confidence fields flagged for quick validation. QA reviewer confirms or corrects the value. Sorted by confidence ascending — lowest first. Covers about 25% of fields.
Drag the slider to route a field into its QA bucket · Three review levels, one audit trail

The QA queue sorts fields by confidence ascending, lowest first, so reviewers spend time where it matters. Each review action (approve, correct, or escalate) creates an immutable audit log record. Three clean operations: CONFIRMED, CORRECTED with the new value, or escalated.


The batch queue and learning loop

Processing 500 documents concurrently would hammer the API and starve other tenants. We use pgBoss with two layers of concurrency: 15 jobs globally, 5 per organization. When an org hits its limit, excess jobs are deferred with a 5-second delay: not rejected, not lost, just re-enqueued.

The pipeline gets smarter over time. When a QA reviewer corrects a field, we store the correction pattern keyed by lessor and document type. Next time we extract from the same lessor, those patterns inject directly into the prompt. Only patterns seen at least twice make the cut; one-off corrections might be noise.

// Only patterns confirmed by multiple corrections get injected
const lines = patterns
  .filter((p) => p.occurrences >= 2)
  .slice(0, 15)
  .map((p) =>
    `- Field "${p.fieldName}": AI extracted "${p.originalValue}"
       but was corrected to "${p.correctedValue}" (${p.occurrences}x)`
  );

Over the first 100 documents from a recurring lessor, the full_review bucket typically shrinks from ~30% of fields to under 10%.


Output adapters

Extracting data is half the problem. The other half is getting it into the client's existing system. We built 8 output adapters behind a factory pattern with lazy loading:

AdapterFormatDate FormatKey Quirk
Generic Excel.xlsxISO 8601Dark headers, freeze panes
Generic CSV.csvISO 8601Formula injection prevention
CoStar.xlsx (3 tabs)ISO 8601Lease Data + Financial + Options
LeaseQuery.txtISO 8601Pipe-delimited, not CSV
Oracle FBDI.csvYYYY/MM/DDNon-ISO date format
SAP RE-FX.jsonYYYYMMDDBAPI_RE_CN_CREATE structure
Visual Lease.xlsxISO 8601Red headers = required fields
Nakisa.xlsxISO 86017 required columns, red/grey

Every adapter implements the same contract (fields in, file out) but the output formats are wildly different. CoStar wants a 3-sheet Excel workbook. LeaseQuery uses pipe-delimited text. SAP needs structured JSON with dateless dates (YYYYMMDD). The lazy loading via async () => import() means we only pull in ExcelJS when generating Excel formats.


See Document Intelligence to learn more about these capabilities.

Next: the modification engine, what happens when lease terms change mid-stream, and why it's the #1 audit finding in ASC 842.