How We Extract Lease Data from 500 PDFs Without Losing a Dollar

Aisha Patel·March 10, 2026

Every enterprise has leases. A mid-market company might have 200. A Fortune 500 retailer might have 5,000. All of them buried in PDFs (scanned agreements, amendments, subleases, estoppels) sitting in SharePoint folders nobody wants to open.

When ASC 842 forced operating leases onto the balance sheet, accounting teams had two options: hire temps to key in every field by hand, or pay a consulting firm $500K+ to do it. Both are slow, expensive, and riddled with transcription errors. A single misread commencement date cascades into a wrong lease term, wrong amortization schedule, wrong journal entries, and an audit finding.

We built a pipeline that takes 500 lease PDFs and returns structured, validated, QA-routed data ready for import into any lease accounting system.

Architecture

End-to-End Extraction Pipeline

↻Learning context loop: QA corrections feed back into extraction prompts for the same lessor

Click any stage for details · 500 PDFs in, structured data out

The extraction prompt

We do not ask Claude to "read this lease." We give it a field-by-field specification (60+ fields across seven categories) with explicit value types, confidence calibration rules, and source provenance requirements:

const system = `You are a lease accounting specialist with deep expertise
in ASC 842, IFRS 16, and GASB 87.
 
EXTRACTION RULES:
1. Extract EVERY identifiable field from the document
2. For each field, provide a confidence score from 0.0 to 1.0:
   - 1.0 = explicitly stated verbatim in the document
   - 0.9-0.99 = clearly stated with minor interpretation
   - 0.8-0.89 = requires reasonable inference
   - 0.5-0.79 = uncertain, requires human review
   - Below 0.5 = very uncertain, just a guess
3. Include the exact source page number and quote`;

Without explicit calibration guidance, models cluster confidence around 0.9 for everything. By defining what each band means, we get a signal we can route on downstream. Every recurring payment also gets an ASC 842 classification (lease_component, nonlease_not_separated, nonlease_separated, or variable) because this determines what enters the present value calculation.

Tiered model strategy

Running Sonnet on 500 PDFs costs real money. Most leases are straightforward. We only need the heavy model when the cheap one is not confident enough.

Cost Optimization

Tiered Model Decision Tree

$0.01–0.05

Haiku only

70% of docs

$0.10–0.50

Sonnet fallback

30% of docs

$75-150

Manual abstraction

Click any node for details · per-document cost vs $75–150 manual

Haiku extracts first at $0.25/M tokens. If average confidence across all fields drops below 0.85, Sonnet re-extracts at $3/M tokens. Both get the same prompt, the same 16K token budget, and temperature 0. In practice, 70% complete with Haiku alone. The threshold and both models are env-configurable, so no deploy is needed to tune.

The 497-line validation engine

Claude gives us confidence scores. We do not trust them blindly.

After extraction, every field passes through a three-layer validation engine. Each failed check penalizes the field's confidence by 0.1, which can push it into a different QA bucket. A field starting at 0.92 that fails two checks drops to 0.72, below the 0.80 threshold, straight into full review.

Quality Gate

The 3-Layer Validation Engine

Format Validation

Does the value parse as its declared type?

−0.1

Range Validation

Is the value within domain-specific bounds?

−0.1

Cross-Field Consistency

Do related fields agree with each other?

−0.1

Confidence

≥0.95

0.80–0.95

<0.80

0.62

full_review

3 layers active

Click a layer for examples · Toggle layers to see confidence impact · 497 lines of validation

The validation engine does not fix data. It creates a reliable signal for the QA routing layer. Cross-field consistency is where we catch the errors that matter most: commencement + term ≈ expiration (within 60 days), monthly × 12 ≈ annual rent (within 5%), payment streams matching the base rent field.

Confidence-bucketed QA routing

After validation adjusts the scores, every field gets routed into one of three QA buckets. This replaces the old approach of "review everything" or "review nothing."

Interactive

Confidence-Bucketed QA Router

Drag to set field confidence score

0.87

0.000.800.951.00

spot_check

≥ 0.95

AUTO_ACCEPTED

~60%

validate

0.80 – 0.95

PENDING (quick)

~25%

full_review

< 0.80

PENDING (full)

~15%

Mid-confidence fields flagged for quick validation. QA reviewer confirms or corrects the value. Sorted by confidence ascending, lowest first. Covers about 25% of fields.

Drag the slider to route a field into its QA bucket · Three review levels, one audit trail

The QA queue sorts fields by confidence ascending, lowest first, so reviewers spend time where it matters. Each review action (approve, correct, or escalate) creates an immutable audit log record. Three clean operations: CONFIRMED, CORRECTED with the new value, or escalated.

The batch queue and learning loop

Processing 500 documents concurrently would hammer the API and starve other tenants. We use pgBoss with two layers of concurrency: 15 jobs globally, 5 per organization. When an org hits its limit, excess jobs are deferred with a 5-second delay: not rejected, not lost, just re-enqueued.

The pipeline gets smarter over time. When a QA reviewer corrects a field, we store the correction pattern keyed by lessor and document type. Next time we extract from the same lessor, those patterns inject directly into the prompt. Only patterns seen at least twice make the cut; one-off corrections might be noise.

// Only patterns confirmed by multiple corrections get injected
const lines = patterns
  .filter((p) => p.occurrences >= 2)
  .slice(0, 15)
  .map((p) =>
    `- Field "${p.fieldName}": AI extracted "${p.originalValue}"
       but was corrected to "${p.correctedValue}" (${p.occurrences}x)`
  );

Over the first 100 documents from a recurring lessor, the full_review bucket typically shrinks from ~30% of fields to under 10%.

Output adapters

Extracting data is half the problem. The other half is getting it into the client's existing system. We built 8 output adapters behind a factory pattern with lazy loading:

Adapter	Format	Date Format	Key Quirk
Generic Excel	.xlsx	ISO 8601	Dark headers, freeze panes
Generic CSV	.csv	ISO 8601	Formula injection prevention
CoStar	.xlsx (3 tabs)	ISO 8601	Lease Data + Financial + Options
LeaseQuery	.txt	ISO 8601	Pipe-delimited, not CSV
Oracle FBDI	.csv	YYYY/MM/DD	Non-ISO date format
SAP RE-FX	.json	YYYYMMDD	BAPI_RE_CN_CREATE structure
Visual Lease	.xlsx	ISO 8601	Red headers = required fields
Nakisa	.xlsx	ISO 8601	7 required columns, red/grey

Every adapter implements the same contract (fields in, file out) but the output formats are wildly different. CoStar wants a 3-sheet Excel workbook. LeaseQuery uses pipe-delimited text. SAP needs structured JSON with dateless dates (YYYYMMDD). The lazy loading via async () => import() means we only pull in ExcelJS when generating Excel formats.

See Document Intelligence to learn more about these capabilities.

Next: the modification engine, what happens when lease terms change mid-stream, and why it's the #1 audit finding in ASC 842.