How We Extract Lease Data from 500 PDFs Without Losing a Dollar
Every enterprise has leases. A mid-market company might have 200. A Fortune 500 retailer might have 5,000. All of them buried in PDFs (scanned agreements, amendments, subleases, estoppels) sitting in SharePoint folders nobody wants to open.
When ASC 842 forced operating leases onto the balance sheet, accounting teams had two options: hire temps to key in every field by hand, or pay a consulting firm $500K+ to do it. Both are slow, expensive, and riddled with transcription errors. A single misread commencement date cascades into a wrong lease term, wrong amortization schedule, wrong journal entries, and an audit finding.
We built a pipeline that takes 500 lease PDFs and returns structured, validated, QA-routed data ready for import into any lease accounting system.
The extraction prompt
We do not ask Claude to "read this lease." We give it a field-by-field specification (60+ fields across seven categories) with explicit value types, confidence calibration rules, and source provenance requirements:
const system = `You are a lease accounting specialist with deep expertise
in ASC 842, IFRS 16, and GASB 87.
EXTRACTION RULES:
1. Extract EVERY identifiable field from the document
2. For each field, provide a confidence score from 0.0 to 1.0:
- 1.0 = explicitly stated verbatim in the document
- 0.9-0.99 = clearly stated with minor interpretation
- 0.8-0.89 = requires reasonable inference
- 0.5-0.79 = uncertain, requires human review
- Below 0.5 = very uncertain, just a guess
3. Include the exact source page number and quote`;Without explicit calibration guidance, models cluster confidence around 0.9 for everything. By defining what each band means, we get a signal we can route on downstream. Every recurring payment also gets an ASC 842 classification (lease_component, nonlease_not_separated, nonlease_separated, or variable) because this determines what enters the present value calculation.
Tiered model strategy
Running Sonnet on 500 PDFs costs real money. Most leases are straightforward. We only need the heavy model when the cheap one is not confident enough.
Haiku extracts first at $0.25/M tokens. If average confidence across all fields drops below 0.85, Sonnet re-extracts at $3/M tokens. Both get the same prompt, the same 16K token budget, and temperature 0. In practice, 70% complete with Haiku alone. The threshold and both models are env-configurable, so no deploy is needed to tune.
The 497-line validation engine
Claude gives us confidence scores. We do not trust them blindly.
After extraction, every field passes through a three-layer validation engine. Each failed check penalizes the field's confidence by 0.1, which can push it into a different QA bucket. A field starting at 0.92 that fails two checks drops to 0.72, below the 0.80 threshold, straight into full review.
The validation engine does not fix data. It creates a reliable signal for the QA routing layer. Cross-field consistency is where we catch the errors that matter most: commencement + term ≈ expiration (within 60 days), monthly × 12 ≈ annual rent (within 5%), payment streams matching the base rent field.
Confidence-bucketed QA routing
After validation adjusts the scores, every field gets routed into one of three QA buckets. This replaces the old approach of "review everything" or "review nothing."
The QA queue sorts fields by confidence ascending, lowest first, so reviewers spend time where it matters. Each review action (approve, correct, or escalate) creates an immutable audit log record. Three clean operations: CONFIRMED, CORRECTED with the new value, or escalated.
The batch queue and learning loop
Processing 500 documents concurrently would hammer the API and starve other tenants. We use pgBoss with two layers of concurrency: 15 jobs globally, 5 per organization. When an org hits its limit, excess jobs are deferred with a 5-second delay: not rejected, not lost, just re-enqueued.
The pipeline gets smarter over time. When a QA reviewer corrects a field, we store the correction pattern keyed by lessor and document type. Next time we extract from the same lessor, those patterns inject directly into the prompt. Only patterns seen at least twice make the cut; one-off corrections might be noise.
// Only patterns confirmed by multiple corrections get injected
const lines = patterns
.filter((p) => p.occurrences >= 2)
.slice(0, 15)
.map((p) =>
`- Field "${p.fieldName}": AI extracted "${p.originalValue}"
but was corrected to "${p.correctedValue}" (${p.occurrences}x)`
);Over the first 100 documents from a recurring lessor, the full_review bucket typically shrinks from ~30% of fields to under 10%.
Output adapters
Extracting data is half the problem. The other half is getting it into the client's existing system. We built 8 output adapters behind a factory pattern with lazy loading:
| Adapter | Format | Date Format | Key Quirk |
|---|---|---|---|
| Generic Excel | .xlsx | ISO 8601 | Dark headers, freeze panes |
| Generic CSV | .csv | ISO 8601 | Formula injection prevention |
| CoStar | .xlsx (3 tabs) | ISO 8601 | Lease Data + Financial + Options |
| LeaseQuery | .txt | ISO 8601 | Pipe-delimited, not CSV |
| Oracle FBDI | .csv | YYYY/MM/DD | Non-ISO date format |
| SAP RE-FX | .json | YYYYMMDD | BAPI_RE_CN_CREATE structure |
| Visual Lease | .xlsx | ISO 8601 | Red headers = required fields |
| Nakisa | .xlsx | ISO 8601 | 7 required columns, red/grey |
Every adapter implements the same contract (fields in, file out) but the output formats are wildly different. CoStar wants a 3-sheet Excel workbook. LeaseQuery uses pipe-delimited text. SAP needs structured JSON with dateless dates (YYYYMMDD). The lazy loading via async () => import() means we only pull in ExcelJS when generating Excel formats.
See Document Intelligence to learn more about these capabilities.
Next: the modification engine, what happens when lease terms change mid-stream, and why it's the #1 audit finding in ASC 842.