Document Summarization

Problem

Automatically generate concise, accurate summaries of long-form enterprise documents — policies, contracts, research reports, meeting transcripts, incident reports, regulatory filings — so that employees can extract key information without reading the full document. The challenge is faithfulness (no hallucinated facts), relevance (summary matches reader intent), and scalability (thousands of documents processed daily).

Users / Stakeholders

RoleUse case
Legal / compliance officerSummarise contract clauses, regulatory guidance
Claims adjusterExtract key facts from medical records, police reports
AnalystSummarise research reports, earnings calls
Customer support agentSummarise customer ticket history
ExecutiveBoard-ready executive summaries of technical reports

Domain Context

  • Document types vary widely: PDFs, Word documents, scanned images (requiring OCR), HTML pages, structured forms. Each has different extraction complexity.
  • Context window limits: Long documents (100+ pages) exceed LLM context windows. Chunking + map-reduce summarisation or retrieval-based selection is necessary.
  • Faithfulness is critical: In legal and medical contexts, a hallucinated fact in a summary can have serious consequences. Faithfulness evaluation (NLI-based) is mandatory.
  • Regulatory: Processing personal data in documents triggers GDPR. Legal professional privilege may apply to legal document summaries — data handling must comply with attorney-client privilege rules.
  • Language diversity: Enterprise documents may be multilingual. Model must handle target languages (GPT-4 handles 95+ languages; smaller models may not).

Inputs and Outputs

Input:

document:          Raw text (extracted from PDF/DOCX/HTML) or markdown
document_type:     CONTRACT / POLICY / REPORT / TRANSCRIPT / MEDICAL_RECORD
user_intent:       FREE_TEXT or STRUCTURED_QUERY ("what are the termination clauses?")
length_target:     SHORT (3 sentences) / MEDIUM (1 paragraph) / DETAILED (1 page)
output_format:     PROSE / BULLET_POINTS / STRUCTURED_JSON

Output:

summary:           Generated text summary matching length and format target
key_points:        Extracted bullet points of most important facts
entities:          Named entity list (people, organisations, dates, amounts)
citations:         Source text spans that support key summary claims
confidence:        Faithfulness score (NLI-based) if available

Decision or Workflow Role

Document arrives (upload / email / API push)
  ↓
Ingestion: format detection → text extraction → OCR if needed
  ↓
Chunking: recursive character split (512–2048 tokens with overlap)
  ↓
For short docs (<16K tokens): direct summarisation prompt
For long docs: 
  → Map: summarise each chunk independently
  → Reduce: summarise the chunk summaries into final summary
  ↓
Post-processing: faithfulness check, entity extraction, formatting
  ↓
Delivery: push to CRM / SharePoint / email / API response
  ↓
User feedback (thumbs up/down) → log → fine-tuning dataset

Modeling / System Options

ApproachStrengthWeaknessWhen to use
Prompt + LLM (GPT-4o, Claude)Flexible; no training; high qualityCost; latency; no fine-tuningGeneral enterprise, low volume
Fine-tuned smaller model (Llama, Mistral)Lower cost; on-premises; domain adaptationTraining cost; quality ceilingHigh volume; data privacy; domain-specific
Extractive summarisation (BERT)Faithful (no hallucination risk); fastNot abstractive; robotic outputFaithfulness-critical; compliance context
Map-reduce chain (LangChain)Handles arbitrary document lengthCoherence loss in reduce stepLong documents (>50K tokens)
RAG-based (retrieval + generation)Query-specific summaries; better for targeted questionsInfrastructure costWhen user specifies what to summarise

Recommended: GPT-4o-mini or Claude Haiku for cost-effective production. Fine-tune Mistral-7B for high-volume, domain-specific (e.g., medical records). Always include faithfulness evaluation.

Deployment Constraints

  • Latency: Async acceptable for most use cases — process in background, deliver via notification. Interactive use (user waiting) requires <10s for short documents.
  • Cost: GPT-4o at 0.01. At 100K docs/day = $1K/day. Cost model must be validated before scaling.
  • Privacy: Documents often contain PII. On-premises models (vLLM + Llama) or data processing agreements with API providers required. Avoid sending raw documents to external APIs without legal review.
  • Faithfulness testing: Deploy NLI faithfulness scorer (TRUE, SummaC) on a sample of outputs. Alert if faithfulness drops below threshold.

Risks and Failure Modes

RiskDescriptionMitigation
HallucinationLLM generates facts not in the source documentFaithfulness evaluation; extractive baseline; citation requirement
Information lossKey sections not captured in summaryLength-adaptive prompting; user feedback loop
Context window overflowLong documents truncated without warningChunking strategy; document length metadata
PII leakageSummaries include sensitive data sent externallyOn-prem model; data classification before routing
Language quality degradationNon-English documents summarised with errorsEvaluate per language; language-specific model routing
Format driftModel changes output format across versionsSchema validation; versioned prompts pinned to model version

Success Metrics

MetricTargetNotes
Faithfulness score (NLI)> 0.85No hallucinated facts; measured on a human-labelled test set
ROUGE-L> 0.35Overlap with human reference summary (when available)
User satisfaction> 4.2/5Survey-based; primary quality signal
Time to insight (user reported)60–80% reductionBusiness impact metric
Processing throughput100+ docs/minuteOperational requirement
Cost per document< $0.05Financial viability

References

  • Goyal, T. et al. (2022). News Summarization and Evaluation in the Era of GPT-3. EMNLP.
  • Laban, P. et al. (2022). SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization. TACL.

AI Engineering

ML Engineering

Reference Implementations

Adjacent Applications