Document Summarization

Problem

Automatically generate concise, accurate summaries of long-form enterprise documents — policies, contracts, research reports, meeting transcripts, incident reports, regulatory filings — so that employees can extract key information without reading the full document. The challenge is faithfulness (no hallucinated facts), relevance (summary matches reader intent), and scalability (thousands of documents processed daily).

Users / Stakeholders

Role	Use case
Legal / compliance officer	Summarise contract clauses, regulatory guidance
Claims adjuster	Extract key facts from medical records, police reports
Analyst	Summarise research reports, earnings calls
Customer support agent	Summarise customer ticket history
Executive	Board-ready executive summaries of technical reports

Domain Context

Document types vary widely: PDFs, Word documents, scanned images (requiring OCR), HTML pages, structured forms. Each has different extraction complexity.
Context window limits: Long documents (100+ pages) exceed LLM context windows. Chunking + map-reduce summarisation or retrieval-based selection is necessary.
Faithfulness is critical: In legal and medical contexts, a hallucinated fact in a summary can have serious consequences. Faithfulness evaluation (NLI-based) is mandatory.
Regulatory: Processing personal data in documents triggers GDPR. Legal professional privilege may apply to legal document summaries — data handling must comply with attorney-client privilege rules.
Language diversity: Enterprise documents may be multilingual. Model must handle target languages (GPT-4 handles 95+ languages; smaller models may not).

Inputs and Outputs

Input:

document:          Raw text (extracted from PDF/DOCX/HTML) or markdown
document_type:     CONTRACT / POLICY / REPORT / TRANSCRIPT / MEDICAL_RECORD
user_intent:       FREE_TEXT or STRUCTURED_QUERY ("what are the termination clauses?")
length_target:     SHORT (3 sentences) / MEDIUM (1 paragraph) / DETAILED (1 page)
output_format:     PROSE / BULLET_POINTS / STRUCTURED_JSON

Output:

summary:           Generated text summary matching length and format target
key_points:        Extracted bullet points of most important facts
entities:          Named entity list (people, organisations, dates, amounts)
citations:         Source text spans that support key summary claims
confidence:        Faithfulness score (NLI-based) if available

Decision or Workflow Role

Document arrives (upload / email / API push)
  ↓
Ingestion: format detection → text extraction → OCR if needed
  ↓
Chunking: recursive character split (512–2048 tokens with overlap)
  ↓
For short docs (<16K tokens): direct summarisation prompt
For long docs: 
  → Map: summarise each chunk independently
  → Reduce: summarise the chunk summaries into final summary
  ↓
Post-processing: faithfulness check, entity extraction, formatting
  ↓
Delivery: push to CRM / SharePoint / email / API response
  ↓
User feedback (thumbs up/down) → log → fine-tuning dataset

Modeling / System Options

Approach	Strength	Weakness	When to use
Prompt + LLM (GPT-4o, Claude)	Flexible; no training; high quality	Cost; latency; no fine-tuning	General enterprise, low volume
Fine-tuned smaller model (Llama, Mistral)	Lower cost; on-premises; domain adaptation	Training cost; quality ceiling	High volume; data privacy; domain-specific
Extractive summarisation (BERT)	Faithful (no hallucination risk); fast	Not abstractive; robotic output	Faithfulness-critical; compliance context
Map-reduce chain (LangChain)	Handles arbitrary document length	Coherence loss in reduce step	Long documents (>50K tokens)
RAG-based (retrieval + generation)	Query-specific summaries; better for targeted questions	Infrastructure cost	When user specifies what to summarise

Recommended: GPT-4o-mini or Claude Haiku for cost-effective production. Fine-tune Mistral-7B for high-volume, domain-specific (e.g., medical records). Always include faithfulness evaluation.

Deployment Constraints

Latency: Async acceptable for most use cases — process in background, deliver via notification. Interactive use (user waiting) requires <10s for short documents.
Cost: GPT-4o at $5/1 M o u tp u tt o k e n s . A 10 - p a g e d oc u m e n t = 2 Ko u tp u tt o k e n s =$ 0.01. At 100K docs/day = $1K/day. Cost model must be validated before scaling.
Privacy: Documents often contain PII. On-premises models (vLLM + Llama) or data processing agreements with API providers required. Avoid sending raw documents to external APIs without legal review.
Faithfulness testing: Deploy NLI faithfulness scorer (TRUE, SummaC) on a sample of outputs. Alert if faithfulness drops below threshold.

Risks and Failure Modes

Risk	Description	Mitigation
Hallucination	LLM generates facts not in the source document	Faithfulness evaluation; extractive baseline; citation requirement
Information loss	Key sections not captured in summary	Length-adaptive prompting; user feedback loop
Context window overflow	Long documents truncated without warning	Chunking strategy; document length metadata
PII leakage	Summaries include sensitive data sent externally	On-prem model; data classification before routing
Language quality degradation	Non-English documents summarised with errors	Evaluate per language; language-specific model routing
Format drift	Model changes output format across versions	Schema validation; versioned prompts pinned to model version

Success Metrics

Metric	Target	Notes
Faithfulness score (NLI)	> 0.85	No hallucinated facts; measured on a human-labelled test set
ROUGE-L	> 0.35	Overlap with human reference summary (when available)
User satisfaction	> 4.2/5	Survey-based; primary quality signal
Time to insight (user reported)	60–80% reduction	Business impact metric
Processing throughput	100+ docs/minute	Operational requirement
Cost per document	< $0.05	Financial viability

References

Goyal, T. et al. (2022). News Summarization and Evaluation in the Era of GPT-3. EMNLP.
Laban, P. et al. (2022). SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization. TACL.

Notes

Explorer

document_summarization