RAG Knowledge Assistant

Problem

Provide employees with a conversational assistant that answers questions using the organisation’s internal knowledge base — policies, procedures, technical documentation, past project outputs, HR information — in natural language, with citations. Traditional intranets and wikis are poorly navigated; employees waste significant time searching for answers or asking colleagues. A RAG assistant reduces resolution time and improves consistency of answers.

Users / Stakeholders

RoleUse case
New employeeOnboarding questions about policies and procedures
HR employeeAnswer employee benefits and leave policy questions at scale
EngineerTechnical documentation lookup, API reference
Claims adjusterCoverage interpretation, underwriting guidelines
SalesProduct pricing, eligibility rules, competitive positioning

Domain Context

  • Knowledge currency: Documents in the knowledge base may be outdated. The system must surface document dates and must not present stale answers as current fact. Freshness filtering is critical.
  • Confidentiality tiers: Some documents (board minutes, personnel files, M&A documents) should not be accessible to all employees. ACL enforcement is non-negotiable.
  • Ambiguous queries: “What is our holiday policy?” is ambiguous — which country? Which employment type? The system must clarify or provide multiple answers.
  • Hallucination risk: The LLM must answer only from retrieved context. Without proper grounding, it will fabricate policy details — potentially creating legal exposure.
  • Integration: Knowledge is spread across SharePoint, Confluence, Google Drive, email archives, PDF repositories. Connectors for each system are required.

Inputs and Outputs

Input:

user_query:          Natural language question
conversation_history: Prior turns in the chat session
user_context:        Department, location, role, language

Output:

answer:      Natural language response grounded in retrieved documents
citations:   List of source documents with title, URL, date, relevant excerpt
confidence:  CONFIDENT / UNCERTAIN (based on retrieval quality score)
followup:    Suggested clarifying questions (optional)

Decision or Workflow Role

[Knowledge ingestion pipeline] — nightly or event-driven
Connectors (SharePoint / Confluence / Google Drive / PDF upload)
  → Text extraction + OCR → Chunking → Embedding → Upsert to vector store
  → Metadata: source, date, author, ACL group, document_type

[Query pipeline] — real-time
User query → ACL context attached
  → Semantic search: embed query → ANN retrieval (top 20)
  → Metadata filter: ACL check, freshness filter
  → Reranking: cross-encoder top 20 → top 5
  → Prompt assembly: [system prompt] + [top 5 chunks] + [chat history] + [query]
  → LLM generation (GPT-4o-mini / Claude Haiku)
  → Citation extraction + response formatting
  → Deliver to user via Slack / Teams / web UI
  → Log interaction → evaluation pipeline → fine-tuning data

Modeling / System Options

DecisionOptionsNotes
Embedding modeltext-embedding-3-small (cost), bge-large-en-v1.5 (accuracy)On-prem vs API cost
LLMGPT-4o-mini (cost/quality), Claude Haiku (fast), Llama-3-8B (on-prem)Privacy requirements drive choice
Vector storepgvector (SQL-native), Qdrant (performance), Chroma (dev)Scale + ops maturity
RAG frameworkLangChain (broad connectors), LlamaIndex (retrieval focus)Connector ecosystem
Grounding enforcementPrompt engineering + “do not hallucinate” instruction, NLI faithfulness check, citation-forced generationLayered defence

Recommended: LlamaIndex for retrieval pipeline (better chunking strategies). GPT-4o-mini or Claude Haiku for generation. pgvector if team has PostgreSQL expertise.

Deployment Constraints

  • Privacy and data residency: Employee knowledge bases contain personal data. Data must stay in the relevant jurisdiction. On-premises LLM or EU-hosted API endpoint required for GDPR compliance.
  • Latency: Conversational response expected in <5 seconds. Streaming output (token-by-token) improves perceived latency.
  • Access control: System must call the organisation’s identity provider (Okta, Azure AD) to resolve user permissions before retrieval.
  • Confidence signalling: When retrieved chunks have low relevance scores, the system should say “I couldn’t find a confident answer” rather than hallucinate.
  • Observability: All queries, retrieved chunks, and generated answers must be logged (LangSmith / custom) for quality auditing and compliance.

Risks and Failure Modes

RiskDescriptionMitigation
Policy hallucinationLLM invents policy details not in the documentsGrounding constraint in system prompt; faithfulness check
ACL bypassConfidential document retrieved for unauthorised userServer-side ACL filter at retrieval stage
Stale knowledgeAnswer based on outdated document versionFreshness filter; document date in answer context
Overconfident wrong answerSystem gives definitive answer when documents are ambiguousUncertainty signalling; “consult HR for confirmation” framing
Index coverage gapsQuestion answered with partial context because relevant doc not indexedDocument coverage reporting; connector completeness monitoring
Prompt injectionMalicious user crafts query to exfiltrate other users’ dataInput sanitisation; strict system prompt; output filtering

Success Metrics

MetricTargetNotes
Answer accuracy (human eval)> 85% correctSpot check by domain expert
Citation accuracy> 90% of answers have valid citationVerifiable grounding
User satisfaction> 4.0/5Monthly survey
Deflection rate> 40% of HR / helpdesk queries resolved without humanBusiness efficiency metric
Hallucination rate< 2% of answersHuman adversarial evaluation
Time to answer< 5 secondsUX SLA

References

  • Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS.
  • Gao, Y. et al. (2023). Retrieval-Augmented Generation for Large Language Models: A Survey.

AI Engineering

Reference Implementations

Adjacent Applications