Designing the Consumption Layer for Enterprise Knowledge Assistants

Written by Wierd van der Haar | 2025-01-15

Once your documents are processed (text is chunked, embedded, and stored) — read "Core techniques in an Enterprise Knowledge Assistant" — , you’re ready to answer user queries in real time. This stage involves transforming the user’s query into an embedding, retrieving relevant chunks, and then feeding both the query and context to the LLM.

*** This article is part of blog series. If you haven't read the previous articles yet, be sure to check them out:

1. Query Vectorization and Retrieval

1. User Prompt
The process starts when the user asks a question or provides a prompt to your AI Knowledge Assistant.

2. Vectorization
The user’s query is converted into an embedding—ideally using the same model (or a compatible one) used to embed your documents. This ensures semantic compatibility between the question and the stored content.

3. Hybrid Search
This process typically takes place in the database where you’ve stored embeddings and other relevant information. For a true hybrid approach—combining semantic (vector) and lexical (full-text) queries—you need a multi-model database capable of handling both vector search and traditional full-text indexing.

For more details, see Hybrid Search Explained. This allows you to seamlessly blend the power of embeddings with keyword-based matching:

Similarity Search: Compare the query embedding to the document embeddings to find semantically relevant chunks.
Keyword/Traditional Search: Use full-text search (e.g., BM25) to account for exact/fuzzy keyword matches or domain-specific terminology.
Re-ranking: Once you have candidate results from both semantic and lexical approaches, you can merge or reorder them:
- Convex Combination: Add both scores (semantic and lexical) using a weight factor. The optimal weight depends on the dataset and use case—for some data, semantic context is more critical; for others, exact keyword matches matter more.
- Reciprocal Rank Fusion (RRF): A popular method due to its simplicity and effectiveness. It merges ranks from multiple search methods
  without relying on their specific scoring scales.

2. LLM Selection Based on Use-Case

Your choice of Large Language Model (LLM) greatly impacts the final answer’s quality and capabilities:

Text-to-Text: If your application is purely Q&A, summarization, or text-based conversation, a specialized text-based LLM may suffice.
Image-to-Text: For scenarios involving diagrams or visual content, you might need an LLM or specialized model that can interpret images.
Image-to-Image: Advanced workflows—like generating new visuals from existing images—often require specialized multimodal models.

3. Incorporating Context into the LLM Prompt

After retrieving the most relevant chunks, you attach this context to the user’s query in a prompt template that tells the LLM how to use the information. This might include:

Guidance on how to handle unanswered questions (e.g., “If unsure, say so.”)
Formatting requirements (e.g., “Answer in a concise paragraph.”)
Citation details (e.g., “Reference figures or images by filename if available.”)

Example Prompt for Technical Documentation:

"You are a skilled technical assistant. Use the following document context to answer the question concisely and clearly. Focus on the most relevant information. Avoid redundancy, but provide a full explanation. Include references to figures or images if mentioned:
Context:
{Context from Hybrid-Search}
Question:
{Original user input}
"

Note: The model used for embeddings might differ from the one used for answer generation. For instance, you could use a smaller, specialized embedding model to handle vector creation and retrieval, but then switch to a more advanced, instruction-tuned LLM for producing clear, user-friendly answers. This approach lets you optimize each task with the most suitable model, often yielding better performance across the entire AI Knowledge Assistant pipeline.

4. Security and Deployment Considerations

Cloud-Based vs. On-Premise: If your documents are highly confidential or regulated, you may prefer hosting the LLM locally to keep data on-premise. Otherwise, an online LLM might provide faster access to model updates or specialized features.

Resource Requirements: Running local LLMs can be GPU-intensive. You’ll need to factor in hardware budgets and maintenance. If you’re already using GPUs for OCR or image embeddings, consider the additional load for real-time inference.

By integrating query vectorization, retrieval, and an LLM effectively, your AI Knowledge Assistant can deliver context-aware, relevant answers to user queries while maintaining stringent security controls. The result is a dynamic, enterprise-ready chatbot that references—and stays true to—your organization’s own data.

*** Continue reading: Step by Step Guide to Building a PDF Knowledge Assistant

View full post