Skip to main content

Documentation Index

Fetch the complete documentation index at: https://koreai-v2-home-nav.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

A Knowledge base is a managed collection of documents and data sources that an agent can search during conversations. Knowledge bases in the Agent Platform serve as the primary retrieval layer for enterprise content, enabling agents to query and access information across heterogeneous data sources. Content can be ingested via file uploads, web crawling, and external connectors, and is indexed to support vector, keyword, and structured retrieval. At query time, the platform applies enterprise search and RAG pipelines to automatically contextualize agent requests—resolving intent, enriching queries, and retrieving relevant chunks. This retrieved context is then made available to agents and tools, allowing them to perform grounded reasoning, execute workflows, and generate accurate, context-aware responses.

How Knowledge Bases Work

A knowledge base processes content through an ingestion pipeline (index-time) and serves it via a retrieval pipeline (query-time).

Ingestion Process

The ingestion pipeline transforms raw content into retrieval-ready chunks with embeddings and metadata.
  1. Ingestion: Registers and tracks content for processing. For uploaded files, the platform checks for duplicates using content hashing, files that have already been indexed with unchanged content are skipped. For connected data sources, the platform discovers and tracks all documents within the source.
  2. Extraction: The extraction stage pulls content from your source files. Depending on the content format, different extraction strategies are used.
FormatExtraction approach
Plain text, MarkdownUsed as-is
HTMLTags stripped, structure preserved where meaningful
PDFText extracted from pages (layout-aware extraction for complex PDFs)
DOCX / Office formatsDocument content extracted from the file structure
JSONValues extracted and concatenated with structural context
The extraction stage also captures document metadata (titles, headings, page numbers) that is used later for enrichment and search result context.
  1. Chunking - The chunking stage splits text into smaller segments that can be individually embedded and retrieved. The chunking strategy you choose directly affects search quality — it determines how coherent and complete each retrievable unit of content is. The platform supports three chunking strategies:
StrategyHow it splitsBest for
Fixed-sizeToken windows of a configurable size with overlap between chunksPredictable, uniform content where consistency matters
SemanticNatural boundaries such as paragraphs, sections, and topic shiftsNarrative or long-form content where coherence is important
Sliding windowOverlapping windows that slide across the full textContent where context continuity across boundaries is critical
Chunk overlap ensures that context at chunk boundaries is not lost. A concept that spans two paragraphs will appear in both chunks when overlap is configured. Always configure overlap when your content spans section boundaries.
  1. Enrich: The enrichment stage adds metadata to each chunk, improving search quality and enabling filtering. The following enrichments are applied:
EnrichmentWhat it adds
Entity detectionIdentifies emails, URLs, dates, and monetary values within the chunk
Summary generationCreates a brief summary of the chunk’s content
Language detectionIdentifies the language of the content
Knowledge graph extractionExtracts entities and relationships to build a knowledge graph
Enrichment metadata is stored alongside each chunk and can be used to filter and rank search results at query time.
  1. Embedding Generation: The embedding stage converts each text chunk into a vector representation that captures its semantic meaning. These vectors enable semantic search — finding content that is conceptually related to a query even when the exact words differ.
  2. Store: The final stage writes the processed chunks to the vector database for fast retrieval. Each stored chunk includes:
    • The original text content
    • The vector embedding
    • Enrichment metadata
    • Source information — document ID, source ID, and page number
    • Permission metadata for access control
Once a chunk is stored, it is immediately available for search.

Retrieval Process

When an agent needs information from a knowledge base, the platform executes a search query and returns the most relevant chunks as context for the agent to compose a response. The search pipeline runs in the following sequence:
  1. The agent sends the query to the runtime.
  2. Runtime embeds the query text into a vector representation.
  3. Hybrid search runs - semantic and keyword search execute simultaneously.
  4. Results from both strategies are merged.
  5. Merged results are re-ranked by relevance.
  6. Permission filters are applied to the re-ranked results.
  7. Filtered, ranked chunks are returned to the agent as context.
  8. The agent uses the context to compose a response.
Retrieval Process

Key Concepts

  1. Hybrid Search: The platform uses hybrid search by default, combining two complementary strategies to maximize retrieval quality. Semantic Search Semantic search finds chunks whose vector embeddings are most similar to the query embedding. This helps retrieve conceptually relevant content even when the user’s phrasing does not directly match the source text. Example: A query for “How do I get my money back?” can match a chunk about “Refund policy and procedures” even though none of the query words appear in the chunk heading. Use semantic search when users are likely to phrase queries in natural language that differs from the terminology used in the source content. Keyword Search Keyword search finds chunks containing the exact terms in the query. This helps retrieve content with specific names, product codes, technical terms, or identifiers that semantic search might miss due to low vector similarity. Use keyword search when the content contains precise terms — such as model numbers, error codes, or proper nouns — that must match exactly. Hybrid Search Hybrid search merges and re-ranks results from semantic and keyword search to produce a single ranked list of the most relevant chunks.
  2. Re-Ranking
After the initial search, results are re-ranked to improve relevance ordering. Re-ranking evaluates the merged result set and adjusts the order based on relevance signals beyond initial similarity scores.
  1. Permission Aware Search
Search results are filtered based on the requesting user’s permissions before they are returned to the agent. If your knowledge base is connected to a data source with access controls, such as SharePoint or other enterprise connectors, only documents the user has permission to see are included in the results.
Permission filtering happens after re-ranking. The agent only receives chunks the requesting user is authorized to access, regardless of their relevance score. This means result counts may vary between users querying the same knowledge base with the same query.

Adding Knowledge Base

Knowledge bases are created and managed at the project level and are available to all agents within that project. Every knowledge base in the project is automatically exposed as a tool that any agent can attach to and use. To create a knowledge base:
  • Navigate to Knowledge Bases in the left sidebar.
  • Click + New Knowledge Base.
  • Enter a name and description.
    • Name: used to identify the knowledge base across the project.
    • Description: helps agents identify when to use this knowledge base as a tool. Write this clearly and specifically.
  • Click Create.
The knowledge base is created with default configurations applied. You can review and update these configurations at any time from the knowledge base dashboard.

Adding Data Sources

A knowledge base can ingest content from multiple source types simultaneously. Each source is managed independently, and you can add, sync, and remove sources without affecting others. When a data source is added to the knowledge base, the platform immediately begins processing it in the background. No manual steps are required to trigger ingestion or indexing. To add data sources,
  • Go to the Data tab and + Add Data Source under the Sources tab.
  • Select the source connector and Connect.
  • Provide configuration details for the source.

File Uploads

Upload files directly from your local system. Uploaded files are ingested immediately and processed through the intelligence pipeline. If the source file changes, you need to re-upload it.
  • Supported file types: PDF, DOCX, TXT, CSV, MD, XLSX, JSON, HTML.
  • Max file size: 100 MB per file
If the source file changes, you need to re-upload it.

View Ingested Content

To view the ingested content,
  • Go to the Documents page under the Data tab.
  • The ingested documents are listed. Click on any document to view the extracted content and the related metadata.

View Extracted chunks

  • Click on a document to see the chunks extracted from the document.
  • See token distribution across chunks from the document.

Apply Intelligence Rules

Configure Content Pipeline

By default, the platform creates a default pipeline for processing of the ingested content automatically to make it searchable.
Default Pipeline
The default pipeline consists of the following stages. You can reconfigure any stage as required. Alternatively, you can add a new flow to the pipeline for custom configurations. Extraction: Extracts raw text and structure from ingested files and connected sources. The extraction strategy varies by file type, for example, tables in spreadsheets are extracted differently from paragraphs in a PDF. Configure extraction settings to control how the platform interprets your content structure. Chunking: Splits extracted content into smaller segments for indexing. Chunk size and overlap affect search quality; smaller chunks return more precise results while larger chunks retain more surrounding context. Configure the chunking strategy based on the nature of your content and how users are likely to query it. Content Intelligence: Enriches chunks with additional metadata, such as topic tags, summaries, or classifications, to improve search relevance. Configure what enrichment is applied to your content at this stage. Visual Analysis: Processes visual content in ingested files, such as images, diagrams, and charts, and extracts text or descriptions for inclusion in the index. Enable this stage when your content contains visual elements that carry meaningful information. Embedding Fields: Defines which fields from the extracted and enriched content are included in the embedding. Controlling embedding fields lets you tune what the vector search considers when matching a query to a chunk. Embedding: Converts the selected fields into vector representations using the configured embedding model. The choice of embedding model affects the quality of semantic search results. Configure the model here based on your language and domain requirements. Opensearch: At this stage, the processed and embedded chunks are written to the index using their embeddings. You can run test queries to see what is currently indexed and verify search quality as content is added.
The pipeline runs in sequence. A change to an upstream stage, such as chunking, affects all downstream stages. Re-indexing is required when pipeline configuration changes are saved.

Configure Fields

Configure Vocabulary

Vocabulary lets you define aliases for terms that appear in your knowledge base content. When a user queries using an alias, the platform expands the query at runtime to include both the alias and the canonical term — improving search coverage without requiring the user to know the exact terminology used in the source documents. Configure vocabulary when:
  • Your source documents use formal or technical terminology that users are unlikely to search for exactly
  • Your content uses abbreviations or acronyms that users may spell out — or vice versa
  • Different teams or user groups use different terms for the same concept
  • Search queries consistently miss relevant content despite the content existing in the knowledge base
Add Vocabulary To add a vocabulary entry, click Add Entry on the Vocabulary page and provide the following details.
FieldDescription
TermThe canonical term as it appears in the document. This is the authoritative form the platform uses for matching. Example: priority, status, assignee
AliasesAlternative terms a user might search for, entered as comma-separated values. Maximum 10 aliases per term. Example: pri, urgency, importance
DescriptionA brief explanation of what the term represents. Used for internal reference.
Field ReferenceThe schema field this term is associated with. Select from the fields detected during schema extraction. Scopes the vocabulary mapping to a specific field in the index.
CapabilitiesDefine how the term and its aliases are used in the results.
Display WithSpecify additional fields to show alongside this term in the results.

Configure LLMs

By default, Knowledge bases automatically select the most suitable LLM configured in the platform for different features. For embedding generation, it uses the BGE-M3 model. Use this section to configure AI-powered features for document processing and enrichment, like embedding models, models used for knowledge graph extraction or vocabulary generation or for advanced features like vision analysis.

Search and Test

The Search and Test tab lets you validate the knowledge base configuration by running queries against indexed content before connecting it to an agent. It shows exactly how the knowledge base will respond to a query, including which chunks are returned, how they are ranked, and how each stage of the search pipeline performed. Query Settings Enter a query in the Query field and configure the following settings before running a search: Query Type controls how the search matches the query to indexed content. Three options are available:
  • Hybrid - combines semantic search and keyword search. Returns results that are both conceptually relevant and lexically matched. Recommended for most use cases.
  • Semantic - finds conceptually relevant content based on meaning, even when the exact words do not appear in the source.
  • Keyword - finds exact or near-exact matches. Best for structured content where precise term matching is important.
Top K defines the maximum number of chunks returned per query. Set this based on how much context the agent needs to compose an accurate response. A higher value returns more context but increases token consumption when the agent processes the results. Resolve Mode controls how results are resolved and ranked when multiple chunks from the same or different sources are returned. <add details of alias, exact, fuzzy> Preprocessing, when enabled, applies transformations to the query before execution, including spell correction, synonym expansion, and entity recognition. Enable this when your users are likely to use natural language queries with varied phrasing or common misspellings. Running a Test Query Click Search to execute the query. The platform returns:
  • Search retrieval time - total time taken and the time per search engine
  • Results - ranked list of matching chunks showing relevance score, source file, content preview, and metadata fields.
  • Resolution Chain Debug, a step-by-step breakdown of how the query was processed
The Diagnostics panel on the right provides a real-time view of the knowledge base’s health across three areas:
  • Data and Indexing - Shows the current state of indexed content, document count, chunk count, and the last indexed timestamp.
  • Enrichment - Shows the configuration status of vocabulary and field mappings.
  • Pipeline Health - Shows the embedding model in use, the current pipeline status, and whether any index errors have occurred.
The Query History section shows recent queries run against the knowledge base, along with the number of results returned, query type, and time taken.

Use Knowledge Base as a Tool in Agents

Every knowledge base in the project automatically has a corresponding knowledge tool. Agents access the knowledge base content by attaching this tool. Once attached, the agent can invoke it at runtime to search the knowledge base and retrieve relevant content. To attach a knowledge tool from the agent configuration:
  1. Navigate to the agent and open the Tools section.
  2. Click Attach Tool.
  3. From the list of available tools, locate the knowledge tool for your knowledge base — knowledge tools are listed alongside all other project tools.
  4. Select the tool to attach it to the agent.
Attaching a Knowledge Tool via DSL To attach a knowledge tool using the DSL editor:
  1. Navigate to Knowledge Tools at the project level.
  2. Open the knowledge tool that corresponds to your knowledge base.
  3. Copy the DSL definition shown on the tool page.
  4. Open the agent’s DSL editor by clicking DSL at the top of the agent page.
  5. Paste the copied DSL into the Tools section of the agent definition.
  6. Compile and save the changes.