System Design Overview
Our offline RAG system is built as a desktop application using Python and Tkinter for the user interface. It leverages a local vector database (Chroma) to store document embeddings, uses a Hugging Face transformer (via a custom embedding module) to compute embeddings, and employs a Llama-based language model for generating responses. Documents are ingested (PDF, Markdown, Word) via a dedicated ingestion script.
High-Level Architecture Diagram
The diagram below outlines the key components and their interactions:
"] EMB["Embedding Model
multilingual-e5-small"] VDB["Vector Database
Chroma"] end subgraph Data["Document Processing"] LOAD["Document Loaders
(PDF, Markdown, Word)"] SPLIT["Text Splitter"] end User -->|"Query"| LLM LLM -->|"Response"| User User -->|"Load Documents"| LOAD LOAD -->|"Raw Documents"| SPLIT SPLIT -->|"Text Chunks"| VDB VDB <-->|"Embeddings"| EMB LLM -->|"Similarity Search"| VDB
Explanation: The diagram shows how the user interface sends queries to the query processor, which then uses the retrieval module to fetch relevant documents from the Chroma vector database. The embedding module creates embeddings, and the ingestion script populates the database. Finally, the LLM generates responses based on the query and retrieved context.
Sequence Diagram: Query Processing Flow
The following sequence diagram illustrates the processing of a user query:
Class Design
The following class diagram shows the design of our system:
Component Details
- User Interface (Tkinter): Provides the primary interaction point with features like dynamic font resizing, file/folder browsing, and conversation export. (main.py)
- Embedding Module: Computes embeddings using Hugging Face’s transformer model (multilingual-e5-small). (embedding.py)
- Vector Database (Chroma): Stores and retrieves document embeddings. (populate_database.py)
- Document Ingestion: Processes various document types and splits them into chunks for indexing. (populate_database.py)
- LLM Generation: Generates natural language responses using a Llama-based model (Qwen2.5-1.5b). (main.py)
Integration & Data Flow
All components are designed to operate offline. The data flow starts with document ingestion and embedding creation, followed by storage in the Chroma vector database. When a query is processed, the system retrieves relevant documents using semantic search, and the LLM generates a response based on the retrieved context.
Final Notes
The system is modular, allowing independent updates to each component. It supports scalability with batch processing and includes resource management for GPU/MPS environments. The user interface provides interactive feedback to ensure a smooth experience.