Implementation: Phase 1

This page describes the implementation of key features in the project. The application integrates document processing, a vector database for semantic search, a chat-based user interface, and robust resource management. Each section below explains one of the core features along with code snippets and diagrams where appropriate.

1. Document Processing and Database Connection

Our system supports processing various file formats (PDF, Markdown, Word) by leveraging the following libraries:

langchain_community.document_loaders: Loads documents in multiple formats.
langchain_text_splitters: Splits documents into manageable chunks.
langchain_community.vectorstores.Chroma: Acts as the vector database for storing document embeddings.

The code (in populate_database.py and embedding.py) performs the following:

Loads a file or directory of files.
Splits loaded documents into text chunks with an overlap to maintain context.
Calculates unique chunk IDs based on source, page, and chunk index.
Adds only new document chunks to the Chroma database, ensuring duplicates are avoided.

The following snippet shows how chunk IDs are calculated:

def calculate_chunk_ids(chunks):
            last_page_id = None
            current_chunk_index = 0
            for chunk in chunks:
                source = chunk.metadata.get("source")
                page = chunk.metadata.get("page")
                current_page_id = f"{source}:{page}"
                if current_page_id == last_page_id:
                    current_chunk_index += 1
                else:
                    current_chunk_index = 0
                chunk.metadata["id"] = f"{current_page_id}:{current_chunk_index}"
                last_page_id = current_page_id
            return chunks

2. Chat UI and Interaction

The user interface is built with Tkinter, Python's standard GUI toolkit. The main file (main.py) is responsible for:

Displaying a chat window where users can interact with the AI assistant.
Allowing users to input queries and view conversation history.
Providing buttons for actions like loading files, exporting conversations, and clearing the database.
Dynamic font size adjustments to enhance readability and accessibility.

The UI uses a grid layout to ensure responsiveness. For example, the chat display area is initialized as follows:

# Initialize main application window
        root = tk.Tk()
        root.title("AI RAG Assistant")
        root.geometry("800x600")
        
        # Chat display area
        output_box = scrolledtext.ScrolledText(root, wrap=tk.WORD)
        output_box.grid(row=1, column=0, columnspan=2, padx=10, pady=10, sticky="nsew")

3. Semantic Search and Response Generation

Our application implements semantic search by integrating a vector database (Chroma) with an embedding function. Key steps include:

Embedding Generation: Utilizes a pre-trained model (e.g., multilingual-e5-small) to convert text into high-dimensional vectors.
Similarity Search: Searches for similar document chunks in the vector database to provide context for user queries.
Content Filtering: Applies exclusion rules to filter out unwanted content based on user-specified terms.
Response Generation: Uses a language model (via the Transformers library) to generate responses, incorporating the retrieved context.

Below is a code snippet showing how similar documents are retrieved from the database:

def retrieve_similar_documents(query, top_k=5):
            # Create vector database connection
            db = Chroma(persist_directory=CHROMA_PATH, embedding_function=embedding_function())
            # Retrieve extra documents for filtering
            results = db.similarity_search(query, k=top_k * 2)
            
            # Filter results based on exclusion criteria
            filtered_results = []
            for result in results:
                if all(ex.lower() not in result.page_content.lower() for ex in do_not_include_items):
                    filtered_results.append(result)
                    if len(filtered_results) >= top_k:
                        break
            return [result.page_content for result in filtered_results[:top_k]]

4. Resource Management and Cleanup

Effective resource management is essential for stability and performance. The implementation ensures:

Proper termination of database connections.
Releasing resources allocated to the language model and embedding functions.
Clearing GPU or system memory using tools like torch.cuda.empty_cache() or garbage collection.

The following function demonstrates the cleanup process executed when the application is closed:

def on_closing():
            try:
                # Close the vector database connection
                db = Chroma(persist_directory=CHROMA_PATH, embedding_function=embedding_function())
                db._client._system.stop()
                
                # Release LLM resources
                global model, tokenizer, generator
                model = None
                tokenizer = None
                generator = None
        
                # Clear GPU memory if available
                if torch.cuda.is_available():
                    torch.cuda.empty_cache()
            except Exception as e:
                print(f"Cleanup error: {e}")
            root.destroy()

5. Additional Features

Other important features include:

File and Folder Loading: Users can index individual files or entire directories. The system supports multiple formats and integrates the loading functionality into the vector database.
Conversation Export: Allows users to export the chat history to a text file for future reference.
Content Exclusion Management: Provides an interface for users to add or remove exclusion terms, thus refining search results.

These features enhance the usability of the application by offering a flexible and interactive interface.

Conclusion

Our project integrates various modern tools and libraries to create a robust AI-assisted retrieval system. With document processing, semantic search capabilities, and an interactive chat UI, the system offers a comprehensive solution for managing and querying large sets of documents. The detailed implementation provided here should serve as a useful reference for understanding and extending the system.