Implementation: Phase 1
This page describes the implementation of key features in the project. The application integrates document processing, a vector database for semantic search, a chat-based user interface, and robust resource management. Each section below explains one of the core features along with code snippets and diagrams where appropriate.
1. Document Processing and Database Connection
Our system supports processing various file formats (PDF, Markdown, Word) by leveraging the following libraries:
- langchain_community.document_loaders: Loads documents in multiple formats.
- langchain_text_splitters: Splits documents into manageable chunks.
- langchain_community.vectorstores.Chroma: Acts as the vector database for storing document embeddings.
The code (in populate_database.py
and embedding.py
) performs the following:
- Loads a file or directory of files.
- Splits loaded documents into text chunks with an overlap to maintain context.
- Calculates unique chunk IDs based on source, page, and chunk index.
- Adds only new document chunks to the Chroma database, ensuring duplicates are avoided.
The following snippet shows how chunk IDs are calculated:
def calculate_chunk_ids(chunks):
last_page_id = None
current_chunk_index = 0
for chunk in chunks:
source = chunk.metadata.get("source")
page = chunk.metadata.get("page")
current_page_id = f"{source}:{page}"
if current_page_id == last_page_id:
current_chunk_index += 1
else:
current_chunk_index = 0
chunk.metadata["id"] = f"{current_page_id}:{current_chunk_index}"
last_page_id = current_page_id
return chunks
2. Chat UI and Interaction
The user interface is built with Tkinter, Python's standard GUI toolkit. The main
file (main.py
) is responsible for:
- Displaying a chat window where users can interact with the AI assistant.
- Allowing users to input queries and view conversation history.
- Providing buttons for actions like loading files, exporting conversations, and clearing the database.
- Dynamic font size adjustments to enhance readability and accessibility.
The UI uses a grid layout to ensure responsiveness. For example, the chat display area is initialized as follows:
# Initialize main application window
root = tk.Tk()
root.title("AI RAG Assistant")
root.geometry("800x600")
# Chat display area
output_box = scrolledtext.ScrolledText(root, wrap=tk.WORD)
output_box.grid(row=1, column=0, columnspan=2, padx=10, pady=10, sticky="nsew")
3. Semantic Search and Response Generation
Our application implements semantic search by integrating a vector database (Chroma) with an embedding function. Key steps include:
- Embedding Generation: Utilizes a pre-trained model (e.g., multilingual-e5-small) to convert text into high-dimensional vectors.
- Similarity Search: Searches for similar document chunks in the vector database to provide context for user queries.
- Content Filtering: Applies exclusion rules to filter out unwanted content based on user-specified terms.
- Response Generation: Uses a language model (via the Transformers library) to generate responses, incorporating the retrieved context.
Below is a code snippet showing how similar documents are retrieved from the database:
def retrieve_similar_documents(query, top_k=5):
# Create vector database connection
db = Chroma(persist_directory=CHROMA_PATH, embedding_function=embedding_function())
# Retrieve extra documents for filtering
results = db.similarity_search(query, k=top_k * 2)
# Filter results based on exclusion criteria
filtered_results = []
for result in results:
if all(ex.lower() not in result.page_content.lower() for ex in do_not_include_items):
filtered_results.append(result)
if len(filtered_results) >= top_k:
break
return [result.page_content for result in filtered_results[:top_k]]
4. Resource Management and Cleanup
Effective resource management is essential for stability and performance. The implementation ensures:
- Proper termination of database connections.
- Releasing resources allocated to the language model and embedding functions.
- Clearing GPU or system memory using tools like
torch.cuda.empty_cache()
or garbage collection.
The following function demonstrates the cleanup process executed when the application is closed:
def on_closing():
try:
# Close the vector database connection
db = Chroma(persist_directory=CHROMA_PATH, embedding_function=embedding_function())
db._client._system.stop()
# Release LLM resources
global model, tokenizer, generator
model = None
tokenizer = None
generator = None
# Clear GPU memory if available
if torch.cuda.is_available():
torch.cuda.empty_cache()
except Exception as e:
print(f"Cleanup error: {e}")
root.destroy()
5. Additional Features
Other important features include:
- File and Folder Loading: Users can index individual files or entire directories. The system supports multiple formats and integrates the loading functionality into the vector database.
- Conversation Export: Allows users to export the chat history to a text file for future reference.
- Content Exclusion Management: Provides an interface for users to add or remove exclusion terms, thus refining search results.
These features enhance the usability of the application by offering a flexible and interactive interface.
Conclusion
Our project integrates various modern tools and libraries to create a robust AI-assisted retrieval system. With document processing, semantic search capabilities, and an interactive chat UI, the system offers a comprehensive solution for managing and querying large sets of documents. The detailed implementation provided here should serve as a useful reference for understanding and extending the system.