Research

Research

For our project we were first instructed by our clients to build a basic RAG application and then tailor that to three use cases: RFQs, Documentation Changes, and Telemetry Data. These three use cases are different applications that share an interface as a user can switch between them, hence there were multiple projects consulted and different technologies used. For these areas find reviews below.

RFQ

Technology Review

  1. Handling Table Extraction from PDFs

    Extracting tables accurately is one of the biggest challenges in implementing RAG for RFQs. A significant drop in retrieval accuracy was observed when RFQ documents contained tables, as incorrect extraction led to fragmented or missing data. Through our research, we explored various table extraction techniques and identified key challenges:

    • Initial Approach: We first attempted using pdfplumber, a robust Python library for extracting text and tables. However, it failed to preserve the table structure correctly, often producing null values and misaligned data.
    • Alternative Approach: We then tested Camelot, a PDF processing library specifically designed for table extraction. Unlike pdfplumber, Camelot preserved table formatting more accurately, reducing errors.
    • Final Decision: Based on our testing, we used:
      • Camelot for extracting tabular data.
      • pdfplumber for extracting textual content.

    This dual-extraction approach ensured that both structured (tables) and unstructured (text) information from RFQs were properly indexed and retrievable.

  2. Vector Database Selection

    To store and retrieve embeddings efficiently, we evaluated different vector databases and selected ChromaDB due to its:

    • Ease of use: Simple integration with Python and LangChain.
    • Persistence capability: Enables storing embeddings for long-term use.
    • Scalability: Efficient retrieval even as the dataset grows.

    Other options, such as FAISS, were considered, but ChromaDB's built-in document management and flexibility made it the preferred choice.

  3. Text Splitting Strategy

    Splitting the extracted data into meaningful chunks is critical for accurate retrieval. We tested multiple text splitters and chose RecursiveTextSplitter due to:

    • Hierarchical splitting approach: It breaks text into chunks while maintaining logical structure.
    • Flexibility: Works well with both short and long documents.
    • Better retrieval performance: Compared to fixed-size splitters, it ensures semantically meaningful chunks are stored in the vector database.
  4. Embedding Model Selection

    For generating vector embeddings, we chose all-MiniLM-L6-v2 because:

    • Lightweight and efficient: Faster computation while maintaining high retrieval accuracy.
    • Optimized for sentence embeddings: Ensures better contextual understanding.
    • Well-integrated with LangChain and ChromaDB.

Technical Choices:

  • Table Extraction: Used Camelot for extracting tables and pdfplumber for extracting text.
  • Vector Database: ChromaDB (for its persistence, ease of use, and scalability).
  • Text Splitting: RecursiveTextSplitter (for improved chunking logic).
  • Embedding Model: all-MiniLM-L6-v2 (for efficiency and accuracy).

These decisions ensured that our RAG pipeline effectively handled both structured and unstructured data, improving retrieval accuracy for RFQs.

Documentation Changes

Technology Review

Database

We evaluated various database technologies for our project, considering factors such as scalability, efficiency, and compatibility with our requirements. Ultimately, we selected Chroma DB due to several key advantages:

  • Local Hosting: Ability to be hosted locally, maintaining full control over data storage and retrieval without relying on external cloud services.
  • Data Storage Flexibility: Supports storing data without persistence, which is useful for temporary or in-memory operations, reducing storage overhead.
  • Native Support for Vector Embeddings: Well-suited for our retrieval-augmented generation (RAG) applications. Since our system relies on embedding-based search, using a database optimized for handling and querying high-dimensional vector representations was essential.
  • Efficient Retrieval: Offers efficient retrieval using cosine similarity, ensuring that queries return the most relevant document chunks by measuring their similarity in the vector space.

This combination of flexibility, efficiency, and native integration with our AI-driven approach made Chroma DB the optimal choice for our project.

Text Splitting

We chose to use a recursive character text splitter with an overlap to ensure that our document chunks maintain coherence and context while being processed by the RAG system. Traditional text splitting methods may cut off sentences or separate related information, making it harder for the model to understand the full meaning. By using a recursive approach, we can intelligently split the text into smaller segments while preserving logical breaks, such as paragraphs or sections. The overlap further enhances continuity by allowing important contextual information to persist across chunks, preventing the loss of key details when querying or analyzing document changes. This method improves retrieval accuracy and ensures that relevant information remains intact across different sections of a document.

Technical Choices

Initially, PyMuPDF was planned for PDF text extraction due to its speed and efficiency in handling standard text content. As project requirements evolved, pdfplumber was used instead due to its robust table extraction capabilities, which PyMuPDF does not natively support, while still effectively handling standard text. This allowed for a more comprehensive and structured extraction process, ensuring that key data embedded in tabular formats was accurately retrieved.

Telemetry Data

Technology Review

There were two possible solutions that we came across in our research: firstly a RAG based approach[4] and an AI SQL Agent Approach[2]. The SQL Agent approach is the one we implemented in our final project.

RAG Based Approach

This approach intended to use a standard RAG approach, similar to the other use cases. However, it would replace fixed-size chunking with a structured chunking strategy:

  • Row-based chunking: For insights into individual records within the Excel sheet.
  • Column-based chunking: For general insights across the entire dataset.

Additionally, this approach would pre-calculate and store statistics (e.g., average, minimum, maximum, standard deviation) for general insights into the data.

SQL Agent Approach (implemented)

This approach is quite different from RAG. Instead of a vector store, retriever, and chunking operations, it uses an SQLite in-memory database.

The Excel file is first converted to a pandas dataframe. The sheet names and column headers are made SQL friendly, and data is formatted with the correct data types. These features allow data to be easily queried.

An agent is created using the LLama model and the agentic API through Ollama. The agent is provided with a tool (a Python function) and a schema of the database, which it uses to query the database.

The data returned from the Agent's query is given again to the LLM so that it can generate a response. In short, the SQLite database replaces the vector store, the Agent replaces the retriever, and there is no chunking, only preprocessing.

Technical Choices

The SQL agent approach was selected for this use case due to several factors:

  • Speed: It delivered faster responses than the RAG-based approach, given the nature of the data.
  • Scalability: It is more sustainable in the long run, as telemetry data can grow quite large, which a RAG-based approach might not handle well, leading to inaccurate results.
  • Accuracy: The SQL approach is inherently more accurate because it eliminates the possibility of the LLM hallucinating or imagining data.

Languages, Frameworks, Libraries, and APIs

Languages

We opted to use Python due to several factors:

  • It offers extensive documentation compatible with the frameworks we employed.
  • Our team possessed prior experience with Python, which expedited development.
  • Python offers a rich ecosystem of data libraries, such as Pandas, which proved indispensable for our data manipulation tasks.
  • Furthermore, Python enjoyed robust support within the client's software environment, HP's AI Studio. This environment featured integrated Jupyter Notebooks, making Python the most seamless choice.
  • In contrast, JavaScript, while a potential alternative, lacked the same level of compatibility with the client's infrastructure.

Frameworks

For our project, we leveraged LangChain and Ollama. These frameworks were selected because:

  • Their documentation was readily available and comprehensive, facilitating our learning curve.
  • They provided support for all the technologies we required, streamlining integration.
  • In comparison, LlamaIndex, which we initially explored, proved sluggish and its documentation was less accessible. Additionally, it lacked support for some of the libraries we needed.
  • As an alternative to Ollama, we experimented with Hugging Face Transformers. However, Ollama emerged as the superior option due to its ease of use, setup, and integration with Docker containers. This compatibility was crucial given the containerized work environment within HP's AI Studio.

Libraries

Our project utilized several key libraries:

  • Pandas: This widely-used library was instrumental for data preprocessing and handling, offering a comprehensive suite of built-in operations and functions.
  • PDFPlumber: We employed PDFPlumber for extracting text from PDF documents for RFQ and Documentation Version use cases.
  • Camelot: This library was chosen for extracting tabular data from RFQs, replacing the previously used PDFPlumber for table extraction. Camelot's output format proved more conducive to embedding within our data model.
  • SQL: For the telemetry data use case, we opted for SQL as the database technology. This choice was motivated by the expectation that LLMs would excel at generating SQL queries due to the abundance of available resources and established best practices.

Summary of Technical Decisions

Our selection of technologies, languages, and frameworks was guided by the project requirements and the constraints imposed by software compatibility. A key consideration was the project's future development beyond our involvement. We aimed to ensure a smooth handover by choosing technologies that were widely familiar and easy to understand for subsequent development teams. This approach prioritizes long-term maintainability and minimizes the learning curve for future contributors.

References

  1. V. Tiwari, "Implementing a RAG Model for PDF Content Extraction and Query Answering," dzone.com, Nov. 22, 2024. https://dzone.com/articles/rag-model-for-pdf-content-extraction-and-query-answering (accessed Apr. 02, 2025).
  2. A. Angulo, "How to create accurate, fast RAG with Excel files | ajac-zero," Ajac-zero.com, Sep. 21, 2024. https://ajac-zero.com/posts/how-to-create-accurate-fast-rag-with-excel-files/ (accessed Apr. 02, 2025).
  3. R. Ong, "RAG With Llama 3.1 8B, Ollama, and Langchain: Tutorial," Datacamp.com, Sep. 05, 2024. https://www.datacamp.com/tutorial/llama-3-1-rag
  4. D. Richards, "Mastering Document Chunking for Non-Standard Excel Files: A Software Engineer's Guide – News from generation RAG," News from generation RAG – Dive deep into the transformative world of AI Retrieval Augmented Generation (RAG) technologies, Jul. 10, 2024. https://ragaboutit.com/mastering-document-chunking-for-non-standard-excel-files-a-software-engineers-guide/#aioseo-sheet-based-and-row-based-chunking (accessed Apr. 02, 2025).