Algorithms

Chunking

When processing large documents for retrieval-augmented generation (RAG), it is essential to split them into smaller, manageable chunks before embedding. This ensures that relevant context can be retrieved efficiently without overwhelming the model with excessively large inputs. We chose to use a recursive character text splitter because it preserves meaningful structure when dividing text, ensuring that sentences or logical sections remain intact. Additionally, we introduced an overlap between chunks to prevent loss of contextual information at chunk boundaries, and to ensure that key details spanning multiple chunks are not missed during retrieval.

Chunking Process

  1. Preprocess the text – Clean the extracted text by removing trailing spaces and whitespace, and standardising line endings to ensure consistency.
  2. Compute the document differences for the documentation use case –compare versions of the document and generated a structured output that highlights additions and deletions.
  3. Extract text and table data for the RFQ use case – Extract relevant text and table data from each RFQ, including key metadata. Identify the table type for each extracted table and combine the text and table data into a single string for chunking.
  4. Split the text into smaller chunks using a recursive character text splitter:
    • For the documentation use case, divide the structured differences into manageable segments.
    • For the RFQ use case, break the combined text and table data into smaller chunks while preserving context.
  5. Set chunk size and overlap – Configure the chunk size to 500 characters to maintain a balance between detail and context. Apply a 20-character overlap to prevent loss of meaning at chunk boundaries in both the documentation and RFQ use cases.
  6. Annotate chunks with metadata:
    • For the documentation use case: Include metadata such as document name, old version, new version, and chunk number to enhance retrieval.
    • For the RFQ use case: Annotate each chunk with relevant metadata, including table number, table type, page number, customer name, contract duration, and monthly fee.
  7. Enhance retrieval with keywords – Embed version-related keywords within each chunk to improve relevance in similarity searches in the documentation use case.
  8. Convert chunks into structured data – Transform each chunk into a Document object, making it compatible with vector storage and retrieval systems.

Evaluation of Chunking Algorithm

To evaluate the effectiveness of the chunking algorithm, we conducted a series of tests to ensure it maintained contextual integrity while optimising retrieval performance. For each of the use cases, we ran the algorithm on sample documents:

  • For the documentation use case: We processed documents with known version differences and manually verified that the extracted changes were accurately segmented and labeled.
  • For the RFQ use case: We tested the code using known RFQs, printing the generated chunks to verify how the data was chunked and stored in the vector database. Additionally, we examined how the chunks were retrieved by reviewing the retriever's output.

We then tested different chunk sizes and overlap values to find a balance between chunk granularity and context retention, ensuring that no critical information was lost at boundaries. Additionally, we performed retrieval tests using queries designed to match specific document changes, verifying that relevant chunks were correctly returned. Finally, we assessed the algorithm's performance on large documents to confirm that it could efficiently process and store version differences for the documentation use case and text and table data for the RFQ use case without excessive memory usage or retrieval delays. These tests demonstrated that the chunking algorithm effectively preserved meaning across splits while allowing accurate and efficient document retrieval.

Preprocessing

To ensure telemetry data from Excel files is usable within the system, two preprocessing algorithms are applied:

Column Name Sanitization

  • Uses regular expressions and string formatting to clean column names.
  • Removes or replaces special characters and spaces, and ensures SQL compatibility.

Date Extraction from Excel Metadata

  • Extracts date information from Excel file metadata.
  • Parses metadata fields to identify and standardize date formats.
  • Allowing for the functionality of comparing multiple analysis windows.

Code Implementation

Code for Chunking:

def split_documents(self, docs_list: List[Document]) -> List[Document]:
    """Split text documents into chunks"""
    text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
        chunk_size=500,
        chunk_overlap=20
    )
    return text_splitter.split_documents(docs_list)
    

For the Documentation use case:

def process_versions(self, doc_versions: Dict[str, List[Tuple[str, str]]]) -> List[Document]:
    # Process and compare different document versions.

    documents = []

    for doc_name, versions in doc_versions.items():
        versions.sort()  # Ensure versions are in order
        for i in range(len(versions) - 1):
            # Compute differences between versions
            old_version, old_text = versions[i]
            new_version, new_text = versions[i + 1]
            differences = self.compare_versions(old_text, new_text)

            # Split the differences into smaller chunks
            text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
                chunk_size=self.chunking_size, chunk_overlap=self.overlap_size  # Configure chunking size and overlap size
            )
            split_chunks = text_splitter.split_text(differences)

            # Annotate chunks with relevant information and metadata
            for idx, chunk in enumerate(split_chunks):
                enhanced_chunk = (
                    f"Document: {doc_name}\nVersion Change: {old_version} → {new_version}\nChunk {idx+1}\n{chunk}\n\
                        Version-related keywords: {old_version}, {new_version}, {doc_name}, {doc_name} {old_version}, {doc_name} {new_version}"
                )

                # Store chunks as Document objects for compatability with database
                doc = Document(
                    page_content=enhanced_chunk,
                    metadata={"source": doc_name, "old_version": old_version, "new_version": new_version}
                )
                documents.append(doc)
    
    return documents

def extract_doc_name_and_version(self, filename: str) -> Tuple[str, str]:
    # Extracts the base name and version from a filename (e.g., doc_v1.pdf -> doc, v1).

    parts = filename.rsplit("_", 1)
    return (parts[0], parts[1] if len(parts) > 1 else "v1")
    

def normalise_text(self, text: str) -> str:
    # Normalise text by removing trailing spaces, converting tabs, and standardising line endings.

    return "\n".join(line.rstrip().replace("\t", " ") for line in text.splitlines())


def compare_versions(self, old_text: str, new_text: str) -> str:
    # Compare two versions of a document and return the differences.

    diff = difflib.ndiff(self.normalise_text(old_text).splitlines(), self.normalise_text(new_text).splitlines())

    # Filter out lines that start with "?"
    filtered_diff = [line for line in diff if not line.startswith("?")]

    # Keep the full new text, but mark additions and deletions
    marked_text = "\n".join(filtered_diff)

    return marked_text
    

For the RFQ use case:

def _identify_table_type(self, df: pd.DataFrame) -> str:
    """
    Identify the type of table based on column names and content.
    
    Args:
        df (pd.DataFrame): DataFrame containing table data
        
    Returns:
        str: Table type
    """
    # Convert column names to string and join for analysis
    cols = [str(col).lower() for col in df.columns]
    cols_text = " ".join(cols)
    
    # Check for devices and storage table
    if any(term in cols_text for term in ["tier", "device", "storage", "nas", "hdd", "ssd"]):
        return "devices_and_storage"
    # Check for software and services table
    elif any(term in cols_text for term in ["software", "service", "professional", "anyware"]):
        return "additional_services"
    # Check for pricing/fees table
    elif any(term in cols_text for term in ["cost", "price", "fee", "qty", "total", "£", "$"]):
        return "pricing"
    else:
        return "general_info"

def extract_content_from_pdf(self, pdf_path: str) -> str:
    """
    Extract text and tables from a PDF file using pdfplumber for text and camelot for tables.
    
    Args:
        pdf_path (str): Path to PDF file
        
    Returns:
        str: Combined extracted text and table content from PDF
    """
    try:
        text_content = ""
        tables_data = []

        # Extract text outside tables with pdfplumber
        with pdfplumber.open(pdf_path) as pdf:
            for page in pdf.pages:
                # Get table boundaries
                table_bboxes = [table.bbox for table in page.find_tables()]
    
                # Extract text outside tables
                text = page.extract_text(x_tolerance=2, y_tolerance=2, tables=table_bboxes)
                if text:
                    text_content += text + "\n"
        
        # Find the cutoff point in the text content
        cutoff_marker = "Services are comprised of the following:"
        if cutoff_marker in text_content:
            # Split at the marker and keep only the content before it (plus the marker itself)
            parts = text_content.split(cutoff_marker, 1)
            text_content = parts[0] + cutoff_marker
        
        
        # Get total page count for table extraction
        with pdfplumber.open(pdf_path) as pdf:
            total_pages = len(pdf.pages)
            
        # Process each page for tables
        for page_num in range(1, total_pages + 1):
            try:
                # Read tables from the current page
                tables = camelot.read_pdf(pdf_path, pages=str(page_num))
                
                if tables and tables.n > 0:
                    for table_num, table in enumerate(tables):
                        # Convert the table to DataFrame
                        df = table.df
                        
                        # Handle header row if present
                        if not df.empty:
                            # Use first row as header and drop it
                            df = (
                                df.rename(columns=df.iloc[0])
                                .drop(df.index[0])
                                .reset_index(drop=True)
                            )
                            
                            # Clean up newlines in data
                            df = df.apply(lambda x: x.str.replace('\n', ' ') if x.dtype == 'object' else x)
                            
                            # Clean up column names to be valid
                            df.columns = [
                                re.sub(r'[^a-zA-Z0-9_]', '', col.replace('\n', ' ').replace(' ', '_'))
                                for col in df.columns
                            ]
                            
                            # Remove empty columns
                            df = df.dropna(axis=1, how='all')
                            
                            # Identify table type
                            table_type = self._identify_table_type(df)
                            
                            # Convert DataFrame to text representation for vector storage
                            table_text = f"--- TABLE NUMBER {table_num+1} ({table_type}) ---\n"
                            
                            # Add headers
                            headers = " | ".join([str(col) for col in df.columns])
                            table_text += headers + "\n"
                            
                            # Add separator
                            table_text += "-" * len(headers) + "\n"
                            
                            # Add rows
                            for _, row in df.iterrows():
                                row_text = " | ".join([str(val) if pd.notna(val) else "" for val in row])
                                table_text += row_text + "\n"
                            
                            table_text += "--- END TABLE ---\n"
                            
                            # Store table data as text
                            tables_data.append(table_text)
            except Exception as table_err:
                print(f"Error extracting tables from page {page_num} of {pdf_path}: {str(table_err)}")
        
        # Combine text content and table data
        combined_content = text_content + "\n" + "\n".join(tables_data)
       
        return combined_content
        
    except Exception as e:
        print(f"Error processing {pdf_path}: {str(e)}")
        return ""

def _extract_key_metadata(self, text: str) -> Dict[str, str]:
    """
    Extract key metadata from RFQ text.
    
    Args:
        text (str): Text content
        
    Returns:
        Dict[str, str]: Extracted metadata
    """
    metadata = {}
    
    # Extract customer name
    customer_match = re.search(r"(?i)Customer Name[:\s]+([^\n]+)", text)
    if customer_match:
        metadata["customer_name"] = customer_match.group(1).strip()
    
    # Extract contract duration
    duration_match = re.search(r"Contract duration:\s*(\d+)\s*\(Months\)", text)
    if duration_match:
        metadata["contract_duration"] = duration_match.group(1)
    
    # Extract monthly fee
    fee_match = re.search(r"pay HP per month in GBP at £([\d,]+\.\d+)", text)
    if fee_match:
        metadata["monthly_fee"] = fee_match.group(1)
    
    return metadata