Algorithms

Semantic Search Algorithm

Introduction

Semantic search is an information retrieval technique that aims to understand the intent and contextual meaning of a user query to provide more relevant search results. Unlike traditional keyword-based search, which matches queries based on exact word matches, semantic search considers the meaning behind the words to deliver more accurate and contextually relevant results.

Initial Approach: TF-IDF and Synonym Expansion

My first approach combined TF-IDF weighted embeddings and synonym expansion to enhance recall and precision.

Key Components:

TF-IDF Weighted Embeddings:
- TF-IDF scores highlight important terms while filtering out common stopwords.
- Word embeddings were scaled by their TF-IDF scores to emphasize more meaningful terms.
Synonym Expansion:
- Used WordNet to find synonyms for each query term.
- Expanded the query with related words, improving recall.
Cosine Similarity:
- Measured the similarity between the query vector and document vectors.
- Applied a similarity threshold to filter out irrelevant results.

Problem with this approach:

While this method worked well in some cases, it failed in situations where meaning was dependent on sentence structure rather than isolated keywords. For example:

Test Case:

Query:"creature on the floor"

Matched best with: "Birds are small creatures that fly."
Incorrect: The correct match should have been "Small animals like bunnies are cute."

This issue arose because TF-IDF and synonym expansion rely heavily on individual words rather than understanding full sentence meaning.

SentenceTransformers

Recognizing the limitations of TF-IDF-based methods, we explored SentenceTransformers, specifically the paraphrase-MiniLM-L6-v2 model. This model was chosen because:

It captures sentence-level semantics rather than just keyword relevance.
It is lightweight and computationally efficient, making it practical for real-time search applications.

SentenceTransformers is a better approach for semantic search due to the following reasons:

Context Awareness:
- The model captures nuances like synonyms and contextual meaning without requiring explicit synonym expansion.
- Example: The query "creature that hops on land and cannot fly" correctly matched "Small animals like bunnies are cute," instead of an irrelevant phrase about birds.
Improved Similarity Scoring:
- Unlike TF-IDF, which ranks importance based on word frequency, SentenceTransformers computes semantic similarity directly, leading to more accurate rankings.
Scalability:
- The model runs efficiently without needing extensive preprocessing like TF-IDF computations.

Performance Comparison

To illustrate the performance of each method, let's consider the following example data:

data = [
  "Small animals like bunnies are cute.",
  "Cars and vehicles are speeding on the highway.",
  "Birds are small creatures that fly."
]

Based on this data, we can test different queries and observe the expected results:

Query	Expected Result	TF-IDF + Synonym Expansion	SentenceTransformers (MiniLM)
"creature on the floor"	"Small animals like bunnies are cute."	✗	✓
"wheels on the road"	"Cars and vehicles are speeding on the highway."	✓	✓
"creature that flies"	"Birds are small creatures that fly."	✓	✓
"animal that hops on land"	"Small animals like bunnies are cute."	✗	✓

This table highlights how each method performs with simple queries based on the example data. SentenceTransformers (MiniLM) shows better contextual understanding, especially in cases where TF-IDF + Synonym Expansion struggles.

After a series of tests on a larger dataset, the following results were obtained:

Method	Accuracy (%)	Avg. Query Time (s)
TF-IDF + Synonym Expansion	60%	0.1
SentenceTransformers (MiniLM)	90%	1.5

We came to the conclusion that SentenceTransformers (MiniLM) was the superior choice due to its ability to understand the contextual meaning of queries, leading to more accurate results. While TF-IDF was faster, it lacked accuracy. SentenceTransformers significantly improved semantic relevance, though at the cost of increased execution time.

Conclusion

Given the superior accuracy of SentenceTransformers, we adopted it as the final algorithm.

By refining the algorithm iteratively, we were able to achieve a balance between accuracy and efficiency, making it suitable for real-world applications.

PDF Extraction

Introduction

For this feature, we implemented a hybrid approach combining rule-based methods with a pre-trained NLP model (spaCy’s en_core_web_sm) to handle three types of PDFs:

Structured PDFs: These contain explicit headings for all fields.
Semi-Structured PDFs: These include headings for some fields while the rest is plain text.
Unstructured PDFs: These have no headings at all.

The extraction functions first try to extract all fields using regular expressions and then run a second extraction pass using spaCy’s entity recognition. This two-step approach guarantees that all fields are captured, regardless of the PDF’s structure. The core idea is that the functions are oblivious to the type of PDF (structured, semi-structured, or unstructured) – they work with any type. The regular expressions leverage explicit field keywords (like “Title”, “Date”, “Time”, etc.) and common phrases to extract values, while spaCy’s NLP capabilities scan the full text to pick out potential candidates for fields such as location and author when clear headings are missing.

Regular Expressions

Regular expressions (regex) play a central role in this algorithm by leveraging common words and phrases that describe each field. For example:

Date and Time Extraction:
Patterns are designed to capture a variety of date formats such as “15th March 2025” or “15/03/2025”, and time formats like “3:00 PM” or “15:00”. By using alternation and optional groups, the regex can handle differences in spacing and the presence or absence of ordinal indicators (st, nd, rd, th).
Location Extraction:
The regex for location extraction utilises a list of common phrases and keywords that might precede the location information. For instance, phrases like “meet us at”, “located at”, “venue:”, and “address:” are included. This helps in capturing the location even when the document does not follow a strict structure. After detecting one of these phrases, the regex captures the subsequent text until it hits a newline or another field keyword, which is particularly useful in unstructured PDFs.

This reliance on common phrases and words makes the algorithm robust, as these expressions are typically used in various documents to denote the same type of information. The regular expressions have been fine-tuned through trial and error on a diverse dataset, which allowed me to iteratively improve their accuracy.

NLP and spaCy

When dealing with semi-structured and unstructured PDFs, spaCy plays a crucial role:

Entity Recognition:
spaCy’s en_core_web_sm model is pre-trained on a large corpus of diverse texts and uses a combination of statistical models and neural networks to perform Named Entity Recognition (NER). When processing a text, spaCy first tokenises the content—splitting it into words, punctuation, and other elements—and then uses its pre-trained NER component to identify spans of text that represent entities, such as dates, times, locations, and person names.

In practice, after extracting the full text from a PDF, the code passes it through spaCy’s NLP pipeline with a command like doc = nlp(text). During this process, the model applies its learned patterns to each token and considers the context in which the token appears. For example, even if a location is mentioned without a clear "Location:" heading, the model can detect it as a GPE (geo-political entity) based on how it was used in similar contexts during training.
Combining Rule-Based and NLP Approaches:
The system first attempts to extract fields using regular expressions. If the regex does not yield a result (for instance, if a field is missing a clear heading), the algorithm falls back on spaCy’s entity recognition. For example, if an author is not detected by the regex because it lacks an explicit author heading, or is described using a phrase not accounted for in the regex, spaCy might still identify an author based on its contextual understanding. This layered approach increases overall extraction accuracy.

Data and Experimentation

I worked with a varied set of PDFs that included structured, semi-structured, and unstructured documents. Throughout the development process, we iteratively refined the extraction rules and methods by continually testing the algorithm on these different PDFs. We compared the extracted text against the expected text, using the discrepancies to fine-tune the regular expressions and improve the performance of spaCy’s entity recognition. This iterative testing and adjustment helped ensure that the final extraction approach was robust and effective across a wide range of document formats.

Observations

During the development and testing, we observed that the algorithm sometimes over-extracts information for certain fields, such as location and author. For example, spaCy’s entity recognition may capture too much surrounding text when trying to extract a location, especially if the text contains multiple location-like phrases or if the context is ambiguous. Similarly, for author fields, spaCy might mistakenly include extra words or even adjacent content if the boundaries are not clear. These issues can occur due to the inherent challenges of processing unstructured text and the model’s attempt to interpret context without explicit delimiters. Future improvements could involve refining the extraction boundaries and applying post-processing steps to better isolate the intended information.

Conclusion

The PDF and ICS extraction feature demonstrates an effective hybrid approach that leverages both rule-based and NLP techniques. By employing a two-pass extraction process – first with regular expressions and then with spaCy’s entity recognition – the system guarantees that all fields are captured, irrespective of the PDF’s structure. While the regular expressions capture explicit markers using common phrases and headings, spaCy’s entity recognition extracts information for fields included in less structured content.

Agentic Decisions

Introduction

The Ask AI section on the home page has two possibilities of user flow.

Perform a search

e.g: "I want to go sightseeing"
Generate a report

e.g: "I want to report a bin overflow in my area"

The process of decision-making falls on the large language model (LLM).

Pre-Process

The determineUserQuery function in SearchBar.jsx prompts the LLM to decipher what the user query is asking for. It is called in the handleSubmit function when a user submits a query.

Care had to be taken in the prompt because the model (Qwen2.5-1.5B-Instruct) would lose accuracy if the prompt was not detailed enough, but also if the prompt was too lengthy.

To evaluate performance, different prompts were tried on different user queries to get the configuration that would lead to the highest number of correct routes.

Prompt 1:

Expand Prompt

You are an AI assistant chatbot for a company. Determine whether the user's input is a request to create a new report, which may include details such as location, title, and description, or a search query looking for events or articles. Respond with 1 if the input is a new report request and 0 if it's a search query.

Prompt 2:

Expand Prompt

You are an AI assistant chatbot for a company. The user has asked something, it may be a report with a location and title and description. It may be a search query asking for events or articles. Determine whether the following is a new report request or a search query Respond with 1 to create a new report and 0 for search query.

Prompt 3:

Expand Prompt

Determine whether the following is a new report request or a search query.

Respond with 1 (new report request) if the input:
- Asks to create a new report.
- Describes an issue or problem that needs to be reported.
- Uses phrases like "report," "create a report," or "make a new report."
Respond with 0 (search query) if the input:
- Asks for information or looks up existing data.
- Uses phrases like "find," "search," or "information about." Examples:
"Can you make a new report? I noticed some overflowing bins on my road." → 1
"Find information about overflowing bins in my area." → 0
"I want to report a pothole on Main Street." → 1
"What are the rules for waste disposal?" → 0 Respond with 1 for new report requests and 0 for search queries.

Table of routing results

A tick means that the prompt successfully determined the user wanted to report an issue (returned 1).

Query	Prompt 1	Prompt 2	Prompt 3
"I'd like to report overflowing bins in my area"	✓	✓	✓
"There is a blocked pipe on Piccadilly Circus, I'd like to report it"	x	✓	✓
"There's a streetlight out on Main Street near the park!"	x	x	✓

It is evident that prompt 3 was the best performing prompt, successfully routing in all example cases.

Expand Code

frontend/src/components/home/SearchBar.jsx
const determineUserQuery = async (userQuery) => {
    if (userQuery === "" || isStreaming) {
      return;
    }
    const systemPrompt = `
    Determine whether the following is a new report request or a search query.
    - Respond with 1 (new report request) if the input:
      * Asks to create a new report.
      * Describes an issue or problem that needs to be reported.
      * Uses phrases like "report," "create a report," or "make a new report."
    - Respond with 0 (search query) if the input:
      * Asks for information or looks up existing data.
      * Uses phrases like "find," "search," or "information about."
    Examples:
    - "Can you make a new report? I noticed some overflowing bins on my road." → 1
    - "Find information about overflowing bins in my area." → 0
    - "I want to report a pothole on Main Street." → 1
    - "What are the rules for waste disposal?" → 0
    Respond with 1 for new report requests and 0 for search queries.
  `;
    const modelReply = await getReply(userQuery, systemPrompt, () => { }, setIsStreaming);

    if (!engine) {
      console.log("Model is still loading...");
      setModelReply("Here is what I found:");
      return;
    }
    return modelReply;
  };

Process

The handleSubmit function takes the output of the LLM (either 0 or 1) and simply routes it to the corresponding action: get a search reply if 0, and get a report reply if 1.

In rare cases, the LLM does not follow instructions, or if the LLM has not loaded yet, we default to a search.

Expand Code

frontend/src/components/home/SearchBar.jsx

  const handleSubmit = async (e) => {
    e.preventDefault();
    // LLM decides whether to initiate new report or search query
    setMessages([...messages, { text: userQuery, sender: "user" }]);
    setFullUserQuery(userQuery);
    setUserQuery("");
    const choice = await determineUserQuery(userQuery);
    setSearchResult([]);
    setGeneratedReport(null);
    switch (choice) {
      case "0":
        getSearchReply(userQuery);
        break;
      case "1":
        getReportReply(userQuery);
        break;
      default:
        getSearchReply(userQuery);
        break;
    }
  };

Post-Process

The program now knows what the user has asked for, and wipe the chat history to prompt a new reply, for search or reporting. See Ask AI for more details on this implementation.

Semantic Search Algorithm​

Introduction​

Initial Approach: TF-IDF and Synonym Expansion​

SentenceTransformers​

Performance Comparison​

Conclusion​

PDF Extraction​

Introduction​

Regular Expressions​

NLP and spaCy​

Data and Experimentation​

Observations​

Conclusion​

Agentic Decisions​

Introduction​

Pre-Process​

Table of routing results​

Process​

Post-Process​

Semantic Search Algorithm

Introduction

Initial Approach: TF-IDF and Synonym Expansion

SentenceTransformers

Performance Comparison

Conclusion

PDF Extraction

Introduction

Regular Expressions

NLP and spaCy

Data and Experimentation

Observations

Conclusion

Agentic Decisions

Introduction

Pre-Process

Table of routing results

Process

Post-Process