Research

Related Literature

Our project aims to create a chatbot that can answer queries using the latest information. It must be able to extract information from websites and PDF files. We need to design a user-friendly interface that allows users to easily enter queries and provide sources for the chatbot to utilise.

We identified a few key areas of research that are relevant to our project:

Large Language Models

The rapid advancement of large language modesl (LLMs) [1], such as GPT-4 and Claude 3, have demonstrated capabilities in understanding and generating human-like text.

These models have the potential to form the foundations of our chatbot, providing the necessary language understanding and generation capabilities to engage in natural conversations.

Retrieval Augmented Generation

Retrieval augmneted models, like the Retrieval Augmented Generation (RAG) system [2], combine language models with the ability to retrieve and incorporate relevant infromation from external sources.

This approach can help our chatbot provide more informed and contextually relevant responses by drawing upon an updated and broader range of knowledge, which is crucial for answering user queries efficiently.

Open Domain Question Answering

Open domain Q&A systems [3] aim to provide accurate and informative responses to a wide range of queries, without being limited to a specific domain. These systems leverage techniques like passage retrieval and knowledge-information generatiokn to access relevant information from various sources. While open-domain refers to general knowledge, the principels and aopproaches cna be adapted to a specifcied domain such as social-bheavioural information.

For our project, we can leverage the foundations of open-domain Q&A to develop a system tailored to social behavioural topics. By curating a knolwedge base that includes relevant and up-to-date information, we can create a chatbot that can effectively answer user questions in this specialized area.

Methodology

To address the requirements of our social behavioral chatbot, we have developed a comprehensive approach that leverages the latest advancements in natural language processing and information retrieval. Our methodology consists of several key components that work together to provide users with accurate and up-to-date responses to their queries.

At the core of our system is the ability to gather the most relevant and up-to-date information from a variety of sources. To achieve this, we plan to implement a web scraping mechanism that continuously monitors and extracts content from trusted websites, online publications, and other relevant data sources. This allows our chatbot to stay informed about the latest developments, trends, and insights in the social behavioral domain. Once the data is collected, we preprocess it to ensure optimal performance and relevance. This preprocessing stage involves dividing the text content into smaller, semantically coherent chunks using advanced text segmentation techniques. By breaking down the information into these manageable units, we can improve the precision and efficiency of the subsequent retrieval and response generation processes.

To enable effective information retrieval, we convert the text chunks into numerical vector representations using state-of-the-art text embedding models [4]. These embeddings capture the semantic and contextual relationships between the text, allowing us to measure the similarity between a user's query and the available information. When a user submits a query, our system leverages these text embeddings to quickly identify the most relevant chunks of information from our database. We employ sophisticated similarity measures [5], such as cosine similarity and Euclidean distance, to rank and retrieve the most pertinent content based on the user's input.

With the relevant information retrieved, we then leverage powerful large language models (LLMs) to generate the final response to the user's query. LLMs, such as OpenAI's GPT-3.5 and GPT-4, have demonstrated exceptional capabilities in understanding and generating human-like text, making them well-suited for our chatbot application. By integrating the LLM with the retrieved information, our system can provide coherent, contextually relevant, and informative responses that address the user's specific needs. The LLM's language understanding and generation capabilities, combined with the curated knowledge base, enable our chatbot to engage in natural and meaningful conversations.

Technology Review

2.1 Embedding Models

At the core of our approach is the idea of word embeddings, which are a vector (a list of numbers) representation of text data. The idea of these vectors are to capture the semantic similarity of the words they represent. There are many ways these vectors are generated, such as using a Siemese Network with Triplet Loss [6] or fine-tuning a pre-trained LLM [7]. This presentation allows us to transform the qualititative data inputted into quantitative data, and thus unlocking the world of statistical analysis on this medium.

2.1.1 all-MiniLM-L6-v2

The all-MiniLM-L6-v2 model [8] is a compact and efficient sentence embedding model developed by Microsoft, which is particualrly well-suited for tasks that require fast and light weight text encoding.

2.1.2 Instructor

Instructor [9] is a language model developed by HKU (The University of Hong Kong), unlike traditional models that focus on open-ended generation, Instructor was specifically trained to intepret and execute a wide range of instructions. The versatility of Instructor makes it an ideal candidate to produce embeddings catered to our project details.

2.1.3 OpenAI's ada & text-embedding-3-large

OpenAI has developed a range of powerful text embedding models [10], namely ada and text-embedding-3-large. These models are widely used in natural language processing applications due to their strong performance on a variety of tasks. While the specific details of how these OpenAI embedding models were trained are not publicly documented, they are known to be versatile and capable of capturing rich semantic representations.

2.2 Preprocessing

To enhance the performance and capabilites of our chatbot, we build upon the RAG model by incorporating additional techniques. Specifically, we implemented documnet chunking mechanism that divides the source materials (e.g. websites, PDF files) into smaller, semantically coherent units. This chunking process helps to concentrate the relevant information and improve the precision of the retrieval process, as the chatbot can focus on the most pertinent sections of the sources.

In particular, we explored the langchain framework [11] due to its in-built chunking functions, which saves development time and thus, allows us to iteratively experiment more.

2.2.1 CharacterTextSplitter

The CharacterTextSplitter is a simple yet effective method for dividing text into chunks based on character-level patterns. The heuristics used in this approach are designed to identify natural breaks in the text, such as paragraphs, headings, or bullet points, to create meaningful chunks. It is best for content that are relatively uniform such as news articles.

2.2.2 RecursiveCharacterTextSplitter

The RecursiveTextSplitter is a more sophisticated method that recursively splits text into smaller chunks based on syntactic and semantic cues. This approach is more suitable for complex documents with multiple sections and sub-sections, as it can identify and extract relevant information at different levels of granularity. In particular, it is highly versatile as it allows you to dictate which characters to preferably split at.

2.2.3 NLTKTextSplitter

Langchain also integrates with the Natural Language Toolkit (NLTK) through the NLTKTextSplitter, which leverages NLTK's sentence and paragraph splitting capabilities. This text splitter can be useful for sources where the logical breaks between content units are more clearly defined, such as academic papers or technical manuals.

2.2.4 spaCy Text Splitting

In addition to the Langchain based text splitting, we also explored the use of spaCy's text splitting capabilities [12]. spaCy provides advanced NLP fetures like tokenization, part-of-speech (POS) tagging and named entity recognition (NER), which can be leveraged to identify breaks and transitions between source documents.

2.2.5 Content based Chunking

Something that we didn't thoroughly explore was the idea of content based chunking, where the chunks are based on the content of the text. This is useful for formatted languages like LaTex or Markdown, however, for our case, the input sources were always in plain text, hence redundant for our case. However, this would be interesting to serve as a future work for format specific applications.

2.3 Find Relevancy

With the sourcing and chunking mechanism in place, we then needed a way to effectively retrieve the relevant chunks relative to the user's query. For this, we explored different types of distanc metrics [13], namely cosine similarity, euclidean distance, and Jaccard similarity, to determine the most suitable approach for our chatbot. These similarity measures extend beyond simple keyword/ syntax matching and instead aim to capture the semantic and contextual relationship between queries and candidate chunks. This enables the system to identify the most informative and appropriate chunks to include in the chatbot's response.

2.3.1 Cosine Similarity

Cosine similarity [14] is a measure of the cosine of the angle between two non-zero vectors. It is a widely used metric in information retrieval and text mining tasks, as it can effectively capture the semantic similarity between two text representations, even if they do not share any common words. The key idea behind cosine similarity is to measure the orientation of the two vectors, rather than their magnitude. This means that two vectors can be considered similar if they point in a similar direction, regardless of their absolute lengths. Mathematically, the cosine similarity between two vectors x and y is calculated as:

\( Cosine\:Similarity(A,\:B)=\frac{A \cdot B}{\|A\|\|B\|} \)

Where A and B are the two vectors being compared, and \(\|A\|\) and \(\|B\|\) are the magnitudes of the vectors. The resulting value ranges from -1 to 1, with 1 indicating perfect similarity, 0 indicating orthogonality, and -1 indicating complete dissimilarity.

2.3.2 Euclidean Distance

Euclidean distance is a straightforward metric that measures the straight-line distance between two points in a multi-dimensional space. In the context of text embeddings, Euclidean distance can be used to quantify the proximity between the vector representations of the user's query and the candidate chunks. The key idea behind Euclidean distance is to measure the absolute difference between the corresponding elements of the two vectors. Mathematically, the Euclidean distance between two vectors x and y is calculated as:

\( Euclidean\:Distance(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2} \)

Where x and y are the two vectors being compared, and n is the number of dimensions in the vector space.

2.3.3 Jaccard Similarity

Jaccard similarity [15] is a measure of the overlap between two sets, defined as the size of the intersection divided by the size of the union of the sets. In the context of text, Jaccard similarity can be used to compare the overlap between the words or n-grams present in the user's query and the candidate chunks. The key idea behind Jaccard similarity is to focus on the shared and unique elements between the two sets, rather than their absolute magnitudes. Mathematically, the Jaccard similarity between two sets A and B is calculated as:

\( Jaccard\:Similarity(A, B) = \frac{|A \cap B|}{|A \cup B|} \)

Jaccard similarity can be useful for identifying chunks that share a high degree of lexical overlap with the user's query, which may indicate a strong topical relevance. This approach is particularly effective when the user's query and the relevant chunks can be well-represented by the presence or absence of specific words or n-grams.

2.4 Prompting

A crucial aspect of our chatbot is the way in which we leverage the LLM through prompting. Prompt engineering [16] refers to the process of providing input to a LLM in a specific format to elicit the desired output or behaviour. In the context of our project, prompting plays a vital role in ensuring the generated responses are well-aligned with the user's query and the overarching goals of our project. One of the core challenges we hope to address is prompt alignment - the process of designing prompts that elicit reponses fromm the LLM that are coherent, infromative and tailored to the specific context of the user's query. This goes beyond simply providing the LLM with just the user's query; instead we have explored various prompt engineering approaches to ebtter align the model's generation capabilities with our desired output characteristics.

2.4.1 Zero-Shot Prompting

In a zero-shot setting, the LLM is provided with a prompt that doesn't include any task specific examples or demonstrations. Instead, the prompt relies on the model's general language understanding and reasoning abilites to generate the desired output. This output can be cost-effective, as it does not require extensive fine-tuning or the collection of labelled training data. However, the accuracy of zero-shot prompting may be lower relative to other methods as it is dependent on the model's ability to generalize and infer the task requirements solely from the prompt.

2.4.2 Few-Shot Prompting

In contrast to zero-shot, as the name suggests, few-shot prompting involves providing the LLM with a small number of task-specfic examples within the prompt. This allows the model to better understand the desired output format and type of information that should be included in the response. Few shot prompting strikes the balance between cost and accuracy, as it requires less resources than full fine-tuning while providing task-specfic guidance to the model.

2.4.3 Chain of Thought Prompting

This approach [17] allows the iteration of prompts by encourging the LLM to engage in step-by-step reasoning and articulate its though process explicity. This can be useful for complex queries that require multiple steps of reasoning or for generating detailed explanations. Past research has shown that chaining prompts can lead to more coherent and informative responses from the LLM.

2.4.4 LLM based Prompting

Something new our team explored was the idea of using the LLM itself to generate the prompts. This is done by providing the LLM with a seed prompt and allowing it to generate the subsequent prompts based on the context and user query. We iterate the process, at each time, testing and tuning the prompt by evaluating the generated responses and using the LLM to adjust the prompt.

2.5 Large Language Models

Large language models (LLMs) are at the core of our chatbot's capabilities, providing the foundation for natural language understanding and generation. We explored several prominent LLM options to determine the most suitable approach for our project.

2.5.1 OpenAI's GPT-3.5 / GPT-4

OpenAI's GPT-3.5 and GPT-4 are among the most advanced and widely-used large language models. These models have demonstrated impressive performance across a wide range of natural language tasks, including text generation, question answering, and language understanding.

2.5.2 Gemini

Google's newly released LLM which replaced the BERT model, Gemini is a powerful language model that has been optimized for a wide range of natural language processing tasks.

2.5.3 Offline LLMs

While the cloud-hosted, API-accessible LLMs like GPT-3.5 and GPT-4 offer convenience and scalability, we also explored the potential of using offline LLM solutions. Offline LLMs can provide increased privacy, lower latency, and the ability to run on resource-constrained devices, which may be beneficial for certain deployment scenarios.

The promising offline LLM options include Llama 2 [18] and Mistal8x7b [19], which provides a wide range of pre-trained models that can be fine-tuned and deployed locally. Additionally, specialized offline LLM solutions, such as those developed by companies like Anthropic or Cohere, may offer optimized performance and integration for our chatbot use case.

Findings

3.1 Embedding Models

Intially, we experimented with the Instructor model, which was specifically designed for instruction interpretation. It was interesting to see how the model could be adapted to generate embeddings for our chatbot, as it was able to capture the instructional nature of the text. However, a big downfall of it was that it took about ~20sec (averaged across 10 runs) to generate the embeddings, which was not ideal for our real-time chatbot.

We then explored the all-MiniLM-L6-v2 model, which was much faster, taking only ~0.5sec (averaged across 10 runs) to generate the embeddings. The embeddings generated by this model were also more compact and efficient, making it a better fit for our chatbot's requirements.

Lastly, we tested the OpenAI's ada model, which was extremely fast, matching the all-MiniLM-L6-v2 model in terms of speed. (At the time of testing, the text-embedding-3-large model was not even released.) The embeddings generated by the ada model were also of high quality, capturing the semantic similarity of the text effectively.

However, one of the core concerns of our client, was the monetary impact of the project, hence, we decided to go with the all-MiniLM-L6-v2 model, as it was free to use and had low computing requirements.

3.2 Preprocessing

Testing out the different text splitters, we found that the CharacterTextSplitter was the fastest, taking only ~0.1sec (averaged across 10 runs) to split the text. However, the quality of the chunks generated by this splitter was not as high as the other splitters, as it was more simplistic in its approach.

The RecursiveCharacterTextSplitter suprisingly took approximately the same time, taking ~0.15sec (averaged across 10 runs) to split the text. Additionally, it was able to generate more coherent and meaningful chunks, making it a better choice for our chatbot. The ability to recursively split the text based on syntactic and semantic cues allowed the splitter to identify and extract relevant information at different levels of granularity.

The NLTKTextSplitter was slightly slower, taking ~0.2sec (averaged across 10 runs) to split the text. However, the quality of the chunks generated by this splitter was high, as it leveraged NLTK's sentence and paragraph splitting capabilities.

The spaCy Text Splitting was the slowest of the splitters, taking ~0.3sec (averaged across 10 runs) to split the text. However, it provided advanced NLP features like tokenization, part-of-speech (POS) tagging, and named entity recognition (NER), which could be useful for identifying breaks and transitions between source documents.

In the end, we decided to go with the RecursiveCharacterTextSplitter, as it provided the best balance between speed and quality of the chunks generated.

3.3 Similarity Functions

We tested the different similarity functions on a small dataset of 100 queries and 100 chunks. This was rather empirical, as we wanted to see how the different similarity functions performed in a real-world scenario. We first chunked the data using the RecursiveCharacterTextSplitter and computed the embeddings generated by the all-MiniLM-L6-v2 model. The top-K (10) chunks were then evaluated by our team to determine the quality of the retrieval.

The Cosine Similarity function performed the best, with an average precision of 0.85 across the 100 queries. We hypothesise that this was due to the fact that the Cosine Similarity function is more robust to the magnitude of the vectors, which allowed it to capture the semantic similarity between the user's query and the candidate chunks effectively. Additionally, some success might be attributed to the fact that the all-MiniLM-L6-v2 was trained using the cosine similarity hence, it was able to produce results that best catered the cosine similarity function.

The Euclidean Distance function was slightly worse, with an average precision of 0.75 across the 100 queries. This might be due to the fact that the Euclidean Distance function is more sensitive to the magnitude of the vectors, which could have affected the quality of the retrieval. As the all-MiniLM-L6-v2 model returned a vector that was 384 dimensions, one could attribute this result to the Curse of Dimensionality [20].

The Jaccard Similarity function performed the worst, with an average precision of 0.65 across the 100 queries. This could be due to the fact that the Jaccard Similarity function is more sensitive to the presence or absence of specific words or n-grams, which might not be ideal for capturing the semantic similarity between the user's query and the candidate chunks.

Thus, we decided to go with the Cosine Similarity function, as it provided the best performance in terms of retrieval quality.

3.4 Prompting

The final system prompt was designed to be a combination of the user's query and a few task-specific examples. We found that the few-shot prompting approach was the most effective, as it provided the model with enough task-specific examples to generate coherent and informative responses. Whereas, the zero-shot prompting approach resulted in the model generating responses that were highly varied (in terms of formatting), which might throw the user off. An example of this was the layout of the references, which sometimes appear as a footer or inline with the text.

The chain of thought prompting approach was also effective, as it allowed the model to engage in step-by-step reasoning and articulate its thought process explicitly. However, due to its nature, more responses were required to answer a single query, which thus increases the monetary cost of the system. Thus, it was abandoned.

Lastly, we also experimented with LLM-based prompt engineering, where the model generated prompts that were tailored to the user's query. We took the initial prompt generated by the few-shot prompting approach and fed it back into the model to generate a refined prompt. This approach was effective in generating prompts that were more aligned with the model's internal representations and reasoning processes, leading to more coherent and informative responses.

3.5 LLM Selection

OpenAI's GPT-3.5 and the more recently released GPT-4 emerged as the frontrunners for our chatbot project. These models have demonstrated exceptional performance across a wide range of natural language tasks, making them highly capable of powering the language understanding and generation capabilities required for our chatbot. The availability of well-documented APIs and tooling for integrating GPT-3.5 and GPT-4 was a significant advantage, as it allowed us to leverage these powerful models without the need for extensive custom development. Additionally, the continuous advancements in these models, with GPT-4 offering even more impressive capabilities, made them an attractive long-term choice for our project.

When annouced, we also evaluated Gemini, however, during our testing, we discovered that Gemini was not approved for use in the UK. Hence, we didn't continue any further.

The exploration of offline LLM solutions, such as Llama 2 and Mistal8x7b, was driven by the potential benefits of increased privacy, lower latency, and the ability to run on resource-constrained devices. However, we found that the high computing resources required for these offline models posed a significant challenge for the IFRC, which had limited access to powerful hardware infrastructure.

Given these findings, we opted on integrating OpenAI's models. While the cost associated with these cloud-hosted models was a consideration, we determined that the benefits of their exceptional performance and accessibility outweighed the financial implications.

References

[1] L. Mearian, “What are LLMs, and how are they used in generative AI?,” Computerworld, May 30, 2023. https://www.computerworld.com/article/3697649/what-are-large-language-models-and-how-are-they-used-in-generative-ai.html

[2] P. Lewis, “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” Available: https://arxiv.org/pdf/2005.11401.pdf

[3] G. Izacard, “Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering,” 2021. Accessed: Mar. 23, 2024. [Online]. Available: https://aclanthology.org/2021.eacl-main.74.pdf

[4] “Word embeddings in NLP: A Complete Guide,” www.turing.com. https://www.turing.com/kb/guide-on-word-embeddings-in-nlp#:~:text=Word%20embedding%20in%20NLP%20is (accessed Mar. 24, 2024).

[5] “Vector Similarity Explained,” Pinecone. https://www.pinecone.io/learn/vector-similarity/

[6] N. Reimers, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” 2019. Available: https://arxiv.org/pdf/1908.10084.pdf

[7] L. Wang, “Improving Text Embeddings with Large Language Models.” Accessed: Mar. 24, 2024. [Online]. Available: https://arxiv.org/pdf/2401.00368.pdf

[8] “sentence-transformers/all-MiniLM-L6-v2 · Hugging Face,” huggingface.co. https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

[9] “hkunlp/instructor-large · Hugging Face,” huggingface.co. https://huggingface.co/hkunlp/instructor-large (accessed Mar. 24, 2024)

[10] “OpenAI Platform,” platform.openai.com. https://platform.openai.com/docs/guides/embeddings/

[11] “Text Splitters | 🦜️🔗 Langchain,” python.langchain.com. https://python.langchain.com/docs/modules/data_connection/document_transformers/ (accessed Mar. 24, 2024).

[12] “Sentencizer · spaCy API Documentation,” Sentencizer. https://spacy.io/api/sentencizer (accessed Mar. 24, 2024).

[13] “Measuring Similarity from Embeddings | Machine Learning,” Google for Developers. https://developers.google.com/machine-learning/clustering/similarity/measuring-similarity

[14] “Cosine Similarity - an overview | ScienceDirect Topics,” www.sciencedirect.com. https://www.sciencedirect.com/topics/computer-science/cosine-similarity#:~:text=Cosine%20similarity%20measures%20the%20similarity

[15] “Jaccard Similarity – Text Similarity Metric in NLP – Study Machine Learning.” https://studymachinelearning.com/jaccard-similarity-text-similarity-metric-in-nlp/

[16] “What is prompt engineering? | McKinsey,” www.mckinsey.com. https://www.mckinsey.com/featured-insights/mckinsey-explainers/what-is-prompt-engineering#:~:text=Prompt%20engineering%20is%20the%20practice (accessed Mar. 24, 2024).

[17] J. Wei, “Chain of Thought Prompting Elicits Reasoning in Large Language Models,” arXiv:2201.11903 [cs], Oct. 2022, Available: https://arxiv.org/abs/2201.11903

[18] “Llama,” Llama. https://llama.meta.com

[19] A. Jiang, “Mistral 7B.” Available: https://arxiv.org/pdf/2310.06825.pdf

[20] B. Shetty, “Curse of Dimensionality,” Built In, Aug. 19, 2022. https://builtin.com/data-science/curse-dimensionality