SYSTEM DESIGN

System Architecture

The overall system is made up of a frontend component with the Next.js React framework, a backend component with the web framework Flask in Python, and a database component containing the SQL database and the vector database.

Next.js is responsible for receiving requests from the user such as storing PDFs, which is done in the frontend component to display said PDFs to the end user easily and removes the need for a proxy when the backend retrieves those files. Furthermore, it can send API requests to the backend and display the response on the web page.

The backend utilises the Flask microweb framework, which can interact with the frontend to handle API requests. It can upload and retrieve relevant text chunks from the Pinecone vector database and read/write data to the SQL database. Flask also interacts with OpenAI’s large language models to generate the systematic review.

The Pinecone vector database is made up of text chunks converted to vectors, which contain metadata of the text’s origin, section category and plaintext. Querying the database yields the most related text chunk and is returned to the backend. The SQL database stores user details as well as previously generated systematic reviews linked with users. SQL statements are used to add, delete and retrieve data from tables, which is returned to the backend in JSON format.

System Architecture Diagram

Figure 1: System architecture diagram of the RAG backend

Sequence Diagrams

This is a sequence diagram that demonstrates the interactions between each system component (including the frontend, backend and databases) in sequential order when the user generates a systematic review.

Sequence diagram

Figure 2: Sequence diagram of the systematic review generation process

Data Storage

ER Diagram

This is an entity-relationship (ER) diagram of the SQL database we use to store the user’s login information and the history of systematic reviews. For one user, there can be zero or many systematic reviews (history) linked with them. This linkage is done between the user_id. The ‘timestamp’ datatype stores the date and time and is generated by MySQL when the instance of the table entry is created.

System Architecture Flowchart

Figure 3: Entity-relationship (ER) of the SQL database

Packages and APIs

PyMuPDF

Converts PDF documents into raw text for easier processing by the system.

LangChain

Handles text processing including text splitting and human message management.

OpenAI

Converts text chunks into high-dimensional vectors using an embedding model.

Pandas

Used for data manipulation and analysis, creating data frames for quality check visualizations.

Matplotlib

Generates visualized graphs for data analysis and quality checks.

Flask

Web framework handling backend operations, HTTP requests, and API endpoints.

Bcrypt

Secures user passwords through hashing before database storage.

MySQL

Relational database system storing user authentication details and usage history.

Next.js

React framework that does server-side rendering for displaying static pages with inbuilt JSX.

TailwindCSS

UI framework used to efficiently style the website.

python-dotenv

Manages environment variables for application configuration.

Pinecone

Vector database for efficient similarity search and storage of embeddings.

OpenAI

Provides access to powerful LLM models and AI capabilities.

Together

Alternative platform for accessing various open-source language models.

langchain-community

Community-contributed components and integrations for LangChain.

lxml

Library for processing XML and HTML documents.

transformers

State-of-the-art natural language processing models and tools.

tf-keras

Deep learning framework for building and training neural networks.

pypdf

PDF toolkit for splitting, merging, cropping, and transforming PDF pages.

markdown-pdf

Converts Markdown documents to PDF format.

mysql-connector-python

Python driver for communicating with MySQL databases.

seaborn

Statistical data visualization library based on matplotlib.

pdfplumber

Extracts text and analyzes PDF documents with detailed page information.

chardet

Character encoding auto-detection for text files.

wordcloud

Generates word cloud visualizations from text.

scikit-learn

Machine learning library for classification, regression, and clustering.

nltk

Natural Language Toolkit for text processing and analysis.

pytest

Framework for writing and running tests in Python.

tqdm

Provides progress bars for loops and iterables.

Authentication API

  • POST /api/register - User registration
  • POST /api/login - User authentication
  • POST /api/query_user - Get user details

Review Generation API

  • POST /api/upsert - Initialize Pinecone
  • POST /api/generate - Generate review
  • POST /api/quality_check - Quality analysis

History Management API

  • POST /api/save - Save review
  • POST /api/query - Get review
  • POST /api/query_user_history - User history
  • POST /api/delete_user_history - Delete review