EVALUATION

Achievement Table

ID Requirement Priority Status Contributors
Key Functionalities (Must + Should Have)
1User can input PDFsMustCompletedAll
2User can enter a prompt to define the systematic reviewMustCompletedAll
3Systematic review is displayed to the userMustCompletedWing Ho
4Vector database integrationMustCompletedAll
5LLM-powered section generationMustCompletedWing Ho, Yiwei
6User authentication systemShouldCompletedWing Ho, Kevin
7Review history tracking for usersShouldCompletedWing Ho, Kevin
8Quality assessment visualisation of the review through graphsShouldCompletedKevin, Jennifer
Optional Functionalities (Could Have)
9Users can view the source files of each reviewCouldCompletedWing Ho
10PDFs of the systematic review can be exportedCouldCompletedWing Ho
11Medical paper database integrationCouldNot Completedn/a

Known Bugs

ID Bug Description Priority
1Entering an ID into the URL causes the generation process with no files or promptMedium
2Ocassional markdown rendering issues with HTML of the reviewLow

Individual Contribution - System Artefact

Work Packages Wing Ho Yeung Jennifer Zhang Yiwei Wang Kevin Jin
Research and Experiments20402020
UI Design800020
Coding25252525
Testing0204040
Overall Contribution (%)31.2521.2521.2526.25

Individual Contribution - Website Report

Work Packages Wing Ho Yeung Jennifer Zhang Yiwei Wang Kevin Jin
Website Template and Setup001000
Home1055350
Video000100
Requirement080020
Research070030
Algorithm0502030
UI Design3040030
System Design6020020
Implementation6020020
Testing001000
Evaluation and Future Work080200
User and Deployment Manuals700030
Legal Issues030070
Blog and Monthly Video100000
Overall Contribution (%)23.631.819.625

Critical Evaluation

User Interface and Experience: We aim to create a simple, clean and effective user interface, which is easily navigable even for users that are not familiar with computers or AI. Key features include history tracking of generated systematic reviews, clickable input PDFs, and download options for reviews. These enhance user experience, and general feedback was positive.

Functionality: All features were successfully implemented. No major issues were reported regarding core functionality.

Stability: The application performs reliably, although improvements could be made to enhance long-term maintainability of the codebase.

Efficiency: The application's efficiency is inadequate due to the time required to upsert vectors and process the LLM. This criterion will be prioritized in future work.

Compatibility: The system currently only runs locally. We plan to deploy it to the cloud in future to enhance accessibility and portability.

Maintainability: Project files are well-organized with modular structure. This helped identify and fix bugs more efficiently.

Project Management: We maintained good communication via WhatsApp and GitHub. GitHub was used extensively for issue tracking, pull requests, and branching workflows. Regular weekly meetings with and without the client kept progress on track. These practices improved teamwork and version control.

While overall collaboration was smooth, we believe starting coding earlier could have allowed us to implement more features and experiments.

Quality Evaluation

We often faced the question: “Why use RAG-n-Bones instead of ChatGPT directly?” To address this, we conducted a user survey comparing outputs of both systems using the same prompt. Ratings ranged from 1 (ChatGPT better) to 10 (RAG-n-Bones better):

Evaluation Criteria Average Score Interpretation
Relevance and Accuracy 7 Better than expected — despite using same model, users found our results more relevant.
Response Speed 1 Significantly slower than ChatGPT due to embedding & document processing overhead.
Response Structure 10 RAG-n-Bones generates better-structured reviews, aiding comprehension.
Reduction of Hallucinations 5 No significant difference — both use GPT-3.5 and are similarly prone to hallucination.
Overall Preference for Systematic Review Generation 7 Despite latency, users preferred our system for accuracy and structure.
Comparison Between Outputs

Here's the comparison between output from our application and ChatGPT with a query we typed in: “Write me a systematic review for efficacy of COVID-19 vaccines by the attached research papers."

RAG App Generated Review:
Click to view RAG-app-generate-review.txt
ChatGPT Generated Review:
Click to view RAG-ChatGPT-review.txt


The comparison shows that review generated via our application focuses on the efficacy of RBD-based Covid-19 vaccines. It provides an in-depth systematic review of studies on this vaccine and duscussing efficavy rates and public health implicationd. On the other hand, the GPT generated review has a broader scope, it compares multiple Covid-19 vaccines and discusses their general efficacy.

Moreover, the review generated by our application is more structured and detailed, it follows PRISMA guidelines and includes a detailed discussion of the studies and their results. The review generated by ChatGPT is less structured and only focused on extracting information from the three input research papers.

Overall, the review generated by our application is more proficient and understands the context and format needed in a systematic review, whereas ChatGPT only provides a general overview with extracted information.