Evaluation

Achievement Table

ID	Requirement	Priority	Status	Contributors
Key Functionalities (Must + Should Have)
1	User can input PDFs	Must	Completed	All
2	User can enter a prompt to define the systematic review	Must	Completed	All
3	Systematic review is displayed to the user	Must	Completed	Wing Ho
4	Vector database integration	Must	Completed	All
5	LLM-powered section generation	Must	Completed	Wing Ho, Yiwei
6	User authentication system	Should	Completed	Wing Ho, Kevin
7	Review history tracking for users	Should	Completed	Wing Ho, Kevin
8	Quality assessment visualisation of the review through graphs	Should	Completed	Kevin, Jennifer
Optional Functionalities (Could Have)
9	Users can view the source files of each review	Could	Completed	Wing Ho
10	PDFs of the systematic review can be exported	Could	Completed	Wing Ho
11	Medical paper database integration	Could	Not Completed	n/a

Known Bugs

ID	Bug Description	Priority
1	Entering an ID into the URL causes the generation process with no files or prompt	Medium
2	Ocassional markdown rendering issues with HTML of the review	Low

Individual Contribution - System Artefact

Work Packages	Wing Ho Yeung	Jennifer Zhang	Yiwei Wang	Kevin Jin
Research and Experiments	20	40	20	20
UI Design	80	0	0	20
Coding	25	25	25	25
Testing	0	20	40	40
Overall Contribution (%)	31.25	21.25	21.25	26.25

Individual Contribution - Website Report

Work Packages	Wing Ho Yeung	Jennifer Zhang	Yiwei Wang	Kevin Jin
Website Template and Setup	0	0	100	0
Home	10	55	35	0
Video	0	0	0	100
Requirement	0	80	0	20
Research	0	70	0	30
Algorithm	0	50	20	30
UI Design	30	40	0	30
System Design	60	20	0	20
Implementation	60	20	0	20
Testing	0	0	100	0
Evaluation and Future Work	0	80	20	0
User and Deployment Manuals	70	0	0	30
Legal Issues	0	30	0	70
Blog and Monthly Video	100	0	0	0
Overall Contribution (%)	23.6	31.8	19.6	25

Critical Evaluation

User Interface and Experience: We aim to create a simple, clean and effective user interface, which is easily navigable even for users that are not familiar with computers or AI. Key features include history tracking of generated systematic reviews, clickable input PDFs, and download options for reviews. These enhance user experience, and general feedback was positive.

Functionality: All features were successfully implemented. No major issues were reported regarding core functionality.

Stability: The application performs reliably, although improvements could be made to enhance long-term maintainability of the codebase.

Efficiency: The application's efficiency is inadequate due to the time required to upsert vectors and process the LLM. This criterion will be prioritized in future work.

Compatibility: The system currently only runs locally. We plan to deploy it to the cloud in future to enhance accessibility and portability.

Maintainability: Project files are well-organized with modular structure. This helped identify and fix bugs more efficiently.

Project Management: We maintained good communication via WhatsApp and GitHub. GitHub was used extensively for issue tracking, pull requests, and branching workflows. Regular weekly meetings with and without the client kept progress on track. These practices improved teamwork and version control.

While overall collaboration was smooth, we believe starting coding earlier could have allowed us to implement more features and experiments.

Quality Evaluation

We often faced the question: “Why use RAG-n-Bones instead of ChatGPT directly?” To address this, we conducted a user survey comparing outputs of both systems using the same prompt. Ratings ranged from 1 (ChatGPT better) to 10 (RAG-n-Bones better):

Evaluation Criteria	Average Score	Interpretation
Relevance and Accuracy	7	Better than expected — despite using same model, users found our results more relevant.
Response Speed	1	Significantly slower than ChatGPT due to embedding & document processing overhead.
Response Structure	10	RAG-n-Bones generates better-structured reviews, aiding comprehension.
Reduction of Hallucinations	5	No significant difference — both use GPT-3.5 and are similarly prone to hallucination.
Overall Preference for Systematic Review Generation	7	Despite latency, users preferred our system for accuracy and structure.

Comparison Between Outputs

Here's the comparison between output from our application and ChatGPT with a query we typed in: “Write me a systematic review for efficacy of COVID-19 vaccines by the attached research papers."

            RAG App Generated Review:

          Click to view RAG-app-generate-review.txt

          ChatGPT Generated Review:

          Click to view RAG-ChatGPT-review.txt

The comparison shows that review generated via our application focuses on the efficacy of RBD-based Covid-19 vaccines. It provides an in-depth systematic review of studies on this vaccine and duscussing efficavy rates and public health implicationd. On the other hand, the GPT generated review has a broader scope, it compares multiple Covid-19 vaccines and discusses their general efficacy.

Moreover, the review generated by our application is more structured and detailed, it follows PRISMA guidelines and includes a detailed discussion of the studies and their results. The review generated by ChatGPT is less structured and only focused on extracting information from the three input research papers.

Overall, the review generated by our application is more proficient and understands the context and format needed in a systematic review, whereas ChatGPT only provides a general overview with extracted information.

Achievement Table

Known Bugs

Individual Contribution - System Artefact

Individual Contribution - Website Report

Critical Evaluation

Quality Evaluation

Comparison Between Outputs