Evaluation

Final MoSCoW Achievements

Functional Requirements

We are pleased to say that we completed all of our functional and non-functional requirements.

ID	Requirement	Priority	State	Contributors
1	Compute and display the overall test pass/fail rates.	Must Have	✔	Aaditya, Gughan
2	Display overall line and branch coverage.	Must Have	✔	Aaditya, Gughan
3	Identify and list tests with the highest memory usage.	Must Have	✔	Aaditya, Gughan
4	Identify and list the slowest tests.	Must Have	✔	Aaditya, Gughan
5	Resolve issues with test case code coverage.	Must Have	✔	Asmita, Swasti
6	Fix failing test cases.	Must Have	✔	Asmita, Swasti
7	Optimise the slowest tests.	Must Have	✔	Asmita, Swasti
8	Optimise tests with the highest memory usage.	Must Have	✔	Asmita, Swasti
9	Display specific metrics per test case.	Should Have	✔	Aaditya
10	Visualise trends over time through graphs.	Should Have	✔	Swasti
11	Provide insights into test interconnectedness and robustness.	Should Have	✔	Asmita
12	Continuously execute tests in the background.	Should Have	✔	Gughan
13	Enable users to accept or reject suggested code snippets.	Should Have	✔	Asmita, Swasti
14	Store logs about the metrics.	Should Have	✔	Gughan
15	Suggest PyDoc documentation.	Could Have	✔	Asmita
16	Educate developers on best practices for testing.	Could Have	✔	Swasti
16	Include a settings page for customisation.	Could Have	✔	Gughan
17	Allow users to specify certain tests for execution.	Could Have	✔	Gughan

Non-Functional Requirements

ID	Requirement	Priority	State	Contributors
18	User-Friendly Interface: The extension must be intuitive to navigate and easy to use.	Must Have	✔	Aaditya
19	GitHub Copilot Integration: The extension must extend GitHub Copilot for AI-powered enhancements.	Must Have	✔	All
20	Performance: The extension should be highly responsive and efficient, ensuring minimal lag.	Must Have	✔	All
21	Availability: The extension must be publicly available on the VS Code Marketplace.	Must Have	✔	All
22	Compatibility: The extension must be compatible with Windows, macOS, and Linux to support all users.	Must Have	✔	All
23	Usability: The extension should include clear documentation and a user manual for guidance.	Should Have	✔	All
24	Extensibility: The architecture should support the addition of new features with minimal disruption.	Should Have	✔	All
25	Maintainability: The codebase should be well-documented, modular, and easy to update for long-term support.	Should Have	✔	All
26	Security: The extension could include data privacy measures, such as secure API calls.	Could Have	✔	All
27	Third-Party LLM integration: The extension could function with a custom, third-party LLM as specified by the user.	Could Have	✔	All
28	Custom IDE Support: The extension is exclusive to VS Code and will not be developed for other IDEs (e.g., PyCharm, IntelliJ).	Won’t Have	N/A	N/A

Bug List

To track our bugs during the development of our extension, we utilised GitHub Issues. This allowed us to keep all discussions within the context of our code repository, and made it easier to collaborate on bug fixes as a team.

Below is a list of the bugs we encountered during the development of our extension:

We were able to issue fixes to all of the known bugs at the point of our project's completion. This was achieved through a combination of pair programming, code reviews, and extensive testing.

Individual Contribution

Throughout the development of our extension, each team member contributed to different aspects of the project. Below is a summary of the individual contributions made by each team member:

Work Package	Aaditya Kumar	Asmita Anand	Gughan Ramakrishnan	Swasti Jain
Project Partners Liason	0%	90%	0%	10%
Requirement Analysis	25%	25%	25%	25%
Research and Experiments	15%	30%	15%	40%
UI Design	80%	0%	20%	0%
Coding	15%	20%	40%	25%
Testing	0%	0%	90%	10%
Report Website: Development	0%	50%	0%	50%
Report Website: Content	25%	25%	25%	25%
Presentation Planning	35%	10%	30%	25%
Videos & Scripts	60%	0%	20%	20%
Development Blog	20%	25%	10%	45%
Overall Contribution	25%	25%	25%	25%
Main Roles	UI Designer, Video Editor, Developer	Client Liasion, Researcher, Developer	Tester, Researcher, Developer	Researcher, Report Editor, Developer

Critical Evaluation

We evaluated our extension based on the following criteria:

We designed UnitPylot’s user interface with the goal of making testing as intuitive and seamless as possible within the VS Code environment. From the beginning, we aimed to reduce friction by integrating key visual elements (the dashboard, test status indicators, and inline coverage highlights) directly into the editor. Our intention was to ensure that developers wouldn’t have to context-switch or leave their workspace to understand their test suite’s health. The interface prioritises clarity: we chose minimalist graphs, concise tooltips, and clean layouts to deliver relevant data at a glance.

The UI went through multiple iterative improvements driven by user testing. For instance, early users noted that the dashboard lacked explanatory legends and that certain coverage indicators weren’t fully intuitive. In response, we incorporated hover tooltips, added labels to graphs, and made the inline suggestions more contextually descriptive. This iterative refinement process demonstrated a strong feedback loop, where the team worked collaboratively to ensure every visual element served a functional purpose. We also improved interaction design by refining command placement in the context menu and palette to reduce friction.

Future improvements could include adaptive UI elements based on user preferences or test suite complexity to further personalise the experience.

We built UnitPylot with a clear focus on solving real pain points in Python testing. Features like test execution, code coverage visualisation, AI suggestions, and snapshot tracking were all grounded in feedback from developers we spoke to during our research phase. Rather than being a passive test viewer, UnitPylot actively helps developers understand and improve their testing through both automation and insight.

Each major feature went through rounds of refinement. For instance, with the Fix Failing Tests command, we slowly expanded the logic to handle real failure output. Our Improve Coverage and Optimise Test Speed & Memory commands were developed using hand-crafted examples and edge cases, which we used to evaluate the accuracy of the AI’s suggestions. Throughout, we tried to balance intelligent automation with developer control: suggestions are non-destructive, clearly explained, and easy to accept or reject.

However, we also encountered functional limitations, especially when dealing with less conventional Python project structures. Some files weren’t detected properly if they didn’t follow common naming conventions, and AI suggestions occasionally struggled with complex code contexts. While these issues didn’t block core functionality, they highlighted the importance of refining our parsing logic and improving the prompts we send to the LLM.

Going forward, we plan to improve our static analysis and introduce more advanced filters to make the AI’s output smarter and more context-aware.

Ensuring stability was a major focus throughout the project. Because testing tools are only useful when reliable, we aimed to build a system that handled errors gracefully and provided meaningful feedback when something went wrong. We wrote unit tests for core modules like the TestRunner, history tracker, and report generator. These gave us confidence that the underlying logic was robust even as we introduced new features.

We manually tested the extension across various environments and scenarios—intentionally breaking tests and running the extension on both macOS and Windows to confirm that it could handle errors without crashing.

Future improvements could include adding environment checks or fallbacks and better logging to help users self-diagnose issues.

Overall, the extension is stable in expected conditions, but we want to make it more resilient to variability in user setups.

Efficiency was a driving factor behind many of our design choices. Developers often hesitate to write or run tests because of time constraints, so we wanted UnitPylot to remove friction from that process. The most significant efficiency gain came from implementing function-level hashing. Instead of running the entire test suite every time, we can now identify which functions changed and rerun only the relevant tests. In our testing, this drastically reduced test execution times, especially in medium-to-large projects.

We also focused on UI responsiveness. Since much of our logic (like hashing and parsing coverage) runs in the background, we used VS Code’s asynchronous APIs to avoid blocking the editor. Most operations complete quickly, and even the more expensive tasks like snapshot creation or graph rendering happen without interrupting the coding flow. We also allow users to configure how often the extension updates test data, giving them more control over performance vs. freshness.

From the outset, we wanted UnitPylot to be compatible with any developer using pytest, regardless of operating system or environment. We tested the extension on Windows, macOS, and Linux, and across several versions of Python to ensure core features like test execution, coverage parsing, and dashboard rendering worked reliably everywhere.

To give users more flexibility, we also allowed integration with custom LLM endpoints. This means users can configure a locally hosted or third-party model in place of the default Language Model API. While this feature was useful in principle, it introduced new challenges. For instance, not all LLMs return output in the strict JSON format our extension expects, which occasionally caused parsing errors or failed suggestions. UnitPylot is tailored specifically for brownfield Python projects, acknowledging the real-world complexity of legacy codebases. It is designed to integrate smoothly into existing workflows without requiring project restructuring. By focusing on brownfield compatibility and a leading test framework, the extension offers dependable integration in common development environments.

Future plans include broadening compatibility with additional test frameworks, to further meet the needs of diverse teams and organisations.

We kept our code modular by splitting responsibilities into separate files—test logic, file hashing, coverage parsing, history management, and AI request handling were all clearly separated. This made it easier for us to work in parallel and avoid stepping on each other’s changes.

We followed consistent naming conventions and used TypeScript and Python typing wherever possible to catch bugs before runtime. Our GitHub workflow included linting, and every major feature was reviewed before merging. Unit tests for the TestRunner and history classes gave us confidence that small changes wouldn’t break critical functionality.

Going forward, maintaining detailed changelogs, developer onboarding guides, and API docs will be essential to scaling the project and onboarding contributors.

Managing the project as a student team required discipline and adaptability. We met regularly to plan sprints, divide tasks, and reassess priorities based on deadlines and feedback. We adopted Agile with a Kanban framework to prioritise tasks effectively, utilising GitHub's built-in "Projects" functionality. Our GitHub project board tracked tasks across categories like core features, UI updates, bug fixes, and testing. This helped us visualise progress and stay aligned, even when team members had different schedules or workloads.

Client check-ins helped keep us focused. Feedback from Microsoft and peer developers guided many decisions. We also used GitHub Issues to track bugs and incorporate user feedback from demo sessions, which allowed us to iterate quickly on what mattered most.

That said, we did encounter challenges with scope creep, particularly around AI integration. Some features took longer than expected, which compressed time for testing and polish. In hindsight, we would benefit from setting clearer boundaries on experimental features and reserving buffer time at the end of each sprint.

Going forward, we want to adopt a more structured sprint retrospective process to reflect on what worked and apply those lessons proactively.

Future Work

While we have successfully met our initial requirements, there are still other ways that UnitPylot could be extended to improve the developer experience. A few examples of such enhancements include:

Smarter Inline Fixes

Currently, the accepted AI-generated code suggestions are appended to the end of the relevant file, requiring developers to manually review and integrate them into the appropriate sections. A more sophisticated approach would involve implementing precise, line-by-line replacements that seamlessly modify the existing codebase while preserving formatting and readability. This enhancement could involve syntax-aware placement mechanisms, conflict resolution strategies, and context-aware refactoring to ensure smoother integration into existing workflows with minimal manual intervention.

Contributor Insights

At present, UnitPylot primarily focuses on analysing test quality metrics. However, it could be extended to provide deeper insights into developer activity when using the extension. By tracking and analysing data points, such as the frequency and nature of code modifications, the number of AI-assisted corrections accepted/rejected, and the most commonly used commands, developers and team leads could gain valuable visibility into development patterns. These insights could be added to the generated report, helping teams assess productivity, identify areas for improvement, and refine their testing and development strategies based on real-world usage data.

Enhanced Tree View

Navigating and interpreting test quality metrics efficiently is crucial for large-scale projects. While the current implementation provides a structured view, further enhancements could include advanced sorting, filtering, and categorisation capabilities based on key quality indicators such as the different types of metrics. By allowing users to customise their view based on specific criteria, UnitPylot could streamline test result analysis, making it easier to identify areas that require attention and prioritise fixes accordingly.

Integration with CI/CD Pipelines

UnitPylot currently focuses on local testing and analysis within the VS Code environment. However, extending its capabilities to integrate with continuous integration/continuous deployment (CI/CD) pipelines would be a valuable addition. By enabling developers to seamlessly incorporate UnitPylot’s testing insights into their automated build and deployment processes, teams could ensure consistent code quality and test coverage across all stages of development. This integration could involve generating custom reports, triggering specific actions based on test results, and providing detailed feedback within the CI/CD workflow to facilitate rapid iteration and deployment cycles.

Framework Agnosticism

UnitPylot is currently designed to work with pytest, for the reasons mentioned in our Research Section. However, expanding its compatibility to support additional frameworks—such as unittest, nose2, or testing frameworks from other ecosystems—would broaden its applicability. By introducing an abstraction layer that allows UnitPylot to adapt its functionality to different testing paradigms, developers working across diverse environments could benefit from its features without being restricted to a single framework.

By incorporating these enhancements, UnitPylot could evolve into a more intelligent, adaptable, and developer-friendly tool, fostering improved testing efficiency and code quality across a wider range of use cases.

Client feedback

We received feedback from our client, Microsoft, on our extension. The feedback was positive and highlighted the following points:

Client Feedback Statement:

"The team has delivered an impressive solution within the academic project timeline, showcasing a well-developed extension for enhancing unit testing in brownfield code bases. The visualisation, particularly the tree view with icons for quick issue identification, is very neat and user-friendly. The autocorrect feature is also highly appreciated. The solution not only automates the testing process but also educates developers on why certain tests might be failing, targeting both experienced and beginner developers. There is significant potential for the project to be expanded and maintained as an open-source solution, with opportunities for community engagement and further development. Additionally, the project will be showcased as part of Microsoft Educator Developer Blog and ideally we would like to partner on a London Python community event and an ACM paper, highlighting its innovative approach to unit testing."