Testing

Testing Strategy

UnitPylot is a VS Code extension that enhances Python testing through execution, coverage insights, a visual dashboard, and AI suggestions. Our testing strategy aimed to ensure that each of these core features functions correctly and delivers a seamless and intuitive user experience. This involved validating both the internal logic and the user interface, ensuring that data is accurately processed and clearly presented, and that the extension responds appropriately to a range of real-world usage scenarios and edge cases. We prioritised early testing for core logic and maintained a continuous feedback loop via GitHub Issues.

Testing Scope

We focused on validating:

Test Execution & Coverage: confirmed consistent results and accurate data parsing.
History Tracking: tested creation and filtering of test data.
UI Behaviour: since the extension is user-facing, we ensured that UI components like the sidebar and editor components were visually pleasing and responded appropriately.
AI Suggestions: verified that generated suggestions were accurate, well-placed, and could be accepted or rejected without errors.

Testing Methdology

Unit Testing: written for key backend modules including

TestRunner and its submodules (coverage, file-hash, parser, helper-functions)
Snapshot and history modules (history-manager, history-processor)
Report generation logic (report-generator)

History Tracking: tested creation and filtering of test data.
Manual Testing: used to validate UI behaviour and AI generated insights, particularly where VS Code API interactions made automation impractical.
User Acceptance Testing: development builds were shared with users to gather feedback on usability, performance, and real-world edge cases.

Unit Testing

Unit testing is a crucial part of UnitPylot’s development process. It ensures that individual components function correctly, improves code reliability, and helps catch bugs early.

Testing Framweworks & Tools

UnitPylot’s unit tests are written using:

Mocha: A JavaScript test framework for running asynchronous tests.
Chai: An assertion library for writing test expectations (expect, should, assert).
Sinon: A library for mocking, stubbing, and spying on function calls and dependencies.
VS Code Test API: Provides utilities to simulate user interactions within VS Code.

Example, Running unit tests with Mocha: npm test

Testing The Parser

To test for the correctness of gathering results from pytest, we have unit tests to simulate example test runs. A test for this is shown below where the output files of pytest are mocked:

test('should get pytest result', async () => { const workspaceFolders = [{ uri: { fsPath: 'workspace' } }]; const pytestOutput = { tests: [ { nodeid: 'src/file1.test.ts::testFunction1', lineno: 10, outcome: 'passed', setup: { duration: 0.1, outcome: 'passed' }, call: { duration: 0.2, outcome: 'passed' }, teardown: { duration: 0.1, outcome: 'passed' } } ] }; const pytestResult: TestResult = { 'src/file1.test.ts': { 'testFunction1': { passed: true, time: 0.2, lineNo: '10', filePath: 'src/file1.test.ts', testName: 'testFunction1', errorMessage: undefined } } }; sandbox.stub(vscode.workspace, 'workspaceFolders').value(workspaceFolders); sandbox.stub(fs, 'readFileSync').returns(JSON.stringify(pytestOutput)); sandbox.stub(fs, 'unlinkSync').returns(undefined); sandbox.stub(sqlite3, 'Database').returns({ all: (stmt: string, params: any[], callback: (err: Error | null, rows: any[]) => void) => { callback(null, []); }, close: (callback: (err: Error | null) => void) => { callback(null); } } as any); const result = await getPytestResult(); expect(result).to.deep.equal(pytestResult); });

Testing File Hashing

In order to facilitate running the minimum number of tests, we use multiple functions to compare the changed files between each test run. Since this is an important area, we have tests to ensure its reliability. The example here tests that the differences in each file are correct.

oldHash and newHash are mock Hash objects representing hashes of the last test run and the current time.

test('should get modified files', () => { const modifiedFiles: FilesDiff = getModifiedFiles(oldHash, newHash); expect(modifiedFiles.added).to.have.property('src/file1.ts'); expect(modifiedFiles.added['src/file1.ts'].functions).to.have.property('function1'); expect(modifiedFiles.added['src/file1.ts'].functions).to.have.property('function4'); expect(modifiedFiles.added).to.have.property('src/file3.ts'); expect(modifiedFiles.deleted).to.have.property('src/file2.ts'); });

Code Coverage

Our tests cover the parts of the codebase that contain the logic for managing tests, report generation, and history tracking. All the public methods and functions are tested for files related to these features. The rest of the codebase contains code for the UI and prompting which cannot be individually tested under unit tests. To cover these cases, we used manual testing.

Manual Testing

Manual testing played a crucial role in validating UnitPylot’s UI and AI components, which could not be fully covered through automated tests due to their reliance on the VS Code API and dynamic user workflows. To support this, we created an example codebases repository that mimics real-world Python projects with varying test suites. This allowed us to simulate realistic development environments and thoroughly test how UnitPylot performs across different application logic and levels of code coverage. In addition to mirroring typical usage scenarios, we intentionally crafted edge cases to stress-test the system, particularly the AI-powered features. Using this repository, we manually tested UI functionalities such as:

Coverage highlights rendered correctly within the editor
Accurate, real-time updates in the visual dashboard (e.g., pass/fail charts and coverage pie graphs)
Command visibility and responsiveness in the command palette and context menus
Inline annotations (e.g., Accept/Reject suggestions) appearing correctly and updating without interfering with the file content

Additionally, we extensively tested the AI responses produced via the Fix Failing Tests, Fix Coverage, Optimise Slowest Tests, and Optimise Memory-intensive Tests commands. These tests focused on:

The accuracy and relevance of suggestions returned by the LLM
Correct file targeting and line placement for inline annotations
The smooth application or rejection of suggestions by the user

This manual process was central to the iterative refinement of our prompt (as described in our Implementation section). Based on observed LLM behaviour, we modified the phrasing and clarified context to improve consistency and reliability of AI responses.

User Acceptance Testing

To evaluate the real-world usability of UnitPylot, we conducted user acceptance testing with individuals who represent our target user group: software developers and technical stakeholders. This included our client, demo users, and peers in the software development community, those who regularly work with Python test suites and development environments. We designed targeted test scenarios covering core workflows and AI-powered features. All testers were provided access to the shared set of example codebases we designed. Feedback was collected on a Likert scale along with optional qualitative comments to guide further refinement.

Testers

The following anonymised testers were involved in the process:

Dev A – 22, Computer Science student working on Python backend services
Dev B – 26, Junior software engineer on a testing-focused team
Dev C – 32, Developer Experience lead and stakeholder from the client team
Dev D – 19, Peer tester and UCL CS student experienced with pytest
Dev E –45, Python developer and demo session participant

These users were selected for their relevance to the extension’s intended audience. All testers used VS Code as their primary editor and were familiar with Python testing workflows, allowing us to gather meaningful, context-aware feedback.

Test Cases

TC1: Run tests and view results using the Run Tests command and dashboard
TC2: Interact with test coverage highlights in the editor
TC3: Use the Fix Failing Tests AI command on a deliberately broken test
TC4: Use the Improve Test Coverage command to identify and fix missing test areas
TC5: Use the Optimise Test Speed & Memory commands for deliberately inefficient tests
TC6: Use the Generate Test Insights command to get general recommendations
TC7: Accept and reject AI-generated suggestions in-line
TC8: Navigate and interpret the UnitPylot TreeView breakdowns
TC9: Export test history and snapshots using the Generate Report command

Feedback

Acceptance Requirement	Neutral	Agree	Strongly Agree	Comments
The extension is easy to set up and start using	0	2	3	+ “Setup was intuitive and the user manual is very comprehensive”
Test results are displayed clearly and understandably	0	1	4	+ “Pass/Fail summary is clean” + “Minimalistic and clutter free design”
The AI suggestions were relevant and helpful	1	2	2	+ “Fix Failing Tests worked really well” - “Responses for optimise memory intensive tests were a bit generic”
Inline annotations were easy to read and interact with	0	1	4	+ “Hover tooltips were helpful” + “Loved the clear suggestion and code snippet fields.”
Accept/Reject functionality worked as expected	1	1	3	+ “Very smooth, changes were applied instantly.” - “Could edit existing code.”
The dashboard provided useful insights	0	1	4	+ “The granular Tree View breakdown is very comprehensive.”
The graphs were visually pleasing and insightful	0	0	5	+ “Loved the interactive features.” + “Clear and easy to interpret.”
Exporting reports is simple and exported reports are clear	1	2	2	+ “Appreciated the performance summary” - “Could include a legend for icons and flag performance trends”

We also received feedback on our tool from industry experts during the UCL Computer Science AI for Good Student Showcase 2025.

Conclusion

Feedback from our user acceptance testing was overwhelmingly positive, confirming that UnitPylot is intuitive, functional, and well-aligned with the needs of Python developers working within VS Code. Testers reported smooth setup, clear visual feedback through the dashboard and coverage highlights, and valuable insights provided by the AI-powered commands. Importantly, the feedback provided actionable guidance:

The request to “include a legend for icons and flag performance trends” in the Markdown report has already been implemented, improving readability and clarity for exported test data.
The suggestion to “edit existing code” implied the need to insert changes directly by replacing incorrect code rather than just appending fixes.
This has been noted as a possible future improvement. This could be supported in a future update to enable cleaner, in-place code corrections when applying AI-generated suggestions.

Overall, the user testing validated both the stability and usability of the extension, while also highlighting specific areas for enhancement that we’ve already begun addressing.