Research

Related Projects Review

Unit testing is an essential part of software development as it ensures that code functions correctly and helps catch errors early in the development cycle. One survey [1] found that many developers want to improve their testing suites and enhance software quality, however they are frustrated and face issues with testing’s repetitive nature, improper infrastructure, lack of time, and pressure to prioritise coding over thorough test implementation.

In the current market, there are a few tools that provide real-time feedback and testing assistance. Wallaby.js [2] and PyCrunch [3], for example, focus on improving testing infrastructure by generating real-time feedback. However, these tools don’t offer AI-driven support or comprehensive dashboard metrics to maintain and optimise test cases directly within the VSCode development environment.

Through research, we found that a truly effective testing tool should address these gaps by generating automatic insights and improvements [1]. Thus, our approach introduces an AI-powered VS Code extension that automates, enhances, and documents the testing lifecycle. This tool will provide real-time feedback, generate better test cases, and offer a comprehensive dashboard to help developers maintain and optimise their test suites.

Wallaby.js

Wallaby.js is a real-time test runner for JavaScript that provides instant feedback within the editor [2]. It highlights test results, code coverage, and performance metrics within the IDE. Its primary focus is on providing live feedback rather than enhancing test quality or optimising test execution strategies. Despite its strengths, Wallaby.js lacks AI-driven improvement features. It does not generate test cases or offer suggestions for better coverage, leaving developers to manually craft and refine their tests. Additionally, while it provides inline metrics and visual indicators, it does not centralise this data into a dashboard for broader insights, such as failure trends.

PyCrunch

PyCrunch is a live python testing plugin for JetBrains IDEs that continuously runs tests in the background, and provides immediate feedback as developers write code [3]. It visualises test execution times and dependencies, allowing users to see which code impacts specific tests. PyCrunch does not support AI-driven test creation or improvements, thus, developers must manually write and refine test cases. Additionally, it lacks comprehensive test metrics, such as identifying the slowest or most memory-intensive tests, tracking long-term trends in test performance, or suggesting improvements for failing test cases, limiting its ability to provide actionable insights.

Bridging Gaps

While Wallaby.js and PyCrunch offer real-time test feedback, they lack AI-powered test enhancements, automated debugging, and a comprehensive dashboard for tracking test performance and optimising execution. These limitations present a gap in the market for a tool that not only provides feedback on test execution but also suggests improvements, tracks historical test performance, and optimises test quality using AI-driven techniques.

Technical Research

Brownfield Python Projects

Python is rapidly becoming one of the most widely used programming languages globally. As of 2024, it surpassed JavaScript as the most popular language on GitHub, with its extensive adoption across various domains including data science, web development, and machine learning [4]. This widespread usage underscores Python’s versatility as a vast number of developers rely on it for their projects.

Furthermore, in the software industry, projects involving the enhancement or integration of existing systems, i.e. brownfield projects, are more common than ones which are built from scratch (greenfield) [5]. Many organisations want to improve or expand their current systems to preserve prior investments. This approach often presents challenges, such as dealing with legacy code and ensuring compatibility with existing infrastructure [5]. Addressing these challenges is crucial, as the majority of development work in the industry involves brownfield projects.

Therefore, we decided that by building a tool for Python and brownfield codebases, we can provide solutions that resonate with a substantial segment of the development community.

Extension Development and Language Choice

To fulfil our brief of developing a VS Code extension, we researched and utilised the VS Code Extension API [6], which offers a set of tools and interfaces that enable us to customise functionality of the VS Code editor and register commands.

VS Code Extensions are primarily developed using either TypeScript or JavaScript. After researching the best-suited language for extension development, we determined that TypeScript was the optimal choice. The official VS Code documentation recommends TypeScript for extension development [7], due to its seamless integration and compatibility with VS Code’s internal APIs. Additionally, TypeScript offers type safety, better tooling and improved maintainability compared to JavaScript.

Since our tool is designed for Python codebases, we developed the automatic testing framework in Python using pytest. Given that all team members have extensive prior experience with Python, this choice allowed us to leverage our expertise for efficient development.

Choosing the Testing Framework

When choosing the testing framework, we considered options such as unittest, nose2 and pytest. The table below summarises the key comparisons between each of the frameworks:

Feature	unittest	nose2	pytest
Pass/Fail Reporting	✔️	✔️	✔️
Assertions & Detailed Tracebacks	✔️	✔️	✔️ (Best Formatting)
Test Execution Time	✖️ (Manual)	✔️ (Plugin)	✔️ (Built-in)
Code Coverage	✖️ (Needs coverage.py)	✖️ (Needs coverage.py)	✔️ (pytest-cov)
Test Discovery	✔️	✔️	✔️ (Most Flexible)
Plugin Support	✖️	✔️	✔️ (Best)

As illustrated above, when compared to the other frameworks, pytest offers richer built-in capabilities and extensibility via plugins. It was for this reason that pytest stood out to us as the most comprehensive framework for collecting key metrics about test cases, such as test duration, memory usage, coverage, and pass/fail status.

In particular, pytest offers several advantages that align with our project goals:

It provides detailed pass/fail reporting, including the exact line of failure and a clear display of which tests have passed or failed.

For test duration, pytest offers built-in features that easily measure how long each test takes to execute, with additional plugins available.

For code coverage, the pytest-cov plugin integrates seamlessly with pytest, offering detailed coverage reports that show how much of the code is tested by the test suite.

To enhance the analysis further, the pytest-json-report plugin generates a JSON file with the test results, which can be parsed for integration with other tools or for further analysis. This enables a deeper understanding of test effectiveness.

For test duration and resource monitoring, the pytest-monitor plugin tracks metrics like execution time and memory usage, giving insights into the performance of tests and helping identify bottlenecks or resource-heavy tests.

Technical Experimentation

Our extension provides intelligent insights into test quality and suggests optimised test cases. To power these features, we evaluated different approaches.

As part of our initial development, we experimented with local LLMs, specifically Meta’s LLaMA [8], to assess their feasibility for generating and improving test cases. Running LLaMA locally offered advantages such as enhanced privacy and independence from external APIs. However, several drawbacks quickly became apparent. Computational costs were a major concern, as running a local LLM required high-end GPU resources or cloud-based hosting, making it impractical for an IDE extension aimed at widespread adoption. Additionally, inference times were much slower compared to cloud-based alternatives, negatively affecting the real-time responsiveness essential for effective code assistance.

Beyond performance concerns, maintaining and updating a local LLM also introduced extra overhead, requiring frequent retraining and adjustments to adapt to evolving testing frameworks, best practices and ensure optimal performance [9] [10]. Security risks, such as hallucinations or misinterpretations, would also require ongoing mitigation strategies, further increasing this overhead. These factors made local LLMs unsuitable for a lightweight, scalable, and developer-friendly tool.

An important factor in our decision to move away from local LLMs was our client, Microsoft, given that our tool is designed to extend GitHub Copilot as a VS Code extension. Since Copilot itself operates as a cloud-based service, integrating a local LLM would be fundamentally misaligned with Copilot’s architecture. Unlike standalone AI-powered tools, Copilot does not support local model customisation, and forcing a local LLM into the workflow would introduce unnecessary complexity, performance bottlenecks, and incompatibility with Copilot’s existing infrastructure.

Moreover, our client, Microsoft, recommended that we integrate with the Language Model API [11] for extending Copilot functionalities. This leverages the same model behind Copilot while being more cost-efficient than integrating directly with GitHub Enterprise APIs. This new approach ensures compatibility, allowing us to build a seamless, native experience within VS Code, while avoiding redundant infrastructure costs.

Given our priorities – ease of use, scalability, and minimal maintenance – leveraging the Language Model API was the most logical choice. It allowed us to stay within Copilot’s native ecosystem, ensuring long-term maintainability and integration with existing development workflows. Furthermore, Copilot is enterprise-ready, backed by GitHub’s security policies and trusted in production environments, making it the ideal foundation for a robust and reliable testing assistant.

By aligning with Microsoft's guidance, we were able to focus on developing meaningful features rather than optimising and maintaining a local model, which was not the primary scope of our project. This decision not only improved the developer experience but also allowed us to maximise efficiency, ensuring UnitPylot is both powerful and practical for real-world use.

Despite choosing Copilot as the default model, we ensured flexibility by allowing users to enter their own local LLM endpoint as an extension feature, making our tool compatible with self-hosted or third-party models if needed. This approach accommodates users who prefer local deployment for privacy or customisation while maintaining the efficiency of Copilot integration.

Model Comparison

Model Type	Performance (Code Accuracy & Efficiency)	Computation	Ease of Integration
GitHub Copilot (GPT-4 API-based)	High – Optimised for real-time code completion, contextual suggestions, and multi-language support. Excels in live coding scenarios.	Cloud-based, no local hardware needed.	Seamless integration with VS Code.
Llama 2 (7B, 13B, 70B)	Moderate – Good for text-based AI tasks, but inferior for live coding. Less specialised for software development.	Requires high-end GPUs (7B: 8GB+ VRAM, 70B: 64GB+ VRAM).	Custom setup required for integration into VS Code.
Code Llama (7B, 13B, 34B, 70B)	Strong for coding, particularly Python, but not as real-time as Copilot. Better for batch-generated suggestions.	Moderate to high GPU needs (7B: 8GB+ VRAM, 34B: 32GB+ VRAM).	Requires manual integration into VS Code.
DeepSeek R1 (67B MoE)	Good for logic-heavy tasks, stronger than Llama 2, but slower inference compared to Copilot.	Needs high-end GPUs (80GB+ VRAM recommended).	Not optimised for VS Code, requires API setup.
SmolLM2 (135M, 360M, 1.7B)	Lightweight, suitable for quick tasks, but not strong for complex code generation.	Runs on CPUs, minimal hardware required.	Works with Ollama but lacks full IDE support.
GPT-2.5 (1.5B)	Outdated, not competitive for modern code generation.	Runs on low-end hardware, but very poor results.	Not practical for real-time coding.

Prompting Techniques

After conducting experiments, we found that CoPilot did not consistently produce the desired output. To improve response quality, we researched prompt engineering techniques and refined our approach by incorporating structured prompts with explicit instructions [12].

Our prompts clearly defined CoPilot’s role, aligning it with a specific domain of expertise. For example, by assigning it the role of a "code coverage analysis assistant," we ensured that its responses remained focused on optimising relevant metrics, such as identifying untested conditions and improving test coverage. This role-based guidance, known as Persona-based prompting, helped generate more targeted and contextually appropriate suggestions.

To ensure clarity and consistency, we specified a strict JSON response format, outlining required fields such as line number, category, and suggestion. Using Constraint-based prompting and specifying a structured output, we minimised ambiguity and formatting inconsistencies by guiding CoPilot in generating actionable insights.

We also explored other prompting techniques, including Chain-of-Thought prompting [13], which encourages step-by-step reasoning, and Few-Shot prompting [14], where example responses demonstrated the expected format. We implemented “Soft” Chain-of-Thought prompting by specifying a step-by-step breakdown that encourages intermediate reasoning steps, rather than explicitly requiring a full chain of thought. This approach enhanced the quality of our responses by guiding the model’s reasoning process. Additionally, we optimised our prompts for clarity and specificity, explicitly instructing CoPilot to highlight affected test cases and explain the impact of code modifications.

Technical Decisions

The table below outlines the key technical decisions made throughout the project, detailing the rationale behind each choice:

Category	Decision	Rationale
Extension Development	VS Code Extension API	Provides necessary tools for customising within the VS Code IDE and registering commands.
Target Codebases	Python brownfield projects using pyTest	Aligns with industry trends and developer needs for improving existing systems. Teammates are also familiar with the language and pyTest framework.
Backend Development: Language	Python, TypeScript	For Python, chosen based on team expertise. For TypeScript, offers type safety and better compatibility with VS Code’s internal APIs.
Front End Development: Language	Typescript	Recommended by VS Code for extension development to prevent runtime errors and ensure compatibility.
AI Integration	GitHub Copilot via Language Model API Local LLM Endpoint (Optional - not recommended)	Aligns with Microsoft's recommendation for Copilot extensibility. Offers a scalable and reliable solution without the heavy infrastructure overhead of local LLMs.
Testing Libraries	Mocha, Chai, and Sinon	Used for unit testing and mocking in the TypeScript-based VS Code extension.
Data Visualisation	VS Code Tree View API [15] & Chart.js	Enables structured and graphical visualisation of test cases and metrics in the extension UI.
Testing Framework	pytest	Offers the richest built-in capabilities and extensibility via plugins for collecting key metrics about test cases.

Citations

[1] P. Straubinger and G. Fraser, “A Survey on What Developers Think About Testing,” arXiv.org, 2023. https://arxiv.org/abs/2309.01154 (accessed Feb. 25, 2025).

[2] Wallaby.js, “Wallaby - Immediate JavaScript test feedback in your IDE as-you-type,” wallabyjs.com, May 03, 2024. https://wallabyjs.com/

[3] “PyCrunch - Python Continuous Test Runner and TDD assistant,” Pycrunch.com, 2025. https://pycrunch.com/ (accessed Mar. 10, 2025).

[4] Github, “Octoverse: AI leads Python to top language as the number of global developers surges,” The GitHub Blog, Oct. 29, 2024. https://github.blog/news-insights/octoverse/octoverse-2024/

[5] Synoptek, “Brownfield vs. Greenfield Development: What’s the Difference in Software?,” Synoptek, Sep. 27, 2018. https://synoptek.com/insights/it-blogs/greenfield-vs-brownfield-software-development/

[6] “Extension API,” code.visualstudio.com. https://code.visualstudio.com/api

[7] “Your First Extension,” code.visualstudio.com. https://code.visualstudio.com/api/get-started/your-first-extension

[8] “LLaMA,” huggingface.co. https://huggingface.co/docs/transformers/en/model_doc/llama

[9] Marius Sandbu, “How to run LLMs locally: Hardware, tools and best practices,” Search Enterprise AI, 2024. https://www.techtarget.com/searchenterpriseai/tip/How-to-run-LLMs-locally-Hardware-tools-and-best-practices (accessed Mar. 10, 2025).

[10] A. A. Awan, “The Pros and Cons of Using LLMs in the Cloud Versus Running LLMs Locally,” Datacamp.com, May 23, 2023. https://www.datacamp.com/blog/the-pros-and-cons-of-using-llm-in-the-cloud-versus-running-llm-locally

[11] Microsoft, “Visual Studio Code,” Visualstudio.com, Nov. 03, 2021. https://code.visualstudio.com/api/extension-guides/language-model (accessed Mar. 12, 2025).

[12] “OpenAI Platform,” Openai.com, 2024. https://platform.openai.com/docs/guides/prompt-engineering

[13] “Chain-of-Thought Prompting – Nextra,” www.promptingguide.ai, Jan. 07, 2025. https://www.promptingguide.ai/techniques/cot

[14] M. Badhan, “Comprehensive Guide to Chain-of-Thought Prompting,” Mercity.ai, 2023. https://www.mercity.ai/blog-post/guide-to-chain-of-thought-prompting

[15] “Tree View API,” code.visualstudio.com. https://code.visualstudio.com/api/extension-guides/tree-view

References

[16] J. Daniels, “Asking better questions: the art of LLM prompting - Version 1,” Version 1, Oct. 03, 2024. https://www.version1.com/blog/asking-better-questions-the-art-of-llm-prompting/ (accessed Mar. 12, 2025).

[17] Krystian Safjan, “Krystian Safjan’s Blog,” Krystian Safjan’s Blog, Jun. 24, 2024. https://safjan.com/measuring-quality-and-quantity-of-unit-tests-in-python-projects-advanced-strategies/ (accessed Mar. 12, 2025).

[18] P. Dave, “Mastering Prompting: An Introduction and Deep Dive into Step-by-Step Modular Decomposition,” Medium, Sep. 2024. https://medium.com/@priyeshdave90/mastering-prompting-an-introduction-and-deep-dive-into-step-by-step-modular-decomposition-79b7ed19fd9f (accessed Mar. 12, 2025).

[19] Mangesh Pise, “‘Deep-dive’ prompting technique to improve the quality of LLM’s response,” Medium, Oct. 27, 2023. https://mangeshpise.medium.com/deep-dive-prompting-technique-to-improve-the-quality-of-llms-response-233f3728223e (accessed Mar. 12, 2025).

[20] Python, “unittest — Unit testing framework — Python 3.8.2 documentation,” docs.python.org, 2024. https://docs.python.org/3/library/unittest.html

[21] “Welcome to nose2 — nose2 0.14.1 documentation,” docs.nose2.io. https://docs.nose2.io/en/latest/

[22] “pytest documentation,” Pytest.org, 2025. http://docs.pytest.org/en/stable/ (accessed Mar. 20, 2025).