Technical Research
Brownfield Python Projects
Python is rapidly becoming one of the most widely used programming languages globally. As of 2024, it
surpassed JavaScript as the most popular language on GitHub, with its extensive adoption across
various domains including data science, web development, and machine learning [4]. This widespread
usage underscores Python’s versatility as a vast number of developers rely on it for their projects.
Furthermore, in the software industry, projects involving the enhancement or integration of existing
systems, i.e. brownfield projects, are more common than ones which are built from scratch (greenfield)
[5]. Many organisations want to improve or expand their current systems to preserve prior investments.
This approach often presents challenges, such as dealing with legacy code and ensuring compatibility
with existing infrastructure [5]. Addressing these challenges is crucial, as the majority of
development work in the industry involves brownfield projects.
Therefore, we decided that by building a tool for Python and brownfield codebases, we can
provide solutions that resonate with a substantial segment of the development community.
Extension Development and Language Choice
To fulfil our brief of developing a VS Code extension, we researched and utilised the VS Code
Extension API [6], which offers a set of tools and interfaces that enable us to customise
functionality of the VS Code editor and register commands.
VS Code Extensions are primarily developed using either TypeScript or JavaScript. After researching
the best-suited language for extension development, we determined that TypeScript was the optimal
choice. The official VS Code documentation recommends TypeScript for extension development [7], due to
its seamless integration and compatibility with VS Code’s internal APIs. Additionally, TypeScript
offers type safety, better tooling and improved maintainability compared to JavaScript.
Since our tool is designed for Python codebases, we developed the automatic testing framework in
Python using pytest. Given that all team members have extensive prior experience with Python, this
choice allowed us to leverage our expertise for efficient development.
Choosing the Testing Framework
When choosing the testing framework, we considered options such as unittest, nose2 and pytest. The table below
summarises the key comparisons between each of the frameworks:
| Feature |
unittest |
nose2 |
pytest |
| Pass/Fail Reporting |
✔️ |
✔️ |
✔️ |
| Assertions & Detailed Tracebacks |
✔️ |
✔️ |
✔️ (Best Formatting) |
| Test Execution Time |
✖️ (Manual) |
✔️ (Plugin) |
✔️ (Built-in) |
| Code Coverage |
✖️ (Needs coverage.py) |
✖️ (Needs coverage.py) |
✔️ (pytest-cov) |
| Test Discovery |
✔️ |
✔️ |
✔️ (Most Flexible) |
| Plugin Support |
✖️ |
✔️ |
✔️ (Best) |
As illustrated above, when compared to the other frameworks, pytest offers richer built-in capabilities and
extensibility via plugins. It was for this reason that pytest stood out to us as the most comprehensive framework for
collecting key metrics about test cases, such as test duration, memory usage, coverage, and pass/fail status.
In particular, pytest offers several advantages that align with our project goals:
- It provides detailed pass/fail reporting, including the exact line of failure and a clear display of which
tests have passed or failed.
- For test duration, pytest offers built-in features that easily measure how long each test takes to execute, with
additional plugins available.
- For code coverage, the pytest-cov plugin integrates seamlessly with pytest, offering detailed coverage reports that
show how much of the code is tested by the test suite.
- To enhance the analysis further, the pytest-json-report plugin generates a JSON file with the test results, which
can be parsed for integration with other tools or for further analysis. This enables a deeper understanding of test
effectiveness.
- For test duration and resource monitoring, the pytest-monitor plugin tracks metrics like execution time and memory
usage, giving insights into the performance of tests and helping identify bottlenecks or resource-heavy tests.
Technical Experimentation
Our extension provides intelligent insights into test quality and suggests optimised test
cases. To
power these features, we evaluated different approaches.
As part of our initial development, we experimented with local LLMs, specifically Meta’s LLaMA [8], to
assess their feasibility for generating and improving test cases. Running LLaMA locally offered
advantages such as enhanced privacy and independence from external APIs. However, several drawbacks
quickly became apparent. Computational costs were a major concern, as running a local LLM required
high-end GPU resources or cloud-based hosting, making it impractical for an IDE extension aimed at
widespread adoption. Additionally, inference times were much slower compared to cloud-based
alternatives, negatively affecting the real-time responsiveness essential for effective code
assistance.
Beyond performance concerns, maintaining and updating a local LLM also introduced extra overhead,
requiring frequent retraining and adjustments to adapt to evolving testing frameworks, best practices
and ensure optimal performance [9] [10]. Security risks, such as hallucinations or misinterpretations,
would also require ongoing mitigation strategies, further increasing this overhead. These
factors made
local LLMs unsuitable for a lightweight, scalable, and developer-friendly tool.
An important factor in our decision to move away from local LLMs was our client, Microsoft, given that
our tool is designed to extend GitHub Copilot as a VS Code extension. Since Copilot itself operates as
a cloud-based service, integrating a local LLM would be fundamentally misaligned with Copilot’s
architecture. Unlike standalone AI-powered tools, Copilot does not support local model customisation,
and forcing a local LLM into the workflow would introduce unnecessary complexity, performance
bottlenecks, and incompatibility with Copilot’s existing infrastructure.
Moreover, our client, Microsoft, recommended that we integrate with the Language Model API [11] for
extending Copilot functionalities. This leverages the same model behind Copilot while being
more
cost-efficient than integrating directly with GitHub Enterprise APIs. This new
approach ensures
compatibility, allowing us to build a seamless, native experience within VS Code, while avoiding
redundant infrastructure costs.
Given our priorities – ease of use, scalability, and minimal maintenance – leveraging the
Language
Model API was the most logical choice. It allowed us to stay within Copilot’s native
ecosystem,
ensuring long-term maintainability and integration with existing development workflows. Furthermore,
Copilot is enterprise-ready, backed by GitHub’s security policies and trusted in production
environments, making it the ideal foundation for a robust and reliable testing assistant.
By aligning with Microsoft's guidance, we were able to focus on developing meaningful features
rather
than optimising and maintaining a local model, which was not the primary scope of our
project. This
decision not only improved the developer experience but also allowed us to maximise
efficiency,
ensuring UnitPylot is both powerful and practical
for real-world use.
Despite choosing Copilot as the default model, we ensured flexibility by allowing users to enter their
own local LLM endpoint as an extension feature, making our tool compatible with self-hosted or
third-party models if needed. This approach accommodates users who prefer local deployment for privacy
or customisation while maintaining the efficiency of Copilot integration.
Model Comparison
| Model Type |
Performance (Code Accuracy & Efficiency) |
Computation |
Ease of Integration |
| GitHub Copilot (GPT-4 API-based) |
High – Optimised for real-time code completion, contextual suggestions, and multi-language
support. Excels in live coding scenarios. |
Cloud-based, no local hardware needed. |
Seamless integration with VS Code. |
| Llama 2 (7B, 13B, 70B) |
Moderate – Good for text-based AI tasks, but inferior for live coding. Less specialised for
software development. |
Requires high-end GPUs (7B: 8GB+ VRAM, 70B: 64GB+ VRAM). |
Custom setup required for integration into VS Code. |
| Code Llama (7B, 13B, 34B, 70B) |
Strong for coding, particularly Python, but not as real-time as Copilot. Better for
batch-generated suggestions. |
Moderate to high GPU needs (7B: 8GB+ VRAM, 34B: 32GB+ VRAM). |
Requires manual integration into VS Code. |
| DeepSeek R1 (67B MoE) |
Good for logic-heavy tasks, stronger than Llama 2, but slower inference compared to Copilot.
|
Needs high-end GPUs (80GB+ VRAM recommended). |
Not optimised for VS Code, requires API setup. |
| SmolLM2 (135M, 360M, 1.7B) |
Lightweight, suitable for quick tasks, but not strong for complex code generation. |
Runs on CPUs, minimal hardware required. |
Works with Ollama but lacks full IDE support. |
| GPT-2.5 (1.5B) |
Outdated, not competitive for modern code generation. |
Runs on low-end hardware, but very poor results. |
Not practical for real-time coding. |
Prompting Techniques
After conducting experiments, we found that CoPilot did not consistently produce the desired output.
To improve response quality, we researched prompt engineering techniques and refined our approach by
incorporating structured prompts with explicit instructions [12].
Our prompts clearly defined CoPilot’s role, aligning it with a specific domain of expertise. For
example, by assigning it the role of a "code coverage analysis assistant," we ensured that its
responses remained focused on optimising relevant metrics, such as identifying untested conditions and
improving test coverage. This role-based guidance, known as Persona-based prompting,
helped generate
more targeted and contextually appropriate suggestions.
To ensure clarity and consistency, we specified a strict JSON response format, outlining required
fields such as line number, category, and suggestion. Using Constraint-based
prompting and specifying
a structured output, we minimised ambiguity and formatting inconsistencies by guiding
CoPilot in
generating actionable insights.
We also explored other prompting techniques, including Chain-of-Thought prompting
[13], which
encourages step-by-step reasoning, and Few-Shot prompting [14], where example
responses demonstrated
the expected format. We implemented “Soft” Chain-of-Thought prompting by specifying a step-by-step
breakdown that encourages intermediate reasoning steps, rather than explicitly requiring a full chain
of thought. This approach enhanced the quality of our responses by guiding the model’s reasoning
process. Additionally, we optimised our prompts for clarity and specificity, explicitly instructing
CoPilot to highlight affected test cases and explain the impact of code modifications.