Testing Strategy

Since a significant amount of our solution requires the use of AI models, our testing strategy involves inputting prompts to the model, and testing if the output is useful to a user. We will test the system method by method, to see if all the methods work. The exception to this rule are the methods generate() and generate1(). This is because they both work in tandem with one another to produce code for a game, and so the quality of one method’s output is inherently dependent on the other method. For this reason, we tested them together to make a game. After we test all the methods, we will test to see if the entire system works.

Setup testing (Setup())

To test the setup() method, we decided to run the function on different systems. This ensures that the model is downloaded properly regardless of the operating system, and guarantees that the model works regardless of whether the user has an intel processor or not. If the test yields no errors, then we know that the method works.

We first ran the method within Windows Subsystem for Linux (WSL), on my device to test if it works on linux with an intel processor. The code worked perfectly, as no errors were yielded.

We then ran the method on a windows device outside WSL, to test if it works on windows with an intel processor. The code worked perfectly, as no errors were yielded.

Offline Code Generation Testing ( generate() and generate1() )

To test the offline model, we attempted to use them to generate games in pygame. We started by inputting a base prompt into the first assistant and then sent the prompts generated by the model one by one to the second assistant. After each prompt, the model was expected to generate code based on the input. We then combined all of this code into a single file and ran it to verify if the game worked as intended. Additionally, we wrote prompts to make changes to the game to assess the model's adaptability. We conducted these tests across the 0.5B, 1.5B, and 3B versions of the model to determine their suitability for intended offline use.

We first inputted the following prompt:

“Make a snake game where a snake moves across a field eating apples, and the apples are randomly scattered across the field. If the snake eats an apple, it becomes longer, and if the snake collides with itself, it loses. Make this game in pygame”

We then inputted the following prompt:

“I want to create a Pong game in pygame. The game should have two paddles, one on the left controlled by W and S keys and one on the right controlled by the Up and Down arrows. A ball should move on its own, bouncing off the paddles and the top and bottom edges. When the ball passes a paddle, the other player scores a point, and the ball resets to the center. The score should be displayed at the top, and the movement of both the paddles and ball should be smooth.”

As expected, for both inputs, the model generated a list of prompts to generate code with, and once these generated prompts are inputted, the model generated all the code needed. Putting all this code together and running it led to a working snake/pong game.

We then tested the ability for the model to amend the game’s code. I did this by writing the prompt “make the ball start at a random y position each turn”. Once the model read this, it subsequently returned an amended version of the code. This code ran successfully with the added change.

AI Model Testing

To make sure the AI behind PixelPilot works smoothly and consistently—especially for kids—a series of AI models were evaluated for their performance and reliability. The models tested included GPT-4o, GPT-3.5 Turbo (o3 mini), Phi-4, DeepSeek V3, and LLaMA 3.3. The main aim was to check how reliable their answers were and whether they could give structured, easy-to-handle responses.

One of the key things we looked at during testing was the temperature setting, which controls how random the AI’s responses are. After playing around with it, we set it to 0.01. This is because unpredictable behaviour from the AI isn’t ideal—particularly when the users are children. They need consistent, clear replies they can follow, and randomness only makes things more confusing.

We also needed the AIs to return answers in JSON format, so they could slot neatly into the chat workflow. Some models handled this well, while others needed extra help. We added detailed prompts to guide the AIs on how to format their answers properly, but even then, they didn’t always get it right. To fix this, we built some custom parsing functions that scan the AI’s reply, extract the correct JSON, and check if it matches what we expect.

On top of that, we added a fallback system that checks whether anything important is missing from the AI’s reply. If it is, the AI asks the user for more information, helping to fill in the gaps and keep things moving. This made the whole system more reliable and easier to use—even if the AI messes up a bit.

Overall, this testing helped us figure out which models were most suitable, and set up solid safety nets to deal with unpredictable output—making PixelPilot feel a lot more stable, especially for younger users.

User Acceptance Testing

Since a significant amount of our solution requires the use of AI models, our testing strategy involves inputting prompts to the model, and testing if the output is useful to a user. We will test the system method by method, to see if all the methods work. The exception to this rule are the methods generate() and generate1(). This is because they both work in tandem with one another to produce code for a game, and so the quality of one method’s output is inherently dependent on the other method. For this reason, we tested them together to make a game. After we test all the methods, we will test to see if the entire system works.

Testers

The testers include:

  1. Amir: 12-year-old student interested in game design.
  2. Lily: 15-year-old student with minimal coding experience.
  3. Tom: 35-year-old headteacher.
  4. Sandra: 42-year-old parent with no technical background.
Note: These were all real-world testers, however, their identities have been made anonymous

Test Cases

We created a set of test scenarios designed to evaluate the usability and accessibility of the PixelPilot VSCode extension. Each tester was asked to perform the following tasks:

  1. Test Case 1: Launch the extension and navigate through the walkthrough.
  2. Test Case 2: Generate a game using a text prompt.
  3. Test Case 3: Upload a graphics folder and prompt the assistant to use custom assets.
  4. Test Case 4: Switch to offline mode and generate an image locally.

Feedback from Users

Acceptance Requirement Strongly Disagree Disagree Agree Strongly Agree Comments
Was the UI easy to navigate? 0 0 1 3 + Very simple layout, clear icons
Was it easy to generate a game? 0 0 2 2 + Prompting felt natural and fun
Did the image generation work offline? 0 1 2 1 - Slight lag but worked eventually
Could you upload and use graphics? 0 0 2 2 + Loved using my own character sprites
Did you enjoy using PixelPilot? 0 0 1 3 + It’s fun and I want to keep building stuff!
Did you understand what each tab does? 0 1 3 0 - Labels could be more descriptive
Was the extension kid-friendly? 0 0 1 3 + Very beginner friendly
Would you recommend it to others? 0 0 1 3 + Yes, especially for schools

NEXT_LEVEL.EXE

Ready to See Our Evaluation Results?

Discover how we evaluated the overall success and impact of Pixel Pilot.