Implementation

Database Design with SQLAlchemy

Modeling the ER Diagram:

  • DOCTOR Table: We created a Doctor model with fields such as doctor_id (primary key), firstname, lastname, username, and password.
  • PATIENT Table: The Patient model includes patient_id (primary key), mrn, firstname, lastname, and dob (date of birth).
  • DOCTOR_PATIENT Table: This table acts as an association table linking doctors and patients. Our DoctorPatient model contains doctor_patient_id as the primary key along with foreign keys (doctor_id and patient_id) referencing the Doctor and Patient models, respectively.
  • MEETING Table: The Meeting model includes meeting_id (primary key), a foreign key doctor_patient_id (linking back to the association table), and fields for summary, meeting_date, transcript, and report.

SQLAlchemy Implementation Details:

  • Declarative Base: We used SQLAlchemy’s declarative_base to define our models.
  • Relationships: By defining relationships between our models (for example, linking a doctor to many patients through the association table), we ensured that queries could easily join related data.
  • SQLite Connection: We configured a SQLite connection string (e.g., sqlite:///./database.db) and used SQLAlchemy’s engine to manage connections.

FastAPI Integration

API Endpoints

  • CRUD Operations: With FastAPI, we built endpoints to create, read, update, and delete entries for doctors, patients, associations, and meetings. For example, we might have endpoints like POST /doctors, GET /patients, and so on.

Dependency Injection

FastAPI’s dependency injection system was used to pass a database session (from SQLAlchemy) into each endpoint, ensuring that each request had access to the database in a clean and thread-safe manner.

Routing and Asynchronous Handling

By integrating SQLAlchemy with FastAPI, we made use of asynchronous endpoints where appropriate, which helped to improve performance, especially under concurrent access.

Data Validation with Pydantic

Request and Response Models

  • Pydantic Schemas: We defined Pydantic models (e.g., DoctorCreate, PatientCreate, MeetingCreate, etc.) to validate incoming data. These schemas ensured that only correctly formatted and complete data was processed.
  • Serialization: Pydantic was also used for serializing SQLAlchemy model instances to JSON when returning responses. This separation between the ORM models and the API schemas helped maintain a clean architecture.

Benefits of Using SQLite

Putting It All Together

Our final implementation can be seen as a layered architecture:

This cohesive setup not only aligns with best practices but also results in a modular, maintainable, and scalable system. Each component (database, web framework, validation) plays a specific role that supports both development speed and application reliability.

In summary, our implementation stands as a strong example of how to combine these modern Python tools to create an efficient API-driven application, all while adhering to the relationships and constraints defined in our ER diagram.

Front End Implementation

Front End Implementation with Electron

Electron as the Framework:

We chose Electron to build a cross-platform desktop application, which allowed us to leverage web technologies while delivering a native-like experience. Electron's architecture enabled us to create a main process that manages application windows and a renderer process where our front end lives.

Main Process:

We set up the Electron main process to initialize our application, create the main browser window, and load our HTML file. This process also handles system-level interactions such as file access and application updates.

Renderer Process:

The renderer process runs our user interface built with HTML, CSS, and JavaScript. It behaves much like a regular web page but with the enhanced capabilities provided by Electron.

Building the User Interface with HTML, CSS, and JavaScript

HTML Structure:

Our HTML provided the structure of the application, defining elements such as navigation menus, forms, and data display areas. This clear semantic markup helped ensure that the content was organized and accessible.

CSS Styling:

We used CSS to style the application, creating a clean and responsive design. The CSS rules were crafted to adapt to different screen sizes and ensure a consistent look and feel across various desktop environments.

JavaScript Interactivity:

JavaScript powered the dynamic behavior of the interface. We implemented event listeners to handle user actions like form submissions and button clicks. Additionally, asynchronous JavaScript functions (using fetch or other AJAX methods) were used to communicate with our backend API, ensuring that data could be retrieved and updated without requiring a full page reload.

Monitor_Audio.py

This code contains the business logic for the observers and the processing of audio into JSON and TXT formats.

Transcript_Processor

This class ensures that the necessary folders for processing exist, such as the raw_json and processed_text directories.

AudioProcessor Class

GLOBAL_PROCESSING_LOCK: A global variable that acts as a lock.
IS_AUDIO_PROCESSING: A boolean global variable that tracks whether audio processing is currently in progress.

The AudioProcessor class handles the logic for processing audio files. When initialized, it ensures that necessary folders exist using the Transcript_Processor and sets up a cache environment for the diarization model.

Device Selection:

Hardware plays a critical role in running AI models effectively. The program dynamically selects the appropriate device:

This dynamic selection ensures the program runs effectively across various devices, accommodating the diverse hardware used by the team.

Diarization Model:

The program loads the Pyannote diarization model based on the selected device:

For Intel devices, the program applies OpenVINO acceleration with the StaticBatchWrapperClass and loads the Pyannote diarization model to utilize the Intel NPU. This setup ensures efficient diarization processing.

StaticBatchWrapper Class

This class is designed to meet the requirements of static input shapes expected by hardware acceleration devices such as the Intel NPU. The wrapper ensures the batch size remains fixed by padding the input data if necessary, allowing the model inference to be consistently executed without shape mismatch errors.

By integrating this wrapper into the inference pipeline, compatibility with Intel NPU hardware is ensured by enforcing consistent batch sizes during inference, preventing runtime errors caused by dynamic batch dimensions.

process_audio Function

An important function inside the AudioProcessor class is process_audio, which requires the path to the audio file (received as a string) as its parameter. This function is responsible for turning an audio file into a transcription TXT file.

To ensure that audio files are processed one at a time, the function uses a boolean flag IS_AUDIO_PROCESSING and a lock GLOBAL_PROCESSING_LOCK. These ensure that checking and changing the state of the flag is atomic.

Steps in process_audio:

Openvino Processing

First the code checks if openVino needs to be used with the use_openvino flag and if it does use open_vino it will try to import openvino_genai and set up all the necessary directories for Openvino Whisper.

The main processing function is process_audio_with_openvino. First, it checks if a model of the format exists, if not it will load and set up the OpenAI whisper small with Openvino format. Then it first tries to process using optimum.intel as we have actually managed to get a speech-to-text transcript with optimum.intel.

It sets up the audio file using the librosa audio library (see more on the license section) and makes sure the audio is indeed at a 16k sample rate (though technically the audio file should already be at 16k Hz; we just wanted to try out the librosa library because we saw on the Openvino documentation [1] [2] that they used librosa library). Then it generates transcription from the audio.

However, we found out that the audio doesn’t have timestamps; we weren’t able to run a transcription version with timestamps with optimum.intel and there seems to be an issue about it in the Openvino when we looked it up [3]. If the transcription for optimum.intel returns empty, we try opengen_ai where we tried implementing according to documentation with CPU device processing [4] but we haven’t actually managed to run it (not sure if it’s a problem of configuration on our computers, or if it’s simply an issue with it?).

Anyways, since we couldn’t generate with timestamps, we wrote code to manually estimate the amount of time taken for each sentence and then we formatted them in a specific JSON file format required for the rest of the existing code for file processing to work to turn it into a transcript with speaker tags.

If this Openvino fails or exceptions occur, it will go back to normal processing. The Openvino code was actually added much later and was a last-minute addition. Then the rest of the code was implemented with a lot of redundancies so as to not break the code and make sure the code keeps running even if Openvino doesn’t work.

If we had more time, we’d probably spend more time studying the Openvino library and their documentation more. However, the good news is that we have managed to successfully integrate Openvino with Pyannote speaker diarization adding onto our already existing code; and we had looked at Openvino documentation to understand how to configure our existing pipeline to Openvino specifications [1].

[1] “Speaker diarization — OpenVINOTM documentation,” Openvino.ai, 2023.

[2] “Speech to Text with OpenVINOTM — OpenVINOTM documentation,” Openvino.ai, 2023.

[3] “[Bug]: OVModelForSpeechSeq2Seq fails to extract_token_timestamps · Issue #22794 · openvinotoolkit/openvino,” GitHub, 2025.

[4] OpenVino, “Accelerate Generative AI,” 2024. Accessed: Mar. 28, 2025.

Normal Processing

If it can’t successfully run OpenVINO or it isn’t an Intel device, it will run the OpenAI Whisper without OpenVINO. It will automatically do a pip install of the right module if it can’t import the module or sees it’s the incorrect version. The OpenAI Whisper is stored in the cached directory so it only needs to be downloaded once and then can be used offline.

The OpenAI model output text is then manually formatted into a JSON with format_whisper_results. The format is based on the JSON files produced by Whisper.cpp. Initially, the program didn’t use the OpenAI model base but “Whisper.cpp,” which ran a command line that output a JSON file automatically. So, the output of the OpenAI model was formatted to work with existing code.

The important output of the OpenAI model we need is the ["segments"] and then the ["start"], ["end"], and ["text"]. The start and end time is converted to this format (hours:minutes:seconds:milliseconds). The JSON is completely arbitrary, but the most important part is that there is a JSON object with an array with a key called "transcription". This object contains:

This JSON output is saved in the raw_json directory inside the transcripts folder.

Merging Transcription and Diarization

Diarization is then performed using the Pyannote model, and we set the max voices to two. The output of the diarization model returns timestamps and whether it is SPEAKER_00 or SPEAKER_01. Code is written to merge the results of the speech-to-text and diarization together, which is why the timestamps are so important to compare and match.

The diarization segments are stored in an array with each segment in this format:

The function merge_results takes the location of the saved JSON file and the diarization segments and combines them to create a transcript in the format:

[ SPEAKER_00 : Speech here ]

The first step in merge_results is to call the process_json function, which processes the JSON. It takes the offsets (start and end), converts them to seconds, and returns the format as a list with each item like this:

Then merge_results takes both lists (speech-to-text and diarization), sorts them by the start time, and merges them. The merging process works as follows:

Step 1: Check Overlap

For each text segment, determine which timestamps it overlaps with:

        If max(text_start, speaker_start) < min(text_end, speaker_end):
          overlap_start = max(text_start, speaker_start)
          overlap_end = min(text_end, speaker_end)
          overlap_duration = overlap_end - overlap_start
          text_duration = text_end - text_start
          overlap_percentage = (overlap_duration / text_duration) * 100
      

Store all calculations in overlapping_speakers, where each item contains:

  • { speaker, overlap_start, overlap_end, overlap_duration, overlap_percentage }

Step 2: Determine Best Speaker

Sort overlapping_speakers by overlap_duration. The first result is the best speaker.

Alternative Step 2: No Overlap

If there is no overlap, calculate the midpoint of the text duration:

        text_midpoint = (text_start + text_end) / 2
      

For each diarization segment, calculate its midpoint:

        speaker_midpoint = (speaker_start + speaker_end) / 2
      

Find the closest speaker by comparing the absolute difference between the text midpoint and the speaker midpoint. The closest speaker is assigned to the text.

Each processed result is appended to a list called merged_entry:

  • { 'start': , 'end': , 'speaker': , 'text': }

A transcript file is created from merged_entry as a TXT file and saved in the processed_transcript directory using the create_transcript_file function. The file format is:

        SPEAKER_00: Speech here
      

Once the process is completed and the file is written, the lock is released, and the IS_AUDIO_PROCESSING flag is set to false.

Design Changes

Main.py: Post “/start-audio” and “/stop-audio”

Initially, for the POST endpoints in main.py, the design was to start and stop observers when they start recording. However, this led to many bugs in timing and unexpected issues. The observers take some time to initialize and stop, which leads to delays, and the observer shuts down before processing the files. Hence, the design was changed to a permanent observer pattern.

  • Permanent Observer with Async Context Manager:
    • Simpler to implement.
    • Observer runs as a background task.
    • Doesn’t interrupt other operations.
    • Automatically cleans up once the server shuts down.
  • Start and Stop Observer Dynamically:
    • Timing issues.
    • Misses files between stopping and starting.
    • Risks clean-up problems.

Also, initially, the program was designed with global variables, which led to many issues as it wasn’t thread-safe. Later, it was made thread-safe by creating AppState.

AudioHandler Class and TranscriptWatcher

To detect new files that go into the audio folder or new transcript files, the program uses the watchdog library to monitor specific directories.

TranscriptWatcher Class in monitorAudio.py

TranscriptWatcher watches over the processed_text directory and triggers when a TXT file is created or modified.

on_created and on_modified

These are functions that are file system events (from the watchdog library) triggered when it detects changes in the directories, such as when a file is modified or created.

  • last_modified_times: A dictionary with the last modified timestamps.
  • debounce_interval: An arbitrary value that acts as a cooldown timer to prevent events from occurring too frequently, which could result in errors.

When the function is triggered, it sets current_time as the time and retrieves the last_modified_time for the file path. If it exists, it gets the time; if not, it sets the last_modified_time for the file path as 0. If the difference between current_time and last_modified_time is larger than the debounce_interval, it allows processing to continue and calls process_event.

process_event()

This function creates a unique key path_key for the file path with the file size.

  • processed_paths: A set of all the paths already processed (no duplicates allowed).
  • If the path_key already exists in this set, it will not be reprocessed.
  • If not, it adds the new path_key to the set and calls update_processing_state.
  • It keeps a cached set of files, but once it exceeds a certain limit, it cleans up after itself.

update_processing_state

All transcripts must take a queue and wait their turn to be processed.

  • processing_state: A class in main.py with attributes such as:
    • current_path: A string representing the current file being processed.
    • is_processed: A boolean indicating whether the file has already been processed.
  • processing_lock: A threading.Lock ensuring that processing is done one at a time, maintaining thread safety.

The function checks if the current_path from processing_state matches the file currently being processed and whether is_processed is true. If so, it avoids duplicate processing. If current_path is None or is_processed is false, it starts a new path, sets the file path as current_path, and sets is_processed to false. Otherwise, it adds the file path to the queue if it’s not already there.

start_transcript_watcher

This function is called in main.py. It requires the processing_state class, the lock, and the queue as inputs.

  • It ensures the correct transcripts/processed_text directory exists.
  • It sets up an instance of the observer watching this directory.
  • It starts the TranscriptWatcher handler and begins monitoring the folder.
  • Returns the observer instance so it can be controlled in main.py.

AudioHandler Class in monitorAudio.py

Initialization Phase

Creates an instance of the AudioProcessor class with the Pyannote authentication model key (note that the Pyannote model is open source and free to use, but you need to provide your information). Audio files are processed one at a time, so they need to wait in a queue patiently for their turn.

  • Sets up self.processed_files and self.pending_files, which is a list used as a queue (FIFO) to track processing and pending files.
  • self.handler_lock: A lock (mutex) for thread safety for operations on pending_files.
  • Stores watch_folder: The directory to observe where the audio files are coming from.
  • Upon initialization, it calls start_processing_thread.

start_processing_thread

The process_files (main program) is contained within start_processing_thread and runs in an infinite loop. It checks for new files to process and independently handles each of them.

  • The lock handler_lock ensures only one thread can access pending_files at a time.
  • Error handling ensures that if there is a problem with processing an audio file, an error message is logged.
  • There is a short delay at the end to avoid excessive CPU usage.
  • start_processing_thread creates a daemon thread that runs process_files in the background, so the audio processing process doesn’t block the main program and ensures proper program closure.

add_file_to_pending

Uses a thread lock when accessing the list of pending files and adds a new file to the pending queue.

on_created

Monitors when a new file is created, checks if the file has been processed, and adds it to the processed_files set and queues it for processing.

on_moved

Handles when files are moved into the watch_folder and updates them for processing.

Fact_Check.py

The purpose of Fact_Check is to compare the transcript with the generated report. It uses a RAG (Retrieval-Augmented Generation) approach by performing a similarity search and then using an LLM to compare the claim with the searched evidence and reason a conclusion about whether the claim in the report is accurate to the transcript.

SessionManager Class

Manages the session state for the fact-checking process, which main.py (server) requires. It includes:

  • A function to create a session UUID and store various information inside the sessions.
  • A function to clean up the session.

generateEmbeddingsForTranscript

This function takes the transcript as a string, calls a function to split sentences with speaker tags (to split it into chunks), and then calls a function to generate embeddings. The result is a np.ndarray, which is appended to a list. It returns a chunk containing a list of chunked text, their embeddings, and their chunk indexes.

generateEmbeddingsForReport

Similar to generateEmbeddingsForTranscript, but it chunks differently without speaker tags and processes sentence by sentence.

process_chunks

Takes the chunks and returns an np.ndarray of embeddings and text.

index_embeddings

Takes the numpy array embeddings and uses FAISS (see more details about the library in the technology review section) to index them, returning the texts and index.

query_faiss_index

Takes the claim, the FAISS index, the texts accompanying the index, the query vector (embedding), and the number of closest results. It performs a search to match the embeddings and returns results, including the claim and the search results.

similarity_search

Takes the transcript string and report string. It:

  • Generates embeddings for the transcript and report.
  • Processes the chunks (process_chunks).
  • Creates a FAISS index from the transcript embeddings and texts.
  • Matches each claim in the report to the most relevant evidence in the transcript using query_faiss_index.

verification_prompt

Contains the prompt used by the LLM to generate a response. It takes the claim and the results of the evidence from the similarity search as parameters.

Fact-Check Function in main.py

  • @app.post("/create-session"): Creates a session ID using an instance of the SessionManager class.
  • @app.post("/process-documents"): Gets the session ID, retrieves the transcript and report from the frontend, and uses the similarity_search function. Transmits the session ID, claims, and evidences.
  • @app.get("/stream-fact-check"): Gets the session ID from the event source in the frontend (passed as a query parameter in the URL). Compares it to the session ID in the SessionManager instance, processes each claim, and streams responses using the DeepSeek model and the prompt.

Splitting_chunks.py

chunk_report_claims

This function chunks reports by:

  • Removing markers and splitting into sections, removing the headers of the report.
  • Splitting into sentences and adding them to the list of claims for the fact-checking pipeline.

sentence_split_with_speaker_tag

Processes a transcript of dialogue between speakers and splits it into smaller chunks while preserving the speaker tag.

Report_Generation.py

Contains generate_medical, which includes the prompt the program uses to generate a report.

Run_granite.py

Classes

Includes classes such as GraniteModel and DeepSeekModel. Each class represents the loading of one LLM model using a singleton pattern to avoid loading large LLM models multiple times.

  • A lock ensures thread safety, preventing multiple threads from creating an instance simultaneously.
  • There is an option to use optimum.intel or llama.cpp libraries to load models.

Design Changes

Initially, the program was designed to reload the AI model each time it was used, which made things slow. This was changed to load only a single instance, allowing the AI model to be loaded once and reused.

run_granite and run_deepseek

These functions run the LLM model by getting an instance of the model and producing a streaming response.

Recorder.py

This code is designed to simulate the frame glasses, in case the frame is not available. PyAudio is used as it provides configurable parameters such as 16kHz sample rate, 1024-byte chunk size, and mono-channel PCM-16 format. This configuration balances audio quality with system responsiveness, particularly important for wearable device simulations.

The recording lifecycle follows a sequence beginning with initialization through a simulated tap gesture (triggered by keyboard input). Once activated, the system enters a continuous recording state where audio data gets buffered in memory. A dual-locking mechanism ensures thread safety during this process - using threading.Lock for synchronous operations and asyncio.Lock for asynchronous coordination. This prevents race conditions while maintaining system responsiveness.

Audio processing occurs through a dedicated pipeline that handles the conversion of raw audio data into WAV format files. The implementation includes atomic file operations to prevent corruption, first writing to temporary files before finalizing the save operation. For backend integration, the system communicates with a FastAPI service using asynchronous HTTP requests through the httpx library, allowing for non-blocking API calls to start and stop recording sessions.

The tap gesture simulation provides an intuitive user interface, mapping keyboard input to device interaction patterns. This abstraction allows for realistic testing of the recording functionality without requiring physical hardware. The system maintains state consistency throughout the recording lifecycle, properly handling concurrent start/stop requests and ensuring all audio data gets preserved.

Performance considerations guided several design choices, including the use of ThreadPoolExecutor for blocking I/O operations and memory-efficient bytearray buffers for audio storage. The architecture supports future enhancements such as audio streaming, real-time processing, and additional gesture controls while maintaining the current system's reliability and responsiveness.

The recorder.py helped us diagnose an issue of why sometimes events couldn’t/weren’t triggered or detected as sometimes the watchdog is not sensitive, and you might need to explicitly trigger an event.

Frame.py

This code implements an audio recording application for Brilliant Labs Frame glasses. The Audio Recording and Saving feature is a core functionality of the Frame smart glasses application, enabling users to record audio and save it as .wav files. This feature is implemented using asynchronous programming with Python's asyncio library. The frame_sdk library is utilized to interact with the glasses' hardware, specifically the microphone, motion, and display. The microphone captures audio data in chunks, which are stored in a byte array. When recording stops, the wave module is used to save the audio data as a .wav file, ensuring the correct format (mono, 16-bit, 16 kHz sample rate).

User interaction is handled through tap gestures on the glasses. When tapped, the application either starts or stops recording by calling the start_recording() or stop_recording() methods, respectively. These methods manage the recording state and initiate or terminate the record_continuously() task, which captures audio data in the background. Additionally, the application communicates with a FastAPI backend using the httpx library to control recording via API endpoints. If the API call fails, the application falls back to direct control, ensuring robustness.

Visual feedback is provided on the glasses' display, showing "Recording" or "Stopped" messages to inform the user of the current state. The recorded files are saved in an audio directory within the project root, with filenames incremented for each new recording. This feature combines hardware interaction, asynchronous programming, and user input handling to deliver a seamless audio recording experience on Frame smart glasses.