Implementation
Database Design with SQLAlchemy
Modeling the ER Diagram:
- DOCTOR Table: We created a Doctor model with fields such as doctor_id (primary key), firstname, lastname, username, and password.
- PATIENT Table: The Patient model includes patient_id (primary key), mrn, firstname, lastname, and dob (date of birth).
- DOCTOR_PATIENT Table: This table acts as an association table linking doctors and patients. Our DoctorPatient model contains doctor_patient_id as the primary key along with foreign keys (doctor_id and patient_id) referencing the Doctor and Patient models, respectively.
- MEETING Table: The Meeting model includes meeting_id (primary key), a foreign key doctor_patient_id (linking back to the association table), and fields for summary, meeting_date, transcript, and report.
SQLAlchemy Implementation Details:
- Declarative Base: We used SQLAlchemy’s
declarative_base
to define our models.
- Relationships: By defining relationships between our models (for example, linking a doctor to many patients through the association table), we ensured that queries could easily join related data.
- SQLite Connection: We configured a SQLite connection string (e.g.,
sqlite:///./database.db
) and used SQLAlchemy’s engine to manage connections.
FastAPI Integration
API Endpoints
- CRUD Operations: With FastAPI, we built endpoints to create, read, update, and delete entries for doctors, patients, associations, and meetings. For example, we might have endpoints like
POST /doctors
, GET /patients
, and so on.
Dependency Injection
FastAPI’s dependency injection system was used to pass a database session (from SQLAlchemy) into each endpoint, ensuring that each request had access to the database in a clean and thread-safe manner.
Routing and Asynchronous Handling
By integrating SQLAlchemy with FastAPI, we made use of asynchronous endpoints where appropriate, which helped to improve performance, especially under concurrent access.
Data Validation with Pydantic
Request and Response Models
- Pydantic Schemas: We defined Pydantic models (e.g., DoctorCreate, PatientCreate, MeetingCreate, etc.) to validate incoming data. These schemas ensured that only correctly formatted and complete data was processed.
- Serialization: Pydantic was also used for serializing SQLAlchemy model instances to JSON when returning responses. This separation between the ORM models and the API schemas helped maintain a clean architecture.
Benefits of Using SQLite
- Lightweight & Simple: SQLite was an excellent choice for a small-to-medium sized project due to its ease of setup and minimal configuration requirements.
- Local Development: It allowed us to rapidly prototype and test the application without the overhead of setting up a separate database server.
Putting It All Together
Our final implementation can be seen as a layered architecture:
- Data Layer (SQLAlchemy + SQLite): The ORM models accurately represent the ER diagram, ensuring relational integrity and straightforward querying.
- Service Layer (FastAPI): FastAPI handles the routing, business logic, and ties the database interactions together with dependency injection.
- Validation Layer (Pydantic): Pydantic models guard the data entering and leaving our API, making the application robust against invalid input.
This cohesive setup not only aligns with best practices but also results in a modular, maintainable, and scalable system. Each component (database, web framework, validation) plays a specific role that supports both development speed and application reliability.
In summary, our implementation stands as a strong example of how to combine these modern Python tools to create an efficient API-driven application, all while adhering to the relationships and constraints defined in our ER diagram.
Front End Implementation
Front End Implementation with Electron
Electron as the Framework:
We chose Electron to build a cross-platform desktop application, which allowed us to leverage web technologies while delivering a native-like experience. Electron's architecture enabled us to create a main process that manages application windows and a renderer process where our front end lives.
Main Process:
We set up the Electron main process to initialize our application, create the main browser window, and load our HTML file. This process also handles system-level interactions such as file access and application updates.
Renderer Process:
The renderer process runs our user interface built with HTML, CSS, and JavaScript. It behaves much like a regular web page but with the enhanced capabilities provided by Electron.
Building the User Interface with HTML, CSS, and JavaScript
HTML Structure:
Our HTML provided the structure of the application, defining elements such as navigation menus, forms, and data display areas. This clear semantic markup helped ensure that the content was organized and accessible.
CSS Styling:
We used CSS to style the application, creating a clean and responsive design. The CSS rules were crafted to adapt to different screen sizes and ensure a consistent look and feel across various desktop environments.
JavaScript Interactivity:
JavaScript powered the dynamic behavior of the interface. We implemented event listeners to handle user actions like form submissions and button clicks. Additionally, asynchronous JavaScript functions (using fetch or other AJAX methods) were used to communicate with our backend API, ensuring that data could be retrieved and updated without requiring a full page reload.
Monitor_Audio.py
This code contains the business logic for the observers and the processing of audio into JSON and TXT formats.
Transcript_Processor
This class ensures that the necessary folders for processing exist, such as the raw_json
and processed_text
directories.
AudioProcessor Class
GLOBAL_PROCESSING_LOCK: A global variable that acts as a lock.
IS_AUDIO_PROCESSING: A boolean global variable that tracks whether audio processing is currently in progress.
The AudioProcessor class handles the logic for processing audio files. When initialized, it ensures that necessary folders exist using the Transcript_Processor and sets up a cache environment for the diarization model.
Device Selection:
Hardware plays a critical role in running AI models effectively. The program dynamically selects the appropriate device:
- Mac Computers: Uses
mps
for GPU acceleration.
- CUDA-Compatible Devices: Utilizes CUDA for GPU acceleration.
- Intel Devices: Checks compatibility with OpenVINO using
check_openvino_compatibility
and applies OpenVINO acceleration if supported.
- Fallback: Defaults to CPU processing if no GPU is available.
This dynamic selection ensures the program runs effectively across various devices, accommodating the diverse hardware used by the team.
Diarization Model:
The program loads the Pyannote diarization model based on the selected device:
- If the model is cached, it uses the cached version for offline use.
- If not cached, it downloads the model.
For Intel devices, the program applies OpenVINO acceleration with the StaticBatchWrapperClass and loads the Pyannote diarization model to utilize the Intel NPU. This setup ensures efficient diarization processing.
StaticBatchWrapper Class
This class is designed to meet the requirements of static input shapes expected by hardware acceleration devices such as the Intel NPU. The wrapper ensures the batch size remains fixed by padding the input data if necessary, allowing the model inference to be consistently executed without shape mismatch errors.
By integrating this wrapper into the inference pipeline, compatibility with Intel NPU hardware is ensured by enforcing consistent batch sizes during inference, preventing runtime errors caused by dynamic batch dimensions.
process_audio Function
An important function inside the AudioProcessor class is process_audio
, which requires the path to the audio file (received as a string) as its parameter. This function is responsible for turning an audio file into a transcription TXT file.
To ensure that audio files are processed one at a time, the function uses a boolean flag IS_AUDIO_PROCESSING
and a lock GLOBAL_PROCESSING_LOCK
. These ensure that checking and changing the state of the flag is atomic.
Steps in process_audio:
-
The function first checks if audio is already being processed. If not, it sets
IS_AUDIO_PROCESSING
to true
.
-
It converts the audio file to ensure compatibility by using
ensure_sample_rate
, which adjusts the audio to 16k Hz and ensures all audio files are mono (1 channel).
-
If the computer is detected as an Intel device, the program attempts to process the audio with OpenVINO acceleration using OpenAI Whisper. If this fails, it falls back to standard processing.
Openvino Processing
First the code checks if openVino needs to be used with the use_openvino
flag and if it does use open_vino it will try to import openvino_genai
and set up all the necessary directories for Openvino Whisper.
The main processing function is process_audio_with_openvino
. First, it checks if a model of the format exists, if not it will load and set up the OpenAI whisper small with Openvino format. Then it first tries to process using optimum.intel
as we have actually managed to get a speech-to-text transcript with optimum.intel
.
It sets up the audio file using the librosa
audio library (see more on the license section) and makes sure the audio is indeed at a 16k sample rate (though technically the audio file should already be at 16k Hz; we just wanted to try out the librosa
library because we saw on the Openvino documentation [1] [2] that they used librosa
library). Then it generates transcription from the audio.
However, we found out that the audio doesn’t have timestamps; we weren’t able to run a transcription version with timestamps with optimum.intel
and there seems to be an issue about it in the Openvino when we looked it up [3]. If the transcription for optimum.intel
returns empty, we try opengen_ai
where we tried implementing according to documentation with CPU device processing [4] but we haven’t actually managed to run it (not sure if it’s a problem of configuration on our computers, or if it’s simply an issue with it?).
Anyways, since we couldn’t generate with timestamps, we wrote code to manually estimate the amount of time taken for each sentence and then we formatted them in a specific JSON file format required for the rest of the existing code for file processing to work to turn it into a transcript with speaker tags.
If this Openvino fails or exceptions occur, it will go back to normal processing. The Openvino code was actually added much later and was a last-minute addition. Then the rest of the code was implemented with a lot of redundancies so as to not break the code and make sure the code keeps running even if Openvino doesn’t work.
If we had more time, we’d probably spend more time studying the Openvino library and their documentation more. However, the good news is that we have managed to successfully integrate Openvino with Pyannote speaker diarization adding onto our already existing code; and we had looked at Openvino documentation to understand how to configure our existing pipeline to Openvino specifications [1].
[1] “Speaker diarization — OpenVINOTM documentation,” Openvino.ai, 2023.
[2] “Speech to Text with OpenVINOTM — OpenVINOTM documentation,” Openvino.ai, 2023.
[3] “[Bug]: OVModelForSpeechSeq2Seq fails to extract_token_timestamps · Issue #22794 · openvinotoolkit/openvino,” GitHub, 2025.
[4] OpenVino, “Accelerate Generative AI,” 2024. Accessed: Mar. 28, 2025.
Normal Processing
If it can’t successfully run OpenVINO or it isn’t an Intel device, it will run the OpenAI Whisper without OpenVINO. It will automatically do a pip install of the right module if it can’t import the module or sees it’s the incorrect version. The OpenAI Whisper is stored in the cached directory so it only needs to be downloaded once and then can be used offline.
The OpenAI model output text is then manually formatted into a JSON with format_whisper_results
. The format is based on the JSON files produced by Whisper.cpp. Initially, the program didn’t use the OpenAI model base but “Whisper.cpp,” which ran a command line that output a JSON file automatically. So, the output of the OpenAI model was formatted to work with existing code.
The important output of the OpenAI model we need is the ["segments"]
and then the ["start"]
, ["end"]
, and ["text"]
. The start and end time is converted to this format (hours:minutes:seconds:milliseconds). The JSON is completely arbitrary, but the most important part is that there is a JSON object with an array with a key called "transcription"
. This object contains:
timestamps
: with from
and to
in strings (hours:minutes:seconds:milliseconds format).
offsets
: with from
and to
in milliseconds (integer values).
text
: the actual transcribed text for the segment.
This JSON output is saved in the raw_json
directory inside the transcripts folder.
Merging Transcription and Diarization
Diarization is then performed using the Pyannote model, and we set the max voices to two. The output of the diarization model returns timestamps and whether it is SPEAKER_00
or SPEAKER_01
. Code is written to merge the results of the speech-to-text and diarization together, which is why the timestamps are so important to compare and match.
The diarization segments are stored in an array with each segment in this format:
{ start: , end: , speaker: }
(timestamps are in seconds).
The function merge_results
takes the location of the saved JSON file and the diarization segments and combines them to create a transcript in the format:
[ SPEAKER_00 : Speech here ]
The first step in merge_results
is to call the process_json
function, which processes the JSON. It takes the offsets (start
and end
), converts them to seconds, and returns the format as a list with each item like this:
{ start: , end: , speaker: }
Then merge_results
takes both lists (speech-to-text and diarization), sorts them by the start time, and merges them. The merging process works as follows:
Step 1: Check Overlap
For each text segment, determine which timestamps it overlaps with:
If max(text_start, speaker_start) < min(text_end, speaker_end):
overlap_start = max(text_start, speaker_start)
overlap_end = min(text_end, speaker_end)
overlap_duration = overlap_end - overlap_start
text_duration = text_end - text_start
overlap_percentage = (overlap_duration / text_duration) * 100
Store all calculations in overlapping_speakers
, where each item contains:
{ speaker, overlap_start, overlap_end, overlap_duration, overlap_percentage }
Step 2: Determine Best Speaker
Sort overlapping_speakers
by overlap_duration
. The first result is the best speaker.
Alternative Step 2: No Overlap
If there is no overlap, calculate the midpoint of the text duration:
text_midpoint = (text_start + text_end) / 2
For each diarization segment, calculate its midpoint:
speaker_midpoint = (speaker_start + speaker_end) / 2
Find the closest speaker by comparing the absolute difference between the text midpoint and the speaker midpoint. The closest speaker is assigned to the text.
Each processed result is appended to a list called merged_entry
:
{ 'start': , 'end': , 'speaker': , 'text': }
A transcript file is created from merged_entry
as a TXT file and saved in the processed_transcript
directory using the create_transcript_file
function. The file format is:
SPEAKER_00: Speech here
Once the process is completed and the file is written, the lock is released, and the IS_AUDIO_PROCESSING
flag is set to false
.
Design Changes
Main.py: Post “/start-audio” and “/stop-audio”
Initially, for the POST endpoints in main.py
, the design was to start and stop observers when they start recording. However, this led to many bugs in timing and unexpected issues. The observers take some time to initialize and stop, which leads to delays, and the observer shuts down before processing the files. Hence, the design was changed to a permanent observer pattern.
- Permanent Observer with Async Context Manager:
- Simpler to implement.
- Observer runs as a background task.
- Doesn’t interrupt other operations.
- Automatically cleans up once the server shuts down.
- Start and Stop Observer Dynamically:
- Timing issues.
- Misses files between stopping and starting.
- Risks clean-up problems.
Also, initially, the program was designed with global variables, which led to many issues as it wasn’t thread-safe. Later, it was made thread-safe by creating AppState
.
AudioHandler Class and TranscriptWatcher
To detect new files that go into the audio folder or new transcript files, the program uses the watchdog
library to monitor specific directories.
TranscriptWatcher Class in monitorAudio.py
TranscriptWatcher
watches over the processed_text
directory and triggers when a TXT file is created or modified.
on_created and on_modified
These are functions that are file system events (from the watchdog
library) triggered when it detects changes in the directories, such as when a file is modified or created.
- last_modified_times: A dictionary with the last modified timestamps.
- debounce_interval: An arbitrary value that acts as a cooldown timer to prevent events from occurring too frequently, which could result in errors.
When the function is triggered, it sets current_time
as the time and retrieves the last_modified_time
for the file path. If it exists, it gets the time; if not, it sets the last_modified_time
for the file path as 0. If the difference between current_time
and last_modified_time
is larger than the debounce_interval
, it allows processing to continue and calls process_event
.
process_event()
This function creates a unique key path_key
for the file path with the file size.
- processed_paths: A set of all the paths already processed (no duplicates allowed).
- If the
path_key
already exists in this set, it will not be reprocessed.
- If not, it adds the new
path_key
to the set and calls update_processing_state
.
- It keeps a cached set of files, but once it exceeds a certain limit, it cleans up after itself.
update_processing_state
All transcripts must take a queue and wait their turn to be processed.
- processing_state: A class in
main.py
with attributes such as:
current_path:
A string representing the current file being processed.
is_processed:
A boolean indicating whether the file has already been processed.
- processing_lock: A
threading.Lock
ensuring that processing is done one at a time, maintaining thread safety.
The function checks if the current_path
from processing_state
matches the file currently being processed and whether is_processed
is true
. If so, it avoids duplicate processing. If current_path
is None
or is_processed
is false
, it starts a new path, sets the file path as current_path
, and sets is_processed
to false
. Otherwise, it adds the file path to the queue if it’s not already there.
start_transcript_watcher
This function is called in main.py
. It requires the processing_state
class, the lock, and the queue as inputs.
- It ensures the correct
transcripts/processed_text
directory exists.
- It sets up an instance of the observer watching this directory.
- It starts the
TranscriptWatcher
handler and begins monitoring the folder.
- Returns the observer instance so it can be controlled in
main.py
.
AudioHandler Class in monitorAudio.py
Initialization Phase
Creates an instance of the AudioProcessor
class with the Pyannote authentication model key (note that the Pyannote model is open source and free to use, but you need to provide your information). Audio files are processed one at a time, so they need to wait in a queue patiently for their turn.
- Sets up
self.processed_files
and self.pending_files
, which is a list used as a queue (FIFO) to track processing and pending files.
self.handler_lock:
A lock (mutex) for thread safety for operations on pending_files
.
- Stores
watch_folder:
The directory to observe where the audio files are coming from.
- Upon initialization, it calls
start_processing_thread
.
start_processing_thread
The process_files
(main program) is contained within start_processing_thread
and runs in an infinite loop. It checks for new files to process and independently handles each of them.
- The lock
handler_lock
ensures only one thread can access pending_files
at a time.
- Error handling ensures that if there is a problem with processing an audio file, an error message is logged.
- There is a short delay at the end to avoid excessive CPU usage.
start_processing_thread
creates a daemon thread that runs process_files
in the background, so the audio processing process doesn’t block the main program and ensures proper program closure.
add_file_to_pending
Uses a thread lock when accessing the list of pending files and adds a new file to the pending queue.
on_created
Monitors when a new file is created, checks if the file has been processed, and adds it to the processed_files
set and queues it for processing.
on_moved
Handles when files are moved into the watch_folder
and updates them for processing.
Fact_Check.py
The purpose of Fact_Check
is to compare the transcript with the generated report. It uses a RAG (Retrieval-Augmented Generation) approach by performing a similarity search and then using an LLM to compare the claim with the searched evidence and reason a conclusion about whether the claim in the report is accurate to the transcript.
SessionManager Class
Manages the session state for the fact-checking process, which main.py
(server) requires. It includes:
- A function to create a session UUID and store various information inside the sessions.
- A function to clean up the session.
generateEmbeddingsForTranscript
This function takes the transcript as a string, calls a function to split sentences with speaker tags (to split it into chunks), and then calls a function to generate embeddings. The result is a np.ndarray
, which is appended to a list. It returns a chunk containing a list of chunked text, their embeddings, and their chunk indexes.
generateEmbeddingsForReport
Similar to generateEmbeddingsForTranscript
, but it chunks differently without speaker tags and processes sentence by sentence.
process_chunks
Takes the chunks and returns an np.ndarray
of embeddings and text.
index_embeddings
Takes the numpy array embeddings and uses FAISS (see more details about the library in the technology review section) to index them, returning the texts and index.
query_faiss_index
Takes the claim, the FAISS index, the texts accompanying the index, the query vector (embedding), and the number of closest results. It performs a search to match the embeddings and returns results, including the claim and the search results.
similarity_search
Takes the transcript string and report string. It:
- Generates embeddings for the transcript and report.
- Processes the chunks (
process_chunks
).
- Creates a FAISS index from the transcript embeddings and texts.
- Matches each claim in the report to the most relevant evidence in the transcript using
query_faiss_index
.
verification_prompt
Contains the prompt used by the LLM to generate a response. It takes the claim and the results of the evidence from the similarity search as parameters.
Fact-Check Function in main.py
@app.post("/create-session"):
Creates a session ID using an instance of the SessionManager
class.
@app.post("/process-documents"):
Gets the session ID, retrieves the transcript and report from the frontend, and uses the similarity_search
function. Transmits the session ID, claims, and evidences.
@app.get("/stream-fact-check"):
Gets the session ID from the event source in the frontend (passed as a query parameter in the URL). Compares it to the session ID in the SessionManager
instance, processes each claim, and streams responses using the DeepSeek model and the prompt.
Splitting_chunks.py
chunk_report_claims
This function chunks reports by:
- Removing markers and splitting into sections, removing the headers of the report.
- Splitting into sentences and adding them to the list of claims for the fact-checking pipeline.
sentence_split_with_speaker_tag
Processes a transcript of dialogue between speakers and splits it into smaller chunks while preserving the speaker tag.
Report_Generation.py
Contains generate_medical
, which includes the prompt the program uses to generate a report.
Run_granite.py
Classes
Includes classes such as GraniteModel
and DeepSeekModel
. Each class represents the loading of one LLM model using a singleton pattern to avoid loading large LLM models multiple times.
- A lock ensures thread safety, preventing multiple threads from creating an instance simultaneously.
- There is an option to use
optimum.intel
or llama.cpp
libraries to load models.
Design Changes
Initially, the program was designed to reload the AI model each time it was used, which made things slow. This was changed to load only a single instance, allowing the AI model to be loaded once and reused.
run_granite and run_deepseek
These functions run the LLM model by getting an instance of the model and producing a streaming response.
Recorder.py
This code is designed to simulate the frame glasses, in case the frame is not available. PyAudio is used as it provides configurable parameters such as 16kHz sample rate, 1024-byte chunk size, and mono-channel PCM-16 format. This configuration balances audio quality with system responsiveness, particularly important for wearable device simulations.
The recording lifecycle follows a sequence beginning with initialization through a simulated tap gesture (triggered by keyboard input). Once activated, the system enters a continuous recording state where audio data gets buffered in memory. A dual-locking mechanism ensures thread safety during this process - using threading.Lock for synchronous operations and asyncio.Lock for asynchronous coordination. This prevents race conditions while maintaining system responsiveness.
Audio processing occurs through a dedicated pipeline that handles the conversion of raw audio data into WAV format files. The implementation includes atomic file operations to prevent corruption, first writing to temporary files before finalizing the save operation. For backend integration, the system communicates with a FastAPI service using asynchronous HTTP requests through the httpx library, allowing for non-blocking API calls to start and stop recording sessions.
The tap gesture simulation provides an intuitive user interface, mapping keyboard input to device interaction patterns. This abstraction allows for realistic testing of the recording functionality without requiring physical hardware. The system maintains state consistency throughout the recording lifecycle, properly handling concurrent start/stop requests and ensuring all audio data gets preserved.
Performance considerations guided several design choices, including the use of ThreadPoolExecutor for blocking I/O operations and memory-efficient bytearray buffers for audio storage. The architecture supports future enhancements such as audio streaming, real-time processing, and additional gesture controls while maintaining the current system's reliability and responsiveness.
The recorder.py helped us diagnose an issue of why sometimes events couldn’t/weren’t triggered or detected as sometimes the watchdog is not sensitive, and you might need to explicitly trigger an event.
Frame.py
This code implements an audio recording application for Brilliant Labs Frame glasses. The Audio Recording and Saving feature is a core functionality of the Frame smart glasses application, enabling users to record audio and save it as .wav files. This feature is implemented using asynchronous programming with Python's asyncio library. The frame_sdk library is utilized to interact with the glasses' hardware, specifically the microphone, motion, and display. The microphone captures audio data in chunks, which are stored in a byte array. When recording stops, the wave module is used to save the audio data as a .wav file, ensuring the correct format (mono, 16-bit, 16 kHz sample rate).
User interaction is handled through tap gestures on the glasses. When tapped, the application either starts or stops recording by calling the start_recording()
or stop_recording()
methods, respectively. These methods manage the recording state and initiate or terminate the record_continuously()
task, which captures audio data in the background. Additionally, the application communicates with a FastAPI backend using the httpx library to control recording via API endpoints. If the API call fails, the application falls back to direct control, ensuring robustness.
Visual feedback is provided on the glasses' display, showing "Recording" or "Stopped" messages to inform the user of the current state. The recorded files are saved in an audio directory within the project root, with filenames incremented for each new recording. This feature combines hardware interaction, asynchronous programming, and user input handling to deliver a seamless audio recording experience on Frame smart glasses.