AI Podcast

Libraries and Tools

Throughout the project we used python as the language for the backend. We also used a total number of 13 libraries within python to code the backend part of the project.

Python

We used Python to build our backend, including Script Generation, Audio Generation and Video Generation.
moviepy

Python library for audio processing

We used moviepy to process audio files, including concatenating multiple mp3 files, giving mp3 file a name etc. We also used it for video processing, helping us do tasks like working out the duration of a video, combining video and audio clips together, etc.
ibm watson

Python library for text-to-speech conversion.

We used ibm watson for pre-made voices that we added to our voices' dictionary. We also needed to use IBM Watson's API, which the library helped us with.
elevenlabs

Python library for text-to-speech conversion.

We used elevenlabs library to generate an audio file using the voice, text and the API.
textblob

Python library for processing textual data.

We used textblob for analysing the sentiment of the script whether it was positive, negative or neutral
pydub

Python library for audio manipulation.

We used pydub for adding music to our episodes.
os

Python module for using operating system dependent functionality.

We used os to remove the files used in generation that were no longer needed.
OpenAI

OpenAI module is for communicating with server side LLM of OpenAI

We used it to generate podcast script given that user have OpenAI apikey
LangChain

LangChainis a large language model integration framework

We used it to integrate both local and server LLM to our application
Transformers

Transformers provides APIs to quickly download and use pretrained models for wide variety of things from hugginface

We used it to create embedding for transforming text information into vector
nltk

Python library for natural language processing tasks.

We used nltk's functions to identify the keywords of the script, that we later used in the image generation prompt.
dotenv

Python module used for loading environment variables.

We used dotenv to hide sensitive information, while still being able to use it. We created a .env file, stored the APIs in there and used dotenv to load them back in.
collections

Python module used for providing specialized data structures.

We used module called Counter from collection to count the frequency of some words, which helped us in determining the keywords used for image generation.
requests

Python library for making HTTP requests.

We used requests to send a POST request to a specified URL in order to generate an image based on given text prompts and additional parameters.
numpy

Python library used for numerical computing.

We used numpy to convert an image into a numpy array, which helped take the image's dimension that were later used in the video generation.
random

Python module used for generating random numbers.

We used random to generate a picture used for video generation from a folder with 20+ pictures.

Script Generation

The script generation works as follows:

Single Agent Script

Firstly, there are two classes that user can choose from given the type of podcast that they want which is single or duo podcast. The classes are split into two different file in singleAgentScript.py and multiAgentScript.py. Lets have a look at singleAgentScript.py

Initialising of singleAgentscript class

Initialising of OpenAISingleAgentScript class

The script will choose the class according to what user want using local LLM or server LLM. Both class follow the same procedure in generating the script. First, it will brainstorm the idea of the topics then generate Introduction, Body and Conclusion scripts. Then the LLM can expand on each premises from the brainstorm stage and generate the podcast script according to the result of expanding each premises as a single person podcast.

Multi Agent script

For multi agent podcast, we model the code to simulate two AI agents talking together like a duo podcast for example Joe Rogan. The init method except OPENAI_API key if given otherwise it will initialize local LLM.

This is how the class initialize LLM.

We tried different method of prompt engineering for desired result and found out that this template give out the best result for duo podcast generation

If you want to have a closer look at our podcast script generation logic please take a look at our code at under multiAgentScript.py and singleAgentScript.py

RAG implementation

Our RAG implementation is under directory documentParser which contain rag_utils.py file. First of all, the code will try to create embedding by loading "sentence-transformers/all-MiniLM-l6-v2" model from hugginface hub. Then it will extract information from each file in document directory (The file can be pdf or txt). The embedding will transform the information into vector database where we can execute similarity check between the given prompt for podcast generation and known information in document directory

Initialising of huggingface embedding

Main implementation of saving vector data from document directory

Audio Generation

The audio generation works as follows:

Firstly, the text script is passed to TextToSpeech.py, it then asks the user to choose voices for the host and the guest. If the user does not want to change the voices, the voices are set to default values.

Voices Dictionary

Note, If the chosen voice starts with "en-", IMB Watson Text-To-Speech is used for the voices. However, if it does not - ElevenLabs Voice Cloning is utilized for generating the corresponding audio as we used two different Text-To-Speech tools. Then, the script is read speech by speech identified by "Host" or "Guest" indicators and the code generates an mp3 file for each of the speech segments using the voices selected.

Choosing Tool

Identifying Turns

Then, AudioMerger.py combines all of the individual mp3 files into a single one, called final_speech.mp3 and deletes the individual files as they are no longer needed.

Merging MP3 Files

After that, the function called analyze_sentiment evaluates the script, marking it as positive, negative, or neutral by determining its polarity. (polarity scores help quickly identify the overall mood of a text). Typically, the text is quantified with a number between -1 and 1, hence we only have 3 options for the polarity. A score closer to -1 represents a negative sentiment, 1 is positive, and around 0 is neutral. Based on the sentiment of the script, the appropriate music is then applied to the final_speech.mp3 to produce the final audio file.

Determining Polarity Score

Applying Music

Video Generation

The video generation works as follows:

To begin with, an image needs to be generated to be displayed throughout the video. There are two ways it can be generated. The first one is: the function called generate_image gets executed with the text variable (that contains the script generated) as a parameter. Then, it activates another function called extract_keywords to retrieve 7 keywords (we discovered it works well with 7 keywords; but it can be changed to a different number;). All the function does is it tokenises words (splits the text into individual words), then gets rid of the stopwords (such as 'the', 'a', 'and'; words that do not contribute to the meaning of the text) and counts the number of times each word appears to determine the 7 most common words in the text; they become keywords.

Identifying Keywords of Script

Then, the function generate_image makes use of the keywords inside the prompt, providing StabilityAI with the details of the image desired. Once it receives the image, it checks if the HTTP response is successful (status code 200), extracts image data from the response JSON (json converts http responses into pytthon dictionaries that can be accessed within python), and saves it as a PNG file named "text_to_image.png" in the current directory. However, if the response is not successful , it raises an exception.

Sending a prompt

Checking for errors and naming the image

A different, more simple approach, includes randomly selecting an image from a folder with pre-made pictures. It starts with determining a file path of image (that's going to be used) by applying a very intuitive function called choose_image, which randomly selects a number between 1 and 50 and the image with this number gets chosen.

Choosing a random image

Pre-made images

Once we get an image, the process of creating a video out of it begins. Firstly, it loads an image from the specified path using the PIL library. After that, it converts this image into a NumPy array for more efficient manipulation and processing of the image data. Then, the code determines the length of the audio clip and creates a video clip with a static image, using the image array as the frame generator function and setting the duration of the video to match the audio duration. Finally, it combines the video clip with the audio clip to create the final video and writes it to an MP4 file with specified codecs, frame rate, and threading options.

Creating video clip.

Implementation