Research

Ask Bob voice assistant framework

Ask Bob is most directly comparable to cloud-based voice assistants such as Amazon Alexa [1] and Google Assistant [2]. They both typically come installed on mobile and Internet-of-Things (IoT) devices and provide useful voice commands primarily for a general audience that are related to services those technology companies already provide. As an example, Google Assistant is integrated with Google Calendar and allows users to perform voice commands to manage their calendars.

Just like Amazon Alexa and Google Assistant, Ask Bob will also have to perform three main steps as part of the interpretation of voice commands: speech transcription, query understanding and response, and speech synthesis. Although where they can rely on access to the cloud to perform these tasks, giving those voice assistants access to greater computational resources over the Internet, Ask Bob will have to perform these three steps locally.

Both Amazon Alexa and Google Assistant also give third-party developers the ability to extend the voice assistants through software development kits (SDKs): Amazon Alexa through Alexa Skills Kit (ASK) SDKs [3] for Node.js, Java and Python; and Google Assistant through the Google Assistant SDK [4] for Go, Java, C#, Node.js and Ruby.

We found the way those base voice assistants could be modularly extended by installing third-party add-ons using SDKs to be particularly interesting and were therefore inspired to design our own plugin system while developing a fully modular Ask Bob voice assistant framework of our own.

Technology review

Supported devices

One of our principal aims was to deploy Ask Bob onto a physical low-power hardware smart speaker-style device as a proof-of-concept prototype.

We initially targeted the Raspberry Pi 4 Model B [5] as they are relatively inexpensive (starting from $35) low-power ARM-based devices that can run Linux into which we would be able to plug a USB microphone and speaker – potentially both on a single add-on board – to assemble such a prototype device. With its quad-core processor, it ought to be able to run the machine learning libraries used for speech transcription and query response. Moreover, given the memory resource requirements for machine learning, we would have been specifically looking at a Raspberry Pi 4 Model B with at least 4GB and preferably 8GB RAM.

We also considered Intel Next Unit of Computing (NUC)-based devices [6] on which either Windows or Linux can be installed. These devices typically have central processing units (CPUs) designed for mobile devices, rather than desktop devices, and therefore have lower power consumption. They come in several variants corresponding to the different CPU brands in Intel’s line-up: Atom, Celeron, Core i3, Core i5, Core i7, etc.

The default Python wheels on the Python Package Index (PyPI) for TensorFlow [7] are compiled with support for advanced vector extensions (AVX) 1 and 2 to the x86-64 instruction set architecture; thus, to ensure adequate computational headroom and to make use of a simplified Python wheel installation process rather than having to recompile TensorFlow on installation, we limited the scope of our considerations to Intel Core i3-based NUC devices with at least 4GB but preferably 8GB RAM for comparison, as the Atom- and Celeron-based Intel NUCs lack AVX support.

Although many of the libraries we used are theoretically compatible with both platforms, installation would be significantly simpler on the Intel NUC devices due to their having 64-bit Intel CPUs based on the x86-64 instruction set architecture (ISA), as opposed to the ARMv8-A ISA used by the Raspberry Pi 4 Model B that would require a more convoluted installation procedure involving recompiling certain dependencies.

Additionally, an Intel NUC device running Windows could be more familiar to system administrators deploying Ask Bob and have greater driver support for microphone and speaker devices than the Raspberry Pi 4 Model B. Furthermore, especially in light of the integration between the different FISE teams, increasing resource requirements, and the aim of creating an Amazon Echo Show-style device, we were persuaded in favour of assembling a prototype based on an Intel Core i3-based NUC device.

The specific device in question we settled after balancing costs with performance requirements was the Jumpy EZbox i3 from Jumper Computer Technology Co., Ltd. It has an Intel Core i3-5005U Broadwell CPU, 8GB RAM, 128GB SSD and Wi-Fi connectivity in a compact 4.39in x 4.88in x 1.67in form factor.

Programming language

We initially considered writing Ask Bob in JavaScript to match the configuration generator web app and therefore use a shared language across the different applications of the project. We also briefly considered Java as a more mature ecosystem with cross-platform support, which would make Ask Bob more easily portable, especially if future developers wanted to build on the Ask Bob codebase.

Despite these factors, we were instead drawn to Python for several key reasons. First and foremost was the availability of libraries at our disposal that matched our requirements – particularly those related to machine learning, such as Rasa and SpaCy. We had also started to consider Mozilla’s DeepSpeech library in parallel to choosing our language; DeepSpeech has a binding for Python, whereas its JavaScript binding is intended for Node.js and its Java binding is for Android only.

Python is compatible with both Windows and Linux (on x86-64 and ARM), so it would give us the greatest flexibility when it came to development and deploying Ask Bob on a physical device. Given it is also a common first language of choice for new developers, it would also make the creation of new Ask Bob plugins more accessible and help to incentivise the formation of a plugin ecosystem.

Libraries

The development of the voice assistant framework had three principal areas of focus for research: speech transcription, speech synthesis and query response.

Speech transcription

When investigating speech-to-text libraries, we came across solutions such as IBM Watson Speech to Text [8] and Google Cloud Speech-to-Text for automatic speech recognition [9]; however, while these services accurately perform the task of transcribing speech, they process speech remotely in the cloud and are closed source, so they unfortunately did not meet our requirements.

We also came across the Uberi speech recognition library [10], which was a library that wrapped several speech-to-text services; however, the services that it did wrap tended to also require cloud access, and while it did support CMU Sphinx which could transcribe speech locally, the library itself had not been updated in over a year with 190+ issues outstanding on GitHub.

Finally, we researched Mozilla’s DeepSpeech machine learning-based speech recognition library [11]. It was ideal for our needs for the following reasons:

it supports voice activity detection paired with the WebRTC voice activity detection module;
it has been made to work on low-power devices, such as the Raspberry Pi in its community of users;
it has bindings for most major languages, including Python;
it is open-source and actively maintained with regular new versions;
it is implemented in TensorFlow internally based off research from Cornell and Baidu; and
most importantly, it performs all speech processing locally on the device it is installed on.

Internally, DeepSpeech has been developed using an end-to-end deep learning approach allowing it to perform better in noisier environments [12], such as those where Ask Bob may be deployed eventually. Its approach is based on an optimised recurrent neural network (RNN) trained on a large volume of data varied using synthetic noise augmentation to generate larger training sets. The DeepSpeech project also provide a pretrained English acoustic model and external scorer that we would be able to make use of in our own project.

Furthermore, UCL mathematical computation masters student Jiaxing Huang has been working on a web-based system to produce enhanced DeepSpeech models and external scorers in soon-to-be-published research that could then be used by Ask Bob voice assistant builds.

Speech synthesis

Having settled on DeepSpeech for speech transcription, we were enticed by Mozilla’s counterpart offering for text-to-speech – mozilla/TTS [13] – which aimed to provide high quality local speech synthesis using machine learning. Unfortunately, this project was only available as an alpha pre-release at the time of research, so we continued exploring other options.

The Python library pyttsx3 [14], on the other hand, does work offline and taps into offline speech engines internally such as sapi5, nsss and espeak. On Windows, it uses the offline text-to-speech voices [15] bundled with the operating system by Microsoft. This satisfied our requirements, so we decided to use pyttsx3 for speech synthesis.

Query response

Our strategy for performing the important task of responding to users’ queries evolved throughout the course of our research as we considered alternative approaches.

Initially, we considered a naïve approach of using an offline search engine library, such as the pure-Python library Whoosh [16] for document searching inspired by Apache Lucene [17], by including user intents and descriptions for each supported voice assistant skill within a series of documents searched whenever the user makes a query.

This approach was unsatisfactory for a number of reasons, the chief one of which was that this strategy would struggle to extract named entities, such as place names, from users’ transcribed speech. For example, the action code for a weather voice assistant skill once identified would have to perform additional input parsing to identify the geographical location for which the user was requesting the weather forecast.

These problems related to identifying the user’s intent, extracting entities and then executing the correct registered action for that voice assistant skill led us to explore an approach based on natural language processing. Our solution relies on combining the conversational natural language comprehension framework, Rasa [18] with the natural language processing library, SpaCy [19].

SpaCy supports a wide range of languages, providing 46 statistical models for 16 languages, which would allow Ask Bob, if used in conjunction with an appropriate DeepSpeech model and text-to-speech voice, to support other languages than English.

Moreover, the SpaCy project provide three pretrained English models of varying sizes (small, medium and large), which suits our needs with respect to deployment on low-power devices that could potentially face storage constraints. These models come with support for named entity recognition and extraction using a transition-based algorithm [20], e.g. for geographical locations, person names, organisation names, etc.

Furthermore, Rasa has a flexible natural language understanding component based on their multitask Dual Intent and Entity Transformer (DIET) architecture [21] that can be used for training models for conversational text-based assistants to classify users’ intents, even on a limited sample of training data. This is ideal for Ask Bob builds where users may only be using a small sample of plugins at a time. Rasa also supports using SpaCy components within its NLP task pipelines.

We have been using SpaCy components for tokenisation, featurisation and entity extraction within our Rasa NLP task pipeline formed of the following tasks:

SpacyNLP
SpacyTokenizer
SpacyFeaturizer
RegexFeaturizer
LexicalSyntacticFeaturizer
CountVectorsFeaturizer
DIETClassifier
EntitySynonymMapper
SpacyEntityExtractor
RegexEntityExtractor
ResponseSelector
FallbackClassifier

As an alternative to SpaCy, we also investigated MITIE – an information extractor tool written in C with a Python binding [22] – within our Rasa pipeline; however, it has a more complex installation process requiring compilation from source, seems to be larger than SpaCy and may be slower to train with, so we decided not to use it, but the flexibility is there for any future developer wishing to use it within a project based on Ask Bob.

As compared to IBM Watson and Microsoft LUIS [23], this setup with Rasa and Spacy can perform language understanding offline without requiring cloud access, thereby meeting our project requirements and giving us flexibility when it comes to designing our plugin system.

Web server

Finally, when it came to integrating with the other FISE Teams, we had to develop a RESTful HTTP interface for the Ask Bob voice assistant framework. Of the popular Python web microframework libraries, such as Flask [24], we decided to use Sanic [25] as it was written using the new Python asynchronous coding paradigm, which would allow the Ask Bob server code to be non-blocking, unlike Flask by default.

Summary of technical decisions

Technical problem	Solution summary
Supported device	Intel Core i3-based NUC
Programming language	Python
Speech transcription library	Mozilla DeepSpeech paired with the WebRTC voice activity detector
Speech synthesis library	pyttsx3
Query response libraries	Rasa and SpaCy natural language processing pipeline
Web server microframework library	Sanic

References

[1] Amazon.com, Inc., “Amazon Alexa,” March 2021. [Online]. Available: https://developer.amazon.com/en-GB/alexa. [Accessed 27 March 2021].

[2] Google LLC, “Google Assistant,” March 2021. [Online]. Available: https://assistant.google.com/. [Accessed 27 March 2021].

[3] Amazon.com, Inc., “Alexa Skills Kit SDKs,” March 2021. [Online]. Available: https://developer.amazon.com/en-GB/docs/alexa/sdk/alexa-skills-kit-sdks.html. [Accessed 27 March 2021].

[4] “Google Assistant SDK,” March 2021. [Online]. Available: https://developers.google.com/assistant/sdk. [Accessed 27 March 2021].

[5] Raspberry Pi Foundation, “Raspberry Pi 4 Model B Specifications,” January 2021. [Online]. Available: https://www.raspberrypi.org/products/raspberry-pi-4-model-b/specifications/. [Accessed 28 March 2021].

[6] Intel, “Intel NUC Kits,” January 2021. [Online]. Available: https://www.intel.co.uk/content/www/uk/en/products/boards-kits/nuc/kits.html. [Accessed 28 March 2021].

[7] Google Inc., “tensorflow - PyPI,” 21 January 2021. [Online]. Available: https://pypi.org/project/tensorflow/#files. [Accessed 27 March 2021].

[8] IBM, “IBM Watson Speech to Text,” [Online]. Available: https://www.ibm.com/cloud/watson-speech-to-text. [Accessed 29 November 2020].

[9] Google, “Google Cloud: Speech-to-Text,” [Online]. Available: https://cloud.google.com/speech-to-text. [Accessed 29 November 2020].

[10] “Github - Uberi / speech_recognition,” 2 July 2019. [Online]. Available: https://github.com/Uberi/speech_recognition. [Accessed 29 November 2020].

[11] Mozilla, “GitHub - Mozilla/DeepSpeech,” 10 December 2020. [Online]. Available: https://github.com/mozilla/DeepSpeech/. [Accessed 13 December 2020].

[12] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates and A. Y. Ng, “Deep speech: Scaling up end-to-end speech recognition,” arXiv preprint arXiv:1412.5567, 2014.

[13] Mozilla, “GitHub - mozilla/TTS,” November 2020. [Online]. Available: https://github.com/mozilla/TTS. [Accessed 29 November 2020].

[14] N. M. Bhat, “GitHub - nateshmbhat/pyttsx3,” 6 July 2020. [Online]. Available: https://pypi.org/project/pyttsx3/. [Accessed 29 November 2020].

[15] Microsoft, “Appendix A: Supported languages and voices,” November 2020. [Online]. Available: https://support.microsoft.com/en-us/windows/appendix-a-supported-languages-and-voices-4486e345-7730-53da-fcfe-55cc64300f01. [Accessed 29 November 2020].

[16] M. Chaput, “GitHub - mchaput/whoosh,” 9 October 2019. [Online]. Available: https://github.com/mchaput/whoosh. [Accessed 13 December 2020].

[17] The Apache Software Foundation, “Apache Lucene,” 22 February 2021. [Online]. Available: https://lucene.apache.org/. [Accessed 27 March 2021]. [18] Rasa, “Introduction to Rasa Open Source,” 17 December 2020. [Online]. Available: https://rasa.com/docs/rasa/. [Accessed 20 December 2020]. [19] spaCy, “SpaCy Usage Documentation,” 11 December 2020. [Online]. Available: https://spacy.io/usage. [Accessed 20 December 2020]. [20] SpaCy, “SpaCy API - EntityRecognizer,” 11 December 2020. [Online]. Available: https://spacy.io/api/entityrecognizer. [Accessed 20 December 2020]. [21] M. Mantha, “Introducing DIET: state-of-the-art architecture that outperforms fine-tuning BERT and is 6X faster to train,” 9 March 2020. [Online]. Available: https://blog.rasa.com/introducing-dual-intent-and-entity-transformer-diet-state-of-the-art-performance-on-a-lightweight-architecture/. [Accessed 20 December 2020].

[22] MIT, “GitHub - mit-nlp/MITIE,” 10 February 2019. [Online]. Available: https://github.com/mit-nlp/MITIE. [Accessed 20 December 2020].

[23] Microsoft, “Microsoft LUIS,” 2021. [Online]. Available: https://www.luis.ai/. [Accessed 27 March 2021].

[24] The Pallets Projects, “Flask Documentation,” 24 June 2019. [Online]. Available: https://flask.palletsprojects.com/en/1.1.x/. [Accessed 27 March 2020].

[25] Sanic Community Organization, “Sanic Documentation,” 25 October 2020. [Online]. Available: https://sanic.readthedocs.io/. [Accessed 31 January 2021].

Configuration generator and skills viewer web apps

Front-end design frameworks

We researched different front-end design frameworks that could be used to develop offline progressive web apps (PWAs). Our research led us to compare several frameworks such as React [1], Angular and Vue [2]. Ultimately, we decided to use React as it integrated well with the other libraries we were considering (such as Redux for state management), was a popular choice for developing simple single-page web apps and had quality documentation [3]. It is also a flexible and scalable framework, which allows us to create modular components with JSX. Developer-friendly PWA templates are also published, which could facilitate the development process. It also has support for developing apps with local caching functionality.

User interface

When researching user interface design for the configuration generator web app to ensure a decent user experience, we looked into Voiceflow: a draft-and-drop voice assistant skill designer with support for exporting Voiceflow .vf files. Unfortunately, at present, there is no clearly defined, published, open specification for .vf files, so we were unable to consider it for integration into our project. Furthermore, Voiceflow also supports many features specific to their use cases which either go beyond the scope of our project or align well with our requirements.

Also as part of our UI research, we investigated Formik: a library that simplifies the creation of web forms. We decided to use it as abstracts away form handling logic and would greatly simplify complexity within our form components. Formik is also compatible with the styling library we chose, Material-UI, as an added bonus.

Drag-and-drop functionality

To implement drag-and-drop functionality, we considered using react-beautiful-dnd; however, while it was feature rich, its use seemed beyond the scope of our project and we found it to have a complex API. We eventually settled on Sortable.js: a library that adds drag-and-drop capabilities to lists of components. We used it to allow users to sort input fields in the add story form by dragging and dropping them.

Styling

Material-UI is a library that provides prestyled components developers can use in their projects. We decided to use it for both the configuration generator app and the skills viewer as its components are easy to use and customise. The components follow the material design guidelines of Google. The components are responsive and have clear, hover and focus animations to provide feedback to users. We also used some icons and components from material-ui/icons and material-ui/lab to create a more unified design.

We chose Material-UI other options such as Bootstrap as Material-UI is easier to customise and follows more modern styling conventions.

State management

We researched several options for managing the state in a React web application, such as Redux and the React Context API. Redux provided a useful structure for building, maintaining and managing the complex state of the configuration generator web application, so we decided to use it. There are also useful Redux debugging developer tools that could be of great use during the development process.

Redux is also useful in the sense that it would allow us to use a reducer factory to reproduce reducers of a similar structure, thereby reducing code reduplication

The result is that code is not repeated and the functions for changing state are simpler [4]. Overall, redux was the best option considering the size of the project and the need for a specific structure.

Summary of technical decisions

Technical problem	Solution summary
Programming language	JavaScript
Front-end framework	React
Form logic handler	Formik
Styling framework	Material-UI
State management	Redux
Drag-and-drop functionality	Sortable.js

References

[1] Facebook Inc., “React – A JavaScript library for building user interfaces,” Facebook Open Source, 2021. [Online]. Available: https://reactjs.org/. [Accessed 29 March 2021].

[2] E. You, “Vue.js,” 2014. [Online]. Available: https://vuejs.org/. [Accessed 29 March 2021].

[3] Timothy, “Making a Progressive Web App,” Facebook Open Source, 9 September 2020. [Online]. Available: https://create-react-app.dev/docs/making-a-progressive-web-app/. [Accessed 27 March 2021].

[4] D. Abramov, “Reusing Reducer Logic,” Redux.org, 2015. [Online]. Available: https://redux.js.org/recipes/structuring-reducers/reusing-reducer-logic. [Accessed 29 March 2021].

AskBob

Research

Ask Bob voice assistant framework

Related project review

Technology review

Supported devices

Programming language

Libraries

Speech transcription

Speech synthesis

Query response

Web server

Summary of technical decisions

References

Configuration generator and skills viewer web apps

Front-end design frameworks

User interface

Drag-and-drop functionality

Styling

State management

Summary of technical decisions

References