Research

Infrastructure

The main functions of the SOTA storytelling robot project are:

Speech recognition
Speech production
Dictionary API
Story-following
Logging user replies in a database

As SOTA is run with Java, we decided to use Java on the Eclipse IDE, to adhere to the choice of IDE used in NTT Data. In this section, we look at the research we conducted into several APIs for each area and the justification of why we chose to use these APIs.

Speech-to-Text APIs

Before choosing which API to use, our client requested us to create a spreadsheet comparing the features of the services. Our findings are the following:

Google Speech

Can return recognition results while the user is still speaking - only available in C#, python, Node.js, Golang
Supported encodings: FLAC, AMR, PCMU and Linear-16
Handles noisy audio from many environments
Filters inappropriate content in text results for some languages
Princing (Monthly): 0 - 60 minutes: Free; 61 - 1,000,000 minutes: $0.006
A lot of examples on github for C#, Go, Java, Node.js, PHP, Python, Ruby
Overall better documentation
Allows adding custom words

IBM watson

Not clear if can return recognition results while the user is still speaking
Supported encodings: OGG, WEBM, MP3, MPEG, WAV, FLAC, PCM, mu-law audio, "basic audio"
Filters inappropriate content in text results for some languages
$0.02 (USD) per minute of the total audio files sent throughout a month added together
Keyword spotting - phrase recognition
Convert dates, times, numbers, etc of US English into more readable, conventional forms - Beta functionality
Few examples, only for Node.js and Java; some cognitive computing SDKs on github in 6 languages, but they are poorly documented and not specifically aimed at speech to text
Allows adding custom words

Conclusion

For the speech-to-text part of the processing, we used Google Speech, as it has better documentation and better support, while for the text-to-speech step we used IBM Watson. This latter choice is motivated by the fact that, in order to keep the storytelling engaging, voice inflexions and different voice tonalities were necessary. IBM Watson has a great API for producing speech, taking into account “mood” and other useful aspects.

Text-to-Speech APIs

Free TTS

FreeTTS is an open source speech synthesis system written entirely in the Java programming language. We tried using it because it is a free service
The quality of the synthesised voice was very bad. The voice component is crucial to our project, all the more given that the users are children. The robot should have a friendly, cheerful voice, and FreeTTS was far from meeting our expectations.

IBM Watson

IBM Watson is far superior to FreeTTS and this is the service we chose for the Text-to-Speech integration. Some features of this service are:

Many english accents available: British english in a single female voice and American english in two female voice versions
IBM has developed a XML-like language used to query for the specific voice desired
The produced voice is clear and full of emotion. Some examples of voice tones available for quereying are: Uncertainty, GoodNews, Apology
There are many other features which we did not use but would be great for creating new characters in the children stories: glottal tension, breathiness, pitch, pitch range, type (e.g. "Young"), strength (e.g. "80%"), timbre (e.g. "Breeze")

Dictionary API

Oxford Dictionaries API

Since all dictionaries offer pretty much the same functionality, we opted for the Oxford Dictionaries API because of the well written documentation. The need of such an API comes from the possibility of children not knowing what certain words mean, therefore having to ask SOTA questions like “what does ‘sheer’ mean?” during the story.

Although the dictionary offers lots of information regarding words, we only make use of the first definition returned by the API.

Story-following

IBM Watson Conversation

This API is unique in what it offers and, as it turned out, it greatly fits the aim of our project.

These are some of the upsides of this API:

Includes Natural Language Processing
Remembers the “checkpoint” of the story the user is currently at
Eliminates possible errors in recognising speech by only considering the labels of the story branches
Has a great, easy to use UI interface for customising the stories

Conclusion

As there was no other such service we found on the market, the IBM Watson Conversation service was of great help, allowing us to focus on getting the stories and the user experience right, while handling all the advanced Natural Language Processing.

User reply logging

MySQL

In order to make it easier for scientists to use the robot while conducting research, we log every reply coming from the user to a database. As there were no specific requirements regarding this aspect of the project, we chose MySQL because of the familiarity our team had using it.

Some features of a MySQL database include:

It is a Relational Database System
MySQL supports as its database language -- as its name suggests – SQL (Structured Query Language). SQL is a standardized language for querying and updating data and for the administration of a database.
ODBC: MySQL supports the ODBC interface Connector/ODBC. This allows MySQL to be addressed by all the usual programming languages that run under Microsoft Windows (Delphi, Visual Basic, etc.). The ODBC interface can also be implemented under Unix, though that is seldom necessary. Windows programmers who have migrated to Microsoft's new .NET platform can, if they wish, use the ODBC provider or the .NET interface Connector/NET.
Platform independence: It is not only client applications that run under a variety of operating systems; MySQL itself (that is, the server) can be executed under a number of operating systems. The most important are Apple Macintosh OS X, Linux, Microsoft Windows, and the countless Unix variants, such as AIX, BSDI, FreeBSD, HP-UX, OpenBSD, Net BSD, SGI Iris, and Sun Solaris
MySQL is globally renowned for being the most secure and reliable database management system used in popular web applications like WordPress, Drupal, Joomla, Facebook and Twitter. The data security and support for transactional processing that accompany the recent version of MySQL, will ensure all user data will be kept safe.

Firebase

Before ultimately choosing a database technology, we also considered Firebase, because of the ease of use.

Firebase is a NoSQL database, so pushing data from the robot would have been easier (right now SOTA calls a PHP page on a hosting service, which then inserts new rows to the MySQL database - a pipeline which was developed by us and which is certainly less tested than the Firebase API). The company provides client libraries that enable integration with Java, the language used to code SOTA.
Another advantage of this technology is the simplicity of browsing through the data using the browser UI of Firebase;
If customisation is desired, the user does not need to know SQL - they just need to make calls to the Firebase API, which has a very well written documentation.

Voice UI

Definition

A voice-user interface (VUI) makes human interaction with computers possible through a voice/speech platform in order to initiate an automated service or process. Applying the same design guidelines to VUIs as to graphical user interfaces is impossible. In a VUI, there are no visual affordances; so, when looking at a VUI, users have no clear indications of what the interface can do or what their options are. When designing VUI actions, it is thus important that the system clearly state possible interaction options, tell the user what functionality he/she is using, and limit the amount of information it gives out to an amount that users can remember. Siri and the Amazon Echo are both examples of popular VUIs. The Echo has recently received a lot of praise about its interface. Given that the two systems can do many similar things, why is the Echo often a better user experience? One reason is that the Echo was designed with voice in mind from the beginning—that’s its sole purpose. Siri, by comparison, is just one more way to interact with your iPhone.

Advantages

Using the interface does not require any prior training
Physically disabled people can use it much easier than a traditional interface
The customer satisfaction is increased

Disadvantages

The fact that the interface cannot respond commands it was not programmed to is a notable reliability issue
Such an interface is very complex to implement and thus very costly, so it should only be used when other kinds of interfaces are not suitable
Because voice interfaces use Machine Learning to extract meaning from speech, the "brain" of the interface will require heavy training and handling of big data in order to be accurate and recognise what the user is saying

Conclusion and Decisions

After careful consideration of all the available technologies, correlated with our prior knowledge, we chose the following technologies to build our project:

Google Speech API - Speech-to-Text
IBM Watson API- Text-to-Speech
Oxford Dictionaries API - Dictionary
IBM Watson Conversation - Story-following and Story-building
MySQL - Database

References

[1] Whitenton, K. (2016) The most important design principles of voice UX. Available at: https://www.fastcodesign.com/3056701/the-most-important-design-principles-of-voice-ux (Accessed: 13 March 2018).

[2] Benefit from a good user manual (no date) Available at: http://technicalwriting.eu/benefit-from-a-good-user-manual/ (Accessed: 13 March 2018).

[3] Pearl, C. (2016) Cathy Pearl. Available at: https://www.oreilly.com/ideas/basic-principles-for-designing-voice-user-interfaces (Accessed: 14 March 2018).

[4] Kuperman, V., Stadthagen-Gonzalez, H. & Brysbaert, M. Behav Res (2012) Age-of-acquisition ratings for 30,000 English words. Available at: https://doi.org/10.3758/s13428-012-0210-4 (Accessed: 14 March 2018).

[5] Weisberg, Deena & Ilgaz, Hande & Hirsh-Pasek, Kathy & Golinkoff, Roberta & Nicolopoulou, Ageliki & Dickinson, David. (2015). Shovels and swords: How realistic and fantastical themes affect children's word learning. Available at: https://www.researchgate.net/publication/