View from the Labs: Voice User Interfaces – A Short History and a Bright Future
Voice User Interfaces (VUI) enable communication between humans and machines though voice-based interactions. If you’ve used to voice-based assistants like Siri (Apple), Cortana (Microsoft), Google Now and Amazon Echo, you will be familiar with what they can do. But how do these voice user interfaces actually work and what’s future do they have? Let’s take a closer look.
VUIs are made possible through natural language processing, which enables the machines to understand and interpret human speech. Developments in the field of artificial intelligence and natural language processing have enabled speech recognition to understand verbal commands. So much so that virtual assistants are rapidly gaining in popularity: Amazon Echo has sold over 5 million units since it launched. They’re being utilized for tasks such as performing searches, answering questions, making recommendations and integrating third party services. Standalone devices like Amazon Echo and Google Home have entered the market. They can be used for looking up information, playing songs, and integrate with other smart devices. All these voice-based assistants offer the opportunity to integrate third party services for performing various tasks like booking tickets, getting news, etc. which can gradually replace smartphone apps. You heard right. Voice interaction is likely to replace the way we interact with smartphone apps.
Voice interfaces have a lot going for them. For one, they enable technology to be more inclusive. Users with disabilities who until now found it difficult to access conventional display-based devices will find that voice interfaces offer improved opportunities to experience technology. Machines until now have relied heavily on visual information, but voice interfaces offer a future with a more inclusive technology that’s responsive to the user’s needs. Voice interfaces will also come to be relied on in situations where we would consider the user to be otherwise occupied. For example, in a scenario where the user is driving a car, a voice user interface will help the driver to interact with the machine and get the tasks done. In the case of wearables, a fast-expanding market, often little or no display is offered. Here VUI will be the preferred mode of interaction.
In fact, VUIs operating wearables is just the tip of the iceberg. As we move towards a world connected with millions of smart devices, we will find that the existing way of controlling them, i.e. through apps, is highly inefficient. Voice interfaces can help provide a common form for interaction with all the smart devices in the ecosystem. So much so that we see voice interface becoming the de facto medium of interaction.
The Evolution of Voice User Interfaces
- 1952: “Audrey” by Bell Labs, which could recognize the numbers from 0-9
- 1962 – “Shoebox” by IBM, which could understand 16 English words
- 1970– Hidden Markov Model (HMM) which enabled speech recognition technology which helped speech recognition move away from literal matching to prediction
- 1971– US Department of Defense (DARPA) “Speech Understanding Research” (SUR) program. An outcome of the SUR programs was “Harpy” speech-understanding system which could understand more than 1000 words
- 1984 – “Speechworks”, Automated Speech Recognition over IVR (Interactive Voice Response) on telephone
- 1996 – BellSouth launched VAL, first voice portal based on dial-in interactive voice recognition system
- 1997 – “Dragon Dictate”, first software to enable recognition of continuous speech
- 2007 – “Siri” was founded to enable natural human to machine interaction through voice interface
- 2008 – “Google” launched Voice search app which enabled users to make queries using voice
- 2011 – “Siri” launched with iPhone 4S integrated with iOS.
- 2014 – Amazon Echo launched.
- 2016 – Google Home launched.
How does it work?
Voice User Interfaces undergo the following steps every time you use them.
- Activation: VUIs are typically through specific keywords or “wake words” such as “Hello Google”, “Alexa” and so on.
- Automatic Speech Recognition: The machine converts the vocal interactions i.e. user queries, commands, etc. into the form of text which is understood by the system.
- Natural Language Understanding: The system processes the transcribed text and understands the user’s intent.
- Action: Based on the user intent, the system performs the appropriate tasks. This can involve performing a search, connecting with 3rd party services (such as making a hotel reservation), etc.
- Natural Language Generation: Using Text-to-Speech (TTS): After the information is found, it has to be translated back to the language understood by users and then communicated to the user via speech synthesizer.
Image Source
The Speech Recognition module is the most critical part as it has the task of filtering out noise and capturing the user commands correctly as the subsequent action of the device depends on the interpretation of the user input. It consists of three key steps: It consists of the following steps: eliminating background noise and echo to reduce interference; separating the user’s voice from all other sounds in the room and finally adjusting to approximate the user’s distance from the device.
Natural Language Processing enables the machine to interact with the humans in a natural manner. Human-machine interaction is no longer confined to a few phrases and now can truly mimic a natural conversation with the user. Today machines can process a wide range of conversational input and have become intelligent enough to tolerate errors in the input.
The applications for VUI are numerous and the potential is huge. The automobile industry, wearable devices and mobile devices all stand to be transformed by this technology. Although older forms of voice interfaces have been around since the 1970s, the market conditions for mass adoption are only now ready. If you’re considering introducing voice interfaces into your mobility initiative and want to know where to start, get in touch.