Now Web App Understands Speech! (HTML5 Web Speech API)

Reading Time: 8 min

Vimal Patel

Sr. Engineering Manager of DevOps & Microsoft Practice

Jul 15, 2013 |

Posted in Development

Introduction

This is the specification which defines a JavaScript API to enable web developers to incorporate speech recognition and synthesis into their web pages. The Web Speech API aims to enable web developers to provide, in a web browser, speech-input and text-to-speech output features that are typically not available when using standard speech-recognition or screen-reader software. The API itself is agnostic of the underlying speech recognition and synthesis implementation and can support both server-based and client-based/embedded recognition and synthesis. The API is designed to enable both brief (one-shot) speech input and continuous speech input. Speech recognition results are provided to the web page as a list of hypotheses, along with other relevant information for each hypothesis.

Use Cases

This specification supports the following use cases:

Voice Web Search
Speech Command Interface
Domain Specific Grammars Contingent on Earlier Inputs
Continuous Recognition of Open Dialog
Domain Specific Grammars Filling Multiple Input Fields
Speech UI present when no visible UI need be present
Voice Activity Detection
Temporal Structure of Synthesis to Provide Visual Feedback
Hello World
Speech Translation
Speech Enabled Email Client
Dialog Systems
Multimodal Interaction
Speech Driving Directions
Multimodal Video Game
Multimodal Search

How to use??

The new JavaScript Web Speech API makes it easy to add speech recognition to your web pages. This API allows fine control and flexibility over the speech recognition capabilities in Chrome version 25 and later. Here’s an example with the recognized text appearing almost immediately while speaking.

DEMO

Let’s take a look under the hood. First we check to see if the browser supports the Web Speech API by checking if the webkitSpeechRecognition object exists. If not, we suggest the user upgrades his browser. (Since the API is still experimental, it’s currently vendor prefixed.) Lastly, we create the webkitSpeechRecognition object which provides the speech interface, and set some of its attributes and event handlers.

continuous: The default value for continuous is false, meaning that when the user stops talking, speech recognition will end. This mode is great for simple text like short input fields. In the demo provided by chrome, it is set to true, so that recognition will continue even if the user pauses while speaking.

interimResults: The default value for interimResults is false, meaning that the only results returned by the recognizer are final and will not change. The demo sets it to true so we get early, interim results that may change. Watch the demo carefully, the grey text is the text that is interim and does sometimes change, whereas the black text are responses from the recognizer that are marked final and will not change.

To get started, the user clicks on the microphone button, which triggers this code:

lang: Chrome speech recognition supports numerous languages, (see the “langs” table in the demo source), as well as some right-to-left languages that are not included in this demo, such as he-IL and ar-EG. If not set, defaults to the lang of the HTML document root element and hierarchy.

start(): After setting the language, need to call recognition.start() to activate the speech recognizer. Once it begins capturing audio, it calls the onstart event handler, and then for each new set of results, it calls the onresult event handler.

This handler concatenates all the results received so far into two strings: final_transcript and interim_transcript. The resulting strings may include “\n”, such as when the user speaks “new paragraph”

interim_transcript is a local variable, and is completely rebuilt each time this event is called because it’s possible that all interim results have changed since the last onresult event. We could do the same for final_transcript simply by starting for the loop at 0. However, because final text never changes, we’ve made the code here a bit more efficient by making final_transcript a global, so that this event can start the for loop at event.resultIndex and only append any new final text.

So make your web pages come alive by enabling them to listen to your users!

Stay Updated

Flutter 3 Overview: Desktop, Mobile, Web Updates

05/10/2022
Accelerate Digital Transformation with LCNC App Development

08/09/2022
7 Trends Affecting Mobile App Development in 2022

18/07/2022
Developing Mobile Applications for Android 13

07/07/2022
Software Development: Why You Need the Right Engagement Model

25/01/2022

How to Work with Your Remote Development Team

Working with remote teams to develop and release new products has become the norm for almost all aspects of software development. Nowhere is that more true than in the mobile...

Sandeep Dhar

Nov 5, 2019

Think You Know Your App Dev Needs? Think Again.

The pace of change in mobile app development has been mind-blowing. Here at Apexon, we’ve been working on mobile apps since their inception. With every project we learn...

Sandeep Dhar

Nov 4, 2019

Talking Agile, but Living “Agile-Fall”

Agile development is seemingly all around us. According to Forrester, “Since 2013, twice as many companies are using agile techniques to create more value for their business,...

Sandeep Dhar

Oct 30, 2019

Now Web App Understands Speech! (HTML5 Web Speech API)

Introduction

Use Cases

How to use??

Other stories you may enjoy...

How to Work with Your Remote Development Team

Think You Know Your App Dev Needs? Think Again.

Talking Agile, but Living “Agile-Fall”