Automatic Speech Recognition, known as ASR, refers to the process of Artificial Intelligence (AI) technology converting human speech into text. The ultimate goal is to provide a transcription of the audio, by correctly converting sound waves into strings of letters and sentences. This requires the ASR system to learn language to a degree, through recognizing speech and accounting for context of the conversation to provide the most accurate transcription. Over the years ASR systems have come a long way and in recent years have become more widespread, being integrated into popular applications like Instagram and Tik Tok. The progress that ASR has made continues to open the door in making audio and video data more accessible and affordable for those who can benefit. In this article, we explore the development of ASR, modern applications of the technology, and how ASR improves accessibility.
The origin of ASR as we know it today can be traced back to 1952 with the invention of a digit recognition system, named “Audrey.” Created by Bell Laboratories, initially Audrey could only transcribe spoken numbers into readable text, but after improvements it eventually was also able to transcribe basic words. Later during the 1960s, a system named “Shoebox” was developed by IBM and it was able to recognize digits as well as understand mathematical commands and calculate the answer. However, it was not until about a decade later that ASR technology was more seriously researched. This eventually led to more accurate commercial use of ASR, and ASR technology and APIs being sold for high costs during the 1990s. Automatic Speech Recognition really gained momentum in the technology boom of the 2000s and today ASR is now reaching near human accuracy. As the cost of purchasing ASR systems becomes more affordable and accessibility continues to grow, forms of ASR technology can now be found as part of many popular mobile applications, becoming more commonplace and widespread.
In order for Automatic Speech Recognition to accurately convert a sequence of sound waves into written text, the ASR system has to learn the language. Much like a person learning a new language, the ASR system learns in steps and builds upon those skills to convert and correctly interpret what is being said. The first step in this process begins with the Automatic Speech Recognition system understanding phonemes. Phonemes are the smallest units of sound in a language. This step is what allows the system to understand and recognize the sound each letter makes. Once phonemes are able to be understood, this foundational skill allows the system to combine different letters and sound them out to create words.
From there the Automated Speech Recognition system is able to build sentences from the words that have been strung together. However, the ASR system’s learning and foundational understanding does not end there. In order to ensure accuracy, the ASR system has to also understand how to correctly distinguish between similar sounding words and phrases and choose the right interpretation. While processing the sound in written text, it is important that the ASR is also able to understand and distinguish which words are important and which are not. For example, the system must be able to understand and account for utterances like disfluencies and filler words. Disfluencies include verbal utterances that occur in natural speech, like pauses or hesitations and stuttering. Filler words include words like “um” which fill space but do not have a meaning in the context of the conversation.
There are multiple approaches and methods for training Automated Speech Recognition systems. In today’s world the two main approaches to training ASR systems are the traditional hybrid method and what is known as an end-to-end deep learning method. Each of these respective approaches incorporates multiple models within each system.
The traditional hybrid method is the legacy approach to automated speech recognition, and continues to still be used by many companies today. Although there are now more accurate training methods, the traditional hybrid approach is still relied on because there is greater knowledge and know-how of how to create a strong model based on this method. As the traditional hybrid approach was the principal method for 15 years there is more data available and research that has been done, making it easier to build systems from. Traditional hybrid methods work by using traditional HMM (Hidden Markov Models) and GMM (Gaussian Mixture Models), both of which require the use of force alignment of data. Forced alignment refers to the process in which the speech recognition system is given an exact transcript of what is being said, and it must then determine the order in time that the words in the speech segment belong. Within the traditional HMM and GMM approaches there are three models that are variants and play an important role in the process of automatic speech recognition.
Three models which play a part in the use of the traditional hybrid method for speech recognition are: the acoustic model, lexicon model, and language model. The acoustic model, usually a variant of the HMM or GMM approaches, is used to replicate the acoustic patterns of speech. This allows it to predict what sound is occurring at what time, based on the forced aligned data that is input. Another model is the lexicon model which is programmed to tell the automatic speech recognition system how words are pronounced. The language model also assists in properly determining the correct order of words in a sentence. Using language statistics as a resource and guide, it uses probability to predict which words follow each other based on probability and data. Finally comes the process of decoding. Decoding synthesizes these models in order to produce a transcript of what is being said.
Despite the longstanding use of the traditional hybrid method, it is not without its limitations or disadvantages. One of the biggest drawbacks of this approach is lower accuracy compared to other methods. Using the traditional hybrid method is also less efficient because each system must be trained individually, making it much more labor intensive and time consuming than other approaches. Accuracy is not as reliable due to the fact that each system utilizes a custom phonetic set to provide transcription, which varies depending on who it is engineered or programmed by.
A more modern day method of providing automated speech recognition is the end-to-end learning approach. End-to-end learning is able to map received acoustic signals into a sequence of words without relying on force-aligned data. Utilizing an end-to-end learning approach offers a more accurate transcription in comparison to the traditional hybrid method. Unlike the traditional hybrid method, end-to-end learning is also capable of creating transcripts without the use of the lexicon model or language model. Three prominent end-to-end architectures are CTC, LAS, and RNNTs. All of these end-to-end Deep Learning architectures can all be used to create highly accurate transcriptions without the use of force-aligned data, language models, or lexicon models. However, using a language model in this process can help to further improve accuracy. Not only does the end-to-end learning approach require far less human labor than the traditional hybrid approach, but it is also easier to train and program.
Automatic speech recognition is more accurate than ever today, and is even reaching near human levels of accuracy. However, automatic speech recognition is always improving as AI systems continue to learn and new methods of learning develop. Accuracy of automatic speech recognition can be affected by different variables, like what approach or method was used to program the system. One common metric for measuring the accuracy of automatic speech recognition is Word Error Rate (WER). Word Error Rate is calculated simply by dividing the number of errors by the number of words in a segment of transcribed speech. While accuracy of automatic speech recognition is influenced by the method used, WER is also affected by outside factors, regardless of the approach used. Variables like audio quality, accents, crosstalk, and homophones all affect the accuracy of ASR. Although ASR is not without its limitations and is still improving, current ASR systems are nearly approximate to the accuracy of transcription done by a human. To demonstrate this comparison, popular ASR systems like Microsoft’s boast a WER of 5.1%, while Google’s holds a Word Error Rate of 4.9%. The average Word Error Rate of a human transcriptionist is 4%; still more accurate than ASR and better able to account for context. Despite ASR’s ever improving accuracy, automatic speech recognition system’s alone are not perfect and show there is still a need for human transcriptionist for the most reliable transcription or captions.
Applications of automatic speech recognition are all around in today’s modern world. Although when most think of automatic speech recognition the first association made may be to think of captioning of videos and television or other forms of transcription, it reaches far beyond this. Commonplace applications of automatic speech recognition are all around, from cell phones to the digital and virtual assistants many have in their homes. ASR is a greater part of everyday life more than many realize. The applications of ASR today simplify tasks for the majority of people in some shape or form, whether it be a smart phone transcribing and sending a text message, a virtual assistant following a command, or beyond.
An everyday application of ASR that can be found in a majority of households and many workplaces lies with virtual and digital assistants. Perhaps the most well known of these virtual assistants are: Amazon’s Alexa, Google’s Google Assistant, Apple’s Siri and Microsoft’s Cortana. These digital assistants and others are designed to be able to carry out basic tasks and respond to and answer questions. Such AI systems are able to access a wide database of information and knowledge, allowing them to find answers to a variety of questions, compute calculations, and perform commands like turning on and off appliances. In business and the workplace, these digital assistants can expedite the office tasks and alleviate the workload by scheduling and starting video conferences and meetings, searching for documents, and even creating graphs and inputting data into reports. Chat bots are another common use, assisting customer service personnel with commonly asked questions and other basic customer needs.
Beyond digital assistants like Siri, smartphones also make use of automatic speech recognition in various applications and speech-to-text capabilities. Popular applications like Instagram incorporate ASR by allowing users to change or activate filters by voice command. Automatic speech recognition is an integral part of every use of speech-to-text on smartphones, whether it be speaking what you want a text message to say or telling a browser or app what to search for. Captioning on social media and content platforms like Instagram and Youtube also use automatic speech recognition to provide auto-generated captions for videos.
The application of automatic speech recognition can help make technology and the world more accessible for the deaf and hard of hearing, as well as those with low vision or mobility needs. One of the most notable ways that ASR improves accessibility is through captioning of television and movies as well as social media content. Through this, ASR is able to make digital content more accessible and inclusive, as those with hearing loss are able to follow dialogue, account for context and background noise, and overall more fully understand and experience visual content. ASR also plays a key role in helping those with accessibility needs to better communicate whether it be via phone calls, text messages or emails.
Speech-to-text capabilities allow those who have mobility difficulties or low vision to dictate what they would like to include in an email or text message and the ASR system then types it for them. This technology allows such individuals to mitigate fatigue or frustration that can come from having to physically type these forms of communication using a keyboard. Individuals with a hearing loss often struggle to hear phone conversations and may avoid making phone calls altogether due to this and the anxiety that can be associated. However, automatic speech recognition helps provide accurate captions through services like InnoCaption so that the hard of hearing can regain the confidence to make phone calls independently.
InnoCaption empowers the hard of hearing community to make phone calls using both stenographers and automatic speech recognition to provide real-time captions. Stenographers are trained professionals who use steno machines to transcribe the conversation in shorthand and provide captioning. By providing both ASR and live stenographers, InnoCaption puts the choice in the user’s hands and they are able to change between captioning methods as their accessibility needs change. Automatic speech recognition is able to provide accurate and fast captions without the presence of a stenographer or other live human being on the line. In order to provide best-in-class captioning through ASR, InnoCaption uses multiple engines and is consistently engineering proprietary solutions to best serve users. Through using automatic speech recognition to provide captions, InnoCaption is also able to provide captioning to users in both English and Spanish. By using automatic speech recognition in addition to live stenographers, InnoCaption is able to provide captions to a wider and more diverse community.
As automatic speech recognition continues to improve and grow, so does the future use and implementation of this technology. Data collection and processing has benefitted accuracy and continues to enable ASR systems to better process accents and unique speech patterns. The continued learning of ASR systems indicates only further use of this AI technology, and experts conjecture that it will take on larger roles in more industries as well. One expectation of ASR’s use in the future lies in the healthcare field. Many expect that chatbots and voice-tech systems will be further integrated into healthcare screening and administrative tasks, taking the place of humans to a greater degree in health screenings. Search behaviors are also expected to change and become further reliant on voice, following the lead of digital assistants, and it is likely that many touchpoints on devices and search engines will change to listening points.
“InnoCaption is what keeps me alive by being able to communicate with family and daily needs being deaf and living alone.”
— InnoCaption User
InnoCaption provides real-time captioning technology making phone calls easy and accessible for the deaf and hard of hearing community. Offered at no cost to individuals with hearing loss because we are certified by the FCC. InnoCaption is the only mobile app that offers real-time captioning of phone calls through live stenographers and automated speech recognition software. The choice is yours.
InnoCaption proporciona tecnología de subtitulado en tiempo real que hace que las llamadas telefónicas sean fáciles y accesibles para la comunidad de personas sordas y con problemas de audición. Se ofrece sin coste alguno para las personas con pérdida auditiva porque estamos certificados por la FCC. InnoCaption es la única aplicación móvil que ofrece subtitulación en tiempo real de llamadas telefónicas mediante taquígrafos en directo y software de reconocimiento automático del habla. Usted elige.