September 7, 2022
5 min read
Written by
InnoCaption App
Technology
No items found.
September 7, 2022
5 min read
Written by
InnoCaption App
Technology
No items found.

Transcription vs Captioning vs Speech to Text

Online video has become the most popular media content in recent years. That’s why quality, accurate transcription is more important than ever before. US adults spend 11 hours per day connected to media, and nearly 6 of those are spent watching video. Perhaps you’ve found yourself in a situation where you’re eager to watch a video or listen to audio, but it’s not the time or place to do so. Fortunately there’s another way to consume the content we want: transcription. Your source content can be translated and transcribed into over 200 languages, opening the door to more audiences around the world.

What is Transcription

Transcription is the process in which speech or audio is converted into a written document. Closed captions are time-coded to the video, while a transcript is just the text with no time information. Transcription is a great option to make audio-only programs, such as podcasts and radio shows, more accessible to Deaf and Hard of Hearing individuals. When it comes to video, transcription is a great complement to closed-captioning; however, it is not considered a substitute based on accessibility laws and standards.

Hand typing on keyboard at desk

Benefits of Transcription

A permanent written record of audio files provides an invaluable resource.  It allows meetings and events to be searched for key terms.  Reading through a document is much quicker than listening through the entire audio.  Poor-quality audios can be quite cumbersome and challenging to listen through, but reading a well-written document is a breeze.  Transcripts never depreciate in value, get worn down, or are at risk of being lost. The digitally-written word lasts forever. If your business has audio or video content online, transcription helps search engines find it. While search engine AI is impressive, it can’t crawl the content of a video. Having a transcript means that a search engine can ‘understand’ your content and rank it correctly. That means when your customers type in a relevant search term, a transcript increases your chances of them finding your content– essential for effective marketing and keeping on your audience’s radar.

Not only can search engines crawl your content and drive traffic to your site, but viewers on your site can find the videos they’re looking for. An interactive transcript allows users to search for keywords within the transcript, and see everywhere that keyword appears. If there is a particular spot in the video a user wants to jump to, all they have to do is click the word, and the video will start playing at that spot. In a study by MIT OpenCourseWare, 97% of students said interactive transcripts enhanced their learning experience. In addition to searching for a given word within one video, you can even scan your whole video library for that keyword using playlist search. That seamless user experience boosts user experience and overall customer satisfaction.

Disadvantages of Transcription

Today’s advance in technology makes a multitude of humanity’s tasks, practices, and activities easier. In fact, what used to be difficult transcription work is now a piece of cake once you leave it to technology. The emergence of audio-to-text solutions are everywhere, but keep in mind that it has its limitations as well. So, you should always look out for the cons as you take advantage of the pros when using automated transcripts.

  • Accents and Fast Talkers: It is difficult for a transcribing software to accurately provide text versions of audio when the speakers have regional accents and when they talk too fast. In these kinds of recordings, software-generated transcripts are very prone to mishears – generating words and phrases that do not make sense, providing you useless transcripts.
  • Audio Challenges: Background noise is the most common type of audio difficulty. Severe ones sometimes drown the actual speakers, making it impossible for the software to produce accurate transcripts. A file is also considered to have audio challenges when the speakers are whispering, mumbling or stuttering as they speak, or when there are technical issues like interferences and feedback present in the recording. Severe audio distortion and echoes can be negative factors too. If you want to generate useful transcripts in general, you must provide a spotless digital recording.
  • Limited Vocabulary: What’s also frustrating about machine-generated transcripts is that they have very limited vocabulary in terms of proper nouns and specialized terms. Speech-to-text recognition software typically have difficulty transcribing or discerning unique or local names in their right spellings – the same goes with establishments and brand names.
  • Technical Difficulties: Machines are not always working. They have bad days too. If you’re fully dependent on an audio-to-text converter, technical glitches will be a big problem for you. If not treated immediately, this can even cause delays on your project deadlines.
  • Customization: Obviously, a transcribing software cannot automatically produce custom transcripts. If you have a specific format and other details (i.e. inputting labels and punctuations) you want in your transcript, you’ll have to do it yourself.

What is Captioning

Captioning is the process of converting the audio content of a television broadcast, webcast, film, video, CD-ROM, DVD, live event, or other productions into text and displaying the text on a screen, monitor, or other visual display system.  Captions not only display words as the textual equivalent of spoken dialogue or narration, but they also include speaker identification, sound effects, and music description.  It is important that the captions are (1) synchronized and appear at approximately the same time as the audio is delivered; (2) equivalent and equal in content to that of the audio, including speaker identification and sound effects; and (3) accessible and readily available to those who need or want them. Captions must have sufficient size and contrast to ensure readability, and be timely, accurate, complete, and efficient.  When displayed, captions must be in the same line of sight as any corresponding visual information, such as a video, speaker, field of play, activity, or exhibition.

Captioning makes audio and audiovisual material accessible and provides a critical link to communication, information, education, news, and entertainment for more than 36 million Americans who are deaf or hard of hearing.  For individuals with limited English proficiency and for English-language learners, English-language captions improve comprehension and fluency.  Captions can also help to improve the literacy skills of children and adults alike.

When captions are visible only when selected and activated, such as when they are visible on a television screen, they are called “closed captions.”  When captions cannot be selected or activated, such as when they are permanently embedded in the audiovisual material, they are called “open captions.”  Captions may also be presented selectively to individuals with specialized caption display equipment.

Captions are commonly produced in advance for pre-recorded material.  When captions are provided for live presentations, they are called “real-time” captions.  Communication Access Realtime Translation (CART) is a form of captioning that can be provided on-site or remotely, usually for live presentations such as meetings, classes, or conferences.

Gray background with black microphone graphic and black paper to show text to speech

Benefits of Captioning


Accessibility for Deaf or Hard of Hearing Viewers:

Given that closed captions were originally developed as an accommodation to provide an equivalent entertainment experience for d/Deaf and hard of hearing people, it makes sense that content accessibility is arguably the most important benefit of transcription. Captions are time-synchronized text that accompanies video content, and transcripts are the complete plain text version of all captions generated.

In combination, transcription and captioning provide a critical alternative for the 48 million Americans with hearing loss and the 360 million people worldwide who experience disabling hearing loss. Quite simply, closed captions allow these viewers to consume your video content, granting them access and simultaneously increasing your audience.

Improved Audience Comprehension:

Students in online learning environments regularly reap the benefits of video  captioning. In a national research study conducted with Oregon State University, it was reported that 52% of students found captions helpful as a learning aid by improving comprehension.

Closed captions can greatly enhance the experience for viewers whose native language is not English. In the same study with Oregon State University, 66% of those students who are learning English as a second language reported that they find captions “very” or “extremely” helpful, as captions allow them to read along while they listen. Watching videos with captions can also help children improve their literacy. A study by Michigan State University concluded that “captions are beneficial because they result in greater depth of processing by focusing attention, reinforce the acquisition of vocabulary through multiple modalities, and allow learners to determine meaning through the unpacking of language chunks.”

Woman in white sweater typing on stenotype machine at desk.

Disadvantages of Captioning

Open captions are incorporated directly into the video stream, making it difficult for viewers to deactivate them if they have no use for them. The quality of open captions is also associated with the quality of the video or stream. If the video or stream is blurry or of low-quality, the captions can also be unclear and could be challenging to read.

Closed captions are not compatible with a few media players and streaming platforms. They will only function if the platform supports closed caption files. They also place an obligation on the watcher to know how to switch the captions on and off. Hence, they are not an excellent option if your audience has difficulty with technology.

What is Speech to Text

Speech to text is a speech recognition software that enables the recognition and translation of spoken language into text through computational linguistics. It is also known as speech recognition or computer speech recognition. Specific applications, tools, and devices can transcribe audio streams in real-time to display text and act on it. Converting speech to text works through a complex machine learning model that involves several steps:

  1. When sound comes out of someone’s mouth to create words, it also makes a series of vibrations. Speech to text technology works by picking up on these vibrations and translating them into a digital language through an analog to digital converter.
  2. The analog-to-digital converter takes sounds from an audio file, measures the waves in great detail, and filters them to distinguish the relevant sounds.
  3. The sounds are then segmented into hundredths or thousandths of seconds and are then matched to phonemes. A phoneme is a unit of sound that distinguishes one word from another in a ny given language. For example, there are approximately 20 phonemes in the English language.
  4. The phonemes are then run through a network via a mathematical model that compares them to well-known sentences, words, and phrases.
  5. The text is then presented as text or a computer-based demand based on the audio’s most likely version.

Benefits of Speech to Text

Like all forms of technology, speech to text has many benefits that help us improve daily processes. These are some of the main advantages of using speech to text:

  • Save time: Automatic speech recognition technology saves time by delivering accurate transcripts in real-time.
  • Cost-efficient: Most speech to text software has a subscription fee, and a few services are free. However, the cost of the subscription is far more cost-efficient than hiring human transcription services.
  • Enhance audio and video content: Speech to text capabilities mean that audio and video data can be converted in real-time for subtitling and fast video transcription.
  • Streamline the customer experience: By drawing on natural language processing, the customer experience is transformed through ease, accessibility, and seamlessness.

Gray background with black graphic of speech to text

Disadvantages of Speech to Text

New technologies like speech to text don’t come without imperfection, and these are some of the main limitations of speech to text:

  • It isn’t perfect: While dictation technology is a powerful tool, it is still in its early days, which means there are some gaps in its overall performance. Because it produces verbatim text only, you can end up with an inaccurate or awkward transcript or missing specific quotations.
  • Requires human input: Because speech to text lacks complete accuracy, some human edits to the speech data are required for optimal usage.
  • Requires clean recordings: To get a quality transcript from voice recognition software, you need to ensure the recorded audio is clear and intelligible. This means there needs to be no background noise, adequate pronunciation, no accents, and one person speaking at a time. You also need to provide voice commands for punctuation.


Transcription, Captioning, Speech to Text, and Accessibility

Accessibility Laws

It’s estimated that as many as 60% of those Americans with hearing loss are a member of the workforce or a part of an educational setting. In order to protect the rights of disabled people and ensure their access to the same resources as the rest of the population, several anti-discrimination laws have been enacted in the United States. Some of those laws require that videos include closed captions when published publicly so that they are fully accessible, but standards for broadcast television and media are strictly regulated by the FCC.

The ADA is a broad, anti-discrimination law for people with disabilities. Titles II and III of the ADA affect web accessibility and closed captioning.

Title II prohibits disability discrimination by all public entities at the local and state level. Governmental organizations must ensure “effective communication” with citizens, including providing assistive technology or services as needed.

Title III prohibits disability discrimination by “places of public accommodation.” A place of public accommodation covers shared or public entities like libraries, universities, hotels, museums, theaters, transportation services, etc., that are privately owned. Video displayed within or distributed by such places must be captioned.

Both Title II and Title III offer a disclaimer about instances where such accommodation would create an “undue hardship” for the organization. This is often the crux of arguments in ADA lawsuits about whether or not an organization must provide closed captioning. Another point of contention is whether or not a purely online business can be considered a “place of public accommodation.”

Closed captioning requirements are written directly into Section 508 of The Rehabilitation Act of 1973, and are often extended to apply to Section 504. Many states have “mini 508” laws as well. The Section 508 refresh was released in January 2017, and now references WCAG 2.0 guidelines as the accessibility standards to meet, which includes both captioning and audio description requirements.

Section 504 of the Rehabilitation Act protects the civil rights of people with disabilities by requiring all federal entities — and organizations that receive federal funding — to make accommodations for equal access. This means that closed captioning must be provided for users who are deaf or hard of hearing.

Section 508 of the Rehabilitation Act requires electronic communications and information technologies, such as websites, email, or web documents, be accessible. For video content, closed captions are a specific requirement.

Over the last decade, many organizations have been sued for failing to provide comprehensive captioning for online video and audio content. Generally speaking, the best way to avoid being part of this legal battle is to proactively transcribe and caption your videos.

Play
1min

Make calls with confidence

InnoCaption provides real-time captioning technology making phone calls easy and accessible for the deaf and hard of hearing community. Offered at no cost to individuals with hearing loss because we are certified by the FCC. InnoCaption is the only mobile app that offers real-time captioning of phone calls through live stenographers and automated speech recognition software. The choice is yours.

Llame con confianza

InnoCaption proporciona tecnología de subtitulado en tiempo real que hace que las llamadas telefónicas sean fáciles y accesibles para la comunidad de personas sordas y con problemas de audición. Se ofrece sin coste alguno para las personas con pérdida auditiva porque estamos certificados por la FCC. InnoCaption es la única aplicación móvil que ofrece subtitulación en tiempo real de llamadas telefónicas mediante taquígrafos en directo y software de reconocimiento automático del habla. Usted elige.