YouTube Captioning

YouTube official logoYesterday, Google announced that it would deploying several new options for increasing the number and quality of closed captioned videos on the site. The New York Times reported on this as a first step to making videos available to deaf and hearing-impaired audiences, but it seems clear that there are a lot of potential beneficiaries – foreign language audiences (captions can be translated to 51 languages), those of us who can’t turn on the speakers at work, and anyone who wants to search the verbal content of a video.

So, how are they doing it? First, speech-to-text technology currently used by Google Voice is being applied to a small number of videos on the site (largely educational content) to produce captions automatically.

“Because the tools are not perfect, we want to make sure that we get feedback from the video owners and the viewers before we roll it out for the whole world,” Mr. Harrenstien said. “Sometimes the auto-captions are good. Sometimes they are not great, but they are better than nothing if you are hearing-impaired or don’t know the language.”

Presumably, if this works, speech-to-text will be rolled out more broadly. For now, you can take a look at how this works below. To see the captions, Google/YouTube explains – “Click on the menu button at the bottom right of the video player, then click CC and the arrow to its left, then click the new “Transcribe Audio” button.” I’ve picked a clip of PBS’s upcoming series This Emotional Life, focusing on Asperger’s.

Obviously, it’s not perfect – “Asperger’s syndrome” is transcribed as “Mister Gerson” – so I hope that speech-to-text improves before this initial stage is extended to other videos. This, however, leads to the second option that Google/YouTube have made available, which is to provide your own captions for videos you upload.

Now, after you upload a video, you can also upload a text file  – YouTube will combine the video and the text to create captions. Through “auto-timing,” YouTube will match a transcript (a file with only verbal content) to the video using speech recognition, or will match a caption file (which includes time codes for the text to appear) to the video. The help file on this seems fairly clear, and also includes tips like including bracketed information about non-verbal sounds [whistling], or using >> to indicate changing speakers inthe captions.

I gave it a try – not the easiest experience. They weren’t kidding when they said that clear speech works best, as my transcript file (no time code) was not able to be matched and displayed as captions. People singing to cats didn’t translate well. Thus, to get a captioned video, I had to try the old fashioned way, creating a .sub file with time codes. This quickly got me in a bit over my head – while I could do it, given the time, there’s a reason most people don’t caption their YouTube videos. It’s time intensive, there’s a learning curve involved, and the results may not seem important enough to justify the work.

This, of course, is exactly why forays into speech-to-text and auto-timing are so exciting. If captions could be created automatically, or from a simple text file, captions on user-created video would certainly become more common and make the world more accessible. While the tools as they are today aren’t anywhere near perfect, it’s certainly a first step in creating automatic accessibility features for participatory media.

As someone who studies accessibility and internet media, I’m constantly torn between getting excited about social/participatory media and being disappointed in their access options. This WordPress blog I’m using is notoriously terrible in its implementation of image alt text, for instance. Blogging has given so many people an outlet to write and connect, but if they want to make a blog accessible, it takes additional research and effort. Attempts to build accessibility features in automatically are, in my opinion, game-changers when they’re done well. I’ll withhold judgment on this YouTube move for now – it has potential – but I’ll be watching to see whether it develops .


  1. Very interesting! I’m curious to see how far this speech-to-text technology can be improved; definitely an interesting application of it. But watching the video with captions that are nonsensical, it makes me wonder whether accessibility can be automated.

    Reply to this comment
    • Glenda, I totally agree. There’s a long way to go before speech-to-text technology works reliably, and I think there are some concerns about voice quality, tone, and accent that will remain problematic – will only male native English speakers be understood by the machine? Google voice can never manage to translate the voicemails I leave into text.

      Plus, there are always judgement calls in this kind of work, always a stage at which human interpretation seems to be the only way to go.

      Reply to this comment
  2. liz, have you seen Greg Downey’s book Closed Captioning: Subtitling, Stenography, and the Digital Convergence of Text with Television? i saw him give him a presentation on campus a couple years ago and it was pretty interesting. he tracks the history of speech-to-text technologies and labor practices, pointing out that deaf and hard-of-hearing educators and activists had been pushing for this technology for decades but it was only when tv industries realized the value of captioning in an era of digital convergence (ie, it supplied metadata for indexing and retrieving digital assets) did D/HOH audiences witness a wider embrace of captioning.

    Reply to this comment

Leave a Reply

Your email address will not be published. Required fields are marked *

© 2022 A Theme