What are captions?
Captions are the transcription of spoken dialog into written text that is displayed during video playback. Captions can also describe actions that occur in the video. Subtitles are similar, but are often a translation of the captions into a different language, and do not contain the descriptions. While initially introduced as an accessibility feature for those who are deaf or hard of hearing, captions are increasingly popular for video playback. Amongst the reasons:
- Autoplaying video on the web must be muted. By adding captions, users tend to linger on the autoplaying video.
- Videos can be watched at a lower volume - allowing parents to watch videos while children sleep (or the other way around).
- Accents. Sometimes, the audio can be hard to understand, and the captions help resolve this issue.
Types of captions
You might think that captions are all the same, but there are multiple types. Let's go over some of the most common ones:
- Closed Captions (CC) - CCs are a transcription or translation of the dialogue and other audio information when sound is unavailable or not clearly audible.
- Open Captions - These are like closed captions except you can't toggle them on and off as you can with CCs, they're there the whole time.
- Subtitles - While these are basically the same thing as CCs, HTML5 gives them a slightly different definition. Subtitles are transcriptions or translations of dialogue when sound is available but not understood by the viewer. For example, if something is spoken in a foreign language.
- Subtitles for the Deaf and Hard of Hearing (SDH) - These subtitles can contain extra information about the audio that the deaf might not be able to understand. For example if a song is playing, or there are special sound effects in the scene.
How are captions stored?
In the old days, for analog TV, captions were hidden on the 21st line of the vertical blanking interval (VBI) of a video signal. (A VBI is the time between the end of the final visible line of one frame of video and the next one. It's specific to analog TVs.) To see the invisible line, you'd 'decode' when viewing using the TV remote.
Today, digital video displays are still designed to handle the analog style of captions, which is called CEA-608. CEA stands for Consumer Electronics Association. However, the new style of captions, called CEA-708 or sometimes EIA-708 (Electronic Industries Alliance) captions are now preferred. This is because CEA-708 captions let you have multiple text sizes, 64 text colors, 64 background colors and 8 different font options. You can also change the opacity of the background and dropshadow your text (although I'm not sure why captions need dropshadowing). Additionally, CEA-708 captions support special characters and symbols, allow multilingual capability and they don't have to be positioned on line 21, instead they're embedded in the video stream.
Caption delivery: the VTT file
The Video Text Track (VTT is a standardized file type to deliver captions.
For each caption, the VTT file identifies a start and end time for each string of text. For example, if you were Rick Rolled, you might see this caption:
5 00:00:44.030 --> 00:00:47.260 Never going to let you down.
Captions with api.video
Captions are added to a video using the Upload caption endpoint.
You can read more in our tutorial