A failed app idea and some new learnings

Last week while browsing hacker news, I came across a discussion on Spotify and podcasts. There i saw few features users requested in podcasts app and thought of creating a new one to address few of those feature requests. Among those, the major feature i liked and want to implement was the transcript feature. I really liked the concept of showing captions/subtitles similar to music lyrics in Spotify app.

At first I thought of using Web Speech API to convert audio to text, and wondered why nobody is using the same. Well I know the reason now, we cannot feed any stream to the recognition service. It works only with user agent controlled input device.


I was devastated for some time. Then i thought, if we could generate transcript files offline, how can we integrate it with audio files. I was thinking about something similar to external subtitle files used in video players. Came to know that it is possible with the help of TextTracks and WebVTT format. I did a small POC and it went well.


Here is the audio html tag with multiple text tracks.

    class="px-2 w-full"
    <a href="sounds/podcast_sample.mp3">
        Download <audio src=""></audio>
    <track id="caption_en" default kind="captions" srclang="en" src="sounds/podcast_sample.vtt">
    <track id="subtitle_de"" default kind="subtitles" srclang="de" src="sounds/podcast_sample_02.vtt">

We need to add some javascript code to render captions as audio tag do not have rendering space. This is not required for video elements

window.onload = function() {
    const textTracks = document.getElementById('podcast_audio').textTracks;        
    var activeTextTrack = getActiveTextTrack(textTracks);
    textTracks.onchange = (event) => {            
        activeTextTrack = getActiveTextTrack(textTracks);            
    const enTrack = textTracks.getTrackById('caption_en');        

Method to get active text track. Active track depends on the user
function getActiveTextTrack(textTracks) {
    var activeTextTrack;
    for(var track of textTracks) {            
        if(track.mode == 'showing') {
            activeTextTrack = track;
    if(activeTextTrack) {            
        activeTextTrack.oncuechange = onCueChange;
    return activeTextTrack;

function onCueChange(event) {        
    const captionSpan = document.getElementById('audio_caption');
    var cueText = "";
    for(var activeCue of this.activeCues) {
        cueText += "\n"+activeCue.text;
    captionSpan.innerText = cueText;

Text file for caption.

WEBVTT - podcast_sample.vtt

00:01.000 --&gt; 00:04.000
- Never drink liquid nitrogen.

00:05.000 --&gt; 00:09.000
- It will perforate your stomach.
- You could die.

00:07.000 --&gt; 00:13.000
- This is an overlapping caption
- Will it work?

Text file for subtitle.

WEBVTT - podcast_sample_02.vtt

This is a sample note/comment

00:01.000 --&gt; 00:04.000
- Ta en kopp varmt te.
- Det är inte varmt.

00:05.000 --&gt; 00:09.000
- Har en kopp te.
- Det smakar som te.

NOTE This last line may not translate well.

00:10.000 --&gt; 00:15.000
- Ta en kopp

What's Next

Need to work on how to generate transcripts from audio files. As of now my understanding is that we need to use some cloud services for it. Also I want to learn more about