candideu/Open Source AI Scribe, Auto-Transcriber, Speech-to-text Transcriptions, Captions & Subtitles Exporter, Interactive Transcripts, Alternative to Otter.ai, Descript, Sonix.ai.md

## Open Source AI Scribe, Auto-Transcriber, Speech-to-text Transcriptions, Captions & Subtitles Exporter, Interactive Transcripts, Alternative to Otter.ai, Descript, Sonix.ai.md

      
    Raw
  

              Open Source AI Scribe, Auto-Transcriber, Speech-to-text Transcriptions, Captions & Subtitles Exporter, Interactive Transcripts, Alternative to Otter.ai, Descript, Sonix.ai.md
            
          
    Hello world!
As a video editor, researcher, digital media enthusiast, and lover of all things FLOSS, I've been on the hunt for an open source alternative to proprietary services like Otter.ai, Sonix, and Descript. I've pitched my idead on open-source-ideas, but I wanted to create a dedicated post for it so that it can reach as many people as possible.
Project description

The idea

A simple, easy-to-use application where users can dictate or upload audio or video files, and an automated transcript is generated. This transcript is synced to the audio track, clickable, and editable, so that users can skip to certain passages and refine the transcript accordingly.
The revised transcript can then be exported as plain text, .srt caption file (and other subtitle formats), .pdf, shareable web page, etc. for further processing.
Users can also provide their own language models, so that the number of possible languages that can be transcribed grows over time, as people create new models.
This application could be something you access from a browser and uses local storage, or a downloadble app (using something like Electron).
Inspiration, and the "Why"

As someone who works a lot with video and audio, and aims to make my work accessible, I'm a big fan of Otter.ai and Sonix.ai. They're very easy to use and provide pretty accurate transcriptions.


Issues, and what's missing in existing tools

That said, Otter and Sonix are not open-source, and their free tiers can be limiting. Both Otter and Sonix offer three lifetime uploads max, and Otter allows 40 minutes of live transcriptions per recording, with a max of 600 minutes a month (no rollover).
Otter only does transcriptions in English. Sonix does offer 37+ languages, but it doesn't look like you can provide your own language models. Other options like YouTube's automated transcriptions offer a wider range of languages, but that involves having to upload the media to YouTube, and there's no clickable transcript option.
Another issue is that some folks use automated transcriptions in their line of work, but cannot use cloud-based, proprietary software for legal reasons (see this Reddit thread).
Relevant Technology

I am in no way an expert, but it seems like Python would be relevant. That said, I'm open to any ideas, and open to having this be an application that's downloaded on your computer (with cross-platform support), or a web application that uses local storage, etc.
Speech-to-text

Vosk Browser


    VOSK.Broswer.mp4
    
  
VOSK Browser is a speech recognition library running in the browser thanks to a WebAssembly build of Vosk. This implementation is probably the one I'm the most excited about because it's very close to what I had in mind. The demo they've created allows you to use your microphone or to upload an audio file to create the transcription. The cool thing about this approach is that you don't need to set up any loopback methods if you are using pre-recorded audio, because the demo seems to do it on its own.
According to the dev, "This project aims just to be a library that wraps a wasm build of vosk and the demo is just a demo of what can be done so I won't be adding such functionalities to the library itself. I have thought of integrating transcription with vosk-browser to oTranscribe which I guess would achieve what you want. I currently have no time for that but maybe someone can pick this up, would be really cool."
Potential ways to build upon this project:

Adding punctuation: I've found a number of punctuation restoration projects on here that could help with that such as punctuator2 and its many forks such as PunkProse. Punctuator2 even has a nifty demo which you can try out here. I also found an implementation of PunkProse + VOSK here.
Making the transcript editable
Adding timings that are synced to the audio (I assume that the live dictation would have to be recorded)
The ability to export the work as a subtitle/caption file

Check out the Demo: https://ccoreilly.github.io/vosk-browser/
View GitHub Repo: https://github.com/ccoreilly/vosk-browser
ideasman42/nerd-dictation

Uses VOSK API, but is for meant for Linux and uses the command-line to be installed. It also doesn't have a clickable transcript

  
    nerd-dictation.mp4
    
  
Video demo
Source code can be viewed here
saharmor/realtime-transcription-playground

Very similar to what I'm proposing, but uses Google's Speech API, which involves creating a service account and knowing how to use their Cloud Console.

  
    Real-time.transcription.demo.mp4
    
  
Source code can be viewed here
STTWebApp


Web Application that uses VOSK to transcribe audios to texts in portuguese.
Would be great if users could supply the language model of your choice.
Source code can be viewed here

Clickable, Interactive Transcript

AblePlayer

Able Player is a fully accessible, open-source cross-browser HTML5 media player. It's not a text-to-speech API, but the player has a really neat clickable transcript feature that can be seen in the following example:

Demo #6


    AblePlayer.mp4
    
  
The source code can be viewed here.

Subtitle + Transcript Editors + Previewers

oTranscribe

oTranscribe is one of the more well-known options in this space. It's a tool for manually transcribing audio interviews that allows you to import a video or audio file, and manually type the transcript. You can also add timestamps which can be clicked on to jump to that point in the audio/video. oTranscribe also features great keyboard shortcuts and playback tools to ease the transcription process.

There's even an oTranscribe for Electron fork that could be interesting to look into.
Drawbacks:

No speech-to-text
Cannot export to .srt (although an .otr to .srt conversion is possible with this external tool)
Cannot edit timestamps as text

View the website here: https://otranscribe.com/
View the repo here: https://github.com/oTranscribe
Hyperaudio

Hyperaudio seems to be working on an exciting suite of open interactive transcript tools which allow people to Navigate, Search and Edit transcripts!
I namely want to highlight the following tools, which could be of interest:
Hyperaudio Lite Editor: A lightweight transcript editor for editing and correcting STT generated timed transcripts

  
    Hyperaudio.Editor.mp4
    
  
Repo: https://github.com/hyperaudio/hyperaudio-lite-editor

Hyperaudio Lite: a Super-lightweight Interactive Transcript Player


Repo: https://github.com/hyperaudio/hyperaudio-lite

Hyperaudio Converter: converts from JSON/SRT to HTML Based Interactive Transcript


Site: https://hyperaud.io/converter/converter.html
Repo: https://github.com/hyperaudio/ha-converter

Hyperaudio Website for now: https://lab.hyperaud.io/
Official Website: https://hyper.audio/

All arounders

Kdenlive

The open-source video editor introduced a speech-to-text module in version 21.04 using VOSK, an offline speech-recognition API. That said, the feature is still pretty new and kind of buggy. It also involves having to download Python and knowing how to use Kdenlive. I like the idea of using VOSK's API, but I think having a simple, dedicated application that works out of the box for automated transcriptions would be best, especially for people who aren't tech-savvy.

View their source code here: https://invent.kde.org/multimedia/kdenlive/-/tree/master/data/scripts
Video Transcriber

Video Transcriber is a Computer assisted video/audio transcription which, from what I can gather, seems to be what I have in mind. It's a prototype made with journalists and media professionals in mind.
Unfortunately, the demo link I found seems to be broken, so I haven't been able to test this one out. Testing this project otherwise would involve  installing dependencies and creating an IBM Bluemix Account (which has monthly limits). The implementation I had in mind would be easy for non-technical users to use out-of-the-box.

View the repo: https://github.com/glitchdigital/video-transcriber

Complexity and required time

I'm not the most knowledgeable on these frameworks, so please let me know if I should tick other options for the complexity. That said, I'm open to helping with the design of the user interface.
Complexity


 Beginner - This project requires no or little prior knowledge of the technolog(y|ies) specified to contribute to the project
 Intermediate - The user should have some prior knowledge of the technolog(y|ies) to the point where they know how to use it, but not necessarily all the nooks and crannies of the technology
 Advanced - The project requires the user to have a good understanding of all components of the project to contribute

Required time (ETA)


 Little work - A couple of days
 Medium work - A week or two
 Much work - The project will take more than a couple of weeks and serious planning is required

Categories


 Mobile app
 IoT
 Web app
 Frontend/UI
 AI/ML
 APIs/Backend
 Voice Assistant
 Developer Tooling
 Extension/Plugin/Add-On
 Design/UX
 AR/VR
 Bots
 Security
 Blockchain
 Futuristic Tech/Something Unique

My own programming (?) skills are limited to HTML, basic CSS, and the tiniest bit of Javascript. As such, I'm hoping to share my findings and proposed idea here in the hopes that more competent coders can  bring this to life.