0xdevalias/audio-to-midi.md

## audio-to-midi.md

      
    Raw
  

              audio-to-midi.md
            
          
    Automated Audio Transcription (AAT) / Automated Music Transcription (AMT) (aka: converting audio to midi)

Some notes on Automated Audio Transcription (AAT) / Automated Music Transcription (AMT) (aka: converting audio to midi)
Table of Contents


Tools

AI Midi
NeuralNote
MR-MT3: Memory Retaining Multi-Track Music Transcription to Mitigate Instrument Leakage (2024)
MT3: Multi-Task Multitrack Music Transcription (2021)
Spotify Basic Pitch (2022)
Ableton Live


Unsorted
See Also

My Other Related Deepdive Gist's and Projects


Tools

AI Midi


https://ai-midi.com/

NeuralNote


https://github.com/DamRsn/NeuralNote


Audio Plugin for Audio to MIDI transcription using deep learning.


NeuralNote


NeuralNote is the audio plugin that brings state-of-the-art Audio to MIDI conversion into your favorite Digital Audio Workstation.

Works with any tonal instrument (voice included)
Supports polyphonic transcription
Supports pitch bend detection
Lightweight and very fast transcription
Allows to adjust the parameters while listening to the transcription
Allows to scale and time quantize transcribed MIDI directly in the plugin


NeuralNote uses internally the model from Spotify's basic-pitch. See their blogpost and paper for more information. In NeuralNote, basic-pitch is run using RTNeural for the CNN part and ONNXRuntime for the feature part (Constant-Q transform calculation + Harmonic Stacking). As part of this project, we contributed to RTNeural to add 2D convolution support.


https://github.com/spotify/basic-pitch
https://github.com/jatinchowdhury18/RTNeural


RTNeural


Real-time neural network inferencing


A lightweight neural network inferencing engine written in C++. This library was designed with the intention of being used in real-time systems, specifically real-time audio processing.


https://github.com/microsoft/onnxruntime


ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator


https://www.youtube.com/watch?v=6_MC0_aG_DQ


NeuralNote - Plugin Presentation


MR-MT3: Memory Retaining Multi-Track Music Transcription to Mitigate Instrument Leakage (2024)


https://arxiv.org/abs/2403.10024


MR-MT3: Memory Retaining Multi-Track Music Transcription to Mitigate Instrument Leakage
Hao Hao Tan, Kin Wai Cheuk, Taemin Cho, Wei-Hsiang Liao, Yuki Mitsufuji
This paper presents enhancements to the MT3 model, a state-of-the-art (SOTA) token-based multi-instrument automatic music transcription (AMT) model. Despite SOTA performance, MT3 has the issue of instrument leakage, where transcriptions are fragmented across different instruments. To mitigate this, we propose MR-MT3, with enhancements including a memory retention mechanism, prior token sampling, and token shuffling are proposed. These methods are evaluated on the Slakh2100 dataset, demonstrating improved onset F1 scores and reduced instrument leakage. In addition to the conventional multi-instrument transcription F1 score, new metrics such as the instrument leakage ratio and the instrument detection F1 score are introduced for a more comprehensive assessment of transcription quality. The study also explores the issue of domain overfitting by evaluating MT3 on single-instrument monophonic datasets such as ComMU and NSynth. The findings, along with the source code, are shared to facilitate future work aimed at refining token-based multi-instrument AMT models.


https://github.com/gudgud96/MR-MT3


MR-MT3
Code accompanying paper: MR-MT3: Memory Retaining Multi-Track Music Transcription to Mitigate Instrument Leakage.


MT3: Multi-Task Multitrack Music Transcription (2021)


https://arxiv.org/abs/2111.03017


MT3: Multi-Task Multitrack Music Transcription
Josh Gardner, Ian Simon, Ethan Manilow, Curtis Hawthorne, Jesse Engel
Automatic Music Transcription (AMT), inferring musical notes from raw audio, is a challenging task at the core of music understanding. Unlike Automatic Speech Recognition (ASR), which typically focuses on the words of a single speaker, AMT often requires transcribing multiple instruments simultaneously, all while preserving fine-scale pitch and timing information. Further, many AMT datasets are "low-resource", as even expert musicians find music transcription difficult and time-consuming. Thus, prior work has focused on task-specific architectures, tailored to the individual instruments of each task. In this work, motivated by the promising results of sequence-to-sequence transfer learning for low-resource Natural Language Processing (NLP), we demonstrate that a general-purpose Transformer model can perform multi-task AMT, jointly transcribing arbitrary combinations of musical instruments across several transcription datasets. We show this unified training framework achieves high-quality transcription results across a range of datasets, dramatically improving performance for low-resource instruments (such as guitar), while preserving strong performance for abundant instruments (such as piano). Finally, by expanding the scope of AMT, we expose the need for more consistent evaluation metrics and better dataset alignment, and provide a strong baseline for this new direction of multi-task AMT.


https://github.com/magenta/mt3


MT3: Multi-Task Multitrack Music Transcription
MT3 is a multi-instrument automatic music transcription model that uses the T5X framework.


https://github.com/google-research/t5x


T5X
T5X is a modular, composable, research-friendly framework for high-performance, configurable, self-service training, evaluation, and inference of sequence models (starting with language) at many scales.


https://arxiv.org/abs/2203.17189


Scaling Up Models and Data with t5x and seqio
Recent neural network-based language models have benefited greatly from scaling up the size of training datasets and the number of parameters in the models themselves. Scaling can be complicated due to various factors including the need to distribute computation on supercomputer clusters (e.g., TPUs), prevent bottlenecks when infeeding data, and ensure reproducible results. In this work, we present two software libraries that ease these issues: t5x simplifies the process of building and training large language models at scale while maintaining ease of use, and seqio provides a task-based API for simple creation of fast and reproducible training data and evaluation pipelines. These open-source libraries have been used to train models with hundreds of billions of parameters on datasets with multiple terabytes of training data.


https://colab.research.google.com/github/magenta/mt3/blob/main/mt3/colab/music_transcription_with_transformers.ipynb


Music Transcription with Transformers
This notebook is an interactive demo of a few music transcription models created by Google's Magenta team. You can upload audio and have one of our models automatically transcribe it.


The notebook supports two pre-trained models:

the piano transcription model from our ISMIR 2021 paper
the multi-instrument transcription model from our ICLR 2022 paper

Caveat: neither model is trained on singing. If you upload audio with vocals, you will likely get weird results. Multi-instrument transcription is still not a completely-solved problem and so you may get weird results regardless.


https://github.com/kunato/mt3-pytorch


MT3: Multi-Task Multitrack Music Transcription - Pytorch
This is an unofficial implementation of MT3: Multi-Task Multitrack Music Transcription in pytorch.


Spotify Basic Pitch (2022)


https://basicpitch.spotify.com/


Basic Pitch


https://engineering.atspotify.com/2022/06/meet-basic-pitch/


Meet Basic Pitch: Spotify’s Open Source Audio-to-MIDI Converter


Basic Pitch uses machine learning to transcribe the musical notes in a recording. Drop a recording of almost any instrument, including your voice, then get back a MIDI version, just like that. Unlike similar ML models, Basic Pitch is not only versatile and accurate, but also fast and computationally lightweight. It was built for artists and producers who want an easy way to turn their recorded ideas into MIDI, a standard for representing notes used in digital music production.


https://github.com/spotify/basic-pitch


A lightweight yet powerful audio-to-MIDI converter with pitch bend detection


Basic Pitch is a Python library for Automatic Music Transcription (AMT), using lightweight neural network developed by Spotify's Audio Intelligence Lab.


Basic Pitch may be simple, but it's is far from "basic"! basic-pitch is efficient and easy to use, and its multipitch support, its ability to generalize across instruments, and its note accuracy competes with much larger and more resource-hungry AMT systems.
Provide a compatible audio file and basic-pitch will generate a MIDI file, complete with pitch bends. Basic pitch is instrument-agnostic and supports polyphonic instruments, so you can freely enjoy transcription of all your favorite music, no matter what instrument is used. Basic pitch works best on one instrument at a time.


https://github.com/spotify/basic-pitch#research-paper


Research Paper
This library was released in conjunction with Spotify's publication at ICASSP 2022. You can read more about this research in the paper, A Lightweight Instrument-Agnostic Model for Polyphonic Note Transcription and Multipitch Estimation.


https://arxiv.org/abs/2203.09893


A Lightweight Instrument-Agnostic Model for Polyphonic Note Transcription and Multipitch Estimation
Rachel M. Bittner, Juan José Bosch, David Rubinstein, Gabriel Meseguer-Brocal, Sebastian Ewert
Automatic Music Transcription (AMT) has been recognized as a key enabling technology with a wide range of applications. Given the task's complexity, best results have typically been reported for systems focusing on specific settings, e.g. instrument-specific systems tend to yield improved results over instrument-agnostic methods. Similarly, higher accuracy can be obtained when only estimating frame-wise f0 values and neglecting the harder note event detection. Despite their high accuracy, such specialized systems often cannot be deployed in the real-world. Storage and network constraints prohibit the use of multiple specialized models, while memory and run-time constraints limit their complexity. In this paper, we propose a lightweight neural network for musical instrument transcription, which supports polyphonic outputs and generalizes to a wide variety of instruments (including vocals). Our model is trained to jointly predict frame-wise onsets, multipitch and note activations, and we experimentally show that this multi-output structure improves the resulting frame-level note accuracy. Despite its simplicity, benchmark results show our system's note estimation to be substantially better than a comparable baseline, and its frame-level accuracy to be only marginally below those of specialized state-of-the-art AMT systems. With this work we hope to encourage the community to further investigate low-resource, instrument-agnostic AMT systems.


https://github.com/spotify/basic-pitch-ts


Basic Pitch is a Typescript and Python library for Automatic Music Transcription (AMT), using lightweight neural network developed by Spotify's Audio Intelligence Lab. It's small, easy-to-use, and npm install-able.


https://github.com/gudgud96/basic-pitch-torch


basic-pitch-torch
PyTorch version of Spotify's Basic Pitch, a lightweight audio-to-MIDI converter.


Ableton Live


https://www.ableton.com/en/manual/converting-audio-to-midi/


Ableton Live


Converting Audio to MIDI


12.1 Slice to New MIDI Track
This command divides the audio into chunks which are assigned to single MIDI notes. Slicing differs from the Convert commands below, in that it doesn’t analyze the musical context of your original audio. Instead, it simply splits the original audio into portions of time, regardless of the content.


12.2 Convert Harmony to New MIDI Track
This command identifies the pitches in a polyphonic audio recording and places them into a clip on a new MIDI track.


12.3 Convert Melody to New MIDI Track
This command identifies the pitches in monophonic audio and places them into a clip on a new MIDI track.


12.4 Convert Drums to New MIDI Track
This command extracts the rhythms from unpitched, percussive audio and places them into a clip on a new MIDI track. The command also attempts to identify kick, snare and hihat sounds and places them into the new clip so that they play the appropriate sounds in the preloaded Drum Rack.


12.5 Optimizing for Better Conversion Quality


Live uses the transient markers (see ‘Transients and Pseudo Warp Markers’) in the original audio clip to determine the divisions between notes in the converted MIDI clip. This means that you can “tune” the results of the conversion by adding, moving, or deleting transient markers in the audio clip before running any of the Convert commands.


Unsorted


https://www.reddit.com/r/edmproduction/comments/15etj94/audio_to_midi_which_one_is_the_best/


Audio to midi: which one is the best?
I'm trying to transcribe vocal to midi so what's the best option out there? in terms of price i don't mind it i can pay! I'm currently using melodyne and looking for more accurate software.


I’m not really sure there’s anything more accurate than melodyne, it’s basically industry standard.


I mean there’s Synchro Arts RePitch but Melodyne should be getting the curvature of the melody just fine.


I use WavesTune because they often have deals on, so you can get it for cheaper than Melodyne, but the UI is much worse.


Melodyne is amazing because it can use AI to separate each track within a song and give you midi adjustment or cut and pastablitly for each element. But most DAWs have Audio to Midi already built in, Its in Ableton and Cubase I do know etc. And if you have a clean track with just one instrument or voice it works pretty well. You could even pre split a song into stems using LALALA or some other AI before hand so you have a cleaner track to extract midi from. Which is essentially what melodyne is doing in one step you just have to do it in multiple steps.


I use this. It’s a bit hit or miss, but it’s a nice free option. https://basicpitch.spotify.com/


Seconded. It’s infinitely better than melodyne about 90% of the time. Even Ableton is better than melodyne, most of the time, in my opinion. Melodyne is good 20% of the time and total dogshit 80% of the time. This is for polyphonic music, not vocal, so vocal might be totally different.


etc


See Also

My Other Related Deepdive Gist's and Projects


Generating Synth Patches with AI (0xdevalias' gist)
AI Voice Cloning / Transfer (eg. RVCv2) (0xdevalias' gist)
Singing Voice Synthesizers (eg. Vocaloid, etc) (0xdevalias' gist)
Audio Pitch Correction (eg. autotune, melodyne, etc) (0xdevalias' gist)
Music APIs and DBs (0xdevalias' gist)
Compare/Diff Audio Files (0xdevalias' gist)
Working Around FLStudio Trial Limitations (0xdevalias' gist)