Skip to content

Instantly share code, notes, and snippets.

@0xdevalias
Last active June 14, 2024 14:38
Show Gist options
  • Save 0xdevalias/f2c6e52824b3bbd4fb4c84c603a3f4bd to your computer and use it in GitHub Desktop.
Save 0xdevalias/f2c6e52824b3bbd4fb4c84c603a3f4bd to your computer and use it in GitHub Desktop.
Some notes on Automated Audio Transcription (AAT) / Automated Music Transcription (AMT) (aka: converting audio to midi)

Automated Audio Transcription (AAT) / Automated Music Transcription (AMT) (aka: converting audio to midi)

Some notes on Automated Audio Transcription (AAT) / Automated Music Transcription (AMT) (aka: converting audio to midi)

Table of Contents

Tools

AI Midi

NeuralNote

  • https://github.com/DamRsn/NeuralNote
    • Audio Plugin for Audio to MIDI transcription using deep learning.

    • NeuralNote

    • NeuralNote is the audio plugin that brings state-of-the-art Audio to MIDI conversion into your favorite Digital Audio Workstation.

      • Works with any tonal instrument (voice included)
      • Supports polyphonic transcription
      • Supports pitch bend detection
      • Lightweight and very fast transcription
      • Allows to adjust the parameters while listening to the transcription
      • Allows to scale and time quantize transcribed MIDI directly in the plugin
    • NeuralNote uses internally the model from Spotify's basic-pitch. See their blogpost and paper for more information. In NeuralNote, basic-pitch is run using RTNeural for the CNN part and ONNXRuntime for the feature part (Constant-Q transform calculation + Harmonic Stacking). As part of this project, we contributed to RTNeural to add 2D convolution support.

    • https://www.youtube.com/watch?v=6_MC0_aG_DQ
      • NeuralNote - Plugin Presentation

MR-MT3: Memory Retaining Multi-Track Music Transcription to Mitigate Instrument Leakage (2024)

  • https://arxiv.org/abs/2403.10024
    • MR-MT3: Memory Retaining Multi-Track Music Transcription to Mitigate Instrument Leakage Hao Hao Tan, Kin Wai Cheuk, Taemin Cho, Wei-Hsiang Liao, Yuki Mitsufuji This paper presents enhancements to the MT3 model, a state-of-the-art (SOTA) token-based multi-instrument automatic music transcription (AMT) model. Despite SOTA performance, MT3 has the issue of instrument leakage, where transcriptions are fragmented across different instruments. To mitigate this, we propose MR-MT3, with enhancements including a memory retention mechanism, prior token sampling, and token shuffling are proposed. These methods are evaluated on the Slakh2100 dataset, demonstrating improved onset F1 scores and reduced instrument leakage. In addition to the conventional multi-instrument transcription F1 score, new metrics such as the instrument leakage ratio and the instrument detection F1 score are introduced for a more comprehensive assessment of transcription quality. The study also explores the issue of domain overfitting by evaluating MT3 on single-instrument monophonic datasets such as ComMU and NSynth. The findings, along with the source code, are shared to facilitate future work aimed at refining token-based multi-instrument AMT models.

  • https://github.com/gudgud96/MR-MT3
    • MR-MT3 Code accompanying paper: MR-MT3: Memory Retaining Multi-Track Music Transcription to Mitigate Instrument Leakage.

MT3: Multi-Task Multitrack Music Transcription (2021)

  • https://arxiv.org/abs/2111.03017
    • MT3: Multi-Task Multitrack Music Transcription Josh Gardner, Ian Simon, Ethan Manilow, Curtis Hawthorne, Jesse Engel Automatic Music Transcription (AMT), inferring musical notes from raw audio, is a challenging task at the core of music understanding. Unlike Automatic Speech Recognition (ASR), which typically focuses on the words of a single speaker, AMT often requires transcribing multiple instruments simultaneously, all while preserving fine-scale pitch and timing information. Further, many AMT datasets are "low-resource", as even expert musicians find music transcription difficult and time-consuming. Thus, prior work has focused on task-specific architectures, tailored to the individual instruments of each task. In this work, motivated by the promising results of sequence-to-sequence transfer learning for low-resource Natural Language Processing (NLP), we demonstrate that a general-purpose Transformer model can perform multi-task AMT, jointly transcribing arbitrary combinations of musical instruments across several transcription datasets. We show this unified training framework achieves high-quality transcription results across a range of datasets, dramatically improving performance for low-resource instruments (such as guitar), while preserving strong performance for abundant instruments (such as piano). Finally, by expanding the scope of AMT, we expose the need for more consistent evaluation metrics and better dataset alignment, and provide a strong baseline for this new direction of multi-task AMT.

  • https://github.com/magenta/mt3
    • MT3: Multi-Task Multitrack Music Transcription MT3 is a multi-instrument automatic music transcription model that uses the T5X framework.

      • https://github.com/google-research/t5x
        • T5X T5X is a modular, composable, research-friendly framework for high-performance, configurable, self-service training, evaluation, and inference of sequence models (starting with language) at many scales.

        • https://arxiv.org/abs/2203.17189
          • Scaling Up Models and Data with t5x and seqio Recent neural network-based language models have benefited greatly from scaling up the size of training datasets and the number of parameters in the models themselves. Scaling can be complicated due to various factors including the need to distribute computation on supercomputer clusters (e.g., TPUs), prevent bottlenecks when infeeding data, and ensure reproducible results. In this work, we present two software libraries that ease these issues: t5x simplifies the process of building and training large language models at scale while maintaining ease of use, and seqio provides a task-based API for simple creation of fast and reproducible training data and evaluation pipelines. These open-source libraries have been used to train models with hundreds of billions of parameters on datasets with multiple terabytes of training data.

    • https://colab.research.google.com/github/magenta/mt3/blob/main/mt3/colab/music_transcription_with_transformers.ipynb
      • Music Transcription with Transformers This notebook is an interactive demo of a few music transcription models created by Google's Magenta team. You can upload audio and have one of our models automatically transcribe it.

      • The notebook supports two pre-trained models:

        • the piano transcription model from our ISMIR 2021 paper
        • the multi-instrument transcription model from our ICLR 2022 paper

        Caveat: neither model is trained on singing. If you upload audio with vocals, you will likely get weird results. Multi-instrument transcription is still not a completely-solved problem and so you may get weird results regardless.

  • https://github.com/kunato/mt3-pytorch
    • MT3: Multi-Task Multitrack Music Transcription - Pytorch This is an unofficial implementation of MT3: Multi-Task Multitrack Music Transcription in pytorch.

Spotify Basic Pitch (2022)

  • https://basicpitch.spotify.com/
    • Basic Pitch

    • https://engineering.atspotify.com/2022/06/meet-basic-pitch/
      • Meet Basic Pitch: Spotify’s Open Source Audio-to-MIDI Converter

      • Basic Pitch uses machine learning to transcribe the musical notes in a recording. Drop a recording of almost any instrument, including your voice, then get back a MIDI version, just like that. Unlike similar ML models, Basic Pitch is not only versatile and accurate, but also fast and computationally lightweight. It was built for artists and producers who want an easy way to turn their recorded ideas into MIDI, a standard for representing notes used in digital music production.

  • https://github.com/spotify/basic-pitch
    • A lightweight yet powerful audio-to-MIDI converter with pitch bend detection

    • Basic Pitch is a Python library for Automatic Music Transcription (AMT), using lightweight neural network developed by Spotify's Audio Intelligence Lab.

    • Basic Pitch may be simple, but it's is far from "basic"! basic-pitch is efficient and easy to use, and its multipitch support, its ability to generalize across instruments, and its note accuracy competes with much larger and more resource-hungry AMT systems. Provide a compatible audio file and basic-pitch will generate a MIDI file, complete with pitch bends. Basic pitch is instrument-agnostic and supports polyphonic instruments, so you can freely enjoy transcription of all your favorite music, no matter what instrument is used. Basic pitch works best on one instrument at a time.

    • https://github.com/spotify/basic-pitch#research-paper
      • Research Paper This library was released in conjunction with Spotify's publication at ICASSP 2022. You can read more about this research in the paper, A Lightweight Instrument-Agnostic Model for Polyphonic Note Transcription and Multipitch Estimation.

        • https://arxiv.org/abs/2203.09893
          • A Lightweight Instrument-Agnostic Model for Polyphonic Note Transcription and Multipitch Estimation Rachel M. Bittner, Juan José Bosch, David Rubinstein, Gabriel Meseguer-Brocal, Sebastian Ewert Automatic Music Transcription (AMT) has been recognized as a key enabling technology with a wide range of applications. Given the task's complexity, best results have typically been reported for systems focusing on specific settings, e.g. instrument-specific systems tend to yield improved results over instrument-agnostic methods. Similarly, higher accuracy can be obtained when only estimating frame-wise f0 values and neglecting the harder note event detection. Despite their high accuracy, such specialized systems often cannot be deployed in the real-world. Storage and network constraints prohibit the use of multiple specialized models, while memory and run-time constraints limit their complexity. In this paper, we propose a lightweight neural network for musical instrument transcription, which supports polyphonic outputs and generalizes to a wide variety of instruments (including vocals). Our model is trained to jointly predict frame-wise onsets, multipitch and note activations, and we experimentally show that this multi-output structure improves the resulting frame-level note accuracy. Despite its simplicity, benchmark results show our system's note estimation to be substantially better than a comparable baseline, and its frame-level accuracy to be only marginally below those of specialized state-of-the-art AMT systems. With this work we hope to encourage the community to further investigate low-resource, instrument-agnostic AMT systems.

  • https://github.com/spotify/basic-pitch-ts
    • Basic Pitch is a Typescript and Python library for Automatic Music Transcription (AMT), using lightweight neural network developed by Spotify's Audio Intelligence Lab. It's small, easy-to-use, and npm install-able.

  • https://github.com/gudgud96/basic-pitch-torch
    • basic-pitch-torch PyTorch version of Spotify's Basic Pitch, a lightweight audio-to-MIDI converter.

Ableton Live

  • https://www.ableton.com/en/manual/converting-audio-to-midi/
    • Ableton Live

      1. Converting Audio to MIDI
    • 12.1 Slice to New MIDI Track This command divides the audio into chunks which are assigned to single MIDI notes. Slicing differs from the Convert commands below, in that it doesn’t analyze the musical context of your original audio. Instead, it simply splits the original audio into portions of time, regardless of the content.

    • 12.2 Convert Harmony to New MIDI Track This command identifies the pitches in a polyphonic audio recording and places them into a clip on a new MIDI track.

    • 12.3 Convert Melody to New MIDI Track This command identifies the pitches in monophonic audio and places them into a clip on a new MIDI track.

    • 12.4 Convert Drums to New MIDI Track This command extracts the rhythms from unpitched, percussive audio and places them into a clip on a new MIDI track. The command also attempts to identify kick, snare and hihat sounds and places them into the new clip so that they play the appropriate sounds in the preloaded Drum Rack.

    • 12.5 Optimizing for Better Conversion Quality

      • Live uses the transient markers (see ‘Transients and Pseudo Warp Markers’) in the original audio clip to determine the divisions between notes in the converted MIDI clip. This means that you can “tune” the results of the conversion by adding, moving, or deleting transient markers in the audio clip before running any of the Convert commands.

Unsorted

  • https://www.reddit.com/r/edmproduction/comments/15etj94/audio_to_midi_which_one_is_the_best/
    • Audio to midi: which one is the best? I'm trying to transcribe vocal to midi so what's the best option out there? in terms of price i don't mind it i can pay! I'm currently using melodyne and looking for more accurate software.

    • I’m not really sure there’s anything more accurate than melodyne, it’s basically industry standard.

    • I mean there’s Synchro Arts RePitch but Melodyne should be getting the curvature of the melody just fine.

    • I use WavesTune because they often have deals on, so you can get it for cheaper than Melodyne, but the UI is much worse.

    • Melodyne is amazing because it can use AI to separate each track within a song and give you midi adjustment or cut and pastablitly for each element. But most DAWs have Audio to Midi already built in, Its in Ableton and Cubase I do know etc. And if you have a clean track with just one instrument or voice it works pretty well. You could even pre split a song into stems using LALALA or some other AI before hand so you have a cleaner track to extract midi from. Which is essentially what melodyne is doing in one step you just have to do it in multiple steps.

    • I use this. It’s a bit hit or miss, but it’s a nice free option. https://basicpitch.spotify.com/

      • Seconded. It’s infinitely better than melodyne about 90% of the time. Even Ableton is better than melodyne, most of the time, in my opinion. Melodyne is good 20% of the time and total dogshit 80% of the time. This is for polyphonic music, not vocal, so vocal might be totally different.

    • etc

See Also

My Other Related Deepdive Gist's and Projects

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment