Skip to content

Instantly share code, notes, and snippets.

Last active May 26, 2024 02:30
Show Gist options
  • Save 0xdevalias/eba698730024674ecae7f43f4c650096 to your computer and use it in GitHub Desktop.
Save 0xdevalias/eba698730024674ecae7f43f4c650096 to your computer and use it in GitHub Desktop.
A collection of music APIs, databases, and related tools

Music APIs and DBs

A collection of music APIs, databases, and related tools.

Table of Contents

Spotify for Developers

Audio Identification

    • Discover Music through Samples, Cover Songs and Remixes Dig deeper into music by discovering direct connections among over 1,025,000 songs and 316,000 artists, from Hip-Hop, Rap and R&B via Electronic / Dance through to Rock, Pop, Soul, Funk, Reggae, Jazz, Classical and beyond. WhoSampled's verified content is built by a community of over 33,000 contributors. Make contributions to earn Cred - our very own points system.

      • WhoSampled Alternatives Music Recognition Apps like WhoSampled

    • is a blog that lets you discover the music behind the hits of today and yesterday.

      In the form of samples (a sample of one music, which is used to create another) or covers, various musical styles are discussed on these pages. Notice to fans of electronic music, hiphop, RnB, pop, rock, variety, disco, funk, rhythm n' blues, new wave and so on!

    • Welcome to AcoustID! AcoustID is a project providing complete audio identification service, based entirely on open source software.

      It consists of a client library for generating compact fingerprints from audio files, a large crowd-sourced database of audio fingerprints, many of which are linked to the MusicBrainz metadata database using their unique identifiers, and an web service that enables applications to quickly search in the fingerprint database.

    • Acoustid Audio identification services Automatic music file tag correction. Music catalog reconciliation and cross-referencing. 100% open source.

    • At the core of AcoustID is an efficient algorithm for extracting audio fingerprints, called Chromaprint. The algorithm is optimized specifically for matching near-identical audio streams, which allows the audio fingerprints to be very compact and the extraction process to be fast. For example, it takes less than 100ms to process a two minute long audio file and the extracted audio fingerprint is just 2.5 KB of binary data.

      AcoustID contains a large crowd-sourced database of such audio fingerprints together with additional information about them, such as the song title, artist or links to the MusicBrainz database. You can send an audio fingerprint to the AcoustID service and it will search the database and return you information about the song. We use a custom database for indexing the audio fingerprints to make the search very fast.

      All of this is 100% open source and the database is available for download.

    • Pricing The AcoustID service is free to use in non-commercial applications. If you want to use the service in a commercial product, please subscribe to one of the plans below. All plans come with a free trial. You are not charged for the first 10k searches. If you don't need more than that, you can use the service for free!

    • Also, if you are a single developer and the plans are too expensive for you, feel free to get in touch, explain your situation and I'm sure we can figure something out.

      • Web Service The AcoustID web service currently supports only two operations, searching in the fingerprint database and submitting new fingerprints into the database.

      • Database The AcoustID database includes user-submitted audio fingerprints, their mapping to MusicBrainz IDs and some supporting tables. It follows the structure of the PostgreSQL database used by the AcoustID server. Each table is exported in a separate file with the tab-separated text format used by the COPY command. At the moment, there are no tools for importing the database dump, it has to be done manually.

      • Monthly database dumps can be downloaded here

      • Chromaprint Chromaprint is the core component of the AcoustID project. It's a client-side library that implements a custom algorithm for extracting fingerprints from any audio source. Overview of the fingerprint extraction process can be found in the blog post "How does Chromaprint work?".

          • How does Chromaprint work?

          • Being primarily based on the Computer Vision for Music Identification paper, images play an important role in the algorithm.

          • A more useful representation is the spectrogram, which shows how does the intensity on specific frequencies changes over time

          • You can get this kind of image by splitting the original audio into many overlapping frames and applying the Fourier transform on them ("Short-time Fourier transform"). In the case of Chromaprint, the input audio is converted to the sampling rate 11025 Hz and the frame size is 4096 (0.371 s) with 2/3 overlap.

          • Many fingerprinting algorithms work with this kind of audio representation. Some are comparing differences across time and frequency, some are looking for peaks in the image, etc.

          • Chromaprint processes the information further by transforming frequencies into musical notes. We are only interested in notes, not octaves, so the result has 12 bins, one for each note. This information is called "chroma features". (I believe they were mentioned in the paper Audio Thumbnailing of Popular Music Using Chroma-Based Representations for the first time.)

          • Now we have a representation of the audio that is pretty robust to changes caused by lossy codecs or similar things and also it isn't very hard to compare such images to check how "similar" they are, but if we want to search for them in a database, we need a more compact form. The idea how to do it again comes from the Computer Vision for Music Identification paper with some modifications based on the Pairwise Boosted Audio Fingerprint paper. You can imagine having a 16x12 pixel large window and moving it over the image from the left to the right, one pixel at a time. This will generate a lot of small subimages. On each of them we apply a pre-defined set of 16 filters that capture intensity differences across musical notes and time. What the filters do is they calculate the sum of specific areas of the grayscale subimage and then compare the two sums. There are six possible ways to arrange the areas

          • You can basically take any of the six filter images, place it anywhere on the subimage and also make it as large as you want (as long as it fits the 16x12 pixel subimage). Then you calculate the sum of the black and white areas and subtract them. The result is a single real number. Every filter has three coefficients associated with it, that say how to quantize the real number, so that the final result is an integer between 0 and 3. These filters and coefficients were selected by a machine learning algorithm on a training data set of audio files during the development of the library.

          • There is 16 filters and each can produce an integer that can be encoded into 2 bits (using the Gray code), so if you combine all the results, you get a 32-bit integer. If you do this for every subimage generated by the sliding window, you get the full audio fingerprint.

      • You can use pyacoustid to interact with the library from Python. It provides a direct wrapper around the library, but also higher-level functions for generating fingerprints from audio files.

          • Python bindings for Chromaprint acoustic fingerprinting and the Acoustid Web service

          • Chromaprint and its associated Acoustid Web service make up a high-quality, open-source acoustic fingerprinting system. This package provides Python bindings for both the fingerprinting algorithm library, which is written in C but portable, and the Web service, which provides fingerprint lookups.

      • You can also use the fpcalc utility programatically. It can produce JSON output, which should be easy to parse in any language. This is the recommended way to use Chromaprint if all you need is generate fingerprints for AcoustID.

        • AcoustID Index Acoustid Index is a "number search engine". It's similar to text search engines, but instead of searching in documents that consist of words, it searches in documents that consist of 32-bit integers.

          It's a simple inverted index data structure that doesn't do any kind of processing on the indexed documents. This is useful for searching in Chromaprint audio fingerprints, which are nothing more than 32-bit integer arrays.

        • Minimalistic search engine searching in audio fingerprints from Chromaprint

    • Between 2015 and 2022, AcousticBrainz helped to crowd source acoustic information from music recordings. This acoustic information describes the acoustic characteristics of music and includes low-level spectral information and information for genres, moods, keys, scales and much more.

      AcousticBrainz was a joint effort between Music Technology Group at Universitat Pompeu Fabra in Barcelona and the MusicBrainz project. At the heart of this project lies the Essentia toolkit from the MTG -- this open source toolkit enables the automatic analysis of music. The output from Essentia is collected by the AcousticBrainz project and made available to the public.

      In 2022, the decision was made to stop collecting data. For now, the website and its API will continue to be available.

      AcousticBrainz organizes the data on a recording basis, indexed by the MusicBrainz ID for recordings. If you know the MBID for a recording, you can easily fetch from AcousticBrainz. For details on how to do this, visit our API documentation.

      All of the data contained in AcousticBrainz is licensed under the CC0 license (public domain).

        • AcousticBrainz: Making a hard decision to end the project

        • We’ve written a blog post outlining some of our reasons for shutting down the project, the final steps that we’re taking, and a few ideas about our future plans for recommendations and other things in the MetaBrainz world.

          • AcousticBrainz: Making a hard decision to end the project We created AcousticBrainz 7 years ago and started to collect data with the goal of using that data down the road once we had collected enough. We finally got around to doing this recenty, and realised that the data simply isn’t of high enough quality to be useful for much at all.

            We spent quite a bit of time trying to brainstorm on how to remedy this, but all of the solutions we found require a significant amount of money for both new developers and new hardware. We lack the resources to commit to properly rebooting AcousticBrainz, so we’ve taken the hard decision to end the project.

            Read on for an explanation of why we decided to do this, how we will do it, and what we’re planning to do in the future.

      • If you are interested in computing acoustic features on your own music, you can still download the command-line essentia extractor and run it yourself

      • 2022-07-06: We provide downloadable archives of all submissions made to AcousticBrainz (29,460,584 submissions)

      • AcousticBrainz data

        • API Reference
        • Highlevel data and datasets
        • Sample data
      • Public datasets

      • The acousticbrainz plugin gets acoustic-analysis information from the AcousticBrainz project.

      • For all tracks with a MusicBrainz recording ID, the plugin currently sets these fields: average_loudness, bpm, chords_changes_rate, chords_key, chords_number_rate, chords_scale, danceable, gender, genre_rosamerica, initial_key, key_strength, mood_acoustic, mood_aggressive, mood_electronic, mood_happy, mood_party, mood_relaxed, mood_sad, moods_mirex, rhythm, timbre, tonal, voice_instrumental

      • MetaBrainz Derived Dumps On this page we describe several datasets with the term “canonical”. Since MusicBrainz aims to catalog all released music, the database contains a lot of different versions of releases or different versions of recordings. We find it important to collect all of these different versions, but in the end it is too much data for most of our users. Fortunately, it is easy to combine multiple pieces of well structured data into something that fits a user’s desired end-use.

        However, sometimes it can be challenging to work out which of the many releases/recordings is the one that “most people will think of the most representative version”. Even defining what this means is incredibly difficult, but we’ve attempted to do just that and we’re using the results of this work in our production systems on ListenBrainz to map incoming listens to MusicBrainz entries.

        When looking at the descriptions our datasets, please consider that “canonical” implies the most representative version. Each of our canonical datasets has a more detailed description of what “canonical” means in that given dataset.

        • GSoC ’23: Artist similarity graph

        • Discovering new pieces to add to your personal collection and play on repeat. This very idea is at the heart of the artist similarity graph project. The project helps the users to uncover the connections between artists with similar genres and styles. It does so by providing a search interface to the users, where they can find their favourite artist and then generate a graph of similar artists. An artist panel featuring information about the artist is also presented, it showcases artist’s name, type, birth, area, wiki, top track and album. Users can also play the tracks right on the page itself using BrainzPlayer.

        • A network graph with a central node of the selected artist and the links to the related artists is displayed. The artists are arranged based on their similarity score. Artists with higher scores being closer and lower being further. To convey the strength of relationships between the artists a divergent colour scheme is used. The user also has the ability to travel across the graph by clicking thorough the artists (nodes).

        • Technologies used:

          • nivo: For artist graph generation
          • React with Typescript: For web pages
          • Figma: Building mock ups and prototypes
          • Docker: To Containerize applications
            • nivo provides a rich set of dataviz components, built on top of the awesome d3 and React libraries

            • nivo provides supercharged React components to easily build dataviz apps, it's built on top of d3.

              Several libraries already exist for React d3 integration, but just a few provide server side rendering ability and fully declarative charts.

        • The first challenge was to normalize the data before using it could be used to generate a graph. Given the non linear nature of the data, a square root transformation was used to transform the data. The result is a linear set of data which can be appropriately used in a graph.

            • The most useful transformations in introductory data analysis are the reciprocal, logarithm, cube root, square root, and square.

            • The square root, x to x^(1/2) = sqrt(x), is a transformation with a moderate effect on distribution shape: it is weaker than the logarithm and the cube root. It is also used for reducing right skewness, and also has the advantage that it can be applied to zero values. Note that the square root of an area has the units of a length. It is commonly applied to counted data, especially if the values are mostly rather small.

        • How to build your own music tagger, with MusicBrainz Canonical Metadata

        • In the blog post where we introduced the new Canonical Metadata dataset, we suggested that a user could now build their own custom music tagging application, without a lot of effort! In this blog post we will walk you through the process of doing just that, using Python.

        • Here at MetaBrainz, we’re die-hard Postgres fans. But the best tool that we’ve found for metadata matching is the Typesense search engine, which supports typo-resistant search. This example will use the Typesense datastore, but you may use whatever datastore you prefer.

            • Lightning-fast Open Source Search

            • The Open Source Alternative to Algolia + Pinecone. The Easier To Use Alternative to Elasticsearch

              • Open Source alternative to Algolia + Pinecone and an Easier-to-Use alternative to ElasticSearch ⚡ 🔍 ✨ Fast, typo tolerant, in-memory fuzzy Search Engine for building delightful search experiences

          • MusicBrainz Canonical Data Examples This simple example shows how to lookup music metadata using the MusicBrainz canonical dataset.

        • New dataset: MusicBrainz Canonical Metadata

        • The MusicBrainz project is proud to announce the release of our latest dataset: MusicBrainz Canonical Metadata. This geeky sounding dataset packs an intense punch! It solves a number of problems involving how to match a piece of music metadata to the correct entry in the massive MusicBrainz database.

          The MusicBrainz database aims to collect metadata for all releases (albums) that have ever been published. For popular albums, there can be many different releases, which begs the question “which one is the main (canonical) release?”. If you want to identify a piece of metadata, and you only have an artist and recording (track) name, how do you choose the correct database release?

          This same problem exists on the recording level – many recordings (songs) exist on many releases – which one should be used?

          The MusicBrainz Canonical Metadata dataset now solves this problem by allowing users to lookup canonical releases and canonical recordings. Given any release MBID, Canonical Release Mapping (canonical_release_redirect.csv) allows you to find the release that we consider “canonical”. The same is now true for recording MBIDs, which allows you to look up canonical recordings using the Canonical Recording Mapping (canonical_recording_redirect.csv). Given any recording MBID, you can now find the correct canonical recording MBID.

        • AIBrainz Playlist Generator (beta)

        • MetaBrainz as an organisation has never much dabbled in (artificial) intelligence, but a number of recent factors have led to the team doing some exciting behind-the-scenes work over the last few months.

          Lately more and more potential contributors have come to MeB interested in working on AI projects, and with ListenBrainz we have an excellent dataset. With a current focus on playtesting and finetuning our playlist features we also have the perfect use-case.

          So, without further ado, we invite you to test the beta version of our new AI-powered playlist generator

        • Fresh Releases – My (G)SoC journey with MetaBrainz

        • MusicBrainz is the largest structured online database of music metadata. Today, a myriad of developers leverage this data to build their client applications and projects. According to MusicBrainz Database statistics, 2022 alone saw a whopping 366,680*, releases, from 275,749 release groups, and 91.5% of these releases have cover art. Given that it has a plethora of useful data about music releases available, but has no useful means to visually present it to general users, the idea of building the Fresh Releases page was born.

      • MetaBrainz Foundation

        • MusicBrainz mirror server with search and replication

        • Docker Compose project for the MusicBrainz Server with replication, search, and development setup

        • Server for the MusicBrainz project (website, API, database tools)

        • MusicBrainz Server is the web frontend to the MusicBrainz Database and is accessible at

        • Server for the ListenBrainz project, including the front-end (javascript/react) code that it serves and all of the data processing components that LB uses.

        • ListenBrainz keeps track of music you listen to and provides you with insights into your listening habits. We're completely open-source and publish our data as open data.

        • MusicBrainz Picard audio file tagger

          • MusicBrainz Picard Picard is a cross-platform music tagger powered by the MusicBrainz database.

        • BookBrainz website, written in node.js.

          • The Open Book Database BookBrainz is a project to create an online database of information about every single book, magazine, journal and other publication ever written. We make all the data that we collect available to the whole world to consume and use as they see fit. Anyone can contribute to BookBrainz, whether through editing our information, helping out with development, or just spreading the word about our project.

        • A recommendation engine playground that should hopefully make playing with music recommendations easy.

        • The Troi Playlisting Engine combines all of ListenBrainz' playlist efforts:

          • Playlist generation: Music recommendations and algorithmic playlist generation using a pipeline architecture that allows easy construction of custom pipelines that output playlists. You can see this part in action on ListenBrainz's Created for You pages, where we show of Weekly jams and Weekly Discovery playlists. The playlist generation tools use an API-first approach were users don't need to download massive amounts of data, but instead fetch the data via APIs as needed.
          • Local content database: Using these tools a user can scan their music collection on disk or via a Subsonic API (e.g. Navidrome, Funkwhale, Gonic), download metadata for it and then resolve global playlists (playlist with only MBIDs) to files available in a local collection. We also have support for duplicate file detection, top tags in your collection and other insights.
          • Playlist exchange: We're in the process of building this toolkit out to support saving/loading playlists in a number of format to hopefully break playlists free from the music silos (Spotify, Apple, etc)
        • MusicBrainz Picard Plugins This repository hosts plugins for MusicBrainz Picard.

    • Listen together with ListenBrainz Track, explore, visualise and share the music you listen to. Follow your favourites and discover great new music.

      • Fresh Releases Listen to recent releases, and browse what's dropping soon.

      • MetaBrainz Dataset Hoster Home You can use this data set hoster to explore the various data sets that are being exposed through this interface. The goal of this interface is to make the discovery of hosted data quick and intuitive - ideally the interface should give you all of the information necessary in order start using one of these APIs in your project quickly.

        The following data sets are available from here:

        • artist-country-code-from-artist-mbid: MusicBrainz Artist Country From Artist MBID
        • artist-credit-from-artist-mbid: MusicBrainz Artist Credit From Artist MBID
        • recording-mbid-lookup: MusicBrainz Recording by MBID Lookup
        • mbid-mapping: MusicBrainz ID Mapping lookup
        • mbid-mapping-release: MusicBrainz ID Mapping Release lookup
        • explain-mbid-mapping: Explain MusicBrainz ID Mapping lookup
        • recording-search: MusicBrainz Recording search
        • acr-lookup: MusicBrainz Artist Credit Recording lookup
        • acrr-lookup: MusicBrainz Artist Credit Recording Release lookup
        • spotify-id-from-metadata: Spotify Track ID Lookup using metadata
        • spotify-id-from-mbid: Spotify Track ID Lookup using recording mbid
        • sessions-viewer: ListenBrainz Session Viewer
        • similar-recordings: Similar Recordings Viewer
        • similar-artists: Similar Artists Viewer
        • tag-similarity: ListenBrainz Tag Similarity
        • bulk-tag-lookup: Bulk MusicBrainz Tag/Popularity by recording MBID Lookup

        Use the web interface for each of these endpoints to discover what parameters to send and what results to expect. Then take the JSON GET or POST example data to integrate these calls into your projects.

    • CritiqueBrainz is a repository for Creative Commons licensed music and book reviews. Here you can read what other people have written about an album or event and write your own review!

      CritiqueBrainz is based on data from MusicBrainz - open music encyclopedia and BookBrainz - open book encyclopedia.

    • Essentia Open-source library and tools for audio and music analysis, description and synthesis

    • Essentia is an open-source C++ library for audio analysis and audio-based music information retrieval. It contains an extensive collection of algorithms, including audio input/output functionality, standard digital signal processing blocks, statistical characterization of data, a large variety of spectral, temporal, tonal, and high-level music descriptors, and tools for inference with deep learning models. Essentia is cross-platform and designed with a focus on optimization in terms of robustness, computational speed, and low memory usage, which makes it efficient for many industrial applications. The library includes Python and JavaScript bindings as well as various command-line tools and third-party extensions, which facilitate its use for fast prototyping and allow setting up research experiments very rapidly.

    • Audio feature extraction for JavaScript

      • Meyda is a JavaScript audio feature extraction library. It works with the Web Audio API (or plain old JavaScript arrays) to expose information about the timbre and perceived qualities of sound. Meyda supports both offline feature extraction as well as real-time feature extraction using the Web Audio API. We wrote a paper about it, which is available here.

        • Often, observing and analysing an audio signal as a waveform doesn’t provide us a lot of information about its contents. An audio feature is a measurement of a particular characteristic of an audio signal, and it gives us insight into what the signal contains. Audio features can be measured by running an algorithm on an audio signal that will return a number, or a set of numbers that quantify the characteristic that the specific algorithm is intended to measure. Meyda implements a selection of standardized audio features that are used widely across a variety of music computing scenarios.

        • Bear in mind that by default, Meyda.extract applies a windowing function to the incoming signal using the hanning windowing function by default. If you compare the results of Meyda’s feature extraction to that of another library for the same signal, make sure that the same windowing is being applied, or the features will likely differ. To disable windowing in Meyda.extract, set Meyda.windowingFunction to ‘rect’.

      • JavaScript library for detecting synthesized sounds

    • AudD offers Music Recognition API. We recognize music with our own audio fingerprinting technology based on neural networks. According to ProgrammableWeb, AudD is #1 among 13 Top Recognition APIs.

    • Pricing:

      • 0+ requests per month - $5 per 1000 requests;
      • 100 000 requests per month - $450;
      • 200 000 requests per month - $800;
      • 500 000 requests per month - $1800. Contact us if you're interested in larger amounts of requests.

      Live audio streams recognition - $45 per stream per month with our music DB, $25 with the music you upload.

      • AudD Music Recognition API Docs

    • DJ Trainspotting: How To Find Out What A DJ Is Playing

    • Top 6 Shazam Alternatives for Android and iOS

      • SoundHound – Music Discovery & Hands-Free Player
      • Genius – Song Lyrics & More
      • Musicxmatch – Lyrics for your music
      • MusicID
      • Soly – Song and Lyrics Finder
      • Google Assistant & Siri
    • Soundhound music Discover, Search, and Play Any Song by Using Just Your Voice

    • 1001Tracklists - The World's Leading DJ Tracklist/Playlist Database

    • The setlist wiki

    • Find setlists for your favorite artists


Conferences, Journals, Research Papers, etc

    • The International Society for Music Information Retrieval (ISMIR) is a non-profit organisation seeking to advance research in the field of music information retrieval (MIR)—a field that aims at developing computational tools for processing, searching, organizing, and accessing music-related data. Among other things, the ISMIR society fosters the exchange of ideas and activities among its members, stimulates research and education in MIR, supports and encourages diversity in membership and disciplines, and oversees the organisation of the annual ISMIR conference.

      • Each year, the ISMIR conference is held in a different corner of the world to motivate the presentation and exchange of ideas and innovations related to the intentionally broad topic of music information. Historically, the call for papers (CFP) is announced in the beginning of the year (February-May) via the community mailing list, and conferences are held several months later (August-November).

      • The Transactions of the International Society for Music Information Retrieval publishes novel scientific research in the field of music information retrieval (MIR), an interdisciplinary research area concerned with processing, analysing, organising and accessing music information. We welcome submissions from a wide range of disciplines, including computer science, musicology, cognitive science, library & information science and electrical engineering.

      • TISMIR was established to complement the widely cited ISMIR conference proceedings and provide a vehicle for the dissemination of the highest quality and most substantial scientific research in MIR. TISMIR retains the Open Access model of the ISMIR Conference proceedings, providing rapid access, free of charge, to all journal content. In order to encourage reproducibility of the published research papers, we provide facilities for archiving the software and data used in the research.

        • The Sound Demixing Challenge 2023 – Music Demixing Track

        • This paper summarizes the music demixing (MDX) track of the Sound Demixing Challenge (SDX’23). We provide a summary of the challenge setup and introduce the task of robust music source separation (MSS), i.e., training MSS models in the presence of errors in the training data. We propose a formalization of the errors that can occur in the design of a training dataset for MSS systems and introduce two new datasets that simulate such errors: SDXDB23_LabelNoise and SDXDB23_Bleeding.1 We describe the methods that achieved the highest scores in the competition. Moreover, we present a direct comparison with the previous edition of the challenge (the Music Demixing Challenge 2021): the best performing system achieved an improvement of over 1.6dB in signal-to-distortion ratio over the winner of the previous competition, when evaluated on MDXDB21. Besides relying on the signal-to-distortion ratio as objective metric, we also performed a listening test with renowned producers and musicians to study the perceptual quality of the systems and report here the results. Finally, we provide our insights into the organization of the competition and our prospects for future editions.

      • International Society for Music Information Retrieval Conference (ISMIR)

      • The dblp computer science bibliography provides open bibliographic information on major computer science journals and proceedings.

    • International Conference on Acoustics, Speech, and Signal Processing

    • ICASSP, the International Conference on Acoustics, Speech, and Signal Processing, is an annual flagship conference organized by IEEE Signal Processing Society. Ei Compendex has indexed all papers included in its proceedings.

    • As ranked by Google Scholar's h-index metric in 2016, ICASSP has the highest h-index of any conference in the Signal Processing field. The Brazilian ministry of education gave the conference an 'A1' rating based on its h-index.

    • International Conference on Digital Audio Effects

    • The annual International Conference on Digital Audio Effects or DAFx Conference is a meeting of enthusiasts working in research areas on audio signal processing, acoustics, and music related disciplines, who come together to present and discuss their findings.

    • New Interfaces for Musical Expression

    • New Interfaces for Musical Expression, also known as NIME, is an international conference dedicated to scientific research on the development of new technologies and their role in musical expression and artistic performance.

    • Sound and Music Computing Conference

    • The Sound and Music Computing (SMC) Conference is the forum for international exchanges around the core interdisciplinary topics of Sound and Music Computing. The conference is held annually to facilitate the exchange of ideas in this field.

    • Sound and Music Computing (SMC) is a research field that studies the whole sound and music communication chain from a multidisciplinary point of view. The current SMC research field can be grouped into a number of subfields that focus on specific aspects of the sound and music communication chain.

      • Processing of sound and music signals: This subfield focuses on audio signal processing techniques for the analysis, transformation and resynthesis of sound and music signals.
      • Understanding and modeling sound and music: This subfield focuses on understanding and modeling sound and music using computational approaches. Here we can include Computational musicology, Music information retrieval, and the more computational approaches of Music cognition.
      • Interfaces for sound and music: This subfield focuses on the design and implementation of computer interfaces for sound and music. This is basically related to Human Computer Interaction.
      • Assisted sound and music creation: This subfield focuses on the development of computer tools for assisting Sound design and Music composition. Here we can include traditional fields like Algorithmic composition.
    • Computer Music Journal

    • Computer Music Journal is a peer-reviewed academic journal that covers a wide range of topics related to digital audio signal processing and electroacoustic music. It is published on-line and in hard copy by MIT Press. The journal is accompanied by an annual CD/DVD that collects audio and video work by various electronic artists.


    • LLMs <3 MIR A tutorial on Large Language Models for Music Information Retrieval

    • This is a web book I wrote because it felt fun when I thought about it -- a tutorial on Large Language Models for Music Information Retrieval.

    • This book is written in the perspective of music AI.

      • Chapter I, “Large Language Models”, would be general and succinct. I’ll outsource a lot by simply sharing links so that you decide the depth and breadth of your study.
      • Chapter II, “LLM as a Tool with Common Sense” is where I introduce some existing works and my suggestions on how to use LLMs for MIR research.
      • Chapter III, “Multimodal LLMs”, provides a summary about how we can incorporate multimodal data into LLMs.
      • Chapter IV, “Weakness of LLMs for MIR”, presents some limitations the current LLMs have in the context of MIR research.
      • Chapter V, “Finale”, is just a single page of my parting words.
        1. Music Audio LLMs So, how can we feed audio signals to a LLM? It’s really the same as we did with images. We need to somewhat find a way to represent the audio signal in a vector sequence Ha, and perhaps feed it with some text representation Hq.
      • Salmonn

          • SALMONN: Speech Audio Language Music Open Neural Network SALMONN is a large language model (LLM) enabling speech, audio events, and music inputs, which is developed by the Department of Electronic Engineering at Tsinghua University and ByteDance. Instead of speech-only input or audio-event-only input, SALMONN can perceive and understand all kinds of audio inputs and therefore obtain emerging capabilities such as multilingual speech recognition and translation and audio-speech co-reasoning. This can be regarded as giving the LLM "ears" and cognitive hearing abilities, which makes SALMONN a step towards hearing-enabled artificial general intelligence.

            • SALMONN: Towards Generic Hearing Abilities for Large Language Models

      • LLark

          • LLark: A Multimodal Foundation Model for Music

          • LLark is a research exploration into the question: how can we build a flexible multimodal language model for music understanding?

          • LLark is designed to produce a text response, given a 25-second music clip and a text query (a question or short instruction).

          • We built our training dataset from a set of open-source academic music datasets (MusicCaps, YouTube8M-MusicTextClips, MusicNet, FMA, MTG-Jamendo, MagnaTagATune). We did this by using variants of ChatGPT to build query-response pairs from the following inputs: (1) the metadata available from a dataset, as pure JSON; (2) the outputs of existing single-task music understanding models; (3) a short prompt describing the fields in the metadata and the type of query-response pairs to generate.

          • We built our training dataset from a set of open-source academic music datasets (MusicCaps, YouTube8M-MusicTextClips, MusicNet, FMA, MTG-Jamendo, MagnaTagATune). We did this by using variants of ChatGPT to build query-response pairs from the following inputs: (1) the metadata available from a dataset, as pure JSON; (2) the outputs of existing single-task music understanding models; (3) a short prompt describing the fields in the metadata and the type of query-response pairs to generate. Training a model using this type of data is known as “instruction tuning.” An instruction-tuning approach has the additional benefit of allowing us to use a diverse collection of open-source music datasets that contain different underlying metadata, since all datasets are eventually transformed into a common (Music + Query + Response) format. From our initial set of 164,000 unique tracks, this process resulted in approximately 1.2M query-response pairs.

          • LLark is trained to use raw audio and a text prompt (the query) as input, and produces a text response as output. LLark is initialized from a set of pretrained open-source modules that are either frozen or fine-tuned, plus only a small number of parameters (less than 1%!) that are trained from scratch.

          • The raw audio is passed through a frozen audio encoder, specifically the open-source Jukebox-5B model. The Jukebox outputs are downsampled to 25 frames per second (which reduces the size of the Jukebox embeddings by nearly 40x while preserving high-level timing information), and then passed through a projection layer that is trained from scratch to produce audio embeddings. The query text is passed through the tokenizer and embedding layer of the language model (LLama2-7B-chat) to produce text embeddings. The audio and text embeddings are then concatenated and passed through through the rest of the language model stack. We fine-tune the weights of the language model and projection layer using a standard training procedure for multimodal large language models (LLMs).

          • In one set of experiments, we asked people to listen to a music recording and rate which of two (anonymized) captions was better. We did this across three different datasets with different styles of music, and for 4 different music captioning systems in addition to LLark. We found that people on average preferred LLark’s captions to all four of the other music captioning systems.

          • We conducted an additional set of experiments to measure LLark’s musical understanding capabilities. In these evaluations, LLark outperformed all baselines tested on evaluations of key, tempo, and instrument identification in zero-shot datasets (datasets not used for training). In zero-shot genre classification, LLark ranked second, but genre estimation is a difficult and subjective task; we show in the paper that LLark’s predictions on this task tend to fall within genres that most musicians would still consider correct (e.g., labeling “metal” songs as “rock”).

            • LLark: A Multimodal Instruction-Following Language Model for Music

            • Music has a unique and complex structure which is challenging for both expert humans and existing AI systems to understand, and presents unique challenges relative to other forms of audio. We present LLark, an instruction-tuned multimodal model for \emph{music} understanding. We detail our process for dataset creation, which involves augmenting the annotations of diverse open-source music datasets and converting them to a unified instruction-tuning format. We propose a multimodal architecture for LLark, integrating a pretrained generative model for music with a pretrained language model. In evaluations on three types of tasks (music understanding, captioning, reasoning), we show that LLark matches or outperforms existing baselines in music understanding, and that humans show a high degree of agreement with its responses in captioning and reasoning tasks. LLark is trained entirely from open-source music data and models, and we make our training code available along with the release of this paper.

            • LLark: A Multimodal Foundation Model for Music This repository contains the code used to build the training dataset, preprocess existing open-source music datasets, train the model, and run inference. Note that this paper is not accompanied with any trained models.

    • Discogs - Music Database and Marketplace

    • | Play music, find songs, and discover artists The world's largest online music service. Listen online, find out more about your favourite artists, and get music recommendations, only at

    • SongRec SongRec is an open-source Shazam client for Linux, written in Rust.

    • How it works

      For useful information about how audio fingerprinting works, you may want to read this article. To be put simply, Shazam generates a spectrogram (a time/frequency 2D graph of the sound, with amplitude at intersections) of the sound, and maps out the frequency peaks from it (which should match key points of the harmonics of voice or of certains instruments).

      Shazam also downsamples the sound at 16 KHz before processing, and cuts the sound in four bands of 250-520 Hz, 520-1450 Hz, 1450-3500 Hz, 3500-5500 Hz (so that if a band is too much scrambled by noise, recognition from other bands may apply). The frequency peaks are then sent to the servers, which subsequently look up the strongest peaks in a database, in order look for the simultaneous presence of neighboring peaks both in the associated reference fingerprints and in the fingerprint we sent.

      Hence, the Shazam fingerprinting algorithm, as implemented by the client, is fairly simple, as much of the processing is done server-side. The general functionment of Shazam has been documented in public research papers and patents.

      • music library manager and MusicBrainz tagger

      • Beets is the media library management system for obsessive music geeks.

        The purpose of beets is to get your music collection right once and for all. It catalogs your collection, automatically improving its metadata as it goes. It then provides a bouquet of tools for manipulating and accessing your music.

      • Because beets is designed as a library, it can do almost anything you can imagine for your music collection. Via plugins, beets becomes a panacea:

        If beets doesn't do what you want yet, writing your own plugin is shockingly simple if you know a little Python.

    • MediaFile: read and write audio files' tags in Python MediaFile is a simple interface to the metadata tags for many audio file formats. It wraps Mutagen, a high-quality library for low-level tag manipulation, with a high-level, format-independent interface for a common set of tags.

    • Python module for handling audio metadata

    • Mutagen is a Python module to handle audio metadata. It supports ASF, FLAC, MP4, Monkey's Audio, MP3, Musepack, Ogg Opus, Ogg FLAC, Ogg Speex, Ogg Theora, Ogg Vorbis, True Audio, WavPack, OptimFROG, and AIFF audio files. All versions of ID3v2 are supported, and all standard ID3v2.4 frames are parsed. It can read Xing headers to accurately calculate the bitrate and length of MP3s. ID3 and APEv2 tags can be edited regardless of audio format. It can also manipulate Ogg streams on an individual packet/page level.



  • Could we use data from the Spotify Audio Analysis endpoints (Ref: 1, 2, 3, 4) to generate a Shazam fingerprint/similar (Ref)?

See Also

My Other Related Deepdive Gist's and Projects

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment