Meta has launched AudioCraft, a new suite of AI models that generate music and audio based on text prompts, the company announced on Wednesday (Aug. 2). The technology consists of three models: MusicGen (music), AudioGen (sound effects) and EnCodec (higher quality music). It acts as new competition for Google’s MusicLM, a text-to-music generator that launched in May.
Using prompts like “soulful music for a dinner party” or “movie scene in a desert with percussion,” users can generate music at the click of a button. According to the company’s announcement, it sees the technology as a “new type of instrument — just like synthesizers when they first appeared.”
MusicGen — the model from the AudioCraft suite that produces music — was trained on 20,000 hours of Meta-owned and specifically licensed music. The announcement is unclear about whether EnCodec was trained on any copyrighted material or if it follows the same guidelines as MusicGen. Meta did not immediately return Billboard’s request for comment.
Training is one of the most contentious areas of the nascent AI industry. To produce human-quality outputs, AI models train on millions or billions of data points to learn the attributes of what they’re replicating — and many of the world’s biggest AI companies train their models on copyrighted material without the authorization, compensation or even knowledge of copyright owners.
MusicGen, AudioGen and EnCodec will all be available as open-source models. This will allow researchers and practitioners access so that they can train their own models with their own datasets, advancing the AudioCraft tools even further than Meta’s initial launch and addressing the company’s concerns of bias, including its proclivity for Western-style music — the biggest portion of its training set.
“Music is arguably the most challenging type of audio to generate as it’s composed of local and long-range patterns, from a suite of notes to a global musical structure with multiple instruments,” said Meta in a blog post, noting that its family of models is “capable of producing high quality audio” with consistency and ease of use.