Sony AI announces SFX generation model: Woosh

Compartir en

Japan – Recognizing that professional sound design requires fundamentally different data and controls than general audio AI systems, Sony AI has released a foundation model built specifically for sound effect generation.

Sony AI’s team simultaneously trained two versions. A private model, optimized for studio-grade output, was trained on licensed professional sound effect libraries such as Pro Sound Effects and BOOM. That same architecture was trained on publicly available datasets and released for the research community to access. It has been named Woosh by the Sony AI team after one of the most common sound effects used in gaming and film.

Woosh, built for workflows used in gaming, film and interactive media, supports two generation tasks: text-to-audio, generating a sound effect from a written description, and video-to-audio, generating sound directly from a video sequence, with an optional text prompt to guide the output. Game and film sound designers working from visual content rather than abstract descriptions should find the video-to-audio capability particularly relevant.

There are significant differences between public audio datasets and pro SFX libraries, the team points out. Public audio datasets can be rife with ambient sound and overlapping noises and may be loosely labeled. In contrast, pro SFX libraries offer purpose-recorded sounds, carefully edited and with precise labeling and tagging that match how professionals search for and describe audio.

Woosh’s private model “significantly outperforms public alternatives on professional sound effect data,” Sony AI’s team reports. “The public model outperforms comparable open-source models on public benchmarks.” Tested against FoleyBench, the first large-scale benchmark designed for evaluating Foley-style video-to-audio generation, “Woosh’s video-to-audio model outperforms the comparable baseline across audio quality and semantic alignment metrics, while using fewer parameters.”

The team has additionally been developing a plug-in for DAWs with planned support for variation generation, inpainting—the ability to complete a region of audio so that it stitches smoothly with an existing sound—and personalization. “With this plug-in we can integrate seamlessly into those pipelines and workflows and tools in a way that sound designers can use more intuitively,” explains Hakim Missoum, strategy and partnerships manager at Sony AI.

Additional controls are planned as the ecosystem develops. The roadmap includes precise time controls, morphing (transforming one sound into another using a semantic description of the target), generation of perfect loops and personalization from one or a small number of audio samples—all capabilities that reflect the kind of granular creative control professionals have told the team they need.

Fully cognizant of the controversy swirling around AI and its potential impact on jobs, the team’s goal is reportedly “to understand where AI can work as a tool to support the human creative process. The controls being built into the plug-in, and the decision to train on licensed professionally curated libraries rather than scraped public data, are both expressions of that commitment.”

Sony AI says that the licensing reflects a deliberate strategy. The public release is non-commercial and is designed to demonstrate what the technology can do, with inference code and model weights available to the community for research and experimentation. Conversely, the private model, trained on licensed studio-quality data, points toward commercial application. As Missoum puts it, the public release “prepares the ground for the professional model we’re developing. The performance is not the same; and that’s the point.”

https://ai.sony/

Compartir en