Google DeepMind Introduces Video-to-Audio V2A Technology: Synchronizing Audiovisual Generation
Sound is indispensable for enriching human experiences, enhancing communication, and adding emotional depth to media. While AI has made significant progress in various domains, incorporating sound in video-generating models with the same sophistication and nuance as human-created content remains challenging. Producing scores for these silent videos is a significant next step in making generated films.
Google DeepMind introduces video-to-audio (V2A) technology that enables synchronized audiovisual creation. Using a combination of video pixels and text instructions in natural language, V2A creates immersive audio for the on-screen action. The team tried autoregressive and diffusion methods to find the best scalable AI architecture; the results for generating audio using the diffusion method were the most convincing and realistic regarding the synchronization of audio and visuals.
The first step of their video-to-audio technology is compressing the input video. The audio is repeatedly cleaned up from background noise using the diffusion model. Visual input and natural language prompts are used to steer this process, which generates realistic, synced audio that closely follows the instructions. Decoding, waveform generation, and merging the audio and visual data constitute the final step in the audio output process.
Before iteratively running the video and audio prompt input through the diffusion model, V2A encodes them. The next step is to create compressed audio decoded into a waveform. The researchers supplemented the training process with additional information, such as transcripts of spoken dialogue and AI-generated annotations with extensive descriptions of sound, to improve the model’s ability to produce high-quality audio and to train it to make specific sounds.
The presented technology learns to respond to the information in the transcripts or annotations by associating distinct audio occurrences with different visual sceneries by training on video, audio, and the added annotations. To make shots with a dramatic score, realistic sound effects, or dialogue that complements the characters and tone of a video, V2A technology can be paired with video generation models like Veo.
With its ability to create scores for a wide range of classic videos, such as silent films and archival footage, V2A technology opens up a world of creative possibilities. The most exciting aspect is that it can generate as many soundtracks as users desire for any video input. Users can define a “positive prompt” to guide the output towards desired sounds or a “negative prompt” to steer it away from unwanted noises. This flexibility gives users unprecedented control over V2A’s audio output, fostering a spirit of experimentation and enabling them to quickly find the perfect match for their creative vision.
The team is dedicated to ongoing research and development to address a range of issues. They are aware that the quality of the audio output is dependent on the video input, and distortions or artifacts in the video that are outside the training distribution of the model can lead to noticeable audio degradation. They are working on improving lip-syncing for videos with voiceovers. By analyzing the input transcripts, V2A aims to create speech that is perfectly synchronized with the mouth movements of the characters. The team is also aware of the incongruity that can occur when the video model doesn’t correspond to the transcript, leading to eerie lip-syncing. They are actively working to resolve these issues, demonstrating their commitment to maintaining high standards and continuously improving the technology.
The team is actively seeking input from prominent creators and filmmakers, recognizing their invaluable insights and contributions to the development of V2A technology. This collaborative approach ensures that V2A technology can positively influence the creative community, meeting their needs and enhancing their work. To further protect AI-generated content from any abuse, they have integrated the SynthID toolbox into the V2A study and watermarked it all, demonstrating their commitment to the ethical use of the technology.
Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone’s life easy.