The Veo 3 AI video model of Google is a major reason – a league above any of its contestants for sound. You not only see on the screen, but you can also indicate what you hear.
Manufactured by Google’s Deepmind Lab, the first VO model began in May 2024, and each new generation has added more functionality. It always excels in speed accuracy and physics understanding compared to the contestants, but there was a game-changer in addition to sound.
You can use it a small commercial, a scene from a film, or even a music video for a music video. But there is an usage that I have seen more than any other – ASMR (autonomous sensory maridian reaction): They trigger gentle exploitation, whispers, and ambient sounds that trigger a tingling sensation for some people.
To see how far it can go, I created a series of ASMR Food Prompts – each designed to generate a matching video and some sound around some Pak.

VO3 in Gemini app
The VO3 is now available in the Mithun app. When starting a new prompt, just select the video option, type it, type it, and an 8-second clip is generated.
While Gemini is not necessarily the best way to reach VO3 – I would recommend Freepic, FAL, Higgsfield, or Google Flow – it is easy to use and work is done.
An important advantage of using Gemini directly is that it automatically explains and enhances your signals. So if you ask for “a cool ASMR video with” Lassagna characterized “, what will you get.
You can be more specific using something called structured prompting – to label each moment with a timestamp and visual details. But until you need accurate control, a simple paragraph (aka story signal) is usually more effective.
Indicate
The first task in any AI project is thinking about your signal. Models are getting better in explaining the intentions, but if you want what you want, it is still specific.
I knew that I wanted ASMR food video, so I started a test: “ASMR food video with sound.”
Result? decent. This essentially gave me the lacquer that was in my mind. Then I refined it – outlining specific food types, adding sound details, and even trying a structured signal for a fizzy drink with ice.
Most of the time, the story signal works best. Just describe what you want to see, the flow of the video, and how to come through the sound.
1. Sizzling from Lasagna PAN
The first indication, “ASMR Food Video with Sound,” produced a stunning clip of a person who slipped a thorn in a slices of Lassagna. When you enter the fork, you listen to the squish, then it hits the plate as a clan. This is a case where I want VO3 to have a “Extended Clip” button.
It did not include any other signal, so I had no way to identify what food would be, how the sound would come out or whether the sound would work. This is why it is important to be specific when indicating the AI model, even in chatbots like Gemini.
2. Cooking and eating
Next, I became more specific-a long, narrative-style sign asked VO3 to prepare satisfactory food in a well-light kitchen and generate close-ups of a chef to eat.
I asked for a slow view of the material, sliced, a crunch of a crunch in a pan, and a crunch in the form of chefs.
I also added this line: “Emphasize audio quality: Clean, Lear ASMR Soundscape Without Music” not only to direct the sound, but also the style of sound and what I do not want to hear.
3. Popcorn popping
For the last sign, I started with an image. I used the Midjorney V7 to make a picture of a woman looking at rainbow popcorn, then added “ASMR Food” to Gemini.
Visually, the result was amazing – but for some reason, the woman says in a voiceover, “It is delicious, it is a rainbow popcorn.” This is on me – I did not specify that he should speak, or what he should say.
A simple fix: You want any speech in quotation. For example, I could inspire him to say that “I liked to see popcorn pop,” and emphasized the word pop. I could also specify that she was speaking on camera – and VO3 would have sinked the lip movement to match.
conclusion
Overall, the VO3 provides impressive results, especially when it comes to producing high quality sounds that accurately reflect the scenes. While there are some quirks to navigate, such as unexpected voiceover or slightly lower -looking lasagna – these are easily addressed with more specific indications.
More than Tom’s guide
Back to laptop

