In ten hours I went from “I have never made an AI video” to this:
I had seen hundreds of excited people shouting about how amazing AI video is, and how easy it was, and how it was taking the world by storm.
My experience poured a lot of cold water on those illusions. AI video was actually kind of a pain in the butt, and riddled with limitations.
But. Once you figure it out, there’s lots of impressive - if selective - things you can make today. Let me give you an unusually honest take on what is possible, what it cost me, and what you might actually use it for.
Making video out of nothing at all
The promise of AI video is that you can type in what you want, and get a video with sound, in around 30 seconds.
That much is true:
I used Veo 3. All I typed in was this:
Modern private office interior, drab and corporate. Hand held camera, documentary. A middle aged, fast-talking but clueless corporate boss looks outside of a window as rain comes down. They sip from a “best boss” mug. He says "Some people see rain. I see ... opportunity."
That gave me pretty decent video and audio, in about 30 seconds.
Sounds great - but here’s Veo giving another video for the exact same prompt, a few seconds later:
Note how this time, he now speaks with the mug covering his mouth, there’s some weird “opportunity” distortion that comes out of nowhere, and how he ad-libbed an extra line at the end.
Scenes with dramatic motion, like explosions or battles, fare much worse:
By far the most annoying hallucinations are captions. Maybe 25% of the time I used it, Veo added gibberish captions, like this:
So the first thing you learn with AI video, is you will be making the same video multiple times. I did most clips at least twice - many six or more:
This takes time. You need roughly 30s to produce each clip, which is actually very fast, but it won’t feel like it when you’re waiting. You soon learn to queue up a lot of requests at once.
It took me 160 video clips to make my 2.5 minute video.
Consistency is an illusion
Each Veo video is exactly 8 seconds long.
That’s just long enough to make a joke, or a short action. But as soon as you want to use multiple scenes, you’re going to have a problem.
There’s no consistency between shots. So the exact same prompt for a ‘robot’ will look different every time:
You can get clever, and ask for a very precise description of a robot. In my experience, it won’t help much. The AI doesn’t always do what it’s asked, and you can’t describe a robot or a person reliably in words.
This is the single biggest problem with AI video today, and it’s going to drive everything else you do with it.
There are basically three ways to get around this, and once you realize what they are, you will understand why current AI videos have some funny limitations.
Option 1: Design to avoid consistency
Notice how my video doesn’t require consistency.
An emerging AI trope is a series of interview clips. This is why. Each scene can be from a different person, and it doesn’t matter.
Most ‘normal’ video doesn’t have this problem. So you have to think of ideas that make sense with it. But if you can, the result can be compelling.
Option 2: Use a known subject
If you use a subject that AI knows how to draw - like a stormtrooper from Star Wars - then you get a consistent output every time for free! This is why stormtrooper vlogs are blowing up on social media:
Prompt:
Selfie footage of two stormtroopers. One stormtrooper says in front of his parked spacecraft, "Well, we can't leave because Greg's lost the keys." We then pan to Greg, who says, "I didn't lose them, I just don't know where I left them."
I’m personally surprised Veo let me make videos of pretty blatantly trademarked characters, and it definitely refuses some other requests (e.g. for celebrities), so I wouldn’t be surprised if this changes.
But if you can ask for something generic and consistent - like Bigfoot - then you’re probably going to get some consistency for free.
Option 3: No dialog
I used Veo 3 to make video, because it has the amazing and currently unique ability to make audio for each video clip.
If I use anything else - say Veo 2 - I lose that audio, but I gain a lot more control over what my video looks like.
I can extend video clips beyond 8 seconds. I can upload reference material - like a photo of a building, or someone’s face. I was able to turn this photo of myself:
Into this video of walking through a futuristic city:
With just this prompt:
Man walks around futuristic Austin with robo-taxis looking like something out of Blade Runner.
Sometimes the results are great, like this. Sometimes… less so (note the vanishing ice cream):
Veo 2 is still a huge increase in visual control, and the only way to really enforce consistency of people between shots, but it comes at the price of audio, and especially dialog.
To add audio, you either need to record your own, or use another tool (I used MM Audio to generate background noise for the above two clips). Any audio you add won’t impact your generated video, so no lip-sync.
This is how you get videos like this:
Note how their lack of dialog is covered brilliantly with a stirring soundtrack. Note how each frame is just a generated AI image, with simple animation of that image added by AI afterwards.
This style has become popular because it works within the limitations of last-gen video generators.
What I did
Given this understanding of what AI video could do, I came up with one basic idea: “The Office but with AI robots”. My plan was to not require consistency between shots: humans and robots could look different each time.
I used ChatGPT 4.5 to help me think of lines:
I didn’t use any of these, but they did help jump-start my own writing.
After a lot of experiments in Veo, I started to coalesce around some whole prompts. The start of each prompt was copy + pasted, for most scenes. I found keeping my prompts short and simple gave the best results:
At first, this was me seeing what was possible. I was aiming to get lots of footage which I could edit later. I settled on a few key themes - e.g. people approving the work of AI - but didn’t have a conherent story in mind.
This was my first video clip (which I ended up not using):
I suspect the process is a lot like making a documentary. You record as much interesting stuff as you can. Sometimes you intuit some things are worth more material and capture more.
Next I imported my video into Final Cut Pro, and I started picking out the best clips, and reordering them to make sense:
This took some time, not least because I hadn’t used FCP in over ten years.
I did almost nothing except cut up the clips and order them. For most scenes, I wanted the cuts tight around the voice; I sometimes deviated from this for comedic effect.
There was only only one edit that was slightly more advanced: I broke out and combined the audio from several of the explosion clips, to make the sound continuous:
Editing made a huge difference, of course. I took 21 minutes of footage and edited it down to 2.5 minutes, so literally 88% was scrapped. That’s not unusual when working with live video, but it might sound weird if you think AI will just do it all for you.
During my editing process, some story beats became apparent, e.g.
Humans admit they’re just approving everything
One human making a bad approval causes the office to blow up
A person secretly tried to shut down the AI
I had rough ideas for these when I made the video, but it took editing and some extra video generations to get each one right.
I dropped a few ideas, like having an AI deploy neurotoxins on healthy workers:
While Veo has a built-in video editor, it’s total garbage. I would have been lost without Final Cut Pro. But iMovie, Adobe Premiere, InShot, or anything else to cover slicing + reordering would all have been equally fine.
Time and cost
4 hours 27 minutes from idea to uploaded on YouTube
Approx. $64 in Veo credits
Finished video: 2 minutes 30 seconds
Normally Veo costs $250 / month for 12,500 credits. I used 3,200 credits to make this video, so approx. $64 (it was actually half that, as the first month of Veo is half price).
It took me 4.5 hours to create it all. Now I was rusty here, and learning both Veo and Final Cut Pro. So maybe that could have been ~3 hours with some practice.
I could easily have made the video longer, but I was trying instead to make it good.
But AI is evil etc
Before AI, I would have had to write a script, hire actors, film, and edit to make a short like this.
I don’t know what that would have cost, but I know I wouldn’t have done it.
Some will say AI video is taking real people’s jobs, but at least in this case it did the opposite. It employed me to create something new, and fun, and it helped me produce content I was otherwise never going to create.
Hopefully, it brought some joy to people who watched it.
It’s clear to me that creating some 8 second clips is a long way away from creating an entertaining 2.5 minute video. Taste, creativity, and discretion are required to achieve that. If anything, AI just gives us new means in which to develop and apply those skills.
A whole bunch of footnotes
I used “Veo 3 Fast”, which costs 20% of “Veo 3 Quality”, and which was good enough for everything I needed.
You can’t ask AI to control the voices it makes. It just makes voices up for you, and no amount of prompting makes a difference.
Veo can’t do portrait video, just landscape, and only in the exact size it chooses.
Veo does all video in 720p resolution, which is not great by modern standards. I upscaled to 1080 in FCP.
The Veo app is pretty buggy, not designed for large scale production, and it can just forget video clips that used to work earlier. I used it for creating video, and did everything else somewhere else.