On this page
- TL;DR
- The new baseline
- Stage 1: Script (30-45 minutes)
- What the script contains
- Use AI to draft, then rewrite by hand
- What a good 60-second script looks like
- Stage 2: Storyboard (45-60 minutes)
- Build a shot list
- Generate keyframes with image models
- Lock the visual style now
- Stage 3: Generate clips (2-4 hours)
- Match each shot to its best model
- Use image-to-video, not text-to-video
- Generate 3 takes per shot
- Keep clips short
- Tag and organize as you go
- Stage 4: Edit (1-2 hours)
- Build the rough cut first
- Trim ruthlessly
- Cut on motion
- Use J-cuts and L-cuts
- Add subtle camera moves to over-static shots
- Color grade everything to one LUT
- Stage 5: Sound (1-2 hours)
- Voice (if needed)
- Music
- SFX
- Mix in the NLE
- Stage 6: Polish (30-45 minutes)
- Add film grain
- Add titles and lower-thirds in the NLE
- Add subtle camera shake to overly clinical shots
- Watch the whole thing twice
- Stage 7: Publish (30 minutes)
- Export presets
- Captions
- Thumbnails
- A real one-day timeline
- What this workflow doesn't replace
- Common workflow mistakes
- Mistake 1: Skipping the script
- Mistake 2: Generating before storyboarding
- Mistake 3: Using one model for everything
- Mistake 4: Editing in the AI tool instead of an NLE
- Mistake 5: Skipping audio
- The mental model
- The summary
TL;DR
- A finished AI-generated video in 2026 takes one creator, one day, and around $30-100 in tool credits. That's the new baseline.
- The workflow has seven stages: script, storyboard, keyframe, generate, edit, sound, publish. Skipping any one of them is the most common failure mode.
- Use the right model per shot — Sora, Runway Gen-4, Veo 2, Kling, Luma Dream Machine, Pika 2 — instead of trying to make one model do everything.
- Audio is a separate stack: ElevenLabs for voice, Suno or Udio for music. Mix in your NLE.
- The real skill in 2026 isn't generation; it's directing the pipeline — knowing what to ask each tool for and how to assemble the pieces.
The new baseline
A year ago, "AI video creator" meant short, glitchy social clips. Today, a single person can ship a 60-90 second polished piece in a day — story, voice, music, edits, color grade.
This guide is the actual pipeline. Not "look what's possible" — the step-by-step workflow that gets a finished video out the door.
For the model-by-model deep dive on each generator, see the complete guide to AI video generation. For the still-image side, see AI image generation tutorial and the APIs comparison.
Stage 1: Script (30-45 minutes)
Every video starts with words on a page, even videos with no dialogue.
What the script contains
- One-sentence premise. "A coffee brand's origin story told through the journey of a single bean."
- Goal. What should the viewer feel or do at the end?
- Length. 30, 60, 90 seconds. Pick before writing.
- VO or no VO. This decides everything about pace and shot length.
- Beat sheet. 4-8 beats for a 60-second piece.
Use AI to draft, then rewrite by hand
Claude Sonnet 4.6 or GPT-5 will draft a tight script in two minutes. Don't ship the draft. Rewrite it in your voice. AI scripts have a rhythmic sameness that audiences are starting to recognize.
Prompt the model with: the premise, target audience, length, tone, and 2-3 reference videos you like. Generate three options. Pick the strongest, rewrite, repeat.
What a good 60-second script looks like
- Hook in the first 3 seconds.
- Build through 3-5 beats.
- Payoff in the last 5-10 seconds.
- Total VO under 130 words (people read 2-3 words per second of comfortable VO).
Stage 2: Storyboard (45-60 minutes)
This is the highest-leverage stage. Skipping it costs you 5x more in regeneration credits later.
Build a shot list
For a 60-second piece, plan 8-12 shots. Each shot:
- One sentence describing what's in frame.
- The camera move (static, dolly, pan, crane, handheld).
- Approximate duration (most usable AI clips are 3-6 seconds).
- The transition into the next shot.
Generate keyframes with image models
For each shot, generate a single keyframe with an image model:
- Flux 1.1 Pro for cinematic, photoreal looks.
- Midjourney v7 for stylized, art-directed shots.
- Imagen 3 for product or commercial cleanliness.
These keyframes do double duty — they validate that the look works, and they become the starting frame for image-to-video generation in the next stage.
Lock the visual style now
Pick palette, mood, lens character, and grade. Write a one-paragraph "visual bible" you'll paste into every prompt. This is what keeps 12 different shots from looking like 12 different videos.
Stage 3: Generate clips (2-4 hours)
Now generate the actual moving footage, one shot at a time.
Match each shot to its best model
- Establishing wide / drone-style → Luma Dream Machine.
- Character close-up with motion → Kling.
- Photoreal product / nature → Veo 2.
- Stylized hero shot / surreal → Sora.
- Controlled commercial setup with reference → Runway Gen-4.
- Quick stylized effect / transition → Pika 2.
Use image-to-video, not text-to-video
Whenever the model supports it, start from your keyframe. Image-to-video gives dramatically more consistent results than text-to-video, because the look is locked from frame one.
Generate 3 takes per shot
Always 3, sometimes 4. Pick the best. The best take usually has the right ending, not the right beginning — because the beginning is locked by your keyframe and the model invents the back half.
Keep clips short
Generate 8-12 seconds. Use 3-6 seconds. Most AI video drifts in the last third — cut around it.
Tag and organize as you go
Filename convention: 01_wide_establishing_take2.mp4. Future-you in the editor will thank present-you.
Stage 4: Edit (1-2 hours)
Drop everything into a real NLE. DaVinci Resolve (free), Premiere Pro, Final Cut, or CapCut.
Build the rough cut first
Drop clips in order. No transitions yet. No audio yet. Just see the shape of the piece.
Trim ruthlessly
Cut every clip to its best 2-5 seconds. The temptation to keep "almost good" frames is the difference between amateur and pro pacing.
Cut on motion
When two shots both have motion, cut at the moment of strongest motion. The viewer's eye doesn't notice the cut. Static-to-static cuts feel choppy.
Use J-cuts and L-cuts
Audio leads the next shot or trails the previous one. This is how every professional edit hides cuts. Worth learning if you don't already use them.
Add subtle camera moves to over-static shots
Most NLEs have a "ken burns" or "transform" effect. A 2-3% slow zoom or pan over a static AI shot makes it feel like real footage.
Color grade everything to one LUT
This is the single biggest "looks pro" upgrade. Apply the same grade to every clip so the piece feels coherent. Free LUT packs are everywhere; pick one cinematic grade and stick with it.
Stage 5: Sound (1-2 hours)
Audio is half of video. People will forgive bad video with great audio. They will not forgive great video with bad audio.
Voice (if needed)
- ElevenLabs for VO. The 2026 voices are convincing. Use the "stability" and "similarity boost" controls to dial in a less synthetic read.
- Record VO in chunks matched to your beats. Easier to swap one beat than re-render the whole thing.
- For lip sync (talking head): HeyGen, Synclabs, or Runway Lip Sync. Acceptable for direct-to-camera; still uncanny in profile.
Music
- Suno or Udio for original tracks. Generate 5-10 candidates with a prompt that names genre, instrumentation, mood, and tempo.
- Ask for instrumental versions. Lyrics over a narration track is muddy.
- For licensed: Epidemic Sound, Artlist, Musicbed.
SFX
- ElevenLabs SFX for specific sounds.
- Freesound for free library SFX.
- Soundly / Pro Sound Effects for professional libraries.
- A few well-placed SFX (footstep, door close, button click, room tone) make AI footage feel grounded.
Mix in the NLE
- VO: -12 to -6 dB
- Music: -18 to -12 dB under VO, -6 dB during instrumental sections
- SFX: balanced to feel natural in the scene
- Add a limiter on the master bus to prevent clipping
Stage 6: Polish (30-45 minutes)
The details that separate "AI video" from "video that happens to use AI."
Add film grain
Subtle grain over the whole timeline ties disparate AI sources together. Most NLEs have grain plugins; FilmConvert is the gold standard.
Add titles and lower-thirds in the NLE
Don't ask the AI to render text. Add it in post. Cleaner, more controllable, no failed text-rendering attempts.
Add subtle camera shake to overly clinical shots
Some AI footage looks too smooth. A 1-2% handheld shake effect makes it feel observed.
Watch the whole thing twice
Once with audio, once muted. Watching muted catches edit pacing problems your ears smooth over.
Stage 7: Publish (30 minutes)
Different platforms want different things.
Export presets
- YouTube / web hero: H.264, 1080p or 4K, 16:9, 12-20 Mbps.
- Instagram / TikTok: H.264, 9:16, 1080x1920, 8-12 Mbps.
- LinkedIn: H.264, 1080p, 16:9 or 1:1, 8-12 Mbps.
- Archive / further edit: ProRes 4444 or DNxHR, original aspect.
Captions
Generate captions with Whisper, ElevenLabs, or your NLE's built-in tool. Edit by hand — auto-captions are 90% accurate and the 10% reads as careless.
Burn-in for short-form social. SRT for YouTube and LinkedIn (better SEO, accessibility, and platforms read them for content matching).
Thumbnails
For YouTube, generate 2-3 thumbnail candidates with an image model. Test them informally with friends; the difference between a 4% and 9% click-through rate is in the thumbnail.
A real one-day timeline
Here's how a 60-second piece breaks down for an experienced creator:
- 09:00 - 09:45: Script and rewrite
- 09:45 - 10:45: Storyboard and keyframes
- 10:45 - 13:30: Generate clips (with breaks while clips render)
- 13:30 - 14:30: Lunch / let render queue finish
- 14:30 - 16:00: Rough cut and trim
- 16:00 - 17:30: Sound design and mix
- 17:30 - 18:00: Polish, grain, titles
- 18:00 - 18:30: Export, captions, publish
Eight to nine working hours. One person. A finished piece.
What this workflow doesn't replace
- Actual cinematography for the highest-end work. AI video at 2026 quality is somewhere around "good stock footage" or "competent indie shoot." For a flagship film, you still want a real crew.
- Real human performances. Lip sync, real emotion, real eye contact — AI is close but not there.
- Documentary subjects. You can't AI-generate a real person's actual story.
- Brand-critical accuracy. Composite real product / real packaging / real logos in post; don't trust the model.
The workflow above is the right answer for ads, explainers, social shorts, music videos, conceptual pieces, and any project where speed and flexibility matter more than absolute realism.
Common workflow mistakes
Mistake 1: Skipping the script
Symptom: you generate clips that don't add up to a story.
Fix: 30 minutes of writing saves 4 hours of regeneration.
Mistake 2: Generating before storyboarding
Symptom: 30 disconnected clips, no flow.
Fix: keyframe every shot first. Verify the look before committing video credits.
Mistake 3: Using one model for everything
Symptom: fighting Sora to do something Luma would do in one click.
Fix: shot-by-shot model selection. The right tool per shot.
Mistake 4: Editing in the AI tool instead of an NLE
Symptom: limited editing controls, no real audio mixing, no color grade.
Fix: AI tools generate; NLEs finish.
Mistake 5: Skipping audio
Symptom: technically impressive video that feels lifeless.
Fix: dedicate as much time to sound as you do to picture.
The mental model
You're not "an AI video creator." You're a director-editor running an automated production crew. Your job is to make the right calls — script, shot, model, take, cut, sound. The crew (the models) does the labor.
The creators who win in 2026 are the ones who internalize this. The tools change every quarter; the craft of directing a pipeline is durable.
The summary
- Seven stages: script, storyboard, generate, edit, sound, polish, publish. Don't skip any.
- Right model per shot. Image-to-video over text-to-video.
- Generate 3 takes per shot, use 3-6 seconds of each.
- Edit in a real NLE. Cut on motion. Color grade everything.
- Audio is half the job. ElevenLabs, Suno, SFX library, mix in the NLE.
- One person, one day, $30-100 in credits = one finished video. That's the new baseline.
Direct the pipeline. Ship the work. Repeat tomorrow.
Plan, prompt, and orchestrate your video pipeline from one BYOK workspace — NovaKit supports every major image and video model, with chains for repeatable workflows and per-asset cost tracking.