How to Prompt PixAI v4.0 Preview for Anime Video — From Image to 15 Seconds
How to prompt PixAI v4.0 preview for anime image-to-video: reference setup, iteration loop, voice line format, and the Mio.2 pre-production workflow we use to ship 15-second anime shorts.
Most prompt guides for video models read like the writer never ran a generation. They hand you the syntax — subject, action, camera, audio — and call it done. Then you try to animate the character art you’ve been saving for weeks. The video comes back static. The eyepatch slides off mid-shot. The camera refuses to move no matter how many times you type “cinematic.”
This guide is the field-tested version. Three things matter, in this order: how you stack your references, what you put in the text prompt, and how you iterate when the first try fails. The text prompt is the easy part. It’s also the part most guides obsess over.
New to v4.0 Preview and want the capability overview first — what it can actually do, how it compares to PixAI’s older models, what audio and reference video unlock? Start with our companion piece: Meet PixAI v4.0 Preview →
If you want the broader model lineup and i2v panel walkthrough — every video model PixAI offers, when to use which, panel settings explained — read our PixAI Image-to-Video Tutorial: Model Guide + Prompt Writing → Otherwise, read on.
How to prompt PixAI v4.0 for anime video, the short version
Working with references
— images, video, and audio
v4.0 Preview takes three kinds of reference. You call each by tag in your prompt:
Most prompts don’t fail on bad text. They fail on bad reference setup. Spend your effort here.
A short walkthrough of the reference upload flow.
Reference images are not first frames
A semantic anchor, not a starting position.
Older video models — including PixAI’s own v3.x line — treat the reference image as the first frame. Whatever pose your character is in, whatever’s behind her, all of it gets locked into the opening shot. The model then has to animate forward from there. Reference shows her sitting, prompt says “she leaps into the air” — good luck.
v4.0 Preview doesn’t work that way. It reads the image as a semantic reference for who the character is, not as the starting frame. The model separates identity (hair, eyes, outfit, eyepatch placement, the cat ears, the heart-shaped ahoge) from everything else. Then your prompt drives the action.
What this means in practice:
- One clean character portrait is enough. You don’t need multiple angles of the same character. We ran side-by-sides with 4-image and 6-image setups of the same OC — the 4-image version was consistently sharper. The model wasn’t trying to reconcile six slightly different jaw shapes.
Sharper character consistency.
More references, more averaging — output softens.
- The pose in your reference doesn’t constrain the action. If your reference shows her sitting on a swing, you can prompt “she runs across a stage” and v4.0 Preview will execute the run without dragging the swing along. We tested this exact case on v3.0 High Consistency. It kept her seated.
v4.0 Preview. Reads the reference as identity, not pose — the character leaps as prompted.
v3.0 HC. Treats the reference as a starting frame — character stays seated.
- For scene-specific work, give it two images and tell it how to combine them. Character on
@image1, empty scene on@image2, then: “She stands in the scene from @image2, wearing the outfit from @image1.” v4.0 Preview composites cleanly without confusing whose face goes where.
Same character, multiple scenes, outfit swap
▸ VIEW PROMPT
@image1 character is the subject throughout this video, maintaining her multicolored white-pink gradient hair, heart ahoge, purple eyes, white eyepatch, fang, cat ears, and cat tail at all times. [0-4s] She sits in front of the vintage vanity mirror from @image2 in the warmly lit dressing room, her back partially to the camera so her reflection is visible in the mirror. She looks contemplatively at her reflection, slowly reaching up to touch a strand of hair. Soft pink light bathes the intimate space, the curtains gently moving in the background. [4-6s] The reflection in the mirror begins to shimmer with dreamlike light, pastel ribbons and cherry petals starting to drift through the air, the dressing room slowly dissolving into a glowing pastel haze. [6-10s] She now stands at center stage from @image3, under the bright spotlight beam, holding a vintage keytar with both hands, looking forward with quiet confidence. Pink and red ribbons float around her, cherry petals drift through the spotlight beams, soft mist rises gently from the wooden stage floor. She lifts her right hand slightly as if ready to play. Continuous seamless dreamlike transition through the mirror as a portal between her private self and her stage self.
You can attach up to 6 images. You usually shouldn’t. Two or three well-chosen references beat five or six competing for the model’s attention — every extra image is another vote on what the character should look like, and the model averages them.
Reference video transfers feel, not content
Borrow the camera, not the character.
You’re not telling v4.0 Preview to copy a video. You’re telling it to borrow the camera motion, pacing, and overall framing from that video and apply them to your scene.
Say you’ve saved a gacha-card animation you love — slow push-in, soft parallax, a dreamy reveal at the end. Upload it as @video1 and write:
Animate this character following the slow camera motion and pacing of @video1.
v4.0 Preview studies how the camera moved, how each beat held, how the particles flowed. It then applies that pacing to your character. Your character doesn’t get rendered in the reference video’s art style. The reference video’s character doesn’t appear in yours. What transfers is the feel.
▸ VIEW PROMPT (used for both examples)
Transform this static character art from @image1 into a gacha-style animated character card, following the same camera motion and pacing style as @video1.
mp4, under 50MB, under 15 seconds each. If you use multiple video references, their combined length still has to fit under 15s.
Reference audio — for voice character or musical mood
For the sonic qualities text can’t capture.
You can attach an audio file when you want v4.0 Preview to anchor to a specific sound quality — a voice timbre the character should speak in, or a piece of music whose mood and instrumentation you want the soundtrack to match.
In your prompt, name what you’re borrowing — “use the voice character of @audio1” or “match the mood and instrumentation of @audio1.” Same logic as video references.
For dialogue, ambient sound, and SFX, you usually don’t need an audio reference at all. Those go directly into the prompt itself (covered in the next section). Audio references are for the harder-to-describe sonic qualities.
Same character, same prompt, same motion. Only the audio reference changes.
Sorrow. Melancholy strings, slow tempo, somber atmosphere.
Dreamy. Ethereal pads, floating texture, soft reverb.
Cheerful. Bright tempo, upbeat instrumentation, warm mood.
▸ VIEW PROMPT
@Image1 — hair strands drift and flutter gently in the breeze. The flowers behind her sway softly with the wind. The camera slowly moves in to a close-up on her face as she reaches out her hand to touch a bubble floating in the air. Reference the rhythm and atmosphere of @Audio1.
Writing the text prompt
— for anime video
Most prompt guides give you a formula and call it done. Real prompt-writing splits into two situations, and the technique differs.
You know what you want
You have a clear shot in your head — camera pushes in toward her face, petals drift left to right, she turns her head on the second beat. Just write it down.
The trap here is vagueness disguised as cinematography. “Zoom in” tells v4.0 Preview less than “the camera slowly pushes in during the turn.” Use real camera verbs — push in, pull back, tilt up, drift, orbit, pan, dolly — instead of “the camera moves.” Use connectives like “then” and “and” to sequence actions: “She looks down, then slowly lifts her eyes to the camera and gives a small smile.” Without those connectives, the model often collapses everything into a single static moment.
This is the easy mode. If you already know your shot, skip to the writing-habits section below.
You have an image but no clear vision
This is where most generations actually start. You find art you love, you want it to move, you have no idea what should happen. Most guides fail you here. They assume you’ve already decided.
Here’s the iteration loop that actually works:
Draft on an LLM.
Drop your image into Claude or ChatGPT and ask for a short video prompt. Treat the result as raw material, not a final answer.
Test on a cheap model.
Run the draft on v4.0 Lite Preview or v2.7 High Dynamics — fast, cheap, structurally similar to Preview. See our Meet PixAI v4.0 Preview post for the full model comparison.
Read failure, not prompt.
Watch what the video did wrong. The fix lives in the gap between what you wrote and what the model actually produced.
Run the final on v4.0 Preview.
Same prompt, more capacity behind it. Smoother motion, sharper character consistency, more nuanced lighting.
Draft, cheap test, read the failure, run the final. That loop is what separates videos made by someone who knows the tool from videos that look like a first try.
Common AI video prompt mistakes to avoid
Five failure patterns. Five fixes.
The loop above only works if you can spot what went wrong. Here are the failure patterns we hit most often, paired with the fixes that worked.
Letting the LLM hallucinate features that aren’t in your reference image
LLMs read “cat girl” and assume she has a tail. They read “sailor uniform” and add a tie. Your reference might not show either. The video then awkwardly generates the invented feature mid-sequence — we hit this 4 times across our test runs. (This part still annoys me.)
Read the draft prompt next to your image. Delete every feature it added that isn’t visible.
▸ SHOW EXAMPLE WITH FAILURE VIDEO
▸ PROMPT (with the hallucinated “tail flicking gently” bolded)
A cute young anime cat girl in close-up, white short hair, cat ears, heart-shaped ahoge, large purple eyes with small pink hearts in the pupils, wearing a navy sailor uniform with a blue ribbon scarf and a small black beret on her head, soft pink blush on her cheeks. She is resting her chin on both hands cupped under her face, gazing softly at the camera with a shy smile. Small pink heart particles and sparkle stars float gently around her in the air. Soft cyan and pink pastel light surrounds her like a dreamy glow. The video starts with her looking down shyly, then she slowly lifts her eyes to meet the camera and gives a tiny knowing smile, her tail flicking gently behind her, one heart particle drifting up past her face. The camera slowly pushes in toward her face during this moment. Anime illustration style, soft painterly textures, kawaii aesthetic, dreamy pastel lighting, intimate close-up framing, gentle slow motion energy.
Letting the LLM describe instead of direct
The first-draft prompt is often a beautiful description of what’s already in the static art — every ribbon, every accessory, every fold of fabric — with no actual motion. The video comes out static.
Cut the descriptive sentences entirely. Keep only verbs that describe what changes.
▸ SHOW EXAMPLE WITH FAILURE VIDEO
▸ PROMPT
Two cat girls appear together in this video, facing each other as opposites — the angel and the devil. Character from @image1 and character from @image2 face each other in an empty void of soft pastel light, no environment, no background — only the two characters and a dreamy gradient atmosphere of soft pink and lavender mist around them. They stand close together, the angel on one side and the devil on the other, eyes meeting in mutual recognition. The angel slowly raises her hands to form a heart shape near her chest with a gentle smile. The devil mirrors the gesture but with a playful smirk, sticking out her tongue. They lean slightly toward each other, sharing a quiet moment of acknowledgment as if they understand they are two halves of the same person. Soft sparkle particles drift between them in the void. The camera holds steady at medium shot, framing both characters in the same frame throughout.
Stacking too many elements in one frame
Ask for a Valentine’s scene and the LLM gives you “she stands in the rose garden, holding roses, while petals fall and balloons float and chocolates drift past” — fifteen elements stacked in one shot. The result is visual noise.
Limit each beat to one subject + one motion + one ambient element.
Letting the model invent physics to justify your prompt
We once wrote “goldfish drift around her” and v4.0 generated a water pool out of nowhere to make the goldfish make sense. (Yes — we laughed.) The model defaults to physical realism unless you tell it not to.
Name the surreal element directly — “goldfish drift through the air, this is a dream, no physical realism needed.”
▸ SHOW FIRST PASS vs FIXED PASS
▸ ADJUSTED PROMPT (key fix: explicit dream framing + timed beats)
Transform this static character art from @image1 into a gacha-style animated character card, following the same camera motion and pacing style as @video1. CAMERA MOTION (mirroring the path of @video1): [0-1.2s] Open on a close-up of her left hand holding the small red-stringed charm pouch (omamori) at her waist. Soft warm light on the red fabric, goldfish drifting slowly past in the soft background blur. [1.2-2.5s] The camera slowly drifts upward along her body — past her green striped yukata, settling onto her face. More goldfish become visible, gently swimming through the air around her, pink petals drifting across the frame. [2.5-3.5s] At her face, the camera gently lingers — her purple eye meeting the viewer softly, the goldfish continuing their slow parallax behind her. The camera makes a subtle floating motion as if caught in the same dream. [3.5-4.2s] The camera slowly pulls back to reveal the full composition — her standing beneath the temple roof eaves, goldfish suspended in the air around her, petals drifting. [4.2-5s] CARD REVEAL CLIMAX — multiple goldfish swim into view from the edges joining the dream, soft bloom intensifies around the frame.
Calling for a transition without giving the model a bridge
“Dissolve into haze and she appears on stage” gets interpreted as a hard cut every time.
Give the transition a physical object the model can animate — a mirror that becomes a portal, a curtain that opens, a falling petal that crosses the frame.
Writing habits that produce clean output
After enough generations, patterns emerge.
Write like a director, not a novelist
Short declarative sentences. One fact per sentence. “Hair flutters slightly. Fixed shot. Eyes blink slowly.” v4.0 Preview reads this as instructions, not prose.
Know which timestamps you’re writing
[Scene 1: 0-3s] framing tells v4.0 Preview to cut between shots. [0-1.2s] beat structure tells it to flow within one continuous take. They look identical on the page. Mixing them up is the #1 reason multi-shot prompts produce slideshows.
Voice lines have a format
Three parts: voice description — emotional tone — line in native language.
Voice line — shy prince vocal type, warm and slightly trembling: 「あの……お昼、一緒に行かない?」
Never romanize. v4.0 Preview handles native characters far better than transliteration.
SFX cues read like film sound design
Em dashes between sound elements:
SFX: heavy rain — thunder rumbling — silence — a single heartbeat
A paragraph describing the soundscape works far worse.
Comic-strip mode is a real output mode
Ask for it explicitly:
From left to right, top to bottom, present this as a comic strip. Add special sound effects for scene transitions.
This switches v4.0 Preview into multi-panel manga output — animated panels with transitions, used for animated manga pages and doujin promo clips.
Putting it all together
A complete v4.0 Preview generation might look like this:
@image1— character reference (clean portrait, any pose)@image2— empty stage scene@video1— animation reference for camera style
“@image1 character stands at center stage on @image2, looking forward with quiet confidence. The camera slowly pushes in toward her face, then a few cherry petals drift up around her. Follow the slow dreamy camera pacing of @video1.”
Three references, four lines of prompt, one cinematic video.
The references do most of the heavy lifting. The prompt directs the motion. The iteration loop catches the failures.
That’s the v4.0 Preview workflow when it clicks.
Production workflows
— Mio.2 pre-production + Edit Pro manga
When to use Mio.2 vs write the prompt yourself
You have a character and a vague idea. The Mio.2 AI agent handles the 6-stage pre-production work — pitches, script, shot list, reference image generation — and outputs a finished v4.0 Preview prompt.
You already have a clear shot in your head (Situation 1 above). Skip pre-production and go straight to a clean v4.0 Preview prompt with 1–2 reference images.
Most production work falls into the first bucket. That’s why Workflow 1 exists — Mio.2 is the difference between “I have a character I love” and “I have a 15-second anime video to publish.”
AI video pre-production with Mio.2 — 6 stages
From character + vague idea to a 15-second short.
Use this when you have a character, an idea, and 15 seconds to fill. Six stages, all in one Mio.2 conversation so the agent keeps context across them.
The running example below is real — a yuri-tragedy short called “The Heart She Can’t Spend.” Character: a silver-haired catgirl gambler with an eyepatch.
“The Heart She Can’t Spend” — the 15-second short the 6-stage workflow produces.
Story pitches
Upload your character image and run:
Based on the character in this image, suggest 5 short-form anime story pitches in the [genre/vibe] direction. For each pitch, include a one-line logline, the opening shot or first line, and what emotional reaction it's going for.
Why this works: Replace [genre/vibe] with what you want — “jirai-yuri tragedy”, “isekai villainess comedy”, “casino gambling thriller”, “slice-of-life with one twist”. Leaving it blank produces generic pitches. Naming a direction produces ones you can actually use.
Tighten the pitch
Pick one, then push back specifically:
Develop pitch #[N]. Change: [what to adjust]. Keep: [what to preserve]. Rewrite the pitch with these changes, keeping the same emotional core.
Why this works: The change/keep structure stops the AI from rewriting everything. Without it, the AI sometimes silently overwrites the previous pitch in your conversation memory.
Outline, then cut
First get the full version:
Outline this pitch as a [15-second] anime short with a hook (first 2s), build (story beats with rough timing), and final image. Include the actual dialogue lines. The premise should be conveyed indirectly through visuals and subtext — never stated out loud.
Why this works: Replace [duration] with your target length. The “never stated out loud” line is the most important constraint. AI defaults to over-explaining when compressing, and this reinforces that dialogue carries subtext, not exposition.
Build the shot list
Break this script into a shot list with a maximum of [number] shots. For each shot, specify timing, frame composition, what's in the frame, camera movement, key visual detail, and dialogue or on-screen text. If you produce more than [number] shots, compress to exactly [number] — each shot must do irreplaceable work.
Why this works: Replace [number] with your reference image budget. v4.0 Preview accepts up to 6 reference images, but 3–4 produces stronger results than 6 competing for attention. Giving the AI room to over-produce and then cutting yields better selection than asking for 4 upfront.
Generate reference images
Send this to Mio.2 (or your image agent):
Based on the shot list below, generate [N] storyboard images for video generation. Use the attached character reference image(s) to keep the character consistent across all shots. Do not include any text in the images. Match the art style of the reference. [Paste your Stage 4 shot list here.]
Why this works: Attach your character reference image(s) — Mio.2 uses them as identity anchors. Match-art-style and no-text-in-images are the constraints that keep storyboard images usable as v4.0 reference inputs.
The first pass will rarely be perfect. Common issues: wrong hair length, drifted eye color, an extra unwanted character in a solo shot, signature accessories on the wrong character. Each is fixable in one turn — name the problem precisely:
“Regenerate Shot 3 with only the gambler in frame, no second character. Confirm her eyes are vivid purple. Also fix Shot 2 — the sailor catgirl should not have an eyepatch.”
Have Mio.2 write the v4.0 prompt
This is the payoff:
Based on the script and shot list from earlier in this conversation, write a v4.0 Preview video prompt for the [N] reference images you just generated. For each timestamped beat, include: time range, which @image to reference, action and camera description, dialogue (with the exact lines from the script), and SFX cues. Use em dashes for layered SFX. Keep total prompt length under 2000 characters. Example output format: 0s–Xs: @Image 1 [Action description with camera movement]. [Character]: "[Exact line from script.]" SFX: [sound 1] — [sound 2] — [sound 3] Xs–Ys: @Image 2 [Action description]. [Character]: "[Exact line.]" SFX: [layered sounds with em dashes] [Continue for each beat. End with fade-to-black instruction if applicable.]
Why this works: The “exact lines from the script” instruction matters. AI sometimes paraphrases dialogue when generating prompts, which drops the carefully-tuned subtext from Stage 3.
Mio.2 outputs the prompt. Copy it, paste into v4.0 Preview with your 4 reference images attached, generate. The final prompt for our test piece:
▸ THE FINAL V4.0 PREVIEW PROMPT
0s–2s: The white-haired catgirl gambler spins a gold coin between her fingers, smirking confidently. Gold coins float around her in dramatic lighting. SFX: metallic spin — tense atmosphere 2s–4s: Cut to sailor catgirl across the table, her hand trembling, heart pendant catching light. She whispers: "You said you'd find me again. Even if it took everything." SFX: soft tearful voice 4s–6s: Back to gambler, leaning in with delighted grin: "Cute line. Did you rehearse it?" Cards slap down. SFX: playful tone — card impact 6s–10s: Match cut — the coin falls through memory space. Two catgirls on a cliff, heart pendant being clasped, promise whispered. Everything dissolves into golden dust as coin lands. SFX: warm wind — ethereal chime — distant promise 10s–13s: Gambler blinks, something behind her eyes going quiet. She looks up, studying the crying girl with genuine curious smile: "...sorry, kitten. Do I know you?" SFX: silence — single heartbeat 13s–15s: Sailor catgirl's hand closes around her heart pendant. She smiles through tears, soft: "Not yet. Deal again?" Fade to black. Coins fall. SFX: emotional catch — coin clatter — silence
Four images, six beats, three lines of dialogue, layered SFX. 15 seconds of actual story — built across six stages, not in one prompt.
Animated manga with Edit Pro + v4.0 Preview
For doujin promo clips and animated manga pages.
Use this when you want a multi-panel manga page that comes to life — animated reactions, comedic timing, the doujin-promo-clip feel. Edit Pro handles the manga layout, v4.0 Preview brings the panels to life.
Edit Pro handles the layout: arrange panels, frame each beat, control reading order. Export the manga page as a single image. Hand it to v4.0 Preview with this prompt:
Present the comic story from top to bottom in a [tone] style, with smooth storytelling and expressive character reactions. Add adorable anime-style sound effects throughout the scenes, such as "boing," "bam," "wah," and "sparkle," to enhance the atmosphere and make the comic feel lively and dynamic.
Swap [tone] for “cute and humorous”, “elegant”, or “dramatic” depending on your story. We’ve tested it from 2-panel gag strips up to 6-panel sequences. The same template scales.
Common questions
How many reference images can I attach to a single v4.0 Preview prompt?
Up to 6 images. In our testing, 2–3 well-chosen references consistently outperformed 6 references competing for the model’s attention. Every extra image adds another vote on what the character should look like, and the model averages them.
What’s the difference between reference image and start frame in v4.0 Preview?
Reference images are semantic — v4.0 uses them to learn who the character is, then lets your prompt drive the pose and action. Start frame (in Keyframe mode) locks the first frame of the video. Keyframe mode and Multi-Reference mode are mutually exclusive — pick the one that fits your workflow. For the full breakdown of every model and panel setting, see our PixAI Image-to-Video Tutorial.
Can v4.0 Preview generate dialogue in Japanese or Korean?
Yes. Write the dialogue in the actual target language using native characters — Japanese in 「」, Korean in “” — never romanize. Romanized lines like “Konnichiwa” produce unintelligible audio in most generations.
How long can a v4.0 Preview video be?
Up to 15 seconds per generation. Reference video uploads have the same 15-second total ceiling — if you attach multiple video references, their combined length still has to fit under 15s.
Should I use v4.0 Preview or v4.0 Lite Preview for prompt iteration?
Use v4.0 Lite Preview or v2.7 High Dynamics for drafts and structural testing. They’re fast and good enough to expose problems in your prompt — vague action, bad transitions, missing references. Save v4.0 Preview for the final run once your prompt is working.
Related PixAI guides
- PixAI Image-to-Video Tutorial: Model Guide + Prompt Writing — the i2v panel walkthrough plus comparisons across every PixAI video model.
- Meet PixAI v4.0 Preview — the capability tour: audio, reference video, what’s new versus v3.x.
- Mio.2 Getting Started: Imagine It. Mio Draws It. — onboarding for the agent we use throughout Workflow 1.
- PixAI Edit Pro: The Advanced AI Image Editor for Complex Edits — the manga layout side of Workflow 2.
- PixAI Reference Pro Guide: Multi-Image Editing with Natural Language — for building the storyboard references Stage 5 needs.
- Introducing Tsubaki: PixAI’s Flagship Anime Model — for generating the character art that becomes your reference.
