PixAI v4.0 Preview: The Anime Image-to-Video Model with References, Voice, and Cinematic Camera

Meet PixAI v4.0 Preview — the anime image-to-video AI with native voice generation, cinematic camera control, and consistent character references. See how it compares to 5 older PixAI video models in 3 same-prompt rounds.

PIXAI ▸ FILM LAB
VOL. 04 ▸ NO. 0
CAPABILITIES OVERVIEW
FIRST EDITION

— THE V4.0 PREVIEW FAMILY —

PixAI v4.0 Preview
— the anime image-to-video model with references, voice, and cinematic camera

Meet the first PixAI video model with native voice generation, cinematic camera control, and a reference system that finally lets characters do what your prompt asks. Side-by-side comparisons against five older PixAI models across three rounds.

Showcase ▸ PixAI v4.0 Preview Reel

PixAI v4.0 Preview is our latest anime image-to-video AI model — the first in our lineup with native voice generation, cinematic camera control, and a reference system that finally lets characters do what your prompt asks.

Upload a portrait of your OC sitting on a bench. Ask her to leap into a spinning magic-card finish. With every other PixAI video model, you fight her for five seconds — she half-stands, the cards drift sideways, the eyepatch slides to the wrong eye by second three.

v4.0 Preview is the first model in our lineup where she actually leaps. She lands. The eyepatch stays put. The voice line you wrote — in your own choice of language — comes out of her mouth, lip-synced. The cherry petals fall from where you said they’d fall from.

This is a tour of what v4.0 Preview can do, with side-by-side comparisons against every PixAI video model still in service. If you’d rather skip ahead to how to actually write prompts for it, that lives in our companion guide: How to Prompt PixAI v4.0 Preview →

If you’re new to PixAI’s image-to-video panel altogether — what the controls do, which legacy models still exist, how prompts plug into the workflow — start with our broader tutorial: PixAI Image-to-Video Tutorial: Model Guide & Prompt Writing →

Otherwise — let’s start with the family.

▸ 01

Two models, one feature set

Preview and Lite.

The v4.0 Preview family ships in two flavors.

v4.0 Preview is the top-tier model. Premium visual detail, the most nuanced motion, the most natural-sounding audio. Reach for it when you’re shipping something that matters — a portfolio hero shot, a polished character card, a scene where the camera moves is the point. It takes longer and costs more credits, but the result shows.

v4.0 Lite Preview (or just Lite) is the everyday model. Same architecture, same reference inputs, same audio generation. Faster, cheaper, and good enough for most of what creators actually make day to day — atmospheric loops, simple character animations, voiced dialogue, gacha-style cards. For most jobs, Lite is the right answer.

— SAME PROMPT, BOTH MODELS —

Gacha-style animated character card

▸ BASE IMAGE
Base character art

▸ V4.0 PREVIEW

▸ V4.0 LITE PREVIEW

▸ VIEW FULL PROMPT
Transform this static character art from @image1 into a gacha-style animated character card, following the same camera motion and pacing style as @video1.

CAMERA MOTION (mirroring the path of @video1):
[0-1.2s] Open on a close-up of her left hand holding the small red-stringed charm pouch (omamori) at her waist. The camera holds briefly on this intimate detail, soft warm light on the red fabric, goldfish drifting slowly past in the soft background blur.
[1.2-2.5s] The camera slowly drifts upward along her body — moving from the charm pouch, past her green striped yukata, past her chest, settling onto her face. As the camera rises, more goldfish become visible, gently swimming through the air around her, pink petals drifting across the frame.
[2.5-3.5s] At her face, the camera gently lingers — her purple eye meeting the viewer softly, her hair swaying just slightly in the dream's breeze, the goldfish continuing their slow parallax behind her. The camera makes a subtle floating motion as if caught in the same dream.
[3.5-4.2s] The camera slowly pulls back to reveal the full composition — her standing beneath the temple roof eaves, goldfish suspended in the air around her, petals drifting.
[4.2-5s] CARD REVEAL CLIMAX — multiple goldfish swim into view from the edges joining the dream, soft bloom intensifies around the frame.

Match the camera pacing, parallax depth, and final ornate card-reveal style of @video1, scaled up to a more dramatic gacha reveal climax.

Both models support the full v4.0 feature set: up to 6 reference images, 3 reference videos, 3 reference audio clips, multi-language voice, and cinematic camera control. The difference is in finish, not in capability.

— HOW TO ATTACH REFERENCES —

A short walkthrough of the reference upload flow.

— OUR WORKING RULE —

Sketch on Lite, iterate on Lite, finish on Preview.

Most of what runs through our team starts on Lite and ends with one Preview pass once the prompt is dialed in. We’ll show you below why that rule exists — there’s a round in the showdown where Lite genuinely outperforms Preview, and it changes how you should think about picking a model.

For the rest of this article, the showcase work is mostly Preview. Not because Lite can’t do it — both models can — but because the easiest way to demonstrate what the v4.0 family is capable of is to use the version that shows it most clearly.

PART TWO

What v4.0 Preview can actually do

Four capabilities. Each with side-by-side examples and showcase reels.

▸ 01

Consistent Characters

Your OC, every shot, every scene.

Older video models treat your reference image as a literal first frame. Whatever pose your character is in, whatever background is behind her — all of it gets locked into second one. If your portrait shows her sitting and you want her sprinting, you fight the model the whole way.

v4.0 Preview reads images differently. It treats them as semantic references — pulling who she is (hair, eyes, fang, eyepatch, outfit silhouette) and separating that from the specific moment the reference happened to capture. One image, endless variations. She stays recognizably herself across dramatic scene changes, costume swaps, multi-shot sequences, and even when sharing the frame with another character of yours. The kind of consistency that used to take dozens of regenerations now happens on the first try.

— EXAMPLE 01 —

One character, three worlds — open-world game-style exploration

▸ CHARACTER + SCENE 01
Character with scene 01 reference
▸ SCENE 02
Scene 02 reference
▸ SCENE 03
Scene 03 reference
▸ OUTPUT ▸ V4.0 PREVIEW
▸ VIEW PROMPT
Using @Image1 as the main character, create an open-world exploration game style animation. The screen displays a game UI with semi-transparent pink buttons — directional pad, attack button, and jump button — while the character walks and explores through the world.

0–5s: The game world scene references the background of @Image1. A gentle breeze passes through, carrying cherry blossom petals across the screen. The character walks slowly through the environment. When the character touches a floating ribbon, an item reward notification pop-up appears.

6–10s: After the character continues walking, the game scene transitions from @Image1 to @Image2. As the character walks, a strong gust of wind blows through the scene, scattering photographs across the screen one by one. The character is able to pick up the photographs as they pass.

11–15s: After the character continues walking, the game scene transitions from @Image1 to @Image3. The environment now features enormous yellow tulips filling the frame. The bubbles from @Image3 float gently through the air across the scene.

— EXAMPLE 02 —

Two characters, one frame, both holding identity

▸ CHARACTER A
Character A reference

▸ CHARACTER B
Character B reference

▸ OUTPUT ▸ V4.0 PREVIEW

▸ VIEW PROMPT
The scene opens on darkness. A deep violet light pulses once from the left side of the frame.

The devil steps into frame from the left, emerging slowly through swirling dark purple mist and drifting violet embers. Small black horns catch the indigo glow. Her eyes open with a mischievous gleam, a slow smirk forming at the corner of her lips. She tilts her head slightly, tongue peeking out, one hand lazily raised. Dark flames curl upward behind her silhouette.

A beat of silence. Then a soft warm light blooms from the right.

The angel descends slowly into frame from the right, carried on a gentle cascade of rose-gold light and drifting white feathers. A faint halo materializes above her head. Her eyes open softly, warm and unhurried. She smiles quietly, raising both hands and forming a heart shape near her chest.

Now both are in frame. Their eyes find each other across the void. The space between them — dark violet on her left, soft pink on her right — begins to shimmer and dissolve into a shared pink-lavender gradient mist.

Dreamy cinematic atmosphere, soft volumetric lighting, smooth animation, anime style, stable camera, no hard cuts.

— SHOWCASE —

Multi-character, multi-shot sequence with identity held across cuts.

▸ 02

Cinematic Camera & Multi-Panel Storytelling

Direct a scene, not just a clip.

v4.0 Preview reads cinematic vocabulary as direction, not decoration. Push in, pull back, orbit, tilt, rack focus, slow drift — each one is a precise instruction the model can act on. Multi-shot sequences carry character identity across cuts, so a 15-second clip can hold an opening close-up, a reaction medium, and a final wide without the character drifting into someone else halfway through.

The model also supports comic-strip mode — animated manga pages with 4 to 6 panels arranged on screen, each panel alive with its own motion, transitions cued by SFX. This is the first model in our lineup where the output stops looking like an “AI video” and starts looking like a scene from a comic adapted to motion.

— SHOWCASE —

What v4.0 Preview’s camera direction looks like in practice.

▸ SHOWCASE A

▸ SHOWCASE B ▸ MULTI-SHOT SEQUENCE

Dense multi-shot sequence with rapid scene transitions.

▸ SHOWCASE C ▸ CHIBI BAND ROAD TRIP

Chibi Mio’s band on a spring road trip — promotional spot.

▸ 03

Style Transfer from Animation References

Reference an animation. Get its feel, with your character.

Hand v4.0 Preview a video you want to reference — a short film opening, a music video sequence, a cinematic clip with a mood you love. The model studies its shot structure, camera motion, pacing, and atmosphere, then applies that framing language to your scene with your characters.

What transfers is the structural and atmospheric feel — the way the camera moves between shots, how each beat holds, the color and lighting mood, the energy of the cuts. What stays yours is the art style. Your anime character won’t get rendered in your reference video’s live-action look. Your soft pastel illustration won’t suddenly turn cinematic-noir. v4.0 reads the reference as a director’s reference board, not a style filter.

— EXAMPLE 01 ▸ GACHA CARD REVEAL —
▸ BASE IMAGE
Base character art

▸ REFERENCE VIDEO

▸ OUTPUT ▸ V4.0 PREVIEW

▸ VIEW PROMPT
Transform this static character art from @image1 into a gacha-style animated character card, following the same camera motion and pacing style as @video1.

— EXAMPLE 02 ▸ STYLE + AUDIO RHYTHM TRANSFER —
▸ BASE IMAGE
Base character art

▸ REFERENCE VIDEO

▸ OUTPUT ▸ V4.0 PREVIEW

▸ VIEW PROMPT
Using @Image1 as the main character, reference the shot composition, movements, and overall style of @Video1, but replace the actions with the following:

Making a finger heart gesture while winking, eating cake with a blissful and contented expression, wearing headphones and gently swaying her head side to side while listening to music.

Reference the rhythm and atmosphere of @Audio1.

— SHOWCASE —

Style transfer across genres: fighting-game POV and magazine spread.

▸ SHOWCASE A ▸ FIRST-PERSON FIGHTER

▸ SHOWCASE B ▸ MAGAZINE SPREAD

▸ 04

Native Voice & Sound

A scene, not a silent video.

This is the cleanest dividing line between v4.0 and everything older. Pre-v4.0 models hand you a silent clip. v4.0 hands you a scene with sound in it.

Voice generation runs natively in English, Japanese, and beyond. Write the line in the actual language and v4.0 Preview generates synced audio with mouth movements that actually track — including stylistic touches like a cat-girl にゃ inflection, a galgame breathy whisper, or a cheerful vlog-style giggle. Music, ambient sound, and SFX run on the same track: a heartbeat under a confession, rain and thunder under a grief scene, warm BGM that swells exactly when the joy hits. The result is something AI video has been missing for two years — a 5- to 15-second clip that actually feels finished.

— EXAMPLE 01 ▸ ENGLISH VOICE —
▸ BASE IMAGE
English voice example base image

▸ V4.0 PREVIEW

▸ V4.0 LITE PREVIEW

▸ V3.2 (LEGACY)

— EXAMPLE 02 ▸ JAPANESE VOICE —
▸ BASE IMAGE
Japanese voice example base image

▸ V4.0 PREVIEW

▸ V4.0 LITE PREVIEW

▸ V3.2 (LEGACY)

— SHOWCASE —

Voice acting that holds up across romance scenes and idol moments.

▸ SHOWCASE A ▸ OTOME-GAME DIALOGUE

▸ SHOWCASE B ▸ IDOL STAGE

PART THREE

Same Prompt Showdown
3 rounds. 6 models. Same starting image.

It’s easy to claim a new model is better. Harder to prove it. So we did the simplest honest thing — one prompt, one starting image, ran across every PixAI video model we still ship.

What we expected to find was a clean ranking with v4.0 Preview on top. What we actually found was more useful than that.

— OUR EVALUATION CRITERIA —

What we paid attention to

Across all three scenarios, our team looked at five things when comparing outputs:

01
Prompt adherence. Whether the model did what we asked — followed the timing, the action sequence, the camera direction we specified.

02
Character consistency. Whether the character stayed the same person — same fang, same eyepatch on the same eye, same accessories from frame one to the final frame.

03
Visual fidelity. How polished the final frame looks — the catchlight in the eyes, the texture on the hair, the lighting nuance that separates “AI video” from “frame of a film.”

04
Motion realism. Whether the motion is physically believable — limbs articulated correctly, the camera moves smoothly, weight and momentum feel right.

05
Style preservation. Whether the output holds onto the source illustration’s style — line quality, color treatment, the dreamy or sharp character of the original.

Yours might weigh some things differently — Visual Fidelity might matter more than Motion to you, or vice versa. Read each commentary with your own priorities in mind.

ROUND ▸ 01

A simple moment
— where the gap is smallest

The kind of generation any creator runs on a typical Tuesday. Character on a windowsill, slow head turn toward camera, a small smile, a leaf drifting in the background. Five seconds. Same starting image across all six models.

▸ BASE IMAGE
Round 1 base image

▸ THE PROMPT

The character in this image gently leans on the window frame, looking softly to the side, then slowly turns her head toward the camera and gives a small playful smile, her fang catching the light. A leaf drifts down past the window behind her. The camera slowly pushes in toward her face. Maintain the character’s appearance and the warm natural window lighting from the source image throughout the video. Anime illustration style, gentle slow motion.

▸ V4.0 PREVIEW

v4.0 Preview. Visibly the most polished. Catchlight in the eyes, painterly hair texture, real window light. Added an unrequested hand gesture, though.

▸ V4.0 LITE PREVIEW

v4.0 Lite Preview. Did exactly what the brief asked. No added gestures, leaf drifts naturally. Polish is a half-step behind Preview, but at less than half the credit cost. Our most interesting finding: Lite outperformed Preview on prompt adherence.

▸ V3.2

v3.2. Awkward middle. Sharpened her lines in a way that broke the dreamy source illustration. Across all three rounds, we don’t have a scenario where we’d recommend it over Lite.

▸ V3.0 HIGH CONSISTENCY

v3.0 High Consistency. Most reference-faithful character of any model. Trade-off: she barely moved, mostly mouthed silently.

▸ V3.0 FLASH

v3.0 Flash. Held up better than expected at the same credit tier as Lite. Reasonable everyday-budget option for prompts this simple.

▸ V2.7 HIGH DYNAMICS

v2.7 High Dynamics. Fastest by a wide margin, but the eyepatch came off mid-clip, uninstructed.

— ROUND 1 TAKEAWAY —

For everyday simple animation, you don’t need v4.0 Preview. Lite is the right answer.

ROUND ▸ 02

A magic-trick performance shot

Round 1 was easy mode. Round 2 is where “looks okay” starts falling apart.

▸ BASE IMAGE
Round 2 base image

▸ THE PROMPT (continuous 5-second magic stage act)
The anime jester cat girl from @image1 performs a continuous 5-second magic stage act in one unbroken shot. The camera follows her as she moves through the performance.

BEGINNING: she stands centered on the stage in her opening pose, eyes closed, gathering quiet focus. After a beat she slowly opens her purple eye and looks directly at the camera with a mischievous fang-baring smile.

BUILD: she raises both arms gracefully, the small teddy bear charm visible on her outfit, then in one swift motion throws her arms wide open. A swirl of playing cards bursts out from her hands, gold coins scatter through the air, red and blue balloons float upward.

PEAK: she leaps into a mid-air spin, her checkered skirt and red cape flowing dramatically, cards continuing to scatter outward in slow motion, the camera tracking around her in a slight arc.

RESOLUTION: she lands lightly, the cards and coins still suspended around her, and turns her head toward the camera with one finger raised playfully to her lips, fang visible in a final mysterious smile.

Throughout: character identity preserved exactly — white hair with pink streaks, white heart ahoge, purple eye on one side, BLACK leather eyepatch with red heart charm on the other side (always same eye), fang, cat ears, cat tail, red and black jester outfit with white lace ruff collar, harlequin stockings. Anime illustration style matching the source image, dramatic theatrical stage lighting, deep crimson and gold palette, continuous one-take camera energy in the style of Satoshi Kon's flowing sequences.

▸ V4.0 PREVIEW

v4.0 Preview. Every beat delivered. Eye reveal, cards bursting from her hands, the leap, the spin, the final shhh. Eyepatch stays on the same eye through the spin. Identity holds through dramatic motion. Slight drift from the painterly source on the most extreme frames — a fair trade.

▸ V4.0 LITE PREVIEW

v4.0 Lite Preview. Held together but cut corners. The hardest beats (leap, mid-air spin) got softened or skipped. Watch the spin — she rotates partway and the camera covers the rest.

▸ V3.2

v3.2. Chose the opposite trade. Executed the action sequence, but the character drifted into a different person halfway through. First and last frame look like two different characters.

▸ V3.0 HIGH CONSISTENCY

v3.0 High Consistency. Chose the wrong fight. Art style perfectly preserved; she mostly stayed seated. Refused to perform the requested action.

▸ V3.0 FLASH

v3.0 Flash. Collapsed. Eyepatch on the wrong eye, outfit changed, body parts contorted into impossible poses.

▸ V2.7 HIGH DYNAMICS

v2.7 High Dynamics. Motion fine, consistency broke. Eyepatch disappeared, teddy charm got tossed, the card array looped three times like a stuck animation.

— ROUND 2 TAKEAWAY —

The gap between models isn’t constant. It grows with the ambition of the prompt.

A model that’s “good enough” for a head turn becomes catastrophic for a stage performance. Each older model breaks here in a way that tells you something specific about how AI video models behave under pressure.

ROUND ▸ 03

A two-character interaction scene

The hardest assumption in video generation is that identity will hold. With one character, you’re hoping the model keeps her hair, her outfit, her eyepatch over five seconds. With two characters in one frame, you’re hoping the model holds two distinct identities at once — without bleeding one character’s features into the other.

▸ BASE IMAGE
Round 3 base image

▸ THE PROMPT

The camera slowly tilts upward from the characters’ waists. Flower petals drift through the air as a gentle breeze softly sways their hair and clothes. The character on the left gently caresses the other’s cheek, followed by a close-up shot. In the next scene, both characters slowly close their eyes slightly.

▸ V4.0 PREVIEW

v4.0 Preview. Faithfully executed prompt with cinematic tilt and clean close-up. Single visible flaw — a tiny strip of bandage appeared under one character’s brow mid-shot, a small misreading of the reference.

▸ V4.0 LITE PREVIEW

v4.0 Lite Preview. Solid execution but the camera tilt felt rushed. The transition to close-up wasn’t as polished as Preview.

▸ V3.2

v3.2. No real transition between shots, awkward camera motion, art style drifted heavily.

▸ V3.0 HIGH CONSISTENCY

v3.0 High Consistency. The synchronized eye-close at the end looked deeply unnatural. Almost no camera motion at all. The one upside: art style fidelity to the reference.

▸ V3.0 FLASH

v3.0 Flash. Maintained the original art style well but characters stayed strangely still — the eye contact felt frozen rather than alive. The camera just pushed forward instead of executing the tilt-then-close-up.

▸ V2.7 HIGH DYNAMICS

v2.7 High Dynamics. Motion was decent but the eyepatch fell off one character mid-clip, and the art style shifted noticeably between before and after — almost like two different versions of the same character.

— ROUND 3 TAKEAWAY —

Multi-character scenes weren’t the hardest task — they just exposed the same baseline weaknesses we saw in Round 1.

We expected two characters in one frame to break older models more dramatically than complex single-character action did in Round 2. It didn’t. Where Preview pulled ahead was in the polish of the camera tilt and close-up transition — “the camera tilted upward like a film does.”

— THE TAKEAWAY —

What this tells you about picking a model

The gap between models is a function of how ambitious the prompt is, not a constant.

▸ IF YOU REMEMBER ONE THING

Iterate cheap. Deliver expensive.

Sketch on Lite or v2.7. Finish on Preview only when the scene’s ambition warrants it.

▸ FAQ

Common questions

Which model should I use — v4.0 Preview or v4.0 Lite Preview?

For final deliverables where visual polish matters, use Preview. For everyday work, prompt iteration, and simple motion, Lite is faster, cheaper, and often just as good. Our rule of thumb: sketch on Lite, finish on Preview.

How many reference images can I upload?

Up to 6 per generation. In practice, 2 to 3 well-chosen references produce better results than 6 references competing for the model’s attention. More isn’t always better.

Can I use reference images, reference video, and reference audio at the same time?

Yes — that’s actually the most powerful workflow. Reference images anchor the character, reference video sets the camera style and pacing, reference audio shapes the voice or mood. Combined, you can apply a specific animation style to your own character with your own voice direction.

How long can a v4.0 Preview video be?

Up to 15 seconds per generation. Reference video uploads also have to fit within 15 seconds total — if you’re using multiple video references, their combined length counts toward this limit.

Are Multi-Reference and Keyframe modes compatible?

No. They’re mutually exclusive. Multi-Reference is for attaching multiple semantic references (character, scene, style). Keyframe is for specifying a start frame and end frame. Use whichever fits your workflow.

Why does my character drift halfway through a complex action?

Long descriptions of the character in the text prompt can fight the reference image’s anchoring. Move character details into the reference image and keep the prompt focused on what should happen. We cover this and a lot more in our companion guide: How to Prompt PixAI v4.0 Preview →

Where can I learn the full PixAI image-to-video workflow, including legacy models?

Our full tutorial covers the entire i2v panel — controls, every model in the lineup with their strengths, prompt writing across model generations: PixAI Image-to-Video Tutorial →

Can I use v4.0 Preview for commercial work?

Yes — content generated on PixAI follows our standard usage terms. Check our terms of service for current details.

▸ FINAL CUT
CTA ▸ 001

— READY? —

Make your first v4.0 Preview video

Pick a character image. Write a sentence about what should happen. Generate.

Index