DiT Model Prompt Writing Guide

Editor’s note (PixAI)
This guide was written by one of our community’s standout creators, 阿童 (ATone), and is republished here with credit to the original author.

PixAI’s DiT family — Tsubaki / Serin / Tsubaki Flash — has a very different prompt style from the SDXL line. This chapter is for users who already know SDXL prompting and are picking up DiT.

Table of Contents

Core Principle

DiT models only accept English prompts and strongly favor natural English descriptions — the closer it sounds to telling a story to a professional illustrator, the better.

Why not use Danbooru tags?

SDXL-family models (Illustrious, NoobAI, etc.) use a CLIP text encoder, which was trained on Danbooru / e621 tag-style captions and therefore expects tag-style input.
DiT models use a text encoder closer to a modern LLM. They understand natural-language descriptions much better and adapt less well to a flat tag list.
The upshot: SDXL’s tag-existence rule (young man is invalid; you must write 1boy) does not apply in DiT. Just write normal English.

Empirical comparison: Model × Prompt style

The same prompt sent to different models gives very different results. The 2×2 below uses the PixAI mascot Mio LoRA (DiT + SDXL versions of the same character, Spring Echoes / Emerald Melody variant) under a strict controlled comparison — same scene, only the model / prompt style swapped:

Natural-language prompt

Tag-stack prompt

Tsubaki.2 (DiT)

A: Model strength + matching prompt style ✓

B: DiT adapts poorly to tag-style prompts

Illustrious-XL (SDXL)

C: SDXL adapts poorly to natural language

D: Model strength + matching prompt style ✓

The diagonal A → D pairs match each model’s preferred prompt style; the off-diagonal B → C pairs produce off-target outputs — even with the same LoRA and the same scene, a mismatched model / prompt-style pairing can throw the result off.

SDXL → DiT: Common Migration Pitfalls

When jumping from SDXL to DiT, drop these habits:

❌ SDXL habit	Why it fails in DiT / What to do
1boy, solo, masterpiece, best quality	DiT doesn’t lean on quality tags. Rewrite as a natural sentence: “A young man standing alone in a cinematic scene.”
Heavy quality stacks (`8k, ultra-detailed, extremely detailed`)	DiT image quality is already strong; piling on quality tags can produce results that don’t match what you intended (sometimes diluted, sometimes overshot). Keep at most one style word.
Underscore tokens (`black_hair`, `looking_at_viewer`)	DiT reads natural English. Drop the underscores.
Bracket weighting `(black hair:1.2)`	DiT doesn’t recognize this syntax. To emphasize an element, rewrite the sentence and put it earlier.
`right: ... left: ...` blocks or `BREAK` for multi-character isolation	These still work on DiT, but the effect isn’t pronounced. Switching to described relationships and interactions usually gives a livelier composition (see the multi-character section below).

Generation Parameters: What’s Different

Beyond the prompt itself, Tsubaki.2 also expose a different parameter panel from SDXL:

No CFG Scale and no step count. The two knobs you tune most on SDXL simply aren’t on the Tsubaki.2 panel.
Use the “Mode” selector instead to balance quality vs. speed. The options are Lite / Standard / Pro / Ultra (Chinese: 輕量 / 標準 / 專業 / 極致). The underlying mechanism is close to step count — higher tiers give finer detail at higher credit cost.
“Standard” is already a strong default; reserve “Pro” for cases that genuinely need extreme detail.

Scenario 1: Single Character

Recommended writing order:

Order	Content	Why this order
1	Style / overall mood / camera language	Sets the global tone first; everything below aligns with it
2	Subject + action / pose	Establishes the focal point
3	Outfit & accessories	Detail the subject after positioning
4	Foreground props	Round out the focal area
5	Background environment	From near to far
6	Lighting & effects	Final pass that locks in atmosphere

Example:

A cinematic medium shot of a young Taiwanese girl with long silver hair and purple eyes, gently smiling, wearing an elegant white lolita dress with intricate lace, standing in a cherry blossom garden, soft pink petals floating in the air, warm golden hour sunlight filtering through the trees, highly detailed, beautiful anime style

💡 Notice the phrase young Taiwanese girl — that’s an invalid Danbooru tag in SDXL and CLIP would mishandle it, but it’s perfectly fine natural English in DiT. DiT does not require tag database lookups.

Scenario 2: Multiple Characters

The biggest change in DiT for multi-character scenes — describe relationships instead of isolating with tags.

Recommended writing order:

Order	Content	Why this order
1	Overall composition / camera / mood	Same as single character — set the tone
2	Relationships and interactions between characters (most important!)	This is how DiT figures out who is who and who is doing what to whom
3	Each character’s appearance, action, expression (primary → secondary)	Introduce them one by one in priority order
4	Outfits and details	After the cast is clear
5	Background, lighting, effects	Final pass, same as above

Example:

A romantic wide shot under cherry blossoms at sunset, a silver-haired catgirl with purple eyes is tiptoeing to kiss a tall black-haired boy, the boy gently holding her waist, they are looking at each other affectionately, detailed intricate clothing, soft pink petals floating around them, warm golden sunlight, cinematic lighting, emotional atmosphere, beautiful detailed anime style

⚠️ SDXL multi-character tricks are not necessary — relationships description like “she is tiptoeing to kiss him while he holds her waist” usually works better.

General Tips

Embedding LoRA triggers (suggestion, not yet fully validated)

A common community conjecture: writing the LoRA trigger as part of the natural-language description may be more stable than a tag-style prefix, because it makes the relationship between the trigger and the described subject more explicit to the model. This isn’t fully validated, and behavior can vary by LoRA / scenario — try both styles and see which works better in your case.

Worth noting: some PixAI official DiT LoRAs (such as the mascot Mio LoRA) ship a trigger that is itself a full descriptive sentence, designed to be folded directly into your prompt. For example, the [PixAI Mio/ミオ] Spring Echoes LoRA trigger:

A girl with white-to-pink gradient hair, heart ahoge, purple eyes, eyepatch, cat ears, fang, jirai kei style. Open dark grey glossy leather hoodie over a black bandeau, slight cleavage, cinched waist, pink drawstrings. Black distressed low-rise denim short

Letting it flow directly into the scene action reads more naturally than dropping it as a prefix and starting a separate sentence:

Style	Example
Whole trigger as a prefix, then a separate scene sentence	`<full trigger>. She is walking through neon-lit Shibuya at night.`
Naturally combined (recommended)	`A girl with white-to-pink gradient hair, heart ahoge, purple eyes, eyepatch, cat ears, fang, jirai kei style, walking through neon-lit Shibuya at night, ...`

If you can’t fit it naturally, drop it as a single sentence at the start or end.

Negative prompt (shared baseline)

blurry, low quality, deformed hands, extra fingers, bad anatomy, watermark, text, logo, ugly, deformed, mutated

DiT honors negative prompts the same way SDXL does. This baseline list works for both.

Put style descriptions in Customize Style

⚠️ Customize Style is Tsubaki.2-only. Other DiT models (Tsubaki v1, Serin, Tsubaki Flash) don’t have this field. On Tsubaki.2, peeling style words out into Customize Style keeps your main prompt clean. On other DiT models, fold the style words into the tail of the main prompt.

Customize Style examples

Scene	Customize Style content
Single-character portrait	`delicate anime style, soft lighting, studio ghibli influence`
Romantic multi-character	`romantic anime style, cinematic, soft bokeh`