LongCat Multi Avatar

Create realistic lip-synced videos of two people having conversations.

Inputs

Input Image

Input Image
Image

Audio (Person 1)

Audio (Person 2)

Output

Upload your video and sync lips in seconds

10,000+ generations this month

📄 About LongCat Multi Avatar
Key Features
Generates highly realistic, lip-synced videos of two people from a single image and dual audio inputs.
Supports both parallel (simultaneous speaking) and sequential (one after another) audio modes for flexible conversations.
Customizable prompts and negative prompts allow for guided video generation and exclusion of unwanted elements.
Adjustable video resolution options (480p and 720p) and segment lengths to fit various project requirements.
Bounding box controls enable precise positioning and cropping of each speaker within the frame.
Fine-tuning parameters such as inference steps and guidance scales for optimal quality and motion realism.
Built-in safety checker and robust error handling for reliable and appropriate outputs.
💡 Use Cases
Creating virtual interviews or two-person dialogue videos for podcasts and YouTube channels.
Producing AI-driven explainer or educational videos featuring conversational scenarios.
Generating realistic avatars for marketing campaigns, product demos, or customer service bots.
Powering interactive storytelling or role-play content with dynamic character interactions.
Building demo videos for voice AI, speech synthesis, or multilingual applications.
Developing social media content with engaging, talking avatar duets or conversations.
Enabling remote team presentations or announcements with personalized, animated avatars.
🎯 Best For
🎯 Content creators, educators, marketers, and AI enthusiasts seeking realistic two-person video generation from images and audio.
👍 Pros
Delivers ultra-realistic, synchronized lip movements and natural facial dynamics.
Supports flexible audio arrangements for authentic conversations or duets.
Highly customizable with advanced prompt and bounding box controls.
Easy to use with simple file uploads or URLs—no technical expertise required.
Multiple output resolutions and segment options to fit diverse needs.
Integrated safety features help maintain output quality and appropriateness.
⚠️ Considerations
Requires high-quality input images for best results.
Primarily designed for two-person scenarios; not suited for group conversations.
Generation times may vary depending on video length and resolution.
Advanced settings may require experimentation for optimal output.
📚 How to Use LongCat Multi Avatar
1
Upload or provide a URL for an image containing two speakers.
2
Upload or link audio files for each person (left and right), or use the default examples.
3
Optionally, enter a prompt to guide the video’s expressions and movements, and a negative prompt to exclude unwanted elements.
4
Select the audio mode (parallel or sequential), desired video resolution, and the number of segments.
5
Adjust advanced settings such as inference steps, guidance scales, and bounding boxes as needed.
6
Submit your inputs and wait for the model to generate and deliver your lip-synced, conversational video.
💡 Pro Tips for LongCat Multi Avatar
Position Faces Side by Side for Best Results LongCat Multi Avatar works best when both speakers are positioned side by side in the source image with clear facial visibility. Avoid overlapping faces or extreme angles. If you need single-speaker lip sync instead, try LongCat Single Avatar (Image + Audio) for simpler setups. Ensure both faces are well-lit and occupy at least 20-30% of the frame width each for optimal mouth movement detection and natural expression rendering.
Match Audio Quality Between Both Speakers For the most realistic conversation videos, ensure both audio files have similar recording quality, volume levels, and background noise profiles. Mismatched audio can create jarring transitions or unnatural lip sync on one speaker. Use clear voice recordings with minimal echo or background noise. If you're working with professional voiceovers or need more control over avatar motion, consider HeyGen Digital Twin Avatar V4 for enterprise-grade voice cloning and lip sync precision.
Use Sequential Mode for Turn-Based Dialogue When creating interview-style content or Q&A videos where speakers take turns, set audio_type to sequential (add) rather than parallel. This ensures person 1 speaks first, followed by person 2, creating a natural back-and-forth conversation flow. Parallel mode works better for simultaneous speech scenarios like debates or duets. Adjust num_segments to extend video length—each additional segment adds roughly 5 seconds, allowing conversations up to 50+ seconds at maximum settings.
Leverage Negative Prompts to Eliminate Artifacts The default negative prompt already excludes common issues like blur, extra fingers, and poor quality, but you can customize it further for your specific content. Add terms like 'watermark', 'text overlay', 'subtitles', or 'logo' if you're seeing unwanted elements. For cleaner outputs with less post-processing, combine strong negative prompts with higher inference steps (40-50) and moderate audio guidance scales (3-5) to balance realism with controllability.
Adjust Bounding Boxes for Non-Standard Layouts If your input image has speakers positioned vertically, at different scales, or off-center, manually define bbox_person1 and bbox_person2 using JSON coordinates. The model defaults to left-right splits, but custom bounding boxes let you specify exact regions for each face. This is critical for group photos where you want to isolate two specific people, or when working with portrait-oriented images where speakers are stacked rather than side by side.
Start with 480p for Faster Iteration While 720p delivers sharper output, it costs 4x the credits per second compared to 480p. During initial testing and prompt refinement, use 480p with 1-2 segments to quickly validate your image, audio pairing, and prompt effectiveness. Once you're satisfied with the composition and lip sync quality, generate the final version in 720p. For even faster audio-driven video generation without image inputs, explore LTX 2.3 Audio to Video for pure audio-to-video synthesis.
Frequently Asked Questions
You need to provide an image featuring two people and one or two audio files—one for each speaker. The model also allows optional prompts for more control.
Yes, you can choose between parallel (simultaneous speaking) or sequential (one after another) audio modes. You can also use prompts to guide the conversation and behaviors.
LongCat Multi Avatar supports both 480p (standard) and 720p (HD) resolutions, allowing you to select the quality that best fits your needs.
Pricing varies by model and is based on a pay-as-you-go credit system, enabling you to pay only for the resources you use.
Absolutely. By using the negative prompt feature, you can specify elements to avoid—such as blur, low quality, or distracting backgrounds—for cleaner results.
LongCat Multi Avatar charges 1 credit unit per second of video at 480p resolution, while 720p costs 4 credit units per second. For a 10-second video, you'll spend 10 credits at 480p versus 40 credits at 720p. The first segment generates approximately 5.8 seconds, with each additional segment adding 5 seconds. If you're producing high-volume content or testing multiple iterations, starting with 480p significantly reduces costs while still delivering professional-quality results suitable for social media and web use. Upgrade to 720p only for final deliverables requiring maximum sharpness or large-screen display.
Yes, all videos generated with paid credits on JAI Portal include full commercial-use rights. You can use LongCat Multi Avatar outputs in advertisements, client projects, YouTube monetized content, product demos, training materials, and any other commercial application without additional licensing fees. The pay-as-you-go model ensures you only pay for what you create, with no recurring subscription costs. This makes it cost-effective for agencies, marketers, and content creators who need flexible, on-demand video production. Always ensure your input images and audio have proper usage rights before generating derivative works.
If automatic face detection fails or positions speakers incorrectly, you can manually override the detection by providing custom bounding boxes in JSON format for bbox_person1 and bbox_person2. Specify x, y coordinates (top-left corner) plus width and height as percentages of the image dimensions. For example, {"x":0,"y":0,"width":50,"height":100} defines the left half of the image. This is essential when working with non-standard layouts, overlapping subjects, or images where faces occupy unusual positions. Test your bounding boxes with a single segment first to validate positioning before generating longer videos.
LongCat Multi Avatar supports up to 10 segments, with the first segment producing roughly 5.8 seconds and each additional segment adding 5 seconds. This allows a maximum video length of approximately 50 seconds in a single generation. For longer conversations, you can generate multiple outputs and stitch them together using video editing software. Alternatively, if you need extended single-speaker content, LongCat Single Avatar (Image + Audio) may offer different segment configurations. Keep in mind that longer videos consume more credits proportionally—a 10-segment video at 720p can cost 200+ credits depending on final duration.
The audio_guidance_scale parameter controls how strongly the model responds to audio input when generating mouth movements. Lower values (2-3) produce subtle, natural lip sync suitable for calm conversations, while higher values (6-8) create more exaggerated mouth movements that can appear overly animated or cartoonish. The default setting of 4 balances realism with clear articulation. If your audio has soft or mumbled speech, increasing the guidance scale helps ensure visible mouth movements. Conversely, if you're seeing unnatural jaw motions or overly dramatic expressions, reduce the scale to 2-3 for more subdued realism. Experiment across 2-3 test generations to find your ideal setting.
⚖️ How LongCat Multi Avatar Compares
LongCat Multi Avatar stands out on JAI Portal as the only model purpose-built for realistic two-person conversational videos from a single image and dual audio inputs. While LongCat Single Avatar (Image + Audio) handles solo speaker scenarios and LongCat Single Avatar (Audio Only) generates avatars without reference images, Multi Avatar uniquely enables authentic dialogue between two characters with synchronized lip movements and natural turn-taking. For users seeking enterprise-grade avatar creation with voice cloning, HeyGen Digital Twin Avatar V4 offers higher production polish but at significantly higher credit costs and longer generation times. If you need pure audio-to-video synthesis without avatar constraints, LTX 2.3 Audio to Video provides faster turnaround for abstract or non-human visuals. Choose LongCat Multi Avatar when you need cost-effective, flexible two-person video generation for interviews, debates, explainer content, or social media dialogues—especially when working with existing photos and recorded audio. Its pay-per-use pricing and granular controls make it ideal for creators who need professional results without subscription commitments. Compare models side by side at JAI Portal or start creating conversation videos today at /auth/signup.

More Lip Sync Models