LongCat Multi Avatar

Create realistic lip-synced videos of two people having conversations.

Inputs

Input Image

Input Image
Image

Audio (Person 1)

Audio (Person 2)

Output

~30-60 seconds

Upload your video and sync lips in seconds

10,000+ generations this month

📄 About LongCat Multi Avatar
Key Features
Generates highly realistic, lip-synced videos of two people from a single image and dual audio inputs.
Supports both parallel (simultaneous speaking) and sequential (one after another) audio modes for flexible conversations.
Customizable prompts and negative prompts allow for guided video generation and exclusion of unwanted elements.
Adjustable video resolution options (480p and 720p) and segment lengths to fit various project requirements.
Bounding box controls enable precise positioning and cropping of each speaker within the frame.
Fine-tuning parameters such as inference steps and guidance scales for optimal quality and motion realism.
Built-in safety checker and robust error handling for reliable and appropriate outputs.
💡 Use Cases
Creating virtual interviews or two-person dialogue videos for podcasts and YouTube channels.
Producing AI-driven explainer or educational videos featuring conversational scenarios.
Generating realistic avatars for marketing campaigns, product demos, or customer service bots.
Powering interactive storytelling or role-play content with dynamic character interactions.
Building demo videos for voice AI, speech synthesis, or multilingual applications.
Developing social media content with engaging, talking avatar duets or conversations.
Enabling remote team presentations or announcements with personalized, animated avatars.
🎯 Best For
🎯 Content creators, educators, marketers, and AI enthusiasts seeking realistic two-person video generation from images and audio.
👍 Pros
Delivers ultra-realistic, synchronized lip movements and natural facial dynamics.
Supports flexible audio arrangements for authentic conversations or duets.
Highly customizable with advanced prompt and bounding box controls.
Easy to use with simple file uploads or URLs—no technical expertise required.
Multiple output resolutions and segment options to fit diverse needs.
Integrated safety features help maintain output quality and appropriateness.
⚠️ Considerations
Requires high-quality input images for best results.
Primarily designed for two-person scenarios; not suited for group conversations.
Generation times may vary depending on video length and resolution.
Advanced settings may require experimentation for optimal output.
📚 How to Use LongCat Multi Avatar
1
Upload or provide a URL for an image containing two speakers.
2
Upload or link audio files for each person (left and right), or use the default examples.
3
Optionally, enter a prompt to guide the video’s expressions and movements, and a negative prompt to exclude unwanted elements.
4
Select the audio mode (parallel or sequential), desired video resolution, and the number of segments.
5
Adjust advanced settings such as inference steps, guidance scales, and bounding boxes as needed.
6
Submit your inputs and wait for the model to generate and deliver your lip-synced, conversational video.
Frequently Asked Questions
You need to provide an image featuring two people and one or two audio files—one for each speaker. The model also allows optional prompts for more control.
Yes, you can choose between parallel (simultaneous speaking) or sequential (one after another) audio modes. You can also use prompts to guide the conversation and behaviors.
LongCat Multi Avatar supports both 480p (standard) and 720p (HD) resolutions, allowing you to select the quality that best fits your needs.
Pricing varies by model and is based on a pay-as-you-go credit system, enabling you to pay only for the resources you use.
Absolutely. By using the negative prompt feature, you can specify elements to avoid—such as blur, low quality, or distracting backgrounds—for cleaner results.

More Lip Sync Models