Speech-to-Creative Pipeline: speech recognition + language swap + lip-sync
One creative is scaled to 5+ languages and GEOs automatically. Whisper transcribes speech → GPT-4o translates and rewrites for the market and persona → ElevenLabs clones the voice with the correct accent → wav2lip syncs lip movement to the new audio. What used to require a recording studio and a week of work — now happens in an evening.
Spy data exists but doesn't scale by hand
In performance marketing, spy services (AdHeart, AdSpy, AdsLibrary, Anstrex) give thousands of competitors' working creatives — but you can only "borrow a bit" of what doesn't trigger a Copyright Strike. Narrator voice, background, actor's face, lip-sync — everything covered by DMCA requires recreation.
The workflow used to look like this: watch spy video → write transcript by hand → hand to copywriter to rewrite for your own offer → hire voice actor via Fiverr → wait 1-3 days → edit in Final Cut → iterate. 4-6 hours of work per variation, and that's without lip-sync.
With budgets of $2-5k/day on A/B tests, production-cycle speed = a unit of competitive advantage. When you have 8-10 funnels running simultaneously and each needs 20-30 fresh creatives per week — the manual workflow stops working.
4-stage pipeline: STT → LLM → TTS → lip-sync
Whisper transcription of spy videos
whisper-large-v3 or OpenAI whisper-1 via API. Supports Russian, English, Balkan, Turkic, Spanish. Outputs SRT with timecodes — we know exactly which phrase is on which second.
GPT — translation, localization, persona rewrite
GPT-4o does three tasks in one pass: (a) breaks the transcript into compositional blocks — hook (first 3 sec), pain, proof, CTA; (b) translates into the target language (Russian / English / Serbian / Polish / Turkish) with cultural nuance — not literal but as "a native speaker would say it"; (c) rewrites for the target offer, persona, and GEO, preserving the rhythm and emotional triggers of the original.
Few-shot prompts with ready "before → after" examples per language and niche. The main focus is preserving timecodes: each phrase's duration must match the original or lip-sync will break. Output — 10-30 variants × 5+ languages, ranked by predicted engagement.
ElevenLabs — voice cloning with the right accent
ElevenLabs Multilingual v2 supports 29 languages in a single model — the cloned voice sounds in each with the correct accent. This matters: if the original speaker is an American with a Southern accent, her clone in Serbian won't sound robotic but like a natural native Serbian speaker.
Two strategies: (a) voice cloning — a 30-second sample is enough for a high-quality clone; (b) stock voices for fast A/B by timbre (male/female, age, emotional tone). Stability / similarity / style settings are tuned per niche. Output — audio.wav matching the original's duration (important for the next step).
wav2lip — lip movement synchronization
At this stage the "magic" appears — the actress in the video starts speaking in the new language as if it was re-shot. wav2lip analyzes the source video + new audio.wav and redraws the mouth region frame by frame so the lip movement matches the new speech. A GPU is needed, but it's hours of compute, not days at a studio.
Simple case (voice-over / off-screen): ffmpeg just replaces the audio track. Complex case (talking head): wav2lip or SadTalker for face sync. Output — finished mp4 for ad platforms (FB Ads / TikTok / VK Ads / Yandex Direct).
Technologies and infrastructure
- OpenAI Whisper API (whisper-1)
- whisper-large-v3 (self-hosted, at large volumes)
- SRT parsing for timecodes
- GPT-4o / Claude Sonnet 4.5
- few-shot prompts with hook → rewrite examples
- structured outputs (JSON schema)
- ElevenLabs Multilingual v2
- Voice cloning (30-second sample is enough)
- Voice style settings: stability / similarity / style
- ffmpeg for audio-replace (voice-over)
- wav2lip / SadTalker for talking head
- Python orchestration + task queue
What changes in the funnel
per creative variation (audio-replace, no lip-sync)
variants × languages (1 actress → 5+ GEOs without re-shoot)
Whisper + GPT + ElevenLabs (at API rates, no lip-sync)
The main effect is iteration speed. An A/B test of 20 hooks instead of 2 launches in an evening, not a week. Winning combinations are identified in the first 24-48 hours, losers are turned off before the budget burns. CPI drops 15-30% from better hook-match with persona.
Where it works, where it doesn't
- · Performance agencies with 5+ creatives staff
- · Affiliate teams (nutra, e-com, sweepstakes, white GEOs)
- · In-house e-com marketing with regular UGC reels
- · Launches in 5+ countries simultaneously (multi-language from one master script)
- · Podcasters / infobusiness for short-form content cuts
- · Direct copying of other people's creatives (DMCA / Copyright Strike)
- · Regulated niches (medicine, pharma, finance) — need media lawyers for compliance
- · Voice cloning without consent of the voice owner (banned by EU AI Act)
- · Long-form (10+ minute videos) — TTS cost and editing time comparable to an expensive human actor
Ethical note: the pipeline is intended for scaling your own ideas and original scripts. Using others' videos and voices without permission violates Copyright and the AI Act. I transcribe others' creatives as research to understand the market, then create my own script, my own audio, and my own video.
Аудит за 5 000 ₽ — с конкретным отчётом и сметой
Расскажу что внедрить в вашем бизнесе в первую очередь, какая будет окупаемость, и нужен ли вообще AI для вашей задачи (иногда — нет).
Или просто напишите свой вопрос — отвечу в течение 2 часов