Forem: AudioProducer.ai

Auto-Assign Characters: how AudioProducer.ai turns a chapter into a line-by-line speaker map

AudioProducer.ai — Fri, 22 May 2026 21:09:52 +0000

If you have ever tried to turn a novel chapter into a multi-voice audio drama by hand, the first thing you discover is that the generate audio step is the easy part. The hard part is the bookkeeping: who is speaking on this line, who is speaking on the next one, is this third paragraph narration or interior monologue, is the italicized text on the cake actually a character or is it a label that should be read by the narrator.

This article is about how the Auto-Assign Characters pass in AudioProducer.ai handles that bookkeeping, what its output actually looks like to a writer in the editor, and the failure modes that show up on real manuscripts. It is the companion piece to the earlier Auto-Assign Sounds article: same pipeline, different pass. Sounds covers the audio backdrop. Characters covers who reads what.

What the pass actually produces

The input is plain chapter text. You can paste it into a blank project or have it come in via EPUB import: either way the pass operates on the same shape, a flat list of paragraphs and lines.

The output is a line-level speaker map. Every line of the chapter ends up tagged with one of three things:

Narrator for prose, scene-setting, action beats, attribution clauses, anything that the third-person voice carries.
A named character for dialogue. Alice. White Rabbit. Kael. Eryndor. Whatever names the chapter actually uses.
An in-world label for text that exists inside the story world but does not come out of a person's mouth. The canonical examples from the editor screenshot of Alice in Wonderland: "Cake Label", "Label on the jar", "Bottle Label". These read out loud in the final audio, but they are clearly not the narrator and they are clearly not Alice.

The third category is the one most pipelines either miss or collapse into the narrator. It matters because the writer almost certainly wants those lines voiced differently from the narration: a different voice, a different prosody, often a noticeably shorter clip with a different ambient soundscape behind it. Surfacing the label as a first-class speaker, not as narration, is what makes that possible at the per-line level.

The editor is the review surface, not the publish target

The pass is explicit about being a starting point. You do not run Auto-Assign Characters and ship the audio. You run it, look at the speaker map in the editor, and adjust. The editor exposes four operations that map one-to-one onto the failure modes of any line-level attribution:

Re-tag a line. Select the line, assign a different speaker.
Split a line. When the model bundled two utterances together (Alice said something, then the White Rabbit answered, but the model glued them into one line), split them and re-tag each half.
Merge lines. Inverse of split. When the model over-segmented (a long quote got chopped at a comma the model thought was a clause boundary), merge them back into one speaker turn.
Add a missing character. If the model invented a new speaker name for someone who was already in your character list (a diminutive, a title, a nickname that did not match any existing tag), you add the canonical character explicitly and re-tag the affected lines.

The thing to notice is what the editor does not have: no "regenerate this paragraph with a slightly different prompt." The review surface is structured edits to the speaker map, not free-text prompt churn. That is deliberate. It means the writer never has to read model output to decide if the model "got it right" in some squishy sense. The question is just: does this line have the correct speaker tag, yes or no.

Failure modes (and how to make them go away)

The Auto-Assign Characters pass is reliable on text that uses conventional dialogue mechanics. Where it gets noisy is on stylistic choices that defeat the cues a reader uses to attribute speech. From the customer-support FAQ:

If many lines are wrong, often it's because the source text uses unusual dialogue conventions (e.g., no quotation marks, unusual attribution patterns). Standardize punctuation in the source and re-run Auto-Assign.

In practice the three patterns that produce the worst attribution noise are:

No quotation marks at all. Some literary fiction renders dialogue as italic-only or em-dash-prefixed. The model has nothing to anchor on, and dialogue ends up tagged as narration. If you want a clean speaker map on text like this, the lift-and-shift fix is to add quotation marks in your source before running the pass. The audio output is the same: the marks are not spoken, they are just attribution cues for the model.
Attribution at the end of long compound sentences. A line that runs "I would rather not, said the Caterpillar, settling back on its mushroom and exhaling another cloud of smoke that drifted over the hatter's ear." will sometimes get the attribution recovered correctly and sometimes get split across speakers. The fix is editorial: shorter sentences, or attribution-before-quote, produce cleaner output.
Unnamed background speakers. A crowd scene with "someone shouted from the back" or "a voice from the doorway" tends to get tagged as Narrator (because the speaker has no name to match against the character list). If you want it voiced distinctly, add an explicit character (Background Voice 1, Voice from Doorway) and re-tag.

None of these are model bugs in the usual sense. They are the same edge cases a copyeditor would flag for any narrator-and-cast read. The editor is structured around fixing them line by line rather than fighting the model.

Carrying characters across a series

The pass operates per chapter, but writers operate per book or per series. The bookkeeping that survives across runs is the character list with assigned voices, not the per-line speaker map. The mechanics:

In a new project, the three-dot menu next to "Add Character" lets you import the character list (with assigned voices and per-character settings) from another project. The new project starts with Alice already pointing at the female_30s_dry voice you picked in book 1.
Inside a single project, the character editing menu supports grouping characters into folders. For ensemble casts the flat list gets unwieldy fast: folders by location, by POV, by plotline, or by chapter range keep the panel scannable.

The Auto-Assign Characters pass on chapter 7 of book 3 then starts from a populated character list and tags new lines against the canonical names. You do not need to re-tell the system that "Kael" is the same character it was twelve chapters and three months ago.

What this means for the AI Words quota

Auto-Assign Characters counts against the AI Words meter, not the Audio Generation Words meter. They are separate quotas. From the product copy:

Both meters get the full plan allowance independently. They don't share a single budget. So a Professional Writer subscriber has 100K AI Words and a separate 100K Audio Generation Words per month.

In practical terms: running Auto-Assign Characters on a 5,000-word chapter eats 5,000 AI Words and zero Audio Generation Words. You can run it, look at the speaker map, adjust voices in the Characters panel, run Auto-Assign Sounds (also AI Words, also free of the audio budget), and only then trigger Generate Audio. The first three steps are reviewable without spending any of your audio rendering allowance. That matters when you are iterating on a draft: you can re-run Auto-Assign after a source edit without the meter cost feeling expensive.

What the pass does not try to do

Two things worth being explicit about, since the surrounding industry copy often blurs them.

The pass does not generate dialogue. It tags existing prose. Every word in the output comes from text the writer put in. If the input is "Alice said hello to the rabbit," the pass labels "hello" as Alice and the rest as narrator. It does not invent or rewrite.

The pass does not enforce voice continuity across runs by itself. Voice assignments live on the character record, not on the speaker map. If you reassign Alice from voice A to voice B and re-generate the audio, every line tagged "Alice" picks up voice B. The speaker map stays the same; the audio sounds different. That separation is what makes voice experimentation cheap.

A short note on iteration shape

The natural workflow is paste-or-import, run Auto-Assign Characters, scan the speaker map in the editor for obvious misses (narrator-vs-character flips, missing in-world labels, over-segmented quotes), fix those in place, then run Auto-Assign Sounds and scan again. The two passes are independent: a re-run of Sounds does not touch the character map and vice versa. If a chapter feels off after audio renders, the diagnosis usually points at one of the two artifacts (wrong speaker on a key line, or wrong soundscape under a scene), not at "the model"; the structured editing is what makes that diagnosable.

If you want to play with the pass on your own text, the import path is at audioproducer.ai: paste a chapter into a blank project, click Auto-Assign, and the speaker map shows up in the editor.

Disclosure: this article was drafted by an AI agent working on behalf of the AudioProducer.ai team.

Importing an EPUB into an AI voice pipeline: what the chapter list looks like before audio runs

AudioProducer.ai — Wed, 20 May 2026 23:10:04 +0000

If you build any pipeline that takes a book as input, the first concrete problem is not "what does the model do." It is "how do I get the book into a shape the model can consume, chapter by chapter, without typing it twice."

In AudioProducer.ai that input shape is one of two things: a blank project where you paste text directly, or an EPUB upload that populates the project's chapter list automatically. We support EPUB and paste. We do not support .docx, .txt, .pdf, or .mobi today; for those, the route is to convert the source to EPUB first (Calibre is the usual answer) and then import.

This article is about the EPUB side: what comes across when you upload one, the predictable places where the chapter structure does not match what the writer expected, and how the review surface in front of audio generation handles those cases before any compute is spent.

What an EPUB actually is, briefly

An .epub file is a ZIP archive with a known directory layout. Inside it there is a manifest file (content.opf) that lists every document in the package; a navigation document (nav.xhtml for EPUB 3, or toc.ncx for the older EPUB 2 path) that defines the table of contents; and one or more XHTML files that hold the actual prose.

The interesting fact for any chapter-splitting code: there is no single canonical way an EPUB encodes "where chapter 4 begins." Some books have one XHTML file per chapter, with the nav document pointing at each file's root. Some have a smaller number of XHTML files, each containing several chapters, with the nav document pointing at specific anchors inside them. Some have one big file with chapter titles as <h1> or <h2> headings and no nav-doc detail beyond a single top-level entry. All three are valid EPUB, and all three are real in the wild.

If you have ever tried writing your own EPUB-to-chapters parser, you already know the implication: there is no single rule that covers every file. Whatever you do is going to be wrong on some books, which means you need a surface in front of the writer that lets them see what you got and adjust before the rest of the pipeline runs.

What we populate when you upload one

When an EPUB lands in AudioProducer.ai as a new project, three things show up in the project: the chapter list (each item gets a row), the chapter titles as they were in the source, and the body text of each chapter ready to be marked up. From the writer's point of view, the slow part of bootstrapping a project is now done. They can open chapter 1, run Auto-Assign Characters and Auto-Assign Sounds, pick voices, and generate audio.

The chapter list is the review surface. It is intentionally separate from the audio pipeline. Auto-Assign runs and audio generation both work per-chapter, and they both cost compute. The chapter list is where you fix structure before that cost is paid. If the import produced ten chapters but your book has eight, this is the place to see and reconcile that.

Where the chapter list does not match expectations

The most common reasons the auto-populated chapter list does not match what the writer pictured, in roughly the order we see them:

Front matter shows up as chapters. Many EPUBs include a copyright page, dedication, epigraph, acknowledgments, or a publisher logo page as separate XHTML files with their own nav entries. From the EPUB's point of view they are chapters. From the writer's point of view they are not. They appear in the imported list, often before chapter 1.

Back matter shows up as chapters. Same pattern in reverse. About-the-author, "also by this author" lists, sample chapters of a different book, ad pages. They were structurally chapters in the source; they probably should not be narrated.

Section grouping reads as a chapter. Books that have Parts (Part I, Part II) sometimes encode each Part's title page as its own XHTML file with a nav entry. That title page is then one row in the chapter list, with the actual numbered chapters following.

Chapter titles are not the titles the writer would have picked. EPUB metadata is sometimes optimized for an e-reader's navigation pane, not for being read aloud. Titles like Chapter_01_v3_final are real. Titles that are just numbers (1, 2, 3) are common. Titles in mixed case where the writer's manuscript was all-caps, or vice versa, happen routinely.

Chapter boundaries do not match the writer's mental boundaries. Some books bundle several short numbered chapters into one XHTML file, and the EPUB's nav document does not have anchor-level granularity into that file. The imported chapter list ends up with one row that contains "Chapter 17, Chapter 18, Chapter 19" inside it as the body text.

None of these are bugs in the EPUB or in any specific parser; they are just consequences of the format being structurally permissive. The takeaway in the pipeline is that the chapter list is where you discover them.

The review surface, what it does

The chapter list shows, for each row, the chapter title and the body that landed in that row. Two things you can do before running Auto-Assign on anything:

Remove a chapter. Front matter, back matter, and section title pages you do not want narrated come out here. If you decide later that you do want the copyright page read out (some audiobooks do open with one), you can rebuild the project; in practice the more common move is to keep them out.

Rename a chapter. The chapter intro feature reads the chapter name out loud at the start of every chapter, optionally with a custom template like Now beginning ${name}. That means the chapter title is not just an organizational label; it is content that listeners hear. Chapter_01_v3_final becomes a real audio artifact unless you rename it. The chapter list is where that rename happens, before any audio generation runs.

Once the chapter list reads the way the writer expects, the rest of the workflow is per-chapter and reversible: open a chapter, click Auto-Assign Characters, click Auto-Assign Sounds, pick voices, listen to the generated audio, edit lines or markup that did not land right, re-generate.

What this does not handle

A few honest limits worth naming.

Other manuscript formats. As above, we support EPUB and blank/paste. .docx, .txt, .pdf, and .mobi are not currently importable. Pasting chapter-by-chapter into blank projects is the workaround. Converting to EPUB first is the better workaround for anything over a few chapters.

EPUBs whose internal structure encodes "chapter" in a way the writer disagrees with. The chapter list reflects what the EPUB says, not what the writer meant. The fix is in the review step, not in the import step. If you have a regular pattern that always wrong-splits the same way (e.g., every Part page comes through), it is often easier to re-export the EPUB from your source with cleaner structure than to fix it in the project every time.

Re-import. If you change the EPUB and want to re-import it into the same project, the pattern is to start a new project rather than overwrite. Project-level customizations (voices, sound design, edits in the markup) live with the project, so a re-import on top would not be a clean merge.

What this changes about the workflow

The shortest read on what EPUB import gives you, before any model touches the manuscript: a chapter list to look at, on a page that does not cost compute, with the option to remove and rename rows until the structure matches the book the writer thinks they are working on. The audio pipeline runs after that, per-chapter and on your signal. The cost-bearing steps (Auto-Assign passes, audio generation) sit on the other side of a review the writer controls.

If you want to see what your own manuscript looks like through this path, the free tier is the easiest way in: 1,200 words per month, no credit card. Upload an EPUB, look at the chapter list, fix anything that landed wrong, run Auto-Assign on chapter 1, generate the audio, and listen. Most of the questions a writer has about whether the rest of the workflow fits how they work are answerable inside that one chapter.

Start a project at audioproducer.ai.

Disclosure: this article was drafted by an AI agent working on behalf of the AudioProducer.ai team.

Running a non-English audiobook through an AI voice pipeline: what's involved

AudioProducer.ai — Tue, 19 May 2026 23:14:09 +0000

Most TTS-based audiobook pipelines are built around English. The voice library is English voices, the dialogue heuristics assume English punctuation, the auto-assignment models train on English-language conventions. When a writer wants to run a French, German, Spanish, or Mandarin manuscript through the same pipeline, what actually changes? Some pieces port over cleanly. Others don't. This is a walk through what we've learned building multilingual support into AudioProducer.ai - what the pipeline does when the source isn't English, and where the rough edges still are.

Voice selection across languages

The voice library has 132 voices at the time of writing, and about 64 of them are tagged for the multilingual model. "Multilingual" here is a model capability, not a guarantee that the voice sounds equally good in every language. The underlying speech model handles phonetic mapping across the languages it was trained on, so a voice that ships as "American English neutral" can produce intelligible French or Spanish output. But cadence, intonation, and the small prosodic choices that make narration sound native are language-specific learned patterns. Some voices carry their non-English performance further than others.

For a writer starting a non-English project, the practical advice is to evaluate a few of the multilingual-tagged voices on a paragraph of the target language before committing to a narrator. Voice library previews give you a feel for each voice's range, but for a specific non-English book the honest test is generating a short sample paragraph in the actual target language inside the editor. That's the cheapest way to know whether the voice carries the language well enough for your purpose.

Per-character voice routing when the prose has multiple speakers

Auto-Assign Characters tags every line in a chapter by speaker. Narrator, named characters, in-world labels - the AI walks the prose and attaches a speaker tag to each line. The mechanism is language-agnostic in shape: the model identifies dialogue boundaries and attribution patterns, then ties each tagged line to a character.

In practice, non-English prose introduces two adjustments.

First, dialogue punctuation conventions vary by language. French dialogue uses em-dashes and guillemets rather than the curly-quoted attribution typical in English prose; Spanish often uses em-dashes too; German uses both guillemets and quotation marks depending on house style. Auto-Assign reads these conventions, but the cleaner the source punctuation, the cleaner the first pass. Standardizing dialogue punctuation in the source manuscript - picking one convention and applying it consistently - saves several rounds of hand-correction on the auto-assigned output.

Second, voice-per-character routing surfaces in the character panel after Auto-Assign completes. If a chosen voice doesn't carry a particular language well for a specific character, the panel is where you swap it out. Same workflow as English, with the cross-language constraint that the candidate voices need to come from the multilingual-tagged subset.

Sound design across languages

The Auto-Assign Sounds pass - music beds, ambient soundscapes, one-shot sound effects - is genuinely language-agnostic. Sound effects don't know what language the chapter was written in. A thunderclap is a thunderclap; rain over a city is rain over a city. The model that selects sounds reads the scene's content - storm, fight, quiet interior, scene transition - not its lexicon.

This is the part of the pipeline that ports across languages with no adjustment. A Spanish-language historical thriller and an English-language historical thriller of the same scene shape end up with broadly similar Auto-Assign Sounds output. Music selection rules (genre, energy, mood) operate at the same layer. Soundscapes earn their place by what they signal narratively, which is upstream of language.

The practical implication: when planning a non-English audiobook through the platform, the voice layer is where the language-specific work happens. The sound design layer is the same pipeline you'd run on an English book.

UI language vs. content language

The editor UI ships in 8 languages: English, French, German, Spanish, Portuguese, Chinese, Hindi, and Arabic. The UI language is independent of the content language. A French-speaking writer can drive an English audiobook project with the editor in French. A Hindi-speaking writer can drive a Spanish audiobook project with the editor in Hindi.

The two layers are decoupled by design. UI locale is picked from the Accept-Language header and an optional locale cookie at SSR time, with translation strings living in their own per-locale files. The content language is determined separately, per chapter, when audio is generated - the model auto-detects the source language of the prose.

The reason this matters: writers shouldn't have to use English buttons and English menus to produce a French or Spanish book just because that's how the platform happened to start. The two locales live in different parts of the stack and shouldn't be welded together in the user's workflow either.

What's still hard

The honest part. Multilingual TTS at the quality level audiobook listeners expect has rough edges, and it's worth being explicit about which ones.

Accent within a language is still hard. A "multilingual French" voice may render Parisian French well and Quebec French unevenly; a "multilingual English" voice may handle American narration cleanly and an Indian-English or Scottish-English character less convincingly. Audiobook listeners are sensitive to accent authenticity, and the available voices don't yet span every regional variant cleanly.

Code-switching in dialogue is also rough. A character who speaks two languages within one paragraph - common in immigrant fiction, regional literary fiction, and many real human conversations - pushes the model into edge cases. Sometimes the switch lands gracefully; sometimes the model forces one language across the boundary.

Idiomatic prosody is the third rough edge. Languages carry expectations about where a sentence's emphasis lands, how a question rises, how a punchline pauses. These are learned per-language and can drift on voices whose training data was thinner in the target language than in English.

What this means operationally: if you're producing in a language you're a native or near-native speaker of, you'll catch what's off and route around it. If you're producing in a language you don't speak, route the audio past a native-speaker reviewer before treating the production as final. The Auto-Assigns are starting points, not final answers - true in English, and more emphatically true outside it.

Wrapping up

Multilingual audiobook production through an AI pipeline is realistic for many language pairs and not yet realistic for all of them. The voice layer carries language-specific quality, the character routing layer carries language-specific punctuation conventions, and the sound design layer carries across languages without changes. Knowing which layer is sensitive to language and which isn't is most of the work in planning a non-English project on the platform.

If you want to try a non-English manuscript through the pipeline, the free tier supports 1,200 words per month - enough for a short chapter sample to evaluate voice quality in your target language before committing to a paid plan. The voice library at audioproducer.ai is where the multilingual-tagged voices live; preview is the cheapest way to see whether the pipeline handles your specific language well enough for your specific book.

Disclosure: this article was drafted by an AI agent working on behalf of the AudioProducer.ai team.

Auto-Assign Sounds: how AudioProducer.ai turns chapter text into music beds, ambience, and SFX

AudioProducer.ai — Mon, 18 May 2026 23:09:40 +0000

If you read our Auto-Assign pipeline post from last week, you already know the shape: chapter text in, two AI passes (Characters, then Sounds), tweak in the editor, click Generate. That walkthrough was deliberately wide. This one is narrow.

This post zooms in on the second pass: Auto-Assign Sounds. What the AI is actually looking for in a chapter, how it picks which sounds to place where, what the editor lets you do with the result, and the edge cases that consistently trip it up. The goal is to make the pass legible enough that when you click the button on your own chapter, you can predict what it will do and where you will need to step in.

Why a separate Sounds pass

In our pipeline, Characters and Sounds are two distinct AI passes, not one. They could in principle be folded together. They are not, for two reasons that are worth knowing up front because they shape how the editor behaves.

First, the failure modes are different. Character attribution is mostly a parsing problem (who said this line?). Sound placement is mostly a scene-comprehension problem (what is happening here, and what would you hear if you were there?). Splitting the passes means you can re-run one without re-running the other when a chapter's character markup is fine but the sound placement needs another go.

Second, the user-facing controls are different. After Characters runs you typically tune voice assignments. After Sounds runs you typically tune which atmospheric layers play under which sections and where individual SFX land. Keeping the panels separate keeps the cognitive load lower for each tuning task.

What the Sounds pass detects

Input: a chapter, with characters already assigned. Output: three categories of audio, placed at specific positions in the text.

Music beds. Long-form atmospheric tracks that play under stretches of text. These are mood pieces, not tied to a single line. The AI looks at the emotional contour of a scene and picks a bed that fits the dominant register: tension, calm, dread, wonder, melancholy. A chase sequence gets a percussive driving bed; a quiet character moment gets something sparser.

Ambient soundscapes. Environmental layers tied to the place a scene happens in, not the mood of it. Wind under an outdoor scene. Distant traffic under a city scene. A crackling fire under a hearth scene. Surf under a beach scene. The AI infers location cues from descriptive prose (and from named locations when the text gives them) and lays in a soundscape that grounds the listener in the geography of the moment.

One-shot sound effects. Discrete events tied to a single moment in the text. A door slamming. Stones launching from a sling. A bottle shattering. Thunder cracking on a beat. These are the ones that show up as inline chips in the editor view with their duration in parentheses, sitting on the exact line they fire on. From real chapters in our examples: "Distant Thunder (4s)", "Wind Howl (6s)", "Wind Gust (5s)", "Stones Launching (3s)".

The three categories layer. A storm scene typically gets a tense music bed running underneath the whole thing, a wind-howl ambient soundscape over the storm description, and one-shot thunder cracks placed on the lines that describe the lightning. The editor shows all three layers stacked at the moments they overlap.

How the editor surfaces the result

After the Sounds pass completes, the chapter view shows you what the AI placed and where. Three surfaces matter:

Inline SFX chips on the text. One-shot sounds appear as small chips inside the body of the chapter, on the line they fire on, labeled with the sound name and duration. You can see at a glance how dense the placement is and whether the moments feel right.
The Sounds panel. Lists every music bed and ambient soundscape the AI placed, along with which range of text they cover. This is the panel you use to swap a bed for a different track, change where a soundscape starts or ends, or remove a placement entirely.
The library browser. Same place you would use to add a sound by hand. You can preview anything in the library before assigning it, so swapping a bed is a low-risk operation.

Two behaviors that come up often:

Dragging chips. SFX chips can be moved to a different line in the text if the AI placed one a beat early or late.
Removing placements. Anything the AI put down can be deleted with no penalty. The point of the Sounds pass is to seed the chapter with reasonable choices, not to be irreversibly bolted in.

Whatever you change in the panel takes effect on the next Generate Audio click. You do not have to re-run Auto-Assign Sounds when you swap a bed or move a chip. You only re-run the pass if you have rewritten enough of the source text that the original placement is now stale.

A concrete example

To make this less abstract, here is roughly what happens on a few paragraphs of a typical fantasy chapter.

The chapter opens with two characters approaching a cave during a storm. The first paragraph describes wind tearing at their cloaks, rain coming sideways, distant thunder rolling. The second paragraph has them ducking into the cave mouth. The third paragraph has one character striking flint to start a fire.

After the Sounds pass:

A tense, low-register music bed lays in under the first two paragraphs. The AI reads "storm" plus the urgency in the action verbs and picks something that primes the listener for trouble.
A wind-howl ambient soundscape sits over the outdoor portion and tapers as the characters enter the cave.
One-shot SFX chips appear: a distant-thunder chip on the line about thunder rolling, a flint-strike chip on the line about striking sparks, a fire-crackle ambient layer that starts after the fire catches.
The music bed shifts down (or out, depending on the AI's read) once the characters are safely inside, leaving the fire-crackle ambient and quiet dialogue.

That is the AI doing the placement work for you. Whether each piece is right is a different question, which is why the editor is built around fast review rather than starting from scratch.

Edge cases and where the AI misfires

The Sounds pass is good and uses the same underlying scene comprehension that powers the Characters pass. It also misfires in predictable ways.

Internal scenes get treated as external. When a paragraph reads as a character internal monologue with vivid imagery (a memory of a battlefield, a dream of an ocean), the AI sometimes places real-world ambience as if the events were happening live. The fix is usually to remove the ambient layer for that range and let the music bed alone carry the mood. The rule of thumb: ambient soundscapes anchor place; if the place is imagined rather than present, the soundscape can land as too literal.

Mood mismatch on tonally ambiguous scenes. A funeral that opens with a joke. A reconciliation that ends with a fight. Scenes that pivot in tone partway through can get a single dominant bed that fits one half and clashes with the other. The fix in the Sounds panel is usually to split the placement: let the original bed cover the first range, swap a different bed for the second.

SFX over-eagerness on dialogue-heavy chapters. When a chapter is mostly two characters talking in a static setting, the AI sometimes places sparse SFX on every motion verb (a cup setting down, a chair scraping) that adds up to noise rather than texture. The fix is to delete the chips that do not need to be there. As a starting heuristic: keep SFX on moments that turn the scene, drop them on moments that just keep it moving.

Repetition across long stretches. On long chapters, the AI can re-use the same ambient track across multiple scenes that happen in different places. The fix is to swap one of them; the variety reads as more produced. The Sounds panel makes this a one-click operation.

Genre register drift. Cozy mysteries and grimdark fantasies do not want the same musical palette. The AI gets the gross-level register right most of the time, but on chapter one of a new project, give the bed selections a closer look. By chapter three the pattern of swaps usually stabilizes into a register that fits your book and the AI's later passes start landing closer to where you would have placed things.

What composes well, what does not

Two practical patterns we have seen across user chapters:

Auto-Assign Sounds composes well for action, exterior scenes, and high-mood passages. The denser the sensory description in the source text, the more cues the AI has to anchor placements to, and the more the result feels intentional. Storm scenes, chase sequences, battle scenes, ritual scenes all get strong starting points.

Hand-curation is faster than auto-assign for sparse, intimate, or stylized scenes. A two-character conversation in a quiet room with no environmental cues does not give the AI much to work with. You will usually end up wanting one ambient soundscape (the room, the rain outside, whatever the character notices) and maybe a single SFX on a key moment. In that case, the faster move is to drop those in yourself from the Sounds panel rather than running the pass and then deleting most of what it placed.

For most book-length manuscripts, the answer is to run Auto-Assign Sounds on every chapter and then keep the hand-curation muscle warm for the chapters where the AI is fighting the material rather than helping it.

Your own audio in the same library

One detail that matters for projects with a specific sonic identity: you are not limited to the built-in library. You can upload your own music and sound effects into your personal sound library and they sit alongside the built-in tracks for use in any of your projects. The Auto-Assign pass draws from the built-in catalogue, but anything you have uploaded shows up in the library browser and can be swapped in. For series with a recurring musical motif or a specific narrator-signature sting, that is the path.

Standard caveat: only upload audio you are authorized to use.

What this pass does not do

To keep the picture honest:

It does not write music or generate SFX from scratch. It picks from a library of existing tracks (built-in plus anything you have uploaded). When customers ask whether they can generate a custom score, the answer today is no.
It does not mix levels for you across tracks. The Sounds panel surfaces placements; final balance is what comes out of Generate Audio with the library's default mix.
It does not handle publishing or distribution. The output is export-ready and compatible with major audiobook platforms, but uploading to them is your step.

Try it

The free tier (1,200 words per month, no credit card) is enough to run Auto-Assign Sounds on a real chapter and develop a feel for what it places and where you push back. Pick a chapter that already has some action in it, click both Auto-Assigns, scan the inline SFX chips and the Sounds panel, swap one bed, delete one chip, click Generate. Forty-five minutes from a cold start to a finished audio drama of your own chapter.

You can do that at audioproducer.ai. If you have already read the Auto-Assign pipeline post, this is the natural next click; if not, that one sets up the broader picture this post drills into.

Disclosure: this article was drafted by an AI agent working on behalf of the AudioProducer.ai team.

Voice cloning inside the audiobook pipeline: integration notes and trade-offs

AudioProducer.ai — Thu, 14 May 2026 17:08:46 +0000

When we shipped voice cloning in AudioProducer.ai, the easiest way to talk about it externally was the consumer pitch: bring your own voice, narrate your own book. That framing is true, but it leaves out what we have found interesting about the feature on the production side. Voice cloning is a system component, and the most useful way to describe it for a developer audience is by what it integrates with, what it leaves untouched, and where the engineering quirks land.

This is a walkthrough of the cloning step as it sits inside the rest of our audiobook pipeline. What the abstraction looks like from the editor's point of view, where it sits in the per-chapter generation flow, the trade-offs we have watched accumulate across long-form jobs, and one operational rule that is structurally a constraint rather than a footnote.

The cloning step is a "library voice" shape, by design

The single design call that mattered most for keeping the pipeline tractable: a cloned voice is the same object type as a library voice. Both live on the same Voices page on the user's account home, in the same list, surfaced through the same selection affordance. After a clone is created, it is a row in the user's voice library with the same slots, the same per-line attachments, and the same selection dropdowns as the 132 built-in voices.

This sounds like a small thing. It is the difference between voice cloning being a tractable feature and being a parallel system that has to be specially handled at every assignment point.

When the user runs Auto-Assign Characters on a chapter, the AI does not need to know whether the user has cloned voices in their library. It does its job: tag every line by speaker, populate the Characters panel with one slot per voice that the chapter needs. The user then opens the panel and assigns a voice to each slot, picking from a flat dropdown that mixes library voices and clones interchangeably.

When the user runs Auto-Assign Sounds, music beds and one-shot effects get placed independent of voice choices. When the user clicks Generate Audio, the renderer asks each character slot for its assigned voice, gets back a voice descriptor that may resolve to a library asset or a cloned asset, and proceeds without branching on origin.

The boundary between "library" and "cloned" is therefore very thin. It lives inside the voice-resolution layer and almost nowhere else in the user-facing pipeline.

Where the clone slots into the per-chapter flow

For a fresh project, the pipeline phases (familiar from our earlier post on Auto-Assign) are:

Source text in, paste a chapter or import an EPUB.
Auto-Assign Characters, AI tags every line by speaker.
Auto-Assign Sounds, AI places music, soundscapes, and SFX.
Generate Audio, renders the finished file.

A cloned voice does not change any of these phases. It changes exactly one assignment: in the Characters panel after step 2, the user can pick their clone (or any mix of clones and library voices) for any character or for the narrator. The clone itself is created out of band, on the Voices page, by uploading a reference clip; once it exists, it appears in the assignment dropdowns the same way the library voices do.

The implication for the pipeline is that cloning is a one-time setup step, not a per-chapter step. A writer who clones their own voice once can then use that clone across every project, every chapter, and every regeneration without re-uploading. It is a row in their voice library, parallel to "British young-male voice 47."

For a writer who is iterating on a chapter, this also means a voice swap from a library voice to a clone (or vice versa) does not require re-running Auto-Assign Characters or Auto-Assign Sounds. The next Generate Audio asks the slot for its current voice and renders.

Trade-offs we have watched in production

A few practical notes that surface after running the feature in production for a while. None of these invalidate the use case, but they are worth being explicit about for anyone planning to use cloning in a long-form workflow.

Cloned voices behave slightly different than library voices on long-form audio drama. Across multi-chapter generations, cloned voices can drift in consistency in ways that highly-tuned library voices do not. The reference clip is a finite sample; the model interpolates from that sample to whatever the manuscript asks. Library voices were trained on much more material per voice, so their tendency under unusual prompts is more predictable. Practical implication: for projects where consistency across a 100,000-word manuscript matters more than the specific timbre, a well-chosen library voice may be the better call. For projects where the specific timbre is the point (narrator in their own voice, podcast host signature, a character voice that no library option captures), a clone is the right call and the consistency is good enough.

Per-line emotion control still applies. This matters in practice: when the user tags a dialogue line with an emotion (anger, fear, calm), the renderer applies that emotion to whichever voice is assigned to the line, library or clone. Cloned voices are not a flat-affect path. The same cloned character can read one line angry and the next calm, the way a library voice would.

The reference clip matters more than people expect. A two-minute clean read at the pace the user actually wants to narrate produces a noticeably different clone than a thirty-second clip recorded in a noisy room. We see this enough in support to be worth saying explicitly: the reference clip is the source of truth for the clone's character. Time spent on the clip is more leveraged than time spent re-cloning later.

Re-generation budgets the same for clones and library voices. A regeneration of a chapter consumes word allowance whether the character is a clone or a library voice. There is no separate cloning quota. This is the kind of detail that matters in production planning: when the user iterates on a chapter, the cost model does not branch.

Cloning is available on the free tier. A deliberate call on the pricing side: the cloning feature is most useful when the user can verify its output against their own ear, and that verification needs to be cheap. The free tier (1,200 words per month, no credit card) is enough to upload a reference clip, generate a clone, render an opening chapter using that clone, and listen. Putting the feature behind a higher tier would push users to decide whether to clone before they have evidence the result will be what they want, which is the wrong place for that decision.

The authorization rule is a structural constraint

The hard rule we surface in the product is the same one we surface in our customer-facing docs: only clone voices the user is authorized to use.

This is not a soft warning. The authorization for any voice the user clones sits with the user, and the responsibility for that authorization carries through into the produced audio. From a system-design perspective, this means the cloning endpoint is a place where user input determines a legal posture, and that posture cannot be relitigated downstream. We surface the rule at the upload step and leave the responsibility where it belongs.

The operating principle is simple. The user's own voice is fine. A voice the user has explicit permission to clone is fine. A public-domain recording cleared for this purpose is fine. A voice the user does not have permission to use is not what the feature is for.

This is the kind of constraint that is easier to design around once than to retrofit. It is also the kind of constraint that is worth being honest about in the docs rather than burying.

End to end

A writer who wants to narrate their own book in their own voice, from a cold start, runs through this sequence:

Sign up for the free tier. Open the Voices page on the account home.
Upload a clean reference clip (a couple of minutes of speech at the pace the writer wants to narrate).
Create a fresh project. Paste a chapter or import an EPUB.
Run Auto-Assign Characters. Open the Characters panel.
Assign the cloned voice to the narrator slot. Leave any character voices on library defaults, or swap them.
Run Auto-Assign Sounds. Click Generate Audio.
Listen to the chapter.

The flow is on the order of a few minutes once the reference clip is in hand, and the result is a chapter rendered in the writer's own voice. If the read is right, the same clone carries forward across every chapter, every regeneration, every future project, without re-uploading.

Try it

If you want to develop intuition for what cloning produces in your own pipeline, the cleanest way in is to clone a voice you already have a clean reference clip for, render an opening chapter, and listen. The free tier handles this end to end: audioproducer.ai.

Disclosure: this article was drafted by an AI agent working on behalf of the AudioProducer.ai team.

How AudioProducer.ai's Auto-Assign pipeline turns a chapter into a multi-voice audio drama

AudioProducer.ai — Tue, 12 May 2026 02:34:36 +0000

Making a multi-voice audiobook used to mean booking a studio, casting voice actors, and waiting weeks for production. We built AudioProducer.ai to compress that pipeline into something a writer can run from a browser in an afternoon — plain chapter text in, finished audio drama out, with the AI doing the bulk of the markup work.

This post is a walkthrough of the two passes that do the heavy lifting: Auto-Assign Characters and Auto-Assign Sounds. We'll cover what each pass takes as input, what it produces, and how the editor surfaces the result for you to tune. The goal is to make the pipeline legible — so when you sit down with a chapter and click the buttons, you know what the system is actually doing on your behalf.

The shape of the pipeline

The product treats every project as a sequence of chapters. For each chapter, the pipeline runs in four phases:

Source text in. Paste a chapter into the editor, or upload an .epub and the project gets pre-populated with chapter structure, titles, and body text.
Auto-Assign Characters. One-click AI pass that reads the chapter and tags every line by speaker — narrator, named characters, even in-world labels.
Auto-Assign Sounds. Second one-click AI pass that analyzes the scene and places music beds, ambient soundscapes, and one-shot sound effects from the built-in library.
Generate Audio. A single button renders the chapter into a finished audio file using the assigned voices and placed sounds.

Auto-Assign is a starting point, not a final answer. The editor is built around the idea that you'll keep what the AI got right and correct what it got wrong, in seconds per line — not by hand-tagging the whole chapter.

Auto-Assign Characters — what it actually does

Input: raw chapter text. Output: every line attributed to a speaker, with the speakers populated as a per-character voice slot on the project.

Three kinds of speakers come out of this pass:

Narrator. Anything that isn't a character speaking goes to the narrator track. Description, scene-setting, action beats.
Named characters. "Alice", "the White Rabbit", "Eryndor" — every distinct named voice in the chapter gets its own slot. The AI handles attribution heuristics (who's speaking based on dialogue tags, conversational context, scene cues) so you don't have to walk every line manually.
In-world labels. This is the part that surprises new users. The AI catches text that isn't spoken by a character but should still be voiced distinctly — labels on jars, signs in a scene, captions a narrator reads aloud. In our editor's Alice in Wonderland example, you'll see entries like "Cake Label", "Label on the jar", and "Bottle Label" as distinct voice slots alongside Alice and the White Rabbit. Give those labels their own narrator voice (or even their own character voice if you want), and they read differently than the surrounding prose.

After the pass, the Characters panel shows you every speaker the AI extracted with a voice already provisionally assigned. You can swap any of those voices from the 132-voice library on your Voices page — or replace them with a voice you've cloned yourself.

Correcting what the AI gets wrong

The Auto-Assign is good but not perfect. Common cases where it gets a line wrong:

The source uses unusual dialogue conventions (no quotation marks, character-speech embedded in narration paragraphs, attribution patterns the AI hasn't seen often).
A scene has two characters with similar names and the attribution-by-context heuristic picks the wrong one.
A line of free indirect speech ("Alice thought it odd that the rabbit was wearing a waistcoat...") could be the narrator or could be inside Alice's head — judgment call.

The editor handles the fix with a two-click pattern: select the line, pick the right character from the dropdown, done. You don't re-run the whole pass; the rest of the chapter's tags are preserved. For source texts with widespread attribution problems, the more efficient move is usually to standardize the punctuation in the source first and re-run the pass — the AI's accuracy is much higher on well-marked-up source.

Auto-Assign Sounds — what it actually does

Input: the same chapter text, now with characters assigned. Output: music beds, ambient soundscapes, and one-shot sound effects placed at the right moments.

The pass distinguishes three audio types:

Music beds — long-form atmospheric tracks that play under sections of text. Use to set tone for a scene or sequence.
Ambient soundscapes — environmental layers (wind, rain, crowd noise, ocean). These set place rather than mood.
One-shot SFX — discrete events tied to a specific moment in the text. The chips show up inline in the editor at the moment they play, with the sound name and duration: "Distant Thunder (4s)", "Wind Howl (6s)", "Stones Launching (3s)".

In practice, what this looks like for an action scene: the chapter opens, the AI places a tense music bed under the first few paragraphs, layers a wind-howl soundscape over the storm description, and drops a "Stones Launching" SFX exactly on the line where the slingshot fires. All in one click; the chips are visible in the editor view, so you can see exactly what was placed where.

The tune-it pattern is the same as for characters: keep what fits, replace what doesn't. The Sounds panel of the editor lets you swap any placed track for a different one from the library, drag SFX to different moments, or remove placements that read as noise.

The voice library angle

The Auto-Assign Characters pass gets you to "every speaker has a voice." The Voices page is where you decide which voice each speaker gets.

A few notes that matter for picking voices:

132 voices in the library as of this writing, across a mix of male / female / unlabeled, middle-aged / young / older, plus dedicated child-male and child-female voices for kids' content. Accent coverage is mostly American with British, US-Southern, Irish, Australian, Indian, and Spanish-accented English in the mix.
The library is actively growing. New voices land regularly; the canonical source is your in-app Voices page.
Per-line emotion control. Same voice, different inflection per line. You attach an emotion tag (anger, fear, calm, etc.) to specific dialogue lines in the editor.
Voice cloning. You can clone a voice (your own, or any voice you're authorized to use) and use it like any library voice. Useful for narrating in your own voice without a recording rig, for distinct character voices that aren't in the library, or for brand consistency on a podcast.

Voice changes don't require re-running Auto-Assign — they take effect on the next Generate Audio.

Putting it together

The flow, end to end, from a fresh project:

Create a project; either paste a chapter into the editor or import an .epub.
Click Auto-Assign Characters. Review the Characters panel; correct any obvious miscasts.
Click Auto-Assign Sounds. Review the placed music / SFX in the editor; swap or remove what doesn't fit.
Open the Voices page; swap library voices into the character slots that need them, or assign a cloned voice if you've made one.
Click Generate Audio. The chapter renders into a downloadable audio file.

No external audio software in the loop. No separate DAW for mixing. The editor is the place where the audio production happens and the audio file is the output.

Pauses and pacing (briefly)

One detail that comes up frequently for writers used to manually mixing audiobooks: pause control. Pauses are configurable at four levels:

Inline pauses for dramatic effect inside a paragraph.
Project-wide default pause between paragraph breaks (set once per project).
Per-paragraph override when a specific transition needs more breath.
Intro pauses for the project intro (title / author / narrator) and chapter intros.

Multiple consecutive blank lines collapse to a single pause, which is usually what you want.

A note on what the pipeline doesn't do

Auto-Assign covers character attribution and sound placement. A few things sit outside the pipeline:

Publishing to Audible, Spotify, or Apple Podcasts — the output is export-ready, but you upload to those platforms yourself.
Royalty or sales tracking for audiobooks sold elsewhere.
Non-EPUB import — .docx, .pdf, .mobi, and .txt aren't supported import formats today; for those, paste chapter-by-chapter into a blank project or convert your source to .epub first.

We mention these so the pipeline picture is accurate — the audio production is end-to-end inside AudioProducer.ai; distribution is your last step outside it.

Try it

There's a free tier (1,200 words per month, no credit card) on audioproducer.ai. Pick a chapter, run both Auto-Assigns, swap a couple of voices, click Generate. The fastest way to develop intuition for what the pipeline does well — and where you'll spend the most editing time — is to feed it ten pages of your own writing and see what comes back.

Disclosure: this article was drafted by an AI agent working on behalf of the AudioProducer.ai team.