The Future of Voice: Bland’s New Breakthrough TTS Engine

We just unveiled a transformative text-to-speech system powered by large language models. Learn how our LLM-based architecture generates lifelike, emotionally intelligent speech with unmatched precision and style control.

Introduction

At Bland, we've been quietly working on a fundamental reimagining of text-to-speech technology. Our engineering team has developed an approach that doesn't just incrementally improve existing TTS pipelines but completely transforms how synthetic speech is generated. This post explores the technical architecture, data challenges, and breakthrough capabilities of our LLM-based speech prediction system.

Beyond Traditional TTS Architectures

Traditional text-to-speech systems follow a sequential pipeline: text normalization, phonetic conversion, prosody modeling, and waveform generation. Each step introduces its own complexities and potential errors. More importantly, this architecture creates an inherent disconnect between understanding what to say and deciding how to say it.

Our engineering team recognized that this architectural limitation was holding back truly expressive speech synthesis. The problem isn't just technical implementation—it's conceptual. Human speech isn't a conversion process; it's a generative one where meaning and expression are deeply intertwined.

We've addressed this by leveraging the predictive power of large language models. Instead of treating TTS as a series of conversion steps, we've trained our models to directly predict audio representations from text input.

The Critical Data Advantage

The foundation of any machine learning system is its training data, and voice AI presents unique challenges in this domain. While our research team initially explored publicly available datasets, we quickly discovered their limitations for building truly conversational AI.

To put this in technical terms: high-quality training for speech models requires two-channel audio with separate tracks for each speaker, precise transcription alignment, speaker role labeling, and comprehensive metadata. This enables models to learn crucial conversational dynamics like turn-taking patterns, interruption handling, and speaker transitions.

Through careful licensing and processing, we've assembled approximately [REDACTED] million hours of two-channel conversational audio with corresponding transcripts—orders of magnitude beyond the current state of the art. For context, most available speech datasets contain at most 2 million hours, and even those rarely offer clean speaker separation or accurate transcriptions.

Our dataset includes:

Two-channel separation of speakers
Time-aligned transcriptions at the utterance level
Speaker role metadata
Conversational context markers
Industry-specific terminology across diverse domains

This data foundation gives our models an unprecedented ability to understand and reproduce the subtleties of natural conversation.

Technical Implementation: From Text LLMs to Audio Generation

Our approach builds upon the transformer architecture that powers modern language models, but with several crucial modifications for audio prediction.

In a standard LLM, the model pipeline looks something like this:

Text is tokenized into subword units
Tokens are converted to embedding vectors
The transformer processes these embeddings to predict subsequent token probabilities
Output tokens are detokenized into text

For our audio prediction system, we've extended this architecture:

Text input is tokenized conventionally
The model predicts sequences of audio tokens rather than text tokens
Audio tokens are converted back to waveform representations

The key technical innovation is our audio tokenizer, which converts continuous audio signals into discrete, learnable tokens while preserving essential acoustic properties. We use a specialized SNAC (Spectral Normalized Audio Codec) tokenizer that encodes features from coarse to fine-grained resolution, enabling the model to capture both broad prosodic patterns and subtle phonetic details.

During training, we carefully align these audio tokens with their corresponding text, creating paired examples that teach the model to associate textual patterns with appropriate acoustic realizations. This alignment process is computationally intensive but critical for the model to learn meaningful relationships between text and speech.

The Architecture of Audio Prediction

Our model architecture expands on the standard decoder-only transformer by incorporating specialized attention mechanisms that help manage the higher dimensionality of audio token sequences.

The training objective is similar to next-token prediction in text LLMs, but with audio tokens as targets. Given a sequence of text tokens as input, the model learns to predict the most likely sequence of audio tokens that would correspond to that text when spoken.

Importantly, this prediction happens holistically rather than sequentially. The model doesn't first predict words and then separately predict their pronunciation—it directly predicts the full acoustic realization, capturing prosody, emphasis, timing, and emotional qualities simultaneously.

From an implementation perspective, this is structured as a chat template:

‍

[

{

"user": "transcript of training example 1",

"assistant": [text output] [audio_token_sequence_1]

},

{

"user": "transcript of training example 2",

"assistant": [text output] [audio_token_sequence_2]

},

{

"user": "text to be synthesized",

"assistant": [text output] [predicted_audio_token_sequence]

}

]

‍

This format allows us to leverage the few-shot learning capabilities of LLMs, creating a system that can adapt to new voices or styles with minimal examples.

Style Transfer: Technical Achievements and Implementation

Style transfer in speech synthesis has been a persistent challenge in the field. Traditional approaches typically rely on explicit style embeddings or attribute vectors that must be learned separately for each style.

Our LLM-based approach solves this more elegantly. By framing speech generation as a prediction problem, the model naturally learns to associate contextual and stylistic cues in the input with appropriate acoustic patterns in the output.

Technically, we implement style control through:

In-context learning: By providing examples of the target style in the prompt, we guide the model to adopt similar stylistic characteristics.
Explicit style markers: We can include special tokens like <excited> or <calm> in the input text, which the model learns to associate with specific acoustic patterns.
Transcript alignment: For specific effects or sounds, we align example audio with descriptive text markers (e.g., aligning <barking> with actual bark sounds in training).

The system doesn't require exhaustive labeling of every possible emotion or style. Instead, it can generalize from a few examples to understand the underlying acoustic patterns associated with different speaking styles. This capability emerges naturally from the LLM's general pattern-matching abilities.

In our testing, we've found that 3-6 examples of a particular voice or style provide sufficient context for high-quality synthesis, though results continue to improve with additional examples up to a point.

This does require parsing the multimodal output effectively.

Sound Effect Integration and Voice Blending

One of the most technically interesting capabilities of our system is its ability to learn and reproduce non-speech sounds alongside speech. This isn't a separate feature—it emerges naturally from the model's general ability to associate textual descriptions with acoustic patterns.

In implementation terms, we achieve this by:

Including examples of the desired sound effect in the conditioning context
Labeling these examples with consistent textual markers (e.g., <barking>)
Using these same markers in the generation prompt to trigger similar sounds

This approach works because the model isn't restricted to generating speech sounds—it simply predicts the most likely audio tokens given the textual context. If that context includes examples associating <barking> with dog bark sounds, the model will learn this association just as it learns the sound of words.

We've found some practical limitations:

Sound effects should constitute no more than 20% of any audio example
High-quality, clean audio is essential for clear reproduction
Complex sounds often benefit from multiple examples
Examples should generally be under one minute to avoid overwhelming the context window

Voice blending—combining characteristics of different speakers or styles—works through a similar mechanism. By including examples of multiple voice styles in the context, the model naturally learns to produce outputs that blend characteristics of those styles, weighted by their prominence in the context.

Technical Challenges and Ongoing Development

Our engineering team continues to address several technical challenges with this approach:

Token Repetition: We occasionally observe pathological cases where the model produces repeating sequences of audio tokens, creating looping artifacts in the output. This appears to be related to the attention patterns in the transformer architecture when processing certain input patterns, and we're investigating improved sampling strategies to mitigate this issue.

Audio Quality Sensitivity: The model is highly sensitive to the quality of audio examples provided in the context. Low-quality or noisy examples lead to similar artifacts in the output, as the model faithfully reproduces acoustic characteristics including unwanted noise. We're working on preprocessing techniques to improve robustness to variable audio quality.

Voice Gender Imbalance: Our current model shows better performance on female voices, requiring fewer examples to achieve high-quality synthesis compared to male voices. This likely stems from imbalances in our training data or biases in the tokenization process. We're addressing this through targeted data augmentation and model adjustments.

Computational Efficiency: Generating high-quality audio at useful latencies remains computationally intensive. We're exploring model distillation and specialized inference optimizations to improve real-time performance without sacrificing quality.

Engineering Considerations for Deployment

Deploying this technology in production environments presents several engineering challenges beyond model architecture:

Latency Management: Speech generation needs to happen in near-real-time for many applications. We've implemented streaming generation techniques that begin outputting audio before the entire sequence is generated, significantly reducing perceived latency.

Context Window Optimization: The few-shot learning approach requires keeping voice examples in the context window, which can consume significant portions of the available context. We've developed techniques to compress voice characteristics into more compact representations while maintaining quality.

Memory Footprint: The model's memory requirements during inference can be substantial, particularly when handling multiple voice styles. Our engineering team has implemented efficient attention caching and quantization techniques to reduce memory needs without compromising quality.

Reliability Safeguards: For production deployments, we've built monitoring systems that detect potential issues like token looping, allowing fallback to alternative generation methods when necessary.

Real-World Applications and Technical Implications

The technical capabilities of this system enable several advanced use cases:

Cross-Speaker Style Transfer: By separating voice identity from speaking style, we can transfer the speaking characteristics of one speaker to another. This has powerful applications in creating consistent brand voices while incorporating the persuasive techniques of top performers.

Domain-Specific Pronunciation: For industries with specialized terminology (healthcare, finance, technology), the model can learn correct pronunciations of technical terms from just a few examples, eliminating the need for extensive pronunciation dictionaries.

Emotional Intelligence: The system can adjust its speaking style based on conversational context, adopting appropriate emotional tones for different types of information—explaining technical details clearly while delivering personal information with warmth.

Multilingual Adaptation: The model architecture transfers well across languages, allowing rapid adaptation to new languages with relatively small amounts of target-language data while preserving natural prosody.

Technical Guidelines for Optimal Results

Based on extensive testing, our engineering team has developed the following guidelines for optimal performance:

Voice Cloning:

Provide 3-6 examples of the target speaker
Use high-fidelity recordings (16kHz+ sample rate)
Include examples with varied prosody and emotional range
Ensure accurate transcription alignment

Effect Integration:

Limit effects to less than 20% of total audio
Use consistent text markers for each effect type
Provide multiple examples of complex sounds
Keep examples under one minute in length

Style Control:

Use <style> notation consistently for explicit control
Arrange examples in order of importance (last examples have stronger influence)
For multilingual applications, include examples in each target language
Test with representative input text to ensure consistent quality

Future Technical Directions

Our research and engineering teams are exploring several promising directions for future development:

Hierarchical Tokenization: We're investigating multi-level audio tokenizers that more efficiently represent acoustic information at different granularities, potentially reducing context window requirements while improving quality.

Cross-Modal Conditioning: Incorporating visual or environmental context into speech generation could further enhance naturalness and appropriateness of synthetic speech in multimodal environments.

Continuous Learning: We're developing systems for ongoing improvement from deployment feedback, allowing models to refine their performance based on real-world usage patterns while maintaining data privacy.

Specialization vs. Generalization: We're exploring the optimal balance between general-purpose speech models and domain-specific adaptation, with promising results from hybrid approaches that combine a strong general foundation with lightweight domain tuning.

Conclusion

At Bland, we believe that the future of text-to-speech technology lies not in incremental improvements to traditional pipelines, but in fundamentally reimagining how computers generate speech. By leveraging the predictive power of large language models and applying them to audio generation, we've created a system that captures the nuance, expressiveness, and contextual awareness that make human speech so natural.

Our approach transforms TTS from a mechanical conversion process into a genuinely generative one, capable of understanding not just what to say but how to say it in a way that feels authentic and communicatively effective. As we continue to refine these technologies, we envision a world where human-computer voice interaction becomes as natural and expressive as human-human conversation.

This isn't just an improvement in how computers sound—it's a fundamental shift in how machines express information, with profound implications for how we'll interact with technology in the coming years.

Isaiah Granet

Isaiah Granet is the Co-Founder and CEO of Bland.