Introduction

Now that transcription, language, and text to speech models are advanced, it’s possible to fully replicate the experience of talking to another human being on the phone. We’re going to break down how we built our AI calling API.

‍

If you want to send yourself an AI call right now, visit our dev portal. Also, here’s an early demo.

Model stack

Our model stack includes: LLM: Claude Instant, Whisper for Transcription, and TTS: ElevenLabs (state of the art voice quality).

Problems we solved

The biggest challenges we solved include: a) Detecting ends of sentences and interruptions: long and short periods of silence are indicators; b) Detecting when we’ve hit an automated customer support system: using ML trained on audio wave patterns; and c) Navigating customer support phone trees autonomously.

Biggest challenges

We still have two massive challenges: 1) latency; and 2) imbuing our agent with conversational intelligence.

Latency

Re latency: we need to drive it down from 1.6s (current median) to 0.6s (humans expect first verbal acknowledgement quickly, even if the rest of the sentence comes later). We think the most effective approach will be a combination of in-housing core infra to serve faster responses and pre-loading acknowledgements like “right…” and “mmhmm”. Within our team the second approach is controversial; we’re worried it’s a hacky solution that will ruin the magic of “real conversation”. What do you all think - especially in terms of the tradeoff with latency?

Conversational Intelligence

Re teaching conversational intelligence: this is really incredibly hard. Realize that LLMs don’t know how normal people have conversations. They love to pontificate and speak at length, whereas normal humans periodically ask questions to keep you engaged. To make our agent actually work, we have to teach it when to jump in, how to handle interruptions, and give it guidelines for how it approaches conversations. Prompting plays a massive role in this; we found that guiding our LLM to refer to “phone call transcripts” was surprisingly impactful. We also have and will continue to tweak instructions around how long to speak for, how and when to ask questions, etc. For humans, language is such a primal skill - and to make people feel they’re engaged in real conversation, each micro interaction has to be perfect; but that’s what makes this problem interesting :)

Wrapping it up

We just opened our API for beta users. To send yourself (and your friends) AI phone calls, visit our developer portal. Note you can also handle inbound calls (transcripts posted to an endpoint via webhook).

‍

Thanks for reading - we’d love to hear your feedback.