More

kwindla · 2025-12-25T02:29:47 1766629787

One easy way to build voice agents and connect them to Twilio is the Pipecat open source framework. Pipecat supports a wide variety of network transports, including the Twilio MediaStream WebSocket protocol so you don't have to bounce through a SIP server. Here's a getting started doc.[1]

(If you do need SIP, this Asterisk project looks really great.)

Pipecat has 90 or so integrations with all the models/services people use for voice AI these days. NVIDIA, AWS, all the foundation labs, all the voice AI labs, most of the video AI labs, and lots of other people use/contribute to Pipecat. And there's lots of interesting stuff in the ecosystem, like the open source, open data, open training code Smart Turn audio turn detection model [2], and the Pipecat Flows state machine library [3].

[1] - https://docs.pipecat.ai/guides/telephony/twilio-websockets [2] - https://github.com/pipecat-ai/pipecat-flows/ [3] - https://github.com/pipecat-ai/smart-turn

Disclaimer: I spend a lot of my time working on Pipecat. Also writing about both voice AI in general and Pipecat in particular. For example: https://voiceaiandvoiceagents.com/

ldenoue · 2025-12-25T03:57:03 1766635023

The problem with PipeCat and LiveKit (the 2 major stacks for building voice ai) is the deployment at scale.

That’s why I created a stack entirely in Cloudflare workers and durable objects in JavaScript.

Providers like AssemblyAI and Deepgram now integrate VAD in their realtime API so our voice AI only need networking (no CPU anymore).

nextworddev · 2025-12-25T04:16:23 1766636183

let me get this straight, you are storing convo threads / context in DOs?

e.g. Deepgram (STT) via websocket -> DO -> LLM API -> TTS?

ldenoue · 2025-12-27T01:26:03 1766798763

Yes DO let you handle long lived websocket connections. I think this is unique to Cloudflare. AWS or Google Cloud don't seem to offer these things (statefulness basically).

Same with TTS: some like Deepgram and ElevenLabs let you stream the LLM text (or chunks per sentence) over their websocket API, making your Voice AI bot really really low latency.

nextworddev · 2025-12-25T03:17:43 1766632663

This is good stuff.

In your opinion, how close is Pipecat + OSS to replacing proprietary infra from Vapi, Retell, Sierra, etc?

kwindla · 2025-12-25T15:11:22 1766675482

It depends on what you mean by replacing.

The integrated developer experience is much better on Vapi, etc.

The goal of the Pipecat project is to provide state of the art building blocks if you want to control every part of the multimodal, realtime agent processing flow and tech stack. There are thousands of companies with Pipecat voice agents deployed at scale in production, including some of the world's largest e-commerce, financial services, and healthtech companies. The Smart Turn model benchmarks better than any of the proprietary turn detection models. Companies like Modal have great info about how to build agents with sub-second voice-to-voice latency.[1] Most of the next-generation video avatar companies are building on Pipecat.[2] NVIDIA built the ACE Controller robot operating system on Pipecat.[3]

[1] https://modal.com/blog/low-latency-voice-bot - [2] https://lemonslice.com/ = [3] https://github.com/NVIDIA/ace-controller/

nextworddev · 2025-12-25T16:14:31 1766679271

Is there a simple, serverless version of deploying Pipecat stack, without: - me having to self host on my infra

I just want to provide: - business logic - tools - configuration metadata (e.g. which voice to use)

I don't like Vapi due to 1) extensive GUI driven experience, 2) cost

ldenoue · 2025-12-27T01:27:39 1766798859

Check out something like LayerCode (Cloudflare based).

Or PipeCat Cloud / LiveKit cloud (I think they charge 1 cent per minute?)

kwindla · 2025-10-14T23:26:12 1760484372

> Sounds like it. Dude you can be honest here.

I'm going to politely weigh in here and say things Sean won't say about himself.

You're talking to someone who has spent the last ten years building open source WebRTC software that many, many, many people use and that he's never tried to commercialize. He works tirelessly to make the Pion community welcoming to everyone, from engineers with a ton of networking/video experience to brand new contributors. He wrote the guide that should be everyone's first read about WebRTC.[] All of it as a labor of love.

He's being honest.

https://webrtcforthecurious.com/

rs186 · 2025-10-15T12:46:25 1760532385

Thanks for the context but I don't see how it's related to what I was asking. You could be Thomas Edison and I'd still ask the same question.

kwindla · 2025-10-14T23:16:07 1760483767

I honestly can't tell if this is trolling. LEGO bricks are pretty new technology, in the scheme of things. The original LEGO company "binding brick" was created in the late 1940s.

Of course you don't "need" an LLM to have a great toy. You also don't "need" injection-molded plastic. But if you have access to one or both, that can be pretty great!

Source: I wrote the spec for the first version of the LEGO Mindstorms programming language. These days I build a lot of voice+LLM stuff, some of it for big companies, some of it for myself and my kid.

kwindla · 2025-10-14T23:09:31 1760483371

I've done a fair amount of fine-tuning for conversational voice use cases. Smaller models can do a really good job on a few things: routing to bigger models, constrained scenarios (think ordering food items from a specific and known menu), and focused tool use.

But medium-sized and small models never hit that sweet spot between open-ended conversation and reasonably on-the-rails responsiveness to what the user has just said. We don't know yet know how to build models <100B parameters that do that, yet. Seems pretty clear that we'll get there, given the pace of improvement. But we're not there yet.

Now maybe you could argue that a kid is going to be happy with a model that you train to be relatively limited and predictable. And given that kids will talk for hours to a stuffie that doesn't talk back at all, on some level this is a fair point! But you can also argue the other side: kids are the very best open-ended conversationalists in the world. They'll take a conversation anywhere! So giving them an 8B parameter, 4-bit quantized Santa would be a shame.

kwindla · 2025-10-14T18:26:20 1760466380

I 100% agree with Sean that the computer is an exploration machine. There are lots of net positive things for kids (and non-kids) that LLMs make possible. Just like there were lots of net positive things that an Internet connection makes possible.

Of course there are things technologies can do that are bad. For kids. For adults. For societies. But I build this kind of voice+LLM stuff, too, and have a kid, and the exploration, play, and learning opportunities here are really, really amazing.

For example, we are within reach of giving every child in the world a personalized, infinitely patient tutor that can cover any subject at the right level for that child. This doesn't replace classroom teachers. It augments what you can do in school, and what kids will be able to do outside of school hours.

kwindla · 2025-10-14T18:13:48 1760465628

This repo is one possible starting point for tinkering with local agents on macOS. I've got versions of this for NVIDIA platforms but I tend to gravitate to using LLMs that are too big to fit on most NVIDIA consumer cards.

https://github.com/kwindla/macos-local-voice-agents

kwindla · 2025-08-25T19:31:24 1756150284

As someone who spends a lot of time looking at timestamped log lines to debug Pipecat pipelines, I'm a big fan of this work from Aleix.

In general, I have three pain points with debugging realtime, multi-model, multi-modal AI stuff. 1. where's the latency creeping in? 2. What context actually got passed to the models. 3. Did the model/processor get data in the format it expected.

For 1 and 3, Whisker is a big step forward. For 2, something like LangFuse (Open Telementry) is very helpful.

kwindla · 2025-04-25T16:00:09 1745596809

https://www.slate.auto/en

The configurator is fun:

https://www.slate.auto/en/personalization

rawgabbit · 2025-04-25T16:55:12 1745600112

Thanks for the link. I see they sell portable bluetooth speakers we can mount under the dash. I like the idea of DIY wrapping both the interior and exterior; I can imagine anime fan boys like my son coming up with very wild art for these wraps. I had also forgotten cars used to have hand cranks to roll up the windows.

kwindla · 2025-03-07T17:50:31 1741369831

In general, for realtime voice AI you don't want this model to support multiple speakers because you have a separate voice input stream for each participant in a session.

We're not doing "speaker diarization" from a single audio track, here. We're streaming the input from each participant.

If there are multiple participants in a session, we still process each stream separately either as it comes in from that user's microphone (locally) or as it arrives over the network (server-side).

kwindla · 2025-03-07T17:48:16 1741369696

I've talked about this a lot with friends.

Endpoint detection (and phrase endpointing, and end of utterance) are terms from the academic literature about this, and related, problems.

Very few people who are doing "AI Engineering" or even "Machine Learning" today know these terms. In the past, I argued that we should use the existing academic language rather than invent new terms.

But then OpenAI released the Realtime API and called this "turn detection" in their docs. And that was that. It no longer made sense to use any other verbiage.

mncharity · 2025-03-07T23:43:37 1741391017

Re SEO, I note "utterance" only occurs once, in a perhaps-ephemeral "Things to do" description.

To help with "what is?" and SEO, perhaps something like "Turn detection (aka [...], end of utterance)"... ?

lelag · 2025-03-07T18:33:04 1741372384

Thank for the explanation. I guess it makes some sense, considering many people with no nlp background are using those models now…