Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A simple way is to split the model’s output stream before TTS. Reasoning/structured tokens go into one bucket, actual user-facing text into another. Only the second bucket is synthesized. Most thinking out loud issues come from feeding the whole stream directly into audio.




There is no TTS here. It's a native audio output model which outputs audio tokens directly. (At least, that's how the other real-time models work. Maybe I've misunderstood the Qwen-Omni architecture.)

True, but even with native audio-token models you still need to split the model’s output channels. Reasoning/internal tokens shouldn't go into the audio stream only user-facing content should be emitted as audio. The principle is the same, whether the last step is TTS or audio token generation.

There's an assumption there that the audio stream contains an equivalent of the <think>/</think> tokens. Every reason to think it should, but without seeing the tokeniser config it's a bit of a guess.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: