A simple way is to split the model’s output stream before TTS. Reasoning/structu...

pugio · 2025-12-10T20:19:13 1765397953

There is no TTS here. It's a native audio output model which outputs audio tokens directly. (At least, that's how the other real-time models work. Maybe I've misunderstood the Qwen-Omni architecture.)

artur44 · 2025-12-10T20:35:01 1765398901

True, but even with native audio-token models you still need to split the model’s output channels. Reasoning/internal tokens shouldn't go into the audio stream only user-facing content should be emitted as audio. The principle is the same, whether the last step is TTS or audio token generation.

regularfry · 2025-12-11T15:32:19 1765467139

There's an assumption there that the audio stream contains an equivalent of the <think>/</think> tokens. Every reason to think it should, but without seeing the tokeniser config it's a bit of a guess.