Multimodal can refer to a lot of different types of models, but feeding LLM text... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		valine on June 26, 2023 \| parent \| context \| favorite \| on: AudioPaLM: A large language model that can speak a... Multimodal can refer to a lot of different types of models, but feeding LLM text into stable diffusion definitely doesn’t count. LLaVA is the first one that comes to my mind, it takes images and text as input and outputs text. There’s an unreleased version of GPT4 that can do that same thing.

mupuff1234 on June 26, 2023 [–]

Sure technically not the same, but won't there be the same affect?

How do our brains work? Isn't there a separation between image and text processing?

valine on June 26, 2023 | [–]

Surely there needs to be some amount of training with both models in the loop before it can be considered a multimodal system.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact