LLaVA is the first one that comes to my mind, it takes images and text as input and outputs text.
There’s an unreleased version of GPT4 that can do that same thing.
How do our brains work? Isn't there a separation between image and text processing?
LLaVA is the first one that comes to my mind, it takes images and text as input and outputs text.
There’s an unreleased version of GPT4 that can do that same thing.