Inference on very large LLMs where model + backprop exceed 48GB is already way faster on a 128GB MacBook than on NVidia unless you have one of those monstrous Hx00s with lots of RAM which most devs don't.
No one is running LLMs on consumer NVidia GPUs or apple MacBooks.
A dev, if they want to run local models, probably run something which just fits on a proper GPU. For everything else, everyone uses an API key from whatever because its fundamentaly faster.
IF a affordable intel GPU would be relevant faster for inferencing, is not clear at all.
A 4090 is at least double the speed of Apples GPU.
4090 is 5x faster than M3 Max 128GB according to my tests but it can't even inference LLaMA-30B. The moment you hit that memory limit the inference is suddenly 30x slower than M3 Max. So a basic GPU with 128GB RAM would trash 4090 on those larger LLMs.
Quantized 30B models should run in 24GB VRAM. A quick search found people doing that with good speed: [1]
I have a 4090, PCIe 3x16, DDR4 RAM.
oobabooga/text-generation-webui
using exllama
I can load 30B 4bit GPTQ models and use full 2048 context
I get 30-40 tokens/s
Quantized sure but there is some loss of variability of the output one can notice quickly with 30B models. If you want to use the fp16 version you are out of luck.
I ran some variation of llama.cpp that could handle large models by running portion of them on GPU and if too large, the rest on CPU and those were the results. Maybe I can dig it from some computer at home but it was almost like a year ago when I got M3 Max with 128GB RAM.
My comment was about Intel having a starter project, getting enthusiastic response from devs, network effects and iterate from there. They need a way to threaten Nvidia and just focusing on what they can't do won't bring them there. There is one route where they can disturb Nvidia's high end over time and that's a cheap basic GPU with lots of RAM. Like Ryzen 1st gen whose single core performance was two generations behind Intel trashed Intel by providing 2x as many cores for cheap.
That's a question M3 Max with its internal GPU already answered. It's not like I didn't do any HPC or CUDA work in the past to be completely clueless about how GPUs work though I haven't created those libraries myself.