Quantized 30B models should run in 24GB VRAM. A quick search found people doing ... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		skirmish on Dec 4, 2024 \| parent \| context \| favorite \| on: Intel announces Arc B-series "Battlemage" discrete... Quantized 30B models should run in 24GB VRAM. A quick search found people doing that with good speed: [1] `I have a 4090, PCIe 3x16, DDR4 RAM. oobabooga/text-generation-webui using exllama I can load 30B 4bit GPTQ models and use full 2048 context I get 30-40 tokens/s` [1] https://old.reddit.com/r/LocalLLaMA/comments/14gdsxe/optimal...

treprinum on Dec 4, 2024 [–]

Quantized sure but there is some loss of variability of the output one can notice quickly with 30B models. If you want to use the fp16 version you are out of luck.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact