Quantized 30B models should run in 24GB VRAM. A quick search found people doing that with good speed: [1]
I have a 4090, PCIe 3x16, DDR4 RAM.
oobabooga/text-generation-webui
using exllama
I can load 30B 4bit GPTQ models and use full 2048 context
I get 30-40 tokens/s
Quantized sure but there is some loss of variability of the output one can notice quickly with 30B models. If you want to use the fp16 version you are out of luck.