Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Wouldn’t the softmax typically be “fused” with the matmul though?


Yes but as far as I understand this is only really usefully possible with FlashAttention. (The main idea is that you have to use the log-sum-exp trick when computing the softmax, but can't compute the max activation incrementally so have to rescale everything.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: