Wouldn’t the softmax typically be “fused” with the matmul though? | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		bjornsing on May 16, 2024 \| parent \| context \| favorite \| on: New exponent functions that make SiLU and SoftMax ... Wouldn’t the softmax typically be “fused” with the matmul though?

anewhnaccount2 on May 16, 2024 [–]

Yes but as far as I understand this is only really usefully possible with FlashAttention. (The main idea is that you have to use the log-sum-exp trick when computing the softmax, but can't compute the max activation incrementally so have to rescale everything.)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact