To clarify, I don't work professionally with videos, I've hacked on some project...

To clarify, I don't work professionally with videos, I've hacked on some projects and read some books about it. My professional experience with ML models is in writing backends to integrate with them, the models I've designed/trained were for my own education (so far, at least). The answer to your question is probably, "I'm a dilettante who doesn't know better, you may well know more than me."

I take the impression that much of the time, color doesn't provide much signal and gives your model things to overfit on, so you collapse it down to grayscale. (Which is to say, most of the time you care about shape, but you don't care about color.) But I bet there are problem spaces where your intuition holds, I'm sure that there's performance the be wrung out of a model by experimenting with different color spaces who's geometry might separate samples nicely.

I did something similarish a few months ago where I used LDA[1] to create a boutique grayscale model where the intensity was correlated to the classification problem at hand, rather than the luminosity of the subject. It worked better than I'd have guessed, just on it's own (though I suspect it wouldn't work very well for most problems). But the idea was to preprocess the frames of the video this way and then feed it into a CNN [2]. (Why not a transformer? Because I was still wrapping my mind around simpler architectures.)

[1] https://en.wikipedia.org/wiki/Linear_discriminant_analysis

[2] https://en.wikipedia.org/wiki/Convolutional_neural_network