we've actually found that multimodal models are surprisingly good at maintaining temporal context as well
that being said, there's also a bunch of additional processing using more traditional CV / audio analysis we do to extract this information out as well (both frame-level and temporal) in your video understanding
for example, with the mean-motion analysis — you can see how subjects move over a period of time, which can help determine where important things are happening in the video, which ultimately can lead to better placements of edits.
we've actually found that multimodal models are surprisingly good at maintaining temporal context as well
that being said, there's also a bunch of additional processing using more traditional CV / audio analysis we do to extract this information out as well (both frame-level and temporal) in your video understanding
for example, with the mean-motion analysis — you can see how subjects move over a period of time, which can help determine where important things are happening in the video, which ultimately can lead to better placements of edits.