hey, thanks for the comment! we've actually found that multimodal models are sur...

hey, thanks for the comment!

we've actually found that multimodal models are surprisingly good at maintaining temporal context as well

that being said, there's also a bunch of additional processing using more traditional CV / audio analysis we do to extract this information out as well (both frame-level and temporal) in your video understanding

for example, with the mean-motion analysis — you can see how subjects move over a period of time, which can help determine where important things are happening in the video, which ultimately can lead to better placements of edits.