Moreover, data-driven control isn't a new concept. It's not my field so I can't comment on what's new here, but I've heard about learning dynamics and rewards in a control theory context plenty of times.
In a control theory context, we're likely talking about inferring a small handful of parameters where the relationships between them are well known. In this paper they're inferring the entire dynamics of an environment from thousands of raw pixel values. This is not something that admits a tractable exact optimization