“Throw an LM at it” is the only approach that has ever produced human level mach...

wizzwizz4 · on Aug 16, 2024

You're mixing up cause and effect. The transformer architecture was invented for machine translation – and it's pretty good at it! (Very far from human-level, but still mostly comprehensible, and a significant improvement over the state-of-the-art at time of first publication.) But we shouldn't treat this as anything more than "special-purpose ML architecture achieves decent results".

The GPT architecture, using transformers to do iterated predictive text, is a modern version of the Markov bot. It's truly awful at translation, when "prompted" to do so. (Perhaps surprisingly so, until you step back, look at the training data, and look at the information flow: the conditional probability of the next token isn't mostly coming from the source text.)

I haven't read that paper yet, but it looks interesting. From the abstract, it looks like one of those perfectly-valid papers that laypeople think is making a stronger claim than it is. This paragraph supports that:

> Note that these models are not intended to accurately capture natural language. Rather, they illustrate how our theory can be used to study the effect of language similarity and complexity on data requirements for UMT.

canjobear · on Aug 17, 2024

It’s true that the Transformer architecture was developed for seq2seq MT, but you can get similar performance with Mamba or RWKV or other new non-Transformer architectures. It seems that what is important is having a strong general sequence-learning architecture plus tons of data.

> The GPT architecture, using transformers to do iterated predictive text, is a modern version of the Markov bot.

The Markov nature only matters if the text falls outside the context window.

> Perhaps surprisingly so, until you step back, look at the training data, and look at the information flow: the conditional probability of the next token isn't mostly coming from the source text.

I’m not sure what you’re getting at here. If it’s that you can predict the next token in many cases without looking at the source language, then that’s also true for traditional encoder-decoder architectures, so it’s not a problem unique to prompting. Or are you getting at problems arising from teacher-forcing?

Basically the question was how an LM could possibly help translation, and the answer is that it gives you a strong prior for the decoder. That’s also the basic idea in the theoretical UMT paper: you are trying to find a function from source to target language that produces a sensible distribution as defined by an LM.