I'm not going to defend the youtube captions as good, but even still, I find them incredibly helpful. My hearing is fine, but my processing is rubbish, and having a visual aid to help contextualize the sound is a big help, even when they're a bit wrong.
Your point about the caption language is probably right though. It's worse with jargon or proper names, and worse with non-American English speakers. If we they don't even get right all the common accents of English, I have little hope for other languages.
Automatic translation famously fails catastrophically with Japanese, because it's a language that heavily depends on implied rather than explicit context.
The minimal grammatically correct sentence is simply a verb, and it's an exercise to the reader to know what the subject and object are expected to be. (Essentially, the more formal/polite you get, the more things are added. You could say "kore wa atsu desu" to mean "this is hot." But you could also just say "atsu," which could also be interpreted as a question instead of a statement.)
Chinese seems to have similar issues, but I know less about how it's structured.
Anyway, it's really nice when Japanese music on YouTube includes a human-provided translation as captions. Automated ones are useless, when it doesn't give up entirely.
I assume people talk about transcription, not translation. Translation in youtube ime is indeed horrible in all languages I have tried, but transcription in english is good enough to be useful. However, the more technical jargon a video uses, the worse transcription is (translation is totally useless in anything technical there).
Automatic transcription in English heavily depend on accent, sound quality, and how well the speaker is articulating. It will often mistake words that sound alike to make non-sensible sentences, randomly skip words, or just inserts random words for no clear reason.
It does seem to do a few clever things. For lyrics it seem to first look for existing transcribed lyrics before making their own guesses (Timing however can be quite bad when it does this). Outside of that, AI transcribed videos is like an alien who has read a book on a dead language and is transcribing based on what the book say that the word should sound like phonetically. At times that can be good enough.
(A note on sound quality. It not the perceived quality. Many low res videos has perfectly acceptable, if somewhat lossy sound quality, but the transcriber goes insane. It likes prefer 1080p videos with what I assume much higher bit-rate for the sound.)
In the times I have noticed the transcription be bad, my speech comprehension itself is even worse. So I still find it useful. It is not substitution for human created (or at least curated) subtitles by any means, but better than nothing.
Your point about the caption language is probably right though. It's worse with jargon or proper names, and worse with non-American English speakers. If we they don't even get right all the common accents of English, I have little hope for other languages.