By checking the hash of the extracted files. The hash of the archive is dependent on the order in which the file were compressed, the compression, some metadata, etc.
That’s expensive, complicated, exposes a greater attack surface, and requires new tooling to maintain considerably more complex metadata covering the full contents of source archives.
For the entire multi-decade history of open source, the norm has been — for very good reason — that source archives are immutable and will not change.
The solution here isn’t to change the entire open source ecosystem.
> For literally the entire multi-decade history of open source, the norm has been — for very good reason — that source archives are immutable and will not change.
Well, the norm has been that maintainers generated and distributed a source archive, and that archive being immutable. That workflow is still perfectly fine with GitHub and not impacted by this change.
The problem is that a bunch of maintainers stopped generating and distributing archives, and instead started relying on GitHub to automatically do that for them.
> That workflow is still perfectly fine with GitHub
It would be perfectly fine if you could prevent GitHub from linking the autogenerated archives from the releases or at least distinguish them in a way that makes it clear that they are not immutable maintainer-generated archives.
The problem was people assuming github works like that - saves a archive of every commit, which is obviously silly if you think about it (why save it if you can regenerate it on a whim from any commit you want?)
You are speaking about release archives. GitHub's "Download as zip" feature is not the same thing as this multi decade-history of open source thing you are talking about.
I always thought zip archives from this feature was generated on the fly, maybe cached, because I don't expect GitHub to store zip archive for every commit of every repository.
I'm actually surprised many important projects are relying on a stable output from this feature, and that this output was actually stable.
Indeed. I remember when Canonical was heavily pushing bzr and others were big fans of Mercurial. Glad my package manager maintainers didn’t waste time writing infrastructure to handle those projects at the repository level. Nobody had to, because providing source tarballs was the norm.
Huh? What I fully believe is that downloading a source tarball over HTTPS, verifying its checksum, and extracting it will take less time than cloning the repository from Git, then verifying the checksum of all files—which you said would take 29 seconds plus 0.4s.
My point is that either spending 0.08s computing the md5 of the zip (I just measured) or 0.3s computing the hash of the repo does not matter the slightest if you are managing software repos, as just extracting the source and preparing to build it will be an order of magnitude slower.