> Was just verifying your tweet's hash, and then...omg!!! I couldn't believe what I realised. The SHA256 of THIS tweet starts with exactly the same 7 characters as your tweet's hash. What are the chances of that?
It's not too difficult. All you need is to generate many variations of potential words and check whether the sha256 hash matches the wanted leading characters. For example, check this text.
$ echo -n $'It\'s not too difficult. All you need is to generate many variations of potential words and check whether the sha256 hash matches the wanted leading characters. For example, check this text.' | sha256sum
182a7c9d2e99162688aaaf3f97638edd7d06f8d295e456c7bb1f16abf3a8f70c -
$ echo -n $'Was just verifying your tweet\'s hash, and then...omg!!! I couldn\'t believe what I realised. The SHA256 of THIS tweet starts with exactly the same 7 characters as your tweet\'s hash. What are the chances of that?' | sha256sum
182a7c9c08b2f0f9333bf23828c5fbf47addf74e815b6a22ca10825450bc2ee1 -
I'm confused as to how they've done this. The original message, sure, you can brute force the digits and hope you get a collision, and try a new, plausible preamble if not. But I don't see how they've found this collision without it looking like anything is brute forced.
Is it using substitute Unicode characters or something?
Indeed, this is how I did it. 2^28 is around 270 million. With little more than a handful options, it should be possible. Although I have to say, it turned out to be more difficult than I'd initially thought. Maybe it's better to think of it as 28 different boolean choices.
echo -n "Indeed. This is how I managed to do it. 2^28 is around 300 million. With only a handful options, it's possible. Although, I must say that it turned out to be more difficult than I'd initially thought. Perhaps it's better to think of it as 28 different alternatives." | sha256sum
Throwing a loop of numbers on the end seems far more feasible. I put together a hasty Python implementation which can immediately find three hits at 5 characters. TQDM is reporting ~450k tries per second, so depending on how lucky you are, would probably want to redo in a faster language to solve for 7+
import hashlib
import itertools
import tqdm
BLOCKS_SIZE = 5
sentence_prefix = "The SHA256 for this sentence begins with:"
itos = {0x00:"zero", 0x01:"one", 0x02:"two", 0x03:"three", 0x04:"four", 0x05:"five", 0x06:"six", 0x07:"seven", 0x08:"eight", 0x09:"nine",
0x0a:"a", 0x0b:"b", 0x0c:"c", 0x0d:"d", 0x0e:"e", 0x0f:"f"}
for nums in tqdm.tqdm(itertools.product(itos.keys(), repeat=BLOCKS_SIZE)):
sentence = f"{sentence_prefix} {', '.join(itos[num] for num in nums[:-1])}, and {itos[nums[-1]]}."
hash_true = hashlib.sha256(bytes(sentence, "utf8")).hexdigest()
guessed_prefix = "".join(f"{n:x}" for n in nums)
true_prefix = hash_true[:BLOCKS_SIZE]
if guessed_prefix == true_prefix:
print("collision")
print(sentence)
print(hash_true)
All we are demonstrating here is why Sha256 is 256 bits and not 32 bits. We have trivially identified collisions for the first 28 bits of the output, which is only 11% of the entire hash size.
Difficulty of collisions roughly doubles for each additional bit. Imagine we had a SHA32, that would be 16 times harder to achieve a collision. SHA256 is 43 with 67 zeroes behind it more difficult than the examples here.
7 digits of the hash makes for 16*7 possible hashes. I spot 4 potential "filler" lines in that tweet, so if you find log4(16*7)=14 candidates for each of those filler lines, then one combination would be expected to yield that hash.
Basically I had several substitutions around words, case, punctuation, etc. and just ran it until it found some hits. Quite easy with just four characters though but was only a proof of concept.
You can generate sentences of the form "This sentence begins with: " followed by seven comma-separated english words denoting hex digits. Then search that space of digits, until you get a hit.
For each digit combination, you can try it with multiple variations of the sentence like "The SHA256 of this sentence begins with", "The SHA256 hash of this text starts with" and many more. That increases the search space without increasing the number of digits that have to match, making it more likely that a hit is found.
If we don't vary the sentence ("begins with, starts with, is prefixed with" etc...) and only consider the final part, then the number of english sentences is (unsurprisingly) equal to the number of possible hashes.
So for each distinct sentence, we pick randomly from the n different hashes, and we attempt to brute force this n times. So the probability of not getting this is:
(1-1/n)^n
As n increases, this will yield e^-1, so we have about a 36.7% chance of this not happening for any given length. So this has a probability of happening of 63.3%.
So there is a decent chance that there exists a sentence "The sha256 of this sentence is ..." for even the full sha256. If you are allowed to modify the sentence to be something like "'Begins with', 'starts with', 'OMG guys, check this out':" you can get this up to almost 1. Finding it would be mildly hard though barring some novel discovery about sha256.
As always, the real WTF is in the comments.