Collision attack against widely used MD5 algorithm took 10 hours, cost just 65 cents.
by Dan Goodin
Underscoring just how broken the widely used MD5 hashing algorithm is, a software engineer racked up just 65 cents in computing fees to replicate the type of attack a powerful nation-state used in 2012 to hijack Microsoft’s Windows Update mechanism.
Nathaniel McHugh ran open source software known as HashClash to modify two separate images—one of them depicting funk legend James Brown and the other R&B singer/songwriter Barry White—that generate precisely the same MD5 hash, e06723d4961a0a3f950e7786f3766338. The exercise—known in cryptographic circles as a hash collision—took just 10 hours and cost only 65 cents plus tax to complete using a GPU instance on Amazon Web Service. In 2007, cryptography expert and HashClash creator Marc Stevens estimated it would require about one day to complete an MD5 collision using a cluster of PlayStation 3 consoles.
The practical ability to create two separate inputs that generate the same hash is a fundamental flaw that makes MD5 unsuitable for most purposes. (The exception is password hashing. Single iteration MD5 hashing is horrible for passwords but for an entirely different reason that is outside the scope of this post.) The susceptibility to collisions can have disastrous consequences, potentially for huge swaths of the Internet.
The Flame espionage malware, for instance, exploited the MD5 collision weakness to counterfeit the sensitive digital certificate Windows machines rely on to determine when system updates are trustworthy. The exploit allowed Flame—the espionage malware that infected Iran and other countries in the Middle East—to easily spread from one computer to another inside a local network. By presenting a counterfeit digital certificate based on the same MD5 hash as the legitimate credential, infected machines were able to hoodwink uninfected machines into accepting malicious code as if it came from Microsoft. Microsoft has since retired the use of MD5 and is in the process of phasing out SHA1, a separate hashing algorithm that is believed to be susceptible to practical collision attacks soon.
“So I guess the message to take away here is that MD5 is well and truly broken,” McHugh wrote in a blog post headlined How I created two images with the same MD5 hash. “Whilst the two images have not shown a break in the pre-image resistance, I cannot think of a single case where the use of a broken cryptographic hash function is an appropriate choice.”
Pre-image resistance refers to the ability to withstand attempts to determine the message or input that generated a hash or to find a second input that generates the same hash as a first input. Despite the infeasibility of breaching these two requirements, researchers have had little trouble violating the third principle of collision resistance. In his blog post, McHugh explains how he did it:
The chosen prefix collision attack works by repeatedly adding ‘near collision’ blocks which gradually work to eliminate the differences in the internal MD5 state until they are the same. Before this can be done the files must be of equal length and the bit differences must be of a particular form. This requires a brute force ‘birthday’ attack which tries random values until two are found that work, it does however have a much lower complexity than a complete brute force attack.
If the attack sounds complicated to do in practice fortunately Marc Stevens has created framework for automated finding of differential paths and using them to create chosen pre-fix collisions. It is available at https://code.google.com/p/hashclash/ . It is intended mainly as a research tool but there is a GUI and pre-built binaries for windows available. I chose to run it on linux, however, using a bash script which can automate the repetitive steps needed.Here are the MD5 states following each successive block (these are unpadded versions of MD5 algorithm).
601034f03377d68d68a74f71b0d76bf4 c924b00ad433ccc979b8e79e6925f28e 0e5453c5c7deabc5e23331c415780ecf 0e5453c521968d3df17653c224bb30cd 8d43bcaea7f738cbbbae37b1a5ec4f3c 8d43bcae018f1a43cad159afb40f723a ea12c8b8645e161d589c69dcf845dd60 ea12c8b8bef6f79467c08bda076aff5e 8ec47a1d73c2267dfb7686b83a3554fb 8ec47a1dce5608750a97a8b6495576f9 daa7508daa16fe67f09ede251abd83a2 daa7508dc5aadd5fffbefe2329dda3a0 c570b0b7b22e3a2545432f2a96baf83a c570b0b7cec2591d55634f28a6da1839 00d3111f51f505e81d5f5db537ed64a4 00d3111f6d0926e22d7f7db5470d85a4 73e0e4d069fdac380cfe2cf0e7ba47fb 73e0e4d0851dad321c1e2df0f7da47fb 583a08b75fa22174307a24df6109b9ec 583a08b76bc22174309a24df6129b9ec e2bdd99bb0bcc66557192eec1f36ff44 e2bdd99bb0bcc66557192eec1f36ff44
The first line is the state after processing the original images where the two hashes are unrelated. The second line shows the state after padding to equal length and the addition of the ‘birthday’ bits. As you can see the first four bytes of the hash are the same. Each of the next nine lines shows the hashes gradually converging until they are there are no differences. The last hash is different from the one calculated on the whole image as that one includes padding.
The image below shows the bit differences between the above hashes a ‘.’ indicates they are the same a ‘1’ indicates a difference.
I ran HashClash on an AWS GPU instance. I cannot say with any certainty that this is the most efficient or cheapest option but it seemed to work reasonably quickly. In particular the ‘birthdaying’ step took much less time than I had expected. It finished in roughly 1hr originally this step had an estimated complexity of 249 compression function calls. In his article Marc Stevens gives an estimate of 2 days for creating a complete collision on a PS3 in 2007. I found that I was able to run the algorithm in about 10 hours on an AWS large GPU instance bringing it is at about $0.65 plus tax.
I faced a few problems with the code as published and had to make some changes to the bash script and some of the C++ code related to saving the collisions. I don’t know if the windows binaries that are published work any better as this seems to be where effort has most recently been expended. If anyone wants too know the changes I made let me know.
The Flame malware used a completely different algorithm than that devised by Stevens, one that Stevens later determined was devised by world-class cryptographers. Still, it relied on the same underlying weakness in MD5. Researchers are recommending use of SHA2, SHA3, or SHA2-512/256.