Tuesday, November 11, 2014

Three way MD5 collision

Previously I explained how I created two images one of James Brown the other of Barry White with the same MD5 hash. At the end of the post I said I was going to try and create a three way collision where three images have the same MD5 hash. Neil K made a suggestion about the image

So I set to work.

After a couple of false starts where I started with the wrong image file I managed to achieve a three way collision. Here are the images.

If you want to check
$ curl -s http://www.fishtrap.co.uk/black.jpg.coll | md5
$ curl -s http://www.fishtrap.co.uk/brown.jpg.coll | md5
$ curl -s http://www.fishtrap.co.uk/white.jpg.coll | md5

A new hash value

This isn't the same hash as before instead the 3 images now collide with a new hash value b69dd1fd1254868b6e0bb8ed9fe7ecad . This is because I had to add near collision blocks to all three images. In the case of the first two the blocks added are the same. This is probably best illustrated with a diagram.


Again I created the files with HashClash. As inputs I used white.jpg and black.jpg images. To make brown.jpg.coll I just had to append the extra collision blocks to brown.jpg which was already a collision with white.jpg. 

I could go on adding more and more files in a tree structure to get many documents to collide. The number of collisions needed is n-1 where n is the number of files. It was this tree of collisions that allowed Marc Stevens to predict the 2008 US presidential election.

A word about file sizes

The files started out different sizes to each other, however, before each collision was generated between two files padding had to be added to one of the files to make it the same as the other. Without this step it would be impossible to extend a collision in the unpadded version to the full MD5 algorithm. This is because the padding includes the size of data processed. 

Wednesday, November 5, 2014

Colliding MD5 Images go crazy

I nearly chocked on my cornflakes while looking at blog stats yesterday morning to see several thousand page views in the morning. Looking at the referring URLs it seemed a link had been posted on Hacker News. I browse that site most days. It was on the front page at number five, anything on the front page is BIG. Later on it got posted to a couple of reddit.com subreddits. I was getting a lot of traffic, all quite unexpected.

Originally I wrote the blog post for a couple of reasons, firstly to aid my own memory about what I had done and how to repeat it and secondly to explain to a few of my friends how I made the images which I had already put on twitter. I didn't really see it as Hacker News material, anyone who has an interest in hash functions will have done this I thought. 

Here are my answers to a couple of things people have said

This isn't new

Agreed, it certainly is not. All the papers and software I referenced and used are at least four years old. The original block buster attack by Wang is ten years old. Even before that in 1996 just four years after its release MD5 was known to be weak at not recommended. 

My guess at what caught peoples attention were two things:
  1. Images are much easier to comprehend as collisions than random byte strings.
  2. The fact I was clearly a rank amateur with nothing more than an AWS account and basic knowledge.

Can you do the same thing with SHA1 and SHA2?

No not really, in the case of SHA1 there are theoretical attacks which work the same way that mean it should not be used either. Marc Stevens has published an attack which he estimates to have a complexity of 261 compression functions for same prefix and 277 for chosen prefix collisions. In fact HashClash contains some code for finding differential paths and near collision blocks for SHA1. However to come up with a collision would take vast computing effort and (unless you control a botnet) expense. I'm sceptical of the exactness of the numbers in here but certainly that order of magnitude. As if to prove this fact there are no published collisions in the full SHA1. 

So with those sort of numbers I'm not got to stick it on an AWS instance backed by my credit card. 

SHA1 seems to be surprisingly resistant to differential analysis attacks this is one reason pretty much nobody has moved to using SHA3 yet. I like SHA3 or Keccak as it is also known.

SHA2 looks like even more of a challenge than SHA1 for differential analysis.

Is this practical?

Hopefully not, no one should be using MD5 for anything. However, old habits die hard and once upon a time MD5 seemed like a fast and secure hash function. You can still find it in use in package managers, download pages and is probably used server side in many applications to verify file uniqueness. I haven't found anywhere still using it in SSL certs following flame.

Can you do a third image collision?

Currently a work in progress, watch this space.

Friday, October 31, 2014

How I created two images with the same MD5 hash

I posted the following images the other day which although looking totally different have exactly the same MD5 hash (e06723d4961a0a3f950e7786f3766338) .

The images were just two I lifted from the web in fact I could have chosen any image or indeed any arbitrary data and created a collision with it.

Why is this surprising?

MD5 was designed as a cryptographic hash function. These are supposed to possess the following 3 properties as per wikipedia (http://en.wikipedia.org/wiki/Cryptographic_hash_function#Properties)

  • Pre-image resistance
    Given a hash h it should be difficult to find any message m such that h = hash(m).
  • Second pre-image resistance
    Given an input m1 it should be difficult to find another input m2 such that m1 ≠ m2 and hash(m1) = hash(m2).
  • Collision resistance
    It should be difficult to find two different messages m1 and m2 such that hash(m1) = hash(m2). Such a pair is called a cryptographic hash collision

The two images above clearly demonstrate that MD5 lacks the final property. MD5 is broken as a cryptographic hash function.

To search though all possible MD5 values is 2128 operations which is massive. To be in with a good chance of finding a collision would take ~  264 operations which is again far too big for normal computing.

How did I create the images?

This type of collision is has been termed a chosen prefix collision. In this case the image data is the prefix or to be more exact the internal state of the MD5 algorithm after processing the image is. You can't see the added binary data at the end of jpeg images as it is preceded with an End Of Image JPEG marker.

Chosen prefix collisions for MD5 were first successfully shown in 2007 in this paper http://www.win.tue.nl/hashclash/ChosenPrefixCollisions/The attack uses iterations of differential analysis of MD5. The first successful differential analysis was demonstrated by Xiaoyun Wang in her 2005 paper How to Break MD5 and Other Hash Functions.

In a previous post I used an improved version of this differential path to create two PHP files with the same MD5 hash. Where those files differed by only a few bits and the data before the collision had to be identical. In a chosen prefix collision the data preceding the specially crafted collision blocks can be completely different.

The chosen prefix collision attack works by repeatedly adding 'near collision' blocks which gradually work to  eliminate the differences in the internal MD5 state until they are the same. Before this can be done the files must be of equal length and the bit differences must be of a particular form. This requires a brute force 'birthday' attack which tries random values until two are found that work, it does however have a much lower complexity than a complete brute force attack.

If the attack sounds complicated to do in practice fortunately Marc Stevens has created framework for automated finding of differential paths and using them to create chosen pre-fix collisions. It is available at https://code.google.com/p/hashclash/ . It is intended mainly as a research tool but there is a GUI and pre-built binaries for windows available. I chose to run it on linux, however, using a bash script which can automate the repetitive steps needed. 

Here are the MD5 states following each successive block (these are unpadded versions of MD5 algorithm). 

The first line is the state after processing the original images where the two hashes are unrelated. The second line shows the state after padding to equal length and the addition of the 'birthday' bits. As you can see the first four bytes of the hash are the same. Each of the next nine lines shows the hashes gradually converging until they are there are no differences. The last hash is different from the one calculated on the whole image as that one includes padding.

The image below shows the bit differences between the above hashes a '.' indicates they are the same a '1' indicates a difference.

I ran HashClash on an AWS GPU instance. I cannot say with any certainty that this is the most efficient or cheapest option but it seemed to work reasonably quickly. In particular the 'birthdaying' step took much less time than I had expected. It finished in roughly 1hr originally this step had an estimated complexity of 249 compression function calls. In his article Marc Stevens gives an estimate of 2 days for creating a complete collision on a PS3 in 2007. I found that I was able to run the algorithm in about 10 hours on an AWS large GPU instance bringing it is at about $0.65 plus tax.

I faced a few problems with the code as published and had to make some changes to the bash script and some of the C++ code related to saving the collisions. I don't know if the windows binaries that are published work any better as this seems to be where effort has most recently been expended. If anyone wants too know the changes I made let me know.

So I guess the message to take away here is that MD5 is well and truly broken. Whilst the two images have not shown a break in the pre-image resistance or second pre-image resistance, I cannot think of a single case where the use of a broken cryptographic hash function is an appropriate choice.

It was a chosen prefix collision attack similar to this that was used to produce a counterfeit ssl certificate used to sign the Flame malware as Microsoft and pass itself off as a Windows update.

What Next?

Well once you have two colliding files you can easily create a third. You just take the output of the first files and a third with any data you want file and use it as the starting point for another chosen prefix collision. I may do this and would welcome image suggestions.

Marc Stevens has also published a tool for detecting files which have been processed in this way. Running this tool over either of the files can detect that collision blocks have been added.


I successfully created a third collision.

Tuesday, October 28, 2014

Images with colliding MD5 hash

These images of James Brown and Barry White have the same MD5 hash e06723d4961a0a3f950e7786f3766338.

Don't believe me. You can easily check it.

$ curl -s http://www.fishtrap.co.uk/brown.jpg | md5
$ curl -s http://www.fishtrap.co.uk/white.jpg | md5
You might need to use md5sum instead of md5 on the command line depending on your system.

Wednesday, October 15, 2014

How I made two PHP files with the same MD5 hash

I recently posted a link on twitter to two PHP scripts which have different behaviours but the same MD5 hash. To verify this download the files:

$ wget https://s3-eu-west-1.amazonaws.com/md5collisions/a.php
$ wget https://s3-eu-west-1.amazonaws.com/md5collisions/b.php

Then check their MD5 hash:
$ md5 a.php b.php
MD5 (a.php) = 62e1d0d1620581693435aa75f5b6c964
MD5 (b.php) = 62e1d0d1620581693435aa75f5b6c964

Now run each file:
$ php a.php
Behaviour A
$ php b.php
Behaviour B

The files behave differently and yet hash to the same hex string using MD5. A quick check with another hash algorithm (SHA1 in this case) confirms the two files are in fact different:

$ shasum a.php b.php
6ef81f904b84d6ea53048004a26157816a85cfa3  a.php
d191a5eb7557231f461c09cb4dc565a23336778a  b.php

So how were these files made? Well if you look at the source of file you will find there is very little difference between the two files.

cat a.php


if ('?G.???-)?????+?B?8????A???P?;"??P?ZU?5?$q4F?1c?-ŏC?S
?e?MIM???~?2?:?]b??\??/5???^?gn~3D7' == '?G.???-)?????+?B?8????A???P?;"??P?ZU?5?$q4F?1c?-ŏC?S
?e?MIM???~?2?:?]b??\??/5???^?gn~3D7' ) {
    echo 'Behaviour A',PHP_EOL;
} else {
    echo 'Behaviour B',PHP_EOL;

$ cat b.php


if ('?G.???-)??????+?B?8????A???P?;"??Py[U?5?$q4F?1??-ŏC?S
?e?MIM???~?2?:?]b?p?\??/5???^?g?~3D7' == '?G.???-)?????+?B?8????A???P?;"??P?ZU?5?$q4F?1c?-ŏC?S
?e?MIM???~?2?:?]b??\??/5???^?gn~3D7' ) {
    echo 'Behaviour A',PHP_EOL;
} else {
    echo 'Behaviour B',PHP_EOL;

Spot any difference? It may be hard to spot but there are actually subtle differences in the first of the chunks of binary data. 

So whats with all that binary data? The binary data represents a collision in the MD5 algorithm. Collisions are surprisingly easy to create for MD5 following an attack on MD5 first published by Wang in 2004. What’s more you can choose the starting value for the collision you want to create so it does not have to be at the start of the file.  This is because MD5 in common with most hash algorithms operates on one block of data (64 bytes in MD5s case) at a time and then uses this value as the starting point for the next block. The next two blocks differ between the two files, however, they have been specially created to collide when inserted at this location. All subsequent blocks are the same in the two files. This is the method I used to generate the files. 

1) First I created a file which is exactly 64 bytes long and contains all of the content up to the first single quote. The A’s are just there to make it exactly 64 bytes long

if (‘

2) Run this content though the MD5 algorithm but a slightly modified version which does not add the length or any padding to the message, I used my own implementation of MD5 in PHP and commented out the parts where padding is appended. You can get the code for this here https://gist.github.com/natmchugh/fbea8efeced195a2acf2.This produces the hash 32f6df28e4a110970fb7f9938bee1686.

3) Use this hash as the initial value to find an MD5 collision. The fastest collision script I have found is Marc Stevens fastcoll See https://marc-stevens.nl/research/ . If creating collisions in MD5 sounds like it will take a long time - it doesn’t. It varies how it takes but in this case it took about 15 seconds to produce 2 messages that collide with the initial value of 32f6df28e4a110970fb7f9938bee1686.

$ time ./fastcoll -i 32f6df28e4a110970fb7f9938bee1686 -o a b
MD5 collision generator v1.5
by Marc Stevens (http://www.win.tue.nl/hashclash/)

Using output filenames: 'a' and 'b'
Using initial value: 32f6df28e4a110970fb7f9938bee1686

Generating first block: .....................................
Generating second block: S10.............................................
Running time: 14.9727 s

real     0m14.981s
user     0m14.969s
sys     0m0.006s

4) I then appended the two messages produced to file created in 1 to give two files.

5). The two files should now be of equal length and have the same MD5 however they will not run as PHP. So I added the statement between the two lots of binary data. 
‘ == ‘

6) Next I appended the collision a to both files. This produces a statement that will be true in file a (since the 2 binary blobs are indeed the same)  but false in file b.

7) I appended the same PHP to both messages to complete the if statement and produce the output you see. So effectively we have two files where the first 64 bytes are identical. The next 128 bytes are a collision in MD5 and then all other text following that is identical. Not surprisingly this produces the same MD5 hash when passed though the whole MD5 algorithm including padding.

While this attack may appear useless since the files are filled with gobbledygook and rely on a fairly obvious if statement there are methods to create files with the same MD5 hash which do not rely on a switch and where the rest of the file contents can be completely different