Friday, October 31, 2014

How I created two images with the same MD5 hash

I posted the following images the other day which although looking totally different have exactly the same MD5 hash (e06723d4961a0a3f950e7786f3766338) .

The images were just two I lifted from the web in fact I could have chosen any image or indeed any arbitrary data and created a collision with it.

Why is this surprising?

MD5 was designed as a cryptographic hash function. These are supposed to possess the following 3 properties as per wikipedia (

  • Pre-image resistance
    Given a hash h it should be difficult to find any message m such that h = hash(m).
  • Second pre-image resistance
    Given an input m1 it should be difficult to find another input m2 such that m1 ≠ m2 and hash(m1) = hash(m2).
  • Collision resistance
    It should be difficult to find two different messages m1 and m2 such that hash(m1) = hash(m2). Such a pair is called a cryptographic hash collision

The two images above clearly demonstrate that MD5 lacks the final property. MD5 is broken as a cryptographic hash function.

To search though all possible MD5 values is 2128 operations which is massive. To be in with a good chance of finding a collision would take ~  264 operations which is again far too big for normal computing.

How did I create the images?

This type of collision is has been termed a chosen prefix collision. In this case the image data is the prefix or to be more exact the internal state of the MD5 algorithm after processing the image is. You can't see the added binary data at the end of jpeg images as it is preceded with an End Of Image JPEG marker.

Chosen prefix collisions for MD5 were first successfully shown in 2007 in this paper attack uses iterations of differential analysis of MD5. The first successful differential analysis was demonstrated by Xiaoyun Wang in her 2005 paper How to Break MD5 and Other Hash Functions.

In a previous post I used an improved version of this differential path to create two PHP files with the same MD5 hash. Where those files differed by only a few bits and the data before the collision had to be identical. In a chosen prefix collision the data preceding the specially crafted collision blocks can be completely different.

The chosen prefix collision attack works by repeatedly adding 'near collision' blocks which gradually work to  eliminate the differences in the internal MD5 state until they are the same. Before this can be done the files must be of equal length and the bit differences must be of a particular form. This requires a brute force 'birthday' attack which tries random values until two are found that work, it does however have a much lower complexity than a complete brute force attack.

If the attack sounds complicated to do in practice fortunately Marc Stevens has created framework for automated finding of differential paths and using them to create chosen pre-fix collisions. It is available at . It is intended mainly as a research tool but there is a GUI and pre-built binaries for windows available. I chose to run it on linux, however, using a bash script which can automate the repetitive steps needed. 

Here are the MD5 states following each successive block (these are unpadded versions of MD5 algorithm). 

The first line is the state after processing the original images where the two hashes are unrelated. The second line shows the state after padding to equal length and the addition of the 'birthday' bits. As you can see the first four bytes of the hash are the same. Each of the next nine lines shows the hashes gradually converging until they are there are no differences. The last hash is different from the one calculated on the whole image as that one includes padding.

The image below shows the bit differences between the above hashes a '.' indicates they are the same a '1' indicates a difference.

I ran HashClash on an AWS GPU instance. I cannot say with any certainty that this is the most efficient or cheapest option but it seemed to work reasonably quickly. In particular the 'birthdaying' step took much less time than I had expected. It finished in roughly 1hr originally this step had an estimated complexity of 249 compression function calls. In his article Marc Stevens gives an estimate of 2 days for creating a complete collision on a PS3 in 2007. I found that I was able to run the algorithm in about 10 hours on an AWS large GPU instance bringing it is at about $0.65 plus tax.

I faced a few problems with the code as published and had to make some changes to the bash script and some of the C++ code related to saving the collisions. I don't know if the windows binaries that are published work any better as this seems to be where effort has most recently been expended. If anyone wants too know the changes I made let me know.

So I guess the message to take away here is that MD5 is well and truly broken. Whilst the two images have not shown a break in the pre-image resistance or second pre-image resistance, I cannot think of a single case where the use of a broken cryptographic hash function is an appropriate choice.

It was a chosen prefix collision attack similar to this that was used to produce a counterfeit ssl certificate used to sign the Flame malware as Microsoft and pass itself off as a Windows update.

What Next?

Well once you have two colliding files you can easily create a third. You just take the output of the first files and a third with any data you want file and use it as the starting point for another chosen prefix collision. I may do this and would welcome image suggestions.

Marc Stevens has also published a tool for detecting files which have been processed in this way. Running this tool over either of the files can detect that collision blocks have been added.


I successfully created a third collision.


  1. Are image md5 hashes actually used for encryption? I had thought their utility was for de-duplication and the function seems reasonably powerful for that purpose. If one had a library of 500,000 images or even a few million, the hash key should reveal if a new image picked at random form some external source was already in the image store.

    I am currently planning to upgrade a fairly large image store using this technique to reduce the redundancy in the images which have been collected for the past 15 years and may have quite a few duplicates in them.

  2. This comment has been removed by a blog administrator.

  3. India eData Solutions provides Ecommerce Image Processing Services, Ecommerce Products Updating Services, 3Ecommerce Image Processing Services for E-commerce Industry at lowest rates in all over world. we have expert team for Outsource 3dCart product listing Project.

  4. Get the Best Bulk Product Upload Services From Us With the drastic facelift that the retail sector has seen in the recent past, the internet gas become the hub for most of the small, medium as well as large scale businesses that are involved in the same.The ecommerce has seen a monumental growth prospective over the last couple of years and the trend is expected to remain the same for the foreseeable future as well. In this regard, what concerns the businesses and the online retail outlet is the data entry aspect. Tech data entry India provides a number of products and services to cater to the Amazon Bulk Product Listing Services.

  5. High profile elite escorts in Chennai available for escort services in Chennai.

  6. Chennai Escorts is the leading escorts service provider in all over Chennai. We are the best in the service of Chennai Escort Service in Independent Chennai Call Girls. if you to more about us. visit here
    Chennai Escorts Agency

  7. The Best Escort woman independent Chennai Escorts Service. It provides best services from others. If you went contact us visit here.
    Chennai Escorts

  8. this problem is related to the graphics device driver you decides to use (pdf to png, etc.).

  9. would it be possible to make this type of thing automatic if so what would i attempt first