Home » Blog » Technology » Deep Dive into Hash Values & Hashing in Keeping Data Integrity

Technology |  9 Minutes Reading

Deep Dive into Hash Values & Hashing in Keeping Data Integrity

Role of Hash Values in Digital Forensics and Legal Evidence
  author
Written By Aswin Vijayan    
Anuraag Singh
Approved By Anuraag Singh  
Calendar
Published On Feb 13th, 2024

The use of digital media as evidence inside a court of law is not new. Actual challenge has always been how to prove that a digital document is genuine and not a forgery. Considering it’s a lot easier to make changes in an electronic file this became a real issue. To combat it the idea of hash values was introduced. These simple codes provide a way for courts to determine the authenticity of a document.

If you want an in-depth idea we suggest that you go through this write-up. Moreover, nothing better than starting with the basics so that is where we begin.

What are Hash Values?

Technically speaking, a hash value is nothing but.

A case-insensitive, alphanumeric string (sometimes including symbols) generated by a hashing algorithm. Length of the hash value depends on the specific algorithm and can vary. The common lengths are 128, 160, 256, and 512 bits, which translate to different character lengths depending on the encoding. More on that later.

Key factors that advocate the use of hash values.

  • Uniqueness: Every digital file has a distinct hash value unique to that particular item.
  • Irreversibility: Hash values cannot be used to recreate the original data.
  • Quickness: Hash calculation is quick irrespective of the file size.
  • Verifiable: Values stay the same for an unaltered file and can be verified easily.

In short, these are some of the best tools in the arsenal of forensics investigators to fight against digital deception. Now that we have a basic understanding of what a hash value is. Let’s take a look at the process to get a hash value, which is hashing.

What is Hashing, and What is it Not?

First, let us be clear, the hashing is in no way related to the hashtag (#) you see inside social media posts.

Hashing is the process of generating a unique identifier for digital data. Think of it as getting a fingerprint, or more accurately, the DNA of the digital file in question. It is done via a computer program called a hashing algorithm.

A hashing algorithm is a mathematical function that takes in input (usually a file) and gives out an output (a string of characters).

Readers should avoid confusing hashing with encryption. Don’t worry as we will help you to distinguish between the two. Unlike encryption, whose primary role is to keep data safe during transit, hashing spits out a one-of-a-kind code for that data.

Another difference is that encryption can be undone via a cipher key. This is because all the data is still inside an encrypted file, just jumbled. In contrast, there is no way to reconstruct the source data from its hash value. As hashing keeps no trace of the original data whatsoever. It is not wrong to say that encryption is two-way and hashing is one-way.

Furthermore, encryption is unique to the program. Understand it by an example. If you store the same image in Google Drive, OneDrive, or iCloud, it gets encrypted differently. If an external agent tries to break in, they get a different jumbled mess in each one of them. As encryption is purely based on the proprietary algorithms of that specific cloud storage provider. Whereas if you create the hash value of the image, you get the same result irrespective of the tool you use.

Note: Discrepancies in hashing may arise in exceptional situations.

So now we know what hashing is let’s see how it’s done.

Breakdown of Hashing Algorithms Used to Generate Hash Values

Two of the most well-known hashing algorithm types that are used for checking the authenticity of digital evidence are.

MD: Short for Message Digest. Its most popular version is MD5. Although newer variants of the MD family exist. The three-decade-old MD5 is still accepted as the industry standard.

We can thank Ronald Rivest for providing us with this algorithm.

He improved the collision resistance from the previous generation and also added more layers to the avalanche effect.

This resulted in an output hash that was always 128-bit and 32 characters long. Which was a massive improvement over the MD4, whose output length could vary.

SHA: Stands for Secure Hash Algorithm. It also has multiple variations, going from SHA1 containing 160 bits to a 512-bit long hash aptly named SHA512.

Created by the National Institute of Standards and Technology. SHA was brought in to bring standardization to the hashing process.

These algorithms are more complex than their MD counterparts and, as a result, have even greater resistance to collisions and other vulnerabilities.

NIST held a public competition in 2015, and the winner became SHA-3, the newest member of the SHA family. The main motive was to come up with a solution that could overcome the flaws of the SHA2 generation.

There has been quite a strong push to adopt the latest algorithm, especially in the court of law. Let’s find out why.

Problems that Plague Old Hashing Algorithms

Earlier, we mentioned how hashing may result in discrepancies. These may form a case for e-discovery and digital forensics too. This is because hashing algorithms are not perfect. And ever since their introductions, there have been attempts to find the crack within them, that brought us:

Hash Collision: This is when two different files produce the same hash value for a given algorithm. It is the bane of hashing algorithms, as it evaporates the core feature of uniqueness from them. We can give credit to Dr. Marc Stevens for recognizing this vulnerability.

  • He made a significant contribution to the creation of HashClash. A cryptographic tool used to detect MD5 collisions.
  • In 2017, with a successful demonstration of an SHA-1 collision attack, named “SHAttered,”. Researchers credited Marc Stevens in their paper.

Hash Value Mismatch: When the hash value for the same file is different in two calculators.   This is a more common error than a collision. If you see a hash value that doesn’t match, check the following.

  • You are using the same file and hashing algorithm for the calculation.
  • The file is not altered in any way between two instances of hash calculation.

If none of this is true and you still get different hash outputs, then the explanation lies below.

Most of the time, the hash value you get takes the entire data present in the file into account, as it should. However, it is also possible that a tool gives out the hash value solely based on the meta properties.

So, you get two different hash values for the same file and the same algorithm. The problem is expanded further when the tool fails to report the hash value type. To deal with this issue, you should compare the hash value with three or more independent tools.

Hashing Digital Evidence Inside a Court of Law

Take a look at this basic overview of the digital evidence lifecycle within a legal framework.

  • During evidence collection, like after the discovery of an email spoofing network. The SOPs clearly state that all digital evidence must be assigned a hash value using an algorithm like SHA256.
  • A copy of the evidence is made, and the original is put in a tamper-proof location.
  • The chain of custody is maintained via timestamps or other means to record who had access to the evidence and when.
  • All analysis, reporting, sharing, and other potential intrusive actions (decryption) are conducted on the copy, not the original.
  • Upon submission to the court, the judge asks for a live calculation of the hash value. If it matches the original (SHA256) value, only then is it accepted as genuine evidence.
  • Once the case is closed, the evidence is either destroyed or put into the archives for future reference based on the protocol.

Learn how hashing makes evidence handling easier.

Digital evidence is often composed of terabytes of data. Shuffling through all that data to determine its uniqueness is not humanly possible. In such a case, hashing makes the job much quicker by automating most evidence tagging. So in a way, it plays a part in a faster evidence examination and, thus, justice delivery.

Hashing stays the same as long as the data is not tampered with. In contrast, the change is quite pronounced, even for minor manipulations. This ability, in conjunction with a chain of custody, ensures that manipulation is minimized. And even if data is tampered with, law enforcement agencies can quickly identify the culprit.

What Law Agencies Should Prefer as their Hash Calculator?

If the opposing counsel finds a mismatch between the hash values, then the entire case may be in jeopardy. That is why it’s important to select the best when it comes to a hash value generator.

However, in the realm of digital forensics, hash value is just a small part. What if you could get a tool that gives you not only the hash value but also assists in other layers of forensics? That is exactly what MailXaminer is.

Schedule a Demo Purchase Tool

The all-in-one tool, with 80+ digital file types as input options, also gives detectives complete freedom to select the hashing type. MD5, SHA1, SHA256, etc. Moreover, as the calculation is done on the data level of the file, it results in an accurate hash value every time.

Conclusion

With this discussion users now have a clear-cut understanding of how useful hash values are. They are now well aware of how this small code is ensuring the authenticity of digital documents worldwide. Here we put forward the definition, means of creating it, and the use case of hash values all in simple words. We also explained that it not only simplifies legal proceedings but also safeguards against potential manipulation. In the end, we introduced the tool to bring out the most insights out of your digital evidence.