Beware when comparing Niftis // My digital scribbles

This post is mostly a warning to my future self: When comparing niftis, be aware that nii.gz files will differ even if the compressed data does not. That is because gzip stores the input filename and timestamp, along with the data inside the .gz file.

I stumbled upon this while trying to make sure that different tools are creating the same niftis out of the raw DICOMs. Actually, it was even simpler. I was comparing whether the manual conversion with dcm2niix and using datalad-hirni - which internally uses heudiconv as a wrapper around dcm2niix - gave the same output.

Hirni does a lot of additional work to ensure that only anonymized data end up in the final bids data repository and its git history. However, in the current case I want to import fMRI data from mice where such concerns are irrelevant. The additional time needed during the import via hirni didn’t seem to be worth it. So, as long as the final files are the same, I could rename & move them into the correct bids structure “by hand”.

I simply wanted to compared the nii.gz files. The simplest comparison I could think of was to calculate the md5sum. The checksums differed. So was this difference due to different data or different header portion of the niftis?

This is were my little Alice in Wonderland trip down the rabbit hole started. First looking at the header information - this is were many DICOM-to-Nifti conversion tools might somewhat disagree. DICOM files can apparently contain more metadata than Niftis. In addition, Scanner manufacturers apparaently also disagree about into which exact field of the DICOM header to write which information (This is a common warning on many conversion tools' homepages..).

But no tool I tried seemed to suggest that my headers were different. And non seemed to suggest that the data-array was different either!

Given that the Nifti-header-spec I found suggested that some portions of it were not being used (but present for compatibility with a previous fileformat - Analyze 7.5), I still thought that some portion of these chunks might be different and being simply ignored by the tools I used to check the headers. This would be, after all, a reasonable approach, given that those bytes are strictly speaking unused in the nifti fileformat definition. Maybe one of the tools could use those portions to store additional info. Also, a binary comparison of the nii.gz files suggested that it was the very first few bytes that differed.

I even went to alter the niftiinfo function and the underlying niftiImage class in MatLab to include those chunks of the header into the parts which get shown to the user. During that process, I felt somewhat vindicated. My hunch had been confirmed: they simply ignored those chunks of the header and simply continued reading the file after skipping those bytes. But even after I added them, those bytes were still the same. How could that be?

Well, because they all handled the nii.gz by first unzipping them and then reporting the content (metadata & data). Simply because I wanted to have a simpler time typing those filenames - I had selected the most important nifti-file for that project for my trip through Alice’s Wonderland - I copied them into a dedicated folder. I also thought of unzipping them for convenience. And for some reason, I decided to make sure that there was still problem and calculated the checksums of the unzipped files. But to my surprise they were the same!

Did I copy the wrong files? After making sure to restore the original nii.gz. from the git history - I had committed the manually created files too, in addition to what hirni would put into the annex anyway - I recopied and unzipped the files once more with the same result. But calculating the checksums of the gzipped files still gave different results - even though the filesize was the same!

A short google search later, I had learnt that gzip adds timestamps and the input filenames into the compressed file too, in addition to the content itself. This can be turned off, but it’s on by default… Well I have learnt something new: do not compare gz files, unless you really want them to be identical. If you care only about the content being identical, make sure to compare the content, not the compress archive.