Hashing 1000 Image Files Quick as Possible (2000x2000 plus resolution) (Python)

I have a folder with around several thousand RGB 8-bit-per-channel image files on my computer that are anywhere between 2000x2000 and 8000x8000 in resolution (so most of them are extremely large).

I would like to store some small value, such as a hash, for each image so that I have a value to easily compare to in the future to see if any image files have changed. There are three primary requirements in the calculation of this value:

The calculation of this value needs to be fast

The result needs to be different if ANY part of the image file changes, even in the slightest amount, even if just one pixel changes. (The hash should not take filename into account).

Collisions should basically never happen.

There are a lot of ways I could go about this, such as sha1, md5, etc, but the real goal here is speed, and really just any extremely quick way to identify if ANY change at all has been made to an image.

How would you achieve this in Python? Is there a particular hash algorithm you recommend for speed? Or can you devise a different way to achieve my three goals altogether?

asked Jan 2 at 1:44

jippyjoe4

26517

I'm curious, given your requirements, how you expect to be able to use anything than a normal has algorithm. This really has nothing to do with images.

– Jonathon Reinhart
Jan 2 at 1:46

1

What file system are you on? Most OSes have a file system journal API that will alert you if a file changes. And on speed: you might actually block on reading the disk

– Kat
Jan 2 at 1:56

2

what did you try out? why did it not suit your need? do you know about hash collisions?

– Patrick Artner
Jan 2 at 2:02

Benchmark with sha256 and then again with the "identity" hash function which just reads the entire file but returns 0 as a hash result. I think you'll find that I/O is always the bottleneck.

– James K Polk
Jan 2 at 3:00

Just look at the file's modification date - that will tell you when it was last changed.

– Mark Setchell
Jan 2 at 12:42

add a comment |

I have a folder with around several thousand RGB 8-bit-per-channel image files on my computer that are anywhere between 2000x2000 and 8000x8000 in resolution (so most of them are extremely large).

The calculation of this value needs to be fast

The result needs to be different if ANY part of the image file changes, even in the slightest amount, even if just one pixel changes. (The hash should not take filename into account).

Collisions should basically never happen.

How would you achieve this in Python? Is there a particular hash algorithm you recommend for speed? Or can you devise a different way to achieve my three goals altogether?

asked Jan 2 at 1:44

jippyjoe4

26517

I'm curious, given your requirements, how you expect to be able to use anything than a normal has algorithm. This really has nothing to do with images.

– Jonathon Reinhart
Jan 2 at 1:46

1

What file system are you on? Most OSes have a file system journal API that will alert you if a file changes. And on speed: you might actually block on reading the disk

– Kat
Jan 2 at 1:56

2

what did you try out? why did it not suit your need? do you know about hash collisions?

– Patrick Artner
Jan 2 at 2:02

Benchmark with sha256 and then again with the "identity" hash function which just reads the entire file but returns 0 as a hash result. I think you'll find that I/O is always the bottleneck.

– James K Polk
Jan 2 at 3:00

Just look at the file's modification date - that will tell you when it was last changed.

– Mark Setchell
Jan 2 at 12:42

add a comment |

I have a folder with around several thousand RGB 8-bit-per-channel image files on my computer that are anywhere between 2000x2000 and 8000x8000 in resolution (so most of them are extremely large).

The calculation of this value needs to be fast

The result needs to be different if ANY part of the image file changes, even in the slightest amount, even if just one pixel changes. (The hash should not take filename into account).

Collisions should basically never happen.

How would you achieve this in Python? Is there a particular hash algorithm you recommend for speed? Or can you devise a different way to achieve my three goals altogether?

asked Jan 2 at 1:44

jippyjoe4

26517

I have a folder with around several thousand RGB 8-bit-per-channel image files on my computer that are anywhere between 2000x2000 and 8000x8000 in resolution (so most of them are extremely large).

The calculation of this value needs to be fast

The result needs to be different if ANY part of the image file changes, even in the slightest amount, even if just one pixel changes. (The hash should not take filename into account).

Collisions should basically never happen.

How would you achieve this in Python? Is there a particular hash algorithm you recommend for speed? Or can you devise a different way to achieve my three goals altogether?

python database python-3.x image hash

asked Jan 2 at 1:44

jippyjoe4

26517

asked Jan 2 at 1:44

jippyjoe4

26517

asked Jan 2 at 1:44

jippyjoe4

26517

asked Jan 2 at 1:44

jippyjoe4

26517

asked Jan 2 at 1:44

jippyjoe4

26517

I'm curious, given your requirements, how you expect to be able to use anything than a normal has algorithm. This really has nothing to do with images.

– Jonathon Reinhart
Jan 2 at 1:46

1

What file system are you on? Most OSes have a file system journal API that will alert you if a file changes. And on speed: you might actually block on reading the disk

– Kat
Jan 2 at 1:56

2

what did you try out? why did it not suit your need? do you know about hash collisions?

– Patrick Artner
Jan 2 at 2:02

Benchmark with sha256 and then again with the "identity" hash function which just reads the entire file but returns 0 as a hash result. I think you'll find that I/O is always the bottleneck.

– James K Polk
Jan 2 at 3:00

Just look at the file's modification date - that will tell you when it was last changed.

– Mark Setchell
Jan 2 at 12:42

add a comment |

I'm curious, given your requirements, how you expect to be able to use anything than a normal has algorithm. This really has nothing to do with images.

– Jonathon Reinhart
Jan 2 at 1:46

1

What file system are you on? Most OSes have a file system journal API that will alert you if a file changes. And on speed: you might actually block on reading the disk

– Kat
Jan 2 at 1:56

2

what did you try out? why did it not suit your need? do you know about hash collisions?

– Patrick Artner
Jan 2 at 2:02

Benchmark with sha256 and then again with the "identity" hash function which just reads the entire file but returns 0 as a hash result. I think you'll find that I/O is always the bottleneck.

– James K Polk
Jan 2 at 3:00

Just look at the file's modification date - that will tell you when it was last changed.

– Mark Setchell
Jan 2 at 12:42

I'm curious, given your requirements, how you expect to be able to use anything than a normal has algorithm. This really has nothing to do with images.

– Jonathon Reinhart
Jan 2 at 1:46

What file system are you on? Most OSes have a file system journal API that will alert you if a file changes. And on speed: you might actually block on reading the disk

– Kat
Jan 2 at 1:56

what did you try out? why did it not suit your need? do you know about hash collisions?

– Patrick Artner
Jan 2 at 2:02

Benchmark with sha256 and then again with the "identity" hash function which just reads the entire file but returns 0 as a hash result. I think you'll find that I/O is always the bottleneck.

– James K Polk
Jan 2 at 3:00

Just look at the file's modification date - that will tell you when it was last changed.

– Mark Setchell
Jan 2 at 12:42

add a comment |

1 Answer
1

active

oldest

votes

The calculation of this value needs to be fast

The result needs to be different if ANY part of the image file changes, even in the slightest amount, even if just one pixel changes.
(The hash should not take filename into account).

Collisions should basically never happen.

Hash calculation (may differ according to the hashing algorithm) of large files take time, if it needs to be fast, try to choose an efficient hashing algorithm for your task. You can find information about how they compare to each other. But, before checking hash, you can optimize your algorithm by checking something else.

If you decided to use hashing, this is the case. The hash value will be changed even if a small part of image has changed.

Collisions may be (very rare, but not never) happen. This is the nature of hash algorithms

Example to 1st (optimizing algorithm),

Check file size.

If sizes are equal, check CRC

If CRCs are equal, then calculate and check hash. (both requires passing the file)

Optionally, before checking hashes, you can partially calculate and compare hashes, instead of all the file.

If most of your files will be more likely different, then checking other things before calculating hash probably will be faster.

But if most of your files will be identical, then the steps before the hashing will just consume more time. Because you'll already have to calculate the hash for most of files.

So try to implement most efficient algorithm according to your context.

edited Jan 2 at 3:15

answered Jan 2 at 2:26

user3790180

10819

"If CRCs are equal, then calculate and check hash." That will likely require two passes through the file, and I/O is likely to be a bottleneck even with a cryptographic hash.

– James K Polk
Jan 2 at 2:58

Yes, you're right Partial checking may be preferrable before passing the whole file. For example, comparing first X bytes, last X bytes, and random chosen Y bytes

– user3790180
Jan 2 at 3:07

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54000349%2fhashing-1000-image-files-quick-as-possible-2000x2000-plus-resolution-python%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

The calculation of this value needs to be fast

The result needs to be different if ANY part of the image file changes, even in the slightest amount, even if just one pixel changes.
(The hash should not take filename into account).

Collisions should basically never happen.

Hash calculation (may differ according to the hashing algorithm) of large files take time, if it needs to be fast, try to choose an efficient hashing algorithm for your task. You can find information about how they compare to each other. But, before checking hash, you can optimize your algorithm by checking something else.

If you decided to use hashing, this is the case. The hash value will be changed even if a small part of image has changed.

Collisions may be (very rare, but not never) happen. This is the nature of hash algorithms

Example to 1st (optimizing algorithm),

Check file size.

If sizes are equal, check CRC

If CRCs are equal, then calculate and check hash. (both requires passing the file)

Optionally, before checking hashes, you can partially calculate and compare hashes, instead of all the file.

If most of your files will be more likely different, then checking other things before calculating hash probably will be faster.

But if most of your files will be identical, then the steps before the hashing will just consume more time. Because you'll already have to calculate the hash for most of files.

So try to implement most efficient algorithm according to your context.

edited Jan 2 at 3:15

answered Jan 2 at 2:26

user3790180

10819

"If CRCs are equal, then calculate and check hash." That will likely require two passes through the file, and I/O is likely to be a bottleneck even with a cryptographic hash.

– James K Polk
Jan 2 at 2:58

Yes, you're right Partial checking may be preferrable before passing the whole file. For example, comparing first X bytes, last X bytes, and random chosen Y bytes

– user3790180
Jan 2 at 3:07

add a comment |

The calculation of this value needs to be fast

The result needs to be different if ANY part of the image file changes, even in the slightest amount, even if just one pixel changes.
(The hash should not take filename into account).

Collisions should basically never happen.

Hash calculation (may differ according to the hashing algorithm) of large files take time, if it needs to be fast, try to choose an efficient hashing algorithm for your task. You can find information about how they compare to each other. But, before checking hash, you can optimize your algorithm by checking something else.

If you decided to use hashing, this is the case. The hash value will be changed even if a small part of image has changed.

Collisions may be (very rare, but not never) happen. This is the nature of hash algorithms

Example to 1st (optimizing algorithm),

Check file size.

If sizes are equal, check CRC

If CRCs are equal, then calculate and check hash. (both requires passing the file)

Optionally, before checking hashes, you can partially calculate and compare hashes, instead of all the file.

If most of your files will be more likely different, then checking other things before calculating hash probably will be faster.

But if most of your files will be identical, then the steps before the hashing will just consume more time. Because you'll already have to calculate the hash for most of files.

So try to implement most efficient algorithm according to your context.

edited Jan 2 at 3:15

answered Jan 2 at 2:26

user3790180

10819

"If CRCs are equal, then calculate and check hash." That will likely require two passes through the file, and I/O is likely to be a bottleneck even with a cryptographic hash.

– James K Polk
Jan 2 at 2:58

Yes, you're right Partial checking may be preferrable before passing the whole file. For example, comparing first X bytes, last X bytes, and random chosen Y bytes

– user3790180
Jan 2 at 3:07

add a comment |

The calculation of this value needs to be fast

The result needs to be different if ANY part of the image file changes, even in the slightest amount, even if just one pixel changes.
(The hash should not take filename into account).

Collisions should basically never happen.

Hash calculation (may differ according to the hashing algorithm) of large files take time, if it needs to be fast, try to choose an efficient hashing algorithm for your task. You can find information about how they compare to each other. But, before checking hash, you can optimize your algorithm by checking something else.

If you decided to use hashing, this is the case. The hash value will be changed even if a small part of image has changed.

Collisions may be (very rare, but not never) happen. This is the nature of hash algorithms

Example to 1st (optimizing algorithm),

Check file size.

If sizes are equal, check CRC

If CRCs are equal, then calculate and check hash. (both requires passing the file)

Optionally, before checking hashes, you can partially calculate and compare hashes, instead of all the file.

If most of your files will be more likely different, then checking other things before calculating hash probably will be faster.

But if most of your files will be identical, then the steps before the hashing will just consume more time. Because you'll already have to calculate the hash for most of files.

So try to implement most efficient algorithm according to your context.

edited Jan 2 at 3:15

answered Jan 2 at 2:26

user3790180

10819

The calculation of this value needs to be fast

The result needs to be different if ANY part of the image file changes, even in the slightest amount, even if just one pixel changes.
(The hash should not take filename into account).

Collisions should basically never happen.

Hash calculation (may differ according to the hashing algorithm) of large files take time, if it needs to be fast, try to choose an efficient hashing algorithm for your task. You can find information about how they compare to each other. But, before checking hash, you can optimize your algorithm by checking something else.

If you decided to use hashing, this is the case. The hash value will be changed even if a small part of image has changed.

Collisions may be (very rare, but not never) happen. This is the nature of hash algorithms

Example to 1st (optimizing algorithm),

Check file size.

If sizes are equal, check CRC

If CRCs are equal, then calculate and check hash. (both requires passing the file)

Optionally, before checking hashes, you can partially calculate and compare hashes, instead of all the file.

If most of your files will be more likely different, then checking other things before calculating hash probably will be faster.

But if most of your files will be identical, then the steps before the hashing will just consume more time. Because you'll already have to calculate the hash for most of files.

So try to implement most efficient algorithm according to your context.

edited Jan 2 at 3:15

answered Jan 2 at 2:26

user3790180

10819

edited Jan 2 at 3:15

answered Jan 2 at 2:26

user3790180

10819

answered Jan 2 at 2:26

user3790180

10819

answered Jan 2 at 2:26

user3790180

10819

"If CRCs are equal, then calculate and check hash." That will likely require two passes through the file, and I/O is likely to be a bottleneck even with a cryptographic hash.

– James K Polk
Jan 2 at 2:58

Yes, you're right Partial checking may be preferrable before passing the whole file. For example, comparing first X bytes, last X bytes, and random chosen Y bytes

– user3790180
Jan 2 at 3:07

add a comment |

"If CRCs are equal, then calculate and check hash." That will likely require two passes through the file, and I/O is likely to be a bottleneck even with a cryptographic hash.

– James K Polk
Jan 2 at 2:58

Yes, you're right Partial checking may be preferrable before passing the whole file. For example, comparing first X bytes, last X bytes, and random chosen Y bytes

– user3790180
Jan 2 at 3:07

"If CRCs are equal, then calculate and check hash." That will likely require two passes through the file, and I/O is likely to be a bottleneck even with a cryptographic hash.

– James K Polk
Jan 2 at 2:58

Yes, you're right Partial checking may be preferrable before passing the whole file. For example, comparing first X bytes, last X bytes, and random chosen Y bytes

– user3790180
Jan 2 at 3:07

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu