Hashing 1000 Image Files Quick as Possible (2000x2000 plus resolution) (Python)
I have a folder with around several thousand RGB 8-bit-per-channel image files on my computer that are anywhere between 2000x2000 and 8000x8000 in resolution (so most of them are extremely large).
I would like to store some small value, such as a hash, for each image so that I have a value to easily compare to in the future to see if any image files have changed. There are three primary requirements in the calculation of this value:
- The calculation of this value needs to be fast
- The result needs to be different if ANY part of the image file changes, even in the slightest amount, even if just one pixel changes. (The hash should not take filename into account).
- Collisions should basically never happen.
There are a lot of ways I could go about this, such as sha1, md5, etc, but the real goal here is speed, and really just any extremely quick way to identify if ANY change at all has been made to an image.
How would you achieve this in Python? Is there a particular hash algorithm you recommend for speed? Or can you devise a different way to achieve my three goals altogether?
python database python-3.x image hash
add a comment |
I have a folder with around several thousand RGB 8-bit-per-channel image files on my computer that are anywhere between 2000x2000 and 8000x8000 in resolution (so most of them are extremely large).
I would like to store some small value, such as a hash, for each image so that I have a value to easily compare to in the future to see if any image files have changed. There are three primary requirements in the calculation of this value:
- The calculation of this value needs to be fast
- The result needs to be different if ANY part of the image file changes, even in the slightest amount, even if just one pixel changes. (The hash should not take filename into account).
- Collisions should basically never happen.
There are a lot of ways I could go about this, such as sha1, md5, etc, but the real goal here is speed, and really just any extremely quick way to identify if ANY change at all has been made to an image.
How would you achieve this in Python? Is there a particular hash algorithm you recommend for speed? Or can you devise a different way to achieve my three goals altogether?
python database python-3.x image hash
I'm curious, given your requirements, how you expect to be able to use anything than a normal has algorithm. This really has nothing to do with images.
– Jonathon Reinhart
Jan 2 at 1:46
1
What file system are you on? Most OSes have a file system journal API that will alert you if a file changes. And on speed: you might actually block on reading the disk
– Kat
Jan 2 at 1:56
2
what did you try out? why did it not suit your need? do you know about hash collisions?
– Patrick Artner
Jan 2 at 2:02
Benchmark with sha256 and then again with the "identity" hash function which just reads the entire file but returns 0 as a hash result. I think you'll find that I/O is always the bottleneck.
– James K Polk
Jan 2 at 3:00
Just look at the file's modification date - that will tell you when it was last changed.
– Mark Setchell
Jan 2 at 12:42
add a comment |
I have a folder with around several thousand RGB 8-bit-per-channel image files on my computer that are anywhere between 2000x2000 and 8000x8000 in resolution (so most of them are extremely large).
I would like to store some small value, such as a hash, for each image so that I have a value to easily compare to in the future to see if any image files have changed. There are three primary requirements in the calculation of this value:
- The calculation of this value needs to be fast
- The result needs to be different if ANY part of the image file changes, even in the slightest amount, even if just one pixel changes. (The hash should not take filename into account).
- Collisions should basically never happen.
There are a lot of ways I could go about this, such as sha1, md5, etc, but the real goal here is speed, and really just any extremely quick way to identify if ANY change at all has been made to an image.
How would you achieve this in Python? Is there a particular hash algorithm you recommend for speed? Or can you devise a different way to achieve my three goals altogether?
python database python-3.x image hash
I have a folder with around several thousand RGB 8-bit-per-channel image files on my computer that are anywhere between 2000x2000 and 8000x8000 in resolution (so most of them are extremely large).
I would like to store some small value, such as a hash, for each image so that I have a value to easily compare to in the future to see if any image files have changed. There are three primary requirements in the calculation of this value:
- The calculation of this value needs to be fast
- The result needs to be different if ANY part of the image file changes, even in the slightest amount, even if just one pixel changes. (The hash should not take filename into account).
- Collisions should basically never happen.
There are a lot of ways I could go about this, such as sha1, md5, etc, but the real goal here is speed, and really just any extremely quick way to identify if ANY change at all has been made to an image.
How would you achieve this in Python? Is there a particular hash algorithm you recommend for speed? Or can you devise a different way to achieve my three goals altogether?
python database python-3.x image hash
python database python-3.x image hash
asked Jan 2 at 1:44
jippyjoe4jippyjoe4
26517
26517
I'm curious, given your requirements, how you expect to be able to use anything than a normal has algorithm. This really has nothing to do with images.
– Jonathon Reinhart
Jan 2 at 1:46
1
What file system are you on? Most OSes have a file system journal API that will alert you if a file changes. And on speed: you might actually block on reading the disk
– Kat
Jan 2 at 1:56
2
what did you try out? why did it not suit your need? do you know about hash collisions?
– Patrick Artner
Jan 2 at 2:02
Benchmark with sha256 and then again with the "identity" hash function which just reads the entire file but returns 0 as a hash result. I think you'll find that I/O is always the bottleneck.
– James K Polk
Jan 2 at 3:00
Just look at the file's modification date - that will tell you when it was last changed.
– Mark Setchell
Jan 2 at 12:42
add a comment |
I'm curious, given your requirements, how you expect to be able to use anything than a normal has algorithm. This really has nothing to do with images.
– Jonathon Reinhart
Jan 2 at 1:46
1
What file system are you on? Most OSes have a file system journal API that will alert you if a file changes. And on speed: you might actually block on reading the disk
– Kat
Jan 2 at 1:56
2
what did you try out? why did it not suit your need? do you know about hash collisions?
– Patrick Artner
Jan 2 at 2:02
Benchmark with sha256 and then again with the "identity" hash function which just reads the entire file but returns 0 as a hash result. I think you'll find that I/O is always the bottleneck.
– James K Polk
Jan 2 at 3:00
Just look at the file's modification date - that will tell you when it was last changed.
– Mark Setchell
Jan 2 at 12:42
I'm curious, given your requirements, how you expect to be able to use anything than a normal has algorithm. This really has nothing to do with images.
– Jonathon Reinhart
Jan 2 at 1:46
I'm curious, given your requirements, how you expect to be able to use anything than a normal has algorithm. This really has nothing to do with images.
– Jonathon Reinhart
Jan 2 at 1:46
1
1
What file system are you on? Most OSes have a file system journal API that will alert you if a file changes. And on speed: you might actually block on reading the disk
– Kat
Jan 2 at 1:56
What file system are you on? Most OSes have a file system journal API that will alert you if a file changes. And on speed: you might actually block on reading the disk
– Kat
Jan 2 at 1:56
2
2
what did you try out? why did it not suit your need? do you know about hash collisions?
– Patrick Artner
Jan 2 at 2:02
what did you try out? why did it not suit your need? do you know about hash collisions?
– Patrick Artner
Jan 2 at 2:02
Benchmark with sha256 and then again with the "identity" hash function which just reads the entire file but returns 0 as a hash result. I think you'll find that I/O is always the bottleneck.
– James K Polk
Jan 2 at 3:00
Benchmark with sha256 and then again with the "identity" hash function which just reads the entire file but returns 0 as a hash result. I think you'll find that I/O is always the bottleneck.
– James K Polk
Jan 2 at 3:00
Just look at the file's modification date - that will tell you when it was last changed.
– Mark Setchell
Jan 2 at 12:42
Just look at the file's modification date - that will tell you when it was last changed.
– Mark Setchell
Jan 2 at 12:42
add a comment |
1 Answer
1
active
oldest
votes
- The calculation of this value needs to be fast
- The result needs to be different if ANY part of the image file changes, even in the slightest amount, even if just one pixel changes.
(The hash should not take filename into account).
- Collisions should basically never happen.
- Hash calculation (may differ according to the hashing algorithm) of large files take time, if it needs to be fast, try to choose an efficient hashing algorithm for your task. You can find information about how they compare to each other. But, before checking hash, you can optimize your algorithm by checking something else.
- If you decided to use hashing, this is the case. The hash value will be changed even if a small part of image has changed.
- Collisions may be (very rare, but not
never
) happen. This is the nature ofhash algorithms
Example to 1st (optimizing algorithm),
- Check file size.
- If sizes are equal, check CRC
- If CRCs are equal, then calculate and check hash. (both requires passing the file)
Optionally, before checking hashes, you can partially calculate and compare hashes, instead of all the file.
If most of your files will be more likely different, then checking other things before calculating hash probably will be faster.
But if most of your files will be identical, then the steps before the hashing will just consume more time. Because you'll already have to calculate the hash for most of files.
So try to implement most efficient algorithm according to your context.
"If CRCs are equal, then calculate and check hash." That will likely require two passes through the file, and I/O is likely to be a bottleneck even with a cryptographic hash.
– James K Polk
Jan 2 at 2:58
Yes, you're right Partial checking may be preferrable before passing the whole file. For example, comparing first X bytes, last X bytes, and random chosen Y bytes
– user3790180
Jan 2 at 3:07
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54000349%2fhashing-1000-image-files-quick-as-possible-2000x2000-plus-resolution-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
- The calculation of this value needs to be fast
- The result needs to be different if ANY part of the image file changes, even in the slightest amount, even if just one pixel changes.
(The hash should not take filename into account).
- Collisions should basically never happen.
- Hash calculation (may differ according to the hashing algorithm) of large files take time, if it needs to be fast, try to choose an efficient hashing algorithm for your task. You can find information about how they compare to each other. But, before checking hash, you can optimize your algorithm by checking something else.
- If you decided to use hashing, this is the case. The hash value will be changed even if a small part of image has changed.
- Collisions may be (very rare, but not
never
) happen. This is the nature ofhash algorithms
Example to 1st (optimizing algorithm),
- Check file size.
- If sizes are equal, check CRC
- If CRCs are equal, then calculate and check hash. (both requires passing the file)
Optionally, before checking hashes, you can partially calculate and compare hashes, instead of all the file.
If most of your files will be more likely different, then checking other things before calculating hash probably will be faster.
But if most of your files will be identical, then the steps before the hashing will just consume more time. Because you'll already have to calculate the hash for most of files.
So try to implement most efficient algorithm according to your context.
"If CRCs are equal, then calculate and check hash." That will likely require two passes through the file, and I/O is likely to be a bottleneck even with a cryptographic hash.
– James K Polk
Jan 2 at 2:58
Yes, you're right Partial checking may be preferrable before passing the whole file. For example, comparing first X bytes, last X bytes, and random chosen Y bytes
– user3790180
Jan 2 at 3:07
add a comment |
- The calculation of this value needs to be fast
- The result needs to be different if ANY part of the image file changes, even in the slightest amount, even if just one pixel changes.
(The hash should not take filename into account).
- Collisions should basically never happen.
- Hash calculation (may differ according to the hashing algorithm) of large files take time, if it needs to be fast, try to choose an efficient hashing algorithm for your task. You can find information about how they compare to each other. But, before checking hash, you can optimize your algorithm by checking something else.
- If you decided to use hashing, this is the case. The hash value will be changed even if a small part of image has changed.
- Collisions may be (very rare, but not
never
) happen. This is the nature ofhash algorithms
Example to 1st (optimizing algorithm),
- Check file size.
- If sizes are equal, check CRC
- If CRCs are equal, then calculate and check hash. (both requires passing the file)
Optionally, before checking hashes, you can partially calculate and compare hashes, instead of all the file.
If most of your files will be more likely different, then checking other things before calculating hash probably will be faster.
But if most of your files will be identical, then the steps before the hashing will just consume more time. Because you'll already have to calculate the hash for most of files.
So try to implement most efficient algorithm according to your context.
"If CRCs are equal, then calculate and check hash." That will likely require two passes through the file, and I/O is likely to be a bottleneck even with a cryptographic hash.
– James K Polk
Jan 2 at 2:58
Yes, you're right Partial checking may be preferrable before passing the whole file. For example, comparing first X bytes, last X bytes, and random chosen Y bytes
– user3790180
Jan 2 at 3:07
add a comment |
- The calculation of this value needs to be fast
- The result needs to be different if ANY part of the image file changes, even in the slightest amount, even if just one pixel changes.
(The hash should not take filename into account).
- Collisions should basically never happen.
- Hash calculation (may differ according to the hashing algorithm) of large files take time, if it needs to be fast, try to choose an efficient hashing algorithm for your task. You can find information about how they compare to each other. But, before checking hash, you can optimize your algorithm by checking something else.
- If you decided to use hashing, this is the case. The hash value will be changed even if a small part of image has changed.
- Collisions may be (very rare, but not
never
) happen. This is the nature ofhash algorithms
Example to 1st (optimizing algorithm),
- Check file size.
- If sizes are equal, check CRC
- If CRCs are equal, then calculate and check hash. (both requires passing the file)
Optionally, before checking hashes, you can partially calculate and compare hashes, instead of all the file.
If most of your files will be more likely different, then checking other things before calculating hash probably will be faster.
But if most of your files will be identical, then the steps before the hashing will just consume more time. Because you'll already have to calculate the hash for most of files.
So try to implement most efficient algorithm according to your context.
- The calculation of this value needs to be fast
- The result needs to be different if ANY part of the image file changes, even in the slightest amount, even if just one pixel changes.
(The hash should not take filename into account).
- Collisions should basically never happen.
- Hash calculation (may differ according to the hashing algorithm) of large files take time, if it needs to be fast, try to choose an efficient hashing algorithm for your task. You can find information about how they compare to each other. But, before checking hash, you can optimize your algorithm by checking something else.
- If you decided to use hashing, this is the case. The hash value will be changed even if a small part of image has changed.
- Collisions may be (very rare, but not
never
) happen. This is the nature ofhash algorithms
Example to 1st (optimizing algorithm),
- Check file size.
- If sizes are equal, check CRC
- If CRCs are equal, then calculate and check hash. (both requires passing the file)
Optionally, before checking hashes, you can partially calculate and compare hashes, instead of all the file.
If most of your files will be more likely different, then checking other things before calculating hash probably will be faster.
But if most of your files will be identical, then the steps before the hashing will just consume more time. Because you'll already have to calculate the hash for most of files.
So try to implement most efficient algorithm according to your context.
edited Jan 2 at 3:15
answered Jan 2 at 2:26
user3790180user3790180
10819
10819
"If CRCs are equal, then calculate and check hash." That will likely require two passes through the file, and I/O is likely to be a bottleneck even with a cryptographic hash.
– James K Polk
Jan 2 at 2:58
Yes, you're right Partial checking may be preferrable before passing the whole file. For example, comparing first X bytes, last X bytes, and random chosen Y bytes
– user3790180
Jan 2 at 3:07
add a comment |
"If CRCs are equal, then calculate and check hash." That will likely require two passes through the file, and I/O is likely to be a bottleneck even with a cryptographic hash.
– James K Polk
Jan 2 at 2:58
Yes, you're right Partial checking may be preferrable before passing the whole file. For example, comparing first X bytes, last X bytes, and random chosen Y bytes
– user3790180
Jan 2 at 3:07
"If CRCs are equal, then calculate and check hash." That will likely require two passes through the file, and I/O is likely to be a bottleneck even with a cryptographic hash.
– James K Polk
Jan 2 at 2:58
"If CRCs are equal, then calculate and check hash." That will likely require two passes through the file, and I/O is likely to be a bottleneck even with a cryptographic hash.
– James K Polk
Jan 2 at 2:58
Yes, you're right Partial checking may be preferrable before passing the whole file. For example, comparing first X bytes, last X bytes, and random chosen Y bytes
– user3790180
Jan 2 at 3:07
Yes, you're right Partial checking may be preferrable before passing the whole file. For example, comparing first X bytes, last X bytes, and random chosen Y bytes
– user3790180
Jan 2 at 3:07
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54000349%2fhashing-1000-image-files-quick-as-possible-2000x2000-plus-resolution-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
I'm curious, given your requirements, how you expect to be able to use anything than a normal has algorithm. This really has nothing to do with images.
– Jonathon Reinhart
Jan 2 at 1:46
1
What file system are you on? Most OSes have a file system journal API that will alert you if a file changes. And on speed: you might actually block on reading the disk
– Kat
Jan 2 at 1:56
2
what did you try out? why did it not suit your need? do you know about hash collisions?
– Patrick Artner
Jan 2 at 2:02
Benchmark with sha256 and then again with the "identity" hash function which just reads the entire file but returns 0 as a hash result. I think you'll find that I/O is always the bottleneck.
– James K Polk
Jan 2 at 3:00
Just look at the file's modification date - that will tell you when it was last changed.
– Mark Setchell
Jan 2 at 12:42