Is it better to use the collapse clause
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
I am never sure which possibility I should choose to parallelize nested for loops.
For example I have the following code snippet:
#pragma omp parallel for schedule(static)
for(int b=0; b<bSize; b++)
for(int n=0; n<N; n++) o[n + b*N] = b[n];
#pragma omp parallel for collapse(2) schedule(static)
for(int b=0; b<bSize; b++)
for(int n=0; n<N; n++) o[n + b*N] = b[n];
In the first snippet I use parallel for
(with schedule(static)
because of the first touch policy). In some codes I saw people use mostly the collapse-clausel to parallize nested for loops in other codes it is never used instead the nested for loops are parallelized with a simple parallel for
. Is this more a habit or is there a difference between the two versions? Is there a reason some people never use collapse(n)
?
c++11 openmp
add a comment |
I am never sure which possibility I should choose to parallelize nested for loops.
For example I have the following code snippet:
#pragma omp parallel for schedule(static)
for(int b=0; b<bSize; b++)
for(int n=0; n<N; n++) o[n + b*N] = b[n];
#pragma omp parallel for collapse(2) schedule(static)
for(int b=0; b<bSize; b++)
for(int n=0; n<N; n++) o[n + b*N] = b[n];
In the first snippet I use parallel for
(with schedule(static)
because of the first touch policy). In some codes I saw people use mostly the collapse-clausel to parallize nested for loops in other codes it is never used instead the nested for loops are parallelized with a simple parallel for
. Is this more a habit or is there a difference between the two versions? Is there a reason some people never use collapse(n)
?
c++11 openmp
2
As the ideal situation is to achieve simd optimization in the inner loop, it's usual better to avoid including the inner loop in a collapse. Certain compilers even look for opportunities to include more than the innermost loop in simd optimization. Parallel collapse might be expected to be of use when there is a remainder when dividing the outer loop count by the number of threads which would produce work imbalance. I'm sure various aspects of this were discussed previously here.
– tim18
Jan 3 at 12:21
add a comment |
I am never sure which possibility I should choose to parallelize nested for loops.
For example I have the following code snippet:
#pragma omp parallel for schedule(static)
for(int b=0; b<bSize; b++)
for(int n=0; n<N; n++) o[n + b*N] = b[n];
#pragma omp parallel for collapse(2) schedule(static)
for(int b=0; b<bSize; b++)
for(int n=0; n<N; n++) o[n + b*N] = b[n];
In the first snippet I use parallel for
(with schedule(static)
because of the first touch policy). In some codes I saw people use mostly the collapse-clausel to parallize nested for loops in other codes it is never used instead the nested for loops are parallelized with a simple parallel for
. Is this more a habit or is there a difference between the two versions? Is there a reason some people never use collapse(n)
?
c++11 openmp
I am never sure which possibility I should choose to parallelize nested for loops.
For example I have the following code snippet:
#pragma omp parallel for schedule(static)
for(int b=0; b<bSize; b++)
for(int n=0; n<N; n++) o[n + b*N] = b[n];
#pragma omp parallel for collapse(2) schedule(static)
for(int b=0; b<bSize; b++)
for(int n=0; n<N; n++) o[n + b*N] = b[n];
In the first snippet I use parallel for
(with schedule(static)
because of the first touch policy). In some codes I saw people use mostly the collapse-clausel to parallize nested for loops in other codes it is never used instead the nested for loops are parallelized with a simple parallel for
. Is this more a habit or is there a difference between the two versions? Is there a reason some people never use collapse(n)
?
c++11 openmp
c++11 openmp
edited Jan 3 at 12:42


Z boson
21.1k782154
21.1k782154
asked Jan 3 at 10:53


SuslikSuslik
174111
174111
2
As the ideal situation is to achieve simd optimization in the inner loop, it's usual better to avoid including the inner loop in a collapse. Certain compilers even look for opportunities to include more than the innermost loop in simd optimization. Parallel collapse might be expected to be of use when there is a remainder when dividing the outer loop count by the number of threads which would produce work imbalance. I'm sure various aspects of this were discussed previously here.
– tim18
Jan 3 at 12:21
add a comment |
2
As the ideal situation is to achieve simd optimization in the inner loop, it's usual better to avoid including the inner loop in a collapse. Certain compilers even look for opportunities to include more than the innermost loop in simd optimization. Parallel collapse might be expected to be of use when there is a remainder when dividing the outer loop count by the number of threads which would produce work imbalance. I'm sure various aspects of this were discussed previously here.
– tim18
Jan 3 at 12:21
2
2
As the ideal situation is to achieve simd optimization in the inner loop, it's usual better to avoid including the inner loop in a collapse. Certain compilers even look for opportunities to include more than the innermost loop in simd optimization. Parallel collapse might be expected to be of use when there is a remainder when dividing the outer loop count by the number of threads which would produce work imbalance. I'm sure various aspects of this were discussed previously here.
– tim18
Jan 3 at 12:21
As the ideal situation is to achieve simd optimization in the inner loop, it's usual better to avoid including the inner loop in a collapse. Certain compilers even look for opportunities to include more than the innermost loop in simd optimization. Parallel collapse might be expected to be of use when there is a remainder when dividing the outer loop count by the number of threads which would produce work imbalance. I'm sure various aspects of this were discussed previously here.
– tim18
Jan 3 at 12:21
add a comment |
1 Answer
1
active
oldest
votes
As with everything in HPC, the answer is "It depends..."
Here it will depend on
- How big your machine is and how big "bSize", and "N" are
- What the content of the inner loop is
For static scheduling of iterations which all run in the same amount of time, unless you can guarantee that number of iterations being work-shared divides by the number of threads, you need to ensure that the number of available iterations is ~10x the number of threads to guarantee 90% efficiency because of potential imbalance. Therefore if you have a 16 core machine you want >160 iterations. If "bSize" is small, then using collapse to generate more available parallelism will help performance. (In the worst case, imagine that "bSize" is smaller than the number of threads!)
On the other hand, as @tim18 is pointing out, if you can vectorize the inner loop while still maintaining enough parallelism that may be a better thing to do.
On the third hand, there is nothing to stop you doing both :-
#pragma omp for simd collapse(2)
for(int b=0; b<bSize; b++)
for(int n=0; n<N; n++) o[n + b*N] = b[n];
If your inner loop really is this small (and vectorizable) then you certainly want to vectorize it, since, unlike parallelism, vectorization can reduce the total CPU time you use, rather than just moving it between cores.
I don't understand why you would be overly concerned about OpenMP prohibiting vectorization. There doesn't seem to be an inherent issue with that. A quick check confirms that OpenMP loops can be vectorized without any other help. Of course, always measure & confirm still applies but preemptively trying to help the compiler here instead of giving the compiler the best possible information seems like a bad approach, especially for a beginner.
– Zulan
Jan 4 at 16:42
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54020887%2fis-it-better-to-use-the-collapse-clause%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
As with everything in HPC, the answer is "It depends..."
Here it will depend on
- How big your machine is and how big "bSize", and "N" are
- What the content of the inner loop is
For static scheduling of iterations which all run in the same amount of time, unless you can guarantee that number of iterations being work-shared divides by the number of threads, you need to ensure that the number of available iterations is ~10x the number of threads to guarantee 90% efficiency because of potential imbalance. Therefore if you have a 16 core machine you want >160 iterations. If "bSize" is small, then using collapse to generate more available parallelism will help performance. (In the worst case, imagine that "bSize" is smaller than the number of threads!)
On the other hand, as @tim18 is pointing out, if you can vectorize the inner loop while still maintaining enough parallelism that may be a better thing to do.
On the third hand, there is nothing to stop you doing both :-
#pragma omp for simd collapse(2)
for(int b=0; b<bSize; b++)
for(int n=0; n<N; n++) o[n + b*N] = b[n];
If your inner loop really is this small (and vectorizable) then you certainly want to vectorize it, since, unlike parallelism, vectorization can reduce the total CPU time you use, rather than just moving it between cores.
I don't understand why you would be overly concerned about OpenMP prohibiting vectorization. There doesn't seem to be an inherent issue with that. A quick check confirms that OpenMP loops can be vectorized without any other help. Of course, always measure & confirm still applies but preemptively trying to help the compiler here instead of giving the compiler the best possible information seems like a bad approach, especially for a beginner.
– Zulan
Jan 4 at 16:42
add a comment |
As with everything in HPC, the answer is "It depends..."
Here it will depend on
- How big your machine is and how big "bSize", and "N" are
- What the content of the inner loop is
For static scheduling of iterations which all run in the same amount of time, unless you can guarantee that number of iterations being work-shared divides by the number of threads, you need to ensure that the number of available iterations is ~10x the number of threads to guarantee 90% efficiency because of potential imbalance. Therefore if you have a 16 core machine you want >160 iterations. If "bSize" is small, then using collapse to generate more available parallelism will help performance. (In the worst case, imagine that "bSize" is smaller than the number of threads!)
On the other hand, as @tim18 is pointing out, if you can vectorize the inner loop while still maintaining enough parallelism that may be a better thing to do.
On the third hand, there is nothing to stop you doing both :-
#pragma omp for simd collapse(2)
for(int b=0; b<bSize; b++)
for(int n=0; n<N; n++) o[n + b*N] = b[n];
If your inner loop really is this small (and vectorizable) then you certainly want to vectorize it, since, unlike parallelism, vectorization can reduce the total CPU time you use, rather than just moving it between cores.
I don't understand why you would be overly concerned about OpenMP prohibiting vectorization. There doesn't seem to be an inherent issue with that. A quick check confirms that OpenMP loops can be vectorized without any other help. Of course, always measure & confirm still applies but preemptively trying to help the compiler here instead of giving the compiler the best possible information seems like a bad approach, especially for a beginner.
– Zulan
Jan 4 at 16:42
add a comment |
As with everything in HPC, the answer is "It depends..."
Here it will depend on
- How big your machine is and how big "bSize", and "N" are
- What the content of the inner loop is
For static scheduling of iterations which all run in the same amount of time, unless you can guarantee that number of iterations being work-shared divides by the number of threads, you need to ensure that the number of available iterations is ~10x the number of threads to guarantee 90% efficiency because of potential imbalance. Therefore if you have a 16 core machine you want >160 iterations. If "bSize" is small, then using collapse to generate more available parallelism will help performance. (In the worst case, imagine that "bSize" is smaller than the number of threads!)
On the other hand, as @tim18 is pointing out, if you can vectorize the inner loop while still maintaining enough parallelism that may be a better thing to do.
On the third hand, there is nothing to stop you doing both :-
#pragma omp for simd collapse(2)
for(int b=0; b<bSize; b++)
for(int n=0; n<N; n++) o[n + b*N] = b[n];
If your inner loop really is this small (and vectorizable) then you certainly want to vectorize it, since, unlike parallelism, vectorization can reduce the total CPU time you use, rather than just moving it between cores.
As with everything in HPC, the answer is "It depends..."
Here it will depend on
- How big your machine is and how big "bSize", and "N" are
- What the content of the inner loop is
For static scheduling of iterations which all run in the same amount of time, unless you can guarantee that number of iterations being work-shared divides by the number of threads, you need to ensure that the number of available iterations is ~10x the number of threads to guarantee 90% efficiency because of potential imbalance. Therefore if you have a 16 core machine you want >160 iterations. If "bSize" is small, then using collapse to generate more available parallelism will help performance. (In the worst case, imagine that "bSize" is smaller than the number of threads!)
On the other hand, as @tim18 is pointing out, if you can vectorize the inner loop while still maintaining enough parallelism that may be a better thing to do.
On the third hand, there is nothing to stop you doing both :-
#pragma omp for simd collapse(2)
for(int b=0; b<bSize; b++)
for(int n=0; n<N; n++) o[n + b*N] = b[n];
If your inner loop really is this small (and vectorizable) then you certainly want to vectorize it, since, unlike parallelism, vectorization can reduce the total CPU time you use, rather than just moving it between cores.
answered Jan 4 at 10:27
Jim CownieJim Cownie
963411
963411
I don't understand why you would be overly concerned about OpenMP prohibiting vectorization. There doesn't seem to be an inherent issue with that. A quick check confirms that OpenMP loops can be vectorized without any other help. Of course, always measure & confirm still applies but preemptively trying to help the compiler here instead of giving the compiler the best possible information seems like a bad approach, especially for a beginner.
– Zulan
Jan 4 at 16:42
add a comment |
I don't understand why you would be overly concerned about OpenMP prohibiting vectorization. There doesn't seem to be an inherent issue with that. A quick check confirms that OpenMP loops can be vectorized without any other help. Of course, always measure & confirm still applies but preemptively trying to help the compiler here instead of giving the compiler the best possible information seems like a bad approach, especially for a beginner.
– Zulan
Jan 4 at 16:42
I don't understand why you would be overly concerned about OpenMP prohibiting vectorization. There doesn't seem to be an inherent issue with that. A quick check confirms that OpenMP loops can be vectorized without any other help. Of course, always measure & confirm still applies but preemptively trying to help the compiler here instead of giving the compiler the best possible information seems like a bad approach, especially for a beginner.
– Zulan
Jan 4 at 16:42
I don't understand why you would be overly concerned about OpenMP prohibiting vectorization. There doesn't seem to be an inherent issue with that. A quick check confirms that OpenMP loops can be vectorized without any other help. Of course, always measure & confirm still applies but preemptively trying to help the compiler here instead of giving the compiler the best possible information seems like a bad approach, especially for a beginner.
– Zulan
Jan 4 at 16:42
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54020887%2fis-it-better-to-use-the-collapse-clause%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
As the ideal situation is to achieve simd optimization in the inner loop, it's usual better to avoid including the inner loop in a collapse. Certain compilers even look for opportunities to include more than the innermost loop in simd optimization. Parallel collapse might be expected to be of use when there is a remainder when dividing the outer loop count by the number of threads which would produce work imbalance. I'm sure various aspects of this were discussed previously here.
– tim18
Jan 3 at 12:21