Is it better to use the collapse clause





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







1















I am never sure which possibility I should choose to parallelize nested for loops.



For example I have the following code snippet:



#pragma omp parallel for schedule(static)
for(int b=0; b<bSize; b++)
for(int n=0; n<N; n++) o[n + b*N] = b[n];


#pragma omp parallel for collapse(2) schedule(static)
for(int b=0; b<bSize; b++)
for(int n=0; n<N; n++) o[n + b*N] = b[n];


In the first snippet I use parallel for (with schedule(static) because of the first touch policy). In some codes I saw people use mostly the collapse-clausel to parallize nested for loops in other codes it is never used instead the nested for loops are parallelized with a simple parallel for. Is this more a habit or is there a difference between the two versions? Is there a reason some people never use collapse(n)?










share|improve this question




















  • 2





    As the ideal situation is to achieve simd optimization in the inner loop, it's usual better to avoid including the inner loop in a collapse. Certain compilers even look for opportunities to include more than the innermost loop in simd optimization. Parallel collapse might be expected to be of use when there is a remainder when dividing the outer loop count by the number of threads which would produce work imbalance. I'm sure various aspects of this were discussed previously here.

    – tim18
    Jan 3 at 12:21


















1















I am never sure which possibility I should choose to parallelize nested for loops.



For example I have the following code snippet:



#pragma omp parallel for schedule(static)
for(int b=0; b<bSize; b++)
for(int n=0; n<N; n++) o[n + b*N] = b[n];


#pragma omp parallel for collapse(2) schedule(static)
for(int b=0; b<bSize; b++)
for(int n=0; n<N; n++) o[n + b*N] = b[n];


In the first snippet I use parallel for (with schedule(static) because of the first touch policy). In some codes I saw people use mostly the collapse-clausel to parallize nested for loops in other codes it is never used instead the nested for loops are parallelized with a simple parallel for. Is this more a habit or is there a difference between the two versions? Is there a reason some people never use collapse(n)?










share|improve this question




















  • 2





    As the ideal situation is to achieve simd optimization in the inner loop, it's usual better to avoid including the inner loop in a collapse. Certain compilers even look for opportunities to include more than the innermost loop in simd optimization. Parallel collapse might be expected to be of use when there is a remainder when dividing the outer loop count by the number of threads which would produce work imbalance. I'm sure various aspects of this were discussed previously here.

    – tim18
    Jan 3 at 12:21














1












1








1








I am never sure which possibility I should choose to parallelize nested for loops.



For example I have the following code snippet:



#pragma omp parallel for schedule(static)
for(int b=0; b<bSize; b++)
for(int n=0; n<N; n++) o[n + b*N] = b[n];


#pragma omp parallel for collapse(2) schedule(static)
for(int b=0; b<bSize; b++)
for(int n=0; n<N; n++) o[n + b*N] = b[n];


In the first snippet I use parallel for (with schedule(static) because of the first touch policy). In some codes I saw people use mostly the collapse-clausel to parallize nested for loops in other codes it is never used instead the nested for loops are parallelized with a simple parallel for. Is this more a habit or is there a difference between the two versions? Is there a reason some people never use collapse(n)?










share|improve this question
















I am never sure which possibility I should choose to parallelize nested for loops.



For example I have the following code snippet:



#pragma omp parallel for schedule(static)
for(int b=0; b<bSize; b++)
for(int n=0; n<N; n++) o[n + b*N] = b[n];


#pragma omp parallel for collapse(2) schedule(static)
for(int b=0; b<bSize; b++)
for(int n=0; n<N; n++) o[n + b*N] = b[n];


In the first snippet I use parallel for (with schedule(static) because of the first touch policy). In some codes I saw people use mostly the collapse-clausel to parallize nested for loops in other codes it is never used instead the nested for loops are parallelized with a simple parallel for. Is this more a habit or is there a difference between the two versions? Is there a reason some people never use collapse(n)?







c++11 openmp






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 3 at 12:42









Z boson

21.1k782154




21.1k782154










asked Jan 3 at 10:53









SuslikSuslik

174111




174111








  • 2





    As the ideal situation is to achieve simd optimization in the inner loop, it's usual better to avoid including the inner loop in a collapse. Certain compilers even look for opportunities to include more than the innermost loop in simd optimization. Parallel collapse might be expected to be of use when there is a remainder when dividing the outer loop count by the number of threads which would produce work imbalance. I'm sure various aspects of this were discussed previously here.

    – tim18
    Jan 3 at 12:21














  • 2





    As the ideal situation is to achieve simd optimization in the inner loop, it's usual better to avoid including the inner loop in a collapse. Certain compilers even look for opportunities to include more than the innermost loop in simd optimization. Parallel collapse might be expected to be of use when there is a remainder when dividing the outer loop count by the number of threads which would produce work imbalance. I'm sure various aspects of this were discussed previously here.

    – tim18
    Jan 3 at 12:21








2




2





As the ideal situation is to achieve simd optimization in the inner loop, it's usual better to avoid including the inner loop in a collapse. Certain compilers even look for opportunities to include more than the innermost loop in simd optimization. Parallel collapse might be expected to be of use when there is a remainder when dividing the outer loop count by the number of threads which would produce work imbalance. I'm sure various aspects of this were discussed previously here.

– tim18
Jan 3 at 12:21





As the ideal situation is to achieve simd optimization in the inner loop, it's usual better to avoid including the inner loop in a collapse. Certain compilers even look for opportunities to include more than the innermost loop in simd optimization. Parallel collapse might be expected to be of use when there is a remainder when dividing the outer loop count by the number of threads which would produce work imbalance. I'm sure various aspects of this were discussed previously here.

– tim18
Jan 3 at 12:21












1 Answer
1






active

oldest

votes


















3














As with everything in HPC, the answer is "It depends..."



Here it will depend on




  1. How big your machine is and how big "bSize", and "N" are

  2. What the content of the inner loop is


For static scheduling of iterations which all run in the same amount of time, unless you can guarantee that number of iterations being work-shared divides by the number of threads, you need to ensure that the number of available iterations is ~10x the number of threads to guarantee 90% efficiency because of potential imbalance. Therefore if you have a 16 core machine you want >160 iterations. If "bSize" is small, then using collapse to generate more available parallelism will help performance. (In the worst case, imagine that "bSize" is smaller than the number of threads!)



On the other hand, as @tim18 is pointing out, if you can vectorize the inner loop while still maintaining enough parallelism that may be a better thing to do.



On the third hand, there is nothing to stop you doing both :-



#pragma omp for simd collapse(2)
for(int b=0; b<bSize; b++)
for(int n=0; n<N; n++) o[n + b*N] = b[n];


If your inner loop really is this small (and vectorizable) then you certainly want to vectorize it, since, unlike parallelism, vectorization can reduce the total CPU time you use, rather than just moving it between cores.






share|improve this answer
























  • I don't understand why you would be overly concerned about OpenMP prohibiting vectorization. There doesn't seem to be an inherent issue with that. A quick check confirms that OpenMP loops can be vectorized without any other help. Of course, always measure & confirm still applies but preemptively trying to help the compiler here instead of giving the compiler the best possible information seems like a bad approach, especially for a beginner.

    – Zulan
    Jan 4 at 16:42












Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54020887%2fis-it-better-to-use-the-collapse-clause%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









3














As with everything in HPC, the answer is "It depends..."



Here it will depend on




  1. How big your machine is and how big "bSize", and "N" are

  2. What the content of the inner loop is


For static scheduling of iterations which all run in the same amount of time, unless you can guarantee that number of iterations being work-shared divides by the number of threads, you need to ensure that the number of available iterations is ~10x the number of threads to guarantee 90% efficiency because of potential imbalance. Therefore if you have a 16 core machine you want >160 iterations. If "bSize" is small, then using collapse to generate more available parallelism will help performance. (In the worst case, imagine that "bSize" is smaller than the number of threads!)



On the other hand, as @tim18 is pointing out, if you can vectorize the inner loop while still maintaining enough parallelism that may be a better thing to do.



On the third hand, there is nothing to stop you doing both :-



#pragma omp for simd collapse(2)
for(int b=0; b<bSize; b++)
for(int n=0; n<N; n++) o[n + b*N] = b[n];


If your inner loop really is this small (and vectorizable) then you certainly want to vectorize it, since, unlike parallelism, vectorization can reduce the total CPU time you use, rather than just moving it between cores.






share|improve this answer
























  • I don't understand why you would be overly concerned about OpenMP prohibiting vectorization. There doesn't seem to be an inherent issue with that. A quick check confirms that OpenMP loops can be vectorized without any other help. Of course, always measure & confirm still applies but preemptively trying to help the compiler here instead of giving the compiler the best possible information seems like a bad approach, especially for a beginner.

    – Zulan
    Jan 4 at 16:42
















3














As with everything in HPC, the answer is "It depends..."



Here it will depend on




  1. How big your machine is and how big "bSize", and "N" are

  2. What the content of the inner loop is


For static scheduling of iterations which all run in the same amount of time, unless you can guarantee that number of iterations being work-shared divides by the number of threads, you need to ensure that the number of available iterations is ~10x the number of threads to guarantee 90% efficiency because of potential imbalance. Therefore if you have a 16 core machine you want >160 iterations. If "bSize" is small, then using collapse to generate more available parallelism will help performance. (In the worst case, imagine that "bSize" is smaller than the number of threads!)



On the other hand, as @tim18 is pointing out, if you can vectorize the inner loop while still maintaining enough parallelism that may be a better thing to do.



On the third hand, there is nothing to stop you doing both :-



#pragma omp for simd collapse(2)
for(int b=0; b<bSize; b++)
for(int n=0; n<N; n++) o[n + b*N] = b[n];


If your inner loop really is this small (and vectorizable) then you certainly want to vectorize it, since, unlike parallelism, vectorization can reduce the total CPU time you use, rather than just moving it between cores.






share|improve this answer
























  • I don't understand why you would be overly concerned about OpenMP prohibiting vectorization. There doesn't seem to be an inherent issue with that. A quick check confirms that OpenMP loops can be vectorized without any other help. Of course, always measure & confirm still applies but preemptively trying to help the compiler here instead of giving the compiler the best possible information seems like a bad approach, especially for a beginner.

    – Zulan
    Jan 4 at 16:42














3












3








3







As with everything in HPC, the answer is "It depends..."



Here it will depend on




  1. How big your machine is and how big "bSize", and "N" are

  2. What the content of the inner loop is


For static scheduling of iterations which all run in the same amount of time, unless you can guarantee that number of iterations being work-shared divides by the number of threads, you need to ensure that the number of available iterations is ~10x the number of threads to guarantee 90% efficiency because of potential imbalance. Therefore if you have a 16 core machine you want >160 iterations. If "bSize" is small, then using collapse to generate more available parallelism will help performance. (In the worst case, imagine that "bSize" is smaller than the number of threads!)



On the other hand, as @tim18 is pointing out, if you can vectorize the inner loop while still maintaining enough parallelism that may be a better thing to do.



On the third hand, there is nothing to stop you doing both :-



#pragma omp for simd collapse(2)
for(int b=0; b<bSize; b++)
for(int n=0; n<N; n++) o[n + b*N] = b[n];


If your inner loop really is this small (and vectorizable) then you certainly want to vectorize it, since, unlike parallelism, vectorization can reduce the total CPU time you use, rather than just moving it between cores.






share|improve this answer













As with everything in HPC, the answer is "It depends..."



Here it will depend on




  1. How big your machine is and how big "bSize", and "N" are

  2. What the content of the inner loop is


For static scheduling of iterations which all run in the same amount of time, unless you can guarantee that number of iterations being work-shared divides by the number of threads, you need to ensure that the number of available iterations is ~10x the number of threads to guarantee 90% efficiency because of potential imbalance. Therefore if you have a 16 core machine you want >160 iterations. If "bSize" is small, then using collapse to generate more available parallelism will help performance. (In the worst case, imagine that "bSize" is smaller than the number of threads!)



On the other hand, as @tim18 is pointing out, if you can vectorize the inner loop while still maintaining enough parallelism that may be a better thing to do.



On the third hand, there is nothing to stop you doing both :-



#pragma omp for simd collapse(2)
for(int b=0; b<bSize; b++)
for(int n=0; n<N; n++) o[n + b*N] = b[n];


If your inner loop really is this small (and vectorizable) then you certainly want to vectorize it, since, unlike parallelism, vectorization can reduce the total CPU time you use, rather than just moving it between cores.







share|improve this answer












share|improve this answer



share|improve this answer










answered Jan 4 at 10:27









Jim CownieJim Cownie

963411




963411













  • I don't understand why you would be overly concerned about OpenMP prohibiting vectorization. There doesn't seem to be an inherent issue with that. A quick check confirms that OpenMP loops can be vectorized without any other help. Of course, always measure & confirm still applies but preemptively trying to help the compiler here instead of giving the compiler the best possible information seems like a bad approach, especially for a beginner.

    – Zulan
    Jan 4 at 16:42



















  • I don't understand why you would be overly concerned about OpenMP prohibiting vectorization. There doesn't seem to be an inherent issue with that. A quick check confirms that OpenMP loops can be vectorized without any other help. Of course, always measure & confirm still applies but preemptively trying to help the compiler here instead of giving the compiler the best possible information seems like a bad approach, especially for a beginner.

    – Zulan
    Jan 4 at 16:42

















I don't understand why you would be overly concerned about OpenMP prohibiting vectorization. There doesn't seem to be an inherent issue with that. A quick check confirms that OpenMP loops can be vectorized without any other help. Of course, always measure & confirm still applies but preemptively trying to help the compiler here instead of giving the compiler the best possible information seems like a bad approach, especially for a beginner.

– Zulan
Jan 4 at 16:42





I don't understand why you would be overly concerned about OpenMP prohibiting vectorization. There doesn't seem to be an inherent issue with that. A quick check confirms that OpenMP loops can be vectorized without any other help. Of course, always measure & confirm still applies but preemptively trying to help the compiler here instead of giving the compiler the best possible information seems like a bad approach, especially for a beginner.

– Zulan
Jan 4 at 16:42




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54020887%2fis-it-better-to-use-the-collapse-clause%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

MongoDB - Not Authorized To Execute Command

How to fix TextFormField cause rebuild widget in Flutter

in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith