Is it better to use the collapse clause

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

I am never sure which possibility I should choose to parallelize nested for loops.

For example I have the following code snippet:

#pragma omp parallel for schedule(static)

for(int b=0; b<bSize; b++)

    for(int n=0; n<N; n++) o[n + b*N] = b[n];





#pragma omp parallel for collapse(2) schedule(static)

for(int b=0; b<bSize; b++)

    for(int n=0; n<N; n++) o[n + b*N] = b[n];

In the first snippet I use parallel for (with schedule(static) because of the first touch policy). In some codes I saw people use mostly the collapse-clausel to parallize nested for loops in other codes it is never used instead the nested for loops are parallelized with a simple parallel for. Is this more a habit or is there a difference between the two versions? Is there a reason some people never use collapse(n)?

edited Jan 3 at 12:42

Z boson

21.1k782154

asked Jan 3 at 10:53

Suslik

174111

2

As the ideal situation is to achieve simd optimization in the inner loop, it's usual better to avoid including the inner loop in a collapse. Certain compilers even look for opportunities to include more than the innermost loop in simd optimization. Parallel collapse might be expected to be of use when there is a remainder when dividing the outer loop count by the number of threads which would produce work imbalance. I'm sure various aspects of this were discussed previously here.

– tim18
Jan 3 at 12:21

add a comment |

I am never sure which possibility I should choose to parallelize nested for loops.

For example I have the following code snippet:

#pragma omp parallel for schedule(static)

for(int b=0; b<bSize; b++)

    for(int n=0; n<N; n++) o[n + b*N] = b[n];





#pragma omp parallel for collapse(2) schedule(static)

for(int b=0; b<bSize; b++)

    for(int n=0; n<N; n++) o[n + b*N] = b[n];

edited Jan 3 at 12:42

Z boson

21.1k782154

asked Jan 3 at 10:53

Suslik

174111

2

As the ideal situation is to achieve simd optimization in the inner loop, it's usual better to avoid including the inner loop in a collapse. Certain compilers even look for opportunities to include more than the innermost loop in simd optimization. Parallel collapse might be expected to be of use when there is a remainder when dividing the outer loop count by the number of threads which would produce work imbalance. I'm sure various aspects of this were discussed previously here.

– tim18
Jan 3 at 12:21

add a comment |

I am never sure which possibility I should choose to parallelize nested for loops.

For example I have the following code snippet:

#pragma omp parallel for schedule(static)

for(int b=0; b<bSize; b++)

    for(int n=0; n<N; n++) o[n + b*N] = b[n];





#pragma omp parallel for collapse(2) schedule(static)

for(int b=0; b<bSize; b++)

    for(int n=0; n<N; n++) o[n + b*N] = b[n];

edited Jan 3 at 12:42

Z boson

21.1k782154

asked Jan 3 at 10:53

Suslik

174111

I am never sure which possibility I should choose to parallelize nested for loops.

For example I have the following code snippet:

#pragma omp parallel for schedule(static)

for(int b=0; b<bSize; b++)

    for(int n=0; n<N; n++) o[n + b*N] = b[n];





#pragma omp parallel for collapse(2) schedule(static)

for(int b=0; b<bSize; b++)

    for(int n=0; n<N; n++) o[n + b*N] = b[n];

c++11 openmp

edited Jan 3 at 12:42

Z boson

21.1k782154

asked Jan 3 at 10:53

Suslik

174111

edited Jan 3 at 12:42

Z boson

21.1k782154

asked Jan 3 at 10:53

Suslik

174111

edited Jan 3 at 12:42

Z boson

21.1k782154

edited Jan 3 at 12:42

Z boson

21.1k782154

edited Jan 3 at 12:42

Z boson

21.1k782154

asked Jan 3 at 10:53

Suslik

174111

asked Jan 3 at 10:53

Suslik

174111

asked Jan 3 at 10:53

Suslik

174111

2

As the ideal situation is to achieve simd optimization in the inner loop, it's usual better to avoid including the inner loop in a collapse. Certain compilers even look for opportunities to include more than the innermost loop in simd optimization. Parallel collapse might be expected to be of use when there is a remainder when dividing the outer loop count by the number of threads which would produce work imbalance. I'm sure various aspects of this were discussed previously here.

– tim18
Jan 3 at 12:21

add a comment |

2

As the ideal situation is to achieve simd optimization in the inner loop, it's usual better to avoid including the inner loop in a collapse. Certain compilers even look for opportunities to include more than the innermost loop in simd optimization. Parallel collapse might be expected to be of use when there is a remainder when dividing the outer loop count by the number of threads which would produce work imbalance. I'm sure various aspects of this were discussed previously here.

– tim18
Jan 3 at 12:21

As the ideal situation is to achieve simd optimization in the inner loop, it's usual better to avoid including the inner loop in a collapse. Certain compilers even look for opportunities to include more than the innermost loop in simd optimization. Parallel collapse might be expected to be of use when there is a remainder when dividing the outer loop count by the number of threads which would produce work imbalance. I'm sure various aspects of this were discussed previously here.

– tim18
Jan 3 at 12:21

add a comment |

1 Answer
1

active

oldest

votes

As with everything in HPC, the answer is "It depends..."

Here it will depend on

How big your machine is and how big "bSize", and "N" are

What the content of the inner loop is

For static scheduling of iterations which all run in the same amount of time, unless you can guarantee that number of iterations being work-shared divides by the number of threads, you need to ensure that the number of available iterations is ~10x the number of threads to guarantee 90% efficiency because of potential imbalance. Therefore if you have a 16 core machine you want >160 iterations. If "bSize" is small, then using collapse to generate more available parallelism will help performance. (In the worst case, imagine that "bSize" is smaller than the number of threads!)

On the other hand, as @tim18 is pointing out, if you can vectorize the inner loop while still maintaining enough parallelism that may be a better thing to do.

On the third hand, there is nothing to stop you doing both :-

#pragma omp for simd collapse(2)

for(int b=0; b<bSize; b++)

    for(int n=0; n<N; n++) o[n + b*N] = b[n];

If your inner loop really is this small (and vectorizable) then you certainly want to vectorize it, since, unlike parallelism, vectorization can reduce the total CPU time you use, rather than just moving it between cores.

answered Jan 4 at 10:27

Jim Cownie

963411

I don't understand why you would be overly concerned about OpenMP prohibiting vectorization. There doesn't seem to be an inherent issue with that. A quick check confirms that OpenMP loops can be vectorized without any other help. Of course, always measure & confirm still applies but preemptively trying to help the compiler here instead of giving the compiler the best possible information seems like a bad approach, especially for a beginner.

– Zulan
Jan 4 at 16:42

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54020887%2fis-it-better-to-use-the-collapse-clause%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

As with everything in HPC, the answer is "It depends..."

Here it will depend on

How big your machine is and how big "bSize", and "N" are

What the content of the inner loop is

On the other hand, as @tim18 is pointing out, if you can vectorize the inner loop while still maintaining enough parallelism that may be a better thing to do.

On the third hand, there is nothing to stop you doing both :-

#pragma omp for simd collapse(2)

for(int b=0; b<bSize; b++)

    for(int n=0; n<N; n++) o[n + b*N] = b[n];

answered Jan 4 at 10:27

Jim Cownie

963411

I don't understand why you would be overly concerned about OpenMP prohibiting vectorization. There doesn't seem to be an inherent issue with that. A quick check confirms that OpenMP loops can be vectorized without any other help. Of course, always measure & confirm still applies but preemptively trying to help the compiler here instead of giving the compiler the best possible information seems like a bad approach, especially for a beginner.

– Zulan
Jan 4 at 16:42

add a comment |

As with everything in HPC, the answer is "It depends..."

Here it will depend on

How big your machine is and how big "bSize", and "N" are

What the content of the inner loop is

On the other hand, as @tim18 is pointing out, if you can vectorize the inner loop while still maintaining enough parallelism that may be a better thing to do.

On the third hand, there is nothing to stop you doing both :-

#pragma omp for simd collapse(2)

for(int b=0; b<bSize; b++)

    for(int n=0; n<N; n++) o[n + b*N] = b[n];

answered Jan 4 at 10:27

Jim Cownie

963411

I don't understand why you would be overly concerned about OpenMP prohibiting vectorization. There doesn't seem to be an inherent issue with that. A quick check confirms that OpenMP loops can be vectorized without any other help. Of course, always measure & confirm still applies but preemptively trying to help the compiler here instead of giving the compiler the best possible information seems like a bad approach, especially for a beginner.

– Zulan
Jan 4 at 16:42

add a comment |

As with everything in HPC, the answer is "It depends..."

Here it will depend on

How big your machine is and how big "bSize", and "N" are

What the content of the inner loop is

On the other hand, as @tim18 is pointing out, if you can vectorize the inner loop while still maintaining enough parallelism that may be a better thing to do.

On the third hand, there is nothing to stop you doing both :-

#pragma omp for simd collapse(2)

for(int b=0; b<bSize; b++)

    for(int n=0; n<N; n++) o[n + b*N] = b[n];

answered Jan 4 at 10:27

Jim Cownie

963411

As with everything in HPC, the answer is "It depends..."

Here it will depend on

How big your machine is and how big "bSize", and "N" are

What the content of the inner loop is

On the other hand, as @tim18 is pointing out, if you can vectorize the inner loop while still maintaining enough parallelism that may be a better thing to do.

On the third hand, there is nothing to stop you doing both :-

#pragma omp for simd collapse(2)

for(int b=0; b<bSize; b++)

    for(int n=0; n<N; n++) o[n + b*N] = b[n];

answered Jan 4 at 10:27

Jim Cownie

963411

answered Jan 4 at 10:27

Jim Cownie

963411

answered Jan 4 at 10:27

Jim Cownie

963411

answered Jan 4 at 10:27

Jim Cownie

963411

I don't understand why you would be overly concerned about OpenMP prohibiting vectorization. There doesn't seem to be an inherent issue with that. A quick check confirms that OpenMP loops can be vectorized without any other help. Of course, always measure & confirm still applies but preemptively trying to help the compiler here instead of giving the compiler the best possible information seems like a bad approach, especially for a beginner.

– Zulan
Jan 4 at 16:42

add a comment |

I don't understand why you would be overly concerned about OpenMP prohibiting vectorization. There doesn't seem to be an inherent issue with that. A quick check confirms that OpenMP loops can be vectorized without any other help. Of course, always measure & confirm still applies but preemptively trying to help the compiler here instead of giving the compiler the best possible information seems like a bad approach, especially for a beginner.

– Zulan
Jan 4 at 16:42

I don't understand why you would be overly concerned about OpenMP prohibiting vectorization. There doesn't seem to be an inherent issue with that. A quick check confirms that OpenMP loops can be vectorized without any other help. Of course, always measure & confirm still applies but preemptively trying to help the compiler here instead of giving the compiler the best possible information seems like a bad approach, especially for a beginner.

– Zulan
Jan 4 at 16:42

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu