Efficient SSE FP floor/ceil/round rounding functions without SSE4.1?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

How can I round a __m128 vector of floats up/down or to the nearest integer, like these functions?

Round - roundf()

Ceil - ceilf() or SSE4.1 _mm_ceil_ps.

Floor - floorf() or SSE4.1 _mm_floor_ps.

I need to do this without SSE4.1 roundps (_mm_floor_ps / _mm_ceil_ps / _mm_round_ps(x, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC). roundps can also truncate toward zero, but I don't need that for this application.

I can use SSE3 and earlier. (No SSSE3 or SSE4)

So the function declaration would be something like:

__m128 RoundSse( __m128 x ), __m128 CeilSse( __m128 x ) and __m128 FloorSse( __m128 x ).

edited Jan 3 at 13:01

Peter Cordes

135k19204346

asked Jan 3 at 12:39

Royi

2,95442744

Starting point would be reddit.com/r/programming/comments/1p2yys/…. Though it uses By Reference instead of By Value. The code there also yields issues with VS 2017.

– Royi
Jan 3 at 12:43

Are you sure you actually need round, and the IEEE default rounding mode wouldn't work for you? rintf or nearbyintf give you that, while round uses a special rounding mode that x86 doesn't have in hardware. round() for float in C++. (It can be emulated with a few instructions on top of roundps if you have SSE4.1, so if you can emulate roundps to nearest you can still emulate round(), but it's probably going to be slower.)

– Peter Cordes
Jan 3 at 12:53

1

Google for "sse_mathfun" - it's a useful library which includes the above functions and many others.

– Paul R
Jan 3 at 13:12

@PaulR ,I'd rather have an implementation here so I will understand how it works. Also will be great reference for other who search for it.

– Royi
Jan 3 at 17:02

2

See DirectXMath, specifically the implementation of XMVectorRound, XMVectorFloor, and XMVectorCeil in this source file

– Chuck Walbourn
Jan 3 at 17:45

|
show 6 more comments

How can I round a __m128 vector of floats up/down or to the nearest integer, like these functions?

Round - roundf()

Ceil - ceilf() or SSE4.1 _mm_ceil_ps.

Floor - floorf() or SSE4.1 _mm_floor_ps.

I can use SSE3 and earlier. (No SSSE3 or SSE4)

So the function declaration would be something like:

__m128 RoundSse( __m128 x ), __m128 CeilSse( __m128 x ) and __m128 FloorSse( __m128 x ).

edited Jan 3 at 13:01

Peter Cordes

135k19204346

asked Jan 3 at 12:39

Royi

2,95442744

Starting point would be reddit.com/r/programming/comments/1p2yys/…. Though it uses By Reference instead of By Value. The code there also yields issues with VS 2017.

– Royi
Jan 3 at 12:43

Are you sure you actually need round, and the IEEE default rounding mode wouldn't work for you? rintf or nearbyintf give you that, while round uses a special rounding mode that x86 doesn't have in hardware. round() for float in C++. (It can be emulated with a few instructions on top of roundps if you have SSE4.1, so if you can emulate roundps to nearest you can still emulate round(), but it's probably going to be slower.)

– Peter Cordes
Jan 3 at 12:53

1

Google for "sse_mathfun" - it's a useful library which includes the above functions and many others.

– Paul R
Jan 3 at 13:12

@PaulR ,I'd rather have an implementation here so I will understand how it works. Also will be great reference for other who search for it.

– Royi
Jan 3 at 17:02

2

See DirectXMath, specifically the implementation of XMVectorRound, XMVectorFloor, and XMVectorCeil in this source file

– Chuck Walbourn
Jan 3 at 17:45

|
show 6 more comments

How can I round a __m128 vector of floats up/down or to the nearest integer, like these functions?

Round - roundf()

Ceil - ceilf() or SSE4.1 _mm_ceil_ps.

Floor - floorf() or SSE4.1 _mm_floor_ps.

I can use SSE3 and earlier. (No SSSE3 or SSE4)

So the function declaration would be something like:

__m128 RoundSse( __m128 x ), __m128 CeilSse( __m128 x ) and __m128 FloorSse( __m128 x ).

edited Jan 3 at 13:01

Peter Cordes

135k19204346

asked Jan 3 at 12:39

Royi

2,95442744

How can I round a __m128 vector of floats up/down or to the nearest integer, like these functions?

Round - roundf()

Ceil - ceilf() or SSE4.1 _mm_ceil_ps.

Floor - floorf() or SSE4.1 _mm_floor_ps.

I can use SSE3 and earlier. (No SSSE3 or SSE4)

So the function declaration would be something like:

__m128 RoundSse( __m128 x ), __m128 CeilSse( __m128 x ) and __m128 FloorSse( __m128 x ).

c optimization vectorization sse simd

edited Jan 3 at 13:01

Peter Cordes

135k19204346

asked Jan 3 at 12:39

Royi

2,95442744

edited Jan 3 at 13:01

Peter Cordes

135k19204346

asked Jan 3 at 12:39

Royi

2,95442744

edited Jan 3 at 13:01

Peter Cordes

135k19204346

edited Jan 3 at 13:01

Peter Cordes

135k19204346

edited Jan 3 at 13:01

Peter Cordes

135k19204346

asked Jan 3 at 12:39

Royi

2,95442744

asked Jan 3 at 12:39

Royi

2,95442744

asked Jan 3 at 12:39

Royi

2,95442744

Starting point would be reddit.com/r/programming/comments/1p2yys/…. Though it uses By Reference instead of By Value. The code there also yields issues with VS 2017.

– Royi
Jan 3 at 12:43

Are you sure you actually need round, and the IEEE default rounding mode wouldn't work for you? rintf or nearbyintf give you that, while round uses a special rounding mode that x86 doesn't have in hardware. round() for float in C++. (It can be emulated with a few instructions on top of roundps if you have SSE4.1, so if you can emulate roundps to nearest you can still emulate round(), but it's probably going to be slower.)

– Peter Cordes
Jan 3 at 12:53

1

Google for "sse_mathfun" - it's a useful library which includes the above functions and many others.

– Paul R
Jan 3 at 13:12

@PaulR ,I'd rather have an implementation here so I will understand how it works. Also will be great reference for other who search for it.

– Royi
Jan 3 at 17:02

2

See DirectXMath, specifically the implementation of XMVectorRound, XMVectorFloor, and XMVectorCeil in this source file

– Chuck Walbourn
Jan 3 at 17:45

|
show 6 more comments

Starting point would be reddit.com/r/programming/comments/1p2yys/…. Though it uses By Reference instead of By Value. The code there also yields issues with VS 2017.

– Royi
Jan 3 at 12:43

Are you sure you actually need round, and the IEEE default rounding mode wouldn't work for you? rintf or nearbyintf give you that, while round uses a special rounding mode that x86 doesn't have in hardware. round() for float in C++. (It can be emulated with a few instructions on top of roundps if you have SSE4.1, so if you can emulate roundps to nearest you can still emulate round(), but it's probably going to be slower.)

– Peter Cordes
Jan 3 at 12:53

1

Google for "sse_mathfun" - it's a useful library which includes the above functions and many others.

– Paul R
Jan 3 at 13:12

@PaulR ,I'd rather have an implementation here so I will understand how it works. Also will be great reference for other who search for it.

– Royi
Jan 3 at 17:02

2

See DirectXMath, specifically the implementation of XMVectorRound, XMVectorFloor, and XMVectorCeil in this source file

– Chuck Walbourn
Jan 3 at 17:45

Starting point would be reddit.com/r/programming/comments/1p2yys/…. Though it uses By Reference instead of By Value. The code there also yields issues with VS 2017.

– Royi
Jan 3 at 12:43

Are you sure you actually need round, and the IEEE default rounding mode wouldn't work for you? rintf or nearbyintf give you that, while round uses a special rounding mode that x86 doesn't have in hardware. round() for float in C++. (It can be emulated with a few instructions on top of roundps if you have SSE4.1, so if you can emulate roundps to nearest you can still emulate round(), but it's probably going to be slower.)

– Peter Cordes
Jan 3 at 12:53

Google for "sse_mathfun" - it's a useful library which includes the above functions and many others.

– Paul R
Jan 3 at 13:12

@PaulR ,I'd rather have an implementation here so I will understand how it works. Also will be great reference for other who search for it.

– Royi
Jan 3 at 17:02

See DirectXMath, specifically the implementation of XMVectorRound, XMVectorFloor, and XMVectorCeil in this source file

– Chuck Walbourn
Jan 3 at 17:45

|
show 6 more comments

1 Answer
1

active

oldest

votes

I'm posting the code from http://dss.stephanierct.com/DevBlog/?p=8:

It should be adopted into By Value form (I just removed the & from the code, not sure it is OK):

static inline __m128 FloorSse(const __m128 x) {

    __m128i v0 = _mm_setzero_si128();

    __m128i v1 = _mm_cmpeq_epi32(v0, v0);

    __m128i ji = _mm_srli_epi32(v1, 25);

    __m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it

    __m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it

    __m128i i = _mm_cvttps_epi32(x);

    __m128 fi = _mm_cvtepi32_ps(i);

    __m128 igx = _mm_cmpgt_ps(fi, x);

    j = _mm_and_ps(igx, j);

    return _mm_sub_ps(fi, j);

}



static inline __m128 CeilSse(const __m128 x) {

    __m128i v0 = _mm_setzero_si128();

    __m128i v1 = _mm_cmpeq_epi32(v0, v0);

    __m128i ji = _mm_srli_epi32(v1, 25);

    __m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it

    __m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it

    __m128i i = _mm_cvttps_epi32(x);

    __m128 fi = _mm_cvtepi32_ps(i);

    __m128 igx = _mm_cmplt_ps(fi, x);

    j = _mm_and_ps(igx, j);

    return _mm_add_ps(fi, j);

}



static inline __m128 RoundSse(const __m128 a) {

    __m128 v0 = _mm_setzero_ps();             //generate the highest value &lt; 2

    __m128 v1 = _mm_cmpeq_ps(v0, v0);

    __m128i tmp = _mm_castps_si128(v1); // I edited this (Added tmp) not sure about it

    tmp = _mm_srli_epi32(tmp, 2); // I edited this (Added tmp) not sure about it

    __m128 vNearest2 = _mm_castsi128_ps(tmp); // I edited this (Added tmp) not sure about it

    __m128i i = _mm_cvttps_epi32(a);

    __m128 aTrunc = _mm_cvtepi32_ps(i);        // truncate a

    __m128 rmd = _mm_sub_ps(a, aTrunc);        // get remainder

    __m128 rmd2 = _mm_mul_ps(rmd, vNearest2); // mul remainder by near 2 will yield the needed offset

    __m128i rmd2i = _mm_cvttps_epi32(rmd2);    // after being truncated of course

    __m128 rmd2Trunc = _mm_cvtepi32_ps(rmd2i);

    __m128 r = _mm_add_ps(aTrunc, rmd2Trunc);

    return r;

}





inline __m128 ModSee(const __m128 a, const __m128 aDiv) {

    __m128 c = _mm_div_ps(a, aDiv);

    __m128i i = _mm_cvttps_epi32(c);

    __m128 cTrunc = _mm_cvtepi32_ps(i);

    __m128 base = _mm_mul_ps(cTrunc, aDiv);

    __m128 r = _mm_sub_ps(a, base);

    return r;

}

edited Jan 3 at 19:55

community wiki

3 revs
Royi

You left out the extra info from the blog that these aren't totally safe outside the range where converting to an integer and back works. The block has safe wrappers for those. According to the reddit discussions (reddit.com/r/programming/comments/1p2yys/…) this updated version of the code does actually work correctly for the whole range of inputs that Bruce Dawson tested.

– Peter Cordes
Jan 3 at 18:22

I also tried to switch it into By Value form. I'm not sure I did it correctly. Also VS 2017 has some issues with few things in the code form the blog. I tried fixing it as well but again I'm not sure I did it correctly. This is why I put it as WIKI so people can edit it and improve it. If you want, post as an answer of yours. I will mark it and copy it into the Wiki.

– Royi
Jan 3 at 18:27

Changing it to by-value is literally as easy as removing the &. There's no need to do that, though. What you should change is *(__m128*)(&tmp) should be _mm_castsi128_ps(tmp), and similar. Those are a bad idea, even though __m128 is a may-alias type.

– Peter Cordes
Jan 3 at 18:52

@PeterCordes, I did removed the &. I just wasn't familiar with the effect of it in that context hence didn't know if it works. I also used the cast functions you recommended. What do you think now?

– Royi
Jan 3 at 19:56

Oh, I forgot this was a C question. Are you not familiar with C++ references? It's no different from any other object, as long as your calling convention supports __m128 args by value. Anyway, you should probably simplify the constants to replace the silly cmpeq and integer stuff with _mm_set1_ps(1.0) or 1.999999whatever. The author of this code went overboard trying to outsmart the compiler and prevent it from loading constants from memory, but used cmpps instead of pcmpeqd to create an all-ones bit-pattern. It's questionable if it's worth generating these constants on the fly.

– Peter Cordes
Jan 3 at 20:54

|
show 3 more comments

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54022478%2fefficient-sse-fp-floor-ceil-round-rounding-functions-without-sse4-1%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

I'm posting the code from http://dss.stephanierct.com/DevBlog/?p=8:

It should be adopted into By Value form (I just removed the & from the code, not sure it is OK):

static inline __m128 FloorSse(const __m128 x) {

    __m128i v0 = _mm_setzero_si128();

    __m128i v1 = _mm_cmpeq_epi32(v0, v0);

    __m128i ji = _mm_srli_epi32(v1, 25);

    __m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it

    __m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it

    __m128i i = _mm_cvttps_epi32(x);

    __m128 fi = _mm_cvtepi32_ps(i);

    __m128 igx = _mm_cmpgt_ps(fi, x);

    j = _mm_and_ps(igx, j);

    return _mm_sub_ps(fi, j);

}



static inline __m128 CeilSse(const __m128 x) {

    __m128i v0 = _mm_setzero_si128();

    __m128i v1 = _mm_cmpeq_epi32(v0, v0);

    __m128i ji = _mm_srli_epi32(v1, 25);

    __m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it

    __m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it

    __m128i i = _mm_cvttps_epi32(x);

    __m128 fi = _mm_cvtepi32_ps(i);

    __m128 igx = _mm_cmplt_ps(fi, x);

    j = _mm_and_ps(igx, j);

    return _mm_add_ps(fi, j);

}



static inline __m128 RoundSse(const __m128 a) {

    __m128 v0 = _mm_setzero_ps();             //generate the highest value &lt; 2

    __m128 v1 = _mm_cmpeq_ps(v0, v0);

    __m128i tmp = _mm_castps_si128(v1); // I edited this (Added tmp) not sure about it

    tmp = _mm_srli_epi32(tmp, 2); // I edited this (Added tmp) not sure about it

    __m128 vNearest2 = _mm_castsi128_ps(tmp); // I edited this (Added tmp) not sure about it

    __m128i i = _mm_cvttps_epi32(a);

    __m128 aTrunc = _mm_cvtepi32_ps(i);        // truncate a

    __m128 rmd = _mm_sub_ps(a, aTrunc);        // get remainder

    __m128 rmd2 = _mm_mul_ps(rmd, vNearest2); // mul remainder by near 2 will yield the needed offset

    __m128i rmd2i = _mm_cvttps_epi32(rmd2);    // after being truncated of course

    __m128 rmd2Trunc = _mm_cvtepi32_ps(rmd2i);

    __m128 r = _mm_add_ps(aTrunc, rmd2Trunc);

    return r;

}





inline __m128 ModSee(const __m128 a, const __m128 aDiv) {

    __m128 c = _mm_div_ps(a, aDiv);

    __m128i i = _mm_cvttps_epi32(c);

    __m128 cTrunc = _mm_cvtepi32_ps(i);

    __m128 base = _mm_mul_ps(cTrunc, aDiv);

    __m128 r = _mm_sub_ps(a, base);

    return r;

}

edited Jan 3 at 19:55

community wiki

3 revs
Royi

You left out the extra info from the blog that these aren't totally safe outside the range where converting to an integer and back works. The block has safe wrappers for those. According to the reddit discussions (reddit.com/r/programming/comments/1p2yys/…) this updated version of the code does actually work correctly for the whole range of inputs that Bruce Dawson tested.

– Peter Cordes
Jan 3 at 18:22

I also tried to switch it into By Value form. I'm not sure I did it correctly. Also VS 2017 has some issues with few things in the code form the blog. I tried fixing it as well but again I'm not sure I did it correctly. This is why I put it as WIKI so people can edit it and improve it. If you want, post as an answer of yours. I will mark it and copy it into the Wiki.

– Royi
Jan 3 at 18:27

Changing it to by-value is literally as easy as removing the &. There's no need to do that, though. What you should change is *(__m128*)(&tmp) should be _mm_castsi128_ps(tmp), and similar. Those are a bad idea, even though __m128 is a may-alias type.

– Peter Cordes
Jan 3 at 18:52

@PeterCordes, I did removed the &. I just wasn't familiar with the effect of it in that context hence didn't know if it works. I also used the cast functions you recommended. What do you think now?

– Royi
Jan 3 at 19:56

Oh, I forgot this was a C question. Are you not familiar with C++ references? It's no different from any other object, as long as your calling convention supports __m128 args by value. Anyway, you should probably simplify the constants to replace the silly cmpeq and integer stuff with _mm_set1_ps(1.0) or 1.999999whatever. The author of this code went overboard trying to outsmart the compiler and prevent it from loading constants from memory, but used cmpps instead of pcmpeqd to create an all-ones bit-pattern. It's questionable if it's worth generating these constants on the fly.

– Peter Cordes
Jan 3 at 20:54

|
show 3 more comments

I'm posting the code from http://dss.stephanierct.com/DevBlog/?p=8:

It should be adopted into By Value form (I just removed the & from the code, not sure it is OK):

static inline __m128 FloorSse(const __m128 x) {

    __m128i v0 = _mm_setzero_si128();

    __m128i v1 = _mm_cmpeq_epi32(v0, v0);

    __m128i ji = _mm_srli_epi32(v1, 25);

    __m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it

    __m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it

    __m128i i = _mm_cvttps_epi32(x);

    __m128 fi = _mm_cvtepi32_ps(i);

    __m128 igx = _mm_cmpgt_ps(fi, x);

    j = _mm_and_ps(igx, j);

    return _mm_sub_ps(fi, j);

}



static inline __m128 CeilSse(const __m128 x) {

    __m128i v0 = _mm_setzero_si128();

    __m128i v1 = _mm_cmpeq_epi32(v0, v0);

    __m128i ji = _mm_srli_epi32(v1, 25);

    __m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it

    __m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it

    __m128i i = _mm_cvttps_epi32(x);

    __m128 fi = _mm_cvtepi32_ps(i);

    __m128 igx = _mm_cmplt_ps(fi, x);

    j = _mm_and_ps(igx, j);

    return _mm_add_ps(fi, j);

}



static inline __m128 RoundSse(const __m128 a) {

    __m128 v0 = _mm_setzero_ps();             //generate the highest value &lt; 2

    __m128 v1 = _mm_cmpeq_ps(v0, v0);

    __m128i tmp = _mm_castps_si128(v1); // I edited this (Added tmp) not sure about it

    tmp = _mm_srli_epi32(tmp, 2); // I edited this (Added tmp) not sure about it

    __m128 vNearest2 = _mm_castsi128_ps(tmp); // I edited this (Added tmp) not sure about it

    __m128i i = _mm_cvttps_epi32(a);

    __m128 aTrunc = _mm_cvtepi32_ps(i);        // truncate a

    __m128 rmd = _mm_sub_ps(a, aTrunc);        // get remainder

    __m128 rmd2 = _mm_mul_ps(rmd, vNearest2); // mul remainder by near 2 will yield the needed offset

    __m128i rmd2i = _mm_cvttps_epi32(rmd2);    // after being truncated of course

    __m128 rmd2Trunc = _mm_cvtepi32_ps(rmd2i);

    __m128 r = _mm_add_ps(aTrunc, rmd2Trunc);

    return r;

}





inline __m128 ModSee(const __m128 a, const __m128 aDiv) {

    __m128 c = _mm_div_ps(a, aDiv);

    __m128i i = _mm_cvttps_epi32(c);

    __m128 cTrunc = _mm_cvtepi32_ps(i);

    __m128 base = _mm_mul_ps(cTrunc, aDiv);

    __m128 r = _mm_sub_ps(a, base);

    return r;

}

edited Jan 3 at 19:55

community wiki

3 revs
Royi

You left out the extra info from the blog that these aren't totally safe outside the range where converting to an integer and back works. The block has safe wrappers for those. According to the reddit discussions (reddit.com/r/programming/comments/1p2yys/…) this updated version of the code does actually work correctly for the whole range of inputs that Bruce Dawson tested.

– Peter Cordes
Jan 3 at 18:22

I also tried to switch it into By Value form. I'm not sure I did it correctly. Also VS 2017 has some issues with few things in the code form the blog. I tried fixing it as well but again I'm not sure I did it correctly. This is why I put it as WIKI so people can edit it and improve it. If you want, post as an answer of yours. I will mark it and copy it into the Wiki.

– Royi
Jan 3 at 18:27

Changing it to by-value is literally as easy as removing the &. There's no need to do that, though. What you should change is *(__m128*)(&tmp) should be _mm_castsi128_ps(tmp), and similar. Those are a bad idea, even though __m128 is a may-alias type.

– Peter Cordes
Jan 3 at 18:52

@PeterCordes, I did removed the &. I just wasn't familiar with the effect of it in that context hence didn't know if it works. I also used the cast functions you recommended. What do you think now?

– Royi
Jan 3 at 19:56

Oh, I forgot this was a C question. Are you not familiar with C++ references? It's no different from any other object, as long as your calling convention supports __m128 args by value. Anyway, you should probably simplify the constants to replace the silly cmpeq and integer stuff with _mm_set1_ps(1.0) or 1.999999whatever. The author of this code went overboard trying to outsmart the compiler and prevent it from loading constants from memory, but used cmpps instead of pcmpeqd to create an all-ones bit-pattern. It's questionable if it's worth generating these constants on the fly.

– Peter Cordes
Jan 3 at 20:54

|
show 3 more comments

I'm posting the code from http://dss.stephanierct.com/DevBlog/?p=8:

It should be adopted into By Value form (I just removed the & from the code, not sure it is OK):

static inline __m128 FloorSse(const __m128 x) {

    __m128i v0 = _mm_setzero_si128();

    __m128i v1 = _mm_cmpeq_epi32(v0, v0);

    __m128i ji = _mm_srli_epi32(v1, 25);

    __m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it

    __m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it

    __m128i i = _mm_cvttps_epi32(x);

    __m128 fi = _mm_cvtepi32_ps(i);

    __m128 igx = _mm_cmpgt_ps(fi, x);

    j = _mm_and_ps(igx, j);

    return _mm_sub_ps(fi, j);

}



static inline __m128 CeilSse(const __m128 x) {

    __m128i v0 = _mm_setzero_si128();

    __m128i v1 = _mm_cmpeq_epi32(v0, v0);

    __m128i ji = _mm_srli_epi32(v1, 25);

    __m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it

    __m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it

    __m128i i = _mm_cvttps_epi32(x);

    __m128 fi = _mm_cvtepi32_ps(i);

    __m128 igx = _mm_cmplt_ps(fi, x);

    j = _mm_and_ps(igx, j);

    return _mm_add_ps(fi, j);

}



static inline __m128 RoundSse(const __m128 a) {

    __m128 v0 = _mm_setzero_ps();             //generate the highest value &lt; 2

    __m128 v1 = _mm_cmpeq_ps(v0, v0);

    __m128i tmp = _mm_castps_si128(v1); // I edited this (Added tmp) not sure about it

    tmp = _mm_srli_epi32(tmp, 2); // I edited this (Added tmp) not sure about it

    __m128 vNearest2 = _mm_castsi128_ps(tmp); // I edited this (Added tmp) not sure about it

    __m128i i = _mm_cvttps_epi32(a);

    __m128 aTrunc = _mm_cvtepi32_ps(i);        // truncate a

    __m128 rmd = _mm_sub_ps(a, aTrunc);        // get remainder

    __m128 rmd2 = _mm_mul_ps(rmd, vNearest2); // mul remainder by near 2 will yield the needed offset

    __m128i rmd2i = _mm_cvttps_epi32(rmd2);    // after being truncated of course

    __m128 rmd2Trunc = _mm_cvtepi32_ps(rmd2i);

    __m128 r = _mm_add_ps(aTrunc, rmd2Trunc);

    return r;

}





inline __m128 ModSee(const __m128 a, const __m128 aDiv) {

    __m128 c = _mm_div_ps(a, aDiv);

    __m128i i = _mm_cvttps_epi32(c);

    __m128 cTrunc = _mm_cvtepi32_ps(i);

    __m128 base = _mm_mul_ps(cTrunc, aDiv);

    __m128 r = _mm_sub_ps(a, base);

    return r;

}

edited Jan 3 at 19:55

community wiki

3 revs
Royi

I'm posting the code from http://dss.stephanierct.com/DevBlog/?p=8:

It should be adopted into By Value form (I just removed the & from the code, not sure it is OK):

static inline __m128 FloorSse(const __m128 x) {

    __m128i v0 = _mm_setzero_si128();

    __m128i v1 = _mm_cmpeq_epi32(v0, v0);

    __m128i ji = _mm_srli_epi32(v1, 25);

    __m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it

    __m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it

    __m128i i = _mm_cvttps_epi32(x);

    __m128 fi = _mm_cvtepi32_ps(i);

    __m128 igx = _mm_cmpgt_ps(fi, x);

    j = _mm_and_ps(igx, j);

    return _mm_sub_ps(fi, j);

}



static inline __m128 CeilSse(const __m128 x) {

    __m128i v0 = _mm_setzero_si128();

    __m128i v1 = _mm_cmpeq_epi32(v0, v0);

    __m128i ji = _mm_srli_epi32(v1, 25);

    __m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it

    __m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it

    __m128i i = _mm_cvttps_epi32(x);

    __m128 fi = _mm_cvtepi32_ps(i);

    __m128 igx = _mm_cmplt_ps(fi, x);

    j = _mm_and_ps(igx, j);

    return _mm_add_ps(fi, j);

}



static inline __m128 RoundSse(const __m128 a) {

    __m128 v0 = _mm_setzero_ps();             //generate the highest value &lt; 2

    __m128 v1 = _mm_cmpeq_ps(v0, v0);

    __m128i tmp = _mm_castps_si128(v1); // I edited this (Added tmp) not sure about it

    tmp = _mm_srli_epi32(tmp, 2); // I edited this (Added tmp) not sure about it

    __m128 vNearest2 = _mm_castsi128_ps(tmp); // I edited this (Added tmp) not sure about it

    __m128i i = _mm_cvttps_epi32(a);

    __m128 aTrunc = _mm_cvtepi32_ps(i);        // truncate a

    __m128 rmd = _mm_sub_ps(a, aTrunc);        // get remainder

    __m128 rmd2 = _mm_mul_ps(rmd, vNearest2); // mul remainder by near 2 will yield the needed offset

    __m128i rmd2i = _mm_cvttps_epi32(rmd2);    // after being truncated of course

    __m128 rmd2Trunc = _mm_cvtepi32_ps(rmd2i);

    __m128 r = _mm_add_ps(aTrunc, rmd2Trunc);

    return r;

}





inline __m128 ModSee(const __m128 a, const __m128 aDiv) {

    __m128 c = _mm_div_ps(a, aDiv);

    __m128i i = _mm_cvttps_epi32(c);

    __m128 cTrunc = _mm_cvtepi32_ps(i);

    __m128 base = _mm_mul_ps(cTrunc, aDiv);

    __m128 r = _mm_sub_ps(a, base);

    return r;

}

edited Jan 3 at 19:55

community wiki

3 revs
Royi

edited Jan 3 at 19:55

community wiki

3 revs
Royi

community wiki

3 revs
Royi

community wiki

3 revs
Royi

You left out the extra info from the blog that these aren't totally safe outside the range where converting to an integer and back works. The block has safe wrappers for those. According to the reddit discussions (reddit.com/r/programming/comments/1p2yys/…) this updated version of the code does actually work correctly for the whole range of inputs that Bruce Dawson tested.

– Peter Cordes
Jan 3 at 18:22

I also tried to switch it into By Value form. I'm not sure I did it correctly. Also VS 2017 has some issues with few things in the code form the blog. I tried fixing it as well but again I'm not sure I did it correctly. This is why I put it as WIKI so people can edit it and improve it. If you want, post as an answer of yours. I will mark it and copy it into the Wiki.

– Royi
Jan 3 at 18:27

Changing it to by-value is literally as easy as removing the &. There's no need to do that, though. What you should change is *(__m128*)(&tmp) should be _mm_castsi128_ps(tmp), and similar. Those are a bad idea, even though __m128 is a may-alias type.

– Peter Cordes
Jan 3 at 18:52

@PeterCordes, I did removed the &. I just wasn't familiar with the effect of it in that context hence didn't know if it works. I also used the cast functions you recommended. What do you think now?

– Royi
Jan 3 at 19:56

Oh, I forgot this was a C question. Are you not familiar with C++ references? It's no different from any other object, as long as your calling convention supports __m128 args by value. Anyway, you should probably simplify the constants to replace the silly cmpeq and integer stuff with _mm_set1_ps(1.0) or 1.999999whatever. The author of this code went overboard trying to outsmart the compiler and prevent it from loading constants from memory, but used cmpps instead of pcmpeqd to create an all-ones bit-pattern. It's questionable if it's worth generating these constants on the fly.

– Peter Cordes
Jan 3 at 20:54

|
show 3 more comments

You left out the extra info from the blog that these aren't totally safe outside the range where converting to an integer and back works. The block has safe wrappers for those. According to the reddit discussions (reddit.com/r/programming/comments/1p2yys/…) this updated version of the code does actually work correctly for the whole range of inputs that Bruce Dawson tested.

– Peter Cordes
Jan 3 at 18:22

I also tried to switch it into By Value form. I'm not sure I did it correctly. Also VS 2017 has some issues with few things in the code form the blog. I tried fixing it as well but again I'm not sure I did it correctly. This is why I put it as WIKI so people can edit it and improve it. If you want, post as an answer of yours. I will mark it and copy it into the Wiki.

– Royi
Jan 3 at 18:27

Changing it to by-value is literally as easy as removing the &. There's no need to do that, though. What you should change is *(__m128*)(&tmp) should be _mm_castsi128_ps(tmp), and similar. Those are a bad idea, even though __m128 is a may-alias type.

– Peter Cordes
Jan 3 at 18:52

@PeterCordes, I did removed the &. I just wasn't familiar with the effect of it in that context hence didn't know if it works. I also used the cast functions you recommended. What do you think now?

– Royi
Jan 3 at 19:56

Oh, I forgot this was a C question. Are you not familiar with C++ references? It's no different from any other object, as long as your calling convention supports __m128 args by value. Anyway, you should probably simplify the constants to replace the silly cmpeq and integer stuff with _mm_set1_ps(1.0) or 1.999999whatever. The author of this code went overboard trying to outsmart the compiler and prevent it from loading constants from memory, but used cmpps instead of pcmpeqd to create an all-ones bit-pattern. It's questionable if it's worth generating these constants on the fly.

– Peter Cordes
Jan 3 at 20:54

You left out the extra info from the blog that these aren't totally safe outside the range where converting to an integer and back works. The block has safe wrappers for those. According to the reddit discussions (reddit.com/r/programming/comments/1p2yys/…) this updated version of the code does actually work correctly for the whole range of inputs that Bruce Dawson tested.

– Peter Cordes
Jan 3 at 18:22

I also tried to switch it into By Value form. I'm not sure I did it correctly. Also VS 2017 has some issues with few things in the code form the blog. I tried fixing it as well but again I'm not sure I did it correctly. This is why I put it as WIKI so people can edit it and improve it. If you want, post as an answer of yours. I will mark it and copy it into the Wiki.

– Royi
Jan 3 at 18:27

Changing it to by-value is literally as easy as removing the &. There's no need to do that, though. What you should change is *(__m128*)(&tmp) should be _mm_castsi128_ps(tmp), and similar. Those are a bad idea, even though __m128 is a may-alias type.

– Peter Cordes
Jan 3 at 18:52

@PeterCordes, I did removed the &. I just wasn't familiar with the effect of it in that context hence didn't know if it works. I also used the cast functions you recommended. What do you think now?

– Royi
Jan 3 at 19:56

Oh, I forgot this was a C question. Are you not familiar with C++ references? It's no different from any other object, as long as your calling convention supports __m128 args by value. Anyway, you should probably simplify the constants to replace the silly cmpeq and integer stuff with _mm_set1_ps(1.0) or 1.999999whatever. The author of this code went overboard trying to outsmart the compiler and prevent it from loading constants from memory, but used cmpps instead of pcmpeqd to create an all-ones bit-pattern. It's questionable if it's worth generating these constants on the fly.

– Peter Cordes
Jan 3 at 20:54

|
show 3 more comments

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu