Efficient SSE FP floor/ceil/round rounding functions without SSE4.1?





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







2















How can I round a __m128 vector of floats up/down or to the nearest integer, like these functions?




  • Round - roundf()


  • Ceil - ceilf() or SSE4.1 _mm_ceil_ps.


  • Floor - floorf() or SSE4.1 _mm_floor_ps.


I need to do this without SSE4.1 roundps (_mm_floor_ps / _mm_ceil_ps / _mm_round_ps(x, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC). roundps can also truncate toward zero, but I don't need that for this application.



I can use SSE3 and earlier. (No SSSE3 or SSE4)



So the function declaration would be something like:



__m128 RoundSse( __m128 x ), __m128 CeilSse( __m128 x ) and __m128 FloorSse( __m128 x ).










share|improve this question

























  • Starting point would be reddit.com/r/programming/comments/1p2yys/…. Though it uses By Reference instead of By Value. The code there also yields issues with VS 2017.

    – Royi
    Jan 3 at 12:43











  • Are you sure you actually need round, and the IEEE default rounding mode wouldn't work for you? rintf or nearbyintf give you that, while round uses a special rounding mode that x86 doesn't have in hardware. round() for float in C++. (It can be emulated with a few instructions on top of roundps if you have SSE4.1, so if you can emulate roundps to nearest you can still emulate round(), but it's probably going to be slower.)

    – Peter Cordes
    Jan 3 at 12:53






  • 1





    Google for "sse_mathfun" - it's a useful library which includes the above functions and many others.

    – Paul R
    Jan 3 at 13:12













  • @PaulR ,I'd rather have an implementation here so I will understand how it works. Also will be great reference for other who search for it.

    – Royi
    Jan 3 at 17:02








  • 2





    See DirectXMath, specifically the implementation of XMVectorRound, XMVectorFloor, and XMVectorCeil in this source file

    – Chuck Walbourn
    Jan 3 at 17:45




















2















How can I round a __m128 vector of floats up/down or to the nearest integer, like these functions?




  • Round - roundf()


  • Ceil - ceilf() or SSE4.1 _mm_ceil_ps.


  • Floor - floorf() or SSE4.1 _mm_floor_ps.


I need to do this without SSE4.1 roundps (_mm_floor_ps / _mm_ceil_ps / _mm_round_ps(x, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC). roundps can also truncate toward zero, but I don't need that for this application.



I can use SSE3 and earlier. (No SSSE3 or SSE4)



So the function declaration would be something like:



__m128 RoundSse( __m128 x ), __m128 CeilSse( __m128 x ) and __m128 FloorSse( __m128 x ).










share|improve this question

























  • Starting point would be reddit.com/r/programming/comments/1p2yys/…. Though it uses By Reference instead of By Value. The code there also yields issues with VS 2017.

    – Royi
    Jan 3 at 12:43











  • Are you sure you actually need round, and the IEEE default rounding mode wouldn't work for you? rintf or nearbyintf give you that, while round uses a special rounding mode that x86 doesn't have in hardware. round() for float in C++. (It can be emulated with a few instructions on top of roundps if you have SSE4.1, so if you can emulate roundps to nearest you can still emulate round(), but it's probably going to be slower.)

    – Peter Cordes
    Jan 3 at 12:53






  • 1





    Google for "sse_mathfun" - it's a useful library which includes the above functions and many others.

    – Paul R
    Jan 3 at 13:12













  • @PaulR ,I'd rather have an implementation here so I will understand how it works. Also will be great reference for other who search for it.

    – Royi
    Jan 3 at 17:02








  • 2





    See DirectXMath, specifically the implementation of XMVectorRound, XMVectorFloor, and XMVectorCeil in this source file

    – Chuck Walbourn
    Jan 3 at 17:45
















2












2








2








How can I round a __m128 vector of floats up/down or to the nearest integer, like these functions?




  • Round - roundf()


  • Ceil - ceilf() or SSE4.1 _mm_ceil_ps.


  • Floor - floorf() or SSE4.1 _mm_floor_ps.


I need to do this without SSE4.1 roundps (_mm_floor_ps / _mm_ceil_ps / _mm_round_ps(x, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC). roundps can also truncate toward zero, but I don't need that for this application.



I can use SSE3 and earlier. (No SSSE3 or SSE4)



So the function declaration would be something like:



__m128 RoundSse( __m128 x ), __m128 CeilSse( __m128 x ) and __m128 FloorSse( __m128 x ).










share|improve this question
















How can I round a __m128 vector of floats up/down or to the nearest integer, like these functions?




  • Round - roundf()


  • Ceil - ceilf() or SSE4.1 _mm_ceil_ps.


  • Floor - floorf() or SSE4.1 _mm_floor_ps.


I need to do this without SSE4.1 roundps (_mm_floor_ps / _mm_ceil_ps / _mm_round_ps(x, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC). roundps can also truncate toward zero, but I don't need that for this application.



I can use SSE3 and earlier. (No SSSE3 or SSE4)



So the function declaration would be something like:



__m128 RoundSse( __m128 x ), __m128 CeilSse( __m128 x ) and __m128 FloorSse( __m128 x ).







c optimization vectorization sse simd






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 3 at 13:01









Peter Cordes

135k19204346




135k19204346










asked Jan 3 at 12:39









RoyiRoyi

2,95442744




2,95442744













  • Starting point would be reddit.com/r/programming/comments/1p2yys/…. Though it uses By Reference instead of By Value. The code there also yields issues with VS 2017.

    – Royi
    Jan 3 at 12:43











  • Are you sure you actually need round, and the IEEE default rounding mode wouldn't work for you? rintf or nearbyintf give you that, while round uses a special rounding mode that x86 doesn't have in hardware. round() for float in C++. (It can be emulated with a few instructions on top of roundps if you have SSE4.1, so if you can emulate roundps to nearest you can still emulate round(), but it's probably going to be slower.)

    – Peter Cordes
    Jan 3 at 12:53






  • 1





    Google for "sse_mathfun" - it's a useful library which includes the above functions and many others.

    – Paul R
    Jan 3 at 13:12













  • @PaulR ,I'd rather have an implementation here so I will understand how it works. Also will be great reference for other who search for it.

    – Royi
    Jan 3 at 17:02








  • 2





    See DirectXMath, specifically the implementation of XMVectorRound, XMVectorFloor, and XMVectorCeil in this source file

    – Chuck Walbourn
    Jan 3 at 17:45





















  • Starting point would be reddit.com/r/programming/comments/1p2yys/…. Though it uses By Reference instead of By Value. The code there also yields issues with VS 2017.

    – Royi
    Jan 3 at 12:43











  • Are you sure you actually need round, and the IEEE default rounding mode wouldn't work for you? rintf or nearbyintf give you that, while round uses a special rounding mode that x86 doesn't have in hardware. round() for float in C++. (It can be emulated with a few instructions on top of roundps if you have SSE4.1, so if you can emulate roundps to nearest you can still emulate round(), but it's probably going to be slower.)

    – Peter Cordes
    Jan 3 at 12:53






  • 1





    Google for "sse_mathfun" - it's a useful library which includes the above functions and many others.

    – Paul R
    Jan 3 at 13:12













  • @PaulR ,I'd rather have an implementation here so I will understand how it works. Also will be great reference for other who search for it.

    – Royi
    Jan 3 at 17:02








  • 2





    See DirectXMath, specifically the implementation of XMVectorRound, XMVectorFloor, and XMVectorCeil in this source file

    – Chuck Walbourn
    Jan 3 at 17:45



















Starting point would be reddit.com/r/programming/comments/1p2yys/…. Though it uses By Reference instead of By Value. The code there also yields issues with VS 2017.

– Royi
Jan 3 at 12:43





Starting point would be reddit.com/r/programming/comments/1p2yys/…. Though it uses By Reference instead of By Value. The code there also yields issues with VS 2017.

– Royi
Jan 3 at 12:43













Are you sure you actually need round, and the IEEE default rounding mode wouldn't work for you? rintf or nearbyintf give you that, while round uses a special rounding mode that x86 doesn't have in hardware. round() for float in C++. (It can be emulated with a few instructions on top of roundps if you have SSE4.1, so if you can emulate roundps to nearest you can still emulate round(), but it's probably going to be slower.)

– Peter Cordes
Jan 3 at 12:53





Are you sure you actually need round, and the IEEE default rounding mode wouldn't work for you? rintf or nearbyintf give you that, while round uses a special rounding mode that x86 doesn't have in hardware. round() for float in C++. (It can be emulated with a few instructions on top of roundps if you have SSE4.1, so if you can emulate roundps to nearest you can still emulate round(), but it's probably going to be slower.)

– Peter Cordes
Jan 3 at 12:53




1




1





Google for "sse_mathfun" - it's a useful library which includes the above functions and many others.

– Paul R
Jan 3 at 13:12







Google for "sse_mathfun" - it's a useful library which includes the above functions and many others.

– Paul R
Jan 3 at 13:12















@PaulR ,I'd rather have an implementation here so I will understand how it works. Also will be great reference for other who search for it.

– Royi
Jan 3 at 17:02







@PaulR ,I'd rather have an implementation here so I will understand how it works. Also will be great reference for other who search for it.

– Royi
Jan 3 at 17:02






2




2





See DirectXMath, specifically the implementation of XMVectorRound, XMVectorFloor, and XMVectorCeil in this source file

– Chuck Walbourn
Jan 3 at 17:45







See DirectXMath, specifically the implementation of XMVectorRound, XMVectorFloor, and XMVectorCeil in this source file

– Chuck Walbourn
Jan 3 at 17:45














1 Answer
1






active

oldest

votes


















0














I'm posting the code from http://dss.stephanierct.com/DevBlog/?p=8:



It should be adopted into By Value form (I just removed the & from the code, not sure it is OK):



static inline __m128 FloorSse(const __m128 x) {
__m128i v0 = _mm_setzero_si128();
__m128i v1 = _mm_cmpeq_epi32(v0, v0);
__m128i ji = _mm_srli_epi32(v1, 25);
__m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it
__m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it
__m128i i = _mm_cvttps_epi32(x);
__m128 fi = _mm_cvtepi32_ps(i);
__m128 igx = _mm_cmpgt_ps(fi, x);
j = _mm_and_ps(igx, j);
return _mm_sub_ps(fi, j);
}

static inline __m128 CeilSse(const __m128 x) {
__m128i v0 = _mm_setzero_si128();
__m128i v1 = _mm_cmpeq_epi32(v0, v0);
__m128i ji = _mm_srli_epi32(v1, 25);
__m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it
__m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it
__m128i i = _mm_cvttps_epi32(x);
__m128 fi = _mm_cvtepi32_ps(i);
__m128 igx = _mm_cmplt_ps(fi, x);
j = _mm_and_ps(igx, j);
return _mm_add_ps(fi, j);
}

static inline __m128 RoundSse(const __m128 a) {
__m128 v0 = _mm_setzero_ps(); //generate the highest value < 2
__m128 v1 = _mm_cmpeq_ps(v0, v0);
__m128i tmp = _mm_castps_si128(v1); // I edited this (Added tmp) not sure about it
tmp = _mm_srli_epi32(tmp, 2); // I edited this (Added tmp) not sure about it
__m128 vNearest2 = _mm_castsi128_ps(tmp); // I edited this (Added tmp) not sure about it
__m128i i = _mm_cvttps_epi32(a);
__m128 aTrunc = _mm_cvtepi32_ps(i); // truncate a
__m128 rmd = _mm_sub_ps(a, aTrunc); // get remainder
__m128 rmd2 = _mm_mul_ps(rmd, vNearest2); // mul remainder by near 2 will yield the needed offset
__m128i rmd2i = _mm_cvttps_epi32(rmd2); // after being truncated of course
__m128 rmd2Trunc = _mm_cvtepi32_ps(rmd2i);
__m128 r = _mm_add_ps(aTrunc, rmd2Trunc);
return r;
}


inline __m128 ModSee(const __m128 a, const __m128 aDiv) {
__m128 c = _mm_div_ps(a, aDiv);
__m128i i = _mm_cvttps_epi32(c);
__m128 cTrunc = _mm_cvtepi32_ps(i);
__m128 base = _mm_mul_ps(cTrunc, aDiv);
__m128 r = _mm_sub_ps(a, base);
return r;
}





share|improve this answer


























  • You left out the extra info from the blog that these aren't totally safe outside the range where converting to an integer and back works. The block has safe wrappers for those. According to the reddit discussions (reddit.com/r/programming/comments/1p2yys/…) this updated version of the code does actually work correctly for the whole range of inputs that Bruce Dawson tested.

    – Peter Cordes
    Jan 3 at 18:22











  • I also tried to switch it into By Value form. I'm not sure I did it correctly. Also VS 2017 has some issues with few things in the code form the blog. I tried fixing it as well but again I'm not sure I did it correctly. This is why I put it as WIKI so people can edit it and improve it. If you want, post as an answer of yours. I will mark it and copy it into the Wiki.

    – Royi
    Jan 3 at 18:27













  • Changing it to by-value is literally as easy as removing the &. There's no need to do that, though. What you should change is *(__m128*)(&tmp) should be _mm_castsi128_ps(tmp), and similar. Those are a bad idea, even though __m128 is a may-alias type.

    – Peter Cordes
    Jan 3 at 18:52











  • @PeterCordes, I did removed the &. I just wasn't familiar with the effect of it in that context hence didn't know if it works. I also used the cast functions you recommended. What do you think now?

    – Royi
    Jan 3 at 19:56











  • Oh, I forgot this was a C question. Are you not familiar with C++ references? It's no different from any other object, as long as your calling convention supports __m128 args by value. Anyway, you should probably simplify the constants to replace the silly cmpeq and integer stuff with _mm_set1_ps(1.0) or 1.999999whatever. The author of this code went overboard trying to outsmart the compiler and prevent it from loading constants from memory, but used cmpps instead of pcmpeqd to create an all-ones bit-pattern. It's questionable if it's worth generating these constants on the fly.

    – Peter Cordes
    Jan 3 at 20:54












Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54022478%2fefficient-sse-fp-floor-ceil-round-rounding-functions-without-sse4-1%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














I'm posting the code from http://dss.stephanierct.com/DevBlog/?p=8:



It should be adopted into By Value form (I just removed the & from the code, not sure it is OK):



static inline __m128 FloorSse(const __m128 x) {
__m128i v0 = _mm_setzero_si128();
__m128i v1 = _mm_cmpeq_epi32(v0, v0);
__m128i ji = _mm_srli_epi32(v1, 25);
__m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it
__m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it
__m128i i = _mm_cvttps_epi32(x);
__m128 fi = _mm_cvtepi32_ps(i);
__m128 igx = _mm_cmpgt_ps(fi, x);
j = _mm_and_ps(igx, j);
return _mm_sub_ps(fi, j);
}

static inline __m128 CeilSse(const __m128 x) {
__m128i v0 = _mm_setzero_si128();
__m128i v1 = _mm_cmpeq_epi32(v0, v0);
__m128i ji = _mm_srli_epi32(v1, 25);
__m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it
__m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it
__m128i i = _mm_cvttps_epi32(x);
__m128 fi = _mm_cvtepi32_ps(i);
__m128 igx = _mm_cmplt_ps(fi, x);
j = _mm_and_ps(igx, j);
return _mm_add_ps(fi, j);
}

static inline __m128 RoundSse(const __m128 a) {
__m128 v0 = _mm_setzero_ps(); //generate the highest value < 2
__m128 v1 = _mm_cmpeq_ps(v0, v0);
__m128i tmp = _mm_castps_si128(v1); // I edited this (Added tmp) not sure about it
tmp = _mm_srli_epi32(tmp, 2); // I edited this (Added tmp) not sure about it
__m128 vNearest2 = _mm_castsi128_ps(tmp); // I edited this (Added tmp) not sure about it
__m128i i = _mm_cvttps_epi32(a);
__m128 aTrunc = _mm_cvtepi32_ps(i); // truncate a
__m128 rmd = _mm_sub_ps(a, aTrunc); // get remainder
__m128 rmd2 = _mm_mul_ps(rmd, vNearest2); // mul remainder by near 2 will yield the needed offset
__m128i rmd2i = _mm_cvttps_epi32(rmd2); // after being truncated of course
__m128 rmd2Trunc = _mm_cvtepi32_ps(rmd2i);
__m128 r = _mm_add_ps(aTrunc, rmd2Trunc);
return r;
}


inline __m128 ModSee(const __m128 a, const __m128 aDiv) {
__m128 c = _mm_div_ps(a, aDiv);
__m128i i = _mm_cvttps_epi32(c);
__m128 cTrunc = _mm_cvtepi32_ps(i);
__m128 base = _mm_mul_ps(cTrunc, aDiv);
__m128 r = _mm_sub_ps(a, base);
return r;
}





share|improve this answer


























  • You left out the extra info from the blog that these aren't totally safe outside the range where converting to an integer and back works. The block has safe wrappers for those. According to the reddit discussions (reddit.com/r/programming/comments/1p2yys/…) this updated version of the code does actually work correctly for the whole range of inputs that Bruce Dawson tested.

    – Peter Cordes
    Jan 3 at 18:22











  • I also tried to switch it into By Value form. I'm not sure I did it correctly. Also VS 2017 has some issues with few things in the code form the blog. I tried fixing it as well but again I'm not sure I did it correctly. This is why I put it as WIKI so people can edit it and improve it. If you want, post as an answer of yours. I will mark it and copy it into the Wiki.

    – Royi
    Jan 3 at 18:27













  • Changing it to by-value is literally as easy as removing the &. There's no need to do that, though. What you should change is *(__m128*)(&tmp) should be _mm_castsi128_ps(tmp), and similar. Those are a bad idea, even though __m128 is a may-alias type.

    – Peter Cordes
    Jan 3 at 18:52











  • @PeterCordes, I did removed the &. I just wasn't familiar with the effect of it in that context hence didn't know if it works. I also used the cast functions you recommended. What do you think now?

    – Royi
    Jan 3 at 19:56











  • Oh, I forgot this was a C question. Are you not familiar with C++ references? It's no different from any other object, as long as your calling convention supports __m128 args by value. Anyway, you should probably simplify the constants to replace the silly cmpeq and integer stuff with _mm_set1_ps(1.0) or 1.999999whatever. The author of this code went overboard trying to outsmart the compiler and prevent it from loading constants from memory, but used cmpps instead of pcmpeqd to create an all-ones bit-pattern. It's questionable if it's worth generating these constants on the fly.

    – Peter Cordes
    Jan 3 at 20:54
















0














I'm posting the code from http://dss.stephanierct.com/DevBlog/?p=8:



It should be adopted into By Value form (I just removed the & from the code, not sure it is OK):



static inline __m128 FloorSse(const __m128 x) {
__m128i v0 = _mm_setzero_si128();
__m128i v1 = _mm_cmpeq_epi32(v0, v0);
__m128i ji = _mm_srli_epi32(v1, 25);
__m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it
__m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it
__m128i i = _mm_cvttps_epi32(x);
__m128 fi = _mm_cvtepi32_ps(i);
__m128 igx = _mm_cmpgt_ps(fi, x);
j = _mm_and_ps(igx, j);
return _mm_sub_ps(fi, j);
}

static inline __m128 CeilSse(const __m128 x) {
__m128i v0 = _mm_setzero_si128();
__m128i v1 = _mm_cmpeq_epi32(v0, v0);
__m128i ji = _mm_srli_epi32(v1, 25);
__m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it
__m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it
__m128i i = _mm_cvttps_epi32(x);
__m128 fi = _mm_cvtepi32_ps(i);
__m128 igx = _mm_cmplt_ps(fi, x);
j = _mm_and_ps(igx, j);
return _mm_add_ps(fi, j);
}

static inline __m128 RoundSse(const __m128 a) {
__m128 v0 = _mm_setzero_ps(); //generate the highest value < 2
__m128 v1 = _mm_cmpeq_ps(v0, v0);
__m128i tmp = _mm_castps_si128(v1); // I edited this (Added tmp) not sure about it
tmp = _mm_srli_epi32(tmp, 2); // I edited this (Added tmp) not sure about it
__m128 vNearest2 = _mm_castsi128_ps(tmp); // I edited this (Added tmp) not sure about it
__m128i i = _mm_cvttps_epi32(a);
__m128 aTrunc = _mm_cvtepi32_ps(i); // truncate a
__m128 rmd = _mm_sub_ps(a, aTrunc); // get remainder
__m128 rmd2 = _mm_mul_ps(rmd, vNearest2); // mul remainder by near 2 will yield the needed offset
__m128i rmd2i = _mm_cvttps_epi32(rmd2); // after being truncated of course
__m128 rmd2Trunc = _mm_cvtepi32_ps(rmd2i);
__m128 r = _mm_add_ps(aTrunc, rmd2Trunc);
return r;
}


inline __m128 ModSee(const __m128 a, const __m128 aDiv) {
__m128 c = _mm_div_ps(a, aDiv);
__m128i i = _mm_cvttps_epi32(c);
__m128 cTrunc = _mm_cvtepi32_ps(i);
__m128 base = _mm_mul_ps(cTrunc, aDiv);
__m128 r = _mm_sub_ps(a, base);
return r;
}





share|improve this answer


























  • You left out the extra info from the blog that these aren't totally safe outside the range where converting to an integer and back works. The block has safe wrappers for those. According to the reddit discussions (reddit.com/r/programming/comments/1p2yys/…) this updated version of the code does actually work correctly for the whole range of inputs that Bruce Dawson tested.

    – Peter Cordes
    Jan 3 at 18:22











  • I also tried to switch it into By Value form. I'm not sure I did it correctly. Also VS 2017 has some issues with few things in the code form the blog. I tried fixing it as well but again I'm not sure I did it correctly. This is why I put it as WIKI so people can edit it and improve it. If you want, post as an answer of yours. I will mark it and copy it into the Wiki.

    – Royi
    Jan 3 at 18:27













  • Changing it to by-value is literally as easy as removing the &. There's no need to do that, though. What you should change is *(__m128*)(&tmp) should be _mm_castsi128_ps(tmp), and similar. Those are a bad idea, even though __m128 is a may-alias type.

    – Peter Cordes
    Jan 3 at 18:52











  • @PeterCordes, I did removed the &. I just wasn't familiar with the effect of it in that context hence didn't know if it works. I also used the cast functions you recommended. What do you think now?

    – Royi
    Jan 3 at 19:56











  • Oh, I forgot this was a C question. Are you not familiar with C++ references? It's no different from any other object, as long as your calling convention supports __m128 args by value. Anyway, you should probably simplify the constants to replace the silly cmpeq and integer stuff with _mm_set1_ps(1.0) or 1.999999whatever. The author of this code went overboard trying to outsmart the compiler and prevent it from loading constants from memory, but used cmpps instead of pcmpeqd to create an all-ones bit-pattern. It's questionable if it's worth generating these constants on the fly.

    – Peter Cordes
    Jan 3 at 20:54














0












0








0







I'm posting the code from http://dss.stephanierct.com/DevBlog/?p=8:



It should be adopted into By Value form (I just removed the & from the code, not sure it is OK):



static inline __m128 FloorSse(const __m128 x) {
__m128i v0 = _mm_setzero_si128();
__m128i v1 = _mm_cmpeq_epi32(v0, v0);
__m128i ji = _mm_srli_epi32(v1, 25);
__m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it
__m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it
__m128i i = _mm_cvttps_epi32(x);
__m128 fi = _mm_cvtepi32_ps(i);
__m128 igx = _mm_cmpgt_ps(fi, x);
j = _mm_and_ps(igx, j);
return _mm_sub_ps(fi, j);
}

static inline __m128 CeilSse(const __m128 x) {
__m128i v0 = _mm_setzero_si128();
__m128i v1 = _mm_cmpeq_epi32(v0, v0);
__m128i ji = _mm_srli_epi32(v1, 25);
__m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it
__m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it
__m128i i = _mm_cvttps_epi32(x);
__m128 fi = _mm_cvtepi32_ps(i);
__m128 igx = _mm_cmplt_ps(fi, x);
j = _mm_and_ps(igx, j);
return _mm_add_ps(fi, j);
}

static inline __m128 RoundSse(const __m128 a) {
__m128 v0 = _mm_setzero_ps(); //generate the highest value < 2
__m128 v1 = _mm_cmpeq_ps(v0, v0);
__m128i tmp = _mm_castps_si128(v1); // I edited this (Added tmp) not sure about it
tmp = _mm_srli_epi32(tmp, 2); // I edited this (Added tmp) not sure about it
__m128 vNearest2 = _mm_castsi128_ps(tmp); // I edited this (Added tmp) not sure about it
__m128i i = _mm_cvttps_epi32(a);
__m128 aTrunc = _mm_cvtepi32_ps(i); // truncate a
__m128 rmd = _mm_sub_ps(a, aTrunc); // get remainder
__m128 rmd2 = _mm_mul_ps(rmd, vNearest2); // mul remainder by near 2 will yield the needed offset
__m128i rmd2i = _mm_cvttps_epi32(rmd2); // after being truncated of course
__m128 rmd2Trunc = _mm_cvtepi32_ps(rmd2i);
__m128 r = _mm_add_ps(aTrunc, rmd2Trunc);
return r;
}


inline __m128 ModSee(const __m128 a, const __m128 aDiv) {
__m128 c = _mm_div_ps(a, aDiv);
__m128i i = _mm_cvttps_epi32(c);
__m128 cTrunc = _mm_cvtepi32_ps(i);
__m128 base = _mm_mul_ps(cTrunc, aDiv);
__m128 r = _mm_sub_ps(a, base);
return r;
}





share|improve this answer















I'm posting the code from http://dss.stephanierct.com/DevBlog/?p=8:



It should be adopted into By Value form (I just removed the & from the code, not sure it is OK):



static inline __m128 FloorSse(const __m128 x) {
__m128i v0 = _mm_setzero_si128();
__m128i v1 = _mm_cmpeq_epi32(v0, v0);
__m128i ji = _mm_srli_epi32(v1, 25);
__m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it
__m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it
__m128i i = _mm_cvttps_epi32(x);
__m128 fi = _mm_cvtepi32_ps(i);
__m128 igx = _mm_cmpgt_ps(fi, x);
j = _mm_and_ps(igx, j);
return _mm_sub_ps(fi, j);
}

static inline __m128 CeilSse(const __m128 x) {
__m128i v0 = _mm_setzero_si128();
__m128i v1 = _mm_cmpeq_epi32(v0, v0);
__m128i ji = _mm_srli_epi32(v1, 25);
__m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it
__m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it
__m128i i = _mm_cvttps_epi32(x);
__m128 fi = _mm_cvtepi32_ps(i);
__m128 igx = _mm_cmplt_ps(fi, x);
j = _mm_and_ps(igx, j);
return _mm_add_ps(fi, j);
}

static inline __m128 RoundSse(const __m128 a) {
__m128 v0 = _mm_setzero_ps(); //generate the highest value < 2
__m128 v1 = _mm_cmpeq_ps(v0, v0);
__m128i tmp = _mm_castps_si128(v1); // I edited this (Added tmp) not sure about it
tmp = _mm_srli_epi32(tmp, 2); // I edited this (Added tmp) not sure about it
__m128 vNearest2 = _mm_castsi128_ps(tmp); // I edited this (Added tmp) not sure about it
__m128i i = _mm_cvttps_epi32(a);
__m128 aTrunc = _mm_cvtepi32_ps(i); // truncate a
__m128 rmd = _mm_sub_ps(a, aTrunc); // get remainder
__m128 rmd2 = _mm_mul_ps(rmd, vNearest2); // mul remainder by near 2 will yield the needed offset
__m128i rmd2i = _mm_cvttps_epi32(rmd2); // after being truncated of course
__m128 rmd2Trunc = _mm_cvtepi32_ps(rmd2i);
__m128 r = _mm_add_ps(aTrunc, rmd2Trunc);
return r;
}


inline __m128 ModSee(const __m128 a, const __m128 aDiv) {
__m128 c = _mm_div_ps(a, aDiv);
__m128i i = _mm_cvttps_epi32(c);
__m128 cTrunc = _mm_cvtepi32_ps(i);
__m128 base = _mm_mul_ps(cTrunc, aDiv);
__m128 r = _mm_sub_ps(a, base);
return r;
}






share|improve this answer














share|improve this answer



share|improve this answer








edited Jan 3 at 19:55


























community wiki





3 revs
Royi














  • You left out the extra info from the blog that these aren't totally safe outside the range where converting to an integer and back works. The block has safe wrappers for those. According to the reddit discussions (reddit.com/r/programming/comments/1p2yys/…) this updated version of the code does actually work correctly for the whole range of inputs that Bruce Dawson tested.

    – Peter Cordes
    Jan 3 at 18:22











  • I also tried to switch it into By Value form. I'm not sure I did it correctly. Also VS 2017 has some issues with few things in the code form the blog. I tried fixing it as well but again I'm not sure I did it correctly. This is why I put it as WIKI so people can edit it and improve it. If you want, post as an answer of yours. I will mark it and copy it into the Wiki.

    – Royi
    Jan 3 at 18:27













  • Changing it to by-value is literally as easy as removing the &. There's no need to do that, though. What you should change is *(__m128*)(&tmp) should be _mm_castsi128_ps(tmp), and similar. Those are a bad idea, even though __m128 is a may-alias type.

    – Peter Cordes
    Jan 3 at 18:52











  • @PeterCordes, I did removed the &. I just wasn't familiar with the effect of it in that context hence didn't know if it works. I also used the cast functions you recommended. What do you think now?

    – Royi
    Jan 3 at 19:56











  • Oh, I forgot this was a C question. Are you not familiar with C++ references? It's no different from any other object, as long as your calling convention supports __m128 args by value. Anyway, you should probably simplify the constants to replace the silly cmpeq and integer stuff with _mm_set1_ps(1.0) or 1.999999whatever. The author of this code went overboard trying to outsmart the compiler and prevent it from loading constants from memory, but used cmpps instead of pcmpeqd to create an all-ones bit-pattern. It's questionable if it's worth generating these constants on the fly.

    – Peter Cordes
    Jan 3 at 20:54



















  • You left out the extra info from the blog that these aren't totally safe outside the range where converting to an integer and back works. The block has safe wrappers for those. According to the reddit discussions (reddit.com/r/programming/comments/1p2yys/…) this updated version of the code does actually work correctly for the whole range of inputs that Bruce Dawson tested.

    – Peter Cordes
    Jan 3 at 18:22











  • I also tried to switch it into By Value form. I'm not sure I did it correctly. Also VS 2017 has some issues with few things in the code form the blog. I tried fixing it as well but again I'm not sure I did it correctly. This is why I put it as WIKI so people can edit it and improve it. If you want, post as an answer of yours. I will mark it and copy it into the Wiki.

    – Royi
    Jan 3 at 18:27













  • Changing it to by-value is literally as easy as removing the &. There's no need to do that, though. What you should change is *(__m128*)(&tmp) should be _mm_castsi128_ps(tmp), and similar. Those are a bad idea, even though __m128 is a may-alias type.

    – Peter Cordes
    Jan 3 at 18:52











  • @PeterCordes, I did removed the &. I just wasn't familiar with the effect of it in that context hence didn't know if it works. I also used the cast functions you recommended. What do you think now?

    – Royi
    Jan 3 at 19:56











  • Oh, I forgot this was a C question. Are you not familiar with C++ references? It's no different from any other object, as long as your calling convention supports __m128 args by value. Anyway, you should probably simplify the constants to replace the silly cmpeq and integer stuff with _mm_set1_ps(1.0) or 1.999999whatever. The author of this code went overboard trying to outsmart the compiler and prevent it from loading constants from memory, but used cmpps instead of pcmpeqd to create an all-ones bit-pattern. It's questionable if it's worth generating these constants on the fly.

    – Peter Cordes
    Jan 3 at 20:54

















You left out the extra info from the blog that these aren't totally safe outside the range where converting to an integer and back works. The block has safe wrappers for those. According to the reddit discussions (reddit.com/r/programming/comments/1p2yys/…) this updated version of the code does actually work correctly for the whole range of inputs that Bruce Dawson tested.

– Peter Cordes
Jan 3 at 18:22





You left out the extra info from the blog that these aren't totally safe outside the range where converting to an integer and back works. The block has safe wrappers for those. According to the reddit discussions (reddit.com/r/programming/comments/1p2yys/…) this updated version of the code does actually work correctly for the whole range of inputs that Bruce Dawson tested.

– Peter Cordes
Jan 3 at 18:22













I also tried to switch it into By Value form. I'm not sure I did it correctly. Also VS 2017 has some issues with few things in the code form the blog. I tried fixing it as well but again I'm not sure I did it correctly. This is why I put it as WIKI so people can edit it and improve it. If you want, post as an answer of yours. I will mark it and copy it into the Wiki.

– Royi
Jan 3 at 18:27







I also tried to switch it into By Value form. I'm not sure I did it correctly. Also VS 2017 has some issues with few things in the code form the blog. I tried fixing it as well but again I'm not sure I did it correctly. This is why I put it as WIKI so people can edit it and improve it. If you want, post as an answer of yours. I will mark it and copy it into the Wiki.

– Royi
Jan 3 at 18:27















Changing it to by-value is literally as easy as removing the &. There's no need to do that, though. What you should change is *(__m128*)(&tmp) should be _mm_castsi128_ps(tmp), and similar. Those are a bad idea, even though __m128 is a may-alias type.

– Peter Cordes
Jan 3 at 18:52





Changing it to by-value is literally as easy as removing the &. There's no need to do that, though. What you should change is *(__m128*)(&tmp) should be _mm_castsi128_ps(tmp), and similar. Those are a bad idea, even though __m128 is a may-alias type.

– Peter Cordes
Jan 3 at 18:52













@PeterCordes, I did removed the &. I just wasn't familiar with the effect of it in that context hence didn't know if it works. I also used the cast functions you recommended. What do you think now?

– Royi
Jan 3 at 19:56





@PeterCordes, I did removed the &. I just wasn't familiar with the effect of it in that context hence didn't know if it works. I also used the cast functions you recommended. What do you think now?

– Royi
Jan 3 at 19:56













Oh, I forgot this was a C question. Are you not familiar with C++ references? It's no different from any other object, as long as your calling convention supports __m128 args by value. Anyway, you should probably simplify the constants to replace the silly cmpeq and integer stuff with _mm_set1_ps(1.0) or 1.999999whatever. The author of this code went overboard trying to outsmart the compiler and prevent it from loading constants from memory, but used cmpps instead of pcmpeqd to create an all-ones bit-pattern. It's questionable if it's worth generating these constants on the fly.

– Peter Cordes
Jan 3 at 20:54





Oh, I forgot this was a C question. Are you not familiar with C++ references? It's no different from any other object, as long as your calling convention supports __m128 args by value. Anyway, you should probably simplify the constants to replace the silly cmpeq and integer stuff with _mm_set1_ps(1.0) or 1.999999whatever. The author of this code went overboard trying to outsmart the compiler and prevent it from loading constants from memory, but used cmpps instead of pcmpeqd to create an all-ones bit-pattern. It's questionable if it's worth generating these constants on the fly.

– Peter Cordes
Jan 3 at 20:54




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54022478%2fefficient-sse-fp-floor-ceil-round-rounding-functions-without-sse4-1%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

MongoDB - Not Authorized To Execute Command

in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith

Npm cannot find a required file even through it is in the searched directory