Efficient SSE FP floor/ceil/round rounding functions without SSE4.1?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
How can I round a __m128
vector of floats up/down or to the nearest integer, like these functions?
- Round -
roundf()
Ceil -ceilf()
or SSE4.1_mm_ceil_ps
.
Floor -floorf()
or SSE4.1_mm_floor_ps
.
I need to do this without SSE4.1 roundps
(_mm_floor_ps
/ _mm_ceil_ps
/ _mm_round_ps(x, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC)
. roundps
can also truncate toward zero, but I don't need that for this application.
I can use SSE3 and earlier. (No SSSE3 or SSE4)
So the function declaration would be something like:
__m128 RoundSse( __m128 x )
, __m128 CeilSse( __m128 x )
and __m128 FloorSse( __m128 x )
.
c optimization vectorization sse simd
|
show 6 more comments
How can I round a __m128
vector of floats up/down or to the nearest integer, like these functions?
- Round -
roundf()
Ceil -ceilf()
or SSE4.1_mm_ceil_ps
.
Floor -floorf()
or SSE4.1_mm_floor_ps
.
I need to do this without SSE4.1 roundps
(_mm_floor_ps
/ _mm_ceil_ps
/ _mm_round_ps(x, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC)
. roundps
can also truncate toward zero, but I don't need that for this application.
I can use SSE3 and earlier. (No SSSE3 or SSE4)
So the function declaration would be something like:
__m128 RoundSse( __m128 x )
, __m128 CeilSse( __m128 x )
and __m128 FloorSse( __m128 x )
.
c optimization vectorization sse simd
Starting point would be reddit.com/r/programming/comments/1p2yys/…. Though it uses By Reference instead of By Value. The code there also yields issues with VS 2017.
– Royi
Jan 3 at 12:43
Are you sure you actually needround
, and the IEEE default rounding mode wouldn't work for you?rintf
ornearbyintf
give you that, whileround
uses a special rounding mode that x86 doesn't have in hardware. round() for float in C++. (It can be emulated with a few instructions on top ofroundps
if you have SSE4.1, so if you can emulateroundps
to nearest you can still emulateround()
, but it's probably going to be slower.)
– Peter Cordes
Jan 3 at 12:53
1
Google for "sse_mathfun" - it's a useful library which includes the above functions and many others.
– Paul R
Jan 3 at 13:12
@PaulR ,I'd rather have an implementation here so I will understand how it works. Also will be great reference for other who search for it.
– Royi
Jan 3 at 17:02
2
See DirectXMath, specifically the implementation ofXMVectorRound
,XMVectorFloor
, andXMVectorCeil
in this source file
– Chuck Walbourn
Jan 3 at 17:45
|
show 6 more comments
How can I round a __m128
vector of floats up/down or to the nearest integer, like these functions?
- Round -
roundf()
Ceil -ceilf()
or SSE4.1_mm_ceil_ps
.
Floor -floorf()
or SSE4.1_mm_floor_ps
.
I need to do this without SSE4.1 roundps
(_mm_floor_ps
/ _mm_ceil_ps
/ _mm_round_ps(x, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC)
. roundps
can also truncate toward zero, but I don't need that for this application.
I can use SSE3 and earlier. (No SSSE3 or SSE4)
So the function declaration would be something like:
__m128 RoundSse( __m128 x )
, __m128 CeilSse( __m128 x )
and __m128 FloorSse( __m128 x )
.
c optimization vectorization sse simd
How can I round a __m128
vector of floats up/down or to the nearest integer, like these functions?
- Round -
roundf()
Ceil -ceilf()
or SSE4.1_mm_ceil_ps
.
Floor -floorf()
or SSE4.1_mm_floor_ps
.
I need to do this without SSE4.1 roundps
(_mm_floor_ps
/ _mm_ceil_ps
/ _mm_round_ps(x, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC)
. roundps
can also truncate toward zero, but I don't need that for this application.
I can use SSE3 and earlier. (No SSSE3 or SSE4)
So the function declaration would be something like:
__m128 RoundSse( __m128 x )
, __m128 CeilSse( __m128 x )
and __m128 FloorSse( __m128 x )
.
c optimization vectorization sse simd
c optimization vectorization sse simd
edited Jan 3 at 13:01


Peter Cordes
135k19204346
135k19204346
asked Jan 3 at 12:39
RoyiRoyi
2,95442744
2,95442744
Starting point would be reddit.com/r/programming/comments/1p2yys/…. Though it uses By Reference instead of By Value. The code there also yields issues with VS 2017.
– Royi
Jan 3 at 12:43
Are you sure you actually needround
, and the IEEE default rounding mode wouldn't work for you?rintf
ornearbyintf
give you that, whileround
uses a special rounding mode that x86 doesn't have in hardware. round() for float in C++. (It can be emulated with a few instructions on top ofroundps
if you have SSE4.1, so if you can emulateroundps
to nearest you can still emulateround()
, but it's probably going to be slower.)
– Peter Cordes
Jan 3 at 12:53
1
Google for "sse_mathfun" - it's a useful library which includes the above functions and many others.
– Paul R
Jan 3 at 13:12
@PaulR ,I'd rather have an implementation here so I will understand how it works. Also will be great reference for other who search for it.
– Royi
Jan 3 at 17:02
2
See DirectXMath, specifically the implementation ofXMVectorRound
,XMVectorFloor
, andXMVectorCeil
in this source file
– Chuck Walbourn
Jan 3 at 17:45
|
show 6 more comments
Starting point would be reddit.com/r/programming/comments/1p2yys/…. Though it uses By Reference instead of By Value. The code there also yields issues with VS 2017.
– Royi
Jan 3 at 12:43
Are you sure you actually needround
, and the IEEE default rounding mode wouldn't work for you?rintf
ornearbyintf
give you that, whileround
uses a special rounding mode that x86 doesn't have in hardware. round() for float in C++. (It can be emulated with a few instructions on top ofroundps
if you have SSE4.1, so if you can emulateroundps
to nearest you can still emulateround()
, but it's probably going to be slower.)
– Peter Cordes
Jan 3 at 12:53
1
Google for "sse_mathfun" - it's a useful library which includes the above functions and many others.
– Paul R
Jan 3 at 13:12
@PaulR ,I'd rather have an implementation here so I will understand how it works. Also will be great reference for other who search for it.
– Royi
Jan 3 at 17:02
2
See DirectXMath, specifically the implementation ofXMVectorRound
,XMVectorFloor
, andXMVectorCeil
in this source file
– Chuck Walbourn
Jan 3 at 17:45
Starting point would be reddit.com/r/programming/comments/1p2yys/…. Though it uses By Reference instead of By Value. The code there also yields issues with VS 2017.
– Royi
Jan 3 at 12:43
Starting point would be reddit.com/r/programming/comments/1p2yys/…. Though it uses By Reference instead of By Value. The code there also yields issues with VS 2017.
– Royi
Jan 3 at 12:43
Are you sure you actually need
round
, and the IEEE default rounding mode wouldn't work for you? rintf
or nearbyintf
give you that, while round
uses a special rounding mode that x86 doesn't have in hardware. round() for float in C++. (It can be emulated with a few instructions on top of roundps
if you have SSE4.1, so if you can emulate roundps
to nearest you can still emulate round()
, but it's probably going to be slower.)– Peter Cordes
Jan 3 at 12:53
Are you sure you actually need
round
, and the IEEE default rounding mode wouldn't work for you? rintf
or nearbyintf
give you that, while round
uses a special rounding mode that x86 doesn't have in hardware. round() for float in C++. (It can be emulated with a few instructions on top of roundps
if you have SSE4.1, so if you can emulate roundps
to nearest you can still emulate round()
, but it's probably going to be slower.)– Peter Cordes
Jan 3 at 12:53
1
1
Google for "sse_mathfun" - it's a useful library which includes the above functions and many others.
– Paul R
Jan 3 at 13:12
Google for "sse_mathfun" - it's a useful library which includes the above functions and many others.
– Paul R
Jan 3 at 13:12
@PaulR ,I'd rather have an implementation here so I will understand how it works. Also will be great reference for other who search for it.
– Royi
Jan 3 at 17:02
@PaulR ,I'd rather have an implementation here so I will understand how it works. Also will be great reference for other who search for it.
– Royi
Jan 3 at 17:02
2
2
See DirectXMath, specifically the implementation of
XMVectorRound
, XMVectorFloor
, and XMVectorCeil
in this source file– Chuck Walbourn
Jan 3 at 17:45
See DirectXMath, specifically the implementation of
XMVectorRound
, XMVectorFloor
, and XMVectorCeil
in this source file– Chuck Walbourn
Jan 3 at 17:45
|
show 6 more comments
1 Answer
1
active
oldest
votes
I'm posting the code from http://dss.stephanierct.com/DevBlog/?p=8:
It should be adopted into By Value form (I just removed the &
from the code, not sure it is OK):
static inline __m128 FloorSse(const __m128 x) {
__m128i v0 = _mm_setzero_si128();
__m128i v1 = _mm_cmpeq_epi32(v0, v0);
__m128i ji = _mm_srli_epi32(v1, 25);
__m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it
__m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it
__m128i i = _mm_cvttps_epi32(x);
__m128 fi = _mm_cvtepi32_ps(i);
__m128 igx = _mm_cmpgt_ps(fi, x);
j = _mm_and_ps(igx, j);
return _mm_sub_ps(fi, j);
}
static inline __m128 CeilSse(const __m128 x) {
__m128i v0 = _mm_setzero_si128();
__m128i v1 = _mm_cmpeq_epi32(v0, v0);
__m128i ji = _mm_srli_epi32(v1, 25);
__m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it
__m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it
__m128i i = _mm_cvttps_epi32(x);
__m128 fi = _mm_cvtepi32_ps(i);
__m128 igx = _mm_cmplt_ps(fi, x);
j = _mm_and_ps(igx, j);
return _mm_add_ps(fi, j);
}
static inline __m128 RoundSse(const __m128 a) {
__m128 v0 = _mm_setzero_ps(); //generate the highest value < 2
__m128 v1 = _mm_cmpeq_ps(v0, v0);
__m128i tmp = _mm_castps_si128(v1); // I edited this (Added tmp) not sure about it
tmp = _mm_srli_epi32(tmp, 2); // I edited this (Added tmp) not sure about it
__m128 vNearest2 = _mm_castsi128_ps(tmp); // I edited this (Added tmp) not sure about it
__m128i i = _mm_cvttps_epi32(a);
__m128 aTrunc = _mm_cvtepi32_ps(i); // truncate a
__m128 rmd = _mm_sub_ps(a, aTrunc); // get remainder
__m128 rmd2 = _mm_mul_ps(rmd, vNearest2); // mul remainder by near 2 will yield the needed offset
__m128i rmd2i = _mm_cvttps_epi32(rmd2); // after being truncated of course
__m128 rmd2Trunc = _mm_cvtepi32_ps(rmd2i);
__m128 r = _mm_add_ps(aTrunc, rmd2Trunc);
return r;
}
inline __m128 ModSee(const __m128 a, const __m128 aDiv) {
__m128 c = _mm_div_ps(a, aDiv);
__m128i i = _mm_cvttps_epi32(c);
__m128 cTrunc = _mm_cvtepi32_ps(i);
__m128 base = _mm_mul_ps(cTrunc, aDiv);
__m128 r = _mm_sub_ps(a, base);
return r;
}
You left out the extra info from the blog that these aren't totally safe outside the range where converting to an integer and back works. The block has safe wrappers for those. According to the reddit discussions (reddit.com/r/programming/comments/1p2yys/…) this updated version of the code does actually work correctly for the whole range of inputs that Bruce Dawson tested.
– Peter Cordes
Jan 3 at 18:22
I also tried to switch it into By Value form. I'm not sure I did it correctly. Also VS 2017 has some issues with few things in the code form the blog. I tried fixing it as well but again I'm not sure I did it correctly. This is why I put it as WIKI so people can edit it and improve it. If you want, post as an answer of yours. I will mark it and copy it into the Wiki.
– Royi
Jan 3 at 18:27
Changing it to by-value is literally as easy as removing the&
. There's no need to do that, though. What you should change is*(__m128*)(&tmp)
should be_mm_castsi128_ps(tmp)
, and similar. Those are a bad idea, even though__m128
is a may-alias type.
– Peter Cordes
Jan 3 at 18:52
@PeterCordes, I did removed the&
. I just wasn't familiar with the effect of it in that context hence didn't know if it works. I also used the cast functions you recommended. What do you think now?
– Royi
Jan 3 at 19:56
Oh, I forgot this was a C question. Are you not familiar with C++ references? It's no different from any other object, as long as your calling convention supports__m128
args by value. Anyway, you should probably simplify the constants to replace the sillycmpeq
and integer stuff with_mm_set1_ps(1.0)
or1.999999whatever
. The author of this code went overboard trying to outsmart the compiler and prevent it from loading constants from memory, but usedcmpps
instead ofpcmpeqd
to create an all-ones bit-pattern. It's questionable if it's worth generating these constants on the fly.
– Peter Cordes
Jan 3 at 20:54
|
show 3 more comments
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54022478%2fefficient-sse-fp-floor-ceil-round-rounding-functions-without-sse4-1%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I'm posting the code from http://dss.stephanierct.com/DevBlog/?p=8:
It should be adopted into By Value form (I just removed the &
from the code, not sure it is OK):
static inline __m128 FloorSse(const __m128 x) {
__m128i v0 = _mm_setzero_si128();
__m128i v1 = _mm_cmpeq_epi32(v0, v0);
__m128i ji = _mm_srli_epi32(v1, 25);
__m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it
__m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it
__m128i i = _mm_cvttps_epi32(x);
__m128 fi = _mm_cvtepi32_ps(i);
__m128 igx = _mm_cmpgt_ps(fi, x);
j = _mm_and_ps(igx, j);
return _mm_sub_ps(fi, j);
}
static inline __m128 CeilSse(const __m128 x) {
__m128i v0 = _mm_setzero_si128();
__m128i v1 = _mm_cmpeq_epi32(v0, v0);
__m128i ji = _mm_srli_epi32(v1, 25);
__m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it
__m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it
__m128i i = _mm_cvttps_epi32(x);
__m128 fi = _mm_cvtepi32_ps(i);
__m128 igx = _mm_cmplt_ps(fi, x);
j = _mm_and_ps(igx, j);
return _mm_add_ps(fi, j);
}
static inline __m128 RoundSse(const __m128 a) {
__m128 v0 = _mm_setzero_ps(); //generate the highest value < 2
__m128 v1 = _mm_cmpeq_ps(v0, v0);
__m128i tmp = _mm_castps_si128(v1); // I edited this (Added tmp) not sure about it
tmp = _mm_srli_epi32(tmp, 2); // I edited this (Added tmp) not sure about it
__m128 vNearest2 = _mm_castsi128_ps(tmp); // I edited this (Added tmp) not sure about it
__m128i i = _mm_cvttps_epi32(a);
__m128 aTrunc = _mm_cvtepi32_ps(i); // truncate a
__m128 rmd = _mm_sub_ps(a, aTrunc); // get remainder
__m128 rmd2 = _mm_mul_ps(rmd, vNearest2); // mul remainder by near 2 will yield the needed offset
__m128i rmd2i = _mm_cvttps_epi32(rmd2); // after being truncated of course
__m128 rmd2Trunc = _mm_cvtepi32_ps(rmd2i);
__m128 r = _mm_add_ps(aTrunc, rmd2Trunc);
return r;
}
inline __m128 ModSee(const __m128 a, const __m128 aDiv) {
__m128 c = _mm_div_ps(a, aDiv);
__m128i i = _mm_cvttps_epi32(c);
__m128 cTrunc = _mm_cvtepi32_ps(i);
__m128 base = _mm_mul_ps(cTrunc, aDiv);
__m128 r = _mm_sub_ps(a, base);
return r;
}
You left out the extra info from the blog that these aren't totally safe outside the range where converting to an integer and back works. The block has safe wrappers for those. According to the reddit discussions (reddit.com/r/programming/comments/1p2yys/…) this updated version of the code does actually work correctly for the whole range of inputs that Bruce Dawson tested.
– Peter Cordes
Jan 3 at 18:22
I also tried to switch it into By Value form. I'm not sure I did it correctly. Also VS 2017 has some issues with few things in the code form the blog. I tried fixing it as well but again I'm not sure I did it correctly. This is why I put it as WIKI so people can edit it and improve it. If you want, post as an answer of yours. I will mark it and copy it into the Wiki.
– Royi
Jan 3 at 18:27
Changing it to by-value is literally as easy as removing the&
. There's no need to do that, though. What you should change is*(__m128*)(&tmp)
should be_mm_castsi128_ps(tmp)
, and similar. Those are a bad idea, even though__m128
is a may-alias type.
– Peter Cordes
Jan 3 at 18:52
@PeterCordes, I did removed the&
. I just wasn't familiar with the effect of it in that context hence didn't know if it works. I also used the cast functions you recommended. What do you think now?
– Royi
Jan 3 at 19:56
Oh, I forgot this was a C question. Are you not familiar with C++ references? It's no different from any other object, as long as your calling convention supports__m128
args by value. Anyway, you should probably simplify the constants to replace the sillycmpeq
and integer stuff with_mm_set1_ps(1.0)
or1.999999whatever
. The author of this code went overboard trying to outsmart the compiler and prevent it from loading constants from memory, but usedcmpps
instead ofpcmpeqd
to create an all-ones bit-pattern. It's questionable if it's worth generating these constants on the fly.
– Peter Cordes
Jan 3 at 20:54
|
show 3 more comments
I'm posting the code from http://dss.stephanierct.com/DevBlog/?p=8:
It should be adopted into By Value form (I just removed the &
from the code, not sure it is OK):
static inline __m128 FloorSse(const __m128 x) {
__m128i v0 = _mm_setzero_si128();
__m128i v1 = _mm_cmpeq_epi32(v0, v0);
__m128i ji = _mm_srli_epi32(v1, 25);
__m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it
__m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it
__m128i i = _mm_cvttps_epi32(x);
__m128 fi = _mm_cvtepi32_ps(i);
__m128 igx = _mm_cmpgt_ps(fi, x);
j = _mm_and_ps(igx, j);
return _mm_sub_ps(fi, j);
}
static inline __m128 CeilSse(const __m128 x) {
__m128i v0 = _mm_setzero_si128();
__m128i v1 = _mm_cmpeq_epi32(v0, v0);
__m128i ji = _mm_srli_epi32(v1, 25);
__m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it
__m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it
__m128i i = _mm_cvttps_epi32(x);
__m128 fi = _mm_cvtepi32_ps(i);
__m128 igx = _mm_cmplt_ps(fi, x);
j = _mm_and_ps(igx, j);
return _mm_add_ps(fi, j);
}
static inline __m128 RoundSse(const __m128 a) {
__m128 v0 = _mm_setzero_ps(); //generate the highest value < 2
__m128 v1 = _mm_cmpeq_ps(v0, v0);
__m128i tmp = _mm_castps_si128(v1); // I edited this (Added tmp) not sure about it
tmp = _mm_srli_epi32(tmp, 2); // I edited this (Added tmp) not sure about it
__m128 vNearest2 = _mm_castsi128_ps(tmp); // I edited this (Added tmp) not sure about it
__m128i i = _mm_cvttps_epi32(a);
__m128 aTrunc = _mm_cvtepi32_ps(i); // truncate a
__m128 rmd = _mm_sub_ps(a, aTrunc); // get remainder
__m128 rmd2 = _mm_mul_ps(rmd, vNearest2); // mul remainder by near 2 will yield the needed offset
__m128i rmd2i = _mm_cvttps_epi32(rmd2); // after being truncated of course
__m128 rmd2Trunc = _mm_cvtepi32_ps(rmd2i);
__m128 r = _mm_add_ps(aTrunc, rmd2Trunc);
return r;
}
inline __m128 ModSee(const __m128 a, const __m128 aDiv) {
__m128 c = _mm_div_ps(a, aDiv);
__m128i i = _mm_cvttps_epi32(c);
__m128 cTrunc = _mm_cvtepi32_ps(i);
__m128 base = _mm_mul_ps(cTrunc, aDiv);
__m128 r = _mm_sub_ps(a, base);
return r;
}
You left out the extra info from the blog that these aren't totally safe outside the range where converting to an integer and back works. The block has safe wrappers for those. According to the reddit discussions (reddit.com/r/programming/comments/1p2yys/…) this updated version of the code does actually work correctly for the whole range of inputs that Bruce Dawson tested.
– Peter Cordes
Jan 3 at 18:22
I also tried to switch it into By Value form. I'm not sure I did it correctly. Also VS 2017 has some issues with few things in the code form the blog. I tried fixing it as well but again I'm not sure I did it correctly. This is why I put it as WIKI so people can edit it and improve it. If you want, post as an answer of yours. I will mark it and copy it into the Wiki.
– Royi
Jan 3 at 18:27
Changing it to by-value is literally as easy as removing the&
. There's no need to do that, though. What you should change is*(__m128*)(&tmp)
should be_mm_castsi128_ps(tmp)
, and similar. Those are a bad idea, even though__m128
is a may-alias type.
– Peter Cordes
Jan 3 at 18:52
@PeterCordes, I did removed the&
. I just wasn't familiar with the effect of it in that context hence didn't know if it works. I also used the cast functions you recommended. What do you think now?
– Royi
Jan 3 at 19:56
Oh, I forgot this was a C question. Are you not familiar with C++ references? It's no different from any other object, as long as your calling convention supports__m128
args by value. Anyway, you should probably simplify the constants to replace the sillycmpeq
and integer stuff with_mm_set1_ps(1.0)
or1.999999whatever
. The author of this code went overboard trying to outsmart the compiler and prevent it from loading constants from memory, but usedcmpps
instead ofpcmpeqd
to create an all-ones bit-pattern. It's questionable if it's worth generating these constants on the fly.
– Peter Cordes
Jan 3 at 20:54
|
show 3 more comments
I'm posting the code from http://dss.stephanierct.com/DevBlog/?p=8:
It should be adopted into By Value form (I just removed the &
from the code, not sure it is OK):
static inline __m128 FloorSse(const __m128 x) {
__m128i v0 = _mm_setzero_si128();
__m128i v1 = _mm_cmpeq_epi32(v0, v0);
__m128i ji = _mm_srli_epi32(v1, 25);
__m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it
__m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it
__m128i i = _mm_cvttps_epi32(x);
__m128 fi = _mm_cvtepi32_ps(i);
__m128 igx = _mm_cmpgt_ps(fi, x);
j = _mm_and_ps(igx, j);
return _mm_sub_ps(fi, j);
}
static inline __m128 CeilSse(const __m128 x) {
__m128i v0 = _mm_setzero_si128();
__m128i v1 = _mm_cmpeq_epi32(v0, v0);
__m128i ji = _mm_srli_epi32(v1, 25);
__m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it
__m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it
__m128i i = _mm_cvttps_epi32(x);
__m128 fi = _mm_cvtepi32_ps(i);
__m128 igx = _mm_cmplt_ps(fi, x);
j = _mm_and_ps(igx, j);
return _mm_add_ps(fi, j);
}
static inline __m128 RoundSse(const __m128 a) {
__m128 v0 = _mm_setzero_ps(); //generate the highest value < 2
__m128 v1 = _mm_cmpeq_ps(v0, v0);
__m128i tmp = _mm_castps_si128(v1); // I edited this (Added tmp) not sure about it
tmp = _mm_srli_epi32(tmp, 2); // I edited this (Added tmp) not sure about it
__m128 vNearest2 = _mm_castsi128_ps(tmp); // I edited this (Added tmp) not sure about it
__m128i i = _mm_cvttps_epi32(a);
__m128 aTrunc = _mm_cvtepi32_ps(i); // truncate a
__m128 rmd = _mm_sub_ps(a, aTrunc); // get remainder
__m128 rmd2 = _mm_mul_ps(rmd, vNearest2); // mul remainder by near 2 will yield the needed offset
__m128i rmd2i = _mm_cvttps_epi32(rmd2); // after being truncated of course
__m128 rmd2Trunc = _mm_cvtepi32_ps(rmd2i);
__m128 r = _mm_add_ps(aTrunc, rmd2Trunc);
return r;
}
inline __m128 ModSee(const __m128 a, const __m128 aDiv) {
__m128 c = _mm_div_ps(a, aDiv);
__m128i i = _mm_cvttps_epi32(c);
__m128 cTrunc = _mm_cvtepi32_ps(i);
__m128 base = _mm_mul_ps(cTrunc, aDiv);
__m128 r = _mm_sub_ps(a, base);
return r;
}
I'm posting the code from http://dss.stephanierct.com/DevBlog/?p=8:
It should be adopted into By Value form (I just removed the &
from the code, not sure it is OK):
static inline __m128 FloorSse(const __m128 x) {
__m128i v0 = _mm_setzero_si128();
__m128i v1 = _mm_cmpeq_epi32(v0, v0);
__m128i ji = _mm_srli_epi32(v1, 25);
__m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it
__m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it
__m128i i = _mm_cvttps_epi32(x);
__m128 fi = _mm_cvtepi32_ps(i);
__m128 igx = _mm_cmpgt_ps(fi, x);
j = _mm_and_ps(igx, j);
return _mm_sub_ps(fi, j);
}
static inline __m128 CeilSse(const __m128 x) {
__m128i v0 = _mm_setzero_si128();
__m128i v1 = _mm_cmpeq_epi32(v0, v0);
__m128i ji = _mm_srli_epi32(v1, 25);
__m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it
__m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it
__m128i i = _mm_cvttps_epi32(x);
__m128 fi = _mm_cvtepi32_ps(i);
__m128 igx = _mm_cmplt_ps(fi, x);
j = _mm_and_ps(igx, j);
return _mm_add_ps(fi, j);
}
static inline __m128 RoundSse(const __m128 a) {
__m128 v0 = _mm_setzero_ps(); //generate the highest value < 2
__m128 v1 = _mm_cmpeq_ps(v0, v0);
__m128i tmp = _mm_castps_si128(v1); // I edited this (Added tmp) not sure about it
tmp = _mm_srli_epi32(tmp, 2); // I edited this (Added tmp) not sure about it
__m128 vNearest2 = _mm_castsi128_ps(tmp); // I edited this (Added tmp) not sure about it
__m128i i = _mm_cvttps_epi32(a);
__m128 aTrunc = _mm_cvtepi32_ps(i); // truncate a
__m128 rmd = _mm_sub_ps(a, aTrunc); // get remainder
__m128 rmd2 = _mm_mul_ps(rmd, vNearest2); // mul remainder by near 2 will yield the needed offset
__m128i rmd2i = _mm_cvttps_epi32(rmd2); // after being truncated of course
__m128 rmd2Trunc = _mm_cvtepi32_ps(rmd2i);
__m128 r = _mm_add_ps(aTrunc, rmd2Trunc);
return r;
}
inline __m128 ModSee(const __m128 a, const __m128 aDiv) {
__m128 c = _mm_div_ps(a, aDiv);
__m128i i = _mm_cvttps_epi32(c);
__m128 cTrunc = _mm_cvtepi32_ps(i);
__m128 base = _mm_mul_ps(cTrunc, aDiv);
__m128 r = _mm_sub_ps(a, base);
return r;
}
edited Jan 3 at 19:55
community wiki
3 revs
Royi
You left out the extra info from the blog that these aren't totally safe outside the range where converting to an integer and back works. The block has safe wrappers for those. According to the reddit discussions (reddit.com/r/programming/comments/1p2yys/…) this updated version of the code does actually work correctly for the whole range of inputs that Bruce Dawson tested.
– Peter Cordes
Jan 3 at 18:22
I also tried to switch it into By Value form. I'm not sure I did it correctly. Also VS 2017 has some issues with few things in the code form the blog. I tried fixing it as well but again I'm not sure I did it correctly. This is why I put it as WIKI so people can edit it and improve it. If you want, post as an answer of yours. I will mark it and copy it into the Wiki.
– Royi
Jan 3 at 18:27
Changing it to by-value is literally as easy as removing the&
. There's no need to do that, though. What you should change is*(__m128*)(&tmp)
should be_mm_castsi128_ps(tmp)
, and similar. Those are a bad idea, even though__m128
is a may-alias type.
– Peter Cordes
Jan 3 at 18:52
@PeterCordes, I did removed the&
. I just wasn't familiar with the effect of it in that context hence didn't know if it works. I also used the cast functions you recommended. What do you think now?
– Royi
Jan 3 at 19:56
Oh, I forgot this was a C question. Are you not familiar with C++ references? It's no different from any other object, as long as your calling convention supports__m128
args by value. Anyway, you should probably simplify the constants to replace the sillycmpeq
and integer stuff with_mm_set1_ps(1.0)
or1.999999whatever
. The author of this code went overboard trying to outsmart the compiler and prevent it from loading constants from memory, but usedcmpps
instead ofpcmpeqd
to create an all-ones bit-pattern. It's questionable if it's worth generating these constants on the fly.
– Peter Cordes
Jan 3 at 20:54
|
show 3 more comments
You left out the extra info from the blog that these aren't totally safe outside the range where converting to an integer and back works. The block has safe wrappers for those. According to the reddit discussions (reddit.com/r/programming/comments/1p2yys/…) this updated version of the code does actually work correctly for the whole range of inputs that Bruce Dawson tested.
– Peter Cordes
Jan 3 at 18:22
I also tried to switch it into By Value form. I'm not sure I did it correctly. Also VS 2017 has some issues with few things in the code form the blog. I tried fixing it as well but again I'm not sure I did it correctly. This is why I put it as WIKI so people can edit it and improve it. If you want, post as an answer of yours. I will mark it and copy it into the Wiki.
– Royi
Jan 3 at 18:27
Changing it to by-value is literally as easy as removing the&
. There's no need to do that, though. What you should change is*(__m128*)(&tmp)
should be_mm_castsi128_ps(tmp)
, and similar. Those are a bad idea, even though__m128
is a may-alias type.
– Peter Cordes
Jan 3 at 18:52
@PeterCordes, I did removed the&
. I just wasn't familiar with the effect of it in that context hence didn't know if it works. I also used the cast functions you recommended. What do you think now?
– Royi
Jan 3 at 19:56
Oh, I forgot this was a C question. Are you not familiar with C++ references? It's no different from any other object, as long as your calling convention supports__m128
args by value. Anyway, you should probably simplify the constants to replace the sillycmpeq
and integer stuff with_mm_set1_ps(1.0)
or1.999999whatever
. The author of this code went overboard trying to outsmart the compiler and prevent it from loading constants from memory, but usedcmpps
instead ofpcmpeqd
to create an all-ones bit-pattern. It's questionable if it's worth generating these constants on the fly.
– Peter Cordes
Jan 3 at 20:54
You left out the extra info from the blog that these aren't totally safe outside the range where converting to an integer and back works. The block has safe wrappers for those. According to the reddit discussions (reddit.com/r/programming/comments/1p2yys/…) this updated version of the code does actually work correctly for the whole range of inputs that Bruce Dawson tested.
– Peter Cordes
Jan 3 at 18:22
You left out the extra info from the blog that these aren't totally safe outside the range where converting to an integer and back works. The block has safe wrappers for those. According to the reddit discussions (reddit.com/r/programming/comments/1p2yys/…) this updated version of the code does actually work correctly for the whole range of inputs that Bruce Dawson tested.
– Peter Cordes
Jan 3 at 18:22
I also tried to switch it into By Value form. I'm not sure I did it correctly. Also VS 2017 has some issues with few things in the code form the blog. I tried fixing it as well but again I'm not sure I did it correctly. This is why I put it as WIKI so people can edit it and improve it. If you want, post as an answer of yours. I will mark it and copy it into the Wiki.
– Royi
Jan 3 at 18:27
I also tried to switch it into By Value form. I'm not sure I did it correctly. Also VS 2017 has some issues with few things in the code form the blog. I tried fixing it as well but again I'm not sure I did it correctly. This is why I put it as WIKI so people can edit it and improve it. If you want, post as an answer of yours. I will mark it and copy it into the Wiki.
– Royi
Jan 3 at 18:27
Changing it to by-value is literally as easy as removing the
&
. There's no need to do that, though. What you should change is *(__m128*)(&tmp)
should be _mm_castsi128_ps(tmp)
, and similar. Those are a bad idea, even though __m128
is a may-alias type.– Peter Cordes
Jan 3 at 18:52
Changing it to by-value is literally as easy as removing the
&
. There's no need to do that, though. What you should change is *(__m128*)(&tmp)
should be _mm_castsi128_ps(tmp)
, and similar. Those are a bad idea, even though __m128
is a may-alias type.– Peter Cordes
Jan 3 at 18:52
@PeterCordes, I did removed the
&
. I just wasn't familiar with the effect of it in that context hence didn't know if it works. I also used the cast functions you recommended. What do you think now?– Royi
Jan 3 at 19:56
@PeterCordes, I did removed the
&
. I just wasn't familiar with the effect of it in that context hence didn't know if it works. I also used the cast functions you recommended. What do you think now?– Royi
Jan 3 at 19:56
Oh, I forgot this was a C question. Are you not familiar with C++ references? It's no different from any other object, as long as your calling convention supports
__m128
args by value. Anyway, you should probably simplify the constants to replace the silly cmpeq
and integer stuff with _mm_set1_ps(1.0)
or 1.999999whatever
. The author of this code went overboard trying to outsmart the compiler and prevent it from loading constants from memory, but used cmpps
instead of pcmpeqd
to create an all-ones bit-pattern. It's questionable if it's worth generating these constants on the fly.– Peter Cordes
Jan 3 at 20:54
Oh, I forgot this was a C question. Are you not familiar with C++ references? It's no different from any other object, as long as your calling convention supports
__m128
args by value. Anyway, you should probably simplify the constants to replace the silly cmpeq
and integer stuff with _mm_set1_ps(1.0)
or 1.999999whatever
. The author of this code went overboard trying to outsmart the compiler and prevent it from loading constants from memory, but used cmpps
instead of pcmpeqd
to create an all-ones bit-pattern. It's questionable if it's worth generating these constants on the fly.– Peter Cordes
Jan 3 at 20:54
|
show 3 more comments
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54022478%2fefficient-sse-fp-floor-ceil-round-rounding-functions-without-sse4-1%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Starting point would be reddit.com/r/programming/comments/1p2yys/…. Though it uses By Reference instead of By Value. The code there also yields issues with VS 2017.
– Royi
Jan 3 at 12:43
Are you sure you actually need
round
, and the IEEE default rounding mode wouldn't work for you?rintf
ornearbyintf
give you that, whileround
uses a special rounding mode that x86 doesn't have in hardware. round() for float in C++. (It can be emulated with a few instructions on top ofroundps
if you have SSE4.1, so if you can emulateroundps
to nearest you can still emulateround()
, but it's probably going to be slower.)– Peter Cordes
Jan 3 at 12:53
1
Google for "sse_mathfun" - it's a useful library which includes the above functions and many others.
– Paul R
Jan 3 at 13:12
@PaulR ,I'd rather have an implementation here so I will understand how it works. Also will be great reference for other who search for it.
– Royi
Jan 3 at 17:02
2
See DirectXMath, specifically the implementation of
XMVectorRound
,XMVectorFloor
, andXMVectorCeil
in this source file– Chuck Walbourn
Jan 3 at 17:45