Salsa20 as a PRNG with streams












4












$begingroup$


Can I use Salsa20 as a good non-cryptographic PRNG with different streams if I reduce the number of rounds to 8 and omit the addition step at the end? I want to omit the final step because I don't want to get all zero outputs.










share|improve this question









$endgroup$








  • 1




    $begingroup$
    Related: crypto.stackexchange.com/q/57670/54184
    $endgroup$
    – forest
    Jan 21 at 11:07






  • 3




    $begingroup$
    Consider ChaCha8. ChaCha is faster and has more diffusion per round.
    $endgroup$
    – Future Security
    Jan 21 at 16:15
















4












$begingroup$


Can I use Salsa20 as a good non-cryptographic PRNG with different streams if I reduce the number of rounds to 8 and omit the addition step at the end? I want to omit the final step because I don't want to get all zero outputs.










share|improve this question









$endgroup$








  • 1




    $begingroup$
    Related: crypto.stackexchange.com/q/57670/54184
    $endgroup$
    – forest
    Jan 21 at 11:07






  • 3




    $begingroup$
    Consider ChaCha8. ChaCha is faster and has more diffusion per round.
    $endgroup$
    – Future Security
    Jan 21 at 16:15














4












4








4


2



$begingroup$


Can I use Salsa20 as a good non-cryptographic PRNG with different streams if I reduce the number of rounds to 8 and omit the addition step at the end? I want to omit the final step because I don't want to get all zero outputs.










share|improve this question









$endgroup$




Can I use Salsa20 as a good non-cryptographic PRNG with different streams if I reduce the number of rounds to 8 and omit the addition step at the end? I want to omit the final step because I don't want to get all zero outputs.







random-number-generator salsa20






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Jan 21 at 10:52









ThorhamThorham

588




588








  • 1




    $begingroup$
    Related: crypto.stackexchange.com/q/57670/54184
    $endgroup$
    – forest
    Jan 21 at 11:07






  • 3




    $begingroup$
    Consider ChaCha8. ChaCha is faster and has more diffusion per round.
    $endgroup$
    – Future Security
    Jan 21 at 16:15














  • 1




    $begingroup$
    Related: crypto.stackexchange.com/q/57670/54184
    $endgroup$
    – forest
    Jan 21 at 11:07






  • 3




    $begingroup$
    Consider ChaCha8. ChaCha is faster and has more diffusion per round.
    $endgroup$
    – Future Security
    Jan 21 at 16:15








1




1




$begingroup$
Related: crypto.stackexchange.com/q/57670/54184
$endgroup$
– forest
Jan 21 at 11:07




$begingroup$
Related: crypto.stackexchange.com/q/57670/54184
$endgroup$
– forest
Jan 21 at 11:07




3




3




$begingroup$
Consider ChaCha8. ChaCha is faster and has more diffusion per round.
$endgroup$
– Future Security
Jan 21 at 16:15




$begingroup$
Consider ChaCha8. ChaCha is faster and has more diffusion per round.
$endgroup$
– Future Security
Jan 21 at 16:15










1 Answer
1






active

oldest

votes


















5












$begingroup$

Reducing the rounds to 8 would give you Salsa20/8, which is not just a fast PRNG operating at 1.88 cycles per byte on Core2Duo, but is still quite cryptographically secure with the best attack requiring approximately 2244 operations. Removing the final addition step would not be good though, as without that, it would be trivial to reverse the function and discover the key and counter given just a single block of known plaintext. You will not get all zero outputs by keeping the addition, so you should keep it.



You could cut the algorithm down to four rounds in order to roughly double the speed while completely sacrificing cryptographic security. Less than four rounds results in incomplete diffusion, leading to biased and non-uniform output. However, it will still be roughly twice as slow as the fastest dedicated non-cryptographic PRNG, XorShift128+ (an LFSR-based PRNG at 0.48 cycles per byte on Kaby Lake).






share|improve this answer











$endgroup$













  • $begingroup$
    Some other non-cryptographic algorithms are certainly faster, but they have a smaller state (I need room for a SHA256 hash) and I need streams. There doesn't seem to be much choice other than crypto algorithms if you have these requirements.
    $endgroup$
    – Thorham
    Jan 21 at 12:00










  • $begingroup$
    Is 0.48 c/b for a scalar implementation? You can run four XorShift128+ PRNGs in parallel in elements of an AVX2 vector. See AVX/SSE version of xorshift128+ for __m256i xorshift128plus_avx2(struct rngstate256 *sp). 8 SIMD ALU uops per 32 bytes of results => about 12 bytes per cycle, or 0.0833 c/b on SKL / KBL. (I used it in my answer on What's the fastest way to generate a 1 GB text file containing random digits? which does > 8 bytes per cycle of space-separated ASCII decimal digits on SKL.)
    $endgroup$
    – Peter Cordes
    Jan 21 at 14:52












  • $begingroup$
    @Thorham: would it work to use a SHA256 hash as the seed for two XorShift128+ PRNGs operating in parallel? If so, 2x 128-bit SIMD vectors will work, and let you generate 2x 64-bit random numbers in parallel. Or use 256-bit vectors to run 4 generators in parallel, requiring twice as much seed data. See my previous comment for C++ and C intrinsics implementations.
    $endgroup$
    – Peter Cordes
    Jan 21 at 14:58






  • 2




    $begingroup$
    @PeterCordes, Thorham, xoshiro256**/+ are available too. (They use rotate, not just xor-shifts. May not be as suitable for vector implementations.) Two instances of a 128-bit algorithm seeded like that isn't too different from just truncating SHA-256 output to 128 bits.
    $endgroup$
    – Future Security
    Jan 21 at 16:09






  • 1




    $begingroup$
    @FutureSecurity: SSE2 / AVX2 xoshiro256** looks very possible. AVX512 even has SIMD rotates, making it even better. Other SIMD ISAs can emulate it with shift+shift+OR. SIMD integer multiply is not bad for 32-bit integers on Intel CPUs with SSE4.1, but requires extended-precision techniques for 64-bit integer elements (until AVX512), which is why I used xorshift+ instead of *. But xoshiro256** only multiplies by the constants *5 and *9, which are both power-of-2 + 1 so are just left-shift+add. (In a scalar implementation, x86 can do that in one cycle with lea rax, [rbx + rbx*8].)
    $endgroup$
    – Peter Cordes
    Jan 21 at 16:24











Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "281"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcrypto.stackexchange.com%2fquestions%2f66640%2fsalsa20-as-a-prng-with-streams%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









5












$begingroup$

Reducing the rounds to 8 would give you Salsa20/8, which is not just a fast PRNG operating at 1.88 cycles per byte on Core2Duo, but is still quite cryptographically secure with the best attack requiring approximately 2244 operations. Removing the final addition step would not be good though, as without that, it would be trivial to reverse the function and discover the key and counter given just a single block of known plaintext. You will not get all zero outputs by keeping the addition, so you should keep it.



You could cut the algorithm down to four rounds in order to roughly double the speed while completely sacrificing cryptographic security. Less than four rounds results in incomplete diffusion, leading to biased and non-uniform output. However, it will still be roughly twice as slow as the fastest dedicated non-cryptographic PRNG, XorShift128+ (an LFSR-based PRNG at 0.48 cycles per byte on Kaby Lake).






share|improve this answer











$endgroup$













  • $begingroup$
    Some other non-cryptographic algorithms are certainly faster, but they have a smaller state (I need room for a SHA256 hash) and I need streams. There doesn't seem to be much choice other than crypto algorithms if you have these requirements.
    $endgroup$
    – Thorham
    Jan 21 at 12:00










  • $begingroup$
    Is 0.48 c/b for a scalar implementation? You can run four XorShift128+ PRNGs in parallel in elements of an AVX2 vector. See AVX/SSE version of xorshift128+ for __m256i xorshift128plus_avx2(struct rngstate256 *sp). 8 SIMD ALU uops per 32 bytes of results => about 12 bytes per cycle, or 0.0833 c/b on SKL / KBL. (I used it in my answer on What's the fastest way to generate a 1 GB text file containing random digits? which does > 8 bytes per cycle of space-separated ASCII decimal digits on SKL.)
    $endgroup$
    – Peter Cordes
    Jan 21 at 14:52












  • $begingroup$
    @Thorham: would it work to use a SHA256 hash as the seed for two XorShift128+ PRNGs operating in parallel? If so, 2x 128-bit SIMD vectors will work, and let you generate 2x 64-bit random numbers in parallel. Or use 256-bit vectors to run 4 generators in parallel, requiring twice as much seed data. See my previous comment for C++ and C intrinsics implementations.
    $endgroup$
    – Peter Cordes
    Jan 21 at 14:58






  • 2




    $begingroup$
    @PeterCordes, Thorham, xoshiro256**/+ are available too. (They use rotate, not just xor-shifts. May not be as suitable for vector implementations.) Two instances of a 128-bit algorithm seeded like that isn't too different from just truncating SHA-256 output to 128 bits.
    $endgroup$
    – Future Security
    Jan 21 at 16:09






  • 1




    $begingroup$
    @FutureSecurity: SSE2 / AVX2 xoshiro256** looks very possible. AVX512 even has SIMD rotates, making it even better. Other SIMD ISAs can emulate it with shift+shift+OR. SIMD integer multiply is not bad for 32-bit integers on Intel CPUs with SSE4.1, but requires extended-precision techniques for 64-bit integer elements (until AVX512), which is why I used xorshift+ instead of *. But xoshiro256** only multiplies by the constants *5 and *9, which are both power-of-2 + 1 so are just left-shift+add. (In a scalar implementation, x86 can do that in one cycle with lea rax, [rbx + rbx*8].)
    $endgroup$
    – Peter Cordes
    Jan 21 at 16:24
















5












$begingroup$

Reducing the rounds to 8 would give you Salsa20/8, which is not just a fast PRNG operating at 1.88 cycles per byte on Core2Duo, but is still quite cryptographically secure with the best attack requiring approximately 2244 operations. Removing the final addition step would not be good though, as without that, it would be trivial to reverse the function and discover the key and counter given just a single block of known plaintext. You will not get all zero outputs by keeping the addition, so you should keep it.



You could cut the algorithm down to four rounds in order to roughly double the speed while completely sacrificing cryptographic security. Less than four rounds results in incomplete diffusion, leading to biased and non-uniform output. However, it will still be roughly twice as slow as the fastest dedicated non-cryptographic PRNG, XorShift128+ (an LFSR-based PRNG at 0.48 cycles per byte on Kaby Lake).






share|improve this answer











$endgroup$













  • $begingroup$
    Some other non-cryptographic algorithms are certainly faster, but they have a smaller state (I need room for a SHA256 hash) and I need streams. There doesn't seem to be much choice other than crypto algorithms if you have these requirements.
    $endgroup$
    – Thorham
    Jan 21 at 12:00










  • $begingroup$
    Is 0.48 c/b for a scalar implementation? You can run four XorShift128+ PRNGs in parallel in elements of an AVX2 vector. See AVX/SSE version of xorshift128+ for __m256i xorshift128plus_avx2(struct rngstate256 *sp). 8 SIMD ALU uops per 32 bytes of results => about 12 bytes per cycle, or 0.0833 c/b on SKL / KBL. (I used it in my answer on What's the fastest way to generate a 1 GB text file containing random digits? which does > 8 bytes per cycle of space-separated ASCII decimal digits on SKL.)
    $endgroup$
    – Peter Cordes
    Jan 21 at 14:52












  • $begingroup$
    @Thorham: would it work to use a SHA256 hash as the seed for two XorShift128+ PRNGs operating in parallel? If so, 2x 128-bit SIMD vectors will work, and let you generate 2x 64-bit random numbers in parallel. Or use 256-bit vectors to run 4 generators in parallel, requiring twice as much seed data. See my previous comment for C++ and C intrinsics implementations.
    $endgroup$
    – Peter Cordes
    Jan 21 at 14:58






  • 2




    $begingroup$
    @PeterCordes, Thorham, xoshiro256**/+ are available too. (They use rotate, not just xor-shifts. May not be as suitable for vector implementations.) Two instances of a 128-bit algorithm seeded like that isn't too different from just truncating SHA-256 output to 128 bits.
    $endgroup$
    – Future Security
    Jan 21 at 16:09






  • 1




    $begingroup$
    @FutureSecurity: SSE2 / AVX2 xoshiro256** looks very possible. AVX512 even has SIMD rotates, making it even better. Other SIMD ISAs can emulate it with shift+shift+OR. SIMD integer multiply is not bad for 32-bit integers on Intel CPUs with SSE4.1, but requires extended-precision techniques for 64-bit integer elements (until AVX512), which is why I used xorshift+ instead of *. But xoshiro256** only multiplies by the constants *5 and *9, which are both power-of-2 + 1 so are just left-shift+add. (In a scalar implementation, x86 can do that in one cycle with lea rax, [rbx + rbx*8].)
    $endgroup$
    – Peter Cordes
    Jan 21 at 16:24














5












5








5





$begingroup$

Reducing the rounds to 8 would give you Salsa20/8, which is not just a fast PRNG operating at 1.88 cycles per byte on Core2Duo, but is still quite cryptographically secure with the best attack requiring approximately 2244 operations. Removing the final addition step would not be good though, as without that, it would be trivial to reverse the function and discover the key and counter given just a single block of known plaintext. You will not get all zero outputs by keeping the addition, so you should keep it.



You could cut the algorithm down to four rounds in order to roughly double the speed while completely sacrificing cryptographic security. Less than four rounds results in incomplete diffusion, leading to biased and non-uniform output. However, it will still be roughly twice as slow as the fastest dedicated non-cryptographic PRNG, XorShift128+ (an LFSR-based PRNG at 0.48 cycles per byte on Kaby Lake).






share|improve this answer











$endgroup$



Reducing the rounds to 8 would give you Salsa20/8, which is not just a fast PRNG operating at 1.88 cycles per byte on Core2Duo, but is still quite cryptographically secure with the best attack requiring approximately 2244 operations. Removing the final addition step would not be good though, as without that, it would be trivial to reverse the function and discover the key and counter given just a single block of known plaintext. You will not get all zero outputs by keeping the addition, so you should keep it.



You could cut the algorithm down to four rounds in order to roughly double the speed while completely sacrificing cryptographic security. Less than four rounds results in incomplete diffusion, leading to biased and non-uniform output. However, it will still be roughly twice as slow as the fastest dedicated non-cryptographic PRNG, XorShift128+ (an LFSR-based PRNG at 0.48 cycles per byte on Kaby Lake).







share|improve this answer














share|improve this answer



share|improve this answer








edited Jan 21 at 11:08

























answered Jan 21 at 11:03









forestforest

4,3501641




4,3501641












  • $begingroup$
    Some other non-cryptographic algorithms are certainly faster, but they have a smaller state (I need room for a SHA256 hash) and I need streams. There doesn't seem to be much choice other than crypto algorithms if you have these requirements.
    $endgroup$
    – Thorham
    Jan 21 at 12:00










  • $begingroup$
    Is 0.48 c/b for a scalar implementation? You can run four XorShift128+ PRNGs in parallel in elements of an AVX2 vector. See AVX/SSE version of xorshift128+ for __m256i xorshift128plus_avx2(struct rngstate256 *sp). 8 SIMD ALU uops per 32 bytes of results => about 12 bytes per cycle, or 0.0833 c/b on SKL / KBL. (I used it in my answer on What's the fastest way to generate a 1 GB text file containing random digits? which does > 8 bytes per cycle of space-separated ASCII decimal digits on SKL.)
    $endgroup$
    – Peter Cordes
    Jan 21 at 14:52












  • $begingroup$
    @Thorham: would it work to use a SHA256 hash as the seed for two XorShift128+ PRNGs operating in parallel? If so, 2x 128-bit SIMD vectors will work, and let you generate 2x 64-bit random numbers in parallel. Or use 256-bit vectors to run 4 generators in parallel, requiring twice as much seed data. See my previous comment for C++ and C intrinsics implementations.
    $endgroup$
    – Peter Cordes
    Jan 21 at 14:58






  • 2




    $begingroup$
    @PeterCordes, Thorham, xoshiro256**/+ are available too. (They use rotate, not just xor-shifts. May not be as suitable for vector implementations.) Two instances of a 128-bit algorithm seeded like that isn't too different from just truncating SHA-256 output to 128 bits.
    $endgroup$
    – Future Security
    Jan 21 at 16:09






  • 1




    $begingroup$
    @FutureSecurity: SSE2 / AVX2 xoshiro256** looks very possible. AVX512 even has SIMD rotates, making it even better. Other SIMD ISAs can emulate it with shift+shift+OR. SIMD integer multiply is not bad for 32-bit integers on Intel CPUs with SSE4.1, but requires extended-precision techniques for 64-bit integer elements (until AVX512), which is why I used xorshift+ instead of *. But xoshiro256** only multiplies by the constants *5 and *9, which are both power-of-2 + 1 so are just left-shift+add. (In a scalar implementation, x86 can do that in one cycle with lea rax, [rbx + rbx*8].)
    $endgroup$
    – Peter Cordes
    Jan 21 at 16:24


















  • $begingroup$
    Some other non-cryptographic algorithms are certainly faster, but they have a smaller state (I need room for a SHA256 hash) and I need streams. There doesn't seem to be much choice other than crypto algorithms if you have these requirements.
    $endgroup$
    – Thorham
    Jan 21 at 12:00










  • $begingroup$
    Is 0.48 c/b for a scalar implementation? You can run four XorShift128+ PRNGs in parallel in elements of an AVX2 vector. See AVX/SSE version of xorshift128+ for __m256i xorshift128plus_avx2(struct rngstate256 *sp). 8 SIMD ALU uops per 32 bytes of results => about 12 bytes per cycle, or 0.0833 c/b on SKL / KBL. (I used it in my answer on What's the fastest way to generate a 1 GB text file containing random digits? which does > 8 bytes per cycle of space-separated ASCII decimal digits on SKL.)
    $endgroup$
    – Peter Cordes
    Jan 21 at 14:52












  • $begingroup$
    @Thorham: would it work to use a SHA256 hash as the seed for two XorShift128+ PRNGs operating in parallel? If so, 2x 128-bit SIMD vectors will work, and let you generate 2x 64-bit random numbers in parallel. Or use 256-bit vectors to run 4 generators in parallel, requiring twice as much seed data. See my previous comment for C++ and C intrinsics implementations.
    $endgroup$
    – Peter Cordes
    Jan 21 at 14:58






  • 2




    $begingroup$
    @PeterCordes, Thorham, xoshiro256**/+ are available too. (They use rotate, not just xor-shifts. May not be as suitable for vector implementations.) Two instances of a 128-bit algorithm seeded like that isn't too different from just truncating SHA-256 output to 128 bits.
    $endgroup$
    – Future Security
    Jan 21 at 16:09






  • 1




    $begingroup$
    @FutureSecurity: SSE2 / AVX2 xoshiro256** looks very possible. AVX512 even has SIMD rotates, making it even better. Other SIMD ISAs can emulate it with shift+shift+OR. SIMD integer multiply is not bad for 32-bit integers on Intel CPUs with SSE4.1, but requires extended-precision techniques for 64-bit integer elements (until AVX512), which is why I used xorshift+ instead of *. But xoshiro256** only multiplies by the constants *5 and *9, which are both power-of-2 + 1 so are just left-shift+add. (In a scalar implementation, x86 can do that in one cycle with lea rax, [rbx + rbx*8].)
    $endgroup$
    – Peter Cordes
    Jan 21 at 16:24
















$begingroup$
Some other non-cryptographic algorithms are certainly faster, but they have a smaller state (I need room for a SHA256 hash) and I need streams. There doesn't seem to be much choice other than crypto algorithms if you have these requirements.
$endgroup$
– Thorham
Jan 21 at 12:00




$begingroup$
Some other non-cryptographic algorithms are certainly faster, but they have a smaller state (I need room for a SHA256 hash) and I need streams. There doesn't seem to be much choice other than crypto algorithms if you have these requirements.
$endgroup$
– Thorham
Jan 21 at 12:00












$begingroup$
Is 0.48 c/b for a scalar implementation? You can run four XorShift128+ PRNGs in parallel in elements of an AVX2 vector. See AVX/SSE version of xorshift128+ for __m256i xorshift128plus_avx2(struct rngstate256 *sp). 8 SIMD ALU uops per 32 bytes of results => about 12 bytes per cycle, or 0.0833 c/b on SKL / KBL. (I used it in my answer on What's the fastest way to generate a 1 GB text file containing random digits? which does > 8 bytes per cycle of space-separated ASCII decimal digits on SKL.)
$endgroup$
– Peter Cordes
Jan 21 at 14:52






$begingroup$
Is 0.48 c/b for a scalar implementation? You can run four XorShift128+ PRNGs in parallel in elements of an AVX2 vector. See AVX/SSE version of xorshift128+ for __m256i xorshift128plus_avx2(struct rngstate256 *sp). 8 SIMD ALU uops per 32 bytes of results => about 12 bytes per cycle, or 0.0833 c/b on SKL / KBL. (I used it in my answer on What's the fastest way to generate a 1 GB text file containing random digits? which does > 8 bytes per cycle of space-separated ASCII decimal digits on SKL.)
$endgroup$
– Peter Cordes
Jan 21 at 14:52














$begingroup$
@Thorham: would it work to use a SHA256 hash as the seed for two XorShift128+ PRNGs operating in parallel? If so, 2x 128-bit SIMD vectors will work, and let you generate 2x 64-bit random numbers in parallel. Or use 256-bit vectors to run 4 generators in parallel, requiring twice as much seed data. See my previous comment for C++ and C intrinsics implementations.
$endgroup$
– Peter Cordes
Jan 21 at 14:58




$begingroup$
@Thorham: would it work to use a SHA256 hash as the seed for two XorShift128+ PRNGs operating in parallel? If so, 2x 128-bit SIMD vectors will work, and let you generate 2x 64-bit random numbers in parallel. Or use 256-bit vectors to run 4 generators in parallel, requiring twice as much seed data. See my previous comment for C++ and C intrinsics implementations.
$endgroup$
– Peter Cordes
Jan 21 at 14:58




2




2




$begingroup$
@PeterCordes, Thorham, xoshiro256**/+ are available too. (They use rotate, not just xor-shifts. May not be as suitable for vector implementations.) Two instances of a 128-bit algorithm seeded like that isn't too different from just truncating SHA-256 output to 128 bits.
$endgroup$
– Future Security
Jan 21 at 16:09




$begingroup$
@PeterCordes, Thorham, xoshiro256**/+ are available too. (They use rotate, not just xor-shifts. May not be as suitable for vector implementations.) Two instances of a 128-bit algorithm seeded like that isn't too different from just truncating SHA-256 output to 128 bits.
$endgroup$
– Future Security
Jan 21 at 16:09




1




1




$begingroup$
@FutureSecurity: SSE2 / AVX2 xoshiro256** looks very possible. AVX512 even has SIMD rotates, making it even better. Other SIMD ISAs can emulate it with shift+shift+OR. SIMD integer multiply is not bad for 32-bit integers on Intel CPUs with SSE4.1, but requires extended-precision techniques for 64-bit integer elements (until AVX512), which is why I used xorshift+ instead of *. But xoshiro256** only multiplies by the constants *5 and *9, which are both power-of-2 + 1 so are just left-shift+add. (In a scalar implementation, x86 can do that in one cycle with lea rax, [rbx + rbx*8].)
$endgroup$
– Peter Cordes
Jan 21 at 16:24




$begingroup$
@FutureSecurity: SSE2 / AVX2 xoshiro256** looks very possible. AVX512 even has SIMD rotates, making it even better. Other SIMD ISAs can emulate it with shift+shift+OR. SIMD integer multiply is not bad for 32-bit integers on Intel CPUs with SSE4.1, but requires extended-precision techniques for 64-bit integer elements (until AVX512), which is why I used xorshift+ instead of *. But xoshiro256** only multiplies by the constants *5 and *9, which are both power-of-2 + 1 so are just left-shift+add. (In a scalar implementation, x86 can do that in one cycle with lea rax, [rbx + rbx*8].)
$endgroup$
– Peter Cordes
Jan 21 at 16:24


















draft saved

draft discarded




















































Thanks for contributing an answer to Cryptography Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcrypto.stackexchange.com%2fquestions%2f66640%2fsalsa20-as-a-prng-with-streams%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

MongoDB - Not Authorized To Execute Command

How to fix TextFormField cause rebuild widget in Flutter

in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith