Salsa20 as a PRNG with streams

Can I use Salsa20 as a good non-cryptographic PRNG with different streams if I reduce the number of rounds to 8 and omit the addition step at the end? I want to omit the final step because I don't want to get all zero outputs.

asked Jan 21 at 10:52

Thorham

588

1

$begingroup$
Related: crypto.stackexchange.com/q/57670/54184
$endgroup$
– forest
Jan 21 at 11:07

3

$begingroup$
Consider ChaCha8. ChaCha is faster and has more diffusion per round.
$endgroup$
– Future Security
Jan 21 at 16:15

add a comment |

asked Jan 21 at 10:52

Thorham

588

1

$begingroup$
Related: crypto.stackexchange.com/q/57670/54184
$endgroup$
– forest
Jan 21 at 11:07

3

$begingroup$
Consider ChaCha8. ChaCha is faster and has more diffusion per round.
$endgroup$
– Future Security
Jan 21 at 16:15

add a comment |

asked Jan 21 at 10:52

Thorham

588

random-number-generator salsa20

asked Jan 21 at 10:52

Thorham

588

asked Jan 21 at 10:52

Thorham

588

asked Jan 21 at 10:52

Thorham

588

asked Jan 21 at 10:52

Thorham

588

asked Jan 21 at 10:52

Thorham

588

1

$begingroup$
Related: crypto.stackexchange.com/q/57670/54184
$endgroup$
– forest
Jan 21 at 11:07

3

$begingroup$
Consider ChaCha8. ChaCha is faster and has more diffusion per round.
$endgroup$
– Future Security
Jan 21 at 16:15

add a comment |

1

$begingroup$
Related: crypto.stackexchange.com/q/57670/54184
$endgroup$
– forest
Jan 21 at 11:07

3

$begingroup$
Consider ChaCha8. ChaCha is faster and has more diffusion per round.
$endgroup$
– Future Security
Jan 21 at 16:15

Related: crypto.stackexchange.com/q/57670/54184

– forest
Jan 21 at 11:07

Consider ChaCha8. ChaCha is faster and has more diffusion per round.

– Future Security
Jan 21 at 16:15

add a comment |

1 Answer
1

active

oldest

votes

Reducing the rounds to 8 would give you Salsa20/8, which is not just a fast PRNG operating at 1.88 cycles per byte on Core2Duo, but is still quite cryptographically secure with the best attack requiring approximately 2²⁴⁴ operations. Removing the final addition step would not be good though, as without that, it would be trivial to reverse the function and discover the key and counter given just a single block of known plaintext. You will not get all zero outputs by keeping the addition, so you should keep it.

You could cut the algorithm down to four rounds in order to roughly double the speed while completely sacrificing cryptographic security. Less than four rounds results in incomplete diffusion, leading to biased and non-uniform output. However, it will still be roughly twice as slow as the fastest dedicated non-cryptographic PRNG, XorShift128+ (an LFSR-based PRNG at 0.48 cycles per byte on Kaby Lake).

edited Jan 21 at 11:08

answered Jan 21 at 11:03

forest

4,3501641

$begingroup$
Some other non-cryptographic algorithms are certainly faster, but they have a smaller state (I need room for a SHA256 hash) and I need streams. There doesn't seem to be much choice other than crypto algorithms if you have these requirements.
$endgroup$
– Thorham
Jan 21 at 12:00

$begingroup$
Is 0.48 c/b for a scalar implementation? You can run four XorShift128+ PRNGs in parallel in elements of an AVX2 vector. See AVX/SSE version of xorshift128+ for __m256i xorshift128plus_avx2(struct rngstate256 *sp). 8 SIMD ALU uops per 32 bytes of results => about 12 bytes per cycle, or 0.0833 c/b on SKL / KBL. (I used it in my answer on What's the fastest way to generate a 1 GB text file containing random digits? which does > 8 bytes per cycle of space-separated ASCII decimal digits on SKL.)
$endgroup$
– Peter Cordes
Jan 21 at 14:52

$begingroup$
@Thorham: would it work to use a SHA256 hash as the seed for two XorShift128+ PRNGs operating in parallel? If so, 2x 128-bit SIMD vectors will work, and let you generate 2x 64-bit random numbers in parallel. Or use 256-bit vectors to run 4 generators in parallel, requiring twice as much seed data. See my previous comment for C++ and C intrinsics implementations.
$endgroup$
– Peter Cordes
Jan 21 at 14:58

2

$begingroup$
@PeterCordes, Thorham, xoshiro256**/+ are available too. (They use rotate, not just xor-shifts. May not be as suitable for vector implementations.) Two instances of a 128-bit algorithm seeded like that isn't too different from just truncating SHA-256 output to 128 bits.
$endgroup$
– Future Security
Jan 21 at 16:09

1

$begingroup$
@FutureSecurity: SSE2 / AVX2 xoshiro256** looks very possible. AVX512 even has SIMD rotates, making it even better. Other SIMD ISAs can emulate it with shift+shift+OR. SIMD integer multiply is not bad for 32-bit integers on Intel CPUs with SSE4.1, but requires extended-precision techniques for 64-bit integer elements (until AVX512), which is why I used xorshift+ instead of *. But xoshiro256** only multiplies by the constants *5 and *9, which are both power-of-2 + 1 so are just left-shift+add. (In a scalar implementation, x86 can do that in one cycle with lea rax, [rbx + rbx*8].)
$endgroup$
– Peter Cordes
Jan 21 at 16:24

|
show 2 more comments

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "281"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcrypto.stackexchange.com%2fquestions%2f66640%2fsalsa20-as-a-prng-with-streams%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

edited Jan 21 at 11:08

answered Jan 21 at 11:03

forest

4,3501641

$begingroup$
Some other non-cryptographic algorithms are certainly faster, but they have a smaller state (I need room for a SHA256 hash) and I need streams. There doesn't seem to be much choice other than crypto algorithms if you have these requirements.
$endgroup$
– Thorham
Jan 21 at 12:00

$begingroup$
Is 0.48 c/b for a scalar implementation? You can run four XorShift128+ PRNGs in parallel in elements of an AVX2 vector. See AVX/SSE version of xorshift128+ for __m256i xorshift128plus_avx2(struct rngstate256 *sp). 8 SIMD ALU uops per 32 bytes of results => about 12 bytes per cycle, or 0.0833 c/b on SKL / KBL. (I used it in my answer on What's the fastest way to generate a 1 GB text file containing random digits? which does > 8 bytes per cycle of space-separated ASCII decimal digits on SKL.)
$endgroup$
– Peter Cordes
Jan 21 at 14:52

$begingroup$
@Thorham: would it work to use a SHA256 hash as the seed for two XorShift128+ PRNGs operating in parallel? If so, 2x 128-bit SIMD vectors will work, and let you generate 2x 64-bit random numbers in parallel. Or use 256-bit vectors to run 4 generators in parallel, requiring twice as much seed data. See my previous comment for C++ and C intrinsics implementations.
$endgroup$
– Peter Cordes
Jan 21 at 14:58

2

$begingroup$
@PeterCordes, Thorham, xoshiro256**/+ are available too. (They use rotate, not just xor-shifts. May not be as suitable for vector implementations.) Two instances of a 128-bit algorithm seeded like that isn't too different from just truncating SHA-256 output to 128 bits.
$endgroup$
– Future Security
Jan 21 at 16:09

1

$begingroup$
@FutureSecurity: SSE2 / AVX2 xoshiro256** looks very possible. AVX512 even has SIMD rotates, making it even better. Other SIMD ISAs can emulate it with shift+shift+OR. SIMD integer multiply is not bad for 32-bit integers on Intel CPUs with SSE4.1, but requires extended-precision techniques for 64-bit integer elements (until AVX512), which is why I used xorshift+ instead of *. But xoshiro256** only multiplies by the constants *5 and *9, which are both power-of-2 + 1 so are just left-shift+add. (In a scalar implementation, x86 can do that in one cycle with lea rax, [rbx + rbx*8].)
$endgroup$
– Peter Cordes
Jan 21 at 16:24

|
show 2 more comments

edited Jan 21 at 11:08

answered Jan 21 at 11:03

forest

4,3501641

$begingroup$
Some other non-cryptographic algorithms are certainly faster, but they have a smaller state (I need room for a SHA256 hash) and I need streams. There doesn't seem to be much choice other than crypto algorithms if you have these requirements.
$endgroup$
– Thorham
Jan 21 at 12:00

$begingroup$
Is 0.48 c/b for a scalar implementation? You can run four XorShift128+ PRNGs in parallel in elements of an AVX2 vector. See AVX/SSE version of xorshift128+ for __m256i xorshift128plus_avx2(struct rngstate256 *sp). 8 SIMD ALU uops per 32 bytes of results => about 12 bytes per cycle, or 0.0833 c/b on SKL / KBL. (I used it in my answer on What's the fastest way to generate a 1 GB text file containing random digits? which does > 8 bytes per cycle of space-separated ASCII decimal digits on SKL.)
$endgroup$
– Peter Cordes
Jan 21 at 14:52

$begingroup$
@Thorham: would it work to use a SHA256 hash as the seed for two XorShift128+ PRNGs operating in parallel? If so, 2x 128-bit SIMD vectors will work, and let you generate 2x 64-bit random numbers in parallel. Or use 256-bit vectors to run 4 generators in parallel, requiring twice as much seed data. See my previous comment for C++ and C intrinsics implementations.
$endgroup$
– Peter Cordes
Jan 21 at 14:58

2

$begingroup$
@PeterCordes, Thorham, xoshiro256**/+ are available too. (They use rotate, not just xor-shifts. May not be as suitable for vector implementations.) Two instances of a 128-bit algorithm seeded like that isn't too different from just truncating SHA-256 output to 128 bits.
$endgroup$
– Future Security
Jan 21 at 16:09

1

$begingroup$
@FutureSecurity: SSE2 / AVX2 xoshiro256** looks very possible. AVX512 even has SIMD rotates, making it even better. Other SIMD ISAs can emulate it with shift+shift+OR. SIMD integer multiply is not bad for 32-bit integers on Intel CPUs with SSE4.1, but requires extended-precision techniques for 64-bit integer elements (until AVX512), which is why I used xorshift+ instead of *. But xoshiro256** only multiplies by the constants *5 and *9, which are both power-of-2 + 1 so are just left-shift+add. (In a scalar implementation, x86 can do that in one cycle with lea rax, [rbx + rbx*8].)
$endgroup$
– Peter Cordes
Jan 21 at 16:24

|
show 2 more comments

edited Jan 21 at 11:08

answered Jan 21 at 11:03

forest

4,3501641

edited Jan 21 at 11:08

answered Jan 21 at 11:03

forest

4,3501641

edited Jan 21 at 11:08

answered Jan 21 at 11:03

forest

4,3501641

answered Jan 21 at 11:03

forest

4,3501641

answered Jan 21 at 11:03

forest

4,3501641

$begingroup$
Some other non-cryptographic algorithms are certainly faster, but they have a smaller state (I need room for a SHA256 hash) and I need streams. There doesn't seem to be much choice other than crypto algorithms if you have these requirements.
$endgroup$
– Thorham
Jan 21 at 12:00

$begingroup$
Is 0.48 c/b for a scalar implementation? You can run four XorShift128+ PRNGs in parallel in elements of an AVX2 vector. See AVX/SSE version of xorshift128+ for __m256i xorshift128plus_avx2(struct rngstate256 *sp). 8 SIMD ALU uops per 32 bytes of results => about 12 bytes per cycle, or 0.0833 c/b on SKL / KBL. (I used it in my answer on What's the fastest way to generate a 1 GB text file containing random digits? which does > 8 bytes per cycle of space-separated ASCII decimal digits on SKL.)
$endgroup$
– Peter Cordes
Jan 21 at 14:52

$begingroup$
@Thorham: would it work to use a SHA256 hash as the seed for two XorShift128+ PRNGs operating in parallel? If so, 2x 128-bit SIMD vectors will work, and let you generate 2x 64-bit random numbers in parallel. Or use 256-bit vectors to run 4 generators in parallel, requiring twice as much seed data. See my previous comment for C++ and C intrinsics implementations.
$endgroup$
– Peter Cordes
Jan 21 at 14:58

2

$begingroup$
@PeterCordes, Thorham, xoshiro256**/+ are available too. (They use rotate, not just xor-shifts. May not be as suitable for vector implementations.) Two instances of a 128-bit algorithm seeded like that isn't too different from just truncating SHA-256 output to 128 bits.
$endgroup$
– Future Security
Jan 21 at 16:09

1

$begingroup$
@FutureSecurity: SSE2 / AVX2 xoshiro256** looks very possible. AVX512 even has SIMD rotates, making it even better. Other SIMD ISAs can emulate it with shift+shift+OR. SIMD integer multiply is not bad for 32-bit integers on Intel CPUs with SSE4.1, but requires extended-precision techniques for 64-bit integer elements (until AVX512), which is why I used xorshift+ instead of *. But xoshiro256** only multiplies by the constants *5 and *9, which are both power-of-2 + 1 so are just left-shift+add. (In a scalar implementation, x86 can do that in one cycle with lea rax, [rbx + rbx*8].)
$endgroup$
– Peter Cordes
Jan 21 at 16:24

|
show 2 more comments

$begingroup$
Some other non-cryptographic algorithms are certainly faster, but they have a smaller state (I need room for a SHA256 hash) and I need streams. There doesn't seem to be much choice other than crypto algorithms if you have these requirements.
$endgroup$
– Thorham
Jan 21 at 12:00

$begingroup$
Is 0.48 c/b for a scalar implementation? You can run four XorShift128+ PRNGs in parallel in elements of an AVX2 vector. See AVX/SSE version of xorshift128+ for __m256i xorshift128plus_avx2(struct rngstate256 *sp). 8 SIMD ALU uops per 32 bytes of results => about 12 bytes per cycle, or 0.0833 c/b on SKL / KBL. (I used it in my answer on What's the fastest way to generate a 1 GB text file containing random digits? which does > 8 bytes per cycle of space-separated ASCII decimal digits on SKL.)
$endgroup$
– Peter Cordes
Jan 21 at 14:52

$begingroup$
@Thorham: would it work to use a SHA256 hash as the seed for two XorShift128+ PRNGs operating in parallel? If so, 2x 128-bit SIMD vectors will work, and let you generate 2x 64-bit random numbers in parallel. Or use 256-bit vectors to run 4 generators in parallel, requiring twice as much seed data. See my previous comment for C++ and C intrinsics implementations.
$endgroup$
– Peter Cordes
Jan 21 at 14:58

2

$begingroup$
@PeterCordes, Thorham, xoshiro256**/+ are available too. (They use rotate, not just xor-shifts. May not be as suitable for vector implementations.) Two instances of a 128-bit algorithm seeded like that isn't too different from just truncating SHA-256 output to 128 bits.
$endgroup$
– Future Security
Jan 21 at 16:09

1

$begingroup$
@FutureSecurity: SSE2 / AVX2 xoshiro256** looks very possible. AVX512 even has SIMD rotates, making it even better. Other SIMD ISAs can emulate it with shift+shift+OR. SIMD integer multiply is not bad for 32-bit integers on Intel CPUs with SSE4.1, but requires extended-precision techniques for 64-bit integer elements (until AVX512), which is why I used xorshift+ instead of *. But xoshiro256** only multiplies by the constants *5 and *9, which are both power-of-2 + 1 so are just left-shift+add. (In a scalar implementation, x86 can do that in one cycle with lea rax, [rbx + rbx*8].)
$endgroup$
– Peter Cordes
Jan 21 at 16:24

Some other non-cryptographic algorithms are certainly faster, but they have a smaller state (I need room for a SHA256 hash) and I need streams. There doesn't seem to be much choice other than crypto algorithms if you have these requirements.

– Thorham
Jan 21 at 12:00

Is 0.48 c/b for a scalar implementation? You can run four XorShift128+ PRNGs in parallel in elements of an AVX2 vector. See AVX/SSE version of xorshift128+ for __m256i xorshift128plus_avx2(struct rngstate256 *sp). 8 SIMD ALU uops per 32 bytes of results => about 12 bytes per cycle, or 0.0833 c/b on SKL / KBL. (I used it in my answer on What's the fastest way to generate a 1 GB text file containing random digits? which does > 8 bytes per cycle of space-separated ASCII decimal digits on SKL.)

– Peter Cordes
Jan 21 at 14:52

@Thorham: would it work to use a SHA256 hash as the seed for two XorShift128+ PRNGs operating in parallel? If so, 2x 128-bit SIMD vectors will work, and let you generate 2x 64-bit random numbers in parallel. Or use 256-bit vectors to run 4 generators in parallel, requiring twice as much seed data. See my previous comment for C++ and C intrinsics implementations.

– Peter Cordes
Jan 21 at 14:58

@PeterCordes, Thorham, xoshiro256**/+ are available too. (They use rotate, not just xor-shifts. May not be as suitable for vector implementations.) Two instances of a 128-bit algorithm seeded like that isn't too different from just truncating SHA-256 output to 128 bits.

– Future Security
Jan 21 at 16:09

@FutureSecurity: SSE2 / AVX2 xoshiro256** looks very possible. AVX512 even has SIMD rotates, making it even better. Other SIMD ISAs can emulate it with shift+shift+OR. SIMD integer multiply is not bad for 32-bit integers on Intel CPUs with SSE4.1, but requires extended-precision techniques for 64-bit integer elements (until AVX512), which is why I used xorshift+ instead of *. But xoshiro256** only multiplies by the constants *5 and *9, which are both power-of-2 + 1 so are just left-shift+add. (In a scalar implementation, x86 can do that in one cycle with lea rax, [rbx + rbx*8].)

– Peter Cordes
Jan 21 at 16:24

|
show 2 more comments

draft saved

draft discarded

Thanks for contributing an answer to Cryptography Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu