Is this lock-free dlist insertion is safe?












3















I need to implement a lock-free insertion of a sub-list in the head of a doubly-linked list. This list has a dummy head, so each thread tries to insert its part right after the head node. This design seems OK for me, however, I don't have enough expertise to prove it.



struct Node {
std::atomic<Node*> next;
std::atomic<Node*> prev;
};
Node head;

// concurrently insert `first`..`last` sublist after `head`
void insertSublist(Node* first, Node* last) {
first->prev = &head;
Node* current_next = head.next;

while (true) {
last->next = current_next;
if (head.next.compare_exchange_weak(current_next, first)) {
current_next->prev = last;
return;
}
}
}


I need to validate this design under these circumstances:



Variant 1



No list removal is performed, all threads just do insertion in a loop.



Variant 2



There is 1 thread, that removes nodes from the list in random order, but it will never remove a node that is right after the head node.



for (auto* node : nodes_to_be_removed) {
if (node->prev == &head)
continue;
// perform removal
}


When insertion is done, node->prev is the last link that is changed. So after it is changed, no other thread (except remover) can access the node or its preceding node next link.
Is this reasoning valid or I am missing something?





Some clarifications after @peter-cordes answer.




  • The list is not linearly traversable, so inconsistent list state is not an issue from this point of view.



  • If you remove a node that an inserter was going to modify (to add the backward link) but hasn't yet




    I expect that the check node->prev == &head prevents this case. Is it true?



  • Are removals safe under these conditions?


    • There is only 1 remover thread

    • Remover has a separate worklist of nodes for deletion

    • A node can be added to the worklist only after its insertion stage is completely finished












share|improve this question

























  • but it will never remove a node that is right after the head node - and if it is the only node in the list?

    – Aconcagua
    Jan 2 at 10:53











  • There's also a dummy tail node, so this is not an issue

    – Viacheslav Kroilov
    Jan 2 at 11:23






  • 1





    I expect that the check node->prev == &head prevents this case. Is it true? Oh, yeah I think so. The not-yet-modified elements all have their prev pointer pointing at head. That means remove fails for partially-inserted elements, and we're definitely limited to only ever inserting at the head. (Otherwise we lose that way of avoiding pending modifications.)

    – Peter Cordes
    Jan 2 at 14:56
















3















I need to implement a lock-free insertion of a sub-list in the head of a doubly-linked list. This list has a dummy head, so each thread tries to insert its part right after the head node. This design seems OK for me, however, I don't have enough expertise to prove it.



struct Node {
std::atomic<Node*> next;
std::atomic<Node*> prev;
};
Node head;

// concurrently insert `first`..`last` sublist after `head`
void insertSublist(Node* first, Node* last) {
first->prev = &head;
Node* current_next = head.next;

while (true) {
last->next = current_next;
if (head.next.compare_exchange_weak(current_next, first)) {
current_next->prev = last;
return;
}
}
}


I need to validate this design under these circumstances:



Variant 1



No list removal is performed, all threads just do insertion in a loop.



Variant 2



There is 1 thread, that removes nodes from the list in random order, but it will never remove a node that is right after the head node.



for (auto* node : nodes_to_be_removed) {
if (node->prev == &head)
continue;
// perform removal
}


When insertion is done, node->prev is the last link that is changed. So after it is changed, no other thread (except remover) can access the node or its preceding node next link.
Is this reasoning valid or I am missing something?





Some clarifications after @peter-cordes answer.




  • The list is not linearly traversable, so inconsistent list state is not an issue from this point of view.



  • If you remove a node that an inserter was going to modify (to add the backward link) but hasn't yet




    I expect that the check node->prev == &head prevents this case. Is it true?



  • Are removals safe under these conditions?


    • There is only 1 remover thread

    • Remover has a separate worklist of nodes for deletion

    • A node can be added to the worklist only after its insertion stage is completely finished












share|improve this question

























  • but it will never remove a node that is right after the head node - and if it is the only node in the list?

    – Aconcagua
    Jan 2 at 10:53











  • There's also a dummy tail node, so this is not an issue

    – Viacheslav Kroilov
    Jan 2 at 11:23






  • 1





    I expect that the check node->prev == &head prevents this case. Is it true? Oh, yeah I think so. The not-yet-modified elements all have their prev pointer pointing at head. That means remove fails for partially-inserted elements, and we're definitely limited to only ever inserting at the head. (Otherwise we lose that way of avoiding pending modifications.)

    – Peter Cordes
    Jan 2 at 14:56














3












3








3








I need to implement a lock-free insertion of a sub-list in the head of a doubly-linked list. This list has a dummy head, so each thread tries to insert its part right after the head node. This design seems OK for me, however, I don't have enough expertise to prove it.



struct Node {
std::atomic<Node*> next;
std::atomic<Node*> prev;
};
Node head;

// concurrently insert `first`..`last` sublist after `head`
void insertSublist(Node* first, Node* last) {
first->prev = &head;
Node* current_next = head.next;

while (true) {
last->next = current_next;
if (head.next.compare_exchange_weak(current_next, first)) {
current_next->prev = last;
return;
}
}
}


I need to validate this design under these circumstances:



Variant 1



No list removal is performed, all threads just do insertion in a loop.



Variant 2



There is 1 thread, that removes nodes from the list in random order, but it will never remove a node that is right after the head node.



for (auto* node : nodes_to_be_removed) {
if (node->prev == &head)
continue;
// perform removal
}


When insertion is done, node->prev is the last link that is changed. So after it is changed, no other thread (except remover) can access the node or its preceding node next link.
Is this reasoning valid or I am missing something?





Some clarifications after @peter-cordes answer.




  • The list is not linearly traversable, so inconsistent list state is not an issue from this point of view.



  • If you remove a node that an inserter was going to modify (to add the backward link) but hasn't yet




    I expect that the check node->prev == &head prevents this case. Is it true?



  • Are removals safe under these conditions?


    • There is only 1 remover thread

    • Remover has a separate worklist of nodes for deletion

    • A node can be added to the worklist only after its insertion stage is completely finished












share|improve this question
















I need to implement a lock-free insertion of a sub-list in the head of a doubly-linked list. This list has a dummy head, so each thread tries to insert its part right after the head node. This design seems OK for me, however, I don't have enough expertise to prove it.



struct Node {
std::atomic<Node*> next;
std::atomic<Node*> prev;
};
Node head;

// concurrently insert `first`..`last` sublist after `head`
void insertSublist(Node* first, Node* last) {
first->prev = &head;
Node* current_next = head.next;

while (true) {
last->next = current_next;
if (head.next.compare_exchange_weak(current_next, first)) {
current_next->prev = last;
return;
}
}
}


I need to validate this design under these circumstances:



Variant 1



No list removal is performed, all threads just do insertion in a loop.



Variant 2



There is 1 thread, that removes nodes from the list in random order, but it will never remove a node that is right after the head node.



for (auto* node : nodes_to_be_removed) {
if (node->prev == &head)
continue;
// perform removal
}


When insertion is done, node->prev is the last link that is changed. So after it is changed, no other thread (except remover) can access the node or its preceding node next link.
Is this reasoning valid or I am missing something?





Some clarifications after @peter-cordes answer.




  • The list is not linearly traversable, so inconsistent list state is not an issue from this point of view.



  • If you remove a node that an inserter was going to modify (to add the backward link) but hasn't yet




    I expect that the check node->prev == &head prevents this case. Is it true?



  • Are removals safe under these conditions?


    • There is only 1 remover thread

    • Remover has a separate worklist of nodes for deletion

    • A node can be added to the worklist only after its insertion stage is completely finished









c++ concurrency linked-list atomic lock-free






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 2 at 14:51







Viacheslav Kroilov

















asked Jan 2 at 9:56









Viacheslav KroilovViacheslav Kroilov

484415




484415













  • but it will never remove a node that is right after the head node - and if it is the only node in the list?

    – Aconcagua
    Jan 2 at 10:53











  • There's also a dummy tail node, so this is not an issue

    – Viacheslav Kroilov
    Jan 2 at 11:23






  • 1





    I expect that the check node->prev == &head prevents this case. Is it true? Oh, yeah I think so. The not-yet-modified elements all have their prev pointer pointing at head. That means remove fails for partially-inserted elements, and we're definitely limited to only ever inserting at the head. (Otherwise we lose that way of avoiding pending modifications.)

    – Peter Cordes
    Jan 2 at 14:56



















  • but it will never remove a node that is right after the head node - and if it is the only node in the list?

    – Aconcagua
    Jan 2 at 10:53











  • There's also a dummy tail node, so this is not an issue

    – Viacheslav Kroilov
    Jan 2 at 11:23






  • 1





    I expect that the check node->prev == &head prevents this case. Is it true? Oh, yeah I think so. The not-yet-modified elements all have their prev pointer pointing at head. That means remove fails for partially-inserted elements, and we're definitely limited to only ever inserting at the head. (Otherwise we lose that way of avoiding pending modifications.)

    – Peter Cordes
    Jan 2 at 14:56

















but it will never remove a node that is right after the head node - and if it is the only node in the list?

– Aconcagua
Jan 2 at 10:53





but it will never remove a node that is right after the head node - and if it is the only node in the list?

– Aconcagua
Jan 2 at 10:53













There's also a dummy tail node, so this is not an issue

– Viacheslav Kroilov
Jan 2 at 11:23





There's also a dummy tail node, so this is not an issue

– Viacheslav Kroilov
Jan 2 at 11:23




1




1





I expect that the check node->prev == &head prevents this case. Is it true? Oh, yeah I think so. The not-yet-modified elements all have their prev pointer pointing at head. That means remove fails for partially-inserted elements, and we're definitely limited to only ever inserting at the head. (Otherwise we lose that way of avoiding pending modifications.)

– Peter Cordes
Jan 2 at 14:56





I expect that the check node->prev == &head prevents this case. Is it true? Oh, yeah I think so. The not-yet-modified elements all have their prev pointer pointing at head. That means remove fails for partially-inserted elements, and we're definitely limited to only ever inserting at the head. (Otherwise we lose that way of avoiding pending modifications.)

– Peter Cordes
Jan 2 at 14:56












1 Answer
1






active

oldest

votes


















4














TL:DR: inserts alone are sort of ok depending on what readers do (no long-term corruption), but removals are probably impossible without locking or much more sophistication, and definitely a showstopper for this simple insert algorithm.





It's a double-linked list, so an insert unavoidably requires modifying two memory locations that other threads can already see: head.next, and the .prev pointer in the old first node. This can't be done atomically+locklessly unless the hardware has a DCAS (double-CAS, two separate non-contiguous locations at once). As the wikipedia article says, it makes lockless double-linked lists easy.



m68k had DCAS at one point, but no current mainstream CPU architecture has it. ISO C++11 doesn't expose a DCAS operation through std::atomic because you can't emulate it on HW that doesn't have it without making all atomic<T> non-lock-free. Except on hardware with transactional memory, available on some recent x86 CPUs from Intel (e.g. Broadwell and later), but not AMD. There has been some work on adding syntax for TM to C++, see https://en.cppreference.com/w/cpp/language/transactional_memory





Of course, it's also impossible for an observe to observe two locations at once, atomically, without transactional memory or something like DCAS. So any threads that read the list need to expect it to be changing out from under them, especially if the list is also supposed to support removals.



Setting up the pointers within the new nodes (not yet published to other threads) before publishing is obviously good, and you're doing that. first->prev and last->next are both set correctly before a CAS attempt to publish these new nodes. The CAS has sequential-consistency memory ordering, so it makes sure those previous stores are visible to other threads before it itself is. (So those "private" stores might as well be std::memory_order_relaxed for efficiency).



Your choice of modifying the old-first's .prev pointer after modifying head makes a lot of sense. You're essentially publishing first in the forward direction, then in the reverse direction. But remember that it's possible for a thread to sleep for a long time at any point, so it's not 100% safe to assume this will always be a momentary inconsistency. Imagine stopping one thread in the debugger at any point inside this function, and even single-stepping, while other threads run. In this case there are only 2 interesting operations, the CAS and the unconditional store to the old first non-dummy node.



If a thread was traversing forward and depending on being able to go back by following a .prev (instead of remembering its own previous in a local variable), this might look to it like the new nodes had been deleted again. It can find a .prev pointing to head. This is a contrived example because it would normally be more efficient to simply remember your previous node if you want to be able to find it again, especially in a lockless list. But perhaps there are non-contrived cases, like maybe one thread traversing forward and another traversing backward, and maybe interacting directly or indirectly, where an inconsistency would be visible.





As long as all threads agree on which order to modify things, I think insertion itself is safe. Doing it only at the head makes it easier to verify, but I think allowing arbitrary insertion points is still safe.



Your current code looks safe for simultaneous insertions (assuming no removals). The forward list can be longer than the backward list (with potentially multiple insertions into the backward list outstanding), but once they all complete the list will be consistent.



Without removal, each pending write to a .prev has a valid destination, and that destination is a node that no other thread wants to write to. lockless singly-linked list insertion is easy, and the forward list (the .next links) are always up to date.



So each insert operation "claims" a node that it use as the insertion point into the reverse list, when its store to current_next->prev becomes visible.





A do{}while(!CAS()) loop is a nice idiom, usually simplifying the code. I weakened the memory-ordering of the other operations, especially the private ones on first and last, because requiring the compiler to use slow barriers after stores to elements other threads can't see yet is inefficient. On x86, release-stores are "free" (no extra barriers), while seq-cst stores are a lost more expensive. (A seq-cst store on x86 has about the same cost as an atomic read-modify-write, in the uncontended case).



// no change in logic to yours, just minimize memory ordering
// and simplify the loop structure.
void insertSublist(Node* first, Node* last)
{
first->prev.store(&head, std::memory_order_relaxed);
Node* current_next = head.next.load(std::memory_order_relaxed);

do {
// current_next set ahead of first iter, and updated on CAS failure
last->next.store(current_next, std::memory_order_relaxed);
}while (!head.next.compare_exchange_weak(current_next, first));
// acq_rel CAS should work, but leave it as seq_cst just to be sure. No slower on x86

current_next->prev.store(last, std::memory_order_release); // not visible before CAS
}


This compiles for x86 with zero mfence instructions instead of 3 for yours, on the Godbolt compiler explorer. (The rest of the asm is literally identical, including the one lock cmpxchg.) So in the uncontended no-RFO case (e.g. repeated insertions from the same core), it might be something like 4x faster. Or better because mfence is actually even slower than a lock prefix on Intel CPUs.



Plus, the do{}while(!CAS) with the final store fully outside the loop is arguably easier for humans to read and see the logic right away.





Removals are a massive complication



I don't see how you could safely remove while you have pending insertions. If you remove a node that an inserter was going to modify (to add the backward link) but hasn't yet, that range of nodes will be forever missing from the backward list.



(Plus if you recycle the memory for that Node, the store by the inserter then steps on something.)



It will make the forward and backward lists out of sync. I don't see a way to solve this without DCAS (or transactional memory, which is a superset of DCAS). I'm not a lockless dlist expert, though, so maybe there's a trick.



Probably even multiple simultaneous removers is a problem, because you can end up with pending modifications to a node that another thread is going to (or already has) removed. Or multiple pending modifications to one node, with no way to make sure that the right one finishes last.



If you had an inserters/removers lock (multiple inserters / single remover, exactly like a readers/writer lock), you could make sure there were no pending inserts when doing a removal. But still allow lockless inserts. Maybe put the lock in the same cache line as head, because inserting threads will always need to modify both it and head. Or maybe not because you might just end up with more contention for that line if cores sometimes lose ownership of the line after taking the lock but before committing their modification to head.






share|improve this answer
























  • First of all, thank you for your valuable time to write such a great answer. I've clarified some points regarding the removal. Can you assert that with these rules the removal is harmless?

    – Viacheslav Kroilov
    Jan 2 at 14:54











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54004268%2fis-this-lock-free-dlist-insertion-is-safe%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









4














TL:DR: inserts alone are sort of ok depending on what readers do (no long-term corruption), but removals are probably impossible without locking or much more sophistication, and definitely a showstopper for this simple insert algorithm.





It's a double-linked list, so an insert unavoidably requires modifying two memory locations that other threads can already see: head.next, and the .prev pointer in the old first node. This can't be done atomically+locklessly unless the hardware has a DCAS (double-CAS, two separate non-contiguous locations at once). As the wikipedia article says, it makes lockless double-linked lists easy.



m68k had DCAS at one point, but no current mainstream CPU architecture has it. ISO C++11 doesn't expose a DCAS operation through std::atomic because you can't emulate it on HW that doesn't have it without making all atomic<T> non-lock-free. Except on hardware with transactional memory, available on some recent x86 CPUs from Intel (e.g. Broadwell and later), but not AMD. There has been some work on adding syntax for TM to C++, see https://en.cppreference.com/w/cpp/language/transactional_memory





Of course, it's also impossible for an observe to observe two locations at once, atomically, without transactional memory or something like DCAS. So any threads that read the list need to expect it to be changing out from under them, especially if the list is also supposed to support removals.



Setting up the pointers within the new nodes (not yet published to other threads) before publishing is obviously good, and you're doing that. first->prev and last->next are both set correctly before a CAS attempt to publish these new nodes. The CAS has sequential-consistency memory ordering, so it makes sure those previous stores are visible to other threads before it itself is. (So those "private" stores might as well be std::memory_order_relaxed for efficiency).



Your choice of modifying the old-first's .prev pointer after modifying head makes a lot of sense. You're essentially publishing first in the forward direction, then in the reverse direction. But remember that it's possible for a thread to sleep for a long time at any point, so it's not 100% safe to assume this will always be a momentary inconsistency. Imagine stopping one thread in the debugger at any point inside this function, and even single-stepping, while other threads run. In this case there are only 2 interesting operations, the CAS and the unconditional store to the old first non-dummy node.



If a thread was traversing forward and depending on being able to go back by following a .prev (instead of remembering its own previous in a local variable), this might look to it like the new nodes had been deleted again. It can find a .prev pointing to head. This is a contrived example because it would normally be more efficient to simply remember your previous node if you want to be able to find it again, especially in a lockless list. But perhaps there are non-contrived cases, like maybe one thread traversing forward and another traversing backward, and maybe interacting directly or indirectly, where an inconsistency would be visible.





As long as all threads agree on which order to modify things, I think insertion itself is safe. Doing it only at the head makes it easier to verify, but I think allowing arbitrary insertion points is still safe.



Your current code looks safe for simultaneous insertions (assuming no removals). The forward list can be longer than the backward list (with potentially multiple insertions into the backward list outstanding), but once they all complete the list will be consistent.



Without removal, each pending write to a .prev has a valid destination, and that destination is a node that no other thread wants to write to. lockless singly-linked list insertion is easy, and the forward list (the .next links) are always up to date.



So each insert operation "claims" a node that it use as the insertion point into the reverse list, when its store to current_next->prev becomes visible.





A do{}while(!CAS()) loop is a nice idiom, usually simplifying the code. I weakened the memory-ordering of the other operations, especially the private ones on first and last, because requiring the compiler to use slow barriers after stores to elements other threads can't see yet is inefficient. On x86, release-stores are "free" (no extra barriers), while seq-cst stores are a lost more expensive. (A seq-cst store on x86 has about the same cost as an atomic read-modify-write, in the uncontended case).



// no change in logic to yours, just minimize memory ordering
// and simplify the loop structure.
void insertSublist(Node* first, Node* last)
{
first->prev.store(&head, std::memory_order_relaxed);
Node* current_next = head.next.load(std::memory_order_relaxed);

do {
// current_next set ahead of first iter, and updated on CAS failure
last->next.store(current_next, std::memory_order_relaxed);
}while (!head.next.compare_exchange_weak(current_next, first));
// acq_rel CAS should work, but leave it as seq_cst just to be sure. No slower on x86

current_next->prev.store(last, std::memory_order_release); // not visible before CAS
}


This compiles for x86 with zero mfence instructions instead of 3 for yours, on the Godbolt compiler explorer. (The rest of the asm is literally identical, including the one lock cmpxchg.) So in the uncontended no-RFO case (e.g. repeated insertions from the same core), it might be something like 4x faster. Or better because mfence is actually even slower than a lock prefix on Intel CPUs.



Plus, the do{}while(!CAS) with the final store fully outside the loop is arguably easier for humans to read and see the logic right away.





Removals are a massive complication



I don't see how you could safely remove while you have pending insertions. If you remove a node that an inserter was going to modify (to add the backward link) but hasn't yet, that range of nodes will be forever missing from the backward list.



(Plus if you recycle the memory for that Node, the store by the inserter then steps on something.)



It will make the forward and backward lists out of sync. I don't see a way to solve this without DCAS (or transactional memory, which is a superset of DCAS). I'm not a lockless dlist expert, though, so maybe there's a trick.



Probably even multiple simultaneous removers is a problem, because you can end up with pending modifications to a node that another thread is going to (or already has) removed. Or multiple pending modifications to one node, with no way to make sure that the right one finishes last.



If you had an inserters/removers lock (multiple inserters / single remover, exactly like a readers/writer lock), you could make sure there were no pending inserts when doing a removal. But still allow lockless inserts. Maybe put the lock in the same cache line as head, because inserting threads will always need to modify both it and head. Or maybe not because you might just end up with more contention for that line if cores sometimes lose ownership of the line after taking the lock but before committing their modification to head.






share|improve this answer
























  • First of all, thank you for your valuable time to write such a great answer. I've clarified some points regarding the removal. Can you assert that with these rules the removal is harmless?

    – Viacheslav Kroilov
    Jan 2 at 14:54
















4














TL:DR: inserts alone are sort of ok depending on what readers do (no long-term corruption), but removals are probably impossible without locking or much more sophistication, and definitely a showstopper for this simple insert algorithm.





It's a double-linked list, so an insert unavoidably requires modifying two memory locations that other threads can already see: head.next, and the .prev pointer in the old first node. This can't be done atomically+locklessly unless the hardware has a DCAS (double-CAS, two separate non-contiguous locations at once). As the wikipedia article says, it makes lockless double-linked lists easy.



m68k had DCAS at one point, but no current mainstream CPU architecture has it. ISO C++11 doesn't expose a DCAS operation through std::atomic because you can't emulate it on HW that doesn't have it without making all atomic<T> non-lock-free. Except on hardware with transactional memory, available on some recent x86 CPUs from Intel (e.g. Broadwell and later), but not AMD. There has been some work on adding syntax for TM to C++, see https://en.cppreference.com/w/cpp/language/transactional_memory





Of course, it's also impossible for an observe to observe two locations at once, atomically, without transactional memory or something like DCAS. So any threads that read the list need to expect it to be changing out from under them, especially if the list is also supposed to support removals.



Setting up the pointers within the new nodes (not yet published to other threads) before publishing is obviously good, and you're doing that. first->prev and last->next are both set correctly before a CAS attempt to publish these new nodes. The CAS has sequential-consistency memory ordering, so it makes sure those previous stores are visible to other threads before it itself is. (So those "private" stores might as well be std::memory_order_relaxed for efficiency).



Your choice of modifying the old-first's .prev pointer after modifying head makes a lot of sense. You're essentially publishing first in the forward direction, then in the reverse direction. But remember that it's possible for a thread to sleep for a long time at any point, so it's not 100% safe to assume this will always be a momentary inconsistency. Imagine stopping one thread in the debugger at any point inside this function, and even single-stepping, while other threads run. In this case there are only 2 interesting operations, the CAS and the unconditional store to the old first non-dummy node.



If a thread was traversing forward and depending on being able to go back by following a .prev (instead of remembering its own previous in a local variable), this might look to it like the new nodes had been deleted again. It can find a .prev pointing to head. This is a contrived example because it would normally be more efficient to simply remember your previous node if you want to be able to find it again, especially in a lockless list. But perhaps there are non-contrived cases, like maybe one thread traversing forward and another traversing backward, and maybe interacting directly or indirectly, where an inconsistency would be visible.





As long as all threads agree on which order to modify things, I think insertion itself is safe. Doing it only at the head makes it easier to verify, but I think allowing arbitrary insertion points is still safe.



Your current code looks safe for simultaneous insertions (assuming no removals). The forward list can be longer than the backward list (with potentially multiple insertions into the backward list outstanding), but once they all complete the list will be consistent.



Without removal, each pending write to a .prev has a valid destination, and that destination is a node that no other thread wants to write to. lockless singly-linked list insertion is easy, and the forward list (the .next links) are always up to date.



So each insert operation "claims" a node that it use as the insertion point into the reverse list, when its store to current_next->prev becomes visible.





A do{}while(!CAS()) loop is a nice idiom, usually simplifying the code. I weakened the memory-ordering of the other operations, especially the private ones on first and last, because requiring the compiler to use slow barriers after stores to elements other threads can't see yet is inefficient. On x86, release-stores are "free" (no extra barriers), while seq-cst stores are a lost more expensive. (A seq-cst store on x86 has about the same cost as an atomic read-modify-write, in the uncontended case).



// no change in logic to yours, just minimize memory ordering
// and simplify the loop structure.
void insertSublist(Node* first, Node* last)
{
first->prev.store(&head, std::memory_order_relaxed);
Node* current_next = head.next.load(std::memory_order_relaxed);

do {
// current_next set ahead of first iter, and updated on CAS failure
last->next.store(current_next, std::memory_order_relaxed);
}while (!head.next.compare_exchange_weak(current_next, first));
// acq_rel CAS should work, but leave it as seq_cst just to be sure. No slower on x86

current_next->prev.store(last, std::memory_order_release); // not visible before CAS
}


This compiles for x86 with zero mfence instructions instead of 3 for yours, on the Godbolt compiler explorer. (The rest of the asm is literally identical, including the one lock cmpxchg.) So in the uncontended no-RFO case (e.g. repeated insertions from the same core), it might be something like 4x faster. Or better because mfence is actually even slower than a lock prefix on Intel CPUs.



Plus, the do{}while(!CAS) with the final store fully outside the loop is arguably easier for humans to read and see the logic right away.





Removals are a massive complication



I don't see how you could safely remove while you have pending insertions. If you remove a node that an inserter was going to modify (to add the backward link) but hasn't yet, that range of nodes will be forever missing from the backward list.



(Plus if you recycle the memory for that Node, the store by the inserter then steps on something.)



It will make the forward and backward lists out of sync. I don't see a way to solve this without DCAS (or transactional memory, which is a superset of DCAS). I'm not a lockless dlist expert, though, so maybe there's a trick.



Probably even multiple simultaneous removers is a problem, because you can end up with pending modifications to a node that another thread is going to (or already has) removed. Or multiple pending modifications to one node, with no way to make sure that the right one finishes last.



If you had an inserters/removers lock (multiple inserters / single remover, exactly like a readers/writer lock), you could make sure there were no pending inserts when doing a removal. But still allow lockless inserts. Maybe put the lock in the same cache line as head, because inserting threads will always need to modify both it and head. Or maybe not because you might just end up with more contention for that line if cores sometimes lose ownership of the line after taking the lock but before committing their modification to head.






share|improve this answer
























  • First of all, thank you for your valuable time to write such a great answer. I've clarified some points regarding the removal. Can you assert that with these rules the removal is harmless?

    – Viacheslav Kroilov
    Jan 2 at 14:54














4












4








4







TL:DR: inserts alone are sort of ok depending on what readers do (no long-term corruption), but removals are probably impossible without locking or much more sophistication, and definitely a showstopper for this simple insert algorithm.





It's a double-linked list, so an insert unavoidably requires modifying two memory locations that other threads can already see: head.next, and the .prev pointer in the old first node. This can't be done atomically+locklessly unless the hardware has a DCAS (double-CAS, two separate non-contiguous locations at once). As the wikipedia article says, it makes lockless double-linked lists easy.



m68k had DCAS at one point, but no current mainstream CPU architecture has it. ISO C++11 doesn't expose a DCAS operation through std::atomic because you can't emulate it on HW that doesn't have it without making all atomic<T> non-lock-free. Except on hardware with transactional memory, available on some recent x86 CPUs from Intel (e.g. Broadwell and later), but not AMD. There has been some work on adding syntax for TM to C++, see https://en.cppreference.com/w/cpp/language/transactional_memory





Of course, it's also impossible for an observe to observe two locations at once, atomically, without transactional memory or something like DCAS. So any threads that read the list need to expect it to be changing out from under them, especially if the list is also supposed to support removals.



Setting up the pointers within the new nodes (not yet published to other threads) before publishing is obviously good, and you're doing that. first->prev and last->next are both set correctly before a CAS attempt to publish these new nodes. The CAS has sequential-consistency memory ordering, so it makes sure those previous stores are visible to other threads before it itself is. (So those "private" stores might as well be std::memory_order_relaxed for efficiency).



Your choice of modifying the old-first's .prev pointer after modifying head makes a lot of sense. You're essentially publishing first in the forward direction, then in the reverse direction. But remember that it's possible for a thread to sleep for a long time at any point, so it's not 100% safe to assume this will always be a momentary inconsistency. Imagine stopping one thread in the debugger at any point inside this function, and even single-stepping, while other threads run. In this case there are only 2 interesting operations, the CAS and the unconditional store to the old first non-dummy node.



If a thread was traversing forward and depending on being able to go back by following a .prev (instead of remembering its own previous in a local variable), this might look to it like the new nodes had been deleted again. It can find a .prev pointing to head. This is a contrived example because it would normally be more efficient to simply remember your previous node if you want to be able to find it again, especially in a lockless list. But perhaps there are non-contrived cases, like maybe one thread traversing forward and another traversing backward, and maybe interacting directly or indirectly, where an inconsistency would be visible.





As long as all threads agree on which order to modify things, I think insertion itself is safe. Doing it only at the head makes it easier to verify, but I think allowing arbitrary insertion points is still safe.



Your current code looks safe for simultaneous insertions (assuming no removals). The forward list can be longer than the backward list (with potentially multiple insertions into the backward list outstanding), but once they all complete the list will be consistent.



Without removal, each pending write to a .prev has a valid destination, and that destination is a node that no other thread wants to write to. lockless singly-linked list insertion is easy, and the forward list (the .next links) are always up to date.



So each insert operation "claims" a node that it use as the insertion point into the reverse list, when its store to current_next->prev becomes visible.





A do{}while(!CAS()) loop is a nice idiom, usually simplifying the code. I weakened the memory-ordering of the other operations, especially the private ones on first and last, because requiring the compiler to use slow barriers after stores to elements other threads can't see yet is inefficient. On x86, release-stores are "free" (no extra barriers), while seq-cst stores are a lost more expensive. (A seq-cst store on x86 has about the same cost as an atomic read-modify-write, in the uncontended case).



// no change in logic to yours, just minimize memory ordering
// and simplify the loop structure.
void insertSublist(Node* first, Node* last)
{
first->prev.store(&head, std::memory_order_relaxed);
Node* current_next = head.next.load(std::memory_order_relaxed);

do {
// current_next set ahead of first iter, and updated on CAS failure
last->next.store(current_next, std::memory_order_relaxed);
}while (!head.next.compare_exchange_weak(current_next, first));
// acq_rel CAS should work, but leave it as seq_cst just to be sure. No slower on x86

current_next->prev.store(last, std::memory_order_release); // not visible before CAS
}


This compiles for x86 with zero mfence instructions instead of 3 for yours, on the Godbolt compiler explorer. (The rest of the asm is literally identical, including the one lock cmpxchg.) So in the uncontended no-RFO case (e.g. repeated insertions from the same core), it might be something like 4x faster. Or better because mfence is actually even slower than a lock prefix on Intel CPUs.



Plus, the do{}while(!CAS) with the final store fully outside the loop is arguably easier for humans to read and see the logic right away.





Removals are a massive complication



I don't see how you could safely remove while you have pending insertions. If you remove a node that an inserter was going to modify (to add the backward link) but hasn't yet, that range of nodes will be forever missing from the backward list.



(Plus if you recycle the memory for that Node, the store by the inserter then steps on something.)



It will make the forward and backward lists out of sync. I don't see a way to solve this without DCAS (or transactional memory, which is a superset of DCAS). I'm not a lockless dlist expert, though, so maybe there's a trick.



Probably even multiple simultaneous removers is a problem, because you can end up with pending modifications to a node that another thread is going to (or already has) removed. Or multiple pending modifications to one node, with no way to make sure that the right one finishes last.



If you had an inserters/removers lock (multiple inserters / single remover, exactly like a readers/writer lock), you could make sure there were no pending inserts when doing a removal. But still allow lockless inserts. Maybe put the lock in the same cache line as head, because inserting threads will always need to modify both it and head. Or maybe not because you might just end up with more contention for that line if cores sometimes lose ownership of the line after taking the lock but before committing their modification to head.






share|improve this answer













TL:DR: inserts alone are sort of ok depending on what readers do (no long-term corruption), but removals are probably impossible without locking or much more sophistication, and definitely a showstopper for this simple insert algorithm.





It's a double-linked list, so an insert unavoidably requires modifying two memory locations that other threads can already see: head.next, and the .prev pointer in the old first node. This can't be done atomically+locklessly unless the hardware has a DCAS (double-CAS, two separate non-contiguous locations at once). As the wikipedia article says, it makes lockless double-linked lists easy.



m68k had DCAS at one point, but no current mainstream CPU architecture has it. ISO C++11 doesn't expose a DCAS operation through std::atomic because you can't emulate it on HW that doesn't have it without making all atomic<T> non-lock-free. Except on hardware with transactional memory, available on some recent x86 CPUs from Intel (e.g. Broadwell and later), but not AMD. There has been some work on adding syntax for TM to C++, see https://en.cppreference.com/w/cpp/language/transactional_memory





Of course, it's also impossible for an observe to observe two locations at once, atomically, without transactional memory or something like DCAS. So any threads that read the list need to expect it to be changing out from under them, especially if the list is also supposed to support removals.



Setting up the pointers within the new nodes (not yet published to other threads) before publishing is obviously good, and you're doing that. first->prev and last->next are both set correctly before a CAS attempt to publish these new nodes. The CAS has sequential-consistency memory ordering, so it makes sure those previous stores are visible to other threads before it itself is. (So those "private" stores might as well be std::memory_order_relaxed for efficiency).



Your choice of modifying the old-first's .prev pointer after modifying head makes a lot of sense. You're essentially publishing first in the forward direction, then in the reverse direction. But remember that it's possible for a thread to sleep for a long time at any point, so it's not 100% safe to assume this will always be a momentary inconsistency. Imagine stopping one thread in the debugger at any point inside this function, and even single-stepping, while other threads run. In this case there are only 2 interesting operations, the CAS and the unconditional store to the old first non-dummy node.



If a thread was traversing forward and depending on being able to go back by following a .prev (instead of remembering its own previous in a local variable), this might look to it like the new nodes had been deleted again. It can find a .prev pointing to head. This is a contrived example because it would normally be more efficient to simply remember your previous node if you want to be able to find it again, especially in a lockless list. But perhaps there are non-contrived cases, like maybe one thread traversing forward and another traversing backward, and maybe interacting directly or indirectly, where an inconsistency would be visible.





As long as all threads agree on which order to modify things, I think insertion itself is safe. Doing it only at the head makes it easier to verify, but I think allowing arbitrary insertion points is still safe.



Your current code looks safe for simultaneous insertions (assuming no removals). The forward list can be longer than the backward list (with potentially multiple insertions into the backward list outstanding), but once they all complete the list will be consistent.



Without removal, each pending write to a .prev has a valid destination, and that destination is a node that no other thread wants to write to. lockless singly-linked list insertion is easy, and the forward list (the .next links) are always up to date.



So each insert operation "claims" a node that it use as the insertion point into the reverse list, when its store to current_next->prev becomes visible.





A do{}while(!CAS()) loop is a nice idiom, usually simplifying the code. I weakened the memory-ordering of the other operations, especially the private ones on first and last, because requiring the compiler to use slow barriers after stores to elements other threads can't see yet is inefficient. On x86, release-stores are "free" (no extra barriers), while seq-cst stores are a lost more expensive. (A seq-cst store on x86 has about the same cost as an atomic read-modify-write, in the uncontended case).



// no change in logic to yours, just minimize memory ordering
// and simplify the loop structure.
void insertSublist(Node* first, Node* last)
{
first->prev.store(&head, std::memory_order_relaxed);
Node* current_next = head.next.load(std::memory_order_relaxed);

do {
// current_next set ahead of first iter, and updated on CAS failure
last->next.store(current_next, std::memory_order_relaxed);
}while (!head.next.compare_exchange_weak(current_next, first));
// acq_rel CAS should work, but leave it as seq_cst just to be sure. No slower on x86

current_next->prev.store(last, std::memory_order_release); // not visible before CAS
}


This compiles for x86 with zero mfence instructions instead of 3 for yours, on the Godbolt compiler explorer. (The rest of the asm is literally identical, including the one lock cmpxchg.) So in the uncontended no-RFO case (e.g. repeated insertions from the same core), it might be something like 4x faster. Or better because mfence is actually even slower than a lock prefix on Intel CPUs.



Plus, the do{}while(!CAS) with the final store fully outside the loop is arguably easier for humans to read and see the logic right away.





Removals are a massive complication



I don't see how you could safely remove while you have pending insertions. If you remove a node that an inserter was going to modify (to add the backward link) but hasn't yet, that range of nodes will be forever missing from the backward list.



(Plus if you recycle the memory for that Node, the store by the inserter then steps on something.)



It will make the forward and backward lists out of sync. I don't see a way to solve this without DCAS (or transactional memory, which is a superset of DCAS). I'm not a lockless dlist expert, though, so maybe there's a trick.



Probably even multiple simultaneous removers is a problem, because you can end up with pending modifications to a node that another thread is going to (or already has) removed. Or multiple pending modifications to one node, with no way to make sure that the right one finishes last.



If you had an inserters/removers lock (multiple inserters / single remover, exactly like a readers/writer lock), you could make sure there were no pending inserts when doing a removal. But still allow lockless inserts. Maybe put the lock in the same cache line as head, because inserting threads will always need to modify both it and head. Or maybe not because you might just end up with more contention for that line if cores sometimes lose ownership of the line after taking the lock but before committing their modification to head.







share|improve this answer












share|improve this answer



share|improve this answer










answered Jan 2 at 12:14









Peter CordesPeter Cordes

132k18201338




132k18201338













  • First of all, thank you for your valuable time to write such a great answer. I've clarified some points regarding the removal. Can you assert that with these rules the removal is harmless?

    – Viacheslav Kroilov
    Jan 2 at 14:54



















  • First of all, thank you for your valuable time to write such a great answer. I've clarified some points regarding the removal. Can you assert that with these rules the removal is harmless?

    – Viacheslav Kroilov
    Jan 2 at 14:54

















First of all, thank you for your valuable time to write such a great answer. I've clarified some points regarding the removal. Can you assert that with these rules the removal is harmless?

– Viacheslav Kroilov
Jan 2 at 14:54





First of all, thank you for your valuable time to write such a great answer. I've clarified some points regarding the removal. Can you assert that with these rules the removal is harmless?

– Viacheslav Kroilov
Jan 2 at 14:54




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54004268%2fis-this-lock-free-dlist-insertion-is-safe%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

MongoDB - Not Authorized To Execute Command

How to fix TextFormField cause rebuild widget in Flutter

in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith