Huffman Coding Using Doubly Linked List












2















I am developing an implementation of Huffman Compression in x64 MASM. My main function in C syntax would be:



void* huffCompress(void* lpDataStream, unsigned long long qwLength);


I am using Microsoft's Fastcall convention where the paramters are passing to the called function using RCX, RDX, R8 & R9 (anything else is on the stack).



I have written code which generates an array of size 256*3. Each 3 byte consists of:



as STRUCT
word wFreq
byte bSymbol
as ENDS


The code then iterates over the lpDataStream twice.




  1. Initialising bSymbol with the index in the array


  2. Incrementing wFreq for each byte



In the code below RBX is a pointer to the base of the array, RSI is a pointer to the lpDataStream and RCX is qwLength. RAX needs to be 0 before entering the below code.



_populateArray:
lodsb
lea rdx, qword ptr[rbx+rax*04h]
sub rdx, rax
inc word ptr[rdx]
loop _populateArray


The function then iterates over the array and for each element which has a wFreq which is not zero it subtracts 13h (19 bytes) from the stack. These 19 bytes are then populated as per the following structure.



nl STRUCT
qword pFLink
qword pBLink
dword dwFreq
byte bSymbol
nl ENDS


The first two qwords are forward and backward links to the next and previous elements in the linked list.



xor rax, rax
mov r10, rsp
mov rcx, rax ;rcx = pFLink
mov rdx, rax ;rdx = pBLink
_generateLinkedList:
cmp word ptr[rbx+rax], 0000h
je _notInitialised
sub rsp, 13h
;write and setup pBLink for next entry
mov qword ptr[rsp+pBLink], rdx
mov rdx, rsp
mov r8w, word ptr[rbx+rax]
mov r9b, byte ptr[rbx+rax+02h]
mov word ptr[rsp+wFreq], r8w ;write dFreq to linked list
mov byte ptr[rsp+bSymbol], r9b ;write bSymbol to linked list
cmp rcx, 00h ;if pFLink is not initialised it means this is the first entry
je _firstEntry
mov qword ptr[rsp+13h], rsp ;when rcx != 00h it means that there is an entry before this one "before" on stack remember
_firstEntry:
mov rcx, rsp
_notInitialised:
add rax, 03h
cmp rax, 300h
jna _generateLinkedList


The above code works as desired, I have added it for completeness. RBX is still pointing to the array.



The function then runs bubble sort on the resulting linked list, sorting in ascending order. This means the element with no pBLink (no backward link) has the lowest wFreq. It is not a circular doubly linked list.



The next step of implementing Huffman Compression is to create a node with a wFreq equal to the sum of the two lowest frequencies.



My plan for this was to do the following:




  1. Add the two lowest wFreq values together


  2. Make pFLink in the lowest element NULL


  3. Make both pFLink and pBLink is the second lowest (pHead->pFLink) NULL


  4. Make more stack space (19 bytes) and add a node to the end (this involves finding the element with a NULL pFLink and changing it to the new node. Additionally the new node has the wFreq of the two lowest added together.



This is where the problem is. I need to make the pFLink and pBLink of the new node point to the two lowest elements (The nodes which had their pointers NULLED in step 2 & 3. However if I do this then the new node cannot be connect to the linked list.



EDIT: I think the above option would work, I would have to rework the sorting algorithm so that when a new node is added and needs to be sorted it is inserted inbetween elements. The code would realise it has found a "new node" because the pBLink wouldn't be correct. It would then know the next element is directly "above" (19 bytes above the "new node"). I think this could be done during the new node creation as the new node doesn't need to be placed in the new space on the stack it can be swapped with the element which is at it's desired place. It's a dirty workaround, if there is a better way I would love to hear it.



I had the idea that when the sorting function finds a node where the pBLink doesn't point to the previous element it should just move up the next element by subtracting 13h from it's pointer. This didn't work because the new nodes (those created by adding the two lowest frequencies together) could be anywhere after sorting.



Is there any way to overcome this without adding two more qwords to the nl structure?










share|improve this question

























  • Why you need to make the pFLink and pBLink of the new node point to the two lowest elements? A node in a linked list doesn't even need a pBLink unless you want to efficiently move backwards. Are you sure a linked list is what you need? a node cannot be in between two elements and at the end of the list at the same time. Maybe you need two lists?

    – Margaret Bloom
    Jan 1 at 21:14











  • @MargaretBloom I am reusing pFLink and pBlink to point to leaves of a node in the Huffman Tree. The code converts a linked list to a Huffman Tree. I included a pBLink because I need two pointers for the tree anyway so I may aswell use the second as a pBLink. I may traverse the linked list backwards when checking where a new node should be placed due to the fact that it's more likely the new node will need to be somewhere at the top (due to having a higher frequency). A nodes memory address and position in the linked list are independent.

    – Will
    Jan 1 at 21:23











  • Are you optimizing for code-size at the expense of speed? If not, avoid the slow loop instruction unless you're only tuning for AMD CPUs. Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?. Also, lodsb is 3 uops, vs. 2 for movzx / inc rsi which avoids a false dependency on the old RAX. You can also do *3 as an index like lea edx, [rax+rax*2] / add [rbx + rdx], 1. (But an indexed add [mem], 1 will unlaminate into more uops than add [rdx],1 on Sandybridge-family.)

    – Peter Cordes
    Jan 1 at 22:20











  • @PeterCordes Neither at the moment. I will probably optimise for code size later. I never though of using "lea edx, [rax+rax*2]" that's really nice, thanks!

    – Will
    Jan 1 at 22:23











  • If you haven't seen that multiply-by-3 idiom before, there's probably a lot you can learn from looking at optimized compiler output. Write some of your functions in C (with function args and return values for inputs/outputs), then look at gcc and clang -O3 (pure speed including autovectorization) or -Os (code size and speed) output on godbolt.org. How to remove "noise" from GCC/clang assembly output?. If you really want to optimize for code-size, see Tips for golfing in x86/x64 machine code

    – Peter Cordes
    Jan 1 at 22:27


















2















I am developing an implementation of Huffman Compression in x64 MASM. My main function in C syntax would be:



void* huffCompress(void* lpDataStream, unsigned long long qwLength);


I am using Microsoft's Fastcall convention where the paramters are passing to the called function using RCX, RDX, R8 & R9 (anything else is on the stack).



I have written code which generates an array of size 256*3. Each 3 byte consists of:



as STRUCT
word wFreq
byte bSymbol
as ENDS


The code then iterates over the lpDataStream twice.




  1. Initialising bSymbol with the index in the array


  2. Incrementing wFreq for each byte



In the code below RBX is a pointer to the base of the array, RSI is a pointer to the lpDataStream and RCX is qwLength. RAX needs to be 0 before entering the below code.



_populateArray:
lodsb
lea rdx, qword ptr[rbx+rax*04h]
sub rdx, rax
inc word ptr[rdx]
loop _populateArray


The function then iterates over the array and for each element which has a wFreq which is not zero it subtracts 13h (19 bytes) from the stack. These 19 bytes are then populated as per the following structure.



nl STRUCT
qword pFLink
qword pBLink
dword dwFreq
byte bSymbol
nl ENDS


The first two qwords are forward and backward links to the next and previous elements in the linked list.



xor rax, rax
mov r10, rsp
mov rcx, rax ;rcx = pFLink
mov rdx, rax ;rdx = pBLink
_generateLinkedList:
cmp word ptr[rbx+rax], 0000h
je _notInitialised
sub rsp, 13h
;write and setup pBLink for next entry
mov qword ptr[rsp+pBLink], rdx
mov rdx, rsp
mov r8w, word ptr[rbx+rax]
mov r9b, byte ptr[rbx+rax+02h]
mov word ptr[rsp+wFreq], r8w ;write dFreq to linked list
mov byte ptr[rsp+bSymbol], r9b ;write bSymbol to linked list
cmp rcx, 00h ;if pFLink is not initialised it means this is the first entry
je _firstEntry
mov qword ptr[rsp+13h], rsp ;when rcx != 00h it means that there is an entry before this one "before" on stack remember
_firstEntry:
mov rcx, rsp
_notInitialised:
add rax, 03h
cmp rax, 300h
jna _generateLinkedList


The above code works as desired, I have added it for completeness. RBX is still pointing to the array.



The function then runs bubble sort on the resulting linked list, sorting in ascending order. This means the element with no pBLink (no backward link) has the lowest wFreq. It is not a circular doubly linked list.



The next step of implementing Huffman Compression is to create a node with a wFreq equal to the sum of the two lowest frequencies.



My plan for this was to do the following:




  1. Add the two lowest wFreq values together


  2. Make pFLink in the lowest element NULL


  3. Make both pFLink and pBLink is the second lowest (pHead->pFLink) NULL


  4. Make more stack space (19 bytes) and add a node to the end (this involves finding the element with a NULL pFLink and changing it to the new node. Additionally the new node has the wFreq of the two lowest added together.



This is where the problem is. I need to make the pFLink and pBLink of the new node point to the two lowest elements (The nodes which had their pointers NULLED in step 2 & 3. However if I do this then the new node cannot be connect to the linked list.



EDIT: I think the above option would work, I would have to rework the sorting algorithm so that when a new node is added and needs to be sorted it is inserted inbetween elements. The code would realise it has found a "new node" because the pBLink wouldn't be correct. It would then know the next element is directly "above" (19 bytes above the "new node"). I think this could be done during the new node creation as the new node doesn't need to be placed in the new space on the stack it can be swapped with the element which is at it's desired place. It's a dirty workaround, if there is a better way I would love to hear it.



I had the idea that when the sorting function finds a node where the pBLink doesn't point to the previous element it should just move up the next element by subtracting 13h from it's pointer. This didn't work because the new nodes (those created by adding the two lowest frequencies together) could be anywhere after sorting.



Is there any way to overcome this without adding two more qwords to the nl structure?










share|improve this question

























  • Why you need to make the pFLink and pBLink of the new node point to the two lowest elements? A node in a linked list doesn't even need a pBLink unless you want to efficiently move backwards. Are you sure a linked list is what you need? a node cannot be in between two elements and at the end of the list at the same time. Maybe you need two lists?

    – Margaret Bloom
    Jan 1 at 21:14











  • @MargaretBloom I am reusing pFLink and pBlink to point to leaves of a node in the Huffman Tree. The code converts a linked list to a Huffman Tree. I included a pBLink because I need two pointers for the tree anyway so I may aswell use the second as a pBLink. I may traverse the linked list backwards when checking where a new node should be placed due to the fact that it's more likely the new node will need to be somewhere at the top (due to having a higher frequency). A nodes memory address and position in the linked list are independent.

    – Will
    Jan 1 at 21:23











  • Are you optimizing for code-size at the expense of speed? If not, avoid the slow loop instruction unless you're only tuning for AMD CPUs. Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?. Also, lodsb is 3 uops, vs. 2 for movzx / inc rsi which avoids a false dependency on the old RAX. You can also do *3 as an index like lea edx, [rax+rax*2] / add [rbx + rdx], 1. (But an indexed add [mem], 1 will unlaminate into more uops than add [rdx],1 on Sandybridge-family.)

    – Peter Cordes
    Jan 1 at 22:20











  • @PeterCordes Neither at the moment. I will probably optimise for code size later. I never though of using "lea edx, [rax+rax*2]" that's really nice, thanks!

    – Will
    Jan 1 at 22:23











  • If you haven't seen that multiply-by-3 idiom before, there's probably a lot you can learn from looking at optimized compiler output. Write some of your functions in C (with function args and return values for inputs/outputs), then look at gcc and clang -O3 (pure speed including autovectorization) or -Os (code size and speed) output on godbolt.org. How to remove "noise" from GCC/clang assembly output?. If you really want to optimize for code-size, see Tips for golfing in x86/x64 machine code

    – Peter Cordes
    Jan 1 at 22:27
















2












2








2








I am developing an implementation of Huffman Compression in x64 MASM. My main function in C syntax would be:



void* huffCompress(void* lpDataStream, unsigned long long qwLength);


I am using Microsoft's Fastcall convention where the paramters are passing to the called function using RCX, RDX, R8 & R9 (anything else is on the stack).



I have written code which generates an array of size 256*3. Each 3 byte consists of:



as STRUCT
word wFreq
byte bSymbol
as ENDS


The code then iterates over the lpDataStream twice.




  1. Initialising bSymbol with the index in the array


  2. Incrementing wFreq for each byte



In the code below RBX is a pointer to the base of the array, RSI is a pointer to the lpDataStream and RCX is qwLength. RAX needs to be 0 before entering the below code.



_populateArray:
lodsb
lea rdx, qword ptr[rbx+rax*04h]
sub rdx, rax
inc word ptr[rdx]
loop _populateArray


The function then iterates over the array and for each element which has a wFreq which is not zero it subtracts 13h (19 bytes) from the stack. These 19 bytes are then populated as per the following structure.



nl STRUCT
qword pFLink
qword pBLink
dword dwFreq
byte bSymbol
nl ENDS


The first two qwords are forward and backward links to the next and previous elements in the linked list.



xor rax, rax
mov r10, rsp
mov rcx, rax ;rcx = pFLink
mov rdx, rax ;rdx = pBLink
_generateLinkedList:
cmp word ptr[rbx+rax], 0000h
je _notInitialised
sub rsp, 13h
;write and setup pBLink for next entry
mov qword ptr[rsp+pBLink], rdx
mov rdx, rsp
mov r8w, word ptr[rbx+rax]
mov r9b, byte ptr[rbx+rax+02h]
mov word ptr[rsp+wFreq], r8w ;write dFreq to linked list
mov byte ptr[rsp+bSymbol], r9b ;write bSymbol to linked list
cmp rcx, 00h ;if pFLink is not initialised it means this is the first entry
je _firstEntry
mov qword ptr[rsp+13h], rsp ;when rcx != 00h it means that there is an entry before this one "before" on stack remember
_firstEntry:
mov rcx, rsp
_notInitialised:
add rax, 03h
cmp rax, 300h
jna _generateLinkedList


The above code works as desired, I have added it for completeness. RBX is still pointing to the array.



The function then runs bubble sort on the resulting linked list, sorting in ascending order. This means the element with no pBLink (no backward link) has the lowest wFreq. It is not a circular doubly linked list.



The next step of implementing Huffman Compression is to create a node with a wFreq equal to the sum of the two lowest frequencies.



My plan for this was to do the following:




  1. Add the two lowest wFreq values together


  2. Make pFLink in the lowest element NULL


  3. Make both pFLink and pBLink is the second lowest (pHead->pFLink) NULL


  4. Make more stack space (19 bytes) and add a node to the end (this involves finding the element with a NULL pFLink and changing it to the new node. Additionally the new node has the wFreq of the two lowest added together.



This is where the problem is. I need to make the pFLink and pBLink of the new node point to the two lowest elements (The nodes which had their pointers NULLED in step 2 & 3. However if I do this then the new node cannot be connect to the linked list.



EDIT: I think the above option would work, I would have to rework the sorting algorithm so that when a new node is added and needs to be sorted it is inserted inbetween elements. The code would realise it has found a "new node" because the pBLink wouldn't be correct. It would then know the next element is directly "above" (19 bytes above the "new node"). I think this could be done during the new node creation as the new node doesn't need to be placed in the new space on the stack it can be swapped with the element which is at it's desired place. It's a dirty workaround, if there is a better way I would love to hear it.



I had the idea that when the sorting function finds a node where the pBLink doesn't point to the previous element it should just move up the next element by subtracting 13h from it's pointer. This didn't work because the new nodes (those created by adding the two lowest frequencies together) could be anywhere after sorting.



Is there any way to overcome this without adding two more qwords to the nl structure?










share|improve this question
















I am developing an implementation of Huffman Compression in x64 MASM. My main function in C syntax would be:



void* huffCompress(void* lpDataStream, unsigned long long qwLength);


I am using Microsoft's Fastcall convention where the paramters are passing to the called function using RCX, RDX, R8 & R9 (anything else is on the stack).



I have written code which generates an array of size 256*3. Each 3 byte consists of:



as STRUCT
word wFreq
byte bSymbol
as ENDS


The code then iterates over the lpDataStream twice.




  1. Initialising bSymbol with the index in the array


  2. Incrementing wFreq for each byte



In the code below RBX is a pointer to the base of the array, RSI is a pointer to the lpDataStream and RCX is qwLength. RAX needs to be 0 before entering the below code.



_populateArray:
lodsb
lea rdx, qword ptr[rbx+rax*04h]
sub rdx, rax
inc word ptr[rdx]
loop _populateArray


The function then iterates over the array and for each element which has a wFreq which is not zero it subtracts 13h (19 bytes) from the stack. These 19 bytes are then populated as per the following structure.



nl STRUCT
qword pFLink
qword pBLink
dword dwFreq
byte bSymbol
nl ENDS


The first two qwords are forward and backward links to the next and previous elements in the linked list.



xor rax, rax
mov r10, rsp
mov rcx, rax ;rcx = pFLink
mov rdx, rax ;rdx = pBLink
_generateLinkedList:
cmp word ptr[rbx+rax], 0000h
je _notInitialised
sub rsp, 13h
;write and setup pBLink for next entry
mov qword ptr[rsp+pBLink], rdx
mov rdx, rsp
mov r8w, word ptr[rbx+rax]
mov r9b, byte ptr[rbx+rax+02h]
mov word ptr[rsp+wFreq], r8w ;write dFreq to linked list
mov byte ptr[rsp+bSymbol], r9b ;write bSymbol to linked list
cmp rcx, 00h ;if pFLink is not initialised it means this is the first entry
je _firstEntry
mov qword ptr[rsp+13h], rsp ;when rcx != 00h it means that there is an entry before this one "before" on stack remember
_firstEntry:
mov rcx, rsp
_notInitialised:
add rax, 03h
cmp rax, 300h
jna _generateLinkedList


The above code works as desired, I have added it for completeness. RBX is still pointing to the array.



The function then runs bubble sort on the resulting linked list, sorting in ascending order. This means the element with no pBLink (no backward link) has the lowest wFreq. It is not a circular doubly linked list.



The next step of implementing Huffman Compression is to create a node with a wFreq equal to the sum of the two lowest frequencies.



My plan for this was to do the following:




  1. Add the two lowest wFreq values together


  2. Make pFLink in the lowest element NULL


  3. Make both pFLink and pBLink is the second lowest (pHead->pFLink) NULL


  4. Make more stack space (19 bytes) and add a node to the end (this involves finding the element with a NULL pFLink and changing it to the new node. Additionally the new node has the wFreq of the two lowest added together.



This is where the problem is. I need to make the pFLink and pBLink of the new node point to the two lowest elements (The nodes which had their pointers NULLED in step 2 & 3. However if I do this then the new node cannot be connect to the linked list.



EDIT: I think the above option would work, I would have to rework the sorting algorithm so that when a new node is added and needs to be sorted it is inserted inbetween elements. The code would realise it has found a "new node" because the pBLink wouldn't be correct. It would then know the next element is directly "above" (19 bytes above the "new node"). I think this could be done during the new node creation as the new node doesn't need to be placed in the new space on the stack it can be swapped with the element which is at it's desired place. It's a dirty workaround, if there is a better way I would love to hear it.



I had the idea that when the sorting function finds a node where the pBLink doesn't point to the previous element it should just move up the next element by subtracting 13h from it's pointer. This didn't work because the new nodes (those created by adding the two lowest frequencies together) could be anywhere after sorting.



Is there any way to overcome this without adding two more qwords to the nl structure?







arrays assembly masm huffman-code






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 1 at 20:36







Will

















asked Jan 1 at 15:05









WillWill

1256




1256













  • Why you need to make the pFLink and pBLink of the new node point to the two lowest elements? A node in a linked list doesn't even need a pBLink unless you want to efficiently move backwards. Are you sure a linked list is what you need? a node cannot be in between two elements and at the end of the list at the same time. Maybe you need two lists?

    – Margaret Bloom
    Jan 1 at 21:14











  • @MargaretBloom I am reusing pFLink and pBlink to point to leaves of a node in the Huffman Tree. The code converts a linked list to a Huffman Tree. I included a pBLink because I need two pointers for the tree anyway so I may aswell use the second as a pBLink. I may traverse the linked list backwards when checking where a new node should be placed due to the fact that it's more likely the new node will need to be somewhere at the top (due to having a higher frequency). A nodes memory address and position in the linked list are independent.

    – Will
    Jan 1 at 21:23











  • Are you optimizing for code-size at the expense of speed? If not, avoid the slow loop instruction unless you're only tuning for AMD CPUs. Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?. Also, lodsb is 3 uops, vs. 2 for movzx / inc rsi which avoids a false dependency on the old RAX. You can also do *3 as an index like lea edx, [rax+rax*2] / add [rbx + rdx], 1. (But an indexed add [mem], 1 will unlaminate into more uops than add [rdx],1 on Sandybridge-family.)

    – Peter Cordes
    Jan 1 at 22:20











  • @PeterCordes Neither at the moment. I will probably optimise for code size later. I never though of using "lea edx, [rax+rax*2]" that's really nice, thanks!

    – Will
    Jan 1 at 22:23











  • If you haven't seen that multiply-by-3 idiom before, there's probably a lot you can learn from looking at optimized compiler output. Write some of your functions in C (with function args and return values for inputs/outputs), then look at gcc and clang -O3 (pure speed including autovectorization) or -Os (code size and speed) output on godbolt.org. How to remove "noise" from GCC/clang assembly output?. If you really want to optimize for code-size, see Tips for golfing in x86/x64 machine code

    – Peter Cordes
    Jan 1 at 22:27





















  • Why you need to make the pFLink and pBLink of the new node point to the two lowest elements? A node in a linked list doesn't even need a pBLink unless you want to efficiently move backwards. Are you sure a linked list is what you need? a node cannot be in between two elements and at the end of the list at the same time. Maybe you need two lists?

    – Margaret Bloom
    Jan 1 at 21:14











  • @MargaretBloom I am reusing pFLink and pBlink to point to leaves of a node in the Huffman Tree. The code converts a linked list to a Huffman Tree. I included a pBLink because I need two pointers for the tree anyway so I may aswell use the second as a pBLink. I may traverse the linked list backwards when checking where a new node should be placed due to the fact that it's more likely the new node will need to be somewhere at the top (due to having a higher frequency). A nodes memory address and position in the linked list are independent.

    – Will
    Jan 1 at 21:23











  • Are you optimizing for code-size at the expense of speed? If not, avoid the slow loop instruction unless you're only tuning for AMD CPUs. Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?. Also, lodsb is 3 uops, vs. 2 for movzx / inc rsi which avoids a false dependency on the old RAX. You can also do *3 as an index like lea edx, [rax+rax*2] / add [rbx + rdx], 1. (But an indexed add [mem], 1 will unlaminate into more uops than add [rdx],1 on Sandybridge-family.)

    – Peter Cordes
    Jan 1 at 22:20











  • @PeterCordes Neither at the moment. I will probably optimise for code size later. I never though of using "lea edx, [rax+rax*2]" that's really nice, thanks!

    – Will
    Jan 1 at 22:23











  • If you haven't seen that multiply-by-3 idiom before, there's probably a lot you can learn from looking at optimized compiler output. Write some of your functions in C (with function args and return values for inputs/outputs), then look at gcc and clang -O3 (pure speed including autovectorization) or -Os (code size and speed) output on godbolt.org. How to remove "noise" from GCC/clang assembly output?. If you really want to optimize for code-size, see Tips for golfing in x86/x64 machine code

    – Peter Cordes
    Jan 1 at 22:27



















Why you need to make the pFLink and pBLink of the new node point to the two lowest elements? A node in a linked list doesn't even need a pBLink unless you want to efficiently move backwards. Are you sure a linked list is what you need? a node cannot be in between two elements and at the end of the list at the same time. Maybe you need two lists?

– Margaret Bloom
Jan 1 at 21:14





Why you need to make the pFLink and pBLink of the new node point to the two lowest elements? A node in a linked list doesn't even need a pBLink unless you want to efficiently move backwards. Are you sure a linked list is what you need? a node cannot be in between two elements and at the end of the list at the same time. Maybe you need two lists?

– Margaret Bloom
Jan 1 at 21:14













@MargaretBloom I am reusing pFLink and pBlink to point to leaves of a node in the Huffman Tree. The code converts a linked list to a Huffman Tree. I included a pBLink because I need two pointers for the tree anyway so I may aswell use the second as a pBLink. I may traverse the linked list backwards when checking where a new node should be placed due to the fact that it's more likely the new node will need to be somewhere at the top (due to having a higher frequency). A nodes memory address and position in the linked list are independent.

– Will
Jan 1 at 21:23





@MargaretBloom I am reusing pFLink and pBlink to point to leaves of a node in the Huffman Tree. The code converts a linked list to a Huffman Tree. I included a pBLink because I need two pointers for the tree anyway so I may aswell use the second as a pBLink. I may traverse the linked list backwards when checking where a new node should be placed due to the fact that it's more likely the new node will need to be somewhere at the top (due to having a higher frequency). A nodes memory address and position in the linked list are independent.

– Will
Jan 1 at 21:23













Are you optimizing for code-size at the expense of speed? If not, avoid the slow loop instruction unless you're only tuning for AMD CPUs. Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?. Also, lodsb is 3 uops, vs. 2 for movzx / inc rsi which avoids a false dependency on the old RAX. You can also do *3 as an index like lea edx, [rax+rax*2] / add [rbx + rdx], 1. (But an indexed add [mem], 1 will unlaminate into more uops than add [rdx],1 on Sandybridge-family.)

– Peter Cordes
Jan 1 at 22:20





Are you optimizing for code-size at the expense of speed? If not, avoid the slow loop instruction unless you're only tuning for AMD CPUs. Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?. Also, lodsb is 3 uops, vs. 2 for movzx / inc rsi which avoids a false dependency on the old RAX. You can also do *3 as an index like lea edx, [rax+rax*2] / add [rbx + rdx], 1. (But an indexed add [mem], 1 will unlaminate into more uops than add [rdx],1 on Sandybridge-family.)

– Peter Cordes
Jan 1 at 22:20













@PeterCordes Neither at the moment. I will probably optimise for code size later. I never though of using "lea edx, [rax+rax*2]" that's really nice, thanks!

– Will
Jan 1 at 22:23





@PeterCordes Neither at the moment. I will probably optimise for code size later. I never though of using "lea edx, [rax+rax*2]" that's really nice, thanks!

– Will
Jan 1 at 22:23













If you haven't seen that multiply-by-3 idiom before, there's probably a lot you can learn from looking at optimized compiler output. Write some of your functions in C (with function args and return values for inputs/outputs), then look at gcc and clang -O3 (pure speed including autovectorization) or -Os (code size and speed) output on godbolt.org. How to remove "noise" from GCC/clang assembly output?. If you really want to optimize for code-size, see Tips for golfing in x86/x64 machine code

– Peter Cordes
Jan 1 at 22:27







If you haven't seen that multiply-by-3 idiom before, there's probably a lot you can learn from looking at optimized compiler output. Write some of your functions in C (with function args and return values for inputs/outputs), then look at gcc and clang -O3 (pure speed including autovectorization) or -Os (code size and speed) output on godbolt.org. How to remove "noise" from GCC/clang assembly output?. If you really want to optimize for code-size, see Tips for golfing in x86/x64 machine code

– Peter Cordes
Jan 1 at 22:27














1 Answer
1






active

oldest

votes


















1














If you are interested in speed, then you need to start with carefully considered algorithm choices before even remotely considering writing it in assembler. The correct choice of algorithms will make order of magnitude improvements. Then writing it in assembler may give you small factor improvements.



In particular you need a much better choice for sorting, likely quicksort in this case, and you can do the Huffman coding much more simply, in place, without re-sorting linked lists.






share|improve this answer
























  • Moreover, writing in asm will only help if you're better at it than a compiler (unless the goal is to learn asm, rather than to create a fast and/or small implementation right away.) Copying around 3-byte objects with separate word + byte stores doesn't look like a good start. Like I suggested in comments, storing the symbol+count in a 4-byte object lets you compare the whole thing as an integer (x86 is little endian, so put the symbol in the low byte, count at the top). Sorting on (count << 16) | symbol means you don't need to extract the count into a separate register to compare it.

    – Peter Cordes
    Jan 4 at 20:41













  • I am writing this in assembler because I enjoy working with it. It is useful for me to learn as I enjoy CTFs and reversing. Speed is only a minor concern to me - learning assembly and how algorithms and software are implemented is my goal. I understand there are many improvements that could be made. I am using bubble sort in this instance because my understanding of Quick Sort is limited. As far as I know it requires recursion? This would probably end up impacting my programs stack pointer therefore messing up my linked list/array which are stored there.

    – Will
    Jan 4 at 21:05











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53996515%2fhuffman-coding-using-doubly-linked-list%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














If you are interested in speed, then you need to start with carefully considered algorithm choices before even remotely considering writing it in assembler. The correct choice of algorithms will make order of magnitude improvements. Then writing it in assembler may give you small factor improvements.



In particular you need a much better choice for sorting, likely quicksort in this case, and you can do the Huffman coding much more simply, in place, without re-sorting linked lists.






share|improve this answer
























  • Moreover, writing in asm will only help if you're better at it than a compiler (unless the goal is to learn asm, rather than to create a fast and/or small implementation right away.) Copying around 3-byte objects with separate word + byte stores doesn't look like a good start. Like I suggested in comments, storing the symbol+count in a 4-byte object lets you compare the whole thing as an integer (x86 is little endian, so put the symbol in the low byte, count at the top). Sorting on (count << 16) | symbol means you don't need to extract the count into a separate register to compare it.

    – Peter Cordes
    Jan 4 at 20:41













  • I am writing this in assembler because I enjoy working with it. It is useful for me to learn as I enjoy CTFs and reversing. Speed is only a minor concern to me - learning assembly and how algorithms and software are implemented is my goal. I understand there are many improvements that could be made. I am using bubble sort in this instance because my understanding of Quick Sort is limited. As far as I know it requires recursion? This would probably end up impacting my programs stack pointer therefore messing up my linked list/array which are stored there.

    – Will
    Jan 4 at 21:05
















1














If you are interested in speed, then you need to start with carefully considered algorithm choices before even remotely considering writing it in assembler. The correct choice of algorithms will make order of magnitude improvements. Then writing it in assembler may give you small factor improvements.



In particular you need a much better choice for sorting, likely quicksort in this case, and you can do the Huffman coding much more simply, in place, without re-sorting linked lists.






share|improve this answer
























  • Moreover, writing in asm will only help if you're better at it than a compiler (unless the goal is to learn asm, rather than to create a fast and/or small implementation right away.) Copying around 3-byte objects with separate word + byte stores doesn't look like a good start. Like I suggested in comments, storing the symbol+count in a 4-byte object lets you compare the whole thing as an integer (x86 is little endian, so put the symbol in the low byte, count at the top). Sorting on (count << 16) | symbol means you don't need to extract the count into a separate register to compare it.

    – Peter Cordes
    Jan 4 at 20:41













  • I am writing this in assembler because I enjoy working with it. It is useful for me to learn as I enjoy CTFs and reversing. Speed is only a minor concern to me - learning assembly and how algorithms and software are implemented is my goal. I understand there are many improvements that could be made. I am using bubble sort in this instance because my understanding of Quick Sort is limited. As far as I know it requires recursion? This would probably end up impacting my programs stack pointer therefore messing up my linked list/array which are stored there.

    – Will
    Jan 4 at 21:05














1












1








1







If you are interested in speed, then you need to start with carefully considered algorithm choices before even remotely considering writing it in assembler. The correct choice of algorithms will make order of magnitude improvements. Then writing it in assembler may give you small factor improvements.



In particular you need a much better choice for sorting, likely quicksort in this case, and you can do the Huffman coding much more simply, in place, without re-sorting linked lists.






share|improve this answer













If you are interested in speed, then you need to start with carefully considered algorithm choices before even remotely considering writing it in assembler. The correct choice of algorithms will make order of magnitude improvements. Then writing it in assembler may give you small factor improvements.



In particular you need a much better choice for sorting, likely quicksort in this case, and you can do the Huffman coding much more simply, in place, without re-sorting linked lists.







share|improve this answer












share|improve this answer



share|improve this answer










answered Jan 4 at 19:28









Mark AdlerMark Adler

59k865112




59k865112













  • Moreover, writing in asm will only help if you're better at it than a compiler (unless the goal is to learn asm, rather than to create a fast and/or small implementation right away.) Copying around 3-byte objects with separate word + byte stores doesn't look like a good start. Like I suggested in comments, storing the symbol+count in a 4-byte object lets you compare the whole thing as an integer (x86 is little endian, so put the symbol in the low byte, count at the top). Sorting on (count << 16) | symbol means you don't need to extract the count into a separate register to compare it.

    – Peter Cordes
    Jan 4 at 20:41













  • I am writing this in assembler because I enjoy working with it. It is useful for me to learn as I enjoy CTFs and reversing. Speed is only a minor concern to me - learning assembly and how algorithms and software are implemented is my goal. I understand there are many improvements that could be made. I am using bubble sort in this instance because my understanding of Quick Sort is limited. As far as I know it requires recursion? This would probably end up impacting my programs stack pointer therefore messing up my linked list/array which are stored there.

    – Will
    Jan 4 at 21:05



















  • Moreover, writing in asm will only help if you're better at it than a compiler (unless the goal is to learn asm, rather than to create a fast and/or small implementation right away.) Copying around 3-byte objects with separate word + byte stores doesn't look like a good start. Like I suggested in comments, storing the symbol+count in a 4-byte object lets you compare the whole thing as an integer (x86 is little endian, so put the symbol in the low byte, count at the top). Sorting on (count << 16) | symbol means you don't need to extract the count into a separate register to compare it.

    – Peter Cordes
    Jan 4 at 20:41













  • I am writing this in assembler because I enjoy working with it. It is useful for me to learn as I enjoy CTFs and reversing. Speed is only a minor concern to me - learning assembly and how algorithms and software are implemented is my goal. I understand there are many improvements that could be made. I am using bubble sort in this instance because my understanding of Quick Sort is limited. As far as I know it requires recursion? This would probably end up impacting my programs stack pointer therefore messing up my linked list/array which are stored there.

    – Will
    Jan 4 at 21:05

















Moreover, writing in asm will only help if you're better at it than a compiler (unless the goal is to learn asm, rather than to create a fast and/or small implementation right away.) Copying around 3-byte objects with separate word + byte stores doesn't look like a good start. Like I suggested in comments, storing the symbol+count in a 4-byte object lets you compare the whole thing as an integer (x86 is little endian, so put the symbol in the low byte, count at the top). Sorting on (count << 16) | symbol means you don't need to extract the count into a separate register to compare it.

– Peter Cordes
Jan 4 at 20:41







Moreover, writing in asm will only help if you're better at it than a compiler (unless the goal is to learn asm, rather than to create a fast and/or small implementation right away.) Copying around 3-byte objects with separate word + byte stores doesn't look like a good start. Like I suggested in comments, storing the symbol+count in a 4-byte object lets you compare the whole thing as an integer (x86 is little endian, so put the symbol in the low byte, count at the top). Sorting on (count << 16) | symbol means you don't need to extract the count into a separate register to compare it.

– Peter Cordes
Jan 4 at 20:41















I am writing this in assembler because I enjoy working with it. It is useful for me to learn as I enjoy CTFs and reversing. Speed is only a minor concern to me - learning assembly and how algorithms and software are implemented is my goal. I understand there are many improvements that could be made. I am using bubble sort in this instance because my understanding of Quick Sort is limited. As far as I know it requires recursion? This would probably end up impacting my programs stack pointer therefore messing up my linked list/array which are stored there.

– Will
Jan 4 at 21:05





I am writing this in assembler because I enjoy working with it. It is useful for me to learn as I enjoy CTFs and reversing. Speed is only a minor concern to me - learning assembly and how algorithms and software are implemented is my goal. I understand there are many improvements that could be made. I am using bubble sort in this instance because my understanding of Quick Sort is limited. As far as I know it requires recursion? This would probably end up impacting my programs stack pointer therefore messing up my linked list/array which are stored there.

– Will
Jan 4 at 21:05




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53996515%2fhuffman-coding-using-doubly-linked-list%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

android studio warns about leanback feature tag usage required on manifest while using Unity exported app?

SQL update select statement

'app-layout' is not a known element: how to share Component with different Modules