Huffman tree compressing/decompressing in C












12














In a past course one of the assignments was to write a program that can compress files using Huffman Tree algorithm, and uncompress the files that the program generates.



My design is to count the byte occurrences first, then construct a HT based on the counted byte frequency.



My compressed file format is 256*4 bytes of "header" that stores the counted frequency, so it can be used to construct the tree when decompressing the file. Then there's a 4-byte integer that indicates how many bits of the last byte is real data. The rest is the real (compressed) data.



Here is this specific version* of code that I want some feedback. Later versions introduced many messy changes (like GUI and buffered I/O) that is not necessary.



Specifically, I'm looking for feedback on my algorithm and data structure implementation, including but not limited to code style, best practices, potential flaws and defects (see below).




  • An exception is the last two functions print_help and main. They're intended to be as simple as possible, so they contain the bare minimum amount of code to work in a reasonable way. Data validation and error checking etc. are omitted on purpose.


In order to simplify the idea, during designing and coding, I have assumed that




  • the program will not be told to uncompress an invalid file, so there's no file validity check in the code

  • file availability is ensured by the environment. It will always be a regular file, with no chance of generating a read error mid-way

  • C library functions does not fail for environmental reasons (e.g. host is short of RAM for malloc(3), target disk out of space for fwrite(3) and consequently write(2), or fread(3) as said above)

  • reading/writing byte-by-byte is fine, because a later version of this code introduced chunk I/O and got a bit messier (I think). Suggestions on making the code run faster without implementing chunk I/O is welcome


so I'm also not looking for feedbacks regarding the above things that I have assumed / intentionally ignored.



I have ensured that the code is working properly, with no warnings when compiled with this command (taken from make output)



gcc -O3 -std=c11 -Wall -Wno-unused-result -o huffman huffman.c


The last option is to suppress the warning about unused result from fread(3).



During my coding process, I run clang-format occasionally and diff the output and my written code to check for potentially bad indentation / styling issues. I am not confident if it can solve everything.



* The link points to my GitHub repo. The code on that page is identical to the code submitted below verbatim.



// File: huffman.c
// Author: iBug

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>

typedef unsigned char byte;

typedef struct _HuffNode {
unsigned data;
struct _HuffNode *left, *right, *parent;
} HuffNode;

void count_frequency(FILE* fp, unsigned* freq) {
size_t orig_pos = ftell(fp);
int ch;
while (1) {
ch = fgetc(fp);
if (ch < 0)
break;
freq[ch]++;
}
fseek(fp, orig_pos, SEEK_SET);
}

void construct_huffman(unsigned* freq_in, HuffNode* tree) {
int count = 256;
unsigned freq[256];
HuffNode *node[256];

// Initialize data
for (int i = 0; i < 256; i++) {
freq[i] = freq_in[i];
tree[i].data = i;
tree[i].left = tree[i].right = NULL;
node[i] = &tree[i];
}

// Sort by frequency, decreasing order
/* WARNING: Although this Quick Sort is an unstable sort,
* it should at least give the same result for the same input frequency table,
* therefore I'm leaving this code here
*/
{
unsigned lower[256], upper[256], top = 1;
lower[0] = 0, upper[0] = 256;
while (top > 0) {
top--;
int left = lower[top], right = upper[top];
int i = left, j = right - 1, flag = 0;
if (i >= j) // Nothing to sort
continue;
while (i < j) {
if (freq[i] < freq[j]) {
unsigned t = freq[i]; freq[i] = freq[j]; freq[j] = t;
HuffNode *p = node[i]; node[i] = node[j]; node[j] = p;
flag = !flag;
}
flag ? i++ : j--;
}
lower[top] = left, upper[top] = i;
lower[top + 1] = i + 1, upper[top + 1] = right;
top += 2;
}
}

// Construct tree
while (count > 1) {
int pos = 512 - count;
HuffNode *parent = &tree[pos];
// Select lowest 2 by freq
int i = count - 2, j = count - 1;
// Create tree, lower freq left
parent->left = node[j]; parent->right = node[i];
node[j]->parent = node[i]->parent = parent;
node[i] = parent;
freq[i] += freq[j];
// Insert
for (; i > 0 && freq[i] > freq[i - 1]; i--) {
unsigned t = freq[i]; freq[i] = freq[i - 1]; freq[i - 1] = t;
HuffNode *p = node[i]; node[i] = node[i - 1]; node[i - 1] = p;
}
count--;
}
// Now HEAD = node[0] = tree[511]
node[0]->parent = NULL;
}

void encode_stream(FILE* fin, FILE* fout, HuffNode* tree, unsigned* padding) {
int n;
byte ch;
byte buf = 0, nbuf = 0;
HuffNode *p;
byte code[256];
while (1) {
n = fgetc(fin);
if (n < 0)
break;
ch = n;

// Encode
p = &tree[ch];
n = 0;
while (p->parent) {
if (p == p->parent->left) {
// Left is 0
code[n] = 0;
} else if (p == p->parent->right) {
code[n] = 1;
}
p = p->parent;
n++;
}

// Write
for (int i = n - 1; i >= 0; i--) {
buf |= code[i] << nbuf;
nbuf++;
if (nbuf == 8) {
fputc(buf, fout);
nbuf = buf = 0;
}
}
}
fputc(buf, fout);
*padding = 8 - nbuf;
}

void decode_stream(FILE* fin, FILE* fout, HuffNode* tree, unsigned padding) {
size_t startpos = ftell(fin); // should be 1028
fseek(fin, 0L, SEEK_END);
size_t endpos = ftell(fin); // last byte handling
fseek(fin, startpos, SEEK_SET);
int count = endpos - startpos;

byte buf = 0, nbuf = 0, bit;
HuffNode *p;
while (count > 0 || nbuf > 0) {
// Start from tree top
p = tree + 510;
while (p->left || p->right) {
// Prepare next bit if needed
if (nbuf == 0) {
if (count <= 0)
return;

buf = fgetc(fin);
if (count == 1) {
// Last bit
nbuf = 8 - padding;
if (nbuf == 0) {
return;
}
} else {
nbuf = 8;
}
count--;
}
// p has child
bit = buf & 1;
buf >>= 1;
nbuf--;
if (bit == 0)
p = p->left;
else
p = p->right;
}
fputc(p->data, fout);
}
}

void compress_file(const char* filename, const char* newname) {
FILE *fin = fopen(filename, "rb"),
*fout = fopen(newname, "wb");

unsigned freq[256], padding;
HuffNode tree[512];
size_t padding_pos;
count_frequency(fin, freq);
construct_huffman(freq, tree);
rewind(fin);
for (int i = 0; i < 256; i++)
fwrite(freq + i, 4, 1, fout);
// Write a placeholder for the padding
padding_pos = ftell(fout);
fwrite(&padding, 4, 1, fout);
encode_stream(fin, fout, tree, &padding);
// Write the padding to the placeholder
fseek(fout, padding_pos, SEEK_SET);
fwrite(&padding, 4, 1, fout);
fclose(fin);
fclose(fout);
}

void uncompress_file(const char* filename, const char* newname) {
FILE *fin = fopen(filename, "rb"),
*fout = fopen(newname, "wb");

unsigned freq[256], padding;
HuffNode tree[512];
for (int i = 0; i < 256; i++) {
fread(&padding, 4, 1, fin);
freq[i] = padding;
}
fread(&padding, 4, 1, fin);
construct_huffman(freq, tree);
decode_stream(fin, fout, tree, padding);
fclose(fin);
fclose(fout);
}

void print_help(void) {
puts("Usage: huffman (-c|-d) input output");
puts(" -c Compress file from input to output");
puts(" -d Uncompress file from input to output");
puts("nCreated by iBug");
}

int main(int argc, char** argv) {
if (argc != 4) {
print_help();
return 1;
}
if (!strcmp(argv[1], "-c")) {
compress_file(argv[2], argv[3]);
} else if (!strcmp(argv[1], "-d")) {
uncompress_file(argv[2], argv[3]);
} else {
print_help();
return 1;
}
return 0;
}


In addition to the mandatory CC BY-SA 3.0 license by posting on Stack Exchange, the code itself also has a MIT license.



On a side note: Although the course has ended and this code is not maintained anymore, it's still one of the programs that I have written with maximum attention and carefulness, so I believe that any feedback to this code is highly valuable and I will remember them in my future C-coding times.










share|improve this question





























    12














    In a past course one of the assignments was to write a program that can compress files using Huffman Tree algorithm, and uncompress the files that the program generates.



    My design is to count the byte occurrences first, then construct a HT based on the counted byte frequency.



    My compressed file format is 256*4 bytes of "header" that stores the counted frequency, so it can be used to construct the tree when decompressing the file. Then there's a 4-byte integer that indicates how many bits of the last byte is real data. The rest is the real (compressed) data.



    Here is this specific version* of code that I want some feedback. Later versions introduced many messy changes (like GUI and buffered I/O) that is not necessary.



    Specifically, I'm looking for feedback on my algorithm and data structure implementation, including but not limited to code style, best practices, potential flaws and defects (see below).




    • An exception is the last two functions print_help and main. They're intended to be as simple as possible, so they contain the bare minimum amount of code to work in a reasonable way. Data validation and error checking etc. are omitted on purpose.


    In order to simplify the idea, during designing and coding, I have assumed that




    • the program will not be told to uncompress an invalid file, so there's no file validity check in the code

    • file availability is ensured by the environment. It will always be a regular file, with no chance of generating a read error mid-way

    • C library functions does not fail for environmental reasons (e.g. host is short of RAM for malloc(3), target disk out of space for fwrite(3) and consequently write(2), or fread(3) as said above)

    • reading/writing byte-by-byte is fine, because a later version of this code introduced chunk I/O and got a bit messier (I think). Suggestions on making the code run faster without implementing chunk I/O is welcome


    so I'm also not looking for feedbacks regarding the above things that I have assumed / intentionally ignored.



    I have ensured that the code is working properly, with no warnings when compiled with this command (taken from make output)



    gcc -O3 -std=c11 -Wall -Wno-unused-result -o huffman huffman.c


    The last option is to suppress the warning about unused result from fread(3).



    During my coding process, I run clang-format occasionally and diff the output and my written code to check for potentially bad indentation / styling issues. I am not confident if it can solve everything.



    * The link points to my GitHub repo. The code on that page is identical to the code submitted below verbatim.



    // File: huffman.c
    // Author: iBug

    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <stdint.h>

    typedef unsigned char byte;

    typedef struct _HuffNode {
    unsigned data;
    struct _HuffNode *left, *right, *parent;
    } HuffNode;

    void count_frequency(FILE* fp, unsigned* freq) {
    size_t orig_pos = ftell(fp);
    int ch;
    while (1) {
    ch = fgetc(fp);
    if (ch < 0)
    break;
    freq[ch]++;
    }
    fseek(fp, orig_pos, SEEK_SET);
    }

    void construct_huffman(unsigned* freq_in, HuffNode* tree) {
    int count = 256;
    unsigned freq[256];
    HuffNode *node[256];

    // Initialize data
    for (int i = 0; i < 256; i++) {
    freq[i] = freq_in[i];
    tree[i].data = i;
    tree[i].left = tree[i].right = NULL;
    node[i] = &tree[i];
    }

    // Sort by frequency, decreasing order
    /* WARNING: Although this Quick Sort is an unstable sort,
    * it should at least give the same result for the same input frequency table,
    * therefore I'm leaving this code here
    */
    {
    unsigned lower[256], upper[256], top = 1;
    lower[0] = 0, upper[0] = 256;
    while (top > 0) {
    top--;
    int left = lower[top], right = upper[top];
    int i = left, j = right - 1, flag = 0;
    if (i >= j) // Nothing to sort
    continue;
    while (i < j) {
    if (freq[i] < freq[j]) {
    unsigned t = freq[i]; freq[i] = freq[j]; freq[j] = t;
    HuffNode *p = node[i]; node[i] = node[j]; node[j] = p;
    flag = !flag;
    }
    flag ? i++ : j--;
    }
    lower[top] = left, upper[top] = i;
    lower[top + 1] = i + 1, upper[top + 1] = right;
    top += 2;
    }
    }

    // Construct tree
    while (count > 1) {
    int pos = 512 - count;
    HuffNode *parent = &tree[pos];
    // Select lowest 2 by freq
    int i = count - 2, j = count - 1;
    // Create tree, lower freq left
    parent->left = node[j]; parent->right = node[i];
    node[j]->parent = node[i]->parent = parent;
    node[i] = parent;
    freq[i] += freq[j];
    // Insert
    for (; i > 0 && freq[i] > freq[i - 1]; i--) {
    unsigned t = freq[i]; freq[i] = freq[i - 1]; freq[i - 1] = t;
    HuffNode *p = node[i]; node[i] = node[i - 1]; node[i - 1] = p;
    }
    count--;
    }
    // Now HEAD = node[0] = tree[511]
    node[0]->parent = NULL;
    }

    void encode_stream(FILE* fin, FILE* fout, HuffNode* tree, unsigned* padding) {
    int n;
    byte ch;
    byte buf = 0, nbuf = 0;
    HuffNode *p;
    byte code[256];
    while (1) {
    n = fgetc(fin);
    if (n < 0)
    break;
    ch = n;

    // Encode
    p = &tree[ch];
    n = 0;
    while (p->parent) {
    if (p == p->parent->left) {
    // Left is 0
    code[n] = 0;
    } else if (p == p->parent->right) {
    code[n] = 1;
    }
    p = p->parent;
    n++;
    }

    // Write
    for (int i = n - 1; i >= 0; i--) {
    buf |= code[i] << nbuf;
    nbuf++;
    if (nbuf == 8) {
    fputc(buf, fout);
    nbuf = buf = 0;
    }
    }
    }
    fputc(buf, fout);
    *padding = 8 - nbuf;
    }

    void decode_stream(FILE* fin, FILE* fout, HuffNode* tree, unsigned padding) {
    size_t startpos = ftell(fin); // should be 1028
    fseek(fin, 0L, SEEK_END);
    size_t endpos = ftell(fin); // last byte handling
    fseek(fin, startpos, SEEK_SET);
    int count = endpos - startpos;

    byte buf = 0, nbuf = 0, bit;
    HuffNode *p;
    while (count > 0 || nbuf > 0) {
    // Start from tree top
    p = tree + 510;
    while (p->left || p->right) {
    // Prepare next bit if needed
    if (nbuf == 0) {
    if (count <= 0)
    return;

    buf = fgetc(fin);
    if (count == 1) {
    // Last bit
    nbuf = 8 - padding;
    if (nbuf == 0) {
    return;
    }
    } else {
    nbuf = 8;
    }
    count--;
    }
    // p has child
    bit = buf & 1;
    buf >>= 1;
    nbuf--;
    if (bit == 0)
    p = p->left;
    else
    p = p->right;
    }
    fputc(p->data, fout);
    }
    }

    void compress_file(const char* filename, const char* newname) {
    FILE *fin = fopen(filename, "rb"),
    *fout = fopen(newname, "wb");

    unsigned freq[256], padding;
    HuffNode tree[512];
    size_t padding_pos;
    count_frequency(fin, freq);
    construct_huffman(freq, tree);
    rewind(fin);
    for (int i = 0; i < 256; i++)
    fwrite(freq + i, 4, 1, fout);
    // Write a placeholder for the padding
    padding_pos = ftell(fout);
    fwrite(&padding, 4, 1, fout);
    encode_stream(fin, fout, tree, &padding);
    // Write the padding to the placeholder
    fseek(fout, padding_pos, SEEK_SET);
    fwrite(&padding, 4, 1, fout);
    fclose(fin);
    fclose(fout);
    }

    void uncompress_file(const char* filename, const char* newname) {
    FILE *fin = fopen(filename, "rb"),
    *fout = fopen(newname, "wb");

    unsigned freq[256], padding;
    HuffNode tree[512];
    for (int i = 0; i < 256; i++) {
    fread(&padding, 4, 1, fin);
    freq[i] = padding;
    }
    fread(&padding, 4, 1, fin);
    construct_huffman(freq, tree);
    decode_stream(fin, fout, tree, padding);
    fclose(fin);
    fclose(fout);
    }

    void print_help(void) {
    puts("Usage: huffman (-c|-d) input output");
    puts(" -c Compress file from input to output");
    puts(" -d Uncompress file from input to output");
    puts("nCreated by iBug");
    }

    int main(int argc, char** argv) {
    if (argc != 4) {
    print_help();
    return 1;
    }
    if (!strcmp(argv[1], "-c")) {
    compress_file(argv[2], argv[3]);
    } else if (!strcmp(argv[1], "-d")) {
    uncompress_file(argv[2], argv[3]);
    } else {
    print_help();
    return 1;
    }
    return 0;
    }


    In addition to the mandatory CC BY-SA 3.0 license by posting on Stack Exchange, the code itself also has a MIT license.



    On a side note: Although the course has ended and this code is not maintained anymore, it's still one of the programs that I have written with maximum attention and carefulness, so I believe that any feedback to this code is highly valuable and I will remember them in my future C-coding times.










    share|improve this question



























      12












      12








      12


      2





      In a past course one of the assignments was to write a program that can compress files using Huffman Tree algorithm, and uncompress the files that the program generates.



      My design is to count the byte occurrences first, then construct a HT based on the counted byte frequency.



      My compressed file format is 256*4 bytes of "header" that stores the counted frequency, so it can be used to construct the tree when decompressing the file. Then there's a 4-byte integer that indicates how many bits of the last byte is real data. The rest is the real (compressed) data.



      Here is this specific version* of code that I want some feedback. Later versions introduced many messy changes (like GUI and buffered I/O) that is not necessary.



      Specifically, I'm looking for feedback on my algorithm and data structure implementation, including but not limited to code style, best practices, potential flaws and defects (see below).




      • An exception is the last two functions print_help and main. They're intended to be as simple as possible, so they contain the bare minimum amount of code to work in a reasonable way. Data validation and error checking etc. are omitted on purpose.


      In order to simplify the idea, during designing and coding, I have assumed that




      • the program will not be told to uncompress an invalid file, so there's no file validity check in the code

      • file availability is ensured by the environment. It will always be a regular file, with no chance of generating a read error mid-way

      • C library functions does not fail for environmental reasons (e.g. host is short of RAM for malloc(3), target disk out of space for fwrite(3) and consequently write(2), or fread(3) as said above)

      • reading/writing byte-by-byte is fine, because a later version of this code introduced chunk I/O and got a bit messier (I think). Suggestions on making the code run faster without implementing chunk I/O is welcome


      so I'm also not looking for feedbacks regarding the above things that I have assumed / intentionally ignored.



      I have ensured that the code is working properly, with no warnings when compiled with this command (taken from make output)



      gcc -O3 -std=c11 -Wall -Wno-unused-result -o huffman huffman.c


      The last option is to suppress the warning about unused result from fread(3).



      During my coding process, I run clang-format occasionally and diff the output and my written code to check for potentially bad indentation / styling issues. I am not confident if it can solve everything.



      * The link points to my GitHub repo. The code on that page is identical to the code submitted below verbatim.



      // File: huffman.c
      // Author: iBug

      #include <stdio.h>
      #include <stdlib.h>
      #include <string.h>
      #include <stdint.h>

      typedef unsigned char byte;

      typedef struct _HuffNode {
      unsigned data;
      struct _HuffNode *left, *right, *parent;
      } HuffNode;

      void count_frequency(FILE* fp, unsigned* freq) {
      size_t orig_pos = ftell(fp);
      int ch;
      while (1) {
      ch = fgetc(fp);
      if (ch < 0)
      break;
      freq[ch]++;
      }
      fseek(fp, orig_pos, SEEK_SET);
      }

      void construct_huffman(unsigned* freq_in, HuffNode* tree) {
      int count = 256;
      unsigned freq[256];
      HuffNode *node[256];

      // Initialize data
      for (int i = 0; i < 256; i++) {
      freq[i] = freq_in[i];
      tree[i].data = i;
      tree[i].left = tree[i].right = NULL;
      node[i] = &tree[i];
      }

      // Sort by frequency, decreasing order
      /* WARNING: Although this Quick Sort is an unstable sort,
      * it should at least give the same result for the same input frequency table,
      * therefore I'm leaving this code here
      */
      {
      unsigned lower[256], upper[256], top = 1;
      lower[0] = 0, upper[0] = 256;
      while (top > 0) {
      top--;
      int left = lower[top], right = upper[top];
      int i = left, j = right - 1, flag = 0;
      if (i >= j) // Nothing to sort
      continue;
      while (i < j) {
      if (freq[i] < freq[j]) {
      unsigned t = freq[i]; freq[i] = freq[j]; freq[j] = t;
      HuffNode *p = node[i]; node[i] = node[j]; node[j] = p;
      flag = !flag;
      }
      flag ? i++ : j--;
      }
      lower[top] = left, upper[top] = i;
      lower[top + 1] = i + 1, upper[top + 1] = right;
      top += 2;
      }
      }

      // Construct tree
      while (count > 1) {
      int pos = 512 - count;
      HuffNode *parent = &tree[pos];
      // Select lowest 2 by freq
      int i = count - 2, j = count - 1;
      // Create tree, lower freq left
      parent->left = node[j]; parent->right = node[i];
      node[j]->parent = node[i]->parent = parent;
      node[i] = parent;
      freq[i] += freq[j];
      // Insert
      for (; i > 0 && freq[i] > freq[i - 1]; i--) {
      unsigned t = freq[i]; freq[i] = freq[i - 1]; freq[i - 1] = t;
      HuffNode *p = node[i]; node[i] = node[i - 1]; node[i - 1] = p;
      }
      count--;
      }
      // Now HEAD = node[0] = tree[511]
      node[0]->parent = NULL;
      }

      void encode_stream(FILE* fin, FILE* fout, HuffNode* tree, unsigned* padding) {
      int n;
      byte ch;
      byte buf = 0, nbuf = 0;
      HuffNode *p;
      byte code[256];
      while (1) {
      n = fgetc(fin);
      if (n < 0)
      break;
      ch = n;

      // Encode
      p = &tree[ch];
      n = 0;
      while (p->parent) {
      if (p == p->parent->left) {
      // Left is 0
      code[n] = 0;
      } else if (p == p->parent->right) {
      code[n] = 1;
      }
      p = p->parent;
      n++;
      }

      // Write
      for (int i = n - 1; i >= 0; i--) {
      buf |= code[i] << nbuf;
      nbuf++;
      if (nbuf == 8) {
      fputc(buf, fout);
      nbuf = buf = 0;
      }
      }
      }
      fputc(buf, fout);
      *padding = 8 - nbuf;
      }

      void decode_stream(FILE* fin, FILE* fout, HuffNode* tree, unsigned padding) {
      size_t startpos = ftell(fin); // should be 1028
      fseek(fin, 0L, SEEK_END);
      size_t endpos = ftell(fin); // last byte handling
      fseek(fin, startpos, SEEK_SET);
      int count = endpos - startpos;

      byte buf = 0, nbuf = 0, bit;
      HuffNode *p;
      while (count > 0 || nbuf > 0) {
      // Start from tree top
      p = tree + 510;
      while (p->left || p->right) {
      // Prepare next bit if needed
      if (nbuf == 0) {
      if (count <= 0)
      return;

      buf = fgetc(fin);
      if (count == 1) {
      // Last bit
      nbuf = 8 - padding;
      if (nbuf == 0) {
      return;
      }
      } else {
      nbuf = 8;
      }
      count--;
      }
      // p has child
      bit = buf & 1;
      buf >>= 1;
      nbuf--;
      if (bit == 0)
      p = p->left;
      else
      p = p->right;
      }
      fputc(p->data, fout);
      }
      }

      void compress_file(const char* filename, const char* newname) {
      FILE *fin = fopen(filename, "rb"),
      *fout = fopen(newname, "wb");

      unsigned freq[256], padding;
      HuffNode tree[512];
      size_t padding_pos;
      count_frequency(fin, freq);
      construct_huffman(freq, tree);
      rewind(fin);
      for (int i = 0; i < 256; i++)
      fwrite(freq + i, 4, 1, fout);
      // Write a placeholder for the padding
      padding_pos = ftell(fout);
      fwrite(&padding, 4, 1, fout);
      encode_stream(fin, fout, tree, &padding);
      // Write the padding to the placeholder
      fseek(fout, padding_pos, SEEK_SET);
      fwrite(&padding, 4, 1, fout);
      fclose(fin);
      fclose(fout);
      }

      void uncompress_file(const char* filename, const char* newname) {
      FILE *fin = fopen(filename, "rb"),
      *fout = fopen(newname, "wb");

      unsigned freq[256], padding;
      HuffNode tree[512];
      for (int i = 0; i < 256; i++) {
      fread(&padding, 4, 1, fin);
      freq[i] = padding;
      }
      fread(&padding, 4, 1, fin);
      construct_huffman(freq, tree);
      decode_stream(fin, fout, tree, padding);
      fclose(fin);
      fclose(fout);
      }

      void print_help(void) {
      puts("Usage: huffman (-c|-d) input output");
      puts(" -c Compress file from input to output");
      puts(" -d Uncompress file from input to output");
      puts("nCreated by iBug");
      }

      int main(int argc, char** argv) {
      if (argc != 4) {
      print_help();
      return 1;
      }
      if (!strcmp(argv[1], "-c")) {
      compress_file(argv[2], argv[3]);
      } else if (!strcmp(argv[1], "-d")) {
      uncompress_file(argv[2], argv[3]);
      } else {
      print_help();
      return 1;
      }
      return 0;
      }


      In addition to the mandatory CC BY-SA 3.0 license by posting on Stack Exchange, the code itself also has a MIT license.



      On a side note: Although the course has ended and this code is not maintained anymore, it's still one of the programs that I have written with maximum attention and carefulness, so I believe that any feedback to this code is highly valuable and I will remember them in my future C-coding times.










      share|improve this question















      In a past course one of the assignments was to write a program that can compress files using Huffman Tree algorithm, and uncompress the files that the program generates.



      My design is to count the byte occurrences first, then construct a HT based on the counted byte frequency.



      My compressed file format is 256*4 bytes of "header" that stores the counted frequency, so it can be used to construct the tree when decompressing the file. Then there's a 4-byte integer that indicates how many bits of the last byte is real data. The rest is the real (compressed) data.



      Here is this specific version* of code that I want some feedback. Later versions introduced many messy changes (like GUI and buffered I/O) that is not necessary.



      Specifically, I'm looking for feedback on my algorithm and data structure implementation, including but not limited to code style, best practices, potential flaws and defects (see below).




      • An exception is the last two functions print_help and main. They're intended to be as simple as possible, so they contain the bare minimum amount of code to work in a reasonable way. Data validation and error checking etc. are omitted on purpose.


      In order to simplify the idea, during designing and coding, I have assumed that




      • the program will not be told to uncompress an invalid file, so there's no file validity check in the code

      • file availability is ensured by the environment. It will always be a regular file, with no chance of generating a read error mid-way

      • C library functions does not fail for environmental reasons (e.g. host is short of RAM for malloc(3), target disk out of space for fwrite(3) and consequently write(2), or fread(3) as said above)

      • reading/writing byte-by-byte is fine, because a later version of this code introduced chunk I/O and got a bit messier (I think). Suggestions on making the code run faster without implementing chunk I/O is welcome


      so I'm also not looking for feedbacks regarding the above things that I have assumed / intentionally ignored.



      I have ensured that the code is working properly, with no warnings when compiled with this command (taken from make output)



      gcc -O3 -std=c11 -Wall -Wno-unused-result -o huffman huffman.c


      The last option is to suppress the warning about unused result from fread(3).



      During my coding process, I run clang-format occasionally and diff the output and my written code to check for potentially bad indentation / styling issues. I am not confident if it can solve everything.



      * The link points to my GitHub repo. The code on that page is identical to the code submitted below verbatim.



      // File: huffman.c
      // Author: iBug

      #include <stdio.h>
      #include <stdlib.h>
      #include <string.h>
      #include <stdint.h>

      typedef unsigned char byte;

      typedef struct _HuffNode {
      unsigned data;
      struct _HuffNode *left, *right, *parent;
      } HuffNode;

      void count_frequency(FILE* fp, unsigned* freq) {
      size_t orig_pos = ftell(fp);
      int ch;
      while (1) {
      ch = fgetc(fp);
      if (ch < 0)
      break;
      freq[ch]++;
      }
      fseek(fp, orig_pos, SEEK_SET);
      }

      void construct_huffman(unsigned* freq_in, HuffNode* tree) {
      int count = 256;
      unsigned freq[256];
      HuffNode *node[256];

      // Initialize data
      for (int i = 0; i < 256; i++) {
      freq[i] = freq_in[i];
      tree[i].data = i;
      tree[i].left = tree[i].right = NULL;
      node[i] = &tree[i];
      }

      // Sort by frequency, decreasing order
      /* WARNING: Although this Quick Sort is an unstable sort,
      * it should at least give the same result for the same input frequency table,
      * therefore I'm leaving this code here
      */
      {
      unsigned lower[256], upper[256], top = 1;
      lower[0] = 0, upper[0] = 256;
      while (top > 0) {
      top--;
      int left = lower[top], right = upper[top];
      int i = left, j = right - 1, flag = 0;
      if (i >= j) // Nothing to sort
      continue;
      while (i < j) {
      if (freq[i] < freq[j]) {
      unsigned t = freq[i]; freq[i] = freq[j]; freq[j] = t;
      HuffNode *p = node[i]; node[i] = node[j]; node[j] = p;
      flag = !flag;
      }
      flag ? i++ : j--;
      }
      lower[top] = left, upper[top] = i;
      lower[top + 1] = i + 1, upper[top + 1] = right;
      top += 2;
      }
      }

      // Construct tree
      while (count > 1) {
      int pos = 512 - count;
      HuffNode *parent = &tree[pos];
      // Select lowest 2 by freq
      int i = count - 2, j = count - 1;
      // Create tree, lower freq left
      parent->left = node[j]; parent->right = node[i];
      node[j]->parent = node[i]->parent = parent;
      node[i] = parent;
      freq[i] += freq[j];
      // Insert
      for (; i > 0 && freq[i] > freq[i - 1]; i--) {
      unsigned t = freq[i]; freq[i] = freq[i - 1]; freq[i - 1] = t;
      HuffNode *p = node[i]; node[i] = node[i - 1]; node[i - 1] = p;
      }
      count--;
      }
      // Now HEAD = node[0] = tree[511]
      node[0]->parent = NULL;
      }

      void encode_stream(FILE* fin, FILE* fout, HuffNode* tree, unsigned* padding) {
      int n;
      byte ch;
      byte buf = 0, nbuf = 0;
      HuffNode *p;
      byte code[256];
      while (1) {
      n = fgetc(fin);
      if (n < 0)
      break;
      ch = n;

      // Encode
      p = &tree[ch];
      n = 0;
      while (p->parent) {
      if (p == p->parent->left) {
      // Left is 0
      code[n] = 0;
      } else if (p == p->parent->right) {
      code[n] = 1;
      }
      p = p->parent;
      n++;
      }

      // Write
      for (int i = n - 1; i >= 0; i--) {
      buf |= code[i] << nbuf;
      nbuf++;
      if (nbuf == 8) {
      fputc(buf, fout);
      nbuf = buf = 0;
      }
      }
      }
      fputc(buf, fout);
      *padding = 8 - nbuf;
      }

      void decode_stream(FILE* fin, FILE* fout, HuffNode* tree, unsigned padding) {
      size_t startpos = ftell(fin); // should be 1028
      fseek(fin, 0L, SEEK_END);
      size_t endpos = ftell(fin); // last byte handling
      fseek(fin, startpos, SEEK_SET);
      int count = endpos - startpos;

      byte buf = 0, nbuf = 0, bit;
      HuffNode *p;
      while (count > 0 || nbuf > 0) {
      // Start from tree top
      p = tree + 510;
      while (p->left || p->right) {
      // Prepare next bit if needed
      if (nbuf == 0) {
      if (count <= 0)
      return;

      buf = fgetc(fin);
      if (count == 1) {
      // Last bit
      nbuf = 8 - padding;
      if (nbuf == 0) {
      return;
      }
      } else {
      nbuf = 8;
      }
      count--;
      }
      // p has child
      bit = buf & 1;
      buf >>= 1;
      nbuf--;
      if (bit == 0)
      p = p->left;
      else
      p = p->right;
      }
      fputc(p->data, fout);
      }
      }

      void compress_file(const char* filename, const char* newname) {
      FILE *fin = fopen(filename, "rb"),
      *fout = fopen(newname, "wb");

      unsigned freq[256], padding;
      HuffNode tree[512];
      size_t padding_pos;
      count_frequency(fin, freq);
      construct_huffman(freq, tree);
      rewind(fin);
      for (int i = 0; i < 256; i++)
      fwrite(freq + i, 4, 1, fout);
      // Write a placeholder for the padding
      padding_pos = ftell(fout);
      fwrite(&padding, 4, 1, fout);
      encode_stream(fin, fout, tree, &padding);
      // Write the padding to the placeholder
      fseek(fout, padding_pos, SEEK_SET);
      fwrite(&padding, 4, 1, fout);
      fclose(fin);
      fclose(fout);
      }

      void uncompress_file(const char* filename, const char* newname) {
      FILE *fin = fopen(filename, "rb"),
      *fout = fopen(newname, "wb");

      unsigned freq[256], padding;
      HuffNode tree[512];
      for (int i = 0; i < 256; i++) {
      fread(&padding, 4, 1, fin);
      freq[i] = padding;
      }
      fread(&padding, 4, 1, fin);
      construct_huffman(freq, tree);
      decode_stream(fin, fout, tree, padding);
      fclose(fin);
      fclose(fout);
      }

      void print_help(void) {
      puts("Usage: huffman (-c|-d) input output");
      puts(" -c Compress file from input to output");
      puts(" -d Uncompress file from input to output");
      puts("nCreated by iBug");
      }

      int main(int argc, char** argv) {
      if (argc != 4) {
      print_help();
      return 1;
      }
      if (!strcmp(argv[1], "-c")) {
      compress_file(argv[2], argv[3]);
      } else if (!strcmp(argv[1], "-d")) {
      uncompress_file(argv[2], argv[3]);
      } else {
      print_help();
      return 1;
      }
      return 0;
      }


      In addition to the mandatory CC BY-SA 3.0 license by posting on Stack Exchange, the code itself also has a MIT license.



      On a side note: Although the course has ended and this code is not maintained anymore, it's still one of the programs that I have written with maximum attention and carefulness, so I believe that any feedback to this code is highly valuable and I will remember them in my future C-coding times.







      algorithm c tree compression






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Dec 31 '18 at 20:09









      200_success

      129k15152414




      129k15152414










      asked Dec 31 '18 at 18:00









      iBugiBug

      2039




      2039






















          3 Answers
          3






          active

          oldest

          votes


















          6














          Use standard types



          You define this:



          typedef unsigned char byte;


          But you also already #include <stdint.h>. As such, you have access to uint8_t, which is more explicit about its size than char.



          Identify local functions



          If all of your functions are in the same translation unit (which seems to be the case), make them static.



          Forever loops



          This is quite minor, but while (1) isn't my favourite way of defining a forever loop. You can either do for (;;) or while (true), after having included stdbool.h. 1 is less expressive than true.



          Unite declaration and initialization



          int ch;
          while (1) {
          ch = fgetc(fp);


          can be



          while (true) {
          int ch = fgetc(fp);


          Define magic numbers



          Make a #define or a global const int for 256. It's used all over your code, and it'd be good to replace that with a symbol for legibility and maintainability.



          Do one thing at a time



          lower[0] = 0, upper[0] = 256;
          // ...
          unsigned t = freq[i]; freq[i] = freq[j]; freq[j] = t;


          It's rarely an improvement to the legibility of code to do multi-statement lines. Just put these on separate lines.



          Sanitize loop counters



          You have:



          unsigned top = 1;
          // ...
          while (top > 0) {
          top--;
          // ...`continue`s under certain conditions...
          top += 2;
          }


          This is a little wacky. Just do:



          for (int top = 0; top >= 0; top--) {
          // ... continue under certain conditions...
          top += 2;
          }


          I think those are equivalent.



          Sanitize logic



                  int flag = 0;
          while (...) {
          if (freq[i] < freq[j]) {
          //...
          flag = !flag;
          }
          flag ? i++ : j--;
          }


          This is ternary abuse. The flag should be a bool (from stdbool.h). Rewrite this as:



          bool flag = false;
          while (...) {
          if (freq[i] < freq[j]) {
          //...
          flag = !flag;
          }
          if (flag) i++;
          else j--;
          }


          Choose a pointer spacing standard



          You do:



          void compress_file(const char* filename, const char* newname) {


          but you also do:



          FILE *fin = fopen(filename, "rb"),
          *fout = fopen(newname, "wb");


          Personally I like the latter; either way, you should pick a standard and apply it everywhere.



          Let fwrite do the iteration



          This:



          for (int i = 0; i < 256; i++)
          fwrite(freq + i, 4, 1, fout);


          is equivalent to:



          fwrite(freq, 4, 256, fout);





          share|improve this answer





























            6














            Header size



            256*4 bytes is very big for a header. The size could be reduced substantially by using one or several of these common techniques:




            • Store the code length instead of symbol frequency. These definitely won't need 32 bits each, 8 would already be a lot. You can pack them in 4 bits each if you set a length limit of 15. Storing lengths is not ambiguous because you can use canonical Huffman codes (there is an easy algorithm to generate them from your table of code lengths, discarding the code itself).

            • Compress the header with delta encoding: storing the length difference between subsequent codes, using a variable-length encoding. Small differences tend to be more common. (seen in eg DEFLATE)

            • Remove most zero-lengths from the header, by first storing a sparse bitmap that indicates which symbols occur in the file. (seen in eg bzip2)


            Encoding process



            Walking up the tree for every byte of the file is needlessly inefficient. You could precompute an array of codes and lengths once in advance and then use the array during encoding. The code could be represented as a single unsigned integer, no array necessary (it won't be that long, and you will want to limit the code lengths anyway for decoding and header reasons). It can be prepended to buf in a couple of simple bitwise operations, similar to how you currently add a single bit, but nbuf++ turns into nbuf += codelength. Together this lets the encoding process take a constant number of operations instead of scaling linearly with the code length.



            Decoding process



            Currently your code implements bit-by-bit tree walking, which is (as one source puts it) dead slow. The alternative is decoding several bits at the same time by using an array lookup. There are a lot of subtly different ways to do that, but the basis of all of them is that part of the buffer is used to index into a table. Limiting the maximum length of the codes is very useful, because with a limited length you can guarantee that decoding is a single-step process, resolving one symbol from the buffer in a constant number of operations, with no looping.



            Some possible relevant sources for these techniques are the one in the previous paragraph and:




            • Introduction to table based Huffman decoding

            • An efficient algorithm of Huffman decoder with nearly constant decoding time

            • Huffman revisited - Part 2 : the Decoder


            • A Fast and Space - Economical Algorithm for Length - Limited Coding (for a way to generate the code lengths with a length limit)






            share|improve this answer





























              0














              Harold has given a pretty comprehensive answer. I thought your code for building the Huffman tree was a little obscure, the usual approach is to use a "heap" which efficiently maintains the smallest element of a array in position zero.



              I have been working on an implementation of RFC 1951 in C#, so far I have found that it's fairly unusual for the bit length limits to be exceeded in practice ( I am using it to write PDF files ), so my approach was simply to try a smaller block size if the limits were exceeded.



              Here's my C# "deflate" code for comparison ( hope it's alright to post code as part of a reply, I am new here ):



              using G = System.Collections.Generic; 
              using U = System.UInt16; // Could also be UInt32 or UInt64, not sure what is best.

              class Encoder : G.List<byte> // Compression per RFC1950, RFC1951.
              {
              public G.List<byte> Deflate( byte inp )
              {
              Clear();
              Add( 0x78); Add( 0x9C ); // RFC 1950 bytes
              ReadInp( inp );
              while ( DoOutput( 1 ) == 0 );
              FlushBitBuf();
              Put32( Adler32( inp ) ); // RFC 1950 checksum
              // System.Console.WriteLine( "Encoder.Deflate, input size=" + inp.Length + " output size=" + this.Count );
              return this;
              }

              class OffsetList{ public uint Offset; public OffsetList Next; } // List of 3-byte match offsets.

              void ReadInp( byte inp ) // LZ77 compression, per RFC1951
              {
              G.Dictionary <uint,OffsetList> lookup = new G.Dictionary<uint,OffsetList>(); // Note: could reduce storage requirement by removing entries outside 32k window
              uint n = (uint)inp.Length;
              SetIBufSize( n );

              uint w = 0; // holds last 3 bytes of input
              int todo = 0; // number of bytes in w that have not yet been output to IBuf, can be negative when a match is found
              uint pendingMatchLen=0, pendingDist=0;

              for ( uint i = 0; i <= 1 && i < n; i += 1 ) { w = ( ( w << 8 ) | inp[i] ); todo += 1; }

              for ( uint i = 2; i < n; i += 1 )
              {
              w = ( ( w << 8 ) | inp[i] ) & 0xffffffu; todo += 1;
              OffsetList e, x = new OffsetList(); x.Offset = i;
              uint bestMatchLen = 0, bestDist = 0;
              if ( lookup.TryGetValue( w, out e ) )
              {
              x.Next = e;
              OffsetList p = x;
              if ( todo >= 3 ) while ( e != null )
              {
              uint dist = i - e.Offset; if ( dist > 32768 ) { p.Next = null; break; }
              uint matchLen = MatchLen( inp, dist, i );
              if ( matchLen > bestMatchLen ) { bestMatchLen = matchLen; bestDist = dist; }
              p = e; e = e.Next;
              }
              }
              lookup[ w ] = x; ISpace();

              // "Lazy matching" RFC 1951 p.15 : if there are overlapping matches, there is a choice over which of the match to use.
              // Example: abc012bc345.... abc345 : abc345 can be encoded as either [abc][345] or as a[bc345].
              // Since a range needs more bits to encode than a literal the latter is better.

              if ( pendingMatchLen > 0 )
              {
              if ( bestMatchLen > pendingMatchLen || bestMatchLen == pendingMatchLen && bestDist < pendingDist )
              { IPut( inp[i-3] ); todo -= 1; }
              else // Save the pending match, suppress bestMatch if any.
              {
              IPut( (ushort)(257 + pendingMatchLen) );
              IPut( (ushort) pendingDist );
              todo -= (int)pendingMatchLen;
              bestMatchLen = 0;
              }
              pendingMatchLen = 0;
              }
              if ( bestMatchLen > 0 ) { pendingMatchLen = bestMatchLen; pendingDist = bestDist; }
              else if ( todo == 3 ) { IPut( (byte)(w >> 16) ); todo = 2; }
              } // end for loop
              if ( pendingMatchLen > 0 )
              {
              IPut( (ushort)(257 + pendingMatchLen) );
              IPut( (ushort) pendingDist );
              todo -= (int)pendingMatchLen;
              }
              while ( todo > 0 ){ todo -= 1; IPut( (byte)( w >> (todo*8) ) ); }
              } // end ReadInp

              uint MatchLen( byte inp, uint dist, uint i )
              {
              // From lookup, we already have a match of 3 bytes, this function computes how many more bytes match.
              uint x = i+1;
              ulong end = (ulong)inp.Length;
              if ( end - i > 256 ) end = i + 256; // Maximum match is 258
              while ( x < end && inp[x] == inp[x-dist] ) x += 1;
              return x - i + 2;
              }

              ushort IBuf; // Intermediate buffer, holds output from LZ99 algorithm.
              const uint IBufSizeMax = 0x40000;
              uint IBufSize, IBufI, IBufJ;
              void IPut( ushort x ) { IBuf[IBufI++] = x; if ( IBufI == IBufSize ) IBufI = 0; }
              ushort IGet(){ ushort result = IBuf[IBufJ++]; if ( IBufJ == IBufSize ) IBufJ = 0; return result; }
              uint ICount(){ if ( IBufI >= IBufJ ) return IBufI - IBufJ; else return IBufI + IBufSize - IBufJ; } // Number of values in IBuf
              void ISpace(){ while ( ICount() > IBufSize - 10 ) DoOutput( 0 ); } // Ensure IBuf has space for at least 10 values.
              void SetIBufSize( uint x ) { x += 20; if ( x > IBufSizeMax ) x = IBufSizeMax; if ( IBufSize < x ) { IBufSize = x; IBuf = new ushort[x]; } }

              U DoOutput( U last ) // while DoBlock fails, retry with a smaller amount of input
              {
              uint n = ICount();
              while ( !DoBlock( n, last ) ) { last = 0; n -= n / 20; }
              return last;
              }

              ///////////////////////////////////////////////////////////////////////////////
              // RFC 1951 encoding constants.

              static readonly byte ClenAlphabet = { 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15 }; // size = 19
              static readonly byte MatchExtra = { 0,0,0,0, 0,0,0,0, 1,1,1,1, 2,2,2,2, 3,3,3,3, 4,4,4,4, 5,5,5,5, 0 }; // size = 29
              static readonly ushort MatchOff = { 3,4,5,6, 7,8,9,10, 11,13,15,17, 19,23,27,31, 35,43,51,59, 67,83,99,115, 131,163,195,227, 258, 0xffff };
              static readonly byte DistExtra = { 0,0,0,0, 1,1,2,2, 3,3,4,4, 5,5,6,6, 7,7,8,8, 9,9,10,10, 11,11,12,12, 13,13 }; // size = 30
              static readonly ushort DistOff = { 1,2,3,4, 5,7,9,13, 17,25,33,49, 65,97,129,193, 257,385,513,769, 1025,1537,2049,3073,
              4097,6145,8193,12289, 16385,24577, 0xffff };

              readonly U LitFreq = new U[288], LitLen = new U[288], LitCode = new U[288];
              readonly U DistFreq = new U[32], DistLen = new U[32], DistCode = new U[32];
              readonly U LenFreq = new U[19], LenLen = new U[19], LenCode = new U[19];

              bool DoBlock( uint n, U last )
              {
              Clear( LitFreq ); Clear( DistFreq ); Clear( LenFreq );
              uint saveI = IBufI, saveJ = IBufJ;
              int got = 0; while ( got < n )
              {
              ushort x = IGet(); got += 1;
              if ( x < 256 ) LitFreq[ x ] += 1;
              else
              {
              x -= 257;
              uint dist = IGet(); got += 1;
              uint mc=0; while ( x >= MatchOff[mc] ) mc += 1; mc -= 1;
              uint dc=0; while ( dist >= DistOff[dc] ) dc += 1; dc -= 1;
              LitFreq[ 257+mc ] += 1;
              DistFreq[ dc ] += 1;
              }
              }
              LitFreq[256] += 1; // end of block code
              IBufI = saveI; IBufJ = saveJ;

              int nLitCode = HE.ComputeCode( 15, LitFreq, LitLen, LitCode ); if ( nLitCode < 0 ) return false;
              int nDistCode = HE.ComputeCode( 15, DistFreq, DistLen, DistCode ); if ( nDistCode < 0 ) return false;

              if ( nDistCode == 0 ) nDistCode = 1;
              LenPass = 1; DoLengths( nLitCode, LitLen, true ); DoLengths( nDistCode, DistLen, false );

              if ( HE.ComputeCode( 7, LenFreq, LenLen, LenCode ) < 0 ) return false;

              int nLenCode = 19;
              while ( nLenCode > 0 && LenLen[ ClenAlphabet[nLenCode-1] ] == 0 ) nLenCode -= 1;

              PutBit( last ); // Last block flag
              PutBits( 2, 2 ); // Dynamic Huffman ( for small blocks fixed coding may work better, not implemented )

              PutBits( 5, (U)( nLitCode - 257 ) ); PutBits( 5, (U)( nDistCode - 1 ) ); PutBits( 4, (U)( nLenCode - 4 ) );
              for ( int i=0; i < nLenCode; i += 1 ) PutBits( 3, LenLen[ ClenAlphabet[i] ] );
              LenPass = 2; DoLengths( nLitCode, LitLen, true ); DoLengths( nDistCode, DistLen, false );

              got = 0; while ( got < n )
              {
              U x = IGet(); got += 1;
              if ( x < 256 ) PutBits( LitLen[x], LitCode[x] );
              else
              {
              x -= 257;
              ushort dist = IGet(); got += 1;
              uint mc=0; while ( x >= MatchOff[mc] ) mc += 1; mc -= 1;
              uint dc=0; while ( dist >= DistOff[dc] ) dc += 1; dc -= 1;
              PutBits( LitLen[257+mc], LitCode[257+mc] );
              PutBits( MatchExtra[mc], (U)(x-MatchOff[mc]) );
              PutBits( DistLen[dc], DistCode[dc] );
              PutBits( DistExtra[dc], (U)(dist-DistOff[dc]) );
              }
              }
              PutBits( LitLen[256], LitCode[256] ); // block end code
              return true;
              } // end DoBlock

              // Encoding of code lengths ( RFC 1951, page 13 ).

              U LenPass, Plen, ZeroRun, Repeat;

              void PutLenCode( U code ) { if ( LenPass == 1 ) LenFreq[code] += 1; else PutBits( LenLen[code], LenCode[code] ); }

              void DoLengths( int n, U a, bool isLit )
              {
              if ( isLit ) { Plen = 0; ZeroRun = 0; Repeat = 0; }
              for ( int i=0; i<n; i += 1 )
              {
              U len = a[i];
              if ( len == 0 ){ EncRepeat(); ZeroRun += 1; Plen = 0; }
              else if ( len == Plen ) { Repeat += 1; }
              else { EncZeroRun(); EncRepeat(); PutLenCode( len ); Plen = len; }
              }
              if ( !isLit ) { EncZeroRun(); EncRepeat(); }
              }

              void EncRepeat()
              {
              while ( Repeat > 0 )
              {
              if ( Repeat < 3 ) { PutLenCode( Plen ); Repeat -= 1; }
              else { U x = Repeat; if ( x > 6 ) x = 6; PutLenCode( 16 ); if ( LenPass == 2 ) PutBits( 2,(U)(x-3) ); Repeat -= x; }
              }
              }

              void EncZeroRun()
              {
              while ( ZeroRun > 0 )
              {
              if ( ZeroRun < 3 ) { PutLenCode( 0 ); ZeroRun -= 1; }
              else if ( ZeroRun < 11 ) { PutLenCode( 17 ); if ( LenPass == 2 ) PutBits( 3, (U)(ZeroRun-3) ); ZeroRun = 0; }
              else { U x = ZeroRun; if ( x > 138 ) x = 138; PutLenCode( 18 ); if ( LenPass == 2 ) PutBits( 7,(U)(x-11) ); ZeroRun -= x; }
              }
              }

              static void Clear( U a ){ System.Array.Clear( a, 0, a.Length ); /*for ( int i=0; i < a.Length; i += 1 ) a[i] = 0;*/ }

              public static uint Adler32( byte b ) // per RFC1950
              {
              uint s1=1, s2=0;
              for ( int i=0; i<b.Length; i+= 1 )
              {
              s1 = ( s1 + b[i] ) % 65521;
              s2 = ( s2 + s1 ) % 65521;
              }
              return s2*65536 + s1;
              }

              // Output bitstream
              byte Buf = 0, M = 1;
              public void PutBit( U b ) { if ( b != 0 ) Buf |= M; M <<= 1; if ( M == 0 ) { Add(Buf); Buf = 0; M = 1; } }
              public void PutBits( U n, U x ) { for ( int i = 0; i < n; i += 1 ) { PutBit( (U)(x & 1u) ); x >>= 1; } }
              public void FlushBitBuf(){ while ( M != 1 ) PutBit( 0 ); }
              public void Put32( uint x ) { Add( (byte)( x >> 24 ) ); Add( (byte)( x >> 16 ) ); Add( (byte)( x >> 8 ) ); Add( (byte) x ); }
              } // end class Encoder

              ////////////////////////////////////////////////////////////////////////////////////////////////////

              class HE // Given a list of frequencies (freq), compute Huffman code lengths (nbits) and codes (tree_code).
              {
              public static int ComputeCode( int limit, U freq, U nbits, U tree_code )
              {
              int ncode = freq.Length;
              Node heap = new Node[ncode];
              int hn = 0;
              for ( int i = 0; i < ncode; i += 1 )
              {
              U f = freq[i];
              if ( f > 0 )
              {
              Node n = new Node();
              n.Freq = f;
              n.Code = (U)i;
              HeapInsert( heap, hn, n );
              hn += 1;
              }
              }

              for ( int i = 0; i < nbits.Length; i += 1 ) nbits[i] = 0;
              if ( hn <= 1 ) // Special case
              { if ( hn == 1 ) nbits[ heap[0].Code ] = 1; }
              else
              {
              while ( hn > 1 )
              {
              Node n = new Node();
              hn -= 1; n.Left = HeapRemove( heap, hn );
              hn -= 1; n.Right = HeapRemove( heap, hn );
              n.Freq = (U) ( n.Left.Freq + n.Right.Freq );
              n.Depth = (U) ( 1 + ( n.Left.Depth > n.Right.Depth ? n.Left.Depth : n.Right.Depth ) );
              HeapInsert( heap, hn, n );
              hn += 1;
              }
              Walk( nbits, heap[0], 0 ); // Walk the tree to find the code lengths (nbits).
              }

              for ( int i = 0; i < ncode; i += 1 ) if ( nbits[i] > limit ) return -1;

              // Now compute codes, code below is from rfc1951 page 7

              uint maxBits = 0;
              for ( int i = 0; i < ncode; i += 1 ) if ( nbits[i] > maxBits ) maxBits = nbits[i];

              U bl_count = new U[maxBits+1];
              for ( int i=0; i < ncode; i += 1 ) bl_count[ nbits[i] ] += 1;

              U next_code = new U[maxBits+1];
              U code = 0; bl_count[0] = 0;
              for ( int i = 0; i < maxBits; i += 1 )
              {
              code = (U)( ( code + bl_count[i] ) << 1 );
              next_code[i+1] = code;
              }

              for ( int i = 0; i < ncode; i += 1 )
              {
              uint len = nbits[i];
              if (len != 0)
              {
              tree_code[i] = Reverse( next_code[len], len );
              next_code[len] += 1;
              }
              }

              //System.Console.WriteLine( "Huff Code" );
              // for ( uint i=0; i < ncode; i += 1 ) if ( nbits[i] > 0 )
              // System.Console.WriteLine( "i=" + i + " len=" + nbits[i] + " tc=" + tree_code[i].ToString("X") + " freq=" + freq[i] );

              while ( ncode > 0 && nbits[ ncode-1 ] == 0 ) ncode -= 1;
              return ncode;
              }

              class Node{ public U Freq; public U Code, Depth; public Node Left, Right; }

              static U Reverse( U x, uint bits )
              { U result = 0; for ( int i = 0; i < bits; i += 1 ) { result <<= 1; result |= (U)(x & 1u); x >>= 1; } return result; }

              static void Walk( U a, Node n,U len )
              { if ( n.Left == null ) a[n.Code] = len; else { Walk( a, n.Left, (U)(len+1) ); Walk( a, n.Right, (U)(len+1) ); } }

              static bool LessThan( Node a, Node b )
              { return a.Freq < b.Freq || a.Freq == b.Freq && a.Depth < b.Depth; }

              static void HeapInsert( Node heap, int h, Node n ) // h is size of heap before insertion
              {
              int j = h;
              while ( j > 0 )
              {
              int p = ( j - 1 ) / 2; // parent
              Node pn = heap[p];
              if ( !LessThan(n,pn) ) break;
              heap[j] = pn; // move parent down
              j = p;
              }
              heap[j] = n;
              }

              static Node HeapRemove( Node heap, int h ) // h is size of heap after removal
              {
              Node result = heap[0];
              Node n = heap[h];
              int j = 0;
              while ( true )
              {
              int c = j * 2 + 1; if ( c >= h ) break;
              Node cn = heap[c];
              if ( c+1 < h )
              {
              Node cn2 = heap[c+1];
              if ( LessThan(cn2,cn) ) { c += 1; cn = cn2; }
              }
              if ( !LessThan(cn,n) ) break;
              heap[j] = cn; j = c;
              }
              heap[j] = n;
              return result;
              }
              } // end class HE





              share|improve this answer





















              • (Welcome to Code Review!) (hope it's alright to […], I am new here there you go
                – greybeard
                Jan 1 at 18:40











              Your Answer





              StackExchange.ifUsing("editor", function () {
              return StackExchange.using("mathjaxEditing", function () {
              StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
              StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
              });
              });
              }, "mathjax-editing");

              StackExchange.ifUsing("editor", function () {
              StackExchange.using("externalEditor", function () {
              StackExchange.using("snippets", function () {
              StackExchange.snippets.init();
              });
              });
              }, "code-snippets");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "196"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: false,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f210654%2fhuffman-tree-compressing-decompressing-in-c%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              3 Answers
              3






              active

              oldest

              votes








              3 Answers
              3






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              6














              Use standard types



              You define this:



              typedef unsigned char byte;


              But you also already #include <stdint.h>. As such, you have access to uint8_t, which is more explicit about its size than char.



              Identify local functions



              If all of your functions are in the same translation unit (which seems to be the case), make them static.



              Forever loops



              This is quite minor, but while (1) isn't my favourite way of defining a forever loop. You can either do for (;;) or while (true), after having included stdbool.h. 1 is less expressive than true.



              Unite declaration and initialization



              int ch;
              while (1) {
              ch = fgetc(fp);


              can be



              while (true) {
              int ch = fgetc(fp);


              Define magic numbers



              Make a #define or a global const int for 256. It's used all over your code, and it'd be good to replace that with a symbol for legibility and maintainability.



              Do one thing at a time



              lower[0] = 0, upper[0] = 256;
              // ...
              unsigned t = freq[i]; freq[i] = freq[j]; freq[j] = t;


              It's rarely an improvement to the legibility of code to do multi-statement lines. Just put these on separate lines.



              Sanitize loop counters



              You have:



              unsigned top = 1;
              // ...
              while (top > 0) {
              top--;
              // ...`continue`s under certain conditions...
              top += 2;
              }


              This is a little wacky. Just do:



              for (int top = 0; top >= 0; top--) {
              // ... continue under certain conditions...
              top += 2;
              }


              I think those are equivalent.



              Sanitize logic



                      int flag = 0;
              while (...) {
              if (freq[i] < freq[j]) {
              //...
              flag = !flag;
              }
              flag ? i++ : j--;
              }


              This is ternary abuse. The flag should be a bool (from stdbool.h). Rewrite this as:



              bool flag = false;
              while (...) {
              if (freq[i] < freq[j]) {
              //...
              flag = !flag;
              }
              if (flag) i++;
              else j--;
              }


              Choose a pointer spacing standard



              You do:



              void compress_file(const char* filename, const char* newname) {


              but you also do:



              FILE *fin = fopen(filename, "rb"),
              *fout = fopen(newname, "wb");


              Personally I like the latter; either way, you should pick a standard and apply it everywhere.



              Let fwrite do the iteration



              This:



              for (int i = 0; i < 256; i++)
              fwrite(freq + i, 4, 1, fout);


              is equivalent to:



              fwrite(freq, 4, 256, fout);





              share|improve this answer


























                6














                Use standard types



                You define this:



                typedef unsigned char byte;


                But you also already #include <stdint.h>. As such, you have access to uint8_t, which is more explicit about its size than char.



                Identify local functions



                If all of your functions are in the same translation unit (which seems to be the case), make them static.



                Forever loops



                This is quite minor, but while (1) isn't my favourite way of defining a forever loop. You can either do for (;;) or while (true), after having included stdbool.h. 1 is less expressive than true.



                Unite declaration and initialization



                int ch;
                while (1) {
                ch = fgetc(fp);


                can be



                while (true) {
                int ch = fgetc(fp);


                Define magic numbers



                Make a #define or a global const int for 256. It's used all over your code, and it'd be good to replace that with a symbol for legibility and maintainability.



                Do one thing at a time



                lower[0] = 0, upper[0] = 256;
                // ...
                unsigned t = freq[i]; freq[i] = freq[j]; freq[j] = t;


                It's rarely an improvement to the legibility of code to do multi-statement lines. Just put these on separate lines.



                Sanitize loop counters



                You have:



                unsigned top = 1;
                // ...
                while (top > 0) {
                top--;
                // ...`continue`s under certain conditions...
                top += 2;
                }


                This is a little wacky. Just do:



                for (int top = 0; top >= 0; top--) {
                // ... continue under certain conditions...
                top += 2;
                }


                I think those are equivalent.



                Sanitize logic



                        int flag = 0;
                while (...) {
                if (freq[i] < freq[j]) {
                //...
                flag = !flag;
                }
                flag ? i++ : j--;
                }


                This is ternary abuse. The flag should be a bool (from stdbool.h). Rewrite this as:



                bool flag = false;
                while (...) {
                if (freq[i] < freq[j]) {
                //...
                flag = !flag;
                }
                if (flag) i++;
                else j--;
                }


                Choose a pointer spacing standard



                You do:



                void compress_file(const char* filename, const char* newname) {


                but you also do:



                FILE *fin = fopen(filename, "rb"),
                *fout = fopen(newname, "wb");


                Personally I like the latter; either way, you should pick a standard and apply it everywhere.



                Let fwrite do the iteration



                This:



                for (int i = 0; i < 256; i++)
                fwrite(freq + i, 4, 1, fout);


                is equivalent to:



                fwrite(freq, 4, 256, fout);





                share|improve this answer
























                  6












                  6








                  6






                  Use standard types



                  You define this:



                  typedef unsigned char byte;


                  But you also already #include <stdint.h>. As such, you have access to uint8_t, which is more explicit about its size than char.



                  Identify local functions



                  If all of your functions are in the same translation unit (which seems to be the case), make them static.



                  Forever loops



                  This is quite minor, but while (1) isn't my favourite way of defining a forever loop. You can either do for (;;) or while (true), after having included stdbool.h. 1 is less expressive than true.



                  Unite declaration and initialization



                  int ch;
                  while (1) {
                  ch = fgetc(fp);


                  can be



                  while (true) {
                  int ch = fgetc(fp);


                  Define magic numbers



                  Make a #define or a global const int for 256. It's used all over your code, and it'd be good to replace that with a symbol for legibility and maintainability.



                  Do one thing at a time



                  lower[0] = 0, upper[0] = 256;
                  // ...
                  unsigned t = freq[i]; freq[i] = freq[j]; freq[j] = t;


                  It's rarely an improvement to the legibility of code to do multi-statement lines. Just put these on separate lines.



                  Sanitize loop counters



                  You have:



                  unsigned top = 1;
                  // ...
                  while (top > 0) {
                  top--;
                  // ...`continue`s under certain conditions...
                  top += 2;
                  }


                  This is a little wacky. Just do:



                  for (int top = 0; top >= 0; top--) {
                  // ... continue under certain conditions...
                  top += 2;
                  }


                  I think those are equivalent.



                  Sanitize logic



                          int flag = 0;
                  while (...) {
                  if (freq[i] < freq[j]) {
                  //...
                  flag = !flag;
                  }
                  flag ? i++ : j--;
                  }


                  This is ternary abuse. The flag should be a bool (from stdbool.h). Rewrite this as:



                  bool flag = false;
                  while (...) {
                  if (freq[i] < freq[j]) {
                  //...
                  flag = !flag;
                  }
                  if (flag) i++;
                  else j--;
                  }


                  Choose a pointer spacing standard



                  You do:



                  void compress_file(const char* filename, const char* newname) {


                  but you also do:



                  FILE *fin = fopen(filename, "rb"),
                  *fout = fopen(newname, "wb");


                  Personally I like the latter; either way, you should pick a standard and apply it everywhere.



                  Let fwrite do the iteration



                  This:



                  for (int i = 0; i < 256; i++)
                  fwrite(freq + i, 4, 1, fout);


                  is equivalent to:



                  fwrite(freq, 4, 256, fout);





                  share|improve this answer












                  Use standard types



                  You define this:



                  typedef unsigned char byte;


                  But you also already #include <stdint.h>. As such, you have access to uint8_t, which is more explicit about its size than char.



                  Identify local functions



                  If all of your functions are in the same translation unit (which seems to be the case), make them static.



                  Forever loops



                  This is quite minor, but while (1) isn't my favourite way of defining a forever loop. You can either do for (;;) or while (true), after having included stdbool.h. 1 is less expressive than true.



                  Unite declaration and initialization



                  int ch;
                  while (1) {
                  ch = fgetc(fp);


                  can be



                  while (true) {
                  int ch = fgetc(fp);


                  Define magic numbers



                  Make a #define or a global const int for 256. It's used all over your code, and it'd be good to replace that with a symbol for legibility and maintainability.



                  Do one thing at a time



                  lower[0] = 0, upper[0] = 256;
                  // ...
                  unsigned t = freq[i]; freq[i] = freq[j]; freq[j] = t;


                  It's rarely an improvement to the legibility of code to do multi-statement lines. Just put these on separate lines.



                  Sanitize loop counters



                  You have:



                  unsigned top = 1;
                  // ...
                  while (top > 0) {
                  top--;
                  // ...`continue`s under certain conditions...
                  top += 2;
                  }


                  This is a little wacky. Just do:



                  for (int top = 0; top >= 0; top--) {
                  // ... continue under certain conditions...
                  top += 2;
                  }


                  I think those are equivalent.



                  Sanitize logic



                          int flag = 0;
                  while (...) {
                  if (freq[i] < freq[j]) {
                  //...
                  flag = !flag;
                  }
                  flag ? i++ : j--;
                  }


                  This is ternary abuse. The flag should be a bool (from stdbool.h). Rewrite this as:



                  bool flag = false;
                  while (...) {
                  if (freq[i] < freq[j]) {
                  //...
                  flag = !flag;
                  }
                  if (flag) i++;
                  else j--;
                  }


                  Choose a pointer spacing standard



                  You do:



                  void compress_file(const char* filename, const char* newname) {


                  but you also do:



                  FILE *fin = fopen(filename, "rb"),
                  *fout = fopen(newname, "wb");


                  Personally I like the latter; either way, you should pick a standard and apply it everywhere.



                  Let fwrite do the iteration



                  This:



                  for (int i = 0; i < 256; i++)
                  fwrite(freq + i, 4, 1, fout);


                  is equivalent to:



                  fwrite(freq, 4, 256, fout);






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Jan 1 at 0:34









                  ReinderienReinderien

                  4,171822




                  4,171822

























                      6














                      Header size



                      256*4 bytes is very big for a header. The size could be reduced substantially by using one or several of these common techniques:




                      • Store the code length instead of symbol frequency. These definitely won't need 32 bits each, 8 would already be a lot. You can pack them in 4 bits each if you set a length limit of 15. Storing lengths is not ambiguous because you can use canonical Huffman codes (there is an easy algorithm to generate them from your table of code lengths, discarding the code itself).

                      • Compress the header with delta encoding: storing the length difference between subsequent codes, using a variable-length encoding. Small differences tend to be more common. (seen in eg DEFLATE)

                      • Remove most zero-lengths from the header, by first storing a sparse bitmap that indicates which symbols occur in the file. (seen in eg bzip2)


                      Encoding process



                      Walking up the tree for every byte of the file is needlessly inefficient. You could precompute an array of codes and lengths once in advance and then use the array during encoding. The code could be represented as a single unsigned integer, no array necessary (it won't be that long, and you will want to limit the code lengths anyway for decoding and header reasons). It can be prepended to buf in a couple of simple bitwise operations, similar to how you currently add a single bit, but nbuf++ turns into nbuf += codelength. Together this lets the encoding process take a constant number of operations instead of scaling linearly with the code length.



                      Decoding process



                      Currently your code implements bit-by-bit tree walking, which is (as one source puts it) dead slow. The alternative is decoding several bits at the same time by using an array lookup. There are a lot of subtly different ways to do that, but the basis of all of them is that part of the buffer is used to index into a table. Limiting the maximum length of the codes is very useful, because with a limited length you can guarantee that decoding is a single-step process, resolving one symbol from the buffer in a constant number of operations, with no looping.



                      Some possible relevant sources for these techniques are the one in the previous paragraph and:




                      • Introduction to table based Huffman decoding

                      • An efficient algorithm of Huffman decoder with nearly constant decoding time

                      • Huffman revisited - Part 2 : the Decoder


                      • A Fast and Space - Economical Algorithm for Length - Limited Coding (for a way to generate the code lengths with a length limit)






                      share|improve this answer


























                        6














                        Header size



                        256*4 bytes is very big for a header. The size could be reduced substantially by using one or several of these common techniques:




                        • Store the code length instead of symbol frequency. These definitely won't need 32 bits each, 8 would already be a lot. You can pack them in 4 bits each if you set a length limit of 15. Storing lengths is not ambiguous because you can use canonical Huffman codes (there is an easy algorithm to generate them from your table of code lengths, discarding the code itself).

                        • Compress the header with delta encoding: storing the length difference between subsequent codes, using a variable-length encoding. Small differences tend to be more common. (seen in eg DEFLATE)

                        • Remove most zero-lengths from the header, by first storing a sparse bitmap that indicates which symbols occur in the file. (seen in eg bzip2)


                        Encoding process



                        Walking up the tree for every byte of the file is needlessly inefficient. You could precompute an array of codes and lengths once in advance and then use the array during encoding. The code could be represented as a single unsigned integer, no array necessary (it won't be that long, and you will want to limit the code lengths anyway for decoding and header reasons). It can be prepended to buf in a couple of simple bitwise operations, similar to how you currently add a single bit, but nbuf++ turns into nbuf += codelength. Together this lets the encoding process take a constant number of operations instead of scaling linearly with the code length.



                        Decoding process



                        Currently your code implements bit-by-bit tree walking, which is (as one source puts it) dead slow. The alternative is decoding several bits at the same time by using an array lookup. There are a lot of subtly different ways to do that, but the basis of all of them is that part of the buffer is used to index into a table. Limiting the maximum length of the codes is very useful, because with a limited length you can guarantee that decoding is a single-step process, resolving one symbol from the buffer in a constant number of operations, with no looping.



                        Some possible relevant sources for these techniques are the one in the previous paragraph and:




                        • Introduction to table based Huffman decoding

                        • An efficient algorithm of Huffman decoder with nearly constant decoding time

                        • Huffman revisited - Part 2 : the Decoder


                        • A Fast and Space - Economical Algorithm for Length - Limited Coding (for a way to generate the code lengths with a length limit)






                        share|improve this answer
























                          6












                          6








                          6






                          Header size



                          256*4 bytes is very big for a header. The size could be reduced substantially by using one or several of these common techniques:




                          • Store the code length instead of symbol frequency. These definitely won't need 32 bits each, 8 would already be a lot. You can pack them in 4 bits each if you set a length limit of 15. Storing lengths is not ambiguous because you can use canonical Huffman codes (there is an easy algorithm to generate them from your table of code lengths, discarding the code itself).

                          • Compress the header with delta encoding: storing the length difference between subsequent codes, using a variable-length encoding. Small differences tend to be more common. (seen in eg DEFLATE)

                          • Remove most zero-lengths from the header, by first storing a sparse bitmap that indicates which symbols occur in the file. (seen in eg bzip2)


                          Encoding process



                          Walking up the tree for every byte of the file is needlessly inefficient. You could precompute an array of codes and lengths once in advance and then use the array during encoding. The code could be represented as a single unsigned integer, no array necessary (it won't be that long, and you will want to limit the code lengths anyway for decoding and header reasons). It can be prepended to buf in a couple of simple bitwise operations, similar to how you currently add a single bit, but nbuf++ turns into nbuf += codelength. Together this lets the encoding process take a constant number of operations instead of scaling linearly with the code length.



                          Decoding process



                          Currently your code implements bit-by-bit tree walking, which is (as one source puts it) dead slow. The alternative is decoding several bits at the same time by using an array lookup. There are a lot of subtly different ways to do that, but the basis of all of them is that part of the buffer is used to index into a table. Limiting the maximum length of the codes is very useful, because with a limited length you can guarantee that decoding is a single-step process, resolving one symbol from the buffer in a constant number of operations, with no looping.



                          Some possible relevant sources for these techniques are the one in the previous paragraph and:




                          • Introduction to table based Huffman decoding

                          • An efficient algorithm of Huffman decoder with nearly constant decoding time

                          • Huffman revisited - Part 2 : the Decoder


                          • A Fast and Space - Economical Algorithm for Length - Limited Coding (for a way to generate the code lengths with a length limit)






                          share|improve this answer












                          Header size



                          256*4 bytes is very big for a header. The size could be reduced substantially by using one or several of these common techniques:




                          • Store the code length instead of symbol frequency. These definitely won't need 32 bits each, 8 would already be a lot. You can pack them in 4 bits each if you set a length limit of 15. Storing lengths is not ambiguous because you can use canonical Huffman codes (there is an easy algorithm to generate them from your table of code lengths, discarding the code itself).

                          • Compress the header with delta encoding: storing the length difference between subsequent codes, using a variable-length encoding. Small differences tend to be more common. (seen in eg DEFLATE)

                          • Remove most zero-lengths from the header, by first storing a sparse bitmap that indicates which symbols occur in the file. (seen in eg bzip2)


                          Encoding process



                          Walking up the tree for every byte of the file is needlessly inefficient. You could precompute an array of codes and lengths once in advance and then use the array during encoding. The code could be represented as a single unsigned integer, no array necessary (it won't be that long, and you will want to limit the code lengths anyway for decoding and header reasons). It can be prepended to buf in a couple of simple bitwise operations, similar to how you currently add a single bit, but nbuf++ turns into nbuf += codelength. Together this lets the encoding process take a constant number of operations instead of scaling linearly with the code length.



                          Decoding process



                          Currently your code implements bit-by-bit tree walking, which is (as one source puts it) dead slow. The alternative is decoding several bits at the same time by using an array lookup. There are a lot of subtly different ways to do that, but the basis of all of them is that part of the buffer is used to index into a table. Limiting the maximum length of the codes is very useful, because with a limited length you can guarantee that decoding is a single-step process, resolving one symbol from the buffer in a constant number of operations, with no looping.



                          Some possible relevant sources for these techniques are the one in the previous paragraph and:




                          • Introduction to table based Huffman decoding

                          • An efficient algorithm of Huffman decoder with nearly constant decoding time

                          • Huffman revisited - Part 2 : the Decoder


                          • A Fast and Space - Economical Algorithm for Length - Limited Coding (for a way to generate the code lengths with a length limit)







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Dec 31 '18 at 19:05









                          haroldharold

                          1,09857




                          1,09857























                              0














                              Harold has given a pretty comprehensive answer. I thought your code for building the Huffman tree was a little obscure, the usual approach is to use a "heap" which efficiently maintains the smallest element of a array in position zero.



                              I have been working on an implementation of RFC 1951 in C#, so far I have found that it's fairly unusual for the bit length limits to be exceeded in practice ( I am using it to write PDF files ), so my approach was simply to try a smaller block size if the limits were exceeded.



                              Here's my C# "deflate" code for comparison ( hope it's alright to post code as part of a reply, I am new here ):



                              using G = System.Collections.Generic; 
                              using U = System.UInt16; // Could also be UInt32 or UInt64, not sure what is best.

                              class Encoder : G.List<byte> // Compression per RFC1950, RFC1951.
                              {
                              public G.List<byte> Deflate( byte inp )
                              {
                              Clear();
                              Add( 0x78); Add( 0x9C ); // RFC 1950 bytes
                              ReadInp( inp );
                              while ( DoOutput( 1 ) == 0 );
                              FlushBitBuf();
                              Put32( Adler32( inp ) ); // RFC 1950 checksum
                              // System.Console.WriteLine( "Encoder.Deflate, input size=" + inp.Length + " output size=" + this.Count );
                              return this;
                              }

                              class OffsetList{ public uint Offset; public OffsetList Next; } // List of 3-byte match offsets.

                              void ReadInp( byte inp ) // LZ77 compression, per RFC1951
                              {
                              G.Dictionary <uint,OffsetList> lookup = new G.Dictionary<uint,OffsetList>(); // Note: could reduce storage requirement by removing entries outside 32k window
                              uint n = (uint)inp.Length;
                              SetIBufSize( n );

                              uint w = 0; // holds last 3 bytes of input
                              int todo = 0; // number of bytes in w that have not yet been output to IBuf, can be negative when a match is found
                              uint pendingMatchLen=0, pendingDist=0;

                              for ( uint i = 0; i <= 1 && i < n; i += 1 ) { w = ( ( w << 8 ) | inp[i] ); todo += 1; }

                              for ( uint i = 2; i < n; i += 1 )
                              {
                              w = ( ( w << 8 ) | inp[i] ) & 0xffffffu; todo += 1;
                              OffsetList e, x = new OffsetList(); x.Offset = i;
                              uint bestMatchLen = 0, bestDist = 0;
                              if ( lookup.TryGetValue( w, out e ) )
                              {
                              x.Next = e;
                              OffsetList p = x;
                              if ( todo >= 3 ) while ( e != null )
                              {
                              uint dist = i - e.Offset; if ( dist > 32768 ) { p.Next = null; break; }
                              uint matchLen = MatchLen( inp, dist, i );
                              if ( matchLen > bestMatchLen ) { bestMatchLen = matchLen; bestDist = dist; }
                              p = e; e = e.Next;
                              }
                              }
                              lookup[ w ] = x; ISpace();

                              // "Lazy matching" RFC 1951 p.15 : if there are overlapping matches, there is a choice over which of the match to use.
                              // Example: abc012bc345.... abc345 : abc345 can be encoded as either [abc][345] or as a[bc345].
                              // Since a range needs more bits to encode than a literal the latter is better.

                              if ( pendingMatchLen > 0 )
                              {
                              if ( bestMatchLen > pendingMatchLen || bestMatchLen == pendingMatchLen && bestDist < pendingDist )
                              { IPut( inp[i-3] ); todo -= 1; }
                              else // Save the pending match, suppress bestMatch if any.
                              {
                              IPut( (ushort)(257 + pendingMatchLen) );
                              IPut( (ushort) pendingDist );
                              todo -= (int)pendingMatchLen;
                              bestMatchLen = 0;
                              }
                              pendingMatchLen = 0;
                              }
                              if ( bestMatchLen > 0 ) { pendingMatchLen = bestMatchLen; pendingDist = bestDist; }
                              else if ( todo == 3 ) { IPut( (byte)(w >> 16) ); todo = 2; }
                              } // end for loop
                              if ( pendingMatchLen > 0 )
                              {
                              IPut( (ushort)(257 + pendingMatchLen) );
                              IPut( (ushort) pendingDist );
                              todo -= (int)pendingMatchLen;
                              }
                              while ( todo > 0 ){ todo -= 1; IPut( (byte)( w >> (todo*8) ) ); }
                              } // end ReadInp

                              uint MatchLen( byte inp, uint dist, uint i )
                              {
                              // From lookup, we already have a match of 3 bytes, this function computes how many more bytes match.
                              uint x = i+1;
                              ulong end = (ulong)inp.Length;
                              if ( end - i > 256 ) end = i + 256; // Maximum match is 258
                              while ( x < end && inp[x] == inp[x-dist] ) x += 1;
                              return x - i + 2;
                              }

                              ushort IBuf; // Intermediate buffer, holds output from LZ99 algorithm.
                              const uint IBufSizeMax = 0x40000;
                              uint IBufSize, IBufI, IBufJ;
                              void IPut( ushort x ) { IBuf[IBufI++] = x; if ( IBufI == IBufSize ) IBufI = 0; }
                              ushort IGet(){ ushort result = IBuf[IBufJ++]; if ( IBufJ == IBufSize ) IBufJ = 0; return result; }
                              uint ICount(){ if ( IBufI >= IBufJ ) return IBufI - IBufJ; else return IBufI + IBufSize - IBufJ; } // Number of values in IBuf
                              void ISpace(){ while ( ICount() > IBufSize - 10 ) DoOutput( 0 ); } // Ensure IBuf has space for at least 10 values.
                              void SetIBufSize( uint x ) { x += 20; if ( x > IBufSizeMax ) x = IBufSizeMax; if ( IBufSize < x ) { IBufSize = x; IBuf = new ushort[x]; } }

                              U DoOutput( U last ) // while DoBlock fails, retry with a smaller amount of input
                              {
                              uint n = ICount();
                              while ( !DoBlock( n, last ) ) { last = 0; n -= n / 20; }
                              return last;
                              }

                              ///////////////////////////////////////////////////////////////////////////////
                              // RFC 1951 encoding constants.

                              static readonly byte ClenAlphabet = { 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15 }; // size = 19
                              static readonly byte MatchExtra = { 0,0,0,0, 0,0,0,0, 1,1,1,1, 2,2,2,2, 3,3,3,3, 4,4,4,4, 5,5,5,5, 0 }; // size = 29
                              static readonly ushort MatchOff = { 3,4,5,6, 7,8,9,10, 11,13,15,17, 19,23,27,31, 35,43,51,59, 67,83,99,115, 131,163,195,227, 258, 0xffff };
                              static readonly byte DistExtra = { 0,0,0,0, 1,1,2,2, 3,3,4,4, 5,5,6,6, 7,7,8,8, 9,9,10,10, 11,11,12,12, 13,13 }; // size = 30
                              static readonly ushort DistOff = { 1,2,3,4, 5,7,9,13, 17,25,33,49, 65,97,129,193, 257,385,513,769, 1025,1537,2049,3073,
                              4097,6145,8193,12289, 16385,24577, 0xffff };

                              readonly U LitFreq = new U[288], LitLen = new U[288], LitCode = new U[288];
                              readonly U DistFreq = new U[32], DistLen = new U[32], DistCode = new U[32];
                              readonly U LenFreq = new U[19], LenLen = new U[19], LenCode = new U[19];

                              bool DoBlock( uint n, U last )
                              {
                              Clear( LitFreq ); Clear( DistFreq ); Clear( LenFreq );
                              uint saveI = IBufI, saveJ = IBufJ;
                              int got = 0; while ( got < n )
                              {
                              ushort x = IGet(); got += 1;
                              if ( x < 256 ) LitFreq[ x ] += 1;
                              else
                              {
                              x -= 257;
                              uint dist = IGet(); got += 1;
                              uint mc=0; while ( x >= MatchOff[mc] ) mc += 1; mc -= 1;
                              uint dc=0; while ( dist >= DistOff[dc] ) dc += 1; dc -= 1;
                              LitFreq[ 257+mc ] += 1;
                              DistFreq[ dc ] += 1;
                              }
                              }
                              LitFreq[256] += 1; // end of block code
                              IBufI = saveI; IBufJ = saveJ;

                              int nLitCode = HE.ComputeCode( 15, LitFreq, LitLen, LitCode ); if ( nLitCode < 0 ) return false;
                              int nDistCode = HE.ComputeCode( 15, DistFreq, DistLen, DistCode ); if ( nDistCode < 0 ) return false;

                              if ( nDistCode == 0 ) nDistCode = 1;
                              LenPass = 1; DoLengths( nLitCode, LitLen, true ); DoLengths( nDistCode, DistLen, false );

                              if ( HE.ComputeCode( 7, LenFreq, LenLen, LenCode ) < 0 ) return false;

                              int nLenCode = 19;
                              while ( nLenCode > 0 && LenLen[ ClenAlphabet[nLenCode-1] ] == 0 ) nLenCode -= 1;

                              PutBit( last ); // Last block flag
                              PutBits( 2, 2 ); // Dynamic Huffman ( for small blocks fixed coding may work better, not implemented )

                              PutBits( 5, (U)( nLitCode - 257 ) ); PutBits( 5, (U)( nDistCode - 1 ) ); PutBits( 4, (U)( nLenCode - 4 ) );
                              for ( int i=0; i < nLenCode; i += 1 ) PutBits( 3, LenLen[ ClenAlphabet[i] ] );
                              LenPass = 2; DoLengths( nLitCode, LitLen, true ); DoLengths( nDistCode, DistLen, false );

                              got = 0; while ( got < n )
                              {
                              U x = IGet(); got += 1;
                              if ( x < 256 ) PutBits( LitLen[x], LitCode[x] );
                              else
                              {
                              x -= 257;
                              ushort dist = IGet(); got += 1;
                              uint mc=0; while ( x >= MatchOff[mc] ) mc += 1; mc -= 1;
                              uint dc=0; while ( dist >= DistOff[dc] ) dc += 1; dc -= 1;
                              PutBits( LitLen[257+mc], LitCode[257+mc] );
                              PutBits( MatchExtra[mc], (U)(x-MatchOff[mc]) );
                              PutBits( DistLen[dc], DistCode[dc] );
                              PutBits( DistExtra[dc], (U)(dist-DistOff[dc]) );
                              }
                              }
                              PutBits( LitLen[256], LitCode[256] ); // block end code
                              return true;
                              } // end DoBlock

                              // Encoding of code lengths ( RFC 1951, page 13 ).

                              U LenPass, Plen, ZeroRun, Repeat;

                              void PutLenCode( U code ) { if ( LenPass == 1 ) LenFreq[code] += 1; else PutBits( LenLen[code], LenCode[code] ); }

                              void DoLengths( int n, U a, bool isLit )
                              {
                              if ( isLit ) { Plen = 0; ZeroRun = 0; Repeat = 0; }
                              for ( int i=0; i<n; i += 1 )
                              {
                              U len = a[i];
                              if ( len == 0 ){ EncRepeat(); ZeroRun += 1; Plen = 0; }
                              else if ( len == Plen ) { Repeat += 1; }
                              else { EncZeroRun(); EncRepeat(); PutLenCode( len ); Plen = len; }
                              }
                              if ( !isLit ) { EncZeroRun(); EncRepeat(); }
                              }

                              void EncRepeat()
                              {
                              while ( Repeat > 0 )
                              {
                              if ( Repeat < 3 ) { PutLenCode( Plen ); Repeat -= 1; }
                              else { U x = Repeat; if ( x > 6 ) x = 6; PutLenCode( 16 ); if ( LenPass == 2 ) PutBits( 2,(U)(x-3) ); Repeat -= x; }
                              }
                              }

                              void EncZeroRun()
                              {
                              while ( ZeroRun > 0 )
                              {
                              if ( ZeroRun < 3 ) { PutLenCode( 0 ); ZeroRun -= 1; }
                              else if ( ZeroRun < 11 ) { PutLenCode( 17 ); if ( LenPass == 2 ) PutBits( 3, (U)(ZeroRun-3) ); ZeroRun = 0; }
                              else { U x = ZeroRun; if ( x > 138 ) x = 138; PutLenCode( 18 ); if ( LenPass == 2 ) PutBits( 7,(U)(x-11) ); ZeroRun -= x; }
                              }
                              }

                              static void Clear( U a ){ System.Array.Clear( a, 0, a.Length ); /*for ( int i=0; i < a.Length; i += 1 ) a[i] = 0;*/ }

                              public static uint Adler32( byte b ) // per RFC1950
                              {
                              uint s1=1, s2=0;
                              for ( int i=0; i<b.Length; i+= 1 )
                              {
                              s1 = ( s1 + b[i] ) % 65521;
                              s2 = ( s2 + s1 ) % 65521;
                              }
                              return s2*65536 + s1;
                              }

                              // Output bitstream
                              byte Buf = 0, M = 1;
                              public void PutBit( U b ) { if ( b != 0 ) Buf |= M; M <<= 1; if ( M == 0 ) { Add(Buf); Buf = 0; M = 1; } }
                              public void PutBits( U n, U x ) { for ( int i = 0; i < n; i += 1 ) { PutBit( (U)(x & 1u) ); x >>= 1; } }
                              public void FlushBitBuf(){ while ( M != 1 ) PutBit( 0 ); }
                              public void Put32( uint x ) { Add( (byte)( x >> 24 ) ); Add( (byte)( x >> 16 ) ); Add( (byte)( x >> 8 ) ); Add( (byte) x ); }
                              } // end class Encoder

                              ////////////////////////////////////////////////////////////////////////////////////////////////////

                              class HE // Given a list of frequencies (freq), compute Huffman code lengths (nbits) and codes (tree_code).
                              {
                              public static int ComputeCode( int limit, U freq, U nbits, U tree_code )
                              {
                              int ncode = freq.Length;
                              Node heap = new Node[ncode];
                              int hn = 0;
                              for ( int i = 0; i < ncode; i += 1 )
                              {
                              U f = freq[i];
                              if ( f > 0 )
                              {
                              Node n = new Node();
                              n.Freq = f;
                              n.Code = (U)i;
                              HeapInsert( heap, hn, n );
                              hn += 1;
                              }
                              }

                              for ( int i = 0; i < nbits.Length; i += 1 ) nbits[i] = 0;
                              if ( hn <= 1 ) // Special case
                              { if ( hn == 1 ) nbits[ heap[0].Code ] = 1; }
                              else
                              {
                              while ( hn > 1 )
                              {
                              Node n = new Node();
                              hn -= 1; n.Left = HeapRemove( heap, hn );
                              hn -= 1; n.Right = HeapRemove( heap, hn );
                              n.Freq = (U) ( n.Left.Freq + n.Right.Freq );
                              n.Depth = (U) ( 1 + ( n.Left.Depth > n.Right.Depth ? n.Left.Depth : n.Right.Depth ) );
                              HeapInsert( heap, hn, n );
                              hn += 1;
                              }
                              Walk( nbits, heap[0], 0 ); // Walk the tree to find the code lengths (nbits).
                              }

                              for ( int i = 0; i < ncode; i += 1 ) if ( nbits[i] > limit ) return -1;

                              // Now compute codes, code below is from rfc1951 page 7

                              uint maxBits = 0;
                              for ( int i = 0; i < ncode; i += 1 ) if ( nbits[i] > maxBits ) maxBits = nbits[i];

                              U bl_count = new U[maxBits+1];
                              for ( int i=0; i < ncode; i += 1 ) bl_count[ nbits[i] ] += 1;

                              U next_code = new U[maxBits+1];
                              U code = 0; bl_count[0] = 0;
                              for ( int i = 0; i < maxBits; i += 1 )
                              {
                              code = (U)( ( code + bl_count[i] ) << 1 );
                              next_code[i+1] = code;
                              }

                              for ( int i = 0; i < ncode; i += 1 )
                              {
                              uint len = nbits[i];
                              if (len != 0)
                              {
                              tree_code[i] = Reverse( next_code[len], len );
                              next_code[len] += 1;
                              }
                              }

                              //System.Console.WriteLine( "Huff Code" );
                              // for ( uint i=0; i < ncode; i += 1 ) if ( nbits[i] > 0 )
                              // System.Console.WriteLine( "i=" + i + " len=" + nbits[i] + " tc=" + tree_code[i].ToString("X") + " freq=" + freq[i] );

                              while ( ncode > 0 && nbits[ ncode-1 ] == 0 ) ncode -= 1;
                              return ncode;
                              }

                              class Node{ public U Freq; public U Code, Depth; public Node Left, Right; }

                              static U Reverse( U x, uint bits )
                              { U result = 0; for ( int i = 0; i < bits; i += 1 ) { result <<= 1; result |= (U)(x & 1u); x >>= 1; } return result; }

                              static void Walk( U a, Node n,U len )
                              { if ( n.Left == null ) a[n.Code] = len; else { Walk( a, n.Left, (U)(len+1) ); Walk( a, n.Right, (U)(len+1) ); } }

                              static bool LessThan( Node a, Node b )
                              { return a.Freq < b.Freq || a.Freq == b.Freq && a.Depth < b.Depth; }

                              static void HeapInsert( Node heap, int h, Node n ) // h is size of heap before insertion
                              {
                              int j = h;
                              while ( j > 0 )
                              {
                              int p = ( j - 1 ) / 2; // parent
                              Node pn = heap[p];
                              if ( !LessThan(n,pn) ) break;
                              heap[j] = pn; // move parent down
                              j = p;
                              }
                              heap[j] = n;
                              }

                              static Node HeapRemove( Node heap, int h ) // h is size of heap after removal
                              {
                              Node result = heap[0];
                              Node n = heap[h];
                              int j = 0;
                              while ( true )
                              {
                              int c = j * 2 + 1; if ( c >= h ) break;
                              Node cn = heap[c];
                              if ( c+1 < h )
                              {
                              Node cn2 = heap[c+1];
                              if ( LessThan(cn2,cn) ) { c += 1; cn = cn2; }
                              }
                              if ( !LessThan(cn,n) ) break;
                              heap[j] = cn; j = c;
                              }
                              heap[j] = n;
                              return result;
                              }
                              } // end class HE





                              share|improve this answer





















                              • (Welcome to Code Review!) (hope it's alright to […], I am new here there you go
                                – greybeard
                                Jan 1 at 18:40
















                              0














                              Harold has given a pretty comprehensive answer. I thought your code for building the Huffman tree was a little obscure, the usual approach is to use a "heap" which efficiently maintains the smallest element of a array in position zero.



                              I have been working on an implementation of RFC 1951 in C#, so far I have found that it's fairly unusual for the bit length limits to be exceeded in practice ( I am using it to write PDF files ), so my approach was simply to try a smaller block size if the limits were exceeded.



                              Here's my C# "deflate" code for comparison ( hope it's alright to post code as part of a reply, I am new here ):



                              using G = System.Collections.Generic; 
                              using U = System.UInt16; // Could also be UInt32 or UInt64, not sure what is best.

                              class Encoder : G.List<byte> // Compression per RFC1950, RFC1951.
                              {
                              public G.List<byte> Deflate( byte inp )
                              {
                              Clear();
                              Add( 0x78); Add( 0x9C ); // RFC 1950 bytes
                              ReadInp( inp );
                              while ( DoOutput( 1 ) == 0 );
                              FlushBitBuf();
                              Put32( Adler32( inp ) ); // RFC 1950 checksum
                              // System.Console.WriteLine( "Encoder.Deflate, input size=" + inp.Length + " output size=" + this.Count );
                              return this;
                              }

                              class OffsetList{ public uint Offset; public OffsetList Next; } // List of 3-byte match offsets.

                              void ReadInp( byte inp ) // LZ77 compression, per RFC1951
                              {
                              G.Dictionary <uint,OffsetList> lookup = new G.Dictionary<uint,OffsetList>(); // Note: could reduce storage requirement by removing entries outside 32k window
                              uint n = (uint)inp.Length;
                              SetIBufSize( n );

                              uint w = 0; // holds last 3 bytes of input
                              int todo = 0; // number of bytes in w that have not yet been output to IBuf, can be negative when a match is found
                              uint pendingMatchLen=0, pendingDist=0;

                              for ( uint i = 0; i <= 1 && i < n; i += 1 ) { w = ( ( w << 8 ) | inp[i] ); todo += 1; }

                              for ( uint i = 2; i < n; i += 1 )
                              {
                              w = ( ( w << 8 ) | inp[i] ) & 0xffffffu; todo += 1;
                              OffsetList e, x = new OffsetList(); x.Offset = i;
                              uint bestMatchLen = 0, bestDist = 0;
                              if ( lookup.TryGetValue( w, out e ) )
                              {
                              x.Next = e;
                              OffsetList p = x;
                              if ( todo >= 3 ) while ( e != null )
                              {
                              uint dist = i - e.Offset; if ( dist > 32768 ) { p.Next = null; break; }
                              uint matchLen = MatchLen( inp, dist, i );
                              if ( matchLen > bestMatchLen ) { bestMatchLen = matchLen; bestDist = dist; }
                              p = e; e = e.Next;
                              }
                              }
                              lookup[ w ] = x; ISpace();

                              // "Lazy matching" RFC 1951 p.15 : if there are overlapping matches, there is a choice over which of the match to use.
                              // Example: abc012bc345.... abc345 : abc345 can be encoded as either [abc][345] or as a[bc345].
                              // Since a range needs more bits to encode than a literal the latter is better.

                              if ( pendingMatchLen > 0 )
                              {
                              if ( bestMatchLen > pendingMatchLen || bestMatchLen == pendingMatchLen && bestDist < pendingDist )
                              { IPut( inp[i-3] ); todo -= 1; }
                              else // Save the pending match, suppress bestMatch if any.
                              {
                              IPut( (ushort)(257 + pendingMatchLen) );
                              IPut( (ushort) pendingDist );
                              todo -= (int)pendingMatchLen;
                              bestMatchLen = 0;
                              }
                              pendingMatchLen = 0;
                              }
                              if ( bestMatchLen > 0 ) { pendingMatchLen = bestMatchLen; pendingDist = bestDist; }
                              else if ( todo == 3 ) { IPut( (byte)(w >> 16) ); todo = 2; }
                              } // end for loop
                              if ( pendingMatchLen > 0 )
                              {
                              IPut( (ushort)(257 + pendingMatchLen) );
                              IPut( (ushort) pendingDist );
                              todo -= (int)pendingMatchLen;
                              }
                              while ( todo > 0 ){ todo -= 1; IPut( (byte)( w >> (todo*8) ) ); }
                              } // end ReadInp

                              uint MatchLen( byte inp, uint dist, uint i )
                              {
                              // From lookup, we already have a match of 3 bytes, this function computes how many more bytes match.
                              uint x = i+1;
                              ulong end = (ulong)inp.Length;
                              if ( end - i > 256 ) end = i + 256; // Maximum match is 258
                              while ( x < end && inp[x] == inp[x-dist] ) x += 1;
                              return x - i + 2;
                              }

                              ushort IBuf; // Intermediate buffer, holds output from LZ99 algorithm.
                              const uint IBufSizeMax = 0x40000;
                              uint IBufSize, IBufI, IBufJ;
                              void IPut( ushort x ) { IBuf[IBufI++] = x; if ( IBufI == IBufSize ) IBufI = 0; }
                              ushort IGet(){ ushort result = IBuf[IBufJ++]; if ( IBufJ == IBufSize ) IBufJ = 0; return result; }
                              uint ICount(){ if ( IBufI >= IBufJ ) return IBufI - IBufJ; else return IBufI + IBufSize - IBufJ; } // Number of values in IBuf
                              void ISpace(){ while ( ICount() > IBufSize - 10 ) DoOutput( 0 ); } // Ensure IBuf has space for at least 10 values.
                              void SetIBufSize( uint x ) { x += 20; if ( x > IBufSizeMax ) x = IBufSizeMax; if ( IBufSize < x ) { IBufSize = x; IBuf = new ushort[x]; } }

                              U DoOutput( U last ) // while DoBlock fails, retry with a smaller amount of input
                              {
                              uint n = ICount();
                              while ( !DoBlock( n, last ) ) { last = 0; n -= n / 20; }
                              return last;
                              }

                              ///////////////////////////////////////////////////////////////////////////////
                              // RFC 1951 encoding constants.

                              static readonly byte ClenAlphabet = { 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15 }; // size = 19
                              static readonly byte MatchExtra = { 0,0,0,0, 0,0,0,0, 1,1,1,1, 2,2,2,2, 3,3,3,3, 4,4,4,4, 5,5,5,5, 0 }; // size = 29
                              static readonly ushort MatchOff = { 3,4,5,6, 7,8,9,10, 11,13,15,17, 19,23,27,31, 35,43,51,59, 67,83,99,115, 131,163,195,227, 258, 0xffff };
                              static readonly byte DistExtra = { 0,0,0,0, 1,1,2,2, 3,3,4,4, 5,5,6,6, 7,7,8,8, 9,9,10,10, 11,11,12,12, 13,13 }; // size = 30
                              static readonly ushort DistOff = { 1,2,3,4, 5,7,9,13, 17,25,33,49, 65,97,129,193, 257,385,513,769, 1025,1537,2049,3073,
                              4097,6145,8193,12289, 16385,24577, 0xffff };

                              readonly U LitFreq = new U[288], LitLen = new U[288], LitCode = new U[288];
                              readonly U DistFreq = new U[32], DistLen = new U[32], DistCode = new U[32];
                              readonly U LenFreq = new U[19], LenLen = new U[19], LenCode = new U[19];

                              bool DoBlock( uint n, U last )
                              {
                              Clear( LitFreq ); Clear( DistFreq ); Clear( LenFreq );
                              uint saveI = IBufI, saveJ = IBufJ;
                              int got = 0; while ( got < n )
                              {
                              ushort x = IGet(); got += 1;
                              if ( x < 256 ) LitFreq[ x ] += 1;
                              else
                              {
                              x -= 257;
                              uint dist = IGet(); got += 1;
                              uint mc=0; while ( x >= MatchOff[mc] ) mc += 1; mc -= 1;
                              uint dc=0; while ( dist >= DistOff[dc] ) dc += 1; dc -= 1;
                              LitFreq[ 257+mc ] += 1;
                              DistFreq[ dc ] += 1;
                              }
                              }
                              LitFreq[256] += 1; // end of block code
                              IBufI = saveI; IBufJ = saveJ;

                              int nLitCode = HE.ComputeCode( 15, LitFreq, LitLen, LitCode ); if ( nLitCode < 0 ) return false;
                              int nDistCode = HE.ComputeCode( 15, DistFreq, DistLen, DistCode ); if ( nDistCode < 0 ) return false;

                              if ( nDistCode == 0 ) nDistCode = 1;
                              LenPass = 1; DoLengths( nLitCode, LitLen, true ); DoLengths( nDistCode, DistLen, false );

                              if ( HE.ComputeCode( 7, LenFreq, LenLen, LenCode ) < 0 ) return false;

                              int nLenCode = 19;
                              while ( nLenCode > 0 && LenLen[ ClenAlphabet[nLenCode-1] ] == 0 ) nLenCode -= 1;

                              PutBit( last ); // Last block flag
                              PutBits( 2, 2 ); // Dynamic Huffman ( for small blocks fixed coding may work better, not implemented )

                              PutBits( 5, (U)( nLitCode - 257 ) ); PutBits( 5, (U)( nDistCode - 1 ) ); PutBits( 4, (U)( nLenCode - 4 ) );
                              for ( int i=0; i < nLenCode; i += 1 ) PutBits( 3, LenLen[ ClenAlphabet[i] ] );
                              LenPass = 2; DoLengths( nLitCode, LitLen, true ); DoLengths( nDistCode, DistLen, false );

                              got = 0; while ( got < n )
                              {
                              U x = IGet(); got += 1;
                              if ( x < 256 ) PutBits( LitLen[x], LitCode[x] );
                              else
                              {
                              x -= 257;
                              ushort dist = IGet(); got += 1;
                              uint mc=0; while ( x >= MatchOff[mc] ) mc += 1; mc -= 1;
                              uint dc=0; while ( dist >= DistOff[dc] ) dc += 1; dc -= 1;
                              PutBits( LitLen[257+mc], LitCode[257+mc] );
                              PutBits( MatchExtra[mc], (U)(x-MatchOff[mc]) );
                              PutBits( DistLen[dc], DistCode[dc] );
                              PutBits( DistExtra[dc], (U)(dist-DistOff[dc]) );
                              }
                              }
                              PutBits( LitLen[256], LitCode[256] ); // block end code
                              return true;
                              } // end DoBlock

                              // Encoding of code lengths ( RFC 1951, page 13 ).

                              U LenPass, Plen, ZeroRun, Repeat;

                              void PutLenCode( U code ) { if ( LenPass == 1 ) LenFreq[code] += 1; else PutBits( LenLen[code], LenCode[code] ); }

                              void DoLengths( int n, U a, bool isLit )
                              {
                              if ( isLit ) { Plen = 0; ZeroRun = 0; Repeat = 0; }
                              for ( int i=0; i<n; i += 1 )
                              {
                              U len = a[i];
                              if ( len == 0 ){ EncRepeat(); ZeroRun += 1; Plen = 0; }
                              else if ( len == Plen ) { Repeat += 1; }
                              else { EncZeroRun(); EncRepeat(); PutLenCode( len ); Plen = len; }
                              }
                              if ( !isLit ) { EncZeroRun(); EncRepeat(); }
                              }

                              void EncRepeat()
                              {
                              while ( Repeat > 0 )
                              {
                              if ( Repeat < 3 ) { PutLenCode( Plen ); Repeat -= 1; }
                              else { U x = Repeat; if ( x > 6 ) x = 6; PutLenCode( 16 ); if ( LenPass == 2 ) PutBits( 2,(U)(x-3) ); Repeat -= x; }
                              }
                              }

                              void EncZeroRun()
                              {
                              while ( ZeroRun > 0 )
                              {
                              if ( ZeroRun < 3 ) { PutLenCode( 0 ); ZeroRun -= 1; }
                              else if ( ZeroRun < 11 ) { PutLenCode( 17 ); if ( LenPass == 2 ) PutBits( 3, (U)(ZeroRun-3) ); ZeroRun = 0; }
                              else { U x = ZeroRun; if ( x > 138 ) x = 138; PutLenCode( 18 ); if ( LenPass == 2 ) PutBits( 7,(U)(x-11) ); ZeroRun -= x; }
                              }
                              }

                              static void Clear( U a ){ System.Array.Clear( a, 0, a.Length ); /*for ( int i=0; i < a.Length; i += 1 ) a[i] = 0;*/ }

                              public static uint Adler32( byte b ) // per RFC1950
                              {
                              uint s1=1, s2=0;
                              for ( int i=0; i<b.Length; i+= 1 )
                              {
                              s1 = ( s1 + b[i] ) % 65521;
                              s2 = ( s2 + s1 ) % 65521;
                              }
                              return s2*65536 + s1;
                              }

                              // Output bitstream
                              byte Buf = 0, M = 1;
                              public void PutBit( U b ) { if ( b != 0 ) Buf |= M; M <<= 1; if ( M == 0 ) { Add(Buf); Buf = 0; M = 1; } }
                              public void PutBits( U n, U x ) { for ( int i = 0; i < n; i += 1 ) { PutBit( (U)(x & 1u) ); x >>= 1; } }
                              public void FlushBitBuf(){ while ( M != 1 ) PutBit( 0 ); }
                              public void Put32( uint x ) { Add( (byte)( x >> 24 ) ); Add( (byte)( x >> 16 ) ); Add( (byte)( x >> 8 ) ); Add( (byte) x ); }
                              } // end class Encoder

                              ////////////////////////////////////////////////////////////////////////////////////////////////////

                              class HE // Given a list of frequencies (freq), compute Huffman code lengths (nbits) and codes (tree_code).
                              {
                              public static int ComputeCode( int limit, U freq, U nbits, U tree_code )
                              {
                              int ncode = freq.Length;
                              Node heap = new Node[ncode];
                              int hn = 0;
                              for ( int i = 0; i < ncode; i += 1 )
                              {
                              U f = freq[i];
                              if ( f > 0 )
                              {
                              Node n = new Node();
                              n.Freq = f;
                              n.Code = (U)i;
                              HeapInsert( heap, hn, n );
                              hn += 1;
                              }
                              }

                              for ( int i = 0; i < nbits.Length; i += 1 ) nbits[i] = 0;
                              if ( hn <= 1 ) // Special case
                              { if ( hn == 1 ) nbits[ heap[0].Code ] = 1; }
                              else
                              {
                              while ( hn > 1 )
                              {
                              Node n = new Node();
                              hn -= 1; n.Left = HeapRemove( heap, hn );
                              hn -= 1; n.Right = HeapRemove( heap, hn );
                              n.Freq = (U) ( n.Left.Freq + n.Right.Freq );
                              n.Depth = (U) ( 1 + ( n.Left.Depth > n.Right.Depth ? n.Left.Depth : n.Right.Depth ) );
                              HeapInsert( heap, hn, n );
                              hn += 1;
                              }
                              Walk( nbits, heap[0], 0 ); // Walk the tree to find the code lengths (nbits).
                              }

                              for ( int i = 0; i < ncode; i += 1 ) if ( nbits[i] > limit ) return -1;

                              // Now compute codes, code below is from rfc1951 page 7

                              uint maxBits = 0;
                              for ( int i = 0; i < ncode; i += 1 ) if ( nbits[i] > maxBits ) maxBits = nbits[i];

                              U bl_count = new U[maxBits+1];
                              for ( int i=0; i < ncode; i += 1 ) bl_count[ nbits[i] ] += 1;

                              U next_code = new U[maxBits+1];
                              U code = 0; bl_count[0] = 0;
                              for ( int i = 0; i < maxBits; i += 1 )
                              {
                              code = (U)( ( code + bl_count[i] ) << 1 );
                              next_code[i+1] = code;
                              }

                              for ( int i = 0; i < ncode; i += 1 )
                              {
                              uint len = nbits[i];
                              if (len != 0)
                              {
                              tree_code[i] = Reverse( next_code[len], len );
                              next_code[len] += 1;
                              }
                              }

                              //System.Console.WriteLine( "Huff Code" );
                              // for ( uint i=0; i < ncode; i += 1 ) if ( nbits[i] > 0 )
                              // System.Console.WriteLine( "i=" + i + " len=" + nbits[i] + " tc=" + tree_code[i].ToString("X") + " freq=" + freq[i] );

                              while ( ncode > 0 && nbits[ ncode-1 ] == 0 ) ncode -= 1;
                              return ncode;
                              }

                              class Node{ public U Freq; public U Code, Depth; public Node Left, Right; }

                              static U Reverse( U x, uint bits )
                              { U result = 0; for ( int i = 0; i < bits; i += 1 ) { result <<= 1; result |= (U)(x & 1u); x >>= 1; } return result; }

                              static void Walk( U a, Node n,U len )
                              { if ( n.Left == null ) a[n.Code] = len; else { Walk( a, n.Left, (U)(len+1) ); Walk( a, n.Right, (U)(len+1) ); } }

                              static bool LessThan( Node a, Node b )
                              { return a.Freq < b.Freq || a.Freq == b.Freq && a.Depth < b.Depth; }

                              static void HeapInsert( Node heap, int h, Node n ) // h is size of heap before insertion
                              {
                              int j = h;
                              while ( j > 0 )
                              {
                              int p = ( j - 1 ) / 2; // parent
                              Node pn = heap[p];
                              if ( !LessThan(n,pn) ) break;
                              heap[j] = pn; // move parent down
                              j = p;
                              }
                              heap[j] = n;
                              }

                              static Node HeapRemove( Node heap, int h ) // h is size of heap after removal
                              {
                              Node result = heap[0];
                              Node n = heap[h];
                              int j = 0;
                              while ( true )
                              {
                              int c = j * 2 + 1; if ( c >= h ) break;
                              Node cn = heap[c];
                              if ( c+1 < h )
                              {
                              Node cn2 = heap[c+1];
                              if ( LessThan(cn2,cn) ) { c += 1; cn = cn2; }
                              }
                              if ( !LessThan(cn,n) ) break;
                              heap[j] = cn; j = c;
                              }
                              heap[j] = n;
                              return result;
                              }
                              } // end class HE





                              share|improve this answer





















                              • (Welcome to Code Review!) (hope it's alright to […], I am new here there you go
                                – greybeard
                                Jan 1 at 18:40














                              0












                              0








                              0






                              Harold has given a pretty comprehensive answer. I thought your code for building the Huffman tree was a little obscure, the usual approach is to use a "heap" which efficiently maintains the smallest element of a array in position zero.



                              I have been working on an implementation of RFC 1951 in C#, so far I have found that it's fairly unusual for the bit length limits to be exceeded in practice ( I am using it to write PDF files ), so my approach was simply to try a smaller block size if the limits were exceeded.



                              Here's my C# "deflate" code for comparison ( hope it's alright to post code as part of a reply, I am new here ):



                              using G = System.Collections.Generic; 
                              using U = System.UInt16; // Could also be UInt32 or UInt64, not sure what is best.

                              class Encoder : G.List<byte> // Compression per RFC1950, RFC1951.
                              {
                              public G.List<byte> Deflate( byte inp )
                              {
                              Clear();
                              Add( 0x78); Add( 0x9C ); // RFC 1950 bytes
                              ReadInp( inp );
                              while ( DoOutput( 1 ) == 0 );
                              FlushBitBuf();
                              Put32( Adler32( inp ) ); // RFC 1950 checksum
                              // System.Console.WriteLine( "Encoder.Deflate, input size=" + inp.Length + " output size=" + this.Count );
                              return this;
                              }

                              class OffsetList{ public uint Offset; public OffsetList Next; } // List of 3-byte match offsets.

                              void ReadInp( byte inp ) // LZ77 compression, per RFC1951
                              {
                              G.Dictionary <uint,OffsetList> lookup = new G.Dictionary<uint,OffsetList>(); // Note: could reduce storage requirement by removing entries outside 32k window
                              uint n = (uint)inp.Length;
                              SetIBufSize( n );

                              uint w = 0; // holds last 3 bytes of input
                              int todo = 0; // number of bytes in w that have not yet been output to IBuf, can be negative when a match is found
                              uint pendingMatchLen=0, pendingDist=0;

                              for ( uint i = 0; i <= 1 && i < n; i += 1 ) { w = ( ( w << 8 ) | inp[i] ); todo += 1; }

                              for ( uint i = 2; i < n; i += 1 )
                              {
                              w = ( ( w << 8 ) | inp[i] ) & 0xffffffu; todo += 1;
                              OffsetList e, x = new OffsetList(); x.Offset = i;
                              uint bestMatchLen = 0, bestDist = 0;
                              if ( lookup.TryGetValue( w, out e ) )
                              {
                              x.Next = e;
                              OffsetList p = x;
                              if ( todo >= 3 ) while ( e != null )
                              {
                              uint dist = i - e.Offset; if ( dist > 32768 ) { p.Next = null; break; }
                              uint matchLen = MatchLen( inp, dist, i );
                              if ( matchLen > bestMatchLen ) { bestMatchLen = matchLen; bestDist = dist; }
                              p = e; e = e.Next;
                              }
                              }
                              lookup[ w ] = x; ISpace();

                              // "Lazy matching" RFC 1951 p.15 : if there are overlapping matches, there is a choice over which of the match to use.
                              // Example: abc012bc345.... abc345 : abc345 can be encoded as either [abc][345] or as a[bc345].
                              // Since a range needs more bits to encode than a literal the latter is better.

                              if ( pendingMatchLen > 0 )
                              {
                              if ( bestMatchLen > pendingMatchLen || bestMatchLen == pendingMatchLen && bestDist < pendingDist )
                              { IPut( inp[i-3] ); todo -= 1; }
                              else // Save the pending match, suppress bestMatch if any.
                              {
                              IPut( (ushort)(257 + pendingMatchLen) );
                              IPut( (ushort) pendingDist );
                              todo -= (int)pendingMatchLen;
                              bestMatchLen = 0;
                              }
                              pendingMatchLen = 0;
                              }
                              if ( bestMatchLen > 0 ) { pendingMatchLen = bestMatchLen; pendingDist = bestDist; }
                              else if ( todo == 3 ) { IPut( (byte)(w >> 16) ); todo = 2; }
                              } // end for loop
                              if ( pendingMatchLen > 0 )
                              {
                              IPut( (ushort)(257 + pendingMatchLen) );
                              IPut( (ushort) pendingDist );
                              todo -= (int)pendingMatchLen;
                              }
                              while ( todo > 0 ){ todo -= 1; IPut( (byte)( w >> (todo*8) ) ); }
                              } // end ReadInp

                              uint MatchLen( byte inp, uint dist, uint i )
                              {
                              // From lookup, we already have a match of 3 bytes, this function computes how many more bytes match.
                              uint x = i+1;
                              ulong end = (ulong)inp.Length;
                              if ( end - i > 256 ) end = i + 256; // Maximum match is 258
                              while ( x < end && inp[x] == inp[x-dist] ) x += 1;
                              return x - i + 2;
                              }

                              ushort IBuf; // Intermediate buffer, holds output from LZ99 algorithm.
                              const uint IBufSizeMax = 0x40000;
                              uint IBufSize, IBufI, IBufJ;
                              void IPut( ushort x ) { IBuf[IBufI++] = x; if ( IBufI == IBufSize ) IBufI = 0; }
                              ushort IGet(){ ushort result = IBuf[IBufJ++]; if ( IBufJ == IBufSize ) IBufJ = 0; return result; }
                              uint ICount(){ if ( IBufI >= IBufJ ) return IBufI - IBufJ; else return IBufI + IBufSize - IBufJ; } // Number of values in IBuf
                              void ISpace(){ while ( ICount() > IBufSize - 10 ) DoOutput( 0 ); } // Ensure IBuf has space for at least 10 values.
                              void SetIBufSize( uint x ) { x += 20; if ( x > IBufSizeMax ) x = IBufSizeMax; if ( IBufSize < x ) { IBufSize = x; IBuf = new ushort[x]; } }

                              U DoOutput( U last ) // while DoBlock fails, retry with a smaller amount of input
                              {
                              uint n = ICount();
                              while ( !DoBlock( n, last ) ) { last = 0; n -= n / 20; }
                              return last;
                              }

                              ///////////////////////////////////////////////////////////////////////////////
                              // RFC 1951 encoding constants.

                              static readonly byte ClenAlphabet = { 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15 }; // size = 19
                              static readonly byte MatchExtra = { 0,0,0,0, 0,0,0,0, 1,1,1,1, 2,2,2,2, 3,3,3,3, 4,4,4,4, 5,5,5,5, 0 }; // size = 29
                              static readonly ushort MatchOff = { 3,4,5,6, 7,8,9,10, 11,13,15,17, 19,23,27,31, 35,43,51,59, 67,83,99,115, 131,163,195,227, 258, 0xffff };
                              static readonly byte DistExtra = { 0,0,0,0, 1,1,2,2, 3,3,4,4, 5,5,6,6, 7,7,8,8, 9,9,10,10, 11,11,12,12, 13,13 }; // size = 30
                              static readonly ushort DistOff = { 1,2,3,4, 5,7,9,13, 17,25,33,49, 65,97,129,193, 257,385,513,769, 1025,1537,2049,3073,
                              4097,6145,8193,12289, 16385,24577, 0xffff };

                              readonly U LitFreq = new U[288], LitLen = new U[288], LitCode = new U[288];
                              readonly U DistFreq = new U[32], DistLen = new U[32], DistCode = new U[32];
                              readonly U LenFreq = new U[19], LenLen = new U[19], LenCode = new U[19];

                              bool DoBlock( uint n, U last )
                              {
                              Clear( LitFreq ); Clear( DistFreq ); Clear( LenFreq );
                              uint saveI = IBufI, saveJ = IBufJ;
                              int got = 0; while ( got < n )
                              {
                              ushort x = IGet(); got += 1;
                              if ( x < 256 ) LitFreq[ x ] += 1;
                              else
                              {
                              x -= 257;
                              uint dist = IGet(); got += 1;
                              uint mc=0; while ( x >= MatchOff[mc] ) mc += 1; mc -= 1;
                              uint dc=0; while ( dist >= DistOff[dc] ) dc += 1; dc -= 1;
                              LitFreq[ 257+mc ] += 1;
                              DistFreq[ dc ] += 1;
                              }
                              }
                              LitFreq[256] += 1; // end of block code
                              IBufI = saveI; IBufJ = saveJ;

                              int nLitCode = HE.ComputeCode( 15, LitFreq, LitLen, LitCode ); if ( nLitCode < 0 ) return false;
                              int nDistCode = HE.ComputeCode( 15, DistFreq, DistLen, DistCode ); if ( nDistCode < 0 ) return false;

                              if ( nDistCode == 0 ) nDistCode = 1;
                              LenPass = 1; DoLengths( nLitCode, LitLen, true ); DoLengths( nDistCode, DistLen, false );

                              if ( HE.ComputeCode( 7, LenFreq, LenLen, LenCode ) < 0 ) return false;

                              int nLenCode = 19;
                              while ( nLenCode > 0 && LenLen[ ClenAlphabet[nLenCode-1] ] == 0 ) nLenCode -= 1;

                              PutBit( last ); // Last block flag
                              PutBits( 2, 2 ); // Dynamic Huffman ( for small blocks fixed coding may work better, not implemented )

                              PutBits( 5, (U)( nLitCode - 257 ) ); PutBits( 5, (U)( nDistCode - 1 ) ); PutBits( 4, (U)( nLenCode - 4 ) );
                              for ( int i=0; i < nLenCode; i += 1 ) PutBits( 3, LenLen[ ClenAlphabet[i] ] );
                              LenPass = 2; DoLengths( nLitCode, LitLen, true ); DoLengths( nDistCode, DistLen, false );

                              got = 0; while ( got < n )
                              {
                              U x = IGet(); got += 1;
                              if ( x < 256 ) PutBits( LitLen[x], LitCode[x] );
                              else
                              {
                              x -= 257;
                              ushort dist = IGet(); got += 1;
                              uint mc=0; while ( x >= MatchOff[mc] ) mc += 1; mc -= 1;
                              uint dc=0; while ( dist >= DistOff[dc] ) dc += 1; dc -= 1;
                              PutBits( LitLen[257+mc], LitCode[257+mc] );
                              PutBits( MatchExtra[mc], (U)(x-MatchOff[mc]) );
                              PutBits( DistLen[dc], DistCode[dc] );
                              PutBits( DistExtra[dc], (U)(dist-DistOff[dc]) );
                              }
                              }
                              PutBits( LitLen[256], LitCode[256] ); // block end code
                              return true;
                              } // end DoBlock

                              // Encoding of code lengths ( RFC 1951, page 13 ).

                              U LenPass, Plen, ZeroRun, Repeat;

                              void PutLenCode( U code ) { if ( LenPass == 1 ) LenFreq[code] += 1; else PutBits( LenLen[code], LenCode[code] ); }

                              void DoLengths( int n, U a, bool isLit )
                              {
                              if ( isLit ) { Plen = 0; ZeroRun = 0; Repeat = 0; }
                              for ( int i=0; i<n; i += 1 )
                              {
                              U len = a[i];
                              if ( len == 0 ){ EncRepeat(); ZeroRun += 1; Plen = 0; }
                              else if ( len == Plen ) { Repeat += 1; }
                              else { EncZeroRun(); EncRepeat(); PutLenCode( len ); Plen = len; }
                              }
                              if ( !isLit ) { EncZeroRun(); EncRepeat(); }
                              }

                              void EncRepeat()
                              {
                              while ( Repeat > 0 )
                              {
                              if ( Repeat < 3 ) { PutLenCode( Plen ); Repeat -= 1; }
                              else { U x = Repeat; if ( x > 6 ) x = 6; PutLenCode( 16 ); if ( LenPass == 2 ) PutBits( 2,(U)(x-3) ); Repeat -= x; }
                              }
                              }

                              void EncZeroRun()
                              {
                              while ( ZeroRun > 0 )
                              {
                              if ( ZeroRun < 3 ) { PutLenCode( 0 ); ZeroRun -= 1; }
                              else if ( ZeroRun < 11 ) { PutLenCode( 17 ); if ( LenPass == 2 ) PutBits( 3, (U)(ZeroRun-3) ); ZeroRun = 0; }
                              else { U x = ZeroRun; if ( x > 138 ) x = 138; PutLenCode( 18 ); if ( LenPass == 2 ) PutBits( 7,(U)(x-11) ); ZeroRun -= x; }
                              }
                              }

                              static void Clear( U a ){ System.Array.Clear( a, 0, a.Length ); /*for ( int i=0; i < a.Length; i += 1 ) a[i] = 0;*/ }

                              public static uint Adler32( byte b ) // per RFC1950
                              {
                              uint s1=1, s2=0;
                              for ( int i=0; i<b.Length; i+= 1 )
                              {
                              s1 = ( s1 + b[i] ) % 65521;
                              s2 = ( s2 + s1 ) % 65521;
                              }
                              return s2*65536 + s1;
                              }

                              // Output bitstream
                              byte Buf = 0, M = 1;
                              public void PutBit( U b ) { if ( b != 0 ) Buf |= M; M <<= 1; if ( M == 0 ) { Add(Buf); Buf = 0; M = 1; } }
                              public void PutBits( U n, U x ) { for ( int i = 0; i < n; i += 1 ) { PutBit( (U)(x & 1u) ); x >>= 1; } }
                              public void FlushBitBuf(){ while ( M != 1 ) PutBit( 0 ); }
                              public void Put32( uint x ) { Add( (byte)( x >> 24 ) ); Add( (byte)( x >> 16 ) ); Add( (byte)( x >> 8 ) ); Add( (byte) x ); }
                              } // end class Encoder

                              ////////////////////////////////////////////////////////////////////////////////////////////////////

                              class HE // Given a list of frequencies (freq), compute Huffman code lengths (nbits) and codes (tree_code).
                              {
                              public static int ComputeCode( int limit, U freq, U nbits, U tree_code )
                              {
                              int ncode = freq.Length;
                              Node heap = new Node[ncode];
                              int hn = 0;
                              for ( int i = 0; i < ncode; i += 1 )
                              {
                              U f = freq[i];
                              if ( f > 0 )
                              {
                              Node n = new Node();
                              n.Freq = f;
                              n.Code = (U)i;
                              HeapInsert( heap, hn, n );
                              hn += 1;
                              }
                              }

                              for ( int i = 0; i < nbits.Length; i += 1 ) nbits[i] = 0;
                              if ( hn <= 1 ) // Special case
                              { if ( hn == 1 ) nbits[ heap[0].Code ] = 1; }
                              else
                              {
                              while ( hn > 1 )
                              {
                              Node n = new Node();
                              hn -= 1; n.Left = HeapRemove( heap, hn );
                              hn -= 1; n.Right = HeapRemove( heap, hn );
                              n.Freq = (U) ( n.Left.Freq + n.Right.Freq );
                              n.Depth = (U) ( 1 + ( n.Left.Depth > n.Right.Depth ? n.Left.Depth : n.Right.Depth ) );
                              HeapInsert( heap, hn, n );
                              hn += 1;
                              }
                              Walk( nbits, heap[0], 0 ); // Walk the tree to find the code lengths (nbits).
                              }

                              for ( int i = 0; i < ncode; i += 1 ) if ( nbits[i] > limit ) return -1;

                              // Now compute codes, code below is from rfc1951 page 7

                              uint maxBits = 0;
                              for ( int i = 0; i < ncode; i += 1 ) if ( nbits[i] > maxBits ) maxBits = nbits[i];

                              U bl_count = new U[maxBits+1];
                              for ( int i=0; i < ncode; i += 1 ) bl_count[ nbits[i] ] += 1;

                              U next_code = new U[maxBits+1];
                              U code = 0; bl_count[0] = 0;
                              for ( int i = 0; i < maxBits; i += 1 )
                              {
                              code = (U)( ( code + bl_count[i] ) << 1 );
                              next_code[i+1] = code;
                              }

                              for ( int i = 0; i < ncode; i += 1 )
                              {
                              uint len = nbits[i];
                              if (len != 0)
                              {
                              tree_code[i] = Reverse( next_code[len], len );
                              next_code[len] += 1;
                              }
                              }

                              //System.Console.WriteLine( "Huff Code" );
                              // for ( uint i=0; i < ncode; i += 1 ) if ( nbits[i] > 0 )
                              // System.Console.WriteLine( "i=" + i + " len=" + nbits[i] + " tc=" + tree_code[i].ToString("X") + " freq=" + freq[i] );

                              while ( ncode > 0 && nbits[ ncode-1 ] == 0 ) ncode -= 1;
                              return ncode;
                              }

                              class Node{ public U Freq; public U Code, Depth; public Node Left, Right; }

                              static U Reverse( U x, uint bits )
                              { U result = 0; for ( int i = 0; i < bits; i += 1 ) { result <<= 1; result |= (U)(x & 1u); x >>= 1; } return result; }

                              static void Walk( U a, Node n,U len )
                              { if ( n.Left == null ) a[n.Code] = len; else { Walk( a, n.Left, (U)(len+1) ); Walk( a, n.Right, (U)(len+1) ); } }

                              static bool LessThan( Node a, Node b )
                              { return a.Freq < b.Freq || a.Freq == b.Freq && a.Depth < b.Depth; }

                              static void HeapInsert( Node heap, int h, Node n ) // h is size of heap before insertion
                              {
                              int j = h;
                              while ( j > 0 )
                              {
                              int p = ( j - 1 ) / 2; // parent
                              Node pn = heap[p];
                              if ( !LessThan(n,pn) ) break;
                              heap[j] = pn; // move parent down
                              j = p;
                              }
                              heap[j] = n;
                              }

                              static Node HeapRemove( Node heap, int h ) // h is size of heap after removal
                              {
                              Node result = heap[0];
                              Node n = heap[h];
                              int j = 0;
                              while ( true )
                              {
                              int c = j * 2 + 1; if ( c >= h ) break;
                              Node cn = heap[c];
                              if ( c+1 < h )
                              {
                              Node cn2 = heap[c+1];
                              if ( LessThan(cn2,cn) ) { c += 1; cn = cn2; }
                              }
                              if ( !LessThan(cn,n) ) break;
                              heap[j] = cn; j = c;
                              }
                              heap[j] = n;
                              return result;
                              }
                              } // end class HE





                              share|improve this answer












                              Harold has given a pretty comprehensive answer. I thought your code for building the Huffman tree was a little obscure, the usual approach is to use a "heap" which efficiently maintains the smallest element of a array in position zero.



                              I have been working on an implementation of RFC 1951 in C#, so far I have found that it's fairly unusual for the bit length limits to be exceeded in practice ( I am using it to write PDF files ), so my approach was simply to try a smaller block size if the limits were exceeded.



                              Here's my C# "deflate" code for comparison ( hope it's alright to post code as part of a reply, I am new here ):



                              using G = System.Collections.Generic; 
                              using U = System.UInt16; // Could also be UInt32 or UInt64, not sure what is best.

                              class Encoder : G.List<byte> // Compression per RFC1950, RFC1951.
                              {
                              public G.List<byte> Deflate( byte inp )
                              {
                              Clear();
                              Add( 0x78); Add( 0x9C ); // RFC 1950 bytes
                              ReadInp( inp );
                              while ( DoOutput( 1 ) == 0 );
                              FlushBitBuf();
                              Put32( Adler32( inp ) ); // RFC 1950 checksum
                              // System.Console.WriteLine( "Encoder.Deflate, input size=" + inp.Length + " output size=" + this.Count );
                              return this;
                              }

                              class OffsetList{ public uint Offset; public OffsetList Next; } // List of 3-byte match offsets.

                              void ReadInp( byte inp ) // LZ77 compression, per RFC1951
                              {
                              G.Dictionary <uint,OffsetList> lookup = new G.Dictionary<uint,OffsetList>(); // Note: could reduce storage requirement by removing entries outside 32k window
                              uint n = (uint)inp.Length;
                              SetIBufSize( n );

                              uint w = 0; // holds last 3 bytes of input
                              int todo = 0; // number of bytes in w that have not yet been output to IBuf, can be negative when a match is found
                              uint pendingMatchLen=0, pendingDist=0;

                              for ( uint i = 0; i <= 1 && i < n; i += 1 ) { w = ( ( w << 8 ) | inp[i] ); todo += 1; }

                              for ( uint i = 2; i < n; i += 1 )
                              {
                              w = ( ( w << 8 ) | inp[i] ) & 0xffffffu; todo += 1;
                              OffsetList e, x = new OffsetList(); x.Offset = i;
                              uint bestMatchLen = 0, bestDist = 0;
                              if ( lookup.TryGetValue( w, out e ) )
                              {
                              x.Next = e;
                              OffsetList p = x;
                              if ( todo >= 3 ) while ( e != null )
                              {
                              uint dist = i - e.Offset; if ( dist > 32768 ) { p.Next = null; break; }
                              uint matchLen = MatchLen( inp, dist, i );
                              if ( matchLen > bestMatchLen ) { bestMatchLen = matchLen; bestDist = dist; }
                              p = e; e = e.Next;
                              }
                              }
                              lookup[ w ] = x; ISpace();

                              // "Lazy matching" RFC 1951 p.15 : if there are overlapping matches, there is a choice over which of the match to use.
                              // Example: abc012bc345.... abc345 : abc345 can be encoded as either [abc][345] or as a[bc345].
                              // Since a range needs more bits to encode than a literal the latter is better.

                              if ( pendingMatchLen > 0 )
                              {
                              if ( bestMatchLen > pendingMatchLen || bestMatchLen == pendingMatchLen && bestDist < pendingDist )
                              { IPut( inp[i-3] ); todo -= 1; }
                              else // Save the pending match, suppress bestMatch if any.
                              {
                              IPut( (ushort)(257 + pendingMatchLen) );
                              IPut( (ushort) pendingDist );
                              todo -= (int)pendingMatchLen;
                              bestMatchLen = 0;
                              }
                              pendingMatchLen = 0;
                              }
                              if ( bestMatchLen > 0 ) { pendingMatchLen = bestMatchLen; pendingDist = bestDist; }
                              else if ( todo == 3 ) { IPut( (byte)(w >> 16) ); todo = 2; }
                              } // end for loop
                              if ( pendingMatchLen > 0 )
                              {
                              IPut( (ushort)(257 + pendingMatchLen) );
                              IPut( (ushort) pendingDist );
                              todo -= (int)pendingMatchLen;
                              }
                              while ( todo > 0 ){ todo -= 1; IPut( (byte)( w >> (todo*8) ) ); }
                              } // end ReadInp

                              uint MatchLen( byte inp, uint dist, uint i )
                              {
                              // From lookup, we already have a match of 3 bytes, this function computes how many more bytes match.
                              uint x = i+1;
                              ulong end = (ulong)inp.Length;
                              if ( end - i > 256 ) end = i + 256; // Maximum match is 258
                              while ( x < end && inp[x] == inp[x-dist] ) x += 1;
                              return x - i + 2;
                              }

                              ushort IBuf; // Intermediate buffer, holds output from LZ99 algorithm.
                              const uint IBufSizeMax = 0x40000;
                              uint IBufSize, IBufI, IBufJ;
                              void IPut( ushort x ) { IBuf[IBufI++] = x; if ( IBufI == IBufSize ) IBufI = 0; }
                              ushort IGet(){ ushort result = IBuf[IBufJ++]; if ( IBufJ == IBufSize ) IBufJ = 0; return result; }
                              uint ICount(){ if ( IBufI >= IBufJ ) return IBufI - IBufJ; else return IBufI + IBufSize - IBufJ; } // Number of values in IBuf
                              void ISpace(){ while ( ICount() > IBufSize - 10 ) DoOutput( 0 ); } // Ensure IBuf has space for at least 10 values.
                              void SetIBufSize( uint x ) { x += 20; if ( x > IBufSizeMax ) x = IBufSizeMax; if ( IBufSize < x ) { IBufSize = x; IBuf = new ushort[x]; } }

                              U DoOutput( U last ) // while DoBlock fails, retry with a smaller amount of input
                              {
                              uint n = ICount();
                              while ( !DoBlock( n, last ) ) { last = 0; n -= n / 20; }
                              return last;
                              }

                              ///////////////////////////////////////////////////////////////////////////////
                              // RFC 1951 encoding constants.

                              static readonly byte ClenAlphabet = { 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15 }; // size = 19
                              static readonly byte MatchExtra = { 0,0,0,0, 0,0,0,0, 1,1,1,1, 2,2,2,2, 3,3,3,3, 4,4,4,4, 5,5,5,5, 0 }; // size = 29
                              static readonly ushort MatchOff = { 3,4,5,6, 7,8,9,10, 11,13,15,17, 19,23,27,31, 35,43,51,59, 67,83,99,115, 131,163,195,227, 258, 0xffff };
                              static readonly byte DistExtra = { 0,0,0,0, 1,1,2,2, 3,3,4,4, 5,5,6,6, 7,7,8,8, 9,9,10,10, 11,11,12,12, 13,13 }; // size = 30
                              static readonly ushort DistOff = { 1,2,3,4, 5,7,9,13, 17,25,33,49, 65,97,129,193, 257,385,513,769, 1025,1537,2049,3073,
                              4097,6145,8193,12289, 16385,24577, 0xffff };

                              readonly U LitFreq = new U[288], LitLen = new U[288], LitCode = new U[288];
                              readonly U DistFreq = new U[32], DistLen = new U[32], DistCode = new U[32];
                              readonly U LenFreq = new U[19], LenLen = new U[19], LenCode = new U[19];

                              bool DoBlock( uint n, U last )
                              {
                              Clear( LitFreq ); Clear( DistFreq ); Clear( LenFreq );
                              uint saveI = IBufI, saveJ = IBufJ;
                              int got = 0; while ( got < n )
                              {
                              ushort x = IGet(); got += 1;
                              if ( x < 256 ) LitFreq[ x ] += 1;
                              else
                              {
                              x -= 257;
                              uint dist = IGet(); got += 1;
                              uint mc=0; while ( x >= MatchOff[mc] ) mc += 1; mc -= 1;
                              uint dc=0; while ( dist >= DistOff[dc] ) dc += 1; dc -= 1;
                              LitFreq[ 257+mc ] += 1;
                              DistFreq[ dc ] += 1;
                              }
                              }
                              LitFreq[256] += 1; // end of block code
                              IBufI = saveI; IBufJ = saveJ;

                              int nLitCode = HE.ComputeCode( 15, LitFreq, LitLen, LitCode ); if ( nLitCode < 0 ) return false;
                              int nDistCode = HE.ComputeCode( 15, DistFreq, DistLen, DistCode ); if ( nDistCode < 0 ) return false;

                              if ( nDistCode == 0 ) nDistCode = 1;
                              LenPass = 1; DoLengths( nLitCode, LitLen, true ); DoLengths( nDistCode, DistLen, false );

                              if ( HE.ComputeCode( 7, LenFreq, LenLen, LenCode ) < 0 ) return false;

                              int nLenCode = 19;
                              while ( nLenCode > 0 && LenLen[ ClenAlphabet[nLenCode-1] ] == 0 ) nLenCode -= 1;

                              PutBit( last ); // Last block flag
                              PutBits( 2, 2 ); // Dynamic Huffman ( for small blocks fixed coding may work better, not implemented )

                              PutBits( 5, (U)( nLitCode - 257 ) ); PutBits( 5, (U)( nDistCode - 1 ) ); PutBits( 4, (U)( nLenCode - 4 ) );
                              for ( int i=0; i < nLenCode; i += 1 ) PutBits( 3, LenLen[ ClenAlphabet[i] ] );
                              LenPass = 2; DoLengths( nLitCode, LitLen, true ); DoLengths( nDistCode, DistLen, false );

                              got = 0; while ( got < n )
                              {
                              U x = IGet(); got += 1;
                              if ( x < 256 ) PutBits( LitLen[x], LitCode[x] );
                              else
                              {
                              x -= 257;
                              ushort dist = IGet(); got += 1;
                              uint mc=0; while ( x >= MatchOff[mc] ) mc += 1; mc -= 1;
                              uint dc=0; while ( dist >= DistOff[dc] ) dc += 1; dc -= 1;
                              PutBits( LitLen[257+mc], LitCode[257+mc] );
                              PutBits( MatchExtra[mc], (U)(x-MatchOff[mc]) );
                              PutBits( DistLen[dc], DistCode[dc] );
                              PutBits( DistExtra[dc], (U)(dist-DistOff[dc]) );
                              }
                              }
                              PutBits( LitLen[256], LitCode[256] ); // block end code
                              return true;
                              } // end DoBlock

                              // Encoding of code lengths ( RFC 1951, page 13 ).

                              U LenPass, Plen, ZeroRun, Repeat;

                              void PutLenCode( U code ) { if ( LenPass == 1 ) LenFreq[code] += 1; else PutBits( LenLen[code], LenCode[code] ); }

                              void DoLengths( int n, U a, bool isLit )
                              {
                              if ( isLit ) { Plen = 0; ZeroRun = 0; Repeat = 0; }
                              for ( int i=0; i<n; i += 1 )
                              {
                              U len = a[i];
                              if ( len == 0 ){ EncRepeat(); ZeroRun += 1; Plen = 0; }
                              else if ( len == Plen ) { Repeat += 1; }
                              else { EncZeroRun(); EncRepeat(); PutLenCode( len ); Plen = len; }
                              }
                              if ( !isLit ) { EncZeroRun(); EncRepeat(); }
                              }

                              void EncRepeat()
                              {
                              while ( Repeat > 0 )
                              {
                              if ( Repeat < 3 ) { PutLenCode( Plen ); Repeat -= 1; }
                              else { U x = Repeat; if ( x > 6 ) x = 6; PutLenCode( 16 ); if ( LenPass == 2 ) PutBits( 2,(U)(x-3) ); Repeat -= x; }
                              }
                              }

                              void EncZeroRun()
                              {
                              while ( ZeroRun > 0 )
                              {
                              if ( ZeroRun < 3 ) { PutLenCode( 0 ); ZeroRun -= 1; }
                              else if ( ZeroRun < 11 ) { PutLenCode( 17 ); if ( LenPass == 2 ) PutBits( 3, (U)(ZeroRun-3) ); ZeroRun = 0; }
                              else { U x = ZeroRun; if ( x > 138 ) x = 138; PutLenCode( 18 ); if ( LenPass == 2 ) PutBits( 7,(U)(x-11) ); ZeroRun -= x; }
                              }
                              }

                              static void Clear( U a ){ System.Array.Clear( a, 0, a.Length ); /*for ( int i=0; i < a.Length; i += 1 ) a[i] = 0;*/ }

                              public static uint Adler32( byte b ) // per RFC1950
                              {
                              uint s1=1, s2=0;
                              for ( int i=0; i<b.Length; i+= 1 )
                              {
                              s1 = ( s1 + b[i] ) % 65521;
                              s2 = ( s2 + s1 ) % 65521;
                              }
                              return s2*65536 + s1;
                              }

                              // Output bitstream
                              byte Buf = 0, M = 1;
                              public void PutBit( U b ) { if ( b != 0 ) Buf |= M; M <<= 1; if ( M == 0 ) { Add(Buf); Buf = 0; M = 1; } }
                              public void PutBits( U n, U x ) { for ( int i = 0; i < n; i += 1 ) { PutBit( (U)(x & 1u) ); x >>= 1; } }
                              public void FlushBitBuf(){ while ( M != 1 ) PutBit( 0 ); }
                              public void Put32( uint x ) { Add( (byte)( x >> 24 ) ); Add( (byte)( x >> 16 ) ); Add( (byte)( x >> 8 ) ); Add( (byte) x ); }
                              } // end class Encoder

                              ////////////////////////////////////////////////////////////////////////////////////////////////////

                              class HE // Given a list of frequencies (freq), compute Huffman code lengths (nbits) and codes (tree_code).
                              {
                              public static int ComputeCode( int limit, U freq, U nbits, U tree_code )
                              {
                              int ncode = freq.Length;
                              Node heap = new Node[ncode];
                              int hn = 0;
                              for ( int i = 0; i < ncode; i += 1 )
                              {
                              U f = freq[i];
                              if ( f > 0 )
                              {
                              Node n = new Node();
                              n.Freq = f;
                              n.Code = (U)i;
                              HeapInsert( heap, hn, n );
                              hn += 1;
                              }
                              }

                              for ( int i = 0; i < nbits.Length; i += 1 ) nbits[i] = 0;
                              if ( hn <= 1 ) // Special case
                              { if ( hn == 1 ) nbits[ heap[0].Code ] = 1; }
                              else
                              {
                              while ( hn > 1 )
                              {
                              Node n = new Node();
                              hn -= 1; n.Left = HeapRemove( heap, hn );
                              hn -= 1; n.Right = HeapRemove( heap, hn );
                              n.Freq = (U) ( n.Left.Freq + n.Right.Freq );
                              n.Depth = (U) ( 1 + ( n.Left.Depth > n.Right.Depth ? n.Left.Depth : n.Right.Depth ) );
                              HeapInsert( heap, hn, n );
                              hn += 1;
                              }
                              Walk( nbits, heap[0], 0 ); // Walk the tree to find the code lengths (nbits).
                              }

                              for ( int i = 0; i < ncode; i += 1 ) if ( nbits[i] > limit ) return -1;

                              // Now compute codes, code below is from rfc1951 page 7

                              uint maxBits = 0;
                              for ( int i = 0; i < ncode; i += 1 ) if ( nbits[i] > maxBits ) maxBits = nbits[i];

                              U bl_count = new U[maxBits+1];
                              for ( int i=0; i < ncode; i += 1 ) bl_count[ nbits[i] ] += 1;

                              U next_code = new U[maxBits+1];
                              U code = 0; bl_count[0] = 0;
                              for ( int i = 0; i < maxBits; i += 1 )
                              {
                              code = (U)( ( code + bl_count[i] ) << 1 );
                              next_code[i+1] = code;
                              }

                              for ( int i = 0; i < ncode; i += 1 )
                              {
                              uint len = nbits[i];
                              if (len != 0)
                              {
                              tree_code[i] = Reverse( next_code[len], len );
                              next_code[len] += 1;
                              }
                              }

                              //System.Console.WriteLine( "Huff Code" );
                              // for ( uint i=0; i < ncode; i += 1 ) if ( nbits[i] > 0 )
                              // System.Console.WriteLine( "i=" + i + " len=" + nbits[i] + " tc=" + tree_code[i].ToString("X") + " freq=" + freq[i] );

                              while ( ncode > 0 && nbits[ ncode-1 ] == 0 ) ncode -= 1;
                              return ncode;
                              }

                              class Node{ public U Freq; public U Code, Depth; public Node Left, Right; }

                              static U Reverse( U x, uint bits )
                              { U result = 0; for ( int i = 0; i < bits; i += 1 ) { result <<= 1; result |= (U)(x & 1u); x >>= 1; } return result; }

                              static void Walk( U a, Node n,U len )
                              { if ( n.Left == null ) a[n.Code] = len; else { Walk( a, n.Left, (U)(len+1) ); Walk( a, n.Right, (U)(len+1) ); } }

                              static bool LessThan( Node a, Node b )
                              { return a.Freq < b.Freq || a.Freq == b.Freq && a.Depth < b.Depth; }

                              static void HeapInsert( Node heap, int h, Node n ) // h is size of heap before insertion
                              {
                              int j = h;
                              while ( j > 0 )
                              {
                              int p = ( j - 1 ) / 2; // parent
                              Node pn = heap[p];
                              if ( !LessThan(n,pn) ) break;
                              heap[j] = pn; // move parent down
                              j = p;
                              }
                              heap[j] = n;
                              }

                              static Node HeapRemove( Node heap, int h ) // h is size of heap after removal
                              {
                              Node result = heap[0];
                              Node n = heap[h];
                              int j = 0;
                              while ( true )
                              {
                              int c = j * 2 + 1; if ( c >= h ) break;
                              Node cn = heap[c];
                              if ( c+1 < h )
                              {
                              Node cn2 = heap[c+1];
                              if ( LessThan(cn2,cn) ) { c += 1; cn = cn2; }
                              }
                              if ( !LessThan(cn,n) ) break;
                              heap[j] = cn; j = c;
                              }
                              heap[j] = n;
                              return result;
                              }
                              } // end class HE






                              share|improve this answer












                              share|improve this answer



                              share|improve this answer










                              answered Jan 1 at 18:09









                              George BarwoodGeorge Barwood

                              596




                              596












                              • (Welcome to Code Review!) (hope it's alright to […], I am new here there you go
                                – greybeard
                                Jan 1 at 18:40


















                              • (Welcome to Code Review!) (hope it's alright to […], I am new here there you go
                                – greybeard
                                Jan 1 at 18:40
















                              (Welcome to Code Review!) (hope it's alright to […], I am new here there you go
                              – greybeard
                              Jan 1 at 18:40




                              (Welcome to Code Review!) (hope it's alright to […], I am new here there you go
                              – greybeard
                              Jan 1 at 18:40


















                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Code Review Stack Exchange!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              Use MathJax to format equations. MathJax reference.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f210654%2fhuffman-tree-compressing-decompressing-in-c%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Can a sorcerer learn a 5th-level spell early by creating spell slots using the Font of Magic feature?

                              Does disintegrating a polymorphed enemy still kill it after the 2018 errata?

                              A Topological Invariant for $pi_3(U(n))$