Understanding encoding schemes












0















I cannot understand some key elements of encoding:




  1. Is ASCII only a character or it also has its encoding scheme algorithm ?

  2. Does other windows code pages such as Latin1 have their own encoding algorithm ?

  3. Are UTF7, 8, 16, 32 the only encoding algorithms ?

  4. Does the UTF alghoritms are used only with the UNICODE set ?


Given the ASCII text: Hello World, if I want to convert it into Latin1 or BIG5, which encoding algorithms are being used in this process ? More specifically, does Latin1/Big5 use their own encoding alghoritm or I have to use a UTF alghoritm ?










share|improve this question

























  • I don't quite understand what you mean with 3. or why you specifically pick UTF-7 and 32…?

    – deceze
    Nov 22 '18 at 8:54











  • Hi, I updated my question. I was wondering if UTF alghoritms are the only ones which are being used to encode Unicode characters

    – David
    Nov 22 '18 at 9:06











  • #4. The U in UTF stands for Unicode. Algorithms can be applied anywhere you like but, please, let names have a declared or agreed upon context.

    – Tom Blodget
    Dec 2 '18 at 15:21
















0















I cannot understand some key elements of encoding:




  1. Is ASCII only a character or it also has its encoding scheme algorithm ?

  2. Does other windows code pages such as Latin1 have their own encoding algorithm ?

  3. Are UTF7, 8, 16, 32 the only encoding algorithms ?

  4. Does the UTF alghoritms are used only with the UNICODE set ?


Given the ASCII text: Hello World, if I want to convert it into Latin1 or BIG5, which encoding algorithms are being used in this process ? More specifically, does Latin1/Big5 use their own encoding alghoritm or I have to use a UTF alghoritm ?










share|improve this question

























  • I don't quite understand what you mean with 3. or why you specifically pick UTF-7 and 32…?

    – deceze
    Nov 22 '18 at 8:54











  • Hi, I updated my question. I was wondering if UTF alghoritms are the only ones which are being used to encode Unicode characters

    – David
    Nov 22 '18 at 9:06











  • #4. The U in UTF stands for Unicode. Algorithms can be applied anywhere you like but, please, let names have a declared or agreed upon context.

    – Tom Blodget
    Dec 2 '18 at 15:21














0












0








0








I cannot understand some key elements of encoding:




  1. Is ASCII only a character or it also has its encoding scheme algorithm ?

  2. Does other windows code pages such as Latin1 have their own encoding algorithm ?

  3. Are UTF7, 8, 16, 32 the only encoding algorithms ?

  4. Does the UTF alghoritms are used only with the UNICODE set ?


Given the ASCII text: Hello World, if I want to convert it into Latin1 or BIG5, which encoding algorithms are being used in this process ? More specifically, does Latin1/Big5 use their own encoding alghoritm or I have to use a UTF alghoritm ?










share|improve this question
















I cannot understand some key elements of encoding:




  1. Is ASCII only a character or it also has its encoding scheme algorithm ?

  2. Does other windows code pages such as Latin1 have their own encoding algorithm ?

  3. Are UTF7, 8, 16, 32 the only encoding algorithms ?

  4. Does the UTF alghoritms are used only with the UNICODE set ?


Given the ASCII text: Hello World, if I want to convert it into Latin1 or BIG5, which encoding algorithms are being used in this process ? More specifically, does Latin1/Big5 use their own encoding alghoritm or I have to use a UTF alghoritm ?







encoding






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 22 '18 at 9:15







David

















asked Nov 22 '18 at 8:51









DavidDavid

64




64













  • I don't quite understand what you mean with 3. or why you specifically pick UTF-7 and 32…?

    – deceze
    Nov 22 '18 at 8:54











  • Hi, I updated my question. I was wondering if UTF alghoritms are the only ones which are being used to encode Unicode characters

    – David
    Nov 22 '18 at 9:06











  • #4. The U in UTF stands for Unicode. Algorithms can be applied anywhere you like but, please, let names have a declared or agreed upon context.

    – Tom Blodget
    Dec 2 '18 at 15:21



















  • I don't quite understand what you mean with 3. or why you specifically pick UTF-7 and 32…?

    – deceze
    Nov 22 '18 at 8:54











  • Hi, I updated my question. I was wondering if UTF alghoritms are the only ones which are being used to encode Unicode characters

    – David
    Nov 22 '18 at 9:06











  • #4. The U in UTF stands for Unicode. Algorithms can be applied anywhere you like but, please, let names have a declared or agreed upon context.

    – Tom Blodget
    Dec 2 '18 at 15:21

















I don't quite understand what you mean with 3. or why you specifically pick UTF-7 and 32…?

– deceze
Nov 22 '18 at 8:54





I don't quite understand what you mean with 3. or why you specifically pick UTF-7 and 32…?

– deceze
Nov 22 '18 at 8:54













Hi, I updated my question. I was wondering if UTF alghoritms are the only ones which are being used to encode Unicode characters

– David
Nov 22 '18 at 9:06





Hi, I updated my question. I was wondering if UTF alghoritms are the only ones which are being used to encode Unicode characters

– David
Nov 22 '18 at 9:06













#4. The U in UTF stands for Unicode. Algorithms can be applied anywhere you like but, please, let names have a declared or agreed upon context.

– Tom Blodget
Dec 2 '18 at 15:21





#4. The U in UTF stands for Unicode. Algorithms can be applied anywhere you like but, please, let names have a declared or agreed upon context.

– Tom Blodget
Dec 2 '18 at 15:21












3 Answers
3






active

oldest

votes


















1














1: Ascii is just an encoding — a really simple encoding. It's literally just the positive end of a signed byte (0...127) mapped to characters and control codes.



Refer to https://www.ascii.codes/ to see the full set and inspect the characters.



There are definitely encoding algorithms to convert ascii strings to and from strings in other encodings, but there is no compression/decompression algorithm required to write or read ascii strings like there is for utf8 or utf16, if that's what you're implying.



2: LATIN-1 is also not a compressed (usually called 'variable width') encoding, so there's no algorithm needed to get in and out of it.



See https://kb.iu.edu/d/aepu for a nice description of LATIN-1 conceptually and of each character in the set. Like a lot of encodings, its first 128 slots are just ascii. Like ascii, it's 1 byte in size, but it's an unsigned byte, so after the last ascii character (DEL/127), LATIN1 adds another 128 characters.



As with any conversion from one string encoding to another, there is an algorithm specifically tailored to that conversion.



3: Again, unicode encodings are just that — encodings. But they're all compressed except for utf32. So unless you're working with utf32 there is always a compression/decompression step required to write and read them.



Note: When working with utf32 strings there is one nonlinear oddity that has to be accounted for... combining characters. Technically that is yet another type of compression since they save space by not giving a codepoint to every possible combination of uncombined character and combining character. They "precombine" a few, but they would run out of slots very quickly if they did them all.



4: Yes. The compression/decompression algorithms for the compressed unicode encodings are just for those encodings. They would not work for any other encoding.



Think of it like zip/unzip. Unzipping anything other than a zipped file or folder would of course not work. That goes for things that are not compressed in the first place and also things that are compressed but using another compression algorithm (e.g.: rar).



I recently wrote the utf8 and utf16 compression/decompression code for a new cross-platform library being developed, and I can tell you quite confidently if you feed a Big5-encoded string into my method written specifically for decompressing utf8... not only would it not work, it might very well crash.



Re: your "Hello World" question... Refer to my answer to your second question about LATIN-1. No conversion is required to go from ascii to LATIN-1 because the first 128 characters (0...127) of LATIN-1 are ascii. If you're converting from LATIN-1 to ascii, the same is true for the lower half of LATIN-1, but if any of the characters beyond 127 are in the string, it would be what's called a "lossy"/partial conversion or an outright failure, depending on your tolerance level for lossiness. In your example, however, all of the characters in "Hello World" have the exact same values in both encodings, so it would convert perfectly, without loss, in either direction.



I know practically nothing about Big5, but regardless, don't use utf-x algos for other encodings. Each one of those is written very specifically for 1 particular encoding (or in the case of conversion: pair of encodings).



If you're curious about utf8/16 compression/decompression algorithms, the unicode website is where you should start (watch out though. they don't use the compression/decompression metaphor in their documentation):



http://unicode.org



You probably won't need anything else.



... except maybe a decent codepoint lookup tool: https://www.unicode.codes/



You can roll your own code based on the unicode documentation, or use the official unicode library:



http://site.icu-project.org/home



Hope this helps.






share|improve this answer

































    0














    In general, most encoding schemes like ASCII or Latin-1 are simply big tables mapping characters to specific byte sequences. There may or may not be some specific algorithm how the creators came up with those specific character⟷byte associations, but there's generally not much more to it than that.



    One of the innovations of Unicode specifically is the indirection of assigning each character a unique number first and foremost, and worrying about how to encode that number into bytes secondarily. There are a number of encoding schemes for how to do this, from the UCS and GB 18030 encodings to the most commonly used UTF-8/UTF-16 encodings. Some are largely defunct by now like UCS-2. Each one has their pros and cons in terms of space tradeoffs, ease of processing and transportability (e.g. UTF-7 for safe transport over 7-bit system like email). Unless otherwise noted, they can all encode the full set of current Unicode characters.



    To convert from one encoding to another, you pretty much need to map bytes from one table to another. Meaning, if you look at the EBCDIC table and the Windows 1250 table, the characters 0xC1 and 0x41 respectively both seem to represent the same character "A", so when converting between the two encodings, you'd map those bytes as equivalent. Yes, that means there needs to be one such mapping between each possible encoding pair.



    Since that is obviously rather laborious, modern converters virtually always go through Unicode as a middleman. This way each encoding only needs to be mapped to the Unicode table, and the conversion can be done with encoding A → Unicode code point → encoding B. In the end you just want to identify which characters look the same/mean the same, and change the byte representation accordingly.






    share|improve this answer































      0














      A character encoding is a mapping from a sequence of characters to a sequence of bytes (in the past there were also encodings to a sequence of bits - they are falling out of fashion). Usually this mapping is one-to-one but not necessarily onto. This means there may be byte sequences that don't correspond to a character sequence in this encoding.



      The domain of the mapping defines which characters can be encoded.



      Now to your questions:




      1. ASCII is both, it defines 128 characters (some of them are control codes) and how they are mapped to the byte values 0 to 127.

      2. Each encoding may define its own set of characters and how they are mapped to bytes

      3. no, there are others as well ASCII, ISO-8859-1, ...

      4. Unicode uses a two step mapping: first the characters are mapped to (relatively) small integers called "code points", then these integers are mapped to a byte sequence. The first part is the same for all UTF encodings, the second step differs. Unicode has the ambition to contain all characters. This means, most characters are in the "UNICODE set".






      share|improve this answer


























      • In point 4: the term is code units. Character sets are a set of codepoints: a mapping between a conceptual character and an integer. Character encodings have code units. They are a map between a codepoint and one or more code unit sequences. (And then there is serialization: a map between a code unit integer and a sequence of bytes with a given endianness.)

        – Tom Blodget
        Nov 22 '18 at 15:30











      Your Answer






      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "1"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53427032%2funderstanding-encoding-schemes%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      1














      1: Ascii is just an encoding — a really simple encoding. It's literally just the positive end of a signed byte (0...127) mapped to characters and control codes.



      Refer to https://www.ascii.codes/ to see the full set and inspect the characters.



      There are definitely encoding algorithms to convert ascii strings to and from strings in other encodings, but there is no compression/decompression algorithm required to write or read ascii strings like there is for utf8 or utf16, if that's what you're implying.



      2: LATIN-1 is also not a compressed (usually called 'variable width') encoding, so there's no algorithm needed to get in and out of it.



      See https://kb.iu.edu/d/aepu for a nice description of LATIN-1 conceptually and of each character in the set. Like a lot of encodings, its first 128 slots are just ascii. Like ascii, it's 1 byte in size, but it's an unsigned byte, so after the last ascii character (DEL/127), LATIN1 adds another 128 characters.



      As with any conversion from one string encoding to another, there is an algorithm specifically tailored to that conversion.



      3: Again, unicode encodings are just that — encodings. But they're all compressed except for utf32. So unless you're working with utf32 there is always a compression/decompression step required to write and read them.



      Note: When working with utf32 strings there is one nonlinear oddity that has to be accounted for... combining characters. Technically that is yet another type of compression since they save space by not giving a codepoint to every possible combination of uncombined character and combining character. They "precombine" a few, but they would run out of slots very quickly if they did them all.



      4: Yes. The compression/decompression algorithms for the compressed unicode encodings are just for those encodings. They would not work for any other encoding.



      Think of it like zip/unzip. Unzipping anything other than a zipped file or folder would of course not work. That goes for things that are not compressed in the first place and also things that are compressed but using another compression algorithm (e.g.: rar).



      I recently wrote the utf8 and utf16 compression/decompression code for a new cross-platform library being developed, and I can tell you quite confidently if you feed a Big5-encoded string into my method written specifically for decompressing utf8... not only would it not work, it might very well crash.



      Re: your "Hello World" question... Refer to my answer to your second question about LATIN-1. No conversion is required to go from ascii to LATIN-1 because the first 128 characters (0...127) of LATIN-1 are ascii. If you're converting from LATIN-1 to ascii, the same is true for the lower half of LATIN-1, but if any of the characters beyond 127 are in the string, it would be what's called a "lossy"/partial conversion or an outright failure, depending on your tolerance level for lossiness. In your example, however, all of the characters in "Hello World" have the exact same values in both encodings, so it would convert perfectly, without loss, in either direction.



      I know practically nothing about Big5, but regardless, don't use utf-x algos for other encodings. Each one of those is written very specifically for 1 particular encoding (or in the case of conversion: pair of encodings).



      If you're curious about utf8/16 compression/decompression algorithms, the unicode website is where you should start (watch out though. they don't use the compression/decompression metaphor in their documentation):



      http://unicode.org



      You probably won't need anything else.



      ... except maybe a decent codepoint lookup tool: https://www.unicode.codes/



      You can roll your own code based on the unicode documentation, or use the official unicode library:



      http://site.icu-project.org/home



      Hope this helps.






      share|improve this answer






























        1














        1: Ascii is just an encoding — a really simple encoding. It's literally just the positive end of a signed byte (0...127) mapped to characters and control codes.



        Refer to https://www.ascii.codes/ to see the full set and inspect the characters.



        There are definitely encoding algorithms to convert ascii strings to and from strings in other encodings, but there is no compression/decompression algorithm required to write or read ascii strings like there is for utf8 or utf16, if that's what you're implying.



        2: LATIN-1 is also not a compressed (usually called 'variable width') encoding, so there's no algorithm needed to get in and out of it.



        See https://kb.iu.edu/d/aepu for a nice description of LATIN-1 conceptually and of each character in the set. Like a lot of encodings, its first 128 slots are just ascii. Like ascii, it's 1 byte in size, but it's an unsigned byte, so after the last ascii character (DEL/127), LATIN1 adds another 128 characters.



        As with any conversion from one string encoding to another, there is an algorithm specifically tailored to that conversion.



        3: Again, unicode encodings are just that — encodings. But they're all compressed except for utf32. So unless you're working with utf32 there is always a compression/decompression step required to write and read them.



        Note: When working with utf32 strings there is one nonlinear oddity that has to be accounted for... combining characters. Technically that is yet another type of compression since they save space by not giving a codepoint to every possible combination of uncombined character and combining character. They "precombine" a few, but they would run out of slots very quickly if they did them all.



        4: Yes. The compression/decompression algorithms for the compressed unicode encodings are just for those encodings. They would not work for any other encoding.



        Think of it like zip/unzip. Unzipping anything other than a zipped file or folder would of course not work. That goes for things that are not compressed in the first place and also things that are compressed but using another compression algorithm (e.g.: rar).



        I recently wrote the utf8 and utf16 compression/decompression code for a new cross-platform library being developed, and I can tell you quite confidently if you feed a Big5-encoded string into my method written specifically for decompressing utf8... not only would it not work, it might very well crash.



        Re: your "Hello World" question... Refer to my answer to your second question about LATIN-1. No conversion is required to go from ascii to LATIN-1 because the first 128 characters (0...127) of LATIN-1 are ascii. If you're converting from LATIN-1 to ascii, the same is true for the lower half of LATIN-1, but if any of the characters beyond 127 are in the string, it would be what's called a "lossy"/partial conversion or an outright failure, depending on your tolerance level for lossiness. In your example, however, all of the characters in "Hello World" have the exact same values in both encodings, so it would convert perfectly, without loss, in either direction.



        I know practically nothing about Big5, but regardless, don't use utf-x algos for other encodings. Each one of those is written very specifically for 1 particular encoding (or in the case of conversion: pair of encodings).



        If you're curious about utf8/16 compression/decompression algorithms, the unicode website is where you should start (watch out though. they don't use the compression/decompression metaphor in their documentation):



        http://unicode.org



        You probably won't need anything else.



        ... except maybe a decent codepoint lookup tool: https://www.unicode.codes/



        You can roll your own code based on the unicode documentation, or use the official unicode library:



        http://site.icu-project.org/home



        Hope this helps.






        share|improve this answer




























          1












          1








          1







          1: Ascii is just an encoding — a really simple encoding. It's literally just the positive end of a signed byte (0...127) mapped to characters and control codes.



          Refer to https://www.ascii.codes/ to see the full set and inspect the characters.



          There are definitely encoding algorithms to convert ascii strings to and from strings in other encodings, but there is no compression/decompression algorithm required to write or read ascii strings like there is for utf8 or utf16, if that's what you're implying.



          2: LATIN-1 is also not a compressed (usually called 'variable width') encoding, so there's no algorithm needed to get in and out of it.



          See https://kb.iu.edu/d/aepu for a nice description of LATIN-1 conceptually and of each character in the set. Like a lot of encodings, its first 128 slots are just ascii. Like ascii, it's 1 byte in size, but it's an unsigned byte, so after the last ascii character (DEL/127), LATIN1 adds another 128 characters.



          As with any conversion from one string encoding to another, there is an algorithm specifically tailored to that conversion.



          3: Again, unicode encodings are just that — encodings. But they're all compressed except for utf32. So unless you're working with utf32 there is always a compression/decompression step required to write and read them.



          Note: When working with utf32 strings there is one nonlinear oddity that has to be accounted for... combining characters. Technically that is yet another type of compression since they save space by not giving a codepoint to every possible combination of uncombined character and combining character. They "precombine" a few, but they would run out of slots very quickly if they did them all.



          4: Yes. The compression/decompression algorithms for the compressed unicode encodings are just for those encodings. They would not work for any other encoding.



          Think of it like zip/unzip. Unzipping anything other than a zipped file or folder would of course not work. That goes for things that are not compressed in the first place and also things that are compressed but using another compression algorithm (e.g.: rar).



          I recently wrote the utf8 and utf16 compression/decompression code for a new cross-platform library being developed, and I can tell you quite confidently if you feed a Big5-encoded string into my method written specifically for decompressing utf8... not only would it not work, it might very well crash.



          Re: your "Hello World" question... Refer to my answer to your second question about LATIN-1. No conversion is required to go from ascii to LATIN-1 because the first 128 characters (0...127) of LATIN-1 are ascii. If you're converting from LATIN-1 to ascii, the same is true for the lower half of LATIN-1, but if any of the characters beyond 127 are in the string, it would be what's called a "lossy"/partial conversion or an outright failure, depending on your tolerance level for lossiness. In your example, however, all of the characters in "Hello World" have the exact same values in both encodings, so it would convert perfectly, without loss, in either direction.



          I know practically nothing about Big5, but regardless, don't use utf-x algos for other encodings. Each one of those is written very specifically for 1 particular encoding (or in the case of conversion: pair of encodings).



          If you're curious about utf8/16 compression/decompression algorithms, the unicode website is where you should start (watch out though. they don't use the compression/decompression metaphor in their documentation):



          http://unicode.org



          You probably won't need anything else.



          ... except maybe a decent codepoint lookup tool: https://www.unicode.codes/



          You can roll your own code based on the unicode documentation, or use the official unicode library:



          http://site.icu-project.org/home



          Hope this helps.






          share|improve this answer















          1: Ascii is just an encoding — a really simple encoding. It's literally just the positive end of a signed byte (0...127) mapped to characters and control codes.



          Refer to https://www.ascii.codes/ to see the full set and inspect the characters.



          There are definitely encoding algorithms to convert ascii strings to and from strings in other encodings, but there is no compression/decompression algorithm required to write or read ascii strings like there is for utf8 or utf16, if that's what you're implying.



          2: LATIN-1 is also not a compressed (usually called 'variable width') encoding, so there's no algorithm needed to get in and out of it.



          See https://kb.iu.edu/d/aepu for a nice description of LATIN-1 conceptually and of each character in the set. Like a lot of encodings, its first 128 slots are just ascii. Like ascii, it's 1 byte in size, but it's an unsigned byte, so after the last ascii character (DEL/127), LATIN1 adds another 128 characters.



          As with any conversion from one string encoding to another, there is an algorithm specifically tailored to that conversion.



          3: Again, unicode encodings are just that — encodings. But they're all compressed except for utf32. So unless you're working with utf32 there is always a compression/decompression step required to write and read them.



          Note: When working with utf32 strings there is one nonlinear oddity that has to be accounted for... combining characters. Technically that is yet another type of compression since they save space by not giving a codepoint to every possible combination of uncombined character and combining character. They "precombine" a few, but they would run out of slots very quickly if they did them all.



          4: Yes. The compression/decompression algorithms for the compressed unicode encodings are just for those encodings. They would not work for any other encoding.



          Think of it like zip/unzip. Unzipping anything other than a zipped file or folder would of course not work. That goes for things that are not compressed in the first place and also things that are compressed but using another compression algorithm (e.g.: rar).



          I recently wrote the utf8 and utf16 compression/decompression code for a new cross-platform library being developed, and I can tell you quite confidently if you feed a Big5-encoded string into my method written specifically for decompressing utf8... not only would it not work, it might very well crash.



          Re: your "Hello World" question... Refer to my answer to your second question about LATIN-1. No conversion is required to go from ascii to LATIN-1 because the first 128 characters (0...127) of LATIN-1 are ascii. If you're converting from LATIN-1 to ascii, the same is true for the lower half of LATIN-1, but if any of the characters beyond 127 are in the string, it would be what's called a "lossy"/partial conversion or an outright failure, depending on your tolerance level for lossiness. In your example, however, all of the characters in "Hello World" have the exact same values in both encodings, so it would convert perfectly, without loss, in either direction.



          I know practically nothing about Big5, but regardless, don't use utf-x algos for other encodings. Each one of those is written very specifically for 1 particular encoding (or in the case of conversion: pair of encodings).



          If you're curious about utf8/16 compression/decompression algorithms, the unicode website is where you should start (watch out though. they don't use the compression/decompression metaphor in their documentation):



          http://unicode.org



          You probably won't need anything else.



          ... except maybe a decent codepoint lookup tool: https://www.unicode.codes/



          You can roll your own code based on the unicode documentation, or use the official unicode library:



          http://site.icu-project.org/home



          Hope this helps.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Dec 3 '18 at 22:11

























          answered Dec 2 '18 at 8:15









          CraigCraig

          112




          112

























              0














              In general, most encoding schemes like ASCII or Latin-1 are simply big tables mapping characters to specific byte sequences. There may or may not be some specific algorithm how the creators came up with those specific character⟷byte associations, but there's generally not much more to it than that.



              One of the innovations of Unicode specifically is the indirection of assigning each character a unique number first and foremost, and worrying about how to encode that number into bytes secondarily. There are a number of encoding schemes for how to do this, from the UCS and GB 18030 encodings to the most commonly used UTF-8/UTF-16 encodings. Some are largely defunct by now like UCS-2. Each one has their pros and cons in terms of space tradeoffs, ease of processing and transportability (e.g. UTF-7 for safe transport over 7-bit system like email). Unless otherwise noted, they can all encode the full set of current Unicode characters.



              To convert from one encoding to another, you pretty much need to map bytes from one table to another. Meaning, if you look at the EBCDIC table and the Windows 1250 table, the characters 0xC1 and 0x41 respectively both seem to represent the same character "A", so when converting between the two encodings, you'd map those bytes as equivalent. Yes, that means there needs to be one such mapping between each possible encoding pair.



              Since that is obviously rather laborious, modern converters virtually always go through Unicode as a middleman. This way each encoding only needs to be mapped to the Unicode table, and the conversion can be done with encoding A → Unicode code point → encoding B. In the end you just want to identify which characters look the same/mean the same, and change the byte representation accordingly.






              share|improve this answer




























                0














                In general, most encoding schemes like ASCII or Latin-1 are simply big tables mapping characters to specific byte sequences. There may or may not be some specific algorithm how the creators came up with those specific character⟷byte associations, but there's generally not much more to it than that.



                One of the innovations of Unicode specifically is the indirection of assigning each character a unique number first and foremost, and worrying about how to encode that number into bytes secondarily. There are a number of encoding schemes for how to do this, from the UCS and GB 18030 encodings to the most commonly used UTF-8/UTF-16 encodings. Some are largely defunct by now like UCS-2. Each one has their pros and cons in terms of space tradeoffs, ease of processing and transportability (e.g. UTF-7 for safe transport over 7-bit system like email). Unless otherwise noted, they can all encode the full set of current Unicode characters.



                To convert from one encoding to another, you pretty much need to map bytes from one table to another. Meaning, if you look at the EBCDIC table and the Windows 1250 table, the characters 0xC1 and 0x41 respectively both seem to represent the same character "A", so when converting between the two encodings, you'd map those bytes as equivalent. Yes, that means there needs to be one such mapping between each possible encoding pair.



                Since that is obviously rather laborious, modern converters virtually always go through Unicode as a middleman. This way each encoding only needs to be mapped to the Unicode table, and the conversion can be done with encoding A → Unicode code point → encoding B. In the end you just want to identify which characters look the same/mean the same, and change the byte representation accordingly.






                share|improve this answer


























                  0












                  0








                  0







                  In general, most encoding schemes like ASCII or Latin-1 are simply big tables mapping characters to specific byte sequences. There may or may not be some specific algorithm how the creators came up with those specific character⟷byte associations, but there's generally not much more to it than that.



                  One of the innovations of Unicode specifically is the indirection of assigning each character a unique number first and foremost, and worrying about how to encode that number into bytes secondarily. There are a number of encoding schemes for how to do this, from the UCS and GB 18030 encodings to the most commonly used UTF-8/UTF-16 encodings. Some are largely defunct by now like UCS-2. Each one has their pros and cons in terms of space tradeoffs, ease of processing and transportability (e.g. UTF-7 for safe transport over 7-bit system like email). Unless otherwise noted, they can all encode the full set of current Unicode characters.



                  To convert from one encoding to another, you pretty much need to map bytes from one table to another. Meaning, if you look at the EBCDIC table and the Windows 1250 table, the characters 0xC1 and 0x41 respectively both seem to represent the same character "A", so when converting between the two encodings, you'd map those bytes as equivalent. Yes, that means there needs to be one such mapping between each possible encoding pair.



                  Since that is obviously rather laborious, modern converters virtually always go through Unicode as a middleman. This way each encoding only needs to be mapped to the Unicode table, and the conversion can be done with encoding A → Unicode code point → encoding B. In the end you just want to identify which characters look the same/mean the same, and change the byte representation accordingly.






                  share|improve this answer













                  In general, most encoding schemes like ASCII or Latin-1 are simply big tables mapping characters to specific byte sequences. There may or may not be some specific algorithm how the creators came up with those specific character⟷byte associations, but there's generally not much more to it than that.



                  One of the innovations of Unicode specifically is the indirection of assigning each character a unique number first and foremost, and worrying about how to encode that number into bytes secondarily. There are a number of encoding schemes for how to do this, from the UCS and GB 18030 encodings to the most commonly used UTF-8/UTF-16 encodings. Some are largely defunct by now like UCS-2. Each one has their pros and cons in terms of space tradeoffs, ease of processing and transportability (e.g. UTF-7 for safe transport over 7-bit system like email). Unless otherwise noted, they can all encode the full set of current Unicode characters.



                  To convert from one encoding to another, you pretty much need to map bytes from one table to another. Meaning, if you look at the EBCDIC table and the Windows 1250 table, the characters 0xC1 and 0x41 respectively both seem to represent the same character "A", so when converting between the two encodings, you'd map those bytes as equivalent. Yes, that means there needs to be one such mapping between each possible encoding pair.



                  Since that is obviously rather laborious, modern converters virtually always go through Unicode as a middleman. This way each encoding only needs to be mapped to the Unicode table, and the conversion can be done with encoding A → Unicode code point → encoding B. In the end you just want to identify which characters look the same/mean the same, and change the byte representation accordingly.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 22 '18 at 9:28









                  decezedeceze

                  396k62540696




                  396k62540696























                      0














                      A character encoding is a mapping from a sequence of characters to a sequence of bytes (in the past there were also encodings to a sequence of bits - they are falling out of fashion). Usually this mapping is one-to-one but not necessarily onto. This means there may be byte sequences that don't correspond to a character sequence in this encoding.



                      The domain of the mapping defines which characters can be encoded.



                      Now to your questions:




                      1. ASCII is both, it defines 128 characters (some of them are control codes) and how they are mapped to the byte values 0 to 127.

                      2. Each encoding may define its own set of characters and how they are mapped to bytes

                      3. no, there are others as well ASCII, ISO-8859-1, ...

                      4. Unicode uses a two step mapping: first the characters are mapped to (relatively) small integers called "code points", then these integers are mapped to a byte sequence. The first part is the same for all UTF encodings, the second step differs. Unicode has the ambition to contain all characters. This means, most characters are in the "UNICODE set".






                      share|improve this answer


























                      • In point 4: the term is code units. Character sets are a set of codepoints: a mapping between a conceptual character and an integer. Character encodings have code units. They are a map between a codepoint and one or more code unit sequences. (And then there is serialization: a map between a code unit integer and a sequence of bytes with a given endianness.)

                        – Tom Blodget
                        Nov 22 '18 at 15:30
















                      0














                      A character encoding is a mapping from a sequence of characters to a sequence of bytes (in the past there were also encodings to a sequence of bits - they are falling out of fashion). Usually this mapping is one-to-one but not necessarily onto. This means there may be byte sequences that don't correspond to a character sequence in this encoding.



                      The domain of the mapping defines which characters can be encoded.



                      Now to your questions:




                      1. ASCII is both, it defines 128 characters (some of them are control codes) and how they are mapped to the byte values 0 to 127.

                      2. Each encoding may define its own set of characters and how they are mapped to bytes

                      3. no, there are others as well ASCII, ISO-8859-1, ...

                      4. Unicode uses a two step mapping: first the characters are mapped to (relatively) small integers called "code points", then these integers are mapped to a byte sequence. The first part is the same for all UTF encodings, the second step differs. Unicode has the ambition to contain all characters. This means, most characters are in the "UNICODE set".






                      share|improve this answer


























                      • In point 4: the term is code units. Character sets are a set of codepoints: a mapping between a conceptual character and an integer. Character encodings have code units. They are a map between a codepoint and one or more code unit sequences. (And then there is serialization: a map between a code unit integer and a sequence of bytes with a given endianness.)

                        – Tom Blodget
                        Nov 22 '18 at 15:30














                      0












                      0








                      0







                      A character encoding is a mapping from a sequence of characters to a sequence of bytes (in the past there were also encodings to a sequence of bits - they are falling out of fashion). Usually this mapping is one-to-one but not necessarily onto. This means there may be byte sequences that don't correspond to a character sequence in this encoding.



                      The domain of the mapping defines which characters can be encoded.



                      Now to your questions:




                      1. ASCII is both, it defines 128 characters (some of them are control codes) and how they are mapped to the byte values 0 to 127.

                      2. Each encoding may define its own set of characters and how they are mapped to bytes

                      3. no, there are others as well ASCII, ISO-8859-1, ...

                      4. Unicode uses a two step mapping: first the characters are mapped to (relatively) small integers called "code points", then these integers are mapped to a byte sequence. The first part is the same for all UTF encodings, the second step differs. Unicode has the ambition to contain all characters. This means, most characters are in the "UNICODE set".






                      share|improve this answer















                      A character encoding is a mapping from a sequence of characters to a sequence of bytes (in the past there were also encodings to a sequence of bits - they are falling out of fashion). Usually this mapping is one-to-one but not necessarily onto. This means there may be byte sequences that don't correspond to a character sequence in this encoding.



                      The domain of the mapping defines which characters can be encoded.



                      Now to your questions:




                      1. ASCII is both, it defines 128 characters (some of them are control codes) and how they are mapped to the byte values 0 to 127.

                      2. Each encoding may define its own set of characters and how they are mapped to bytes

                      3. no, there are others as well ASCII, ISO-8859-1, ...

                      4. Unicode uses a two step mapping: first the characters are mapped to (relatively) small integers called "code points", then these integers are mapped to a byte sequence. The first part is the same for all UTF encodings, the second step differs. Unicode has the ambition to contain all characters. This means, most characters are in the "UNICODE set".







                      share|improve this answer














                      share|improve this answer



                      share|improve this answer








                      edited Nov 22 '18 at 9:39

























                      answered Nov 22 '18 at 9:31









                      HenryHenry

                      34.1k54260




                      34.1k54260













                      • In point 4: the term is code units. Character sets are a set of codepoints: a mapping between a conceptual character and an integer. Character encodings have code units. They are a map between a codepoint and one or more code unit sequences. (And then there is serialization: a map between a code unit integer and a sequence of bytes with a given endianness.)

                        – Tom Blodget
                        Nov 22 '18 at 15:30



















                      • In point 4: the term is code units. Character sets are a set of codepoints: a mapping between a conceptual character and an integer. Character encodings have code units. They are a map between a codepoint and one or more code unit sequences. (And then there is serialization: a map between a code unit integer and a sequence of bytes with a given endianness.)

                        – Tom Blodget
                        Nov 22 '18 at 15:30

















                      In point 4: the term is code units. Character sets are a set of codepoints: a mapping between a conceptual character and an integer. Character encodings have code units. They are a map between a codepoint and one or more code unit sequences. (And then there is serialization: a map between a code unit integer and a sequence of bytes with a given endianness.)

                      – Tom Blodget
                      Nov 22 '18 at 15:30





                      In point 4: the term is code units. Character sets are a set of codepoints: a mapping between a conceptual character and an integer. Character encodings have code units. They are a map between a codepoint and one or more code unit sequences. (And then there is serialization: a map between a code unit integer and a sequence of bytes with a given endianness.)

                      – Tom Blodget
                      Nov 22 '18 at 15:30


















                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53427032%2funderstanding-encoding-schemes%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Can a sorcerer learn a 5th-level spell early by creating spell slots using the Font of Magic feature?

                      Does disintegrating a polymorphed enemy still kill it after the 2018 errata?

                      A Topological Invariant for $pi_3(U(n))$