How can (La)TeX read UTF-8?












19















As described in The TeXbook, TeX reads files byte by byte, regardless of the particular format -- as I understand, this is just how IniTeX is set up.



I also understand that LaTeX is just a collection of macros built on top of IniTeX, described in most distributions of TeX by the file latex.ltx.



The above two things are at odds with my understanding of LaTeX's ability to read UTF-8. I was under the impression that reading the input byte by byte (and thus for instance, only being able to access numbers from 0 to 255 using char or something) was baked into TeX, and thus would exist in all variants built on top of it.



Thus, how is LaTeX able to do this?










share|improve this question




















  • 5





    The TeXbook describes the so-called Knuth-TeX engine, as well as a collection of macros frequently called "PlainTeX". Are you aware of newer engines called pdfTeX, XeTeX, and LuaTeX?

    – Mico
    Jan 21 at 1:54






  • 3





    You could be asking one of (at least) two questions here. Are you wondering how e.g. XeTeX (natively UTF-8) can be derived from Knuth's TeX (8-bit). Or are you wondering how 8-bit TeX engines deal with UTF-8 input (conversion of 'raw' bytes to codpoints to output)?

    – Joseph Wright
    Jan 21 at 7:42











  • @JosephWright The former.

    – extremeaxe5
    Jan 22 at 18:40











  • @Mico I am aware they exist. Are you saying that the newer TeX engines read input utf8 char by utf8 char? How would this work? I feel like a lot of things would break -- as Knuth describes, catcodes, mathcodes, and delcodes are only defined for 256 distinct values (all the possible bytes), no?

    – extremeaxe5
    Jan 22 at 18:43











  • Related: tex.stackexchange.com/questions/222286/…. There, I look in detail at the differences between engines, some of which are linked to Unicode support. (The questions are distinct, it's not a duplicate.)

    – Joseph Wright
    Jan 22 at 20:52
















19















As described in The TeXbook, TeX reads files byte by byte, regardless of the particular format -- as I understand, this is just how IniTeX is set up.



I also understand that LaTeX is just a collection of macros built on top of IniTeX, described in most distributions of TeX by the file latex.ltx.



The above two things are at odds with my understanding of LaTeX's ability to read UTF-8. I was under the impression that reading the input byte by byte (and thus for instance, only being able to access numbers from 0 to 255 using char or something) was baked into TeX, and thus would exist in all variants built on top of it.



Thus, how is LaTeX able to do this?










share|improve this question




















  • 5





    The TeXbook describes the so-called Knuth-TeX engine, as well as a collection of macros frequently called "PlainTeX". Are you aware of newer engines called pdfTeX, XeTeX, and LuaTeX?

    – Mico
    Jan 21 at 1:54






  • 3





    You could be asking one of (at least) two questions here. Are you wondering how e.g. XeTeX (natively UTF-8) can be derived from Knuth's TeX (8-bit). Or are you wondering how 8-bit TeX engines deal with UTF-8 input (conversion of 'raw' bytes to codpoints to output)?

    – Joseph Wright
    Jan 21 at 7:42











  • @JosephWright The former.

    – extremeaxe5
    Jan 22 at 18:40











  • @Mico I am aware they exist. Are you saying that the newer TeX engines read input utf8 char by utf8 char? How would this work? I feel like a lot of things would break -- as Knuth describes, catcodes, mathcodes, and delcodes are only defined for 256 distinct values (all the possible bytes), no?

    – extremeaxe5
    Jan 22 at 18:43











  • Related: tex.stackexchange.com/questions/222286/…. There, I look in detail at the differences between engines, some of which are linked to Unicode support. (The questions are distinct, it's not a duplicate.)

    – Joseph Wright
    Jan 22 at 20:52














19












19








19


4






As described in The TeXbook, TeX reads files byte by byte, regardless of the particular format -- as I understand, this is just how IniTeX is set up.



I also understand that LaTeX is just a collection of macros built on top of IniTeX, described in most distributions of TeX by the file latex.ltx.



The above two things are at odds with my understanding of LaTeX's ability to read UTF-8. I was under the impression that reading the input byte by byte (and thus for instance, only being able to access numbers from 0 to 255 using char or something) was baked into TeX, and thus would exist in all variants built on top of it.



Thus, how is LaTeX able to do this?










share|improve this question
















As described in The TeXbook, TeX reads files byte by byte, regardless of the particular format -- as I understand, this is just how IniTeX is set up.



I also understand that LaTeX is just a collection of macros built on top of IniTeX, described in most distributions of TeX by the file latex.ltx.



The above two things are at odds with my understanding of LaTeX's ability to read UTF-8. I was under the impression that reading the input byte by byte (and thus for instance, only being able to access numbers from 0 to 255 using char or something) was baked into TeX, and thus would exist in all variants built on top of it.



Thus, how is LaTeX able to do this?







unicode






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 22 at 20:49









Joseph Wright

204k23560889




204k23560889










asked Jan 21 at 1:34









extremeaxe5extremeaxe5

3096




3096








  • 5





    The TeXbook describes the so-called Knuth-TeX engine, as well as a collection of macros frequently called "PlainTeX". Are you aware of newer engines called pdfTeX, XeTeX, and LuaTeX?

    – Mico
    Jan 21 at 1:54






  • 3





    You could be asking one of (at least) two questions here. Are you wondering how e.g. XeTeX (natively UTF-8) can be derived from Knuth's TeX (8-bit). Or are you wondering how 8-bit TeX engines deal with UTF-8 input (conversion of 'raw' bytes to codpoints to output)?

    – Joseph Wright
    Jan 21 at 7:42











  • @JosephWright The former.

    – extremeaxe5
    Jan 22 at 18:40











  • @Mico I am aware they exist. Are you saying that the newer TeX engines read input utf8 char by utf8 char? How would this work? I feel like a lot of things would break -- as Knuth describes, catcodes, mathcodes, and delcodes are only defined for 256 distinct values (all the possible bytes), no?

    – extremeaxe5
    Jan 22 at 18:43











  • Related: tex.stackexchange.com/questions/222286/…. There, I look in detail at the differences between engines, some of which are linked to Unicode support. (The questions are distinct, it's not a duplicate.)

    – Joseph Wright
    Jan 22 at 20:52














  • 5





    The TeXbook describes the so-called Knuth-TeX engine, as well as a collection of macros frequently called "PlainTeX". Are you aware of newer engines called pdfTeX, XeTeX, and LuaTeX?

    – Mico
    Jan 21 at 1:54






  • 3





    You could be asking one of (at least) two questions here. Are you wondering how e.g. XeTeX (natively UTF-8) can be derived from Knuth's TeX (8-bit). Or are you wondering how 8-bit TeX engines deal with UTF-8 input (conversion of 'raw' bytes to codpoints to output)?

    – Joseph Wright
    Jan 21 at 7:42











  • @JosephWright The former.

    – extremeaxe5
    Jan 22 at 18:40











  • @Mico I am aware they exist. Are you saying that the newer TeX engines read input utf8 char by utf8 char? How would this work? I feel like a lot of things would break -- as Knuth describes, catcodes, mathcodes, and delcodes are only defined for 256 distinct values (all the possible bytes), no?

    – extremeaxe5
    Jan 22 at 18:43











  • Related: tex.stackexchange.com/questions/222286/…. There, I look in detail at the differences between engines, some of which are linked to Unicode support. (The questions are distinct, it's not a duplicate.)

    – Joseph Wright
    Jan 22 at 20:52








5




5





The TeXbook describes the so-called Knuth-TeX engine, as well as a collection of macros frequently called "PlainTeX". Are you aware of newer engines called pdfTeX, XeTeX, and LuaTeX?

– Mico
Jan 21 at 1:54





The TeXbook describes the so-called Knuth-TeX engine, as well as a collection of macros frequently called "PlainTeX". Are you aware of newer engines called pdfTeX, XeTeX, and LuaTeX?

– Mico
Jan 21 at 1:54




3




3





You could be asking one of (at least) two questions here. Are you wondering how e.g. XeTeX (natively UTF-8) can be derived from Knuth's TeX (8-bit). Or are you wondering how 8-bit TeX engines deal with UTF-8 input (conversion of 'raw' bytes to codpoints to output)?

– Joseph Wright
Jan 21 at 7:42





You could be asking one of (at least) two questions here. Are you wondering how e.g. XeTeX (natively UTF-8) can be derived from Knuth's TeX (8-bit). Or are you wondering how 8-bit TeX engines deal with UTF-8 input (conversion of 'raw' bytes to codpoints to output)?

– Joseph Wright
Jan 21 at 7:42













@JosephWright The former.

– extremeaxe5
Jan 22 at 18:40





@JosephWright The former.

– extremeaxe5
Jan 22 at 18:40













@Mico I am aware they exist. Are you saying that the newer TeX engines read input utf8 char by utf8 char? How would this work? I feel like a lot of things would break -- as Knuth describes, catcodes, mathcodes, and delcodes are only defined for 256 distinct values (all the possible bytes), no?

– extremeaxe5
Jan 22 at 18:43





@Mico I am aware they exist. Are you saying that the newer TeX engines read input utf8 char by utf8 char? How would this work? I feel like a lot of things would break -- as Knuth describes, catcodes, mathcodes, and delcodes are only defined for 256 distinct values (all the possible bytes), no?

– extremeaxe5
Jan 22 at 18:43













Related: tex.stackexchange.com/questions/222286/…. There, I look in detail at the differences between engines, some of which are linked to Unicode support. (The questions are distinct, it's not a duplicate.)

– Joseph Wright
Jan 22 at 20:52





Related: tex.stackexchange.com/questions/222286/…. There, I look in detail at the differences between engines, some of which are linked to Unicode support. (The questions are distinct, it's not a duplicate.)

– Joseph Wright
Jan 22 at 20:52










2 Answers
2






active

oldest

votes


















20














If you want to know how the 8-bit engines handle utf8 input you can use tracingmacros:



documentclass{article}

begin{document}
{tracingmacros =1 ä }
end{document}


which gives



Ã->UTFviii@two@octets Ã

UTFviii@two@octets #1#2->expandafter UTFviii@defined csname u8:#1string #2
endcsname
#1<-Ã
#2<-¤

UTFviii@defined #1->ifx #1relax if relax expandafter UTFviii@checkseq s
tring #1relax relax UTFviii@undefined@err {#1}else PackageError {inputenc}
{Invalid UTF-8 byte sequence}UTFviii@invalid@help fi else expandafter #1fi

#1<-u8:ä

u8:ä ->IeC {"a}


That means the the first byte of the ä (the Ã) is an active char, a command which then picks up the next byte and then calls u8:ä which calls "a. In this way (pdf)latex can handle quite a lot of utf8 input but it has e.g. problems with "char + combining accent" as there is no sensible code for the combining accent to go back to add an accent on the char.






share|improve this answer



















  • 1





    Just a comment to point out log file containing the trace got viewed by editor in some 8bit encoding, presumably iso-latin-1 (like Emacs does for me), not in UTF8... so à is only one byte. This is tacit in your answer...

    – user4686
    Jan 21 at 9:01





















5














In order to move from 8-bit TeX90 to Unicode XeTeX or LuaTeX, there is work to do in extending/modifying internal structures. However, that is largely a question of effort rather than any major conceptual limitation. Knuth after all extended TeX from 7-bit to 8-bit between TeX82 (TeX 2) and TeX90 (TeX 3).



Both XeTeX and LuaTeX read files in UTF-8 rather than on a per-byte basis. This happens well before any TeX-related processes are involved, and as such at the macro level there are only UTF-8 characters. (One can alter this in LuaTeX: see luainputenc for example.) Both engines then use tables which are extended to cover the full Unicode range.



The change in accepted input can be used to test for Unicode-aware engines, as shown in an example from https://www.contextgarden.net/Encodings_and_Regimes



deftest#1#2!{defsecondarg{#2}}
test χ!relax % That's Chi, a 2-byte utf-8 sequence
ifxsecondargempty message{newstuff}else message{tex82}fi


In the main, macro code does not need altering to accept Unicode: the engines deal with the byte aspect, so from a macro viewpoint everything is 'as expected'. Of course, there is a little set up to do, for example setting catcode, uccode, etc. for the full Unicode range. Today, this is handled by using the unicode-data, so is built-in to both plain TeX-derived and LaTeX formats.



There are a few places that LaTeX has to be aware of which engine is in use, but the direct impact of Unicode is largley limited to




  • Setting up the data for catcode, etc.

  • Setting up hyphenation patterns (which today are all stored as UTF-8 and
    require more work to use with pdfTeX than with Unicode TeX engines)

  • (Not) setting up support for Unicode support based on active 8-bit characters
    (See Ulrike's answer for details
    of how that works in 8-bit engines)


Other aspects at the macro layer are related to other functionality, for example the ability of both XeTeX and LuaTeX to load system fonts: that requires a Unicode
font encoding (TU), but that is distinct from handling input.






share|improve this answer

























    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "85"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2ftex.stackexchange.com%2fquestions%2f471071%2fhow-can-latex-read-utf-8%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    20














    If you want to know how the 8-bit engines handle utf8 input you can use tracingmacros:



    documentclass{article}

    begin{document}
    {tracingmacros =1 ä }
    end{document}


    which gives



    Ã->UTFviii@two@octets Ã

    UTFviii@two@octets #1#2->expandafter UTFviii@defined csname u8:#1string #2
    endcsname
    #1<-Ã
    #2<-¤

    UTFviii@defined #1->ifx #1relax if relax expandafter UTFviii@checkseq s
    tring #1relax relax UTFviii@undefined@err {#1}else PackageError {inputenc}
    {Invalid UTF-8 byte sequence}UTFviii@invalid@help fi else expandafter #1fi

    #1<-u8:ä

    u8:ä ->IeC {"a}


    That means the the first byte of the ä (the Ã) is an active char, a command which then picks up the next byte and then calls u8:ä which calls "a. In this way (pdf)latex can handle quite a lot of utf8 input but it has e.g. problems with "char + combining accent" as there is no sensible code for the combining accent to go back to add an accent on the char.






    share|improve this answer



















    • 1





      Just a comment to point out log file containing the trace got viewed by editor in some 8bit encoding, presumably iso-latin-1 (like Emacs does for me), not in UTF8... so à is only one byte. This is tacit in your answer...

      – user4686
      Jan 21 at 9:01


















    20














    If you want to know how the 8-bit engines handle utf8 input you can use tracingmacros:



    documentclass{article}

    begin{document}
    {tracingmacros =1 ä }
    end{document}


    which gives



    Ã->UTFviii@two@octets Ã

    UTFviii@two@octets #1#2->expandafter UTFviii@defined csname u8:#1string #2
    endcsname
    #1<-Ã
    #2<-¤

    UTFviii@defined #1->ifx #1relax if relax expandafter UTFviii@checkseq s
    tring #1relax relax UTFviii@undefined@err {#1}else PackageError {inputenc}
    {Invalid UTF-8 byte sequence}UTFviii@invalid@help fi else expandafter #1fi

    #1<-u8:ä

    u8:ä ->IeC {"a}


    That means the the first byte of the ä (the Ã) is an active char, a command which then picks up the next byte and then calls u8:ä which calls "a. In this way (pdf)latex can handle quite a lot of utf8 input but it has e.g. problems with "char + combining accent" as there is no sensible code for the combining accent to go back to add an accent on the char.






    share|improve this answer



















    • 1





      Just a comment to point out log file containing the trace got viewed by editor in some 8bit encoding, presumably iso-latin-1 (like Emacs does for me), not in UTF8... so à is only one byte. This is tacit in your answer...

      – user4686
      Jan 21 at 9:01
















    20












    20








    20







    If you want to know how the 8-bit engines handle utf8 input you can use tracingmacros:



    documentclass{article}

    begin{document}
    {tracingmacros =1 ä }
    end{document}


    which gives



    Ã->UTFviii@two@octets Ã

    UTFviii@two@octets #1#2->expandafter UTFviii@defined csname u8:#1string #2
    endcsname
    #1<-Ã
    #2<-¤

    UTFviii@defined #1->ifx #1relax if relax expandafter UTFviii@checkseq s
    tring #1relax relax UTFviii@undefined@err {#1}else PackageError {inputenc}
    {Invalid UTF-8 byte sequence}UTFviii@invalid@help fi else expandafter #1fi

    #1<-u8:ä

    u8:ä ->IeC {"a}


    That means the the first byte of the ä (the Ã) is an active char, a command which then picks up the next byte and then calls u8:ä which calls "a. In this way (pdf)latex can handle quite a lot of utf8 input but it has e.g. problems with "char + combining accent" as there is no sensible code for the combining accent to go back to add an accent on the char.






    share|improve this answer













    If you want to know how the 8-bit engines handle utf8 input you can use tracingmacros:



    documentclass{article}

    begin{document}
    {tracingmacros =1 ä }
    end{document}


    which gives



    Ã->UTFviii@two@octets Ã

    UTFviii@two@octets #1#2->expandafter UTFviii@defined csname u8:#1string #2
    endcsname
    #1<-Ã
    #2<-¤

    UTFviii@defined #1->ifx #1relax if relax expandafter UTFviii@checkseq s
    tring #1relax relax UTFviii@undefined@err {#1}else PackageError {inputenc}
    {Invalid UTF-8 byte sequence}UTFviii@invalid@help fi else expandafter #1fi

    #1<-u8:ä

    u8:ä ->IeC {"a}


    That means the the first byte of the ä (the Ã) is an active char, a command which then picks up the next byte and then calls u8:ä which calls "a. In this way (pdf)latex can handle quite a lot of utf8 input but it has e.g. problems with "char + combining accent" as there is no sensible code for the combining accent to go back to add an accent on the char.







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Jan 21 at 8:49









    Ulrike FischerUlrike Fischer

    194k8302688




    194k8302688








    • 1





      Just a comment to point out log file containing the trace got viewed by editor in some 8bit encoding, presumably iso-latin-1 (like Emacs does for me), not in UTF8... so à is only one byte. This is tacit in your answer...

      – user4686
      Jan 21 at 9:01
















    • 1





      Just a comment to point out log file containing the trace got viewed by editor in some 8bit encoding, presumably iso-latin-1 (like Emacs does for me), not in UTF8... so à is only one byte. This is tacit in your answer...

      – user4686
      Jan 21 at 9:01










    1




    1





    Just a comment to point out log file containing the trace got viewed by editor in some 8bit encoding, presumably iso-latin-1 (like Emacs does for me), not in UTF8... so à is only one byte. This is tacit in your answer...

    – user4686
    Jan 21 at 9:01







    Just a comment to point out log file containing the trace got viewed by editor in some 8bit encoding, presumably iso-latin-1 (like Emacs does for me), not in UTF8... so à is only one byte. This is tacit in your answer...

    – user4686
    Jan 21 at 9:01













    5














    In order to move from 8-bit TeX90 to Unicode XeTeX or LuaTeX, there is work to do in extending/modifying internal structures. However, that is largely a question of effort rather than any major conceptual limitation. Knuth after all extended TeX from 7-bit to 8-bit between TeX82 (TeX 2) and TeX90 (TeX 3).



    Both XeTeX and LuaTeX read files in UTF-8 rather than on a per-byte basis. This happens well before any TeX-related processes are involved, and as such at the macro level there are only UTF-8 characters. (One can alter this in LuaTeX: see luainputenc for example.) Both engines then use tables which are extended to cover the full Unicode range.



    The change in accepted input can be used to test for Unicode-aware engines, as shown in an example from https://www.contextgarden.net/Encodings_and_Regimes



    deftest#1#2!{defsecondarg{#2}}
    test χ!relax % That's Chi, a 2-byte utf-8 sequence
    ifxsecondargempty message{newstuff}else message{tex82}fi


    In the main, macro code does not need altering to accept Unicode: the engines deal with the byte aspect, so from a macro viewpoint everything is 'as expected'. Of course, there is a little set up to do, for example setting catcode, uccode, etc. for the full Unicode range. Today, this is handled by using the unicode-data, so is built-in to both plain TeX-derived and LaTeX formats.



    There are a few places that LaTeX has to be aware of which engine is in use, but the direct impact of Unicode is largley limited to




    • Setting up the data for catcode, etc.

    • Setting up hyphenation patterns (which today are all stored as UTF-8 and
      require more work to use with pdfTeX than with Unicode TeX engines)

    • (Not) setting up support for Unicode support based on active 8-bit characters
      (See Ulrike's answer for details
      of how that works in 8-bit engines)


    Other aspects at the macro layer are related to other functionality, for example the ability of both XeTeX and LuaTeX to load system fonts: that requires a Unicode
    font encoding (TU), but that is distinct from handling input.






    share|improve this answer






























      5














      In order to move from 8-bit TeX90 to Unicode XeTeX or LuaTeX, there is work to do in extending/modifying internal structures. However, that is largely a question of effort rather than any major conceptual limitation. Knuth after all extended TeX from 7-bit to 8-bit between TeX82 (TeX 2) and TeX90 (TeX 3).



      Both XeTeX and LuaTeX read files in UTF-8 rather than on a per-byte basis. This happens well before any TeX-related processes are involved, and as such at the macro level there are only UTF-8 characters. (One can alter this in LuaTeX: see luainputenc for example.) Both engines then use tables which are extended to cover the full Unicode range.



      The change in accepted input can be used to test for Unicode-aware engines, as shown in an example from https://www.contextgarden.net/Encodings_and_Regimes



      deftest#1#2!{defsecondarg{#2}}
      test χ!relax % That's Chi, a 2-byte utf-8 sequence
      ifxsecondargempty message{newstuff}else message{tex82}fi


      In the main, macro code does not need altering to accept Unicode: the engines deal with the byte aspect, so from a macro viewpoint everything is 'as expected'. Of course, there is a little set up to do, for example setting catcode, uccode, etc. for the full Unicode range. Today, this is handled by using the unicode-data, so is built-in to both plain TeX-derived and LaTeX formats.



      There are a few places that LaTeX has to be aware of which engine is in use, but the direct impact of Unicode is largley limited to




      • Setting up the data for catcode, etc.

      • Setting up hyphenation patterns (which today are all stored as UTF-8 and
        require more work to use with pdfTeX than with Unicode TeX engines)

      • (Not) setting up support for Unicode support based on active 8-bit characters
        (See Ulrike's answer for details
        of how that works in 8-bit engines)


      Other aspects at the macro layer are related to other functionality, for example the ability of both XeTeX and LuaTeX to load system fonts: that requires a Unicode
      font encoding (TU), but that is distinct from handling input.






      share|improve this answer




























        5












        5








        5







        In order to move from 8-bit TeX90 to Unicode XeTeX or LuaTeX, there is work to do in extending/modifying internal structures. However, that is largely a question of effort rather than any major conceptual limitation. Knuth after all extended TeX from 7-bit to 8-bit between TeX82 (TeX 2) and TeX90 (TeX 3).



        Both XeTeX and LuaTeX read files in UTF-8 rather than on a per-byte basis. This happens well before any TeX-related processes are involved, and as such at the macro level there are only UTF-8 characters. (One can alter this in LuaTeX: see luainputenc for example.) Both engines then use tables which are extended to cover the full Unicode range.



        The change in accepted input can be used to test for Unicode-aware engines, as shown in an example from https://www.contextgarden.net/Encodings_and_Regimes



        deftest#1#2!{defsecondarg{#2}}
        test χ!relax % That's Chi, a 2-byte utf-8 sequence
        ifxsecondargempty message{newstuff}else message{tex82}fi


        In the main, macro code does not need altering to accept Unicode: the engines deal with the byte aspect, so from a macro viewpoint everything is 'as expected'. Of course, there is a little set up to do, for example setting catcode, uccode, etc. for the full Unicode range. Today, this is handled by using the unicode-data, so is built-in to both plain TeX-derived and LaTeX formats.



        There are a few places that LaTeX has to be aware of which engine is in use, but the direct impact of Unicode is largley limited to




        • Setting up the data for catcode, etc.

        • Setting up hyphenation patterns (which today are all stored as UTF-8 and
          require more work to use with pdfTeX than with Unicode TeX engines)

        • (Not) setting up support for Unicode support based on active 8-bit characters
          (See Ulrike's answer for details
          of how that works in 8-bit engines)


        Other aspects at the macro layer are related to other functionality, for example the ability of both XeTeX and LuaTeX to load system fonts: that requires a Unicode
        font encoding (TU), but that is distinct from handling input.






        share|improve this answer















        In order to move from 8-bit TeX90 to Unicode XeTeX or LuaTeX, there is work to do in extending/modifying internal structures. However, that is largely a question of effort rather than any major conceptual limitation. Knuth after all extended TeX from 7-bit to 8-bit between TeX82 (TeX 2) and TeX90 (TeX 3).



        Both XeTeX and LuaTeX read files in UTF-8 rather than on a per-byte basis. This happens well before any TeX-related processes are involved, and as such at the macro level there are only UTF-8 characters. (One can alter this in LuaTeX: see luainputenc for example.) Both engines then use tables which are extended to cover the full Unicode range.



        The change in accepted input can be used to test for Unicode-aware engines, as shown in an example from https://www.contextgarden.net/Encodings_and_Regimes



        deftest#1#2!{defsecondarg{#2}}
        test χ!relax % That's Chi, a 2-byte utf-8 sequence
        ifxsecondargempty message{newstuff}else message{tex82}fi


        In the main, macro code does not need altering to accept Unicode: the engines deal with the byte aspect, so from a macro viewpoint everything is 'as expected'. Of course, there is a little set up to do, for example setting catcode, uccode, etc. for the full Unicode range. Today, this is handled by using the unicode-data, so is built-in to both plain TeX-derived and LaTeX formats.



        There are a few places that LaTeX has to be aware of which engine is in use, but the direct impact of Unicode is largley limited to




        • Setting up the data for catcode, etc.

        • Setting up hyphenation patterns (which today are all stored as UTF-8 and
          require more work to use with pdfTeX than with Unicode TeX engines)

        • (Not) setting up support for Unicode support based on active 8-bit characters
          (See Ulrike's answer for details
          of how that works in 8-bit engines)


        Other aspects at the macro layer are related to other functionality, for example the ability of both XeTeX and LuaTeX to load system fonts: that requires a Unicode
        font encoding (TU), but that is distinct from handling input.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Jan 22 at 20:47

























        answered Jan 22 at 19:17









        Joseph WrightJoseph Wright

        204k23560889




        204k23560889






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to TeX - LaTeX Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2ftex.stackexchange.com%2fquestions%2f471071%2fhow-can-latex-read-utf-8%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            MongoDB - Not Authorized To Execute Command

            Npm cannot find a required file even through it is in the searched directory

            in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith