How can (La)TeX read UTF-8?

As described in The TeXbook, TeX reads files byte by byte, regardless of the particular format -- as I understand, this is just how IniTeX is set up.

I also understand that LaTeX is just a collection of macros built on top of IniTeX, described in most distributions of TeX by the file latex.ltx.

The above two things are at odds with my understanding of LaTeX's ability to read UTF-8. I was under the impression that reading the input byte by byte (and thus for instance, only being able to access numbers from 0 to 255 using char or something) was baked into TeX, and thus would exist in all variants built on top of it.

Thus, how is LaTeX able to do this?

edited Jan 22 at 20:49

Joseph Wright♦

204k23560889

asked Jan 21 at 1:34

extremeaxe5

3096

5

The TeXbook describes the so-called Knuth-TeX engine, as well as a collection of macros frequently called "PlainTeX". Are you aware of newer engines called pdfTeX, XeTeX, and LuaTeX?

– Mico
Jan 21 at 1:54

3

You could be asking one of (at least) two questions here. Are you wondering how e.g. XeTeX (natively UTF-8) can be derived from Knuth's TeX (8-bit). Or are you wondering how 8-bit TeX engines deal with UTF-8 input (conversion of 'raw' bytes to codpoints to output)?

– Joseph Wright♦
Jan 21 at 7:42

@JosephWright The former.

– extremeaxe5
Jan 22 at 18:40

@Mico I am aware they exist. Are you saying that the newer TeX engines read input utf8 char by utf8 char? How would this work? I feel like a lot of things would break -- as Knuth describes, catcodes, mathcodes, and delcodes are only defined for 256 distinct values (all the possible bytes), no?

– extremeaxe5
Jan 22 at 18:43

Related: tex.stackexchange.com/questions/222286/…. There, I look in detail at the differences between engines, some of which are linked to Unicode support. (The questions are distinct, it's not a duplicate.)

– Joseph Wright♦
Jan 22 at 20:52

add a comment |

As described in The TeXbook, TeX reads files byte by byte, regardless of the particular format -- as I understand, this is just how IniTeX is set up.

I also understand that LaTeX is just a collection of macros built on top of IniTeX, described in most distributions of TeX by the file latex.ltx.

Thus, how is LaTeX able to do this?

edited Jan 22 at 20:49

Joseph Wright♦

204k23560889

asked Jan 21 at 1:34

extremeaxe5

3096

5

The TeXbook describes the so-called Knuth-TeX engine, as well as a collection of macros frequently called "PlainTeX". Are you aware of newer engines called pdfTeX, XeTeX, and LuaTeX?

– Mico
Jan 21 at 1:54

3

You could be asking one of (at least) two questions here. Are you wondering how e.g. XeTeX (natively UTF-8) can be derived from Knuth's TeX (8-bit). Or are you wondering how 8-bit TeX engines deal with UTF-8 input (conversion of 'raw' bytes to codpoints to output)?

– Joseph Wright♦
Jan 21 at 7:42

@JosephWright The former.

– extremeaxe5
Jan 22 at 18:40

@Mico I am aware they exist. Are you saying that the newer TeX engines read input utf8 char by utf8 char? How would this work? I feel like a lot of things would break -- as Knuth describes, catcodes, mathcodes, and delcodes are only defined for 256 distinct values (all the possible bytes), no?

– extremeaxe5
Jan 22 at 18:43

Related: tex.stackexchange.com/questions/222286/…. There, I look in detail at the differences between engines, some of which are linked to Unicode support. (The questions are distinct, it's not a duplicate.)

– Joseph Wright♦
Jan 22 at 20:52

add a comment |

As described in The TeXbook, TeX reads files byte by byte, regardless of the particular format -- as I understand, this is just how IniTeX is set up.

I also understand that LaTeX is just a collection of macros built on top of IniTeX, described in most distributions of TeX by the file latex.ltx.

Thus, how is LaTeX able to do this?

edited Jan 22 at 20:49

Joseph Wright♦

204k23560889

asked Jan 21 at 1:34

extremeaxe5

3096

As described in The TeXbook, TeX reads files byte by byte, regardless of the particular format -- as I understand, this is just how IniTeX is set up.

I also understand that LaTeX is just a collection of macros built on top of IniTeX, described in most distributions of TeX by the file latex.ltx.

Thus, how is LaTeX able to do this?

unicode

edited Jan 22 at 20:49

Joseph Wright♦

204k23560889

asked Jan 21 at 1:34

extremeaxe5

3096

edited Jan 22 at 20:49

Joseph Wright♦

204k23560889

asked Jan 21 at 1:34

extremeaxe5

3096

edited Jan 22 at 20:49

Joseph Wright♦

204k23560889

edited Jan 22 at 20:49

Joseph Wright♦

204k23560889

edited Jan 22 at 20:49

Joseph Wright♦

204k23560889

asked Jan 21 at 1:34

extremeaxe5

3096

asked Jan 21 at 1:34

extremeaxe5

3096

asked Jan 21 at 1:34

extremeaxe5

3096

5

The TeXbook describes the so-called Knuth-TeX engine, as well as a collection of macros frequently called "PlainTeX". Are you aware of newer engines called pdfTeX, XeTeX, and LuaTeX?

– Mico
Jan 21 at 1:54

3

You could be asking one of (at least) two questions here. Are you wondering how e.g. XeTeX (natively UTF-8) can be derived from Knuth's TeX (8-bit). Or are you wondering how 8-bit TeX engines deal with UTF-8 input (conversion of 'raw' bytes to codpoints to output)?

– Joseph Wright♦
Jan 21 at 7:42

@JosephWright The former.

– extremeaxe5
Jan 22 at 18:40

@Mico I am aware they exist. Are you saying that the newer TeX engines read input utf8 char by utf8 char? How would this work? I feel like a lot of things would break -- as Knuth describes, catcodes, mathcodes, and delcodes are only defined for 256 distinct values (all the possible bytes), no?

– extremeaxe5
Jan 22 at 18:43

Related: tex.stackexchange.com/questions/222286/…. There, I look in detail at the differences between engines, some of which are linked to Unicode support. (The questions are distinct, it's not a duplicate.)

– Joseph Wright♦
Jan 22 at 20:52

add a comment |

5

The TeXbook describes the so-called Knuth-TeX engine, as well as a collection of macros frequently called "PlainTeX". Are you aware of newer engines called pdfTeX, XeTeX, and LuaTeX?

– Mico
Jan 21 at 1:54

3

You could be asking one of (at least) two questions here. Are you wondering how e.g. XeTeX (natively UTF-8) can be derived from Knuth's TeX (8-bit). Or are you wondering how 8-bit TeX engines deal with UTF-8 input (conversion of 'raw' bytes to codpoints to output)?

– Joseph Wright♦
Jan 21 at 7:42

@JosephWright The former.

– extremeaxe5
Jan 22 at 18:40

@Mico I am aware they exist. Are you saying that the newer TeX engines read input utf8 char by utf8 char? How would this work? I feel like a lot of things would break -- as Knuth describes, catcodes, mathcodes, and delcodes are only defined for 256 distinct values (all the possible bytes), no?

– extremeaxe5
Jan 22 at 18:43

Related: tex.stackexchange.com/questions/222286/…. There, I look in detail at the differences between engines, some of which are linked to Unicode support. (The questions are distinct, it's not a duplicate.)

– Joseph Wright♦
Jan 22 at 20:52

The TeXbook describes the so-called Knuth-TeX engine, as well as a collection of macros frequently called "PlainTeX". Are you aware of newer engines called pdfTeX, XeTeX, and LuaTeX?

– Mico
Jan 21 at 1:54

You could be asking one of (at least) two questions here. Are you wondering how e.g. XeTeX (natively UTF-8) can be derived from Knuth's TeX (8-bit). Or are you wondering how 8-bit TeX engines deal with UTF-8 input (conversion of 'raw' bytes to codpoints to output)?

– Joseph Wright♦
Jan 21 at 7:42

@JosephWright The former.

– extremeaxe5
Jan 22 at 18:40

@Mico I am aware they exist. Are you saying that the newer TeX engines read input utf8 char by utf8 char? How would this work? I feel like a lot of things would break -- as Knuth describes, catcodes, mathcodes, and delcodes are only defined for 256 distinct values (all the possible bytes), no?

– extremeaxe5
Jan 22 at 18:43

Related: tex.stackexchange.com/questions/222286/…. There, I look in detail at the differences between engines, some of which are linked to Unicode support. (The questions are distinct, it's not a duplicate.)

– Joseph Wright♦
Jan 22 at 20:52

add a comment |

2 Answers
2

active

oldest

votes

If you want to know how the 8-bit engines handle utf8 input you can use tracingmacros:

documentclass{article}



begin{document}

{tracingmacros =1 ä }

end{document}

which gives

Ã->UTFviii@two@octets Ã



UTFviii@two@octets #1#2->expandafter UTFviii@defined csname u8:#1string #2

endcsname 

#1<-Ã

#2<-¤



UTFviii@defined #1->ifx #1relax if relax expandafter UTFviii@checkseq s

tring #1relax relax UTFviii@undefined@err {#1}else PackageError {inputenc}

{Invalid UTF-8 byte sequence}UTFviii@invalid@help fi else expandafter #1fi



#1<-u8:Ã¤ 



u8:Ã¤ ->IeC {"a}

That means the the first byte of the ä (the Ã) is an active char, a command which then picks up the next byte and then calls u8:Ã¤ which calls "a. In this way (pdf)latex can handle quite a lot of utf8 input but it has e.g. problems with "char + combining accent" as there is no sensible code for the combining accent to go back to add an accent on the char.

answered Jan 21 at 8:49

Ulrike Fischer

194k8302688

1

Just a comment to point out log file containing the trace got viewed by editor in some 8bit encoding, presumably iso-latin-1 (like Emacs does for me), not in UTF8... so Ã is only one byte. This is tacit in your answer...

– user4686
Jan 21 at 9:01

add a comment |

In order to move from 8-bit TeX90 to Unicode XeTeX or LuaTeX, there is work to do in extending/modifying internal structures. However, that is largely a question of effort rather than any major conceptual limitation. Knuth after all extended TeX from 7-bit to 8-bit between TeX82 (TeX 2) and TeX90 (TeX 3).

Both XeTeX and LuaTeX read files in UTF-8 rather than on a per-byte basis. This happens well before any TeX-related processes are involved, and as such at the macro level there are only UTF-8 characters. (One can alter this in LuaTeX: see luainputenc for example.) Both engines then use tables which are extended to cover the full Unicode range.

The change in accepted input can be used to test for Unicode-aware engines, as shown in an example from https://www.contextgarden.net/Encodings_and_Regimes

deftest#1#2!{defsecondarg{#2}}

test χ!relax % That's Chi, a 2-byte utf-8 sequence

ifxsecondargempty message{newstuff}else message{tex82}fi

In the main, macro code does not need altering to accept Unicode: the engines deal with the byte aspect, so from a macro viewpoint everything is 'as expected'. Of course, there is a little set up to do, for example setting catcode, uccode, etc. for the full Unicode range. Today, this is handled by using the unicode-data, so is built-in to both plain TeX-derived and LaTeX formats.

There are a few places that LaTeX has to be aware of which engine is in use, but the direct impact of Unicode is largley limited to

Setting up the data for catcode, etc.

Setting up hyphenation patterns (which today are all stored as UTF-8 and
require more work to use with pdfTeX than with Unicode TeX engines)

(Not) setting up support for Unicode support based on active 8-bit characters
(See Ulrike's answer for details
of how that works in 8-bit engines)

Other aspects at the macro layer are related to other functionality, for example the ability of both XeTeX and LuaTeX to load system fonts: that requires a Unicode
font encoding (TU), but that is distinct from handling input.

edited Jan 22 at 20:47

answered Jan 22 at 19:17

Joseph Wright♦

204k23560889

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "85"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2ftex.stackexchange.com%2fquestions%2f471071%2fhow-can-latex-read-utf-8%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

If you want to know how the 8-bit engines handle utf8 input you can use tracingmacros:

documentclass{article}



begin{document}

{tracingmacros =1 ä }

end{document}

which gives

Ã->UTFviii@two@octets Ã



UTFviii@two@octets #1#2->expandafter UTFviii@defined csname u8:#1string #2

endcsname 

#1<-Ã

#2<-¤



UTFviii@defined #1->ifx #1relax if relax expandafter UTFviii@checkseq s

tring #1relax relax UTFviii@undefined@err {#1}else PackageError {inputenc}

{Invalid UTF-8 byte sequence}UTFviii@invalid@help fi else expandafter #1fi



#1<-u8:Ã¤ 



u8:Ã¤ ->IeC {"a}

answered Jan 21 at 8:49

Ulrike Fischer

194k8302688

1

Just a comment to point out log file containing the trace got viewed by editor in some 8bit encoding, presumably iso-latin-1 (like Emacs does for me), not in UTF8... so Ã is only one byte. This is tacit in your answer...

– user4686
Jan 21 at 9:01

add a comment |

If you want to know how the 8-bit engines handle utf8 input you can use tracingmacros:

documentclass{article}



begin{document}

{tracingmacros =1 ä }

end{document}

which gives

Ã->UTFviii@two@octets Ã



UTFviii@two@octets #1#2->expandafter UTFviii@defined csname u8:#1string #2

endcsname 

#1<-Ã

#2<-¤



UTFviii@defined #1->ifx #1relax if relax expandafter UTFviii@checkseq s

tring #1relax relax UTFviii@undefined@err {#1}else PackageError {inputenc}

{Invalid UTF-8 byte sequence}UTFviii@invalid@help fi else expandafter #1fi



#1<-u8:Ã¤ 



u8:Ã¤ ->IeC {"a}

answered Jan 21 at 8:49

Ulrike Fischer

194k8302688

1

Just a comment to point out log file containing the trace got viewed by editor in some 8bit encoding, presumably iso-latin-1 (like Emacs does for me), not in UTF8... so Ã is only one byte. This is tacit in your answer...

– user4686
Jan 21 at 9:01

add a comment |

If you want to know how the 8-bit engines handle utf8 input you can use tracingmacros:

documentclass{article}



begin{document}

{tracingmacros =1 ä }

end{document}

which gives

Ã->UTFviii@two@octets Ã



UTFviii@two@octets #1#2->expandafter UTFviii@defined csname u8:#1string #2

endcsname 

#1<-Ã

#2<-¤



UTFviii@defined #1->ifx #1relax if relax expandafter UTFviii@checkseq s

tring #1relax relax UTFviii@undefined@err {#1}else PackageError {inputenc}

{Invalid UTF-8 byte sequence}UTFviii@invalid@help fi else expandafter #1fi



#1<-u8:Ã¤ 



u8:Ã¤ ->IeC {"a}

answered Jan 21 at 8:49

Ulrike Fischer

194k8302688

If you want to know how the 8-bit engines handle utf8 input you can use tracingmacros:

documentclass{article}



begin{document}

{tracingmacros =1 ä }

end{document}

which gives

Ã->UTFviii@two@octets Ã



UTFviii@two@octets #1#2->expandafter UTFviii@defined csname u8:#1string #2

endcsname 

#1<-Ã

#2<-¤



UTFviii@defined #1->ifx #1relax if relax expandafter UTFviii@checkseq s

tring #1relax relax UTFviii@undefined@err {#1}else PackageError {inputenc}

{Invalid UTF-8 byte sequence}UTFviii@invalid@help fi else expandafter #1fi



#1<-u8:Ã¤ 



u8:Ã¤ ->IeC {"a}

answered Jan 21 at 8:49

Ulrike Fischer

194k8302688

answered Jan 21 at 8:49

Ulrike Fischer

194k8302688

answered Jan 21 at 8:49

Ulrike Fischer

194k8302688

answered Jan 21 at 8:49

Ulrike Fischer

194k8302688

1

Just a comment to point out log file containing the trace got viewed by editor in some 8bit encoding, presumably iso-latin-1 (like Emacs does for me), not in UTF8... so Ã is only one byte. This is tacit in your answer...

– user4686
Jan 21 at 9:01

add a comment |

1

Just a comment to point out log file containing the trace got viewed by editor in some 8bit encoding, presumably iso-latin-1 (like Emacs does for me), not in UTF8... so Ã is only one byte. This is tacit in your answer...

– user4686
Jan 21 at 9:01

Just a comment to point out log file containing the trace got viewed by editor in some 8bit encoding, presumably iso-latin-1 (like Emacs does for me), not in UTF8... so Ã is only one byte. This is tacit in your answer...

– user4686
Jan 21 at 9:01

add a comment |

The change in accepted input can be used to test for Unicode-aware engines, as shown in an example from https://www.contextgarden.net/Encodings_and_Regimes

deftest#1#2!{defsecondarg{#2}}

test χ!relax % That's Chi, a 2-byte utf-8 sequence

ifxsecondargempty message{newstuff}else message{tex82}fi

There are a few places that LaTeX has to be aware of which engine is in use, but the direct impact of Unicode is largley limited to

Setting up the data for catcode, etc.

Setting up hyphenation patterns (which today are all stored as UTF-8 and
require more work to use with pdfTeX than with Unicode TeX engines)

(Not) setting up support for Unicode support based on active 8-bit characters
(See Ulrike's answer for details
of how that works in 8-bit engines)

edited Jan 22 at 20:47

answered Jan 22 at 19:17

Joseph Wright♦

204k23560889

add a comment |

The change in accepted input can be used to test for Unicode-aware engines, as shown in an example from https://www.contextgarden.net/Encodings_and_Regimes

deftest#1#2!{defsecondarg{#2}}

test χ!relax % That's Chi, a 2-byte utf-8 sequence

ifxsecondargempty message{newstuff}else message{tex82}fi

There are a few places that LaTeX has to be aware of which engine is in use, but the direct impact of Unicode is largley limited to

Setting up the data for catcode, etc.

Setting up hyphenation patterns (which today are all stored as UTF-8 and
require more work to use with pdfTeX than with Unicode TeX engines)

(Not) setting up support for Unicode support based on active 8-bit characters
(See Ulrike's answer for details
of how that works in 8-bit engines)

edited Jan 22 at 20:47

answered Jan 22 at 19:17

Joseph Wright♦

204k23560889

add a comment |

The change in accepted input can be used to test for Unicode-aware engines, as shown in an example from https://www.contextgarden.net/Encodings_and_Regimes

deftest#1#2!{defsecondarg{#2}}

test χ!relax % That's Chi, a 2-byte utf-8 sequence

ifxsecondargempty message{newstuff}else message{tex82}fi

There are a few places that LaTeX has to be aware of which engine is in use, but the direct impact of Unicode is largley limited to

Setting up the data for catcode, etc.

Setting up hyphenation patterns (which today are all stored as UTF-8 and
require more work to use with pdfTeX than with Unicode TeX engines)

(Not) setting up support for Unicode support based on active 8-bit characters
(See Ulrike's answer for details
of how that works in 8-bit engines)

edited Jan 22 at 20:47

answered Jan 22 at 19:17

Joseph Wright♦

204k23560889

The change in accepted input can be used to test for Unicode-aware engines, as shown in an example from https://www.contextgarden.net/Encodings_and_Regimes

deftest#1#2!{defsecondarg{#2}}

test χ!relax % That's Chi, a 2-byte utf-8 sequence

ifxsecondargempty message{newstuff}else message{tex82}fi

There are a few places that LaTeX has to be aware of which engine is in use, but the direct impact of Unicode is largley limited to

Setting up the data for catcode, etc.

Setting up hyphenation patterns (which today are all stored as UTF-8 and
require more work to use with pdfTeX than with Unicode TeX engines)

(Not) setting up support for Unicode support based on active 8-bit characters
(See Ulrike's answer for details
of how that works in 8-bit engines)

edited Jan 22 at 20:47

answered Jan 22 at 19:17

Joseph Wright♦

204k23560889

edited Jan 22 at 20:47

answered Jan 22 at 19:17

Joseph Wright♦

204k23560889

answered Jan 22 at 19:17

Joseph Wright♦

204k23560889

answered Jan 22 at 19:17

Joseph Wright♦

204k23560889

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to TeX - LaTeX Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu