How can (La)TeX read UTF-8?
As described in The TeXbook, TeX reads files byte by byte, regardless of the particular format -- as I understand, this is just how IniTeX is set up.
I also understand that LaTeX is just a collection of macros built on top of IniTeX, described in most distributions of TeX by the file latex.ltx
.
The above two things are at odds with my understanding of LaTeX's ability to read UTF-8. I was under the impression that reading the input byte by byte (and thus for instance, only being able to access numbers from 0 to 255 using char
or something) was baked into TeX, and thus would exist in all variants built on top of it.
Thus, how is LaTeX able to do this?
unicode
add a comment |
As described in The TeXbook, TeX reads files byte by byte, regardless of the particular format -- as I understand, this is just how IniTeX is set up.
I also understand that LaTeX is just a collection of macros built on top of IniTeX, described in most distributions of TeX by the file latex.ltx
.
The above two things are at odds with my understanding of LaTeX's ability to read UTF-8. I was under the impression that reading the input byte by byte (and thus for instance, only being able to access numbers from 0 to 255 using char
or something) was baked into TeX, and thus would exist in all variants built on top of it.
Thus, how is LaTeX able to do this?
unicode
5
The TeXbook describes the so-called Knuth-TeX engine, as well as a collection of macros frequently called "PlainTeX". Are you aware of newer engines called pdfTeX, XeTeX, and LuaTeX?
– Mico
Jan 21 at 1:54
3
You could be asking one of (at least) two questions here. Are you wondering how e.g. XeTeX (natively UTF-8) can be derived from Knuth's TeX (8-bit). Or are you wondering how 8-bit TeX engines deal with UTF-8 input (conversion of 'raw' bytes to codpoints to output)?
– Joseph Wright♦
Jan 21 at 7:42
@JosephWright The former.
– extremeaxe5
Jan 22 at 18:40
@Mico I am aware they exist. Are you saying that the newer TeX engines read input utf8 char by utf8 char? How would this work? I feel like a lot of things would break -- as Knuth describes, catcodes, mathcodes, and delcodes are only defined for 256 distinct values (all the possible bytes), no?
– extremeaxe5
Jan 22 at 18:43
Related: tex.stackexchange.com/questions/222286/…. There, I look in detail at the differences between engines, some of which are linked to Unicode support. (The questions are distinct, it's not a duplicate.)
– Joseph Wright♦
Jan 22 at 20:52
add a comment |
As described in The TeXbook, TeX reads files byte by byte, regardless of the particular format -- as I understand, this is just how IniTeX is set up.
I also understand that LaTeX is just a collection of macros built on top of IniTeX, described in most distributions of TeX by the file latex.ltx
.
The above two things are at odds with my understanding of LaTeX's ability to read UTF-8. I was under the impression that reading the input byte by byte (and thus for instance, only being able to access numbers from 0 to 255 using char
or something) was baked into TeX, and thus would exist in all variants built on top of it.
Thus, how is LaTeX able to do this?
unicode
As described in The TeXbook, TeX reads files byte by byte, regardless of the particular format -- as I understand, this is just how IniTeX is set up.
I also understand that LaTeX is just a collection of macros built on top of IniTeX, described in most distributions of TeX by the file latex.ltx
.
The above two things are at odds with my understanding of LaTeX's ability to read UTF-8. I was under the impression that reading the input byte by byte (and thus for instance, only being able to access numbers from 0 to 255 using char
or something) was baked into TeX, and thus would exist in all variants built on top of it.
Thus, how is LaTeX able to do this?
unicode
unicode
edited Jan 22 at 20:49
Joseph Wright♦
204k23560889
204k23560889
asked Jan 21 at 1:34


extremeaxe5extremeaxe5
3096
3096
5
The TeXbook describes the so-called Knuth-TeX engine, as well as a collection of macros frequently called "PlainTeX". Are you aware of newer engines called pdfTeX, XeTeX, and LuaTeX?
– Mico
Jan 21 at 1:54
3
You could be asking one of (at least) two questions here. Are you wondering how e.g. XeTeX (natively UTF-8) can be derived from Knuth's TeX (8-bit). Or are you wondering how 8-bit TeX engines deal with UTF-8 input (conversion of 'raw' bytes to codpoints to output)?
– Joseph Wright♦
Jan 21 at 7:42
@JosephWright The former.
– extremeaxe5
Jan 22 at 18:40
@Mico I am aware they exist. Are you saying that the newer TeX engines read input utf8 char by utf8 char? How would this work? I feel like a lot of things would break -- as Knuth describes, catcodes, mathcodes, and delcodes are only defined for 256 distinct values (all the possible bytes), no?
– extremeaxe5
Jan 22 at 18:43
Related: tex.stackexchange.com/questions/222286/…. There, I look in detail at the differences between engines, some of which are linked to Unicode support. (The questions are distinct, it's not a duplicate.)
– Joseph Wright♦
Jan 22 at 20:52
add a comment |
5
The TeXbook describes the so-called Knuth-TeX engine, as well as a collection of macros frequently called "PlainTeX". Are you aware of newer engines called pdfTeX, XeTeX, and LuaTeX?
– Mico
Jan 21 at 1:54
3
You could be asking one of (at least) two questions here. Are you wondering how e.g. XeTeX (natively UTF-8) can be derived from Knuth's TeX (8-bit). Or are you wondering how 8-bit TeX engines deal with UTF-8 input (conversion of 'raw' bytes to codpoints to output)?
– Joseph Wright♦
Jan 21 at 7:42
@JosephWright The former.
– extremeaxe5
Jan 22 at 18:40
@Mico I am aware they exist. Are you saying that the newer TeX engines read input utf8 char by utf8 char? How would this work? I feel like a lot of things would break -- as Knuth describes, catcodes, mathcodes, and delcodes are only defined for 256 distinct values (all the possible bytes), no?
– extremeaxe5
Jan 22 at 18:43
Related: tex.stackexchange.com/questions/222286/…. There, I look in detail at the differences between engines, some of which are linked to Unicode support. (The questions are distinct, it's not a duplicate.)
– Joseph Wright♦
Jan 22 at 20:52
5
5
The TeXbook describes the so-called Knuth-TeX engine, as well as a collection of macros frequently called "PlainTeX". Are you aware of newer engines called pdfTeX, XeTeX, and LuaTeX?
– Mico
Jan 21 at 1:54
The TeXbook describes the so-called Knuth-TeX engine, as well as a collection of macros frequently called "PlainTeX". Are you aware of newer engines called pdfTeX, XeTeX, and LuaTeX?
– Mico
Jan 21 at 1:54
3
3
You could be asking one of (at least) two questions here. Are you wondering how e.g. XeTeX (natively UTF-8) can be derived from Knuth's TeX (8-bit). Or are you wondering how 8-bit TeX engines deal with UTF-8 input (conversion of 'raw' bytes to codpoints to output)?
– Joseph Wright♦
Jan 21 at 7:42
You could be asking one of (at least) two questions here. Are you wondering how e.g. XeTeX (natively UTF-8) can be derived from Knuth's TeX (8-bit). Or are you wondering how 8-bit TeX engines deal with UTF-8 input (conversion of 'raw' bytes to codpoints to output)?
– Joseph Wright♦
Jan 21 at 7:42
@JosephWright The former.
– extremeaxe5
Jan 22 at 18:40
@JosephWright The former.
– extremeaxe5
Jan 22 at 18:40
@Mico I am aware they exist. Are you saying that the newer TeX engines read input utf8 char by utf8 char? How would this work? I feel like a lot of things would break -- as Knuth describes, catcodes, mathcodes, and delcodes are only defined for 256 distinct values (all the possible bytes), no?
– extremeaxe5
Jan 22 at 18:43
@Mico I am aware they exist. Are you saying that the newer TeX engines read input utf8 char by utf8 char? How would this work? I feel like a lot of things would break -- as Knuth describes, catcodes, mathcodes, and delcodes are only defined for 256 distinct values (all the possible bytes), no?
– extremeaxe5
Jan 22 at 18:43
Related: tex.stackexchange.com/questions/222286/…. There, I look in detail at the differences between engines, some of which are linked to Unicode support. (The questions are distinct, it's not a duplicate.)
– Joseph Wright♦
Jan 22 at 20:52
Related: tex.stackexchange.com/questions/222286/…. There, I look in detail at the differences between engines, some of which are linked to Unicode support. (The questions are distinct, it's not a duplicate.)
– Joseph Wright♦
Jan 22 at 20:52
add a comment |
2 Answers
2
active
oldest
votes
If you want to know how the 8-bit engines handle utf8 input you can use tracingmacros:
documentclass{article}
begin{document}
{tracingmacros =1 ä }
end{document}
which gives
Ã->UTFviii@two@octets Ã
UTFviii@two@octets #1#2->expandafter UTFviii@defined csname u8:#1string #2
endcsname
#1<-Ã
#2<-¤
UTFviii@defined #1->ifx #1relax if relax expandafter UTFviii@checkseq s
tring #1relax relax UTFviii@undefined@err {#1}else PackageError {inputenc}
{Invalid UTF-8 byte sequence}UTFviii@invalid@help fi else expandafter #1fi
#1<-u8:ä
u8:ä ->IeC {"a}
That means the the first byte of the ä
(the Ã
) is an active char, a command which then picks up the next byte and then calls u8:ä
which calls "a
. In this way (pdf)latex can handle quite a lot of utf8 input but it has e.g. problems with "char + combining accent" as there is no sensible code for the combining accent to go back to add an accent on the char.
1
Just a comment to point out log file containing the trace got viewed by editor in some 8bit encoding, presumably iso-latin-1 (like Emacs does for me), not in UTF8... soÃ
is only one byte. This is tacit in your answer...
– user4686
Jan 21 at 9:01
add a comment |
In order to move from 8-bit TeX90 to Unicode XeTeX or LuaTeX, there is work to do in extending/modifying internal structures. However, that is largely a question of effort rather than any major conceptual limitation. Knuth after all extended TeX from 7-bit to 8-bit between TeX82 (TeX 2) and TeX90 (TeX 3).
Both XeTeX and LuaTeX read files in UTF-8 rather than on a per-byte basis. This happens well before any TeX-related processes are involved, and as such at the macro level there are only UTF-8 characters. (One can alter this in LuaTeX: see luainputenc
for example.) Both engines then use tables which are extended to cover the full Unicode range.
The change in accepted input can be used to test for Unicode-aware engines, as shown in an example from https://www.contextgarden.net/Encodings_and_Regimes
deftest#1#2!{defsecondarg{#2}}
test χ!relax % That's Chi, a 2-byte utf-8 sequence
ifxsecondargempty message{newstuff}else message{tex82}fi
In the main, macro code does not need altering to accept Unicode: the engines deal with the byte aspect, so from a macro viewpoint everything is 'as expected'. Of course, there is a little set up to do, for example setting catcode
, uccode
, etc. for the full Unicode range. Today, this is handled by using the unicode-data
, so is built-in to both plain TeX-derived and LaTeX formats.
There are a few places that LaTeX has to be aware of which engine is in use, but the direct impact of Unicode is largley limited to
- Setting up the data for
catcode
, etc.
- Setting up hyphenation patterns (which today are all stored as UTF-8 and
require more work to use with pdfTeX than with Unicode TeX engines) - (Not) setting up support for Unicode support based on active 8-bit characters
(See Ulrike's answer for details
of how that works in 8-bit engines)
Other aspects at the macro layer are related to other functionality, for example the ability of both XeTeX and LuaTeX to load system fonts: that requires a Unicode
font encoding (TU
), but that is distinct from handling input.
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "85"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2ftex.stackexchange.com%2fquestions%2f471071%2fhow-can-latex-read-utf-8%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
If you want to know how the 8-bit engines handle utf8 input you can use tracingmacros:
documentclass{article}
begin{document}
{tracingmacros =1 ä }
end{document}
which gives
Ã->UTFviii@two@octets Ã
UTFviii@two@octets #1#2->expandafter UTFviii@defined csname u8:#1string #2
endcsname
#1<-Ã
#2<-¤
UTFviii@defined #1->ifx #1relax if relax expandafter UTFviii@checkseq s
tring #1relax relax UTFviii@undefined@err {#1}else PackageError {inputenc}
{Invalid UTF-8 byte sequence}UTFviii@invalid@help fi else expandafter #1fi
#1<-u8:ä
u8:ä ->IeC {"a}
That means the the first byte of the ä
(the Ã
) is an active char, a command which then picks up the next byte and then calls u8:ä
which calls "a
. In this way (pdf)latex can handle quite a lot of utf8 input but it has e.g. problems with "char + combining accent" as there is no sensible code for the combining accent to go back to add an accent on the char.
1
Just a comment to point out log file containing the trace got viewed by editor in some 8bit encoding, presumably iso-latin-1 (like Emacs does for me), not in UTF8... soÃ
is only one byte. This is tacit in your answer...
– user4686
Jan 21 at 9:01
add a comment |
If you want to know how the 8-bit engines handle utf8 input you can use tracingmacros:
documentclass{article}
begin{document}
{tracingmacros =1 ä }
end{document}
which gives
Ã->UTFviii@two@octets Ã
UTFviii@two@octets #1#2->expandafter UTFviii@defined csname u8:#1string #2
endcsname
#1<-Ã
#2<-¤
UTFviii@defined #1->ifx #1relax if relax expandafter UTFviii@checkseq s
tring #1relax relax UTFviii@undefined@err {#1}else PackageError {inputenc}
{Invalid UTF-8 byte sequence}UTFviii@invalid@help fi else expandafter #1fi
#1<-u8:ä
u8:ä ->IeC {"a}
That means the the first byte of the ä
(the Ã
) is an active char, a command which then picks up the next byte and then calls u8:ä
which calls "a
. In this way (pdf)latex can handle quite a lot of utf8 input but it has e.g. problems with "char + combining accent" as there is no sensible code for the combining accent to go back to add an accent on the char.
1
Just a comment to point out log file containing the trace got viewed by editor in some 8bit encoding, presumably iso-latin-1 (like Emacs does for me), not in UTF8... soÃ
is only one byte. This is tacit in your answer...
– user4686
Jan 21 at 9:01
add a comment |
If you want to know how the 8-bit engines handle utf8 input you can use tracingmacros:
documentclass{article}
begin{document}
{tracingmacros =1 ä }
end{document}
which gives
Ã->UTFviii@two@octets Ã
UTFviii@two@octets #1#2->expandafter UTFviii@defined csname u8:#1string #2
endcsname
#1<-Ã
#2<-¤
UTFviii@defined #1->ifx #1relax if relax expandafter UTFviii@checkseq s
tring #1relax relax UTFviii@undefined@err {#1}else PackageError {inputenc}
{Invalid UTF-8 byte sequence}UTFviii@invalid@help fi else expandafter #1fi
#1<-u8:ä
u8:ä ->IeC {"a}
That means the the first byte of the ä
(the Ã
) is an active char, a command which then picks up the next byte and then calls u8:ä
which calls "a
. In this way (pdf)latex can handle quite a lot of utf8 input but it has e.g. problems with "char + combining accent" as there is no sensible code for the combining accent to go back to add an accent on the char.
If you want to know how the 8-bit engines handle utf8 input you can use tracingmacros:
documentclass{article}
begin{document}
{tracingmacros =1 ä }
end{document}
which gives
Ã->UTFviii@two@octets Ã
UTFviii@two@octets #1#2->expandafter UTFviii@defined csname u8:#1string #2
endcsname
#1<-Ã
#2<-¤
UTFviii@defined #1->ifx #1relax if relax expandafter UTFviii@checkseq s
tring #1relax relax UTFviii@undefined@err {#1}else PackageError {inputenc}
{Invalid UTF-8 byte sequence}UTFviii@invalid@help fi else expandafter #1fi
#1<-u8:ä
u8:ä ->IeC {"a}
That means the the first byte of the ä
(the Ã
) is an active char, a command which then picks up the next byte and then calls u8:ä
which calls "a
. In this way (pdf)latex can handle quite a lot of utf8 input but it has e.g. problems with "char + combining accent" as there is no sensible code for the combining accent to go back to add an accent on the char.
answered Jan 21 at 8:49


Ulrike FischerUlrike Fischer
194k8302688
194k8302688
1
Just a comment to point out log file containing the trace got viewed by editor in some 8bit encoding, presumably iso-latin-1 (like Emacs does for me), not in UTF8... soÃ
is only one byte. This is tacit in your answer...
– user4686
Jan 21 at 9:01
add a comment |
1
Just a comment to point out log file containing the trace got viewed by editor in some 8bit encoding, presumably iso-latin-1 (like Emacs does for me), not in UTF8... soÃ
is only one byte. This is tacit in your answer...
– user4686
Jan 21 at 9:01
1
1
Just a comment to point out log file containing the trace got viewed by editor in some 8bit encoding, presumably iso-latin-1 (like Emacs does for me), not in UTF8... so
Ã
is only one byte. This is tacit in your answer...– user4686
Jan 21 at 9:01
Just a comment to point out log file containing the trace got viewed by editor in some 8bit encoding, presumably iso-latin-1 (like Emacs does for me), not in UTF8... so
Ã
is only one byte. This is tacit in your answer...– user4686
Jan 21 at 9:01
add a comment |
In order to move from 8-bit TeX90 to Unicode XeTeX or LuaTeX, there is work to do in extending/modifying internal structures. However, that is largely a question of effort rather than any major conceptual limitation. Knuth after all extended TeX from 7-bit to 8-bit between TeX82 (TeX 2) and TeX90 (TeX 3).
Both XeTeX and LuaTeX read files in UTF-8 rather than on a per-byte basis. This happens well before any TeX-related processes are involved, and as such at the macro level there are only UTF-8 characters. (One can alter this in LuaTeX: see luainputenc
for example.) Both engines then use tables which are extended to cover the full Unicode range.
The change in accepted input can be used to test for Unicode-aware engines, as shown in an example from https://www.contextgarden.net/Encodings_and_Regimes
deftest#1#2!{defsecondarg{#2}}
test χ!relax % That's Chi, a 2-byte utf-8 sequence
ifxsecondargempty message{newstuff}else message{tex82}fi
In the main, macro code does not need altering to accept Unicode: the engines deal with the byte aspect, so from a macro viewpoint everything is 'as expected'. Of course, there is a little set up to do, for example setting catcode
, uccode
, etc. for the full Unicode range. Today, this is handled by using the unicode-data
, so is built-in to both plain TeX-derived and LaTeX formats.
There are a few places that LaTeX has to be aware of which engine is in use, but the direct impact of Unicode is largley limited to
- Setting up the data for
catcode
, etc.
- Setting up hyphenation patterns (which today are all stored as UTF-8 and
require more work to use with pdfTeX than with Unicode TeX engines) - (Not) setting up support for Unicode support based on active 8-bit characters
(See Ulrike's answer for details
of how that works in 8-bit engines)
Other aspects at the macro layer are related to other functionality, for example the ability of both XeTeX and LuaTeX to load system fonts: that requires a Unicode
font encoding (TU
), but that is distinct from handling input.
add a comment |
In order to move from 8-bit TeX90 to Unicode XeTeX or LuaTeX, there is work to do in extending/modifying internal structures. However, that is largely a question of effort rather than any major conceptual limitation. Knuth after all extended TeX from 7-bit to 8-bit between TeX82 (TeX 2) and TeX90 (TeX 3).
Both XeTeX and LuaTeX read files in UTF-8 rather than on a per-byte basis. This happens well before any TeX-related processes are involved, and as such at the macro level there are only UTF-8 characters. (One can alter this in LuaTeX: see luainputenc
for example.) Both engines then use tables which are extended to cover the full Unicode range.
The change in accepted input can be used to test for Unicode-aware engines, as shown in an example from https://www.contextgarden.net/Encodings_and_Regimes
deftest#1#2!{defsecondarg{#2}}
test χ!relax % That's Chi, a 2-byte utf-8 sequence
ifxsecondargempty message{newstuff}else message{tex82}fi
In the main, macro code does not need altering to accept Unicode: the engines deal with the byte aspect, so from a macro viewpoint everything is 'as expected'. Of course, there is a little set up to do, for example setting catcode
, uccode
, etc. for the full Unicode range. Today, this is handled by using the unicode-data
, so is built-in to both plain TeX-derived and LaTeX formats.
There are a few places that LaTeX has to be aware of which engine is in use, but the direct impact of Unicode is largley limited to
- Setting up the data for
catcode
, etc.
- Setting up hyphenation patterns (which today are all stored as UTF-8 and
require more work to use with pdfTeX than with Unicode TeX engines) - (Not) setting up support for Unicode support based on active 8-bit characters
(See Ulrike's answer for details
of how that works in 8-bit engines)
Other aspects at the macro layer are related to other functionality, for example the ability of both XeTeX and LuaTeX to load system fonts: that requires a Unicode
font encoding (TU
), but that is distinct from handling input.
add a comment |
In order to move from 8-bit TeX90 to Unicode XeTeX or LuaTeX, there is work to do in extending/modifying internal structures. However, that is largely a question of effort rather than any major conceptual limitation. Knuth after all extended TeX from 7-bit to 8-bit between TeX82 (TeX 2) and TeX90 (TeX 3).
Both XeTeX and LuaTeX read files in UTF-8 rather than on a per-byte basis. This happens well before any TeX-related processes are involved, and as such at the macro level there are only UTF-8 characters. (One can alter this in LuaTeX: see luainputenc
for example.) Both engines then use tables which are extended to cover the full Unicode range.
The change in accepted input can be used to test for Unicode-aware engines, as shown in an example from https://www.contextgarden.net/Encodings_and_Regimes
deftest#1#2!{defsecondarg{#2}}
test χ!relax % That's Chi, a 2-byte utf-8 sequence
ifxsecondargempty message{newstuff}else message{tex82}fi
In the main, macro code does not need altering to accept Unicode: the engines deal with the byte aspect, so from a macro viewpoint everything is 'as expected'. Of course, there is a little set up to do, for example setting catcode
, uccode
, etc. for the full Unicode range. Today, this is handled by using the unicode-data
, so is built-in to both plain TeX-derived and LaTeX formats.
There are a few places that LaTeX has to be aware of which engine is in use, but the direct impact of Unicode is largley limited to
- Setting up the data for
catcode
, etc.
- Setting up hyphenation patterns (which today are all stored as UTF-8 and
require more work to use with pdfTeX than with Unicode TeX engines) - (Not) setting up support for Unicode support based on active 8-bit characters
(See Ulrike's answer for details
of how that works in 8-bit engines)
Other aspects at the macro layer are related to other functionality, for example the ability of both XeTeX and LuaTeX to load system fonts: that requires a Unicode
font encoding (TU
), but that is distinct from handling input.
In order to move from 8-bit TeX90 to Unicode XeTeX or LuaTeX, there is work to do in extending/modifying internal structures. However, that is largely a question of effort rather than any major conceptual limitation. Knuth after all extended TeX from 7-bit to 8-bit between TeX82 (TeX 2) and TeX90 (TeX 3).
Both XeTeX and LuaTeX read files in UTF-8 rather than on a per-byte basis. This happens well before any TeX-related processes are involved, and as such at the macro level there are only UTF-8 characters. (One can alter this in LuaTeX: see luainputenc
for example.) Both engines then use tables which are extended to cover the full Unicode range.
The change in accepted input can be used to test for Unicode-aware engines, as shown in an example from https://www.contextgarden.net/Encodings_and_Regimes
deftest#1#2!{defsecondarg{#2}}
test χ!relax % That's Chi, a 2-byte utf-8 sequence
ifxsecondargempty message{newstuff}else message{tex82}fi
In the main, macro code does not need altering to accept Unicode: the engines deal with the byte aspect, so from a macro viewpoint everything is 'as expected'. Of course, there is a little set up to do, for example setting catcode
, uccode
, etc. for the full Unicode range. Today, this is handled by using the unicode-data
, so is built-in to both plain TeX-derived and LaTeX formats.
There are a few places that LaTeX has to be aware of which engine is in use, but the direct impact of Unicode is largley limited to
- Setting up the data for
catcode
, etc.
- Setting up hyphenation patterns (which today are all stored as UTF-8 and
require more work to use with pdfTeX than with Unicode TeX engines) - (Not) setting up support for Unicode support based on active 8-bit characters
(See Ulrike's answer for details
of how that works in 8-bit engines)
Other aspects at the macro layer are related to other functionality, for example the ability of both XeTeX and LuaTeX to load system fonts: that requires a Unicode
font encoding (TU
), but that is distinct from handling input.
edited Jan 22 at 20:47
answered Jan 22 at 19:17
Joseph Wright♦Joseph Wright
204k23560889
204k23560889
add a comment |
add a comment |
Thanks for contributing an answer to TeX - LaTeX Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2ftex.stackexchange.com%2fquestions%2f471071%2fhow-can-latex-read-utf-8%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
5
The TeXbook describes the so-called Knuth-TeX engine, as well as a collection of macros frequently called "PlainTeX". Are you aware of newer engines called pdfTeX, XeTeX, and LuaTeX?
– Mico
Jan 21 at 1:54
3
You could be asking one of (at least) two questions here. Are you wondering how e.g. XeTeX (natively UTF-8) can be derived from Knuth's TeX (8-bit). Or are you wondering how 8-bit TeX engines deal with UTF-8 input (conversion of 'raw' bytes to codpoints to output)?
– Joseph Wright♦
Jan 21 at 7:42
@JosephWright The former.
– extremeaxe5
Jan 22 at 18:40
@Mico I am aware they exist. Are you saying that the newer TeX engines read input utf8 char by utf8 char? How would this work? I feel like a lot of things would break -- as Knuth describes, catcodes, mathcodes, and delcodes are only defined for 256 distinct values (all the possible bytes), no?
– extremeaxe5
Jan 22 at 18:43
Related: tex.stackexchange.com/questions/222286/…. There, I look in detail at the differences between engines, some of which are linked to Unicode support. (The questions are distinct, it's not a duplicate.)
– Joseph Wright♦
Jan 22 at 20:52