Why is this Java encoding UTF-8 --> Latin1 wrong?
I want to download this UTF-8 file and convert it to Latin1 in Java (Android). At line 443, Frango-dâ~@~Yágua-menor
is translated to Frango-d?água-menor
instead of Frango-d'água-menor
. Same in line 465, where Descrição fÃsicaâ~@¦
is translated to Descrição física?
, with that pesky ?
at the end.
It seems this file is not a valid UTF-8
? But iconv -f utf-8 -t iso-8859-1//TRANSLIT
on this file works just fine.
This is the code I use to download (downloaded file is in infofile
):
fos = new FileOutputStream(infotxt);
out = new OutputStreamWriter(fos, 'Latin1');
fis = new FileInputStream(infofile);
br = new BufferedReader(new InputStreamReader(fis));
while ((line = br.readLine()) != null) {
out.write("n"+line.trim());
}
br.close();
out.close();
fis.close();
fos.close();
java

add a comment |
I want to download this UTF-8 file and convert it to Latin1 in Java (Android). At line 443, Frango-dâ~@~Yágua-menor
is translated to Frango-d?água-menor
instead of Frango-d'água-menor
. Same in line 465, where Descrição fÃsicaâ~@¦
is translated to Descrição física?
, with that pesky ?
at the end.
It seems this file is not a valid UTF-8
? But iconv -f utf-8 -t iso-8859-1//TRANSLIT
on this file works just fine.
This is the code I use to download (downloaded file is in infofile
):
fos = new FileOutputStream(infotxt);
out = new OutputStreamWriter(fos, 'Latin1');
fis = new FileInputStream(infofile);
br = new BufferedReader(new InputStreamReader(fis));
while ((line = br.readLine()) != null) {
out.write("n"+line.trim());
}
br.close();
out.close();
fis.close();
fos.close();
java

?
usually means that either the program you are writing with does not know how to convert a certain character or the program you use to display does not know what to print. Probably you have to do some special case handling or try an other converter.iconv
probably has a better mapping table between the two encodings.
– leonardkraemer
Jan 2 at 16:56
The?
appears in the (correctly written Latin1) downloaded file.
– Luis A. Florit
Jan 2 at 17:00
1
Then the first part of my comment is the answer.OutputStreamWriter
has no mapping for the specific character fromUTF-8
toLatin1
. see stackoverflow.com/questions/652161/…
– leonardkraemer
Jan 2 at 17:04
Exactly. But then why theTRANSLIT
iniconv
did a perfect job, and how can I simulate that in Java? Maybe something like this: stackoverflow.com/a/5807419/1483390
– Luis A. Florit
Jan 2 at 17:16
I guess you will have to fumble with CharsetEncoder or any other solution. You could run iconv to android with ndk, but that will give you even more problems.
– leonardkraemer
Jan 2 at 17:23
add a comment |
I want to download this UTF-8 file and convert it to Latin1 in Java (Android). At line 443, Frango-dâ~@~Yágua-menor
is translated to Frango-d?água-menor
instead of Frango-d'água-menor
. Same in line 465, where Descrição fÃsicaâ~@¦
is translated to Descrição física?
, with that pesky ?
at the end.
It seems this file is not a valid UTF-8
? But iconv -f utf-8 -t iso-8859-1//TRANSLIT
on this file works just fine.
This is the code I use to download (downloaded file is in infofile
):
fos = new FileOutputStream(infotxt);
out = new OutputStreamWriter(fos, 'Latin1');
fis = new FileInputStream(infofile);
br = new BufferedReader(new InputStreamReader(fis));
while ((line = br.readLine()) != null) {
out.write("n"+line.trim());
}
br.close();
out.close();
fis.close();
fos.close();
java

I want to download this UTF-8 file and convert it to Latin1 in Java (Android). At line 443, Frango-dâ~@~Yágua-menor
is translated to Frango-d?água-menor
instead of Frango-d'água-menor
. Same in line 465, where Descrição fÃsicaâ~@¦
is translated to Descrição física?
, with that pesky ?
at the end.
It seems this file is not a valid UTF-8
? But iconv -f utf-8 -t iso-8859-1//TRANSLIT
on this file works just fine.
This is the code I use to download (downloaded file is in infofile
):
fos = new FileOutputStream(infotxt);
out = new OutputStreamWriter(fos, 'Latin1');
fis = new FileInputStream(infofile);
br = new BufferedReader(new InputStreamReader(fis));
while ((line = br.readLine()) != null) {
out.write("n"+line.trim());
}
br.close();
out.close();
fis.close();
fos.close();
java

java

asked Jan 2 at 16:51
Luis A. FloritLuis A. Florit
8021639
8021639
?
usually means that either the program you are writing with does not know how to convert a certain character or the program you use to display does not know what to print. Probably you have to do some special case handling or try an other converter.iconv
probably has a better mapping table between the two encodings.
– leonardkraemer
Jan 2 at 16:56
The?
appears in the (correctly written Latin1) downloaded file.
– Luis A. Florit
Jan 2 at 17:00
1
Then the first part of my comment is the answer.OutputStreamWriter
has no mapping for the specific character fromUTF-8
toLatin1
. see stackoverflow.com/questions/652161/…
– leonardkraemer
Jan 2 at 17:04
Exactly. But then why theTRANSLIT
iniconv
did a perfect job, and how can I simulate that in Java? Maybe something like this: stackoverflow.com/a/5807419/1483390
– Luis A. Florit
Jan 2 at 17:16
I guess you will have to fumble with CharsetEncoder or any other solution. You could run iconv to android with ndk, but that will give you even more problems.
– leonardkraemer
Jan 2 at 17:23
add a comment |
?
usually means that either the program you are writing with does not know how to convert a certain character or the program you use to display does not know what to print. Probably you have to do some special case handling or try an other converter.iconv
probably has a better mapping table between the two encodings.
– leonardkraemer
Jan 2 at 16:56
The?
appears in the (correctly written Latin1) downloaded file.
– Luis A. Florit
Jan 2 at 17:00
1
Then the first part of my comment is the answer.OutputStreamWriter
has no mapping for the specific character fromUTF-8
toLatin1
. see stackoverflow.com/questions/652161/…
– leonardkraemer
Jan 2 at 17:04
Exactly. But then why theTRANSLIT
iniconv
did a perfect job, and how can I simulate that in Java? Maybe something like this: stackoverflow.com/a/5807419/1483390
– Luis A. Florit
Jan 2 at 17:16
I guess you will have to fumble with CharsetEncoder or any other solution. You could run iconv to android with ndk, but that will give you even more problems.
– leonardkraemer
Jan 2 at 17:23
?
usually means that either the program you are writing with does not know how to convert a certain character or the program you use to display does not know what to print. Probably you have to do some special case handling or try an other converter. iconv
probably has a better mapping table between the two encodings.– leonardkraemer
Jan 2 at 16:56
?
usually means that either the program you are writing with does not know how to convert a certain character or the program you use to display does not know what to print. Probably you have to do some special case handling or try an other converter. iconv
probably has a better mapping table between the two encodings.– leonardkraemer
Jan 2 at 16:56
The
?
appears in the (correctly written Latin1) downloaded file.– Luis A. Florit
Jan 2 at 17:00
The
?
appears in the (correctly written Latin1) downloaded file.– Luis A. Florit
Jan 2 at 17:00
1
1
Then the first part of my comment is the answer.
OutputStreamWriter
has no mapping for the specific character from UTF-8
to Latin1
. see stackoverflow.com/questions/652161/…– leonardkraemer
Jan 2 at 17:04
Then the first part of my comment is the answer.
OutputStreamWriter
has no mapping for the specific character from UTF-8
to Latin1
. see stackoverflow.com/questions/652161/…– leonardkraemer
Jan 2 at 17:04
Exactly. But then why the
TRANSLIT
in iconv
did a perfect job, and how can I simulate that in Java? Maybe something like this: stackoverflow.com/a/5807419/1483390– Luis A. Florit
Jan 2 at 17:16
Exactly. But then why the
TRANSLIT
in iconv
did a perfect job, and how can I simulate that in Java? Maybe something like this: stackoverflow.com/a/5807419/1483390– Luis A. Florit
Jan 2 at 17:16
I guess you will have to fumble with CharsetEncoder or any other solution. You could run iconv to android with ndk, but that will give you even more problems.
– leonardkraemer
Jan 2 at 17:23
I guess you will have to fumble with CharsetEncoder or any other solution. You could run iconv to android with ndk, but that will give you even more problems.
– leonardkraemer
Jan 2 at 17:23
add a comment |
1 Answer
1
active
oldest
votes
The file you linked is a UTF-8 encoded HTML file, and it uses characters outside of the Latin-1 character set. E.g. instead of the Latin-1 quotation mark that you expect (Frango-d'água-menor
, using code U+0027) it uses the similar-looking Right Single Quotation Mark U+2019 (Frango-d’água-menor
). This isn't part of the Latin-1 set, so you get a replacement question mark.
As Latin-1 can't encode the whole Unicode character set, you have to accept things like that.
Your best chance is to identify the problem characters and do a string replacement before writing to the limited Latin-1 set.
Do you know how to make a "reasonably" safe replacement, like the iconv TRANSLIT does, at theline
variable in my code?
– Luis A. Florit
Jan 2 at 18:36
I meant,line.replaceAll("[\u2018\u2019]", "'"))
is works for the quotation mark. Maybe there is some 'safe' relatively small list like this? I don't need to implement all here: git.savannah.gnu.org/cgit/libiconv.git/tree/lib/translit.def
– Luis A. Florit
Jan 2 at 19:13
1
Only you can decide which replacements you need (which special Unicode characters you expect in the input). You have a good (?) reference source of replacements. If you need just a handful, go for multipleline.replaceAll()
. If you need more, I'd build a replacement array indexed by the character code, containing the replacement string, maybe initialized from translit.def or similar.
– Ralf Kleberhoff
Jan 2 at 19:27
Yes, that is what I did, multipleline.replaceAll()
after looking at thetranslit.def
. I guess only a handful may appear. Thanks!!
– Luis A. Florit
Jan 2 at 19:30
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54010172%2fwhy-is-this-java-encoding-utf-8-latin1-wrong%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The file you linked is a UTF-8 encoded HTML file, and it uses characters outside of the Latin-1 character set. E.g. instead of the Latin-1 quotation mark that you expect (Frango-d'água-menor
, using code U+0027) it uses the similar-looking Right Single Quotation Mark U+2019 (Frango-d’água-menor
). This isn't part of the Latin-1 set, so you get a replacement question mark.
As Latin-1 can't encode the whole Unicode character set, you have to accept things like that.
Your best chance is to identify the problem characters and do a string replacement before writing to the limited Latin-1 set.
Do you know how to make a "reasonably" safe replacement, like the iconv TRANSLIT does, at theline
variable in my code?
– Luis A. Florit
Jan 2 at 18:36
I meant,line.replaceAll("[\u2018\u2019]", "'"))
is works for the quotation mark. Maybe there is some 'safe' relatively small list like this? I don't need to implement all here: git.savannah.gnu.org/cgit/libiconv.git/tree/lib/translit.def
– Luis A. Florit
Jan 2 at 19:13
1
Only you can decide which replacements you need (which special Unicode characters you expect in the input). You have a good (?) reference source of replacements. If you need just a handful, go for multipleline.replaceAll()
. If you need more, I'd build a replacement array indexed by the character code, containing the replacement string, maybe initialized from translit.def or similar.
– Ralf Kleberhoff
Jan 2 at 19:27
Yes, that is what I did, multipleline.replaceAll()
after looking at thetranslit.def
. I guess only a handful may appear. Thanks!!
– Luis A. Florit
Jan 2 at 19:30
add a comment |
The file you linked is a UTF-8 encoded HTML file, and it uses characters outside of the Latin-1 character set. E.g. instead of the Latin-1 quotation mark that you expect (Frango-d'água-menor
, using code U+0027) it uses the similar-looking Right Single Quotation Mark U+2019 (Frango-d’água-menor
). This isn't part of the Latin-1 set, so you get a replacement question mark.
As Latin-1 can't encode the whole Unicode character set, you have to accept things like that.
Your best chance is to identify the problem characters and do a string replacement before writing to the limited Latin-1 set.
Do you know how to make a "reasonably" safe replacement, like the iconv TRANSLIT does, at theline
variable in my code?
– Luis A. Florit
Jan 2 at 18:36
I meant,line.replaceAll("[\u2018\u2019]", "'"))
is works for the quotation mark. Maybe there is some 'safe' relatively small list like this? I don't need to implement all here: git.savannah.gnu.org/cgit/libiconv.git/tree/lib/translit.def
– Luis A. Florit
Jan 2 at 19:13
1
Only you can decide which replacements you need (which special Unicode characters you expect in the input). You have a good (?) reference source of replacements. If you need just a handful, go for multipleline.replaceAll()
. If you need more, I'd build a replacement array indexed by the character code, containing the replacement string, maybe initialized from translit.def or similar.
– Ralf Kleberhoff
Jan 2 at 19:27
Yes, that is what I did, multipleline.replaceAll()
after looking at thetranslit.def
. I guess only a handful may appear. Thanks!!
– Luis A. Florit
Jan 2 at 19:30
add a comment |
The file you linked is a UTF-8 encoded HTML file, and it uses characters outside of the Latin-1 character set. E.g. instead of the Latin-1 quotation mark that you expect (Frango-d'água-menor
, using code U+0027) it uses the similar-looking Right Single Quotation Mark U+2019 (Frango-d’água-menor
). This isn't part of the Latin-1 set, so you get a replacement question mark.
As Latin-1 can't encode the whole Unicode character set, you have to accept things like that.
Your best chance is to identify the problem characters and do a string replacement before writing to the limited Latin-1 set.
The file you linked is a UTF-8 encoded HTML file, and it uses characters outside of the Latin-1 character set. E.g. instead of the Latin-1 quotation mark that you expect (Frango-d'água-menor
, using code U+0027) it uses the similar-looking Right Single Quotation Mark U+2019 (Frango-d’água-menor
). This isn't part of the Latin-1 set, so you get a replacement question mark.
As Latin-1 can't encode the whole Unicode character set, you have to accept things like that.
Your best chance is to identify the problem characters and do a string replacement before writing to the limited Latin-1 set.
answered Jan 2 at 18:00
Ralf KleberhoffRalf Kleberhoff
3,860156
3,860156
Do you know how to make a "reasonably" safe replacement, like the iconv TRANSLIT does, at theline
variable in my code?
– Luis A. Florit
Jan 2 at 18:36
I meant,line.replaceAll("[\u2018\u2019]", "'"))
is works for the quotation mark. Maybe there is some 'safe' relatively small list like this? I don't need to implement all here: git.savannah.gnu.org/cgit/libiconv.git/tree/lib/translit.def
– Luis A. Florit
Jan 2 at 19:13
1
Only you can decide which replacements you need (which special Unicode characters you expect in the input). You have a good (?) reference source of replacements. If you need just a handful, go for multipleline.replaceAll()
. If you need more, I'd build a replacement array indexed by the character code, containing the replacement string, maybe initialized from translit.def or similar.
– Ralf Kleberhoff
Jan 2 at 19:27
Yes, that is what I did, multipleline.replaceAll()
after looking at thetranslit.def
. I guess only a handful may appear. Thanks!!
– Luis A. Florit
Jan 2 at 19:30
add a comment |
Do you know how to make a "reasonably" safe replacement, like the iconv TRANSLIT does, at theline
variable in my code?
– Luis A. Florit
Jan 2 at 18:36
I meant,line.replaceAll("[\u2018\u2019]", "'"))
is works for the quotation mark. Maybe there is some 'safe' relatively small list like this? I don't need to implement all here: git.savannah.gnu.org/cgit/libiconv.git/tree/lib/translit.def
– Luis A. Florit
Jan 2 at 19:13
1
Only you can decide which replacements you need (which special Unicode characters you expect in the input). You have a good (?) reference source of replacements. If you need just a handful, go for multipleline.replaceAll()
. If you need more, I'd build a replacement array indexed by the character code, containing the replacement string, maybe initialized from translit.def or similar.
– Ralf Kleberhoff
Jan 2 at 19:27
Yes, that is what I did, multipleline.replaceAll()
after looking at thetranslit.def
. I guess only a handful may appear. Thanks!!
– Luis A. Florit
Jan 2 at 19:30
Do you know how to make a "reasonably" safe replacement, like the iconv TRANSLIT does, at the
line
variable in my code?– Luis A. Florit
Jan 2 at 18:36
Do you know how to make a "reasonably" safe replacement, like the iconv TRANSLIT does, at the
line
variable in my code?– Luis A. Florit
Jan 2 at 18:36
I meant,
line.replaceAll("[\u2018\u2019]", "'"))
is works for the quotation mark. Maybe there is some 'safe' relatively small list like this? I don't need to implement all here: git.savannah.gnu.org/cgit/libiconv.git/tree/lib/translit.def– Luis A. Florit
Jan 2 at 19:13
I meant,
line.replaceAll("[\u2018\u2019]", "'"))
is works for the quotation mark. Maybe there is some 'safe' relatively small list like this? I don't need to implement all here: git.savannah.gnu.org/cgit/libiconv.git/tree/lib/translit.def– Luis A. Florit
Jan 2 at 19:13
1
1
Only you can decide which replacements you need (which special Unicode characters you expect in the input). You have a good (?) reference source of replacements. If you need just a handful, go for multiple
line.replaceAll()
. If you need more, I'd build a replacement array indexed by the character code, containing the replacement string, maybe initialized from translit.def or similar.– Ralf Kleberhoff
Jan 2 at 19:27
Only you can decide which replacements you need (which special Unicode characters you expect in the input). You have a good (?) reference source of replacements. If you need just a handful, go for multiple
line.replaceAll()
. If you need more, I'd build a replacement array indexed by the character code, containing the replacement string, maybe initialized from translit.def or similar.– Ralf Kleberhoff
Jan 2 at 19:27
Yes, that is what I did, multiple
line.replaceAll()
after looking at the translit.def
. I guess only a handful may appear. Thanks!!– Luis A. Florit
Jan 2 at 19:30
Yes, that is what I did, multiple
line.replaceAll()
after looking at the translit.def
. I guess only a handful may appear. Thanks!!– Luis A. Florit
Jan 2 at 19:30
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54010172%2fwhy-is-this-java-encoding-utf-8-latin1-wrong%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
?
usually means that either the program you are writing with does not know how to convert a certain character or the program you use to display does not know what to print. Probably you have to do some special case handling or try an other converter.iconv
probably has a better mapping table between the two encodings.– leonardkraemer
Jan 2 at 16:56
The
?
appears in the (correctly written Latin1) downloaded file.– Luis A. Florit
Jan 2 at 17:00
1
Then the first part of my comment is the answer.
OutputStreamWriter
has no mapping for the specific character fromUTF-8
toLatin1
. see stackoverflow.com/questions/652161/…– leonardkraemer
Jan 2 at 17:04
Exactly. But then why the
TRANSLIT
iniconv
did a perfect job, and how can I simulate that in Java? Maybe something like this: stackoverflow.com/a/5807419/1483390– Luis A. Florit
Jan 2 at 17:16
I guess you will have to fumble with CharsetEncoder or any other solution. You could run iconv to android with ndk, but that will give you even more problems.
– leonardkraemer
Jan 2 at 17:23