Why is this Java encoding UTF-8 --> Latin1 wrong?

I want to download this UTF-8 file and convert it to Latin1 in Java (Android). At line 443, Frango-dâ~@~YÃ¡gua-menor is translated to Frango-d?água-menor instead of Frango-d'água-menor. Same in line 465, where DescriÃ§Ã£o fÃsicaâ~@¦ is translated to Descrição física?, with that pesky ? at the end.

It seems this file is not a valid UTF-8? But iconv -f utf-8 -t iso-8859-1//TRANSLIT on this file works just fine.

This is the code I use to download (downloaded file is in infofile):

                fos = new FileOutputStream(infotxt);

                out = new OutputStreamWriter(fos, 'Latin1');

                fis = new FileInputStream(infofile);

                br = new BufferedReader(new InputStreamReader(fis));

                while ((line = br.readLine()) != null) {

                    out.write("n"+line.trim());

                }

                br.close();

                out.close();

                fis.close();

                fos.close();

asked Jan 2 at 16:51

Luis A. Florit

8021639

? usually means that either the program you are writing with does not know how to convert a certain character or the program you use to display does not know what to print. Probably you have to do some special case handling or try an other converter. iconv probably has a better mapping table between the two encodings.

– leonardkraemer
Jan 2 at 16:56

The ? appears in the (correctly written Latin1) downloaded file.

– Luis A. Florit
Jan 2 at 17:00

1

Then the first part of my comment is the answer. OutputStreamWriter has no mapping for the specific character from UTF-8 to Latin1. see stackoverflow.com/questions/652161/…

– leonardkraemer
Jan 2 at 17:04

Exactly. But then why the TRANSLIT in iconv did a perfect job, and how can I simulate that in Java? Maybe something like this: stackoverflow.com/a/5807419/1483390

– Luis A. Florit
Jan 2 at 17:16

I guess you will have to fumble with CharsetEncoder or any other solution. You could run iconv to android with ndk, but that will give you even more problems.

– leonardkraemer
Jan 2 at 17:23

add a comment |

It seems this file is not a valid UTF-8? But iconv -f utf-8 -t iso-8859-1//TRANSLIT on this file works just fine.

This is the code I use to download (downloaded file is in infofile):

                fos = new FileOutputStream(infotxt);

                out = new OutputStreamWriter(fos, 'Latin1');

                fis = new FileInputStream(infofile);

                br = new BufferedReader(new InputStreamReader(fis));

                while ((line = br.readLine()) != null) {

                    out.write("n"+line.trim());

                }

                br.close();

                out.close();

                fis.close();

                fos.close();

asked Jan 2 at 16:51

Luis A. Florit

8021639

? usually means that either the program you are writing with does not know how to convert a certain character or the program you use to display does not know what to print. Probably you have to do some special case handling or try an other converter. iconv probably has a better mapping table between the two encodings.

– leonardkraemer
Jan 2 at 16:56

The ? appears in the (correctly written Latin1) downloaded file.

– Luis A. Florit
Jan 2 at 17:00

1

Then the first part of my comment is the answer. OutputStreamWriter has no mapping for the specific character from UTF-8 to Latin1. see stackoverflow.com/questions/652161/…

– leonardkraemer
Jan 2 at 17:04

Exactly. But then why the TRANSLIT in iconv did a perfect job, and how can I simulate that in Java? Maybe something like this: stackoverflow.com/a/5807419/1483390

– Luis A. Florit
Jan 2 at 17:16

I guess you will have to fumble with CharsetEncoder or any other solution. You could run iconv to android with ndk, but that will give you even more problems.

– leonardkraemer
Jan 2 at 17:23

add a comment |

It seems this file is not a valid UTF-8? But iconv -f utf-8 -t iso-8859-1//TRANSLIT on this file works just fine.

This is the code I use to download (downloaded file is in infofile):

                fos = new FileOutputStream(infotxt);

                out = new OutputStreamWriter(fos, 'Latin1');

                fis = new FileInputStream(infofile);

                br = new BufferedReader(new InputStreamReader(fis));

                while ((line = br.readLine()) != null) {

                    out.write("n"+line.trim());

                }

                br.close();

                out.close();

                fis.close();

                fos.close();

asked Jan 2 at 16:51

Luis A. Florit

8021639

It seems this file is not a valid UTF-8? But iconv -f utf-8 -t iso-8859-1//TRANSLIT on this file works just fine.

This is the code I use to download (downloaded file is in infofile):

                fos = new FileOutputStream(infotxt);

                out = new OutputStreamWriter(fos, 'Latin1');

                fis = new FileInputStream(infofile);

                br = new BufferedReader(new InputStreamReader(fis));

                while ((line = br.readLine()) != null) {

                    out.write("n"+line.trim());

                }

                br.close();

                out.close();

                fis.close();

                fos.close();

java android encoding

asked Jan 2 at 16:51

Luis A. Florit

8021639

asked Jan 2 at 16:51

Luis A. Florit

8021639

asked Jan 2 at 16:51

Luis A. Florit

8021639

asked Jan 2 at 16:51

Luis A. Florit

8021639

asked Jan 2 at 16:51

Luis A. Florit

8021639

? usually means that either the program you are writing with does not know how to convert a certain character or the program you use to display does not know what to print. Probably you have to do some special case handling or try an other converter. iconv probably has a better mapping table between the two encodings.

– leonardkraemer
Jan 2 at 16:56

The ? appears in the (correctly written Latin1) downloaded file.

– Luis A. Florit
Jan 2 at 17:00

1

Then the first part of my comment is the answer. OutputStreamWriter has no mapping for the specific character from UTF-8 to Latin1. see stackoverflow.com/questions/652161/…

– leonardkraemer
Jan 2 at 17:04

Exactly. But then why the TRANSLIT in iconv did a perfect job, and how can I simulate that in Java? Maybe something like this: stackoverflow.com/a/5807419/1483390

– Luis A. Florit
Jan 2 at 17:16

I guess you will have to fumble with CharsetEncoder or any other solution. You could run iconv to android with ndk, but that will give you even more problems.

– leonardkraemer
Jan 2 at 17:23

add a comment |

? usually means that either the program you are writing with does not know how to convert a certain character or the program you use to display does not know what to print. Probably you have to do some special case handling or try an other converter. iconv probably has a better mapping table between the two encodings.

– leonardkraemer
Jan 2 at 16:56

The ? appears in the (correctly written Latin1) downloaded file.

– Luis A. Florit
Jan 2 at 17:00

1

Then the first part of my comment is the answer. OutputStreamWriter has no mapping for the specific character from UTF-8 to Latin1. see stackoverflow.com/questions/652161/…

– leonardkraemer
Jan 2 at 17:04

Exactly. But then why the TRANSLIT in iconv did a perfect job, and how can I simulate that in Java? Maybe something like this: stackoverflow.com/a/5807419/1483390

– Luis A. Florit
Jan 2 at 17:16

I guess you will have to fumble with CharsetEncoder or any other solution. You could run iconv to android with ndk, but that will give you even more problems.

– leonardkraemer
Jan 2 at 17:23

? usually means that either the program you are writing with does not know how to convert a certain character or the program you use to display does not know what to print. Probably you have to do some special case handling or try an other converter. iconv probably has a better mapping table between the two encodings.

– leonardkraemer
Jan 2 at 16:56

The ? appears in the (correctly written Latin1) downloaded file.

– Luis A. Florit
Jan 2 at 17:00

Then the first part of my comment is the answer. OutputStreamWriter has no mapping for the specific character from UTF-8 to Latin1. see stackoverflow.com/questions/652161/…

– leonardkraemer
Jan 2 at 17:04

Exactly. But then why the TRANSLIT in iconv did a perfect job, and how can I simulate that in Java? Maybe something like this: stackoverflow.com/a/5807419/1483390

– Luis A. Florit
Jan 2 at 17:16

I guess you will have to fumble with CharsetEncoder or any other solution. You could run iconv to android with ndk, but that will give you even more problems.

– leonardkraemer
Jan 2 at 17:23

add a comment |

1 Answer
1

active

oldest

votes

The file you linked is a UTF-8 encoded HTML file, and it uses characters outside of the Latin-1 character set. E.g. instead of the Latin-1 quotation mark that you expect (Frango-d'água-menor, using code U+0027) it uses the similar-looking Right Single Quotation Mark U+2019 (Frango-d’água-menor). This isn't part of the Latin-1 set, so you get a replacement question mark.

As Latin-1 can't encode the whole Unicode character set, you have to accept things like that.

Your best chance is to identify the problem characters and do a string replacement before writing to the limited Latin-1 set.

answered Jan 2 at 18:00

Ralf Kleberhoff

3,860156

Do you know how to make a "reasonably" safe replacement, like the iconv TRANSLIT does, at the line variable in my code?

– Luis A. Florit
Jan 2 at 18:36

I meant, line.replaceAll("[\u2018\u2019]", "'")) is works for the quotation mark. Maybe there is some 'safe' relatively small list like this? I don't need to implement all here: git.savannah.gnu.org/cgit/libiconv.git/tree/lib/translit.def

– Luis A. Florit
Jan 2 at 19:13

1

Only you can decide which replacements you need (which special Unicode characters you expect in the input). You have a good (?) reference source of replacements. If you need just a handful, go for multiple line.replaceAll(). If you need more, I'd build a replacement array indexed by the character code, containing the replacement string, maybe initialized from translit.def or similar.

– Ralf Kleberhoff
Jan 2 at 19:27

Yes, that is what I did, multiple line.replaceAll() after looking at the translit.def. I guess only a handful may appear. Thanks!!

– Luis A. Florit
Jan 2 at 19:30

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54010172%2fwhy-is-this-java-encoding-utf-8-latin1-wrong%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

As Latin-1 can't encode the whole Unicode character set, you have to accept things like that.

Your best chance is to identify the problem characters and do a string replacement before writing to the limited Latin-1 set.

answered Jan 2 at 18:00

Ralf Kleberhoff

3,860156

Do you know how to make a "reasonably" safe replacement, like the iconv TRANSLIT does, at the line variable in my code?

– Luis A. Florit
Jan 2 at 18:36

I meant, line.replaceAll("[\u2018\u2019]", "'")) is works for the quotation mark. Maybe there is some 'safe' relatively small list like this? I don't need to implement all here: git.savannah.gnu.org/cgit/libiconv.git/tree/lib/translit.def

– Luis A. Florit
Jan 2 at 19:13

1

Only you can decide which replacements you need (which special Unicode characters you expect in the input). You have a good (?) reference source of replacements. If you need just a handful, go for multiple line.replaceAll(). If you need more, I'd build a replacement array indexed by the character code, containing the replacement string, maybe initialized from translit.def or similar.

– Ralf Kleberhoff
Jan 2 at 19:27

Yes, that is what I did, multiple line.replaceAll() after looking at the translit.def. I guess only a handful may appear. Thanks!!

– Luis A. Florit
Jan 2 at 19:30

add a comment |

As Latin-1 can't encode the whole Unicode character set, you have to accept things like that.

Your best chance is to identify the problem characters and do a string replacement before writing to the limited Latin-1 set.

answered Jan 2 at 18:00

Ralf Kleberhoff

3,860156

Do you know how to make a "reasonably" safe replacement, like the iconv TRANSLIT does, at the line variable in my code?

– Luis A. Florit
Jan 2 at 18:36

I meant, line.replaceAll("[\u2018\u2019]", "'")) is works for the quotation mark. Maybe there is some 'safe' relatively small list like this? I don't need to implement all here: git.savannah.gnu.org/cgit/libiconv.git/tree/lib/translit.def

– Luis A. Florit
Jan 2 at 19:13

1

Only you can decide which replacements you need (which special Unicode characters you expect in the input). You have a good (?) reference source of replacements. If you need just a handful, go for multiple line.replaceAll(). If you need more, I'd build a replacement array indexed by the character code, containing the replacement string, maybe initialized from translit.def or similar.

– Ralf Kleberhoff
Jan 2 at 19:27

Yes, that is what I did, multiple line.replaceAll() after looking at the translit.def. I guess only a handful may appear. Thanks!!

– Luis A. Florit
Jan 2 at 19:30

add a comment |

As Latin-1 can't encode the whole Unicode character set, you have to accept things like that.

Your best chance is to identify the problem characters and do a string replacement before writing to the limited Latin-1 set.

answered Jan 2 at 18:00

Ralf Kleberhoff

3,860156

As Latin-1 can't encode the whole Unicode character set, you have to accept things like that.

Your best chance is to identify the problem characters and do a string replacement before writing to the limited Latin-1 set.

answered Jan 2 at 18:00

Ralf Kleberhoff

3,860156

answered Jan 2 at 18:00

Ralf Kleberhoff

3,860156

answered Jan 2 at 18:00

Ralf Kleberhoff

3,860156

answered Jan 2 at 18:00

Ralf Kleberhoff

3,860156

Do you know how to make a "reasonably" safe replacement, like the iconv TRANSLIT does, at the line variable in my code?

– Luis A. Florit
Jan 2 at 18:36

I meant, line.replaceAll("[\u2018\u2019]", "'")) is works for the quotation mark. Maybe there is some 'safe' relatively small list like this? I don't need to implement all here: git.savannah.gnu.org/cgit/libiconv.git/tree/lib/translit.def

– Luis A. Florit
Jan 2 at 19:13

1

Only you can decide which replacements you need (which special Unicode characters you expect in the input). You have a good (?) reference source of replacements. If you need just a handful, go for multiple line.replaceAll(). If you need more, I'd build a replacement array indexed by the character code, containing the replacement string, maybe initialized from translit.def or similar.

– Ralf Kleberhoff
Jan 2 at 19:27

Yes, that is what I did, multiple line.replaceAll() after looking at the translit.def. I guess only a handful may appear. Thanks!!

– Luis A. Florit
Jan 2 at 19:30

add a comment |

Do you know how to make a "reasonably" safe replacement, like the iconv TRANSLIT does, at the line variable in my code?

– Luis A. Florit
Jan 2 at 18:36

I meant, line.replaceAll("[\u2018\u2019]", "'")) is works for the quotation mark. Maybe there is some 'safe' relatively small list like this? I don't need to implement all here: git.savannah.gnu.org/cgit/libiconv.git/tree/lib/translit.def

– Luis A. Florit
Jan 2 at 19:13

1

Only you can decide which replacements you need (which special Unicode characters you expect in the input). You have a good (?) reference source of replacements. If you need just a handful, go for multiple line.replaceAll(). If you need more, I'd build a replacement array indexed by the character code, containing the replacement string, maybe initialized from translit.def or similar.

– Ralf Kleberhoff
Jan 2 at 19:27

Yes, that is what I did, multiple line.replaceAll() after looking at the translit.def. I guess only a handful may appear. Thanks!!

– Luis A. Florit
Jan 2 at 19:30

Do you know how to make a "reasonably" safe replacement, like the iconv TRANSLIT does, at the line variable in my code?

– Luis A. Florit
Jan 2 at 18:36

I meant, line.replaceAll("[\u2018\u2019]", "'")) is works for the quotation mark. Maybe there is some 'safe' relatively small list like this? I don't need to implement all here: git.savannah.gnu.org/cgit/libiconv.git/tree/lib/translit.def

– Luis A. Florit
Jan 2 at 19:13

Only you can decide which replacements you need (which special Unicode characters you expect in the input). You have a good (?) reference source of replacements. If you need just a handful, go for multiple line.replaceAll(). If you need more, I'd build a replacement array indexed by the character code, containing the replacement string, maybe initialized from translit.def or similar.

– Ralf Kleberhoff
Jan 2 at 19:27

Yes, that is what I did, multiple line.replaceAll() after looking at the translit.def. I guess only a handful may appear. Thanks!!

– Luis A. Florit
Jan 2 at 19:30

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu