Strange character in a file












7















I have an UTF-8 file that contains a strange character -- visible to me just as



<96>


This is how it appears on vi



vi screenshot



and how it appears on gedit



gedit screenshot



and how it appears under LibreOffice



LibreOffice screenshot



and that makes a series of basic Unix tools misbehave, including:





  1. cat file make the character dissapear, and more as well

  2. I cannot copy and paste inside vi/vim -- it will not even find itself


  3. grep fails to display anything as well, as if the character did not exists.


The program file works fine and recognizes it an UTF-8 file. I also know that, because of the nature of the file, it most likely came from a Copy & Paste from the web and the character initially represented an EMDASH.



My basic questions are:




  1. Is there anything wrong with this file?

  2. How can I search for other occurrences of it inside the same file?

  3. How can I grep for other files that may contain the same problem/character?


The file can be found here: file.txt










share|improve this question




















  • 2





    First step is to hexdump -C filename and look at the encoding of what is "visible" to you as <96>. Context should help to pinpoint it.

    – dirkt
    Jan 20 at 19:35











  • @dirkt, the context points to the character being an EMDASH and hexdump -C shows c2 96. How can I search for other occurrences of the same thing?

    – Paulo Ney
    Jan 20 at 19:49











  • @G-Man, you can download the file, the character shows like that in vi/vim, for example, and I am using stock "grep" on Ubuntu 18.04.

    – Paulo Ney
    Jan 20 at 19:51











  • Is there any tool, that can manage this character is a good way? I'm thinking of word processor like LibreOffice Writer or a simple text editor like gedit when you have set it to manage your language and UTF-8. In this case you can remove that character.

    – sudodus
    Jan 20 at 19:59













  • @sudodus I added the views from vi, gedit and libreOffice -- none of them seem to produce something useful.

    – Paulo Ney
    Jan 20 at 20:07
















7















I have an UTF-8 file that contains a strange character -- visible to me just as



<96>


This is how it appears on vi



vi screenshot



and how it appears on gedit



gedit screenshot



and how it appears under LibreOffice



LibreOffice screenshot



and that makes a series of basic Unix tools misbehave, including:





  1. cat file make the character dissapear, and more as well

  2. I cannot copy and paste inside vi/vim -- it will not even find itself


  3. grep fails to display anything as well, as if the character did not exists.


The program file works fine and recognizes it an UTF-8 file. I also know that, because of the nature of the file, it most likely came from a Copy & Paste from the web and the character initially represented an EMDASH.



My basic questions are:




  1. Is there anything wrong with this file?

  2. How can I search for other occurrences of it inside the same file?

  3. How can I grep for other files that may contain the same problem/character?


The file can be found here: file.txt










share|improve this question




















  • 2





    First step is to hexdump -C filename and look at the encoding of what is "visible" to you as <96>. Context should help to pinpoint it.

    – dirkt
    Jan 20 at 19:35











  • @dirkt, the context points to the character being an EMDASH and hexdump -C shows c2 96. How can I search for other occurrences of the same thing?

    – Paulo Ney
    Jan 20 at 19:49











  • @G-Man, you can download the file, the character shows like that in vi/vim, for example, and I am using stock "grep" on Ubuntu 18.04.

    – Paulo Ney
    Jan 20 at 19:51











  • Is there any tool, that can manage this character is a good way? I'm thinking of word processor like LibreOffice Writer or a simple text editor like gedit when you have set it to manage your language and UTF-8. In this case you can remove that character.

    – sudodus
    Jan 20 at 19:59













  • @sudodus I added the views from vi, gedit and libreOffice -- none of them seem to produce something useful.

    – Paulo Ney
    Jan 20 at 20:07














7












7








7


4






I have an UTF-8 file that contains a strange character -- visible to me just as



<96>


This is how it appears on vi



vi screenshot



and how it appears on gedit



gedit screenshot



and how it appears under LibreOffice



LibreOffice screenshot



and that makes a series of basic Unix tools misbehave, including:





  1. cat file make the character dissapear, and more as well

  2. I cannot copy and paste inside vi/vim -- it will not even find itself


  3. grep fails to display anything as well, as if the character did not exists.


The program file works fine and recognizes it an UTF-8 file. I also know that, because of the nature of the file, it most likely came from a Copy & Paste from the web and the character initially represented an EMDASH.



My basic questions are:




  1. Is there anything wrong with this file?

  2. How can I search for other occurrences of it inside the same file?

  3. How can I grep for other files that may contain the same problem/character?


The file can be found here: file.txt










share|improve this question
















I have an UTF-8 file that contains a strange character -- visible to me just as



<96>


This is how it appears on vi



vi screenshot



and how it appears on gedit



gedit screenshot



and how it appears under LibreOffice



LibreOffice screenshot



and that makes a series of basic Unix tools misbehave, including:





  1. cat file make the character dissapear, and more as well

  2. I cannot copy and paste inside vi/vim -- it will not even find itself


  3. grep fails to display anything as well, as if the character did not exists.


The program file works fine and recognizes it an UTF-8 file. I also know that, because of the nature of the file, it most likely came from a Copy & Paste from the web and the character initially represented an EMDASH.



My basic questions are:




  1. Is there anything wrong with this file?

  2. How can I search for other occurrences of it inside the same file?

  3. How can I grep for other files that may contain the same problem/character?


The file can be found here: file.txt







unicode character-encoding






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 21 at 15:11









wjandrea

500413




500413










asked Jan 20 at 19:30









Paulo NeyPaulo Ney

1445




1445








  • 2





    First step is to hexdump -C filename and look at the encoding of what is "visible" to you as <96>. Context should help to pinpoint it.

    – dirkt
    Jan 20 at 19:35











  • @dirkt, the context points to the character being an EMDASH and hexdump -C shows c2 96. How can I search for other occurrences of the same thing?

    – Paulo Ney
    Jan 20 at 19:49











  • @G-Man, you can download the file, the character shows like that in vi/vim, for example, and I am using stock "grep" on Ubuntu 18.04.

    – Paulo Ney
    Jan 20 at 19:51











  • Is there any tool, that can manage this character is a good way? I'm thinking of word processor like LibreOffice Writer or a simple text editor like gedit when you have set it to manage your language and UTF-8. In this case you can remove that character.

    – sudodus
    Jan 20 at 19:59













  • @sudodus I added the views from vi, gedit and libreOffice -- none of them seem to produce something useful.

    – Paulo Ney
    Jan 20 at 20:07














  • 2





    First step is to hexdump -C filename and look at the encoding of what is "visible" to you as <96>. Context should help to pinpoint it.

    – dirkt
    Jan 20 at 19:35











  • @dirkt, the context points to the character being an EMDASH and hexdump -C shows c2 96. How can I search for other occurrences of the same thing?

    – Paulo Ney
    Jan 20 at 19:49











  • @G-Man, you can download the file, the character shows like that in vi/vim, for example, and I am using stock "grep" on Ubuntu 18.04.

    – Paulo Ney
    Jan 20 at 19:51











  • Is there any tool, that can manage this character is a good way? I'm thinking of word processor like LibreOffice Writer or a simple text editor like gedit when you have set it to manage your language and UTF-8. In this case you can remove that character.

    – sudodus
    Jan 20 at 19:59













  • @sudodus I added the views from vi, gedit and libreOffice -- none of them seem to produce something useful.

    – Paulo Ney
    Jan 20 at 20:07








2




2





First step is to hexdump -C filename and look at the encoding of what is "visible" to you as <96>. Context should help to pinpoint it.

– dirkt
Jan 20 at 19:35





First step is to hexdump -C filename and look at the encoding of what is "visible" to you as <96>. Context should help to pinpoint it.

– dirkt
Jan 20 at 19:35













@dirkt, the context points to the character being an EMDASH and hexdump -C shows c2 96. How can I search for other occurrences of the same thing?

– Paulo Ney
Jan 20 at 19:49





@dirkt, the context points to the character being an EMDASH and hexdump -C shows c2 96. How can I search for other occurrences of the same thing?

– Paulo Ney
Jan 20 at 19:49













@G-Man, you can download the file, the character shows like that in vi/vim, for example, and I am using stock "grep" on Ubuntu 18.04.

– Paulo Ney
Jan 20 at 19:51





@G-Man, you can download the file, the character shows like that in vi/vim, for example, and I am using stock "grep" on Ubuntu 18.04.

– Paulo Ney
Jan 20 at 19:51













Is there any tool, that can manage this character is a good way? I'm thinking of word processor like LibreOffice Writer or a simple text editor like gedit when you have set it to manage your language and UTF-8. In this case you can remove that character.

– sudodus
Jan 20 at 19:59







Is there any tool, that can manage this character is a good way? I'm thinking of word processor like LibreOffice Writer or a simple text editor like gedit when you have set it to manage your language and UTF-8. In this case you can remove that character.

– sudodus
Jan 20 at 19:59















@sudodus I added the views from vi, gedit and libreOffice -- none of them seem to produce something useful.

– Paulo Ney
Jan 20 at 20:07





@sudodus I added the views from vi, gedit and libreOffice -- none of them seem to produce something useful.

– Paulo Ney
Jan 20 at 20:07










3 Answers
3






active

oldest

votes


















27














This file contains bytes C2 96, which are the UTF-8 encoding of codepoint U+0096. That codepoint is one of the C1 control characters commonly called SPA "Start of Guarded Area" (or "Protected Area"). That isn't a useful character for any modern system, but it's unlikely to be harmful that it's there.



The original source for this was likely a byte 0x96 in some single-byte 8-bit encoding that has been transcoded incorrectly somewhere along the way. Probably this was originally a Windows CP1252 en dash "–", which has byte value 96 in that encoding - most other plausible candidates have the control set at positions 80-9F - which has been translated to UTF-8 as though it was latin-1 (ISO/IEC 8859-1), which is not uncommon. That would lead to the byte being interpreted as the control character and translated accordingly as you've seen.





You can fix this file with the iconv tool, which is part of glibc.



iconv -f utf-8 -t iso-8859-1 < mwe.txt | iconv -f cp1252 -t utf-8


produces a correct version of your minimal example for me. That works by first converting the UTF-8 to latin-1 (inverting the earlier mistranslation), and then reinterpreting that as cp1252 to convert it back to UTF-8 correctly.



It does depend on what else is in the real file, however. If you have characters outside Latin-1 elsewhere it will fail because it can't encode those correctly at the first step.



If you don't have iconv, or it doesn't work for the real file, you can replace the bytes directly using sed:



LC_ALL=C sed -e $'s/xc2x96/xe2x80x93/g' < mwe.txt


This replaces C2 96 with the UTF-8 en dash encoding E2 80 93. You could also replace it with e.g. a hyphen or two by changing xe2x80x93 into --.





You can grep in a similar fashion. We're using LC_ALL=C to make sure we're reading the actual bytes, and not having grep interpret things:



LC_ALL=C grep -R $'xc2x96` .


will list out everywhere under this directory those bytes appear. You may want to limit it to just text files if you have mixed content around, since binary files will include any pair of bytes fairly often.






share|improve this answer


























  • Is there a way I can search other files for the same occurrence, with something like grep?

    – Paulo Ney
    Jan 20 at 20:11






  • 1





    Yes, you can use grep $'xc2x96' (last section).

    – Michael Homer
    Jan 20 at 20:13











  • Is the file a "valid" UTF-8 file?

    – Paulo Ney
    Jan 20 at 20:24






  • 2





    Yes, it's a perfectly correct encoding of a not-very-useful character.

    – Michael Homer
    Jan 20 at 20:26






  • 1





    saying that windows-1252 mislabeled as iso-8859-1 is a common problem is quite an understatement ;-); iso-8859-1 should always be treated as a synonym of windows-1252 unless you want to trash your data (that's also mandated by the HTML5 standard -- and I've never seen a genuine iso-8859-1 file that was not the result of some encoding error).

    – Uncle Billy
    Jan 21 at 4:07



















2














0x96 is an en dash in the Windows codepage 1252. The c2 byte preceding it seems to be a default first byte in a double-width character. Someone else could explain more precisely about it.



To search for other occurrences, put your cursor over it in command mode, hit yl (yank one character), then type /<Ctrl>+r". (ctrl+r lets you insert the contents of a register into the command, and the " register is whatever has last been yanked).



Just replace it with two hyphens if you want it to render in your terminal. If that is a bibtex file that you have, then two hyphens are the appropriate way to key it in.



To show how you can find occurrences of the character, you can pipe it through a hexdump tool like xxd.



$ cat tmp | xxd | grep c296
00000000: 7061 6765 733d 7b31 c296 3935 7d2c 0a70 pages={1..95},.p
00000020: 6765 733d 7b31 c296 3935 7d2c 0a70 6167 ges={1..95},.pag
00000040: 733d 7b31 c296 3935 7d2c 0a70 6167 6573 s={1..95},.pages
00000060: 7b31 c296 3935 7d2c 0a70 6167 6573 3d7b {1..95},.pages={
00000080: c296 3935 7d2c 0a70 6167 6573 3d7b 31c2 ..95},.pages={1.
00000090: 9639 357d 2c0a 7061 6765 733d 7b31 c296 .95},.pages={1..
000000b0: 357d 2c0a 7061 6765 733d 7b31 c296 3935 5},.pages={1..95
000000d0: 2c0a 7061 6765 733d 7b31 c296 3935 7d2c ,.pages={1..95},
000000f0: 7061 6765 733d 7b31 c296 3935 7d2c 0a70 pages={1..95},.p
00000110: 6765 733d 7b31 c296 3935 7d2c 0a70 6167 ges={1..95},.pag
00000130: 733d 7b31 c296 3935 7d2c 0a70 6167 6573 s={1..95},.pages
00000150: 7b31 c296 3935 7d2c 0a70 6167 6573 3d7b {1..95},.pages={





share|improve this answer


























  • Nice. Is there a way I can search for occurrences in others files with something like grep?

    – Paulo Ney
    Jan 20 at 20:09






  • 7





    There is no 0x96 value in ASCII - presumably it was in an 8-bit encoding originally (I've speculated cp1252, but there are other options).

    – Michael Homer
    Jan 20 at 20:13











  • @MichaelHomer thanks for the correction.

    – jlovegren
    Jan 20 at 20:14











  • @PauloNey you can pass it through a hex dump util like xxd. See my updated answer.

    – jlovegren
    Jan 20 at 20:14



















1














The text in your file is pages = {1113},, yes it looks like the number 1113 but actually there is a different character after the first 1. And, yes, you can copy-paste the string from the edit link for this web page to get the encoded character.



We can look inside the string with some tools:



$ a='pages = {1113},'


Or, to make it explicitly clear and allow an easy copy-paste without using the edit page:



$ a=$(printf 'pages = {1xc2x96113},') 

$ echo "$a" | od -An -tx1c
70 61 67 65 73 20 3d 20 7b 31 c2 96 31 31 33 7d
p a g e s = { 1 302 226 1 1 3 }
2c 0a
, n

$ echo "$a" | sed -n l
pages = {1302226113},$

$ echo "$a" | xxd
00000000: 7061 6765 7320 3d20 7b31 c296 3131 337d pages = {1..113}
00000010: 2c0a


So, the character is two bytes values c2 96 (in hex) or 302 226 (in octal).

It probably is the UTF-8 encoding of a byte value of 96, or expressed as an Unicode character: U-0096.



That value, in present times UTF-8, or better yet, in ISO-8859-1, is a control character in the C1 region of control characters(Wikipedia page) and (Unicode PDF) that goes from 128 to 159 in decimal. In specific, the U-0096 is called "START OF GUARDED AREA" or SPA.



That value (dec 150) is beyond the ASCII range (0-127) and was (in older times) used to represent several characters depending on the code-page used. It seems reasonable to assume that is was previously a dash (to mark the range 1-113) as encoded in Windows-1252 (Microsoft page) (Wikipedia 1252) and called an en dash (which is the smaller of the two dashes en and em) (Wikipedia en dash) or simply, in layman terms, a dash (-).





Q1: Is there anything wrong with this file?



Not really, control characters are valid characters, rarely used but valid none-the-less.



But you may replace them with a dash to make editing easier.



<file.txt sed 's/xc2x96/-/'


Q2 - How can I search for other occurrences of it inside the same file?



sed -n '/xc2x96/p'             # will print lines that contain that character.


Or, grep could search for the character (the color highlight will not be visible as the character is non-printable) and print the line.



c="$(printf "U96")" ; grep "$c" file.txt


Or more broad, find all characters in that control character range and list the files that contain such characters:



grep -rlP "[x80-x9f]"    


Q3 - How can I grep for other files that may contain the same problem/character?



This will list (-l) the files that match the character.



grep -rlP "x96"





share|improve this answer

























    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "106"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f495643%2fstrange-character-in-a-file%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    27














    This file contains bytes C2 96, which are the UTF-8 encoding of codepoint U+0096. That codepoint is one of the C1 control characters commonly called SPA "Start of Guarded Area" (or "Protected Area"). That isn't a useful character for any modern system, but it's unlikely to be harmful that it's there.



    The original source for this was likely a byte 0x96 in some single-byte 8-bit encoding that has been transcoded incorrectly somewhere along the way. Probably this was originally a Windows CP1252 en dash "–", which has byte value 96 in that encoding - most other plausible candidates have the control set at positions 80-9F - which has been translated to UTF-8 as though it was latin-1 (ISO/IEC 8859-1), which is not uncommon. That would lead to the byte being interpreted as the control character and translated accordingly as you've seen.





    You can fix this file with the iconv tool, which is part of glibc.



    iconv -f utf-8 -t iso-8859-1 < mwe.txt | iconv -f cp1252 -t utf-8


    produces a correct version of your minimal example for me. That works by first converting the UTF-8 to latin-1 (inverting the earlier mistranslation), and then reinterpreting that as cp1252 to convert it back to UTF-8 correctly.



    It does depend on what else is in the real file, however. If you have characters outside Latin-1 elsewhere it will fail because it can't encode those correctly at the first step.



    If you don't have iconv, or it doesn't work for the real file, you can replace the bytes directly using sed:



    LC_ALL=C sed -e $'s/xc2x96/xe2x80x93/g' < mwe.txt


    This replaces C2 96 with the UTF-8 en dash encoding E2 80 93. You could also replace it with e.g. a hyphen or two by changing xe2x80x93 into --.





    You can grep in a similar fashion. We're using LC_ALL=C to make sure we're reading the actual bytes, and not having grep interpret things:



    LC_ALL=C grep -R $'xc2x96` .


    will list out everywhere under this directory those bytes appear. You may want to limit it to just text files if you have mixed content around, since binary files will include any pair of bytes fairly often.






    share|improve this answer


























    • Is there a way I can search other files for the same occurrence, with something like grep?

      – Paulo Ney
      Jan 20 at 20:11






    • 1





      Yes, you can use grep $'xc2x96' (last section).

      – Michael Homer
      Jan 20 at 20:13











    • Is the file a "valid" UTF-8 file?

      – Paulo Ney
      Jan 20 at 20:24






    • 2





      Yes, it's a perfectly correct encoding of a not-very-useful character.

      – Michael Homer
      Jan 20 at 20:26






    • 1





      saying that windows-1252 mislabeled as iso-8859-1 is a common problem is quite an understatement ;-); iso-8859-1 should always be treated as a synonym of windows-1252 unless you want to trash your data (that's also mandated by the HTML5 standard -- and I've never seen a genuine iso-8859-1 file that was not the result of some encoding error).

      – Uncle Billy
      Jan 21 at 4:07
















    27














    This file contains bytes C2 96, which are the UTF-8 encoding of codepoint U+0096. That codepoint is one of the C1 control characters commonly called SPA "Start of Guarded Area" (or "Protected Area"). That isn't a useful character for any modern system, but it's unlikely to be harmful that it's there.



    The original source for this was likely a byte 0x96 in some single-byte 8-bit encoding that has been transcoded incorrectly somewhere along the way. Probably this was originally a Windows CP1252 en dash "–", which has byte value 96 in that encoding - most other plausible candidates have the control set at positions 80-9F - which has been translated to UTF-8 as though it was latin-1 (ISO/IEC 8859-1), which is not uncommon. That would lead to the byte being interpreted as the control character and translated accordingly as you've seen.





    You can fix this file with the iconv tool, which is part of glibc.



    iconv -f utf-8 -t iso-8859-1 < mwe.txt | iconv -f cp1252 -t utf-8


    produces a correct version of your minimal example for me. That works by first converting the UTF-8 to latin-1 (inverting the earlier mistranslation), and then reinterpreting that as cp1252 to convert it back to UTF-8 correctly.



    It does depend on what else is in the real file, however. If you have characters outside Latin-1 elsewhere it will fail because it can't encode those correctly at the first step.



    If you don't have iconv, or it doesn't work for the real file, you can replace the bytes directly using sed:



    LC_ALL=C sed -e $'s/xc2x96/xe2x80x93/g' < mwe.txt


    This replaces C2 96 with the UTF-8 en dash encoding E2 80 93. You could also replace it with e.g. a hyphen or two by changing xe2x80x93 into --.





    You can grep in a similar fashion. We're using LC_ALL=C to make sure we're reading the actual bytes, and not having grep interpret things:



    LC_ALL=C grep -R $'xc2x96` .


    will list out everywhere under this directory those bytes appear. You may want to limit it to just text files if you have mixed content around, since binary files will include any pair of bytes fairly often.






    share|improve this answer


























    • Is there a way I can search other files for the same occurrence, with something like grep?

      – Paulo Ney
      Jan 20 at 20:11






    • 1





      Yes, you can use grep $'xc2x96' (last section).

      – Michael Homer
      Jan 20 at 20:13











    • Is the file a "valid" UTF-8 file?

      – Paulo Ney
      Jan 20 at 20:24






    • 2





      Yes, it's a perfectly correct encoding of a not-very-useful character.

      – Michael Homer
      Jan 20 at 20:26






    • 1





      saying that windows-1252 mislabeled as iso-8859-1 is a common problem is quite an understatement ;-); iso-8859-1 should always be treated as a synonym of windows-1252 unless you want to trash your data (that's also mandated by the HTML5 standard -- and I've never seen a genuine iso-8859-1 file that was not the result of some encoding error).

      – Uncle Billy
      Jan 21 at 4:07














    27












    27








    27







    This file contains bytes C2 96, which are the UTF-8 encoding of codepoint U+0096. That codepoint is one of the C1 control characters commonly called SPA "Start of Guarded Area" (or "Protected Area"). That isn't a useful character for any modern system, but it's unlikely to be harmful that it's there.



    The original source for this was likely a byte 0x96 in some single-byte 8-bit encoding that has been transcoded incorrectly somewhere along the way. Probably this was originally a Windows CP1252 en dash "–", which has byte value 96 in that encoding - most other plausible candidates have the control set at positions 80-9F - which has been translated to UTF-8 as though it was latin-1 (ISO/IEC 8859-1), which is not uncommon. That would lead to the byte being interpreted as the control character and translated accordingly as you've seen.





    You can fix this file with the iconv tool, which is part of glibc.



    iconv -f utf-8 -t iso-8859-1 < mwe.txt | iconv -f cp1252 -t utf-8


    produces a correct version of your minimal example for me. That works by first converting the UTF-8 to latin-1 (inverting the earlier mistranslation), and then reinterpreting that as cp1252 to convert it back to UTF-8 correctly.



    It does depend on what else is in the real file, however. If you have characters outside Latin-1 elsewhere it will fail because it can't encode those correctly at the first step.



    If you don't have iconv, or it doesn't work for the real file, you can replace the bytes directly using sed:



    LC_ALL=C sed -e $'s/xc2x96/xe2x80x93/g' < mwe.txt


    This replaces C2 96 with the UTF-8 en dash encoding E2 80 93. You could also replace it with e.g. a hyphen or two by changing xe2x80x93 into --.





    You can grep in a similar fashion. We're using LC_ALL=C to make sure we're reading the actual bytes, and not having grep interpret things:



    LC_ALL=C grep -R $'xc2x96` .


    will list out everywhere under this directory those bytes appear. You may want to limit it to just text files if you have mixed content around, since binary files will include any pair of bytes fairly often.






    share|improve this answer















    This file contains bytes C2 96, which are the UTF-8 encoding of codepoint U+0096. That codepoint is one of the C1 control characters commonly called SPA "Start of Guarded Area" (or "Protected Area"). That isn't a useful character for any modern system, but it's unlikely to be harmful that it's there.



    The original source for this was likely a byte 0x96 in some single-byte 8-bit encoding that has been transcoded incorrectly somewhere along the way. Probably this was originally a Windows CP1252 en dash "–", which has byte value 96 in that encoding - most other plausible candidates have the control set at positions 80-9F - which has been translated to UTF-8 as though it was latin-1 (ISO/IEC 8859-1), which is not uncommon. That would lead to the byte being interpreted as the control character and translated accordingly as you've seen.





    You can fix this file with the iconv tool, which is part of glibc.



    iconv -f utf-8 -t iso-8859-1 < mwe.txt | iconv -f cp1252 -t utf-8


    produces a correct version of your minimal example for me. That works by first converting the UTF-8 to latin-1 (inverting the earlier mistranslation), and then reinterpreting that as cp1252 to convert it back to UTF-8 correctly.



    It does depend on what else is in the real file, however. If you have characters outside Latin-1 elsewhere it will fail because it can't encode those correctly at the first step.



    If you don't have iconv, or it doesn't work for the real file, you can replace the bytes directly using sed:



    LC_ALL=C sed -e $'s/xc2x96/xe2x80x93/g' < mwe.txt


    This replaces C2 96 with the UTF-8 en dash encoding E2 80 93. You could also replace it with e.g. a hyphen or two by changing xe2x80x93 into --.





    You can grep in a similar fashion. We're using LC_ALL=C to make sure we're reading the actual bytes, and not having grep interpret things:



    LC_ALL=C grep -R $'xc2x96` .


    will list out everywhere under this directory those bytes appear. You may want to limit it to just text files if you have mixed content around, since binary files will include any pair of bytes fairly often.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Jan 21 at 19:08

























    answered Jan 20 at 20:05









    Michael HomerMichael Homer

    49.4k8133172




    49.4k8133172













    • Is there a way I can search other files for the same occurrence, with something like grep?

      – Paulo Ney
      Jan 20 at 20:11






    • 1





      Yes, you can use grep $'xc2x96' (last section).

      – Michael Homer
      Jan 20 at 20:13











    • Is the file a "valid" UTF-8 file?

      – Paulo Ney
      Jan 20 at 20:24






    • 2





      Yes, it's a perfectly correct encoding of a not-very-useful character.

      – Michael Homer
      Jan 20 at 20:26






    • 1





      saying that windows-1252 mislabeled as iso-8859-1 is a common problem is quite an understatement ;-); iso-8859-1 should always be treated as a synonym of windows-1252 unless you want to trash your data (that's also mandated by the HTML5 standard -- and I've never seen a genuine iso-8859-1 file that was not the result of some encoding error).

      – Uncle Billy
      Jan 21 at 4:07



















    • Is there a way I can search other files for the same occurrence, with something like grep?

      – Paulo Ney
      Jan 20 at 20:11






    • 1





      Yes, you can use grep $'xc2x96' (last section).

      – Michael Homer
      Jan 20 at 20:13











    • Is the file a "valid" UTF-8 file?

      – Paulo Ney
      Jan 20 at 20:24






    • 2





      Yes, it's a perfectly correct encoding of a not-very-useful character.

      – Michael Homer
      Jan 20 at 20:26






    • 1





      saying that windows-1252 mislabeled as iso-8859-1 is a common problem is quite an understatement ;-); iso-8859-1 should always be treated as a synonym of windows-1252 unless you want to trash your data (that's also mandated by the HTML5 standard -- and I've never seen a genuine iso-8859-1 file that was not the result of some encoding error).

      – Uncle Billy
      Jan 21 at 4:07

















    Is there a way I can search other files for the same occurrence, with something like grep?

    – Paulo Ney
    Jan 20 at 20:11





    Is there a way I can search other files for the same occurrence, with something like grep?

    – Paulo Ney
    Jan 20 at 20:11




    1




    1





    Yes, you can use grep $'xc2x96' (last section).

    – Michael Homer
    Jan 20 at 20:13





    Yes, you can use grep $'xc2x96' (last section).

    – Michael Homer
    Jan 20 at 20:13













    Is the file a "valid" UTF-8 file?

    – Paulo Ney
    Jan 20 at 20:24





    Is the file a "valid" UTF-8 file?

    – Paulo Ney
    Jan 20 at 20:24




    2




    2





    Yes, it's a perfectly correct encoding of a not-very-useful character.

    – Michael Homer
    Jan 20 at 20:26





    Yes, it's a perfectly correct encoding of a not-very-useful character.

    – Michael Homer
    Jan 20 at 20:26




    1




    1





    saying that windows-1252 mislabeled as iso-8859-1 is a common problem is quite an understatement ;-); iso-8859-1 should always be treated as a synonym of windows-1252 unless you want to trash your data (that's also mandated by the HTML5 standard -- and I've never seen a genuine iso-8859-1 file that was not the result of some encoding error).

    – Uncle Billy
    Jan 21 at 4:07





    saying that windows-1252 mislabeled as iso-8859-1 is a common problem is quite an understatement ;-); iso-8859-1 should always be treated as a synonym of windows-1252 unless you want to trash your data (that's also mandated by the HTML5 standard -- and I've never seen a genuine iso-8859-1 file that was not the result of some encoding error).

    – Uncle Billy
    Jan 21 at 4:07













    2














    0x96 is an en dash in the Windows codepage 1252. The c2 byte preceding it seems to be a default first byte in a double-width character. Someone else could explain more precisely about it.



    To search for other occurrences, put your cursor over it in command mode, hit yl (yank one character), then type /<Ctrl>+r". (ctrl+r lets you insert the contents of a register into the command, and the " register is whatever has last been yanked).



    Just replace it with two hyphens if you want it to render in your terminal. If that is a bibtex file that you have, then two hyphens are the appropriate way to key it in.



    To show how you can find occurrences of the character, you can pipe it through a hexdump tool like xxd.



    $ cat tmp | xxd | grep c296
    00000000: 7061 6765 733d 7b31 c296 3935 7d2c 0a70 pages={1..95},.p
    00000020: 6765 733d 7b31 c296 3935 7d2c 0a70 6167 ges={1..95},.pag
    00000040: 733d 7b31 c296 3935 7d2c 0a70 6167 6573 s={1..95},.pages
    00000060: 7b31 c296 3935 7d2c 0a70 6167 6573 3d7b {1..95},.pages={
    00000080: c296 3935 7d2c 0a70 6167 6573 3d7b 31c2 ..95},.pages={1.
    00000090: 9639 357d 2c0a 7061 6765 733d 7b31 c296 .95},.pages={1..
    000000b0: 357d 2c0a 7061 6765 733d 7b31 c296 3935 5},.pages={1..95
    000000d0: 2c0a 7061 6765 733d 7b31 c296 3935 7d2c ,.pages={1..95},
    000000f0: 7061 6765 733d 7b31 c296 3935 7d2c 0a70 pages={1..95},.p
    00000110: 6765 733d 7b31 c296 3935 7d2c 0a70 6167 ges={1..95},.pag
    00000130: 733d 7b31 c296 3935 7d2c 0a70 6167 6573 s={1..95},.pages
    00000150: 7b31 c296 3935 7d2c 0a70 6167 6573 3d7b {1..95},.pages={





    share|improve this answer


























    • Nice. Is there a way I can search for occurrences in others files with something like grep?

      – Paulo Ney
      Jan 20 at 20:09






    • 7





      There is no 0x96 value in ASCII - presumably it was in an 8-bit encoding originally (I've speculated cp1252, but there are other options).

      – Michael Homer
      Jan 20 at 20:13











    • @MichaelHomer thanks for the correction.

      – jlovegren
      Jan 20 at 20:14











    • @PauloNey you can pass it through a hex dump util like xxd. See my updated answer.

      – jlovegren
      Jan 20 at 20:14
















    2














    0x96 is an en dash in the Windows codepage 1252. The c2 byte preceding it seems to be a default first byte in a double-width character. Someone else could explain more precisely about it.



    To search for other occurrences, put your cursor over it in command mode, hit yl (yank one character), then type /<Ctrl>+r". (ctrl+r lets you insert the contents of a register into the command, and the " register is whatever has last been yanked).



    Just replace it with two hyphens if you want it to render in your terminal. If that is a bibtex file that you have, then two hyphens are the appropriate way to key it in.



    To show how you can find occurrences of the character, you can pipe it through a hexdump tool like xxd.



    $ cat tmp | xxd | grep c296
    00000000: 7061 6765 733d 7b31 c296 3935 7d2c 0a70 pages={1..95},.p
    00000020: 6765 733d 7b31 c296 3935 7d2c 0a70 6167 ges={1..95},.pag
    00000040: 733d 7b31 c296 3935 7d2c 0a70 6167 6573 s={1..95},.pages
    00000060: 7b31 c296 3935 7d2c 0a70 6167 6573 3d7b {1..95},.pages={
    00000080: c296 3935 7d2c 0a70 6167 6573 3d7b 31c2 ..95},.pages={1.
    00000090: 9639 357d 2c0a 7061 6765 733d 7b31 c296 .95},.pages={1..
    000000b0: 357d 2c0a 7061 6765 733d 7b31 c296 3935 5},.pages={1..95
    000000d0: 2c0a 7061 6765 733d 7b31 c296 3935 7d2c ,.pages={1..95},
    000000f0: 7061 6765 733d 7b31 c296 3935 7d2c 0a70 pages={1..95},.p
    00000110: 6765 733d 7b31 c296 3935 7d2c 0a70 6167 ges={1..95},.pag
    00000130: 733d 7b31 c296 3935 7d2c 0a70 6167 6573 s={1..95},.pages
    00000150: 7b31 c296 3935 7d2c 0a70 6167 6573 3d7b {1..95},.pages={





    share|improve this answer


























    • Nice. Is there a way I can search for occurrences in others files with something like grep?

      – Paulo Ney
      Jan 20 at 20:09






    • 7





      There is no 0x96 value in ASCII - presumably it was in an 8-bit encoding originally (I've speculated cp1252, but there are other options).

      – Michael Homer
      Jan 20 at 20:13











    • @MichaelHomer thanks for the correction.

      – jlovegren
      Jan 20 at 20:14











    • @PauloNey you can pass it through a hex dump util like xxd. See my updated answer.

      – jlovegren
      Jan 20 at 20:14














    2












    2








    2







    0x96 is an en dash in the Windows codepage 1252. The c2 byte preceding it seems to be a default first byte in a double-width character. Someone else could explain more precisely about it.



    To search for other occurrences, put your cursor over it in command mode, hit yl (yank one character), then type /<Ctrl>+r". (ctrl+r lets you insert the contents of a register into the command, and the " register is whatever has last been yanked).



    Just replace it with two hyphens if you want it to render in your terminal. If that is a bibtex file that you have, then two hyphens are the appropriate way to key it in.



    To show how you can find occurrences of the character, you can pipe it through a hexdump tool like xxd.



    $ cat tmp | xxd | grep c296
    00000000: 7061 6765 733d 7b31 c296 3935 7d2c 0a70 pages={1..95},.p
    00000020: 6765 733d 7b31 c296 3935 7d2c 0a70 6167 ges={1..95},.pag
    00000040: 733d 7b31 c296 3935 7d2c 0a70 6167 6573 s={1..95},.pages
    00000060: 7b31 c296 3935 7d2c 0a70 6167 6573 3d7b {1..95},.pages={
    00000080: c296 3935 7d2c 0a70 6167 6573 3d7b 31c2 ..95},.pages={1.
    00000090: 9639 357d 2c0a 7061 6765 733d 7b31 c296 .95},.pages={1..
    000000b0: 357d 2c0a 7061 6765 733d 7b31 c296 3935 5},.pages={1..95
    000000d0: 2c0a 7061 6765 733d 7b31 c296 3935 7d2c ,.pages={1..95},
    000000f0: 7061 6765 733d 7b31 c296 3935 7d2c 0a70 pages={1..95},.p
    00000110: 6765 733d 7b31 c296 3935 7d2c 0a70 6167 ges={1..95},.pag
    00000130: 733d 7b31 c296 3935 7d2c 0a70 6167 6573 s={1..95},.pages
    00000150: 7b31 c296 3935 7d2c 0a70 6167 6573 3d7b {1..95},.pages={





    share|improve this answer















    0x96 is an en dash in the Windows codepage 1252. The c2 byte preceding it seems to be a default first byte in a double-width character. Someone else could explain more precisely about it.



    To search for other occurrences, put your cursor over it in command mode, hit yl (yank one character), then type /<Ctrl>+r". (ctrl+r lets you insert the contents of a register into the command, and the " register is whatever has last been yanked).



    Just replace it with two hyphens if you want it to render in your terminal. If that is a bibtex file that you have, then two hyphens are the appropriate way to key it in.



    To show how you can find occurrences of the character, you can pipe it through a hexdump tool like xxd.



    $ cat tmp | xxd | grep c296
    00000000: 7061 6765 733d 7b31 c296 3935 7d2c 0a70 pages={1..95},.p
    00000020: 6765 733d 7b31 c296 3935 7d2c 0a70 6167 ges={1..95},.pag
    00000040: 733d 7b31 c296 3935 7d2c 0a70 6167 6573 s={1..95},.pages
    00000060: 7b31 c296 3935 7d2c 0a70 6167 6573 3d7b {1..95},.pages={
    00000080: c296 3935 7d2c 0a70 6167 6573 3d7b 31c2 ..95},.pages={1.
    00000090: 9639 357d 2c0a 7061 6765 733d 7b31 c296 .95},.pages={1..
    000000b0: 357d 2c0a 7061 6765 733d 7b31 c296 3935 5},.pages={1..95
    000000d0: 2c0a 7061 6765 733d 7b31 c296 3935 7d2c ,.pages={1..95},
    000000f0: 7061 6765 733d 7b31 c296 3935 7d2c 0a70 pages={1..95},.p
    00000110: 6765 733d 7b31 c296 3935 7d2c 0a70 6167 ges={1..95},.pag
    00000130: 733d 7b31 c296 3935 7d2c 0a70 6167 6573 s={1..95},.pages
    00000150: 7b31 c296 3935 7d2c 0a70 6167 6573 3d7b {1..95},.pages={






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Jan 21 at 13:53









    glglgl

    1,154811




    1,154811










    answered Jan 20 at 20:04









    jlovegrenjlovegren

    1344




    1344













    • Nice. Is there a way I can search for occurrences in others files with something like grep?

      – Paulo Ney
      Jan 20 at 20:09






    • 7





      There is no 0x96 value in ASCII - presumably it was in an 8-bit encoding originally (I've speculated cp1252, but there are other options).

      – Michael Homer
      Jan 20 at 20:13











    • @MichaelHomer thanks for the correction.

      – jlovegren
      Jan 20 at 20:14











    • @PauloNey you can pass it through a hex dump util like xxd. See my updated answer.

      – jlovegren
      Jan 20 at 20:14



















    • Nice. Is there a way I can search for occurrences in others files with something like grep?

      – Paulo Ney
      Jan 20 at 20:09






    • 7





      There is no 0x96 value in ASCII - presumably it was in an 8-bit encoding originally (I've speculated cp1252, but there are other options).

      – Michael Homer
      Jan 20 at 20:13











    • @MichaelHomer thanks for the correction.

      – jlovegren
      Jan 20 at 20:14











    • @PauloNey you can pass it through a hex dump util like xxd. See my updated answer.

      – jlovegren
      Jan 20 at 20:14

















    Nice. Is there a way I can search for occurrences in others files with something like grep?

    – Paulo Ney
    Jan 20 at 20:09





    Nice. Is there a way I can search for occurrences in others files with something like grep?

    – Paulo Ney
    Jan 20 at 20:09




    7




    7





    There is no 0x96 value in ASCII - presumably it was in an 8-bit encoding originally (I've speculated cp1252, but there are other options).

    – Michael Homer
    Jan 20 at 20:13





    There is no 0x96 value in ASCII - presumably it was in an 8-bit encoding originally (I've speculated cp1252, but there are other options).

    – Michael Homer
    Jan 20 at 20:13













    @MichaelHomer thanks for the correction.

    – jlovegren
    Jan 20 at 20:14





    @MichaelHomer thanks for the correction.

    – jlovegren
    Jan 20 at 20:14













    @PauloNey you can pass it through a hex dump util like xxd. See my updated answer.

    – jlovegren
    Jan 20 at 20:14





    @PauloNey you can pass it through a hex dump util like xxd. See my updated answer.

    – jlovegren
    Jan 20 at 20:14











    1














    The text in your file is pages = {1113},, yes it looks like the number 1113 but actually there is a different character after the first 1. And, yes, you can copy-paste the string from the edit link for this web page to get the encoded character.



    We can look inside the string with some tools:



    $ a='pages = {1113},'


    Or, to make it explicitly clear and allow an easy copy-paste without using the edit page:



    $ a=$(printf 'pages = {1xc2x96113},') 

    $ echo "$a" | od -An -tx1c
    70 61 67 65 73 20 3d 20 7b 31 c2 96 31 31 33 7d
    p a g e s = { 1 302 226 1 1 3 }
    2c 0a
    , n

    $ echo "$a" | sed -n l
    pages = {1302226113},$

    $ echo "$a" | xxd
    00000000: 7061 6765 7320 3d20 7b31 c296 3131 337d pages = {1..113}
    00000010: 2c0a


    So, the character is two bytes values c2 96 (in hex) or 302 226 (in octal).

    It probably is the UTF-8 encoding of a byte value of 96, or expressed as an Unicode character: U-0096.



    That value, in present times UTF-8, or better yet, in ISO-8859-1, is a control character in the C1 region of control characters(Wikipedia page) and (Unicode PDF) that goes from 128 to 159 in decimal. In specific, the U-0096 is called "START OF GUARDED AREA" or SPA.



    That value (dec 150) is beyond the ASCII range (0-127) and was (in older times) used to represent several characters depending on the code-page used. It seems reasonable to assume that is was previously a dash (to mark the range 1-113) as encoded in Windows-1252 (Microsoft page) (Wikipedia 1252) and called an en dash (which is the smaller of the two dashes en and em) (Wikipedia en dash) or simply, in layman terms, a dash (-).





    Q1: Is there anything wrong with this file?



    Not really, control characters are valid characters, rarely used but valid none-the-less.



    But you may replace them with a dash to make editing easier.



    <file.txt sed 's/xc2x96/-/'


    Q2 - How can I search for other occurrences of it inside the same file?



    sed -n '/xc2x96/p'             # will print lines that contain that character.


    Or, grep could search for the character (the color highlight will not be visible as the character is non-printable) and print the line.



    c="$(printf "U96")" ; grep "$c" file.txt


    Or more broad, find all characters in that control character range and list the files that contain such characters:



    grep -rlP "[x80-x9f]"    


    Q3 - How can I grep for other files that may contain the same problem/character?



    This will list (-l) the files that match the character.



    grep -rlP "x96"





    share|improve this answer






























      1














      The text in your file is pages = {1113},, yes it looks like the number 1113 but actually there is a different character after the first 1. And, yes, you can copy-paste the string from the edit link for this web page to get the encoded character.



      We can look inside the string with some tools:



      $ a='pages = {1113},'


      Or, to make it explicitly clear and allow an easy copy-paste without using the edit page:



      $ a=$(printf 'pages = {1xc2x96113},') 

      $ echo "$a" | od -An -tx1c
      70 61 67 65 73 20 3d 20 7b 31 c2 96 31 31 33 7d
      p a g e s = { 1 302 226 1 1 3 }
      2c 0a
      , n

      $ echo "$a" | sed -n l
      pages = {1302226113},$

      $ echo "$a" | xxd
      00000000: 7061 6765 7320 3d20 7b31 c296 3131 337d pages = {1..113}
      00000010: 2c0a


      So, the character is two bytes values c2 96 (in hex) or 302 226 (in octal).

      It probably is the UTF-8 encoding of a byte value of 96, or expressed as an Unicode character: U-0096.



      That value, in present times UTF-8, or better yet, in ISO-8859-1, is a control character in the C1 region of control characters(Wikipedia page) and (Unicode PDF) that goes from 128 to 159 in decimal. In specific, the U-0096 is called "START OF GUARDED AREA" or SPA.



      That value (dec 150) is beyond the ASCII range (0-127) and was (in older times) used to represent several characters depending on the code-page used. It seems reasonable to assume that is was previously a dash (to mark the range 1-113) as encoded in Windows-1252 (Microsoft page) (Wikipedia 1252) and called an en dash (which is the smaller of the two dashes en and em) (Wikipedia en dash) or simply, in layman terms, a dash (-).





      Q1: Is there anything wrong with this file?



      Not really, control characters are valid characters, rarely used but valid none-the-less.



      But you may replace them with a dash to make editing easier.



      <file.txt sed 's/xc2x96/-/'


      Q2 - How can I search for other occurrences of it inside the same file?



      sed -n '/xc2x96/p'             # will print lines that contain that character.


      Or, grep could search for the character (the color highlight will not be visible as the character is non-printable) and print the line.



      c="$(printf "U96")" ; grep "$c" file.txt


      Or more broad, find all characters in that control character range and list the files that contain such characters:



      grep -rlP "[x80-x9f]"    


      Q3 - How can I grep for other files that may contain the same problem/character?



      This will list (-l) the files that match the character.



      grep -rlP "x96"





      share|improve this answer




























        1












        1








        1







        The text in your file is pages = {1113},, yes it looks like the number 1113 but actually there is a different character after the first 1. And, yes, you can copy-paste the string from the edit link for this web page to get the encoded character.



        We can look inside the string with some tools:



        $ a='pages = {1113},'


        Or, to make it explicitly clear and allow an easy copy-paste without using the edit page:



        $ a=$(printf 'pages = {1xc2x96113},') 

        $ echo "$a" | od -An -tx1c
        70 61 67 65 73 20 3d 20 7b 31 c2 96 31 31 33 7d
        p a g e s = { 1 302 226 1 1 3 }
        2c 0a
        , n

        $ echo "$a" | sed -n l
        pages = {1302226113},$

        $ echo "$a" | xxd
        00000000: 7061 6765 7320 3d20 7b31 c296 3131 337d pages = {1..113}
        00000010: 2c0a


        So, the character is two bytes values c2 96 (in hex) or 302 226 (in octal).

        It probably is the UTF-8 encoding of a byte value of 96, or expressed as an Unicode character: U-0096.



        That value, in present times UTF-8, or better yet, in ISO-8859-1, is a control character in the C1 region of control characters(Wikipedia page) and (Unicode PDF) that goes from 128 to 159 in decimal. In specific, the U-0096 is called "START OF GUARDED AREA" or SPA.



        That value (dec 150) is beyond the ASCII range (0-127) and was (in older times) used to represent several characters depending on the code-page used. It seems reasonable to assume that is was previously a dash (to mark the range 1-113) as encoded in Windows-1252 (Microsoft page) (Wikipedia 1252) and called an en dash (which is the smaller of the two dashes en and em) (Wikipedia en dash) or simply, in layman terms, a dash (-).





        Q1: Is there anything wrong with this file?



        Not really, control characters are valid characters, rarely used but valid none-the-less.



        But you may replace them with a dash to make editing easier.



        <file.txt sed 's/xc2x96/-/'


        Q2 - How can I search for other occurrences of it inside the same file?



        sed -n '/xc2x96/p'             # will print lines that contain that character.


        Or, grep could search for the character (the color highlight will not be visible as the character is non-printable) and print the line.



        c="$(printf "U96")" ; grep "$c" file.txt


        Or more broad, find all characters in that control character range and list the files that contain such characters:



        grep -rlP "[x80-x9f]"    


        Q3 - How can I grep for other files that may contain the same problem/character?



        This will list (-l) the files that match the character.



        grep -rlP "x96"





        share|improve this answer















        The text in your file is pages = {1113},, yes it looks like the number 1113 but actually there is a different character after the first 1. And, yes, you can copy-paste the string from the edit link for this web page to get the encoded character.



        We can look inside the string with some tools:



        $ a='pages = {1113},'


        Or, to make it explicitly clear and allow an easy copy-paste without using the edit page:



        $ a=$(printf 'pages = {1xc2x96113},') 

        $ echo "$a" | od -An -tx1c
        70 61 67 65 73 20 3d 20 7b 31 c2 96 31 31 33 7d
        p a g e s = { 1 302 226 1 1 3 }
        2c 0a
        , n

        $ echo "$a" | sed -n l
        pages = {1302226113},$

        $ echo "$a" | xxd
        00000000: 7061 6765 7320 3d20 7b31 c296 3131 337d pages = {1..113}
        00000010: 2c0a


        So, the character is two bytes values c2 96 (in hex) or 302 226 (in octal).

        It probably is the UTF-8 encoding of a byte value of 96, or expressed as an Unicode character: U-0096.



        That value, in present times UTF-8, or better yet, in ISO-8859-1, is a control character in the C1 region of control characters(Wikipedia page) and (Unicode PDF) that goes from 128 to 159 in decimal. In specific, the U-0096 is called "START OF GUARDED AREA" or SPA.



        That value (dec 150) is beyond the ASCII range (0-127) and was (in older times) used to represent several characters depending on the code-page used. It seems reasonable to assume that is was previously a dash (to mark the range 1-113) as encoded in Windows-1252 (Microsoft page) (Wikipedia 1252) and called an en dash (which is the smaller of the two dashes en and em) (Wikipedia en dash) or simply, in layman terms, a dash (-).





        Q1: Is there anything wrong with this file?



        Not really, control characters are valid characters, rarely used but valid none-the-less.



        But you may replace them with a dash to make editing easier.



        <file.txt sed 's/xc2x96/-/'


        Q2 - How can I search for other occurrences of it inside the same file?



        sed -n '/xc2x96/p'             # will print lines that contain that character.


        Or, grep could search for the character (the color highlight will not be visible as the character is non-printable) and print the line.



        c="$(printf "U96")" ; grep "$c" file.txt


        Or more broad, find all characters in that control character range and list the files that contain such characters:



        grep -rlP "[x80-x9f]"    


        Q3 - How can I grep for other files that may contain the same problem/character?



        This will list (-l) the files that match the character.



        grep -rlP "x96"






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Jan 27 at 2:46

























        answered Jan 26 at 7:49









        IsaacIsaac

        12k11852




        12k11852






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Unix & Linux Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f495643%2fstrange-character-in-a-file%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            android studio warns about leanback feature tag usage required on manifest while using Unity exported app?

            SQL update select statement

            'app-layout' is not a known element: how to share Component with different Modules