Strange characters in string gotten from API can't decode












1















I'm creating a program that grabs data from an API and stores it in my own database. The problem is some of the stings have some sort of character code where quotation marks should be. Upon further inspection, it appears to be hex code for the quotation mark, but it's fantastically double escaped, confusing me along with all my decoders. I believe the string comes in as ascii and I don't have any other issues with the other characters.



I know I can simply replace the specific character code with the actual character, but I need to catch stuff like this in the future. If it is hex, I need to comb strings for hex codes and replace them procedurally.



I've tried



clean_val = unicodedata.normalize('NFKD', val).encode('latin1').decode('utf8')


I've gotten myself quite confused about the whole thing



response = session.get(url)
if response.status_code == requests.codes.ok:
print(response.content)

b'{"Description":"American Assets Trust, Inc. (the \u0093company\u0094) is a full service, vertically ..."}'


I think the string is stored on their database like " to satisfy some SQL escaping protocol. When I get it, the escaping slash gets mixed in with the character code, thereby messing up the encoding.










share|improve this question

























  • Some variation of Ascii2Uni might help. Perhaps there is a Python binding or port.

    – Tom Blodget
    Jan 3 at 23:48
















1















I'm creating a program that grabs data from an API and stores it in my own database. The problem is some of the stings have some sort of character code where quotation marks should be. Upon further inspection, it appears to be hex code for the quotation mark, but it's fantastically double escaped, confusing me along with all my decoders. I believe the string comes in as ascii and I don't have any other issues with the other characters.



I know I can simply replace the specific character code with the actual character, but I need to catch stuff like this in the future. If it is hex, I need to comb strings for hex codes and replace them procedurally.



I've tried



clean_val = unicodedata.normalize('NFKD', val).encode('latin1').decode('utf8')


I've gotten myself quite confused about the whole thing



response = session.get(url)
if response.status_code == requests.codes.ok:
print(response.content)

b'{"Description":"American Assets Trust, Inc. (the \u0093company\u0094) is a full service, vertically ..."}'


I think the string is stored on their database like " to satisfy some SQL escaping protocol. When I get it, the escaping slash gets mixed in with the character code, thereby messing up the encoding.










share|improve this question

























  • Some variation of Ascii2Uni might help. Perhaps there is a Python binding or port.

    – Tom Blodget
    Jan 3 at 23:48














1












1








1








I'm creating a program that grabs data from an API and stores it in my own database. The problem is some of the stings have some sort of character code where quotation marks should be. Upon further inspection, it appears to be hex code for the quotation mark, but it's fantastically double escaped, confusing me along with all my decoders. I believe the string comes in as ascii and I don't have any other issues with the other characters.



I know I can simply replace the specific character code with the actual character, but I need to catch stuff like this in the future. If it is hex, I need to comb strings for hex codes and replace them procedurally.



I've tried



clean_val = unicodedata.normalize('NFKD', val).encode('latin1').decode('utf8')


I've gotten myself quite confused about the whole thing



response = session.get(url)
if response.status_code == requests.codes.ok:
print(response.content)

b'{"Description":"American Assets Trust, Inc. (the \u0093company\u0094) is a full service, vertically ..."}'


I think the string is stored on their database like " to satisfy some SQL escaping protocol. When I get it, the escaping slash gets mixed in with the character code, thereby messing up the encoding.










share|improve this question
















I'm creating a program that grabs data from an API and stores it in my own database. The problem is some of the stings have some sort of character code where quotation marks should be. Upon further inspection, it appears to be hex code for the quotation mark, but it's fantastically double escaped, confusing me along with all my decoders. I believe the string comes in as ascii and I don't have any other issues with the other characters.



I know I can simply replace the specific character code with the actual character, but I need to catch stuff like this in the future. If it is hex, I need to comb strings for hex codes and replace them procedurally.



I've tried



clean_val = unicodedata.normalize('NFKD', val).encode('latin1').decode('utf8')


I've gotten myself quite confused about the whole thing



response = session.get(url)
if response.status_code == requests.codes.ok:
print(response.content)

b'{"Description":"American Assets Trust, Inc. (the \u0093company\u0094) is a full service, vertically ..."}'


I think the string is stored on their database like " to satisfy some SQL escaping protocol. When I get it, the escaping slash gets mixed in with the character code, thereby messing up the encoding.







python python-3.x character-encoding






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 2 at 18:10









snakecharmerb

12.1k42552




12.1k42552










asked Jan 2 at 23:29









Jonathan DuncanJonathan Duncan

133




133













  • Some variation of Ascii2Uni might help. Perhaps there is a Python binding or port.

    – Tom Blodget
    Jan 3 at 23:48



















  • Some variation of Ascii2Uni might help. Perhaps there is a Python binding or port.

    – Tom Blodget
    Jan 3 at 23:48

















Some variation of Ascii2Uni might help. Perhaps there is a Python binding or port.

– Tom Blodget
Jan 3 at 23:48





Some variation of Ascii2Uni might help. Perhaps there is a Python binding or port.

– Tom Blodget
Jan 3 at 23:48












1 Answer
1






active

oldest

votes


















0














It looks like these characters are from text encoded as cp1252. It's possible to decode them



>>> bs = b'{"Description":"American Assets Trust, Inc. (the \u0093company\u0094) is a full service, vertically ..."}'
>>> d = json.loads(bs)
>>> s = d['Description']
>>> decoded = s.encode('latin-1').decode('cp1252')
>>> decoded
'American Assets Trust, Inc. (the “company”) is a full service, vertically ...'


but you will have to replace them manually, using str.replace or str.translate



>>> table = str.maketrans('“”', '""')
>>> decoded = s.encode('latin-1').decode('cp1252')
>>> decoded.translate(table)
'American Assets Trust, Inc. (the "company") is a full service, vertically ...'





share|improve this answer
























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54014532%2fstrange-characters-in-string-gotten-from-api-cant-decode%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    It looks like these characters are from text encoded as cp1252. It's possible to decode them



    >>> bs = b'{"Description":"American Assets Trust, Inc. (the \u0093company\u0094) is a full service, vertically ..."}'
    >>> d = json.loads(bs)
    >>> s = d['Description']
    >>> decoded = s.encode('latin-1').decode('cp1252')
    >>> decoded
    'American Assets Trust, Inc. (the “company”) is a full service, vertically ...'


    but you will have to replace them manually, using str.replace or str.translate



    >>> table = str.maketrans('“”', '""')
    >>> decoded = s.encode('latin-1').decode('cp1252')
    >>> decoded.translate(table)
    'American Assets Trust, Inc. (the "company") is a full service, vertically ...'





    share|improve this answer




























      0














      It looks like these characters are from text encoded as cp1252. It's possible to decode them



      >>> bs = b'{"Description":"American Assets Trust, Inc. (the \u0093company\u0094) is a full service, vertically ..."}'
      >>> d = json.loads(bs)
      >>> s = d['Description']
      >>> decoded = s.encode('latin-1').decode('cp1252')
      >>> decoded
      'American Assets Trust, Inc. (the “company”) is a full service, vertically ...'


      but you will have to replace them manually, using str.replace or str.translate



      >>> table = str.maketrans('“”', '""')
      >>> decoded = s.encode('latin-1').decode('cp1252')
      >>> decoded.translate(table)
      'American Assets Trust, Inc. (the "company") is a full service, vertically ...'





      share|improve this answer


























        0












        0








        0







        It looks like these characters are from text encoded as cp1252. It's possible to decode them



        >>> bs = b'{"Description":"American Assets Trust, Inc. (the \u0093company\u0094) is a full service, vertically ..."}'
        >>> d = json.loads(bs)
        >>> s = d['Description']
        >>> decoded = s.encode('latin-1').decode('cp1252')
        >>> decoded
        'American Assets Trust, Inc. (the “company”) is a full service, vertically ...'


        but you will have to replace them manually, using str.replace or str.translate



        >>> table = str.maketrans('“”', '""')
        >>> decoded = s.encode('latin-1').decode('cp1252')
        >>> decoded.translate(table)
        'American Assets Trust, Inc. (the "company") is a full service, vertically ...'





        share|improve this answer













        It looks like these characters are from text encoded as cp1252. It's possible to decode them



        >>> bs = b'{"Description":"American Assets Trust, Inc. (the \u0093company\u0094) is a full service, vertically ..."}'
        >>> d = json.loads(bs)
        >>> s = d['Description']
        >>> decoded = s.encode('latin-1').decode('cp1252')
        >>> decoded
        'American Assets Trust, Inc. (the “company”) is a full service, vertically ...'


        but you will have to replace them manually, using str.replace or str.translate



        >>> table = str.maketrans('“”', '""')
        >>> decoded = s.encode('latin-1').decode('cp1252')
        >>> decoded.translate(table)
        'American Assets Trust, Inc. (the "company") is a full service, vertically ...'






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Mar 2 at 18:08









        snakecharmerbsnakecharmerb

        12.1k42552




        12.1k42552
































            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54014532%2fstrange-characters-in-string-gotten-from-api-cant-decode%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Can a sorcerer learn a 5th-level spell early by creating spell slots using the Font of Magic feature?

            Does disintegrating a polymorphed enemy still kill it after the 2018 errata?

            A Topological Invariant for $pi_3(U(n))$