Strange characters in string gotten from API can't decode
I'm creating a program that grabs data from an API and stores it in my own database. The problem is some of the stings have some sort of character code where quotation marks should be. Upon further inspection, it appears to be hex code for the quotation mark, but it's fantastically double escaped, confusing me along with all my decoders. I believe the string comes in as ascii and I don't have any other issues with the other characters.
I know I can simply replace the specific character code with the actual character, but I need to catch stuff like this in the future. If it is hex, I need to comb strings for hex codes and replace them procedurally.
I've tried
clean_val = unicodedata.normalize('NFKD', val).encode('latin1').decode('utf8')
I've gotten myself quite confused about the whole thing
response = session.get(url)
if response.status_code == requests.codes.ok:
print(response.content)
b'{"Description":"American Assets Trust, Inc. (the \u0093company\u0094) is a full service, vertically ..."}'
I think the string is stored on their database like " to satisfy some SQL escaping protocol. When I get it, the escaping slash gets mixed in with the character code, thereby messing up the encoding.
python python-3.x character-encoding
add a comment |
I'm creating a program that grabs data from an API and stores it in my own database. The problem is some of the stings have some sort of character code where quotation marks should be. Upon further inspection, it appears to be hex code for the quotation mark, but it's fantastically double escaped, confusing me along with all my decoders. I believe the string comes in as ascii and I don't have any other issues with the other characters.
I know I can simply replace the specific character code with the actual character, but I need to catch stuff like this in the future. If it is hex, I need to comb strings for hex codes and replace them procedurally.
I've tried
clean_val = unicodedata.normalize('NFKD', val).encode('latin1').decode('utf8')
I've gotten myself quite confused about the whole thing
response = session.get(url)
if response.status_code == requests.codes.ok:
print(response.content)
b'{"Description":"American Assets Trust, Inc. (the \u0093company\u0094) is a full service, vertically ..."}'
I think the string is stored on their database like " to satisfy some SQL escaping protocol. When I get it, the escaping slash gets mixed in with the character code, thereby messing up the encoding.
python python-3.x character-encoding
Some variation of Ascii2Uni might help. Perhaps there is a Python binding or port.
– Tom Blodget
Jan 3 at 23:48
add a comment |
I'm creating a program that grabs data from an API and stores it in my own database. The problem is some of the stings have some sort of character code where quotation marks should be. Upon further inspection, it appears to be hex code for the quotation mark, but it's fantastically double escaped, confusing me along with all my decoders. I believe the string comes in as ascii and I don't have any other issues with the other characters.
I know I can simply replace the specific character code with the actual character, but I need to catch stuff like this in the future. If it is hex, I need to comb strings for hex codes and replace them procedurally.
I've tried
clean_val = unicodedata.normalize('NFKD', val).encode('latin1').decode('utf8')
I've gotten myself quite confused about the whole thing
response = session.get(url)
if response.status_code == requests.codes.ok:
print(response.content)
b'{"Description":"American Assets Trust, Inc. (the \u0093company\u0094) is a full service, vertically ..."}'
I think the string is stored on their database like " to satisfy some SQL escaping protocol. When I get it, the escaping slash gets mixed in with the character code, thereby messing up the encoding.
python python-3.x character-encoding
I'm creating a program that grabs data from an API and stores it in my own database. The problem is some of the stings have some sort of character code where quotation marks should be. Upon further inspection, it appears to be hex code for the quotation mark, but it's fantastically double escaped, confusing me along with all my decoders. I believe the string comes in as ascii and I don't have any other issues with the other characters.
I know I can simply replace the specific character code with the actual character, but I need to catch stuff like this in the future. If it is hex, I need to comb strings for hex codes and replace them procedurally.
I've tried
clean_val = unicodedata.normalize('NFKD', val).encode('latin1').decode('utf8')
I've gotten myself quite confused about the whole thing
response = session.get(url)
if response.status_code == requests.codes.ok:
print(response.content)
b'{"Description":"American Assets Trust, Inc. (the \u0093company\u0094) is a full service, vertically ..."}'
I think the string is stored on their database like " to satisfy some SQL escaping protocol. When I get it, the escaping slash gets mixed in with the character code, thereby messing up the encoding.
python python-3.x character-encoding
python python-3.x character-encoding
edited Mar 2 at 18:10
snakecharmerb
12.1k42552
12.1k42552
asked Jan 2 at 23:29
Jonathan DuncanJonathan Duncan
133
133
Some variation of Ascii2Uni might help. Perhaps there is a Python binding or port.
– Tom Blodget
Jan 3 at 23:48
add a comment |
Some variation of Ascii2Uni might help. Perhaps there is a Python binding or port.
– Tom Blodget
Jan 3 at 23:48
Some variation of Ascii2Uni might help. Perhaps there is a Python binding or port.
– Tom Blodget
Jan 3 at 23:48
Some variation of Ascii2Uni might help. Perhaps there is a Python binding or port.
– Tom Blodget
Jan 3 at 23:48
add a comment |
1 Answer
1
active
oldest
votes
It looks like these characters are from text encoded as cp1252. It's possible to decode them
>>> bs = b'{"Description":"American Assets Trust, Inc. (the \u0093company\u0094) is a full service, vertically ..."}'
>>> d = json.loads(bs)
>>> s = d['Description']
>>> decoded = s.encode('latin-1').decode('cp1252')
>>> decoded
'American Assets Trust, Inc. (the “company”) is a full service, vertically ...'
but you will have to replace them manually, using str.replace
or str.translate
>>> table = str.maketrans('“”', '""')
>>> decoded = s.encode('latin-1').decode('cp1252')
>>> decoded.translate(table)
'American Assets Trust, Inc. (the "company") is a full service, vertically ...'
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54014532%2fstrange-characters-in-string-gotten-from-api-cant-decode%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
It looks like these characters are from text encoded as cp1252. It's possible to decode them
>>> bs = b'{"Description":"American Assets Trust, Inc. (the \u0093company\u0094) is a full service, vertically ..."}'
>>> d = json.loads(bs)
>>> s = d['Description']
>>> decoded = s.encode('latin-1').decode('cp1252')
>>> decoded
'American Assets Trust, Inc. (the “company”) is a full service, vertically ...'
but you will have to replace them manually, using str.replace
or str.translate
>>> table = str.maketrans('“”', '""')
>>> decoded = s.encode('latin-1').decode('cp1252')
>>> decoded.translate(table)
'American Assets Trust, Inc. (the "company") is a full service, vertically ...'
add a comment |
It looks like these characters are from text encoded as cp1252. It's possible to decode them
>>> bs = b'{"Description":"American Assets Trust, Inc. (the \u0093company\u0094) is a full service, vertically ..."}'
>>> d = json.loads(bs)
>>> s = d['Description']
>>> decoded = s.encode('latin-1').decode('cp1252')
>>> decoded
'American Assets Trust, Inc. (the “company”) is a full service, vertically ...'
but you will have to replace them manually, using str.replace
or str.translate
>>> table = str.maketrans('“”', '""')
>>> decoded = s.encode('latin-1').decode('cp1252')
>>> decoded.translate(table)
'American Assets Trust, Inc. (the "company") is a full service, vertically ...'
add a comment |
It looks like these characters are from text encoded as cp1252. It's possible to decode them
>>> bs = b'{"Description":"American Assets Trust, Inc. (the \u0093company\u0094) is a full service, vertically ..."}'
>>> d = json.loads(bs)
>>> s = d['Description']
>>> decoded = s.encode('latin-1').decode('cp1252')
>>> decoded
'American Assets Trust, Inc. (the “company”) is a full service, vertically ...'
but you will have to replace them manually, using str.replace
or str.translate
>>> table = str.maketrans('“”', '""')
>>> decoded = s.encode('latin-1').decode('cp1252')
>>> decoded.translate(table)
'American Assets Trust, Inc. (the "company") is a full service, vertically ...'
It looks like these characters are from text encoded as cp1252. It's possible to decode them
>>> bs = b'{"Description":"American Assets Trust, Inc. (the \u0093company\u0094) is a full service, vertically ..."}'
>>> d = json.loads(bs)
>>> s = d['Description']
>>> decoded = s.encode('latin-1').decode('cp1252')
>>> decoded
'American Assets Trust, Inc. (the “company”) is a full service, vertically ...'
but you will have to replace them manually, using str.replace
or str.translate
>>> table = str.maketrans('“”', '""')
>>> decoded = s.encode('latin-1').decode('cp1252')
>>> decoded.translate(table)
'American Assets Trust, Inc. (the "company") is a full service, vertically ...'
answered Mar 2 at 18:08
snakecharmerbsnakecharmerb
12.1k42552
12.1k42552
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54014532%2fstrange-characters-in-string-gotten-from-api-cant-decode%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Some variation of Ascii2Uni might help. Perhaps there is a Python binding or port.
– Tom Blodget
Jan 3 at 23:48