scrapy pipeline to JSON with Chinese characters
I'm trying to scrapy some web contents with Chinese character. the content scraped like below
2018-11-20 12:42:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://cn.bing.com/dict/search?q=tool&FORM=BDVSP6&mkt=zh-cn>
{'defBing': '工具;方法;受人利用的人',
'defWeb': '工具;方法;受人利用的人',
'pClass': 'n.',
'prUK': 'UKxa0[tuːl]',
'prUS': 'USxa0[tul]',
'word': 'tool'}
But after the pipeline process, the content has been like this:
{
"word": "tool",
"prUS": "USu00a0[tul]",
"prUK": "UKu00a0[tuu02d0l]",
"pClass": "n.",
"defBing": "u5de5u5177uff1bu65b9u6cd5uff1bu53d7u4ebau5229u7528u7684u4eba",
"defWeb": "u5de5u5177uff1bu65b9u6cd5uff1bu53d7u4ebau5229u7528u7684u4eba"
}
The pipeline looks like:
class JsonWriterPipeline(object):
def open_spider(self, spider):
self.file = open('log/DICT.%s.json' % time.strftime('%Y%m%d-%H%M%S', time.localtime()), 'tw')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
try:
line = json.dumps(dict(item), indent=4) + "n"
self.file.write(line)
except Exception as e:
print(e)
return item
my question is: how can I keep the Chinese character printed as-is in the *.json file? I really don't want those encoded Unicode characters :)
python json scrapy
add a comment |
I'm trying to scrapy some web contents with Chinese character. the content scraped like below
2018-11-20 12:42:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://cn.bing.com/dict/search?q=tool&FORM=BDVSP6&mkt=zh-cn>
{'defBing': '工具;方法;受人利用的人',
'defWeb': '工具;方法;受人利用的人',
'pClass': 'n.',
'prUK': 'UKxa0[tuːl]',
'prUS': 'USxa0[tul]',
'word': 'tool'}
But after the pipeline process, the content has been like this:
{
"word": "tool",
"prUS": "USu00a0[tul]",
"prUK": "UKu00a0[tuu02d0l]",
"pClass": "n.",
"defBing": "u5de5u5177uff1bu65b9u6cd5uff1bu53d7u4ebau5229u7528u7684u4eba",
"defWeb": "u5de5u5177uff1bu65b9u6cd5uff1bu53d7u4ebau5229u7528u7684u4eba"
}
The pipeline looks like:
class JsonWriterPipeline(object):
def open_spider(self, spider):
self.file = open('log/DICT.%s.json' % time.strftime('%Y%m%d-%H%M%S', time.localtime()), 'tw')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
try:
line = json.dumps(dict(item), indent=4) + "n"
self.file.write(line)
except Exception as e:
print(e)
return item
my question is: how can I keep the Chinese character printed as-is in the *.json file? I really don't want those encoded Unicode characters :)
python json scrapy
add a comment |
I'm trying to scrapy some web contents with Chinese character. the content scraped like below
2018-11-20 12:42:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://cn.bing.com/dict/search?q=tool&FORM=BDVSP6&mkt=zh-cn>
{'defBing': '工具;方法;受人利用的人',
'defWeb': '工具;方法;受人利用的人',
'pClass': 'n.',
'prUK': 'UKxa0[tuːl]',
'prUS': 'USxa0[tul]',
'word': 'tool'}
But after the pipeline process, the content has been like this:
{
"word": "tool",
"prUS": "USu00a0[tul]",
"prUK": "UKu00a0[tuu02d0l]",
"pClass": "n.",
"defBing": "u5de5u5177uff1bu65b9u6cd5uff1bu53d7u4ebau5229u7528u7684u4eba",
"defWeb": "u5de5u5177uff1bu65b9u6cd5uff1bu53d7u4ebau5229u7528u7684u4eba"
}
The pipeline looks like:
class JsonWriterPipeline(object):
def open_spider(self, spider):
self.file = open('log/DICT.%s.json' % time.strftime('%Y%m%d-%H%M%S', time.localtime()), 'tw')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
try:
line = json.dumps(dict(item), indent=4) + "n"
self.file.write(line)
except Exception as e:
print(e)
return item
my question is: how can I keep the Chinese character printed as-is in the *.json file? I really don't want those encoded Unicode characters :)
python json scrapy
I'm trying to scrapy some web contents with Chinese character. the content scraped like below
2018-11-20 12:42:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://cn.bing.com/dict/search?q=tool&FORM=BDVSP6&mkt=zh-cn>
{'defBing': '工具;方法;受人利用的人',
'defWeb': '工具;方法;受人利用的人',
'pClass': 'n.',
'prUK': 'UKxa0[tuːl]',
'prUS': 'USxa0[tul]',
'word': 'tool'}
But after the pipeline process, the content has been like this:
{
"word": "tool",
"prUS": "USu00a0[tul]",
"prUK": "UKu00a0[tuu02d0l]",
"pClass": "n.",
"defBing": "u5de5u5177uff1bu65b9u6cd5uff1bu53d7u4ebau5229u7528u7684u4eba",
"defWeb": "u5de5u5177uff1bu65b9u6cd5uff1bu53d7u4ebau5229u7528u7684u4eba"
}
The pipeline looks like:
class JsonWriterPipeline(object):
def open_spider(self, spider):
self.file = open('log/DICT.%s.json' % time.strftime('%Y%m%d-%H%M%S', time.localtime()), 'tw')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
try:
line = json.dumps(dict(item), indent=4) + "n"
self.file.write(line)
except Exception as e:
print(e)
return item
my question is: how can I keep the Chinese character printed as-is in the *.json file? I really don't want those encoded Unicode characters :)
python json scrapy
python json scrapy
edited Nov 20 '18 at 16:57


Ami Hollander
1,40821432
1,40821432
asked Nov 20 '18 at 14:41
Solaris_9Solaris_9
768
768
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
It seems like the json lib escape those symbols, try to add ensure_ascii=False
to json.dumps()
as follow:
class JsonWriterPipeline(object):
def open_spider(self, spider):
self.file = open('log/DICT.%s.json' % time.strftime('%Y%m%d-%H%M%S', time.localtime()), 'tw')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
try:
line = json.dumps(dict(item), indent=4, ensure_ascii=False) + "n"
self.file.write(line)
except Exception as e:
print(e)
return item
1
thank you @Ami, you solved my issue!
– Solaris_9
Nov 20 '18 at 15:06
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53395450%2fscrapy-pipeline-to-json-with-chinese-characters%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
It seems like the json lib escape those symbols, try to add ensure_ascii=False
to json.dumps()
as follow:
class JsonWriterPipeline(object):
def open_spider(self, spider):
self.file = open('log/DICT.%s.json' % time.strftime('%Y%m%d-%H%M%S', time.localtime()), 'tw')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
try:
line = json.dumps(dict(item), indent=4, ensure_ascii=False) + "n"
self.file.write(line)
except Exception as e:
print(e)
return item
1
thank you @Ami, you solved my issue!
– Solaris_9
Nov 20 '18 at 15:06
add a comment |
It seems like the json lib escape those symbols, try to add ensure_ascii=False
to json.dumps()
as follow:
class JsonWriterPipeline(object):
def open_spider(self, spider):
self.file = open('log/DICT.%s.json' % time.strftime('%Y%m%d-%H%M%S', time.localtime()), 'tw')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
try:
line = json.dumps(dict(item), indent=4, ensure_ascii=False) + "n"
self.file.write(line)
except Exception as e:
print(e)
return item
1
thank you @Ami, you solved my issue!
– Solaris_9
Nov 20 '18 at 15:06
add a comment |
It seems like the json lib escape those symbols, try to add ensure_ascii=False
to json.dumps()
as follow:
class JsonWriterPipeline(object):
def open_spider(self, spider):
self.file = open('log/DICT.%s.json' % time.strftime('%Y%m%d-%H%M%S', time.localtime()), 'tw')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
try:
line = json.dumps(dict(item), indent=4, ensure_ascii=False) + "n"
self.file.write(line)
except Exception as e:
print(e)
return item
It seems like the json lib escape those symbols, try to add ensure_ascii=False
to json.dumps()
as follow:
class JsonWriterPipeline(object):
def open_spider(self, spider):
self.file = open('log/DICT.%s.json' % time.strftime('%Y%m%d-%H%M%S', time.localtime()), 'tw')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
try:
line = json.dumps(dict(item), indent=4, ensure_ascii=False) + "n"
self.file.write(line)
except Exception as e:
print(e)
return item
answered Nov 20 '18 at 14:46


Ami HollanderAmi Hollander
1,40821432
1,40821432
1
thank you @Ami, you solved my issue!
– Solaris_9
Nov 20 '18 at 15:06
add a comment |
1
thank you @Ami, you solved my issue!
– Solaris_9
Nov 20 '18 at 15:06
1
1
thank you @Ami, you solved my issue!
– Solaris_9
Nov 20 '18 at 15:06
thank you @Ami, you solved my issue!
– Solaris_9
Nov 20 '18 at 15:06
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53395450%2fscrapy-pipeline-to-json-with-chinese-characters%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown