Web scraping certain web page cannot finish
So i'm learning web scraping with node 8, followed this
npm install --save request-promise cheerio puppeteer
The code is simple
const rp = require('request-promise');
const url = 'https://www.examples.com'; //good
rp(url).then( (html) => {
console.log(html);
}).catch( (e) => {
console.log(e);
});
Now if url is examples.com, i can see the plain html output, great.
Q1: If yahoo.com, it outputs binary data, e.g.
�i��,a��g�Z.~�Ż�ڔ+�<ٵ�A�y�+�c�n1O>Vr�K�#,bc���8�����|����U>��p4U>mś0��Z�M�Xg"6�lS�2B�+�Y�Ɣ���? ��*
why is this ?
Q2: Then with nasdaq.com,
const url = 'https://www.nasdaq.com/earnings/report/msft';
the above code just won't finish, seems hangs there.
Why is this please ?
node.js puppeteer
add a comment |
So i'm learning web scraping with node 8, followed this
npm install --save request-promise cheerio puppeteer
The code is simple
const rp = require('request-promise');
const url = 'https://www.examples.com'; //good
rp(url).then( (html) => {
console.log(html);
}).catch( (e) => {
console.log(e);
});
Now if url is examples.com, i can see the plain html output, great.
Q1: If yahoo.com, it outputs binary data, e.g.
�i��,a��g�Z.~�Ż�ڔ+�<ٵ�A�y�+�c�n1O>Vr�K�#,bc���8�����|����U>��p4U>mś0��Z�M�Xg"6�lS�2B�+�Y�Ɣ���? ��*
why is this ?
Q2: Then with nasdaq.com,
const url = 'https://www.nasdaq.com/earnings/report/msft';
the above code just won't finish, seems hangs there.
Why is this please ?
node.js puppeteer
Can you share some of the "binary data" that is output for yahoo.com?
– nareddyt
Jan 2 at 3:05
I have tried using another HTTP client package called "Axios", and the result is the same, maybe it's just how Yahoo return there data?
– Felix Fong
Jan 2 at 3:10
@FelixFong maybe, i don't know much about these stuff, but if you run in browser, everything is fine. The 2nd question even confuses more, just returns nothing and hangs there.
– user3552178
Jan 2 at 3:17
add a comment |
So i'm learning web scraping with node 8, followed this
npm install --save request-promise cheerio puppeteer
The code is simple
const rp = require('request-promise');
const url = 'https://www.examples.com'; //good
rp(url).then( (html) => {
console.log(html);
}).catch( (e) => {
console.log(e);
});
Now if url is examples.com, i can see the plain html output, great.
Q1: If yahoo.com, it outputs binary data, e.g.
�i��,a��g�Z.~�Ż�ڔ+�<ٵ�A�y�+�c�n1O>Vr�K�#,bc���8�����|����U>��p4U>mś0��Z�M�Xg"6�lS�2B�+�Y�Ɣ���? ��*
why is this ?
Q2: Then with nasdaq.com,
const url = 'https://www.nasdaq.com/earnings/report/msft';
the above code just won't finish, seems hangs there.
Why is this please ?
node.js puppeteer
So i'm learning web scraping with node 8, followed this
npm install --save request-promise cheerio puppeteer
The code is simple
const rp = require('request-promise');
const url = 'https://www.examples.com'; //good
rp(url).then( (html) => {
console.log(html);
}).catch( (e) => {
console.log(e);
});
Now if url is examples.com, i can see the plain html output, great.
Q1: If yahoo.com, it outputs binary data, e.g.
�i��,a��g�Z.~�Ż�ڔ+�<ٵ�A�y�+�c�n1O>Vr�K�#,bc���8�����|����U>��p4U>mś0��Z�M�Xg"6�lS�2B�+�Y�Ɣ���? ��*
why is this ?
Q2: Then with nasdaq.com,
const url = 'https://www.nasdaq.com/earnings/report/msft';
the above code just won't finish, seems hangs there.
Why is this please ?
node.js puppeteer
node.js puppeteer
edited Jan 2 at 3:07
user3552178
asked Jan 2 at 2:56
user3552178user3552178
4561817
4561817
Can you share some of the "binary data" that is output for yahoo.com?
– nareddyt
Jan 2 at 3:05
I have tried using another HTTP client package called "Axios", and the result is the same, maybe it's just how Yahoo return there data?
– Felix Fong
Jan 2 at 3:10
@FelixFong maybe, i don't know much about these stuff, but if you run in browser, everything is fine. The 2nd question even confuses more, just returns nothing and hangs there.
– user3552178
Jan 2 at 3:17
add a comment |
Can you share some of the "binary data" that is output for yahoo.com?
– nareddyt
Jan 2 at 3:05
I have tried using another HTTP client package called "Axios", and the result is the same, maybe it's just how Yahoo return there data?
– Felix Fong
Jan 2 at 3:10
@FelixFong maybe, i don't know much about these stuff, but if you run in browser, everything is fine. The 2nd question even confuses more, just returns nothing and hangs there.
– user3552178
Jan 2 at 3:17
Can you share some of the "binary data" that is output for yahoo.com?
– nareddyt
Jan 2 at 3:05
Can you share some of the "binary data" that is output for yahoo.com?
– nareddyt
Jan 2 at 3:05
I have tried using another HTTP client package called "Axios", and the result is the same, maybe it's just how Yahoo return there data?
– Felix Fong
Jan 2 at 3:10
I have tried using another HTTP client package called "Axios", and the result is the same, maybe it's just how Yahoo return there data?
– Felix Fong
Jan 2 at 3:10
@FelixFong maybe, i don't know much about these stuff, but if you run in browser, everything is fine. The 2nd question even confuses more, just returns nothing and hangs there.
– user3552178
Jan 2 at 3:17
@FelixFong maybe, i don't know much about these stuff, but if you run in browser, everything is fine. The 2nd question even confuses more, just returns nothing and hangs there.
– user3552178
Jan 2 at 3:17
add a comment |
1 Answer
1
active
oldest
votes
I'm not sure about Q2, but I can answer Q1.
It seems like Yahoo is detecting you as a bot and preventing you from scraping the page! The most common method sites use to detect bots is via the User-Agent header. When you make a request using request-promise
(which uses the request
library internally), it does not set this header at all. This means websites can infer your request came from a program (instead of a web browser) because there is not User-Agent header. They will then treat you like a bot and send you back gibberish or never serve you content.
You can work around this by manually setting a User-Agent header to mimic a browser. Note this seems to work for Yahoo, but might not work for all websites. Other websites might use more advanced techniques to detect bots.
const rp = require('request-promise');
const url = 'https://www.yahoo.com'; //good
const options = {
url,
headers: {
'User-Agent': 'Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0'
}
};
rp(options).then( (html) => {
console.log(html);
}).catch( (e) => {
console.log(e);
});
Q2 might be related to this, but the above code does not solve it. Nasdaq might be running more sophisticated bot detection, such as checking for various other headers.
1
Good one, let me read a little bit more, many thanks !
– user3552178
Jan 2 at 3:44
No problem! In general, websites want you to use their APIs instead of web-scraping because it is easier to monetize APIs. Nasdaq has a real-time quote API that costs money, which is probably why they block bots from web-scraping. I would suggest looking for other APIs to solve your problem instead of web-scraping. This might be a good place to start.
– nareddyt
Jan 2 at 3:48
thx again, you rock !
– user3552178
Jan 2 at 4:16
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54000738%2fweb-scraping-certain-web-page-cannot-finish%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I'm not sure about Q2, but I can answer Q1.
It seems like Yahoo is detecting you as a bot and preventing you from scraping the page! The most common method sites use to detect bots is via the User-Agent header. When you make a request using request-promise
(which uses the request
library internally), it does not set this header at all. This means websites can infer your request came from a program (instead of a web browser) because there is not User-Agent header. They will then treat you like a bot and send you back gibberish or never serve you content.
You can work around this by manually setting a User-Agent header to mimic a browser. Note this seems to work for Yahoo, but might not work for all websites. Other websites might use more advanced techniques to detect bots.
const rp = require('request-promise');
const url = 'https://www.yahoo.com'; //good
const options = {
url,
headers: {
'User-Agent': 'Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0'
}
};
rp(options).then( (html) => {
console.log(html);
}).catch( (e) => {
console.log(e);
});
Q2 might be related to this, but the above code does not solve it. Nasdaq might be running more sophisticated bot detection, such as checking for various other headers.
1
Good one, let me read a little bit more, many thanks !
– user3552178
Jan 2 at 3:44
No problem! In general, websites want you to use their APIs instead of web-scraping because it is easier to monetize APIs. Nasdaq has a real-time quote API that costs money, which is probably why they block bots from web-scraping. I would suggest looking for other APIs to solve your problem instead of web-scraping. This might be a good place to start.
– nareddyt
Jan 2 at 3:48
thx again, you rock !
– user3552178
Jan 2 at 4:16
add a comment |
I'm not sure about Q2, but I can answer Q1.
It seems like Yahoo is detecting you as a bot and preventing you from scraping the page! The most common method sites use to detect bots is via the User-Agent header. When you make a request using request-promise
(which uses the request
library internally), it does not set this header at all. This means websites can infer your request came from a program (instead of a web browser) because there is not User-Agent header. They will then treat you like a bot and send you back gibberish or never serve you content.
You can work around this by manually setting a User-Agent header to mimic a browser. Note this seems to work for Yahoo, but might not work for all websites. Other websites might use more advanced techniques to detect bots.
const rp = require('request-promise');
const url = 'https://www.yahoo.com'; //good
const options = {
url,
headers: {
'User-Agent': 'Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0'
}
};
rp(options).then( (html) => {
console.log(html);
}).catch( (e) => {
console.log(e);
});
Q2 might be related to this, but the above code does not solve it. Nasdaq might be running more sophisticated bot detection, such as checking for various other headers.
1
Good one, let me read a little bit more, many thanks !
– user3552178
Jan 2 at 3:44
No problem! In general, websites want you to use their APIs instead of web-scraping because it is easier to monetize APIs. Nasdaq has a real-time quote API that costs money, which is probably why they block bots from web-scraping. I would suggest looking for other APIs to solve your problem instead of web-scraping. This might be a good place to start.
– nareddyt
Jan 2 at 3:48
thx again, you rock !
– user3552178
Jan 2 at 4:16
add a comment |
I'm not sure about Q2, but I can answer Q1.
It seems like Yahoo is detecting you as a bot and preventing you from scraping the page! The most common method sites use to detect bots is via the User-Agent header. When you make a request using request-promise
(which uses the request
library internally), it does not set this header at all. This means websites can infer your request came from a program (instead of a web browser) because there is not User-Agent header. They will then treat you like a bot and send you back gibberish or never serve you content.
You can work around this by manually setting a User-Agent header to mimic a browser. Note this seems to work for Yahoo, but might not work for all websites. Other websites might use more advanced techniques to detect bots.
const rp = require('request-promise');
const url = 'https://www.yahoo.com'; //good
const options = {
url,
headers: {
'User-Agent': 'Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0'
}
};
rp(options).then( (html) => {
console.log(html);
}).catch( (e) => {
console.log(e);
});
Q2 might be related to this, but the above code does not solve it. Nasdaq might be running more sophisticated bot detection, such as checking for various other headers.
I'm not sure about Q2, but I can answer Q1.
It seems like Yahoo is detecting you as a bot and preventing you from scraping the page! The most common method sites use to detect bots is via the User-Agent header. When you make a request using request-promise
(which uses the request
library internally), it does not set this header at all. This means websites can infer your request came from a program (instead of a web browser) because there is not User-Agent header. They will then treat you like a bot and send you back gibberish or never serve you content.
You can work around this by manually setting a User-Agent header to mimic a browser. Note this seems to work for Yahoo, but might not work for all websites. Other websites might use more advanced techniques to detect bots.
const rp = require('request-promise');
const url = 'https://www.yahoo.com'; //good
const options = {
url,
headers: {
'User-Agent': 'Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0'
}
};
rp(options).then( (html) => {
console.log(html);
}).catch( (e) => {
console.log(e);
});
Q2 might be related to this, but the above code does not solve it. Nasdaq might be running more sophisticated bot detection, such as checking for various other headers.
answered Jan 2 at 3:39
nareddytnareddyt
495410
495410
1
Good one, let me read a little bit more, many thanks !
– user3552178
Jan 2 at 3:44
No problem! In general, websites want you to use their APIs instead of web-scraping because it is easier to monetize APIs. Nasdaq has a real-time quote API that costs money, which is probably why they block bots from web-scraping. I would suggest looking for other APIs to solve your problem instead of web-scraping. This might be a good place to start.
– nareddyt
Jan 2 at 3:48
thx again, you rock !
– user3552178
Jan 2 at 4:16
add a comment |
1
Good one, let me read a little bit more, many thanks !
– user3552178
Jan 2 at 3:44
No problem! In general, websites want you to use their APIs instead of web-scraping because it is easier to monetize APIs. Nasdaq has a real-time quote API that costs money, which is probably why they block bots from web-scraping. I would suggest looking for other APIs to solve your problem instead of web-scraping. This might be a good place to start.
– nareddyt
Jan 2 at 3:48
thx again, you rock !
– user3552178
Jan 2 at 4:16
1
1
Good one, let me read a little bit more, many thanks !
– user3552178
Jan 2 at 3:44
Good one, let me read a little bit more, many thanks !
– user3552178
Jan 2 at 3:44
No problem! In general, websites want you to use their APIs instead of web-scraping because it is easier to monetize APIs. Nasdaq has a real-time quote API that costs money, which is probably why they block bots from web-scraping. I would suggest looking for other APIs to solve your problem instead of web-scraping. This might be a good place to start.
– nareddyt
Jan 2 at 3:48
No problem! In general, websites want you to use their APIs instead of web-scraping because it is easier to monetize APIs. Nasdaq has a real-time quote API that costs money, which is probably why they block bots from web-scraping. I would suggest looking for other APIs to solve your problem instead of web-scraping. This might be a good place to start.
– nareddyt
Jan 2 at 3:48
thx again, you rock !
– user3552178
Jan 2 at 4:16
thx again, you rock !
– user3552178
Jan 2 at 4:16
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54000738%2fweb-scraping-certain-web-page-cannot-finish%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Can you share some of the "binary data" that is output for yahoo.com?
– nareddyt
Jan 2 at 3:05
I have tried using another HTTP client package called "Axios", and the result is the same, maybe it's just how Yahoo return there data?
– Felix Fong
Jan 2 at 3:10
@FelixFong maybe, i don't know much about these stuff, but if you run in browser, everything is fine. The 2nd question even confuses more, just returns nothing and hangs there.
– user3552178
Jan 2 at 3:17