Web scraping certain web page cannot finish

So i'm learning web scraping with node 8, followed this
npm install --save request-promise cheerio puppeteer

The code is simple

const rp = require('request-promise');

const url = 'https://www.examples.com'; //good



rp(url).then( (html) => {

    console.log(html);

}).catch( (e) => {

    console.log(e);

});

Now if url is examples.com, i can see the plain html output, great.

Q1: If yahoo.com, it outputs binary data, e.g.
�i��,a��g�Z.~�Ż�ڔ+�<ٵ�A�y�+�c�n1O>Vr�K�#,bc��8��|��U>��p4U>mś0��Z�M�Xg"6�lS�2B�+�Y�Ɣ��? ��*
why is this ?

Q2: Then with nasdaq.com,
const url = 'https://www.nasdaq.com/earnings/report/msft';
the above code just won't finish, seems hangs there.

Why is this please ?

edited Jan 2 at 3:07

asked Jan 2 at 2:56

user3552178

4561817

Can you share some of the "binary data" that is output for yahoo.com?

– nareddyt
Jan 2 at 3:05

I have tried using another HTTP client package called "Axios", and the result is the same, maybe it's just how Yahoo return there data?

– Felix Fong
Jan 2 at 3:10

@FelixFong maybe, i don't know much about these stuff, but if you run in browser, everything is fine. The 2nd question even confuses more, just returns nothing and hangs there.

– user3552178
Jan 2 at 3:17

add a comment |

So i'm learning web scraping with node 8, followed this
npm install --save request-promise cheerio puppeteer

The code is simple

const rp = require('request-promise');

const url = 'https://www.examples.com'; //good



rp(url).then( (html) => {

    console.log(html);

}).catch( (e) => {

    console.log(e);

});

Now if url is examples.com, i can see the plain html output, great.

Q2: Then with nasdaq.com,
const url = 'https://www.nasdaq.com/earnings/report/msft';
the above code just won't finish, seems hangs there.

Why is this please ?

edited Jan 2 at 3:07

asked Jan 2 at 2:56

user3552178

4561817

Can you share some of the "binary data" that is output for yahoo.com?

– nareddyt
Jan 2 at 3:05

I have tried using another HTTP client package called "Axios", and the result is the same, maybe it's just how Yahoo return there data?

– Felix Fong
Jan 2 at 3:10

@FelixFong maybe, i don't know much about these stuff, but if you run in browser, everything is fine. The 2nd question even confuses more, just returns nothing and hangs there.

– user3552178
Jan 2 at 3:17

add a comment |

So i'm learning web scraping with node 8, followed this
npm install --save request-promise cheerio puppeteer

The code is simple

const rp = require('request-promise');

const url = 'https://www.examples.com'; //good



rp(url).then( (html) => {

    console.log(html);

}).catch( (e) => {

    console.log(e);

});

Now if url is examples.com, i can see the plain html output, great.

Q2: Then with nasdaq.com,
const url = 'https://www.nasdaq.com/earnings/report/msft';
the above code just won't finish, seems hangs there.

Why is this please ?

edited Jan 2 at 3:07

asked Jan 2 at 2:56

user3552178

4561817

So i'm learning web scraping with node 8, followed this
npm install --save request-promise cheerio puppeteer

The code is simple

const rp = require('request-promise');

const url = 'https://www.examples.com'; //good



rp(url).then( (html) => {

    console.log(html);

}).catch( (e) => {

    console.log(e);

});

Now if url is examples.com, i can see the plain html output, great.

Q2: Then with nasdaq.com,
const url = 'https://www.nasdaq.com/earnings/report/msft';
the above code just won't finish, seems hangs there.

Why is this please ?

node.js puppeteer

edited Jan 2 at 3:07

asked Jan 2 at 2:56

user3552178

4561817

edited Jan 2 at 3:07

asked Jan 2 at 2:56

user3552178

4561817

edited Jan 2 at 3:07

asked Jan 2 at 2:56

user3552178

4561817

asked Jan 2 at 2:56

user3552178

4561817

asked Jan 2 at 2:56

user3552178

4561817

Can you share some of the "binary data" that is output for yahoo.com?

– nareddyt
Jan 2 at 3:05

I have tried using another HTTP client package called "Axios", and the result is the same, maybe it's just how Yahoo return there data?

– Felix Fong
Jan 2 at 3:10

@FelixFong maybe, i don't know much about these stuff, but if you run in browser, everything is fine. The 2nd question even confuses more, just returns nothing and hangs there.

– user3552178
Jan 2 at 3:17

add a comment |

Can you share some of the "binary data" that is output for yahoo.com?

– nareddyt
Jan 2 at 3:05

I have tried using another HTTP client package called "Axios", and the result is the same, maybe it's just how Yahoo return there data?

– Felix Fong
Jan 2 at 3:10

@FelixFong maybe, i don't know much about these stuff, but if you run in browser, everything is fine. The 2nd question even confuses more, just returns nothing and hangs there.

– user3552178
Jan 2 at 3:17

Can you share some of the "binary data" that is output for yahoo.com?

– nareddyt
Jan 2 at 3:05

I have tried using another HTTP client package called "Axios", and the result is the same, maybe it's just how Yahoo return there data?

– Felix Fong
Jan 2 at 3:10

@FelixFong maybe, i don't know much about these stuff, but if you run in browser, everything is fine. The 2nd question even confuses more, just returns nothing and hangs there.

– user3552178
Jan 2 at 3:17

add a comment |

1 Answer
1

active

oldest

votes

I'm not sure about Q2, but I can answer Q1.

It seems like Yahoo is detecting you as a bot and preventing you from scraping the page! The most common method sites use to detect bots is via the User-Agent header. When you make a request using request-promise (which uses the request library internally), it does not set this header at all. This means websites can infer your request came from a program (instead of a web browser) because there is not User-Agent header. They will then treat you like a bot and send you back gibberish or never serve you content.

You can work around this by manually setting a User-Agent header to mimic a browser. Note this seems to work for Yahoo, but might not work for all websites. Other websites might use more advanced techniques to detect bots.

const rp = require('request-promise');

const url = 'https://www.yahoo.com'; //good



const options = {

  url,

  headers: {

    'User-Agent': 'Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0'

  }

};



rp(options).then( (html) => {

    console.log(html);

}).catch( (e) => {

    console.log(e);

});

Q2 might be related to this, but the above code does not solve it. Nasdaq might be running more sophisticated bot detection, such as checking for various other headers.

answered Jan 2 at 3:39

nareddyt

495410

1

Good one, let me read a little bit more, many thanks !

– user3552178
Jan 2 at 3:44

No problem! In general, websites want you to use their APIs instead of web-scraping because it is easier to monetize APIs. Nasdaq has a real-time quote API that costs money, which is probably why they block bots from web-scraping. I would suggest looking for other APIs to solve your problem instead of web-scraping. This might be a good place to start.

– nareddyt
Jan 2 at 3:48

thx again, you rock !

– user3552178
Jan 2 at 4:16

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54000738%2fweb-scraping-certain-web-page-cannot-finish%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

I'm not sure about Q2, but I can answer Q1.

const rp = require('request-promise');

const url = 'https://www.yahoo.com'; //good



const options = {

  url,

  headers: {

    'User-Agent': 'Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0'

  }

};



rp(options).then( (html) => {

    console.log(html);

}).catch( (e) => {

    console.log(e);

});

Q2 might be related to this, but the above code does not solve it. Nasdaq might be running more sophisticated bot detection, such as checking for various other headers.

answered Jan 2 at 3:39

nareddyt

495410

1

Good one, let me read a little bit more, many thanks !

– user3552178
Jan 2 at 3:44

No problem! In general, websites want you to use their APIs instead of web-scraping because it is easier to monetize APIs. Nasdaq has a real-time quote API that costs money, which is probably why they block bots from web-scraping. I would suggest looking for other APIs to solve your problem instead of web-scraping. This might be a good place to start.

– nareddyt
Jan 2 at 3:48

thx again, you rock !

– user3552178
Jan 2 at 4:16

add a comment |

I'm not sure about Q2, but I can answer Q1.

const rp = require('request-promise');

const url = 'https://www.yahoo.com'; //good



const options = {

  url,

  headers: {

    'User-Agent': 'Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0'

  }

};



rp(options).then( (html) => {

    console.log(html);

}).catch( (e) => {

    console.log(e);

});

Q2 might be related to this, but the above code does not solve it. Nasdaq might be running more sophisticated bot detection, such as checking for various other headers.

answered Jan 2 at 3:39

nareddyt

495410

1

Good one, let me read a little bit more, many thanks !

– user3552178
Jan 2 at 3:44

No problem! In general, websites want you to use their APIs instead of web-scraping because it is easier to monetize APIs. Nasdaq has a real-time quote API that costs money, which is probably why they block bots from web-scraping. I would suggest looking for other APIs to solve your problem instead of web-scraping. This might be a good place to start.

– nareddyt
Jan 2 at 3:48

thx again, you rock !

– user3552178
Jan 2 at 4:16

add a comment |

I'm not sure about Q2, but I can answer Q1.

const rp = require('request-promise');

const url = 'https://www.yahoo.com'; //good



const options = {

  url,

  headers: {

    'User-Agent': 'Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0'

  }

};



rp(options).then( (html) => {

    console.log(html);

}).catch( (e) => {

    console.log(e);

});

Q2 might be related to this, but the above code does not solve it. Nasdaq might be running more sophisticated bot detection, such as checking for various other headers.

answered Jan 2 at 3:39

nareddyt

495410

I'm not sure about Q2, but I can answer Q1.

const rp = require('request-promise');

const url = 'https://www.yahoo.com'; //good



const options = {

  url,

  headers: {

    'User-Agent': 'Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0'

  }

};



rp(options).then( (html) => {

    console.log(html);

}).catch( (e) => {

    console.log(e);

});

Q2 might be related to this, but the above code does not solve it. Nasdaq might be running more sophisticated bot detection, such as checking for various other headers.

answered Jan 2 at 3:39

nareddyt

495410

answered Jan 2 at 3:39

nareddyt

495410

answered Jan 2 at 3:39

nareddyt

495410

answered Jan 2 at 3:39

nareddyt

495410

1

Good one, let me read a little bit more, many thanks !

– user3552178
Jan 2 at 3:44

No problem! In general, websites want you to use their APIs instead of web-scraping because it is easier to monetize APIs. Nasdaq has a real-time quote API that costs money, which is probably why they block bots from web-scraping. I would suggest looking for other APIs to solve your problem instead of web-scraping. This might be a good place to start.

– nareddyt
Jan 2 at 3:48

thx again, you rock !

– user3552178
Jan 2 at 4:16

add a comment |

1

Good one, let me read a little bit more, many thanks !

– user3552178
Jan 2 at 3:44

No problem! In general, websites want you to use their APIs instead of web-scraping because it is easier to monetize APIs. Nasdaq has a real-time quote API that costs money, which is probably why they block bots from web-scraping. I would suggest looking for other APIs to solve your problem instead of web-scraping. This might be a good place to start.

– nareddyt
Jan 2 at 3:48

thx again, you rock !

– user3552178
Jan 2 at 4:16

Good one, let me read a little bit more, many thanks !

– user3552178
Jan 2 at 3:44

No problem! In general, websites want you to use their APIs instead of web-scraping because it is easier to monetize APIs. Nasdaq has a real-time quote API that costs money, which is probably why they block bots from web-scraping. I would suggest looking for other APIs to solve your problem instead of web-scraping. This might be a good place to start.

– nareddyt
Jan 2 at 3:48

thx again, you rock !

– user3552178
Jan 2 at 4:16

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu