Scrapy Debug crawled 200 and nothing return

I am working on a crawling project and try to get each endorsement link of a band.

My code is as follows:

my code

It returned nothing. However, if I put each URL of a band in the start_url, it works well. But it will be hard for me to put all the URLs I want manually in the start_url field since I am even not sure how many there are...

The log is shown:

log

Anyone can help? Thanks in advance!

edited Nov 21 '18 at 19:05

Guillaume

1,1381724

asked Nov 20 '18 at 19:20

Emily

Next time, it would be great if you could put your code directly in the question instead of in an image, that much more helpful, same for the logs

– Guillaume
Nov 21 '18 at 18:00

add a comment |

I am working on a crawling project and try to get each endorsement link of a band.

My code is as follows:

my code

The log is shown:

log

Anyone can help? Thanks in advance!

edited Nov 21 '18 at 19:05

Guillaume

1,1381724

asked Nov 20 '18 at 19:20

Emily

Next time, it would be great if you could put your code directly in the question instead of in an image, that much more helpful, same for the logs

– Guillaume
Nov 21 '18 at 18:00

add a comment |

I am working on a crawling project and try to get each endorsement link of a band.

My code is as follows:

my code

The log is shown:

log

Anyone can help? Thanks in advance!

edited Nov 21 '18 at 19:05

Guillaume

1,1381724

asked Nov 20 '18 at 19:20

Emily

I am working on a crawling project and try to get each endorsement link of a band.

My code is as follows:

my code

The log is shown:

log

Anyone can help? Thanks in advance!

python web-scraping scrapy web-crawler scrapy-spider

edited Nov 21 '18 at 19:05

Guillaume

1,1381724

asked Nov 20 '18 at 19:20

Emily

edited Nov 21 '18 at 19:05

Guillaume

1,1381724

asked Nov 20 '18 at 19:20

Emily

edited Nov 21 '18 at 19:05

Guillaume

1,1381724

edited Nov 21 '18 at 19:05

Guillaume

1,1381724

edited Nov 21 '18 at 19:05

Guillaume

1,1381724

asked Nov 20 '18 at 19:20

Emily

asked Nov 20 '18 at 19:20

Emily

asked Nov 20 '18 at 19:20

Emily

Next time, it would be great if you could put your code directly in the question instead of in an image, that much more helpful, same for the logs

– Guillaume
Nov 21 '18 at 18:00

add a comment |

Next time, it would be great if you could put your code directly in the question instead of in an image, that much more helpful, same for the logs

– Guillaume
Nov 21 '18 at 18:00

Next time, it would be great if you could put your code directly in the question instead of in an image, that much more helpful, same for the logs

– Guillaume
Nov 21 '18 at 18:00

add a comment |

1 Answer
1

active

oldest

votes

Your restrict xpath expression looks wrong.

You could use the allow parameter instead, this is much easier:

from scrapy.spiders import CrawlSpider, Rule

from scrapy.linkextractors import LinkExtractor





class MySpider(CrawlSpider):



    name = 'celebrityendorsers.com'

    start_urls = ['https://celebrityendorsers.com/endorsement/']



    rules = (

        Rule(LinkExtractor('/endorsements/'), callback='parse_url_contents'),

    )



    def parse_url_contents(self, response):

        pass

This is the output log:

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playtex-wipes/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/plenish-cleanse/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/play-date-by-sarah-beckham/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playstation-3/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playmg/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playsight/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/play-cloths/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/platinum-league-trading-cards/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playstation/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/platinum-group/> (referer: https://celebrityendorsers.com/endorsement/)

If you really want to use xpath, then try removing [*].

The xpath that you commented looks correct, but the callback is wrong, you cannot use the parse callback with a CrawlSpider.

answered Nov 21 '18 at 18:09

Guillaume

1,1381724

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53400090%2fscrapy-debug-crawled-200-and-nothing-return%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Your restrict xpath expression looks wrong.

You could use the allow parameter instead, this is much easier:

from scrapy.spiders import CrawlSpider, Rule

from scrapy.linkextractors import LinkExtractor





class MySpider(CrawlSpider):



    name = 'celebrityendorsers.com'

    start_urls = ['https://celebrityendorsers.com/endorsement/']



    rules = (

        Rule(LinkExtractor('/endorsements/'), callback='parse_url_contents'),

    )



    def parse_url_contents(self, response):

        pass

This is the output log:

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playtex-wipes/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/plenish-cleanse/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/play-date-by-sarah-beckham/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playstation-3/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playmg/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playsight/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/play-cloths/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/platinum-league-trading-cards/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playstation/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/platinum-group/> (referer: https://celebrityendorsers.com/endorsement/)

If you really want to use xpath, then try removing [*].

The xpath that you commented looks correct, but the callback is wrong, you cannot use the parse callback with a CrawlSpider.

answered Nov 21 '18 at 18:09

Guillaume

1,1381724

add a comment |

Your restrict xpath expression looks wrong.

You could use the allow parameter instead, this is much easier:

from scrapy.spiders import CrawlSpider, Rule

from scrapy.linkextractors import LinkExtractor





class MySpider(CrawlSpider):



    name = 'celebrityendorsers.com'

    start_urls = ['https://celebrityendorsers.com/endorsement/']



    rules = (

        Rule(LinkExtractor('/endorsements/'), callback='parse_url_contents'),

    )



    def parse_url_contents(self, response):

        pass

This is the output log:

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playtex-wipes/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/plenish-cleanse/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/play-date-by-sarah-beckham/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playstation-3/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playmg/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playsight/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/play-cloths/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/platinum-league-trading-cards/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playstation/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/platinum-group/> (referer: https://celebrityendorsers.com/endorsement/)

If you really want to use xpath, then try removing [*].

The xpath that you commented looks correct, but the callback is wrong, you cannot use the parse callback with a CrawlSpider.

answered Nov 21 '18 at 18:09

Guillaume

1,1381724

add a comment |

Your restrict xpath expression looks wrong.

You could use the allow parameter instead, this is much easier:

from scrapy.spiders import CrawlSpider, Rule

from scrapy.linkextractors import LinkExtractor





class MySpider(CrawlSpider):



    name = 'celebrityendorsers.com'

    start_urls = ['https://celebrityendorsers.com/endorsement/']



    rules = (

        Rule(LinkExtractor('/endorsements/'), callback='parse_url_contents'),

    )



    def parse_url_contents(self, response):

        pass

This is the output log:

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playtex-wipes/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/plenish-cleanse/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/play-date-by-sarah-beckham/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playstation-3/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playmg/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playsight/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/play-cloths/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/platinum-league-trading-cards/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playstation/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/platinum-group/> (referer: https://celebrityendorsers.com/endorsement/)

If you really want to use xpath, then try removing [*].

The xpath that you commented looks correct, but the callback is wrong, you cannot use the parse callback with a CrawlSpider.

answered Nov 21 '18 at 18:09

Guillaume

1,1381724

Your restrict xpath expression looks wrong.

You could use the allow parameter instead, this is much easier:

from scrapy.spiders import CrawlSpider, Rule

from scrapy.linkextractors import LinkExtractor





class MySpider(CrawlSpider):



    name = 'celebrityendorsers.com'

    start_urls = ['https://celebrityendorsers.com/endorsement/']



    rules = (

        Rule(LinkExtractor('/endorsements/'), callback='parse_url_contents'),

    )



    def parse_url_contents(self, response):

        pass

This is the output log:

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playtex-wipes/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/plenish-cleanse/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/play-date-by-sarah-beckham/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playstation-3/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playmg/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playsight/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/play-cloths/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/platinum-league-trading-cards/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/playstation/> (referer: https://celebrityendorsers.com/endorsement/)

2018-11-22 02:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://celebrityendorsers.com/endorsements/platinum-group/> (referer: https://celebrityendorsers.com/endorsement/)

If you really want to use xpath, then try removing [*].

The xpath that you commented looks correct, but the callback is wrong, you cannot use the parse callback with a CrawlSpider.

answered Nov 21 '18 at 18:09

Guillaume

1,1381724

answered Nov 21 '18 at 18:09

Guillaume

1,1381724

answered Nov 21 '18 at 18:09

Guillaume

1,1381724

answered Nov 21 '18 at 18:09

Guillaume

1,1381724

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu