How to deal with Pagination when scraping
A website I'm scraping for educational purposes has pagination.
My code is scraping the first page perfectly fine...
But how would I scrape
?page=2
?page=3
?page=4
?page=5
And beyond??...
Should be noted I have looked for solutions, but can't seem to find anything which definitively answers what I need to know.
Current code:
// @nuget: HtmlAgilityPack
using System;
using System.Data;
using System.Data.SqlClient;
using System.Net;
using HtmlAgilityPack;
public class Program
{
public static void Main()
{
ServicePointManager.Expect100Continue = true;
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls
| SecurityProtocolType.Tls11
| SecurityProtocolType.Tls12
| SecurityProtocolType.Ssl3;
HtmlWeb web = new HtmlWeb();
HtmlDocument html = web.Load("https://www.g2crowd.com/products/google-analytics/reviews");
// var divNodes = html.DocumentNode.SelectNodes("//div[@class='mb-2 border-bottom']");
var divNodes = html.DocumentNode.SelectNodes(@"//div[@itemprop='reviewBody']");
if (divNodes != null)
{
foreach (var tag in divNodes)
{
string review = tag.InnerText;
review = review.Replace("What do you like best?", "What do you like best?n");
review = review.Replace("What do you dislike?", "nWhat do you dislike?n");
review = review.Replace("Recommendations to others considering the product", "nnRecommendations to others considering the productn");
review = review.Replace("What business problems are you solving with the product? What benefits have you realized?", "nnWhat business problems are you solving with the product? What benefits have you realized?n");
Console.WriteLine(review);
Console.WriteLine("n------------------------------- Review found. Adding to Database -------------------------------n");
review = review.Replace("'", "");
review = review.Replace("n", "<br />");
}
}
}
}
c# .net web-scraping pagination
add a comment |
A website I'm scraping for educational purposes has pagination.
My code is scraping the first page perfectly fine...
But how would I scrape
?page=2
?page=3
?page=4
?page=5
And beyond??...
Should be noted I have looked for solutions, but can't seem to find anything which definitively answers what I need to know.
Current code:
// @nuget: HtmlAgilityPack
using System;
using System.Data;
using System.Data.SqlClient;
using System.Net;
using HtmlAgilityPack;
public class Program
{
public static void Main()
{
ServicePointManager.Expect100Continue = true;
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls
| SecurityProtocolType.Tls11
| SecurityProtocolType.Tls12
| SecurityProtocolType.Ssl3;
HtmlWeb web = new HtmlWeb();
HtmlDocument html = web.Load("https://www.g2crowd.com/products/google-analytics/reviews");
// var divNodes = html.DocumentNode.SelectNodes("//div[@class='mb-2 border-bottom']");
var divNodes = html.DocumentNode.SelectNodes(@"//div[@itemprop='reviewBody']");
if (divNodes != null)
{
foreach (var tag in divNodes)
{
string review = tag.InnerText;
review = review.Replace("What do you like best?", "What do you like best?n");
review = review.Replace("What do you dislike?", "nWhat do you dislike?n");
review = review.Replace("Recommendations to others considering the product", "nnRecommendations to others considering the productn");
review = review.Replace("What business problems are you solving with the product? What benefits have you realized?", "nnWhat business problems are you solving with the product? What benefits have you realized?n");
Console.WriteLine(review);
Console.WriteLine("n------------------------------- Review found. Adding to Database -------------------------------n");
review = review.Replace("'", "");
review = review.Replace("n", "<br />");
}
}
}
}
c# .net web-scraping pagination
How do you instinctively think you would deal with it ? and you probably have your answer... There is no magic bullet here, either try for the next page, or search the page for clues to see if you can
– Michael Randall
Nov 21 '18 at 3:48
My guess is either following the link to the next page, or somehow coding when done with page=1 > move to page=2? Pretty new to C# - very hard to put my thoughts into code. A nudge from SO has seemed to help me learn a lot in the past! Bit stumped is all!
– Duke Dodson
Nov 21 '18 at 3:50
Depending if you are making a crawler or not, then the link should be followable if there, if you are just trying to get the set then once again, just follow the link, not really much more i can add. maybe someone else can chime in
– Michael Randall
Nov 21 '18 at 3:55
add a comment |
A website I'm scraping for educational purposes has pagination.
My code is scraping the first page perfectly fine...
But how would I scrape
?page=2
?page=3
?page=4
?page=5
And beyond??...
Should be noted I have looked for solutions, but can't seem to find anything which definitively answers what I need to know.
Current code:
// @nuget: HtmlAgilityPack
using System;
using System.Data;
using System.Data.SqlClient;
using System.Net;
using HtmlAgilityPack;
public class Program
{
public static void Main()
{
ServicePointManager.Expect100Continue = true;
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls
| SecurityProtocolType.Tls11
| SecurityProtocolType.Tls12
| SecurityProtocolType.Ssl3;
HtmlWeb web = new HtmlWeb();
HtmlDocument html = web.Load("https://www.g2crowd.com/products/google-analytics/reviews");
// var divNodes = html.DocumentNode.SelectNodes("//div[@class='mb-2 border-bottom']");
var divNodes = html.DocumentNode.SelectNodes(@"//div[@itemprop='reviewBody']");
if (divNodes != null)
{
foreach (var tag in divNodes)
{
string review = tag.InnerText;
review = review.Replace("What do you like best?", "What do you like best?n");
review = review.Replace("What do you dislike?", "nWhat do you dislike?n");
review = review.Replace("Recommendations to others considering the product", "nnRecommendations to others considering the productn");
review = review.Replace("What business problems are you solving with the product? What benefits have you realized?", "nnWhat business problems are you solving with the product? What benefits have you realized?n");
Console.WriteLine(review);
Console.WriteLine("n------------------------------- Review found. Adding to Database -------------------------------n");
review = review.Replace("'", "");
review = review.Replace("n", "<br />");
}
}
}
}
c# .net web-scraping pagination
A website I'm scraping for educational purposes has pagination.
My code is scraping the first page perfectly fine...
But how would I scrape
?page=2
?page=3
?page=4
?page=5
And beyond??...
Should be noted I have looked for solutions, but can't seem to find anything which definitively answers what I need to know.
Current code:
// @nuget: HtmlAgilityPack
using System;
using System.Data;
using System.Data.SqlClient;
using System.Net;
using HtmlAgilityPack;
public class Program
{
public static void Main()
{
ServicePointManager.Expect100Continue = true;
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls
| SecurityProtocolType.Tls11
| SecurityProtocolType.Tls12
| SecurityProtocolType.Ssl3;
HtmlWeb web = new HtmlWeb();
HtmlDocument html = web.Load("https://www.g2crowd.com/products/google-analytics/reviews");
// var divNodes = html.DocumentNode.SelectNodes("//div[@class='mb-2 border-bottom']");
var divNodes = html.DocumentNode.SelectNodes(@"//div[@itemprop='reviewBody']");
if (divNodes != null)
{
foreach (var tag in divNodes)
{
string review = tag.InnerText;
review = review.Replace("What do you like best?", "What do you like best?n");
review = review.Replace("What do you dislike?", "nWhat do you dislike?n");
review = review.Replace("Recommendations to others considering the product", "nnRecommendations to others considering the productn");
review = review.Replace("What business problems are you solving with the product? What benefits have you realized?", "nnWhat business problems are you solving with the product? What benefits have you realized?n");
Console.WriteLine(review);
Console.WriteLine("n------------------------------- Review found. Adding to Database -------------------------------n");
review = review.Replace("'", "");
review = review.Replace("n", "<br />");
}
}
}
}
c# .net web-scraping pagination
c# .net web-scraping pagination
asked Nov 21 '18 at 3:46
Duke DodsonDuke Dodson
215
215
How do you instinctively think you would deal with it ? and you probably have your answer... There is no magic bullet here, either try for the next page, or search the page for clues to see if you can
– Michael Randall
Nov 21 '18 at 3:48
My guess is either following the link to the next page, or somehow coding when done with page=1 > move to page=2? Pretty new to C# - very hard to put my thoughts into code. A nudge from SO has seemed to help me learn a lot in the past! Bit stumped is all!
– Duke Dodson
Nov 21 '18 at 3:50
Depending if you are making a crawler or not, then the link should be followable if there, if you are just trying to get the set then once again, just follow the link, not really much more i can add. maybe someone else can chime in
– Michael Randall
Nov 21 '18 at 3:55
add a comment |
How do you instinctively think you would deal with it ? and you probably have your answer... There is no magic bullet here, either try for the next page, or search the page for clues to see if you can
– Michael Randall
Nov 21 '18 at 3:48
My guess is either following the link to the next page, or somehow coding when done with page=1 > move to page=2? Pretty new to C# - very hard to put my thoughts into code. A nudge from SO has seemed to help me learn a lot in the past! Bit stumped is all!
– Duke Dodson
Nov 21 '18 at 3:50
Depending if you are making a crawler or not, then the link should be followable if there, if you are just trying to get the set then once again, just follow the link, not really much more i can add. maybe someone else can chime in
– Michael Randall
Nov 21 '18 at 3:55
How do you instinctively think you would deal with it ? and you probably have your answer... There is no magic bullet here, either try for the next page, or search the page for clues to see if you can
– Michael Randall
Nov 21 '18 at 3:48
How do you instinctively think you would deal with it ? and you probably have your answer... There is no magic bullet here, either try for the next page, or search the page for clues to see if you can
– Michael Randall
Nov 21 '18 at 3:48
My guess is either following the link to the next page, or somehow coding when done with page=1 > move to page=2? Pretty new to C# - very hard to put my thoughts into code. A nudge from SO has seemed to help me learn a lot in the past! Bit stumped is all!
– Duke Dodson
Nov 21 '18 at 3:50
My guess is either following the link to the next page, or somehow coding when done with page=1 > move to page=2? Pretty new to C# - very hard to put my thoughts into code. A nudge from SO has seemed to help me learn a lot in the past! Bit stumped is all!
– Duke Dodson
Nov 21 '18 at 3:50
Depending if you are making a crawler or not, then the link should be followable if there, if you are just trying to get the set then once again, just follow the link, not really much more i can add. maybe someone else can chime in
– Michael Randall
Nov 21 '18 at 3:55
Depending if you are making a crawler or not, then the link should be followable if there, if you are just trying to get the set then once again, just follow the link, not really much more i can add. maybe someone else can chime in
– Michael Randall
Nov 21 '18 at 3:55
add a comment |
1 Answer
1
active
oldest
votes
The next link looks like this:
//link[@rel=next]
Just keep following it until it's not there anymore.
next_page = response.xpath('//link[@rel="next"]/@href').extract_first(); if (next_page yield response.follow(next_page));
This is what I have so far. Doesn't seem to work as of yet.
– Duke Dodson
Nov 21 '18 at 9:08
Well I'm not sure whatresponse
is there, it should be the html parser object, not the raw response if that makes sense.
– pguardiario
Nov 21 '18 at 9:22
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53404987%2fhow-to-deal-with-pagination-when-scraping%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The next link looks like this:
//link[@rel=next]
Just keep following it until it's not there anymore.
next_page = response.xpath('//link[@rel="next"]/@href').extract_first(); if (next_page yield response.follow(next_page));
This is what I have so far. Doesn't seem to work as of yet.
– Duke Dodson
Nov 21 '18 at 9:08
Well I'm not sure whatresponse
is there, it should be the html parser object, not the raw response if that makes sense.
– pguardiario
Nov 21 '18 at 9:22
add a comment |
The next link looks like this:
//link[@rel=next]
Just keep following it until it's not there anymore.
next_page = response.xpath('//link[@rel="next"]/@href').extract_first(); if (next_page yield response.follow(next_page));
This is what I have so far. Doesn't seem to work as of yet.
– Duke Dodson
Nov 21 '18 at 9:08
Well I'm not sure whatresponse
is there, it should be the html parser object, not the raw response if that makes sense.
– pguardiario
Nov 21 '18 at 9:22
add a comment |
The next link looks like this:
//link[@rel=next]
Just keep following it until it's not there anymore.
The next link looks like this:
//link[@rel=next]
Just keep following it until it's not there anymore.
answered Nov 21 '18 at 4:47
pguardiariopguardiario
36.1k980114
36.1k980114
next_page = response.xpath('//link[@rel="next"]/@href').extract_first(); if (next_page yield response.follow(next_page));
This is what I have so far. Doesn't seem to work as of yet.
– Duke Dodson
Nov 21 '18 at 9:08
Well I'm not sure whatresponse
is there, it should be the html parser object, not the raw response if that makes sense.
– pguardiario
Nov 21 '18 at 9:22
add a comment |
next_page = response.xpath('//link[@rel="next"]/@href').extract_first(); if (next_page yield response.follow(next_page));
This is what I have so far. Doesn't seem to work as of yet.
– Duke Dodson
Nov 21 '18 at 9:08
Well I'm not sure whatresponse
is there, it should be the html parser object, not the raw response if that makes sense.
– pguardiario
Nov 21 '18 at 9:22
next_page = response.xpath('//link[@rel="next"]/@href').extract_first(); if (next_page yield response.follow(next_page));
This is what I have so far. Doesn't seem to work as of yet.– Duke Dodson
Nov 21 '18 at 9:08
next_page = response.xpath('//link[@rel="next"]/@href').extract_first(); if (next_page yield response.follow(next_page));
This is what I have so far. Doesn't seem to work as of yet.– Duke Dodson
Nov 21 '18 at 9:08
Well I'm not sure what
response
is there, it should be the html parser object, not the raw response if that makes sense.– pguardiario
Nov 21 '18 at 9:22
Well I'm not sure what
response
is there, it should be the html parser object, not the raw response if that makes sense.– pguardiario
Nov 21 '18 at 9:22
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53404987%2fhow-to-deal-with-pagination-when-scraping%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
How do you instinctively think you would deal with it ? and you probably have your answer... There is no magic bullet here, either try for the next page, or search the page for clues to see if you can
– Michael Randall
Nov 21 '18 at 3:48
My guess is either following the link to the next page, or somehow coding when done with page=1 > move to page=2? Pretty new to C# - very hard to put my thoughts into code. A nudge from SO has seemed to help me learn a lot in the past! Bit stumped is all!
– Duke Dodson
Nov 21 '18 at 3:50
Depending if you are making a crawler or not, then the link should be followable if there, if you are just trying to get the set then once again, just follow the link, not really much more i can add. maybe someone else can chime in
– Michael Randall
Nov 21 '18 at 3:55