How to deal with Pagination when scraping












0















A website I'm scraping for educational purposes has pagination.



My code is scraping the first page perfectly fine...



But how would I scrape



?page=2
?page=3
?page=4
?page=5


And beyond??...



Should be noted I have looked for solutions, but can't seem to find anything which definitively answers what I need to know.



Current code:



// @nuget: HtmlAgilityPack
using System;
using System.Data;
using System.Data.SqlClient;
using System.Net;
using HtmlAgilityPack;


public class Program

{

public static void Main()
{


ServicePointManager.Expect100Continue = true;
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls
| SecurityProtocolType.Tls11
| SecurityProtocolType.Tls12
| SecurityProtocolType.Ssl3;
HtmlWeb web = new HtmlWeb();
HtmlDocument html = web.Load("https://www.g2crowd.com/products/google-analytics/reviews");
// var divNodes = html.DocumentNode.SelectNodes("//div[@class='mb-2 border-bottom']");

var divNodes = html.DocumentNode.SelectNodes(@"//div[@itemprop='reviewBody']");

if (divNodes != null)
{
foreach (var tag in divNodes)
{
string review = tag.InnerText;
review = review.Replace("What do you like best?", "What do you like best?n");
review = review.Replace("What do you dislike?", "nWhat do you dislike?n");
review = review.Replace("Recommendations to others considering the product", "nnRecommendations to others considering the productn");
review = review.Replace("What business problems are you solving with the product? What benefits have you realized?", "nnWhat business problems are you solving with the product? What benefits have you realized?n");
Console.WriteLine(review);
Console.WriteLine("n------------------------------- Review found. Adding to Database -------------------------------n");
review = review.Replace("'", "");
review = review.Replace("n", "<br />");
}
}
}
}









share|improve this question























  • How do you instinctively think you would deal with it ? and you probably have your answer... There is no magic bullet here, either try for the next page, or search the page for clues to see if you can

    – Michael Randall
    Nov 21 '18 at 3:48













  • My guess is either following the link to the next page, or somehow coding when done with page=1 > move to page=2? Pretty new to C# - very hard to put my thoughts into code. A nudge from SO has seemed to help me learn a lot in the past! Bit stumped is all!

    – Duke Dodson
    Nov 21 '18 at 3:50











  • Depending if you are making a crawler or not, then the link should be followable if there, if you are just trying to get the set then once again, just follow the link, not really much more i can add. maybe someone else can chime in

    – Michael Randall
    Nov 21 '18 at 3:55
















0















A website I'm scraping for educational purposes has pagination.



My code is scraping the first page perfectly fine...



But how would I scrape



?page=2
?page=3
?page=4
?page=5


And beyond??...



Should be noted I have looked for solutions, but can't seem to find anything which definitively answers what I need to know.



Current code:



// @nuget: HtmlAgilityPack
using System;
using System.Data;
using System.Data.SqlClient;
using System.Net;
using HtmlAgilityPack;


public class Program

{

public static void Main()
{


ServicePointManager.Expect100Continue = true;
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls
| SecurityProtocolType.Tls11
| SecurityProtocolType.Tls12
| SecurityProtocolType.Ssl3;
HtmlWeb web = new HtmlWeb();
HtmlDocument html = web.Load("https://www.g2crowd.com/products/google-analytics/reviews");
// var divNodes = html.DocumentNode.SelectNodes("//div[@class='mb-2 border-bottom']");

var divNodes = html.DocumentNode.SelectNodes(@"//div[@itemprop='reviewBody']");

if (divNodes != null)
{
foreach (var tag in divNodes)
{
string review = tag.InnerText;
review = review.Replace("What do you like best?", "What do you like best?n");
review = review.Replace("What do you dislike?", "nWhat do you dislike?n");
review = review.Replace("Recommendations to others considering the product", "nnRecommendations to others considering the productn");
review = review.Replace("What business problems are you solving with the product? What benefits have you realized?", "nnWhat business problems are you solving with the product? What benefits have you realized?n");
Console.WriteLine(review);
Console.WriteLine("n------------------------------- Review found. Adding to Database -------------------------------n");
review = review.Replace("'", "");
review = review.Replace("n", "<br />");
}
}
}
}









share|improve this question























  • How do you instinctively think you would deal with it ? and you probably have your answer... There is no magic bullet here, either try for the next page, or search the page for clues to see if you can

    – Michael Randall
    Nov 21 '18 at 3:48













  • My guess is either following the link to the next page, or somehow coding when done with page=1 > move to page=2? Pretty new to C# - very hard to put my thoughts into code. A nudge from SO has seemed to help me learn a lot in the past! Bit stumped is all!

    – Duke Dodson
    Nov 21 '18 at 3:50











  • Depending if you are making a crawler or not, then the link should be followable if there, if you are just trying to get the set then once again, just follow the link, not really much more i can add. maybe someone else can chime in

    – Michael Randall
    Nov 21 '18 at 3:55














0












0








0








A website I'm scraping for educational purposes has pagination.



My code is scraping the first page perfectly fine...



But how would I scrape



?page=2
?page=3
?page=4
?page=5


And beyond??...



Should be noted I have looked for solutions, but can't seem to find anything which definitively answers what I need to know.



Current code:



// @nuget: HtmlAgilityPack
using System;
using System.Data;
using System.Data.SqlClient;
using System.Net;
using HtmlAgilityPack;


public class Program

{

public static void Main()
{


ServicePointManager.Expect100Continue = true;
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls
| SecurityProtocolType.Tls11
| SecurityProtocolType.Tls12
| SecurityProtocolType.Ssl3;
HtmlWeb web = new HtmlWeb();
HtmlDocument html = web.Load("https://www.g2crowd.com/products/google-analytics/reviews");
// var divNodes = html.DocumentNode.SelectNodes("//div[@class='mb-2 border-bottom']");

var divNodes = html.DocumentNode.SelectNodes(@"//div[@itemprop='reviewBody']");

if (divNodes != null)
{
foreach (var tag in divNodes)
{
string review = tag.InnerText;
review = review.Replace("What do you like best?", "What do you like best?n");
review = review.Replace("What do you dislike?", "nWhat do you dislike?n");
review = review.Replace("Recommendations to others considering the product", "nnRecommendations to others considering the productn");
review = review.Replace("What business problems are you solving with the product? What benefits have you realized?", "nnWhat business problems are you solving with the product? What benefits have you realized?n");
Console.WriteLine(review);
Console.WriteLine("n------------------------------- Review found. Adding to Database -------------------------------n");
review = review.Replace("'", "");
review = review.Replace("n", "<br />");
}
}
}
}









share|improve this question














A website I'm scraping for educational purposes has pagination.



My code is scraping the first page perfectly fine...



But how would I scrape



?page=2
?page=3
?page=4
?page=5


And beyond??...



Should be noted I have looked for solutions, but can't seem to find anything which definitively answers what I need to know.



Current code:



// @nuget: HtmlAgilityPack
using System;
using System.Data;
using System.Data.SqlClient;
using System.Net;
using HtmlAgilityPack;


public class Program

{

public static void Main()
{


ServicePointManager.Expect100Continue = true;
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls
| SecurityProtocolType.Tls11
| SecurityProtocolType.Tls12
| SecurityProtocolType.Ssl3;
HtmlWeb web = new HtmlWeb();
HtmlDocument html = web.Load("https://www.g2crowd.com/products/google-analytics/reviews");
// var divNodes = html.DocumentNode.SelectNodes("//div[@class='mb-2 border-bottom']");

var divNodes = html.DocumentNode.SelectNodes(@"//div[@itemprop='reviewBody']");

if (divNodes != null)
{
foreach (var tag in divNodes)
{
string review = tag.InnerText;
review = review.Replace("What do you like best?", "What do you like best?n");
review = review.Replace("What do you dislike?", "nWhat do you dislike?n");
review = review.Replace("Recommendations to others considering the product", "nnRecommendations to others considering the productn");
review = review.Replace("What business problems are you solving with the product? What benefits have you realized?", "nnWhat business problems are you solving with the product? What benefits have you realized?n");
Console.WriteLine(review);
Console.WriteLine("n------------------------------- Review found. Adding to Database -------------------------------n");
review = review.Replace("'", "");
review = review.Replace("n", "<br />");
}
}
}
}






c# .net web-scraping pagination






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 21 '18 at 3:46









Duke DodsonDuke Dodson

215




215













  • How do you instinctively think you would deal with it ? and you probably have your answer... There is no magic bullet here, either try for the next page, or search the page for clues to see if you can

    – Michael Randall
    Nov 21 '18 at 3:48













  • My guess is either following the link to the next page, or somehow coding when done with page=1 > move to page=2? Pretty new to C# - very hard to put my thoughts into code. A nudge from SO has seemed to help me learn a lot in the past! Bit stumped is all!

    – Duke Dodson
    Nov 21 '18 at 3:50











  • Depending if you are making a crawler or not, then the link should be followable if there, if you are just trying to get the set then once again, just follow the link, not really much more i can add. maybe someone else can chime in

    – Michael Randall
    Nov 21 '18 at 3:55



















  • How do you instinctively think you would deal with it ? and you probably have your answer... There is no magic bullet here, either try for the next page, or search the page for clues to see if you can

    – Michael Randall
    Nov 21 '18 at 3:48













  • My guess is either following the link to the next page, or somehow coding when done with page=1 > move to page=2? Pretty new to C# - very hard to put my thoughts into code. A nudge from SO has seemed to help me learn a lot in the past! Bit stumped is all!

    – Duke Dodson
    Nov 21 '18 at 3:50











  • Depending if you are making a crawler or not, then the link should be followable if there, if you are just trying to get the set then once again, just follow the link, not really much more i can add. maybe someone else can chime in

    – Michael Randall
    Nov 21 '18 at 3:55

















How do you instinctively think you would deal with it ? and you probably have your answer... There is no magic bullet here, either try for the next page, or search the page for clues to see if you can

– Michael Randall
Nov 21 '18 at 3:48







How do you instinctively think you would deal with it ? and you probably have your answer... There is no magic bullet here, either try for the next page, or search the page for clues to see if you can

– Michael Randall
Nov 21 '18 at 3:48















My guess is either following the link to the next page, or somehow coding when done with page=1 > move to page=2? Pretty new to C# - very hard to put my thoughts into code. A nudge from SO has seemed to help me learn a lot in the past! Bit stumped is all!

– Duke Dodson
Nov 21 '18 at 3:50





My guess is either following the link to the next page, or somehow coding when done with page=1 > move to page=2? Pretty new to C# - very hard to put my thoughts into code. A nudge from SO has seemed to help me learn a lot in the past! Bit stumped is all!

– Duke Dodson
Nov 21 '18 at 3:50













Depending if you are making a crawler or not, then the link should be followable if there, if you are just trying to get the set then once again, just follow the link, not really much more i can add. maybe someone else can chime in

– Michael Randall
Nov 21 '18 at 3:55





Depending if you are making a crawler or not, then the link should be followable if there, if you are just trying to get the set then once again, just follow the link, not really much more i can add. maybe someone else can chime in

– Michael Randall
Nov 21 '18 at 3:55












1 Answer
1






active

oldest

votes


















0














The next link looks like this:



//link[@rel=next]


Just keep following it until it's not there anymore.






share|improve this answer
























  • next_page = response.xpath('//link[@rel="next"]/@href').extract_first(); if (next_page yield response.follow(next_page)); This is what I have so far. Doesn't seem to work as of yet.

    – Duke Dodson
    Nov 21 '18 at 9:08











  • Well I'm not sure what response is there, it should be the html parser object, not the raw response if that makes sense.

    – pguardiario
    Nov 21 '18 at 9:22











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53404987%2fhow-to-deal-with-pagination-when-scraping%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














The next link looks like this:



//link[@rel=next]


Just keep following it until it's not there anymore.






share|improve this answer
























  • next_page = response.xpath('//link[@rel="next"]/@href').extract_first(); if (next_page yield response.follow(next_page)); This is what I have so far. Doesn't seem to work as of yet.

    – Duke Dodson
    Nov 21 '18 at 9:08











  • Well I'm not sure what response is there, it should be the html parser object, not the raw response if that makes sense.

    – pguardiario
    Nov 21 '18 at 9:22
















0














The next link looks like this:



//link[@rel=next]


Just keep following it until it's not there anymore.






share|improve this answer
























  • next_page = response.xpath('//link[@rel="next"]/@href').extract_first(); if (next_page yield response.follow(next_page)); This is what I have so far. Doesn't seem to work as of yet.

    – Duke Dodson
    Nov 21 '18 at 9:08











  • Well I'm not sure what response is there, it should be the html parser object, not the raw response if that makes sense.

    – pguardiario
    Nov 21 '18 at 9:22














0












0








0







The next link looks like this:



//link[@rel=next]


Just keep following it until it's not there anymore.






share|improve this answer













The next link looks like this:



//link[@rel=next]


Just keep following it until it's not there anymore.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 21 '18 at 4:47









pguardiariopguardiario

36.1k980114




36.1k980114













  • next_page = response.xpath('//link[@rel="next"]/@href').extract_first(); if (next_page yield response.follow(next_page)); This is what I have so far. Doesn't seem to work as of yet.

    – Duke Dodson
    Nov 21 '18 at 9:08











  • Well I'm not sure what response is there, it should be the html parser object, not the raw response if that makes sense.

    – pguardiario
    Nov 21 '18 at 9:22



















  • next_page = response.xpath('//link[@rel="next"]/@href').extract_first(); if (next_page yield response.follow(next_page)); This is what I have so far. Doesn't seem to work as of yet.

    – Duke Dodson
    Nov 21 '18 at 9:08











  • Well I'm not sure what response is there, it should be the html parser object, not the raw response if that makes sense.

    – pguardiario
    Nov 21 '18 at 9:22

















next_page = response.xpath('//link[@rel="next"]/@href').extract_first(); if (next_page yield response.follow(next_page)); This is what I have so far. Doesn't seem to work as of yet.

– Duke Dodson
Nov 21 '18 at 9:08





next_page = response.xpath('//link[@rel="next"]/@href').extract_first(); if (next_page yield response.follow(next_page)); This is what I have so far. Doesn't seem to work as of yet.

– Duke Dodson
Nov 21 '18 at 9:08













Well I'm not sure what response is there, it should be the html parser object, not the raw response if that makes sense.

– pguardiario
Nov 21 '18 at 9:22





Well I'm not sure what response is there, it should be the html parser object, not the raw response if that makes sense.

– pguardiario
Nov 21 '18 at 9:22


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53404987%2fhow-to-deal-with-pagination-when-scraping%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

MongoDB - Not Authorized To Execute Command

in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith

How to fix TextFormField cause rebuild widget in Flutter