Complete HTML doesn't render when scraping using bs4 python
I am trying to scrape data from geeksforgeeks for my own simple scraping and analysis project.
I am using bs4
and requests
- python2
I need to scrape all the questions on this url so I do,
ques_page = requests.get('https://practice.geeksforgeeks.org/explore/?page=1')
ques_soup = BeautifulSoup(ques_page.text, 'lxml')
get_ques = ques_soup.find('div', class_="panel problem-block")
The class panel problem-block
contians the question data.
But when I view the scraped html
- print(ques_page.text)
doesn't contain the div
at all !
On viewing the page source ( Ctrl-F for problemFeed
this div is where all the questions are present )
<div id="problemFeed" class="row" data-masonry-options='{"itemSelector": ".item" }'></div>
This div
is EMPTY! Thus am not able to scrape any data out of it! How is this possible, since I can view everything inside this div
in the console, but not in the page source or during scraping!
python html web-scraping beautifulsoup
|
show 3 more comments
I am trying to scrape data from geeksforgeeks for my own simple scraping and analysis project.
I am using bs4
and requests
- python2
I need to scrape all the questions on this url so I do,
ques_page = requests.get('https://practice.geeksforgeeks.org/explore/?page=1')
ques_soup = BeautifulSoup(ques_page.text, 'lxml')
get_ques = ques_soup.find('div', class_="panel problem-block")
The class panel problem-block
contians the question data.
But when I view the scraped html
- print(ques_page.text)
doesn't contain the div
at all !
On viewing the page source ( Ctrl-F for problemFeed
this div is where all the questions are present )
<div id="problemFeed" class="row" data-masonry-options='{"itemSelector": ".item" }'></div>
This div
is EMPTY! Thus am not able to scrape any data out of it! How is this possible, since I can view everything inside this div
in the console, but not in the page source or during scraping!
python html web-scraping beautifulsoup
It is possible that this part is rendered after the page is up (by javascript) thus its not part of the original html
– Ron Serruya
Jan 2 at 11:20
if you open this page in a browser like chrome and select "view page source" you will see this class "panel problem-block" doesn't exist either.
– Chris Doyle
Jan 2 at 11:31
Yes, this class is inside theproblemFeed
div itself thats why @ChrisDoyle
– Gagan Ganapathy
Jan 2 at 11:32
@RonSerruya so such things are not scrap-able at all ?
– Gagan Ganapathy
Jan 2 at 11:32
You can scrape the rendered HTML using selenium
– Sreyas
Jan 2 at 11:35
|
show 3 more comments
I am trying to scrape data from geeksforgeeks for my own simple scraping and analysis project.
I am using bs4
and requests
- python2
I need to scrape all the questions on this url so I do,
ques_page = requests.get('https://practice.geeksforgeeks.org/explore/?page=1')
ques_soup = BeautifulSoup(ques_page.text, 'lxml')
get_ques = ques_soup.find('div', class_="panel problem-block")
The class panel problem-block
contians the question data.
But when I view the scraped html
- print(ques_page.text)
doesn't contain the div
at all !
On viewing the page source ( Ctrl-F for problemFeed
this div is where all the questions are present )
<div id="problemFeed" class="row" data-masonry-options='{"itemSelector": ".item" }'></div>
This div
is EMPTY! Thus am not able to scrape any data out of it! How is this possible, since I can view everything inside this div
in the console, but not in the page source or during scraping!
python html web-scraping beautifulsoup
I am trying to scrape data from geeksforgeeks for my own simple scraping and analysis project.
I am using bs4
and requests
- python2
I need to scrape all the questions on this url so I do,
ques_page = requests.get('https://practice.geeksforgeeks.org/explore/?page=1')
ques_soup = BeautifulSoup(ques_page.text, 'lxml')
get_ques = ques_soup.find('div', class_="panel problem-block")
The class panel problem-block
contians the question data.
But when I view the scraped html
- print(ques_page.text)
doesn't contain the div
at all !
On viewing the page source ( Ctrl-F for problemFeed
this div is where all the questions are present )
<div id="problemFeed" class="row" data-masonry-options='{"itemSelector": ".item" }'></div>
This div
is EMPTY! Thus am not able to scrape any data out of it! How is this possible, since I can view everything inside this div
in the console, but not in the page source or during scraping!
python html web-scraping beautifulsoup
python html web-scraping beautifulsoup
asked Jan 2 at 11:16


Gagan GanapathyGagan Ganapathy
33
33
It is possible that this part is rendered after the page is up (by javascript) thus its not part of the original html
– Ron Serruya
Jan 2 at 11:20
if you open this page in a browser like chrome and select "view page source" you will see this class "panel problem-block" doesn't exist either.
– Chris Doyle
Jan 2 at 11:31
Yes, this class is inside theproblemFeed
div itself thats why @ChrisDoyle
– Gagan Ganapathy
Jan 2 at 11:32
@RonSerruya so such things are not scrap-able at all ?
– Gagan Ganapathy
Jan 2 at 11:32
You can scrape the rendered HTML using selenium
– Sreyas
Jan 2 at 11:35
|
show 3 more comments
It is possible that this part is rendered after the page is up (by javascript) thus its not part of the original html
– Ron Serruya
Jan 2 at 11:20
if you open this page in a browser like chrome and select "view page source" you will see this class "panel problem-block" doesn't exist either.
– Chris Doyle
Jan 2 at 11:31
Yes, this class is inside theproblemFeed
div itself thats why @ChrisDoyle
– Gagan Ganapathy
Jan 2 at 11:32
@RonSerruya so such things are not scrap-able at all ?
– Gagan Ganapathy
Jan 2 at 11:32
You can scrape the rendered HTML using selenium
– Sreyas
Jan 2 at 11:35
It is possible that this part is rendered after the page is up (by javascript) thus its not part of the original html
– Ron Serruya
Jan 2 at 11:20
It is possible that this part is rendered after the page is up (by javascript) thus its not part of the original html
– Ron Serruya
Jan 2 at 11:20
if you open this page in a browser like chrome and select "view page source" you will see this class "panel problem-block" doesn't exist either.
– Chris Doyle
Jan 2 at 11:31
if you open this page in a browser like chrome and select "view page source" you will see this class "panel problem-block" doesn't exist either.
– Chris Doyle
Jan 2 at 11:31
Yes, this class is inside the
problemFeed
div itself thats why @ChrisDoyle– Gagan Ganapathy
Jan 2 at 11:32
Yes, this class is inside the
problemFeed
div itself thats why @ChrisDoyle– Gagan Ganapathy
Jan 2 at 11:32
@RonSerruya so such things are not scrap-able at all ?
– Gagan Ganapathy
Jan 2 at 11:32
@RonSerruya so such things are not scrap-able at all ?
– Gagan Ganapathy
Jan 2 at 11:32
You can scrape the rendered HTML using selenium
– Sreyas
Jan 2 at 11:35
You can scrape the rendered HTML using selenium
– Sreyas
Jan 2 at 11:35
|
show 3 more comments
1 Answer
1
active
oldest
votes
you can get it from the Ajax endpoint with post request
data = {'page': 1, 'query' : 'page1'} # 2, page2...
ques_page = requests.post('https://practice.geeksforgeeks.org/ajax/practicePageAjax.php', data=data)
ques_soup = BeautifulSoup(ques_page.text, 'lxml')
get_ques = ques_soup.find('div', class_="panel problem-block")
print(get_ques)
How did you find the ajax endpoint ?
– Gagan Ganapathy
Jan 2 at 13:13
you can view it on browser console
– ewwink
Jan 3 at 2:12
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54005341%2fcomplete-html-doesnt-render-when-scraping-using-bs4-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
you can get it from the Ajax endpoint with post request
data = {'page': 1, 'query' : 'page1'} # 2, page2...
ques_page = requests.post('https://practice.geeksforgeeks.org/ajax/practicePageAjax.php', data=data)
ques_soup = BeautifulSoup(ques_page.text, 'lxml')
get_ques = ques_soup.find('div', class_="panel problem-block")
print(get_ques)
How did you find the ajax endpoint ?
– Gagan Ganapathy
Jan 2 at 13:13
you can view it on browser console
– ewwink
Jan 3 at 2:12
add a comment |
you can get it from the Ajax endpoint with post request
data = {'page': 1, 'query' : 'page1'} # 2, page2...
ques_page = requests.post('https://practice.geeksforgeeks.org/ajax/practicePageAjax.php', data=data)
ques_soup = BeautifulSoup(ques_page.text, 'lxml')
get_ques = ques_soup.find('div', class_="panel problem-block")
print(get_ques)
How did you find the ajax endpoint ?
– Gagan Ganapathy
Jan 2 at 13:13
you can view it on browser console
– ewwink
Jan 3 at 2:12
add a comment |
you can get it from the Ajax endpoint with post request
data = {'page': 1, 'query' : 'page1'} # 2, page2...
ques_page = requests.post('https://practice.geeksforgeeks.org/ajax/practicePageAjax.php', data=data)
ques_soup = BeautifulSoup(ques_page.text, 'lxml')
get_ques = ques_soup.find('div', class_="panel problem-block")
print(get_ques)
you can get it from the Ajax endpoint with post request
data = {'page': 1, 'query' : 'page1'} # 2, page2...
ques_page = requests.post('https://practice.geeksforgeeks.org/ajax/practicePageAjax.php', data=data)
ques_soup = BeautifulSoup(ques_page.text, 'lxml')
get_ques = ques_soup.find('div', class_="panel problem-block")
print(get_ques)
answered Jan 2 at 11:51
ewwinkewwink
12.2k22440
12.2k22440
How did you find the ajax endpoint ?
– Gagan Ganapathy
Jan 2 at 13:13
you can view it on browser console
– ewwink
Jan 3 at 2:12
add a comment |
How did you find the ajax endpoint ?
– Gagan Ganapathy
Jan 2 at 13:13
you can view it on browser console
– ewwink
Jan 3 at 2:12
How did you find the ajax endpoint ?
– Gagan Ganapathy
Jan 2 at 13:13
How did you find the ajax endpoint ?
– Gagan Ganapathy
Jan 2 at 13:13
you can view it on browser console
– ewwink
Jan 3 at 2:12
you can view it on browser console
– ewwink
Jan 3 at 2:12
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54005341%2fcomplete-html-doesnt-render-when-scraping-using-bs4-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
It is possible that this part is rendered after the page is up (by javascript) thus its not part of the original html
– Ron Serruya
Jan 2 at 11:20
if you open this page in a browser like chrome and select "view page source" you will see this class "panel problem-block" doesn't exist either.
– Chris Doyle
Jan 2 at 11:31
Yes, this class is inside the
problemFeed
div itself thats why @ChrisDoyle– Gagan Ganapathy
Jan 2 at 11:32
@RonSerruya so such things are not scrap-able at all ?
– Gagan Ganapathy
Jan 2 at 11:32
You can scrape the rendered HTML using selenium
– Sreyas
Jan 2 at 11:35