Cannot pull data from pantip.com
I have been trying to pull data from pantip.com including title, post stoy and all comments using beautifulsoup.
However, I could pull only title and post stoy. I could not get comments.
Here is code for title and post stoy
import requests
import re
from bs4 import BeautifulSoup
# specify the url
url = 'https://pantip.com/topic/38372443'
# Split Topic number
topic_number = re.split('https://pantip.com/topic/', url)
topic_number = topic_number[1]
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
# Capture title
elementTag_title = soup.find(id = 'topic-'+ topic_number)
title = str(elementTag_title.find_all(class_ = 'display-post-title')[0].string)
# Capture post story
resultSet_post = elementTag_title.find_all(class_ = 'display-post-story')[0]
post = resultSet_post.contents[1].text.strip()
I tried to find by id
elementTag_comment = soup.find(id = "comments-jsrender")
according to
I got the result below.
elementTag_comment =
<div id="comments-jsrender">
<div class="loadmore-bar loadmore-bar-paging"> <a href="javascript:void(0)">
<span class="icon-expand-left"><small>▼</small></span> <span class="focus-
txt"><span class="loading-txt">กำลังโหลดข้อมูล...</span></span> <span
class="icon-expand-right"><small>▼</small></span> </a> </div>
</div>
The question is how can I get all comments. Please, suggest me how to fix it.
python selenium web-scraping beautifulsoup nlp
add a comment |
I have been trying to pull data from pantip.com including title, post stoy and all comments using beautifulsoup.
However, I could pull only title and post stoy. I could not get comments.
Here is code for title and post stoy
import requests
import re
from bs4 import BeautifulSoup
# specify the url
url = 'https://pantip.com/topic/38372443'
# Split Topic number
topic_number = re.split('https://pantip.com/topic/', url)
topic_number = topic_number[1]
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
# Capture title
elementTag_title = soup.find(id = 'topic-'+ topic_number)
title = str(elementTag_title.find_all(class_ = 'display-post-title')[0].string)
# Capture post story
resultSet_post = elementTag_title.find_all(class_ = 'display-post-story')[0]
post = resultSet_post.contents[1].text.strip()
I tried to find by id
elementTag_comment = soup.find(id = "comments-jsrender")
according to
I got the result below.
elementTag_comment =
<div id="comments-jsrender">
<div class="loadmore-bar loadmore-bar-paging"> <a href="javascript:void(0)">
<span class="icon-expand-left"><small>▼</small></span> <span class="focus-
txt"><span class="loading-txt">กำลังโหลดข้อมูล...</span></span> <span
class="icon-expand-right"><small>▼</small></span> </a> </div>
</div>
The question is how can I get all comments. Please, suggest me how to fix it.
python selenium web-scraping beautifulsoup nlp
At a glance, it seems theres lazy loading for the posts. ( Comments are loaded asynchronously after page). If you look at the network tab, you can see a 'render_comments' resource network call is made...Please refer to this answer: stackoverflow.com/a/47851306/6283258
– Paul Rdt
Jan 2 at 10:18
soup.find()
will only return the first element is finds. If you want all tags with id = "comments-jsrender", you need to usesoup.find_all()
. Then depending what you want to do, may need to iterate through each element.
– chitown88
Jan 2 at 20:31
add a comment |
I have been trying to pull data from pantip.com including title, post stoy and all comments using beautifulsoup.
However, I could pull only title and post stoy. I could not get comments.
Here is code for title and post stoy
import requests
import re
from bs4 import BeautifulSoup
# specify the url
url = 'https://pantip.com/topic/38372443'
# Split Topic number
topic_number = re.split('https://pantip.com/topic/', url)
topic_number = topic_number[1]
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
# Capture title
elementTag_title = soup.find(id = 'topic-'+ topic_number)
title = str(elementTag_title.find_all(class_ = 'display-post-title')[0].string)
# Capture post story
resultSet_post = elementTag_title.find_all(class_ = 'display-post-story')[0]
post = resultSet_post.contents[1].text.strip()
I tried to find by id
elementTag_comment = soup.find(id = "comments-jsrender")
according to
I got the result below.
elementTag_comment =
<div id="comments-jsrender">
<div class="loadmore-bar loadmore-bar-paging"> <a href="javascript:void(0)">
<span class="icon-expand-left"><small>▼</small></span> <span class="focus-
txt"><span class="loading-txt">กำลังโหลดข้อมูล...</span></span> <span
class="icon-expand-right"><small>▼</small></span> </a> </div>
</div>
The question is how can I get all comments. Please, suggest me how to fix it.
python selenium web-scraping beautifulsoup nlp
I have been trying to pull data from pantip.com including title, post stoy and all comments using beautifulsoup.
However, I could pull only title and post stoy. I could not get comments.
Here is code for title and post stoy
import requests
import re
from bs4 import BeautifulSoup
# specify the url
url = 'https://pantip.com/topic/38372443'
# Split Topic number
topic_number = re.split('https://pantip.com/topic/', url)
topic_number = topic_number[1]
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
# Capture title
elementTag_title = soup.find(id = 'topic-'+ topic_number)
title = str(elementTag_title.find_all(class_ = 'display-post-title')[0].string)
# Capture post story
resultSet_post = elementTag_title.find_all(class_ = 'display-post-story')[0]
post = resultSet_post.contents[1].text.strip()
I tried to find by id
elementTag_comment = soup.find(id = "comments-jsrender")
according to
I got the result below.
elementTag_comment =
<div id="comments-jsrender">
<div class="loadmore-bar loadmore-bar-paging"> <a href="javascript:void(0)">
<span class="icon-expand-left"><small>▼</small></span> <span class="focus-
txt"><span class="loading-txt">กำลังโหลดข้อมูล...</span></span> <span
class="icon-expand-right"><small>▼</small></span> </a> </div>
</div>
The question is how can I get all comments. Please, suggest me how to fix it.
python selenium web-scraping beautifulsoup nlp
python selenium web-scraping beautifulsoup nlp
edited Jan 3 at 12:58
SIM
10.7k31046
10.7k31046
asked Jan 2 at 10:10
t.Wandee Wandeet.Wandee Wandee
325
325
At a glance, it seems theres lazy loading for the posts. ( Comments are loaded asynchronously after page). If you look at the network tab, you can see a 'render_comments' resource network call is made...Please refer to this answer: stackoverflow.com/a/47851306/6283258
– Paul Rdt
Jan 2 at 10:18
soup.find()
will only return the first element is finds. If you want all tags with id = "comments-jsrender", you need to usesoup.find_all()
. Then depending what you want to do, may need to iterate through each element.
– chitown88
Jan 2 at 20:31
add a comment |
At a glance, it seems theres lazy loading for the posts. ( Comments are loaded asynchronously after page). If you look at the network tab, you can see a 'render_comments' resource network call is made...Please refer to this answer: stackoverflow.com/a/47851306/6283258
– Paul Rdt
Jan 2 at 10:18
soup.find()
will only return the first element is finds. If you want all tags with id = "comments-jsrender", you need to usesoup.find_all()
. Then depending what you want to do, may need to iterate through each element.
– chitown88
Jan 2 at 20:31
At a glance, it seems theres lazy loading for the posts. ( Comments are loaded asynchronously after page). If you look at the network tab, you can see a 'render_comments' resource network call is made...Please refer to this answer: stackoverflow.com/a/47851306/6283258
– Paul Rdt
Jan 2 at 10:18
At a glance, it seems theres lazy loading for the posts. ( Comments are loaded asynchronously after page). If you look at the network tab, you can see a 'render_comments' resource network call is made...Please refer to this answer: stackoverflow.com/a/47851306/6283258
– Paul Rdt
Jan 2 at 10:18
soup.find()
will only return the first element is finds. If you want all tags with id = "comments-jsrender", you need to use soup.find_all()
. Then depending what you want to do, may need to iterate through each element.– chitown88
Jan 2 at 20:31
soup.find()
will only return the first element is finds. If you want all tags with id = "comments-jsrender", you need to use soup.find_all()
. Then depending what you want to do, may need to iterate through each element.– chitown88
Jan 2 at 20:31
add a comment |
1 Answer
1
active
oldest
votes
The reason your having trouble locating the rest of these posts is because the site is populated with dynamic javascript. To get around this you can implement a solution with selenium, see here how to get the correct driver and add to your system variables https://github.com/mozilla/geckodriver/releases . Selenium will load the page and you will have full access to all the attributes you see in your screenshot, with just beautiful soup that data is not being parsed.
Once you do that you can use the following to return each of the posts data:
from bs4 import BeautifulSoup
from selenium import webdriver
url='https://pantip.com/topic/38372443'
driver = webdriver.Firefox()
driver.get(url)
content=driver.page_source
soup=BeautifulSoup(content,'lxml')
for div in soup.find_all("div", id=lambda value: value and value.startswith("comment-")):
if len(str(div.text).strip()) > 1:
print(str(div.text).strip())
driver.quit()
Thank you very much. I wonder whether I can do that without Web Browser popping up. I think it may help increase speed.
– t.Wandee Wandee
Jan 4 at 3:10
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54004431%2fcannot-pull-data-from-pantip-com%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The reason your having trouble locating the rest of these posts is because the site is populated with dynamic javascript. To get around this you can implement a solution with selenium, see here how to get the correct driver and add to your system variables https://github.com/mozilla/geckodriver/releases . Selenium will load the page and you will have full access to all the attributes you see in your screenshot, with just beautiful soup that data is not being parsed.
Once you do that you can use the following to return each of the posts data:
from bs4 import BeautifulSoup
from selenium import webdriver
url='https://pantip.com/topic/38372443'
driver = webdriver.Firefox()
driver.get(url)
content=driver.page_source
soup=BeautifulSoup(content,'lxml')
for div in soup.find_all("div", id=lambda value: value and value.startswith("comment-")):
if len(str(div.text).strip()) > 1:
print(str(div.text).strip())
driver.quit()
Thank you very much. I wonder whether I can do that without Web Browser popping up. I think it may help increase speed.
– t.Wandee Wandee
Jan 4 at 3:10
add a comment |
The reason your having trouble locating the rest of these posts is because the site is populated with dynamic javascript. To get around this you can implement a solution with selenium, see here how to get the correct driver and add to your system variables https://github.com/mozilla/geckodriver/releases . Selenium will load the page and you will have full access to all the attributes you see in your screenshot, with just beautiful soup that data is not being parsed.
Once you do that you can use the following to return each of the posts data:
from bs4 import BeautifulSoup
from selenium import webdriver
url='https://pantip.com/topic/38372443'
driver = webdriver.Firefox()
driver.get(url)
content=driver.page_source
soup=BeautifulSoup(content,'lxml')
for div in soup.find_all("div", id=lambda value: value and value.startswith("comment-")):
if len(str(div.text).strip()) > 1:
print(str(div.text).strip())
driver.quit()
Thank you very much. I wonder whether I can do that without Web Browser popping up. I think it may help increase speed.
– t.Wandee Wandee
Jan 4 at 3:10
add a comment |
The reason your having trouble locating the rest of these posts is because the site is populated with dynamic javascript. To get around this you can implement a solution with selenium, see here how to get the correct driver and add to your system variables https://github.com/mozilla/geckodriver/releases . Selenium will load the page and you will have full access to all the attributes you see in your screenshot, with just beautiful soup that data is not being parsed.
Once you do that you can use the following to return each of the posts data:
from bs4 import BeautifulSoup
from selenium import webdriver
url='https://pantip.com/topic/38372443'
driver = webdriver.Firefox()
driver.get(url)
content=driver.page_source
soup=BeautifulSoup(content,'lxml')
for div in soup.find_all("div", id=lambda value: value and value.startswith("comment-")):
if len(str(div.text).strip()) > 1:
print(str(div.text).strip())
driver.quit()
The reason your having trouble locating the rest of these posts is because the site is populated with dynamic javascript. To get around this you can implement a solution with selenium, see here how to get the correct driver and add to your system variables https://github.com/mozilla/geckodriver/releases . Selenium will load the page and you will have full access to all the attributes you see in your screenshot, with just beautiful soup that data is not being parsed.
Once you do that you can use the following to return each of the posts data:
from bs4 import BeautifulSoup
from selenium import webdriver
url='https://pantip.com/topic/38372443'
driver = webdriver.Firefox()
driver.get(url)
content=driver.page_source
soup=BeautifulSoup(content,'lxml')
for div in soup.find_all("div", id=lambda value: value and value.startswith("comment-")):
if len(str(div.text).strip()) > 1:
print(str(div.text).strip())
driver.quit()
answered Jan 3 at 18:47
Ian-FogelmanIan-Fogelman
1,111511
1,111511
Thank you very much. I wonder whether I can do that without Web Browser popping up. I think it may help increase speed.
– t.Wandee Wandee
Jan 4 at 3:10
add a comment |
Thank you very much. I wonder whether I can do that without Web Browser popping up. I think it may help increase speed.
– t.Wandee Wandee
Jan 4 at 3:10
Thank you very much. I wonder whether I can do that without Web Browser popping up. I think it may help increase speed.
– t.Wandee Wandee
Jan 4 at 3:10
Thank you very much. I wonder whether I can do that without Web Browser popping up. I think it may help increase speed.
– t.Wandee Wandee
Jan 4 at 3:10
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54004431%2fcannot-pull-data-from-pantip-com%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
At a glance, it seems theres lazy loading for the posts. ( Comments are loaded asynchronously after page). If you look at the network tab, you can see a 'render_comments' resource network call is made...Please refer to this answer: stackoverflow.com/a/47851306/6283258
– Paul Rdt
Jan 2 at 10:18
soup.find()
will only return the first element is finds. If you want all tags with id = "comments-jsrender", you need to usesoup.find_all()
. Then depending what you want to do, may need to iterate through each element.– chitown88
Jan 2 at 20:31