Cannot pull data from pantip.com












0















I have been trying to pull data from pantip.com including title, post stoy and all comments using beautifulsoup.
However, I could pull only title and post stoy. I could not get comments.
Here is code for title and post stoy



import requests
import re
from bs4 import BeautifulSoup


# specify the url
url = 'https://pantip.com/topic/38372443'

# Split Topic number
topic_number = re.split('https://pantip.com/topic/', url)
topic_number = topic_number[1]


page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

# Capture title
elementTag_title = soup.find(id = 'topic-'+ topic_number)
title = str(elementTag_title.find_all(class_ = 'display-post-title')[0].string)

# Capture post story
resultSet_post = elementTag_title.find_all(class_ = 'display-post-story')[0]
post = resultSet_post.contents[1].text.strip()


I tried to find by id



elementTag_comment = soup.find(id = "comments-jsrender")


according to
enter image description here



I got the result below.



elementTag_comment =



<div id="comments-jsrender">
<div class="loadmore-bar loadmore-bar-paging"> <a href="javascript:void(0)">
<span class="icon-expand-left"><small>▼</small></span> <span class="focus-
txt"><span class="loading-txt">กำลังโหลดข้อมูล...</span></span> <span
class="icon-expand-right"><small>▼</small></span> </a> </div>
</div>


The question is how can I get all comments. Please, suggest me how to fix it.










share|improve this question

























  • At a glance, it seems theres lazy loading for the posts. ( Comments are loaded asynchronously after page). If you look at the network tab, you can see a 'render_comments' resource network call is made...Please refer to this answer: stackoverflow.com/a/47851306/6283258

    – Paul Rdt
    Jan 2 at 10:18













  • soup.find() will only return the first element is finds. If you want all tags with id = "comments-jsrender", you need to use soup.find_all(). Then depending what you want to do, may need to iterate through each element.

    – chitown88
    Jan 2 at 20:31
















0















I have been trying to pull data from pantip.com including title, post stoy and all comments using beautifulsoup.
However, I could pull only title and post stoy. I could not get comments.
Here is code for title and post stoy



import requests
import re
from bs4 import BeautifulSoup


# specify the url
url = 'https://pantip.com/topic/38372443'

# Split Topic number
topic_number = re.split('https://pantip.com/topic/', url)
topic_number = topic_number[1]


page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

# Capture title
elementTag_title = soup.find(id = 'topic-'+ topic_number)
title = str(elementTag_title.find_all(class_ = 'display-post-title')[0].string)

# Capture post story
resultSet_post = elementTag_title.find_all(class_ = 'display-post-story')[0]
post = resultSet_post.contents[1].text.strip()


I tried to find by id



elementTag_comment = soup.find(id = "comments-jsrender")


according to
enter image description here



I got the result below.



elementTag_comment =



<div id="comments-jsrender">
<div class="loadmore-bar loadmore-bar-paging"> <a href="javascript:void(0)">
<span class="icon-expand-left"><small>▼</small></span> <span class="focus-
txt"><span class="loading-txt">กำลังโหลดข้อมูล...</span></span> <span
class="icon-expand-right"><small>▼</small></span> </a> </div>
</div>


The question is how can I get all comments. Please, suggest me how to fix it.










share|improve this question

























  • At a glance, it seems theres lazy loading for the posts. ( Comments are loaded asynchronously after page). If you look at the network tab, you can see a 'render_comments' resource network call is made...Please refer to this answer: stackoverflow.com/a/47851306/6283258

    – Paul Rdt
    Jan 2 at 10:18













  • soup.find() will only return the first element is finds. If you want all tags with id = "comments-jsrender", you need to use soup.find_all(). Then depending what you want to do, may need to iterate through each element.

    – chitown88
    Jan 2 at 20:31














0












0








0








I have been trying to pull data from pantip.com including title, post stoy and all comments using beautifulsoup.
However, I could pull only title and post stoy. I could not get comments.
Here is code for title and post stoy



import requests
import re
from bs4 import BeautifulSoup


# specify the url
url = 'https://pantip.com/topic/38372443'

# Split Topic number
topic_number = re.split('https://pantip.com/topic/', url)
topic_number = topic_number[1]


page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

# Capture title
elementTag_title = soup.find(id = 'topic-'+ topic_number)
title = str(elementTag_title.find_all(class_ = 'display-post-title')[0].string)

# Capture post story
resultSet_post = elementTag_title.find_all(class_ = 'display-post-story')[0]
post = resultSet_post.contents[1].text.strip()


I tried to find by id



elementTag_comment = soup.find(id = "comments-jsrender")


according to
enter image description here



I got the result below.



elementTag_comment =



<div id="comments-jsrender">
<div class="loadmore-bar loadmore-bar-paging"> <a href="javascript:void(0)">
<span class="icon-expand-left"><small>▼</small></span> <span class="focus-
txt"><span class="loading-txt">กำลังโหลดข้อมูล...</span></span> <span
class="icon-expand-right"><small>▼</small></span> </a> </div>
</div>


The question is how can I get all comments. Please, suggest me how to fix it.










share|improve this question
















I have been trying to pull data from pantip.com including title, post stoy and all comments using beautifulsoup.
However, I could pull only title and post stoy. I could not get comments.
Here is code for title and post stoy



import requests
import re
from bs4 import BeautifulSoup


# specify the url
url = 'https://pantip.com/topic/38372443'

# Split Topic number
topic_number = re.split('https://pantip.com/topic/', url)
topic_number = topic_number[1]


page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

# Capture title
elementTag_title = soup.find(id = 'topic-'+ topic_number)
title = str(elementTag_title.find_all(class_ = 'display-post-title')[0].string)

# Capture post story
resultSet_post = elementTag_title.find_all(class_ = 'display-post-story')[0]
post = resultSet_post.contents[1].text.strip()


I tried to find by id



elementTag_comment = soup.find(id = "comments-jsrender")


according to
enter image description here



I got the result below.



elementTag_comment =



<div id="comments-jsrender">
<div class="loadmore-bar loadmore-bar-paging"> <a href="javascript:void(0)">
<span class="icon-expand-left"><small>▼</small></span> <span class="focus-
txt"><span class="loading-txt">กำลังโหลดข้อมูล...</span></span> <span
class="icon-expand-right"><small>▼</small></span> </a> </div>
</div>


The question is how can I get all comments. Please, suggest me how to fix it.







python selenium web-scraping beautifulsoup nlp






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 3 at 12:58









SIM

10.7k31046




10.7k31046










asked Jan 2 at 10:10









t.Wandee Wandeet.Wandee Wandee

325




325













  • At a glance, it seems theres lazy loading for the posts. ( Comments are loaded asynchronously after page). If you look at the network tab, you can see a 'render_comments' resource network call is made...Please refer to this answer: stackoverflow.com/a/47851306/6283258

    – Paul Rdt
    Jan 2 at 10:18













  • soup.find() will only return the first element is finds. If you want all tags with id = "comments-jsrender", you need to use soup.find_all(). Then depending what you want to do, may need to iterate through each element.

    – chitown88
    Jan 2 at 20:31



















  • At a glance, it seems theres lazy loading for the posts. ( Comments are loaded asynchronously after page). If you look at the network tab, you can see a 'render_comments' resource network call is made...Please refer to this answer: stackoverflow.com/a/47851306/6283258

    – Paul Rdt
    Jan 2 at 10:18













  • soup.find() will only return the first element is finds. If you want all tags with id = "comments-jsrender", you need to use soup.find_all(). Then depending what you want to do, may need to iterate through each element.

    – chitown88
    Jan 2 at 20:31

















At a glance, it seems theres lazy loading for the posts. ( Comments are loaded asynchronously after page). If you look at the network tab, you can see a 'render_comments' resource network call is made...Please refer to this answer: stackoverflow.com/a/47851306/6283258

– Paul Rdt
Jan 2 at 10:18







At a glance, it seems theres lazy loading for the posts. ( Comments are loaded asynchronously after page). If you look at the network tab, you can see a 'render_comments' resource network call is made...Please refer to this answer: stackoverflow.com/a/47851306/6283258

– Paul Rdt
Jan 2 at 10:18















soup.find() will only return the first element is finds. If you want all tags with id = "comments-jsrender", you need to use soup.find_all(). Then depending what you want to do, may need to iterate through each element.

– chitown88
Jan 2 at 20:31





soup.find() will only return the first element is finds. If you want all tags with id = "comments-jsrender", you need to use soup.find_all(). Then depending what you want to do, may need to iterate through each element.

– chitown88
Jan 2 at 20:31












1 Answer
1






active

oldest

votes


















0














The reason your having trouble locating the rest of these posts is because the site is populated with dynamic javascript. To get around this you can implement a solution with selenium, see here how to get the correct driver and add to your system variables https://github.com/mozilla/geckodriver/releases . Selenium will load the page and you will have full access to all the attributes you see in your screenshot, with just beautiful soup that data is not being parsed.



Once you do that you can use the following to return each of the posts data:



from bs4 import BeautifulSoup
from selenium import webdriver

url='https://pantip.com/topic/38372443'
driver = webdriver.Firefox()
driver.get(url)
content=driver.page_source
soup=BeautifulSoup(content,'lxml')

for div in soup.find_all("div", id=lambda value: value and value.startswith("comment-")):
if len(str(div.text).strip()) > 1:
print(str(div.text).strip())

driver.quit()





share|improve this answer
























  • Thank you very much. I wonder whether I can do that without Web Browser popping up. I think it may help increase speed.

    – t.Wandee Wandee
    Jan 4 at 3:10











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54004431%2fcannot-pull-data-from-pantip-com%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














The reason your having trouble locating the rest of these posts is because the site is populated with dynamic javascript. To get around this you can implement a solution with selenium, see here how to get the correct driver and add to your system variables https://github.com/mozilla/geckodriver/releases . Selenium will load the page and you will have full access to all the attributes you see in your screenshot, with just beautiful soup that data is not being parsed.



Once you do that you can use the following to return each of the posts data:



from bs4 import BeautifulSoup
from selenium import webdriver

url='https://pantip.com/topic/38372443'
driver = webdriver.Firefox()
driver.get(url)
content=driver.page_source
soup=BeautifulSoup(content,'lxml')

for div in soup.find_all("div", id=lambda value: value and value.startswith("comment-")):
if len(str(div.text).strip()) > 1:
print(str(div.text).strip())

driver.quit()





share|improve this answer
























  • Thank you very much. I wonder whether I can do that without Web Browser popping up. I think it may help increase speed.

    – t.Wandee Wandee
    Jan 4 at 3:10
















0














The reason your having trouble locating the rest of these posts is because the site is populated with dynamic javascript. To get around this you can implement a solution with selenium, see here how to get the correct driver and add to your system variables https://github.com/mozilla/geckodriver/releases . Selenium will load the page and you will have full access to all the attributes you see in your screenshot, with just beautiful soup that data is not being parsed.



Once you do that you can use the following to return each of the posts data:



from bs4 import BeautifulSoup
from selenium import webdriver

url='https://pantip.com/topic/38372443'
driver = webdriver.Firefox()
driver.get(url)
content=driver.page_source
soup=BeautifulSoup(content,'lxml')

for div in soup.find_all("div", id=lambda value: value and value.startswith("comment-")):
if len(str(div.text).strip()) > 1:
print(str(div.text).strip())

driver.quit()





share|improve this answer
























  • Thank you very much. I wonder whether I can do that without Web Browser popping up. I think it may help increase speed.

    – t.Wandee Wandee
    Jan 4 at 3:10














0












0








0







The reason your having trouble locating the rest of these posts is because the site is populated with dynamic javascript. To get around this you can implement a solution with selenium, see here how to get the correct driver and add to your system variables https://github.com/mozilla/geckodriver/releases . Selenium will load the page and you will have full access to all the attributes you see in your screenshot, with just beautiful soup that data is not being parsed.



Once you do that you can use the following to return each of the posts data:



from bs4 import BeautifulSoup
from selenium import webdriver

url='https://pantip.com/topic/38372443'
driver = webdriver.Firefox()
driver.get(url)
content=driver.page_source
soup=BeautifulSoup(content,'lxml')

for div in soup.find_all("div", id=lambda value: value and value.startswith("comment-")):
if len(str(div.text).strip()) > 1:
print(str(div.text).strip())

driver.quit()





share|improve this answer













The reason your having trouble locating the rest of these posts is because the site is populated with dynamic javascript. To get around this you can implement a solution with selenium, see here how to get the correct driver and add to your system variables https://github.com/mozilla/geckodriver/releases . Selenium will load the page and you will have full access to all the attributes you see in your screenshot, with just beautiful soup that data is not being parsed.



Once you do that you can use the following to return each of the posts data:



from bs4 import BeautifulSoup
from selenium import webdriver

url='https://pantip.com/topic/38372443'
driver = webdriver.Firefox()
driver.get(url)
content=driver.page_source
soup=BeautifulSoup(content,'lxml')

for div in soup.find_all("div", id=lambda value: value and value.startswith("comment-")):
if len(str(div.text).strip()) > 1:
print(str(div.text).strip())

driver.quit()






share|improve this answer












share|improve this answer



share|improve this answer










answered Jan 3 at 18:47









Ian-FogelmanIan-Fogelman

1,111511




1,111511













  • Thank you very much. I wonder whether I can do that without Web Browser popping up. I think it may help increase speed.

    – t.Wandee Wandee
    Jan 4 at 3:10



















  • Thank you very much. I wonder whether I can do that without Web Browser popping up. I think it may help increase speed.

    – t.Wandee Wandee
    Jan 4 at 3:10

















Thank you very much. I wonder whether I can do that without Web Browser popping up. I think it may help increase speed.

– t.Wandee Wandee
Jan 4 at 3:10





Thank you very much. I wonder whether I can do that without Web Browser popping up. I think it may help increase speed.

– t.Wandee Wandee
Jan 4 at 3:10




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54004431%2fcannot-pull-data-from-pantip-com%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

'app-layout' is not a known element: how to share Component with different Modules

android studio warns about leanback feature tag usage required on manifest while using Unity exported app?

SQL update select statement