Cannot pull data from pantip.com

I have been trying to pull data from pantip.com including title, post stoy and all comments using beautifulsoup.
However, I could pull only title and post stoy. I could not get comments.
Here is code for title and post stoy

import requests

import re

from bs4 import BeautifulSoup





# specify the url

url = 'https://pantip.com/topic/38372443'



# Split Topic number

topic_number = re.split('https://pantip.com/topic/', url)

topic_number = topic_number[1]





page = requests.get(url)

soup = BeautifulSoup(page.content, 'html.parser')



# Capture title

elementTag_title = soup.find(id = 'topic-'+ topic_number)

title = str(elementTag_title.find_all(class_ = 'display-post-title')[0].string)



# Capture post story

resultSet_post = elementTag_title.find_all(class_ = 'display-post-story')[0]

post = resultSet_post.contents[1].text.strip()

I tried to find by id

elementTag_comment = soup.find(id = "comments-jsrender")

according to
enter image description here

I got the result below.

elementTag_comment =

<div id="comments-jsrender">

<div class="loadmore-bar loadmore-bar-paging"> <a href="javascript:void(0)"> 

<span class="icon-expand-left"><small>▼</small></span> <span class="focus- 

txt"><span class="loading-txt">กำลังโหลดข้อมูล...</span></span> <span 

class="icon-expand-right"><small>▼</small></span> </a> </div>

</div>

The question is how can I get all comments. Please, suggest me how to fix it.

edited Jan 3 at 12:58

SIM

10.7k31046

asked Jan 2 at 10:10

t.Wandee Wandee

325

At a glance, it seems theres lazy loading for the posts. ( Comments are loaded asynchronously after page). If you look at the network tab, you can see a 'render_comments' resource network call is made...Please refer to this answer: stackoverflow.com/a/47851306/6283258

– Paul Rdt
Jan 2 at 10:18

soup.find() will only return the first element is finds. If you want all tags with id = "comments-jsrender", you need to use soup.find_all(). Then depending what you want to do, may need to iterate through each element.

– chitown88
Jan 2 at 20:31

add a comment |

import requests

import re

from bs4 import BeautifulSoup





# specify the url

url = 'https://pantip.com/topic/38372443'



# Split Topic number

topic_number = re.split('https://pantip.com/topic/', url)

topic_number = topic_number[1]





page = requests.get(url)

soup = BeautifulSoup(page.content, 'html.parser')



# Capture title

elementTag_title = soup.find(id = 'topic-'+ topic_number)

title = str(elementTag_title.find_all(class_ = 'display-post-title')[0].string)



# Capture post story

resultSet_post = elementTag_title.find_all(class_ = 'display-post-story')[0]

post = resultSet_post.contents[1].text.strip()

I tried to find by id

elementTag_comment = soup.find(id = "comments-jsrender")

according to
enter image description here

I got the result below.

elementTag_comment =

<div id="comments-jsrender">

<div class="loadmore-bar loadmore-bar-paging"> <a href="javascript:void(0)"> 

<span class="icon-expand-left"><small>▼</small></span> <span class="focus- 

txt"><span class="loading-txt">กำลังโหลดข้อมูล...</span></span> <span 

class="icon-expand-right"><small>▼</small></span> </a> </div>

</div>

The question is how can I get all comments. Please, suggest me how to fix it.

edited Jan 3 at 12:58

SIM

10.7k31046

asked Jan 2 at 10:10

t.Wandee Wandee

325

At a glance, it seems theres lazy loading for the posts. ( Comments are loaded asynchronously after page). If you look at the network tab, you can see a 'render_comments' resource network call is made...Please refer to this answer: stackoverflow.com/a/47851306/6283258

– Paul Rdt
Jan 2 at 10:18

soup.find() will only return the first element is finds. If you want all tags with id = "comments-jsrender", you need to use soup.find_all(). Then depending what you want to do, may need to iterate through each element.

– chitown88
Jan 2 at 20:31

add a comment |

import requests

import re

from bs4 import BeautifulSoup





# specify the url

url = 'https://pantip.com/topic/38372443'



# Split Topic number

topic_number = re.split('https://pantip.com/topic/', url)

topic_number = topic_number[1]





page = requests.get(url)

soup = BeautifulSoup(page.content, 'html.parser')



# Capture title

elementTag_title = soup.find(id = 'topic-'+ topic_number)

title = str(elementTag_title.find_all(class_ = 'display-post-title')[0].string)



# Capture post story

resultSet_post = elementTag_title.find_all(class_ = 'display-post-story')[0]

post = resultSet_post.contents[1].text.strip()

I tried to find by id

elementTag_comment = soup.find(id = "comments-jsrender")

according to
enter image description here

I got the result below.

elementTag_comment =

<div id="comments-jsrender">

<div class="loadmore-bar loadmore-bar-paging"> <a href="javascript:void(0)"> 

<span class="icon-expand-left"><small>▼</small></span> <span class="focus- 

txt"><span class="loading-txt">กำลังโหลดข้อมูล...</span></span> <span 

class="icon-expand-right"><small>▼</small></span> </a> </div>

</div>

The question is how can I get all comments. Please, suggest me how to fix it.

edited Jan 3 at 12:58

SIM

10.7k31046

asked Jan 2 at 10:10

t.Wandee Wandee

325

import requests

import re

from bs4 import BeautifulSoup





# specify the url

url = 'https://pantip.com/topic/38372443'



# Split Topic number

topic_number = re.split('https://pantip.com/topic/', url)

topic_number = topic_number[1]





page = requests.get(url)

soup = BeautifulSoup(page.content, 'html.parser')



# Capture title

elementTag_title = soup.find(id = 'topic-'+ topic_number)

title = str(elementTag_title.find_all(class_ = 'display-post-title')[0].string)



# Capture post story

resultSet_post = elementTag_title.find_all(class_ = 'display-post-story')[0]

post = resultSet_post.contents[1].text.strip()

I tried to find by id

elementTag_comment = soup.find(id = "comments-jsrender")

according to
enter image description here

I got the result below.

elementTag_comment =

<div id="comments-jsrender">

<div class="loadmore-bar loadmore-bar-paging"> <a href="javascript:void(0)"> 

<span class="icon-expand-left"><small>▼</small></span> <span class="focus- 

txt"><span class="loading-txt">กำลังโหลดข้อมูล...</span></span> <span 

class="icon-expand-right"><small>▼</small></span> </a> </div>

</div>

The question is how can I get all comments. Please, suggest me how to fix it.

python selenium web-scraping beautifulsoup nlp

edited Jan 3 at 12:58

SIM

10.7k31046

asked Jan 2 at 10:10

t.Wandee Wandee

325

edited Jan 3 at 12:58

SIM

10.7k31046

asked Jan 2 at 10:10

t.Wandee Wandee

325

edited Jan 3 at 12:58

SIM

10.7k31046

edited Jan 3 at 12:58

SIM

10.7k31046

edited Jan 3 at 12:58

SIM

10.7k31046

asked Jan 2 at 10:10

t.Wandee Wandee

325

asked Jan 2 at 10:10

t.Wandee Wandee

325

asked Jan 2 at 10:10

t.Wandee Wandee

325

At a glance, it seems theres lazy loading for the posts. ( Comments are loaded asynchronously after page). If you look at the network tab, you can see a 'render_comments' resource network call is made...Please refer to this answer: stackoverflow.com/a/47851306/6283258

– Paul Rdt
Jan 2 at 10:18

soup.find() will only return the first element is finds. If you want all tags with id = "comments-jsrender", you need to use soup.find_all(). Then depending what you want to do, may need to iterate through each element.

– chitown88
Jan 2 at 20:31

add a comment |

At a glance, it seems theres lazy loading for the posts. ( Comments are loaded asynchronously after page). If you look at the network tab, you can see a 'render_comments' resource network call is made...Please refer to this answer: stackoverflow.com/a/47851306/6283258

– Paul Rdt
Jan 2 at 10:18

soup.find() will only return the first element is finds. If you want all tags with id = "comments-jsrender", you need to use soup.find_all(). Then depending what you want to do, may need to iterate through each element.

– chitown88
Jan 2 at 20:31

At a glance, it seems theres lazy loading for the posts. ( Comments are loaded asynchronously after page). If you look at the network tab, you can see a 'render_comments' resource network call is made...Please refer to this answer: stackoverflow.com/a/47851306/6283258

– Paul Rdt
Jan 2 at 10:18

soup.find() will only return the first element is finds. If you want all tags with id = "comments-jsrender", you need to use soup.find_all(). Then depending what you want to do, may need to iterate through each element.

– chitown88
Jan 2 at 20:31

add a comment |

1 Answer
1

active

oldest

votes

The reason your having trouble locating the rest of these posts is because the site is populated with dynamic javascript. To get around this you can implement a solution with selenium, see here how to get the correct driver and add to your system variables https://github.com/mozilla/geckodriver/releases . Selenium will load the page and you will have full access to all the attributes you see in your screenshot, with just beautiful soup that data is not being parsed.

Once you do that you can use the following to return each of the posts data:

from bs4 import BeautifulSoup

from selenium import webdriver



url='https://pantip.com/topic/38372443'

driver = webdriver.Firefox()

driver.get(url)

content=driver.page_source

soup=BeautifulSoup(content,'lxml')



for div in soup.find_all("div", id=lambda value: value and value.startswith("comment-")):

    if len(str(div.text).strip()) > 1:

        print(str(div.text).strip())



driver.quit()

answered Jan 3 at 18:47

Ian-Fogelman

1,111511

Thank you very much. I wonder whether I can do that without Web Browser popping up. I think it may help increase speed.

– t.Wandee Wandee
Jan 4 at 3:10

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54004431%2fcannot-pull-data-from-pantip-com%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Once you do that you can use the following to return each of the posts data:

from bs4 import BeautifulSoup

from selenium import webdriver



url='https://pantip.com/topic/38372443'

driver = webdriver.Firefox()

driver.get(url)

content=driver.page_source

soup=BeautifulSoup(content,'lxml')



for div in soup.find_all("div", id=lambda value: value and value.startswith("comment-")):

    if len(str(div.text).strip()) > 1:

        print(str(div.text).strip())



driver.quit()

answered Jan 3 at 18:47

Ian-Fogelman

1,111511

Thank you very much. I wonder whether I can do that without Web Browser popping up. I think it may help increase speed.

– t.Wandee Wandee
Jan 4 at 3:10

add a comment |

Once you do that you can use the following to return each of the posts data:

from bs4 import BeautifulSoup

from selenium import webdriver



url='https://pantip.com/topic/38372443'

driver = webdriver.Firefox()

driver.get(url)

content=driver.page_source

soup=BeautifulSoup(content,'lxml')



for div in soup.find_all("div", id=lambda value: value and value.startswith("comment-")):

    if len(str(div.text).strip()) > 1:

        print(str(div.text).strip())



driver.quit()

answered Jan 3 at 18:47

Ian-Fogelman

1,111511

Thank you very much. I wonder whether I can do that without Web Browser popping up. I think it may help increase speed.

– t.Wandee Wandee
Jan 4 at 3:10

add a comment |

Once you do that you can use the following to return each of the posts data:

from bs4 import BeautifulSoup

from selenium import webdriver



url='https://pantip.com/topic/38372443'

driver = webdriver.Firefox()

driver.get(url)

content=driver.page_source

soup=BeautifulSoup(content,'lxml')



for div in soup.find_all("div", id=lambda value: value and value.startswith("comment-")):

    if len(str(div.text).strip()) > 1:

        print(str(div.text).strip())



driver.quit()

answered Jan 3 at 18:47

Ian-Fogelman

1,111511

Once you do that you can use the following to return each of the posts data:

from bs4 import BeautifulSoup

from selenium import webdriver



url='https://pantip.com/topic/38372443'

driver = webdriver.Firefox()

driver.get(url)

content=driver.page_source

soup=BeautifulSoup(content,'lxml')



for div in soup.find_all("div", id=lambda value: value and value.startswith("comment-")):

    if len(str(div.text).strip()) > 1:

        print(str(div.text).strip())



driver.quit()

answered Jan 3 at 18:47

Ian-Fogelman

1,111511

answered Jan 3 at 18:47

Ian-Fogelman

1,111511

answered Jan 3 at 18:47

Ian-Fogelman

1,111511

answered Jan 3 at 18:47

Ian-Fogelman

1,111511

Thank you very much. I wonder whether I can do that without Web Browser popping up. I think it may help increase speed.

– t.Wandee Wandee
Jan 4 at 3:10

add a comment |

Thank you very much. I wonder whether I can do that without Web Browser popping up. I think it may help increase speed.

– t.Wandee Wandee
Jan 4 at 3:10

Thank you very much. I wonder whether I can do that without Web Browser popping up. I think it may help increase speed.

– t.Wandee Wandee
Jan 4 at 3:10

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu