BeautifulSoup - TypeError: sequence item 0: expected str instance












1















I made a web crawler using python and everything runs fine until it gets to this section of the code:



    # Use BeautifulSoup modules to format web page as text that can
# be parsed and indexed
#
soup = bs4.BeautifulSoup(response, "html.parser")
tok = "".join(soup.findAll("p", text=re.compile(".")))
# pass the text extracted from the web page to the parsetoken routine for indexing
parsetoken(db, tok)
documents += 1


The error I get is TypeError: sequence item 0: expected str instance, Tag found around the tok line in the code.

I think my syntax could be the issue but I am not sure. How can I fix this?










share|improve this question




















  • 1





    what you are passing to ''.join is not an iterable of strings, which it must be. soup.findall returns a sequence of some type of custom objects I can only assume

    – juanpa.arrivillaga
    Jan 2 at 18:13






  • 1





    You probably need tok = "".join([x.text for x in soup.findAll("p", text=re.compile(".")))

    – C.Nivs
    Jan 2 at 18:17
















1















I made a web crawler using python and everything runs fine until it gets to this section of the code:



    # Use BeautifulSoup modules to format web page as text that can
# be parsed and indexed
#
soup = bs4.BeautifulSoup(response, "html.parser")
tok = "".join(soup.findAll("p", text=re.compile(".")))
# pass the text extracted from the web page to the parsetoken routine for indexing
parsetoken(db, tok)
documents += 1


The error I get is TypeError: sequence item 0: expected str instance, Tag found around the tok line in the code.

I think my syntax could be the issue but I am not sure. How can I fix this?










share|improve this question




















  • 1





    what you are passing to ''.join is not an iterable of strings, which it must be. soup.findall returns a sequence of some type of custom objects I can only assume

    – juanpa.arrivillaga
    Jan 2 at 18:13






  • 1





    You probably need tok = "".join([x.text for x in soup.findAll("p", text=re.compile(".")))

    – C.Nivs
    Jan 2 at 18:17














1












1








1








I made a web crawler using python and everything runs fine until it gets to this section of the code:



    # Use BeautifulSoup modules to format web page as text that can
# be parsed and indexed
#
soup = bs4.BeautifulSoup(response, "html.parser")
tok = "".join(soup.findAll("p", text=re.compile(".")))
# pass the text extracted from the web page to the parsetoken routine for indexing
parsetoken(db, tok)
documents += 1


The error I get is TypeError: sequence item 0: expected str instance, Tag found around the tok line in the code.

I think my syntax could be the issue but I am not sure. How can I fix this?










share|improve this question
















I made a web crawler using python and everything runs fine until it gets to this section of the code:



    # Use BeautifulSoup modules to format web page as text that can
# be parsed and indexed
#
soup = bs4.BeautifulSoup(response, "html.parser")
tok = "".join(soup.findAll("p", text=re.compile(".")))
# pass the text extracted from the web page to the parsetoken routine for indexing
parsetoken(db, tok)
documents += 1


The error I get is TypeError: sequence item 0: expected str instance, Tag found around the tok line in the code.

I think my syntax could be the issue but I am not sure. How can I fix this?







python python-3.x beautifulsoup






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 2 at 20:11









phunctional

146318




146318










asked Jan 2 at 18:09









xhenierxhenier

245




245








  • 1





    what you are passing to ''.join is not an iterable of strings, which it must be. soup.findall returns a sequence of some type of custom objects I can only assume

    – juanpa.arrivillaga
    Jan 2 at 18:13






  • 1





    You probably need tok = "".join([x.text for x in soup.findAll("p", text=re.compile(".")))

    – C.Nivs
    Jan 2 at 18:17














  • 1





    what you are passing to ''.join is not an iterable of strings, which it must be. soup.findall returns a sequence of some type of custom objects I can only assume

    – juanpa.arrivillaga
    Jan 2 at 18:13






  • 1





    You probably need tok = "".join([x.text for x in soup.findAll("p", text=re.compile(".")))

    – C.Nivs
    Jan 2 at 18:17








1




1





what you are passing to ''.join is not an iterable of strings, which it must be. soup.findall returns a sequence of some type of custom objects I can only assume

– juanpa.arrivillaga
Jan 2 at 18:13





what you are passing to ''.join is not an iterable of strings, which it must be. soup.findall returns a sequence of some type of custom objects I can only assume

– juanpa.arrivillaga
Jan 2 at 18:13




1




1





You probably need tok = "".join([x.text for x in soup.findAll("p", text=re.compile(".")))

– C.Nivs
Jan 2 at 18:17





You probably need tok = "".join([x.text for x in soup.findAll("p", text=re.compile(".")))

– C.Nivs
Jan 2 at 18:17












1 Answer
1






active

oldest

votes


















0














There are a few issues here:




  • First, I'm not sure where you're getting response from, but that should be a string of actual HTML. Make sure you're not just capturing a "response" code from scraping a site that tells you whether it was successful.

  • More importantly though, when you do the "findAll", note that this returns a list of BeautifulSoup objects, not a list of strings. So the "join" command doesn't know what to do with these. It looks at the first object in the list, sees that it's not a string, and this is why it errors out with a complaint that it "expected str instance". The good news is you can use .text to extract the actual text from a given <p> element.

  • Though even if you do use .text to extract the actual text from every <p> object, your join() may still fail if your list is a mix of unicode and str formats. So you may have to do some encoding tricks to get everything as the same type before you join.


Here's an example I did using this very page:



>>> import bs4, re
>>> import urllib2
>>> url = "https://stackoverflow.com/questions/3925614/how-do-you-read-a-file-into-a-list-in-python"
>>> html = urllib2.urlopen(url).read()
>>> soup = bs4.BeautifulSoup(html, "html.parser")
>>> L = soup.findAll("p", text=re.compile("."))
>>> M = [t.text.encode('utf-8') for t in L]
>>> print(" ".join(M))


This prints the combined text of everything found in a "P" tag.



EDIT: This example was on Python 2.7.x. For 3.x, drop the ".encode('utf-8')".






share|improve this answer


























  • This is Python 3, no need for text.encode('utf-8')

    – juanpa.arrivillaga
    Jan 2 at 18:46











  • I provided a solution that works on both.

    – Bill M.
    Jan 2 at 18:50











  • This will not work on Python 3, .encode returns bytes objects, and you are trying to join using a str object, i.e. " ".join, this will throw a type error. You could do b" ".join(...), but then, why would you want a bytes object in Python 3? Look, if Python 2 and 3 could easily be written to handle the issue of unicode strings vs bytes strings, then there would have been no Python 2 and 3. But otherwise, this is correct.

    – juanpa.arrivillaga
    Jan 2 at 18:51













  • OK, I've updated it. Now go back to "pulling out your hair", Juanpa.

    – Bill M.
    Jan 2 at 18:55












Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54011146%2fbeautifulsoup-typeerror-sequence-item-0-expected-str-instance%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














There are a few issues here:




  • First, I'm not sure where you're getting response from, but that should be a string of actual HTML. Make sure you're not just capturing a "response" code from scraping a site that tells you whether it was successful.

  • More importantly though, when you do the "findAll", note that this returns a list of BeautifulSoup objects, not a list of strings. So the "join" command doesn't know what to do with these. It looks at the first object in the list, sees that it's not a string, and this is why it errors out with a complaint that it "expected str instance". The good news is you can use .text to extract the actual text from a given <p> element.

  • Though even if you do use .text to extract the actual text from every <p> object, your join() may still fail if your list is a mix of unicode and str formats. So you may have to do some encoding tricks to get everything as the same type before you join.


Here's an example I did using this very page:



>>> import bs4, re
>>> import urllib2
>>> url = "https://stackoverflow.com/questions/3925614/how-do-you-read-a-file-into-a-list-in-python"
>>> html = urllib2.urlopen(url).read()
>>> soup = bs4.BeautifulSoup(html, "html.parser")
>>> L = soup.findAll("p", text=re.compile("."))
>>> M = [t.text.encode('utf-8') for t in L]
>>> print(" ".join(M))


This prints the combined text of everything found in a "P" tag.



EDIT: This example was on Python 2.7.x. For 3.x, drop the ".encode('utf-8')".






share|improve this answer


























  • This is Python 3, no need for text.encode('utf-8')

    – juanpa.arrivillaga
    Jan 2 at 18:46











  • I provided a solution that works on both.

    – Bill M.
    Jan 2 at 18:50











  • This will not work on Python 3, .encode returns bytes objects, and you are trying to join using a str object, i.e. " ".join, this will throw a type error. You could do b" ".join(...), but then, why would you want a bytes object in Python 3? Look, if Python 2 and 3 could easily be written to handle the issue of unicode strings vs bytes strings, then there would have been no Python 2 and 3. But otherwise, this is correct.

    – juanpa.arrivillaga
    Jan 2 at 18:51













  • OK, I've updated it. Now go back to "pulling out your hair", Juanpa.

    – Bill M.
    Jan 2 at 18:55
















0














There are a few issues here:




  • First, I'm not sure where you're getting response from, but that should be a string of actual HTML. Make sure you're not just capturing a "response" code from scraping a site that tells you whether it was successful.

  • More importantly though, when you do the "findAll", note that this returns a list of BeautifulSoup objects, not a list of strings. So the "join" command doesn't know what to do with these. It looks at the first object in the list, sees that it's not a string, and this is why it errors out with a complaint that it "expected str instance". The good news is you can use .text to extract the actual text from a given <p> element.

  • Though even if you do use .text to extract the actual text from every <p> object, your join() may still fail if your list is a mix of unicode and str formats. So you may have to do some encoding tricks to get everything as the same type before you join.


Here's an example I did using this very page:



>>> import bs4, re
>>> import urllib2
>>> url = "https://stackoverflow.com/questions/3925614/how-do-you-read-a-file-into-a-list-in-python"
>>> html = urllib2.urlopen(url).read()
>>> soup = bs4.BeautifulSoup(html, "html.parser")
>>> L = soup.findAll("p", text=re.compile("."))
>>> M = [t.text.encode('utf-8') for t in L]
>>> print(" ".join(M))


This prints the combined text of everything found in a "P" tag.



EDIT: This example was on Python 2.7.x. For 3.x, drop the ".encode('utf-8')".






share|improve this answer


























  • This is Python 3, no need for text.encode('utf-8')

    – juanpa.arrivillaga
    Jan 2 at 18:46











  • I provided a solution that works on both.

    – Bill M.
    Jan 2 at 18:50











  • This will not work on Python 3, .encode returns bytes objects, and you are trying to join using a str object, i.e. " ".join, this will throw a type error. You could do b" ".join(...), but then, why would you want a bytes object in Python 3? Look, if Python 2 and 3 could easily be written to handle the issue of unicode strings vs bytes strings, then there would have been no Python 2 and 3. But otherwise, this is correct.

    – juanpa.arrivillaga
    Jan 2 at 18:51













  • OK, I've updated it. Now go back to "pulling out your hair", Juanpa.

    – Bill M.
    Jan 2 at 18:55














0












0








0







There are a few issues here:




  • First, I'm not sure where you're getting response from, but that should be a string of actual HTML. Make sure you're not just capturing a "response" code from scraping a site that tells you whether it was successful.

  • More importantly though, when you do the "findAll", note that this returns a list of BeautifulSoup objects, not a list of strings. So the "join" command doesn't know what to do with these. It looks at the first object in the list, sees that it's not a string, and this is why it errors out with a complaint that it "expected str instance". The good news is you can use .text to extract the actual text from a given <p> element.

  • Though even if you do use .text to extract the actual text from every <p> object, your join() may still fail if your list is a mix of unicode and str formats. So you may have to do some encoding tricks to get everything as the same type before you join.


Here's an example I did using this very page:



>>> import bs4, re
>>> import urllib2
>>> url = "https://stackoverflow.com/questions/3925614/how-do-you-read-a-file-into-a-list-in-python"
>>> html = urllib2.urlopen(url).read()
>>> soup = bs4.BeautifulSoup(html, "html.parser")
>>> L = soup.findAll("p", text=re.compile("."))
>>> M = [t.text.encode('utf-8') for t in L]
>>> print(" ".join(M))


This prints the combined text of everything found in a "P" tag.



EDIT: This example was on Python 2.7.x. For 3.x, drop the ".encode('utf-8')".






share|improve this answer















There are a few issues here:




  • First, I'm not sure where you're getting response from, but that should be a string of actual HTML. Make sure you're not just capturing a "response" code from scraping a site that tells you whether it was successful.

  • More importantly though, when you do the "findAll", note that this returns a list of BeautifulSoup objects, not a list of strings. So the "join" command doesn't know what to do with these. It looks at the first object in the list, sees that it's not a string, and this is why it errors out with a complaint that it "expected str instance". The good news is you can use .text to extract the actual text from a given <p> element.

  • Though even if you do use .text to extract the actual text from every <p> object, your join() may still fail if your list is a mix of unicode and str formats. So you may have to do some encoding tricks to get everything as the same type before you join.


Here's an example I did using this very page:



>>> import bs4, re
>>> import urllib2
>>> url = "https://stackoverflow.com/questions/3925614/how-do-you-read-a-file-into-a-list-in-python"
>>> html = urllib2.urlopen(url).read()
>>> soup = bs4.BeautifulSoup(html, "html.parser")
>>> L = soup.findAll("p", text=re.compile("."))
>>> M = [t.text.encode('utf-8') for t in L]
>>> print(" ".join(M))


This prints the combined text of everything found in a "P" tag.



EDIT: This example was on Python 2.7.x. For 3.x, drop the ".encode('utf-8')".







share|improve this answer














share|improve this answer



share|improve this answer








edited Jan 2 at 18:54

























answered Jan 2 at 18:38









Bill M.Bill M.

951113




951113













  • This is Python 3, no need for text.encode('utf-8')

    – juanpa.arrivillaga
    Jan 2 at 18:46











  • I provided a solution that works on both.

    – Bill M.
    Jan 2 at 18:50











  • This will not work on Python 3, .encode returns bytes objects, and you are trying to join using a str object, i.e. " ".join, this will throw a type error. You could do b" ".join(...), but then, why would you want a bytes object in Python 3? Look, if Python 2 and 3 could easily be written to handle the issue of unicode strings vs bytes strings, then there would have been no Python 2 and 3. But otherwise, this is correct.

    – juanpa.arrivillaga
    Jan 2 at 18:51













  • OK, I've updated it. Now go back to "pulling out your hair", Juanpa.

    – Bill M.
    Jan 2 at 18:55



















  • This is Python 3, no need for text.encode('utf-8')

    – juanpa.arrivillaga
    Jan 2 at 18:46











  • I provided a solution that works on both.

    – Bill M.
    Jan 2 at 18:50











  • This will not work on Python 3, .encode returns bytes objects, and you are trying to join using a str object, i.e. " ".join, this will throw a type error. You could do b" ".join(...), but then, why would you want a bytes object in Python 3? Look, if Python 2 and 3 could easily be written to handle the issue of unicode strings vs bytes strings, then there would have been no Python 2 and 3. But otherwise, this is correct.

    – juanpa.arrivillaga
    Jan 2 at 18:51













  • OK, I've updated it. Now go back to "pulling out your hair", Juanpa.

    – Bill M.
    Jan 2 at 18:55

















This is Python 3, no need for text.encode('utf-8')

– juanpa.arrivillaga
Jan 2 at 18:46





This is Python 3, no need for text.encode('utf-8')

– juanpa.arrivillaga
Jan 2 at 18:46













I provided a solution that works on both.

– Bill M.
Jan 2 at 18:50





I provided a solution that works on both.

– Bill M.
Jan 2 at 18:50













This will not work on Python 3, .encode returns bytes objects, and you are trying to join using a str object, i.e. " ".join, this will throw a type error. You could do b" ".join(...), but then, why would you want a bytes object in Python 3? Look, if Python 2 and 3 could easily be written to handle the issue of unicode strings vs bytes strings, then there would have been no Python 2 and 3. But otherwise, this is correct.

– juanpa.arrivillaga
Jan 2 at 18:51







This will not work on Python 3, .encode returns bytes objects, and you are trying to join using a str object, i.e. " ".join, this will throw a type error. You could do b" ".join(...), but then, why would you want a bytes object in Python 3? Look, if Python 2 and 3 could easily be written to handle the issue of unicode strings vs bytes strings, then there would have been no Python 2 and 3. But otherwise, this is correct.

– juanpa.arrivillaga
Jan 2 at 18:51















OK, I've updated it. Now go back to "pulling out your hair", Juanpa.

– Bill M.
Jan 2 at 18:55





OK, I've updated it. Now go back to "pulling out your hair", Juanpa.

– Bill M.
Jan 2 at 18:55




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54011146%2fbeautifulsoup-typeerror-sequence-item-0-expected-str-instance%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

MongoDB - Not Authorized To Execute Command

How to fix TextFormField cause rebuild widget in Flutter

Npm cannot find a required file even through it is in the searched directory