BeautifulSoup - TypeError: sequence item 0: expected str instance
I made a web crawler using python and everything runs fine until it gets to this section of the code:
# Use BeautifulSoup modules to format web page as text that can
# be parsed and indexed
#
soup = bs4.BeautifulSoup(response, "html.parser")
tok = "".join(soup.findAll("p", text=re.compile(".")))
# pass the text extracted from the web page to the parsetoken routine for indexing
parsetoken(db, tok)
documents += 1
The error I get is TypeError: sequence item 0: expected str instance
, Tag found around the tok line in the code.
I think my syntax could be the issue but I am not sure. How can I fix this?
python python-3.x beautifulsoup
add a comment |
I made a web crawler using python and everything runs fine until it gets to this section of the code:
# Use BeautifulSoup modules to format web page as text that can
# be parsed and indexed
#
soup = bs4.BeautifulSoup(response, "html.parser")
tok = "".join(soup.findAll("p", text=re.compile(".")))
# pass the text extracted from the web page to the parsetoken routine for indexing
parsetoken(db, tok)
documents += 1
The error I get is TypeError: sequence item 0: expected str instance
, Tag found around the tok line in the code.
I think my syntax could be the issue but I am not sure. How can I fix this?
python python-3.x beautifulsoup
1
what you are passing to''.join
is not an iterable of strings, which it must be.soup.findall
returns a sequence of some type of custom objects I can only assume
– juanpa.arrivillaga
Jan 2 at 18:13
1
You probably needtok = "".join([x.text for x in soup.findAll("p", text=re.compile(".")))
– C.Nivs
Jan 2 at 18:17
add a comment |
I made a web crawler using python and everything runs fine until it gets to this section of the code:
# Use BeautifulSoup modules to format web page as text that can
# be parsed and indexed
#
soup = bs4.BeautifulSoup(response, "html.parser")
tok = "".join(soup.findAll("p", text=re.compile(".")))
# pass the text extracted from the web page to the parsetoken routine for indexing
parsetoken(db, tok)
documents += 1
The error I get is TypeError: sequence item 0: expected str instance
, Tag found around the tok line in the code.
I think my syntax could be the issue but I am not sure. How can I fix this?
python python-3.x beautifulsoup
I made a web crawler using python and everything runs fine until it gets to this section of the code:
# Use BeautifulSoup modules to format web page as text that can
# be parsed and indexed
#
soup = bs4.BeautifulSoup(response, "html.parser")
tok = "".join(soup.findAll("p", text=re.compile(".")))
# pass the text extracted from the web page to the parsetoken routine for indexing
parsetoken(db, tok)
documents += 1
The error I get is TypeError: sequence item 0: expected str instance
, Tag found around the tok line in the code.
I think my syntax could be the issue but I am not sure. How can I fix this?
python python-3.x beautifulsoup
python python-3.x beautifulsoup
edited Jan 2 at 20:11


phunctional
146318
146318
asked Jan 2 at 18:09
xhenierxhenier
245
245
1
what you are passing to''.join
is not an iterable of strings, which it must be.soup.findall
returns a sequence of some type of custom objects I can only assume
– juanpa.arrivillaga
Jan 2 at 18:13
1
You probably needtok = "".join([x.text for x in soup.findAll("p", text=re.compile(".")))
– C.Nivs
Jan 2 at 18:17
add a comment |
1
what you are passing to''.join
is not an iterable of strings, which it must be.soup.findall
returns a sequence of some type of custom objects I can only assume
– juanpa.arrivillaga
Jan 2 at 18:13
1
You probably needtok = "".join([x.text for x in soup.findAll("p", text=re.compile(".")))
– C.Nivs
Jan 2 at 18:17
1
1
what you are passing to
''.join
is not an iterable of strings, which it must be. soup.findall
returns a sequence of some type of custom objects I can only assume– juanpa.arrivillaga
Jan 2 at 18:13
what you are passing to
''.join
is not an iterable of strings, which it must be. soup.findall
returns a sequence of some type of custom objects I can only assume– juanpa.arrivillaga
Jan 2 at 18:13
1
1
You probably need
tok = "".join([x.text for x in soup.findAll("p", text=re.compile(".")))
– C.Nivs
Jan 2 at 18:17
You probably need
tok = "".join([x.text for x in soup.findAll("p", text=re.compile(".")))
– C.Nivs
Jan 2 at 18:17
add a comment |
1 Answer
1
active
oldest
votes
There are a few issues here:
- First, I'm not sure where you're getting
response
from, but that should be a string of actual HTML. Make sure you're not just capturing a "response" code from scraping a site that tells you whether it was successful. - More importantly though, when you do the "findAll", note that this returns a list of BeautifulSoup objects, not a list of strings. So the "
join
" command doesn't know what to do with these. It looks at the first object in the list, sees that it's not a string, and this is why it errors out with a complaint that it "expected str instance
". The good news is you can use.text
to extract the actual text from a given<p>
element. - Though even if you do use
.text
to extract the actual text from every<p>
object, yourjoin()
may still fail if your list is a mix ofunicode
andstr
formats. So you may have to do some encoding tricks to get everything as the same type before you join.
Here's an example I did using this very page:
>>> import bs4, re
>>> import urllib2
>>> url = "https://stackoverflow.com/questions/3925614/how-do-you-read-a-file-into-a-list-in-python"
>>> html = urllib2.urlopen(url).read()
>>> soup = bs4.BeautifulSoup(html, "html.parser")
>>> L = soup.findAll("p", text=re.compile("."))
>>> M = [t.text.encode('utf-8') for t in L]
>>> print(" ".join(M))
This prints the combined text of everything found in a "P" tag.
EDIT: This example was on Python 2.7.x. For 3.x, drop the ".encode('utf-8')".
This is Python 3, no need fortext.encode('utf-8')
– juanpa.arrivillaga
Jan 2 at 18:46
I provided a solution that works on both.
– Bill M.
Jan 2 at 18:50
This will not work on Python 3,.encode
returnsbytes
objects, and you are trying to join using astr
object, i.e." ".join
, this will throw a type error. You could dob" ".join(...)
, but then, why would you want a bytes object in Python 3? Look, if Python 2 and 3 could easily be written to handle the issue of unicode strings vs bytes strings, then there would have been no Python 2 and 3. But otherwise, this is correct.
– juanpa.arrivillaga
Jan 2 at 18:51
OK, I've updated it. Now go back to "pulling out your hair", Juanpa.
– Bill M.
Jan 2 at 18:55
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54011146%2fbeautifulsoup-typeerror-sequence-item-0-expected-str-instance%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
There are a few issues here:
- First, I'm not sure where you're getting
response
from, but that should be a string of actual HTML. Make sure you're not just capturing a "response" code from scraping a site that tells you whether it was successful. - More importantly though, when you do the "findAll", note that this returns a list of BeautifulSoup objects, not a list of strings. So the "
join
" command doesn't know what to do with these. It looks at the first object in the list, sees that it's not a string, and this is why it errors out with a complaint that it "expected str instance
". The good news is you can use.text
to extract the actual text from a given<p>
element. - Though even if you do use
.text
to extract the actual text from every<p>
object, yourjoin()
may still fail if your list is a mix ofunicode
andstr
formats. So you may have to do some encoding tricks to get everything as the same type before you join.
Here's an example I did using this very page:
>>> import bs4, re
>>> import urllib2
>>> url = "https://stackoverflow.com/questions/3925614/how-do-you-read-a-file-into-a-list-in-python"
>>> html = urllib2.urlopen(url).read()
>>> soup = bs4.BeautifulSoup(html, "html.parser")
>>> L = soup.findAll("p", text=re.compile("."))
>>> M = [t.text.encode('utf-8') for t in L]
>>> print(" ".join(M))
This prints the combined text of everything found in a "P" tag.
EDIT: This example was on Python 2.7.x. For 3.x, drop the ".encode('utf-8')".
This is Python 3, no need fortext.encode('utf-8')
– juanpa.arrivillaga
Jan 2 at 18:46
I provided a solution that works on both.
– Bill M.
Jan 2 at 18:50
This will not work on Python 3,.encode
returnsbytes
objects, and you are trying to join using astr
object, i.e." ".join
, this will throw a type error. You could dob" ".join(...)
, but then, why would you want a bytes object in Python 3? Look, if Python 2 and 3 could easily be written to handle the issue of unicode strings vs bytes strings, then there would have been no Python 2 and 3. But otherwise, this is correct.
– juanpa.arrivillaga
Jan 2 at 18:51
OK, I've updated it. Now go back to "pulling out your hair", Juanpa.
– Bill M.
Jan 2 at 18:55
add a comment |
There are a few issues here:
- First, I'm not sure where you're getting
response
from, but that should be a string of actual HTML. Make sure you're not just capturing a "response" code from scraping a site that tells you whether it was successful. - More importantly though, when you do the "findAll", note that this returns a list of BeautifulSoup objects, not a list of strings. So the "
join
" command doesn't know what to do with these. It looks at the first object in the list, sees that it's not a string, and this is why it errors out with a complaint that it "expected str instance
". The good news is you can use.text
to extract the actual text from a given<p>
element. - Though even if you do use
.text
to extract the actual text from every<p>
object, yourjoin()
may still fail if your list is a mix ofunicode
andstr
formats. So you may have to do some encoding tricks to get everything as the same type before you join.
Here's an example I did using this very page:
>>> import bs4, re
>>> import urllib2
>>> url = "https://stackoverflow.com/questions/3925614/how-do-you-read-a-file-into-a-list-in-python"
>>> html = urllib2.urlopen(url).read()
>>> soup = bs4.BeautifulSoup(html, "html.parser")
>>> L = soup.findAll("p", text=re.compile("."))
>>> M = [t.text.encode('utf-8') for t in L]
>>> print(" ".join(M))
This prints the combined text of everything found in a "P" tag.
EDIT: This example was on Python 2.7.x. For 3.x, drop the ".encode('utf-8')".
This is Python 3, no need fortext.encode('utf-8')
– juanpa.arrivillaga
Jan 2 at 18:46
I provided a solution that works on both.
– Bill M.
Jan 2 at 18:50
This will not work on Python 3,.encode
returnsbytes
objects, and you are trying to join using astr
object, i.e." ".join
, this will throw a type error. You could dob" ".join(...)
, but then, why would you want a bytes object in Python 3? Look, if Python 2 and 3 could easily be written to handle the issue of unicode strings vs bytes strings, then there would have been no Python 2 and 3. But otherwise, this is correct.
– juanpa.arrivillaga
Jan 2 at 18:51
OK, I've updated it. Now go back to "pulling out your hair", Juanpa.
– Bill M.
Jan 2 at 18:55
add a comment |
There are a few issues here:
- First, I'm not sure where you're getting
response
from, but that should be a string of actual HTML. Make sure you're not just capturing a "response" code from scraping a site that tells you whether it was successful. - More importantly though, when you do the "findAll", note that this returns a list of BeautifulSoup objects, not a list of strings. So the "
join
" command doesn't know what to do with these. It looks at the first object in the list, sees that it's not a string, and this is why it errors out with a complaint that it "expected str instance
". The good news is you can use.text
to extract the actual text from a given<p>
element. - Though even if you do use
.text
to extract the actual text from every<p>
object, yourjoin()
may still fail if your list is a mix ofunicode
andstr
formats. So you may have to do some encoding tricks to get everything as the same type before you join.
Here's an example I did using this very page:
>>> import bs4, re
>>> import urllib2
>>> url = "https://stackoverflow.com/questions/3925614/how-do-you-read-a-file-into-a-list-in-python"
>>> html = urllib2.urlopen(url).read()
>>> soup = bs4.BeautifulSoup(html, "html.parser")
>>> L = soup.findAll("p", text=re.compile("."))
>>> M = [t.text.encode('utf-8') for t in L]
>>> print(" ".join(M))
This prints the combined text of everything found in a "P" tag.
EDIT: This example was on Python 2.7.x. For 3.x, drop the ".encode('utf-8')".
There are a few issues here:
- First, I'm not sure where you're getting
response
from, but that should be a string of actual HTML. Make sure you're not just capturing a "response" code from scraping a site that tells you whether it was successful. - More importantly though, when you do the "findAll", note that this returns a list of BeautifulSoup objects, not a list of strings. So the "
join
" command doesn't know what to do with these. It looks at the first object in the list, sees that it's not a string, and this is why it errors out with a complaint that it "expected str instance
". The good news is you can use.text
to extract the actual text from a given<p>
element. - Though even if you do use
.text
to extract the actual text from every<p>
object, yourjoin()
may still fail if your list is a mix ofunicode
andstr
formats. So you may have to do some encoding tricks to get everything as the same type before you join.
Here's an example I did using this very page:
>>> import bs4, re
>>> import urllib2
>>> url = "https://stackoverflow.com/questions/3925614/how-do-you-read-a-file-into-a-list-in-python"
>>> html = urllib2.urlopen(url).read()
>>> soup = bs4.BeautifulSoup(html, "html.parser")
>>> L = soup.findAll("p", text=re.compile("."))
>>> M = [t.text.encode('utf-8') for t in L]
>>> print(" ".join(M))
This prints the combined text of everything found in a "P" tag.
EDIT: This example was on Python 2.7.x. For 3.x, drop the ".encode('utf-8')".
edited Jan 2 at 18:54
answered Jan 2 at 18:38


Bill M.Bill M.
951113
951113
This is Python 3, no need fortext.encode('utf-8')
– juanpa.arrivillaga
Jan 2 at 18:46
I provided a solution that works on both.
– Bill M.
Jan 2 at 18:50
This will not work on Python 3,.encode
returnsbytes
objects, and you are trying to join using astr
object, i.e." ".join
, this will throw a type error. You could dob" ".join(...)
, but then, why would you want a bytes object in Python 3? Look, if Python 2 and 3 could easily be written to handle the issue of unicode strings vs bytes strings, then there would have been no Python 2 and 3. But otherwise, this is correct.
– juanpa.arrivillaga
Jan 2 at 18:51
OK, I've updated it. Now go back to "pulling out your hair", Juanpa.
– Bill M.
Jan 2 at 18:55
add a comment |
This is Python 3, no need fortext.encode('utf-8')
– juanpa.arrivillaga
Jan 2 at 18:46
I provided a solution that works on both.
– Bill M.
Jan 2 at 18:50
This will not work on Python 3,.encode
returnsbytes
objects, and you are trying to join using astr
object, i.e." ".join
, this will throw a type error. You could dob" ".join(...)
, but then, why would you want a bytes object in Python 3? Look, if Python 2 and 3 could easily be written to handle the issue of unicode strings vs bytes strings, then there would have been no Python 2 and 3. But otherwise, this is correct.
– juanpa.arrivillaga
Jan 2 at 18:51
OK, I've updated it. Now go back to "pulling out your hair", Juanpa.
– Bill M.
Jan 2 at 18:55
This is Python 3, no need for
text.encode('utf-8')
– juanpa.arrivillaga
Jan 2 at 18:46
This is Python 3, no need for
text.encode('utf-8')
– juanpa.arrivillaga
Jan 2 at 18:46
I provided a solution that works on both.
– Bill M.
Jan 2 at 18:50
I provided a solution that works on both.
– Bill M.
Jan 2 at 18:50
This will not work on Python 3,
.encode
returns bytes
objects, and you are trying to join using a str
object, i.e. " ".join
, this will throw a type error. You could do b" ".join(...)
, but then, why would you want a bytes object in Python 3? Look, if Python 2 and 3 could easily be written to handle the issue of unicode strings vs bytes strings, then there would have been no Python 2 and 3. But otherwise, this is correct.– juanpa.arrivillaga
Jan 2 at 18:51
This will not work on Python 3,
.encode
returns bytes
objects, and you are trying to join using a str
object, i.e. " ".join
, this will throw a type error. You could do b" ".join(...)
, but then, why would you want a bytes object in Python 3? Look, if Python 2 and 3 could easily be written to handle the issue of unicode strings vs bytes strings, then there would have been no Python 2 and 3. But otherwise, this is correct.– juanpa.arrivillaga
Jan 2 at 18:51
OK, I've updated it. Now go back to "pulling out your hair", Juanpa.
– Bill M.
Jan 2 at 18:55
OK, I've updated it. Now go back to "pulling out your hair", Juanpa.
– Bill M.
Jan 2 at 18:55
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54011146%2fbeautifulsoup-typeerror-sequence-item-0-expected-str-instance%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
what you are passing to
''.join
is not an iterable of strings, which it must be.soup.findall
returns a sequence of some type of custom objects I can only assume– juanpa.arrivillaga
Jan 2 at 18:13
1
You probably need
tok = "".join([x.text for x in soup.findAll("p", text=re.compile(".")))
– C.Nivs
Jan 2 at 18:17