BeautifulSoup - TypeError: sequence item 0: expected str instance

I made a web crawler using python and everything runs fine until it gets to this section of the code:

    # Use BeautifulSoup modules to format web page as text that can

    # be parsed and indexed

    #

    soup = bs4.BeautifulSoup(response, "html.parser")

    tok = "".join(soup.findAll("p", text=re.compile(".")))

    # pass the text extracted from the web page to the parsetoken routine for indexing

    parsetoken(db, tok)

    documents += 1

The error I get is TypeError: sequence item 0: expected str instance, Tag found around the tok line in the code.

I think my syntax could be the issue but I am not sure. How can I fix this?

edited Jan 2 at 20:11

phunctional

146318

asked Jan 2 at 18:09

xhenier

245

1

what you are passing to ''.join is not an iterable of strings, which it must be. soup.findall returns a sequence of some type of custom objects I can only assume

– juanpa.arrivillaga
Jan 2 at 18:13

1

You probably need tok = "".join([x.text for x in soup.findAll("p", text=re.compile(".")))

– C.Nivs
Jan 2 at 18:17

add a comment |

I made a web crawler using python and everything runs fine until it gets to this section of the code:

    # Use BeautifulSoup modules to format web page as text that can

    # be parsed and indexed

    #

    soup = bs4.BeautifulSoup(response, "html.parser")

    tok = "".join(soup.findAll("p", text=re.compile(".")))

    # pass the text extracted from the web page to the parsetoken routine for indexing

    parsetoken(db, tok)

    documents += 1

The error I get is TypeError: sequence item 0: expected str instance, Tag found around the tok line in the code.

I think my syntax could be the issue but I am not sure. How can I fix this?

edited Jan 2 at 20:11

phunctional

146318

asked Jan 2 at 18:09

xhenier

245

1

what you are passing to ''.join is not an iterable of strings, which it must be. soup.findall returns a sequence of some type of custom objects I can only assume

– juanpa.arrivillaga
Jan 2 at 18:13

1

You probably need tok = "".join([x.text for x in soup.findAll("p", text=re.compile(".")))

– C.Nivs
Jan 2 at 18:17

add a comment |

I made a web crawler using python and everything runs fine until it gets to this section of the code:

    # Use BeautifulSoup modules to format web page as text that can

    # be parsed and indexed

    #

    soup = bs4.BeautifulSoup(response, "html.parser")

    tok = "".join(soup.findAll("p", text=re.compile(".")))

    # pass the text extracted from the web page to the parsetoken routine for indexing

    parsetoken(db, tok)

    documents += 1

The error I get is TypeError: sequence item 0: expected str instance, Tag found around the tok line in the code.

I think my syntax could be the issue but I am not sure. How can I fix this?

edited Jan 2 at 20:11

phunctional

146318

asked Jan 2 at 18:09

xhenier

245

I made a web crawler using python and everything runs fine until it gets to this section of the code:

    # Use BeautifulSoup modules to format web page as text that can

    # be parsed and indexed

    #

    soup = bs4.BeautifulSoup(response, "html.parser")

    tok = "".join(soup.findAll("p", text=re.compile(".")))

    # pass the text extracted from the web page to the parsetoken routine for indexing

    parsetoken(db, tok)

    documents += 1

The error I get is TypeError: sequence item 0: expected str instance, Tag found around the tok line in the code.

I think my syntax could be the issue but I am not sure. How can I fix this?

python python-3.x beautifulsoup

edited Jan 2 at 20:11

phunctional

146318

asked Jan 2 at 18:09

xhenier

245

edited Jan 2 at 20:11

phunctional

146318

asked Jan 2 at 18:09

xhenier

245

edited Jan 2 at 20:11

phunctional

146318

edited Jan 2 at 20:11

phunctional

146318

edited Jan 2 at 20:11

phunctional

146318

asked Jan 2 at 18:09

xhenier

245

asked Jan 2 at 18:09

xhenier

245

asked Jan 2 at 18:09

xhenier

245

1

what you are passing to ''.join is not an iterable of strings, which it must be. soup.findall returns a sequence of some type of custom objects I can only assume

– juanpa.arrivillaga
Jan 2 at 18:13

1

You probably need tok = "".join([x.text for x in soup.findAll("p", text=re.compile(".")))

– C.Nivs
Jan 2 at 18:17

add a comment |

1

what you are passing to ''.join is not an iterable of strings, which it must be. soup.findall returns a sequence of some type of custom objects I can only assume

– juanpa.arrivillaga
Jan 2 at 18:13

1

You probably need tok = "".join([x.text for x in soup.findAll("p", text=re.compile(".")))

– C.Nivs
Jan 2 at 18:17

what you are passing to ''.join is not an iterable of strings, which it must be. soup.findall returns a sequence of some type of custom objects I can only assume

– juanpa.arrivillaga
Jan 2 at 18:13

You probably need tok = "".join([x.text for x in soup.findAll("p", text=re.compile(".")))

– C.Nivs
Jan 2 at 18:17

add a comment |

1 Answer
1

active

oldest

votes

There are a few issues here:

First, I'm not sure where you're getting response from, but that should be a string of actual HTML. Make sure you're not just capturing a "response" code from scraping a site that tells you whether it was successful.

More importantly though, when you do the "findAll", note that this returns a list of BeautifulSoup objects, not a list of strings. So the "join" command doesn't know what to do with these. It looks at the first object in the list, sees that it's not a string, and this is why it errors out with a complaint that it "expected str instance". The good news is you can use .text to extract the actual text from a given  element.

Though even if you do use .text to extract the actual text from every  object, your join() may still fail if your list is a mix of unicode and str formats. So you may have to do some encoding tricks to get everything as the same type before you join.

Here's an example I did using this very page:

>>> import bs4, re

>>> import urllib2

>>> url = "https://stackoverflow.com/questions/3925614/how-do-you-read-a-file-into-a-list-in-python"

>>> html = urllib2.urlopen(url).read()

>>> soup = bs4.BeautifulSoup(html, "html.parser")

>>> L = soup.findAll("p", text=re.compile("."))

>>> M = [t.text.encode('utf-8') for t in L]

>>> print(" ".join(M))

This prints the combined text of everything found in a "P" tag.

EDIT: This example was on Python 2.7.x. For 3.x, drop the ".encode('utf-8')".

edited Jan 2 at 18:54

answered Jan 2 at 18:38

Bill M.

951113

This is Python 3, no need for text.encode('utf-8')

– juanpa.arrivillaga
Jan 2 at 18:46

I provided a solution that works on both.

– Bill M.
Jan 2 at 18:50

This will not work on Python 3, .encode returns bytes objects, and you are trying to join using a str object, i.e. " ".join, this will throw a type error. You could do b" ".join(...), but then, why would you want a bytes object in Python 3? Look, if Python 2 and 3 could easily be written to handle the issue of unicode strings vs bytes strings, then there would have been no Python 2 and 3. But otherwise, this is correct.

– juanpa.arrivillaga
Jan 2 at 18:51

OK, I've updated it. Now go back to "pulling out your hair", Juanpa.

– Bill M.
Jan 2 at 18:55

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54011146%2fbeautifulsoup-typeerror-sequence-item-0-expected-str-instance%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

There are a few issues here:

First, I'm not sure where you're getting response from, but that should be a string of actual HTML. Make sure you're not just capturing a "response" code from scraping a site that tells you whether it was successful.

More importantly though, when you do the "findAll", note that this returns a list of BeautifulSoup objects, not a list of strings. So the "join" command doesn't know what to do with these. It looks at the first object in the list, sees that it's not a string, and this is why it errors out with a complaint that it "expected str instance". The good news is you can use .text to extract the actual text from a given  element.

Though even if you do use .text to extract the actual text from every  object, your join() may still fail if your list is a mix of unicode and str formats. So you may have to do some encoding tricks to get everything as the same type before you join.

Here's an example I did using this very page:

>>> import bs4, re

>>> import urllib2

>>> url = "https://stackoverflow.com/questions/3925614/how-do-you-read-a-file-into-a-list-in-python"

>>> html = urllib2.urlopen(url).read()

>>> soup = bs4.BeautifulSoup(html, "html.parser")

>>> L = soup.findAll("p", text=re.compile("."))

>>> M = [t.text.encode('utf-8') for t in L]

>>> print(" ".join(M))

This prints the combined text of everything found in a "P" tag.

EDIT: This example was on Python 2.7.x. For 3.x, drop the ".encode('utf-8')".

edited Jan 2 at 18:54

answered Jan 2 at 18:38

Bill M.

951113

This is Python 3, no need for text.encode('utf-8')

– juanpa.arrivillaga
Jan 2 at 18:46

I provided a solution that works on both.

– Bill M.
Jan 2 at 18:50

This will not work on Python 3, .encode returns bytes objects, and you are trying to join using a str object, i.e. " ".join, this will throw a type error. You could do b" ".join(...), but then, why would you want a bytes object in Python 3? Look, if Python 2 and 3 could easily be written to handle the issue of unicode strings vs bytes strings, then there would have been no Python 2 and 3. But otherwise, this is correct.

– juanpa.arrivillaga
Jan 2 at 18:51

OK, I've updated it. Now go back to "pulling out your hair", Juanpa.

– Bill M.
Jan 2 at 18:55

add a comment |

There are a few issues here:

First, I'm not sure where you're getting response from, but that should be a string of actual HTML. Make sure you're not just capturing a "response" code from scraping a site that tells you whether it was successful.

More importantly though, when you do the "findAll", note that this returns a list of BeautifulSoup objects, not a list of strings. So the "join" command doesn't know what to do with these. It looks at the first object in the list, sees that it's not a string, and this is why it errors out with a complaint that it "expected str instance". The good news is you can use .text to extract the actual text from a given  element.

Though even if you do use .text to extract the actual text from every  object, your join() may still fail if your list is a mix of unicode and str formats. So you may have to do some encoding tricks to get everything as the same type before you join.

Here's an example I did using this very page:

>>> import bs4, re

>>> import urllib2

>>> url = "https://stackoverflow.com/questions/3925614/how-do-you-read-a-file-into-a-list-in-python"

>>> html = urllib2.urlopen(url).read()

>>> soup = bs4.BeautifulSoup(html, "html.parser")

>>> L = soup.findAll("p", text=re.compile("."))

>>> M = [t.text.encode('utf-8') for t in L]

>>> print(" ".join(M))

This prints the combined text of everything found in a "P" tag.

EDIT: This example was on Python 2.7.x. For 3.x, drop the ".encode('utf-8')".

edited Jan 2 at 18:54

answered Jan 2 at 18:38

Bill M.

951113

This is Python 3, no need for text.encode('utf-8')

– juanpa.arrivillaga
Jan 2 at 18:46

I provided a solution that works on both.

– Bill M.
Jan 2 at 18:50

This will not work on Python 3, .encode returns bytes objects, and you are trying to join using a str object, i.e. " ".join, this will throw a type error. You could do b" ".join(...), but then, why would you want a bytes object in Python 3? Look, if Python 2 and 3 could easily be written to handle the issue of unicode strings vs bytes strings, then there would have been no Python 2 and 3. But otherwise, this is correct.

– juanpa.arrivillaga
Jan 2 at 18:51

OK, I've updated it. Now go back to "pulling out your hair", Juanpa.

– Bill M.
Jan 2 at 18:55

add a comment |

There are a few issues here:

First, I'm not sure where you're getting response from, but that should be a string of actual HTML. Make sure you're not just capturing a "response" code from scraping a site that tells you whether it was successful.

More importantly though, when you do the "findAll", note that this returns a list of BeautifulSoup objects, not a list of strings. So the "join" command doesn't know what to do with these. It looks at the first object in the list, sees that it's not a string, and this is why it errors out with a complaint that it "expected str instance". The good news is you can use .text to extract the actual text from a given  element.

Though even if you do use .text to extract the actual text from every  object, your join() may still fail if your list is a mix of unicode and str formats. So you may have to do some encoding tricks to get everything as the same type before you join.

Here's an example I did using this very page:

>>> import bs4, re

>>> import urllib2

>>> url = "https://stackoverflow.com/questions/3925614/how-do-you-read-a-file-into-a-list-in-python"

>>> html = urllib2.urlopen(url).read()

>>> soup = bs4.BeautifulSoup(html, "html.parser")

>>> L = soup.findAll("p", text=re.compile("."))

>>> M = [t.text.encode('utf-8') for t in L]

>>> print(" ".join(M))

This prints the combined text of everything found in a "P" tag.

EDIT: This example was on Python 2.7.x. For 3.x, drop the ".encode('utf-8')".

edited Jan 2 at 18:54

answered Jan 2 at 18:38

Bill M.

951113

There are a few issues here:

First, I'm not sure where you're getting response from, but that should be a string of actual HTML. Make sure you're not just capturing a "response" code from scraping a site that tells you whether it was successful.

More importantly though, when you do the "findAll", note that this returns a list of BeautifulSoup objects, not a list of strings. So the "join" command doesn't know what to do with these. It looks at the first object in the list, sees that it's not a string, and this is why it errors out with a complaint that it "expected str instance". The good news is you can use .text to extract the actual text from a given  element.

Though even if you do use .text to extract the actual text from every  object, your join() may still fail if your list is a mix of unicode and str formats. So you may have to do some encoding tricks to get everything as the same type before you join.

Here's an example I did using this very page:

>>> import bs4, re

>>> import urllib2

>>> url = "https://stackoverflow.com/questions/3925614/how-do-you-read-a-file-into-a-list-in-python"

>>> html = urllib2.urlopen(url).read()

>>> soup = bs4.BeautifulSoup(html, "html.parser")

>>> L = soup.findAll("p", text=re.compile("."))

>>> M = [t.text.encode('utf-8') for t in L]

>>> print(" ".join(M))

This prints the combined text of everything found in a "P" tag.

EDIT: This example was on Python 2.7.x. For 3.x, drop the ".encode('utf-8')".

edited Jan 2 at 18:54

answered Jan 2 at 18:38

Bill M.

951113

edited Jan 2 at 18:54

answered Jan 2 at 18:38

Bill M.

951113

answered Jan 2 at 18:38

Bill M.

951113

answered Jan 2 at 18:38

Bill M.

951113

This is Python 3, no need for text.encode('utf-8')

– juanpa.arrivillaga
Jan 2 at 18:46

I provided a solution that works on both.

– Bill M.
Jan 2 at 18:50

This will not work on Python 3, .encode returns bytes objects, and you are trying to join using a str object, i.e. " ".join, this will throw a type error. You could do b" ".join(...), but then, why would you want a bytes object in Python 3? Look, if Python 2 and 3 could easily be written to handle the issue of unicode strings vs bytes strings, then there would have been no Python 2 and 3. But otherwise, this is correct.

– juanpa.arrivillaga
Jan 2 at 18:51

OK, I've updated it. Now go back to "pulling out your hair", Juanpa.

– Bill M.
Jan 2 at 18:55

add a comment |

This is Python 3, no need for text.encode('utf-8')

– juanpa.arrivillaga
Jan 2 at 18:46

I provided a solution that works on both.

– Bill M.
Jan 2 at 18:50

This will not work on Python 3, .encode returns bytes objects, and you are trying to join using a str object, i.e. " ".join, this will throw a type error. You could do b" ".join(...), but then, why would you want a bytes object in Python 3? Look, if Python 2 and 3 could easily be written to handle the issue of unicode strings vs bytes strings, then there would have been no Python 2 and 3. But otherwise, this is correct.

– juanpa.arrivillaga
Jan 2 at 18:51

OK, I've updated it. Now go back to "pulling out your hair", Juanpa.

– Bill M.
Jan 2 at 18:55

This is Python 3, no need for text.encode('utf-8')

– juanpa.arrivillaga
Jan 2 at 18:46

I provided a solution that works on both.

– Bill M.
Jan 2 at 18:50

This will not work on Python 3, .encode returns bytes objects, and you are trying to join using a str object, i.e. " ".join, this will throw a type error. You could do b" ".join(...), but then, why would you want a bytes object in Python 3? Look, if Python 2 and 3 could easily be written to handle the issue of unicode strings vs bytes strings, then there would have been no Python 2 and 3. But otherwise, this is correct.

– juanpa.arrivillaga
Jan 2 at 18:51

OK, I've updated it. Now go back to "pulling out your hair", Juanpa.

– Bill M.
Jan 2 at 18:55

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu