Improve speed/performance of web-scraping with lots of exceptions

I've written some web-scraping code that is currently working, however quite slow. Some background: I am using Selenium as it requires several stages of clicks and entry, along with BeautifulSoup. My code is looking at a list of materials within subcategories on a website(image below) and scraping them. If the material scraped from the website is one of the 30 I am interested in (lst below), then it writes the number 1 to a dataframe which I later convert to an Excel sheet.

The reason it is so slow, I believe anyway, is due to the fact that there are a lot of exceptions. However, I am not sure how to handle these besides try/except. The main bits of code can be seen below, as the entire piece of code is quite lengthy. I have also attached an image of the website in question for reference.

lst = ["Household cleaner and detergent bottles", "Plastic milk bottles", "Toiletries and shampoo bottles", "Plastic drinks bottles", 

       "Drinks cans", "Food tins", "Metal lids from glass jars", "Aerosols", 

       "Food pots and tubs", "Margarine tubs", "Plastic trays","Yoghurt pots", "Carrier bags",

       "Aluminium foil", "Foil trays",

       "Cardboard sleeves", "Cardboard egg boxes", "Cardboard fruit and veg punnets", "Cereal boxes", "Corrugated cardboard", "Toilet roll tubes", "Food and drink cartons",

       "Newspapers", "Window envelopes", "Magazines", "Junk mail", "Brown envelopes", "Shredded paper", "Yellow Pages" , "Telephone directories",

       "Glass bottles and jars"]



def site_scraper(site):

    page_loc = ('//*[@id="wrap-rlw"]/div/div[2]/div/div/div/div[2]/div/ol/li[{}]/div').format(site)

    page = driver.find_element_by_xpath(page_loc) 

    page.click()

    driver.execute_script("arguments[0].scrollIntoView(true);", page)



    soup=BeautifulSoup(driver.page_source, 'lxml')

    for i in x:

        for j in y:

            try:

                material = soup.find_all("div", class_ = "rlw-accordion-content")[i].find_all('li')[j].get_text(strip=True).encode('utf-8')

                if material in lst:

                    df.at[code_no, material] = 1 

                else:

                    continue 

                continue

            except IndexError:

                continue



x = xrange(0,8) 

y = xrange(0,9)



p = xrange(1,31)



for site in p:

    site_scraper(site)

Specifically, the i's and j's rarely go to 6,7 or 8 but when they do, it is important that I capture that information too. For context, the i's correspond to the number of different categories in the image below (Automative, Building materials etc.) whilst the j's represent the sub-list (car batteries and engine oil etc.). Because these two loops are repeated for all 30 sites for each code, and I have 1500 codes, this is extremely slow. Currently it is taking 6.5 minutes for 10 codes.

Is there a way I could improve this process? I tried list comprehension however it was difficult to handle errors like this and my results were no longer accurate. Could an "if" function be a better choice for this and if so, how would I incorporate it? I also would be happy to attach the full code if required. Thank you!

enter image description here

EDIT:
by changing

        except IndexError:

            continue

        except IndexError:

            break

it is now running almost twice as fast! Obviously it is best to exit to loop after it fails once, as the later iterations will also fail. However any other pythonic tips are still welcome :)

edited Nov 20 '18 at 16:57

asked Nov 20 '18 at 16:02

Ozdanny

236

add a comment |

lst = ["Household cleaner and detergent bottles", "Plastic milk bottles", "Toiletries and shampoo bottles", "Plastic drinks bottles", 

       "Drinks cans", "Food tins", "Metal lids from glass jars", "Aerosols", 

       "Food pots and tubs", "Margarine tubs", "Plastic trays","Yoghurt pots", "Carrier bags",

       "Aluminium foil", "Foil trays",

       "Cardboard sleeves", "Cardboard egg boxes", "Cardboard fruit and veg punnets", "Cereal boxes", "Corrugated cardboard", "Toilet roll tubes", "Food and drink cartons",

       "Newspapers", "Window envelopes", "Magazines", "Junk mail", "Brown envelopes", "Shredded paper", "Yellow Pages" , "Telephone directories",

       "Glass bottles and jars"]



def site_scraper(site):

    page_loc = ('//*[@id="wrap-rlw"]/div/div[2]/div/div/div/div[2]/div/ol/li[{}]/div').format(site)

    page = driver.find_element_by_xpath(page_loc) 

    page.click()

    driver.execute_script("arguments[0].scrollIntoView(true);", page)



    soup=BeautifulSoup(driver.page_source, 'lxml')

    for i in x:

        for j in y:

            try:

                material = soup.find_all("div", class_ = "rlw-accordion-content")[i].find_all('li')[j].get_text(strip=True).encode('utf-8')

                if material in lst:

                    df.at[code_no, material] = 1 

                else:

                    continue 

                continue

            except IndexError:

                continue



x = xrange(0,8) 

y = xrange(0,9)



p = xrange(1,31)



for site in p:

    site_scraper(site)

enter image description here

EDIT:
by changing

        except IndexError:

            continue

        except IndexError:

            break

it is now running almost twice as fast! Obviously it is best to exit to loop after it fails once, as the later iterations will also fail. However any other pythonic tips are still welcome :)

edited Nov 20 '18 at 16:57

asked Nov 20 '18 at 16:02

Ozdanny

236

add a comment |

lst = ["Household cleaner and detergent bottles", "Plastic milk bottles", "Toiletries and shampoo bottles", "Plastic drinks bottles", 

       "Drinks cans", "Food tins", "Metal lids from glass jars", "Aerosols", 

       "Food pots and tubs", "Margarine tubs", "Plastic trays","Yoghurt pots", "Carrier bags",

       "Aluminium foil", "Foil trays",

       "Cardboard sleeves", "Cardboard egg boxes", "Cardboard fruit and veg punnets", "Cereal boxes", "Corrugated cardboard", "Toilet roll tubes", "Food and drink cartons",

       "Newspapers", "Window envelopes", "Magazines", "Junk mail", "Brown envelopes", "Shredded paper", "Yellow Pages" , "Telephone directories",

       "Glass bottles and jars"]



def site_scraper(site):

    page_loc = ('//*[@id="wrap-rlw"]/div/div[2]/div/div/div/div[2]/div/ol/li[{}]/div').format(site)

    page = driver.find_element_by_xpath(page_loc) 

    page.click()

    driver.execute_script("arguments[0].scrollIntoView(true);", page)



    soup=BeautifulSoup(driver.page_source, 'lxml')

    for i in x:

        for j in y:

            try:

                material = soup.find_all("div", class_ = "rlw-accordion-content")[i].find_all('li')[j].get_text(strip=True).encode('utf-8')

                if material in lst:

                    df.at[code_no, material] = 1 

                else:

                    continue 

                continue

            except IndexError:

                continue



x = xrange(0,8) 

y = xrange(0,9)



p = xrange(1,31)



for site in p:

    site_scraper(site)

enter image description here

EDIT:
by changing

        except IndexError:

            continue

        except IndexError:

            break

it is now running almost twice as fast! Obviously it is best to exit to loop after it fails once, as the later iterations will also fail. However any other pythonic tips are still welcome :)

edited Nov 20 '18 at 16:57

asked Nov 20 '18 at 16:02

Ozdanny

236

lst = ["Household cleaner and detergent bottles", "Plastic milk bottles", "Toiletries and shampoo bottles", "Plastic drinks bottles", 

       "Drinks cans", "Food tins", "Metal lids from glass jars", "Aerosols", 

       "Food pots and tubs", "Margarine tubs", "Plastic trays","Yoghurt pots", "Carrier bags",

       "Aluminium foil", "Foil trays",

       "Cardboard sleeves", "Cardboard egg boxes", "Cardboard fruit and veg punnets", "Cereal boxes", "Corrugated cardboard", "Toilet roll tubes", "Food and drink cartons",

       "Newspapers", "Window envelopes", "Magazines", "Junk mail", "Brown envelopes", "Shredded paper", "Yellow Pages" , "Telephone directories",

       "Glass bottles and jars"]



def site_scraper(site):

    page_loc = ('//*[@id="wrap-rlw"]/div/div[2]/div/div/div/div[2]/div/ol/li[{}]/div').format(site)

    page = driver.find_element_by_xpath(page_loc) 

    page.click()

    driver.execute_script("arguments[0].scrollIntoView(true);", page)



    soup=BeautifulSoup(driver.page_source, 'lxml')

    for i in x:

        for j in y:

            try:

                material = soup.find_all("div", class_ = "rlw-accordion-content")[i].find_all('li')[j].get_text(strip=True).encode('utf-8')

                if material in lst:

                    df.at[code_no, material] = 1 

                else:

                    continue 

                continue

            except IndexError:

                continue



x = xrange(0,8) 

y = xrange(0,9)



p = xrange(1,31)



for site in p:

    site_scraper(site)

enter image description here

EDIT:
by changing

        except IndexError:

            continue

        except IndexError:

            break

it is now running almost twice as fast! Obviously it is best to exit to loop after it fails once, as the later iterations will also fail. However any other pythonic tips are still welcome :)

python exception web-scraping error-handling try-catch

edited Nov 20 '18 at 16:57

asked Nov 20 '18 at 16:02

Ozdanny

236

edited Nov 20 '18 at 16:57

asked Nov 20 '18 at 16:02

Ozdanny

236

edited Nov 20 '18 at 16:57

asked Nov 20 '18 at 16:02

Ozdanny

236

asked Nov 20 '18 at 16:02

Ozdanny

236

asked Nov 20 '18 at 16:02

Ozdanny

236

add a comment |

1 Answer
1

active

oldest

votes

It sounds like you just need the text of those lis:

lis = driver.execute_script("[...document.querySelectorAll('.rlw-accordion-content li')].map(li => li.innerText.trim())")

Now you can use those for your logic:

for material in lis:

  if material in lst:

    df.at[code_no, material] = 1

answered Nov 21 '18 at 0:53

pguardiario

36.1k979114

Thank you for this. However when I run this and try to print "lis" it returns None so I cannot iterate over this in the loop. Could you explain how you've generated the "lis" command so I can try fix it?

– Ozdanny
Nov 21 '18 at 9:53

Maybe you need to add a time.sleep?. Essentially you want to load that page in chrome and fiddle with the js in console until it returns what you want. Without seeing the url thats the best help I can provide.

– pguardiario
Nov 21 '18 at 10:12

I'll add a time.sleep and try that now too. As for the url: recyclenow.com/local-recycling - there it is however you have to click on the RHS button ("Find your nearest recycling station") then enter a postcode (try "NP11") and click search. Then click on the first site and you'll see the format I've mentioned. As you can see there is a lot of steps for Selenium to take so this slows it down too!

– Ozdanny
Nov 21 '18 at 11:37

Ok. Please take my advise as something general. You will need to put in the work to adapt it to your specifics.

– pguardiario
Nov 21 '18 at 11:44

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53396944%2fimprove-speed-performance-of-web-scraping-with-lots-of-exceptions%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

It sounds like you just need the text of those lis:

lis = driver.execute_script("[...document.querySelectorAll('.rlw-accordion-content li')].map(li => li.innerText.trim())")

Now you can use those for your logic:

for material in lis:

  if material in lst:

    df.at[code_no, material] = 1

answered Nov 21 '18 at 0:53

pguardiario

36.1k979114

Thank you for this. However when I run this and try to print "lis" it returns None so I cannot iterate over this in the loop. Could you explain how you've generated the "lis" command so I can try fix it?

– Ozdanny
Nov 21 '18 at 9:53

Maybe you need to add a time.sleep?. Essentially you want to load that page in chrome and fiddle with the js in console until it returns what you want. Without seeing the url thats the best help I can provide.

– pguardiario
Nov 21 '18 at 10:12

I'll add a time.sleep and try that now too. As for the url: recyclenow.com/local-recycling - there it is however you have to click on the RHS button ("Find your nearest recycling station") then enter a postcode (try "NP11") and click search. Then click on the first site and you'll see the format I've mentioned. As you can see there is a lot of steps for Selenium to take so this slows it down too!

– Ozdanny
Nov 21 '18 at 11:37

Ok. Please take my advise as something general. You will need to put in the work to adapt it to your specifics.

– pguardiario
Nov 21 '18 at 11:44

add a comment |

It sounds like you just need the text of those lis:

lis = driver.execute_script("[...document.querySelectorAll('.rlw-accordion-content li')].map(li => li.innerText.trim())")

Now you can use those for your logic:

for material in lis:

  if material in lst:

    df.at[code_no, material] = 1

answered Nov 21 '18 at 0:53

pguardiario

36.1k979114

Thank you for this. However when I run this and try to print "lis" it returns None so I cannot iterate over this in the loop. Could you explain how you've generated the "lis" command so I can try fix it?

– Ozdanny
Nov 21 '18 at 9:53

Maybe you need to add a time.sleep?. Essentially you want to load that page in chrome and fiddle with the js in console until it returns what you want. Without seeing the url thats the best help I can provide.

– pguardiario
Nov 21 '18 at 10:12

I'll add a time.sleep and try that now too. As for the url: recyclenow.com/local-recycling - there it is however you have to click on the RHS button ("Find your nearest recycling station") then enter a postcode (try "NP11") and click search. Then click on the first site and you'll see the format I've mentioned. As you can see there is a lot of steps for Selenium to take so this slows it down too!

– Ozdanny
Nov 21 '18 at 11:37

Ok. Please take my advise as something general. You will need to put in the work to adapt it to your specifics.

– pguardiario
Nov 21 '18 at 11:44

add a comment |

It sounds like you just need the text of those lis:

lis = driver.execute_script("[...document.querySelectorAll('.rlw-accordion-content li')].map(li => li.innerText.trim())")

Now you can use those for your logic:

for material in lis:

  if material in lst:

    df.at[code_no, material] = 1

answered Nov 21 '18 at 0:53

pguardiario

36.1k979114

It sounds like you just need the text of those lis:

lis = driver.execute_script("[...document.querySelectorAll('.rlw-accordion-content li')].map(li => li.innerText.trim())")

Now you can use those for your logic:

for material in lis:

  if material in lst:

    df.at[code_no, material] = 1

answered Nov 21 '18 at 0:53

pguardiario

36.1k979114

answered Nov 21 '18 at 0:53

pguardiario

36.1k979114

answered Nov 21 '18 at 0:53

pguardiario

36.1k979114

answered Nov 21 '18 at 0:53

pguardiario

36.1k979114

Thank you for this. However when I run this and try to print "lis" it returns None so I cannot iterate over this in the loop. Could you explain how you've generated the "lis" command so I can try fix it?

– Ozdanny
Nov 21 '18 at 9:53

Maybe you need to add a time.sleep?. Essentially you want to load that page in chrome and fiddle with the js in console until it returns what you want. Without seeing the url thats the best help I can provide.

– pguardiario
Nov 21 '18 at 10:12

I'll add a time.sleep and try that now too. As for the url: recyclenow.com/local-recycling - there it is however you have to click on the RHS button ("Find your nearest recycling station") then enter a postcode (try "NP11") and click search. Then click on the first site and you'll see the format I've mentioned. As you can see there is a lot of steps for Selenium to take so this slows it down too!

– Ozdanny
Nov 21 '18 at 11:37

Ok. Please take my advise as something general. You will need to put in the work to adapt it to your specifics.

– pguardiario
Nov 21 '18 at 11:44

add a comment |

Thank you for this. However when I run this and try to print "lis" it returns None so I cannot iterate over this in the loop. Could you explain how you've generated the "lis" command so I can try fix it?

– Ozdanny
Nov 21 '18 at 9:53

Maybe you need to add a time.sleep?. Essentially you want to load that page in chrome and fiddle with the js in console until it returns what you want. Without seeing the url thats the best help I can provide.

– pguardiario
Nov 21 '18 at 10:12

I'll add a time.sleep and try that now too. As for the url: recyclenow.com/local-recycling - there it is however you have to click on the RHS button ("Find your nearest recycling station") then enter a postcode (try "NP11") and click search. Then click on the first site and you'll see the format I've mentioned. As you can see there is a lot of steps for Selenium to take so this slows it down too!

– Ozdanny
Nov 21 '18 at 11:37

Ok. Please take my advise as something general. You will need to put in the work to adapt it to your specifics.

– pguardiario
Nov 21 '18 at 11:44

Thank you for this. However when I run this and try to print "lis" it returns None so I cannot iterate over this in the loop. Could you explain how you've generated the "lis" command so I can try fix it?

– Ozdanny
Nov 21 '18 at 9:53

Maybe you need to add a time.sleep?. Essentially you want to load that page in chrome and fiddle with the js in console until it returns what you want. Without seeing the url thats the best help I can provide.

– pguardiario
Nov 21 '18 at 10:12

I'll add a time.sleep and try that now too. As for the url: recyclenow.com/local-recycling - there it is however you have to click on the RHS button ("Find your nearest recycling station") then enter a postcode (try "NP11") and click search. Then click on the first site and you'll see the format I've mentioned. As you can see there is a lot of steps for Selenium to take so this slows it down too!

– Ozdanny
Nov 21 '18 at 11:37

Ok. Please take my advise as something general. You will need to put in the work to adapt it to your specifics.

– pguardiario
Nov 21 '18 at 11:44

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu