Improve speed/performance of web-scraping with lots of exceptions












1















I've written some web-scraping code that is currently working, however quite slow. Some background: I am using Selenium as it requires several stages of clicks and entry, along with BeautifulSoup. My code is looking at a list of materials within subcategories on a website(image below) and scraping them. If the material scraped from the website is one of the 30 I am interested in (lst below), then it writes the number 1 to a dataframe which I later convert to an Excel sheet.



The reason it is so slow, I believe anyway, is due to the fact that there are a lot of exceptions. However, I am not sure how to handle these besides try/except. The main bits of code can be seen below, as the entire piece of code is quite lengthy. I have also attached an image of the website in question for reference.



lst = ["Household cleaner and detergent bottles", "Plastic milk bottles", "Toiletries and shampoo bottles", "Plastic drinks bottles", 
"Drinks cans", "Food tins", "Metal lids from glass jars", "Aerosols",
"Food pots and tubs", "Margarine tubs", "Plastic trays","Yoghurt pots", "Carrier bags",
"Aluminium foil", "Foil trays",
"Cardboard sleeves", "Cardboard egg boxes", "Cardboard fruit and veg punnets", "Cereal boxes", "Corrugated cardboard", "Toilet roll tubes", "Food and drink cartons",
"Newspapers", "Window envelopes", "Magazines", "Junk mail", "Brown envelopes", "Shredded paper", "Yellow Pages" , "Telephone directories",
"Glass bottles and jars"]

def site_scraper(site):
page_loc = ('//*[@id="wrap-rlw"]/div/div[2]/div/div/div/div[2]/div/ol/li[{}]/div').format(site)
page = driver.find_element_by_xpath(page_loc)
page.click()
driver.execute_script("arguments[0].scrollIntoView(true);", page)

soup=BeautifulSoup(driver.page_source, 'lxml')
for i in x:
for j in y:
try:
material = soup.find_all("div", class_ = "rlw-accordion-content")[i].find_all('li')[j].get_text(strip=True).encode('utf-8')
if material in lst:
df.at[code_no, material] = 1
else:
continue
continue
except IndexError:
continue

x = xrange(0,8)
y = xrange(0,9)

p = xrange(1,31)

for site in p:
site_scraper(site)


Specifically, the i's and j's rarely go to 6,7 or 8 but when they do, it is important that I capture that information too. For context, the i's correspond to the number of different categories in the image below (Automative, Building materials etc.) whilst the j's represent the sub-list (car batteries and engine oil etc.). Because these two loops are repeated for all 30 sites for each code, and I have 1500 codes, this is extremely slow. Currently it is taking 6.5 minutes for 10 codes.



Is there a way I could improve this process? I tried list comprehension however it was difficult to handle errors like this and my results were no longer accurate. Could an "if" function be a better choice for this and if so, how would I incorporate it? I also would be happy to attach the full code if required. Thank you!



enter image description here



EDIT:
by changing



        except IndexError:
continue


to



        except IndexError:
break


it is now running almost twice as fast! Obviously it is best to exit to loop after it fails once, as the later iterations will also fail. However any other pythonic tips are still welcome :)










share|improve this question





























    1















    I've written some web-scraping code that is currently working, however quite slow. Some background: I am using Selenium as it requires several stages of clicks and entry, along with BeautifulSoup. My code is looking at a list of materials within subcategories on a website(image below) and scraping them. If the material scraped from the website is one of the 30 I am interested in (lst below), then it writes the number 1 to a dataframe which I later convert to an Excel sheet.



    The reason it is so slow, I believe anyway, is due to the fact that there are a lot of exceptions. However, I am not sure how to handle these besides try/except. The main bits of code can be seen below, as the entire piece of code is quite lengthy. I have also attached an image of the website in question for reference.



    lst = ["Household cleaner and detergent bottles", "Plastic milk bottles", "Toiletries and shampoo bottles", "Plastic drinks bottles", 
    "Drinks cans", "Food tins", "Metal lids from glass jars", "Aerosols",
    "Food pots and tubs", "Margarine tubs", "Plastic trays","Yoghurt pots", "Carrier bags",
    "Aluminium foil", "Foil trays",
    "Cardboard sleeves", "Cardboard egg boxes", "Cardboard fruit and veg punnets", "Cereal boxes", "Corrugated cardboard", "Toilet roll tubes", "Food and drink cartons",
    "Newspapers", "Window envelopes", "Magazines", "Junk mail", "Brown envelopes", "Shredded paper", "Yellow Pages" , "Telephone directories",
    "Glass bottles and jars"]

    def site_scraper(site):
    page_loc = ('//*[@id="wrap-rlw"]/div/div[2]/div/div/div/div[2]/div/ol/li[{}]/div').format(site)
    page = driver.find_element_by_xpath(page_loc)
    page.click()
    driver.execute_script("arguments[0].scrollIntoView(true);", page)

    soup=BeautifulSoup(driver.page_source, 'lxml')
    for i in x:
    for j in y:
    try:
    material = soup.find_all("div", class_ = "rlw-accordion-content")[i].find_all('li')[j].get_text(strip=True).encode('utf-8')
    if material in lst:
    df.at[code_no, material] = 1
    else:
    continue
    continue
    except IndexError:
    continue

    x = xrange(0,8)
    y = xrange(0,9)

    p = xrange(1,31)

    for site in p:
    site_scraper(site)


    Specifically, the i's and j's rarely go to 6,7 or 8 but when they do, it is important that I capture that information too. For context, the i's correspond to the number of different categories in the image below (Automative, Building materials etc.) whilst the j's represent the sub-list (car batteries and engine oil etc.). Because these two loops are repeated for all 30 sites for each code, and I have 1500 codes, this is extremely slow. Currently it is taking 6.5 minutes for 10 codes.



    Is there a way I could improve this process? I tried list comprehension however it was difficult to handle errors like this and my results were no longer accurate. Could an "if" function be a better choice for this and if so, how would I incorporate it? I also would be happy to attach the full code if required. Thank you!



    enter image description here



    EDIT:
    by changing



            except IndexError:
    continue


    to



            except IndexError:
    break


    it is now running almost twice as fast! Obviously it is best to exit to loop after it fails once, as the later iterations will also fail. However any other pythonic tips are still welcome :)










    share|improve this question



























      1












      1








      1








      I've written some web-scraping code that is currently working, however quite slow. Some background: I am using Selenium as it requires several stages of clicks and entry, along with BeautifulSoup. My code is looking at a list of materials within subcategories on a website(image below) and scraping them. If the material scraped from the website is one of the 30 I am interested in (lst below), then it writes the number 1 to a dataframe which I later convert to an Excel sheet.



      The reason it is so slow, I believe anyway, is due to the fact that there are a lot of exceptions. However, I am not sure how to handle these besides try/except. The main bits of code can be seen below, as the entire piece of code is quite lengthy. I have also attached an image of the website in question for reference.



      lst = ["Household cleaner and detergent bottles", "Plastic milk bottles", "Toiletries and shampoo bottles", "Plastic drinks bottles", 
      "Drinks cans", "Food tins", "Metal lids from glass jars", "Aerosols",
      "Food pots and tubs", "Margarine tubs", "Plastic trays","Yoghurt pots", "Carrier bags",
      "Aluminium foil", "Foil trays",
      "Cardboard sleeves", "Cardboard egg boxes", "Cardboard fruit and veg punnets", "Cereal boxes", "Corrugated cardboard", "Toilet roll tubes", "Food and drink cartons",
      "Newspapers", "Window envelopes", "Magazines", "Junk mail", "Brown envelopes", "Shredded paper", "Yellow Pages" , "Telephone directories",
      "Glass bottles and jars"]

      def site_scraper(site):
      page_loc = ('//*[@id="wrap-rlw"]/div/div[2]/div/div/div/div[2]/div/ol/li[{}]/div').format(site)
      page = driver.find_element_by_xpath(page_loc)
      page.click()
      driver.execute_script("arguments[0].scrollIntoView(true);", page)

      soup=BeautifulSoup(driver.page_source, 'lxml')
      for i in x:
      for j in y:
      try:
      material = soup.find_all("div", class_ = "rlw-accordion-content")[i].find_all('li')[j].get_text(strip=True).encode('utf-8')
      if material in lst:
      df.at[code_no, material] = 1
      else:
      continue
      continue
      except IndexError:
      continue

      x = xrange(0,8)
      y = xrange(0,9)

      p = xrange(1,31)

      for site in p:
      site_scraper(site)


      Specifically, the i's and j's rarely go to 6,7 or 8 but when they do, it is important that I capture that information too. For context, the i's correspond to the number of different categories in the image below (Automative, Building materials etc.) whilst the j's represent the sub-list (car batteries and engine oil etc.). Because these two loops are repeated for all 30 sites for each code, and I have 1500 codes, this is extremely slow. Currently it is taking 6.5 minutes for 10 codes.



      Is there a way I could improve this process? I tried list comprehension however it was difficult to handle errors like this and my results were no longer accurate. Could an "if" function be a better choice for this and if so, how would I incorporate it? I also would be happy to attach the full code if required. Thank you!



      enter image description here



      EDIT:
      by changing



              except IndexError:
      continue


      to



              except IndexError:
      break


      it is now running almost twice as fast! Obviously it is best to exit to loop after it fails once, as the later iterations will also fail. However any other pythonic tips are still welcome :)










      share|improve this question
















      I've written some web-scraping code that is currently working, however quite slow. Some background: I am using Selenium as it requires several stages of clicks and entry, along with BeautifulSoup. My code is looking at a list of materials within subcategories on a website(image below) and scraping them. If the material scraped from the website is one of the 30 I am interested in (lst below), then it writes the number 1 to a dataframe which I later convert to an Excel sheet.



      The reason it is so slow, I believe anyway, is due to the fact that there are a lot of exceptions. However, I am not sure how to handle these besides try/except. The main bits of code can be seen below, as the entire piece of code is quite lengthy. I have also attached an image of the website in question for reference.



      lst = ["Household cleaner and detergent bottles", "Plastic milk bottles", "Toiletries and shampoo bottles", "Plastic drinks bottles", 
      "Drinks cans", "Food tins", "Metal lids from glass jars", "Aerosols",
      "Food pots and tubs", "Margarine tubs", "Plastic trays","Yoghurt pots", "Carrier bags",
      "Aluminium foil", "Foil trays",
      "Cardboard sleeves", "Cardboard egg boxes", "Cardboard fruit and veg punnets", "Cereal boxes", "Corrugated cardboard", "Toilet roll tubes", "Food and drink cartons",
      "Newspapers", "Window envelopes", "Magazines", "Junk mail", "Brown envelopes", "Shredded paper", "Yellow Pages" , "Telephone directories",
      "Glass bottles and jars"]

      def site_scraper(site):
      page_loc = ('//*[@id="wrap-rlw"]/div/div[2]/div/div/div/div[2]/div/ol/li[{}]/div').format(site)
      page = driver.find_element_by_xpath(page_loc)
      page.click()
      driver.execute_script("arguments[0].scrollIntoView(true);", page)

      soup=BeautifulSoup(driver.page_source, 'lxml')
      for i in x:
      for j in y:
      try:
      material = soup.find_all("div", class_ = "rlw-accordion-content")[i].find_all('li')[j].get_text(strip=True).encode('utf-8')
      if material in lst:
      df.at[code_no, material] = 1
      else:
      continue
      continue
      except IndexError:
      continue

      x = xrange(0,8)
      y = xrange(0,9)

      p = xrange(1,31)

      for site in p:
      site_scraper(site)


      Specifically, the i's and j's rarely go to 6,7 or 8 but when they do, it is important that I capture that information too. For context, the i's correspond to the number of different categories in the image below (Automative, Building materials etc.) whilst the j's represent the sub-list (car batteries and engine oil etc.). Because these two loops are repeated for all 30 sites for each code, and I have 1500 codes, this is extremely slow. Currently it is taking 6.5 minutes for 10 codes.



      Is there a way I could improve this process? I tried list comprehension however it was difficult to handle errors like this and my results were no longer accurate. Could an "if" function be a better choice for this and if so, how would I incorporate it? I also would be happy to attach the full code if required. Thank you!



      enter image description here



      EDIT:
      by changing



              except IndexError:
      continue


      to



              except IndexError:
      break


      it is now running almost twice as fast! Obviously it is best to exit to loop after it fails once, as the later iterations will also fail. However any other pythonic tips are still welcome :)







      python exception web-scraping error-handling try-catch






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 20 '18 at 16:57







      Ozdanny

















      asked Nov 20 '18 at 16:02









      OzdannyOzdanny

      236




      236
























          1 Answer
          1






          active

          oldest

          votes


















          0














          It sounds like you just need the text of those lis:



          lis = driver.execute_script("[...document.querySelectorAll('.rlw-accordion-content li')].map(li => li.innerText.trim())")


          Now you can use those for your logic:



          for material in lis:
          if material in lst:
          df.at[code_no, material] = 1





          share|improve this answer
























          • Thank you for this. However when I run this and try to print "lis" it returns None so I cannot iterate over this in the loop. Could you explain how you've generated the "lis" command so I can try fix it?

            – Ozdanny
            Nov 21 '18 at 9:53











          • Maybe you need to add a time.sleep?. Essentially you want to load that page in chrome and fiddle with the js in console until it returns what you want. Without seeing the url thats the best help I can provide.

            – pguardiario
            Nov 21 '18 at 10:12













          • I'll add a time.sleep and try that now too. As for the url: recyclenow.com/local-recycling - there it is however you have to click on the RHS button ("Find your nearest recycling station") then enter a postcode (try "NP11") and click search. Then click on the first site and you'll see the format I've mentioned. As you can see there is a lot of steps for Selenium to take so this slows it down too!

            – Ozdanny
            Nov 21 '18 at 11:37











          • Ok. Please take my advise as something general. You will need to put in the work to adapt it to your specifics.

            – pguardiario
            Nov 21 '18 at 11:44











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53396944%2fimprove-speed-performance-of-web-scraping-with-lots-of-exceptions%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0














          It sounds like you just need the text of those lis:



          lis = driver.execute_script("[...document.querySelectorAll('.rlw-accordion-content li')].map(li => li.innerText.trim())")


          Now you can use those for your logic:



          for material in lis:
          if material in lst:
          df.at[code_no, material] = 1





          share|improve this answer
























          • Thank you for this. However when I run this and try to print "lis" it returns None so I cannot iterate over this in the loop. Could you explain how you've generated the "lis" command so I can try fix it?

            – Ozdanny
            Nov 21 '18 at 9:53











          • Maybe you need to add a time.sleep?. Essentially you want to load that page in chrome and fiddle with the js in console until it returns what you want. Without seeing the url thats the best help I can provide.

            – pguardiario
            Nov 21 '18 at 10:12













          • I'll add a time.sleep and try that now too. As for the url: recyclenow.com/local-recycling - there it is however you have to click on the RHS button ("Find your nearest recycling station") then enter a postcode (try "NP11") and click search. Then click on the first site and you'll see the format I've mentioned. As you can see there is a lot of steps for Selenium to take so this slows it down too!

            – Ozdanny
            Nov 21 '18 at 11:37











          • Ok. Please take my advise as something general. You will need to put in the work to adapt it to your specifics.

            – pguardiario
            Nov 21 '18 at 11:44
















          0














          It sounds like you just need the text of those lis:



          lis = driver.execute_script("[...document.querySelectorAll('.rlw-accordion-content li')].map(li => li.innerText.trim())")


          Now you can use those for your logic:



          for material in lis:
          if material in lst:
          df.at[code_no, material] = 1





          share|improve this answer
























          • Thank you for this. However when I run this and try to print "lis" it returns None so I cannot iterate over this in the loop. Could you explain how you've generated the "lis" command so I can try fix it?

            – Ozdanny
            Nov 21 '18 at 9:53











          • Maybe you need to add a time.sleep?. Essentially you want to load that page in chrome and fiddle with the js in console until it returns what you want. Without seeing the url thats the best help I can provide.

            – pguardiario
            Nov 21 '18 at 10:12













          • I'll add a time.sleep and try that now too. As for the url: recyclenow.com/local-recycling - there it is however you have to click on the RHS button ("Find your nearest recycling station") then enter a postcode (try "NP11") and click search. Then click on the first site and you'll see the format I've mentioned. As you can see there is a lot of steps for Selenium to take so this slows it down too!

            – Ozdanny
            Nov 21 '18 at 11:37











          • Ok. Please take my advise as something general. You will need to put in the work to adapt it to your specifics.

            – pguardiario
            Nov 21 '18 at 11:44














          0












          0








          0







          It sounds like you just need the text of those lis:



          lis = driver.execute_script("[...document.querySelectorAll('.rlw-accordion-content li')].map(li => li.innerText.trim())")


          Now you can use those for your logic:



          for material in lis:
          if material in lst:
          df.at[code_no, material] = 1





          share|improve this answer













          It sounds like you just need the text of those lis:



          lis = driver.execute_script("[...document.querySelectorAll('.rlw-accordion-content li')].map(li => li.innerText.trim())")


          Now you can use those for your logic:



          for material in lis:
          if material in lst:
          df.at[code_no, material] = 1






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 21 '18 at 0:53









          pguardiariopguardiario

          36.1k979114




          36.1k979114













          • Thank you for this. However when I run this and try to print "lis" it returns None so I cannot iterate over this in the loop. Could you explain how you've generated the "lis" command so I can try fix it?

            – Ozdanny
            Nov 21 '18 at 9:53











          • Maybe you need to add a time.sleep?. Essentially you want to load that page in chrome and fiddle with the js in console until it returns what you want. Without seeing the url thats the best help I can provide.

            – pguardiario
            Nov 21 '18 at 10:12













          • I'll add a time.sleep and try that now too. As for the url: recyclenow.com/local-recycling - there it is however you have to click on the RHS button ("Find your nearest recycling station") then enter a postcode (try "NP11") and click search. Then click on the first site and you'll see the format I've mentioned. As you can see there is a lot of steps for Selenium to take so this slows it down too!

            – Ozdanny
            Nov 21 '18 at 11:37











          • Ok. Please take my advise as something general. You will need to put in the work to adapt it to your specifics.

            – pguardiario
            Nov 21 '18 at 11:44



















          • Thank you for this. However when I run this and try to print "lis" it returns None so I cannot iterate over this in the loop. Could you explain how you've generated the "lis" command so I can try fix it?

            – Ozdanny
            Nov 21 '18 at 9:53











          • Maybe you need to add a time.sleep?. Essentially you want to load that page in chrome and fiddle with the js in console until it returns what you want. Without seeing the url thats the best help I can provide.

            – pguardiario
            Nov 21 '18 at 10:12













          • I'll add a time.sleep and try that now too. As for the url: recyclenow.com/local-recycling - there it is however you have to click on the RHS button ("Find your nearest recycling station") then enter a postcode (try "NP11") and click search. Then click on the first site and you'll see the format I've mentioned. As you can see there is a lot of steps for Selenium to take so this slows it down too!

            – Ozdanny
            Nov 21 '18 at 11:37











          • Ok. Please take my advise as something general. You will need to put in the work to adapt it to your specifics.

            – pguardiario
            Nov 21 '18 at 11:44

















          Thank you for this. However when I run this and try to print "lis" it returns None so I cannot iterate over this in the loop. Could you explain how you've generated the "lis" command so I can try fix it?

          – Ozdanny
          Nov 21 '18 at 9:53





          Thank you for this. However when I run this and try to print "lis" it returns None so I cannot iterate over this in the loop. Could you explain how you've generated the "lis" command so I can try fix it?

          – Ozdanny
          Nov 21 '18 at 9:53













          Maybe you need to add a time.sleep?. Essentially you want to load that page in chrome and fiddle with the js in console until it returns what you want. Without seeing the url thats the best help I can provide.

          – pguardiario
          Nov 21 '18 at 10:12







          Maybe you need to add a time.sleep?. Essentially you want to load that page in chrome and fiddle with the js in console until it returns what you want. Without seeing the url thats the best help I can provide.

          – pguardiario
          Nov 21 '18 at 10:12















          I'll add a time.sleep and try that now too. As for the url: recyclenow.com/local-recycling - there it is however you have to click on the RHS button ("Find your nearest recycling station") then enter a postcode (try "NP11") and click search. Then click on the first site and you'll see the format I've mentioned. As you can see there is a lot of steps for Selenium to take so this slows it down too!

          – Ozdanny
          Nov 21 '18 at 11:37





          I'll add a time.sleep and try that now too. As for the url: recyclenow.com/local-recycling - there it is however you have to click on the RHS button ("Find your nearest recycling station") then enter a postcode (try "NP11") and click search. Then click on the first site and you'll see the format I've mentioned. As you can see there is a lot of steps for Selenium to take so this slows it down too!

          – Ozdanny
          Nov 21 '18 at 11:37













          Ok. Please take my advise as something general. You will need to put in the work to adapt it to your specifics.

          – pguardiario
          Nov 21 '18 at 11:44





          Ok. Please take my advise as something general. You will need to put in the work to adapt it to your specifics.

          – pguardiario
          Nov 21 '18 at 11:44


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53396944%2fimprove-speed-performance-of-web-scraping-with-lots-of-exceptions%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          MongoDB - Not Authorized To Execute Command

          How to fix TextFormField cause rebuild widget in Flutter

          in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith