Improve speed/performance of web-scraping with lots of exceptions
I've written some web-scraping code that is currently working, however quite slow. Some background: I am using Selenium as it requires several stages of clicks and entry, along with BeautifulSoup. My code is looking at a list of materials within subcategories on a website(image below) and scraping them. If the material scraped from the website is one of the 30 I am interested in (lst below), then it writes the number 1 to a dataframe which I later convert to an Excel sheet.
The reason it is so slow, I believe anyway, is due to the fact that there are a lot of exceptions. However, I am not sure how to handle these besides try/except. The main bits of code can be seen below, as the entire piece of code is quite lengthy. I have also attached an image of the website in question for reference.
lst = ["Household cleaner and detergent bottles", "Plastic milk bottles", "Toiletries and shampoo bottles", "Plastic drinks bottles",
"Drinks cans", "Food tins", "Metal lids from glass jars", "Aerosols",
"Food pots and tubs", "Margarine tubs", "Plastic trays","Yoghurt pots", "Carrier bags",
"Aluminium foil", "Foil trays",
"Cardboard sleeves", "Cardboard egg boxes", "Cardboard fruit and veg punnets", "Cereal boxes", "Corrugated cardboard", "Toilet roll tubes", "Food and drink cartons",
"Newspapers", "Window envelopes", "Magazines", "Junk mail", "Brown envelopes", "Shredded paper", "Yellow Pages" , "Telephone directories",
"Glass bottles and jars"]
def site_scraper(site):
page_loc = ('//*[@id="wrap-rlw"]/div/div[2]/div/div/div/div[2]/div/ol/li[{}]/div').format(site)
page = driver.find_element_by_xpath(page_loc)
page.click()
driver.execute_script("arguments[0].scrollIntoView(true);", page)
soup=BeautifulSoup(driver.page_source, 'lxml')
for i in x:
for j in y:
try:
material = soup.find_all("div", class_ = "rlw-accordion-content")[i].find_all('li')[j].get_text(strip=True).encode('utf-8')
if material in lst:
df.at[code_no, material] = 1
else:
continue
continue
except IndexError:
continue
x = xrange(0,8)
y = xrange(0,9)
p = xrange(1,31)
for site in p:
site_scraper(site)
Specifically, the i's and j's rarely go to 6,7 or 8 but when they do, it is important that I capture that information too. For context, the i's correspond to the number of different categories in the image below (Automative, Building materials etc.) whilst the j's represent the sub-list (car batteries and engine oil etc.). Because these two loops are repeated for all 30 sites for each code, and I have 1500 codes, this is extremely slow. Currently it is taking 6.5 minutes for 10 codes.
Is there a way I could improve this process? I tried list comprehension however it was difficult to handle errors like this and my results were no longer accurate. Could an "if" function be a better choice for this and if so, how would I incorporate it? I also would be happy to attach the full code if required. Thank you!
EDIT:
by changing
except IndexError:
continue
to
except IndexError:
break
it is now running almost twice as fast! Obviously it is best to exit to loop after it fails once, as the later iterations will also fail. However any other pythonic tips are still welcome :)
python exception web-scraping error-handling try-catch
add a comment |
I've written some web-scraping code that is currently working, however quite slow. Some background: I am using Selenium as it requires several stages of clicks and entry, along with BeautifulSoup. My code is looking at a list of materials within subcategories on a website(image below) and scraping them. If the material scraped from the website is one of the 30 I am interested in (lst below), then it writes the number 1 to a dataframe which I later convert to an Excel sheet.
The reason it is so slow, I believe anyway, is due to the fact that there are a lot of exceptions. However, I am not sure how to handle these besides try/except. The main bits of code can be seen below, as the entire piece of code is quite lengthy. I have also attached an image of the website in question for reference.
lst = ["Household cleaner and detergent bottles", "Plastic milk bottles", "Toiletries and shampoo bottles", "Plastic drinks bottles",
"Drinks cans", "Food tins", "Metal lids from glass jars", "Aerosols",
"Food pots and tubs", "Margarine tubs", "Plastic trays","Yoghurt pots", "Carrier bags",
"Aluminium foil", "Foil trays",
"Cardboard sleeves", "Cardboard egg boxes", "Cardboard fruit and veg punnets", "Cereal boxes", "Corrugated cardboard", "Toilet roll tubes", "Food and drink cartons",
"Newspapers", "Window envelopes", "Magazines", "Junk mail", "Brown envelopes", "Shredded paper", "Yellow Pages" , "Telephone directories",
"Glass bottles and jars"]
def site_scraper(site):
page_loc = ('//*[@id="wrap-rlw"]/div/div[2]/div/div/div/div[2]/div/ol/li[{}]/div').format(site)
page = driver.find_element_by_xpath(page_loc)
page.click()
driver.execute_script("arguments[0].scrollIntoView(true);", page)
soup=BeautifulSoup(driver.page_source, 'lxml')
for i in x:
for j in y:
try:
material = soup.find_all("div", class_ = "rlw-accordion-content")[i].find_all('li')[j].get_text(strip=True).encode('utf-8')
if material in lst:
df.at[code_no, material] = 1
else:
continue
continue
except IndexError:
continue
x = xrange(0,8)
y = xrange(0,9)
p = xrange(1,31)
for site in p:
site_scraper(site)
Specifically, the i's and j's rarely go to 6,7 or 8 but when they do, it is important that I capture that information too. For context, the i's correspond to the number of different categories in the image below (Automative, Building materials etc.) whilst the j's represent the sub-list (car batteries and engine oil etc.). Because these two loops are repeated for all 30 sites for each code, and I have 1500 codes, this is extremely slow. Currently it is taking 6.5 minutes for 10 codes.
Is there a way I could improve this process? I tried list comprehension however it was difficult to handle errors like this and my results were no longer accurate. Could an "if" function be a better choice for this and if so, how would I incorporate it? I also would be happy to attach the full code if required. Thank you!
EDIT:
by changing
except IndexError:
continue
to
except IndexError:
break
it is now running almost twice as fast! Obviously it is best to exit to loop after it fails once, as the later iterations will also fail. However any other pythonic tips are still welcome :)
python exception web-scraping error-handling try-catch
add a comment |
I've written some web-scraping code that is currently working, however quite slow. Some background: I am using Selenium as it requires several stages of clicks and entry, along with BeautifulSoup. My code is looking at a list of materials within subcategories on a website(image below) and scraping them. If the material scraped from the website is one of the 30 I am interested in (lst below), then it writes the number 1 to a dataframe which I later convert to an Excel sheet.
The reason it is so slow, I believe anyway, is due to the fact that there are a lot of exceptions. However, I am not sure how to handle these besides try/except. The main bits of code can be seen below, as the entire piece of code is quite lengthy. I have also attached an image of the website in question for reference.
lst = ["Household cleaner and detergent bottles", "Plastic milk bottles", "Toiletries and shampoo bottles", "Plastic drinks bottles",
"Drinks cans", "Food tins", "Metal lids from glass jars", "Aerosols",
"Food pots and tubs", "Margarine tubs", "Plastic trays","Yoghurt pots", "Carrier bags",
"Aluminium foil", "Foil trays",
"Cardboard sleeves", "Cardboard egg boxes", "Cardboard fruit and veg punnets", "Cereal boxes", "Corrugated cardboard", "Toilet roll tubes", "Food and drink cartons",
"Newspapers", "Window envelopes", "Magazines", "Junk mail", "Brown envelopes", "Shredded paper", "Yellow Pages" , "Telephone directories",
"Glass bottles and jars"]
def site_scraper(site):
page_loc = ('//*[@id="wrap-rlw"]/div/div[2]/div/div/div/div[2]/div/ol/li[{}]/div').format(site)
page = driver.find_element_by_xpath(page_loc)
page.click()
driver.execute_script("arguments[0].scrollIntoView(true);", page)
soup=BeautifulSoup(driver.page_source, 'lxml')
for i in x:
for j in y:
try:
material = soup.find_all("div", class_ = "rlw-accordion-content")[i].find_all('li')[j].get_text(strip=True).encode('utf-8')
if material in lst:
df.at[code_no, material] = 1
else:
continue
continue
except IndexError:
continue
x = xrange(0,8)
y = xrange(0,9)
p = xrange(1,31)
for site in p:
site_scraper(site)
Specifically, the i's and j's rarely go to 6,7 or 8 but when they do, it is important that I capture that information too. For context, the i's correspond to the number of different categories in the image below (Automative, Building materials etc.) whilst the j's represent the sub-list (car batteries and engine oil etc.). Because these two loops are repeated for all 30 sites for each code, and I have 1500 codes, this is extremely slow. Currently it is taking 6.5 minutes for 10 codes.
Is there a way I could improve this process? I tried list comprehension however it was difficult to handle errors like this and my results were no longer accurate. Could an "if" function be a better choice for this and if so, how would I incorporate it? I also would be happy to attach the full code if required. Thank you!
EDIT:
by changing
except IndexError:
continue
to
except IndexError:
break
it is now running almost twice as fast! Obviously it is best to exit to loop after it fails once, as the later iterations will also fail. However any other pythonic tips are still welcome :)
python exception web-scraping error-handling try-catch
I've written some web-scraping code that is currently working, however quite slow. Some background: I am using Selenium as it requires several stages of clicks and entry, along with BeautifulSoup. My code is looking at a list of materials within subcategories on a website(image below) and scraping them. If the material scraped from the website is one of the 30 I am interested in (lst below), then it writes the number 1 to a dataframe which I later convert to an Excel sheet.
The reason it is so slow, I believe anyway, is due to the fact that there are a lot of exceptions. However, I am not sure how to handle these besides try/except. The main bits of code can be seen below, as the entire piece of code is quite lengthy. I have also attached an image of the website in question for reference.
lst = ["Household cleaner and detergent bottles", "Plastic milk bottles", "Toiletries and shampoo bottles", "Plastic drinks bottles",
"Drinks cans", "Food tins", "Metal lids from glass jars", "Aerosols",
"Food pots and tubs", "Margarine tubs", "Plastic trays","Yoghurt pots", "Carrier bags",
"Aluminium foil", "Foil trays",
"Cardboard sleeves", "Cardboard egg boxes", "Cardboard fruit and veg punnets", "Cereal boxes", "Corrugated cardboard", "Toilet roll tubes", "Food and drink cartons",
"Newspapers", "Window envelopes", "Magazines", "Junk mail", "Brown envelopes", "Shredded paper", "Yellow Pages" , "Telephone directories",
"Glass bottles and jars"]
def site_scraper(site):
page_loc = ('//*[@id="wrap-rlw"]/div/div[2]/div/div/div/div[2]/div/ol/li[{}]/div').format(site)
page = driver.find_element_by_xpath(page_loc)
page.click()
driver.execute_script("arguments[0].scrollIntoView(true);", page)
soup=BeautifulSoup(driver.page_source, 'lxml')
for i in x:
for j in y:
try:
material = soup.find_all("div", class_ = "rlw-accordion-content")[i].find_all('li')[j].get_text(strip=True).encode('utf-8')
if material in lst:
df.at[code_no, material] = 1
else:
continue
continue
except IndexError:
continue
x = xrange(0,8)
y = xrange(0,9)
p = xrange(1,31)
for site in p:
site_scraper(site)
Specifically, the i's and j's rarely go to 6,7 or 8 but when they do, it is important that I capture that information too. For context, the i's correspond to the number of different categories in the image below (Automative, Building materials etc.) whilst the j's represent the sub-list (car batteries and engine oil etc.). Because these two loops are repeated for all 30 sites for each code, and I have 1500 codes, this is extremely slow. Currently it is taking 6.5 minutes for 10 codes.
Is there a way I could improve this process? I tried list comprehension however it was difficult to handle errors like this and my results were no longer accurate. Could an "if" function be a better choice for this and if so, how would I incorporate it? I also would be happy to attach the full code if required. Thank you!
EDIT:
by changing
except IndexError:
continue
to
except IndexError:
break
it is now running almost twice as fast! Obviously it is best to exit to loop after it fails once, as the later iterations will also fail. However any other pythonic tips are still welcome :)
python exception web-scraping error-handling try-catch
python exception web-scraping error-handling try-catch
edited Nov 20 '18 at 16:57
Ozdanny
asked Nov 20 '18 at 16:02
OzdannyOzdanny
236
236
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
It sounds like you just need the text of those lis
:
lis = driver.execute_script("[...document.querySelectorAll('.rlw-accordion-content li')].map(li => li.innerText.trim())")
Now you can use those for your logic:
for material in lis:
if material in lst:
df.at[code_no, material] = 1
Thank you for this. However when I run this and try to print "lis" it returns None so I cannot iterate over this in the loop. Could you explain how you've generated the "lis" command so I can try fix it?
– Ozdanny
Nov 21 '18 at 9:53
Maybe you need to add atime.sleep
?. Essentially you want to load that page in chrome and fiddle with the js in console until it returns what you want. Without seeing the url thats the best help I can provide.
– pguardiario
Nov 21 '18 at 10:12
I'll add a time.sleep and try that now too. As for the url: recyclenow.com/local-recycling - there it is however you have to click on the RHS button ("Find your nearest recycling station") then enter a postcode (try "NP11") and click search. Then click on the first site and you'll see the format I've mentioned. As you can see there is a lot of steps for Selenium to take so this slows it down too!
– Ozdanny
Nov 21 '18 at 11:37
Ok. Please take my advise as something general. You will need to put in the work to adapt it to your specifics.
– pguardiario
Nov 21 '18 at 11:44
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53396944%2fimprove-speed-performance-of-web-scraping-with-lots-of-exceptions%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
It sounds like you just need the text of those lis
:
lis = driver.execute_script("[...document.querySelectorAll('.rlw-accordion-content li')].map(li => li.innerText.trim())")
Now you can use those for your logic:
for material in lis:
if material in lst:
df.at[code_no, material] = 1
Thank you for this. However when I run this and try to print "lis" it returns None so I cannot iterate over this in the loop. Could you explain how you've generated the "lis" command so I can try fix it?
– Ozdanny
Nov 21 '18 at 9:53
Maybe you need to add atime.sleep
?. Essentially you want to load that page in chrome and fiddle with the js in console until it returns what you want. Without seeing the url thats the best help I can provide.
– pguardiario
Nov 21 '18 at 10:12
I'll add a time.sleep and try that now too. As for the url: recyclenow.com/local-recycling - there it is however you have to click on the RHS button ("Find your nearest recycling station") then enter a postcode (try "NP11") and click search. Then click on the first site and you'll see the format I've mentioned. As you can see there is a lot of steps for Selenium to take so this slows it down too!
– Ozdanny
Nov 21 '18 at 11:37
Ok. Please take my advise as something general. You will need to put in the work to adapt it to your specifics.
– pguardiario
Nov 21 '18 at 11:44
add a comment |
It sounds like you just need the text of those lis
:
lis = driver.execute_script("[...document.querySelectorAll('.rlw-accordion-content li')].map(li => li.innerText.trim())")
Now you can use those for your logic:
for material in lis:
if material in lst:
df.at[code_no, material] = 1
Thank you for this. However when I run this and try to print "lis" it returns None so I cannot iterate over this in the loop. Could you explain how you've generated the "lis" command so I can try fix it?
– Ozdanny
Nov 21 '18 at 9:53
Maybe you need to add atime.sleep
?. Essentially you want to load that page in chrome and fiddle with the js in console until it returns what you want. Without seeing the url thats the best help I can provide.
– pguardiario
Nov 21 '18 at 10:12
I'll add a time.sleep and try that now too. As for the url: recyclenow.com/local-recycling - there it is however you have to click on the RHS button ("Find your nearest recycling station") then enter a postcode (try "NP11") and click search. Then click on the first site and you'll see the format I've mentioned. As you can see there is a lot of steps for Selenium to take so this slows it down too!
– Ozdanny
Nov 21 '18 at 11:37
Ok. Please take my advise as something general. You will need to put in the work to adapt it to your specifics.
– pguardiario
Nov 21 '18 at 11:44
add a comment |
It sounds like you just need the text of those lis
:
lis = driver.execute_script("[...document.querySelectorAll('.rlw-accordion-content li')].map(li => li.innerText.trim())")
Now you can use those for your logic:
for material in lis:
if material in lst:
df.at[code_no, material] = 1
It sounds like you just need the text of those lis
:
lis = driver.execute_script("[...document.querySelectorAll('.rlw-accordion-content li')].map(li => li.innerText.trim())")
Now you can use those for your logic:
for material in lis:
if material in lst:
df.at[code_no, material] = 1
answered Nov 21 '18 at 0:53
pguardiariopguardiario
36.1k979114
36.1k979114
Thank you for this. However when I run this and try to print "lis" it returns None so I cannot iterate over this in the loop. Could you explain how you've generated the "lis" command so I can try fix it?
– Ozdanny
Nov 21 '18 at 9:53
Maybe you need to add atime.sleep
?. Essentially you want to load that page in chrome and fiddle with the js in console until it returns what you want. Without seeing the url thats the best help I can provide.
– pguardiario
Nov 21 '18 at 10:12
I'll add a time.sleep and try that now too. As for the url: recyclenow.com/local-recycling - there it is however you have to click on the RHS button ("Find your nearest recycling station") then enter a postcode (try "NP11") and click search. Then click on the first site and you'll see the format I've mentioned. As you can see there is a lot of steps for Selenium to take so this slows it down too!
– Ozdanny
Nov 21 '18 at 11:37
Ok. Please take my advise as something general. You will need to put in the work to adapt it to your specifics.
– pguardiario
Nov 21 '18 at 11:44
add a comment |
Thank you for this. However when I run this and try to print "lis" it returns None so I cannot iterate over this in the loop. Could you explain how you've generated the "lis" command so I can try fix it?
– Ozdanny
Nov 21 '18 at 9:53
Maybe you need to add atime.sleep
?. Essentially you want to load that page in chrome and fiddle with the js in console until it returns what you want. Without seeing the url thats the best help I can provide.
– pguardiario
Nov 21 '18 at 10:12
I'll add a time.sleep and try that now too. As for the url: recyclenow.com/local-recycling - there it is however you have to click on the RHS button ("Find your nearest recycling station") then enter a postcode (try "NP11") and click search. Then click on the first site and you'll see the format I've mentioned. As you can see there is a lot of steps for Selenium to take so this slows it down too!
– Ozdanny
Nov 21 '18 at 11:37
Ok. Please take my advise as something general. You will need to put in the work to adapt it to your specifics.
– pguardiario
Nov 21 '18 at 11:44
Thank you for this. However when I run this and try to print "lis" it returns None so I cannot iterate over this in the loop. Could you explain how you've generated the "lis" command so I can try fix it?
– Ozdanny
Nov 21 '18 at 9:53
Thank you for this. However when I run this and try to print "lis" it returns None so I cannot iterate over this in the loop. Could you explain how you've generated the "lis" command so I can try fix it?
– Ozdanny
Nov 21 '18 at 9:53
Maybe you need to add a
time.sleep
?. Essentially you want to load that page in chrome and fiddle with the js in console until it returns what you want. Without seeing the url thats the best help I can provide.– pguardiario
Nov 21 '18 at 10:12
Maybe you need to add a
time.sleep
?. Essentially you want to load that page in chrome and fiddle with the js in console until it returns what you want. Without seeing the url thats the best help I can provide.– pguardiario
Nov 21 '18 at 10:12
I'll add a time.sleep and try that now too. As for the url: recyclenow.com/local-recycling - there it is however you have to click on the RHS button ("Find your nearest recycling station") then enter a postcode (try "NP11") and click search. Then click on the first site and you'll see the format I've mentioned. As you can see there is a lot of steps for Selenium to take so this slows it down too!
– Ozdanny
Nov 21 '18 at 11:37
I'll add a time.sleep and try that now too. As for the url: recyclenow.com/local-recycling - there it is however you have to click on the RHS button ("Find your nearest recycling station") then enter a postcode (try "NP11") and click search. Then click on the first site and you'll see the format I've mentioned. As you can see there is a lot of steps for Selenium to take so this slows it down too!
– Ozdanny
Nov 21 '18 at 11:37
Ok. Please take my advise as something general. You will need to put in the work to adapt it to your specifics.
– pguardiario
Nov 21 '18 at 11:44
Ok. Please take my advise as something general. You will need to put in the work to adapt it to your specifics.
– pguardiario
Nov 21 '18 at 11:44
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53396944%2fimprove-speed-performance-of-web-scraping-with-lots-of-exceptions%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown