Script throws an error when it is made to run using multiprocessing

-1

I've written a script in python in combination with BeautifulSoup to extract the title of books which get populated upon providing some ISBN numbers in amazon search box. I'm providing those ISBN numbers from an excel file named amazon.xlsx. When I try using my following script, It parse the titles accordingly and write back to excel file as intended.

The link where I put isbn numbers to populate the results.

import requests

from bs4 import BeautifulSoup

from openpyxl import load_workbook



wb = load_workbook('amazon.xlsx')

ws = wb['content']



def get_info(num):

    params = {

        'url': 'search-alias=aps',

        'field-keywords': num

    }

    res = requests.get("https://www.amazon.com/s/ref=nb_sb_noss?",params=params)

    soup = BeautifulSoup(res.text,"lxml")

    itemlink = soup.select_one("a.s-access-detail-page")

    if itemlink:

        get_data(itemlink['href'])



def get_data(link):

    res = requests.get(link)

    soup = BeautifulSoup(res.text,"lxml")

    try:

        itmtitle = soup.select_one("#productTitle").get_text(strip=True)

    except AttributeError: itmtitle = "NA"



    print(itmtitle)



    ws.cell(row=row, column=2).value = itmtitle

    wb.save("amazon.xlsx")



if __name__ == '__main__':

    for row in range(2, ws.max_row + 1):

        if ws.cell(row=row,column=1).value==None:break

        val = ws["A" + str(row)].value

        get_info(val)

However, when I try to do the same using multiprocessing I get the following error:

ws.cell(row=row, column=2).value = itmtitle

NameError: name 'row' is not defined

For multiprocessing what I brought changes in my script is:

from multiprocessing import Pool



if __name__ == '__main__':

    isbnlist = 

    for row in range(2, ws.max_row + 1):

        if ws.cell(row=row,column=1).value==None:break

        val = ws["A" + str(row)].value

        isbnlist.append(val)



    with Pool(10) as p:

        p.map(get_info,isbnlist)

        p.terminate()

        p.join()

Few of the ISBN I've tried with:

9781584806844

9780917360664

9780134715308

9781285858265

9780986615108

9780393646399

9780134612966

9781285857589

9781453385982

9780134683461

How Can I get rid of that error and get the desired results using multiprocessing?

asked Nov 21 '18 at 16:43

robots.txt

306115

add a comment |

-1

The link where I put isbn numbers to populate the results.

import requests

from bs4 import BeautifulSoup

from openpyxl import load_workbook



wb = load_workbook('amazon.xlsx')

ws = wb['content']



def get_info(num):

    params = {

        'url': 'search-alias=aps',

        'field-keywords': num

    }

    res = requests.get("https://www.amazon.com/s/ref=nb_sb_noss?",params=params)

    soup = BeautifulSoup(res.text,"lxml")

    itemlink = soup.select_one("a.s-access-detail-page")

    if itemlink:

        get_data(itemlink['href'])



def get_data(link):

    res = requests.get(link)

    soup = BeautifulSoup(res.text,"lxml")

    try:

        itmtitle = soup.select_one("#productTitle").get_text(strip=True)

    except AttributeError: itmtitle = "NA"



    print(itmtitle)



    ws.cell(row=row, column=2).value = itmtitle

    wb.save("amazon.xlsx")



if __name__ == '__main__':

    for row in range(2, ws.max_row + 1):

        if ws.cell(row=row,column=1).value==None:break

        val = ws["A" + str(row)].value

        get_info(val)

However, when I try to do the same using multiprocessing I get the following error:

ws.cell(row=row, column=2).value = itmtitle

NameError: name 'row' is not defined

For multiprocessing what I brought changes in my script is:

from multiprocessing import Pool



if __name__ == '__main__':

    isbnlist = 

    for row in range(2, ws.max_row + 1):

        if ws.cell(row=row,column=1).value==None:break

        val = ws["A" + str(row)].value

        isbnlist.append(val)



    with Pool(10) as p:

        p.map(get_info,isbnlist)

        p.terminate()

        p.join()

Few of the ISBN I've tried with:

9781584806844

9780917360664

9780134715308

9781285858265

9780986615108

9780393646399

9780134612966

9781285857589

9781453385982

9780134683461

How Can I get rid of that error and get the desired results using multiprocessing?

asked Nov 21 '18 at 16:43

robots.txt

306115

add a comment |

-1

The link where I put isbn numbers to populate the results.

import requests

from bs4 import BeautifulSoup

from openpyxl import load_workbook



wb = load_workbook('amazon.xlsx')

ws = wb['content']



def get_info(num):

    params = {

        'url': 'search-alias=aps',

        'field-keywords': num

    }

    res = requests.get("https://www.amazon.com/s/ref=nb_sb_noss?",params=params)

    soup = BeautifulSoup(res.text,"lxml")

    itemlink = soup.select_one("a.s-access-detail-page")

    if itemlink:

        get_data(itemlink['href'])



def get_data(link):

    res = requests.get(link)

    soup = BeautifulSoup(res.text,"lxml")

    try:

        itmtitle = soup.select_one("#productTitle").get_text(strip=True)

    except AttributeError: itmtitle = "NA"



    print(itmtitle)



    ws.cell(row=row, column=2).value = itmtitle

    wb.save("amazon.xlsx")



if __name__ == '__main__':

    for row in range(2, ws.max_row + 1):

        if ws.cell(row=row,column=1).value==None:break

        val = ws["A" + str(row)].value

        get_info(val)

However, when I try to do the same using multiprocessing I get the following error:

ws.cell(row=row, column=2).value = itmtitle

NameError: name 'row' is not defined

For multiprocessing what I brought changes in my script is:

from multiprocessing import Pool



if __name__ == '__main__':

    isbnlist = 

    for row in range(2, ws.max_row + 1):

        if ws.cell(row=row,column=1).value==None:break

        val = ws["A" + str(row)].value

        isbnlist.append(val)



    with Pool(10) as p:

        p.map(get_info,isbnlist)

        p.terminate()

        p.join()

Few of the ISBN I've tried with:

9781584806844

9780917360664

9780134715308

9781285858265

9780986615108

9780393646399

9780134612966

9781285857589

9781453385982

9780134683461

How Can I get rid of that error and get the desired results using multiprocessing?

asked Nov 21 '18 at 16:43

robots.txt

306115

The link where I put isbn numbers to populate the results.

import requests

from bs4 import BeautifulSoup

from openpyxl import load_workbook



wb = load_workbook('amazon.xlsx')

ws = wb['content']



def get_info(num):

    params = {

        'url': 'search-alias=aps',

        'field-keywords': num

    }

    res = requests.get("https://www.amazon.com/s/ref=nb_sb_noss?",params=params)

    soup = BeautifulSoup(res.text,"lxml")

    itemlink = soup.select_one("a.s-access-detail-page")

    if itemlink:

        get_data(itemlink['href'])



def get_data(link):

    res = requests.get(link)

    soup = BeautifulSoup(res.text,"lxml")

    try:

        itmtitle = soup.select_one("#productTitle").get_text(strip=True)

    except AttributeError: itmtitle = "NA"



    print(itmtitle)



    ws.cell(row=row, column=2).value = itmtitle

    wb.save("amazon.xlsx")



if __name__ == '__main__':

    for row in range(2, ws.max_row + 1):

        if ws.cell(row=row,column=1).value==None:break

        val = ws["A" + str(row)].value

        get_info(val)

However, when I try to do the same using multiprocessing I get the following error:

ws.cell(row=row, column=2).value = itmtitle

NameError: name 'row' is not defined

For multiprocessing what I brought changes in my script is:

from multiprocessing import Pool



if __name__ == '__main__':

    isbnlist = 

    for row in range(2, ws.max_row + 1):

        if ws.cell(row=row,column=1).value==None:break

        val = ws["A" + str(row)].value

        isbnlist.append(val)



    with Pool(10) as p:

        p.map(get_info,isbnlist)

        p.terminate()

        p.join()

Few of the ISBN I've tried with:

9781584806844

9780917360664

9780134715308

9781285858265

9780986615108

9780393646399

9780134612966

9781285857589

9781453385982

9780134683461

How Can I get rid of that error and get the desired results using multiprocessing?

python python-3.x web-scraping multiprocessing openpyxl

asked Nov 21 '18 at 16:43

robots.txt

306115

asked Nov 21 '18 at 16:43

robots.txt

306115

asked Nov 21 '18 at 16:43

robots.txt

306115

asked Nov 21 '18 at 16:43

robots.txt

306115

asked Nov 21 '18 at 16:43

robots.txt

306115

add a comment |

1 Answer
1

active

oldest

votes

It does not make sense to reference the global variable row in get_data(), because

It's a global and will not be shared between each "thread" in the multiprocessing Pool, because they are actually separate python processes that do not share globals.

Even if they did, because you're building the entire ISBN list before executing get_info(), the value of row will always be ws.max_row + 1 because the loop has completed.

So you would need to provide the row values as part of the data passed to the second argument of p.map(). But even if you were to do that, writing to and saving the spreadsheet from multiple processes is a bad idea due to Windows file locking, race conditions, etc. You're better off just building the list of titles with multiprocessing, and then writing them out once when that's done, as in the following:

import requests

from bs4 import BeautifulSoup

from openpyxl import load_workbook

from multiprocessing import Pool





def get_info(isbn):

    params = {

        'url': 'search-alias=aps',

        'field-keywords': isbn

    }

    res = requests.get("https://www.amazon.com/s/ref=nb_sb_noss?", params=params)

    soup = BeautifulSoup(res.text, "lxml")

    itemlink = soup.select_one("a.s-access-detail-page")

    if itemlink:

        return get_data(itemlink['href'])





def get_data(link):

    res = requests.get(link)

    soup = BeautifulSoup(res.text, "lxml")

    try:

        itmtitle = soup.select_one("#productTitle").get_text(strip=True)

    except AttributeError:

        itmtitle = "NA"



    return itmtitle





def main():

    wb = load_workbook('amazon.xlsx')

    ws = wb['content']



    isbnlist = 

    for row in range(2, ws.max_row + 1):

        if ws.cell(row=row, column=1).value is None:

            break

        val = ws["A" + str(row)].value

        isbnlist.append(val)



    with Pool(10) as p:

        titles = p.map(get_info, isbnlist)

        p.terminate()

        p.join()



    for row in range(2, ws.max_row + 1):

        ws.cell(row=row, column=2).value = titles[row - 2]



    wb.save("amazon.xlsx")





if __name__ == '__main__':

    main()

answered Nov 21 '18 at 17:43

cody

5,59621125

Thanks for your solution @cody.

– robots.txt
Nov 21 '18 at 19:36

I could not understand this line return get_data(itemlink['href']) @cody? Why you used return? Usually this should get_data(itemlink['href']) do the job.

– robots.txt
Nov 22 '18 at 5:13

1

Notice that the return value of p.map(get_info, isbnlist) is now being captured in titles. The Pool.map function generally follows the semantics of the standard Python map function and takes two arguments: a function (in this case, get_info), and a list of values (isbnlist). It returns a new list, where each value is the result of passing the original value through the passed function. So get_info needs to return a value. Check out this page for more info on map(): book.pythontips.com/en/latest/map_filter.html

– cody
Nov 22 '18 at 14:34

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53416806%2fscript-throws-an-error-when-it-is-made-to-run-using-multiprocessing%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

It does not make sense to reference the global variable row in get_data(), because

It's a global and will not be shared between each "thread" in the multiprocessing Pool, because they are actually separate python processes that do not share globals.

Even if they did, because you're building the entire ISBN list before executing get_info(), the value of row will always be ws.max_row + 1 because the loop has completed.

import requests

from bs4 import BeautifulSoup

from openpyxl import load_workbook

from multiprocessing import Pool





def get_info(isbn):

    params = {

        'url': 'search-alias=aps',

        'field-keywords': isbn

    }

    res = requests.get("https://www.amazon.com/s/ref=nb_sb_noss?", params=params)

    soup = BeautifulSoup(res.text, "lxml")

    itemlink = soup.select_one("a.s-access-detail-page")

    if itemlink:

        return get_data(itemlink['href'])





def get_data(link):

    res = requests.get(link)

    soup = BeautifulSoup(res.text, "lxml")

    try:

        itmtitle = soup.select_one("#productTitle").get_text(strip=True)

    except AttributeError:

        itmtitle = "NA"



    return itmtitle





def main():

    wb = load_workbook('amazon.xlsx')

    ws = wb['content']



    isbnlist = 

    for row in range(2, ws.max_row + 1):

        if ws.cell(row=row, column=1).value is None:

            break

        val = ws["A" + str(row)].value

        isbnlist.append(val)



    with Pool(10) as p:

        titles = p.map(get_info, isbnlist)

        p.terminate()

        p.join()



    for row in range(2, ws.max_row + 1):

        ws.cell(row=row, column=2).value = titles[row - 2]



    wb.save("amazon.xlsx")





if __name__ == '__main__':

    main()

answered Nov 21 '18 at 17:43

cody

5,59621125

Thanks for your solution @cody.

– robots.txt
Nov 21 '18 at 19:36

I could not understand this line return get_data(itemlink['href']) @cody? Why you used return? Usually this should get_data(itemlink['href']) do the job.

– robots.txt
Nov 22 '18 at 5:13

1

Notice that the return value of p.map(get_info, isbnlist) is now being captured in titles. The Pool.map function generally follows the semantics of the standard Python map function and takes two arguments: a function (in this case, get_info), and a list of values (isbnlist). It returns a new list, where each value is the result of passing the original value through the passed function. So get_info needs to return a value. Check out this page for more info on map(): book.pythontips.com/en/latest/map_filter.html

– cody
Nov 22 '18 at 14:34

add a comment |

It does not make sense to reference the global variable row in get_data(), because

It's a global and will not be shared between each "thread" in the multiprocessing Pool, because they are actually separate python processes that do not share globals.

Even if they did, because you're building the entire ISBN list before executing get_info(), the value of row will always be ws.max_row + 1 because the loop has completed.

import requests

from bs4 import BeautifulSoup

from openpyxl import load_workbook

from multiprocessing import Pool





def get_info(isbn):

    params = {

        'url': 'search-alias=aps',

        'field-keywords': isbn

    }

    res = requests.get("https://www.amazon.com/s/ref=nb_sb_noss?", params=params)

    soup = BeautifulSoup(res.text, "lxml")

    itemlink = soup.select_one("a.s-access-detail-page")

    if itemlink:

        return get_data(itemlink['href'])





def get_data(link):

    res = requests.get(link)

    soup = BeautifulSoup(res.text, "lxml")

    try:

        itmtitle = soup.select_one("#productTitle").get_text(strip=True)

    except AttributeError:

        itmtitle = "NA"



    return itmtitle





def main():

    wb = load_workbook('amazon.xlsx')

    ws = wb['content']



    isbnlist = 

    for row in range(2, ws.max_row + 1):

        if ws.cell(row=row, column=1).value is None:

            break

        val = ws["A" + str(row)].value

        isbnlist.append(val)



    with Pool(10) as p:

        titles = p.map(get_info, isbnlist)

        p.terminate()

        p.join()



    for row in range(2, ws.max_row + 1):

        ws.cell(row=row, column=2).value = titles[row - 2]



    wb.save("amazon.xlsx")





if __name__ == '__main__':

    main()

answered Nov 21 '18 at 17:43

cody

5,59621125

Thanks for your solution @cody.

– robots.txt
Nov 21 '18 at 19:36

I could not understand this line return get_data(itemlink['href']) @cody? Why you used return? Usually this should get_data(itemlink['href']) do the job.

– robots.txt
Nov 22 '18 at 5:13

1

Notice that the return value of p.map(get_info, isbnlist) is now being captured in titles. The Pool.map function generally follows the semantics of the standard Python map function and takes two arguments: a function (in this case, get_info), and a list of values (isbnlist). It returns a new list, where each value is the result of passing the original value through the passed function. So get_info needs to return a value. Check out this page for more info on map(): book.pythontips.com/en/latest/map_filter.html

– cody
Nov 22 '18 at 14:34

add a comment |

It does not make sense to reference the global variable row in get_data(), because

It's a global and will not be shared between each "thread" in the multiprocessing Pool, because they are actually separate python processes that do not share globals.

Even if they did, because you're building the entire ISBN list before executing get_info(), the value of row will always be ws.max_row + 1 because the loop has completed.

import requests

from bs4 import BeautifulSoup

from openpyxl import load_workbook

from multiprocessing import Pool





def get_info(isbn):

    params = {

        'url': 'search-alias=aps',

        'field-keywords': isbn

    }

    res = requests.get("https://www.amazon.com/s/ref=nb_sb_noss?", params=params)

    soup = BeautifulSoup(res.text, "lxml")

    itemlink = soup.select_one("a.s-access-detail-page")

    if itemlink:

        return get_data(itemlink['href'])





def get_data(link):

    res = requests.get(link)

    soup = BeautifulSoup(res.text, "lxml")

    try:

        itmtitle = soup.select_one("#productTitle").get_text(strip=True)

    except AttributeError:

        itmtitle = "NA"



    return itmtitle





def main():

    wb = load_workbook('amazon.xlsx')

    ws = wb['content']



    isbnlist = 

    for row in range(2, ws.max_row + 1):

        if ws.cell(row=row, column=1).value is None:

            break

        val = ws["A" + str(row)].value

        isbnlist.append(val)



    with Pool(10) as p:

        titles = p.map(get_info, isbnlist)

        p.terminate()

        p.join()



    for row in range(2, ws.max_row + 1):

        ws.cell(row=row, column=2).value = titles[row - 2]



    wb.save("amazon.xlsx")





if __name__ == '__main__':

    main()

answered Nov 21 '18 at 17:43

cody

5,59621125

It does not make sense to reference the global variable row in get_data(), because

It's a global and will not be shared between each "thread" in the multiprocessing Pool, because they are actually separate python processes that do not share globals.

Even if they did, because you're building the entire ISBN list before executing get_info(), the value of row will always be ws.max_row + 1 because the loop has completed.

import requests

from bs4 import BeautifulSoup

from openpyxl import load_workbook

from multiprocessing import Pool





def get_info(isbn):

    params = {

        'url': 'search-alias=aps',

        'field-keywords': isbn

    }

    res = requests.get("https://www.amazon.com/s/ref=nb_sb_noss?", params=params)

    soup = BeautifulSoup(res.text, "lxml")

    itemlink = soup.select_one("a.s-access-detail-page")

    if itemlink:

        return get_data(itemlink['href'])





def get_data(link):

    res = requests.get(link)

    soup = BeautifulSoup(res.text, "lxml")

    try:

        itmtitle = soup.select_one("#productTitle").get_text(strip=True)

    except AttributeError:

        itmtitle = "NA"



    return itmtitle





def main():

    wb = load_workbook('amazon.xlsx')

    ws = wb['content']



    isbnlist = 

    for row in range(2, ws.max_row + 1):

        if ws.cell(row=row, column=1).value is None:

            break

        val = ws["A" + str(row)].value

        isbnlist.append(val)



    with Pool(10) as p:

        titles = p.map(get_info, isbnlist)

        p.terminate()

        p.join()



    for row in range(2, ws.max_row + 1):

        ws.cell(row=row, column=2).value = titles[row - 2]



    wb.save("amazon.xlsx")





if __name__ == '__main__':

    main()

answered Nov 21 '18 at 17:43

cody

5,59621125

answered Nov 21 '18 at 17:43

cody

5,59621125

answered Nov 21 '18 at 17:43

cody

5,59621125

answered Nov 21 '18 at 17:43

cody

5,59621125

Thanks for your solution @cody.

– robots.txt
Nov 21 '18 at 19:36

I could not understand this line return get_data(itemlink['href']) @cody? Why you used return? Usually this should get_data(itemlink['href']) do the job.

– robots.txt
Nov 22 '18 at 5:13

1

Notice that the return value of p.map(get_info, isbnlist) is now being captured in titles. The Pool.map function generally follows the semantics of the standard Python map function and takes two arguments: a function (in this case, get_info), and a list of values (isbnlist). It returns a new list, where each value is the result of passing the original value through the passed function. So get_info needs to return a value. Check out this page for more info on map(): book.pythontips.com/en/latest/map_filter.html

– cody
Nov 22 '18 at 14:34

add a comment |

Thanks for your solution @cody.

– robots.txt
Nov 21 '18 at 19:36

I could not understand this line return get_data(itemlink['href']) @cody? Why you used return? Usually this should get_data(itemlink['href']) do the job.

– robots.txt
Nov 22 '18 at 5:13

1

Notice that the return value of p.map(get_info, isbnlist) is now being captured in titles. The Pool.map function generally follows the semantics of the standard Python map function and takes two arguments: a function (in this case, get_info), and a list of values (isbnlist). It returns a new list, where each value is the result of passing the original value through the passed function. So get_info needs to return a value. Check out this page for more info on map(): book.pythontips.com/en/latest/map_filter.html

– cody
Nov 22 '18 at 14:34

Thanks for your solution @cody.

– robots.txt
Nov 21 '18 at 19:36

I could not understand this line return get_data(itemlink['href']) @cody? Why you used return? Usually this should get_data(itemlink['href']) do the job.

– robots.txt
Nov 22 '18 at 5:13

Notice that the return value of p.map(get_info, isbnlist) is now being captured in titles. The Pool.map function generally follows the semantics of the standard Python map function and takes two arguments: a function (in this case, get_info), and a list of values (isbnlist). It returns a new list, where each value is the result of passing the original value through the passed function. So get_info needs to return a value. Check out this page for more info on map(): book.pythontips.com/en/latest/map_filter.html

– cody
Nov 22 '18 at 14:34

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu