Reducing the amount of List in a WebScraper












3












$begingroup$


At the moment, I'm learning and experimenting on the use of web scraping content from different varieties of web pages. But I've come across a common smelly code among several of my applications. I have many repetitive List that has data being append to them.



from requests import get
import requests
import json
from time import sleep
import pandas as pd

url = 'https://shopee.com.my/api/v2/flash_sale/get_items?offset=0&limit=16&filter_soldout=true'
list_name =
list_price =
list_discount =
list_stock =

response = get(url)
json_data = response.json()


def getShockingSales():
index = 0
if response.status_code is 200:
print('Response: ' + 'OK')
else:
print('Unable to access')
total_flashsale = len(json_data['data']['items'])
total_flashsale -= 1
for i in range(index, total_flashsale):
print('Getting data from site... please wait a few seconds')
while i <= total_flashsale:
flash_name = json_data['data']['items'][i]['name']
flash_price = json_data['data']['items'][i]['price']
flash_discount = json_data['data']['items'][i]['discount']
flash_stock = json_data['data']['items'][i]['stock']
list_name.append(flash_name)
list_price.append(flash_price)
list_discount.append(flash_discount)
list_stock.append(flash_stock)
sleep(0.5)
i += 1
if i > total_flashsale:
print('Task is completed...')
return

getShockingSales()
new_panda = pd.DataFrame({'Name': list_name, 'Price': list_price,
'Discount': list_discount, 'Stock Available': list_stock})

print('Converting to Panda Frame....')
sleep(5)
print(new_panda)


Would one list be more than sufficient? Am I approaching this wrongly.










share|improve this question











$endgroup$

















    3












    $begingroup$


    At the moment, I'm learning and experimenting on the use of web scraping content from different varieties of web pages. But I've come across a common smelly code among several of my applications. I have many repetitive List that has data being append to them.



    from requests import get
    import requests
    import json
    from time import sleep
    import pandas as pd

    url = 'https://shopee.com.my/api/v2/flash_sale/get_items?offset=0&limit=16&filter_soldout=true'
    list_name =
    list_price =
    list_discount =
    list_stock =

    response = get(url)
    json_data = response.json()


    def getShockingSales():
    index = 0
    if response.status_code is 200:
    print('Response: ' + 'OK')
    else:
    print('Unable to access')
    total_flashsale = len(json_data['data']['items'])
    total_flashsale -= 1
    for i in range(index, total_flashsale):
    print('Getting data from site... please wait a few seconds')
    while i <= total_flashsale:
    flash_name = json_data['data']['items'][i]['name']
    flash_price = json_data['data']['items'][i]['price']
    flash_discount = json_data['data']['items'][i]['discount']
    flash_stock = json_data['data']['items'][i]['stock']
    list_name.append(flash_name)
    list_price.append(flash_price)
    list_discount.append(flash_discount)
    list_stock.append(flash_stock)
    sleep(0.5)
    i += 1
    if i > total_flashsale:
    print('Task is completed...')
    return

    getShockingSales()
    new_panda = pd.DataFrame({'Name': list_name, 'Price': list_price,
    'Discount': list_discount, 'Stock Available': list_stock})

    print('Converting to Panda Frame....')
    sleep(5)
    print(new_panda)


    Would one list be more than sufficient? Am I approaching this wrongly.










    share|improve this question











    $endgroup$















      3












      3








      3





      $begingroup$


      At the moment, I'm learning and experimenting on the use of web scraping content from different varieties of web pages. But I've come across a common smelly code among several of my applications. I have many repetitive List that has data being append to them.



      from requests import get
      import requests
      import json
      from time import sleep
      import pandas as pd

      url = 'https://shopee.com.my/api/v2/flash_sale/get_items?offset=0&limit=16&filter_soldout=true'
      list_name =
      list_price =
      list_discount =
      list_stock =

      response = get(url)
      json_data = response.json()


      def getShockingSales():
      index = 0
      if response.status_code is 200:
      print('Response: ' + 'OK')
      else:
      print('Unable to access')
      total_flashsale = len(json_data['data']['items'])
      total_flashsale -= 1
      for i in range(index, total_flashsale):
      print('Getting data from site... please wait a few seconds')
      while i <= total_flashsale:
      flash_name = json_data['data']['items'][i]['name']
      flash_price = json_data['data']['items'][i]['price']
      flash_discount = json_data['data']['items'][i]['discount']
      flash_stock = json_data['data']['items'][i]['stock']
      list_name.append(flash_name)
      list_price.append(flash_price)
      list_discount.append(flash_discount)
      list_stock.append(flash_stock)
      sleep(0.5)
      i += 1
      if i > total_flashsale:
      print('Task is completed...')
      return

      getShockingSales()
      new_panda = pd.DataFrame({'Name': list_name, 'Price': list_price,
      'Discount': list_discount, 'Stock Available': list_stock})

      print('Converting to Panda Frame....')
      sleep(5)
      print(new_panda)


      Would one list be more than sufficient? Am I approaching this wrongly.










      share|improve this question











      $endgroup$




      At the moment, I'm learning and experimenting on the use of web scraping content from different varieties of web pages. But I've come across a common smelly code among several of my applications. I have many repetitive List that has data being append to them.



      from requests import get
      import requests
      import json
      from time import sleep
      import pandas as pd

      url = 'https://shopee.com.my/api/v2/flash_sale/get_items?offset=0&limit=16&filter_soldout=true'
      list_name =
      list_price =
      list_discount =
      list_stock =

      response = get(url)
      json_data = response.json()


      def getShockingSales():
      index = 0
      if response.status_code is 200:
      print('Response: ' + 'OK')
      else:
      print('Unable to access')
      total_flashsale = len(json_data['data']['items'])
      total_flashsale -= 1
      for i in range(index, total_flashsale):
      print('Getting data from site... please wait a few seconds')
      while i <= total_flashsale:
      flash_name = json_data['data']['items'][i]['name']
      flash_price = json_data['data']['items'][i]['price']
      flash_discount = json_data['data']['items'][i]['discount']
      flash_stock = json_data['data']['items'][i]['stock']
      list_name.append(flash_name)
      list_price.append(flash_price)
      list_discount.append(flash_discount)
      list_stock.append(flash_stock)
      sleep(0.5)
      i += 1
      if i > total_flashsale:
      print('Task is completed...')
      return

      getShockingSales()
      new_panda = pd.DataFrame({'Name': list_name, 'Price': list_price,
      'Discount': list_discount, 'Stock Available': list_stock})

      print('Converting to Panda Frame....')
      sleep(5)
      print(new_panda)


      Would one list be more than sufficient? Am I approaching this wrongly.







      python python-3.x json






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jan 9 at 10:07







      Minial

















      asked Jan 9 at 7:41









      MinialMinial

      716




      716






















          2 Answers
          2






          active

          oldest

          votes


















          3












          $begingroup$

          Review




          1. Remove unnecessary imports


          2. Don't work in the global namespace



            This makes it harder to track bugs



          3. constants (url) should be UPPER_SNAKE_CASE


          4. Functions (getShockingSales()) should be lower_snake_case


          5. You don't break or return when an invalid status is encountered



          6. if response.status_code is 200: should be == instead of is



            There is a function for this though



            response.raise_for_status() this will create an exception when there is an 4xx, 5xx status




          7. Why use a while inside the for and return when finished with the while



            This is really odd!
            Either loop with a for or a while, not both! Because the while currently disregards the for loop.



            I suggest to stick with for loops, Python excels at readable for loops



            (Loop like a native)





          Would one list be more than sufficient? Am I approaching this wrongly.




          Yes.



          You don't have the use 4 separate lists, but can instead create one list and add the column names afterwards.



          Code



          from requests import get
          import pandas as pd

          URL = 'https://shopee.com.my/api/v2/flash_sale/get_items?offset=0&limit=16&filter_soldout=true'

          def get_stocking_sales():
          response = get(URL)
          response.raise_for_status()
          return [
          (item['name'], item['price'], item['discount'], item['stock'])
          for item in response.json()['data']['items']
          ]

          def create_pd():
          return pd.DataFrame(
          get_stocking_sales(),
          columns=['Name', 'Price', 'Discount', 'Stock']
          )

          if __name__ == '__main__':
          print(create_pd())





          share|improve this answer









          $endgroup$













          • $begingroup$
            Thank you for showing where and what I did wrong and where I can improve and also making them much cleaner! I've followed what you've said and never knew about the if __name__ == '__main__': concept. Really; not only did you help ~ but I've learned more from your insight. Thank you so much~
            $endgroup$
            – Minial
            Jan 10 at 2:12










          • $begingroup$
            May I know just to really understand; how does this portion works return[ (item['name'], item['discount'], item['liked_count'], item['stock']) for item in response.json()['data']['items'] ]
            $endgroup$
            – Minial
            Jan 11 at 4:44












          • $begingroup$
            It is called a list comprehension here is a decent explanation
            $endgroup$
            – Ludisposed
            Jan 11 at 8:38





















          4












          $begingroup$

          Review




          1. Creating functions that read and modify global variables is not a good idea, for example if someone wants to reuse your function, they won't know about side effects.


          2. index is not useful, and range(0, n) is the same as range(n)


          3. Using == is more appropriate than is in general, hence response.status_code == 200


          4. If response.status_code != 200, I think the function should ~return an empty result~ raise an exception like said by @Ludisposed.


          5. You use json_data["data"]["items"] a lot, you could define items = json_data["data"]["items"] instead, but see below.


          6. Your usage of i is totally messy. Never use both for and while on the same variable. I think you just want to get the information for each item. So just use for item in json_data["data"]["items"]:.


          7. Actually, print("Getting data from site... please wait a few seconds") is wrong as you got the data at response = get(url). Also, sleep(0.5) and sleep(5) don't make any sense.


          8. Speaking from this, requests.get is more explicit.


          9. You can actually create a pandas DataFrame directly from a list of dictionaries.


          10. Actually, if you don't use the response in another place, you can use the url as an argument of the function.


          11. Putting spaces in column names of a DataFrame is not a good idea. It removes the possibility to access the column named stock (for example) with df.stock. If you still want that, you can use pandas.DataFrame.rename


          12. You don't need to import json.


          13. The discounts are given as strings like "59%". I think integers are preferable if you want to perform computations on them. I used df.discount = df.discount.apply(lambda s: int(s[:-1])) to perform this.



          14. Optional: you might want to use logging instead of printing everything. Or at least print to stderr with:



            from sys import stderr



            print('Information', file=stderr)




          Code



          import requests
          import pandas as pd


          def getShockingSales(url):
          response = requests.get(url)
          columns = ["name", "price", "discount", "stock"]
          response.raise_for_status()
          print("Response: OK")
          json_data = response.json()
          df = pd.DataFrame(json_data["data"]["items"])[columns]
          df.discount = df.discount.apply(lambda s: int(s[:-1]))
          print("Task is completed...")
          return df


          URL = "https://shopee.com.my/api/v2/flash_sale/get_items?offset=0&limit=16&filter_soldout=true"
          df = getShockingSales(URL)





          share|improve this answer









          $endgroup$













          • $begingroup$
            Thank you for your insight~ I've learned more than I could hope for by reading your review. It even helped me solved and fixed a few errors in other areas of my application. I wish I could give you more upvotes v.v
            $endgroup$
            – Minial
            Jan 10 at 2:14











          Your Answer





          StackExchange.ifUsing("editor", function () {
          return StackExchange.using("mathjaxEditing", function () {
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
          });
          });
          }, "mathjax-editing");

          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "196"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f211164%2freducing-the-amount-of-list-in-a-webscraper%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          2 Answers
          2






          active

          oldest

          votes








          2 Answers
          2






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          3












          $begingroup$

          Review




          1. Remove unnecessary imports


          2. Don't work in the global namespace



            This makes it harder to track bugs



          3. constants (url) should be UPPER_SNAKE_CASE


          4. Functions (getShockingSales()) should be lower_snake_case


          5. You don't break or return when an invalid status is encountered



          6. if response.status_code is 200: should be == instead of is



            There is a function for this though



            response.raise_for_status() this will create an exception when there is an 4xx, 5xx status




          7. Why use a while inside the for and return when finished with the while



            This is really odd!
            Either loop with a for or a while, not both! Because the while currently disregards the for loop.



            I suggest to stick with for loops, Python excels at readable for loops



            (Loop like a native)





          Would one list be more than sufficient? Am I approaching this wrongly.




          Yes.



          You don't have the use 4 separate lists, but can instead create one list and add the column names afterwards.



          Code



          from requests import get
          import pandas as pd

          URL = 'https://shopee.com.my/api/v2/flash_sale/get_items?offset=0&limit=16&filter_soldout=true'

          def get_stocking_sales():
          response = get(URL)
          response.raise_for_status()
          return [
          (item['name'], item['price'], item['discount'], item['stock'])
          for item in response.json()['data']['items']
          ]

          def create_pd():
          return pd.DataFrame(
          get_stocking_sales(),
          columns=['Name', 'Price', 'Discount', 'Stock']
          )

          if __name__ == '__main__':
          print(create_pd())





          share|improve this answer









          $endgroup$













          • $begingroup$
            Thank you for showing where and what I did wrong and where I can improve and also making them much cleaner! I've followed what you've said and never knew about the if __name__ == '__main__': concept. Really; not only did you help ~ but I've learned more from your insight. Thank you so much~
            $endgroup$
            – Minial
            Jan 10 at 2:12










          • $begingroup$
            May I know just to really understand; how does this portion works return[ (item['name'], item['discount'], item['liked_count'], item['stock']) for item in response.json()['data']['items'] ]
            $endgroup$
            – Minial
            Jan 11 at 4:44












          • $begingroup$
            It is called a list comprehension here is a decent explanation
            $endgroup$
            – Ludisposed
            Jan 11 at 8:38


















          3












          $begingroup$

          Review




          1. Remove unnecessary imports


          2. Don't work in the global namespace



            This makes it harder to track bugs



          3. constants (url) should be UPPER_SNAKE_CASE


          4. Functions (getShockingSales()) should be lower_snake_case


          5. You don't break or return when an invalid status is encountered



          6. if response.status_code is 200: should be == instead of is



            There is a function for this though



            response.raise_for_status() this will create an exception when there is an 4xx, 5xx status




          7. Why use a while inside the for and return when finished with the while



            This is really odd!
            Either loop with a for or a while, not both! Because the while currently disregards the for loop.



            I suggest to stick with for loops, Python excels at readable for loops



            (Loop like a native)





          Would one list be more than sufficient? Am I approaching this wrongly.




          Yes.



          You don't have the use 4 separate lists, but can instead create one list and add the column names afterwards.



          Code



          from requests import get
          import pandas as pd

          URL = 'https://shopee.com.my/api/v2/flash_sale/get_items?offset=0&limit=16&filter_soldout=true'

          def get_stocking_sales():
          response = get(URL)
          response.raise_for_status()
          return [
          (item['name'], item['price'], item['discount'], item['stock'])
          for item in response.json()['data']['items']
          ]

          def create_pd():
          return pd.DataFrame(
          get_stocking_sales(),
          columns=['Name', 'Price', 'Discount', 'Stock']
          )

          if __name__ == '__main__':
          print(create_pd())





          share|improve this answer









          $endgroup$













          • $begingroup$
            Thank you for showing where and what I did wrong and where I can improve and also making them much cleaner! I've followed what you've said and never knew about the if __name__ == '__main__': concept. Really; not only did you help ~ but I've learned more from your insight. Thank you so much~
            $endgroup$
            – Minial
            Jan 10 at 2:12










          • $begingroup$
            May I know just to really understand; how does this portion works return[ (item['name'], item['discount'], item['liked_count'], item['stock']) for item in response.json()['data']['items'] ]
            $endgroup$
            – Minial
            Jan 11 at 4:44












          • $begingroup$
            It is called a list comprehension here is a decent explanation
            $endgroup$
            – Ludisposed
            Jan 11 at 8:38
















          3












          3








          3





          $begingroup$

          Review




          1. Remove unnecessary imports


          2. Don't work in the global namespace



            This makes it harder to track bugs



          3. constants (url) should be UPPER_SNAKE_CASE


          4. Functions (getShockingSales()) should be lower_snake_case


          5. You don't break or return when an invalid status is encountered



          6. if response.status_code is 200: should be == instead of is



            There is a function for this though



            response.raise_for_status() this will create an exception when there is an 4xx, 5xx status




          7. Why use a while inside the for and return when finished with the while



            This is really odd!
            Either loop with a for or a while, not both! Because the while currently disregards the for loop.



            I suggest to stick with for loops, Python excels at readable for loops



            (Loop like a native)





          Would one list be more than sufficient? Am I approaching this wrongly.




          Yes.



          You don't have the use 4 separate lists, but can instead create one list and add the column names afterwards.



          Code



          from requests import get
          import pandas as pd

          URL = 'https://shopee.com.my/api/v2/flash_sale/get_items?offset=0&limit=16&filter_soldout=true'

          def get_stocking_sales():
          response = get(URL)
          response.raise_for_status()
          return [
          (item['name'], item['price'], item['discount'], item['stock'])
          for item in response.json()['data']['items']
          ]

          def create_pd():
          return pd.DataFrame(
          get_stocking_sales(),
          columns=['Name', 'Price', 'Discount', 'Stock']
          )

          if __name__ == '__main__':
          print(create_pd())





          share|improve this answer









          $endgroup$



          Review




          1. Remove unnecessary imports


          2. Don't work in the global namespace



            This makes it harder to track bugs



          3. constants (url) should be UPPER_SNAKE_CASE


          4. Functions (getShockingSales()) should be lower_snake_case


          5. You don't break or return when an invalid status is encountered



          6. if response.status_code is 200: should be == instead of is



            There is a function for this though



            response.raise_for_status() this will create an exception when there is an 4xx, 5xx status




          7. Why use a while inside the for and return when finished with the while



            This is really odd!
            Either loop with a for or a while, not both! Because the while currently disregards the for loop.



            I suggest to stick with for loops, Python excels at readable for loops



            (Loop like a native)





          Would one list be more than sufficient? Am I approaching this wrongly.




          Yes.



          You don't have the use 4 separate lists, but can instead create one list and add the column names afterwards.



          Code



          from requests import get
          import pandas as pd

          URL = 'https://shopee.com.my/api/v2/flash_sale/get_items?offset=0&limit=16&filter_soldout=true'

          def get_stocking_sales():
          response = get(URL)
          response.raise_for_status()
          return [
          (item['name'], item['price'], item['discount'], item['stock'])
          for item in response.json()['data']['items']
          ]

          def create_pd():
          return pd.DataFrame(
          get_stocking_sales(),
          columns=['Name', 'Price', 'Discount', 'Stock']
          )

          if __name__ == '__main__':
          print(create_pd())






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Jan 9 at 10:34









          LudisposedLudisposed

          7,80721960




          7,80721960












          • $begingroup$
            Thank you for showing where and what I did wrong and where I can improve and also making them much cleaner! I've followed what you've said and never knew about the if __name__ == '__main__': concept. Really; not only did you help ~ but I've learned more from your insight. Thank you so much~
            $endgroup$
            – Minial
            Jan 10 at 2:12










          • $begingroup$
            May I know just to really understand; how does this portion works return[ (item['name'], item['discount'], item['liked_count'], item['stock']) for item in response.json()['data']['items'] ]
            $endgroup$
            – Minial
            Jan 11 at 4:44












          • $begingroup$
            It is called a list comprehension here is a decent explanation
            $endgroup$
            – Ludisposed
            Jan 11 at 8:38




















          • $begingroup$
            Thank you for showing where and what I did wrong and where I can improve and also making them much cleaner! I've followed what you've said and never knew about the if __name__ == '__main__': concept. Really; not only did you help ~ but I've learned more from your insight. Thank you so much~
            $endgroup$
            – Minial
            Jan 10 at 2:12










          • $begingroup$
            May I know just to really understand; how does this portion works return[ (item['name'], item['discount'], item['liked_count'], item['stock']) for item in response.json()['data']['items'] ]
            $endgroup$
            – Minial
            Jan 11 at 4:44












          • $begingroup$
            It is called a list comprehension here is a decent explanation
            $endgroup$
            – Ludisposed
            Jan 11 at 8:38


















          $begingroup$
          Thank you for showing where and what I did wrong and where I can improve and also making them much cleaner! I've followed what you've said and never knew about the if __name__ == '__main__': concept. Really; not only did you help ~ but I've learned more from your insight. Thank you so much~
          $endgroup$
          – Minial
          Jan 10 at 2:12




          $begingroup$
          Thank you for showing where and what I did wrong and where I can improve and also making them much cleaner! I've followed what you've said and never knew about the if __name__ == '__main__': concept. Really; not only did you help ~ but I've learned more from your insight. Thank you so much~
          $endgroup$
          – Minial
          Jan 10 at 2:12












          $begingroup$
          May I know just to really understand; how does this portion works return[ (item['name'], item['discount'], item['liked_count'], item['stock']) for item in response.json()['data']['items'] ]
          $endgroup$
          – Minial
          Jan 11 at 4:44






          $begingroup$
          May I know just to really understand; how does this portion works return[ (item['name'], item['discount'], item['liked_count'], item['stock']) for item in response.json()['data']['items'] ]
          $endgroup$
          – Minial
          Jan 11 at 4:44














          $begingroup$
          It is called a list comprehension here is a decent explanation
          $endgroup$
          – Ludisposed
          Jan 11 at 8:38






          $begingroup$
          It is called a list comprehension here is a decent explanation
          $endgroup$
          – Ludisposed
          Jan 11 at 8:38















          4












          $begingroup$

          Review




          1. Creating functions that read and modify global variables is not a good idea, for example if someone wants to reuse your function, they won't know about side effects.


          2. index is not useful, and range(0, n) is the same as range(n)


          3. Using == is more appropriate than is in general, hence response.status_code == 200


          4. If response.status_code != 200, I think the function should ~return an empty result~ raise an exception like said by @Ludisposed.


          5. You use json_data["data"]["items"] a lot, you could define items = json_data["data"]["items"] instead, but see below.


          6. Your usage of i is totally messy. Never use both for and while on the same variable. I think you just want to get the information for each item. So just use for item in json_data["data"]["items"]:.


          7. Actually, print("Getting data from site... please wait a few seconds") is wrong as you got the data at response = get(url). Also, sleep(0.5) and sleep(5) don't make any sense.


          8. Speaking from this, requests.get is more explicit.


          9. You can actually create a pandas DataFrame directly from a list of dictionaries.


          10. Actually, if you don't use the response in another place, you can use the url as an argument of the function.


          11. Putting spaces in column names of a DataFrame is not a good idea. It removes the possibility to access the column named stock (for example) with df.stock. If you still want that, you can use pandas.DataFrame.rename


          12. You don't need to import json.


          13. The discounts are given as strings like "59%". I think integers are preferable if you want to perform computations on them. I used df.discount = df.discount.apply(lambda s: int(s[:-1])) to perform this.



          14. Optional: you might want to use logging instead of printing everything. Or at least print to stderr with:



            from sys import stderr



            print('Information', file=stderr)




          Code



          import requests
          import pandas as pd


          def getShockingSales(url):
          response = requests.get(url)
          columns = ["name", "price", "discount", "stock"]
          response.raise_for_status()
          print("Response: OK")
          json_data = response.json()
          df = pd.DataFrame(json_data["data"]["items"])[columns]
          df.discount = df.discount.apply(lambda s: int(s[:-1]))
          print("Task is completed...")
          return df


          URL = "https://shopee.com.my/api/v2/flash_sale/get_items?offset=0&limit=16&filter_soldout=true"
          df = getShockingSales(URL)





          share|improve this answer









          $endgroup$













          • $begingroup$
            Thank you for your insight~ I've learned more than I could hope for by reading your review. It even helped me solved and fixed a few errors in other areas of my application. I wish I could give you more upvotes v.v
            $endgroup$
            – Minial
            Jan 10 at 2:14
















          4












          $begingroup$

          Review




          1. Creating functions that read and modify global variables is not a good idea, for example if someone wants to reuse your function, they won't know about side effects.


          2. index is not useful, and range(0, n) is the same as range(n)


          3. Using == is more appropriate than is in general, hence response.status_code == 200


          4. If response.status_code != 200, I think the function should ~return an empty result~ raise an exception like said by @Ludisposed.


          5. You use json_data["data"]["items"] a lot, you could define items = json_data["data"]["items"] instead, but see below.


          6. Your usage of i is totally messy. Never use both for and while on the same variable. I think you just want to get the information for each item. So just use for item in json_data["data"]["items"]:.


          7. Actually, print("Getting data from site... please wait a few seconds") is wrong as you got the data at response = get(url). Also, sleep(0.5) and sleep(5) don't make any sense.


          8. Speaking from this, requests.get is more explicit.


          9. You can actually create a pandas DataFrame directly from a list of dictionaries.


          10. Actually, if you don't use the response in another place, you can use the url as an argument of the function.


          11. Putting spaces in column names of a DataFrame is not a good idea. It removes the possibility to access the column named stock (for example) with df.stock. If you still want that, you can use pandas.DataFrame.rename


          12. You don't need to import json.


          13. The discounts are given as strings like "59%". I think integers are preferable if you want to perform computations on them. I used df.discount = df.discount.apply(lambda s: int(s[:-1])) to perform this.



          14. Optional: you might want to use logging instead of printing everything. Or at least print to stderr with:



            from sys import stderr



            print('Information', file=stderr)




          Code



          import requests
          import pandas as pd


          def getShockingSales(url):
          response = requests.get(url)
          columns = ["name", "price", "discount", "stock"]
          response.raise_for_status()
          print("Response: OK")
          json_data = response.json()
          df = pd.DataFrame(json_data["data"]["items"])[columns]
          df.discount = df.discount.apply(lambda s: int(s[:-1]))
          print("Task is completed...")
          return df


          URL = "https://shopee.com.my/api/v2/flash_sale/get_items?offset=0&limit=16&filter_soldout=true"
          df = getShockingSales(URL)





          share|improve this answer









          $endgroup$













          • $begingroup$
            Thank you for your insight~ I've learned more than I could hope for by reading your review. It even helped me solved and fixed a few errors in other areas of my application. I wish I could give you more upvotes v.v
            $endgroup$
            – Minial
            Jan 10 at 2:14














          4












          4








          4





          $begingroup$

          Review




          1. Creating functions that read and modify global variables is not a good idea, for example if someone wants to reuse your function, they won't know about side effects.


          2. index is not useful, and range(0, n) is the same as range(n)


          3. Using == is more appropriate than is in general, hence response.status_code == 200


          4. If response.status_code != 200, I think the function should ~return an empty result~ raise an exception like said by @Ludisposed.


          5. You use json_data["data"]["items"] a lot, you could define items = json_data["data"]["items"] instead, but see below.


          6. Your usage of i is totally messy. Never use both for and while on the same variable. I think you just want to get the information for each item. So just use for item in json_data["data"]["items"]:.


          7. Actually, print("Getting data from site... please wait a few seconds") is wrong as you got the data at response = get(url). Also, sleep(0.5) and sleep(5) don't make any sense.


          8. Speaking from this, requests.get is more explicit.


          9. You can actually create a pandas DataFrame directly from a list of dictionaries.


          10. Actually, if you don't use the response in another place, you can use the url as an argument of the function.


          11. Putting spaces in column names of a DataFrame is not a good idea. It removes the possibility to access the column named stock (for example) with df.stock. If you still want that, you can use pandas.DataFrame.rename


          12. You don't need to import json.


          13. The discounts are given as strings like "59%". I think integers are preferable if you want to perform computations on them. I used df.discount = df.discount.apply(lambda s: int(s[:-1])) to perform this.



          14. Optional: you might want to use logging instead of printing everything. Or at least print to stderr with:



            from sys import stderr



            print('Information', file=stderr)




          Code



          import requests
          import pandas as pd


          def getShockingSales(url):
          response = requests.get(url)
          columns = ["name", "price", "discount", "stock"]
          response.raise_for_status()
          print("Response: OK")
          json_data = response.json()
          df = pd.DataFrame(json_data["data"]["items"])[columns]
          df.discount = df.discount.apply(lambda s: int(s[:-1]))
          print("Task is completed...")
          return df


          URL = "https://shopee.com.my/api/v2/flash_sale/get_items?offset=0&limit=16&filter_soldout=true"
          df = getShockingSales(URL)





          share|improve this answer









          $endgroup$



          Review




          1. Creating functions that read and modify global variables is not a good idea, for example if someone wants to reuse your function, they won't know about side effects.


          2. index is not useful, and range(0, n) is the same as range(n)


          3. Using == is more appropriate than is in general, hence response.status_code == 200


          4. If response.status_code != 200, I think the function should ~return an empty result~ raise an exception like said by @Ludisposed.


          5. You use json_data["data"]["items"] a lot, you could define items = json_data["data"]["items"] instead, but see below.


          6. Your usage of i is totally messy. Never use both for and while on the same variable. I think you just want to get the information for each item. So just use for item in json_data["data"]["items"]:.


          7. Actually, print("Getting data from site... please wait a few seconds") is wrong as you got the data at response = get(url). Also, sleep(0.5) and sleep(5) don't make any sense.


          8. Speaking from this, requests.get is more explicit.


          9. You can actually create a pandas DataFrame directly from a list of dictionaries.


          10. Actually, if you don't use the response in another place, you can use the url as an argument of the function.


          11. Putting spaces in column names of a DataFrame is not a good idea. It removes the possibility to access the column named stock (for example) with df.stock. If you still want that, you can use pandas.DataFrame.rename


          12. You don't need to import json.


          13. The discounts are given as strings like "59%". I think integers are preferable if you want to perform computations on them. I used df.discount = df.discount.apply(lambda s: int(s[:-1])) to perform this.



          14. Optional: you might want to use logging instead of printing everything. Or at least print to stderr with:



            from sys import stderr



            print('Information', file=stderr)




          Code



          import requests
          import pandas as pd


          def getShockingSales(url):
          response = requests.get(url)
          columns = ["name", "price", "discount", "stock"]
          response.raise_for_status()
          print("Response: OK")
          json_data = response.json()
          df = pd.DataFrame(json_data["data"]["items"])[columns]
          df.discount = df.discount.apply(lambda s: int(s[:-1]))
          print("Task is completed...")
          return df


          URL = "https://shopee.com.my/api/v2/flash_sale/get_items?offset=0&limit=16&filter_soldout=true"
          df = getShockingSales(URL)






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Jan 9 at 10:41









          LaboLabo

          1664




          1664












          • $begingroup$
            Thank you for your insight~ I've learned more than I could hope for by reading your review. It even helped me solved and fixed a few errors in other areas of my application. I wish I could give you more upvotes v.v
            $endgroup$
            – Minial
            Jan 10 at 2:14


















          • $begingroup$
            Thank you for your insight~ I've learned more than I could hope for by reading your review. It even helped me solved and fixed a few errors in other areas of my application. I wish I could give you more upvotes v.v
            $endgroup$
            – Minial
            Jan 10 at 2:14
















          $begingroup$
          Thank you for your insight~ I've learned more than I could hope for by reading your review. It even helped me solved and fixed a few errors in other areas of my application. I wish I could give you more upvotes v.v
          $endgroup$
          – Minial
          Jan 10 at 2:14




          $begingroup$
          Thank you for your insight~ I've learned more than I could hope for by reading your review. It even helped me solved and fixed a few errors in other areas of my application. I wish I could give you more upvotes v.v
          $endgroup$
          – Minial
          Jan 10 at 2:14


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Code Review Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f211164%2freducing-the-amount-of-list-in-a-webscraper%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          MongoDB - Not Authorized To Execute Command

          How to fix TextFormField cause rebuild widget in Flutter

          in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith