Using multiprocessing to preprocess images












1















I am trying to get my feet wet using multiprocessing in python. As such, I am trying to operate an image preprocessing pipeline using multiprocessing. I have all my images in a directory called image_files and I have a list of all the filenames that are inside this directory. I split the list into two chunks a and b and pass each to its own multiprocessing.Process where a method called preprocess_image is doing the preprocessing on each image.



Following a tutorial on how to calculate square roots using multiprocessing I came up with a working code (see below).



This code works, however, speed matters and I am not sure whether it is appropriate to define two methods doing basically the same or if it would be faster to use only a single method and simply pass a and b to the same target in multiprocessing.Process(target=work... .



Hence my question is whether this is the right way to use multiprocessing or if I could speed it up somehow?



def work1(array):
for i in tqdm(array):
image_path = "C:/Users/aaron/Desktop/image_files/"+i

image = preprocess_image(image_path)
cv2.imwrite("C:/Users/aaron/Desktop/destination/"+i, image)

def work2(array):
for i in tqdm(array):
image_path = "C:/Users/aaron/Desktop/image_files/"+i

image = preprocess_image(image_path)
cv2.imwrite("C:/Users/aaron/Desktop/destination/"+i, image)


if __name__ == "__main__":
p1 = multiprocessing.Process(target=work1, args=(a,))
p2 = multiprocessing.Process(target=work2, args=(b,))

p1.start()
p2.start()

p1.join()
p2.join()

print("Done!")









share|improve this question



























    1















    I am trying to get my feet wet using multiprocessing in python. As such, I am trying to operate an image preprocessing pipeline using multiprocessing. I have all my images in a directory called image_files and I have a list of all the filenames that are inside this directory. I split the list into two chunks a and b and pass each to its own multiprocessing.Process where a method called preprocess_image is doing the preprocessing on each image.



    Following a tutorial on how to calculate square roots using multiprocessing I came up with a working code (see below).



    This code works, however, speed matters and I am not sure whether it is appropriate to define two methods doing basically the same or if it would be faster to use only a single method and simply pass a and b to the same target in multiprocessing.Process(target=work... .



    Hence my question is whether this is the right way to use multiprocessing or if I could speed it up somehow?



    def work1(array):
    for i in tqdm(array):
    image_path = "C:/Users/aaron/Desktop/image_files/"+i

    image = preprocess_image(image_path)
    cv2.imwrite("C:/Users/aaron/Desktop/destination/"+i, image)

    def work2(array):
    for i in tqdm(array):
    image_path = "C:/Users/aaron/Desktop/image_files/"+i

    image = preprocess_image(image_path)
    cv2.imwrite("C:/Users/aaron/Desktop/destination/"+i, image)


    if __name__ == "__main__":
    p1 = multiprocessing.Process(target=work1, args=(a,))
    p2 = multiprocessing.Process(target=work2, args=(b,))

    p1.start()
    p2.start()

    p1.join()
    p2.join()

    print("Done!")









    share|improve this question

























      1












      1








      1








      I am trying to get my feet wet using multiprocessing in python. As such, I am trying to operate an image preprocessing pipeline using multiprocessing. I have all my images in a directory called image_files and I have a list of all the filenames that are inside this directory. I split the list into two chunks a and b and pass each to its own multiprocessing.Process where a method called preprocess_image is doing the preprocessing on each image.



      Following a tutorial on how to calculate square roots using multiprocessing I came up with a working code (see below).



      This code works, however, speed matters and I am not sure whether it is appropriate to define two methods doing basically the same or if it would be faster to use only a single method and simply pass a and b to the same target in multiprocessing.Process(target=work... .



      Hence my question is whether this is the right way to use multiprocessing or if I could speed it up somehow?



      def work1(array):
      for i in tqdm(array):
      image_path = "C:/Users/aaron/Desktop/image_files/"+i

      image = preprocess_image(image_path)
      cv2.imwrite("C:/Users/aaron/Desktop/destination/"+i, image)

      def work2(array):
      for i in tqdm(array):
      image_path = "C:/Users/aaron/Desktop/image_files/"+i

      image = preprocess_image(image_path)
      cv2.imwrite("C:/Users/aaron/Desktop/destination/"+i, image)


      if __name__ == "__main__":
      p1 = multiprocessing.Process(target=work1, args=(a,))
      p2 = multiprocessing.Process(target=work2, args=(b,))

      p1.start()
      p2.start()

      p1.join()
      p2.join()

      print("Done!")









      share|improve this question














      I am trying to get my feet wet using multiprocessing in python. As such, I am trying to operate an image preprocessing pipeline using multiprocessing. I have all my images in a directory called image_files and I have a list of all the filenames that are inside this directory. I split the list into two chunks a and b and pass each to its own multiprocessing.Process where a method called preprocess_image is doing the preprocessing on each image.



      Following a tutorial on how to calculate square roots using multiprocessing I came up with a working code (see below).



      This code works, however, speed matters and I am not sure whether it is appropriate to define two methods doing basically the same or if it would be faster to use only a single method and simply pass a and b to the same target in multiprocessing.Process(target=work... .



      Hence my question is whether this is the right way to use multiprocessing or if I could speed it up somehow?



      def work1(array):
      for i in tqdm(array):
      image_path = "C:/Users/aaron/Desktop/image_files/"+i

      image = preprocess_image(image_path)
      cv2.imwrite("C:/Users/aaron/Desktop/destination/"+i, image)

      def work2(array):
      for i in tqdm(array):
      image_path = "C:/Users/aaron/Desktop/image_files/"+i

      image = preprocess_image(image_path)
      cv2.imwrite("C:/Users/aaron/Desktop/destination/"+i, image)


      if __name__ == "__main__":
      p1 = multiprocessing.Process(target=work1, args=(a,))
      p2 = multiprocessing.Process(target=work2, args=(b,))

      p1.start()
      p2.start()

      p1.join()
      p2.join()

      print("Done!")






      python






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 20 '18 at 18:59









      AaronDTAaronDT

      8532526




      8532526
























          1 Answer
          1






          active

          oldest

          votes


















          1














          Since all of your process output seem to be independent, you should use MultiProcessing.Pool:



          from multiprocessing import Pool

          l = # list of all your image files

          f = # function to modify each of these, taking element of l as input.

          p = Pool(10) # however many process you want to spawn

          p.map(f, l)


          That's it, you don't need to define the same function twice or manually split the list. It'll be automatically assigned and managed for you.






          share|improve this answer
























          • Thank you! This works great! Is the number of processes I can spawn limited by the number of cores I have available? Also I wonder how to stop the entire process while running - using cmd + c doesn't do the trick it seems...

            – AaronDT
            Nov 20 '18 at 19:52













          • No, it is not limited, however in this case, more processes than CPU cores are probably not going to help performance, as they'll just wait in a queue. If you have IO intensive application then it make sense to have more process than number of cores. As to stopping it, the Pool generate daemons which will eventually be killed, but you might have to wait for the daemon to come back finished with current task - hence not immediate.

            – Rocky Li
            Nov 20 '18 at 20:09













          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53399779%2fusing-multiprocessing-to-preprocess-images%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          Since all of your process output seem to be independent, you should use MultiProcessing.Pool:



          from multiprocessing import Pool

          l = # list of all your image files

          f = # function to modify each of these, taking element of l as input.

          p = Pool(10) # however many process you want to spawn

          p.map(f, l)


          That's it, you don't need to define the same function twice or manually split the list. It'll be automatically assigned and managed for you.






          share|improve this answer
























          • Thank you! This works great! Is the number of processes I can spawn limited by the number of cores I have available? Also I wonder how to stop the entire process while running - using cmd + c doesn't do the trick it seems...

            – AaronDT
            Nov 20 '18 at 19:52













          • No, it is not limited, however in this case, more processes than CPU cores are probably not going to help performance, as they'll just wait in a queue. If you have IO intensive application then it make sense to have more process than number of cores. As to stopping it, the Pool generate daemons which will eventually be killed, but you might have to wait for the daemon to come back finished with current task - hence not immediate.

            – Rocky Li
            Nov 20 '18 at 20:09


















          1














          Since all of your process output seem to be independent, you should use MultiProcessing.Pool:



          from multiprocessing import Pool

          l = # list of all your image files

          f = # function to modify each of these, taking element of l as input.

          p = Pool(10) # however many process you want to spawn

          p.map(f, l)


          That's it, you don't need to define the same function twice or manually split the list. It'll be automatically assigned and managed for you.






          share|improve this answer
























          • Thank you! This works great! Is the number of processes I can spawn limited by the number of cores I have available? Also I wonder how to stop the entire process while running - using cmd + c doesn't do the trick it seems...

            – AaronDT
            Nov 20 '18 at 19:52













          • No, it is not limited, however in this case, more processes than CPU cores are probably not going to help performance, as they'll just wait in a queue. If you have IO intensive application then it make sense to have more process than number of cores. As to stopping it, the Pool generate daemons which will eventually be killed, but you might have to wait for the daemon to come back finished with current task - hence not immediate.

            – Rocky Li
            Nov 20 '18 at 20:09
















          1












          1








          1







          Since all of your process output seem to be independent, you should use MultiProcessing.Pool:



          from multiprocessing import Pool

          l = # list of all your image files

          f = # function to modify each of these, taking element of l as input.

          p = Pool(10) # however many process you want to spawn

          p.map(f, l)


          That's it, you don't need to define the same function twice or manually split the list. It'll be automatically assigned and managed for you.






          share|improve this answer













          Since all of your process output seem to be independent, you should use MultiProcessing.Pool:



          from multiprocessing import Pool

          l = # list of all your image files

          f = # function to modify each of these, taking element of l as input.

          p = Pool(10) # however many process you want to spawn

          p.map(f, l)


          That's it, you don't need to define the same function twice or manually split the list. It'll be automatically assigned and managed for you.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 20 '18 at 19:03









          Rocky LiRocky Li

          2,8731316




          2,8731316













          • Thank you! This works great! Is the number of processes I can spawn limited by the number of cores I have available? Also I wonder how to stop the entire process while running - using cmd + c doesn't do the trick it seems...

            – AaronDT
            Nov 20 '18 at 19:52













          • No, it is not limited, however in this case, more processes than CPU cores are probably not going to help performance, as they'll just wait in a queue. If you have IO intensive application then it make sense to have more process than number of cores. As to stopping it, the Pool generate daemons which will eventually be killed, but you might have to wait for the daemon to come back finished with current task - hence not immediate.

            – Rocky Li
            Nov 20 '18 at 20:09





















          • Thank you! This works great! Is the number of processes I can spawn limited by the number of cores I have available? Also I wonder how to stop the entire process while running - using cmd + c doesn't do the trick it seems...

            – AaronDT
            Nov 20 '18 at 19:52













          • No, it is not limited, however in this case, more processes than CPU cores are probably not going to help performance, as they'll just wait in a queue. If you have IO intensive application then it make sense to have more process than number of cores. As to stopping it, the Pool generate daemons which will eventually be killed, but you might have to wait for the daemon to come back finished with current task - hence not immediate.

            – Rocky Li
            Nov 20 '18 at 20:09



















          Thank you! This works great! Is the number of processes I can spawn limited by the number of cores I have available? Also I wonder how to stop the entire process while running - using cmd + c doesn't do the trick it seems...

          – AaronDT
          Nov 20 '18 at 19:52







          Thank you! This works great! Is the number of processes I can spawn limited by the number of cores I have available? Also I wonder how to stop the entire process while running - using cmd + c doesn't do the trick it seems...

          – AaronDT
          Nov 20 '18 at 19:52















          No, it is not limited, however in this case, more processes than CPU cores are probably not going to help performance, as they'll just wait in a queue. If you have IO intensive application then it make sense to have more process than number of cores. As to stopping it, the Pool generate daemons which will eventually be killed, but you might have to wait for the daemon to come back finished with current task - hence not immediate.

          – Rocky Li
          Nov 20 '18 at 20:09







          No, it is not limited, however in this case, more processes than CPU cores are probably not going to help performance, as they'll just wait in a queue. If you have IO intensive application then it make sense to have more process than number of cores. As to stopping it, the Pool generate daemons which will eventually be killed, but you might have to wait for the daemon to come back finished with current task - hence not immediate.

          – Rocky Li
          Nov 20 '18 at 20:09




















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53399779%2fusing-multiprocessing-to-preprocess-images%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Can a sorcerer learn a 5th-level spell early by creating spell slots using the Font of Magic feature?

          Does disintegrating a polymorphed enemy still kill it after the 2018 errata?

          A Topological Invariant for $pi_3(U(n))$