Parallel calculation on a sparse matrix in python












0















I have a large [numpy] vector X, and a comparison function f(x,y). I need to find all the pairs of elements of X for which f(X[I],X[j])<T for some threshold T. This works well:



good_inds = {}
for i in range(0,len(X)):
for j in range(x+1,len(X)):
score = f(X[i],X[j])
if score<T:
good_inds[x,y] = score


This actually builds a dictionary which is a representation of a sparse matrix. The problem is that it's rather slow, and I wish to parallelise this process.
Please advise.










share|improve this question

























  • x and y are constants within the scope of this snippet, so why use a dictionary? Did you mean x --> X[i] and y --> X[j]?

    – meowgoesthedog
    Jan 1 at 14:53








  • 1





    answers to this sort of question will be strongly dependant on what f does, e.g. what sort of constraints can be exploited. Roland's answer is great if there's nothing more known about the problem, but you'd get much more relevant answers if you said that X and Y are both numpy arrays and f a simple algebraic expression

    – Sam Mason
    Feb 4 at 14:53
















0















I have a large [numpy] vector X, and a comparison function f(x,y). I need to find all the pairs of elements of X for which f(X[I],X[j])<T for some threshold T. This works well:



good_inds = {}
for i in range(0,len(X)):
for j in range(x+1,len(X)):
score = f(X[i],X[j])
if score<T:
good_inds[x,y] = score


This actually builds a dictionary which is a representation of a sparse matrix. The problem is that it's rather slow, and I wish to parallelise this process.
Please advise.










share|improve this question

























  • x and y are constants within the scope of this snippet, so why use a dictionary? Did you mean x --> X[i] and y --> X[j]?

    – meowgoesthedog
    Jan 1 at 14:53








  • 1





    answers to this sort of question will be strongly dependant on what f does, e.g. what sort of constraints can be exploited. Roland's answer is great if there's nothing more known about the problem, but you'd get much more relevant answers if you said that X and Y are both numpy arrays and f a simple algebraic expression

    – Sam Mason
    Feb 4 at 14:53














0












0








0








I have a large [numpy] vector X, and a comparison function f(x,y). I need to find all the pairs of elements of X for which f(X[I],X[j])<T for some threshold T. This works well:



good_inds = {}
for i in range(0,len(X)):
for j in range(x+1,len(X)):
score = f(X[i],X[j])
if score<T:
good_inds[x,y] = score


This actually builds a dictionary which is a representation of a sparse matrix. The problem is that it's rather slow, and I wish to parallelise this process.
Please advise.










share|improve this question
















I have a large [numpy] vector X, and a comparison function f(x,y). I need to find all the pairs of elements of X for which f(X[I],X[j])<T for some threshold T. This works well:



good_inds = {}
for i in range(0,len(X)):
for j in range(x+1,len(X)):
score = f(X[i],X[j])
if score<T:
good_inds[x,y] = score


This actually builds a dictionary which is a representation of a sparse matrix. The problem is that it's rather slow, and I wish to parallelise this process.
Please advise.







python parallel-processing sparse-matrix






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 1 at 15:23







mousomer

















asked Jan 1 at 14:01









mousomermousomer

1,19211116




1,19211116













  • x and y are constants within the scope of this snippet, so why use a dictionary? Did you mean x --> X[i] and y --> X[j]?

    – meowgoesthedog
    Jan 1 at 14:53








  • 1





    answers to this sort of question will be strongly dependant on what f does, e.g. what sort of constraints can be exploited. Roland's answer is great if there's nothing more known about the problem, but you'd get much more relevant answers if you said that X and Y are both numpy arrays and f a simple algebraic expression

    – Sam Mason
    Feb 4 at 14:53



















  • x and y are constants within the scope of this snippet, so why use a dictionary? Did you mean x --> X[i] and y --> X[j]?

    – meowgoesthedog
    Jan 1 at 14:53








  • 1





    answers to this sort of question will be strongly dependant on what f does, e.g. what sort of constraints can be exploited. Roland's answer is great if there's nothing more known about the problem, but you'd get much more relevant answers if you said that X and Y are both numpy arrays and f a simple algebraic expression

    – Sam Mason
    Feb 4 at 14:53

















x and y are constants within the scope of this snippet, so why use a dictionary? Did you mean x --> X[i] and y --> X[j]?

– meowgoesthedog
Jan 1 at 14:53







x and y are constants within the scope of this snippet, so why use a dictionary? Did you mean x --> X[i] and y --> X[j]?

– meowgoesthedog
Jan 1 at 14:53






1




1





answers to this sort of question will be strongly dependant on what f does, e.g. what sort of constraints can be exploited. Roland's answer is great if there's nothing more known about the problem, but you'd get much more relevant answers if you said that X and Y are both numpy arrays and f a simple algebraic expression

– Sam Mason
Feb 4 at 14:53





answers to this sort of question will be strongly dependant on what f does, e.g. what sort of constraints can be exploited. Roland's answer is great if there's nothing more known about the problem, but you'd get much more relevant answers if you said that X and Y are both numpy arrays and f a simple algebraic expression

– Sam Mason
Feb 4 at 14:53












2 Answers
2






active

oldest

votes


















1














This is a good fit for multiprocessing.Pool.



Create your numpy array, then make an iterator of 2-tuples all possible i and j values. For example with itertools.combinations.



In [1]: import itertools

In [7]: list(itertools.combinations(range(4), 2))
Out[7]: [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]


(You should use the length of your vector as the argument to range, of course.)



Write the following function:



def worker(pair):
i, j = pair
rv = False
if f(X[i],X[j]) < T:
rv = True
return (i, j, rv)


Create a Pool, and run imap_unordered:



p = multiprocessing.Pool()
for i, j, result in p.imap_unordered(worker, itertools.combinations(range(len(X)), 2)):
if result:
print('Good pair:', i, j)
# do something with the results...


This will run as many workers as you CPU has cores.






share|improve this answer


























  • thanks. This is very cool. At the end I found scipy distance matrices to be already optimized.

    – mousomer
    Jan 2 at 15:36



















0














So. Apparently SciPy is already good enough.



full_dist_mat = spatial.distance.squareform( spatial.distance.pdist(vects2, metric='cosine'))



is already optimised. Running 2000 vectors takes 1.3 seconds in jupyter lab on a Macbook pro.






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53996080%2fparallel-calculation-on-a-sparse-matrix-in-python%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    This is a good fit for multiprocessing.Pool.



    Create your numpy array, then make an iterator of 2-tuples all possible i and j values. For example with itertools.combinations.



    In [1]: import itertools

    In [7]: list(itertools.combinations(range(4), 2))
    Out[7]: [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]


    (You should use the length of your vector as the argument to range, of course.)



    Write the following function:



    def worker(pair):
    i, j = pair
    rv = False
    if f(X[i],X[j]) < T:
    rv = True
    return (i, j, rv)


    Create a Pool, and run imap_unordered:



    p = multiprocessing.Pool()
    for i, j, result in p.imap_unordered(worker, itertools.combinations(range(len(X)), 2)):
    if result:
    print('Good pair:', i, j)
    # do something with the results...


    This will run as many workers as you CPU has cores.






    share|improve this answer


























    • thanks. This is very cool. At the end I found scipy distance matrices to be already optimized.

      – mousomer
      Jan 2 at 15:36
















    1














    This is a good fit for multiprocessing.Pool.



    Create your numpy array, then make an iterator of 2-tuples all possible i and j values. For example with itertools.combinations.



    In [1]: import itertools

    In [7]: list(itertools.combinations(range(4), 2))
    Out[7]: [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]


    (You should use the length of your vector as the argument to range, of course.)



    Write the following function:



    def worker(pair):
    i, j = pair
    rv = False
    if f(X[i],X[j]) < T:
    rv = True
    return (i, j, rv)


    Create a Pool, and run imap_unordered:



    p = multiprocessing.Pool()
    for i, j, result in p.imap_unordered(worker, itertools.combinations(range(len(X)), 2)):
    if result:
    print('Good pair:', i, j)
    # do something with the results...


    This will run as many workers as you CPU has cores.






    share|improve this answer


























    • thanks. This is very cool. At the end I found scipy distance matrices to be already optimized.

      – mousomer
      Jan 2 at 15:36














    1












    1








    1







    This is a good fit for multiprocessing.Pool.



    Create your numpy array, then make an iterator of 2-tuples all possible i and j values. For example with itertools.combinations.



    In [1]: import itertools

    In [7]: list(itertools.combinations(range(4), 2))
    Out[7]: [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]


    (You should use the length of your vector as the argument to range, of course.)



    Write the following function:



    def worker(pair):
    i, j = pair
    rv = False
    if f(X[i],X[j]) < T:
    rv = True
    return (i, j, rv)


    Create a Pool, and run imap_unordered:



    p = multiprocessing.Pool()
    for i, j, result in p.imap_unordered(worker, itertools.combinations(range(len(X)), 2)):
    if result:
    print('Good pair:', i, j)
    # do something with the results...


    This will run as many workers as you CPU has cores.






    share|improve this answer















    This is a good fit for multiprocessing.Pool.



    Create your numpy array, then make an iterator of 2-tuples all possible i and j values. For example with itertools.combinations.



    In [1]: import itertools

    In [7]: list(itertools.combinations(range(4), 2))
    Out[7]: [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]


    (You should use the length of your vector as the argument to range, of course.)



    Write the following function:



    def worker(pair):
    i, j = pair
    rv = False
    if f(X[i],X[j]) < T:
    rv = True
    return (i, j, rv)


    Create a Pool, and run imap_unordered:



    p = multiprocessing.Pool()
    for i, j, result in p.imap_unordered(worker, itertools.combinations(range(len(X)), 2)):
    if result:
    print('Good pair:', i, j)
    # do something with the results...


    This will run as many workers as you CPU has cores.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Jan 1 at 15:41

























    answered Jan 1 at 15:35









    Roland SmithRoland Smith

    26.7k33256




    26.7k33256













    • thanks. This is very cool. At the end I found scipy distance matrices to be already optimized.

      – mousomer
      Jan 2 at 15:36



















    • thanks. This is very cool. At the end I found scipy distance matrices to be already optimized.

      – mousomer
      Jan 2 at 15:36

















    thanks. This is very cool. At the end I found scipy distance matrices to be already optimized.

    – mousomer
    Jan 2 at 15:36





    thanks. This is very cool. At the end I found scipy distance matrices to be already optimized.

    – mousomer
    Jan 2 at 15:36













    0














    So. Apparently SciPy is already good enough.



    full_dist_mat = spatial.distance.squareform( spatial.distance.pdist(vects2, metric='cosine'))



    is already optimised. Running 2000 vectors takes 1.3 seconds in jupyter lab on a Macbook pro.






    share|improve this answer




























      0














      So. Apparently SciPy is already good enough.



      full_dist_mat = spatial.distance.squareform( spatial.distance.pdist(vects2, metric='cosine'))



      is already optimised. Running 2000 vectors takes 1.3 seconds in jupyter lab on a Macbook pro.






      share|improve this answer


























        0












        0








        0







        So. Apparently SciPy is already good enough.



        full_dist_mat = spatial.distance.squareform( spatial.distance.pdist(vects2, metric='cosine'))



        is already optimised. Running 2000 vectors takes 1.3 seconds in jupyter lab on a Macbook pro.






        share|improve this answer













        So. Apparently SciPy is already good enough.



        full_dist_mat = spatial.distance.squareform( spatial.distance.pdist(vects2, metric='cosine'))



        is already optimised. Running 2000 vectors takes 1.3 seconds in jupyter lab on a Macbook pro.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Jan 2 at 15:35









        mousomermousomer

        1,19211116




        1,19211116






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53996080%2fparallel-calculation-on-a-sparse-matrix-in-python%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            MongoDB - Not Authorized To Execute Command

            How to fix TextFormField cause rebuild widget in Flutter

            in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith