Parallel calculation on a sparse matrix in python

I have a large [numpy] vector X, and a comparison function f(x,y). I need to find all the pairs of elements of X for which f(X[I],X[j])<T for some threshold T. This works well:

good_inds = {}

for i in range(0,len(X)):

   for j in range(x+1,len(X)):

       score = f(X[i],X[j])

       if score<T:

           good_inds[x,y] = score

This actually builds a dictionary which is a representation of a sparse matrix. The problem is that it's rather slow, and I wish to parallelise this process.
Please advise.

edited Jan 1 at 15:23

asked Jan 1 at 14:01

mousomer

1,19211116

x and y are constants within the scope of this snippet, so why use a dictionary? Did you mean x --> X[i] and y --> X[j]?

– meowgoesthedog
Jan 1 at 14:53

1

answers to this sort of question will be strongly dependant on what f does, e.g. what sort of constraints can be exploited. Roland's answer is great if there's nothing more known about the problem, but you'd get much more relevant answers if you said that X and Y are both numpy arrays and f a simple algebraic expression

– Sam Mason
Feb 4 at 14:53

add a comment |

I have a large [numpy] vector X, and a comparison function f(x,y). I need to find all the pairs of elements of X for which f(X[I],X[j])<T for some threshold T. This works well:

good_inds = {}

for i in range(0,len(X)):

   for j in range(x+1,len(X)):

       score = f(X[i],X[j])

       if score<T:

           good_inds[x,y] = score

This actually builds a dictionary which is a representation of a sparse matrix. The problem is that it's rather slow, and I wish to parallelise this process.
Please advise.

edited Jan 1 at 15:23

asked Jan 1 at 14:01

mousomer

1,19211116

x and y are constants within the scope of this snippet, so why use a dictionary? Did you mean x --> X[i] and y --> X[j]?

– meowgoesthedog
Jan 1 at 14:53

1

answers to this sort of question will be strongly dependant on what f does, e.g. what sort of constraints can be exploited. Roland's answer is great if there's nothing more known about the problem, but you'd get much more relevant answers if you said that X and Y are both numpy arrays and f a simple algebraic expression

– Sam Mason
Feb 4 at 14:53

add a comment |

I have a large [numpy] vector X, and a comparison function f(x,y). I need to find all the pairs of elements of X for which f(X[I],X[j])<T for some threshold T. This works well:

good_inds = {}

for i in range(0,len(X)):

   for j in range(x+1,len(X)):

       score = f(X[i],X[j])

       if score<T:

           good_inds[x,y] = score

This actually builds a dictionary which is a representation of a sparse matrix. The problem is that it's rather slow, and I wish to parallelise this process.
Please advise.

edited Jan 1 at 15:23

asked Jan 1 at 14:01

mousomer

1,19211116

I have a large [numpy] vector X, and a comparison function f(x,y). I need to find all the pairs of elements of X for which f(X[I],X[j])<T for some threshold T. This works well:

good_inds = {}

for i in range(0,len(X)):

   for j in range(x+1,len(X)):

       score = f(X[i],X[j])

       if score<T:

           good_inds[x,y] = score

This actually builds a dictionary which is a representation of a sparse matrix. The problem is that it's rather slow, and I wish to parallelise this process.
Please advise.

python parallel-processing sparse-matrix

edited Jan 1 at 15:23

asked Jan 1 at 14:01

mousomer

1,19211116

edited Jan 1 at 15:23

asked Jan 1 at 14:01

mousomer

1,19211116

edited Jan 1 at 15:23

asked Jan 1 at 14:01

mousomer

1,19211116

asked Jan 1 at 14:01

mousomer

1,19211116

asked Jan 1 at 14:01

mousomer

1,19211116

x and y are constants within the scope of this snippet, so why use a dictionary? Did you mean x --> X[i] and y --> X[j]?

– meowgoesthedog
Jan 1 at 14:53

1

answers to this sort of question will be strongly dependant on what f does, e.g. what sort of constraints can be exploited. Roland's answer is great if there's nothing more known about the problem, but you'd get much more relevant answers if you said that X and Y are both numpy arrays and f a simple algebraic expression

– Sam Mason
Feb 4 at 14:53

add a comment |

x and y are constants within the scope of this snippet, so why use a dictionary? Did you mean x --> X[i] and y --> X[j]?

– meowgoesthedog
Jan 1 at 14:53

1

answers to this sort of question will be strongly dependant on what f does, e.g. what sort of constraints can be exploited. Roland's answer is great if there's nothing more known about the problem, but you'd get much more relevant answers if you said that X and Y are both numpy arrays and f a simple algebraic expression

– Sam Mason
Feb 4 at 14:53

x and y are constants within the scope of this snippet, so why use a dictionary? Did you mean x --> X[i] and y --> X[j]?

– meowgoesthedog
Jan 1 at 14:53

answers to this sort of question will be strongly dependant on what f does, e.g. what sort of constraints can be exploited. Roland's answer is great if there's nothing more known about the problem, but you'd get much more relevant answers if you said that X and Y are both numpy arrays and f a simple algebraic expression

– Sam Mason
Feb 4 at 14:53

add a comment |

2 Answers
2

active

oldest

votes

This is a good fit for multiprocessing.Pool.

Create your numpy array, then make an iterator of 2-tuples all possible i and j values. For example with itertools.combinations.

In [1]: import itertools



In [7]: list(itertools.combinations(range(4), 2))                                                        

Out[7]: [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]

(You should use the length of your vector as the argument to range, of course.)

Write the following function:

def worker(pair):

    i, j = pair

    rv = False

    if f(X[i],X[j]) < T:

       rv = True

    return (i, j, rv)

Create a Pool, and run imap_unordered:

p = multiprocessing.Pool()

for i, j, result in p.imap_unordered(worker, itertools.combinations(range(len(X)), 2)):

    if result:

        print('Good pair:', i, j)

        # do something with the results...

This will run as many workers as you CPU has cores.

edited Jan 1 at 15:41

answered Jan 1 at 15:35

Roland Smith

26.7k33256

thanks. This is very cool. At the end I found scipy distance matrices to be already optimized.

– mousomer
Jan 2 at 15:36

add a comment |

So. Apparently SciPy is already good enough.

full_dist_mat = spatial.distance.squareform( spatial.distance.pdist(vects2, metric='cosine'))

is already optimised. Running 2000 vectors takes 1.3 seconds in jupyter lab on a Macbook pro.

answered Jan 2 at 15:35

mousomer

1,19211116

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53996080%2fparallel-calculation-on-a-sparse-matrix-in-python%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

This is a good fit for multiprocessing.Pool.

Create your numpy array, then make an iterator of 2-tuples all possible i and j values. For example with itertools.combinations.

In [1]: import itertools



In [7]: list(itertools.combinations(range(4), 2))                                                        

Out[7]: [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]

(You should use the length of your vector as the argument to range, of course.)

Write the following function:

def worker(pair):

    i, j = pair

    rv = False

    if f(X[i],X[j]) < T:

       rv = True

    return (i, j, rv)

Create a Pool, and run imap_unordered:

p = multiprocessing.Pool()

for i, j, result in p.imap_unordered(worker, itertools.combinations(range(len(X)), 2)):

    if result:

        print('Good pair:', i, j)

        # do something with the results...

This will run as many workers as you CPU has cores.

edited Jan 1 at 15:41

answered Jan 1 at 15:35

Roland Smith

26.7k33256

thanks. This is very cool. At the end I found scipy distance matrices to be already optimized.

– mousomer
Jan 2 at 15:36

add a comment |

This is a good fit for multiprocessing.Pool.

Create your numpy array, then make an iterator of 2-tuples all possible i and j values. For example with itertools.combinations.

In [1]: import itertools



In [7]: list(itertools.combinations(range(4), 2))                                                        

Out[7]: [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]

(You should use the length of your vector as the argument to range, of course.)

Write the following function:

def worker(pair):

    i, j = pair

    rv = False

    if f(X[i],X[j]) < T:

       rv = True

    return (i, j, rv)

Create a Pool, and run imap_unordered:

p = multiprocessing.Pool()

for i, j, result in p.imap_unordered(worker, itertools.combinations(range(len(X)), 2)):

    if result:

        print('Good pair:', i, j)

        # do something with the results...

This will run as many workers as you CPU has cores.

edited Jan 1 at 15:41

answered Jan 1 at 15:35

Roland Smith

26.7k33256

thanks. This is very cool. At the end I found scipy distance matrices to be already optimized.

– mousomer
Jan 2 at 15:36

add a comment |

This is a good fit for multiprocessing.Pool.

Create your numpy array, then make an iterator of 2-tuples all possible i and j values. For example with itertools.combinations.

In [1]: import itertools



In [7]: list(itertools.combinations(range(4), 2))                                                        

Out[7]: [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]

(You should use the length of your vector as the argument to range, of course.)

Write the following function:

def worker(pair):

    i, j = pair

    rv = False

    if f(X[i],X[j]) < T:

       rv = True

    return (i, j, rv)

Create a Pool, and run imap_unordered:

p = multiprocessing.Pool()

for i, j, result in p.imap_unordered(worker, itertools.combinations(range(len(X)), 2)):

    if result:

        print('Good pair:', i, j)

        # do something with the results...

This will run as many workers as you CPU has cores.

edited Jan 1 at 15:41

answered Jan 1 at 15:35

Roland Smith

26.7k33256

This is a good fit for multiprocessing.Pool.

Create your numpy array, then make an iterator of 2-tuples all possible i and j values. For example with itertools.combinations.

In [1]: import itertools



In [7]: list(itertools.combinations(range(4), 2))                                                        

Out[7]: [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]

(You should use the length of your vector as the argument to range, of course.)

Write the following function:

def worker(pair):

    i, j = pair

    rv = False

    if f(X[i],X[j]) < T:

       rv = True

    return (i, j, rv)

Create a Pool, and run imap_unordered:

p = multiprocessing.Pool()

for i, j, result in p.imap_unordered(worker, itertools.combinations(range(len(X)), 2)):

    if result:

        print('Good pair:', i, j)

        # do something with the results...

This will run as many workers as you CPU has cores.

edited Jan 1 at 15:41

answered Jan 1 at 15:35

Roland Smith

26.7k33256

edited Jan 1 at 15:41

answered Jan 1 at 15:35

Roland Smith

26.7k33256

answered Jan 1 at 15:35

Roland Smith

26.7k33256

answered Jan 1 at 15:35

Roland Smith

26.7k33256

thanks. This is very cool. At the end I found scipy distance matrices to be already optimized.

– mousomer
Jan 2 at 15:36

add a comment |

thanks. This is very cool. At the end I found scipy distance matrices to be already optimized.

– mousomer
Jan 2 at 15:36

thanks. This is very cool. At the end I found scipy distance matrices to be already optimized.

– mousomer
Jan 2 at 15:36

add a comment |

So. Apparently SciPy is already good enough.

full_dist_mat = spatial.distance.squareform( spatial.distance.pdist(vects2, metric='cosine'))

is already optimised. Running 2000 vectors takes 1.3 seconds in jupyter lab on a Macbook pro.

answered Jan 2 at 15:35

mousomer

1,19211116

add a comment |

So. Apparently SciPy is already good enough.

full_dist_mat = spatial.distance.squareform( spatial.distance.pdist(vects2, metric='cosine'))

is already optimised. Running 2000 vectors takes 1.3 seconds in jupyter lab on a Macbook pro.

answered Jan 2 at 15:35

mousomer

1,19211116

add a comment |

So. Apparently SciPy is already good enough.

full_dist_mat = spatial.distance.squareform( spatial.distance.pdist(vects2, metric='cosine'))

is already optimised. Running 2000 vectors takes 1.3 seconds in jupyter lab on a Macbook pro.

answered Jan 2 at 15:35

mousomer

1,19211116

So. Apparently SciPy is already good enough.

full_dist_mat = spatial.distance.squareform( spatial.distance.pdist(vects2, metric='cosine'))

is already optimised. Running 2000 vectors takes 1.3 seconds in jupyter lab on a Macbook pro.

answered Jan 2 at 15:35

mousomer

1,19211116

answered Jan 2 at 15:35

mousomer

1,19211116

answered Jan 2 at 15:35

mousomer

1,19211116

answered Jan 2 at 15:35

mousomer

1,19211116

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu