Parallel calculation on a sparse matrix in python
I have a large [numpy] vector X, and a comparison function f(x,y)
. I need to find all the pairs of elements of X for which f(X[I],X[j])<T
for some threshold T. This works well:
good_inds = {}
for i in range(0,len(X)):
for j in range(x+1,len(X)):
score = f(X[i],X[j])
if score<T:
good_inds[x,y] = score
This actually builds a dictionary which is a representation of a sparse matrix. The problem is that it's rather slow, and I wish to parallelise this process.
Please advise.
python parallel-processing sparse-matrix
add a comment |
I have a large [numpy] vector X, and a comparison function f(x,y)
. I need to find all the pairs of elements of X for which f(X[I],X[j])<T
for some threshold T. This works well:
good_inds = {}
for i in range(0,len(X)):
for j in range(x+1,len(X)):
score = f(X[i],X[j])
if score<T:
good_inds[x,y] = score
This actually builds a dictionary which is a representation of a sparse matrix. The problem is that it's rather slow, and I wish to parallelise this process.
Please advise.
python parallel-processing sparse-matrix
x
andy
are constants within the scope of this snippet, so why use a dictionary? Did you meanx --> X[i]
andy --> X[j]
?
– meowgoesthedog
Jan 1 at 14:53
1
answers to this sort of question will be strongly dependant on whatf
does, e.g. what sort of constraints can be exploited. Roland's answer is great if there's nothing more known about the problem, but you'd get much more relevant answers if you said thatX
andY
are bothnumpy
arrays andf
a simple algebraic expression
– Sam Mason
Feb 4 at 14:53
add a comment |
I have a large [numpy] vector X, and a comparison function f(x,y)
. I need to find all the pairs of elements of X for which f(X[I],X[j])<T
for some threshold T. This works well:
good_inds = {}
for i in range(0,len(X)):
for j in range(x+1,len(X)):
score = f(X[i],X[j])
if score<T:
good_inds[x,y] = score
This actually builds a dictionary which is a representation of a sparse matrix. The problem is that it's rather slow, and I wish to parallelise this process.
Please advise.
python parallel-processing sparse-matrix
I have a large [numpy] vector X, and a comparison function f(x,y)
. I need to find all the pairs of elements of X for which f(X[I],X[j])<T
for some threshold T. This works well:
good_inds = {}
for i in range(0,len(X)):
for j in range(x+1,len(X)):
score = f(X[i],X[j])
if score<T:
good_inds[x,y] = score
This actually builds a dictionary which is a representation of a sparse matrix. The problem is that it's rather slow, and I wish to parallelise this process.
Please advise.
python parallel-processing sparse-matrix
python parallel-processing sparse-matrix
edited Jan 1 at 15:23
mousomer
asked Jan 1 at 14:01
mousomermousomer
1,19211116
1,19211116
x
andy
are constants within the scope of this snippet, so why use a dictionary? Did you meanx --> X[i]
andy --> X[j]
?
– meowgoesthedog
Jan 1 at 14:53
1
answers to this sort of question will be strongly dependant on whatf
does, e.g. what sort of constraints can be exploited. Roland's answer is great if there's nothing more known about the problem, but you'd get much more relevant answers if you said thatX
andY
are bothnumpy
arrays andf
a simple algebraic expression
– Sam Mason
Feb 4 at 14:53
add a comment |
x
andy
are constants within the scope of this snippet, so why use a dictionary? Did you meanx --> X[i]
andy --> X[j]
?
– meowgoesthedog
Jan 1 at 14:53
1
answers to this sort of question will be strongly dependant on whatf
does, e.g. what sort of constraints can be exploited. Roland's answer is great if there's nothing more known about the problem, but you'd get much more relevant answers if you said thatX
andY
are bothnumpy
arrays andf
a simple algebraic expression
– Sam Mason
Feb 4 at 14:53
x
and y
are constants within the scope of this snippet, so why use a dictionary? Did you mean x --> X[i]
and y --> X[j]
?– meowgoesthedog
Jan 1 at 14:53
x
and y
are constants within the scope of this snippet, so why use a dictionary? Did you mean x --> X[i]
and y --> X[j]
?– meowgoesthedog
Jan 1 at 14:53
1
1
answers to this sort of question will be strongly dependant on what
f
does, e.g. what sort of constraints can be exploited. Roland's answer is great if there's nothing more known about the problem, but you'd get much more relevant answers if you said that X
and Y
are both numpy
arrays and f
a simple algebraic expression– Sam Mason
Feb 4 at 14:53
answers to this sort of question will be strongly dependant on what
f
does, e.g. what sort of constraints can be exploited. Roland's answer is great if there's nothing more known about the problem, but you'd get much more relevant answers if you said that X
and Y
are both numpy
arrays and f
a simple algebraic expression– Sam Mason
Feb 4 at 14:53
add a comment |
2 Answers
2
active
oldest
votes
This is a good fit for multiprocessing.Pool
.
Create your numpy array, then make an iterator of 2-tuples all possible i
and j
values. For example with itertools.combinations
.
In [1]: import itertools
In [7]: list(itertools.combinations(range(4), 2))
Out[7]: [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]
(You should use the length of your vector as the argument to range
, of course.)
Write the following function:
def worker(pair):
i, j = pair
rv = False
if f(X[i],X[j]) < T:
rv = True
return (i, j, rv)
Create a Pool
, and run imap_unordered
:
p = multiprocessing.Pool()
for i, j, result in p.imap_unordered(worker, itertools.combinations(range(len(X)), 2)):
if result:
print('Good pair:', i, j)
# do something with the results...
This will run as many workers as you CPU has cores.
thanks. This is very cool. At the end I found scipy distance matrices to be already optimized.
– mousomer
Jan 2 at 15:36
add a comment |
So. Apparently SciPy is already good enough.
full_dist_mat = spatial.distance.squareform( spatial.distance.pdist(vects2, metric='cosine'))
is already optimised. Running 2000 vectors takes 1.3 seconds in jupyter lab on a Macbook pro.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53996080%2fparallel-calculation-on-a-sparse-matrix-in-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
This is a good fit for multiprocessing.Pool
.
Create your numpy array, then make an iterator of 2-tuples all possible i
and j
values. For example with itertools.combinations
.
In [1]: import itertools
In [7]: list(itertools.combinations(range(4), 2))
Out[7]: [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]
(You should use the length of your vector as the argument to range
, of course.)
Write the following function:
def worker(pair):
i, j = pair
rv = False
if f(X[i],X[j]) < T:
rv = True
return (i, j, rv)
Create a Pool
, and run imap_unordered
:
p = multiprocessing.Pool()
for i, j, result in p.imap_unordered(worker, itertools.combinations(range(len(X)), 2)):
if result:
print('Good pair:', i, j)
# do something with the results...
This will run as many workers as you CPU has cores.
thanks. This is very cool. At the end I found scipy distance matrices to be already optimized.
– mousomer
Jan 2 at 15:36
add a comment |
This is a good fit for multiprocessing.Pool
.
Create your numpy array, then make an iterator of 2-tuples all possible i
and j
values. For example with itertools.combinations
.
In [1]: import itertools
In [7]: list(itertools.combinations(range(4), 2))
Out[7]: [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]
(You should use the length of your vector as the argument to range
, of course.)
Write the following function:
def worker(pair):
i, j = pair
rv = False
if f(X[i],X[j]) < T:
rv = True
return (i, j, rv)
Create a Pool
, and run imap_unordered
:
p = multiprocessing.Pool()
for i, j, result in p.imap_unordered(worker, itertools.combinations(range(len(X)), 2)):
if result:
print('Good pair:', i, j)
# do something with the results...
This will run as many workers as you CPU has cores.
thanks. This is very cool. At the end I found scipy distance matrices to be already optimized.
– mousomer
Jan 2 at 15:36
add a comment |
This is a good fit for multiprocessing.Pool
.
Create your numpy array, then make an iterator of 2-tuples all possible i
and j
values. For example with itertools.combinations
.
In [1]: import itertools
In [7]: list(itertools.combinations(range(4), 2))
Out[7]: [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]
(You should use the length of your vector as the argument to range
, of course.)
Write the following function:
def worker(pair):
i, j = pair
rv = False
if f(X[i],X[j]) < T:
rv = True
return (i, j, rv)
Create a Pool
, and run imap_unordered
:
p = multiprocessing.Pool()
for i, j, result in p.imap_unordered(worker, itertools.combinations(range(len(X)), 2)):
if result:
print('Good pair:', i, j)
# do something with the results...
This will run as many workers as you CPU has cores.
This is a good fit for multiprocessing.Pool
.
Create your numpy array, then make an iterator of 2-tuples all possible i
and j
values. For example with itertools.combinations
.
In [1]: import itertools
In [7]: list(itertools.combinations(range(4), 2))
Out[7]: [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]
(You should use the length of your vector as the argument to range
, of course.)
Write the following function:
def worker(pair):
i, j = pair
rv = False
if f(X[i],X[j]) < T:
rv = True
return (i, j, rv)
Create a Pool
, and run imap_unordered
:
p = multiprocessing.Pool()
for i, j, result in p.imap_unordered(worker, itertools.combinations(range(len(X)), 2)):
if result:
print('Good pair:', i, j)
# do something with the results...
This will run as many workers as you CPU has cores.
edited Jan 1 at 15:41
answered Jan 1 at 15:35


Roland SmithRoland Smith
26.7k33256
26.7k33256
thanks. This is very cool. At the end I found scipy distance matrices to be already optimized.
– mousomer
Jan 2 at 15:36
add a comment |
thanks. This is very cool. At the end I found scipy distance matrices to be already optimized.
– mousomer
Jan 2 at 15:36
thanks. This is very cool. At the end I found scipy distance matrices to be already optimized.
– mousomer
Jan 2 at 15:36
thanks. This is very cool. At the end I found scipy distance matrices to be already optimized.
– mousomer
Jan 2 at 15:36
add a comment |
So. Apparently SciPy is already good enough.
full_dist_mat = spatial.distance.squareform( spatial.distance.pdist(vects2, metric='cosine'))
is already optimised. Running 2000 vectors takes 1.3 seconds in jupyter lab on a Macbook pro.
add a comment |
So. Apparently SciPy is already good enough.
full_dist_mat = spatial.distance.squareform( spatial.distance.pdist(vects2, metric='cosine'))
is already optimised. Running 2000 vectors takes 1.3 seconds in jupyter lab on a Macbook pro.
add a comment |
So. Apparently SciPy is already good enough.
full_dist_mat = spatial.distance.squareform( spatial.distance.pdist(vects2, metric='cosine'))
is already optimised. Running 2000 vectors takes 1.3 seconds in jupyter lab on a Macbook pro.
So. Apparently SciPy is already good enough.
full_dist_mat = spatial.distance.squareform( spatial.distance.pdist(vects2, metric='cosine'))
is already optimised. Running 2000 vectors takes 1.3 seconds in jupyter lab on a Macbook pro.
answered Jan 2 at 15:35
mousomermousomer
1,19211116
1,19211116
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53996080%2fparallel-calculation-on-a-sparse-matrix-in-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
x
andy
are constants within the scope of this snippet, so why use a dictionary? Did you meanx --> X[i]
andy --> X[j]
?– meowgoesthedog
Jan 1 at 14:53
1
answers to this sort of question will be strongly dependant on what
f
does, e.g. what sort of constraints can be exploited. Roland's answer is great if there's nothing more known about the problem, but you'd get much more relevant answers if you said thatX
andY
are bothnumpy
arrays andf
a simple algebraic expression– Sam Mason
Feb 4 at 14:53