Matching Company names in Python through Cosine Similarity, TF-IDF and pyspark
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
I'm trying to match company names from 2 lists together, in order to check if a company in list A is indeed listed in list B. As company names a written in all kinds of different forms, I leaned toward a match using the cosing similarity.
For this, I followed the note on this blog made by Ran Tavory : Link Here
Here is the general outline:
- Calculate TF-IDF matrices on the driver.
- Parallelize matrix A; Broadcast matrix B
- Each worker now flatMaps its chunk of work by multiplying its chunk of matrix A with the entire matrix B. So if a worker operates on
A[0:99] then it would multiply these hundred rows and return the
result of, say A[13] matches a name found in B[21]. Multiplication is
done using numpy.
- The driver would collect back all the results from the different workers and match the indices (A[13] and B[21]) to the actual names in
the original dataset — and we’re done!
I was able to run the exact code described in the note, but one part of it seems kind of odd :
b_mat_dist = broadcast_matrix(a_mat)
When broadcasting a_mat as well as parallelize a_mat, I get a logical result of a perfect match for every company names (as we're looking in the same source).
When I try broadcasting the b_mat : b_mat_dist = broadcast_matrix(b_mat), I get the following error : Incompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418
Any help would be greatly appreciated!
Thanks in advance !
Here is my code :
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from pyspark.sql import SQLContext, SparkSession
from pyspark import SparkContext
from scipy.sparse import csr_matrix
vectorizer = TfidfVectorizer()
if 'sc' in locals():
sc.stop()
sc = SparkContext("local", "Simple App")
pd.set_option('display.max_colwidth', -1)
RefB = pd.read_excel('Ref.xlsx')
ToMatchB = pd.read_excel('ToMatch.xlsx')
Ref = RefB['CLT_company_name']
ToMatch = ToMatchB ['Name1']
a_mat = vectorizer.fit_transform(Ref)
b_mat = vectorizer.fit_transform(ToMatch)
def find_matches_in_submatrix(sources, targets, inputs_start_index,
threshold=.8):
cosimilarities = cosine_similarity(sources, targets)
for i, cosimilarity in enumerate(cosimilarities):
cosimilarity = cosimilarity.flatten()
# Find the best match by using argsort()[-1]
target_index = cosimilarity.argsort()[-1]
source_index = inputs_start_index + i
similarity = cosimilarity[target_index]
if cosimilarity[target_index] >= threshold:
yield (source_index, target_index, similarity)
def broadcast_matrix(mat):
bcast = sc.broadcast((mat.data, mat.indices, mat.indptr))
(data, indices, indptr) = bcast.value
bcast_mat = csr_matrix((data, indices, indptr), shape=mat.shape)
return bcast_mat
def parallelize_matrix(scipy_mat, rows_per_chunk=100):
[rows, cols] = scipy_mat.shape
i = 0
submatrices =
while i < rows:
current_chunk_size = min(rows_per_chunk, rows - i)
submat = scipy_mat[i:i + current_chunk_size]
submatrices.append((i, (submat.data, submat.indices,
submat.indptr),
(current_chunk_size, cols)))
i += current_chunk_size
return sc.parallelize(submatrices)
a_mat_para = parallelize_matrix(a_mat, rows_per_chunk=100)
b_mat_dist = broadcast_matrix(b_mat)
results = a_mat_para.flatMap(
lambda submatrix:
find_matches_in_submatrix(csr_matrix(submatrix[1],
shape=submatrix[2]),
b_mat_dist,
submatrix[0]))
python string-matching cosine-similarity
add a comment |
I'm trying to match company names from 2 lists together, in order to check if a company in list A is indeed listed in list B. As company names a written in all kinds of different forms, I leaned toward a match using the cosing similarity.
For this, I followed the note on this blog made by Ran Tavory : Link Here
Here is the general outline:
- Calculate TF-IDF matrices on the driver.
- Parallelize matrix A; Broadcast matrix B
- Each worker now flatMaps its chunk of work by multiplying its chunk of matrix A with the entire matrix B. So if a worker operates on
A[0:99] then it would multiply these hundred rows and return the
result of, say A[13] matches a name found in B[21]. Multiplication is
done using numpy.
- The driver would collect back all the results from the different workers and match the indices (A[13] and B[21]) to the actual names in
the original dataset — and we’re done!
I was able to run the exact code described in the note, but one part of it seems kind of odd :
b_mat_dist = broadcast_matrix(a_mat)
When broadcasting a_mat as well as parallelize a_mat, I get a logical result of a perfect match for every company names (as we're looking in the same source).
When I try broadcasting the b_mat : b_mat_dist = broadcast_matrix(b_mat), I get the following error : Incompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418
Any help would be greatly appreciated!
Thanks in advance !
Here is my code :
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from pyspark.sql import SQLContext, SparkSession
from pyspark import SparkContext
from scipy.sparse import csr_matrix
vectorizer = TfidfVectorizer()
if 'sc' in locals():
sc.stop()
sc = SparkContext("local", "Simple App")
pd.set_option('display.max_colwidth', -1)
RefB = pd.read_excel('Ref.xlsx')
ToMatchB = pd.read_excel('ToMatch.xlsx')
Ref = RefB['CLT_company_name']
ToMatch = ToMatchB ['Name1']
a_mat = vectorizer.fit_transform(Ref)
b_mat = vectorizer.fit_transform(ToMatch)
def find_matches_in_submatrix(sources, targets, inputs_start_index,
threshold=.8):
cosimilarities = cosine_similarity(sources, targets)
for i, cosimilarity in enumerate(cosimilarities):
cosimilarity = cosimilarity.flatten()
# Find the best match by using argsort()[-1]
target_index = cosimilarity.argsort()[-1]
source_index = inputs_start_index + i
similarity = cosimilarity[target_index]
if cosimilarity[target_index] >= threshold:
yield (source_index, target_index, similarity)
def broadcast_matrix(mat):
bcast = sc.broadcast((mat.data, mat.indices, mat.indptr))
(data, indices, indptr) = bcast.value
bcast_mat = csr_matrix((data, indices, indptr), shape=mat.shape)
return bcast_mat
def parallelize_matrix(scipy_mat, rows_per_chunk=100):
[rows, cols] = scipy_mat.shape
i = 0
submatrices =
while i < rows:
current_chunk_size = min(rows_per_chunk, rows - i)
submat = scipy_mat[i:i + current_chunk_size]
submatrices.append((i, (submat.data, submat.indices,
submat.indptr),
(current_chunk_size, cols)))
i += current_chunk_size
return sc.parallelize(submatrices)
a_mat_para = parallelize_matrix(a_mat, rows_per_chunk=100)
b_mat_dist = broadcast_matrix(b_mat)
results = a_mat_para.flatMap(
lambda submatrix:
find_matches_in_submatrix(csr_matrix(submatrix[1],
shape=submatrix[2]),
b_mat_dist,
submatrix[0]))
python string-matching cosine-similarity
add a comment |
I'm trying to match company names from 2 lists together, in order to check if a company in list A is indeed listed in list B. As company names a written in all kinds of different forms, I leaned toward a match using the cosing similarity.
For this, I followed the note on this blog made by Ran Tavory : Link Here
Here is the general outline:
- Calculate TF-IDF matrices on the driver.
- Parallelize matrix A; Broadcast matrix B
- Each worker now flatMaps its chunk of work by multiplying its chunk of matrix A with the entire matrix B. So if a worker operates on
A[0:99] then it would multiply these hundred rows and return the
result of, say A[13] matches a name found in B[21]. Multiplication is
done using numpy.
- The driver would collect back all the results from the different workers and match the indices (A[13] and B[21]) to the actual names in
the original dataset — and we’re done!
I was able to run the exact code described in the note, but one part of it seems kind of odd :
b_mat_dist = broadcast_matrix(a_mat)
When broadcasting a_mat as well as parallelize a_mat, I get a logical result of a perfect match for every company names (as we're looking in the same source).
When I try broadcasting the b_mat : b_mat_dist = broadcast_matrix(b_mat), I get the following error : Incompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418
Any help would be greatly appreciated!
Thanks in advance !
Here is my code :
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from pyspark.sql import SQLContext, SparkSession
from pyspark import SparkContext
from scipy.sparse import csr_matrix
vectorizer = TfidfVectorizer()
if 'sc' in locals():
sc.stop()
sc = SparkContext("local", "Simple App")
pd.set_option('display.max_colwidth', -1)
RefB = pd.read_excel('Ref.xlsx')
ToMatchB = pd.read_excel('ToMatch.xlsx')
Ref = RefB['CLT_company_name']
ToMatch = ToMatchB ['Name1']
a_mat = vectorizer.fit_transform(Ref)
b_mat = vectorizer.fit_transform(ToMatch)
def find_matches_in_submatrix(sources, targets, inputs_start_index,
threshold=.8):
cosimilarities = cosine_similarity(sources, targets)
for i, cosimilarity in enumerate(cosimilarities):
cosimilarity = cosimilarity.flatten()
# Find the best match by using argsort()[-1]
target_index = cosimilarity.argsort()[-1]
source_index = inputs_start_index + i
similarity = cosimilarity[target_index]
if cosimilarity[target_index] >= threshold:
yield (source_index, target_index, similarity)
def broadcast_matrix(mat):
bcast = sc.broadcast((mat.data, mat.indices, mat.indptr))
(data, indices, indptr) = bcast.value
bcast_mat = csr_matrix((data, indices, indptr), shape=mat.shape)
return bcast_mat
def parallelize_matrix(scipy_mat, rows_per_chunk=100):
[rows, cols] = scipy_mat.shape
i = 0
submatrices =
while i < rows:
current_chunk_size = min(rows_per_chunk, rows - i)
submat = scipy_mat[i:i + current_chunk_size]
submatrices.append((i, (submat.data, submat.indices,
submat.indptr),
(current_chunk_size, cols)))
i += current_chunk_size
return sc.parallelize(submatrices)
a_mat_para = parallelize_matrix(a_mat, rows_per_chunk=100)
b_mat_dist = broadcast_matrix(b_mat)
results = a_mat_para.flatMap(
lambda submatrix:
find_matches_in_submatrix(csr_matrix(submatrix[1],
shape=submatrix[2]),
b_mat_dist,
submatrix[0]))
python string-matching cosine-similarity
I'm trying to match company names from 2 lists together, in order to check if a company in list A is indeed listed in list B. As company names a written in all kinds of different forms, I leaned toward a match using the cosing similarity.
For this, I followed the note on this blog made by Ran Tavory : Link Here
Here is the general outline:
- Calculate TF-IDF matrices on the driver.
- Parallelize matrix A; Broadcast matrix B
- Each worker now flatMaps its chunk of work by multiplying its chunk of matrix A with the entire matrix B. So if a worker operates on
A[0:99] then it would multiply these hundred rows and return the
result of, say A[13] matches a name found in B[21]. Multiplication is
done using numpy.
- The driver would collect back all the results from the different workers and match the indices (A[13] and B[21]) to the actual names in
the original dataset — and we’re done!
I was able to run the exact code described in the note, but one part of it seems kind of odd :
b_mat_dist = broadcast_matrix(a_mat)
When broadcasting a_mat as well as parallelize a_mat, I get a logical result of a perfect match for every company names (as we're looking in the same source).
When I try broadcasting the b_mat : b_mat_dist = broadcast_matrix(b_mat), I get the following error : Incompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418
Any help would be greatly appreciated!
Thanks in advance !
Here is my code :
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from pyspark.sql import SQLContext, SparkSession
from pyspark import SparkContext
from scipy.sparse import csr_matrix
vectorizer = TfidfVectorizer()
if 'sc' in locals():
sc.stop()
sc = SparkContext("local", "Simple App")
pd.set_option('display.max_colwidth', -1)
RefB = pd.read_excel('Ref.xlsx')
ToMatchB = pd.read_excel('ToMatch.xlsx')
Ref = RefB['CLT_company_name']
ToMatch = ToMatchB ['Name1']
a_mat = vectorizer.fit_transform(Ref)
b_mat = vectorizer.fit_transform(ToMatch)
def find_matches_in_submatrix(sources, targets, inputs_start_index,
threshold=.8):
cosimilarities = cosine_similarity(sources, targets)
for i, cosimilarity in enumerate(cosimilarities):
cosimilarity = cosimilarity.flatten()
# Find the best match by using argsort()[-1]
target_index = cosimilarity.argsort()[-1]
source_index = inputs_start_index + i
similarity = cosimilarity[target_index]
if cosimilarity[target_index] >= threshold:
yield (source_index, target_index, similarity)
def broadcast_matrix(mat):
bcast = sc.broadcast((mat.data, mat.indices, mat.indptr))
(data, indices, indptr) = bcast.value
bcast_mat = csr_matrix((data, indices, indptr), shape=mat.shape)
return bcast_mat
def parallelize_matrix(scipy_mat, rows_per_chunk=100):
[rows, cols] = scipy_mat.shape
i = 0
submatrices =
while i < rows:
current_chunk_size = min(rows_per_chunk, rows - i)
submat = scipy_mat[i:i + current_chunk_size]
submatrices.append((i, (submat.data, submat.indices,
submat.indptr),
(current_chunk_size, cols)))
i += current_chunk_size
return sc.parallelize(submatrices)
a_mat_para = parallelize_matrix(a_mat, rows_per_chunk=100)
b_mat_dist = broadcast_matrix(b_mat)
results = a_mat_para.flatMap(
lambda submatrix:
find_matches_in_submatrix(csr_matrix(submatrix[1],
shape=submatrix[2]),
b_mat_dist,
submatrix[0]))
python string-matching cosine-similarity
python string-matching cosine-similarity
asked Jan 3 at 12:04
MaxGMaxG
11
11
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Try equalising vocabulary for both TfidVectorizer object:
vect = CountVectorizer()
vocabulary = vect.fit(Ref + ToMatch).vocabulary_
vectorizer = TfidfVectorizer(vocabulary=vocabulary)
Also based on what you are aiming to do:
a_mat = vectorizer.fit_transform(ToMatch)
b_mat = vectorizer.fit_transform(Ref)
looked like a better option to me.
Thanks a lot for your answer. I implemented your comments but I still get the errorIncompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418
– MaxG
Mar 27 at 15:42
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54021953%2fmatching-company-names-in-python-through-cosine-similarity-tf-idf-and-pyspark%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Try equalising vocabulary for both TfidVectorizer object:
vect = CountVectorizer()
vocabulary = vect.fit(Ref + ToMatch).vocabulary_
vectorizer = TfidfVectorizer(vocabulary=vocabulary)
Also based on what you are aiming to do:
a_mat = vectorizer.fit_transform(ToMatch)
b_mat = vectorizer.fit_transform(Ref)
looked like a better option to me.
Thanks a lot for your answer. I implemented your comments but I still get the errorIncompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418
– MaxG
Mar 27 at 15:42
add a comment |
Try equalising vocabulary for both TfidVectorizer object:
vect = CountVectorizer()
vocabulary = vect.fit(Ref + ToMatch).vocabulary_
vectorizer = TfidfVectorizer(vocabulary=vocabulary)
Also based on what you are aiming to do:
a_mat = vectorizer.fit_transform(ToMatch)
b_mat = vectorizer.fit_transform(Ref)
looked like a better option to me.
Thanks a lot for your answer. I implemented your comments but I still get the errorIncompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418
– MaxG
Mar 27 at 15:42
add a comment |
Try equalising vocabulary for both TfidVectorizer object:
vect = CountVectorizer()
vocabulary = vect.fit(Ref + ToMatch).vocabulary_
vectorizer = TfidfVectorizer(vocabulary=vocabulary)
Also based on what you are aiming to do:
a_mat = vectorizer.fit_transform(ToMatch)
b_mat = vectorizer.fit_transform(Ref)
looked like a better option to me.
Try equalising vocabulary for both TfidVectorizer object:
vect = CountVectorizer()
vocabulary = vect.fit(Ref + ToMatch).vocabulary_
vectorizer = TfidfVectorizer(vocabulary=vocabulary)
Also based on what you are aiming to do:
a_mat = vectorizer.fit_transform(ToMatch)
b_mat = vectorizer.fit_transform(Ref)
looked like a better option to me.
answered Mar 4 at 9:29
nimbousnimbous
11
11
Thanks a lot for your answer. I implemented your comments but I still get the errorIncompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418
– MaxG
Mar 27 at 15:42
add a comment |
Thanks a lot for your answer. I implemented your comments but I still get the errorIncompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418
– MaxG
Mar 27 at 15:42
Thanks a lot for your answer. I implemented your comments but I still get the error
Incompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418
– MaxG
Mar 27 at 15:42
Thanks a lot for your answer. I implemented your comments but I still get the error
Incompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418
– MaxG
Mar 27 at 15:42
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54021953%2fmatching-company-names-in-python-through-cosine-similarity-tf-idf-and-pyspark%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown