Matching Company names in Python through Cosine Similarity, TF-IDF and pyspark

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

I'm trying to match company names from 2 lists together, in order to check if a company in list A is indeed listed in list B. As company names a written in all kinds of different forms, I leaned toward a match using the cosing similarity.
For this, I followed the note on this blog made by Ran Tavory : Link Here

Here is the general outline:

Calculate TF-IDF matrices on the driver.

Parallelize matrix A; Broadcast matrix B

Each worker now flatMaps its chunk of work by multiplying its chunk of matrix A with the entire matrix B. So if a worker operates on
A[0:99] then it would multiply these hundred rows and return the
result of, say A[13] matches a name found in B[21]. Multiplication is
done using numpy.

The driver would collect back all the results from the different workers and match the indices (A[13] and B[21]) to the actual names in
the original dataset — and we’re done!

I was able to run the exact code described in the note, but one part of it seems kind of odd :
b_mat_dist = broadcast_matrix(a_mat)

When broadcasting a_mat as well as parallelize a_mat, I get a logical result of a perfect match for every company names (as we're looking in the same source).

When I try broadcasting the b_mat : b_mat_dist = broadcast_matrix(b_mat), I get the following error : Incompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418

Any help would be greatly appreciated!
Thanks in advance !

Here is my code :

import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import cosine_similarity

from pyspark.sql import SQLContext, SparkSession

from pyspark import SparkContext

from scipy.sparse import csr_matrix

vectorizer = TfidfVectorizer()



if 'sc' in locals():

    sc.stop()



sc = SparkContext("local", "Simple App")



pd.set_option('display.max_colwidth', -1)

RefB =  pd.read_excel('Ref.xlsx')

ToMatchB =  pd.read_excel('ToMatch.xlsx')



Ref = RefB['CLT_company_name']

ToMatch = ToMatchB ['Name1']



a_mat = vectorizer.fit_transform(Ref)

b_mat = vectorizer.fit_transform(ToMatch)



def find_matches_in_submatrix(sources, targets, inputs_start_index,

                              threshold=.8):

    cosimilarities = cosine_similarity(sources, targets)

    for i, cosimilarity in enumerate(cosimilarities):

        cosimilarity = cosimilarity.flatten()

        # Find the best match by using argsort()[-1]

        target_index = cosimilarity.argsort()[-1]

        source_index = inputs_start_index + i

        similarity = cosimilarity[target_index]

        if cosimilarity[target_index] >= threshold:

            yield (source_index, target_index, similarity)



def broadcast_matrix(mat):

    bcast = sc.broadcast((mat.data, mat.indices, mat.indptr))

    (data, indices, indptr) = bcast.value

    bcast_mat = csr_matrix((data, indices, indptr), shape=mat.shape)

    return bcast_mat



def parallelize_matrix(scipy_mat, rows_per_chunk=100):

    [rows, cols] = scipy_mat.shape

    i = 0

    submatrices = 

    while i < rows:

        current_chunk_size = min(rows_per_chunk, rows - i)

        submat = scipy_mat[i:i + current_chunk_size]

        submatrices.append((i, (submat.data, submat.indices, 

                                submat.indptr),

                            (current_chunk_size, cols)))

        i += current_chunk_size

    return sc.parallelize(submatrices)



a_mat_para = parallelize_matrix(a_mat, rows_per_chunk=100)

b_mat_dist = broadcast_matrix(b_mat)

results = a_mat_para.flatMap(

        lambda submatrix:

        find_matches_in_submatrix(csr_matrix(submatrix[1],

                                             shape=submatrix[2]),

                                   b_mat_dist,

                                   submatrix[0]))

asked Jan 3 at 12:04

MaxG

add a comment |

Here is the general outline:

Calculate TF-IDF matrices on the driver.

Parallelize matrix A; Broadcast matrix B

Each worker now flatMaps its chunk of work by multiplying its chunk of matrix A with the entire matrix B. So if a worker operates on
A[0:99] then it would multiply these hundred rows and return the
result of, say A[13] matches a name found in B[21]. Multiplication is
done using numpy.

The driver would collect back all the results from the different workers and match the indices (A[13] and B[21]) to the actual names in
the original dataset — and we’re done!

I was able to run the exact code described in the note, but one part of it seems kind of odd :
b_mat_dist = broadcast_matrix(a_mat)

When broadcasting a_mat as well as parallelize a_mat, I get a logical result of a perfect match for every company names (as we're looking in the same source).

When I try broadcasting the b_mat : b_mat_dist = broadcast_matrix(b_mat), I get the following error : Incompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418

Any help would be greatly appreciated!
Thanks in advance !

Here is my code :

import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import cosine_similarity

from pyspark.sql import SQLContext, SparkSession

from pyspark import SparkContext

from scipy.sparse import csr_matrix

vectorizer = TfidfVectorizer()



if 'sc' in locals():

    sc.stop()



sc = SparkContext("local", "Simple App")



pd.set_option('display.max_colwidth', -1)

RefB =  pd.read_excel('Ref.xlsx')

ToMatchB =  pd.read_excel('ToMatch.xlsx')



Ref = RefB['CLT_company_name']

ToMatch = ToMatchB ['Name1']



a_mat = vectorizer.fit_transform(Ref)

b_mat = vectorizer.fit_transform(ToMatch)



def find_matches_in_submatrix(sources, targets, inputs_start_index,

                              threshold=.8):

    cosimilarities = cosine_similarity(sources, targets)

    for i, cosimilarity in enumerate(cosimilarities):

        cosimilarity = cosimilarity.flatten()

        # Find the best match by using argsort()[-1]

        target_index = cosimilarity.argsort()[-1]

        source_index = inputs_start_index + i

        similarity = cosimilarity[target_index]

        if cosimilarity[target_index] >= threshold:

            yield (source_index, target_index, similarity)



def broadcast_matrix(mat):

    bcast = sc.broadcast((mat.data, mat.indices, mat.indptr))

    (data, indices, indptr) = bcast.value

    bcast_mat = csr_matrix((data, indices, indptr), shape=mat.shape)

    return bcast_mat



def parallelize_matrix(scipy_mat, rows_per_chunk=100):

    [rows, cols] = scipy_mat.shape

    i = 0

    submatrices = 

    while i < rows:

        current_chunk_size = min(rows_per_chunk, rows - i)

        submat = scipy_mat[i:i + current_chunk_size]

        submatrices.append((i, (submat.data, submat.indices, 

                                submat.indptr),

                            (current_chunk_size, cols)))

        i += current_chunk_size

    return sc.parallelize(submatrices)



a_mat_para = parallelize_matrix(a_mat, rows_per_chunk=100)

b_mat_dist = broadcast_matrix(b_mat)

results = a_mat_para.flatMap(

        lambda submatrix:

        find_matches_in_submatrix(csr_matrix(submatrix[1],

                                             shape=submatrix[2]),

                                   b_mat_dist,

                                   submatrix[0]))

asked Jan 3 at 12:04

MaxG

add a comment |

Here is the general outline:

Calculate TF-IDF matrices on the driver.

Parallelize matrix A; Broadcast matrix B

Each worker now flatMaps its chunk of work by multiplying its chunk of matrix A with the entire matrix B. So if a worker operates on
A[0:99] then it would multiply these hundred rows and return the
result of, say A[13] matches a name found in B[21]. Multiplication is
done using numpy.

The driver would collect back all the results from the different workers and match the indices (A[13] and B[21]) to the actual names in
the original dataset — and we’re done!

I was able to run the exact code described in the note, but one part of it seems kind of odd :
b_mat_dist = broadcast_matrix(a_mat)

When broadcasting a_mat as well as parallelize a_mat, I get a logical result of a perfect match for every company names (as we're looking in the same source).

When I try broadcasting the b_mat : b_mat_dist = broadcast_matrix(b_mat), I get the following error : Incompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418

Any help would be greatly appreciated!
Thanks in advance !

Here is my code :

import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import cosine_similarity

from pyspark.sql import SQLContext, SparkSession

from pyspark import SparkContext

from scipy.sparse import csr_matrix

vectorizer = TfidfVectorizer()



if 'sc' in locals():

    sc.stop()



sc = SparkContext("local", "Simple App")



pd.set_option('display.max_colwidth', -1)

RefB =  pd.read_excel('Ref.xlsx')

ToMatchB =  pd.read_excel('ToMatch.xlsx')



Ref = RefB['CLT_company_name']

ToMatch = ToMatchB ['Name1']



a_mat = vectorizer.fit_transform(Ref)

b_mat = vectorizer.fit_transform(ToMatch)



def find_matches_in_submatrix(sources, targets, inputs_start_index,

                              threshold=.8):

    cosimilarities = cosine_similarity(sources, targets)

    for i, cosimilarity in enumerate(cosimilarities):

        cosimilarity = cosimilarity.flatten()

        # Find the best match by using argsort()[-1]

        target_index = cosimilarity.argsort()[-1]

        source_index = inputs_start_index + i

        similarity = cosimilarity[target_index]

        if cosimilarity[target_index] >= threshold:

            yield (source_index, target_index, similarity)



def broadcast_matrix(mat):

    bcast = sc.broadcast((mat.data, mat.indices, mat.indptr))

    (data, indices, indptr) = bcast.value

    bcast_mat = csr_matrix((data, indices, indptr), shape=mat.shape)

    return bcast_mat



def parallelize_matrix(scipy_mat, rows_per_chunk=100):

    [rows, cols] = scipy_mat.shape

    i = 0

    submatrices = 

    while i < rows:

        current_chunk_size = min(rows_per_chunk, rows - i)

        submat = scipy_mat[i:i + current_chunk_size]

        submatrices.append((i, (submat.data, submat.indices, 

                                submat.indptr),

                            (current_chunk_size, cols)))

        i += current_chunk_size

    return sc.parallelize(submatrices)



a_mat_para = parallelize_matrix(a_mat, rows_per_chunk=100)

b_mat_dist = broadcast_matrix(b_mat)

results = a_mat_para.flatMap(

        lambda submatrix:

        find_matches_in_submatrix(csr_matrix(submatrix[1],

                                             shape=submatrix[2]),

                                   b_mat_dist,

                                   submatrix[0]))

asked Jan 3 at 12:04

MaxG

Here is the general outline:

Calculate TF-IDF matrices on the driver.

Parallelize matrix A; Broadcast matrix B

Each worker now flatMaps its chunk of work by multiplying its chunk of matrix A with the entire matrix B. So if a worker operates on
A[0:99] then it would multiply these hundred rows and return the
result of, say A[13] matches a name found in B[21]. Multiplication is
done using numpy.

The driver would collect back all the results from the different workers and match the indices (A[13] and B[21]) to the actual names in
the original dataset — and we’re done!

I was able to run the exact code described in the note, but one part of it seems kind of odd :
b_mat_dist = broadcast_matrix(a_mat)

When broadcasting a_mat as well as parallelize a_mat, I get a logical result of a perfect match for every company names (as we're looking in the same source).

When I try broadcasting the b_mat : b_mat_dist = broadcast_matrix(b_mat), I get the following error : Incompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418

Any help would be greatly appreciated!
Thanks in advance !

Here is my code :

import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import cosine_similarity

from pyspark.sql import SQLContext, SparkSession

from pyspark import SparkContext

from scipy.sparse import csr_matrix

vectorizer = TfidfVectorizer()



if 'sc' in locals():

    sc.stop()



sc = SparkContext("local", "Simple App")



pd.set_option('display.max_colwidth', -1)

RefB =  pd.read_excel('Ref.xlsx')

ToMatchB =  pd.read_excel('ToMatch.xlsx')



Ref = RefB['CLT_company_name']

ToMatch = ToMatchB ['Name1']



a_mat = vectorizer.fit_transform(Ref)

b_mat = vectorizer.fit_transform(ToMatch)



def find_matches_in_submatrix(sources, targets, inputs_start_index,

                              threshold=.8):

    cosimilarities = cosine_similarity(sources, targets)

    for i, cosimilarity in enumerate(cosimilarities):

        cosimilarity = cosimilarity.flatten()

        # Find the best match by using argsort()[-1]

        target_index = cosimilarity.argsort()[-1]

        source_index = inputs_start_index + i

        similarity = cosimilarity[target_index]

        if cosimilarity[target_index] >= threshold:

            yield (source_index, target_index, similarity)



def broadcast_matrix(mat):

    bcast = sc.broadcast((mat.data, mat.indices, mat.indptr))

    (data, indices, indptr) = bcast.value

    bcast_mat = csr_matrix((data, indices, indptr), shape=mat.shape)

    return bcast_mat



def parallelize_matrix(scipy_mat, rows_per_chunk=100):

    [rows, cols] = scipy_mat.shape

    i = 0

    submatrices = 

    while i < rows:

        current_chunk_size = min(rows_per_chunk, rows - i)

        submat = scipy_mat[i:i + current_chunk_size]

        submatrices.append((i, (submat.data, submat.indices, 

                                submat.indptr),

                            (current_chunk_size, cols)))

        i += current_chunk_size

    return sc.parallelize(submatrices)



a_mat_para = parallelize_matrix(a_mat, rows_per_chunk=100)

b_mat_dist = broadcast_matrix(b_mat)

results = a_mat_para.flatMap(

        lambda submatrix:

        find_matches_in_submatrix(csr_matrix(submatrix[1],

                                             shape=submatrix[2]),

                                   b_mat_dist,

                                   submatrix[0]))

python string-matching cosine-similarity

asked Jan 3 at 12:04

MaxG

asked Jan 3 at 12:04

MaxG

asked Jan 3 at 12:04

MaxG

asked Jan 3 at 12:04

MaxG

asked Jan 3 at 12:04

MaxG

add a comment |

1 Answer
1

active

oldest

votes

Try equalising vocabulary for both TfidVectorizer object:

vect = CountVectorizer()

vocabulary =  vect.fit(Ref + ToMatch).vocabulary_

vectorizer = TfidfVectorizer(vocabulary=vocabulary)

Also based on what you are aiming to do:

a_mat = vectorizer.fit_transform(ToMatch)

b_mat = vectorizer.fit_transform(Ref)

looked like a better option to me.

answered Mar 4 at 9:29

nimbous

Thanks a lot for your answer. I implemented your comments but I still get the error Incompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418

– MaxG
Mar 27 at 15:42

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54021953%2fmatching-company-names-in-python-through-cosine-similarity-tf-idf-and-pyspark%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Try equalising vocabulary for both TfidVectorizer object:

vect = CountVectorizer()

vocabulary =  vect.fit(Ref + ToMatch).vocabulary_

vectorizer = TfidfVectorizer(vocabulary=vocabulary)

Also based on what you are aiming to do:

a_mat = vectorizer.fit_transform(ToMatch)

b_mat = vectorizer.fit_transform(Ref)

looked like a better option to me.

answered Mar 4 at 9:29

nimbous

Thanks a lot for your answer. I implemented your comments but I still get the error Incompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418

– MaxG
Mar 27 at 15:42

add a comment |

Try equalising vocabulary for both TfidVectorizer object:

vect = CountVectorizer()

vocabulary =  vect.fit(Ref + ToMatch).vocabulary_

vectorizer = TfidfVectorizer(vocabulary=vocabulary)

Also based on what you are aiming to do:

a_mat = vectorizer.fit_transform(ToMatch)

b_mat = vectorizer.fit_transform(Ref)

looked like a better option to me.

answered Mar 4 at 9:29

nimbous

Thanks a lot for your answer. I implemented your comments but I still get the error Incompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418

– MaxG
Mar 27 at 15:42

add a comment |

Try equalising vocabulary for both TfidVectorizer object:

vect = CountVectorizer()

vocabulary =  vect.fit(Ref + ToMatch).vocabulary_

vectorizer = TfidfVectorizer(vocabulary=vocabulary)

Also based on what you are aiming to do:

a_mat = vectorizer.fit_transform(ToMatch)

b_mat = vectorizer.fit_transform(Ref)

looked like a better option to me.

answered Mar 4 at 9:29

nimbous

Try equalising vocabulary for both TfidVectorizer object:

vect = CountVectorizer()

vocabulary =  vect.fit(Ref + ToMatch).vocabulary_

vectorizer = TfidfVectorizer(vocabulary=vocabulary)

Also based on what you are aiming to do:

a_mat = vectorizer.fit_transform(ToMatch)

b_mat = vectorizer.fit_transform(Ref)

looked like a better option to me.

answered Mar 4 at 9:29

nimbous

answered Mar 4 at 9:29

nimbous

answered Mar 4 at 9:29

nimbous

answered Mar 4 at 9:29

nimbous

Thanks a lot for your answer. I implemented your comments but I still get the error Incompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418

– MaxG
Mar 27 at 15:42

add a comment |

Thanks a lot for your answer. I implemented your comments but I still get the error Incompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418

– MaxG
Mar 27 at 15:42

Thanks a lot for your answer. I implemented your comments but I still get the error Incompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418

– MaxG
Mar 27 at 15:42

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu