Matching Company names in Python through Cosine Similarity, TF-IDF and pyspark





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







0















I'm trying to match company names from 2 lists together, in order to check if a company in list A is indeed listed in list B. As company names a written in all kinds of different forms, I leaned toward a match using the cosing similarity.
For this, I followed the note on this blog made by Ran Tavory : Link Here



Here is the general outline:





  1. Calculate TF-IDF matrices on the driver.

  2. Parallelize matrix A; Broadcast matrix B

  3. Each worker now flatMaps its chunk of work by multiplying its chunk of matrix A with the entire matrix B. So if a worker operates on
    A[0:99] then it would multiply these hundred rows and return the
    result of, say A[13] matches a name found in B[21]. Multiplication is
    done using numpy.

  4. The driver would collect back all the results from the different workers and match the indices (A[13] and B[21]) to the actual names in
    the original dataset — and we’re done!




I was able to run the exact code described in the note, but one part of it seems kind of odd :
b_mat_dist = broadcast_matrix(a_mat)



When broadcasting a_mat as well as parallelize a_mat, I get a logical result of a perfect match for every company names (as we're looking in the same source).



When I try broadcasting the b_mat : b_mat_dist = broadcast_matrix(b_mat), I get the following error : Incompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418



Any help would be greatly appreciated!
Thanks in advance !



Here is my code :



import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from pyspark.sql import SQLContext, SparkSession
from pyspark import SparkContext
from scipy.sparse import csr_matrix
vectorizer = TfidfVectorizer()

if 'sc' in locals():
sc.stop()

sc = SparkContext("local", "Simple App")

pd.set_option('display.max_colwidth', -1)
RefB = pd.read_excel('Ref.xlsx')
ToMatchB = pd.read_excel('ToMatch.xlsx')

Ref = RefB['CLT_company_name']
ToMatch = ToMatchB ['Name1']

a_mat = vectorizer.fit_transform(Ref)
b_mat = vectorizer.fit_transform(ToMatch)

def find_matches_in_submatrix(sources, targets, inputs_start_index,
threshold=.8):
cosimilarities = cosine_similarity(sources, targets)
for i, cosimilarity in enumerate(cosimilarities):
cosimilarity = cosimilarity.flatten()
# Find the best match by using argsort()[-1]
target_index = cosimilarity.argsort()[-1]
source_index = inputs_start_index + i
similarity = cosimilarity[target_index]
if cosimilarity[target_index] >= threshold:
yield (source_index, target_index, similarity)

def broadcast_matrix(mat):
bcast = sc.broadcast((mat.data, mat.indices, mat.indptr))
(data, indices, indptr) = bcast.value
bcast_mat = csr_matrix((data, indices, indptr), shape=mat.shape)
return bcast_mat

def parallelize_matrix(scipy_mat, rows_per_chunk=100):
[rows, cols] = scipy_mat.shape
i = 0
submatrices =
while i < rows:
current_chunk_size = min(rows_per_chunk, rows - i)
submat = scipy_mat[i:i + current_chunk_size]
submatrices.append((i, (submat.data, submat.indices,
submat.indptr),
(current_chunk_size, cols)))
i += current_chunk_size
return sc.parallelize(submatrices)

a_mat_para = parallelize_matrix(a_mat, rows_per_chunk=100)
b_mat_dist = broadcast_matrix(b_mat)
results = a_mat_para.flatMap(
lambda submatrix:
find_matches_in_submatrix(csr_matrix(submatrix[1],
shape=submatrix[2]),
b_mat_dist,
submatrix[0]))









share|improve this question





























    0















    I'm trying to match company names from 2 lists together, in order to check if a company in list A is indeed listed in list B. As company names a written in all kinds of different forms, I leaned toward a match using the cosing similarity.
    For this, I followed the note on this blog made by Ran Tavory : Link Here



    Here is the general outline:





    1. Calculate TF-IDF matrices on the driver.

    2. Parallelize matrix A; Broadcast matrix B

    3. Each worker now flatMaps its chunk of work by multiplying its chunk of matrix A with the entire matrix B. So if a worker operates on
      A[0:99] then it would multiply these hundred rows and return the
      result of, say A[13] matches a name found in B[21]. Multiplication is
      done using numpy.

    4. The driver would collect back all the results from the different workers and match the indices (A[13] and B[21]) to the actual names in
      the original dataset — and we’re done!




    I was able to run the exact code described in the note, but one part of it seems kind of odd :
    b_mat_dist = broadcast_matrix(a_mat)



    When broadcasting a_mat as well as parallelize a_mat, I get a logical result of a perfect match for every company names (as we're looking in the same source).



    When I try broadcasting the b_mat : b_mat_dist = broadcast_matrix(b_mat), I get the following error : Incompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418



    Any help would be greatly appreciated!
    Thanks in advance !



    Here is my code :



    import pandas as pd
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    from pyspark.sql import SQLContext, SparkSession
    from pyspark import SparkContext
    from scipy.sparse import csr_matrix
    vectorizer = TfidfVectorizer()

    if 'sc' in locals():
    sc.stop()

    sc = SparkContext("local", "Simple App")

    pd.set_option('display.max_colwidth', -1)
    RefB = pd.read_excel('Ref.xlsx')
    ToMatchB = pd.read_excel('ToMatch.xlsx')

    Ref = RefB['CLT_company_name']
    ToMatch = ToMatchB ['Name1']

    a_mat = vectorizer.fit_transform(Ref)
    b_mat = vectorizer.fit_transform(ToMatch)

    def find_matches_in_submatrix(sources, targets, inputs_start_index,
    threshold=.8):
    cosimilarities = cosine_similarity(sources, targets)
    for i, cosimilarity in enumerate(cosimilarities):
    cosimilarity = cosimilarity.flatten()
    # Find the best match by using argsort()[-1]
    target_index = cosimilarity.argsort()[-1]
    source_index = inputs_start_index + i
    similarity = cosimilarity[target_index]
    if cosimilarity[target_index] >= threshold:
    yield (source_index, target_index, similarity)

    def broadcast_matrix(mat):
    bcast = sc.broadcast((mat.data, mat.indices, mat.indptr))
    (data, indices, indptr) = bcast.value
    bcast_mat = csr_matrix((data, indices, indptr), shape=mat.shape)
    return bcast_mat

    def parallelize_matrix(scipy_mat, rows_per_chunk=100):
    [rows, cols] = scipy_mat.shape
    i = 0
    submatrices =
    while i < rows:
    current_chunk_size = min(rows_per_chunk, rows - i)
    submat = scipy_mat[i:i + current_chunk_size]
    submatrices.append((i, (submat.data, submat.indices,
    submat.indptr),
    (current_chunk_size, cols)))
    i += current_chunk_size
    return sc.parallelize(submatrices)

    a_mat_para = parallelize_matrix(a_mat, rows_per_chunk=100)
    b_mat_dist = broadcast_matrix(b_mat)
    results = a_mat_para.flatMap(
    lambda submatrix:
    find_matches_in_submatrix(csr_matrix(submatrix[1],
    shape=submatrix[2]),
    b_mat_dist,
    submatrix[0]))









    share|improve this question

























      0












      0








      0








      I'm trying to match company names from 2 lists together, in order to check if a company in list A is indeed listed in list B. As company names a written in all kinds of different forms, I leaned toward a match using the cosing similarity.
      For this, I followed the note on this blog made by Ran Tavory : Link Here



      Here is the general outline:





      1. Calculate TF-IDF matrices on the driver.

      2. Parallelize matrix A; Broadcast matrix B

      3. Each worker now flatMaps its chunk of work by multiplying its chunk of matrix A with the entire matrix B. So if a worker operates on
        A[0:99] then it would multiply these hundred rows and return the
        result of, say A[13] matches a name found in B[21]. Multiplication is
        done using numpy.

      4. The driver would collect back all the results from the different workers and match the indices (A[13] and B[21]) to the actual names in
        the original dataset — and we’re done!




      I was able to run the exact code described in the note, but one part of it seems kind of odd :
      b_mat_dist = broadcast_matrix(a_mat)



      When broadcasting a_mat as well as parallelize a_mat, I get a logical result of a perfect match for every company names (as we're looking in the same source).



      When I try broadcasting the b_mat : b_mat_dist = broadcast_matrix(b_mat), I get the following error : Incompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418



      Any help would be greatly appreciated!
      Thanks in advance !



      Here is my code :



      import pandas as pd
      from sklearn.feature_extraction.text import TfidfVectorizer
      from sklearn.metrics.pairwise import cosine_similarity
      from pyspark.sql import SQLContext, SparkSession
      from pyspark import SparkContext
      from scipy.sparse import csr_matrix
      vectorizer = TfidfVectorizer()

      if 'sc' in locals():
      sc.stop()

      sc = SparkContext("local", "Simple App")

      pd.set_option('display.max_colwidth', -1)
      RefB = pd.read_excel('Ref.xlsx')
      ToMatchB = pd.read_excel('ToMatch.xlsx')

      Ref = RefB['CLT_company_name']
      ToMatch = ToMatchB ['Name1']

      a_mat = vectorizer.fit_transform(Ref)
      b_mat = vectorizer.fit_transform(ToMatch)

      def find_matches_in_submatrix(sources, targets, inputs_start_index,
      threshold=.8):
      cosimilarities = cosine_similarity(sources, targets)
      for i, cosimilarity in enumerate(cosimilarities):
      cosimilarity = cosimilarity.flatten()
      # Find the best match by using argsort()[-1]
      target_index = cosimilarity.argsort()[-1]
      source_index = inputs_start_index + i
      similarity = cosimilarity[target_index]
      if cosimilarity[target_index] >= threshold:
      yield (source_index, target_index, similarity)

      def broadcast_matrix(mat):
      bcast = sc.broadcast((mat.data, mat.indices, mat.indptr))
      (data, indices, indptr) = bcast.value
      bcast_mat = csr_matrix((data, indices, indptr), shape=mat.shape)
      return bcast_mat

      def parallelize_matrix(scipy_mat, rows_per_chunk=100):
      [rows, cols] = scipy_mat.shape
      i = 0
      submatrices =
      while i < rows:
      current_chunk_size = min(rows_per_chunk, rows - i)
      submat = scipy_mat[i:i + current_chunk_size]
      submatrices.append((i, (submat.data, submat.indices,
      submat.indptr),
      (current_chunk_size, cols)))
      i += current_chunk_size
      return sc.parallelize(submatrices)

      a_mat_para = parallelize_matrix(a_mat, rows_per_chunk=100)
      b_mat_dist = broadcast_matrix(b_mat)
      results = a_mat_para.flatMap(
      lambda submatrix:
      find_matches_in_submatrix(csr_matrix(submatrix[1],
      shape=submatrix[2]),
      b_mat_dist,
      submatrix[0]))









      share|improve this question














      I'm trying to match company names from 2 lists together, in order to check if a company in list A is indeed listed in list B. As company names a written in all kinds of different forms, I leaned toward a match using the cosing similarity.
      For this, I followed the note on this blog made by Ran Tavory : Link Here



      Here is the general outline:





      1. Calculate TF-IDF matrices on the driver.

      2. Parallelize matrix A; Broadcast matrix B

      3. Each worker now flatMaps its chunk of work by multiplying its chunk of matrix A with the entire matrix B. So if a worker operates on
        A[0:99] then it would multiply these hundred rows and return the
        result of, say A[13] matches a name found in B[21]. Multiplication is
        done using numpy.

      4. The driver would collect back all the results from the different workers and match the indices (A[13] and B[21]) to the actual names in
        the original dataset — and we’re done!




      I was able to run the exact code described in the note, but one part of it seems kind of odd :
      b_mat_dist = broadcast_matrix(a_mat)



      When broadcasting a_mat as well as parallelize a_mat, I get a logical result of a perfect match for every company names (as we're looking in the same source).



      When I try broadcasting the b_mat : b_mat_dist = broadcast_matrix(b_mat), I get the following error : Incompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418



      Any help would be greatly appreciated!
      Thanks in advance !



      Here is my code :



      import pandas as pd
      from sklearn.feature_extraction.text import TfidfVectorizer
      from sklearn.metrics.pairwise import cosine_similarity
      from pyspark.sql import SQLContext, SparkSession
      from pyspark import SparkContext
      from scipy.sparse import csr_matrix
      vectorizer = TfidfVectorizer()

      if 'sc' in locals():
      sc.stop()

      sc = SparkContext("local", "Simple App")

      pd.set_option('display.max_colwidth', -1)
      RefB = pd.read_excel('Ref.xlsx')
      ToMatchB = pd.read_excel('ToMatch.xlsx')

      Ref = RefB['CLT_company_name']
      ToMatch = ToMatchB ['Name1']

      a_mat = vectorizer.fit_transform(Ref)
      b_mat = vectorizer.fit_transform(ToMatch)

      def find_matches_in_submatrix(sources, targets, inputs_start_index,
      threshold=.8):
      cosimilarities = cosine_similarity(sources, targets)
      for i, cosimilarity in enumerate(cosimilarities):
      cosimilarity = cosimilarity.flatten()
      # Find the best match by using argsort()[-1]
      target_index = cosimilarity.argsort()[-1]
      source_index = inputs_start_index + i
      similarity = cosimilarity[target_index]
      if cosimilarity[target_index] >= threshold:
      yield (source_index, target_index, similarity)

      def broadcast_matrix(mat):
      bcast = sc.broadcast((mat.data, mat.indices, mat.indptr))
      (data, indices, indptr) = bcast.value
      bcast_mat = csr_matrix((data, indices, indptr), shape=mat.shape)
      return bcast_mat

      def parallelize_matrix(scipy_mat, rows_per_chunk=100):
      [rows, cols] = scipy_mat.shape
      i = 0
      submatrices =
      while i < rows:
      current_chunk_size = min(rows_per_chunk, rows - i)
      submat = scipy_mat[i:i + current_chunk_size]
      submatrices.append((i, (submat.data, submat.indices,
      submat.indptr),
      (current_chunk_size, cols)))
      i += current_chunk_size
      return sc.parallelize(submatrices)

      a_mat_para = parallelize_matrix(a_mat, rows_per_chunk=100)
      b_mat_dist = broadcast_matrix(b_mat)
      results = a_mat_para.flatMap(
      lambda submatrix:
      find_matches_in_submatrix(csr_matrix(submatrix[1],
      shape=submatrix[2]),
      b_mat_dist,
      submatrix[0]))






      python string-matching cosine-similarity






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Jan 3 at 12:04









      MaxGMaxG

      11




      11
























          1 Answer
          1






          active

          oldest

          votes


















          0














          Try equalising vocabulary for both TfidVectorizer object:



          vect = CountVectorizer()
          vocabulary = vect.fit(Ref + ToMatch).vocabulary_
          vectorizer = TfidfVectorizer(vocabulary=vocabulary)


          Also based on what you are aiming to do:



          a_mat = vectorizer.fit_transform(ToMatch)
          b_mat = vectorizer.fit_transform(Ref)


          looked like a better option to me.






          share|improve this answer
























          • Thanks a lot for your answer. I implemented your comments but I still get the error Incompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418

            – MaxG
            Mar 27 at 15:42














          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54021953%2fmatching-company-names-in-python-through-cosine-similarity-tf-idf-and-pyspark%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0














          Try equalising vocabulary for both TfidVectorizer object:



          vect = CountVectorizer()
          vocabulary = vect.fit(Ref + ToMatch).vocabulary_
          vectorizer = TfidfVectorizer(vocabulary=vocabulary)


          Also based on what you are aiming to do:



          a_mat = vectorizer.fit_transform(ToMatch)
          b_mat = vectorizer.fit_transform(Ref)


          looked like a better option to me.






          share|improve this answer
























          • Thanks a lot for your answer. I implemented your comments but I still get the error Incompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418

            – MaxG
            Mar 27 at 15:42


















          0














          Try equalising vocabulary for both TfidVectorizer object:



          vect = CountVectorizer()
          vocabulary = vect.fit(Ref + ToMatch).vocabulary_
          vectorizer = TfidfVectorizer(vocabulary=vocabulary)


          Also based on what you are aiming to do:



          a_mat = vectorizer.fit_transform(ToMatch)
          b_mat = vectorizer.fit_transform(Ref)


          looked like a better option to me.






          share|improve this answer
























          • Thanks a lot for your answer. I implemented your comments but I still get the error Incompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418

            – MaxG
            Mar 27 at 15:42
















          0












          0








          0







          Try equalising vocabulary for both TfidVectorizer object:



          vect = CountVectorizer()
          vocabulary = vect.fit(Ref + ToMatch).vocabulary_
          vectorizer = TfidfVectorizer(vocabulary=vocabulary)


          Also based on what you are aiming to do:



          a_mat = vectorizer.fit_transform(ToMatch)
          b_mat = vectorizer.fit_transform(Ref)


          looked like a better option to me.






          share|improve this answer













          Try equalising vocabulary for both TfidVectorizer object:



          vect = CountVectorizer()
          vocabulary = vect.fit(Ref + ToMatch).vocabulary_
          vectorizer = TfidfVectorizer(vocabulary=vocabulary)


          Also based on what you are aiming to do:



          a_mat = vectorizer.fit_transform(ToMatch)
          b_mat = vectorizer.fit_transform(Ref)


          looked like a better option to me.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Mar 4 at 9:29









          nimbousnimbous

          11




          11













          • Thanks a lot for your answer. I implemented your comments but I still get the error Incompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418

            – MaxG
            Mar 27 at 15:42





















          • Thanks a lot for your answer. I implemented your comments but I still get the error Incompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418

            – MaxG
            Mar 27 at 15:42



















          Thanks a lot for your answer. I implemented your comments but I still get the error Incompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418

          – MaxG
          Mar 27 at 15:42







          Thanks a lot for your answer. I implemented your comments but I still get the error Incompatible dimension for X and Y matrices: X.shape[1] == 56710 while Y.shape[1] == 2418

          – MaxG
          Mar 27 at 15:42






















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54021953%2fmatching-company-names-in-python-through-cosine-similarity-tf-idf-and-pyspark%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          MongoDB - Not Authorized To Execute Command

          How to fix TextFormField cause rebuild widget in Flutter

          in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith