Column string conversion based on unique values












2















Is there a way to replace string values in columns of a 2D array with ordered numbers in Python?



For example say you have a 2D array:



a = np.array([['A',0,'C'],['A',0.3,'B'],['D',1,'D']])
a
Out[57]:
array([['A', '0', 'C'],
['A', '0.3', 'B'],
['D', '1', 'D']], dtype='<U3')


If I wanted to replace the string values 'A','A','D' in the first column with the numbers 0,0,1 and 'C','B','D' with 0,1,2 is there an efficient way to do so.



It may be helpful to know:




  • Replacement numbers in different columns are independent of column. i.e. each column who's strings have been replaced with numbers will start with 0 and increase up to the number of unique values in that column.

  • The above is a test case and the real data is a lot bigger with more columns of strings.


Here is an example method to solve this problem I quickly came up with:



for  j in range(a.shape[1]):
b = list(set(a[:,j]))
length = len(b)
for i in range(len(b)):
indices = np.where(a[:,j]==b[i])[0]
print(indices)
a[indices,j]=i


However this seems like an inefficient way to achieve this and also cannot distinguish between float or string values in columns and defaults to replacing values with strings of numbers:



a
Out[91]:
array([['1.0', '0.0', '2.0'],
['1.0', '1.0', '0.0'],
['0.0', '2.0', '1.0']], dtype='<U3')


Any help on this matter would be greatly appreciated!










share|improve this question





























    2















    Is there a way to replace string values in columns of a 2D array with ordered numbers in Python?



    For example say you have a 2D array:



    a = np.array([['A',0,'C'],['A',0.3,'B'],['D',1,'D']])
    a
    Out[57]:
    array([['A', '0', 'C'],
    ['A', '0.3', 'B'],
    ['D', '1', 'D']], dtype='<U3')


    If I wanted to replace the string values 'A','A','D' in the first column with the numbers 0,0,1 and 'C','B','D' with 0,1,2 is there an efficient way to do so.



    It may be helpful to know:




    • Replacement numbers in different columns are independent of column. i.e. each column who's strings have been replaced with numbers will start with 0 and increase up to the number of unique values in that column.

    • The above is a test case and the real data is a lot bigger with more columns of strings.


    Here is an example method to solve this problem I quickly came up with:



    for  j in range(a.shape[1]):
    b = list(set(a[:,j]))
    length = len(b)
    for i in range(len(b)):
    indices = np.where(a[:,j]==b[i])[0]
    print(indices)
    a[indices,j]=i


    However this seems like an inefficient way to achieve this and also cannot distinguish between float or string values in columns and defaults to replacing values with strings of numbers:



    a
    Out[91]:
    array([['1.0', '0.0', '2.0'],
    ['1.0', '1.0', '0.0'],
    ['0.0', '2.0', '1.0']], dtype='<U3')


    Any help on this matter would be greatly appreciated!










    share|improve this question



























      2












      2








      2








      Is there a way to replace string values in columns of a 2D array with ordered numbers in Python?



      For example say you have a 2D array:



      a = np.array([['A',0,'C'],['A',0.3,'B'],['D',1,'D']])
      a
      Out[57]:
      array([['A', '0', 'C'],
      ['A', '0.3', 'B'],
      ['D', '1', 'D']], dtype='<U3')


      If I wanted to replace the string values 'A','A','D' in the first column with the numbers 0,0,1 and 'C','B','D' with 0,1,2 is there an efficient way to do so.



      It may be helpful to know:




      • Replacement numbers in different columns are independent of column. i.e. each column who's strings have been replaced with numbers will start with 0 and increase up to the number of unique values in that column.

      • The above is a test case and the real data is a lot bigger with more columns of strings.


      Here is an example method to solve this problem I quickly came up with:



      for  j in range(a.shape[1]):
      b = list(set(a[:,j]))
      length = len(b)
      for i in range(len(b)):
      indices = np.where(a[:,j]==b[i])[0]
      print(indices)
      a[indices,j]=i


      However this seems like an inefficient way to achieve this and also cannot distinguish between float or string values in columns and defaults to replacing values with strings of numbers:



      a
      Out[91]:
      array([['1.0', '0.0', '2.0'],
      ['1.0', '1.0', '0.0'],
      ['0.0', '2.0', '1.0']], dtype='<U3')


      Any help on this matter would be greatly appreciated!










      share|improve this question
















      Is there a way to replace string values in columns of a 2D array with ordered numbers in Python?



      For example say you have a 2D array:



      a = np.array([['A',0,'C'],['A',0.3,'B'],['D',1,'D']])
      a
      Out[57]:
      array([['A', '0', 'C'],
      ['A', '0.3', 'B'],
      ['D', '1', 'D']], dtype='<U3')


      If I wanted to replace the string values 'A','A','D' in the first column with the numbers 0,0,1 and 'C','B','D' with 0,1,2 is there an efficient way to do so.



      It may be helpful to know:




      • Replacement numbers in different columns are independent of column. i.e. each column who's strings have been replaced with numbers will start with 0 and increase up to the number of unique values in that column.

      • The above is a test case and the real data is a lot bigger with more columns of strings.


      Here is an example method to solve this problem I quickly came up with:



      for  j in range(a.shape[1]):
      b = list(set(a[:,j]))
      length = len(b)
      for i in range(len(b)):
      indices = np.where(a[:,j]==b[i])[0]
      print(indices)
      a[indices,j]=i


      However this seems like an inefficient way to achieve this and also cannot distinguish between float or string values in columns and defaults to replacing values with strings of numbers:



      a
      Out[91]:
      array([['1.0', '0.0', '2.0'],
      ['1.0', '1.0', '0.0'],
      ['0.0', '2.0', '1.0']], dtype='<U3')


      Any help on this matter would be greatly appreciated!







      python arrays string numpy 2d






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 20 '18 at 12:51









      Mohit Motwani

      1,2991522




      1,2991522










      asked Nov 20 '18 at 12:33









      user8188120user8188120

      1828




      1828
























          2 Answers
          2






          active

          oldest

          votes


















          2














          It seems that you are trying to do a label encoding.



          I can think of two options: pandas.factorize and sklearn.preprocessing.LabelEncoder.



          Using LabelEncoder



          from sklearn.preprocessing import LabelEncoder

          b = np.zeros_like(a, np.int)
          for column in range(a.shape[1]):
          b[:, column] = LabelEncoder().fit_transform(a[:, column])


          Then b will be:



          array([[0, 0, 1],
          [0, 1, 0],
          [1, 2, 2]])


          If you want to be able to go back to the original values, you will need to save the encoders. You can do it this way:



          from sklearn.preprocessing import LabelEncoder

          encoders = {}
          b = np.zeros_like(a, np.int)
          for column in range(a.shape[1]):
          encoders[column] = LabelEncoder()
          b[:, column] = encoders[column].fit_transform(a[:, column])


          Now encoders[0].classes_ will have:



          array(['A', 'D'], dtype='<U3')


          Which means that 'A' was mapped to 0 and 'D' to 1.



          Finally, if you do the encoding overriding a instead of using a new matrix c, you will obtain integers as strings ("1" instead of 1), you can solve this with astype(int):



          encoders = {}
          for column in range(a.shape[1]):
          encoders[column] = LabelEncoder()
          a[:, column] = encoders[column].fit_transform(a[:, column])

          # At this point, a will have strings instead of ints because a had type str
          # array([['0', '0', '1'],
          # ['0', '1', '0'],
          # ['1', '2', '2']], dtype='<U3')

          a = a.astype(int)

          # Now `a` is of type int
          # array([[0, 0, 1],
          # [0, 1, 0],
          # [1, 2, 2]])


          Using pd.factorize



          factorize returns the encoded column and the encoding mapping, so if you don't care about it you can avoid saving it:



          for column in range(a.shape[1]):
          a[:, column], _ = pd.factorize(a[:, column]) # Drop mapping

          a = a.astype(int) # same as above, it's of type str
          # a is
          # array([[0, 0, 1],
          # [0, 1, 0],
          # [1, 2, 2]])


          If you want to keep the encoding mappings:



          mappings = 
          for column in range(a.shape[1]):
          a[:, column], mapping = pd.factorize(a[:, column])
          mappings.append(mapping)

          a = a.astype(int)


          Now mappings[0] will have the following data:



          array(['A', 'D'], dtype=object)


          Which has the same semantics than encoders[0].classes_ of sklearn's LabelEncoder solution.






          share|improve this answer


























          • Many thanks! I've gone for your label encoder method in the end. Thank you for taking the time to reply

            – user8188120
            Nov 20 '18 at 14:12













          • Glad to help :)

            – Julian Peller
            Nov 20 '18 at 14:22











          • If I could expand on this question, is there a way to use label_encoder in conjunction with a predefined array of unique values? The reason I ask is because I would like to perform this encoding on several huge datasets which have some common string values and some uncommon and therefore the above method would give the same string a different value in different datasets whereas I need them to be encoded the same across all datasets I plan to load in.

            – user8188120
            Nov 20 '18 at 14:45






          • 1





            Is it possible to have all the possible values for a column together? If it is, you can do as follows: First, fit the LabelEncoder on all the values for that column with the method fit: enc = LabelEncoder(); enc.fit(all_possible_values). This will create the unique mapping you want in enc. After that, you can just transform the columns of each specific dataframe with column = enc.transform(column), having the cross-dataframe consistency you want.

            – Julian Peller
            Nov 20 '18 at 14:55













          • It's certainly possible to get all the unique values so this should work I think. Thanks again, I'll give it a try! Is there a sensible way to save an encoding once it's been fit to data for using later? Just so the process doesn't have to be done in one long script run

            – user8188120
            Nov 20 '18 at 15:01





















          1














          You can do what you want in an efficient way with just Numpy.



          Basically, you iterate over the values in each column of your input while keeping track of the observed letters in a set or dict. This is similar to what you already had, but slightly more efficient (you avoid the call to np.where for one thing).



          Here's a function charToIx that will do what you want:



          from collections import defaultdict
          from string import ascii_letters

          class Ix:
          def __init__(self):
          self._val = 0

          def __call__(self):
          val = self._val
          self._val += 1
          return val

          def charToIx(arr, dtype=None, out=None):
          if dtype is None:
          dtype = arr.dtype

          if out is None:
          out = np.zeros(arr.shape, dtype=dtype)

          for incol,outcol in zip(arr.T, out.T):
          ix = Ix()
          cixDict = defaultdict(lambda: ix())
          for i,x in enumerate(incol):
          if x in cixDict or x in ascii_letters:
          outcol[i] = cixDict[x]
          else:
          outcol[i] = x

          return out


          You specify the type of the output array when you call the function. So the output of:



          a = np.array([['A',0,'C'],['A',0.3,'B'],['D',1,'D']])
          print(charToIx(a, dtype=float))


          will be a float array:



          array([[0. , 0. , 0. ],
          [0. , 0.3, 1. ],
          [1. , 1. , 2. ]])





          share|improve this answer

























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53393087%2fcolumn-string-conversion-based-on-unique-values%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            2














            It seems that you are trying to do a label encoding.



            I can think of two options: pandas.factorize and sklearn.preprocessing.LabelEncoder.



            Using LabelEncoder



            from sklearn.preprocessing import LabelEncoder

            b = np.zeros_like(a, np.int)
            for column in range(a.shape[1]):
            b[:, column] = LabelEncoder().fit_transform(a[:, column])


            Then b will be:



            array([[0, 0, 1],
            [0, 1, 0],
            [1, 2, 2]])


            If you want to be able to go back to the original values, you will need to save the encoders. You can do it this way:



            from sklearn.preprocessing import LabelEncoder

            encoders = {}
            b = np.zeros_like(a, np.int)
            for column in range(a.shape[1]):
            encoders[column] = LabelEncoder()
            b[:, column] = encoders[column].fit_transform(a[:, column])


            Now encoders[0].classes_ will have:



            array(['A', 'D'], dtype='<U3')


            Which means that 'A' was mapped to 0 and 'D' to 1.



            Finally, if you do the encoding overriding a instead of using a new matrix c, you will obtain integers as strings ("1" instead of 1), you can solve this with astype(int):



            encoders = {}
            for column in range(a.shape[1]):
            encoders[column] = LabelEncoder()
            a[:, column] = encoders[column].fit_transform(a[:, column])

            # At this point, a will have strings instead of ints because a had type str
            # array([['0', '0', '1'],
            # ['0', '1', '0'],
            # ['1', '2', '2']], dtype='<U3')

            a = a.astype(int)

            # Now `a` is of type int
            # array([[0, 0, 1],
            # [0, 1, 0],
            # [1, 2, 2]])


            Using pd.factorize



            factorize returns the encoded column and the encoding mapping, so if you don't care about it you can avoid saving it:



            for column in range(a.shape[1]):
            a[:, column], _ = pd.factorize(a[:, column]) # Drop mapping

            a = a.astype(int) # same as above, it's of type str
            # a is
            # array([[0, 0, 1],
            # [0, 1, 0],
            # [1, 2, 2]])


            If you want to keep the encoding mappings:



            mappings = 
            for column in range(a.shape[1]):
            a[:, column], mapping = pd.factorize(a[:, column])
            mappings.append(mapping)

            a = a.astype(int)


            Now mappings[0] will have the following data:



            array(['A', 'D'], dtype=object)


            Which has the same semantics than encoders[0].classes_ of sklearn's LabelEncoder solution.






            share|improve this answer


























            • Many thanks! I've gone for your label encoder method in the end. Thank you for taking the time to reply

              – user8188120
              Nov 20 '18 at 14:12













            • Glad to help :)

              – Julian Peller
              Nov 20 '18 at 14:22











            • If I could expand on this question, is there a way to use label_encoder in conjunction with a predefined array of unique values? The reason I ask is because I would like to perform this encoding on several huge datasets which have some common string values and some uncommon and therefore the above method would give the same string a different value in different datasets whereas I need them to be encoded the same across all datasets I plan to load in.

              – user8188120
              Nov 20 '18 at 14:45






            • 1





              Is it possible to have all the possible values for a column together? If it is, you can do as follows: First, fit the LabelEncoder on all the values for that column with the method fit: enc = LabelEncoder(); enc.fit(all_possible_values). This will create the unique mapping you want in enc. After that, you can just transform the columns of each specific dataframe with column = enc.transform(column), having the cross-dataframe consistency you want.

              – Julian Peller
              Nov 20 '18 at 14:55













            • It's certainly possible to get all the unique values so this should work I think. Thanks again, I'll give it a try! Is there a sensible way to save an encoding once it's been fit to data for using later? Just so the process doesn't have to be done in one long script run

              – user8188120
              Nov 20 '18 at 15:01


















            2














            It seems that you are trying to do a label encoding.



            I can think of two options: pandas.factorize and sklearn.preprocessing.LabelEncoder.



            Using LabelEncoder



            from sklearn.preprocessing import LabelEncoder

            b = np.zeros_like(a, np.int)
            for column in range(a.shape[1]):
            b[:, column] = LabelEncoder().fit_transform(a[:, column])


            Then b will be:



            array([[0, 0, 1],
            [0, 1, 0],
            [1, 2, 2]])


            If you want to be able to go back to the original values, you will need to save the encoders. You can do it this way:



            from sklearn.preprocessing import LabelEncoder

            encoders = {}
            b = np.zeros_like(a, np.int)
            for column in range(a.shape[1]):
            encoders[column] = LabelEncoder()
            b[:, column] = encoders[column].fit_transform(a[:, column])


            Now encoders[0].classes_ will have:



            array(['A', 'D'], dtype='<U3')


            Which means that 'A' was mapped to 0 and 'D' to 1.



            Finally, if you do the encoding overriding a instead of using a new matrix c, you will obtain integers as strings ("1" instead of 1), you can solve this with astype(int):



            encoders = {}
            for column in range(a.shape[1]):
            encoders[column] = LabelEncoder()
            a[:, column] = encoders[column].fit_transform(a[:, column])

            # At this point, a will have strings instead of ints because a had type str
            # array([['0', '0', '1'],
            # ['0', '1', '0'],
            # ['1', '2', '2']], dtype='<U3')

            a = a.astype(int)

            # Now `a` is of type int
            # array([[0, 0, 1],
            # [0, 1, 0],
            # [1, 2, 2]])


            Using pd.factorize



            factorize returns the encoded column and the encoding mapping, so if you don't care about it you can avoid saving it:



            for column in range(a.shape[1]):
            a[:, column], _ = pd.factorize(a[:, column]) # Drop mapping

            a = a.astype(int) # same as above, it's of type str
            # a is
            # array([[0, 0, 1],
            # [0, 1, 0],
            # [1, 2, 2]])


            If you want to keep the encoding mappings:



            mappings = 
            for column in range(a.shape[1]):
            a[:, column], mapping = pd.factorize(a[:, column])
            mappings.append(mapping)

            a = a.astype(int)


            Now mappings[0] will have the following data:



            array(['A', 'D'], dtype=object)


            Which has the same semantics than encoders[0].classes_ of sklearn's LabelEncoder solution.






            share|improve this answer


























            • Many thanks! I've gone for your label encoder method in the end. Thank you for taking the time to reply

              – user8188120
              Nov 20 '18 at 14:12













            • Glad to help :)

              – Julian Peller
              Nov 20 '18 at 14:22











            • If I could expand on this question, is there a way to use label_encoder in conjunction with a predefined array of unique values? The reason I ask is because I would like to perform this encoding on several huge datasets which have some common string values and some uncommon and therefore the above method would give the same string a different value in different datasets whereas I need them to be encoded the same across all datasets I plan to load in.

              – user8188120
              Nov 20 '18 at 14:45






            • 1





              Is it possible to have all the possible values for a column together? If it is, you can do as follows: First, fit the LabelEncoder on all the values for that column with the method fit: enc = LabelEncoder(); enc.fit(all_possible_values). This will create the unique mapping you want in enc. After that, you can just transform the columns of each specific dataframe with column = enc.transform(column), having the cross-dataframe consistency you want.

              – Julian Peller
              Nov 20 '18 at 14:55













            • It's certainly possible to get all the unique values so this should work I think. Thanks again, I'll give it a try! Is there a sensible way to save an encoding once it's been fit to data for using later? Just so the process doesn't have to be done in one long script run

              – user8188120
              Nov 20 '18 at 15:01
















            2












            2








            2







            It seems that you are trying to do a label encoding.



            I can think of two options: pandas.factorize and sklearn.preprocessing.LabelEncoder.



            Using LabelEncoder



            from sklearn.preprocessing import LabelEncoder

            b = np.zeros_like(a, np.int)
            for column in range(a.shape[1]):
            b[:, column] = LabelEncoder().fit_transform(a[:, column])


            Then b will be:



            array([[0, 0, 1],
            [0, 1, 0],
            [1, 2, 2]])


            If you want to be able to go back to the original values, you will need to save the encoders. You can do it this way:



            from sklearn.preprocessing import LabelEncoder

            encoders = {}
            b = np.zeros_like(a, np.int)
            for column in range(a.shape[1]):
            encoders[column] = LabelEncoder()
            b[:, column] = encoders[column].fit_transform(a[:, column])


            Now encoders[0].classes_ will have:



            array(['A', 'D'], dtype='<U3')


            Which means that 'A' was mapped to 0 and 'D' to 1.



            Finally, if you do the encoding overriding a instead of using a new matrix c, you will obtain integers as strings ("1" instead of 1), you can solve this with astype(int):



            encoders = {}
            for column in range(a.shape[1]):
            encoders[column] = LabelEncoder()
            a[:, column] = encoders[column].fit_transform(a[:, column])

            # At this point, a will have strings instead of ints because a had type str
            # array([['0', '0', '1'],
            # ['0', '1', '0'],
            # ['1', '2', '2']], dtype='<U3')

            a = a.astype(int)

            # Now `a` is of type int
            # array([[0, 0, 1],
            # [0, 1, 0],
            # [1, 2, 2]])


            Using pd.factorize



            factorize returns the encoded column and the encoding mapping, so if you don't care about it you can avoid saving it:



            for column in range(a.shape[1]):
            a[:, column], _ = pd.factorize(a[:, column]) # Drop mapping

            a = a.astype(int) # same as above, it's of type str
            # a is
            # array([[0, 0, 1],
            # [0, 1, 0],
            # [1, 2, 2]])


            If you want to keep the encoding mappings:



            mappings = 
            for column in range(a.shape[1]):
            a[:, column], mapping = pd.factorize(a[:, column])
            mappings.append(mapping)

            a = a.astype(int)


            Now mappings[0] will have the following data:



            array(['A', 'D'], dtype=object)


            Which has the same semantics than encoders[0].classes_ of sklearn's LabelEncoder solution.






            share|improve this answer















            It seems that you are trying to do a label encoding.



            I can think of two options: pandas.factorize and sklearn.preprocessing.LabelEncoder.



            Using LabelEncoder



            from sklearn.preprocessing import LabelEncoder

            b = np.zeros_like(a, np.int)
            for column in range(a.shape[1]):
            b[:, column] = LabelEncoder().fit_transform(a[:, column])


            Then b will be:



            array([[0, 0, 1],
            [0, 1, 0],
            [1, 2, 2]])


            If you want to be able to go back to the original values, you will need to save the encoders. You can do it this way:



            from sklearn.preprocessing import LabelEncoder

            encoders = {}
            b = np.zeros_like(a, np.int)
            for column in range(a.shape[1]):
            encoders[column] = LabelEncoder()
            b[:, column] = encoders[column].fit_transform(a[:, column])


            Now encoders[0].classes_ will have:



            array(['A', 'D'], dtype='<U3')


            Which means that 'A' was mapped to 0 and 'D' to 1.



            Finally, if you do the encoding overriding a instead of using a new matrix c, you will obtain integers as strings ("1" instead of 1), you can solve this with astype(int):



            encoders = {}
            for column in range(a.shape[1]):
            encoders[column] = LabelEncoder()
            a[:, column] = encoders[column].fit_transform(a[:, column])

            # At this point, a will have strings instead of ints because a had type str
            # array([['0', '0', '1'],
            # ['0', '1', '0'],
            # ['1', '2', '2']], dtype='<U3')

            a = a.astype(int)

            # Now `a` is of type int
            # array([[0, 0, 1],
            # [0, 1, 0],
            # [1, 2, 2]])


            Using pd.factorize



            factorize returns the encoded column and the encoding mapping, so if you don't care about it you can avoid saving it:



            for column in range(a.shape[1]):
            a[:, column], _ = pd.factorize(a[:, column]) # Drop mapping

            a = a.astype(int) # same as above, it's of type str
            # a is
            # array([[0, 0, 1],
            # [0, 1, 0],
            # [1, 2, 2]])


            If you want to keep the encoding mappings:



            mappings = 
            for column in range(a.shape[1]):
            a[:, column], mapping = pd.factorize(a[:, column])
            mappings.append(mapping)

            a = a.astype(int)


            Now mappings[0] will have the following data:



            array(['A', 'D'], dtype=object)


            Which has the same semantics than encoders[0].classes_ of sklearn's LabelEncoder solution.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Nov 20 '18 at 13:57

























            answered Nov 20 '18 at 13:51









            Julian PellerJulian Peller

            8941511




            8941511













            • Many thanks! I've gone for your label encoder method in the end. Thank you for taking the time to reply

              – user8188120
              Nov 20 '18 at 14:12













            • Glad to help :)

              – Julian Peller
              Nov 20 '18 at 14:22











            • If I could expand on this question, is there a way to use label_encoder in conjunction with a predefined array of unique values? The reason I ask is because I would like to perform this encoding on several huge datasets which have some common string values and some uncommon and therefore the above method would give the same string a different value in different datasets whereas I need them to be encoded the same across all datasets I plan to load in.

              – user8188120
              Nov 20 '18 at 14:45






            • 1





              Is it possible to have all the possible values for a column together? If it is, you can do as follows: First, fit the LabelEncoder on all the values for that column with the method fit: enc = LabelEncoder(); enc.fit(all_possible_values). This will create the unique mapping you want in enc. After that, you can just transform the columns of each specific dataframe with column = enc.transform(column), having the cross-dataframe consistency you want.

              – Julian Peller
              Nov 20 '18 at 14:55













            • It's certainly possible to get all the unique values so this should work I think. Thanks again, I'll give it a try! Is there a sensible way to save an encoding once it's been fit to data for using later? Just so the process doesn't have to be done in one long script run

              – user8188120
              Nov 20 '18 at 15:01





















            • Many thanks! I've gone for your label encoder method in the end. Thank you for taking the time to reply

              – user8188120
              Nov 20 '18 at 14:12













            • Glad to help :)

              – Julian Peller
              Nov 20 '18 at 14:22











            • If I could expand on this question, is there a way to use label_encoder in conjunction with a predefined array of unique values? The reason I ask is because I would like to perform this encoding on several huge datasets which have some common string values and some uncommon and therefore the above method would give the same string a different value in different datasets whereas I need them to be encoded the same across all datasets I plan to load in.

              – user8188120
              Nov 20 '18 at 14:45






            • 1





              Is it possible to have all the possible values for a column together? If it is, you can do as follows: First, fit the LabelEncoder on all the values for that column with the method fit: enc = LabelEncoder(); enc.fit(all_possible_values). This will create the unique mapping you want in enc. After that, you can just transform the columns of each specific dataframe with column = enc.transform(column), having the cross-dataframe consistency you want.

              – Julian Peller
              Nov 20 '18 at 14:55













            • It's certainly possible to get all the unique values so this should work I think. Thanks again, I'll give it a try! Is there a sensible way to save an encoding once it's been fit to data for using later? Just so the process doesn't have to be done in one long script run

              – user8188120
              Nov 20 '18 at 15:01



















            Many thanks! I've gone for your label encoder method in the end. Thank you for taking the time to reply

            – user8188120
            Nov 20 '18 at 14:12







            Many thanks! I've gone for your label encoder method in the end. Thank you for taking the time to reply

            – user8188120
            Nov 20 '18 at 14:12















            Glad to help :)

            – Julian Peller
            Nov 20 '18 at 14:22





            Glad to help :)

            – Julian Peller
            Nov 20 '18 at 14:22













            If I could expand on this question, is there a way to use label_encoder in conjunction with a predefined array of unique values? The reason I ask is because I would like to perform this encoding on several huge datasets which have some common string values and some uncommon and therefore the above method would give the same string a different value in different datasets whereas I need them to be encoded the same across all datasets I plan to load in.

            – user8188120
            Nov 20 '18 at 14:45





            If I could expand on this question, is there a way to use label_encoder in conjunction with a predefined array of unique values? The reason I ask is because I would like to perform this encoding on several huge datasets which have some common string values and some uncommon and therefore the above method would give the same string a different value in different datasets whereas I need them to be encoded the same across all datasets I plan to load in.

            – user8188120
            Nov 20 '18 at 14:45




            1




            1





            Is it possible to have all the possible values for a column together? If it is, you can do as follows: First, fit the LabelEncoder on all the values for that column with the method fit: enc = LabelEncoder(); enc.fit(all_possible_values). This will create the unique mapping you want in enc. After that, you can just transform the columns of each specific dataframe with column = enc.transform(column), having the cross-dataframe consistency you want.

            – Julian Peller
            Nov 20 '18 at 14:55







            Is it possible to have all the possible values for a column together? If it is, you can do as follows: First, fit the LabelEncoder on all the values for that column with the method fit: enc = LabelEncoder(); enc.fit(all_possible_values). This will create the unique mapping you want in enc. After that, you can just transform the columns of each specific dataframe with column = enc.transform(column), having the cross-dataframe consistency you want.

            – Julian Peller
            Nov 20 '18 at 14:55















            It's certainly possible to get all the unique values so this should work I think. Thanks again, I'll give it a try! Is there a sensible way to save an encoding once it's been fit to data for using later? Just so the process doesn't have to be done in one long script run

            – user8188120
            Nov 20 '18 at 15:01







            It's certainly possible to get all the unique values so this should work I think. Thanks again, I'll give it a try! Is there a sensible way to save an encoding once it's been fit to data for using later? Just so the process doesn't have to be done in one long script run

            – user8188120
            Nov 20 '18 at 15:01















            1














            You can do what you want in an efficient way with just Numpy.



            Basically, you iterate over the values in each column of your input while keeping track of the observed letters in a set or dict. This is similar to what you already had, but slightly more efficient (you avoid the call to np.where for one thing).



            Here's a function charToIx that will do what you want:



            from collections import defaultdict
            from string import ascii_letters

            class Ix:
            def __init__(self):
            self._val = 0

            def __call__(self):
            val = self._val
            self._val += 1
            return val

            def charToIx(arr, dtype=None, out=None):
            if dtype is None:
            dtype = arr.dtype

            if out is None:
            out = np.zeros(arr.shape, dtype=dtype)

            for incol,outcol in zip(arr.T, out.T):
            ix = Ix()
            cixDict = defaultdict(lambda: ix())
            for i,x in enumerate(incol):
            if x in cixDict or x in ascii_letters:
            outcol[i] = cixDict[x]
            else:
            outcol[i] = x

            return out


            You specify the type of the output array when you call the function. So the output of:



            a = np.array([['A',0,'C'],['A',0.3,'B'],['D',1,'D']])
            print(charToIx(a, dtype=float))


            will be a float array:



            array([[0. , 0. , 0. ],
            [0. , 0.3, 1. ],
            [1. , 1. , 2. ]])





            share|improve this answer






























              1














              You can do what you want in an efficient way with just Numpy.



              Basically, you iterate over the values in each column of your input while keeping track of the observed letters in a set or dict. This is similar to what you already had, but slightly more efficient (you avoid the call to np.where for one thing).



              Here's a function charToIx that will do what you want:



              from collections import defaultdict
              from string import ascii_letters

              class Ix:
              def __init__(self):
              self._val = 0

              def __call__(self):
              val = self._val
              self._val += 1
              return val

              def charToIx(arr, dtype=None, out=None):
              if dtype is None:
              dtype = arr.dtype

              if out is None:
              out = np.zeros(arr.shape, dtype=dtype)

              for incol,outcol in zip(arr.T, out.T):
              ix = Ix()
              cixDict = defaultdict(lambda: ix())
              for i,x in enumerate(incol):
              if x in cixDict or x in ascii_letters:
              outcol[i] = cixDict[x]
              else:
              outcol[i] = x

              return out


              You specify the type of the output array when you call the function. So the output of:



              a = np.array([['A',0,'C'],['A',0.3,'B'],['D',1,'D']])
              print(charToIx(a, dtype=float))


              will be a float array:



              array([[0. , 0. , 0. ],
              [0. , 0.3, 1. ],
              [1. , 1. , 2. ]])





              share|improve this answer




























                1












                1








                1







                You can do what you want in an efficient way with just Numpy.



                Basically, you iterate over the values in each column of your input while keeping track of the observed letters in a set or dict. This is similar to what you already had, but slightly more efficient (you avoid the call to np.where for one thing).



                Here's a function charToIx that will do what you want:



                from collections import defaultdict
                from string import ascii_letters

                class Ix:
                def __init__(self):
                self._val = 0

                def __call__(self):
                val = self._val
                self._val += 1
                return val

                def charToIx(arr, dtype=None, out=None):
                if dtype is None:
                dtype = arr.dtype

                if out is None:
                out = np.zeros(arr.shape, dtype=dtype)

                for incol,outcol in zip(arr.T, out.T):
                ix = Ix()
                cixDict = defaultdict(lambda: ix())
                for i,x in enumerate(incol):
                if x in cixDict or x in ascii_letters:
                outcol[i] = cixDict[x]
                else:
                outcol[i] = x

                return out


                You specify the type of the output array when you call the function. So the output of:



                a = np.array([['A',0,'C'],['A',0.3,'B'],['D',1,'D']])
                print(charToIx(a, dtype=float))


                will be a float array:



                array([[0. , 0. , 0. ],
                [0. , 0.3, 1. ],
                [1. , 1. , 2. ]])





                share|improve this answer















                You can do what you want in an efficient way with just Numpy.



                Basically, you iterate over the values in each column of your input while keeping track of the observed letters in a set or dict. This is similar to what you already had, but slightly more efficient (you avoid the call to np.where for one thing).



                Here's a function charToIx that will do what you want:



                from collections import defaultdict
                from string import ascii_letters

                class Ix:
                def __init__(self):
                self._val = 0

                def __call__(self):
                val = self._val
                self._val += 1
                return val

                def charToIx(arr, dtype=None, out=None):
                if dtype is None:
                dtype = arr.dtype

                if out is None:
                out = np.zeros(arr.shape, dtype=dtype)

                for incol,outcol in zip(arr.T, out.T):
                ix = Ix()
                cixDict = defaultdict(lambda: ix())
                for i,x in enumerate(incol):
                if x in cixDict or x in ascii_letters:
                outcol[i] = cixDict[x]
                else:
                outcol[i] = x

                return out


                You specify the type of the output array when you call the function. So the output of:



                a = np.array([['A',0,'C'],['A',0.3,'B'],['D',1,'D']])
                print(charToIx(a, dtype=float))


                will be a float array:



                array([[0. , 0. , 0. ],
                [0. , 0.3, 1. ],
                [1. , 1. , 2. ]])






                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Nov 20 '18 at 14:05

























                answered Nov 20 '18 at 13:51









                teltel

                7,34121431




                7,34121431






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53393087%2fcolumn-string-conversion-based-on-unique-values%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    MongoDB - Not Authorized To Execute Command

                    How to fix TextFormField cause rebuild widget in Flutter

                    in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith