Column string conversion based on unique values

Is there a way to replace string values in columns of a 2D array with ordered numbers in Python?

For example say you have a 2D array:

a = np.array([['A',0,'C'],['A',0.3,'B'],['D',1,'D']])

a

Out[57]: 

array([['A', '0', 'C'],

       ['A', '0.3', 'B'],

       ['D', '1', 'D']], dtype='<U3')

If I wanted to replace the string values 'A','A','D' in the first column with the numbers 0,0,1 and 'C','B','D' with 0,1,2 is there an efficient way to do so.

It may be helpful to know:

Replacement numbers in different columns are independent of column. i.e. each column who's strings have been replaced with numbers will start with 0 and increase up to the number of unique values in that column.

The above is a test case and the real data is a lot bigger with more columns of strings.

Here is an example method to solve this problem I quickly came up with:

for  j in range(a.shape[1]):

    b = list(set(a[:,j]))

    length = len(b)

    for i in range(len(b)):

        indices = np.where(a[:,j]==b[i])[0]

        print(indices)

        a[indices,j]=i

However this seems like an inefficient way to achieve this and also cannot distinguish between float or string values in columns and defaults to replacing values with strings of numbers:

a

Out[91]: 

array([['1.0', '0.0', '2.0'],

       ['1.0', '1.0', '0.0'],

       ['0.0', '2.0', '1.0']], dtype='<U3')

Any help on this matter would be greatly appreciated!

edited Nov 20 '18 at 12:51

Mohit Motwani

1,2991522

asked Nov 20 '18 at 12:33

user8188120

1828

add a comment |

Is there a way to replace string values in columns of a 2D array with ordered numbers in Python?

For example say you have a 2D array:

a = np.array([['A',0,'C'],['A',0.3,'B'],['D',1,'D']])

a

Out[57]: 

array([['A', '0', 'C'],

       ['A', '0.3', 'B'],

       ['D', '1', 'D']], dtype='<U3')

If I wanted to replace the string values 'A','A','D' in the first column with the numbers 0,0,1 and 'C','B','D' with 0,1,2 is there an efficient way to do so.

It may be helpful to know:

Replacement numbers in different columns are independent of column. i.e. each column who's strings have been replaced with numbers will start with 0 and increase up to the number of unique values in that column.

The above is a test case and the real data is a lot bigger with more columns of strings.

Here is an example method to solve this problem I quickly came up with:

for  j in range(a.shape[1]):

    b = list(set(a[:,j]))

    length = len(b)

    for i in range(len(b)):

        indices = np.where(a[:,j]==b[i])[0]

        print(indices)

        a[indices,j]=i

However this seems like an inefficient way to achieve this and also cannot distinguish between float or string values in columns and defaults to replacing values with strings of numbers:

a

Out[91]: 

array([['1.0', '0.0', '2.0'],

       ['1.0', '1.0', '0.0'],

       ['0.0', '2.0', '1.0']], dtype='<U3')

Any help on this matter would be greatly appreciated!

edited Nov 20 '18 at 12:51

Mohit Motwani

1,2991522

asked Nov 20 '18 at 12:33

user8188120

1828

add a comment |

Is there a way to replace string values in columns of a 2D array with ordered numbers in Python?

For example say you have a 2D array:

a = np.array([['A',0,'C'],['A',0.3,'B'],['D',1,'D']])

a

Out[57]: 

array([['A', '0', 'C'],

       ['A', '0.3', 'B'],

       ['D', '1', 'D']], dtype='<U3')

If I wanted to replace the string values 'A','A','D' in the first column with the numbers 0,0,1 and 'C','B','D' with 0,1,2 is there an efficient way to do so.

It may be helpful to know:

Replacement numbers in different columns are independent of column. i.e. each column who's strings have been replaced with numbers will start with 0 and increase up to the number of unique values in that column.

The above is a test case and the real data is a lot bigger with more columns of strings.

Here is an example method to solve this problem I quickly came up with:

for  j in range(a.shape[1]):

    b = list(set(a[:,j]))

    length = len(b)

    for i in range(len(b)):

        indices = np.where(a[:,j]==b[i])[0]

        print(indices)

        a[indices,j]=i

However this seems like an inefficient way to achieve this and also cannot distinguish between float or string values in columns and defaults to replacing values with strings of numbers:

a

Out[91]: 

array([['1.0', '0.0', '2.0'],

       ['1.0', '1.0', '0.0'],

       ['0.0', '2.0', '1.0']], dtype='<U3')

Any help on this matter would be greatly appreciated!

edited Nov 20 '18 at 12:51

Mohit Motwani

1,2991522

asked Nov 20 '18 at 12:33

user8188120

1828

Is there a way to replace string values in columns of a 2D array with ordered numbers in Python?

For example say you have a 2D array:

a = np.array([['A',0,'C'],['A',0.3,'B'],['D',1,'D']])

a

Out[57]: 

array([['A', '0', 'C'],

       ['A', '0.3', 'B'],

       ['D', '1', 'D']], dtype='<U3')

If I wanted to replace the string values 'A','A','D' in the first column with the numbers 0,0,1 and 'C','B','D' with 0,1,2 is there an efficient way to do so.

It may be helpful to know:

Replacement numbers in different columns are independent of column. i.e. each column who's strings have been replaced with numbers will start with 0 and increase up to the number of unique values in that column.

The above is a test case and the real data is a lot bigger with more columns of strings.

Here is an example method to solve this problem I quickly came up with:

for  j in range(a.shape[1]):

    b = list(set(a[:,j]))

    length = len(b)

    for i in range(len(b)):

        indices = np.where(a[:,j]==b[i])[0]

        print(indices)

        a[indices,j]=i

However this seems like an inefficient way to achieve this and also cannot distinguish between float or string values in columns and defaults to replacing values with strings of numbers:

a

Out[91]: 

array([['1.0', '0.0', '2.0'],

       ['1.0', '1.0', '0.0'],

       ['0.0', '2.0', '1.0']], dtype='<U3')

Any help on this matter would be greatly appreciated!

python arrays string numpy 2d

edited Nov 20 '18 at 12:51

Mohit Motwani

1,2991522

asked Nov 20 '18 at 12:33

user8188120

1828

edited Nov 20 '18 at 12:51

Mohit Motwani

1,2991522

asked Nov 20 '18 at 12:33

user8188120

1828

edited Nov 20 '18 at 12:51

Mohit Motwani

1,2991522

edited Nov 20 '18 at 12:51

Mohit Motwani

1,2991522

edited Nov 20 '18 at 12:51

Mohit Motwani

1,2991522

asked Nov 20 '18 at 12:33

user8188120

1828

asked Nov 20 '18 at 12:33

user8188120

1828

asked Nov 20 '18 at 12:33

user8188120

1828

add a comment |

2 Answers
2

active

oldest

votes

It seems that you are trying to do a label encoding.

I can think of two options: pandas.factorize and sklearn.preprocessing.LabelEncoder.

Using `LabelEncoder`

from sklearn.preprocessing import LabelEncoder



b = np.zeros_like(a, np.int) 

for column in range(a.shape[1]):

    b[:, column] = LabelEncoder().fit_transform(a[:, column])

Then b will be:

array([[0, 0, 1],

       [0, 1, 0],

       [1, 2, 2]])

If you want to be able to go back to the original values, you will need to save the encoders. You can do it this way:

from sklearn.preprocessing import LabelEncoder



encoders = {}

b = np.zeros_like(a, np.int)

for column in range(a.shape[1]):

    encoders[column] = LabelEncoder()

    b[:, column] = encoders[column].fit_transform(a[:, column])

Now encoders[0].classes_ will have:

array(['A', 'D'], dtype='<U3')

Which means that 'A' was mapped to 0 and 'D' to 1.

Finally, if you do the encoding overriding a instead of using a new matrix c, you will obtain integers as strings ("1" instead of 1), you can solve this with astype(int):

encoders = {}

for column in range(a.shape[1]):

    encoders[column] = LabelEncoder()

    a[:, column] = encoders[column].fit_transform(a[:, column])



# At this point, a will have strings instead of ints because a had type str

# array([['0', '0', '1'],

#       ['0', '1', '0'],

#       ['1', '2', '2']], dtype='<U3')



a = a.astype(int)



# Now `a` is of type int

# array([[0, 0, 1],

#        [0, 1, 0],

#        [1, 2, 2]])

Using `pd.factorize`

factorize returns the encoded column and the encoding mapping, so if you don't care about it you can avoid saving it:

for column in range(a.shape[1]):

    a[:, column], _ = pd.factorize(a[:, column]) # Drop mapping



a = a.astype(int) # same as above, it's of type str

# a is

# array([[0, 0, 1],

#        [0, 1, 0],

#        [1, 2, 2]])

If you want to keep the encoding mappings:

mappings = 

for column in range(a.shape[1]):

    a[:, column], mapping = pd.factorize(a[:, column])

    mappings.append(mapping)



a = a.astype(int)

Now mappings[0] will have the following data:

array(['A', 'D'], dtype=object)

Which has the same semantics than encoders[0].classes_ of sklearn's LabelEncoder solution.

edited Nov 20 '18 at 13:57

answered Nov 20 '18 at 13:51

Julian Peller

8941511

Many thanks! I've gone for your label encoder method in the end. Thank you for taking the time to reply

– user8188120
Nov 20 '18 at 14:12

Glad to help :)

– Julian Peller
Nov 20 '18 at 14:22

If I could expand on this question, is there a way to use label_encoder in conjunction with a predefined array of unique values? The reason I ask is because I would like to perform this encoding on several huge datasets which have some common string values and some uncommon and therefore the above method would give the same string a different value in different datasets whereas I need them to be encoded the same across all datasets I plan to load in.

– user8188120
Nov 20 '18 at 14:45

1

Is it possible to have all the possible values for a column together? If it is, you can do as follows: First, fit the LabelEncoder on all the values for that column with the method fit: enc = LabelEncoder(); enc.fit(all_possible_values). This will create the unique mapping you want in enc. After that, you can just transform the columns of each specific dataframe with column = enc.transform(column), having the cross-dataframe consistency you want.

– Julian Peller
Nov 20 '18 at 14:55

It's certainly possible to get all the unique values so this should work I think. Thanks again, I'll give it a try! Is there a sensible way to save an encoding once it's been fit to data for using later? Just so the process doesn't have to be done in one long script run

– user8188120
Nov 20 '18 at 15:01

|
show 1 more comment

You can do what you want in an efficient way with just Numpy.

Basically, you iterate over the values in each column of your input while keeping track of the observed letters in a set or dict. This is similar to what you already had, but slightly more efficient (you avoid the call to np.where for one thing).

Here's a function charToIx that will do what you want:

from collections import defaultdict

from string import ascii_letters



class Ix:

    def __init__(self):

        self._val = 0



    def __call__(self):

        val = self._val

        self._val += 1

        return val



def charToIx(arr, dtype=None, out=None):

    if dtype is None:

        dtype = arr.dtype



    if out is None:

        out = np.zeros(arr.shape, dtype=dtype)



    for incol,outcol in zip(arr.T, out.T):

        ix = Ix()

        cixDict = defaultdict(lambda: ix())

        for i,x in enumerate(incol):

            if x in cixDict or x in ascii_letters:

                outcol[i] = cixDict[x]

            else:

                outcol[i] = x



    return out

You specify the type of the output array when you call the function. So the output of:

a = np.array([['A',0,'C'],['A',0.3,'B'],['D',1,'D']])

print(charToIx(a, dtype=float))

will be a float array:

array([[0. , 0. , 0. ],

       [0. , 0.3, 1. ],

       [1. , 1. , 2. ]])

edited Nov 20 '18 at 14:05

answered Nov 20 '18 at 13:51

tel

7,34121431

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53393087%2fcolumn-string-conversion-based-on-unique-values%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

It seems that you are trying to do a label encoding.

I can think of two options: pandas.factorize and sklearn.preprocessing.LabelEncoder.

Using `LabelEncoder`

from sklearn.preprocessing import LabelEncoder



b = np.zeros_like(a, np.int) 

for column in range(a.shape[1]):

    b[:, column] = LabelEncoder().fit_transform(a[:, column])

Then b will be:

array([[0, 0, 1],

       [0, 1, 0],

       [1, 2, 2]])

If you want to be able to go back to the original values, you will need to save the encoders. You can do it this way:

from sklearn.preprocessing import LabelEncoder



encoders = {}

b = np.zeros_like(a, np.int)

for column in range(a.shape[1]):

    encoders[column] = LabelEncoder()

    b[:, column] = encoders[column].fit_transform(a[:, column])

Now encoders[0].classes_ will have:

array(['A', 'D'], dtype='<U3')

Which means that 'A' was mapped to 0 and 'D' to 1.

Finally, if you do the encoding overriding a instead of using a new matrix c, you will obtain integers as strings ("1" instead of 1), you can solve this with astype(int):

encoders = {}

for column in range(a.shape[1]):

    encoders[column] = LabelEncoder()

    a[:, column] = encoders[column].fit_transform(a[:, column])



# At this point, a will have strings instead of ints because a had type str

# array([['0', '0', '1'],

#       ['0', '1', '0'],

#       ['1', '2', '2']], dtype='<U3')



a = a.astype(int)



# Now `a` is of type int

# array([[0, 0, 1],

#        [0, 1, 0],

#        [1, 2, 2]])

Using `pd.factorize`

factorize returns the encoded column and the encoding mapping, so if you don't care about it you can avoid saving it:

for column in range(a.shape[1]):

    a[:, column], _ = pd.factorize(a[:, column]) # Drop mapping



a = a.astype(int) # same as above, it's of type str

# a is

# array([[0, 0, 1],

#        [0, 1, 0],

#        [1, 2, 2]])

If you want to keep the encoding mappings:

mappings = 

for column in range(a.shape[1]):

    a[:, column], mapping = pd.factorize(a[:, column])

    mappings.append(mapping)



a = a.astype(int)

Now mappings[0] will have the following data:

array(['A', 'D'], dtype=object)

Which has the same semantics than encoders[0].classes_ of sklearn's LabelEncoder solution.

edited Nov 20 '18 at 13:57

answered Nov 20 '18 at 13:51

Julian Peller

8941511

Many thanks! I've gone for your label encoder method in the end. Thank you for taking the time to reply

– user8188120
Nov 20 '18 at 14:12

Glad to help :)

– Julian Peller
Nov 20 '18 at 14:22

If I could expand on this question, is there a way to use label_encoder in conjunction with a predefined array of unique values? The reason I ask is because I would like to perform this encoding on several huge datasets which have some common string values and some uncommon and therefore the above method would give the same string a different value in different datasets whereas I need them to be encoded the same across all datasets I plan to load in.

– user8188120
Nov 20 '18 at 14:45

1

Is it possible to have all the possible values for a column together? If it is, you can do as follows: First, fit the LabelEncoder on all the values for that column with the method fit: enc = LabelEncoder(); enc.fit(all_possible_values). This will create the unique mapping you want in enc. After that, you can just transform the columns of each specific dataframe with column = enc.transform(column), having the cross-dataframe consistency you want.

– Julian Peller
Nov 20 '18 at 14:55

It's certainly possible to get all the unique values so this should work I think. Thanks again, I'll give it a try! Is there a sensible way to save an encoding once it's been fit to data for using later? Just so the process doesn't have to be done in one long script run

– user8188120
Nov 20 '18 at 15:01

|
show 1 more comment

It seems that you are trying to do a label encoding.

I can think of two options: pandas.factorize and sklearn.preprocessing.LabelEncoder.

Using `LabelEncoder`

from sklearn.preprocessing import LabelEncoder



b = np.zeros_like(a, np.int) 

for column in range(a.shape[1]):

    b[:, column] = LabelEncoder().fit_transform(a[:, column])

Then b will be:

array([[0, 0, 1],

       [0, 1, 0],

       [1, 2, 2]])

If you want to be able to go back to the original values, you will need to save the encoders. You can do it this way:

from sklearn.preprocessing import LabelEncoder



encoders = {}

b = np.zeros_like(a, np.int)

for column in range(a.shape[1]):

    encoders[column] = LabelEncoder()

    b[:, column] = encoders[column].fit_transform(a[:, column])

Now encoders[0].classes_ will have:

array(['A', 'D'], dtype='<U3')

Which means that 'A' was mapped to 0 and 'D' to 1.

Finally, if you do the encoding overriding a instead of using a new matrix c, you will obtain integers as strings ("1" instead of 1), you can solve this with astype(int):

encoders = {}

for column in range(a.shape[1]):

    encoders[column] = LabelEncoder()

    a[:, column] = encoders[column].fit_transform(a[:, column])



# At this point, a will have strings instead of ints because a had type str

# array([['0', '0', '1'],

#       ['0', '1', '0'],

#       ['1', '2', '2']], dtype='<U3')



a = a.astype(int)



# Now `a` is of type int

# array([[0, 0, 1],

#        [0, 1, 0],

#        [1, 2, 2]])

Using `pd.factorize`

factorize returns the encoded column and the encoding mapping, so if you don't care about it you can avoid saving it:

for column in range(a.shape[1]):

    a[:, column], _ = pd.factorize(a[:, column]) # Drop mapping



a = a.astype(int) # same as above, it's of type str

# a is

# array([[0, 0, 1],

#        [0, 1, 0],

#        [1, 2, 2]])

If you want to keep the encoding mappings:

mappings = 

for column in range(a.shape[1]):

    a[:, column], mapping = pd.factorize(a[:, column])

    mappings.append(mapping)



a = a.astype(int)

Now mappings[0] will have the following data:

array(['A', 'D'], dtype=object)

Which has the same semantics than encoders[0].classes_ of sklearn's LabelEncoder solution.

edited Nov 20 '18 at 13:57

answered Nov 20 '18 at 13:51

Julian Peller

8941511

Many thanks! I've gone for your label encoder method in the end. Thank you for taking the time to reply

– user8188120
Nov 20 '18 at 14:12

Glad to help :)

– Julian Peller
Nov 20 '18 at 14:22

If I could expand on this question, is there a way to use label_encoder in conjunction with a predefined array of unique values? The reason I ask is because I would like to perform this encoding on several huge datasets which have some common string values and some uncommon and therefore the above method would give the same string a different value in different datasets whereas I need them to be encoded the same across all datasets I plan to load in.

– user8188120
Nov 20 '18 at 14:45

1

Is it possible to have all the possible values for a column together? If it is, you can do as follows: First, fit the LabelEncoder on all the values for that column with the method fit: enc = LabelEncoder(); enc.fit(all_possible_values). This will create the unique mapping you want in enc. After that, you can just transform the columns of each specific dataframe with column = enc.transform(column), having the cross-dataframe consistency you want.

– Julian Peller
Nov 20 '18 at 14:55

It's certainly possible to get all the unique values so this should work I think. Thanks again, I'll give it a try! Is there a sensible way to save an encoding once it's been fit to data for using later? Just so the process doesn't have to be done in one long script run

– user8188120
Nov 20 '18 at 15:01

|
show 1 more comment

It seems that you are trying to do a label encoding.

I can think of two options: pandas.factorize and sklearn.preprocessing.LabelEncoder.

Using `LabelEncoder`

from sklearn.preprocessing import LabelEncoder



b = np.zeros_like(a, np.int) 

for column in range(a.shape[1]):

    b[:, column] = LabelEncoder().fit_transform(a[:, column])

Then b will be:

array([[0, 0, 1],

       [0, 1, 0],

       [1, 2, 2]])

If you want to be able to go back to the original values, you will need to save the encoders. You can do it this way:

from sklearn.preprocessing import LabelEncoder



encoders = {}

b = np.zeros_like(a, np.int)

for column in range(a.shape[1]):

    encoders[column] = LabelEncoder()

    b[:, column] = encoders[column].fit_transform(a[:, column])

Now encoders[0].classes_ will have:

array(['A', 'D'], dtype='<U3')

Which means that 'A' was mapped to 0 and 'D' to 1.

Finally, if you do the encoding overriding a instead of using a new matrix c, you will obtain integers as strings ("1" instead of 1), you can solve this with astype(int):

encoders = {}

for column in range(a.shape[1]):

    encoders[column] = LabelEncoder()

    a[:, column] = encoders[column].fit_transform(a[:, column])



# At this point, a will have strings instead of ints because a had type str

# array([['0', '0', '1'],

#       ['0', '1', '0'],

#       ['1', '2', '2']], dtype='<U3')



a = a.astype(int)



# Now `a` is of type int

# array([[0, 0, 1],

#        [0, 1, 0],

#        [1, 2, 2]])

Using `pd.factorize`

factorize returns the encoded column and the encoding mapping, so if you don't care about it you can avoid saving it:

for column in range(a.shape[1]):

    a[:, column], _ = pd.factorize(a[:, column]) # Drop mapping



a = a.astype(int) # same as above, it's of type str

# a is

# array([[0, 0, 1],

#        [0, 1, 0],

#        [1, 2, 2]])

If you want to keep the encoding mappings:

mappings = 

for column in range(a.shape[1]):

    a[:, column], mapping = pd.factorize(a[:, column])

    mappings.append(mapping)



a = a.astype(int)

Now mappings[0] will have the following data:

array(['A', 'D'], dtype=object)

Which has the same semantics than encoders[0].classes_ of sklearn's LabelEncoder solution.

edited Nov 20 '18 at 13:57

answered Nov 20 '18 at 13:51

Julian Peller

8941511

It seems that you are trying to do a label encoding.

I can think of two options: pandas.factorize and sklearn.preprocessing.LabelEncoder.

Using `LabelEncoder`

from sklearn.preprocessing import LabelEncoder



b = np.zeros_like(a, np.int) 

for column in range(a.shape[1]):

    b[:, column] = LabelEncoder().fit_transform(a[:, column])

Then b will be:

array([[0, 0, 1],

       [0, 1, 0],

       [1, 2, 2]])

If you want to be able to go back to the original values, you will need to save the encoders. You can do it this way:

from sklearn.preprocessing import LabelEncoder



encoders = {}

b = np.zeros_like(a, np.int)

for column in range(a.shape[1]):

    encoders[column] = LabelEncoder()

    b[:, column] = encoders[column].fit_transform(a[:, column])

Now encoders[0].classes_ will have:

array(['A', 'D'], dtype='<U3')

Which means that 'A' was mapped to 0 and 'D' to 1.

Finally, if you do the encoding overriding a instead of using a new matrix c, you will obtain integers as strings ("1" instead of 1), you can solve this with astype(int):

encoders = {}

for column in range(a.shape[1]):

    encoders[column] = LabelEncoder()

    a[:, column] = encoders[column].fit_transform(a[:, column])



# At this point, a will have strings instead of ints because a had type str

# array([['0', '0', '1'],

#       ['0', '1', '0'],

#       ['1', '2', '2']], dtype='<U3')



a = a.astype(int)



# Now `a` is of type int

# array([[0, 0, 1],

#        [0, 1, 0],

#        [1, 2, 2]])

Using `pd.factorize`

factorize returns the encoded column and the encoding mapping, so if you don't care about it you can avoid saving it:

for column in range(a.shape[1]):

    a[:, column], _ = pd.factorize(a[:, column]) # Drop mapping



a = a.astype(int) # same as above, it's of type str

# a is

# array([[0, 0, 1],

#        [0, 1, 0],

#        [1, 2, 2]])

If you want to keep the encoding mappings:

mappings = 

for column in range(a.shape[1]):

    a[:, column], mapping = pd.factorize(a[:, column])

    mappings.append(mapping)



a = a.astype(int)

Now mappings[0] will have the following data:

array(['A', 'D'], dtype=object)

Which has the same semantics than encoders[0].classes_ of sklearn's LabelEncoder solution.

edited Nov 20 '18 at 13:57

answered Nov 20 '18 at 13:51

Julian Peller

8941511

edited Nov 20 '18 at 13:57

answered Nov 20 '18 at 13:51

Julian Peller

8941511

answered Nov 20 '18 at 13:51

Julian Peller

8941511

answered Nov 20 '18 at 13:51

Julian Peller

8941511

Many thanks! I've gone for your label encoder method in the end. Thank you for taking the time to reply

– user8188120
Nov 20 '18 at 14:12

Glad to help :)

– Julian Peller
Nov 20 '18 at 14:22

If I could expand on this question, is there a way to use label_encoder in conjunction with a predefined array of unique values? The reason I ask is because I would like to perform this encoding on several huge datasets which have some common string values and some uncommon and therefore the above method would give the same string a different value in different datasets whereas I need them to be encoded the same across all datasets I plan to load in.

– user8188120
Nov 20 '18 at 14:45

1

Is it possible to have all the possible values for a column together? If it is, you can do as follows: First, fit the LabelEncoder on all the values for that column with the method fit: enc = LabelEncoder(); enc.fit(all_possible_values). This will create the unique mapping you want in enc. After that, you can just transform the columns of each specific dataframe with column = enc.transform(column), having the cross-dataframe consistency you want.

– Julian Peller
Nov 20 '18 at 14:55

It's certainly possible to get all the unique values so this should work I think. Thanks again, I'll give it a try! Is there a sensible way to save an encoding once it's been fit to data for using later? Just so the process doesn't have to be done in one long script run

– user8188120
Nov 20 '18 at 15:01

|
show 1 more comment

Many thanks! I've gone for your label encoder method in the end. Thank you for taking the time to reply

– user8188120
Nov 20 '18 at 14:12

Glad to help :)

– Julian Peller
Nov 20 '18 at 14:22

If I could expand on this question, is there a way to use label_encoder in conjunction with a predefined array of unique values? The reason I ask is because I would like to perform this encoding on several huge datasets which have some common string values and some uncommon and therefore the above method would give the same string a different value in different datasets whereas I need them to be encoded the same across all datasets I plan to load in.

– user8188120
Nov 20 '18 at 14:45

1

Is it possible to have all the possible values for a column together? If it is, you can do as follows: First, fit the LabelEncoder on all the values for that column with the method fit: enc = LabelEncoder(); enc.fit(all_possible_values). This will create the unique mapping you want in enc. After that, you can just transform the columns of each specific dataframe with column = enc.transform(column), having the cross-dataframe consistency you want.

– Julian Peller
Nov 20 '18 at 14:55

It's certainly possible to get all the unique values so this should work I think. Thanks again, I'll give it a try! Is there a sensible way to save an encoding once it's been fit to data for using later? Just so the process doesn't have to be done in one long script run

– user8188120
Nov 20 '18 at 15:01

Many thanks! I've gone for your label encoder method in the end. Thank you for taking the time to reply

– user8188120
Nov 20 '18 at 14:12

Glad to help :)

– Julian Peller
Nov 20 '18 at 14:22

If I could expand on this question, is there a way to use label_encoder in conjunction with a predefined array of unique values? The reason I ask is because I would like to perform this encoding on several huge datasets which have some common string values and some uncommon and therefore the above method would give the same string a different value in different datasets whereas I need them to be encoded the same across all datasets I plan to load in.

– user8188120
Nov 20 '18 at 14:45

Is it possible to have all the possible values for a column together? If it is, you can do as follows: First, fit the LabelEncoder on all the values for that column with the method fit: enc = LabelEncoder(); enc.fit(all_possible_values). This will create the unique mapping you want in enc. After that, you can just transform the columns of each specific dataframe with column = enc.transform(column), having the cross-dataframe consistency you want.

– Julian Peller
Nov 20 '18 at 14:55

It's certainly possible to get all the unique values so this should work I think. Thanks again, I'll give it a try! Is there a sensible way to save an encoding once it's been fit to data for using later? Just so the process doesn't have to be done in one long script run

– user8188120
Nov 20 '18 at 15:01

|
show 1 more comment

You can do what you want in an efficient way with just Numpy.

Here's a function charToIx that will do what you want:

from collections import defaultdict

from string import ascii_letters



class Ix:

    def __init__(self):

        self._val = 0



    def __call__(self):

        val = self._val

        self._val += 1

        return val



def charToIx(arr, dtype=None, out=None):

    if dtype is None:

        dtype = arr.dtype



    if out is None:

        out = np.zeros(arr.shape, dtype=dtype)



    for incol,outcol in zip(arr.T, out.T):

        ix = Ix()

        cixDict = defaultdict(lambda: ix())

        for i,x in enumerate(incol):

            if x in cixDict or x in ascii_letters:

                outcol[i] = cixDict[x]

            else:

                outcol[i] = x



    return out

You specify the type of the output array when you call the function. So the output of:

a = np.array([['A',0,'C'],['A',0.3,'B'],['D',1,'D']])

print(charToIx(a, dtype=float))

will be a float array:

array([[0. , 0. , 0. ],

       [0. , 0.3, 1. ],

       [1. , 1. , 2. ]])

edited Nov 20 '18 at 14:05

answered Nov 20 '18 at 13:51

tel

7,34121431

add a comment |

You can do what you want in an efficient way with just Numpy.

Here's a function charToIx that will do what you want:

from collections import defaultdict

from string import ascii_letters



class Ix:

    def __init__(self):

        self._val = 0



    def __call__(self):

        val = self._val

        self._val += 1

        return val



def charToIx(arr, dtype=None, out=None):

    if dtype is None:

        dtype = arr.dtype



    if out is None:

        out = np.zeros(arr.shape, dtype=dtype)



    for incol,outcol in zip(arr.T, out.T):

        ix = Ix()

        cixDict = defaultdict(lambda: ix())

        for i,x in enumerate(incol):

            if x in cixDict or x in ascii_letters:

                outcol[i] = cixDict[x]

            else:

                outcol[i] = x



    return out

You specify the type of the output array when you call the function. So the output of:

a = np.array([['A',0,'C'],['A',0.3,'B'],['D',1,'D']])

print(charToIx(a, dtype=float))

will be a float array:

array([[0. , 0. , 0. ],

       [0. , 0.3, 1. ],

       [1. , 1. , 2. ]])

edited Nov 20 '18 at 14:05

answered Nov 20 '18 at 13:51

tel

7,34121431

add a comment |

You can do what you want in an efficient way with just Numpy.

Here's a function charToIx that will do what you want:

from collections import defaultdict

from string import ascii_letters



class Ix:

    def __init__(self):

        self._val = 0



    def __call__(self):

        val = self._val

        self._val += 1

        return val



def charToIx(arr, dtype=None, out=None):

    if dtype is None:

        dtype = arr.dtype



    if out is None:

        out = np.zeros(arr.shape, dtype=dtype)



    for incol,outcol in zip(arr.T, out.T):

        ix = Ix()

        cixDict = defaultdict(lambda: ix())

        for i,x in enumerate(incol):

            if x in cixDict or x in ascii_letters:

                outcol[i] = cixDict[x]

            else:

                outcol[i] = x



    return out

You specify the type of the output array when you call the function. So the output of:

a = np.array([['A',0,'C'],['A',0.3,'B'],['D',1,'D']])

print(charToIx(a, dtype=float))

will be a float array:

array([[0. , 0. , 0. ],

       [0. , 0.3, 1. ],

       [1. , 1. , 2. ]])

edited Nov 20 '18 at 14:05

answered Nov 20 '18 at 13:51

tel

7,34121431

You can do what you want in an efficient way with just Numpy.

Here's a function charToIx that will do what you want:

from collections import defaultdict

from string import ascii_letters



class Ix:

    def __init__(self):

        self._val = 0



    def __call__(self):

        val = self._val

        self._val += 1

        return val



def charToIx(arr, dtype=None, out=None):

    if dtype is None:

        dtype = arr.dtype



    if out is None:

        out = np.zeros(arr.shape, dtype=dtype)



    for incol,outcol in zip(arr.T, out.T):

        ix = Ix()

        cixDict = defaultdict(lambda: ix())

        for i,x in enumerate(incol):

            if x in cixDict or x in ascii_letters:

                outcol[i] = cixDict[x]

            else:

                outcol[i] = x



    return out

You specify the type of the output array when you call the function. So the output of:

a = np.array([['A',0,'C'],['A',0.3,'B'],['D',1,'D']])

print(charToIx(a, dtype=float))

will be a float array:

array([[0. , 0. , 0. ],

       [0. , 0.3, 1. ],

       [1. , 1. , 2. ]])

edited Nov 20 '18 at 14:05

answered Nov 20 '18 at 13:51

tel

7,34121431

edited Nov 20 '18 at 14:05

answered Nov 20 '18 at 13:51

tel

7,34121431

answered Nov 20 '18 at 13:51

tel

7,34121431

answered Nov 20 '18 at 13:51

tel

7,34121431

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu