pyspark Regexp_Extract - Extract multiple words from a string column

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

-2

I am trying to extract words from a strings column using pyspark regexp.

My DataFrame Below :

ID, Code



10, A1005*B1003



12, A1007*D1008*C1004



result=df.withColumn('Code1', regexp_extract(col(Code), 'w+',0))

Output :

ID, Code,              Code1, 



10, A1005*B1003,       A1005



12, A1007*D1008*C1004, A1007



result=df.withColumn('Code1', regexp_extract(col(Code), 'w+',0))

Output :

ID, Code,              Code1, 



10, A1005*B1003,       A1005



12, A1007*D1008*C1004, A1007

I want to extract codes from Code column and i want my DataFrame to display as below.

ID, Code,              Code1,  Code2,  Code3



10, A1005*B1003,       A1005,  B1003,  null



12, A1007*D1008*C1004, A1007,  D1008,  C1004

edited Jan 3 at 15:36

SHR

6,06872544

asked Jan 3 at 15:15

Mayan

Possible duplicate of Split Spark Dataframe string column into multiple columns

– pault
Jan 3 at 15:44

add a comment |

-2

I am trying to extract words from a strings column using pyspark regexp.

My DataFrame Below :

ID, Code



10, A1005*B1003



12, A1007*D1008*C1004



result=df.withColumn('Code1', regexp_extract(col(Code), 'w+',0))

Output :

ID, Code,              Code1, 



10, A1005*B1003,       A1005



12, A1007*D1008*C1004, A1007



result=df.withColumn('Code1', regexp_extract(col(Code), 'w+',0))

Output :

ID, Code,              Code1, 



10, A1005*B1003,       A1005



12, A1007*D1008*C1004, A1007

I want to extract codes from Code column and i want my DataFrame to display as below.

ID, Code,              Code1,  Code2,  Code3



10, A1005*B1003,       A1005,  B1003,  null



12, A1007*D1008*C1004, A1007,  D1008,  C1004

edited Jan 3 at 15:36

SHR

6,06872544

asked Jan 3 at 15:15

Mayan

Possible duplicate of Split Spark Dataframe string column into multiple columns

– pault
Jan 3 at 15:44

add a comment |

-2

I am trying to extract words from a strings column using pyspark regexp.

My DataFrame Below :

ID, Code



10, A1005*B1003



12, A1007*D1008*C1004



result=df.withColumn('Code1', regexp_extract(col(Code), 'w+',0))

Output :

ID, Code,              Code1, 



10, A1005*B1003,       A1005



12, A1007*D1008*C1004, A1007



result=df.withColumn('Code1', regexp_extract(col(Code), 'w+',0))

Output :

ID, Code,              Code1, 



10, A1005*B1003,       A1005



12, A1007*D1008*C1004, A1007

I want to extract codes from Code column and i want my DataFrame to display as below.

ID, Code,              Code1,  Code2,  Code3



10, A1005*B1003,       A1005,  B1003,  null



12, A1007*D1008*C1004, A1007,  D1008,  C1004

edited Jan 3 at 15:36

SHR

6,06872544

asked Jan 3 at 15:15

Mayan

I am trying to extract words from a strings column using pyspark regexp.

My DataFrame Below :

ID, Code



10, A1005*B1003



12, A1007*D1008*C1004



result=df.withColumn('Code1', regexp_extract(col(Code), 'w+',0))

Output :

ID, Code,              Code1, 



10, A1005*B1003,       A1005



12, A1007*D1008*C1004, A1007



result=df.withColumn('Code1', regexp_extract(col(Code), 'w+',0))

Output :

ID, Code,              Code1, 



10, A1005*B1003,       A1005



12, A1007*D1008*C1004, A1007

I want to extract codes from Code column and i want my DataFrame to display as below.

ID, Code,              Code1,  Code2,  Code3



10, A1005*B1003,       A1005,  B1003,  null



12, A1007*D1008*C1004, A1007,  D1008,  C1004

pyspark

edited Jan 3 at 15:36

SHR

6,06872544

asked Jan 3 at 15:15

Mayan

edited Jan 3 at 15:36

SHR

6,06872544

asked Jan 3 at 15:15

Mayan

edited Jan 3 at 15:36

SHR

6,06872544

edited Jan 3 at 15:36

SHR

6,06872544

edited Jan 3 at 15:36

SHR

6,06872544

asked Jan 3 at 15:15

Mayan

asked Jan 3 at 15:15

Mayan

asked Jan 3 at 15:15

Mayan

Possible duplicate of Split Spark Dataframe string column into multiple columns

– pault
Jan 3 at 15:44

add a comment |

Possible duplicate of Split Spark Dataframe string column into multiple columns

– pault
Jan 3 at 15:44

Possible duplicate of Split Spark Dataframe string column into multiple columns

– pault
Jan 3 at 15:44

add a comment |

1 Answer
1

active

oldest

votes

Assume your ID column is unique for each row; Here is one way of doing it with split, explode and then pivot:

import pyspark.sql.functions as f



(df.select('ID', 'Code', f.posexplode(f.split('Code', '\*')))

   .withColumn('pos', f.concat(f.lit('code'), f.col('pos')))

   .groupBy('ID', 'Code').pivot('pos').agg(f.first('col'))

   .show())

+---+-----------------+-----+-----+-----+

| ID|             Code|code0|code1|code2|

+---+-----------------+-----+-----+-----+

| 10|      A1005*B1003|A1005|B1003| null|

| 12|A1007*D1008*C1004|A1007|D1008|C1004|

+---+-----------------+-----+-----+-----+

Another option without pivoting:

df1 = df.select('ID', 'Code', f.split('Code', '\*').alias('Codes'))

maxCodes = df1.agg(f.max(f.size('Codes'))).first()[0]      # 3

df1.select(

  'ID', 'Code', 

  *[f.col('Codes').getItem(i).alias(f'Code{i+1}') for i in range(maxCodes)]

).show()

+---+-----------------+-----+-----+-----+

| ID|             Code|Code1|Code2|Code3|

+---+-----------------+-----+-----+-----+

| 10|      A1005*B1003|A1005|B1003| null|

| 12|A1007*D1008*C1004|A1007|D1008|C1004|

+---+-----------------+-----+-----+-----+

edited Jan 3 at 15:44

answered Jan 3 at 15:34

Psidom

128k1293141

Hi , Thank you for the quick reply. The code column holds the arithmetic operators. The code column can store values like (A1002*B1002)-C1003+D1005 or A1004/(C1008-D1006). And the number of codes in the string can go upto 7.

– Mayan
Jan 3 at 15:49

If the word you want to extract contains only digits and letters, you can replace f.split(...) in above two options with f.array_remove(f.split('Code', '\W+'), ''), and it should give the result you needed.

– Psidom
Jan 3 at 16:07

Hi, Could you please help me with transpose the same dataset as below. ID Code Code_T 10 A1005*B1003 A1005 10 A1005*B1003 B1003 12 A1007*D1008*C1004 A1007 12 A1007*D1008*C1004 D1008 12 A1007*D1008*C1004 C1004

– Mayan
Jan 8 at 14:35

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54025061%2fpyspark-regexp-extract-extract-multiple-words-from-a-string-column%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Assume your ID column is unique for each row; Here is one way of doing it with split, explode and then pivot:

import pyspark.sql.functions as f



(df.select('ID', 'Code', f.posexplode(f.split('Code', '\*')))

   .withColumn('pos', f.concat(f.lit('code'), f.col('pos')))

   .groupBy('ID', 'Code').pivot('pos').agg(f.first('col'))

   .show())

+---+-----------------+-----+-----+-----+

| ID|             Code|code0|code1|code2|

+---+-----------------+-----+-----+-----+

| 10|      A1005*B1003|A1005|B1003| null|

| 12|A1007*D1008*C1004|A1007|D1008|C1004|

+---+-----------------+-----+-----+-----+

Another option without pivoting:

df1 = df.select('ID', 'Code', f.split('Code', '\*').alias('Codes'))

maxCodes = df1.agg(f.max(f.size('Codes'))).first()[0]      # 3

df1.select(

  'ID', 'Code', 

  *[f.col('Codes').getItem(i).alias(f'Code{i+1}') for i in range(maxCodes)]

).show()

+---+-----------------+-----+-----+-----+

| ID|             Code|Code1|Code2|Code3|

+---+-----------------+-----+-----+-----+

| 10|      A1005*B1003|A1005|B1003| null|

| 12|A1007*D1008*C1004|A1007|D1008|C1004|

+---+-----------------+-----+-----+-----+

edited Jan 3 at 15:44

answered Jan 3 at 15:34

Psidom

128k1293141

Hi , Thank you for the quick reply. The code column holds the arithmetic operators. The code column can store values like (A1002*B1002)-C1003+D1005 or A1004/(C1008-D1006). And the number of codes in the string can go upto 7.

– Mayan
Jan 3 at 15:49

If the word you want to extract contains only digits and letters, you can replace f.split(...) in above two options with f.array_remove(f.split('Code', '\W+'), ''), and it should give the result you needed.

– Psidom
Jan 3 at 16:07

Hi, Could you please help me with transpose the same dataset as below. ID Code Code_T 10 A1005*B1003 A1005 10 A1005*B1003 B1003 12 A1007*D1008*C1004 A1007 12 A1007*D1008*C1004 D1008 12 A1007*D1008*C1004 C1004

– Mayan
Jan 8 at 14:35

add a comment |

Assume your ID column is unique for each row; Here is one way of doing it with split, explode and then pivot:

import pyspark.sql.functions as f



(df.select('ID', 'Code', f.posexplode(f.split('Code', '\*')))

   .withColumn('pos', f.concat(f.lit('code'), f.col('pos')))

   .groupBy('ID', 'Code').pivot('pos').agg(f.first('col'))

   .show())

+---+-----------------+-----+-----+-----+

| ID|             Code|code0|code1|code2|

+---+-----------------+-----+-----+-----+

| 10|      A1005*B1003|A1005|B1003| null|

| 12|A1007*D1008*C1004|A1007|D1008|C1004|

+---+-----------------+-----+-----+-----+

Another option without pivoting:

df1 = df.select('ID', 'Code', f.split('Code', '\*').alias('Codes'))

maxCodes = df1.agg(f.max(f.size('Codes'))).first()[0]      # 3

df1.select(

  'ID', 'Code', 

  *[f.col('Codes').getItem(i).alias(f'Code{i+1}') for i in range(maxCodes)]

).show()

+---+-----------------+-----+-----+-----+

| ID|             Code|Code1|Code2|Code3|

+---+-----------------+-----+-----+-----+

| 10|      A1005*B1003|A1005|B1003| null|

| 12|A1007*D1008*C1004|A1007|D1008|C1004|

+---+-----------------+-----+-----+-----+

edited Jan 3 at 15:44

answered Jan 3 at 15:34

Psidom

128k1293141

Hi , Thank you for the quick reply. The code column holds the arithmetic operators. The code column can store values like (A1002*B1002)-C1003+D1005 or A1004/(C1008-D1006). And the number of codes in the string can go upto 7.

– Mayan
Jan 3 at 15:49

If the word you want to extract contains only digits and letters, you can replace f.split(...) in above two options with f.array_remove(f.split('Code', '\W+'), ''), and it should give the result you needed.

– Psidom
Jan 3 at 16:07

Hi, Could you please help me with transpose the same dataset as below. ID Code Code_T 10 A1005*B1003 A1005 10 A1005*B1003 B1003 12 A1007*D1008*C1004 A1007 12 A1007*D1008*C1004 D1008 12 A1007*D1008*C1004 C1004

– Mayan
Jan 8 at 14:35

add a comment |

Assume your ID column is unique for each row; Here is one way of doing it with split, explode and then pivot:

import pyspark.sql.functions as f



(df.select('ID', 'Code', f.posexplode(f.split('Code', '\*')))

   .withColumn('pos', f.concat(f.lit('code'), f.col('pos')))

   .groupBy('ID', 'Code').pivot('pos').agg(f.first('col'))

   .show())

+---+-----------------+-----+-----+-----+

| ID|             Code|code0|code1|code2|

+---+-----------------+-----+-----+-----+

| 10|      A1005*B1003|A1005|B1003| null|

| 12|A1007*D1008*C1004|A1007|D1008|C1004|

+---+-----------------+-----+-----+-----+

Another option without pivoting:

df1 = df.select('ID', 'Code', f.split('Code', '\*').alias('Codes'))

maxCodes = df1.agg(f.max(f.size('Codes'))).first()[0]      # 3

df1.select(

  'ID', 'Code', 

  *[f.col('Codes').getItem(i).alias(f'Code{i+1}') for i in range(maxCodes)]

).show()

+---+-----------------+-----+-----+-----+

| ID|             Code|Code1|Code2|Code3|

+---+-----------------+-----+-----+-----+

| 10|      A1005*B1003|A1005|B1003| null|

| 12|A1007*D1008*C1004|A1007|D1008|C1004|

+---+-----------------+-----+-----+-----+

edited Jan 3 at 15:44

answered Jan 3 at 15:34

Psidom

128k1293141

Assume your ID column is unique for each row; Here is one way of doing it with split, explode and then pivot:

import pyspark.sql.functions as f



(df.select('ID', 'Code', f.posexplode(f.split('Code', '\*')))

   .withColumn('pos', f.concat(f.lit('code'), f.col('pos')))

   .groupBy('ID', 'Code').pivot('pos').agg(f.first('col'))

   .show())

+---+-----------------+-----+-----+-----+

| ID|             Code|code0|code1|code2|

+---+-----------------+-----+-----+-----+

| 10|      A1005*B1003|A1005|B1003| null|

| 12|A1007*D1008*C1004|A1007|D1008|C1004|

+---+-----------------+-----+-----+-----+

Another option without pivoting:

df1 = df.select('ID', 'Code', f.split('Code', '\*').alias('Codes'))

maxCodes = df1.agg(f.max(f.size('Codes'))).first()[0]      # 3

df1.select(

  'ID', 'Code', 

  *[f.col('Codes').getItem(i).alias(f'Code{i+1}') for i in range(maxCodes)]

).show()

+---+-----------------+-----+-----+-----+

| ID|             Code|Code1|Code2|Code3|

+---+-----------------+-----+-----+-----+

| 10|      A1005*B1003|A1005|B1003| null|

| 12|A1007*D1008*C1004|A1007|D1008|C1004|

+---+-----------------+-----+-----+-----+

edited Jan 3 at 15:44

answered Jan 3 at 15:34

Psidom

128k1293141

edited Jan 3 at 15:44

answered Jan 3 at 15:34

Psidom

128k1293141

answered Jan 3 at 15:34

Psidom

128k1293141

answered Jan 3 at 15:34

Psidom

128k1293141

Hi , Thank you for the quick reply. The code column holds the arithmetic operators. The code column can store values like (A1002*B1002)-C1003+D1005 or A1004/(C1008-D1006). And the number of codes in the string can go upto 7.

– Mayan
Jan 3 at 15:49

If the word you want to extract contains only digits and letters, you can replace f.split(...) in above two options with f.array_remove(f.split('Code', '\W+'), ''), and it should give the result you needed.

– Psidom
Jan 3 at 16:07

Hi, Could you please help me with transpose the same dataset as below. ID Code Code_T 10 A1005*B1003 A1005 10 A1005*B1003 B1003 12 A1007*D1008*C1004 A1007 12 A1007*D1008*C1004 D1008 12 A1007*D1008*C1004 C1004

– Mayan
Jan 8 at 14:35

add a comment |

Hi , Thank you for the quick reply. The code column holds the arithmetic operators. The code column can store values like (A1002*B1002)-C1003+D1005 or A1004/(C1008-D1006). And the number of codes in the string can go upto 7.

– Mayan
Jan 3 at 15:49

If the word you want to extract contains only digits and letters, you can replace f.split(...) in above two options with f.array_remove(f.split('Code', '\W+'), ''), and it should give the result you needed.

– Psidom
Jan 3 at 16:07

Hi, Could you please help me with transpose the same dataset as below. ID Code Code_T 10 A1005*B1003 A1005 10 A1005*B1003 B1003 12 A1007*D1008*C1004 A1007 12 A1007*D1008*C1004 D1008 12 A1007*D1008*C1004 C1004

– Mayan
Jan 8 at 14:35

Hi , Thank you for the quick reply. The code column holds the arithmetic operators. The code column can store values like (A1002*B1002)-C1003+D1005 or A1004/(C1008-D1006). And the number of codes in the string can go upto 7.

– Mayan
Jan 3 at 15:49

If the word you want to extract contains only digits and letters, you can replace f.split(...) in above two options with f.array_remove(f.split('Code', '\W+'), ''), and it should give the result you needed.

– Psidom
Jan 3 at 16:07

Hi, Could you please help me with transpose the same dataset as below. ID Code Code_T 10 A1005*B1003 A1005 10 A1005*B1003 B1003 12 A1007*D1008*C1004 A1007 12 A1007*D1008*C1004 D1008 12 A1007*D1008*C1004 C1004

– Mayan
Jan 8 at 14:35

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu