pyspark Regexp_Extract - Extract multiple words from a string column
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
I am trying to extract words from a strings column using pyspark regexp.
My DataFrame Below :
ID, Code
10, A1005*B1003
12, A1007*D1008*C1004
result=df.withColumn('Code1', regexp_extract(col(Code), 'w+',0))
Output :
ID, Code, Code1,
10, A1005*B1003, A1005
12, A1007*D1008*C1004, A1007
result=df.withColumn('Code1', regexp_extract(col(Code), 'w+',0))
Output :
ID, Code, Code1,
10, A1005*B1003, A1005
12, A1007*D1008*C1004, A1007
I want to extract codes from Code column and i want my DataFrame to display as below.
ID, Code, Code1, Code2, Code3
10, A1005*B1003, A1005, B1003, null
12, A1007*D1008*C1004, A1007, D1008, C1004
pyspark
add a comment |
I am trying to extract words from a strings column using pyspark regexp.
My DataFrame Below :
ID, Code
10, A1005*B1003
12, A1007*D1008*C1004
result=df.withColumn('Code1', regexp_extract(col(Code), 'w+',0))
Output :
ID, Code, Code1,
10, A1005*B1003, A1005
12, A1007*D1008*C1004, A1007
result=df.withColumn('Code1', regexp_extract(col(Code), 'w+',0))
Output :
ID, Code, Code1,
10, A1005*B1003, A1005
12, A1007*D1008*C1004, A1007
I want to extract codes from Code column and i want my DataFrame to display as below.
ID, Code, Code1, Code2, Code3
10, A1005*B1003, A1005, B1003, null
12, A1007*D1008*C1004, A1007, D1008, C1004
pyspark
Possible duplicate of Split Spark Dataframe string column into multiple columns
– pault
Jan 3 at 15:44
add a comment |
I am trying to extract words from a strings column using pyspark regexp.
My DataFrame Below :
ID, Code
10, A1005*B1003
12, A1007*D1008*C1004
result=df.withColumn('Code1', regexp_extract(col(Code), 'w+',0))
Output :
ID, Code, Code1,
10, A1005*B1003, A1005
12, A1007*D1008*C1004, A1007
result=df.withColumn('Code1', regexp_extract(col(Code), 'w+',0))
Output :
ID, Code, Code1,
10, A1005*B1003, A1005
12, A1007*D1008*C1004, A1007
I want to extract codes from Code column and i want my DataFrame to display as below.
ID, Code, Code1, Code2, Code3
10, A1005*B1003, A1005, B1003, null
12, A1007*D1008*C1004, A1007, D1008, C1004
pyspark
I am trying to extract words from a strings column using pyspark regexp.
My DataFrame Below :
ID, Code
10, A1005*B1003
12, A1007*D1008*C1004
result=df.withColumn('Code1', regexp_extract(col(Code), 'w+',0))
Output :
ID, Code, Code1,
10, A1005*B1003, A1005
12, A1007*D1008*C1004, A1007
result=df.withColumn('Code1', regexp_extract(col(Code), 'w+',0))
Output :
ID, Code, Code1,
10, A1005*B1003, A1005
12, A1007*D1008*C1004, A1007
I want to extract codes from Code column and i want my DataFrame to display as below.
ID, Code, Code1, Code2, Code3
10, A1005*B1003, A1005, B1003, null
12, A1007*D1008*C1004, A1007, D1008, C1004
pyspark
pyspark
edited Jan 3 at 15:36
SHR
6,06872544
6,06872544
asked Jan 3 at 15:15
MayanMayan
33
33
Possible duplicate of Split Spark Dataframe string column into multiple columns
– pault
Jan 3 at 15:44
add a comment |
Possible duplicate of Split Spark Dataframe string column into multiple columns
– pault
Jan 3 at 15:44
Possible duplicate of Split Spark Dataframe string column into multiple columns
– pault
Jan 3 at 15:44
Possible duplicate of Split Spark Dataframe string column into multiple columns
– pault
Jan 3 at 15:44
add a comment |
1 Answer
1
active
oldest
votes
Assume your ID
column is unique for each row; Here is one way of doing it with split
, explode
and then pivot
:
import pyspark.sql.functions as f
(df.select('ID', 'Code', f.posexplode(f.split('Code', '\*')))
.withColumn('pos', f.concat(f.lit('code'), f.col('pos')))
.groupBy('ID', 'Code').pivot('pos').agg(f.first('col'))
.show())
+---+-----------------+-----+-----+-----+
| ID| Code|code0|code1|code2|
+---+-----------------+-----+-----+-----+
| 10| A1005*B1003|A1005|B1003| null|
| 12|A1007*D1008*C1004|A1007|D1008|C1004|
+---+-----------------+-----+-----+-----+
Another option without pivoting:
df1 = df.select('ID', 'Code', f.split('Code', '\*').alias('Codes'))
maxCodes = df1.agg(f.max(f.size('Codes'))).first()[0] # 3
df1.select(
'ID', 'Code',
*[f.col('Codes').getItem(i).alias(f'Code{i+1}') for i in range(maxCodes)]
).show()
+---+-----------------+-----+-----+-----+
| ID| Code|Code1|Code2|Code3|
+---+-----------------+-----+-----+-----+
| 10| A1005*B1003|A1005|B1003| null|
| 12|A1007*D1008*C1004|A1007|D1008|C1004|
+---+-----------------+-----+-----+-----+
Hi , Thank you for the quick reply. The code column holds the arithmetic operators. The code column can store values like (A1002*B1002)-C1003+D1005 or A1004/(C1008-D1006). And the number of codes in the string can go upto 7.
– Mayan
Jan 3 at 15:49
If the word you want to extract contains only digits and letters, you can replacef.split(...)
in above two options withf.array_remove(f.split('Code', '\W+'), '')
, and it should give the result you needed.
– Psidom
Jan 3 at 16:07
Hi, Could you please help me with transpose the same dataset as below. ID Code Code_T 10 A1005*B1003 A1005 10 A1005*B1003 B1003 12 A1007*D1008*C1004 A1007 12 A1007*D1008*C1004 D1008 12 A1007*D1008*C1004 C1004
– Mayan
Jan 8 at 14:35
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54025061%2fpyspark-regexp-extract-extract-multiple-words-from-a-string-column%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Assume your ID
column is unique for each row; Here is one way of doing it with split
, explode
and then pivot
:
import pyspark.sql.functions as f
(df.select('ID', 'Code', f.posexplode(f.split('Code', '\*')))
.withColumn('pos', f.concat(f.lit('code'), f.col('pos')))
.groupBy('ID', 'Code').pivot('pos').agg(f.first('col'))
.show())
+---+-----------------+-----+-----+-----+
| ID| Code|code0|code1|code2|
+---+-----------------+-----+-----+-----+
| 10| A1005*B1003|A1005|B1003| null|
| 12|A1007*D1008*C1004|A1007|D1008|C1004|
+---+-----------------+-----+-----+-----+
Another option without pivoting:
df1 = df.select('ID', 'Code', f.split('Code', '\*').alias('Codes'))
maxCodes = df1.agg(f.max(f.size('Codes'))).first()[0] # 3
df1.select(
'ID', 'Code',
*[f.col('Codes').getItem(i).alias(f'Code{i+1}') for i in range(maxCodes)]
).show()
+---+-----------------+-----+-----+-----+
| ID| Code|Code1|Code2|Code3|
+---+-----------------+-----+-----+-----+
| 10| A1005*B1003|A1005|B1003| null|
| 12|A1007*D1008*C1004|A1007|D1008|C1004|
+---+-----------------+-----+-----+-----+
Hi , Thank you for the quick reply. The code column holds the arithmetic operators. The code column can store values like (A1002*B1002)-C1003+D1005 or A1004/(C1008-D1006). And the number of codes in the string can go upto 7.
– Mayan
Jan 3 at 15:49
If the word you want to extract contains only digits and letters, you can replacef.split(...)
in above two options withf.array_remove(f.split('Code', '\W+'), '')
, and it should give the result you needed.
– Psidom
Jan 3 at 16:07
Hi, Could you please help me with transpose the same dataset as below. ID Code Code_T 10 A1005*B1003 A1005 10 A1005*B1003 B1003 12 A1007*D1008*C1004 A1007 12 A1007*D1008*C1004 D1008 12 A1007*D1008*C1004 C1004
– Mayan
Jan 8 at 14:35
add a comment |
Assume your ID
column is unique for each row; Here is one way of doing it with split
, explode
and then pivot
:
import pyspark.sql.functions as f
(df.select('ID', 'Code', f.posexplode(f.split('Code', '\*')))
.withColumn('pos', f.concat(f.lit('code'), f.col('pos')))
.groupBy('ID', 'Code').pivot('pos').agg(f.first('col'))
.show())
+---+-----------------+-----+-----+-----+
| ID| Code|code0|code1|code2|
+---+-----------------+-----+-----+-----+
| 10| A1005*B1003|A1005|B1003| null|
| 12|A1007*D1008*C1004|A1007|D1008|C1004|
+---+-----------------+-----+-----+-----+
Another option without pivoting:
df1 = df.select('ID', 'Code', f.split('Code', '\*').alias('Codes'))
maxCodes = df1.agg(f.max(f.size('Codes'))).first()[0] # 3
df1.select(
'ID', 'Code',
*[f.col('Codes').getItem(i).alias(f'Code{i+1}') for i in range(maxCodes)]
).show()
+---+-----------------+-----+-----+-----+
| ID| Code|Code1|Code2|Code3|
+---+-----------------+-----+-----+-----+
| 10| A1005*B1003|A1005|B1003| null|
| 12|A1007*D1008*C1004|A1007|D1008|C1004|
+---+-----------------+-----+-----+-----+
Hi , Thank you for the quick reply. The code column holds the arithmetic operators. The code column can store values like (A1002*B1002)-C1003+D1005 or A1004/(C1008-D1006). And the number of codes in the string can go upto 7.
– Mayan
Jan 3 at 15:49
If the word you want to extract contains only digits and letters, you can replacef.split(...)
in above two options withf.array_remove(f.split('Code', '\W+'), '')
, and it should give the result you needed.
– Psidom
Jan 3 at 16:07
Hi, Could you please help me with transpose the same dataset as below. ID Code Code_T 10 A1005*B1003 A1005 10 A1005*B1003 B1003 12 A1007*D1008*C1004 A1007 12 A1007*D1008*C1004 D1008 12 A1007*D1008*C1004 C1004
– Mayan
Jan 8 at 14:35
add a comment |
Assume your ID
column is unique for each row; Here is one way of doing it with split
, explode
and then pivot
:
import pyspark.sql.functions as f
(df.select('ID', 'Code', f.posexplode(f.split('Code', '\*')))
.withColumn('pos', f.concat(f.lit('code'), f.col('pos')))
.groupBy('ID', 'Code').pivot('pos').agg(f.first('col'))
.show())
+---+-----------------+-----+-----+-----+
| ID| Code|code0|code1|code2|
+---+-----------------+-----+-----+-----+
| 10| A1005*B1003|A1005|B1003| null|
| 12|A1007*D1008*C1004|A1007|D1008|C1004|
+---+-----------------+-----+-----+-----+
Another option without pivoting:
df1 = df.select('ID', 'Code', f.split('Code', '\*').alias('Codes'))
maxCodes = df1.agg(f.max(f.size('Codes'))).first()[0] # 3
df1.select(
'ID', 'Code',
*[f.col('Codes').getItem(i).alias(f'Code{i+1}') for i in range(maxCodes)]
).show()
+---+-----------------+-----+-----+-----+
| ID| Code|Code1|Code2|Code3|
+---+-----------------+-----+-----+-----+
| 10| A1005*B1003|A1005|B1003| null|
| 12|A1007*D1008*C1004|A1007|D1008|C1004|
+---+-----------------+-----+-----+-----+
Assume your ID
column is unique for each row; Here is one way of doing it with split
, explode
and then pivot
:
import pyspark.sql.functions as f
(df.select('ID', 'Code', f.posexplode(f.split('Code', '\*')))
.withColumn('pos', f.concat(f.lit('code'), f.col('pos')))
.groupBy('ID', 'Code').pivot('pos').agg(f.first('col'))
.show())
+---+-----------------+-----+-----+-----+
| ID| Code|code0|code1|code2|
+---+-----------------+-----+-----+-----+
| 10| A1005*B1003|A1005|B1003| null|
| 12|A1007*D1008*C1004|A1007|D1008|C1004|
+---+-----------------+-----+-----+-----+
Another option without pivoting:
df1 = df.select('ID', 'Code', f.split('Code', '\*').alias('Codes'))
maxCodes = df1.agg(f.max(f.size('Codes'))).first()[0] # 3
df1.select(
'ID', 'Code',
*[f.col('Codes').getItem(i).alias(f'Code{i+1}') for i in range(maxCodes)]
).show()
+---+-----------------+-----+-----+-----+
| ID| Code|Code1|Code2|Code3|
+---+-----------------+-----+-----+-----+
| 10| A1005*B1003|A1005|B1003| null|
| 12|A1007*D1008*C1004|A1007|D1008|C1004|
+---+-----------------+-----+-----+-----+
edited Jan 3 at 15:44
answered Jan 3 at 15:34
PsidomPsidom
128k1293141
128k1293141
Hi , Thank you for the quick reply. The code column holds the arithmetic operators. The code column can store values like (A1002*B1002)-C1003+D1005 or A1004/(C1008-D1006). And the number of codes in the string can go upto 7.
– Mayan
Jan 3 at 15:49
If the word you want to extract contains only digits and letters, you can replacef.split(...)
in above two options withf.array_remove(f.split('Code', '\W+'), '')
, and it should give the result you needed.
– Psidom
Jan 3 at 16:07
Hi, Could you please help me with transpose the same dataset as below. ID Code Code_T 10 A1005*B1003 A1005 10 A1005*B1003 B1003 12 A1007*D1008*C1004 A1007 12 A1007*D1008*C1004 D1008 12 A1007*D1008*C1004 C1004
– Mayan
Jan 8 at 14:35
add a comment |
Hi , Thank you for the quick reply. The code column holds the arithmetic operators. The code column can store values like (A1002*B1002)-C1003+D1005 or A1004/(C1008-D1006). And the number of codes in the string can go upto 7.
– Mayan
Jan 3 at 15:49
If the word you want to extract contains only digits and letters, you can replacef.split(...)
in above two options withf.array_remove(f.split('Code', '\W+'), '')
, and it should give the result you needed.
– Psidom
Jan 3 at 16:07
Hi, Could you please help me with transpose the same dataset as below. ID Code Code_T 10 A1005*B1003 A1005 10 A1005*B1003 B1003 12 A1007*D1008*C1004 A1007 12 A1007*D1008*C1004 D1008 12 A1007*D1008*C1004 C1004
– Mayan
Jan 8 at 14:35
Hi , Thank you for the quick reply. The code column holds the arithmetic operators. The code column can store values like (A1002*B1002)-C1003+D1005 or A1004/(C1008-D1006). And the number of codes in the string can go upto 7.
– Mayan
Jan 3 at 15:49
Hi , Thank you for the quick reply. The code column holds the arithmetic operators. The code column can store values like (A1002*B1002)-C1003+D1005 or A1004/(C1008-D1006). And the number of codes in the string can go upto 7.
– Mayan
Jan 3 at 15:49
If the word you want to extract contains only digits and letters, you can replace
f.split(...)
in above two options with f.array_remove(f.split('Code', '\W+'), '')
, and it should give the result you needed.– Psidom
Jan 3 at 16:07
If the word you want to extract contains only digits and letters, you can replace
f.split(...)
in above two options with f.array_remove(f.split('Code', '\W+'), '')
, and it should give the result you needed.– Psidom
Jan 3 at 16:07
Hi, Could you please help me with transpose the same dataset as below. ID Code Code_T 10 A1005*B1003 A1005 10 A1005*B1003 B1003 12 A1007*D1008*C1004 A1007 12 A1007*D1008*C1004 D1008 12 A1007*D1008*C1004 C1004
– Mayan
Jan 8 at 14:35
Hi, Could you please help me with transpose the same dataset as below. ID Code Code_T 10 A1005*B1003 A1005 10 A1005*B1003 B1003 12 A1007*D1008*C1004 A1007 12 A1007*D1008*C1004 D1008 12 A1007*D1008*C1004 C1004
– Mayan
Jan 8 at 14:35
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54025061%2fpyspark-regexp-extract-extract-multiple-words-from-a-string-column%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Possible duplicate of Split Spark Dataframe string column into multiple columns
– pault
Jan 3 at 15:44