regular expression pyspark dataframe column
My dataframe looks like this.
I have a pyspark dataframe and I want to split column A into A1 and A2 like this using regex but that didn't work.
A | A1 | A2
20-13-2012-monday 20-13-2012 monday
20-14-2012-tues 20-14-2012 tues
20-13-2012-wed 20-13-2012 wed
My code looks like this
import re
from pyspark.sql.functions import regexp_extract
reg = r'^([d]+-[d]+-[d]+)'
df=df.withColumn("A1",re.match(reg, df.select(['A'])).group())
df.show()
pyspark
add a comment |
My dataframe looks like this.
I have a pyspark dataframe and I want to split column A into A1 and A2 like this using regex but that didn't work.
A | A1 | A2
20-13-2012-monday 20-13-2012 monday
20-14-2012-tues 20-14-2012 tues
20-13-2012-wed 20-13-2012 wed
My code looks like this
import re
from pyspark.sql.functions import regexp_extract
reg = r'^([d]+-[d]+-[d]+)'
df=df.withColumn("A1",re.match(reg, df.select(['A'])).group())
df.show()
pyspark
You're mixingre
from the python library with spark.pyspark.sql.functions.split
can split on regex:df=df.withColumn("A1",split(col("A"), reg))
– pault
Nov 20 '18 at 6:05
I am getting output like -- [ , monday]. I want just monday. No comma no
– Emma
Nov 20 '18 at 13:09
it says invalid syntax
– Emma
Nov 20 '18 at 13:10
This is a useful post: Reference: what does this regex mean?.
– pault
Nov 20 '18 at 14:10
add a comment |
My dataframe looks like this.
I have a pyspark dataframe and I want to split column A into A1 and A2 like this using regex but that didn't work.
A | A1 | A2
20-13-2012-monday 20-13-2012 monday
20-14-2012-tues 20-14-2012 tues
20-13-2012-wed 20-13-2012 wed
My code looks like this
import re
from pyspark.sql.functions import regexp_extract
reg = r'^([d]+-[d]+-[d]+)'
df=df.withColumn("A1",re.match(reg, df.select(['A'])).group())
df.show()
pyspark
My dataframe looks like this.
I have a pyspark dataframe and I want to split column A into A1 and A2 like this using regex but that didn't work.
A | A1 | A2
20-13-2012-monday 20-13-2012 monday
20-14-2012-tues 20-14-2012 tues
20-13-2012-wed 20-13-2012 wed
My code looks like this
import re
from pyspark.sql.functions import regexp_extract
reg = r'^([d]+-[d]+-[d]+)'
df=df.withColumn("A1",re.match(reg, df.select(['A'])).group())
df.show()
pyspark
pyspark
asked Nov 19 '18 at 23:37


EmmaEmma
236
236
You're mixingre
from the python library with spark.pyspark.sql.functions.split
can split on regex:df=df.withColumn("A1",split(col("A"), reg))
– pault
Nov 20 '18 at 6:05
I am getting output like -- [ , monday]. I want just monday. No comma no
– Emma
Nov 20 '18 at 13:09
it says invalid syntax
– Emma
Nov 20 '18 at 13:10
This is a useful post: Reference: what does this regex mean?.
– pault
Nov 20 '18 at 14:10
add a comment |
You're mixingre
from the python library with spark.pyspark.sql.functions.split
can split on regex:df=df.withColumn("A1",split(col("A"), reg))
– pault
Nov 20 '18 at 6:05
I am getting output like -- [ , monday]. I want just monday. No comma no
– Emma
Nov 20 '18 at 13:09
it says invalid syntax
– Emma
Nov 20 '18 at 13:10
This is a useful post: Reference: what does this regex mean?.
– pault
Nov 20 '18 at 14:10
You're mixing
re
from the python library with spark. pyspark.sql.functions.split
can split on regex: df=df.withColumn("A1",split(col("A"), reg))
– pault
Nov 20 '18 at 6:05
You're mixing
re
from the python library with spark. pyspark.sql.functions.split
can split on regex: df=df.withColumn("A1",split(col("A"), reg))
– pault
Nov 20 '18 at 6:05
I am getting output like -- [ , monday]. I want just monday. No comma no
– Emma
Nov 20 '18 at 13:09
I am getting output like -- [ , monday]. I want just monday. No comma no
– Emma
Nov 20 '18 at 13:09
it says invalid syntax
– Emma
Nov 20 '18 at 13:10
it says invalid syntax
– Emma
Nov 20 '18 at 13:10
This is a useful post: Reference: what does this regex mean?.
– pault
Nov 20 '18 at 14:10
This is a useful post: Reference: what does this regex mean?.
– pault
Nov 20 '18 at 14:10
add a comment |
1 Answer
1
active
oldest
votes
You can use the regex as an udf
and achieve the required output like this:
>>> import re
>>> from pyspark.sql.types import *
>>> from pyspark.sql.functions import udf
>>> def get_date_day(a):
... x, y = re.split('^([d]+-[d]+-[d]+)', a)[1:]
... return [x, y[1:]]
>>> get_date_day('20-13-2012-monday')
['20-13-2012', 'monday']
>>> get_date_day('20-13-2012-monday')
['20-13-2012', '-monday']
>>> get_date_udf = udf(get_date_day, ArrayType(StringType()))
>>> df = sc.parallelize([('20-13-2012-monday',), ('20-14-2012-tues',), ('20-13-2012-wed',)]).toDF(['A'])
>>> df.show()
+-----------------+
| A|
+-----------------+
|20-13-2012-monday|
| 20-14-2012-tues|
| 20-13-2012-wed|
+-----------------+
>>> df = df.withColumn("A12", get_date_udf('A'))
>>> df.show(truncate=False)
+-----------------+--------------------+
|A |A12 |
+-----------------+--------------------+
|20-13-2012-monday|[20-13-2012, monday]|
|20-14-2012-tues |[20-14-2012, tues] |
|20-13-2012-wed |[20-13-2012, wed] |
+-----------------+--------------------+
>>> df = df.withColumn("A1", udf(lambda x:x[0])('A12')).withColumn("A2", udf(lambda x:x[1])('A12'))
>>> df = df.drop('A12')
>>> df.show(truncate=False)
+-----------------+----------+------+
|A |A1 |A2 |
+-----------------+----------+------+
|20-13-2012-monday|20-13-2012|monday|
|20-14-2012-tues |20-14-2012|tues |
|20-13-2012-wed |20-13-2012|wed |
+-----------------+----------+------+
Hope this helps!
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53384200%2fregular-expression-pyspark-dataframe-column%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
You can use the regex as an udf
and achieve the required output like this:
>>> import re
>>> from pyspark.sql.types import *
>>> from pyspark.sql.functions import udf
>>> def get_date_day(a):
... x, y = re.split('^([d]+-[d]+-[d]+)', a)[1:]
... return [x, y[1:]]
>>> get_date_day('20-13-2012-monday')
['20-13-2012', 'monday']
>>> get_date_day('20-13-2012-monday')
['20-13-2012', '-monday']
>>> get_date_udf = udf(get_date_day, ArrayType(StringType()))
>>> df = sc.parallelize([('20-13-2012-monday',), ('20-14-2012-tues',), ('20-13-2012-wed',)]).toDF(['A'])
>>> df.show()
+-----------------+
| A|
+-----------------+
|20-13-2012-monday|
| 20-14-2012-tues|
| 20-13-2012-wed|
+-----------------+
>>> df = df.withColumn("A12", get_date_udf('A'))
>>> df.show(truncate=False)
+-----------------+--------------------+
|A |A12 |
+-----------------+--------------------+
|20-13-2012-monday|[20-13-2012, monday]|
|20-14-2012-tues |[20-14-2012, tues] |
|20-13-2012-wed |[20-13-2012, wed] |
+-----------------+--------------------+
>>> df = df.withColumn("A1", udf(lambda x:x[0])('A12')).withColumn("A2", udf(lambda x:x[1])('A12'))
>>> df = df.drop('A12')
>>> df.show(truncate=False)
+-----------------+----------+------+
|A |A1 |A2 |
+-----------------+----------+------+
|20-13-2012-monday|20-13-2012|monday|
|20-14-2012-tues |20-14-2012|tues |
|20-13-2012-wed |20-13-2012|wed |
+-----------------+----------+------+
Hope this helps!
add a comment |
You can use the regex as an udf
and achieve the required output like this:
>>> import re
>>> from pyspark.sql.types import *
>>> from pyspark.sql.functions import udf
>>> def get_date_day(a):
... x, y = re.split('^([d]+-[d]+-[d]+)', a)[1:]
... return [x, y[1:]]
>>> get_date_day('20-13-2012-monday')
['20-13-2012', 'monday']
>>> get_date_day('20-13-2012-monday')
['20-13-2012', '-monday']
>>> get_date_udf = udf(get_date_day, ArrayType(StringType()))
>>> df = sc.parallelize([('20-13-2012-monday',), ('20-14-2012-tues',), ('20-13-2012-wed',)]).toDF(['A'])
>>> df.show()
+-----------------+
| A|
+-----------------+
|20-13-2012-monday|
| 20-14-2012-tues|
| 20-13-2012-wed|
+-----------------+
>>> df = df.withColumn("A12", get_date_udf('A'))
>>> df.show(truncate=False)
+-----------------+--------------------+
|A |A12 |
+-----------------+--------------------+
|20-13-2012-monday|[20-13-2012, monday]|
|20-14-2012-tues |[20-14-2012, tues] |
|20-13-2012-wed |[20-13-2012, wed] |
+-----------------+--------------------+
>>> df = df.withColumn("A1", udf(lambda x:x[0])('A12')).withColumn("A2", udf(lambda x:x[1])('A12'))
>>> df = df.drop('A12')
>>> df.show(truncate=False)
+-----------------+----------+------+
|A |A1 |A2 |
+-----------------+----------+------+
|20-13-2012-monday|20-13-2012|monday|
|20-14-2012-tues |20-14-2012|tues |
|20-13-2012-wed |20-13-2012|wed |
+-----------------+----------+------+
Hope this helps!
add a comment |
You can use the regex as an udf
and achieve the required output like this:
>>> import re
>>> from pyspark.sql.types import *
>>> from pyspark.sql.functions import udf
>>> def get_date_day(a):
... x, y = re.split('^([d]+-[d]+-[d]+)', a)[1:]
... return [x, y[1:]]
>>> get_date_day('20-13-2012-monday')
['20-13-2012', 'monday']
>>> get_date_day('20-13-2012-monday')
['20-13-2012', '-monday']
>>> get_date_udf = udf(get_date_day, ArrayType(StringType()))
>>> df = sc.parallelize([('20-13-2012-monday',), ('20-14-2012-tues',), ('20-13-2012-wed',)]).toDF(['A'])
>>> df.show()
+-----------------+
| A|
+-----------------+
|20-13-2012-monday|
| 20-14-2012-tues|
| 20-13-2012-wed|
+-----------------+
>>> df = df.withColumn("A12", get_date_udf('A'))
>>> df.show(truncate=False)
+-----------------+--------------------+
|A |A12 |
+-----------------+--------------------+
|20-13-2012-monday|[20-13-2012, monday]|
|20-14-2012-tues |[20-14-2012, tues] |
|20-13-2012-wed |[20-13-2012, wed] |
+-----------------+--------------------+
>>> df = df.withColumn("A1", udf(lambda x:x[0])('A12')).withColumn("A2", udf(lambda x:x[1])('A12'))
>>> df = df.drop('A12')
>>> df.show(truncate=False)
+-----------------+----------+------+
|A |A1 |A2 |
+-----------------+----------+------+
|20-13-2012-monday|20-13-2012|monday|
|20-14-2012-tues |20-14-2012|tues |
|20-13-2012-wed |20-13-2012|wed |
+-----------------+----------+------+
Hope this helps!
You can use the regex as an udf
and achieve the required output like this:
>>> import re
>>> from pyspark.sql.types import *
>>> from pyspark.sql.functions import udf
>>> def get_date_day(a):
... x, y = re.split('^([d]+-[d]+-[d]+)', a)[1:]
... return [x, y[1:]]
>>> get_date_day('20-13-2012-monday')
['20-13-2012', 'monday']
>>> get_date_day('20-13-2012-monday')
['20-13-2012', '-monday']
>>> get_date_udf = udf(get_date_day, ArrayType(StringType()))
>>> df = sc.parallelize([('20-13-2012-monday',), ('20-14-2012-tues',), ('20-13-2012-wed',)]).toDF(['A'])
>>> df.show()
+-----------------+
| A|
+-----------------+
|20-13-2012-monday|
| 20-14-2012-tues|
| 20-13-2012-wed|
+-----------------+
>>> df = df.withColumn("A12", get_date_udf('A'))
>>> df.show(truncate=False)
+-----------------+--------------------+
|A |A12 |
+-----------------+--------------------+
|20-13-2012-monday|[20-13-2012, monday]|
|20-14-2012-tues |[20-14-2012, tues] |
|20-13-2012-wed |[20-13-2012, wed] |
+-----------------+--------------------+
>>> df = df.withColumn("A1", udf(lambda x:x[0])('A12')).withColumn("A2", udf(lambda x:x[1])('A12'))
>>> df = df.drop('A12')
>>> df.show(truncate=False)
+-----------------+----------+------+
|A |A1 |A2 |
+-----------------+----------+------+
|20-13-2012-monday|20-13-2012|monday|
|20-14-2012-tues |20-14-2012|tues |
|20-13-2012-wed |20-13-2012|wed |
+-----------------+----------+------+
Hope this helps!
answered Nov 21 '18 at 2:08
Pavithran RamachandranPavithran Ramachandran
43338
43338
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53384200%2fregular-expression-pyspark-dataframe-column%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
You're mixing
re
from the python library with spark.pyspark.sql.functions.split
can split on regex:df=df.withColumn("A1",split(col("A"), reg))
– pault
Nov 20 '18 at 6:05
I am getting output like -- [ , monday]. I want just monday. No comma no
– Emma
Nov 20 '18 at 13:09
it says invalid syntax
– Emma
Nov 20 '18 at 13:10
This is a useful post: Reference: what does this regex mean?.
– pault
Nov 20 '18 at 14:10