regular expression pyspark dataframe column

My dataframe looks like this.

I have a pyspark dataframe and I want to split column A into A1 and A2 like this using regex but that didn't work.

A                 |   A1           | A2

20-13-2012-monday    20-13-2012     monday

20-14-2012-tues      20-14-2012     tues

20-13-2012-wed       20-13-2012     wed

My code looks like this

import re

from pyspark.sql.functions import regexp_extract   

reg = r'^([d]+-[d]+-[d]+)'

df=df.withColumn("A1",re.match(reg, df.select(['A'])).group())

df.show()

asked Nov 19 '18 at 23:37

Emma

236

You're mixing re from the python library with spark. pyspark.sql.functions.split can split on regex: df=df.withColumn("A1",split(col("A"), reg))

– pault
Nov 20 '18 at 6:05

I am getting output like -- [ , monday]. I want just monday. No comma no

– Emma
Nov 20 '18 at 13:09

it says invalid syntax

– Emma
Nov 20 '18 at 13:10

This is a useful post: Reference: what does this regex mean?.

– pault
Nov 20 '18 at 14:10

add a comment |

My dataframe looks like this.

I have a pyspark dataframe and I want to split column A into A1 and A2 like this using regex but that didn't work.

A                 |   A1           | A2

20-13-2012-monday    20-13-2012     monday

20-14-2012-tues      20-14-2012     tues

20-13-2012-wed       20-13-2012     wed

My code looks like this

import re

from pyspark.sql.functions import regexp_extract   

reg = r'^([d]+-[d]+-[d]+)'

df=df.withColumn("A1",re.match(reg, df.select(['A'])).group())

df.show()

asked Nov 19 '18 at 23:37

Emma

236

You're mixing re from the python library with spark. pyspark.sql.functions.split can split on regex: df=df.withColumn("A1",split(col("A"), reg))

– pault
Nov 20 '18 at 6:05

I am getting output like -- [ , monday]. I want just monday. No comma no

– Emma
Nov 20 '18 at 13:09

it says invalid syntax

– Emma
Nov 20 '18 at 13:10

This is a useful post: Reference: what does this regex mean?.

– pault
Nov 20 '18 at 14:10

add a comment |

My dataframe looks like this.

I have a pyspark dataframe and I want to split column A into A1 and A2 like this using regex but that didn't work.

A                 |   A1           | A2

20-13-2012-monday    20-13-2012     monday

20-14-2012-tues      20-14-2012     tues

20-13-2012-wed       20-13-2012     wed

My code looks like this

import re

from pyspark.sql.functions import regexp_extract   

reg = r'^([d]+-[d]+-[d]+)'

df=df.withColumn("A1",re.match(reg, df.select(['A'])).group())

df.show()

asked Nov 19 '18 at 23:37

Emma

236

My dataframe looks like this.

I have a pyspark dataframe and I want to split column A into A1 and A2 like this using regex but that didn't work.

A                 |   A1           | A2

20-13-2012-monday    20-13-2012     monday

20-14-2012-tues      20-14-2012     tues

20-13-2012-wed       20-13-2012     wed

My code looks like this

import re

from pyspark.sql.functions import regexp_extract   

reg = r'^([d]+-[d]+-[d]+)'

df=df.withColumn("A1",re.match(reg, df.select(['A'])).group())

df.show()

pyspark

asked Nov 19 '18 at 23:37

Emma

236

asked Nov 19 '18 at 23:37

Emma

236

asked Nov 19 '18 at 23:37

Emma

236

asked Nov 19 '18 at 23:37

Emma

236

asked Nov 19 '18 at 23:37

Emma

236

You're mixing re from the python library with spark. pyspark.sql.functions.split can split on regex: df=df.withColumn("A1",split(col("A"), reg))

– pault
Nov 20 '18 at 6:05

I am getting output like -- [ , monday]. I want just monday. No comma no

– Emma
Nov 20 '18 at 13:09

it says invalid syntax

– Emma
Nov 20 '18 at 13:10

This is a useful post: Reference: what does this regex mean?.

– pault
Nov 20 '18 at 14:10

add a comment |

You're mixing re from the python library with spark. pyspark.sql.functions.split can split on regex: df=df.withColumn("A1",split(col("A"), reg))

– pault
Nov 20 '18 at 6:05

I am getting output like -- [ , monday]. I want just monday. No comma no

– Emma
Nov 20 '18 at 13:09

it says invalid syntax

– Emma
Nov 20 '18 at 13:10

This is a useful post: Reference: what does this regex mean?.

– pault
Nov 20 '18 at 14:10

You're mixing re from the python library with spark. pyspark.sql.functions.split can split on regex: df=df.withColumn("A1",split(col("A"), reg))

– pault
Nov 20 '18 at 6:05

I am getting output like -- [ , monday]. I want just monday. No comma no

– Emma
Nov 20 '18 at 13:09

it says invalid syntax

– Emma
Nov 20 '18 at 13:10

This is a useful post: Reference: what does this regex mean?.

– pault
Nov 20 '18 at 14:10

add a comment |

1 Answer
1

active

oldest

votes

You can use the regex as an udf and achieve the required output like this:

>>> import re

>>> from pyspark.sql.types import *

>>> from pyspark.sql.functions import udf



>>> def get_date_day(a):

...   x, y = re.split('^([d]+-[d]+-[d]+)', a)[1:]

...   return [x, y[1:]]



>>> get_date_day('20-13-2012-monday')

['20-13-2012', 'monday']



>>> get_date_day('20-13-2012-monday')

['20-13-2012', '-monday']

>>> get_date_udf = udf(get_date_day, ArrayType(StringType()))





>>> df = sc.parallelize([('20-13-2012-monday',), ('20-14-2012-tues',), ('20-13-2012-wed',)]).toDF(['A'])

>>> df.show()

+-----------------+

|                A|

+-----------------+

|20-13-2012-monday|

|  20-14-2012-tues|

|   20-13-2012-wed|

+-----------------+



>>> df = df.withColumn("A12", get_date_udf('A'))

>>> df.show(truncate=False)

+-----------------+--------------------+

|A                |A12                 |

+-----------------+--------------------+

|20-13-2012-monday|[20-13-2012, monday]|

|20-14-2012-tues  |[20-14-2012, tues]  |

|20-13-2012-wed   |[20-13-2012, wed]   |

+-----------------+--------------------+



>>> df = df.withColumn("A1", udf(lambda x:x[0])('A12')).withColumn("A2", udf(lambda x:x[1])('A12'))

>>> df = df.drop('A12')

>>> df.show(truncate=False)

+-----------------+----------+------+

|A                |A1        |A2    |

+-----------------+----------+------+

|20-13-2012-monday|20-13-2012|monday|

|20-14-2012-tues  |20-14-2012|tues  |

|20-13-2012-wed   |20-13-2012|wed   |

+-----------------+----------+------+

Hope this helps!

answered Nov 21 '18 at 2:08

Pavithran Ramachandran

43338

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53384200%2fregular-expression-pyspark-dataframe-column%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

You can use the regex as an udf and achieve the required output like this:

>>> import re

>>> from pyspark.sql.types import *

>>> from pyspark.sql.functions import udf



>>> def get_date_day(a):

...   x, y = re.split('^([d]+-[d]+-[d]+)', a)[1:]

...   return [x, y[1:]]



>>> get_date_day('20-13-2012-monday')

['20-13-2012', 'monday']



>>> get_date_day('20-13-2012-monday')

['20-13-2012', '-monday']

>>> get_date_udf = udf(get_date_day, ArrayType(StringType()))





>>> df = sc.parallelize([('20-13-2012-monday',), ('20-14-2012-tues',), ('20-13-2012-wed',)]).toDF(['A'])

>>> df.show()

+-----------------+

|                A|

+-----------------+

|20-13-2012-monday|

|  20-14-2012-tues|

|   20-13-2012-wed|

+-----------------+



>>> df = df.withColumn("A12", get_date_udf('A'))

>>> df.show(truncate=False)

+-----------------+--------------------+

|A                |A12                 |

+-----------------+--------------------+

|20-13-2012-monday|[20-13-2012, monday]|

|20-14-2012-tues  |[20-14-2012, tues]  |

|20-13-2012-wed   |[20-13-2012, wed]   |

+-----------------+--------------------+



>>> df = df.withColumn("A1", udf(lambda x:x[0])('A12')).withColumn("A2", udf(lambda x:x[1])('A12'))

>>> df = df.drop('A12')

>>> df.show(truncate=False)

+-----------------+----------+------+

|A                |A1        |A2    |

+-----------------+----------+------+

|20-13-2012-monday|20-13-2012|monday|

|20-14-2012-tues  |20-14-2012|tues  |

|20-13-2012-wed   |20-13-2012|wed   |

+-----------------+----------+------+

Hope this helps!

answered Nov 21 '18 at 2:08

Pavithran Ramachandran

43338

add a comment |

You can use the regex as an udf and achieve the required output like this:

>>> import re

>>> from pyspark.sql.types import *

>>> from pyspark.sql.functions import udf



>>> def get_date_day(a):

...   x, y = re.split('^([d]+-[d]+-[d]+)', a)[1:]

...   return [x, y[1:]]



>>> get_date_day('20-13-2012-monday')

['20-13-2012', 'monday']



>>> get_date_day('20-13-2012-monday')

['20-13-2012', '-monday']

>>> get_date_udf = udf(get_date_day, ArrayType(StringType()))





>>> df = sc.parallelize([('20-13-2012-monday',), ('20-14-2012-tues',), ('20-13-2012-wed',)]).toDF(['A'])

>>> df.show()

+-----------------+

|                A|

+-----------------+

|20-13-2012-monday|

|  20-14-2012-tues|

|   20-13-2012-wed|

+-----------------+



>>> df = df.withColumn("A12", get_date_udf('A'))

>>> df.show(truncate=False)

+-----------------+--------------------+

|A                |A12                 |

+-----------------+--------------------+

|20-13-2012-monday|[20-13-2012, monday]|

|20-14-2012-tues  |[20-14-2012, tues]  |

|20-13-2012-wed   |[20-13-2012, wed]   |

+-----------------+--------------------+



>>> df = df.withColumn("A1", udf(lambda x:x[0])('A12')).withColumn("A2", udf(lambda x:x[1])('A12'))

>>> df = df.drop('A12')

>>> df.show(truncate=False)

+-----------------+----------+------+

|A                |A1        |A2    |

+-----------------+----------+------+

|20-13-2012-monday|20-13-2012|monday|

|20-14-2012-tues  |20-14-2012|tues  |

|20-13-2012-wed   |20-13-2012|wed   |

+-----------------+----------+------+

Hope this helps!

answered Nov 21 '18 at 2:08

Pavithran Ramachandran

43338

add a comment |

You can use the regex as an udf and achieve the required output like this:

>>> import re

>>> from pyspark.sql.types import *

>>> from pyspark.sql.functions import udf



>>> def get_date_day(a):

...   x, y = re.split('^([d]+-[d]+-[d]+)', a)[1:]

...   return [x, y[1:]]



>>> get_date_day('20-13-2012-monday')

['20-13-2012', 'monday']



>>> get_date_day('20-13-2012-monday')

['20-13-2012', '-monday']

>>> get_date_udf = udf(get_date_day, ArrayType(StringType()))





>>> df = sc.parallelize([('20-13-2012-monday',), ('20-14-2012-tues',), ('20-13-2012-wed',)]).toDF(['A'])

>>> df.show()

+-----------------+

|                A|

+-----------------+

|20-13-2012-monday|

|  20-14-2012-tues|

|   20-13-2012-wed|

+-----------------+



>>> df = df.withColumn("A12", get_date_udf('A'))

>>> df.show(truncate=False)

+-----------------+--------------------+

|A                |A12                 |

+-----------------+--------------------+

|20-13-2012-monday|[20-13-2012, monday]|

|20-14-2012-tues  |[20-14-2012, tues]  |

|20-13-2012-wed   |[20-13-2012, wed]   |

+-----------------+--------------------+



>>> df = df.withColumn("A1", udf(lambda x:x[0])('A12')).withColumn("A2", udf(lambda x:x[1])('A12'))

>>> df = df.drop('A12')

>>> df.show(truncate=False)

+-----------------+----------+------+

|A                |A1        |A2    |

+-----------------+----------+------+

|20-13-2012-monday|20-13-2012|monday|

|20-14-2012-tues  |20-14-2012|tues  |

|20-13-2012-wed   |20-13-2012|wed   |

+-----------------+----------+------+

Hope this helps!

answered Nov 21 '18 at 2:08

Pavithran Ramachandran

43338

You can use the regex as an udf and achieve the required output like this:

>>> import re

>>> from pyspark.sql.types import *

>>> from pyspark.sql.functions import udf



>>> def get_date_day(a):

...   x, y = re.split('^([d]+-[d]+-[d]+)', a)[1:]

...   return [x, y[1:]]



>>> get_date_day('20-13-2012-monday')

['20-13-2012', 'monday']



>>> get_date_day('20-13-2012-monday')

['20-13-2012', '-monday']

>>> get_date_udf = udf(get_date_day, ArrayType(StringType()))





>>> df = sc.parallelize([('20-13-2012-monday',), ('20-14-2012-tues',), ('20-13-2012-wed',)]).toDF(['A'])

>>> df.show()

+-----------------+

|                A|

+-----------------+

|20-13-2012-monday|

|  20-14-2012-tues|

|   20-13-2012-wed|

+-----------------+



>>> df = df.withColumn("A12", get_date_udf('A'))

>>> df.show(truncate=False)

+-----------------+--------------------+

|A                |A12                 |

+-----------------+--------------------+

|20-13-2012-monday|[20-13-2012, monday]|

|20-14-2012-tues  |[20-14-2012, tues]  |

|20-13-2012-wed   |[20-13-2012, wed]   |

+-----------------+--------------------+



>>> df = df.withColumn("A1", udf(lambda x:x[0])('A12')).withColumn("A2", udf(lambda x:x[1])('A12'))

>>> df = df.drop('A12')

>>> df.show(truncate=False)

+-----------------+----------+------+

|A                |A1        |A2    |

+-----------------+----------+------+

|20-13-2012-monday|20-13-2012|monday|

|20-14-2012-tues  |20-14-2012|tues  |

|20-13-2012-wed   |20-13-2012|wed   |

+-----------------+----------+------+

Hope this helps!

answered Nov 21 '18 at 2:08

Pavithran Ramachandran

43338

answered Nov 21 '18 at 2:08

Pavithran Ramachandran

43338

answered Nov 21 '18 at 2:08

Pavithran Ramachandran

43338

answered Nov 21 '18 at 2:08

Pavithran Ramachandran

43338

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu