regular expression pyspark dataframe column












0















My dataframe looks like this.



I have a pyspark dataframe and I want to split column A into A1 and A2 like this using regex but that didn't work.



A                 |   A1           | A2
20-13-2012-monday 20-13-2012 monday
20-14-2012-tues 20-14-2012 tues
20-13-2012-wed 20-13-2012 wed


My code looks like this



import re
from pyspark.sql.functions import regexp_extract
reg = r'^([d]+-[d]+-[d]+)'
df=df.withColumn("A1",re.match(reg, df.select(['A'])).group())
df.show()









share|improve this question























  • You're mixing re from the python library with spark. pyspark.sql.functions.split can split on regex: df=df.withColumn("A1",split(col("A"), reg))

    – pault
    Nov 20 '18 at 6:05











  • I am getting output like -- [ , monday]. I want just monday. No comma no

    – Emma
    Nov 20 '18 at 13:09











  • it says invalid syntax

    – Emma
    Nov 20 '18 at 13:10











  • This is a useful post: Reference: what does this regex mean?.

    – pault
    Nov 20 '18 at 14:10
















0















My dataframe looks like this.



I have a pyspark dataframe and I want to split column A into A1 and A2 like this using regex but that didn't work.



A                 |   A1           | A2
20-13-2012-monday 20-13-2012 monday
20-14-2012-tues 20-14-2012 tues
20-13-2012-wed 20-13-2012 wed


My code looks like this



import re
from pyspark.sql.functions import regexp_extract
reg = r'^([d]+-[d]+-[d]+)'
df=df.withColumn("A1",re.match(reg, df.select(['A'])).group())
df.show()









share|improve this question























  • You're mixing re from the python library with spark. pyspark.sql.functions.split can split on regex: df=df.withColumn("A1",split(col("A"), reg))

    – pault
    Nov 20 '18 at 6:05











  • I am getting output like -- [ , monday]. I want just monday. No comma no

    – Emma
    Nov 20 '18 at 13:09











  • it says invalid syntax

    – Emma
    Nov 20 '18 at 13:10











  • This is a useful post: Reference: what does this regex mean?.

    – pault
    Nov 20 '18 at 14:10














0












0








0








My dataframe looks like this.



I have a pyspark dataframe and I want to split column A into A1 and A2 like this using regex but that didn't work.



A                 |   A1           | A2
20-13-2012-monday 20-13-2012 monday
20-14-2012-tues 20-14-2012 tues
20-13-2012-wed 20-13-2012 wed


My code looks like this



import re
from pyspark.sql.functions import regexp_extract
reg = r'^([d]+-[d]+-[d]+)'
df=df.withColumn("A1",re.match(reg, df.select(['A'])).group())
df.show()









share|improve this question














My dataframe looks like this.



I have a pyspark dataframe and I want to split column A into A1 and A2 like this using regex but that didn't work.



A                 |   A1           | A2
20-13-2012-monday 20-13-2012 monday
20-14-2012-tues 20-14-2012 tues
20-13-2012-wed 20-13-2012 wed


My code looks like this



import re
from pyspark.sql.functions import regexp_extract
reg = r'^([d]+-[d]+-[d]+)'
df=df.withColumn("A1",re.match(reg, df.select(['A'])).group())
df.show()






pyspark






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 19 '18 at 23:37









EmmaEmma

236




236













  • You're mixing re from the python library with spark. pyspark.sql.functions.split can split on regex: df=df.withColumn("A1",split(col("A"), reg))

    – pault
    Nov 20 '18 at 6:05











  • I am getting output like -- [ , monday]. I want just monday. No comma no

    – Emma
    Nov 20 '18 at 13:09











  • it says invalid syntax

    – Emma
    Nov 20 '18 at 13:10











  • This is a useful post: Reference: what does this regex mean?.

    – pault
    Nov 20 '18 at 14:10



















  • You're mixing re from the python library with spark. pyspark.sql.functions.split can split on regex: df=df.withColumn("A1",split(col("A"), reg))

    – pault
    Nov 20 '18 at 6:05











  • I am getting output like -- [ , monday]. I want just monday. No comma no

    – Emma
    Nov 20 '18 at 13:09











  • it says invalid syntax

    – Emma
    Nov 20 '18 at 13:10











  • This is a useful post: Reference: what does this regex mean?.

    – pault
    Nov 20 '18 at 14:10

















You're mixing re from the python library with spark. pyspark.sql.functions.split can split on regex: df=df.withColumn("A1",split(col("A"), reg))

– pault
Nov 20 '18 at 6:05





You're mixing re from the python library with spark. pyspark.sql.functions.split can split on regex: df=df.withColumn("A1",split(col("A"), reg))

– pault
Nov 20 '18 at 6:05













I am getting output like -- [ , monday]. I want just monday. No comma no

– Emma
Nov 20 '18 at 13:09





I am getting output like -- [ , monday]. I want just monday. No comma no

– Emma
Nov 20 '18 at 13:09













it says invalid syntax

– Emma
Nov 20 '18 at 13:10





it says invalid syntax

– Emma
Nov 20 '18 at 13:10













This is a useful post: Reference: what does this regex mean?.

– pault
Nov 20 '18 at 14:10





This is a useful post: Reference: what does this regex mean?.

– pault
Nov 20 '18 at 14:10












1 Answer
1






active

oldest

votes


















1














You can use the regex as an udf and achieve the required output like this:



>>> import re
>>> from pyspark.sql.types import *
>>> from pyspark.sql.functions import udf

>>> def get_date_day(a):
... x, y = re.split('^([d]+-[d]+-[d]+)', a)[1:]
... return [x, y[1:]]

>>> get_date_day('20-13-2012-monday')
['20-13-2012', 'monday']

>>> get_date_day('20-13-2012-monday')
['20-13-2012', '-monday']
>>> get_date_udf = udf(get_date_day, ArrayType(StringType()))


>>> df = sc.parallelize([('20-13-2012-monday',), ('20-14-2012-tues',), ('20-13-2012-wed',)]).toDF(['A'])
>>> df.show()
+-----------------+
| A|
+-----------------+
|20-13-2012-monday|
| 20-14-2012-tues|
| 20-13-2012-wed|
+-----------------+

>>> df = df.withColumn("A12", get_date_udf('A'))
>>> df.show(truncate=False)
+-----------------+--------------------+
|A |A12 |
+-----------------+--------------------+
|20-13-2012-monday|[20-13-2012, monday]|
|20-14-2012-tues |[20-14-2012, tues] |
|20-13-2012-wed |[20-13-2012, wed] |
+-----------------+--------------------+

>>> df = df.withColumn("A1", udf(lambda x:x[0])('A12')).withColumn("A2", udf(lambda x:x[1])('A12'))
>>> df = df.drop('A12')
>>> df.show(truncate=False)
+-----------------+----------+------+
|A |A1 |A2 |
+-----------------+----------+------+
|20-13-2012-monday|20-13-2012|monday|
|20-14-2012-tues |20-14-2012|tues |
|20-13-2012-wed |20-13-2012|wed |
+-----------------+----------+------+


Hope this helps!






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53384200%2fregular-expression-pyspark-dataframe-column%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    You can use the regex as an udf and achieve the required output like this:



    >>> import re
    >>> from pyspark.sql.types import *
    >>> from pyspark.sql.functions import udf

    >>> def get_date_day(a):
    ... x, y = re.split('^([d]+-[d]+-[d]+)', a)[1:]
    ... return [x, y[1:]]

    >>> get_date_day('20-13-2012-monday')
    ['20-13-2012', 'monday']

    >>> get_date_day('20-13-2012-monday')
    ['20-13-2012', '-monday']
    >>> get_date_udf = udf(get_date_day, ArrayType(StringType()))


    >>> df = sc.parallelize([('20-13-2012-monday',), ('20-14-2012-tues',), ('20-13-2012-wed',)]).toDF(['A'])
    >>> df.show()
    +-----------------+
    | A|
    +-----------------+
    |20-13-2012-monday|
    | 20-14-2012-tues|
    | 20-13-2012-wed|
    +-----------------+

    >>> df = df.withColumn("A12", get_date_udf('A'))
    >>> df.show(truncate=False)
    +-----------------+--------------------+
    |A |A12 |
    +-----------------+--------------------+
    |20-13-2012-monday|[20-13-2012, monday]|
    |20-14-2012-tues |[20-14-2012, tues] |
    |20-13-2012-wed |[20-13-2012, wed] |
    +-----------------+--------------------+

    >>> df = df.withColumn("A1", udf(lambda x:x[0])('A12')).withColumn("A2", udf(lambda x:x[1])('A12'))
    >>> df = df.drop('A12')
    >>> df.show(truncate=False)
    +-----------------+----------+------+
    |A |A1 |A2 |
    +-----------------+----------+------+
    |20-13-2012-monday|20-13-2012|monday|
    |20-14-2012-tues |20-14-2012|tues |
    |20-13-2012-wed |20-13-2012|wed |
    +-----------------+----------+------+


    Hope this helps!






    share|improve this answer




























      1














      You can use the regex as an udf and achieve the required output like this:



      >>> import re
      >>> from pyspark.sql.types import *
      >>> from pyspark.sql.functions import udf

      >>> def get_date_day(a):
      ... x, y = re.split('^([d]+-[d]+-[d]+)', a)[1:]
      ... return [x, y[1:]]

      >>> get_date_day('20-13-2012-monday')
      ['20-13-2012', 'monday']

      >>> get_date_day('20-13-2012-monday')
      ['20-13-2012', '-monday']
      >>> get_date_udf = udf(get_date_day, ArrayType(StringType()))


      >>> df = sc.parallelize([('20-13-2012-monday',), ('20-14-2012-tues',), ('20-13-2012-wed',)]).toDF(['A'])
      >>> df.show()
      +-----------------+
      | A|
      +-----------------+
      |20-13-2012-monday|
      | 20-14-2012-tues|
      | 20-13-2012-wed|
      +-----------------+

      >>> df = df.withColumn("A12", get_date_udf('A'))
      >>> df.show(truncate=False)
      +-----------------+--------------------+
      |A |A12 |
      +-----------------+--------------------+
      |20-13-2012-monday|[20-13-2012, monday]|
      |20-14-2012-tues |[20-14-2012, tues] |
      |20-13-2012-wed |[20-13-2012, wed] |
      +-----------------+--------------------+

      >>> df = df.withColumn("A1", udf(lambda x:x[0])('A12')).withColumn("A2", udf(lambda x:x[1])('A12'))
      >>> df = df.drop('A12')
      >>> df.show(truncate=False)
      +-----------------+----------+------+
      |A |A1 |A2 |
      +-----------------+----------+------+
      |20-13-2012-monday|20-13-2012|monday|
      |20-14-2012-tues |20-14-2012|tues |
      |20-13-2012-wed |20-13-2012|wed |
      +-----------------+----------+------+


      Hope this helps!






      share|improve this answer


























        1












        1








        1







        You can use the regex as an udf and achieve the required output like this:



        >>> import re
        >>> from pyspark.sql.types import *
        >>> from pyspark.sql.functions import udf

        >>> def get_date_day(a):
        ... x, y = re.split('^([d]+-[d]+-[d]+)', a)[1:]
        ... return [x, y[1:]]

        >>> get_date_day('20-13-2012-monday')
        ['20-13-2012', 'monday']

        >>> get_date_day('20-13-2012-monday')
        ['20-13-2012', '-monday']
        >>> get_date_udf = udf(get_date_day, ArrayType(StringType()))


        >>> df = sc.parallelize([('20-13-2012-monday',), ('20-14-2012-tues',), ('20-13-2012-wed',)]).toDF(['A'])
        >>> df.show()
        +-----------------+
        | A|
        +-----------------+
        |20-13-2012-monday|
        | 20-14-2012-tues|
        | 20-13-2012-wed|
        +-----------------+

        >>> df = df.withColumn("A12", get_date_udf('A'))
        >>> df.show(truncate=False)
        +-----------------+--------------------+
        |A |A12 |
        +-----------------+--------------------+
        |20-13-2012-monday|[20-13-2012, monday]|
        |20-14-2012-tues |[20-14-2012, tues] |
        |20-13-2012-wed |[20-13-2012, wed] |
        +-----------------+--------------------+

        >>> df = df.withColumn("A1", udf(lambda x:x[0])('A12')).withColumn("A2", udf(lambda x:x[1])('A12'))
        >>> df = df.drop('A12')
        >>> df.show(truncate=False)
        +-----------------+----------+------+
        |A |A1 |A2 |
        +-----------------+----------+------+
        |20-13-2012-monday|20-13-2012|monday|
        |20-14-2012-tues |20-14-2012|tues |
        |20-13-2012-wed |20-13-2012|wed |
        +-----------------+----------+------+


        Hope this helps!






        share|improve this answer













        You can use the regex as an udf and achieve the required output like this:



        >>> import re
        >>> from pyspark.sql.types import *
        >>> from pyspark.sql.functions import udf

        >>> def get_date_day(a):
        ... x, y = re.split('^([d]+-[d]+-[d]+)', a)[1:]
        ... return [x, y[1:]]

        >>> get_date_day('20-13-2012-monday')
        ['20-13-2012', 'monday']

        >>> get_date_day('20-13-2012-monday')
        ['20-13-2012', '-monday']
        >>> get_date_udf = udf(get_date_day, ArrayType(StringType()))


        >>> df = sc.parallelize([('20-13-2012-monday',), ('20-14-2012-tues',), ('20-13-2012-wed',)]).toDF(['A'])
        >>> df.show()
        +-----------------+
        | A|
        +-----------------+
        |20-13-2012-monday|
        | 20-14-2012-tues|
        | 20-13-2012-wed|
        +-----------------+

        >>> df = df.withColumn("A12", get_date_udf('A'))
        >>> df.show(truncate=False)
        +-----------------+--------------------+
        |A |A12 |
        +-----------------+--------------------+
        |20-13-2012-monday|[20-13-2012, monday]|
        |20-14-2012-tues |[20-14-2012, tues] |
        |20-13-2012-wed |[20-13-2012, wed] |
        +-----------------+--------------------+

        >>> df = df.withColumn("A1", udf(lambda x:x[0])('A12')).withColumn("A2", udf(lambda x:x[1])('A12'))
        >>> df = df.drop('A12')
        >>> df.show(truncate=False)
        +-----------------+----------+------+
        |A |A1 |A2 |
        +-----------------+----------+------+
        |20-13-2012-monday|20-13-2012|monday|
        |20-14-2012-tues |20-14-2012|tues |
        |20-13-2012-wed |20-13-2012|wed |
        +-----------------+----------+------+


        Hope this helps!







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 21 '18 at 2:08









        Pavithran RamachandranPavithran Ramachandran

        43338




        43338






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53384200%2fregular-expression-pyspark-dataframe-column%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            MongoDB - Not Authorized To Execute Command

            How to fix TextFormField cause rebuild widget in Flutter

            in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith