Using regex to extract based on a recurring pattern excluding newline characters












0















I have a string as follows:



27223525

West Food Group B.V.9

52608670

Westcon

Group European Operations Netherlands Branch

30221053

Westland Infra Netbeheer B.V.

27176688

Wetransfer 85 B.V.

34380998

WETRAVEL B.V.

70669783


This string contains many newline characters, I wanted to explicitly ignore these as well as all multidigits with 6 or more numbers. I came up with the following regex expression:



[^nd{6,}].+


This almost takes me there as it returns all the company names, however in cases where the company name itself contains a new line character these get returned as two different company names. For instance Westcon is a match and Group European Operations Netherlands Branch is also a match. I would like to tweak the above expression to make sure that the final match is Westcon European Operations Netherlands Branch. What regex concepts should I use to achieve this? Thanks.



EDIT
I tried the following based on the comment below but got the wrong result



text = 'West Food Group B.V.9n n52608670n nWestconn nGroup European Operations Netherlands Branchn n30221053n nWestland Infra Netbeheer B.V.n n27176688n nWetransfer 85 B.V.n n34380998n nWETRAVEL B.V.n n70669783n nWeWork Companies (International) B.V.n n61501220n nWeWork Netherlands B.V.n n61505439n nWexford Finance B.V.n n27124941n nWFCn-nFood Safety B.V.n n11069471n nWhale Cloud Technology Netherlands B.V.n n63774801n nWHILL Europe B.V.n n72465700n nWhirlpool Nederland B.V.n n20042061n nWhitakern-nTaylor Netherlands B.V.n n66255163n nWhite Oak B.V.n'

re.findall(r'[^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+)',text)









share|improve this question

























  • I am currently using regex101.com with python flavor with the aim of later extending it to a python script which uses re

    – user32882
    Jan 2 at 14:47











  • I've came up with this [^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+) regex101.com/r/UTFMyk/1

    – jcubic
    Jan 2 at 14:50











  • Try re.findall(r'^[A-Za-z].*(?:n(?!d+$).*)*', contents, re.M) where contents is file.read().

    – Wiktor Stribiżew
    Jan 2 at 14:58
















0















I have a string as follows:



27223525

West Food Group B.V.9

52608670

Westcon

Group European Operations Netherlands Branch

30221053

Westland Infra Netbeheer B.V.

27176688

Wetransfer 85 B.V.

34380998

WETRAVEL B.V.

70669783


This string contains many newline characters, I wanted to explicitly ignore these as well as all multidigits with 6 or more numbers. I came up with the following regex expression:



[^nd{6,}].+


This almost takes me there as it returns all the company names, however in cases where the company name itself contains a new line character these get returned as two different company names. For instance Westcon is a match and Group European Operations Netherlands Branch is also a match. I would like to tweak the above expression to make sure that the final match is Westcon European Operations Netherlands Branch. What regex concepts should I use to achieve this? Thanks.



EDIT
I tried the following based on the comment below but got the wrong result



text = 'West Food Group B.V.9n n52608670n nWestconn nGroup European Operations Netherlands Branchn n30221053n nWestland Infra Netbeheer B.V.n n27176688n nWetransfer 85 B.V.n n34380998n nWETRAVEL B.V.n n70669783n nWeWork Companies (International) B.V.n n61501220n nWeWork Netherlands B.V.n n61505439n nWexford Finance B.V.n n27124941n nWFCn-nFood Safety B.V.n n11069471n nWhale Cloud Technology Netherlands B.V.n n63774801n nWHILL Europe B.V.n n72465700n nWhirlpool Nederland B.V.n n20042061n nWhitakern-nTaylor Netherlands B.V.n n66255163n nWhite Oak B.V.n'

re.findall(r'[^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+)',text)









share|improve this question

























  • I am currently using regex101.com with python flavor with the aim of later extending it to a python script which uses re

    – user32882
    Jan 2 at 14:47











  • I've came up with this [^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+) regex101.com/r/UTFMyk/1

    – jcubic
    Jan 2 at 14:50











  • Try re.findall(r'^[A-Za-z].*(?:n(?!d+$).*)*', contents, re.M) where contents is file.read().

    – Wiktor Stribiżew
    Jan 2 at 14:58














0












0








0


1






I have a string as follows:



27223525

West Food Group B.V.9

52608670

Westcon

Group European Operations Netherlands Branch

30221053

Westland Infra Netbeheer B.V.

27176688

Wetransfer 85 B.V.

34380998

WETRAVEL B.V.

70669783


This string contains many newline characters, I wanted to explicitly ignore these as well as all multidigits with 6 or more numbers. I came up with the following regex expression:



[^nd{6,}].+


This almost takes me there as it returns all the company names, however in cases where the company name itself contains a new line character these get returned as two different company names. For instance Westcon is a match and Group European Operations Netherlands Branch is also a match. I would like to tweak the above expression to make sure that the final match is Westcon European Operations Netherlands Branch. What regex concepts should I use to achieve this? Thanks.



EDIT
I tried the following based on the comment below but got the wrong result



text = 'West Food Group B.V.9n n52608670n nWestconn nGroup European Operations Netherlands Branchn n30221053n nWestland Infra Netbeheer B.V.n n27176688n nWetransfer 85 B.V.n n34380998n nWETRAVEL B.V.n n70669783n nWeWork Companies (International) B.V.n n61501220n nWeWork Netherlands B.V.n n61505439n nWexford Finance B.V.n n27124941n nWFCn-nFood Safety B.V.n n11069471n nWhale Cloud Technology Netherlands B.V.n n63774801n nWHILL Europe B.V.n n72465700n nWhirlpool Nederland B.V.n n20042061n nWhitakern-nTaylor Netherlands B.V.n n66255163n nWhite Oak B.V.n'

re.findall(r'[^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+)',text)









share|improve this question
















I have a string as follows:



27223525

West Food Group B.V.9

52608670

Westcon

Group European Operations Netherlands Branch

30221053

Westland Infra Netbeheer B.V.

27176688

Wetransfer 85 B.V.

34380998

WETRAVEL B.V.

70669783


This string contains many newline characters, I wanted to explicitly ignore these as well as all multidigits with 6 or more numbers. I came up with the following regex expression:



[^nd{6,}].+


This almost takes me there as it returns all the company names, however in cases where the company name itself contains a new line character these get returned as two different company names. For instance Westcon is a match and Group European Operations Netherlands Branch is also a match. I would like to tweak the above expression to make sure that the final match is Westcon European Operations Netherlands Branch. What regex concepts should I use to achieve this? Thanks.



EDIT
I tried the following based on the comment below but got the wrong result



text = 'West Food Group B.V.9n n52608670n nWestconn nGroup European Operations Netherlands Branchn n30221053n nWestland Infra Netbeheer B.V.n n27176688n nWetransfer 85 B.V.n n34380998n nWETRAVEL B.V.n n70669783n nWeWork Companies (International) B.V.n n61501220n nWeWork Netherlands B.V.n n61505439n nWexford Finance B.V.n n27124941n nWFCn-nFood Safety B.V.n n11069471n nWhale Cloud Technology Netherlands B.V.n n63774801n nWHILL Europe B.V.n n72465700n nWhirlpool Nederland B.V.n n20042061n nWhitakern-nTaylor Netherlands B.V.n n66255163n nWhite Oak B.V.n'

re.findall(r'[^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+)',text)






python regex






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 2 at 14:59







user32882

















asked Jan 2 at 14:36









user32882user32882

934729




934729













  • I am currently using regex101.com with python flavor with the aim of later extending it to a python script which uses re

    – user32882
    Jan 2 at 14:47











  • I've came up with this [^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+) regex101.com/r/UTFMyk/1

    – jcubic
    Jan 2 at 14:50











  • Try re.findall(r'^[A-Za-z].*(?:n(?!d+$).*)*', contents, re.M) where contents is file.read().

    – Wiktor Stribiżew
    Jan 2 at 14:58



















  • I am currently using regex101.com with python flavor with the aim of later extending it to a python script which uses re

    – user32882
    Jan 2 at 14:47











  • I've came up with this [^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+) regex101.com/r/UTFMyk/1

    – jcubic
    Jan 2 at 14:50











  • Try re.findall(r'^[A-Za-z].*(?:n(?!d+$).*)*', contents, re.M) where contents is file.read().

    – Wiktor Stribiżew
    Jan 2 at 14:58

















I am currently using regex101.com with python flavor with the aim of later extending it to a python script which uses re

– user32882
Jan 2 at 14:47





I am currently using regex101.com with python flavor with the aim of later extending it to a python script which uses re

– user32882
Jan 2 at 14:47













I've came up with this [^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+) regex101.com/r/UTFMyk/1

– jcubic
Jan 2 at 14:50





I've came up with this [^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+) regex101.com/r/UTFMyk/1

– jcubic
Jan 2 at 14:50













Try re.findall(r'^[A-Za-z].*(?:n(?!d+$).*)*', contents, re.M) where contents is file.read().

– Wiktor Stribiżew
Jan 2 at 14:58





Try re.findall(r'^[A-Za-z].*(?:n(?!d+$).*)*', contents, re.M) where contents is file.read().

– Wiktor Stribiżew
Jan 2 at 14:58












5 Answers
5






active

oldest

votes


















1














I think that you only want the company names. If so, this should work.



input = '''27223525

West Food Group B.V.9

52608670

Westcon

Group European Operations Netherlands Branch

30221053

Westland Infra Netbeheer B.V.

27176688

Wetransfer 85 B.V.

34380998

WETRAVEL B.V.

70669783

'''

company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', input)

pprint(company_name_regex)

['West Food Group B.V.9',
'Westcon',
'Group European Operations Netherlands Branch',
'Westland Infra Netbeheer B.V.',
'Wetransfer 85 B.V.'
'WETRAVEL B.V.']





share|improve this answer


























  • please see edits above. Companies may have digits anywhere in there name. For this dataset the number of digits within any one company name is always below 6. The number digits I want to exclude are always above 6.

    – user32882
    Jan 2 at 15:01











  • Yes, I noted that the input example changed, so I have updated my answer.

    – Life is complex
    Jan 2 at 15:06






  • 1





    Doesn't look correct as it matches Westcon as a separate match and remaining part as a separate match.

    – anubhava
    Jan 2 at 15:20











  • So does Westcon always get linked to Group European Operations Netherlands Branch?

    – Life is complex
    Jan 2 at 15:25



















1














This will create one group for lines that don't have numbers.



regex: /(?!(d{6,}|n))[a-zA-Z .n]+/g



Demo: https://regex101.com/r/MMLGw6/1






share|improve this answer
























  • except some company names have numbers below six digits. I have edited the question to reflect that

    – user32882
    Jan 2 at 14:56



















0














Assuming your company names starts with a letter, you may use this regex with re.M modifier:



^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)


RegEx Demo



In python:



regex = re.compile(r"^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)", re.M)


This matches a line that starts with [a-zA-Z] until end of line and then matches more lines separated by n that also start with [a-zA-Z] characters.



(?=n+d{6,}$) is a lookahead assertion to make sure our company names have a newline and 6+ digits ahead.






share|improve this answer































    0














    If you can solve this without regex it should be solved without regex:



    useful = 

    for line in text.split():
    if line.strip() and not line.isdigit():
    useful.append(line)


    That should work - more or less. Replying from my phone so can't test.






    share|improve this answer

































      0














      Here is another answer based on your question edits:



      text = 'West Food Group B.V.9n n52608670n nWestconn nGroup European Operations Netherlands Branchn n30221053n nWestland Infra Netbeheer B.V.n n27176688n nWetransfer 85 B.V.n n34380998n nWETRAVEL B.V.n n70669783n nWeWork Companies (International) B.V.n n61501220n nWeWork Netherlands B.V.n n61505439n nWexford Finance B.V.n n27124941n nWFCn-nFood Safety B.V.n n11069471n nWhale Cloud Technology Netherlands B.V.n n63774801n nWHILL Europe B.V.n n72465700n nWhirlpool Nederland B.V.n n20042061n nWhitakern-nTaylor Netherlands B.V.n n66255163n nWhite Oak B.V.n'

      company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', text)

      for i in range(len(company_name_regex)):

      if i < len(company_name_regex) - 1:

      previous_company_name = company_name_regex[i]
      next_company_name = company_name_regex[i + 1]
      if 'Westcon' in previous_company_name and 'Group European Operations Netherlands Branch' in next_company_name:
      company_name = ' '.join([previous_company_name, next_company_name])
      else:
      if not 'Group European Operations Netherlands Branch' in previous_company_name:
      company_name = previous_company_name


      **OUTPUTS**:
      West Food Group B.V.9
      Westcon Group European Operations Netherlands Branch
      Westland Infra Netbeheer B.V.
      Wetransfer 85 B.V.
      WETRAVEL B.V.
      WeWork Companies (International) B.V.
      WeWork Netherlands B.V.
      Wexford Finance B.V.
      WFC
      Food Safety B.V.
      Whale Cloud Technology Netherlands B.V.
      WHILL Europe B.V.
      Whirlpool Nederland B.V.
      Whitaker
      Taylor Netherlands B.V.





      share|improve this answer

























        Your Answer






        StackExchange.ifUsing("editor", function () {
        StackExchange.using("externalEditor", function () {
        StackExchange.using("snippets", function () {
        StackExchange.snippets.init();
        });
        });
        }, "code-snippets");

        StackExchange.ready(function() {
        var channelOptions = {
        tags: "".split(" "),
        id: "1"
        };
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function() {
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled) {
        StackExchange.using("snippets", function() {
        createEditor();
        });
        }
        else {
        createEditor();
        }
        });

        function createEditor() {
        StackExchange.prepareEditor({
        heartbeatType: 'answer',
        autoActivateHeartbeat: false,
        convertImagesToLinks: true,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: 10,
        bindNavPrevention: true,
        postfix: "",
        imageUploader: {
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        },
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        });


        }
        });














        draft saved

        draft discarded


















        StackExchange.ready(
        function () {
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54008208%2fusing-regex-to-extract-based-on-a-recurring-pattern-excluding-newline-characters%23new-answer', 'question_page');
        }
        );

        Post as a guest















        Required, but never shown

























        5 Answers
        5






        active

        oldest

        votes








        5 Answers
        5






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        1














        I think that you only want the company names. If so, this should work.



        input = '''27223525

        West Food Group B.V.9

        52608670

        Westcon

        Group European Operations Netherlands Branch

        30221053

        Westland Infra Netbeheer B.V.

        27176688

        Wetransfer 85 B.V.

        34380998

        WETRAVEL B.V.

        70669783

        '''

        company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', input)

        pprint(company_name_regex)

        ['West Food Group B.V.9',
        'Westcon',
        'Group European Operations Netherlands Branch',
        'Westland Infra Netbeheer B.V.',
        'Wetransfer 85 B.V.'
        'WETRAVEL B.V.']





        share|improve this answer


























        • please see edits above. Companies may have digits anywhere in there name. For this dataset the number of digits within any one company name is always below 6. The number digits I want to exclude are always above 6.

          – user32882
          Jan 2 at 15:01











        • Yes, I noted that the input example changed, so I have updated my answer.

          – Life is complex
          Jan 2 at 15:06






        • 1





          Doesn't look correct as it matches Westcon as a separate match and remaining part as a separate match.

          – anubhava
          Jan 2 at 15:20











        • So does Westcon always get linked to Group European Operations Netherlands Branch?

          – Life is complex
          Jan 2 at 15:25
















        1














        I think that you only want the company names. If so, this should work.



        input = '''27223525

        West Food Group B.V.9

        52608670

        Westcon

        Group European Operations Netherlands Branch

        30221053

        Westland Infra Netbeheer B.V.

        27176688

        Wetransfer 85 B.V.

        34380998

        WETRAVEL B.V.

        70669783

        '''

        company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', input)

        pprint(company_name_regex)

        ['West Food Group B.V.9',
        'Westcon',
        'Group European Operations Netherlands Branch',
        'Westland Infra Netbeheer B.V.',
        'Wetransfer 85 B.V.'
        'WETRAVEL B.V.']





        share|improve this answer


























        • please see edits above. Companies may have digits anywhere in there name. For this dataset the number of digits within any one company name is always below 6. The number digits I want to exclude are always above 6.

          – user32882
          Jan 2 at 15:01











        • Yes, I noted that the input example changed, so I have updated my answer.

          – Life is complex
          Jan 2 at 15:06






        • 1





          Doesn't look correct as it matches Westcon as a separate match and remaining part as a separate match.

          – anubhava
          Jan 2 at 15:20











        • So does Westcon always get linked to Group European Operations Netherlands Branch?

          – Life is complex
          Jan 2 at 15:25














        1












        1








        1







        I think that you only want the company names. If so, this should work.



        input = '''27223525

        West Food Group B.V.9

        52608670

        Westcon

        Group European Operations Netherlands Branch

        30221053

        Westland Infra Netbeheer B.V.

        27176688

        Wetransfer 85 B.V.

        34380998

        WETRAVEL B.V.

        70669783

        '''

        company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', input)

        pprint(company_name_regex)

        ['West Food Group B.V.9',
        'Westcon',
        'Group European Operations Netherlands Branch',
        'Westland Infra Netbeheer B.V.',
        'Wetransfer 85 B.V.'
        'WETRAVEL B.V.']





        share|improve this answer















        I think that you only want the company names. If so, this should work.



        input = '''27223525

        West Food Group B.V.9

        52608670

        Westcon

        Group European Operations Netherlands Branch

        30221053

        Westland Infra Netbeheer B.V.

        27176688

        Wetransfer 85 B.V.

        34380998

        WETRAVEL B.V.

        70669783

        '''

        company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', input)

        pprint(company_name_regex)

        ['West Food Group B.V.9',
        'Westcon',
        'Group European Operations Netherlands Branch',
        'Westland Infra Netbeheer B.V.',
        'Wetransfer 85 B.V.'
        'WETRAVEL B.V.']






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Jan 2 at 15:10

























        answered Jan 2 at 14:59









        Life is complexLife is complex

        598518




        598518













        • please see edits above. Companies may have digits anywhere in there name. For this dataset the number of digits within any one company name is always below 6. The number digits I want to exclude are always above 6.

          – user32882
          Jan 2 at 15:01











        • Yes, I noted that the input example changed, so I have updated my answer.

          – Life is complex
          Jan 2 at 15:06






        • 1





          Doesn't look correct as it matches Westcon as a separate match and remaining part as a separate match.

          – anubhava
          Jan 2 at 15:20











        • So does Westcon always get linked to Group European Operations Netherlands Branch?

          – Life is complex
          Jan 2 at 15:25



















        • please see edits above. Companies may have digits anywhere in there name. For this dataset the number of digits within any one company name is always below 6. The number digits I want to exclude are always above 6.

          – user32882
          Jan 2 at 15:01











        • Yes, I noted that the input example changed, so I have updated my answer.

          – Life is complex
          Jan 2 at 15:06






        • 1





          Doesn't look correct as it matches Westcon as a separate match and remaining part as a separate match.

          – anubhava
          Jan 2 at 15:20











        • So does Westcon always get linked to Group European Operations Netherlands Branch?

          – Life is complex
          Jan 2 at 15:25

















        please see edits above. Companies may have digits anywhere in there name. For this dataset the number of digits within any one company name is always below 6. The number digits I want to exclude are always above 6.

        – user32882
        Jan 2 at 15:01





        please see edits above. Companies may have digits anywhere in there name. For this dataset the number of digits within any one company name is always below 6. The number digits I want to exclude are always above 6.

        – user32882
        Jan 2 at 15:01













        Yes, I noted that the input example changed, so I have updated my answer.

        – Life is complex
        Jan 2 at 15:06





        Yes, I noted that the input example changed, so I have updated my answer.

        – Life is complex
        Jan 2 at 15:06




        1




        1





        Doesn't look correct as it matches Westcon as a separate match and remaining part as a separate match.

        – anubhava
        Jan 2 at 15:20





        Doesn't look correct as it matches Westcon as a separate match and remaining part as a separate match.

        – anubhava
        Jan 2 at 15:20













        So does Westcon always get linked to Group European Operations Netherlands Branch?

        – Life is complex
        Jan 2 at 15:25





        So does Westcon always get linked to Group European Operations Netherlands Branch?

        – Life is complex
        Jan 2 at 15:25













        1














        This will create one group for lines that don't have numbers.



        regex: /(?!(d{6,}|n))[a-zA-Z .n]+/g



        Demo: https://regex101.com/r/MMLGw6/1






        share|improve this answer
























        • except some company names have numbers below six digits. I have edited the question to reflect that

          – user32882
          Jan 2 at 14:56
















        1














        This will create one group for lines that don't have numbers.



        regex: /(?!(d{6,}|n))[a-zA-Z .n]+/g



        Demo: https://regex101.com/r/MMLGw6/1






        share|improve this answer
























        • except some company names have numbers below six digits. I have edited the question to reflect that

          – user32882
          Jan 2 at 14:56














        1












        1








        1







        This will create one group for lines that don't have numbers.



        regex: /(?!(d{6,}|n))[a-zA-Z .n]+/g



        Demo: https://regex101.com/r/MMLGw6/1






        share|improve this answer













        This will create one group for lines that don't have numbers.



        regex: /(?!(d{6,}|n))[a-zA-Z .n]+/g



        Demo: https://regex101.com/r/MMLGw6/1







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Jan 2 at 14:52









        Alex GAlex G

        1,4172410




        1,4172410













        • except some company names have numbers below six digits. I have edited the question to reflect that

          – user32882
          Jan 2 at 14:56



















        • except some company names have numbers below six digits. I have edited the question to reflect that

          – user32882
          Jan 2 at 14:56

















        except some company names have numbers below six digits. I have edited the question to reflect that

        – user32882
        Jan 2 at 14:56





        except some company names have numbers below six digits. I have edited the question to reflect that

        – user32882
        Jan 2 at 14:56











        0














        Assuming your company names starts with a letter, you may use this regex with re.M modifier:



        ^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)


        RegEx Demo



        In python:



        regex = re.compile(r"^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)", re.M)


        This matches a line that starts with [a-zA-Z] until end of line and then matches more lines separated by n that also start with [a-zA-Z] characters.



        (?=n+d{6,}$) is a lookahead assertion to make sure our company names have a newline and 6+ digits ahead.






        share|improve this answer




























          0














          Assuming your company names starts with a letter, you may use this regex with re.M modifier:



          ^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)


          RegEx Demo



          In python:



          regex = re.compile(r"^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)", re.M)


          This matches a line that starts with [a-zA-Z] until end of line and then matches more lines separated by n that also start with [a-zA-Z] characters.



          (?=n+d{6,}$) is a lookahead assertion to make sure our company names have a newline and 6+ digits ahead.






          share|improve this answer


























            0












            0








            0







            Assuming your company names starts with a letter, you may use this regex with re.M modifier:



            ^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)


            RegEx Demo



            In python:



            regex = re.compile(r"^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)", re.M)


            This matches a line that starts with [a-zA-Z] until end of line and then matches more lines separated by n that also start with [a-zA-Z] characters.



            (?=n+d{6,}$) is a lookahead assertion to make sure our company names have a newline and 6+ digits ahead.






            share|improve this answer













            Assuming your company names starts with a letter, you may use this regex with re.M modifier:



            ^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)


            RegEx Demo



            In python:



            regex = re.compile(r"^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)", re.M)


            This matches a line that starts with [a-zA-Z] until end of line and then matches more lines separated by n that also start with [a-zA-Z] characters.



            (?=n+d{6,}$) is a lookahead assertion to make sure our company names have a newline and 6+ digits ahead.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Jan 2 at 15:25









            anubhavaanubhava

            533k48331408




            533k48331408























                0














                If you can solve this without regex it should be solved without regex:



                useful = 

                for line in text.split():
                if line.strip() and not line.isdigit():
                useful.append(line)


                That should work - more or less. Replying from my phone so can't test.






                share|improve this answer






























                  0














                  If you can solve this without regex it should be solved without regex:



                  useful = 

                  for line in text.split():
                  if line.strip() and not line.isdigit():
                  useful.append(line)


                  That should work - more or less. Replying from my phone so can't test.






                  share|improve this answer




























                    0












                    0








                    0







                    If you can solve this without regex it should be solved without regex:



                    useful = 

                    for line in text.split():
                    if line.strip() and not line.isdigit():
                    useful.append(line)


                    That should work - more or less. Replying from my phone so can't test.






                    share|improve this answer















                    If you can solve this without regex it should be solved without regex:



                    useful = 

                    for line in text.split():
                    if line.strip() and not line.isdigit():
                    useful.append(line)


                    That should work - more or less. Replying from my phone so can't test.







                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited Jan 2 at 15:41

























                    answered Jan 2 at 15:23









                    HugoHugo

                    38929




                    38929























                        0














                        Here is another answer based on your question edits:



                        text = 'West Food Group B.V.9n n52608670n nWestconn nGroup European Operations Netherlands Branchn n30221053n nWestland Infra Netbeheer B.V.n n27176688n nWetransfer 85 B.V.n n34380998n nWETRAVEL B.V.n n70669783n nWeWork Companies (International) B.V.n n61501220n nWeWork Netherlands B.V.n n61505439n nWexford Finance B.V.n n27124941n nWFCn-nFood Safety B.V.n n11069471n nWhale Cloud Technology Netherlands B.V.n n63774801n nWHILL Europe B.V.n n72465700n nWhirlpool Nederland B.V.n n20042061n nWhitakern-nTaylor Netherlands B.V.n n66255163n nWhite Oak B.V.n'

                        company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', text)

                        for i in range(len(company_name_regex)):

                        if i < len(company_name_regex) - 1:

                        previous_company_name = company_name_regex[i]
                        next_company_name = company_name_regex[i + 1]
                        if 'Westcon' in previous_company_name and 'Group European Operations Netherlands Branch' in next_company_name:
                        company_name = ' '.join([previous_company_name, next_company_name])
                        else:
                        if not 'Group European Operations Netherlands Branch' in previous_company_name:
                        company_name = previous_company_name


                        **OUTPUTS**:
                        West Food Group B.V.9
                        Westcon Group European Operations Netherlands Branch
                        Westland Infra Netbeheer B.V.
                        Wetransfer 85 B.V.
                        WETRAVEL B.V.
                        WeWork Companies (International) B.V.
                        WeWork Netherlands B.V.
                        Wexford Finance B.V.
                        WFC
                        Food Safety B.V.
                        Whale Cloud Technology Netherlands B.V.
                        WHILL Europe B.V.
                        Whirlpool Nederland B.V.
                        Whitaker
                        Taylor Netherlands B.V.





                        share|improve this answer






























                          0














                          Here is another answer based on your question edits:



                          text = 'West Food Group B.V.9n n52608670n nWestconn nGroup European Operations Netherlands Branchn n30221053n nWestland Infra Netbeheer B.V.n n27176688n nWetransfer 85 B.V.n n34380998n nWETRAVEL B.V.n n70669783n nWeWork Companies (International) B.V.n n61501220n nWeWork Netherlands B.V.n n61505439n nWexford Finance B.V.n n27124941n nWFCn-nFood Safety B.V.n n11069471n nWhale Cloud Technology Netherlands B.V.n n63774801n nWHILL Europe B.V.n n72465700n nWhirlpool Nederland B.V.n n20042061n nWhitakern-nTaylor Netherlands B.V.n n66255163n nWhite Oak B.V.n'

                          company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', text)

                          for i in range(len(company_name_regex)):

                          if i < len(company_name_regex) - 1:

                          previous_company_name = company_name_regex[i]
                          next_company_name = company_name_regex[i + 1]
                          if 'Westcon' in previous_company_name and 'Group European Operations Netherlands Branch' in next_company_name:
                          company_name = ' '.join([previous_company_name, next_company_name])
                          else:
                          if not 'Group European Operations Netherlands Branch' in previous_company_name:
                          company_name = previous_company_name


                          **OUTPUTS**:
                          West Food Group B.V.9
                          Westcon Group European Operations Netherlands Branch
                          Westland Infra Netbeheer B.V.
                          Wetransfer 85 B.V.
                          WETRAVEL B.V.
                          WeWork Companies (International) B.V.
                          WeWork Netherlands B.V.
                          Wexford Finance B.V.
                          WFC
                          Food Safety B.V.
                          Whale Cloud Technology Netherlands B.V.
                          WHILL Europe B.V.
                          Whirlpool Nederland B.V.
                          Whitaker
                          Taylor Netherlands B.V.





                          share|improve this answer




























                            0












                            0








                            0







                            Here is another answer based on your question edits:



                            text = 'West Food Group B.V.9n n52608670n nWestconn nGroup European Operations Netherlands Branchn n30221053n nWestland Infra Netbeheer B.V.n n27176688n nWetransfer 85 B.V.n n34380998n nWETRAVEL B.V.n n70669783n nWeWork Companies (International) B.V.n n61501220n nWeWork Netherlands B.V.n n61505439n nWexford Finance B.V.n n27124941n nWFCn-nFood Safety B.V.n n11069471n nWhale Cloud Technology Netherlands B.V.n n63774801n nWHILL Europe B.V.n n72465700n nWhirlpool Nederland B.V.n n20042061n nWhitakern-nTaylor Netherlands B.V.n n66255163n nWhite Oak B.V.n'

                            company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', text)

                            for i in range(len(company_name_regex)):

                            if i < len(company_name_regex) - 1:

                            previous_company_name = company_name_regex[i]
                            next_company_name = company_name_regex[i + 1]
                            if 'Westcon' in previous_company_name and 'Group European Operations Netherlands Branch' in next_company_name:
                            company_name = ' '.join([previous_company_name, next_company_name])
                            else:
                            if not 'Group European Operations Netherlands Branch' in previous_company_name:
                            company_name = previous_company_name


                            **OUTPUTS**:
                            West Food Group B.V.9
                            Westcon Group European Operations Netherlands Branch
                            Westland Infra Netbeheer B.V.
                            Wetransfer 85 B.V.
                            WETRAVEL B.V.
                            WeWork Companies (International) B.V.
                            WeWork Netherlands B.V.
                            Wexford Finance B.V.
                            WFC
                            Food Safety B.V.
                            Whale Cloud Technology Netherlands B.V.
                            WHILL Europe B.V.
                            Whirlpool Nederland B.V.
                            Whitaker
                            Taylor Netherlands B.V.





                            share|improve this answer















                            Here is another answer based on your question edits:



                            text = 'West Food Group B.V.9n n52608670n nWestconn nGroup European Operations Netherlands Branchn n30221053n nWestland Infra Netbeheer B.V.n n27176688n nWetransfer 85 B.V.n n34380998n nWETRAVEL B.V.n n70669783n nWeWork Companies (International) B.V.n n61501220n nWeWork Netherlands B.V.n n61505439n nWexford Finance B.V.n n27124941n nWFCn-nFood Safety B.V.n n11069471n nWhale Cloud Technology Netherlands B.V.n n63774801n nWHILL Europe B.V.n n72465700n nWhirlpool Nederland B.V.n n20042061n nWhitakern-nTaylor Netherlands B.V.n n66255163n nWhite Oak B.V.n'

                            company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', text)

                            for i in range(len(company_name_regex)):

                            if i < len(company_name_regex) - 1:

                            previous_company_name = company_name_regex[i]
                            next_company_name = company_name_regex[i + 1]
                            if 'Westcon' in previous_company_name and 'Group European Operations Netherlands Branch' in next_company_name:
                            company_name = ' '.join([previous_company_name, next_company_name])
                            else:
                            if not 'Group European Operations Netherlands Branch' in previous_company_name:
                            company_name = previous_company_name


                            **OUTPUTS**:
                            West Food Group B.V.9
                            Westcon Group European Operations Netherlands Branch
                            Westland Infra Netbeheer B.V.
                            Wetransfer 85 B.V.
                            WETRAVEL B.V.
                            WeWork Companies (International) B.V.
                            WeWork Netherlands B.V.
                            Wexford Finance B.V.
                            WFC
                            Food Safety B.V.
                            Whale Cloud Technology Netherlands B.V.
                            WHILL Europe B.V.
                            Whirlpool Nederland B.V.
                            Whitaker
                            Taylor Netherlands B.V.






                            share|improve this answer














                            share|improve this answer



                            share|improve this answer








                            edited Jan 2 at 21:26

























                            answered Jan 2 at 21:19









                            Life is complexLife is complex

                            598518




                            598518






























                                draft saved

                                draft discarded




















































                                Thanks for contributing an answer to Stack Overflow!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid



                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.


                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function () {
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54008208%2fusing-regex-to-extract-based-on-a-recurring-pattern-excluding-newline-characters%23new-answer', 'question_page');
                                }
                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                MongoDB - Not Authorized To Execute Command

                                How to fix TextFormField cause rebuild widget in Flutter

                                Npm cannot find a required file even through it is in the searched directory