Using regex to extract based on a recurring pattern excluding newline characters

I have a string as follows:

27223525



West Food Group B.V.9



52608670



Westcon



Group European Operations Netherlands Branch



30221053



Westland Infra Netbeheer B.V.



27176688



Wetransfer  85 B.V.



34380998



WETRAVEL B.V.



70669783

This string contains many newline characters, I wanted to explicitly ignore these as well as all multidigits with 6 or more numbers. I came up with the following regex expression:

[^nd{6,}].+

This almost takes me there as it returns all the company names, however in cases where the company name itself contains a new line character these get returned as two different company names. For instance Westcon is a match and Group European Operations Netherlands Branch is also a match. I would like to tweak the above expression to make sure that the final match is Westcon European Operations Netherlands Branch. What regex concepts should I use to achieve this? Thanks.

EDIT
I tried the following based on the comment below but got the wrong result

text = 'West Food Group B.V.9n n52608670n nWestconn nGroup European Operations Netherlands Branchn n30221053n nWestland Infra Netbeheer B.V.n n27176688n nWetransfer 85 B.V.n n34380998n nWETRAVEL B.V.n n70669783n nWeWork Companies (International) B.V.n n61501220n nWeWork Netherlands B.V.n n61505439n nWexford Finance B.V.n n27124941n nWFCn-nFood Safety B.V.n n11069471n nWhale Cloud Technology Netherlands B.V.n n63774801n nWHILL Europe B.V.n n72465700n nWhirlpool Nederland B.V.n n20042061n nWhitakern-nTaylor Netherlands B.V.n n66255163n nWhite Oak B.V.n'



re.findall(r'[^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+)',text)

edited Jan 2 at 14:59

asked Jan 2 at 14:36

user32882

934729

I am currently using regex101.com with python flavor with the aim of later extending it to a python script which uses re

– user32882
Jan 2 at 14:47

I've came up with this [^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+) regex101.com/r/UTFMyk/1

– jcubic
Jan 2 at 14:50

Try re.findall(r'^[A-Za-z].*(?:n(?!d+$).*)*', contents, re.M) where contents is file.read().

– Wiktor Stribiżew
Jan 2 at 14:58

add a comment |

I have a string as follows:

27223525



West Food Group B.V.9



52608670



Westcon



Group European Operations Netherlands Branch



30221053



Westland Infra Netbeheer B.V.



27176688



Wetransfer  85 B.V.



34380998



WETRAVEL B.V.



70669783

This string contains many newline characters, I wanted to explicitly ignore these as well as all multidigits with 6 or more numbers. I came up with the following regex expression:

[^nd{6,}].+

EDIT
I tried the following based on the comment below but got the wrong result

text = 'West Food Group B.V.9n n52608670n nWestconn nGroup European Operations Netherlands Branchn n30221053n nWestland Infra Netbeheer B.V.n n27176688n nWetransfer 85 B.V.n n34380998n nWETRAVEL B.V.n n70669783n nWeWork Companies (International) B.V.n n61501220n nWeWork Netherlands B.V.n n61505439n nWexford Finance B.V.n n27124941n nWFCn-nFood Safety B.V.n n11069471n nWhale Cloud Technology Netherlands B.V.n n63774801n nWHILL Europe B.V.n n72465700n nWhirlpool Nederland B.V.n n20042061n nWhitakern-nTaylor Netherlands B.V.n n66255163n nWhite Oak B.V.n'



re.findall(r'[^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+)',text)

edited Jan 2 at 14:59

asked Jan 2 at 14:36

user32882

934729

I am currently using regex101.com with python flavor with the aim of later extending it to a python script which uses re

– user32882
Jan 2 at 14:47

I've came up with this [^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+) regex101.com/r/UTFMyk/1

– jcubic
Jan 2 at 14:50

Try re.findall(r'^[A-Za-z].*(?:n(?!d+$).*)*', contents, re.M) where contents is file.read().

– Wiktor Stribiżew
Jan 2 at 14:58

add a comment |

I have a string as follows:

27223525



West Food Group B.V.9



52608670



Westcon



Group European Operations Netherlands Branch



30221053



Westland Infra Netbeheer B.V.



27176688



Wetransfer  85 B.V.



34380998



WETRAVEL B.V.



70669783

This string contains many newline characters, I wanted to explicitly ignore these as well as all multidigits with 6 or more numbers. I came up with the following regex expression:

[^nd{6,}].+

EDIT
I tried the following based on the comment below but got the wrong result

text = 'West Food Group B.V.9n n52608670n nWestconn nGroup European Operations Netherlands Branchn n30221053n nWestland Infra Netbeheer B.V.n n27176688n nWetransfer 85 B.V.n n34380998n nWETRAVEL B.V.n n70669783n nWeWork Companies (International) B.V.n n61501220n nWeWork Netherlands B.V.n n61505439n nWexford Finance B.V.n n27124941n nWFCn-nFood Safety B.V.n n11069471n nWhale Cloud Technology Netherlands B.V.n n63774801n nWHILL Europe B.V.n n72465700n nWhirlpool Nederland B.V.n n20042061n nWhitakern-nTaylor Netherlands B.V.n n66255163n nWhite Oak B.V.n'



re.findall(r'[^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+)',text)

edited Jan 2 at 14:59

asked Jan 2 at 14:36

user32882

934729

I have a string as follows:

27223525



West Food Group B.V.9



52608670



Westcon



Group European Operations Netherlands Branch



30221053



Westland Infra Netbeheer B.V.



27176688



Wetransfer  85 B.V.



34380998



WETRAVEL B.V.



70669783

This string contains many newline characters, I wanted to explicitly ignore these as well as all multidigits with 6 or more numbers. I came up with the following regex expression:

[^nd{6,}].+

EDIT
I tried the following based on the comment below but got the wrong result

text = 'West Food Group B.V.9n n52608670n nWestconn nGroup European Operations Netherlands Branchn n30221053n nWestland Infra Netbeheer B.V.n n27176688n nWetransfer 85 B.V.n n34380998n nWETRAVEL B.V.n n70669783n nWeWork Companies (International) B.V.n n61501220n nWeWork Netherlands B.V.n n61505439n nWexford Finance B.V.n n27124941n nWFCn-nFood Safety B.V.n n11069471n nWhale Cloud Technology Netherlands B.V.n n63774801n nWHILL Europe B.V.n n72465700n nWhirlpool Nederland B.V.n n20042061n nWhitakern-nTaylor Netherlands B.V.n n66255163n nWhite Oak B.V.n'



re.findall(r'[^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+)',text)

python regex

edited Jan 2 at 14:59

asked Jan 2 at 14:36

user32882

934729

edited Jan 2 at 14:59

asked Jan 2 at 14:36

user32882

934729

edited Jan 2 at 14:59

asked Jan 2 at 14:36

user32882

934729

asked Jan 2 at 14:36

user32882

934729

asked Jan 2 at 14:36

user32882

934729

I am currently using regex101.com with python flavor with the aim of later extending it to a python script which uses re

– user32882
Jan 2 at 14:47

I've came up with this [^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+) regex101.com/r/UTFMyk/1

– jcubic
Jan 2 at 14:50

Try re.findall(r'^[A-Za-z].*(?:n(?!d+$).*)*', contents, re.M) where contents is file.read().

– Wiktor Stribiżew
Jan 2 at 14:58

add a comment |

I am currently using regex101.com with python flavor with the aim of later extending it to a python script which uses re

– user32882
Jan 2 at 14:47

I've came up with this [^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+) regex101.com/r/UTFMyk/1

– jcubic
Jan 2 at 14:50

Try re.findall(r'^[A-Za-z].*(?:n(?!d+$).*)*', contents, re.M) where contents is file.read().

– Wiktor Stribiżew
Jan 2 at 14:58

I am currently using regex101.com with python flavor with the aim of later extending it to a python script which uses re

– user32882
Jan 2 at 14:47

I've came up with this [^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+) regex101.com/r/UTFMyk/1

– jcubic
Jan 2 at 14:50

Try re.findall(r'^[A-Za-z].*(?:n(?!d+$).*)*', contents, re.M) where contents is file.read().

– Wiktor Stribiżew
Jan 2 at 14:58

add a comment |

5 Answers
5

active

oldest

votes

I think that you only want the company names. If so, this should work.

input = '''27223525



West Food Group B.V.9



52608670



Westcon



Group European Operations Netherlands Branch



30221053



Westland Infra Netbeheer B.V.



27176688



Wetransfer 85 B.V.



34380998



WETRAVEL B.V.



70669783



'''



company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', input)



pprint(company_name_regex)



['West Food Group B.V.9',

 'Westcon',

 'Group European Operations Netherlands Branch',

 'Westland Infra Netbeheer B.V.',

 'Wetransfer 85 B.V.'

 'WETRAVEL B.V.']

edited Jan 2 at 15:10

answered Jan 2 at 14:59

Life is complex

598518

please see edits above. Companies may have digits anywhere in there name. For this dataset the number of digits within any one company name is always below 6. The number digits I want to exclude are always above 6.

– user32882
Jan 2 at 15:01

Yes, I noted that the input example changed, so I have updated my answer.

– Life is complex
Jan 2 at 15:06

1

Doesn't look correct as it matches Westcon as a separate match and remaining part as a separate match.

– anubhava
Jan 2 at 15:20

So does Westcon always get linked to Group European Operations Netherlands Branch?

– Life is complex
Jan 2 at 15:25

add a comment |

This will create one group for lines that don't have numbers.

regex: /(?!(d{6,}|n))[a-zA-Z .n]+/g

Demo: https://regex101.com/r/MMLGw6/1

answered Jan 2 at 14:52

Alex G

1,4172410

except some company names have numbers below six digits. I have edited the question to reflect that

– user32882
Jan 2 at 14:56

add a comment |

Assuming your company names starts with a letter, you may use this regex with re.M modifier:

^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)

RegEx Demo

In python:

regex = re.compile(r"^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)", re.M)

This matches a line that starts with [a-zA-Z] until end of line and then matches more lines separated by n that also start with [a-zA-Z] characters.

(?=n+d{6,}$) is a lookahead assertion to make sure our company names have a newline and 6+ digits ahead.

answered Jan 2 at 15:25

anubhava

533k48331408

add a comment |

If you can solve this without regex it should be solved without regex:

useful = 



for line in text.split():

    if line.strip() and not line.isdigit():

        useful.append(line)

That should work - more or less. Replying from my phone so can't test.

edited Jan 2 at 15:41

answered Jan 2 at 15:23

Hugo

38929

add a comment |

Here is another answer based on your question edits:

text = 'West Food Group B.V.9n n52608670n nWestconn nGroup European Operations Netherlands Branchn n30221053n nWestland Infra Netbeheer B.V.n n27176688n nWetransfer 85 B.V.n n34380998n nWETRAVEL B.V.n n70669783n nWeWork Companies (International) B.V.n n61501220n nWeWork Netherlands B.V.n n61505439n nWexford Finance B.V.n n27124941n nWFCn-nFood Safety B.V.n n11069471n nWhale Cloud Technology Netherlands B.V.n n63774801n nWHILL Europe B.V.n n72465700n nWhirlpool Nederland B.V.n n20042061n nWhitakern-nTaylor Netherlands B.V.n n66255163n nWhite Oak B.V.n'



company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', text)



for i in range(len(company_name_regex)):



  if i < len(company_name_regex) - 1:



    previous_company_name =  company_name_regex[i]

    next_company_name = company_name_regex[i + 1]

    if 'Westcon' in previous_company_name and 'Group European Operations Netherlands Branch' in next_company_name:

        company_name = ' '.join([previous_company_name, next_company_name])

    else:

        if not 'Group European Operations Netherlands Branch' in previous_company_name:

           company_name = previous_company_name





**OUTPUTS**:

West Food Group B.V.9

Westcon Group European Operations Netherlands Branch

Westland Infra Netbeheer B.V.

Wetransfer 85 B.V.

WETRAVEL B.V.

WeWork Companies (International) B.V.

WeWork Netherlands B.V.

Wexford Finance B.V.

WFC

Food Safety B.V.

Whale Cloud Technology Netherlands B.V.

WHILL Europe B.V.

Whirlpool Nederland B.V.

Whitaker

Taylor Netherlands B.V.

edited Jan 2 at 21:26

answered Jan 2 at 21:19

Life is complex

598518

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54008208%2fusing-regex-to-extract-based-on-a-recurring-pattern-excluding-newline-characters%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

5 Answers
5

active

oldest

votes

5 Answers
5

active

oldest

votes

I think that you only want the company names. If so, this should work.

input = '''27223525



West Food Group B.V.9



52608670



Westcon



Group European Operations Netherlands Branch



30221053



Westland Infra Netbeheer B.V.



27176688



Wetransfer 85 B.V.



34380998



WETRAVEL B.V.



70669783



'''



company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', input)



pprint(company_name_regex)



['West Food Group B.V.9',

 'Westcon',

 'Group European Operations Netherlands Branch',

 'Westland Infra Netbeheer B.V.',

 'Wetransfer 85 B.V.'

 'WETRAVEL B.V.']

edited Jan 2 at 15:10

answered Jan 2 at 14:59

Life is complex

598518

please see edits above. Companies may have digits anywhere in there name. For this dataset the number of digits within any one company name is always below 6. The number digits I want to exclude are always above 6.

– user32882
Jan 2 at 15:01

Yes, I noted that the input example changed, so I have updated my answer.

– Life is complex
Jan 2 at 15:06

1

Doesn't look correct as it matches Westcon as a separate match and remaining part as a separate match.

– anubhava
Jan 2 at 15:20

So does Westcon always get linked to Group European Operations Netherlands Branch?

– Life is complex
Jan 2 at 15:25

add a comment |

I think that you only want the company names. If so, this should work.

input = '''27223525



West Food Group B.V.9



52608670



Westcon



Group European Operations Netherlands Branch



30221053



Westland Infra Netbeheer B.V.



27176688



Wetransfer 85 B.V.



34380998



WETRAVEL B.V.



70669783



'''



company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', input)



pprint(company_name_regex)



['West Food Group B.V.9',

 'Westcon',

 'Group European Operations Netherlands Branch',

 'Westland Infra Netbeheer B.V.',

 'Wetransfer 85 B.V.'

 'WETRAVEL B.V.']

edited Jan 2 at 15:10

answered Jan 2 at 14:59

Life is complex

598518

please see edits above. Companies may have digits anywhere in there name. For this dataset the number of digits within any one company name is always below 6. The number digits I want to exclude are always above 6.

– user32882
Jan 2 at 15:01

Yes, I noted that the input example changed, so I have updated my answer.

– Life is complex
Jan 2 at 15:06

1

Doesn't look correct as it matches Westcon as a separate match and remaining part as a separate match.

– anubhava
Jan 2 at 15:20

So does Westcon always get linked to Group European Operations Netherlands Branch?

– Life is complex
Jan 2 at 15:25

add a comment |

I think that you only want the company names. If so, this should work.

input = '''27223525



West Food Group B.V.9



52608670



Westcon



Group European Operations Netherlands Branch



30221053



Westland Infra Netbeheer B.V.



27176688



Wetransfer 85 B.V.



34380998



WETRAVEL B.V.



70669783



'''



company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', input)



pprint(company_name_regex)



['West Food Group B.V.9',

 'Westcon',

 'Group European Operations Netherlands Branch',

 'Westland Infra Netbeheer B.V.',

 'Wetransfer 85 B.V.'

 'WETRAVEL B.V.']

edited Jan 2 at 15:10

answered Jan 2 at 14:59

Life is complex

598518

I think that you only want the company names. If so, this should work.

input = '''27223525



West Food Group B.V.9



52608670



Westcon



Group European Operations Netherlands Branch



30221053



Westland Infra Netbeheer B.V.



27176688



Wetransfer 85 B.V.



34380998



WETRAVEL B.V.



70669783



'''



company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', input)



pprint(company_name_regex)



['West Food Group B.V.9',

 'Westcon',

 'Group European Operations Netherlands Branch',

 'Westland Infra Netbeheer B.V.',

 'Wetransfer 85 B.V.'

 'WETRAVEL B.V.']

edited Jan 2 at 15:10

answered Jan 2 at 14:59

Life is complex

598518

edited Jan 2 at 15:10

answered Jan 2 at 14:59

Life is complex

598518

answered Jan 2 at 14:59

Life is complex

598518

answered Jan 2 at 14:59

Life is complex

598518

please see edits above. Companies may have digits anywhere in there name. For this dataset the number of digits within any one company name is always below 6. The number digits I want to exclude are always above 6.

– user32882
Jan 2 at 15:01

Yes, I noted that the input example changed, so I have updated my answer.

– Life is complex
Jan 2 at 15:06

1

Doesn't look correct as it matches Westcon as a separate match and remaining part as a separate match.

– anubhava
Jan 2 at 15:20

So does Westcon always get linked to Group European Operations Netherlands Branch?

– Life is complex
Jan 2 at 15:25

add a comment |

please see edits above. Companies may have digits anywhere in there name. For this dataset the number of digits within any one company name is always below 6. The number digits I want to exclude are always above 6.

– user32882
Jan 2 at 15:01

Yes, I noted that the input example changed, so I have updated my answer.

– Life is complex
Jan 2 at 15:06

1

Doesn't look correct as it matches Westcon as a separate match and remaining part as a separate match.

– anubhava
Jan 2 at 15:20

So does Westcon always get linked to Group European Operations Netherlands Branch?

– Life is complex
Jan 2 at 15:25

please see edits above. Companies may have digits anywhere in there name. For this dataset the number of digits within any one company name is always below 6. The number digits I want to exclude are always above 6.

– user32882
Jan 2 at 15:01

Yes, I noted that the input example changed, so I have updated my answer.

– Life is complex
Jan 2 at 15:06

Doesn't look correct as it matches Westcon as a separate match and remaining part as a separate match.

– anubhava
Jan 2 at 15:20

So does Westcon always get linked to Group European Operations Netherlands Branch?

– Life is complex
Jan 2 at 15:25

add a comment |

This will create one group for lines that don't have numbers.

regex: /(?!(d{6,}|n))[a-zA-Z .n]+/g

Demo: https://regex101.com/r/MMLGw6/1

answered Jan 2 at 14:52

Alex G

1,4172410

except some company names have numbers below six digits. I have edited the question to reflect that

– user32882
Jan 2 at 14:56

add a comment |

This will create one group for lines that don't have numbers.

regex: /(?!(d{6,}|n))[a-zA-Z .n]+/g

Demo: https://regex101.com/r/MMLGw6/1

answered Jan 2 at 14:52

Alex G

1,4172410

except some company names have numbers below six digits. I have edited the question to reflect that

– user32882
Jan 2 at 14:56

add a comment |

This will create one group for lines that don't have numbers.

regex: /(?!(d{6,}|n))[a-zA-Z .n]+/g

Demo: https://regex101.com/r/MMLGw6/1

answered Jan 2 at 14:52

Alex G

1,4172410

This will create one group for lines that don't have numbers.

regex: /(?!(d{6,}|n))[a-zA-Z .n]+/g

Demo: https://regex101.com/r/MMLGw6/1

answered Jan 2 at 14:52

Alex G

1,4172410

answered Jan 2 at 14:52

Alex G

1,4172410

answered Jan 2 at 14:52

Alex G

1,4172410

answered Jan 2 at 14:52

Alex G

1,4172410

except some company names have numbers below six digits. I have edited the question to reflect that

– user32882
Jan 2 at 14:56

add a comment |

except some company names have numbers below six digits. I have edited the question to reflect that

– user32882
Jan 2 at 14:56

except some company names have numbers below six digits. I have edited the question to reflect that

– user32882
Jan 2 at 14:56

add a comment |

Assuming your company names starts with a letter, you may use this regex with re.M modifier:

^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)

RegEx Demo

In python:

regex = re.compile(r"^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)", re.M)

This matches a line that starts with [a-zA-Z] until end of line and then matches more lines separated by n that also start with [a-zA-Z] characters.

(?=n+d{6,}$) is a lookahead assertion to make sure our company names have a newline and 6+ digits ahead.

answered Jan 2 at 15:25

anubhava

533k48331408

add a comment |

Assuming your company names starts with a letter, you may use this regex with re.M modifier:

^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)

RegEx Demo

In python:

regex = re.compile(r"^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)", re.M)

This matches a line that starts with [a-zA-Z] until end of line and then matches more lines separated by n that also start with [a-zA-Z] characters.

(?=n+d{6,}$) is a lookahead assertion to make sure our company names have a newline and 6+ digits ahead.

answered Jan 2 at 15:25

anubhava

533k48331408

add a comment |

Assuming your company names starts with a letter, you may use this regex with re.M modifier:

^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)

RegEx Demo

In python:

regex = re.compile(r"^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)", re.M)

This matches a line that starts with [a-zA-Z] until end of line and then matches more lines separated by n that also start with [a-zA-Z] characters.

(?=n+d{6,}$) is a lookahead assertion to make sure our company names have a newline and 6+ digits ahead.

answered Jan 2 at 15:25

anubhava

533k48331408

Assuming your company names starts with a letter, you may use this regex with re.M modifier:

^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)

RegEx Demo

In python:

regex = re.compile(r"^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)", re.M)

This matches a line that starts with [a-zA-Z] until end of line and then matches more lines separated by n that also start with [a-zA-Z] characters.

(?=n+d{6,}$) is a lookahead assertion to make sure our company names have a newline and 6+ digits ahead.

answered Jan 2 at 15:25

anubhava

533k48331408

answered Jan 2 at 15:25

anubhava

533k48331408

answered Jan 2 at 15:25

anubhava

533k48331408

answered Jan 2 at 15:25

anubhava

533k48331408

add a comment |

If you can solve this without regex it should be solved without regex:

useful = 



for line in text.split():

    if line.strip() and not line.isdigit():

        useful.append(line)

That should work - more or less. Replying from my phone so can't test.

edited Jan 2 at 15:41

answered Jan 2 at 15:23

Hugo

38929

add a comment |

If you can solve this without regex it should be solved without regex:

useful = 



for line in text.split():

    if line.strip() and not line.isdigit():

        useful.append(line)

That should work - more or less. Replying from my phone so can't test.

edited Jan 2 at 15:41

answered Jan 2 at 15:23

Hugo

38929

add a comment |

If you can solve this without regex it should be solved without regex:

useful = 



for line in text.split():

    if line.strip() and not line.isdigit():

        useful.append(line)

That should work - more or less. Replying from my phone so can't test.

edited Jan 2 at 15:41

answered Jan 2 at 15:23

Hugo

38929

If you can solve this without regex it should be solved without regex:

useful = 



for line in text.split():

    if line.strip() and not line.isdigit():

        useful.append(line)

That should work - more or less. Replying from my phone so can't test.

edited Jan 2 at 15:41

answered Jan 2 at 15:23

Hugo

38929

edited Jan 2 at 15:41

answered Jan 2 at 15:23

Hugo

38929

answered Jan 2 at 15:23

Hugo

38929

answered Jan 2 at 15:23

Hugo

38929

add a comment |

Here is another answer based on your question edits:

text = 'West Food Group B.V.9n n52608670n nWestconn nGroup European Operations Netherlands Branchn n30221053n nWestland Infra Netbeheer B.V.n n27176688n nWetransfer 85 B.V.n n34380998n nWETRAVEL B.V.n n70669783n nWeWork Companies (International) B.V.n n61501220n nWeWork Netherlands B.V.n n61505439n nWexford Finance B.V.n n27124941n nWFCn-nFood Safety B.V.n n11069471n nWhale Cloud Technology Netherlands B.V.n n63774801n nWHILL Europe B.V.n n72465700n nWhirlpool Nederland B.V.n n20042061n nWhitakern-nTaylor Netherlands B.V.n n66255163n nWhite Oak B.V.n'



company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', text)



for i in range(len(company_name_regex)):



  if i < len(company_name_regex) - 1:



    previous_company_name =  company_name_regex[i]

    next_company_name = company_name_regex[i + 1]

    if 'Westcon' in previous_company_name and 'Group European Operations Netherlands Branch' in next_company_name:

        company_name = ' '.join([previous_company_name, next_company_name])

    else:

        if not 'Group European Operations Netherlands Branch' in previous_company_name:

           company_name = previous_company_name





**OUTPUTS**:

West Food Group B.V.9

Westcon Group European Operations Netherlands Branch

Westland Infra Netbeheer B.V.

Wetransfer 85 B.V.

WETRAVEL B.V.

WeWork Companies (International) B.V.

WeWork Netherlands B.V.

Wexford Finance B.V.

WFC

Food Safety B.V.

Whale Cloud Technology Netherlands B.V.

WHILL Europe B.V.

Whirlpool Nederland B.V.

Whitaker

Taylor Netherlands B.V.

edited Jan 2 at 21:26

answered Jan 2 at 21:19

Life is complex

598518

add a comment |

Here is another answer based on your question edits:

text = 'West Food Group B.V.9n n52608670n nWestconn nGroup European Operations Netherlands Branchn n30221053n nWestland Infra Netbeheer B.V.n n27176688n nWetransfer 85 B.V.n n34380998n nWETRAVEL B.V.n n70669783n nWeWork Companies (International) B.V.n n61501220n nWeWork Netherlands B.V.n n61505439n nWexford Finance B.V.n n27124941n nWFCn-nFood Safety B.V.n n11069471n nWhale Cloud Technology Netherlands B.V.n n63774801n nWHILL Europe B.V.n n72465700n nWhirlpool Nederland B.V.n n20042061n nWhitakern-nTaylor Netherlands B.V.n n66255163n nWhite Oak B.V.n'



company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', text)



for i in range(len(company_name_regex)):



  if i < len(company_name_regex) - 1:



    previous_company_name =  company_name_regex[i]

    next_company_name = company_name_regex[i + 1]

    if 'Westcon' in previous_company_name and 'Group European Operations Netherlands Branch' in next_company_name:

        company_name = ' '.join([previous_company_name, next_company_name])

    else:

        if not 'Group European Operations Netherlands Branch' in previous_company_name:

           company_name = previous_company_name





**OUTPUTS**:

West Food Group B.V.9

Westcon Group European Operations Netherlands Branch

Westland Infra Netbeheer B.V.

Wetransfer 85 B.V.

WETRAVEL B.V.

WeWork Companies (International) B.V.

WeWork Netherlands B.V.

Wexford Finance B.V.

WFC

Food Safety B.V.

Whale Cloud Technology Netherlands B.V.

WHILL Europe B.V.

Whirlpool Nederland B.V.

Whitaker

Taylor Netherlands B.V.

edited Jan 2 at 21:26

answered Jan 2 at 21:19

Life is complex

598518

add a comment |

Here is another answer based on your question edits:

text = 'West Food Group B.V.9n n52608670n nWestconn nGroup European Operations Netherlands Branchn n30221053n nWestland Infra Netbeheer B.V.n n27176688n nWetransfer 85 B.V.n n34380998n nWETRAVEL B.V.n n70669783n nWeWork Companies (International) B.V.n n61501220n nWeWork Netherlands B.V.n n61505439n nWexford Finance B.V.n n27124941n nWFCn-nFood Safety B.V.n n11069471n nWhale Cloud Technology Netherlands B.V.n n63774801n nWHILL Europe B.V.n n72465700n nWhirlpool Nederland B.V.n n20042061n nWhitakern-nTaylor Netherlands B.V.n n66255163n nWhite Oak B.V.n'



company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', text)



for i in range(len(company_name_regex)):



  if i < len(company_name_regex) - 1:



    previous_company_name =  company_name_regex[i]

    next_company_name = company_name_regex[i + 1]

    if 'Westcon' in previous_company_name and 'Group European Operations Netherlands Branch' in next_company_name:

        company_name = ' '.join([previous_company_name, next_company_name])

    else:

        if not 'Group European Operations Netherlands Branch' in previous_company_name:

           company_name = previous_company_name





**OUTPUTS**:

West Food Group B.V.9

Westcon Group European Operations Netherlands Branch

Westland Infra Netbeheer B.V.

Wetransfer 85 B.V.

WETRAVEL B.V.

WeWork Companies (International) B.V.

WeWork Netherlands B.V.

Wexford Finance B.V.

WFC

Food Safety B.V.

Whale Cloud Technology Netherlands B.V.

WHILL Europe B.V.

Whirlpool Nederland B.V.

Whitaker

Taylor Netherlands B.V.

edited Jan 2 at 21:26

answered Jan 2 at 21:19

Life is complex

598518

Here is another answer based on your question edits:

text = 'West Food Group B.V.9n n52608670n nWestconn nGroup European Operations Netherlands Branchn n30221053n nWestland Infra Netbeheer B.V.n n27176688n nWetransfer 85 B.V.n n34380998n nWETRAVEL B.V.n n70669783n nWeWork Companies (International) B.V.n n61501220n nWeWork Netherlands B.V.n n61505439n nWexford Finance B.V.n n27124941n nWFCn-nFood Safety B.V.n n11069471n nWhale Cloud Technology Netherlands B.V.n n63774801n nWHILL Europe B.V.n n72465700n nWhirlpool Nederland B.V.n n20042061n nWhitakern-nTaylor Netherlands B.V.n n66255163n nWhite Oak B.V.n'



company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', text)



for i in range(len(company_name_regex)):



  if i < len(company_name_regex) - 1:



    previous_company_name =  company_name_regex[i]

    next_company_name = company_name_regex[i + 1]

    if 'Westcon' in previous_company_name and 'Group European Operations Netherlands Branch' in next_company_name:

        company_name = ' '.join([previous_company_name, next_company_name])

    else:

        if not 'Group European Operations Netherlands Branch' in previous_company_name:

           company_name = previous_company_name





**OUTPUTS**:

West Food Group B.V.9

Westcon Group European Operations Netherlands Branch

Westland Infra Netbeheer B.V.

Wetransfer 85 B.V.

WETRAVEL B.V.

WeWork Companies (International) B.V.

WeWork Netherlands B.V.

Wexford Finance B.V.

WFC

Food Safety B.V.

Whale Cloud Technology Netherlands B.V.

WHILL Europe B.V.

Whirlpool Nederland B.V.

Whitaker

Taylor Netherlands B.V.

edited Jan 2 at 21:26

answered Jan 2 at 21:19

Life is complex

598518

edited Jan 2 at 21:26

answered Jan 2 at 21:19

Life is complex

598518

answered Jan 2 at 21:19

Life is complex

598518

answered Jan 2 at 21:19

Life is complex

598518

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu