Using regex to extract based on a recurring pattern excluding newline characters
I have a string as follows:
27223525
West Food Group B.V.9
52608670
Westcon
Group European Operations Netherlands Branch
30221053
Westland Infra Netbeheer B.V.
27176688
Wetransfer 85 B.V.
34380998
WETRAVEL B.V.
70669783
This string contains many newline characters, I wanted to explicitly ignore these as well as all multidigits with 6 or more numbers. I came up with the following regex expression:
[^nd{6,}].+
This almost takes me there as it returns all the company names, however in cases where the company name itself contains a new line character these get returned as two different company names. For instance Westcon
is a match and Group European Operations Netherlands Branch
is also a match. I would like to tweak the above expression to make sure that the final match is Westcon European Operations Netherlands Branch
. What regex concepts should I use to achieve this? Thanks.
EDIT
I tried the following based on the comment below but got the wrong result
text = 'West Food Group B.V.9n n52608670n nWestconn nGroup European Operations Netherlands Branchn n30221053n nWestland Infra Netbeheer B.V.n n27176688n nWetransfer 85 B.V.n n34380998n nWETRAVEL B.V.n n70669783n nWeWork Companies (International) B.V.n n61501220n nWeWork Netherlands B.V.n n61505439n nWexford Finance B.V.n n27124941n nWFCn-nFood Safety B.V.n n11069471n nWhale Cloud Technology Netherlands B.V.n n63774801n nWHILL Europe B.V.n n72465700n nWhirlpool Nederland B.V.n n20042061n nWhitakern-nTaylor Netherlands B.V.n n66255163n nWhite Oak B.V.n'
re.findall(r'[^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+)',text)
python regex
add a comment |
I have a string as follows:
27223525
West Food Group B.V.9
52608670
Westcon
Group European Operations Netherlands Branch
30221053
Westland Infra Netbeheer B.V.
27176688
Wetransfer 85 B.V.
34380998
WETRAVEL B.V.
70669783
This string contains many newline characters, I wanted to explicitly ignore these as well as all multidigits with 6 or more numbers. I came up with the following regex expression:
[^nd{6,}].+
This almost takes me there as it returns all the company names, however in cases where the company name itself contains a new line character these get returned as two different company names. For instance Westcon
is a match and Group European Operations Netherlands Branch
is also a match. I would like to tweak the above expression to make sure that the final match is Westcon European Operations Netherlands Branch
. What regex concepts should I use to achieve this? Thanks.
EDIT
I tried the following based on the comment below but got the wrong result
text = 'West Food Group B.V.9n n52608670n nWestconn nGroup European Operations Netherlands Branchn n30221053n nWestland Infra Netbeheer B.V.n n27176688n nWetransfer 85 B.V.n n34380998n nWETRAVEL B.V.n n70669783n nWeWork Companies (International) B.V.n n61501220n nWeWork Netherlands B.V.n n61505439n nWexford Finance B.V.n n27124941n nWFCn-nFood Safety B.V.n n11069471n nWhale Cloud Technology Netherlands B.V.n n63774801n nWHILL Europe B.V.n n72465700n nWhirlpool Nederland B.V.n n20042061n nWhitakern-nTaylor Netherlands B.V.n n66255163n nWhite Oak B.V.n'
re.findall(r'[^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+)',text)
python regex
I am currently using regex101.com with python flavor with the aim of later extending it to a python script which usesre
– user32882
Jan 2 at 14:47
I've came up with this[^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+)
regex101.com/r/UTFMyk/1
– jcubic
Jan 2 at 14:50
Tryre.findall(r'^[A-Za-z].*(?:n(?!d+$).*)*', contents, re.M)
wherecontents
isfile.read()
.
– Wiktor Stribiżew
Jan 2 at 14:58
add a comment |
I have a string as follows:
27223525
West Food Group B.V.9
52608670
Westcon
Group European Operations Netherlands Branch
30221053
Westland Infra Netbeheer B.V.
27176688
Wetransfer 85 B.V.
34380998
WETRAVEL B.V.
70669783
This string contains many newline characters, I wanted to explicitly ignore these as well as all multidigits with 6 or more numbers. I came up with the following regex expression:
[^nd{6,}].+
This almost takes me there as it returns all the company names, however in cases where the company name itself contains a new line character these get returned as two different company names. For instance Westcon
is a match and Group European Operations Netherlands Branch
is also a match. I would like to tweak the above expression to make sure that the final match is Westcon European Operations Netherlands Branch
. What regex concepts should I use to achieve this? Thanks.
EDIT
I tried the following based on the comment below but got the wrong result
text = 'West Food Group B.V.9n n52608670n nWestconn nGroup European Operations Netherlands Branchn n30221053n nWestland Infra Netbeheer B.V.n n27176688n nWetransfer 85 B.V.n n34380998n nWETRAVEL B.V.n n70669783n nWeWork Companies (International) B.V.n n61501220n nWeWork Netherlands B.V.n n61505439n nWexford Finance B.V.n n27124941n nWFCn-nFood Safety B.V.n n11069471n nWhale Cloud Technology Netherlands B.V.n n63774801n nWHILL Europe B.V.n n72465700n nWhirlpool Nederland B.V.n n20042061n nWhitakern-nTaylor Netherlands B.V.n n66255163n nWhite Oak B.V.n'
re.findall(r'[^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+)',text)
python regex
I have a string as follows:
27223525
West Food Group B.V.9
52608670
Westcon
Group European Operations Netherlands Branch
30221053
Westland Infra Netbeheer B.V.
27176688
Wetransfer 85 B.V.
34380998
WETRAVEL B.V.
70669783
This string contains many newline characters, I wanted to explicitly ignore these as well as all multidigits with 6 or more numbers. I came up with the following regex expression:
[^nd{6,}].+
This almost takes me there as it returns all the company names, however in cases where the company name itself contains a new line character these get returned as two different company names. For instance Westcon
is a match and Group European Operations Netherlands Branch
is also a match. I would like to tweak the above expression to make sure that the final match is Westcon European Operations Netherlands Branch
. What regex concepts should I use to achieve this? Thanks.
EDIT
I tried the following based on the comment below but got the wrong result
text = 'West Food Group B.V.9n n52608670n nWestconn nGroup European Operations Netherlands Branchn n30221053n nWestland Infra Netbeheer B.V.n n27176688n nWetransfer 85 B.V.n n34380998n nWETRAVEL B.V.n n70669783n nWeWork Companies (International) B.V.n n61501220n nWeWork Netherlands B.V.n n61505439n nWexford Finance B.V.n n27124941n nWFCn-nFood Safety B.V.n n11069471n nWhale Cloud Technology Netherlands B.V.n n63774801n nWHILL Europe B.V.n n72465700n nWhirlpool Nederland B.V.n n20042061n nWhitakern-nTaylor Netherlands B.V.n n66255163n nWhite Oak B.V.n'
re.findall(r'[^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+)',text)
python regex
python regex
edited Jan 2 at 14:59
user32882
asked Jan 2 at 14:36
user32882user32882
934729
934729
I am currently using regex101.com with python flavor with the aim of later extending it to a python script which usesre
– user32882
Jan 2 at 14:47
I've came up with this[^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+)
regex101.com/r/UTFMyk/1
– jcubic
Jan 2 at 14:50
Tryre.findall(r'^[A-Za-z].*(?:n(?!d+$).*)*', contents, re.M)
wherecontents
isfile.read()
.
– Wiktor Stribiżew
Jan 2 at 14:58
add a comment |
I am currently using regex101.com with python flavor with the aim of later extending it to a python script which usesre
– user32882
Jan 2 at 14:47
I've came up with this[^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+)
regex101.com/r/UTFMyk/1
– jcubic
Jan 2 at 14:50
Tryre.findall(r'^[A-Za-z].*(?:n(?!d+$).*)*', contents, re.M)
wherecontents
isfile.read()
.
– Wiktor Stribiżew
Jan 2 at 14:58
I am currently using regex101.com with python flavor with the aim of later extending it to a python script which uses
re
– user32882
Jan 2 at 14:47
I am currently using regex101.com with python flavor with the aim of later extending it to a python script which uses
re
– user32882
Jan 2 at 14:47
I've came up with this
[^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+)
regex101.com/r/UTFMyk/1– jcubic
Jan 2 at 14:50
I've came up with this
[^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+)
regex101.com/r/UTFMyk/1– jcubic
Jan 2 at 14:50
Try
re.findall(r'^[A-Za-z].*(?:n(?!d+$).*)*', contents, re.M)
where contents
is file.read()
.– Wiktor Stribiżew
Jan 2 at 14:58
Try
re.findall(r'^[A-Za-z].*(?:n(?!d+$).*)*', contents, re.M)
where contents
is file.read()
.– Wiktor Stribiżew
Jan 2 at 14:58
add a comment |
5 Answers
5
active
oldest
votes
I think that you only want the company names. If so, this should work.
input = '''27223525
West Food Group B.V.9
52608670
Westcon
Group European Operations Netherlands Branch
30221053
Westland Infra Netbeheer B.V.
27176688
Wetransfer 85 B.V.
34380998
WETRAVEL B.V.
70669783
'''
company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', input)
pprint(company_name_regex)
['West Food Group B.V.9',
'Westcon',
'Group European Operations Netherlands Branch',
'Westland Infra Netbeheer B.V.',
'Wetransfer 85 B.V.'
'WETRAVEL B.V.']
please see edits above. Companies may have digits anywhere in there name. For this dataset the number of digits within any one company name is always below 6. The number digits I want to exclude are always above 6.
– user32882
Jan 2 at 15:01
Yes, I noted that the input example changed, so I have updated my answer.
– Life is complex
Jan 2 at 15:06
1
Doesn't look correct as it matchesWestcon
as a separate match and remaining part as a separate match.
– anubhava
Jan 2 at 15:20
So does Westcon always get linked to Group European Operations Netherlands Branch?
– Life is complex
Jan 2 at 15:25
add a comment |
This will create one group for lines that don't have numbers.
regex: /(?!(d{6,}|n))[a-zA-Z .n]+/g
Demo: https://regex101.com/r/MMLGw6/1
except some company names have numbers below six digits. I have edited the question to reflect that
– user32882
Jan 2 at 14:56
add a comment |
Assuming your company names starts with a letter, you may use this regex with re.M
modifier:
^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)
RegEx Demo
In python:
regex = re.compile(r"^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)", re.M)
This matches a line that starts with [a-zA-Z]
until end of line and then matches more lines separated by n
that also start with [a-zA-Z]
characters.
(?=n+d{6,}$)
is a lookahead assertion to make sure our company names have a newline and 6+ digits ahead.
add a comment |
If you can solve this without regex it should be solved without regex:
useful =
for line in text.split():
if line.strip() and not line.isdigit():
useful.append(line)
That should work - more or less. Replying from my phone so can't test.
add a comment |
Here is another answer based on your question edits:
text = 'West Food Group B.V.9n n52608670n nWestconn nGroup European Operations Netherlands Branchn n30221053n nWestland Infra Netbeheer B.V.n n27176688n nWetransfer 85 B.V.n n34380998n nWETRAVEL B.V.n n70669783n nWeWork Companies (International) B.V.n n61501220n nWeWork Netherlands B.V.n n61505439n nWexford Finance B.V.n n27124941n nWFCn-nFood Safety B.V.n n11069471n nWhale Cloud Technology Netherlands B.V.n n63774801n nWHILL Europe B.V.n n72465700n nWhirlpool Nederland B.V.n n20042061n nWhitakern-nTaylor Netherlands B.V.n n66255163n nWhite Oak B.V.n'
company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', text)
for i in range(len(company_name_regex)):
if i < len(company_name_regex) - 1:
previous_company_name = company_name_regex[i]
next_company_name = company_name_regex[i + 1]
if 'Westcon' in previous_company_name and 'Group European Operations Netherlands Branch' in next_company_name:
company_name = ' '.join([previous_company_name, next_company_name])
else:
if not 'Group European Operations Netherlands Branch' in previous_company_name:
company_name = previous_company_name
**OUTPUTS**:
West Food Group B.V.9
Westcon Group European Operations Netherlands Branch
Westland Infra Netbeheer B.V.
Wetransfer 85 B.V.
WETRAVEL B.V.
WeWork Companies (International) B.V.
WeWork Netherlands B.V.
Wexford Finance B.V.
WFC
Food Safety B.V.
Whale Cloud Technology Netherlands B.V.
WHILL Europe B.V.
Whirlpool Nederland B.V.
Whitaker
Taylor Netherlands B.V.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54008208%2fusing-regex-to-extract-based-on-a-recurring-pattern-excluding-newline-characters%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
5 Answers
5
active
oldest
votes
5 Answers
5
active
oldest
votes
active
oldest
votes
active
oldest
votes
I think that you only want the company names. If so, this should work.
input = '''27223525
West Food Group B.V.9
52608670
Westcon
Group European Operations Netherlands Branch
30221053
Westland Infra Netbeheer B.V.
27176688
Wetransfer 85 B.V.
34380998
WETRAVEL B.V.
70669783
'''
company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', input)
pprint(company_name_regex)
['West Food Group B.V.9',
'Westcon',
'Group European Operations Netherlands Branch',
'Westland Infra Netbeheer B.V.',
'Wetransfer 85 B.V.'
'WETRAVEL B.V.']
please see edits above. Companies may have digits anywhere in there name. For this dataset the number of digits within any one company name is always below 6. The number digits I want to exclude are always above 6.
– user32882
Jan 2 at 15:01
Yes, I noted that the input example changed, so I have updated my answer.
– Life is complex
Jan 2 at 15:06
1
Doesn't look correct as it matchesWestcon
as a separate match and remaining part as a separate match.
– anubhava
Jan 2 at 15:20
So does Westcon always get linked to Group European Operations Netherlands Branch?
– Life is complex
Jan 2 at 15:25
add a comment |
I think that you only want the company names. If so, this should work.
input = '''27223525
West Food Group B.V.9
52608670
Westcon
Group European Operations Netherlands Branch
30221053
Westland Infra Netbeheer B.V.
27176688
Wetransfer 85 B.V.
34380998
WETRAVEL B.V.
70669783
'''
company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', input)
pprint(company_name_regex)
['West Food Group B.V.9',
'Westcon',
'Group European Operations Netherlands Branch',
'Westland Infra Netbeheer B.V.',
'Wetransfer 85 B.V.'
'WETRAVEL B.V.']
please see edits above. Companies may have digits anywhere in there name. For this dataset the number of digits within any one company name is always below 6. The number digits I want to exclude are always above 6.
– user32882
Jan 2 at 15:01
Yes, I noted that the input example changed, so I have updated my answer.
– Life is complex
Jan 2 at 15:06
1
Doesn't look correct as it matchesWestcon
as a separate match and remaining part as a separate match.
– anubhava
Jan 2 at 15:20
So does Westcon always get linked to Group European Operations Netherlands Branch?
– Life is complex
Jan 2 at 15:25
add a comment |
I think that you only want the company names. If so, this should work.
input = '''27223525
West Food Group B.V.9
52608670
Westcon
Group European Operations Netherlands Branch
30221053
Westland Infra Netbeheer B.V.
27176688
Wetransfer 85 B.V.
34380998
WETRAVEL B.V.
70669783
'''
company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', input)
pprint(company_name_regex)
['West Food Group B.V.9',
'Westcon',
'Group European Operations Netherlands Branch',
'Westland Infra Netbeheer B.V.',
'Wetransfer 85 B.V.'
'WETRAVEL B.V.']
I think that you only want the company names. If so, this should work.
input = '''27223525
West Food Group B.V.9
52608670
Westcon
Group European Operations Netherlands Branch
30221053
Westland Infra Netbeheer B.V.
27176688
Wetransfer 85 B.V.
34380998
WETRAVEL B.V.
70669783
'''
company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', input)
pprint(company_name_regex)
['West Food Group B.V.9',
'Westcon',
'Group European Operations Netherlands Branch',
'Westland Infra Netbeheer B.V.',
'Wetransfer 85 B.V.'
'WETRAVEL B.V.']
edited Jan 2 at 15:10
answered Jan 2 at 14:59


Life is complexLife is complex
598518
598518
please see edits above. Companies may have digits anywhere in there name. For this dataset the number of digits within any one company name is always below 6. The number digits I want to exclude are always above 6.
– user32882
Jan 2 at 15:01
Yes, I noted that the input example changed, so I have updated my answer.
– Life is complex
Jan 2 at 15:06
1
Doesn't look correct as it matchesWestcon
as a separate match and remaining part as a separate match.
– anubhava
Jan 2 at 15:20
So does Westcon always get linked to Group European Operations Netherlands Branch?
– Life is complex
Jan 2 at 15:25
add a comment |
please see edits above. Companies may have digits anywhere in there name. For this dataset the number of digits within any one company name is always below 6. The number digits I want to exclude are always above 6.
– user32882
Jan 2 at 15:01
Yes, I noted that the input example changed, so I have updated my answer.
– Life is complex
Jan 2 at 15:06
1
Doesn't look correct as it matchesWestcon
as a separate match and remaining part as a separate match.
– anubhava
Jan 2 at 15:20
So does Westcon always get linked to Group European Operations Netherlands Branch?
– Life is complex
Jan 2 at 15:25
please see edits above. Companies may have digits anywhere in there name. For this dataset the number of digits within any one company name is always below 6. The number digits I want to exclude are always above 6.
– user32882
Jan 2 at 15:01
please see edits above. Companies may have digits anywhere in there name. For this dataset the number of digits within any one company name is always below 6. The number digits I want to exclude are always above 6.
– user32882
Jan 2 at 15:01
Yes, I noted that the input example changed, so I have updated my answer.
– Life is complex
Jan 2 at 15:06
Yes, I noted that the input example changed, so I have updated my answer.
– Life is complex
Jan 2 at 15:06
1
1
Doesn't look correct as it matches
Westcon
as a separate match and remaining part as a separate match.– anubhava
Jan 2 at 15:20
Doesn't look correct as it matches
Westcon
as a separate match and remaining part as a separate match.– anubhava
Jan 2 at 15:20
So does Westcon always get linked to Group European Operations Netherlands Branch?
– Life is complex
Jan 2 at 15:25
So does Westcon always get linked to Group European Operations Netherlands Branch?
– Life is complex
Jan 2 at 15:25
add a comment |
This will create one group for lines that don't have numbers.
regex: /(?!(d{6,}|n))[a-zA-Z .n]+/g
Demo: https://regex101.com/r/MMLGw6/1
except some company names have numbers below six digits. I have edited the question to reflect that
– user32882
Jan 2 at 14:56
add a comment |
This will create one group for lines that don't have numbers.
regex: /(?!(d{6,}|n))[a-zA-Z .n]+/g
Demo: https://regex101.com/r/MMLGw6/1
except some company names have numbers below six digits. I have edited the question to reflect that
– user32882
Jan 2 at 14:56
add a comment |
This will create one group for lines that don't have numbers.
regex: /(?!(d{6,}|n))[a-zA-Z .n]+/g
Demo: https://regex101.com/r/MMLGw6/1
This will create one group for lines that don't have numbers.
regex: /(?!(d{6,}|n))[a-zA-Z .n]+/g
Demo: https://regex101.com/r/MMLGw6/1
answered Jan 2 at 14:52
Alex GAlex G
1,4172410
1,4172410
except some company names have numbers below six digits. I have edited the question to reflect that
– user32882
Jan 2 at 14:56
add a comment |
except some company names have numbers below six digits. I have edited the question to reflect that
– user32882
Jan 2 at 14:56
except some company names have numbers below six digits. I have edited the question to reflect that
– user32882
Jan 2 at 14:56
except some company names have numbers below six digits. I have edited the question to reflect that
– user32882
Jan 2 at 14:56
add a comment |
Assuming your company names starts with a letter, you may use this regex with re.M
modifier:
^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)
RegEx Demo
In python:
regex = re.compile(r"^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)", re.M)
This matches a line that starts with [a-zA-Z]
until end of line and then matches more lines separated by n
that also start with [a-zA-Z]
characters.
(?=n+d{6,}$)
is a lookahead assertion to make sure our company names have a newline and 6+ digits ahead.
add a comment |
Assuming your company names starts with a letter, you may use this regex with re.M
modifier:
^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)
RegEx Demo
In python:
regex = re.compile(r"^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)", re.M)
This matches a line that starts with [a-zA-Z]
until end of line and then matches more lines separated by n
that also start with [a-zA-Z]
characters.
(?=n+d{6,}$)
is a lookahead assertion to make sure our company names have a newline and 6+ digits ahead.
add a comment |
Assuming your company names starts with a letter, you may use this regex with re.M
modifier:
^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)
RegEx Demo
In python:
regex = re.compile(r"^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)", re.M)
This matches a line that starts with [a-zA-Z]
until end of line and then matches more lines separated by n
that also start with [a-zA-Z]
characters.
(?=n+d{6,}$)
is a lookahead assertion to make sure our company names have a newline and 6+ digits ahead.
Assuming your company names starts with a letter, you may use this regex with re.M
modifier:
^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)
RegEx Demo
In python:
regex = re.compile(r"^[a-zA-Z].*(?:n+[a-zA-Z].*)*(?=n+d{6,}$)", re.M)
This matches a line that starts with [a-zA-Z]
until end of line and then matches more lines separated by n
that also start with [a-zA-Z]
characters.
(?=n+d{6,}$)
is a lookahead assertion to make sure our company names have a newline and 6+ digits ahead.
answered Jan 2 at 15:25
anubhavaanubhava
533k48331408
533k48331408
add a comment |
add a comment |
If you can solve this without regex it should be solved without regex:
useful =
for line in text.split():
if line.strip() and not line.isdigit():
useful.append(line)
That should work - more or less. Replying from my phone so can't test.
add a comment |
If you can solve this without regex it should be solved without regex:
useful =
for line in text.split():
if line.strip() and not line.isdigit():
useful.append(line)
That should work - more or less. Replying from my phone so can't test.
add a comment |
If you can solve this without regex it should be solved without regex:
useful =
for line in text.split():
if line.strip() and not line.isdigit():
useful.append(line)
That should work - more or less. Replying from my phone so can't test.
If you can solve this without regex it should be solved without regex:
useful =
for line in text.split():
if line.strip() and not line.isdigit():
useful.append(line)
That should work - more or less. Replying from my phone so can't test.
edited Jan 2 at 15:41
answered Jan 2 at 15:23


HugoHugo
38929
38929
add a comment |
add a comment |
Here is another answer based on your question edits:
text = 'West Food Group B.V.9n n52608670n nWestconn nGroup European Operations Netherlands Branchn n30221053n nWestland Infra Netbeheer B.V.n n27176688n nWetransfer 85 B.V.n n34380998n nWETRAVEL B.V.n n70669783n nWeWork Companies (International) B.V.n n61501220n nWeWork Netherlands B.V.n n61505439n nWexford Finance B.V.n n27124941n nWFCn-nFood Safety B.V.n n11069471n nWhale Cloud Technology Netherlands B.V.n n63774801n nWHILL Europe B.V.n n72465700n nWhirlpool Nederland B.V.n n20042061n nWhitakern-nTaylor Netherlands B.V.n n66255163n nWhite Oak B.V.n'
company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', text)
for i in range(len(company_name_regex)):
if i < len(company_name_regex) - 1:
previous_company_name = company_name_regex[i]
next_company_name = company_name_regex[i + 1]
if 'Westcon' in previous_company_name and 'Group European Operations Netherlands Branch' in next_company_name:
company_name = ' '.join([previous_company_name, next_company_name])
else:
if not 'Group European Operations Netherlands Branch' in previous_company_name:
company_name = previous_company_name
**OUTPUTS**:
West Food Group B.V.9
Westcon Group European Operations Netherlands Branch
Westland Infra Netbeheer B.V.
Wetransfer 85 B.V.
WETRAVEL B.V.
WeWork Companies (International) B.V.
WeWork Netherlands B.V.
Wexford Finance B.V.
WFC
Food Safety B.V.
Whale Cloud Technology Netherlands B.V.
WHILL Europe B.V.
Whirlpool Nederland B.V.
Whitaker
Taylor Netherlands B.V.
add a comment |
Here is another answer based on your question edits:
text = 'West Food Group B.V.9n n52608670n nWestconn nGroup European Operations Netherlands Branchn n30221053n nWestland Infra Netbeheer B.V.n n27176688n nWetransfer 85 B.V.n n34380998n nWETRAVEL B.V.n n70669783n nWeWork Companies (International) B.V.n n61501220n nWeWork Netherlands B.V.n n61505439n nWexford Finance B.V.n n27124941n nWFCn-nFood Safety B.V.n n11069471n nWhale Cloud Technology Netherlands B.V.n n63774801n nWHILL Europe B.V.n n72465700n nWhirlpool Nederland B.V.n n20042061n nWhitakern-nTaylor Netherlands B.V.n n66255163n nWhite Oak B.V.n'
company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', text)
for i in range(len(company_name_regex)):
if i < len(company_name_regex) - 1:
previous_company_name = company_name_regex[i]
next_company_name = company_name_regex[i + 1]
if 'Westcon' in previous_company_name and 'Group European Operations Netherlands Branch' in next_company_name:
company_name = ' '.join([previous_company_name, next_company_name])
else:
if not 'Group European Operations Netherlands Branch' in previous_company_name:
company_name = previous_company_name
**OUTPUTS**:
West Food Group B.V.9
Westcon Group European Operations Netherlands Branch
Westland Infra Netbeheer B.V.
Wetransfer 85 B.V.
WETRAVEL B.V.
WeWork Companies (International) B.V.
WeWork Netherlands B.V.
Wexford Finance B.V.
WFC
Food Safety B.V.
Whale Cloud Technology Netherlands B.V.
WHILL Europe B.V.
Whirlpool Nederland B.V.
Whitaker
Taylor Netherlands B.V.
add a comment |
Here is another answer based on your question edits:
text = 'West Food Group B.V.9n n52608670n nWestconn nGroup European Operations Netherlands Branchn n30221053n nWestland Infra Netbeheer B.V.n n27176688n nWetransfer 85 B.V.n n34380998n nWETRAVEL B.V.n n70669783n nWeWork Companies (International) B.V.n n61501220n nWeWork Netherlands B.V.n n61505439n nWexford Finance B.V.n n27124941n nWFCn-nFood Safety B.V.n n11069471n nWhale Cloud Technology Netherlands B.V.n n63774801n nWHILL Europe B.V.n n72465700n nWhirlpool Nederland B.V.n n20042061n nWhitakern-nTaylor Netherlands B.V.n n66255163n nWhite Oak B.V.n'
company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', text)
for i in range(len(company_name_regex)):
if i < len(company_name_regex) - 1:
previous_company_name = company_name_regex[i]
next_company_name = company_name_regex[i + 1]
if 'Westcon' in previous_company_name and 'Group European Operations Netherlands Branch' in next_company_name:
company_name = ' '.join([previous_company_name, next_company_name])
else:
if not 'Group European Operations Netherlands Branch' in previous_company_name:
company_name = previous_company_name
**OUTPUTS**:
West Food Group B.V.9
Westcon Group European Operations Netherlands Branch
Westland Infra Netbeheer B.V.
Wetransfer 85 B.V.
WETRAVEL B.V.
WeWork Companies (International) B.V.
WeWork Netherlands B.V.
Wexford Finance B.V.
WFC
Food Safety B.V.
Whale Cloud Technology Netherlands B.V.
WHILL Europe B.V.
Whirlpool Nederland B.V.
Whitaker
Taylor Netherlands B.V.
Here is another answer based on your question edits:
text = 'West Food Group B.V.9n n52608670n nWestconn nGroup European Operations Netherlands Branchn n30221053n nWestland Infra Netbeheer B.V.n n27176688n nWetransfer 85 B.V.n n34380998n nWETRAVEL B.V.n n70669783n nWeWork Companies (International) B.V.n n61501220n nWeWork Netherlands B.V.n n61505439n nWexford Finance B.V.n n27124941n nWFCn-nFood Safety B.V.n n11069471n nWhale Cloud Technology Netherlands B.V.n n63774801n nWHILL Europe B.V.n n72465700n nWhirlpool Nederland B.V.n n20042061n nWhitakern-nTaylor Netherlands B.V.n n66255163n nWhite Oak B.V.n'
company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*d{1,5}.*', text)
for i in range(len(company_name_regex)):
if i < len(company_name_regex) - 1:
previous_company_name = company_name_regex[i]
next_company_name = company_name_regex[i + 1]
if 'Westcon' in previous_company_name and 'Group European Operations Netherlands Branch' in next_company_name:
company_name = ' '.join([previous_company_name, next_company_name])
else:
if not 'Group European Operations Netherlands Branch' in previous_company_name:
company_name = previous_company_name
**OUTPUTS**:
West Food Group B.V.9
Westcon Group European Operations Netherlands Branch
Westland Infra Netbeheer B.V.
Wetransfer 85 B.V.
WETRAVEL B.V.
WeWork Companies (International) B.V.
WeWork Netherlands B.V.
Wexford Finance B.V.
WFC
Food Safety B.V.
Whale Cloud Technology Netherlands B.V.
WHILL Europe B.V.
Whirlpool Nederland B.V.
Whitaker
Taylor Netherlands B.V.
edited Jan 2 at 21:26
answered Jan 2 at 21:19


Life is complexLife is complex
598518
598518
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54008208%2fusing-regex-to-extract-based-on-a-recurring-pattern-excluding-newline-characters%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
I am currently using regex101.com with python flavor with the aim of later extending it to a python script which uses
re
– user32882
Jan 2 at 14:47
I've came up with this
[^nd{6,}](?:(?:[a-zs.]+(n[a-zs.])*)|.+)
regex101.com/r/UTFMyk/1– jcubic
Jan 2 at 14:50
Try
re.findall(r'^[A-Za-z].*(?:n(?!d+$).*)*', contents, re.M)
wherecontents
isfile.read()
.– Wiktor Stribiżew
Jan 2 at 14:58