georgetheroux1
asked on
Regex to get defined term and meaning from contract
Given the following body of text, I'm trying to write a regex that gives 2 parts, the definition, and the meaning.
I've got to:
but I'm a little stuck as the line breaks within the definitions meanings cause my draft regex to break...
Any help much appreciated.
I've got to:
/(^[A-Z0-9].+)(means|is\sdefined).*(?=\.\n)/gm
but I'm a little stuck as the line breaks within the definitions meanings cause my draft regex to break...
Any help much appreciated.
Approved Counterparties, means any counterparty to a Hedging Agreement with the Borrower that (a) is a Lender or an Affiliate of a Lender or (B)
has a credit rating of Baa1 or higher from Moody€s or BBB+ or higher from S&P.
Approved Engineer, means Netherland, Sewell and Associates, Inc., Miller and Lents, Ltd., Ryder Scott Company, L.P., or any other independent
petroleum engineer satisfactory to the Administrative Agent in its sole and absolute discretion.
Approved Fund, means any Person (other than a natural Person) that (a) is engaged in making, purchasing, holding or otherwise investing in
commercial loans and similar extensions of credit in the ordinary course of its business, and (b) is administered or managed by a Lender, an Affiliate
of a Lender or a Person or an Affiliate of a Person that administers or manages a Lender.
Credit Agreement (First Lien)
4
Authorized Officer, means, relative to any Obligor, those of its officers, general partners or managing members (as applicable) whose signatures
and incumbency shall have been certified to the Administrative Agent, the Lenders and the Issuers pursuant to Section 5.1.1 or pursuant to the other
provisions of this Agreement.
Base Amount, is defined in Section 3.3.3.
Base Rate, means, at any time, the rate of interest then most recently established by the Administrative Agent as its base rate for Dollars loaned in
the United States. The Base Rate is not necessarily intended to be the lowest rate of interest determined by the Administrative Agent in connection
with extensions of credit.
Base Rate Loan, means a Loan bearing interest at a fluctuating rate determined by reference to the Alternate Base Rate.
BNPPSC, is defined in the preamble.
Borrower, is defined in the preamble.
Borrower Pledge and Security Agreement, means the Second Amended and Restated First Lien Pledge and Security Agreement and Irrevocable
Proxy executed and delivered by an Authorized Officer of the Borrower, substantially in the form of Exhibit G- 1 hereto, as amended, supplemented,
amended and restated or otherwise modified from time to time.
Borrowing, means the Loans of the same Type and, in the case of LIBO Rate Loans having the same Interest Period made by all Lenders required to
make such Loans on the same Business Day and pursuant to the same Borrowing Request in accordance with Section 2.3.
Borrowing Base, means at any time an amount equal to the amount determined in accordance with Section 2.8, as the same may be adjusted from
time to time in accordance with the terms hereof.
Borrowing Base Deficiency, is defined in Section 3.1.1(c).
The difficulty I see in your current regex is that it does not include the comma as the separator if the term,.
In perl, while it is possible to pattern match accross lines, I often aggregate when possible into a single line and then evaluate. In your data the existance of means or is defined can be used as an indication that the prior line/grouping is complete.
Splitting the line on the terms might be simpler then trying to come up with a regex that would need to include either the grammar notations commas,semi-colons, etc.
($term,$restofline)=split( /,/,$line, 1);
If there is a match to means/meaning, in the $restofline you can assigned the matched pattern to $meaning or $definition.
Let me try to think this through and give it a shot.
I'll also compare what your regex gets possibly it is 95% there.
In perl, while it is possible to pattern match accross lines, I often aggregate when possible into a single line and then evaluate. In your data the existance of means or is defined can be used as an indication that the prior line/grouping is complete.
Splitting the line on the terms might be simpler then trying to come up with a regex that would need to include either the grammar notations commas,semi-colons, etc.
($term,$restofline)=split(
If there is a match to means/meaning, in the $restofline you can assigned the matched pattern to $meaning or $definition.
Let me try to think this through and give it a shot.
I'll also compare what your regex gets possibly it is 95% there.
/(^[A-Z0-9].+?)(means|is\s defined)(? s:.*?)(?=\ .\n)/gm
If you wanted to capture the meaning in addition to the (means|is\sdefined)
/(^[A-Z0-9].+?)((?:means|i s\sdefined )(?s:.*?)( ?=\.\n))/g m
If you wanted to capture the meaning in addition to the (means|is\sdefined)
/(^[A-Z0-9].+?)((?:means|i
Here is the script.
At this point it outputs the Term/Explanation. You can modify it if you want to assigned the items into variables/arrays/hashes... .
The example deals with splitting the two
At this point it outputs the Term/Explanation. You can modify it if you want to assigned the items into variables/arrays/hashes...
The example deals with splitting the two
#!/usr/bin/perl
my $line='';
while (<>) {
chomp();
if (/, (means|is defined)/) {
if (length($line) >0 ) {
($term,$rest)=split(/\,/,$line,2);
print "\n\nTerm $term\n\tExplanation: $res
t\n";
$line=$_;
}
else {
$line=$_;}
}
else { $line.=" $_";
}
}
ASKER
@ozo - I just tried your regexes out here: http://regexr.com/3at2a - unfortunately no matches are found :(
@arnold - I'm trying to achieve this with pure regex, as it's part of a wider project.
@wilcoxon - that doesn't work unfortunately
@arnold - I'm trying to achieve this with pure regex, as it's part of a wider project.
@wilcoxon - that doesn't work unfortunately
http://regexr.com/3at2a does not seem to understand /s or (?s:) so you can use [\s\S] instead
/(^[A-Z0-9].+?)((?:means|i s\sdefined )([\s\S]*? )(?=\.\n)) /gm
http://regexr.com/3at2m
/(^[A-Z0-9].+?)((?:means|i
http://regexr.com/3at2m
Is making sure the entry is one line first an option?
If it is one line
/^(a-zA-Z0-9_\-]+)\,(.*)$/
$1 term
$2 definition or meaning.
If it is one line
/^(a-zA-Z0-9_\-]+)\,(.*)$/
$1 term
$2 definition or meaning.
ASKER
@arnold, unfortunately not. I'm working from PDFs so I don't have that luxury. The snippet I provided on regexr is what happens when I grab the text from the PDF.
The PDF, do you convert PDF to PS, or PDF to HTML or PDF to text
Or is this a select all on a PDF and paste into a new text document?
Or is there some PDF file that perl is opening?
Or is this a select all on a PDF and paste into a new text document?
Or is there some PDF file that perl is opening?
ASKER
@ozo - ok, but as far as I can see, that example still yields no matches :(
@arnold, I'm converting the PDF to plain text. Then running the regex. The dump I linked to on rexexp.com is an example of such a plain text output.
@arnold, I'm converting the PDF to plain text. Then running the regex. The dump I linked to on rexexp.com is an example of such a plain text output.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Did you try it on your
Approved Counterparties, meanstext above? or the
UNITED STATEStext that it seems to default to?
SECURITIES AND EXCHANGE COMMISSION
ASKER
@ozo - re: your comment at 05:53:23, that works amazingly, I've tweaked to:
(^[A-Z0-9].+?)((?:means|is \sdefined| shall\shav e\sthe\sme aning)[\s\ S]*?(?=\.\ n))
It does, however, fall down on the following edge case, where 'of any person' ends up being contained in the definition. Any ideas of how I can prevent this from happening?
(^[A-Z0-9].+?)((?:means|is
It does, however, fall down on the following edge case, where 'of any person' ends up being contained in the definition. Any ideas of how I can prevent this from happening?
Affiliate of any Person means any other Person that, directly or indirectly, controls, is controlled by or is under common control with such
Person. Control of a Person means the power, directly or indirectly,
Where should 'of any person' end up?
ASKER
@ozo - as part of the second group (the definition),
i.e.
Group 1: Affiliate
Group 2: of any Person means any....
i.e.
Group 1: Affiliate
Group 2: of any Person means any....
ASKER
@ozo, this works, but I wonder if there is a cleaner way to write it (without repeating 'means')
(^[A-Z0-9].+?)((?:of\sany\ sPerson\sm eans|means |is\sdefin ed|shall\s have\sthe\ smeaning)[ \s\S]*?(?= \.\n))
(^[A-Z0-9].+?)((?:of\sany\
(^[A-Z0-9].+?)((?:(?:of\sany\sPerson\s)?means|is\sdefined|shall\shave\sthe\smeaning)[\s\S]*?(?=\.\n))
Any way to have the process that converts the data from pdf to plain text to handle the combining of these lines/section into a far more easily managed regex?
This regex pattern applied to your sample text
([A-Z].+?),\s*(means|is defined in)\b\s*((?:.|\n)+?\.)(?:\r\n|$)
...parsed the following 13 matches:Match 0 Start(0) Length(226)
SubMatch 0: Approved Counterparties
SubMatch 1: means
SubMatch 2: any counterparty to a Hedging Agreement with the Borrower that (a) is a Lender or an Affiliate of a Lender or (B) has a credit rating of Baa1 or higher from Moody€s or BBB+ or higher from S&P.
Match 1 Start(226) Length(241)
SubMatch 0: Approved Engineer
SubMatch 1: means
SubMatch 2: Netherland, Sewell and Associates, Inc., Miller and Lents, Ltd., Ryder Scott Company, L.P., or any other independent petroleum engineer satisfactory to the Administrative Agent in its sole and absolute discretion.
Match 2 Start(467) Length(383)
SubMatch 0: Approved Fund
SubMatch 1: means
SubMatch 2: any Person (other than a natural Person) that (a) is engaged in making, purchasing, holding or otherwise investing in commercial loans and similar extensions of credit in the ordinary course of its business, and (b) is administered or managed by a Lender, an Affiliate of a Lender or a Person or an Affiliate of a Person that administers or manages a Lender.
Match 3 Start(886) Length(327)
SubMatch 0: Authorized Officer
SubMatch 1: means
SubMatch 2: , relative to any Obligor, those of its officers, general partners or managing members (as applicable) whose signatures and incumbency shall have been certified to the Administrative Agent, the Lenders and the Issuers pursuant to Section 5.1.1 or pursuant to the other provisions of this Agreement.
Match 4 Start(1213) Length(43)
SubMatch 0: Base Amount
SubMatch 1: is defined in
SubMatch 2: Section 3.3.3.
Match 5 Start(1256) Length(329)
SubMatch 0: Base Rate
SubMatch 1: means
SubMatch 2: , at any time, the rate of interest then most recently established by the Administrative Agent as its base rate for Dollars loaned in the United States. The Base Rate is not necessarily intended to be the lowest rate of interest determined by the Administrative Agent in connection with extensions of credit.
Match 6 Start(1585) Length(121)
SubMatch 0: Base Rate Loan
SubMatch 1: means
SubMatch 2: a Loan bearing interest at a fluctuating rate determined by reference to the Alternate Base Rate.
Match 7 Start(1706) Length(37)
SubMatch 0: BNPPSC
SubMatch 1: is defined in
SubMatch 2: the preamble.
Match 8 Start(1743) Length(39)
SubMatch 0: Borrower
SubMatch 1: is defined in
SubMatch 2: the preamble.
Match 9 Start(1782) Length(347)
SubMatch 0: Borrower Pledge and Security Agreement
SubMatch 1: means
SubMatch 2: the Second Amended and Restated First Lien Pledge and Security Agreement and Irrevocable Proxy executed and delivered by an Authorized Officer of the Borrower, substantially in the form of Exhibit G- 1 hereto, as amended, supplemented, amended and restated or otherwise modified from time to time.
Match 10 Start(2129) Length(262)
SubMatch 0: Borrowing
SubMatch 1: means
SubMatch 2: the Loans of the same Type and, in the case of LIBO Rate Loans having the same Interest Period made by all Lenders required to make such Loans on the same Business Day and pursuant to the same Borrowing Request in accordance with Section 2.3.
Match 11 Start(2391) Length(192)
SubMatch 0: Borrowing Base
SubMatch 1: means
SubMatch 2: at any time an amount equal to the amount determined in accordance with Section 2.8, as the same may be adjusted from time to time in accordance with the terms hereof.
Match 12 Start(2583) Length(58)
SubMatch 0: Borrowing Base Deficiency
SubMatch 1: is defined in
SubMatch 2: Section 3.1.1(c).
ASKER
@aikimark, thanks, any way of making it just 2 groups? I.e. combining submatch 1 & 2 ?
any way of making it just 2 groups? I.e. combining submatch 1 & 2 ?Try this pattern:
([A-Z].+?),\s*((?:means|is defined in)\b\s*(?:.|\n)+?\.)(?:\r\n|$)
Which parses thusly:Match 0 Start(0) Length(226)
SubMatch 0: Approved Counterparties
SubMatch 1: means any counterparty to a Hedging Agreement with the Borrower that (a) is a Lender or an Affiliate of a Lender or (B) has a credit rating of Baa1 or higher from Moody€s or BBB+ or higher from S&P.
Match 1 Start(226) Length(241)
SubMatch 0: Approved Engineer
SubMatch 1: means Netherland, Sewell and Associates, Inc., Miller and Lents, Ltd., Ryder Scott Company, L.P., or any other independent petroleum engineer satisfactory to the Administrative Agent in its sole and absolute discretion.
Match 2 Start(467) Length(383)
SubMatch 0: Approved Fund
SubMatch 1: means any Person (other than a natural Person) that (a) is engaged in making, purchasing, holding or otherwise investing in commercial loans and similar extensions of credit in the ordinary course of its business, and (b) is administered or managed by a Lender, an Affiliate of a Lender or a Person or an Affiliate of a Person that administers or manages a Lender.
Match 3 Start(886) Length(327)
SubMatch 0: Authorized Officer
SubMatch 1: means, relative to any Obligor, those of its officers, general partners or managing members (as applicable) whose signatures and incumbency shall have been certified to the Administrative Agent, the Lenders and the Issuers pursuant to Section 5.1.1 or pursuant to the other provisions of this Agreement.
Match 4 Start(1213) Length(43)
SubMatch 0: Base Amount
SubMatch 1: is defined in Section 3.3.3.
Match 5 Start(1256) Length(329)
SubMatch 0: Base Rate
SubMatch 1: means, at any time, the rate of interest then most recently established by the Administrative Agent as its base rate for Dollars loaned in the United States. The Base Rate is not necessarily intended to be the lowest rate of interest determined by the Administrative Agent in connection with extensions of credit.
Match 6 Start(1585) Length(121)
SubMatch 0: Base Rate Loan
SubMatch 1: means a Loan bearing interest at a fluctuating rate determined by reference to the Alternate Base Rate.
Match 7 Start(1706) Length(37)
SubMatch 0: BNPPSC
SubMatch 1: is defined in the preamble.
Match 8 Start(1743) Length(39)
SubMatch 0: Borrower
SubMatch 1: is defined in the preamble.
Match 9 Start(1782) Length(347)
SubMatch 0: Borrower Pledge and Security Agreement
SubMatch 1: means the Second Amended and Restated First Lien Pledge and Security Agreement and Irrevocable Proxy executed and delivered by an Authorized Officer of the Borrower, substantially in the form of Exhibit G- 1 hereto, as amended, supplemented, amended and restated or otherwise modified from time to time.
Match 10 Start(2129) Length(262)
SubMatch 0: Borrowing
SubMatch 1: means the Loans of the same Type and, in the case of LIBO Rate Loans having the same Interest Period made by all Lenders required to make such Loans on the same Business Day and pursuant to the same Borrowing Request in accordance with Section 2.3.
Match 11 Start(2391) Length(192)
SubMatch 0: Borrowing Base
SubMatch 1: means at any time an amount equal to the amount determined in accordance with Section 2.8, as the same may be adjusted from time to time in accordance with the terms hereof.
Match 12 Start(2583) Length(58)
SubMatch 0: Borrowing Base Deficiency
SubMatch 1: is defined in Section 3.1.1(c).
Is the punctuation correct in
It does not contain \.\n and unlike the previous examples does not have a , before means
Affiliate of any Person means any other Person that, directly or indirectly, controls, is controlled by or is under common control with suchfrom http:#a40746230 ?
Person. Control of a Person means the power, directly or indirectly,
It does not contain \.\n and unlike the previous examples does not have a , before means
ASKER
@ozo, sadly so, these documents aren't totally consistent in their structure.
Is my most recent pattern parsing your text as you need?
ASKER
@alkimark, I tried:
([A-Z].+?),\s*((?:means|is \sdefined\ sin)\b\s*( ?:.|\n)+?\ .)(?:\r\n| $)
but doesn't work unfortunately.. Do note I'm using php - not sure if that's relevant.
([A-Z].+?),\s*((?:means|is
but doesn't work unfortunately.. Do note I'm using php - not sure if that's relevant.
If definitions like http:#a40746230 do not always start or end in \n, would it be better to use
(\s[A-Z0-9].+?)((?:(?:of\s any\sPerso n\s)?means |is\sdefin ed|shall\s have\sthe\ smeaning)[ \s\S]*?(?= \.\s))
instead of
(^[A-Z0-9].+?)((?:(?:of\sa ny\sPerson \s)?means| is\sdefine d|shall\sh ave\sthe\s meaning)[\ s\S]*?(?=\ .\n))
(\s[A-Z0-9].+?)((?:(?:of\s
instead of
(^[A-Z0-9].+?)((?:(?:of\sa
Please post a file with sample data.
ASKER
@ozo -
http://www.regexr.com/3at2a
Both of these appear to work quite well:
(^[A-Z0-9].+?)((?:(?:of\sa ny\sPerson \s)?means| is\sdefine d|shall\sh ave\sthe\s meaning)[\ s\S]*?(?=\ .\s))
(\s[A-Z0-9].+?)((?:(?:of\s any\sPerso n\s)?means |is\sdefin ed|shall\s have\sthe\ smeaning)[ \s\S]*?(?= \.\s))
but they break when they come across ". " (a period & then a double space) - see 'Base Rate' as an example.
What do you think ?
http://www.regexr.com/3at2a
Both of these appear to work quite well:
(^[A-Z0-9].+?)((?:(?:of\sa
(\s[A-Z0-9].+?)((?:(?:of\s
but they break when they come across ". " (a period & then a double space) - see 'Base Rate' as an example.
What do you think ?
ASKER
Sample data file attached.
The difficulty with a straight regex that can span many lines and varies in structure is that it is difficult.
Presumably, the information while might not be formatted the same, will often fall within the same category heading such prior processing during the conversion process might be used to "standardize/normalize" the data that will be evaluated later on.
Presumably, the information while might not be formatted the same, will often fall within the same category heading such prior processing during the conversion process might be used to "standardize/normalize" the data that will be evaluated later on.
ASKER
depending on the source of you getting the filings.
Presumably you have or are a service to which documents are delivered in PDF and you need to extract certain terms/data from them.
While the example has a set of required data but has flexibility to the formatting.
Trying to cover all possible variation using a single regex might require earlier steps to .........
Presumably you have or are a service to which documents are delivered in PDF and you need to extract certain terms/data from them.
While the example has a set of required data but has flexibility to the formatting.
Trying to cover all possible variation using a single regex might require earlier steps to .........
Part of the problem is that you have non-ASCII characters in the text. I think the simplest solution is going to be
1. read the entire file
2. do two regex replace operations to change the two sequences I've found (so far)
3. parse the cleaned up string
1. read the entire file
2. do two regex replace operations to change the two sequences I've found (so far)
3. parse the cleaned up string
These bytes \xE2\x80\x9A should be replaced with a comma (",")
These bytes \xE2\x82\xAC should be replaced with an apostrophe ("'")
You do not have to use regex replacement. You can use the native PHP string replacement function.
Also, I might be seeing some terms that are not properly capitalized. Such terms would not be found in the regex patterns I've posted so far. Please confirm what I've found.
These bytes \xE2\x82\xAC should be replaced with an apostrophe ("'")
You do not have to use regex replacement. You can use the native PHP string replacement function.
Also, I might be seeing some terms that are not properly capitalized. Such terms would not be found in the regex patterns I've posted so far. Please confirm what I've found.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
I found I had to remove the page numbers with this regex replace:
Pattern: \n\d{1,2}\n
Replace with: carriage return
After that, I got 233 definitions with this pattern:
(\w[^\xe2\n]+)\xe2.. ((?:means|shall have the meaning|is defined in)\b(?:\S|\s)+?)\.\n
The regex engine I am using had trouble replacing the apostrophe+s unicode characters. I had to use the language's intrinsic character replace function instead of regex replace.
Pattern: \n\d{1,2}\n
Replace with: carriage return
After that, I got 233 definitions with this pattern:
(\w[^\xe2\n]+)\xe2.. ((?:means|shall have the meaning|is defined in)\b(?:\S|\s)+?)\.\n
The regex engine I am using had trouble replacing the apostrophe+s unicode characters. I had to use the language's intrinsic character replace function instead of regex replace.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Open in new window