Link to home
Start Free TrialLog in
Avatar of georgetheroux1
georgetheroux1

asked on

Regex to get defined term and meaning from contract

Given the following body of text, I'm trying to write a regex that gives 2 parts, the definition, and the meaning.

I've got to:
/(^[A-Z0-9].+)(means|is\sdefined).*(?=\.\n)/gm

Open in new window


but I'm a little stuck as the line breaks within the definitions meanings cause my draft regex to break...

Any help much appreciated.

Approved Counterparties, means any counterparty to a Hedging Agreement with the Borrower that (a) is a Lender or an Affiliate of a Lender or (B)
has a credit rating of Baa1 or higher from Moody€s or BBB+ or higher from S&P.
Approved Engineer, means Netherland, Sewell and Associates, Inc., Miller and Lents, Ltd., Ryder Scott Company, L.P., or any other independent
petroleum engineer satisfactory to the Administrative Agent in its sole and absolute discretion.
Approved Fund, means any Person (other than a natural Person) that (a) is engaged in making, purchasing, holding or otherwise investing in
commercial loans and similar extensions of credit in the ordinary course of its business, and (b) is administered or managed by a Lender, an Affiliate
of a Lender or a Person or an Affiliate of a Person that administers or manages a Lender.
Credit Agreement (First Lien)
4

Authorized Officer, means, relative to any Obligor, those of its officers, general partners or managing members (as applicable) whose signatures
and incumbency shall have been certified to the Administrative Agent, the Lenders and the Issuers pursuant to Section 5.1.1 or pursuant to the other
provisions of this Agreement.
Base Amount, is defined in Section 3.3.3.
Base Rate, means, at any time, the rate of interest then most recently established by the Administrative Agent as its base rate for Dollars loaned in
the United States.  The Base Rate is not necessarily intended to be the lowest rate of interest determined by the Administrative Agent in connection
with extensions of credit.
Base Rate Loan, means a Loan bearing interest at a fluctuating rate determined by reference to the Alternate Base Rate.
BNPPSC, is defined in the preamble.
Borrower, is defined in the preamble.
Borrower Pledge and Security Agreement, means the Second Amended and Restated First Lien Pledge and Security Agreement and Irrevocable
Proxy executed and delivered by an Authorized Officer of the Borrower, substantially in the form of Exhibit G- 1 hereto, as amended, supplemented,
amended and restated or otherwise modified from time to time.
Borrowing, means the Loans of the same Type and, in the case of LIBO Rate Loans having the same Interest Period made by all Lenders required to
make such Loans on the same Business Day and pursuant to the same Borrowing Request in accordance with Section 2.3.
Borrowing Base, means at any time an amount equal to the amount determined in accordance with Section 2.8, as the same may be adjusted from
time to time in accordance with the terms hereof.
Borrowing Base Deficiency, is defined in Section 3.1.1(c).

Open in new window

Avatar of wilcoxon
wilcoxon
Flag of United States of America image

You're close.  I think this will work (if not, let me know):
/(^[A-Z0-9].+)(means|is\sdefined).*?(?=\.\n)/gms

Open in new window

The difficulty I see in your current regex is that it does not include the comma as the separator if the term,.

In perl, while it is possible to pattern match accross lines, I often aggregate when possible into a single line and then evaluate. In your data the existance of means or is defined can be used as an indication that the prior line/grouping is complete.

Splitting the line on the terms might be simpler then trying to come up with a regex that would need to include either the grammar notations commas,semi-colons, etc.

($term,$restofline)=split(/,/,$line,1);
If there is a match to means/meaning, in the $restofline you can assigned the matched pattern to $meaning or $definition.

Let me try to think this through and give it a shot.

I'll also compare what your regex gets possibly it is 95% there.
/(^[A-Z0-9].+?)(means|is\sdefined)(?s:.*?)(?=\.\n)/gm
If you wanted to capture the meaning in addition to the (means|is\sdefined)
/(^[A-Z0-9].+?)((?:means|is\sdefined)(?s:.*?)(?=\.\n))/gm
Here is the script.
At this point it outputs the Term/Explanation.  You can modify it if you want to assigned the items into variables/arrays/hashes....
The example deals with splitting the two
#!/usr/bin/perl
my $line='';
while (<>) {
chomp();
if (/, (means|is defined)/) {
    
if (length($line) >0 ) {
   ($term,$rest)=split(/\,/,$line,2); 
   print "\n\nTerm $term\n\tExplanation: $res
t\n";
   $line=$_;
}
else {
$line=$_;}
}
else { $line.=" $_";
}
}

Open in new window

Avatar of georgetheroux1
georgetheroux1

ASKER

@ozo - I just tried your regexes out here: http://regexr.com/3at2a - unfortunately no matches are found :(

@arnold - I'm trying to achieve this with pure regex, as it's part of a wider project.

@wilcoxon - that doesn't work unfortunately
http://regexr.com/3at2a does not seem to understand /s or (?s:) so you can use [\s\S] instead
/(^[A-Z0-9].+?)((?:means|is\sdefined)([\s\S]*?)(?=\.\n))/gm
http://regexr.com/3at2m
Is making sure the entry is one line first an option?

If it is one line
/^(a-zA-Z0-9_\-]+)\,(.*)$/
$1 term
$2 definition or meaning.
@arnold, unfortunately not. I'm working from PDFs so I don't have that luxury. The snippet I provided on regexr is what happens when I grab the text from the PDF.
The PDF, do you convert PDF to PS, or PDF to HTML or PDF to text
Or is this a select all on a PDF and paste into a new text document?
Or is there some PDF file that perl is opening?
@ozo - ok, but as far as I can see, that example still yields no matches :(

@arnold, I'm converting the PDF to plain text. Then running the regex. The dump I linked to on rexexp.com is an example of such a plain text output.
ASKER CERTIFIED SOLUTION
Avatar of ozo
ozo
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Did you try it on your
Approved Counterparties, means
text above? or the
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
text that it seems to default to?
@ozo - re: your comment at 05:53:23, that works amazingly, I've tweaked to:

(^[A-Z0-9].+?)((?:means|is\sdefined|shall\shave\sthe\smeaning)[\s\S]*?(?=\.\n))

It does, however, fall down on the following edge case, where 'of any person' ends up being contained in the definition. Any ideas of how I can prevent this from happening?

Affiliate of any Person means any other Person that, directly or indirectly, controls, is controlled by or is under common control with such
Person. Control of a Person means the power, directly or indirectly,

Open in new window

Where should 'of any person' end up?
@ozo - as part of the second group (the definition),

i.e.
Group 1: Affiliate
Group 2: of any Person means any....
@ozo, this works, but I wonder if there is a cleaner way to write it (without repeating 'means')

(^[A-Z0-9].+?)((?:of\sany\sPerson\smeans|means|is\sdefined|shall\shave\sthe\smeaning)[\s\S]*?(?=\.\n))
(^[A-Z0-9].+?)((?:(?:of\sany\sPerson\s)?means|is\sdefined|shall\shave\sthe\smeaning)[\s\S]*?(?=\.\n))

Open in new window

Any way to have the process that converts the data from pdf to plain text to handle the combining of these lines/section into a far more easily managed regex?
This regex pattern applied to your sample text
([A-Z].+?),\s*(means|is defined in)\b\s*((?:.|\n)+?\.)(?:\r\n|$)

Open in new window

...parsed the following 13 matches:
Match 0 Start(0) Length(226) 
SubMatch 0: Approved Counterparties
SubMatch 1: means
SubMatch 2: any counterparty to a Hedging Agreement with the Borrower that (a) is a Lender or an Affiliate of a Lender or (B) has a credit rating of Baa1 or higher from Moody€s or BBB+ or higher from S&P.

Match 1 Start(226) Length(241) 
SubMatch 0: Approved Engineer
SubMatch 1: means
SubMatch 2: Netherland, Sewell and Associates, Inc., Miller and Lents, Ltd., Ryder Scott Company, L.P., or any other independent petroleum engineer satisfactory to the Administrative Agent in its sole and absolute discretion.

Match 2 Start(467) Length(383) 
SubMatch 0: Approved Fund
SubMatch 1: means
SubMatch 2: any Person (other than a natural Person) that (a) is engaged in making, purchasing, holding or otherwise investing in commercial loans and similar extensions of credit in the ordinary course of its business, and (b) is administered or managed by a Lender, an Affiliate of a Lender or a Person or an Affiliate of a Person that administers or manages a Lender.

Match 3 Start(886) Length(327) 
SubMatch 0: Authorized Officer
SubMatch 1: means
SubMatch 2: , relative to any Obligor, those of its officers, general partners or managing members (as applicable) whose signatures and incumbency shall have been certified to the Administrative Agent, the Lenders and the Issuers pursuant to Section 5.1.1 or pursuant to the other provisions of this Agreement.

Match 4 Start(1213) Length(43) 
SubMatch 0: Base Amount
SubMatch 1: is defined in
SubMatch 2: Section 3.3.3.

Match 5 Start(1256) Length(329) 
SubMatch 0: Base Rate
SubMatch 1: means
SubMatch 2: , at any time, the rate of interest then most recently established by the Administrative Agent as its base rate for Dollars loaned in the United States. The Base Rate is not necessarily intended to be the lowest rate of interest determined by the Administrative Agent in connection with extensions of credit.

Match 6 Start(1585) Length(121) 
SubMatch 0: Base Rate Loan
SubMatch 1: means
SubMatch 2: a Loan bearing interest at a fluctuating rate determined by reference to the Alternate Base Rate.

Match 7 Start(1706) Length(37) 
SubMatch 0: BNPPSC
SubMatch 1: is defined in
SubMatch 2: the preamble.

Match 8 Start(1743) Length(39) 
SubMatch 0: Borrower
SubMatch 1: is defined in
SubMatch 2: the preamble.

Match 9 Start(1782) Length(347) 
SubMatch 0: Borrower Pledge and Security Agreement
SubMatch 1: means
SubMatch 2: the Second Amended and Restated First Lien Pledge and Security Agreement and Irrevocable Proxy executed and delivered by an Authorized Officer of the Borrower, substantially in the form of Exhibit G- 1 hereto, as amended, supplemented, amended and restated or otherwise modified from time to time.

Match 10 Start(2129) Length(262) 
SubMatch 0: Borrowing
SubMatch 1: means
SubMatch 2: the Loans of the same Type and, in the case of LIBO Rate Loans having the same Interest Period made by all Lenders required to make such Loans on the same Business Day and pursuant to the same Borrowing Request in accordance with Section 2.3.

Match 11 Start(2391) Length(192) 
SubMatch 0: Borrowing Base
SubMatch 1: means
SubMatch 2: at any time an amount equal to the amount determined in accordance with Section 2.8, as the same may be adjusted from time to time in accordance with the terms hereof.

Match 12 Start(2583) Length(58) 
SubMatch 0: Borrowing Base Deficiency
SubMatch 1: is defined in
SubMatch 2: Section 3.1.1(c).

Open in new window

@aikimark, thanks, any way of making it just 2 groups? I.e. combining submatch 1 & 2 ?
any way of making it just 2 groups? I.e. combining submatch 1 & 2 ?
Try this pattern:
([A-Z].+?),\s*((?:means|is defined in)\b\s*(?:.|\n)+?\.)(?:\r\n|$)

Open in new window

Which parses thusly:
Match 0 Start(0) Length(226) 
SubMatch 0: Approved Counterparties
SubMatch 1: means any counterparty to a Hedging Agreement with the Borrower that (a) is a Lender or an Affiliate of a Lender or (B) has a credit rating of Baa1 or higher from Moody€s or BBB+ or higher from S&P.

Match 1 Start(226) Length(241) 
SubMatch 0: Approved Engineer
SubMatch 1: means Netherland, Sewell and Associates, Inc., Miller and Lents, Ltd., Ryder Scott Company, L.P., or any other independent petroleum engineer satisfactory to the Administrative Agent in its sole and absolute discretion.

Match 2 Start(467) Length(383) 
SubMatch 0: Approved Fund
SubMatch 1: means any Person (other than a natural Person) that (a) is engaged in making, purchasing, holding or otherwise investing in commercial loans and similar extensions of credit in the ordinary course of its business, and (b) is administered or managed by a Lender, an Affiliate of a Lender or a Person or an Affiliate of a Person that administers or manages a Lender.

Match 3 Start(886) Length(327) 
SubMatch 0: Authorized Officer
SubMatch 1: means, relative to any Obligor, those of its officers, general partners or managing members (as applicable) whose signatures and incumbency shall have been certified to the Administrative Agent, the Lenders and the Issuers pursuant to Section 5.1.1 or pursuant to the other provisions of this Agreement.

Match 4 Start(1213) Length(43) 
SubMatch 0: Base Amount
SubMatch 1: is defined in Section 3.3.3.

Match 5 Start(1256) Length(329) 
SubMatch 0: Base Rate
SubMatch 1: means, at any time, the rate of interest then most recently established by the Administrative Agent as its base rate for Dollars loaned in the United States. The Base Rate is not necessarily intended to be the lowest rate of interest determined by the Administrative Agent in connection with extensions of credit.

Match 6 Start(1585) Length(121) 
SubMatch 0: Base Rate Loan
SubMatch 1: means a Loan bearing interest at a fluctuating rate determined by reference to the Alternate Base Rate.

Match 7 Start(1706) Length(37) 
SubMatch 0: BNPPSC
SubMatch 1: is defined in the preamble.

Match 8 Start(1743) Length(39) 
SubMatch 0: Borrower
SubMatch 1: is defined in the preamble.

Match 9 Start(1782) Length(347) 
SubMatch 0: Borrower Pledge and Security Agreement
SubMatch 1: means the Second Amended and Restated First Lien Pledge and Security Agreement and Irrevocable Proxy executed and delivered by an Authorized Officer of the Borrower, substantially in the form of Exhibit G- 1 hereto, as amended, supplemented, amended and restated or otherwise modified from time to time.

Match 10 Start(2129) Length(262) 
SubMatch 0: Borrowing
SubMatch 1: means the Loans of the same Type and, in the case of LIBO Rate Loans having the same Interest Period made by all Lenders required to make such Loans on the same Business Day and pursuant to the same Borrowing Request in accordance with Section 2.3.

Match 11 Start(2391) Length(192) 
SubMatch 0: Borrowing Base
SubMatch 1: means at any time an amount equal to the amount determined in accordance with Section 2.8, as the same may be adjusted from time to time in accordance with the terms hereof.

Match 12 Start(2583) Length(58) 
SubMatch 0: Borrowing Base Deficiency
SubMatch 1: is defined in Section 3.1.1(c).

Open in new window

Is the punctuation correct in
Affiliate of any Person means any other Person that, directly or indirectly, controls, is controlled by or is under common control with such
Person. Control of a Person means the power, directly or indirectly,
from http:#a40746230 ?
It does not contain \.\n and unlike the previous examples does not have a , before means
@ozo, sadly so, these documents aren't totally consistent in their structure.
Is my most recent pattern parsing your text as you need?
@alkimark, I tried:

([A-Z].+?),\s*((?:means|is\sdefined\sin)\b\s*(?:.|\n)+?\.)(?:\r\n|$)

but doesn't work unfortunately.. Do note I'm using php - not sure if that's relevant.
If definitions like http:#a40746230 do not always start or end in \n, would it be better to use
(\s[A-Z0-9].+?)((?:(?:of\sany\sPerson\s)?means|is\sdefined|shall\shave\sthe\smeaning)[\s\S]*?(?=\.\s))
instead of
(^[A-Z0-9].+?)((?:(?:of\sany\sPerson\s)?means|is\sdefined|shall\shave\sthe\smeaning)[\s\S]*?(?=\.\n))
Please post a file with sample data.
@ozo -

http://www.regexr.com/3at2a

Both of these appear to work quite well:
(^[A-Z0-9].+?)((?:(?:of\sany\sPerson\s)?means|is\sdefined|shall\shave\sthe\smeaning)[\s\S]*?(?=\.\s))
(\s[A-Z0-9].+?)((?:(?:of\sany\sPerson\s)?means|is\sdefined|shall\shave\sthe\smeaning)[\s\S]*?(?=\.\s))

but they break when they come across ".  " (a period & then a double space) - see 'Base Rate' as an example.

What do you think ?
Sample data file attached.
The difficulty with a straight regex that can span many lines and varies in structure is that it is difficult.
Presumably, the information while might not be formatted the same, will often fall within the same category heading such prior processing during the conversion process might be used to "standardize/normalize" the data that will be evaluated later on.
depending on the source of you getting the filings.
Presumably you have or are a service to which documents are delivered in PDF and you need to extract certain terms/data from them.

While the example has a set of required data but has flexibility to the formatting.
Trying to cover all possible variation using  a single regex might require earlier steps to .........
Part of the problem is that you have non-ASCII characters in the text.  I think the simplest solution is going to be
1. read the entire file
2. do two regex replace operations to change the two sequences I've found (so far)
3. parse the cleaned up string
These bytes \xE2\x80\x9A should be replaced with a comma (",")
These bytes \xE2\x82\xAC should be replaced with an apostrophe ("'")

You do not have to use regex replacement.  You can use the native PHP string replacement function.

Also, I might be seeing some terms that are not properly capitalized.  Such terms would not be found in the regex patterns I've posted so far.  Please confirm what I've found.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I found I had to remove the page numbers with this regex replace:
Pattern: \n\d{1,2}\n
Replace with: carriage return

After that, I got 233 definitions with this pattern:
(\w[^\xe2\n]+)\xe2.. ((?:means|shall have the meaning|is defined in)\b(?:\S|\s)+?)\.\n

The regex engine I am using had trouble replacing the apostrophe+s unicode characters.  I had to use the language's intrinsic character replace function instead of regex replace.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial