Regex to get defined term and meaning from contract

Given the following body of text, I'm trying to write a regex that gives 2 parts, the definition, and the meaning.

I've got to:
/(^[A-Z0-9].+)(means|is\sdefined).*(?=\.\n)/gm

Open in new window


but I'm a little stuck as the line breaks within the definitions meanings cause my draft regex to break...

Any help much appreciated.

Approved Counterparties, means any counterparty to a Hedging Agreement with the Borrower that (a) is a Lender or an Affiliate of a Lender or (B)
has a credit rating of Baa1 or higher from Moody€s or BBB+ or higher from S&P.
Approved Engineer, means Netherland, Sewell and Associates, Inc., Miller and Lents, Ltd., Ryder Scott Company, L.P., or any other independent
petroleum engineer satisfactory to the Administrative Agent in its sole and absolute discretion.
Approved Fund, means any Person (other than a natural Person) that (a) is engaged in making, purchasing, holding or otherwise investing in
commercial loans and similar extensions of credit in the ordinary course of its business, and (b) is administered or managed by a Lender, an Affiliate
of a Lender or a Person or an Affiliate of a Person that administers or manages a Lender.
Credit Agreement (First Lien)
4

Authorized Officer, means, relative to any Obligor, those of its officers, general partners or managing members (as applicable) whose signatures
and incumbency shall have been certified to the Administrative Agent, the Lenders and the Issuers pursuant to Section 5.1.1 or pursuant to the other
provisions of this Agreement.
Base Amount, is defined in Section 3.3.3.
Base Rate, means, at any time, the rate of interest then most recently established by the Administrative Agent as its base rate for Dollars loaned in
the United States.  The Base Rate is not necessarily intended to be the lowest rate of interest determined by the Administrative Agent in connection
with extensions of credit.
Base Rate Loan, means a Loan bearing interest at a fluctuating rate determined by reference to the Alternate Base Rate.
BNPPSC, is defined in the preamble.
Borrower, is defined in the preamble.
Borrower Pledge and Security Agreement, means the Second Amended and Restated First Lien Pledge and Security Agreement and Irrevocable
Proxy executed and delivered by an Authorized Officer of the Borrower, substantially in the form of Exhibit G- 1 hereto, as amended, supplemented,
amended and restated or otherwise modified from time to time.
Borrowing, means the Loans of the same Type and, in the case of LIBO Rate Loans having the same Interest Period made by all Lenders required to
make such Loans on the same Business Day and pursuant to the same Borrowing Request in accordance with Section 2.3.
Borrowing Base, means at any time an amount equal to the amount determined in accordance with Section 2.8, as the same may be adjusted from
time to time in accordance with the terms hereof.
Borrowing Base Deficiency, is defined in Section 3.1.1(c).

Open in new window

georgetheroux1Asked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

wilcoxonCommented:
You're close.  I think this will work (if not, let me know):
/(^[A-Z0-9].+)(means|is\sdefined).*?(?=\.\n)/gms

Open in new window

arnoldCommented:
The difficulty I see in your current regex is that it does not include the comma as the separator if the term,.

In perl, while it is possible to pattern match accross lines, I often aggregate when possible into a single line and then evaluate. In your data the existance of means or is defined can be used as an indication that the prior line/grouping is complete.

Splitting the line on the terms might be simpler then trying to come up with a regex that would need to include either the grammar notations commas,semi-colons, etc.

($term,$restofline)=split(/,/,$line,1);
If there is a match to means/meaning, in the $restofline you can assigned the matched pattern to $meaning or $definition.

Let me try to think this through and give it a shot.

I'll also compare what your regex gets possibly it is 95% there.
ozoCommented:
/(^[A-Z0-9].+?)(means|is\sdefined)(?s:.*?)(?=\.\n)/gm
If you wanted to capture the meaning in addition to the (means|is\sdefined)
/(^[A-Z0-9].+?)((?:means|is\sdefined)(?s:.*?)(?=\.\n))/gm
Python 3 Fundamentals

This course will teach participants about installing and configuring Python, syntax, importing, statements, types, strings, booleans, files, lists, tuples, comprehensions, functions, and classes.

arnoldCommented:
Here is the script.
At this point it outputs the Term/Explanation.  You can modify it if you want to assigned the items into variables/arrays/hashes....
The example deals with splitting the two
#!/usr/bin/perl
my $line='';
while (<>) {
chomp();
if (/, (means|is defined)/) {
    
if (length($line) >0 ) {
   ($term,$rest)=split(/\,/,$line,2); 
   print "\n\nTerm $term\n\tExplanation: $res
t\n";
   $line=$_;
}
else {
$line=$_;}
}
else { $line.=" $_";
}
}

Open in new window

georgetheroux1Author Commented:
@ozo - I just tried your regexes out here: http://regexr.com/3at2a - unfortunately no matches are found :(

@arnold - I'm trying to achieve this with pure regex, as it's part of a wider project.

@wilcoxon - that doesn't work unfortunately
ozoCommented:
http://regexr.com/3at2a does not seem to understand /s or (?s:) so you can use [\s\S] instead
/(^[A-Z0-9].+?)((?:means|is\sdefined)([\s\S]*?)(?=\.\n))/gm
http://regexr.com/3at2m
arnoldCommented:
Is making sure the entry is one line first an option?

If it is one line
/^(a-zA-Z0-9_\-]+)\,(.*)$/
$1 term
$2 definition or meaning.
georgetheroux1Author Commented:
@arnold, unfortunately not. I'm working from PDFs so I don't have that luxury. The snippet I provided on regexr is what happens when I grab the text from the PDF.
arnoldCommented:
The PDF, do you convert PDF to PS, or PDF to HTML or PDF to text
Or is this a select all on a PDF and paste into a new text document?
Or is there some PDF file that perl is opening?
georgetheroux1Author Commented:
@ozo - ok, but as far as I can see, that example still yields no matches :(

@arnold, I'm converting the PDF to plain text. Then running the regex. The dump I linked to on rexexp.com is an example of such a plain text output.
ozoCommented:
Forgot to remove the capture when I removed the ?s: since you said you wanted 2 parts:
/(^[A-Z0-9].+?)((?:means|is\sdefined)[\s\S]*?(?=\.\n))/gm

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
ozoCommented:
Did you try it on your
Approved Counterparties, means
text above? or the
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
text that it seems to default to?
georgetheroux1Author Commented:
@ozo - re: your comment at 05:53:23, that works amazingly, I've tweaked to:

(^[A-Z0-9].+?)((?:means|is\sdefined|shall\shave\sthe\smeaning)[\s\S]*?(?=\.\n))

It does, however, fall down on the following edge case, where 'of any person' ends up being contained in the definition. Any ideas of how I can prevent this from happening?

Affiliate of any Person means any other Person that, directly or indirectly, controls, is controlled by or is under common control with such
Person. Control of a Person means the power, directly or indirectly,

Open in new window

ozoCommented:
Where should 'of any person' end up?
georgetheroux1Author Commented:
@ozo - as part of the second group (the definition),

i.e.
Group 1: Affiliate
Group 2: of any Person means any....
georgetheroux1Author Commented:
@ozo, this works, but I wonder if there is a cleaner way to write it (without repeating 'means')

(^[A-Z0-9].+?)((?:of\sany\sPerson\smeans|means|is\sdefined|shall\shave\sthe\smeaning)[\s\S]*?(?=\.\n))
wilcoxonCommented:
(^[A-Z0-9].+?)((?:(?:of\sany\sPerson\s)?means|is\sdefined|shall\shave\sthe\smeaning)[\s\S]*?(?=\.\n))

Open in new window

arnoldCommented:
Any way to have the process that converts the data from pdf to plain text to handle the combining of these lines/section into a far more easily managed regex?
aikimarkCommented:
This regex pattern applied to your sample text
([A-Z].+?),\s*(means|is defined in)\b\s*((?:.|\n)+?\.)(?:\r\n|$)

Open in new window

...parsed the following 13 matches:
Match 0 Start(0) Length(226) 
SubMatch 0: Approved Counterparties
SubMatch 1: means
SubMatch 2: any counterparty to a Hedging Agreement with the Borrower that (a) is a Lender or an Affiliate of a Lender or (B) has a credit rating of Baa1 or higher from Moody€s or BBB+ or higher from S&P.

Match 1 Start(226) Length(241) 
SubMatch 0: Approved Engineer
SubMatch 1: means
SubMatch 2: Netherland, Sewell and Associates, Inc., Miller and Lents, Ltd., Ryder Scott Company, L.P., or any other independent petroleum engineer satisfactory to the Administrative Agent in its sole and absolute discretion.

Match 2 Start(467) Length(383) 
SubMatch 0: Approved Fund
SubMatch 1: means
SubMatch 2: any Person (other than a natural Person) that (a) is engaged in making, purchasing, holding or otherwise investing in commercial loans and similar extensions of credit in the ordinary course of its business, and (b) is administered or managed by a Lender, an Affiliate of a Lender or a Person or an Affiliate of a Person that administers or manages a Lender.

Match 3 Start(886) Length(327) 
SubMatch 0: Authorized Officer
SubMatch 1: means
SubMatch 2: , relative to any Obligor, those of its officers, general partners or managing members (as applicable) whose signatures and incumbency shall have been certified to the Administrative Agent, the Lenders and the Issuers pursuant to Section 5.1.1 or pursuant to the other provisions of this Agreement.

Match 4 Start(1213) Length(43) 
SubMatch 0: Base Amount
SubMatch 1: is defined in
SubMatch 2: Section 3.3.3.

Match 5 Start(1256) Length(329) 
SubMatch 0: Base Rate
SubMatch 1: means
SubMatch 2: , at any time, the rate of interest then most recently established by the Administrative Agent as its base rate for Dollars loaned in the United States. The Base Rate is not necessarily intended to be the lowest rate of interest determined by the Administrative Agent in connection with extensions of credit.

Match 6 Start(1585) Length(121) 
SubMatch 0: Base Rate Loan
SubMatch 1: means
SubMatch 2: a Loan bearing interest at a fluctuating rate determined by reference to the Alternate Base Rate.

Match 7 Start(1706) Length(37) 
SubMatch 0: BNPPSC
SubMatch 1: is defined in
SubMatch 2: the preamble.

Match 8 Start(1743) Length(39) 
SubMatch 0: Borrower
SubMatch 1: is defined in
SubMatch 2: the preamble.

Match 9 Start(1782) Length(347) 
SubMatch 0: Borrower Pledge and Security Agreement
SubMatch 1: means
SubMatch 2: the Second Amended and Restated First Lien Pledge and Security Agreement and Irrevocable Proxy executed and delivered by an Authorized Officer of the Borrower, substantially in the form of Exhibit G- 1 hereto, as amended, supplemented, amended and restated or otherwise modified from time to time.

Match 10 Start(2129) Length(262) 
SubMatch 0: Borrowing
SubMatch 1: means
SubMatch 2: the Loans of the same Type and, in the case of LIBO Rate Loans having the same Interest Period made by all Lenders required to make such Loans on the same Business Day and pursuant to the same Borrowing Request in accordance with Section 2.3.

Match 11 Start(2391) Length(192) 
SubMatch 0: Borrowing Base
SubMatch 1: means
SubMatch 2: at any time an amount equal to the amount determined in accordance with Section 2.8, as the same may be adjusted from time to time in accordance with the terms hereof.

Match 12 Start(2583) Length(58) 
SubMatch 0: Borrowing Base Deficiency
SubMatch 1: is defined in
SubMatch 2: Section 3.1.1(c).

Open in new window

georgetheroux1Author Commented:
@aikimark, thanks, any way of making it just 2 groups? I.e. combining submatch 1 & 2 ?
aikimarkCommented:
any way of making it just 2 groups? I.e. combining submatch 1 & 2 ?
Try this pattern:
([A-Z].+?),\s*((?:means|is defined in)\b\s*(?:.|\n)+?\.)(?:\r\n|$)

Open in new window

Which parses thusly:
Match 0 Start(0) Length(226) 
SubMatch 0: Approved Counterparties
SubMatch 1: means any counterparty to a Hedging Agreement with the Borrower that (a) is a Lender or an Affiliate of a Lender or (B) has a credit rating of Baa1 or higher from Moody€s or BBB+ or higher from S&P.

Match 1 Start(226) Length(241) 
SubMatch 0: Approved Engineer
SubMatch 1: means Netherland, Sewell and Associates, Inc., Miller and Lents, Ltd., Ryder Scott Company, L.P., or any other independent petroleum engineer satisfactory to the Administrative Agent in its sole and absolute discretion.

Match 2 Start(467) Length(383) 
SubMatch 0: Approved Fund
SubMatch 1: means any Person (other than a natural Person) that (a) is engaged in making, purchasing, holding or otherwise investing in commercial loans and similar extensions of credit in the ordinary course of its business, and (b) is administered or managed by a Lender, an Affiliate of a Lender or a Person or an Affiliate of a Person that administers or manages a Lender.

Match 3 Start(886) Length(327) 
SubMatch 0: Authorized Officer
SubMatch 1: means, relative to any Obligor, those of its officers, general partners or managing members (as applicable) whose signatures and incumbency shall have been certified to the Administrative Agent, the Lenders and the Issuers pursuant to Section 5.1.1 or pursuant to the other provisions of this Agreement.

Match 4 Start(1213) Length(43) 
SubMatch 0: Base Amount
SubMatch 1: is defined in Section 3.3.3.

Match 5 Start(1256) Length(329) 
SubMatch 0: Base Rate
SubMatch 1: means, at any time, the rate of interest then most recently established by the Administrative Agent as its base rate for Dollars loaned in the United States. The Base Rate is not necessarily intended to be the lowest rate of interest determined by the Administrative Agent in connection with extensions of credit.

Match 6 Start(1585) Length(121) 
SubMatch 0: Base Rate Loan
SubMatch 1: means a Loan bearing interest at a fluctuating rate determined by reference to the Alternate Base Rate.

Match 7 Start(1706) Length(37) 
SubMatch 0: BNPPSC
SubMatch 1: is defined in the preamble.

Match 8 Start(1743) Length(39) 
SubMatch 0: Borrower
SubMatch 1: is defined in the preamble.

Match 9 Start(1782) Length(347) 
SubMatch 0: Borrower Pledge and Security Agreement
SubMatch 1: means the Second Amended and Restated First Lien Pledge and Security Agreement and Irrevocable Proxy executed and delivered by an Authorized Officer of the Borrower, substantially in the form of Exhibit G- 1 hereto, as amended, supplemented, amended and restated or otherwise modified from time to time.

Match 10 Start(2129) Length(262) 
SubMatch 0: Borrowing
SubMatch 1: means the Loans of the same Type and, in the case of LIBO Rate Loans having the same Interest Period made by all Lenders required to make such Loans on the same Business Day and pursuant to the same Borrowing Request in accordance with Section 2.3.

Match 11 Start(2391) Length(192) 
SubMatch 0: Borrowing Base
SubMatch 1: means at any time an amount equal to the amount determined in accordance with Section 2.8, as the same may be adjusted from time to time in accordance with the terms hereof.

Match 12 Start(2583) Length(58) 
SubMatch 0: Borrowing Base Deficiency
SubMatch 1: is defined in Section 3.1.1(c).

Open in new window

ozoCommented:
Is the punctuation correct in
Affiliate of any Person means any other Person that, directly or indirectly, controls, is controlled by or is under common control with such
Person. Control of a Person means the power, directly or indirectly,
from http:#a40746230 ?
It does not contain \.\n and unlike the previous examples does not have a , before means
georgetheroux1Author Commented:
@ozo, sadly so, these documents aren't totally consistent in their structure.
aikimarkCommented:
Is my most recent pattern parsing your text as you need?
georgetheroux1Author Commented:
@alkimark, I tried:

([A-Z].+?),\s*((?:means|is\sdefined\sin)\b\s*(?:.|\n)+?\.)(?:\r\n|$)

but doesn't work unfortunately.. Do note I'm using php - not sure if that's relevant.
ozoCommented:
If definitions like http:#a40746230 do not always start or end in \n, would it be better to use
(\s[A-Z0-9].+?)((?:(?:of\sany\sPerson\s)?means|is\sdefined|shall\shave\sthe\smeaning)[\s\S]*?(?=\.\s))
instead of
(^[A-Z0-9].+?)((?:(?:of\sany\sPerson\s)?means|is\sdefined|shall\shave\sthe\smeaning)[\s\S]*?(?=\.\n))
aikimarkCommented:
Please post a file with sample data.
georgetheroux1Author Commented:
@ozo -

http://www.regexr.com/3at2a

Both of these appear to work quite well:
(^[A-Z0-9].+?)((?:(?:of\sany\sPerson\s)?means|is\sdefined|shall\shave\sthe\smeaning)[\s\S]*?(?=\.\s))
(\s[A-Z0-9].+?)((?:(?:of\sany\sPerson\s)?means|is\sdefined|shall\shave\sthe\smeaning)[\s\S]*?(?=\.\s))

but they break when they come across ".  " (a period & then a double space) - see 'Base Rate' as an example.

What do you think ?
georgetheroux1Author Commented:
Sample data file attached.
arnoldCommented:
The difficulty with a straight regex that can span many lines and varies in structure is that it is difficult.
Presumably, the information while might not be formatted the same, will often fall within the same category heading such prior processing during the conversion process might be used to "standardize/normalize" the data that will be evaluated later on.
georgetheroux1Author Commented:
arnoldCommented:
depending on the source of you getting the filings.
Presumably you have or are a service to which documents are delivered in PDF and you need to extract certain terms/data from them.

While the example has a set of required data but has flexibility to the formatting.
Trying to cover all possible variation using  a single regex might require earlier steps to .........
aikimarkCommented:
Part of the problem is that you have non-ASCII characters in the text.  I think the simplest solution is going to be
1. read the entire file
2. do two regex replace operations to change the two sequences I've found (so far)
3. parse the cleaned up string
aikimarkCommented:
These bytes \xE2\x80\x9A should be replaced with a comma (",")
These bytes \xE2\x82\xAC should be replaced with an apostrophe ("'")

You do not have to use regex replacement.  You can use the native PHP string replacement function.

Also, I might be seeing some terms that are not properly capitalized.  Such terms would not be found in the regex patterns I've posted so far.  Please confirm what I've found.
ozoCommented:
preg_match_all("/\s([A-Z0-9].+?)((?:(?:of\sany\sPerson\s)?means|is\sdefined|shall\shave\sthe\smeaning)[\s\S]*?)(?=\.\n|\.\s+.*(?:means|is\sdefined|shall\shave\sthe\smeaning))/"
handles both

match[12][1]: Affiliate‚
match[12][2]: of any Person means any other Person that, directly or indirectly, controls, is controlled by or is under common control with such
Person

match[13][1]: Control‚ of a Person
match[13][2]: means the power, directly or indirectly,
(a)  to vote 10% or more of the Capital Securities (on a fully diluted basis) of such Person having ordinary voting power for the election of directors,
managing members or general partners (as applicable); or
(b)  to direct or cause the direction of the management and policies of such Person (whether by contract or otherwise)

and

match[26][1]: Base Rate‚
match[26][2]: means, at any time, the rate of interest then most recently established by the Administrative Agent as its base rate for Dollars loaned in
the United States.  The Base Rate is not necessarily intended to be the lowest rate of interest determined by the Administrative Agent in connection
with extensions of credit

but you may prefer
preg_match_all("/\s([A-Z0-9].+?)((?:(?:of\sa(?:ny)?\sPerson\s)?means|is\sdefined|shall\shave\sthe\smeaning)[\s\S]*?)(?=\.\n|\.\s+.*(?:means|is\sdefined|shall\shave\sthe\smeaning))/",
or
preg_match_all("/\s([A-Z0-9].+?)(?:,|\342\200\232)(.*?(?:means|is\sdefined|shall\shave\sthe\smeaning)[\s\S]*?)(?=\.\n|\.\s+.*(?:means|is\sdefined|shall\shave\sthe\smeaning))/",

if you use
preg_match_all("/\s((?:[A-Z0-9]|(?<=\.\n)[a-z]).+?)(?:,|\342\200\232)(.*?(?:means|is\sdefined|shall\shave\sthe\smeaning)[\s\S]*?)(?=\.\n|\.\s+.*(?:means|is\sdefined|shall\shave\sthe\smeaning))/",
then you can also capture
match[111][1]: including
match[111][2]:  and include‚ means including without limiting the generality of any description preceding such term, and, for purposes of each Loan
Document, the parties hereto agree that the rule of ejusdem generis shall not be applicable to limit a general statement, that is followed by or referable
to an enumeration of specific matters, to matters similar to the matters specifically mentioned
and
match[238][1]: wholly owned Subsidiary
aikimarkCommented:
I found I had to remove the page numbers with this regex replace:
Pattern: \n\d{1,2}\n
Replace with: carriage return

After that, I got 233 definitions with this pattern:
(\w[^\xe2\n]+)\xe2.. ((?:means|shall have the meaning|is defined in)\b(?:\S|\s)+?)\.\n

The regex engine I am using had trouble replacing the apostrophe+s unicode characters.  I had to use the language's intrinsic character replace function instead of regex replace.
aikimarkCommented:
My hex editor wasn't giving me accurate information.
In addition to the removal of the page numbers (earlier comment), we
Pattern: "\xe2\u201A\xac"
Regex Replace With: "'"           <--- apostrophe

After those two clean-up operations, the following pattern seems to correctly parse the data (234 matches):
(\w[^\xe2\n]+)\xe2\u20AC\u0161 ((?:means|shall have the meaning|is defined in)\b(?:\S|\s)+?)\.\n
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Regular Expressions

From novice to tech pro — start learning today.