Link to home
Start Free TrialLog in
Avatar of Russ Suter
Russ Suter

asked on

Regex Balancing Group

I'm trying to parse an Oracle TNS Names file. It looks something like this:
# Generated by Oracle configuration tools.

NorthWind =
  (DESCRIPTION =
    (ADDRESS_LIST =
      (ADDRESS = (PROTOCOL = TCP)(HOST = 10.0.0.1)(PORT = 1521))
    )
    (CONNECT_DATA =
      (SERVICE_NAME = NorthWind)
    )
  )

SouthWind =
  (DESCRIPTION =
    (ADDRESS_LIST =
      (ADDRESS = (PROTOCOL = TCP)(HOST = 10.0.0.1)(PORT = 1521))
    )
    (CONNECT_DATA =
      (SERVICE_NAME = SouthWind)
    )
  )
  
WestWind =
  (DESCRIPTION =
    (ADDRESS_LIST =
      (ADDRESS = (PROTOCOL = TCP)(HOST = 10.0.1.44)(PORT = 1521))
    )
    (CONNECT_DATA =
      (SERVICE_NAME = WestWind)
    )
  )

Open in new window

What I need to be able to do is come up with a regular expression that will allow me to identify the TNS Name and then capture everything inside the parentheses that follow. I've been looking into using a Regex balancing group but haven't quite got the hang of it. Here's what I have so far:
[\n][\s]*[^\(]SouthWind[\s]*=[\s]*((?<Begin>[(]).*(?<End-Begin>[)]))

Open in new window

This isn't working. The capture group overruns the closing parenthesis. There seems to be little good documentation on balancing groups in Regex. They seem like the bastard child that nobody wants to talk about.

This is what I want to get:
(DESCRIPTION =
    (ADDRESS_LIST =
      (ADDRESS = (PROTOCOL = TCP)(HOST = 10.0.0.1)(PORT = 1521))
    )
    (CONNECT_DATA =
      (SERVICE_NAME = SouthWind)
    )
  )

Open in new window


Can anyone help?
Avatar of Dan Craciun
Dan Craciun
Flag of Romania image

If you don't mind adding a couple of LF at the end of the file, this should work:
^\w+ =\n(.*?\)\n\s*\))\n\s*\n

Open in new window


HTH,
Dan
Avatar of Russ Suter
Russ Suter

ASKER

I just tried that. I added the LFs at the end as you suggested. It didn't match on anything.
OK. What do you use to test?

 User generated image
I use Expresso
I also tried this online Regex tester: https://regex101.com/
And Visual Studio

None of them worked
You forgot the modifiers:
1. Dot matches line breaks.
  \s on regex101.com, SingleLine on VS
2. ^$ match at line breaks.
\m on regex101.com, Multiline on VS

Here is the link: https://regex101.com/r/vT4gC5/1
Didn't forget those. Both are enabled. It still doesn't work.
Did you click on the link? It shows the first match.

Add g (global) to see all matches.
It doesn't work in C#. that's where I need it.
subjectString = "
# Generated by Oracle configuration tools.

NorthWind =
  (DESCRIPTION =
    (ADDRESS_LIST =
      (ADDRESS = (PROTOCOL = TCP)(HOST = 10.0.0.1)(PORT = 1521))
    )
    (CONNECT_DATA =
      (SERVICE_NAME = NorthWind)
    )
  )

SouthWind =
  (DESCRIPTION =
    (ADDRESS_LIST =
      (ADDRESS = (PROTOCOL = TCP)(HOST = 10.0.0.1)(PORT = 1521))
    )
    (CONNECT_DATA =
      (SERVICE_NAME = SouthWind)
    )
  )
  
WestWind =
  (DESCRIPTION =
    (ADDRESS_LIST =
      (ADDRESS = (PROTOCOL = TCP)(HOST = 10.0.1.44)(PORT = 1521))
    )
    (CONNECT_DATA =
      (SERVICE_NAME = WestWind)
    )
  )

";

MatchCollection allMatchResults = null;
try {
	Regex regexObj = new Regex(@"^\w+ =\n(.*?\)\n\s*\))\n\s*\n", RegexOptions.Singleline | RegexOptions.Multiline);
	allMatchResults = regexObj.Matches(subjectString);
	if (allMatchResults.Count > 0) {
		// Access individual matches using allMatchResults.Item[]
	} else {
		// Match attempt failed
	} 
} catch (ArgumentException ex) {
	// Syntax error in the regular expression
}

Open in new window


If it does not work, make sure the end line is <LF> (like in the sample you provided), not <CR><LF>.

If the file has Windows line endings, then use this (basically replace \n with \r\n):
^\w+ =\r\n(.*?\)\r\n\s*\))\r\n\s*\r\n

Open in new window

If you want documentation on balancing groups, you can check this: http://www.regular-expressions.info/balancing.html
I've already been there and read through it.

Ultimately I just ended up not using Regex for capturing the text. I decided to just use it to determine my start point then just wrote a quick program that parses the following text character by character and keeps track of the parentheses. Sometimes I guess the brute force approach is the best.
I've requested that this question be closed as follows:

Accepted answer: 0 points for Russ Suter's comment #a41742550

for the following reason:

None of the above offered solutions actually worked.
The regular expression provided works. Proof: https://regex101.com/r/vT4gC5/1 . Add g to see all matches.

The first time the author mentioned C# is comment ID: 41735968.

After that I provided sample code in C#.

I only tested with (and provided a solution for) the sample data provided by the author in the question. If it's different from the live data... I can't test on something I can't see.
Your protestation that it works doesn't actually make it work.
So... ignore a working solution (you can protest, but the link on regex101.com proves the regular expression works on the sample you provided), then accept a semi-blind link just in spite.
It doesn't work. It may work on some website but it doesn't work in a real world application. I gave the other solution a C grade because it offered some (but not enough) information. It wasn't personal or spiteful.
You're joking, right?

I provided the actual data (names and IP addresses changed) and the language, admittedly not at first but in a future post I did.

The solution DOES NOT work on what I provided. It may work in your test case but not in my real-world case. I'm really not sure why you fail to understand this.

Furthermore, I'm not the one getting upset. I solved my issue and moved on by using a different solution. You objected so I revisited and gave some (but not full) credit to the only link that offered anything actually useful.

There are plenty of times I answered a question and someone else's answer was accepted even though I thought mine was perfectly valid. I just moved along. I suggest you do the same.

I don't think I'll be able to add anything to this discussion. I'll not reply again. Feel free to get the last word if you wish.
I don't have access to your real world case. I only have access to what you provided on your question.

My regular expression works on your (not my) test data. I only copy/pasted from your post.
Verified in RegexBuddy, regex101.com and Visual Studio.

If it is different from the real world data, how can I (or anyone) provide a solution???

On your next questions, please read and try to provide a SSCCE.

Thank you.
Let's set aside for a moment that the proposed solution didn't work. It actually didn't even properly address the question which involved finding a specific block of text following an identifier. An alternate solution was found. However, I'm happy to wait a while longer for a more complete Regex based solution. Allow me to specify the requirements more fully. I've attached the actual TNS names file (appended with a .txt extension which normally isn't there).

Here are the requirements and restrictions:

1. I cannot in any way modify the file. I must read it as-is on the computer.
2. The file may or may not have additional characters following the last entry. These characters should be irrelevant.
3. I need to be able to extract the text within a balanced block of parentheses following an identifier. In the attached file there are 3 entries and the identifiers are:
    NorthWind =
    SouthWind =
    WestWind =
  The parenthesized block following any one of these (specified by user input) must be extracted. The outer parentheses are optional since I know I can add them back in if they are omitted.

If a C# code block produces the desired result on this website: http://rextester.com/ I will consider it a success.
tnsnames.ora.txt
As I said in my answer above if the pattern does not work with \n it simply means that you have Windows line endings (\r\n).

Here is the link to working proof: http://rextester.com/ADXFEV61390

I'm not a C# programmer so you'll need to write yourself the loop to find the rest of the matches.
If you can't do that, please post a new question in the appropriate TA.

Thank you.
Dan's suggestion appears to work in regexhero.net, which is a Silverlight app (hence it uses .NET's regex engine). You could make the carriage returns optional ( \r? ) to account for either style of line ending:

^\w+ =\r?\n(.*?\)\r?\n\s*\))\r?\n\s*\r?\n

Open in new window


Expresso should work as well if you account for the line ending issue that Dan mentioned.
ASKER CERTIFIED SOLUTION
Avatar of louisfr
louisfr

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
This is almost perfect. By switching out the leading \w+ with the keyword identifier (SouthWind for example) I was able to match exactly the text I needed AND it uses balancing groups as requested. I modified it slightly to use capture groups. The final product looks like this:

(?:SouthWind\s+=\s+)(\([^()]+(?>(?>(?'open'\()[^()]*)+(?>(?'-open'\))[^()]*)+)+(?(open)(?!))\))

I'll programmatically drop in the appropriate identifier in place of SouthWind as needed using a simple string concatenation.

I'm normally pretty good with Regex but this one is a doozy.
this one is a doozy
Which means that if you're using this in production code, then it is probably not the best approach. Regex is a very good and powerful tool, but that doesn't always mean it's the best tool. Can you really say that in six months you'll be able to digest that regex and know what it does or means? What about people coming behind you? Will they understand what it does?

If this is for some one-off, potentially throw-away utility, then it's of less consequence.
I think balancing groups are easy in theory but getting details right are tricky.
I always start from the same pre-made regex from the page I linked to earlier.
Agreed. While I'm generally quite Regex adept I found exactly as you said, getting the details right is tricky. And since balancing groups aren't supported by most Regex flavors it's a bit of a specialized art.

I had a perfectly working piece of code that knew where to start based on a simple Regex and then just read each character until the parentheses balanced out. It's a simple loop operation in C#. Now that I also have a viable Regex sample my next step is to consider performance and try to throw a few curve balls at the solution to see how it behaves.

@käµfm³d 👽
Can you really say that in six months you'll be able to digest that regex and know what it does or means? What about people coming behind you? Will they understand what it does?
That's what code comments are for. ;)
This is a post that I quote increasingly often: stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

Regex is a beautiful tool. Just don't use it for everything, as it becomes clunky very quickly.
Good programmers comment why something is done, not what something is doing

= )