We help IT Professionals succeed at work.

Regular Expression Problems

GraemeScotland
on
I am having a few problems with the optional flags which follow a regular expression. I am wanting to match and extract an ID out of a HTTP response. The content of the response looks like this.

clusterid=23 blahblahblah>TestCluster
clusterid=24 blahblahblah>TestCluster2
clusterid=25 blahblahblah>TestCluster3
clusterid=26 blahblahblah>TestCluster4
clusterid=27 blahblahblah>TestCluster5

Now, I am wanting to extract the variable 27 from the above content example (I know the name of the TestCluster so I use that in the regular expression)

$regexp = "clusterid=(\d+).+TestCluster5";
$content =~ /$regexp/mig

This works OK, and the result is 27.

My problem is that when trying to use the same code to extract some more data from another response, it won't work. This is the new content

<textarea name="blah">bob@yahoo.com
                      jim@lycos.com
                      tam@hotmail.com</textarea>

Here I want to remove the three email addresses from within the textarea tags. Using the aforementioned regular expression

$content =~ /$regexp/mig

with $regexp this time being "<textarea.+>(.+)</textarea>"

the regular expression fails because the email addresses are on more than one line. To counteract this, I can add a s flag at the end of the reg exp i.e.

$content =~ /$regexp/migs

This allows the . wildcard character to also allow newline characters. When I try this, it works fine, except now the first regular expression stops working.

A quick reminder of the first content we are looking to get 27 as the clusterid of TestCluster5

clusterid=23 blahblahblah>TestCluster
clusterid=24 blahblahblah>TestCluster2
clusterid=25 blahblahblah>TestCluster3
clusterid=26 blahblahblah>TestCluster4
clusterid=27 blahblahblah>TestCluster5

Now, after the addition of the s flag in the regular expression line, it produces the result 23!! This is clearly wrong and I can guess why it is doing it. It is obviously matching clusterid=(\d+) then .+ will match everything including newline chars until it gets to TestCluster5. I don't want this to happen though, how do I make it only get the 27 part??

Thanks in advance

Graeme

Comment
Watch Question

Commented:
As you mentioned, using s///s causes . to match the newline as well.  You can reverse the effect for specific cases by using [^\n] in place of some periods.

However, it's usually a good idea make regular expressions as specific as possible, because slightly different rearragments of text could also cause [^\n] to break.  You could perhaps try something like [^<>] so you know you won't cross tag boundaries.
ozo
CERTIFIED EXPERT
Most Valuable Expert 2014
Top Expert 2015

Commented:
$regexp = "clusterid=(\d+)[^>\n]*>TestCluster5";
$regexp = "(?s)<textarea[^>]*>(.+?)</textarea>"

Author

Commented:
Sorry I didn't accept this earlier, EE didn't bother emailing me to say that someone had posted something

Author

Commented:
Sorry I didn't accept this earlier, EE didn't bother emailing me to say that someone had posted something

Explore More ContentExplore courses, solutions, and other research materials related to this topic.