Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 366
  • Last Modified:

Regular expression

Hi!

How do you write a regular expression to get the value of the nameparameter from a html-string, like this

<input type="text" name="namedata">

I want the output to be: namedata

PHP or Perl thanx!
0
lorenz
Asked:
lorenz
1 Solution
 
BernhardBrueckCommented:
/<input type="text" name="([^"]*)"/;
print "$1"

Hope that helps,
  Bernhard Brueck
0
 
webaukCommented:
I know this question was answered a looong time ago, but in case anyone else has a similar problem here's a  slightly more comprehensive solution.

Bernhard's solution above works fine BUT only if the text you're matching is EXACTLY
<input type="text" name="namedata">

It would not work for
<INPUT type="text" name="namedata">

Or
<input type = "text" name = "namedata">

Or
<  input type='text' name='namedata'  >

Or
<input type=text name=namedata>

Or
<input name="namedata" type="text" >

All of which are valid HTML form lines. Oh, you can also split a line of HTML over more than one line of a file / web page as well :-)

To tackle these complications one at a time:-

upper / lower case is easy - just use the " i " option.

Spaces between terms are handled by using  \s  which represents whitespace. However, you do not know how many whitespaces to match, or indeed if there are any at all. Using * tells the reg-ex engine to match zero or more whitespaces. The only-downside to this is that it can makes the regEx a little "messy" to look at later ;-)

Values can be enclosed in single or double quotes or have no quotes at all. I prefer to use "?'? which means match zero or one singles quotes and zero or one double quotes. This is not perfect because it would match "' (double quote followed by a single quote) as well, but in the work I do this would not occur. An alternative is to define a character class of ["'] which would match either a single quote or a double quote. If there is no quote mark to delimit the content then the data should terminate at the first white space. So this gives us a choice of two patterns to match.

One of them is ["'].*?["'] This breaks down into delimiter, a bunch of characters, delimiter. The "bunch of characters" match uses a ? to make it non-greedy otherwise it will try to match ALL the characters up to the very last quote.

The other possibility is where there is no quote delimiter. This means we want to match from the first non-whitespace character to the first whitespace character (you can also use word-boundary matches by the way but I'll let you look that up yourself) This gives .*?\s

Use round brackets to show a list of alternatives to match in regular expressions separating the alternatives using a pipe character. So the match is:

(["'].*?["']|.*?\s)

The last difficulty is how to handle the fact that its can be either "name=" then "type=" or "type=" then "name=".  There are different ways to handle this, the method you use will depend on what you want to do with data after you've matched it.

One possibility is to match alternatives. For example (name|type)=  However, this matches:
<input type="namedata" type="text" >  which may be suitable for your particular needs or not.

Another possibility is to match anything=  Again, this might or might not be suitable.

If you absolutely have to ensure that you have name= type= (or the other way around!) then you are forced into using more alternatives - this works, but will make for veeery big reg exes! For example:
(name=["'].*?["'] type=["'].*?["']|type=["'].*?["'] name=["'].*?["'])
Note: that, in order to simplify this, I have left out the checks for variable amounts of whitespace

Taking all of the above into account you could end up with the following Reg Ex:

/<\s*input\s*(name\s*=["']?.*?["']?\s*type\s*=\s*["']?.*?["']?|type\s*=\s*["']?.*?["']?\s*name\s*=\s*["']?.*?["']?)\s*>/i

Wow!

That's all very well, and will match a valid <INPUT type= name= > statement BUT if you want to extract the values of name and type then you will need to add more round brackets so that matches are copied to regular expression memory. This servers to further complicate the Reg Ex like so...

/<\s*input\s*(name\s*=["']?(.*?)["']?\s*type\s*=\s*["']?(.*?)["']?|type\s*=\s*["']?(.*?)["']?\s*name\s*=\s*["']?(.*?)["']?)\s*>/i;

When I run this in Perl I get values returned in either $2 & $3 OR $4 & $5 depending on whether the line was <INPUT type= name= >  or <INPUT name= type= > 

In conclusion. If you need it then reg exes can be made very complicated / comprehensive to catch all the valid variations of syntax of HTML lines. If you end up with a reg ex this complicated document it well (you can add comments directly into reg exes - go look it up)
If you end up with a reg ex this complicated you might choose to break parsing it into a number of steps. This could be slower to execute, but quicker to code and easier to maintain.

Oh, and there's always more than one way to do it.

webauk
0

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now