Solved

loading an external form with fields and attributes into an array via preg_match_all

Posted on 2011-09-07
13
297 Views
Last Modified: 2012-05-12
hi there, i have the following code in which the function takes the html loaded via file_get_contents and parses it for forms. for each set of form html, it is then supposed to retrieve the form name and action attributes; input types, names, values and possibly checked attributes; textarea name attributes and values;  select name and value attributes (still working on catching the selected option) ; as well as button name and value attributes - but in this adapted regex from a working PCRE used in another part of the site, it says that | is an unknown modifier. i need to parse the attributes no matter in what order they are - but in one preg_match_all PCRE pattern.
the function is as follows:
function getform($html) {
	/*
	- load form as:
	  array(
	   [type],
	   [name],
	   [value],
	   [checked]
	  );
	*/
	$ret = array();
	$forms = array();
	preg_match_all("/\<form [^\>]+\>.*\<\/form\>/ims",
			$html,
			$out);
	print_r($out);
	foreach ($out[0] as $form) {
		array_push($forms,$form);
	}
	foreach ($forms as $form) {
		$pattern = '/\<(form).*\sname="([^"]+)".*\saction="([^")".*\>'.
			   '|\<(input).*\stype="([^"]+)".*\sname="([^"]+)".*\svalue="([^"]+)".*\schecked="([^"]+)".*\/\>'.
			   '|\<(textarea).*\sname="([^"]+)".*\>(.*)*\<\/textarea\>/ims'.
			   '|<(select).*\sname="([^"]+)".*>'.
			   '|<(button).*\sname="([^"]+)".*\svalue="([^"]+)".*>';
		preg_match_all($pattern,
					$form,
					$out);
		print_r($out);
		/*foreach ($out[1] as $type) {
			switch ($type) {
				case "form":
					array_push($ret, $out[2][i]);
					break;
				case "input":
					
					break;
				case "textarea":
					
					break;
				case "select":
					
					break;
				case "button":
					
					break;
			}
		}*/
	}
}

Open in new window

please assist me in completing the second regular expression on line 21 - it goes through the first preg_match_all retrieving all forms on any page that is loaded via file_get_contents, passing an array of forms to the next step on line 20.
namaste - Greywacke.
0
Comment
Question by:intellisource
  • 8
  • 5
13 Comments
 

Author Comment

by:intellisource
ID: 36495147
ok i discovered why the | were returning an error - it is because the regular expression was closed before the end of the string! :S
i have replaced the pattern with
		$pattern = '/\<(form).*name="([^"]+)".*action="([^")".*\>'.
			   '|\<(input).*type="([^"]+)".*name="([^"]+)".*value="([^"]+)".*checked="([^"]+)".*\/\>'.
			   '|\<(textarea).*name="([^"]+)".*\>(.*)*\<\/textarea\>'.
			   '|<(select).*name="([^"]+)".*>'.
			   '|<(button).*name="([^"]+)".*value="([^"]+)".*>/ims';

Open in new window

but unfortunately it still returns the matches as empty :/ what's wrong here? i know it is supposed to return them in any order of attributes within a tag, but i do not see how this can be done - especially if i cannot get any results yet 0o
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 36495345
Let's make a few corrections to your pattern to make it more sound:

$pattern = '#<(?<form_tag>form)(?=[^>]*name="(?<form_name>[^"]+))(?=[^>]*action="(?<form_action>[^"]*))' .
           '|<(?<input_tag>input)(?=[^>]*type="(?<input_type>[^"]+))(?=[^>]*name="(?<input_name>[^"]+))(?=[^>]*value="(?<input_value>[^"]+))(?=[^>]*checked="(?<input_checked>[^"]+))' .
           '|<(?<textarea_tag>textarea)(?=[^>]*name="(?<textarea_name>[^"]+))[^>]*>(?<textarea_content>[^<]' .
           '|<(?!/textarea))*|<(?<select>select)(?=[^>]*name="(?<select_name>[^"]+))' .
           '|<(?<button_tag>button)(?=[^>]*name="(?<button_name>[^"]+))(?=[^>]*value="(?<button_value>[^"]+))#i';

Open in new window


Please note:  I've added named groups for capturing your data. The group names are the identifiers inside of angle brackets ( < ... >) and containing underscores (e.g. "form_tag"). You can index your match array using those values instead of indexes (e.g. $match['form_tag']).

As for what I changed, instead of relying on the dot-star, which can be tricky for HTML parsing, I've changed instead to a zero-or-more search for anything not a closing angle bracket ( > ). This means your patterns won't leave the safety of each tag, provided your HTML is properly structured! I've also moved to a positive lookahead approach [ (?= ... ) ] for finding your attribute values. The benefit of this approach is that your attributes can be in any order within the pattern. Currently, your attributes will only be found if they occur in the order you have them defined in your pattern. This is entirely fine if you are guaranteed that the values will always occur in the same order. My approach will give you a bit more flexibility.

I've only mildly tested the above pattern, but I believe it to be sound. If it is not working for you, and you care to provide an example of your data, I can do further verification.
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 36495387
This "error" won't affect the operation of the pattern, but to be consistent with your readability convention I need to post a correction (I broke the pattern in the wrong place):

$pattern = '#<(?<form_tag>form)(?=[^>]*name="(?<form_name>[^"]+))(?=[^>]*action="(?<form_action>[^"]*))' .
           '|<(?<input_tag>input)(?=[^>]*type="(?<input_type>[^"]+))(?=[^>]*name="(?<input_name>[^"]+))(?=[^>]*value="(?<input_value>[^"]+))(?=[^>]*checked="(?<input_checked>[^"]+))' .
           '|<(?<textarea_tag>textarea)(?=[^>]*name="(?<textarea_name>[^"]+))[^>]*>(?<textarea_content>[^<]|<(?!/textarea))*' .
           '|<(?<select_tag>select)(?=[^>]*name="(?<select_name>[^"]+))' .
           '|<(?<button_tag>button)(?=[^>]*name="(?<button_name>[^"]+))(?=[^>]*value="(?<button_value>[^"]+))#i';

Open in new window


It seems I also created the "select" tag's group name inconsistently with the others--I forgot the "_tag" part of the name. This is also corrected above.
0
 

Author Comment

by:intellisource
ID: 36495858
ok this seems to work but i have decided not to go with returning named subpatterns. it tends to get rather confusing especially with PREG_PATTERN_ORDER
0
 

Author Comment

by:intellisource
ID: 36495928
but it just gets the form details, input details are not correct. they need to work in any situation.
Array
(
    [0] => Array
        (
            [0] => <form
        )

    [form_tag] => Array
        (
            [0] => form
        )

    [1] => Array
        (
            [0] => form
        )

    [form_name] => Array
        (
            [0] => form1
        )

    [2] => Array
        (
            [0] => form1
        )

    [form_action] => Array
        (
            [0] => login.aspx
        )

    [3] => Array
        (
            [0] => login.aspx
        )

    [input_tag] => Array
        (
            [0] => 
        )

    [4] => Array
        (
            [0] => 
        )

    [input_type] => Array
        (
            [0] => 
        )

    [5] => Array
        (
            [0] => 
        )

    [input_name] => Array
        (
            [0] => 
        )

    [6] => Array
        (
            [0] => 
        )

    [input_value] => Array
        (
            [0] => 
        )

    [7] => Array
        (
            [0] => 
        )

    [input_checked] => Array
        (
            [0] => 
        )

    [8] => Array
        (
            [0] => 
        )

    [textarea_tag] => Array
        (
            [0] => 
        )

    [9] => Array
        (
            [0] => 
        )

    [textarea_name] => Array
        (
            [0] => 
        )

    [10] => Array
        (
            [0] => 
        )

    [textarea_content] => Array
        (
            [0] => 
        )

    [11] => Array
        (
            [0] => 
        )

    [select_tag] => Array
        (
            [0] => 
        )

    [12] => Array
        (
            [0] => 
        )

    [select_name] => Array
        (
            [0] => 
        )

    [13] => Array
        (
            [0] => 
        )

    [button_tag] => Array
        (
            [0] => 
        )

    [14] => Array
        (
            [0] => 
        )

    [button_name] => Array
        (
            [0] => 
        )

    [15] => Array
        (
            [0] => 
        )

    [button_value] => Array
        (
            [0] => 
        )

    [16] => Array
        (
            [0] => 
        )

)

Open in new window

an example form that is loaded is available on https://lmx.leads360.com/web/login.aspx
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 36496448
OK, try this correction:

$pattern = '#<(form)(?=[^>]*name="([^"]*))(?=[^>]*action="([^"]*))' .
           '|<(input)(?=(?:[^>]*type="([^"]*))?)(?=(?:[^>]*name="([^"]*))?)(?=(?:[^>]*value="([^"]*))?)(?=(?:[^>]*checked="([^"]*))?)' .
           '|<(textarea)(?=(?:[^>]*name="([^"]*))?)[^>]*>([^<]|<(?!/textarea))*' .
           '|<(select)(?=(?:[^>]*name="([^"]*))?)' .
           '|<(button)(?=(?:[^>]*name="([^"]*))?)(?=(?:[^>]*value="([^"]*))?)#i';

Open in new window


The problem was with the "checked" property, which not every INPUT has. I made all the attributes optional so the aforementioned snag should be averted. I also removed the named groups per your previous comment.
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 

Author Comment

by:intellisource
ID: 36496451
i believe closer to my requirements would be the following using PREG_SET_ORDER
		$pattern = '#<(form)[^>]*name="([^"]+)[^>]*action="([^"]+)'.
			    '|<(input)[^>]*type="([^"]+)[^>]*name="([^"]+)[^>]*value="([^"]+)[^>]*checked="([^"]+)'.
			    '|<(textarea)[^>]*name="([^"]+)[^>]*>(^<)/textarea>'.
			    '|<(select)[^>]*name="([^"]+)'.
			    '|<(option)[^>]*value="([^"]+)[^>]*selected="([^"]+)'.
			    '|<(button)[^>]*name="([^"]+)"[^>]*value="([^"]+)"#ims';

Open in new window

which returns
Array
(
    [0] => Array
        (
            [0] => <form name="form1" method="post" action="login.aspx
            [1] => form
            [2] => form1
            [3] => login.aspx
        )

)

Open in new window

which is ideal for the parsing - but it does not match any of the form elements on https://lmx.leads360.com/web/login.aspx :/ i need to return multiple matches for any of the form elements in ANY form loaded from the web, regardless of whitespaces in or outside the tags matched. what am i missing in my pcre pattern? 0o
0
 
LVL 75

Accepted Solution

by:
käµfm³d   👽 earned 500 total points
ID: 36496467
I told a little white lie:  I didn't make the attributes on the FORM tag optional...  they are now:
$pattern = '#<(form)(?=(?:[^>]*name="([^"]*))?)(?=(?:[^>]*action="([^"]*))?)' .
           '|<(input)(?=(?:[^>]*type="([^"]*))?)(?=(?:[^>]*name="([^"]*))?)(?=(?:[^>]*value="([^"]*))?)(?=(?:[^>]*checked="([^"]*))?)' .
           '|<(textarea)(?=(?:[^>]*name="([^"]*))?)[^>]*>([^<]|<(?!/textarea))*' .
           '|<(select)(?=(?:[^>]*name="([^"]*))?)' .
           '|<(button)(?=(?:[^>]*name="([^"]*))?)(?=(?:[^>]*value="([^"]*))?)#i';

Open in new window

0
 

Author Closing Comment

by:intellisource
ID: 36496697
hehehe thanks man ;) ur the winner :D with ur definate response to me missing certain pcre elements lol :P
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 36497248
NP. Glad it worked for you  = )
0
 

Author Comment

by:intellisource
ID: 36497389
just one more question - if an element is within a div with style visibility: hidden or display: none how can i detect that (hidden|none|empty string) by adding to the four element pattern's regex?
0
 

Author Comment

by:intellisource
ID: 36497404
for example loading the login form in the example url, the emailTexbox element is retrieved but does not give any clue as to wether it is displayed/hidden or not.
<div class="dialog" style="position: absolute; visibility: hidden; z-index: 70013; left: 735px; top: 212px;" id="passwordDialog"><div class="header" id="passwordDialog_HeaderSpan">

            Forgot Password
        
</div><div class="content password" id="passwordDialog_InnerSpan">

            <p>Please enter the email address you signed up with and we will send you a new login link.</p>
            <dl>
                <dt>Email:</dt>
                <dd><input type="text" class="bigtextbox" id="emailTextBox" name="emailTextBox" tabindex="0"></dd>
            </dl>
            <div class="buttons submitcancel">
                <a onclick="RequestPasswordReset();" id="submitPasswordRest" class="submit left" tabindex="0">Submit</a>
                <a onclick="passwordDialog.Close();" class="cancel right" tabindex="0">Close</a>
            </div>
        
</div><div class="footer" id="passwordDialog_FooterSpan">

        
</div></div>

Open in new window

0
 

Author Comment

by:intellisource
ID: 36500784
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Popularity Can Be Measured Sometimes we deal with questions of popularity, and we need a way to collect opinions from our clients.  This article shows a simple teaching example of how we might elect a favorite color by letting our clients vote for …
I imagine that there are some, like me, who require a way of getting currency exchange rates for implementation in web project from time to time, so I thought I would share a solution that I have developed for this purpose. It turns out that Yaho…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
The viewer will learn how to dynamically set the form action using jQuery.

932 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now