Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 312
  • Last Modified:

loading an external form with fields and attributes into an array via preg_match_all

hi there, i have the following code in which the function takes the html loaded via file_get_contents and parses it for forms. for each set of form html, it is then supposed to retrieve the form name and action attributes; input types, names, values and possibly checked attributes; textarea name attributes and values;  select name and value attributes (still working on catching the selected option) ; as well as button name and value attributes - but in this adapted regex from a working PCRE used in another part of the site, it says that | is an unknown modifier. i need to parse the attributes no matter in what order they are - but in one preg_match_all PCRE pattern.
the function is as follows:
function getform($html) {
	/*
	- load form as:
	  array(
	   [type],
	   [name],
	   [value],
	   [checked]
	  );
	*/
	$ret = array();
	$forms = array();
	preg_match_all("/\<form [^\>]+\>.*\<\/form\>/ims",
			$html,
			$out);
	print_r($out);
	foreach ($out[0] as $form) {
		array_push($forms,$form);
	}
	foreach ($forms as $form) {
		$pattern = '/\<(form).*\sname="([^"]+)".*\saction="([^")".*\>'.
			   '|\<(input).*\stype="([^"]+)".*\sname="([^"]+)".*\svalue="([^"]+)".*\schecked="([^"]+)".*\/\>'.
			   '|\<(textarea).*\sname="([^"]+)".*\>(.*)*\<\/textarea\>/ims'.
			   '|<(select).*\sname="([^"]+)".*>'.
			   '|<(button).*\sname="([^"]+)".*\svalue="([^"]+)".*>';
		preg_match_all($pattern,
					$form,
					$out);
		print_r($out);
		/*foreach ($out[1] as $type) {
			switch ($type) {
				case "form":
					array_push($ret, $out[2][i]);
					break;
				case "input":
					
					break;
				case "textarea":
					
					break;
				case "select":
					
					break;
				case "button":
					
					break;
			}
		}*/
	}
}

Open in new window

please assist me in completing the second regular expression on line 21 - it goes through the first preg_match_all retrieving all forms on any page that is loaded via file_get_contents, passing an array of forms to the next step on line 20.
namaste - Greywacke.
0
intellisource
Asked:
intellisource
  • 8
  • 5
1 Solution
 
intellisourceAuthor Commented:
ok i discovered why the | were returning an error - it is because the regular expression was closed before the end of the string! :S
i have replaced the pattern with
		$pattern = '/\<(form).*name="([^"]+)".*action="([^")".*\>'.
			   '|\<(input).*type="([^"]+)".*name="([^"]+)".*value="([^"]+)".*checked="([^"]+)".*\/\>'.
			   '|\<(textarea).*name="([^"]+)".*\>(.*)*\<\/textarea\>'.
			   '|<(select).*name="([^"]+)".*>'.
			   '|<(button).*name="([^"]+)".*value="([^"]+)".*>/ims';

Open in new window

but unfortunately it still returns the matches as empty :/ what's wrong here? i know it is supposed to return them in any order of attributes within a tag, but i do not see how this can be done - especially if i cannot get any results yet 0o
0
 
käµfm³d 👽Commented:
Let's make a few corrections to your pattern to make it more sound:

$pattern = '#<(?<form_tag>form)(?=[^>]*name="(?<form_name>[^"]+))(?=[^>]*action="(?<form_action>[^"]*))' .
           '|<(?<input_tag>input)(?=[^>]*type="(?<input_type>[^"]+))(?=[^>]*name="(?<input_name>[^"]+))(?=[^>]*value="(?<input_value>[^"]+))(?=[^>]*checked="(?<input_checked>[^"]+))' .
           '|<(?<textarea_tag>textarea)(?=[^>]*name="(?<textarea_name>[^"]+))[^>]*>(?<textarea_content>[^<]' .
           '|<(?!/textarea))*|<(?<select>select)(?=[^>]*name="(?<select_name>[^"]+))' .
           '|<(?<button_tag>button)(?=[^>]*name="(?<button_name>[^"]+))(?=[^>]*value="(?<button_value>[^"]+))#i';

Open in new window


Please note:  I've added named groups for capturing your data. The group names are the identifiers inside of angle brackets ( < ... >) and containing underscores (e.g. "form_tag"). You can index your match array using those values instead of indexes (e.g. $match['form_tag']).

As for what I changed, instead of relying on the dot-star, which can be tricky for HTML parsing, I've changed instead to a zero-or-more search for anything not a closing angle bracket ( > ). This means your patterns won't leave the safety of each tag, provided your HTML is properly structured! I've also moved to a positive lookahead approach [ (?= ... ) ] for finding your attribute values. The benefit of this approach is that your attributes can be in any order within the pattern. Currently, your attributes will only be found if they occur in the order you have them defined in your pattern. This is entirely fine if you are guaranteed that the values will always occur in the same order. My approach will give you a bit more flexibility.

I've only mildly tested the above pattern, but I believe it to be sound. If it is not working for you, and you care to provide an example of your data, I can do further verification.
0
 
käµfm³d 👽Commented:
This "error" won't affect the operation of the pattern, but to be consistent with your readability convention I need to post a correction (I broke the pattern in the wrong place):

$pattern = '#<(?<form_tag>form)(?=[^>]*name="(?<form_name>[^"]+))(?=[^>]*action="(?<form_action>[^"]*))' .
           '|<(?<input_tag>input)(?=[^>]*type="(?<input_type>[^"]+))(?=[^>]*name="(?<input_name>[^"]+))(?=[^>]*value="(?<input_value>[^"]+))(?=[^>]*checked="(?<input_checked>[^"]+))' .
           '|<(?<textarea_tag>textarea)(?=[^>]*name="(?<textarea_name>[^"]+))[^>]*>(?<textarea_content>[^<]|<(?!/textarea))*' .
           '|<(?<select_tag>select)(?=[^>]*name="(?<select_name>[^"]+))' .
           '|<(?<button_tag>button)(?=[^>]*name="(?<button_name>[^"]+))(?=[^>]*value="(?<button_value>[^"]+))#i';

Open in new window


It seems I also created the "select" tag's group name inconsistently with the others--I forgot the "_tag" part of the name. This is also corrected above.
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
intellisourceAuthor Commented:
ok this seems to work but i have decided not to go with returning named subpatterns. it tends to get rather confusing especially with PREG_PATTERN_ORDER
0
 
intellisourceAuthor Commented:
but it just gets the form details, input details are not correct. they need to work in any situation.
Array
(
    [0] => Array
        (
            [0] => <form
        )

    [form_tag] => Array
        (
            [0] => form
        )

    [1] => Array
        (
            [0] => form
        )

    [form_name] => Array
        (
            [0] => form1
        )

    [2] => Array
        (
            [0] => form1
        )

    [form_action] => Array
        (
            [0] => login.aspx
        )

    [3] => Array
        (
            [0] => login.aspx
        )

    [input_tag] => Array
        (
            [0] => 
        )

    [4] => Array
        (
            [0] => 
        )

    [input_type] => Array
        (
            [0] => 
        )

    [5] => Array
        (
            [0] => 
        )

    [input_name] => Array
        (
            [0] => 
        )

    [6] => Array
        (
            [0] => 
        )

    [input_value] => Array
        (
            [0] => 
        )

    [7] => Array
        (
            [0] => 
        )

    [input_checked] => Array
        (
            [0] => 
        )

    [8] => Array
        (
            [0] => 
        )

    [textarea_tag] => Array
        (
            [0] => 
        )

    [9] => Array
        (
            [0] => 
        )

    [textarea_name] => Array
        (
            [0] => 
        )

    [10] => Array
        (
            [0] => 
        )

    [textarea_content] => Array
        (
            [0] => 
        )

    [11] => Array
        (
            [0] => 
        )

    [select_tag] => Array
        (
            [0] => 
        )

    [12] => Array
        (
            [0] => 
        )

    [select_name] => Array
        (
            [0] => 
        )

    [13] => Array
        (
            [0] => 
        )

    [button_tag] => Array
        (
            [0] => 
        )

    [14] => Array
        (
            [0] => 
        )

    [button_name] => Array
        (
            [0] => 
        )

    [15] => Array
        (
            [0] => 
        )

    [button_value] => Array
        (
            [0] => 
        )

    [16] => Array
        (
            [0] => 
        )

)

Open in new window

an example form that is loaded is available on https://lmx.leads360.com/web/login.aspx
0
 
käµfm³d 👽Commented:
OK, try this correction:

$pattern = '#<(form)(?=[^>]*name="([^"]*))(?=[^>]*action="([^"]*))' .
           '|<(input)(?=(?:[^>]*type="([^"]*))?)(?=(?:[^>]*name="([^"]*))?)(?=(?:[^>]*value="([^"]*))?)(?=(?:[^>]*checked="([^"]*))?)' .
           '|<(textarea)(?=(?:[^>]*name="([^"]*))?)[^>]*>([^<]|<(?!/textarea))*' .
           '|<(select)(?=(?:[^>]*name="([^"]*))?)' .
           '|<(button)(?=(?:[^>]*name="([^"]*))?)(?=(?:[^>]*value="([^"]*))?)#i';

Open in new window


The problem was with the "checked" property, which not every INPUT has. I made all the attributes optional so the aforementioned snag should be averted. I also removed the named groups per your previous comment.
0
 
intellisourceAuthor Commented:
i believe closer to my requirements would be the following using PREG_SET_ORDER
		$pattern = '#<(form)[^>]*name="([^"]+)[^>]*action="([^"]+)'.
			    '|<(input)[^>]*type="([^"]+)[^>]*name="([^"]+)[^>]*value="([^"]+)[^>]*checked="([^"]+)'.
			    '|<(textarea)[^>]*name="([^"]+)[^>]*>(^<)/textarea>'.
			    '|<(select)[^>]*name="([^"]+)'.
			    '|<(option)[^>]*value="([^"]+)[^>]*selected="([^"]+)'.
			    '|<(button)[^>]*name="([^"]+)"[^>]*value="([^"]+)"#ims';

Open in new window

which returns
Array
(
    [0] => Array
        (
            [0] => <form name="form1" method="post" action="login.aspx
            [1] => form
            [2] => form1
            [3] => login.aspx
        )

)

Open in new window

which is ideal for the parsing - but it does not match any of the form elements on https://lmx.leads360.com/web/login.aspx :/ i need to return multiple matches for any of the form elements in ANY form loaded from the web, regardless of whitespaces in or outside the tags matched. what am i missing in my pcre pattern? 0o
0
 
käµfm³d 👽Commented:
I told a little white lie:  I didn't make the attributes on the FORM tag optional...  they are now:
$pattern = '#<(form)(?=(?:[^>]*name="([^"]*))?)(?=(?:[^>]*action="([^"]*))?)' .
           '|<(input)(?=(?:[^>]*type="([^"]*))?)(?=(?:[^>]*name="([^"]*))?)(?=(?:[^>]*value="([^"]*))?)(?=(?:[^>]*checked="([^"]*))?)' .
           '|<(textarea)(?=(?:[^>]*name="([^"]*))?)[^>]*>([^<]|<(?!/textarea))*' .
           '|<(select)(?=(?:[^>]*name="([^"]*))?)' .
           '|<(button)(?=(?:[^>]*name="([^"]*))?)(?=(?:[^>]*value="([^"]*))?)#i';

Open in new window

0
 
intellisourceAuthor Commented:
hehehe thanks man ;) ur the winner :D with ur definate response to me missing certain pcre elements lol :P
0
 
käµfm³d 👽Commented:
NP. Glad it worked for you  = )
0
 
intellisourceAuthor Commented:
just one more question - if an element is within a div with style visibility: hidden or display: none how can i detect that (hidden|none|empty string) by adding to the four element pattern's regex?
0
 
intellisourceAuthor Commented:
for example loading the login form in the example url, the emailTexbox element is retrieved but does not give any clue as to wether it is displayed/hidden or not.
<div class="dialog" style="position: absolute; visibility: hidden; z-index: 70013; left: 735px; top: 212px;" id="passwordDialog"><div class="header" id="passwordDialog_HeaderSpan">

            Forgot Password
        
</div><div class="content password" id="passwordDialog_InnerSpan">

            <p>Please enter the email address you signed up with and we will send you a new login link.</p>
            <dl>
                <dt>Email:</dt>
                <dd><input type="text" class="bigtextbox" id="emailTextBox" name="emailTextBox" tabindex="0"></dd>
            </dl>
            <div class="buttons submitcancel">
                <a onclick="RequestPasswordReset();" id="submitPasswordRest" class="submit left" tabindex="0">Submit</a>
                <a onclick="passwordDialog.Close();" class="cancel right" tabindex="0">Close</a>
            </div>
        
</div><div class="footer" id="passwordDialog_FooterSpan">

        
</div></div>

Open in new window

0

Featured Post

How to Use the Help Bell

Need to boost the visibility of your question for solutions? Use the Experts Exchange Help Bell to confirm priority levels and contact subject-matter experts for question attention.  Check out this how-to article for more information.

  • 8
  • 5
Tackle projects and never again get stuck behind a technical roadblock.
Join Now