Solved

loading an external form with fields and attributes into an array via preg_match_all

Posted on 2011-09-07
13
301 Views
Last Modified: 2012-05-12
hi there, i have the following code in which the function takes the html loaded via file_get_contents and parses it for forms. for each set of form html, it is then supposed to retrieve the form name and action attributes; input types, names, values and possibly checked attributes; textarea name attributes and values;  select name and value attributes (still working on catching the selected option) ; as well as button name and value attributes - but in this adapted regex from a working PCRE used in another part of the site, it says that | is an unknown modifier. i need to parse the attributes no matter in what order they are - but in one preg_match_all PCRE pattern.
the function is as follows:
function getform($html) {
	/*
	- load form as:
	  array(
	   [type],
	   [name],
	   [value],
	   [checked]
	  );
	*/
	$ret = array();
	$forms = array();
	preg_match_all("/\<form [^\>]+\>.*\<\/form\>/ims",
			$html,
			$out);
	print_r($out);
	foreach ($out[0] as $form) {
		array_push($forms,$form);
	}
	foreach ($forms as $form) {
		$pattern = '/\<(form).*\sname="([^"]+)".*\saction="([^")".*\>'.
			   '|\<(input).*\stype="([^"]+)".*\sname="([^"]+)".*\svalue="([^"]+)".*\schecked="([^"]+)".*\/\>'.
			   '|\<(textarea).*\sname="([^"]+)".*\>(.*)*\<\/textarea\>/ims'.
			   '|<(select).*\sname="([^"]+)".*>'.
			   '|<(button).*\sname="([^"]+)".*\svalue="([^"]+)".*>';
		preg_match_all($pattern,
					$form,
					$out);
		print_r($out);
		/*foreach ($out[1] as $type) {
			switch ($type) {
				case "form":
					array_push($ret, $out[2][i]);
					break;
				case "input":
					
					break;
				case "textarea":
					
					break;
				case "select":
					
					break;
				case "button":
					
					break;
			}
		}*/
	}
}

Open in new window

please assist me in completing the second regular expression on line 21 - it goes through the first preg_match_all retrieving all forms on any page that is loaded via file_get_contents, passing an array of forms to the next step on line 20.
namaste - Greywacke.
0
Comment
Question by:intellisource
  • 8
  • 5
13 Comments
 

Author Comment

by:intellisource
ID: 36495147
ok i discovered why the | were returning an error - it is because the regular expression was closed before the end of the string! :S
i have replaced the pattern with
		$pattern = '/\<(form).*name="([^"]+)".*action="([^")".*\>'.
			   '|\<(input).*type="([^"]+)".*name="([^"]+)".*value="([^"]+)".*checked="([^"]+)".*\/\>'.
			   '|\<(textarea).*name="([^"]+)".*\>(.*)*\<\/textarea\>'.
			   '|<(select).*name="([^"]+)".*>'.
			   '|<(button).*name="([^"]+)".*value="([^"]+)".*>/ims';

Open in new window

but unfortunately it still returns the matches as empty :/ what's wrong here? i know it is supposed to return them in any order of attributes within a tag, but i do not see how this can be done - especially if i cannot get any results yet 0o
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 36495345
Let's make a few corrections to your pattern to make it more sound:

$pattern = '#<(?<form_tag>form)(?=[^>]*name="(?<form_name>[^"]+))(?=[^>]*action="(?<form_action>[^"]*))' .
           '|<(?<input_tag>input)(?=[^>]*type="(?<input_type>[^"]+))(?=[^>]*name="(?<input_name>[^"]+))(?=[^>]*value="(?<input_value>[^"]+))(?=[^>]*checked="(?<input_checked>[^"]+))' .
           '|<(?<textarea_tag>textarea)(?=[^>]*name="(?<textarea_name>[^"]+))[^>]*>(?<textarea_content>[^<]' .
           '|<(?!/textarea))*|<(?<select>select)(?=[^>]*name="(?<select_name>[^"]+))' .
           '|<(?<button_tag>button)(?=[^>]*name="(?<button_name>[^"]+))(?=[^>]*value="(?<button_value>[^"]+))#i';

Open in new window


Please note:  I've added named groups for capturing your data. The group names are the identifiers inside of angle brackets ( < ... >) and containing underscores (e.g. "form_tag"). You can index your match array using those values instead of indexes (e.g. $match['form_tag']).

As for what I changed, instead of relying on the dot-star, which can be tricky for HTML parsing, I've changed instead to a zero-or-more search for anything not a closing angle bracket ( > ). This means your patterns won't leave the safety of each tag, provided your HTML is properly structured! I've also moved to a positive lookahead approach [ (?= ... ) ] for finding your attribute values. The benefit of this approach is that your attributes can be in any order within the pattern. Currently, your attributes will only be found if they occur in the order you have them defined in your pattern. This is entirely fine if you are guaranteed that the values will always occur in the same order. My approach will give you a bit more flexibility.

I've only mildly tested the above pattern, but I believe it to be sound. If it is not working for you, and you care to provide an example of your data, I can do further verification.
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 36495387
This "error" won't affect the operation of the pattern, but to be consistent with your readability convention I need to post a correction (I broke the pattern in the wrong place):

$pattern = '#<(?<form_tag>form)(?=[^>]*name="(?<form_name>[^"]+))(?=[^>]*action="(?<form_action>[^"]*))' .
           '|<(?<input_tag>input)(?=[^>]*type="(?<input_type>[^"]+))(?=[^>]*name="(?<input_name>[^"]+))(?=[^>]*value="(?<input_value>[^"]+))(?=[^>]*checked="(?<input_checked>[^"]+))' .
           '|<(?<textarea_tag>textarea)(?=[^>]*name="(?<textarea_name>[^"]+))[^>]*>(?<textarea_content>[^<]|<(?!/textarea))*' .
           '|<(?<select_tag>select)(?=[^>]*name="(?<select_name>[^"]+))' .
           '|<(?<button_tag>button)(?=[^>]*name="(?<button_name>[^"]+))(?=[^>]*value="(?<button_value>[^"]+))#i';

Open in new window


It seems I also created the "select" tag's group name inconsistently with the others--I forgot the "_tag" part of the name. This is also corrected above.
0
Courses: Start Training Online With Pros, Today

Brush up on the basics or master the advanced techniques required to earn essential industry certifications, with Courses. Enroll in a course and start learning today. Training topics range from Android App Dev to the Xen Virtualization Platform.

 

Author Comment

by:intellisource
ID: 36495858
ok this seems to work but i have decided not to go with returning named subpatterns. it tends to get rather confusing especially with PREG_PATTERN_ORDER
0
 

Author Comment

by:intellisource
ID: 36495928
but it just gets the form details, input details are not correct. they need to work in any situation.
Array
(
    [0] => Array
        (
            [0] => <form
        )

    [form_tag] => Array
        (
            [0] => form
        )

    [1] => Array
        (
            [0] => form
        )

    [form_name] => Array
        (
            [0] => form1
        )

    [2] => Array
        (
            [0] => form1
        )

    [form_action] => Array
        (
            [0] => login.aspx
        )

    [3] => Array
        (
            [0] => login.aspx
        )

    [input_tag] => Array
        (
            [0] => 
        )

    [4] => Array
        (
            [0] => 
        )

    [input_type] => Array
        (
            [0] => 
        )

    [5] => Array
        (
            [0] => 
        )

    [input_name] => Array
        (
            [0] => 
        )

    [6] => Array
        (
            [0] => 
        )

    [input_value] => Array
        (
            [0] => 
        )

    [7] => Array
        (
            [0] => 
        )

    [input_checked] => Array
        (
            [0] => 
        )

    [8] => Array
        (
            [0] => 
        )

    [textarea_tag] => Array
        (
            [0] => 
        )

    [9] => Array
        (
            [0] => 
        )

    [textarea_name] => Array
        (
            [0] => 
        )

    [10] => Array
        (
            [0] => 
        )

    [textarea_content] => Array
        (
            [0] => 
        )

    [11] => Array
        (
            [0] => 
        )

    [select_tag] => Array
        (
            [0] => 
        )

    [12] => Array
        (
            [0] => 
        )

    [select_name] => Array
        (
            [0] => 
        )

    [13] => Array
        (
            [0] => 
        )

    [button_tag] => Array
        (
            [0] => 
        )

    [14] => Array
        (
            [0] => 
        )

    [button_name] => Array
        (
            [0] => 
        )

    [15] => Array
        (
            [0] => 
        )

    [button_value] => Array
        (
            [0] => 
        )

    [16] => Array
        (
            [0] => 
        )

)

Open in new window

an example form that is loaded is available on https://lmx.leads360.com/web/login.aspx
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 36496448
OK, try this correction:

$pattern = '#<(form)(?=[^>]*name="([^"]*))(?=[^>]*action="([^"]*))' .
           '|<(input)(?=(?:[^>]*type="([^"]*))?)(?=(?:[^>]*name="([^"]*))?)(?=(?:[^>]*value="([^"]*))?)(?=(?:[^>]*checked="([^"]*))?)' .
           '|<(textarea)(?=(?:[^>]*name="([^"]*))?)[^>]*>([^<]|<(?!/textarea))*' .
           '|<(select)(?=(?:[^>]*name="([^"]*))?)' .
           '|<(button)(?=(?:[^>]*name="([^"]*))?)(?=(?:[^>]*value="([^"]*))?)#i';

Open in new window


The problem was with the "checked" property, which not every INPUT has. I made all the attributes optional so the aforementioned snag should be averted. I also removed the named groups per your previous comment.
0
 

Author Comment

by:intellisource
ID: 36496451
i believe closer to my requirements would be the following using PREG_SET_ORDER
		$pattern = '#<(form)[^>]*name="([^"]+)[^>]*action="([^"]+)'.
			    '|<(input)[^>]*type="([^"]+)[^>]*name="([^"]+)[^>]*value="([^"]+)[^>]*checked="([^"]+)'.
			    '|<(textarea)[^>]*name="([^"]+)[^>]*>(^<)/textarea>'.
			    '|<(select)[^>]*name="([^"]+)'.
			    '|<(option)[^>]*value="([^"]+)[^>]*selected="([^"]+)'.
			    '|<(button)[^>]*name="([^"]+)"[^>]*value="([^"]+)"#ims';

Open in new window

which returns
Array
(
    [0] => Array
        (
            [0] => <form name="form1" method="post" action="login.aspx
            [1] => form
            [2] => form1
            [3] => login.aspx
        )

)

Open in new window

which is ideal for the parsing - but it does not match any of the form elements on https://lmx.leads360.com/web/login.aspx :/ i need to return multiple matches for any of the form elements in ANY form loaded from the web, regardless of whitespaces in or outside the tags matched. what am i missing in my pcre pattern? 0o
0
 
LVL 75

Accepted Solution

by:
käµfm³d   👽 earned 500 total points
ID: 36496467
I told a little white lie:  I didn't make the attributes on the FORM tag optional...  they are now:
$pattern = '#<(form)(?=(?:[^>]*name="([^"]*))?)(?=(?:[^>]*action="([^"]*))?)' .
           '|<(input)(?=(?:[^>]*type="([^"]*))?)(?=(?:[^>]*name="([^"]*))?)(?=(?:[^>]*value="([^"]*))?)(?=(?:[^>]*checked="([^"]*))?)' .
           '|<(textarea)(?=(?:[^>]*name="([^"]*))?)[^>]*>([^<]|<(?!/textarea))*' .
           '|<(select)(?=(?:[^>]*name="([^"]*))?)' .
           '|<(button)(?=(?:[^>]*name="([^"]*))?)(?=(?:[^>]*value="([^"]*))?)#i';

Open in new window

0
 

Author Closing Comment

by:intellisource
ID: 36496697
hehehe thanks man ;) ur the winner :D with ur definate response to me missing certain pcre elements lol :P
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 36497248
NP. Glad it worked for you  = )
0
 

Author Comment

by:intellisource
ID: 36497389
just one more question - if an element is within a div with style visibility: hidden or display: none how can i detect that (hidden|none|empty string) by adding to the four element pattern's regex?
0
 

Author Comment

by:intellisource
ID: 36497404
for example loading the login form in the example url, the emailTexbox element is retrieved but does not give any clue as to wether it is displayed/hidden or not.
<div class="dialog" style="position: absolute; visibility: hidden; z-index: 70013; left: 735px; top: 212px;" id="passwordDialog"><div class="header" id="passwordDialog_HeaderSpan">

            Forgot Password
        
</div><div class="content password" id="passwordDialog_InnerSpan">

            <p>Please enter the email address you signed up with and we will send you a new login link.</p>
            <dl>
                <dt>Email:</dt>
                <dd><input type="text" class="bigtextbox" id="emailTextBox" name="emailTextBox" tabindex="0"></dd>
            </dl>
            <div class="buttons submitcancel">
                <a onclick="RequestPasswordReset();" id="submitPasswordRest" class="submit left" tabindex="0">Submit</a>
                <a onclick="passwordDialog.Close();" class="cancel right" tabindex="0">Close</a>
            </div>
        
</div><div class="footer" id="passwordDialog_FooterSpan">

        
</div></div>

Open in new window

0
 

Author Comment

by:intellisource
ID: 36500784
0

Featured Post

Announcing the Most Valuable Experts of 2016

MVEs are more concerned with the satisfaction of those they help than with the considerable points they can earn. They are the types of people you feel privileged to call colleagues. Join us in honoring this amazing group of Experts.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Things That Drive Us Nuts Have you noticed the use of the reCaptcha feature at EE and other web sites?  It wants you to read and retype something that looks like this.Insanity!  It's not EE's fault - that's just the way reCaptcha works.  But it is …
Password hashing is better than message digests or encryption, and you should be using it instead of message digests or encryption.  Find out why and how in this article, which supplements the original article on PHP Client Registration, Login, Logo…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.

815 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now