Solved

loading an external form with fields and attributes into an array via preg_match_all

Posted on 2011-09-07
13
294 Views
Last Modified: 2012-05-12
hi there, i have the following code in which the function takes the html loaded via file_get_contents and parses it for forms. for each set of form html, it is then supposed to retrieve the form name and action attributes; input types, names, values and possibly checked attributes; textarea name attributes and values;  select name and value attributes (still working on catching the selected option) ; as well as button name and value attributes - but in this adapted regex from a working PCRE used in another part of the site, it says that | is an unknown modifier. i need to parse the attributes no matter in what order they are - but in one preg_match_all PCRE pattern.
the function is as follows:
function getform($html) {
	/*
	- load form as:
	  array(
	   [type],
	   [name],
	   [value],
	   [checked]
	  );
	*/
	$ret = array();
	$forms = array();
	preg_match_all("/\<form [^\>]+\>.*\<\/form\>/ims",
			$html,
			$out);
	print_r($out);
	foreach ($out[0] as $form) {
		array_push($forms,$form);
	}
	foreach ($forms as $form) {
		$pattern = '/\<(form).*\sname="([^"]+)".*\saction="([^")".*\>'.
			   '|\<(input).*\stype="([^"]+)".*\sname="([^"]+)".*\svalue="([^"]+)".*\schecked="([^"]+)".*\/\>'.
			   '|\<(textarea).*\sname="([^"]+)".*\>(.*)*\<\/textarea\>/ims'.
			   '|<(select).*\sname="([^"]+)".*>'.
			   '|<(button).*\sname="([^"]+)".*\svalue="([^"]+)".*>';
		preg_match_all($pattern,
					$form,
					$out);
		print_r($out);
		/*foreach ($out[1] as $type) {
			switch ($type) {
				case "form":
					array_push($ret, $out[2][i]);
					break;
				case "input":
					
					break;
				case "textarea":
					
					break;
				case "select":
					
					break;
				case "button":
					
					break;
			}
		}*/
	}
}

Open in new window

please assist me in completing the second regular expression on line 21 - it goes through the first preg_match_all retrieving all forms on any page that is loaded via file_get_contents, passing an array of forms to the next step on line 20.
namaste - Greywacke.
0
Comment
Question by:intellisource
  • 8
  • 5
13 Comments
 

Author Comment

by:intellisource
Comment Utility
ok i discovered why the | were returning an error - it is because the regular expression was closed before the end of the string! :S
i have replaced the pattern with
		$pattern = '/\<(form).*name="([^"]+)".*action="([^")".*\>'.
			   '|\<(input).*type="([^"]+)".*name="([^"]+)".*value="([^"]+)".*checked="([^"]+)".*\/\>'.
			   '|\<(textarea).*name="([^"]+)".*\>(.*)*\<\/textarea\>'.
			   '|<(select).*name="([^"]+)".*>'.
			   '|<(button).*name="([^"]+)".*value="([^"]+)".*>/ims';

Open in new window

but unfortunately it still returns the matches as empty :/ what's wrong here? i know it is supposed to return them in any order of attributes within a tag, but i do not see how this can be done - especially if i cannot get any results yet 0o
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
Comment Utility
Let's make a few corrections to your pattern to make it more sound:

$pattern = '#<(?<form_tag>form)(?=[^>]*name="(?<form_name>[^"]+))(?=[^>]*action="(?<form_action>[^"]*))' .
           '|<(?<input_tag>input)(?=[^>]*type="(?<input_type>[^"]+))(?=[^>]*name="(?<input_name>[^"]+))(?=[^>]*value="(?<input_value>[^"]+))(?=[^>]*checked="(?<input_checked>[^"]+))' .
           '|<(?<textarea_tag>textarea)(?=[^>]*name="(?<textarea_name>[^"]+))[^>]*>(?<textarea_content>[^<]' .
           '|<(?!/textarea))*|<(?<select>select)(?=[^>]*name="(?<select_name>[^"]+))' .
           '|<(?<button_tag>button)(?=[^>]*name="(?<button_name>[^"]+))(?=[^>]*value="(?<button_value>[^"]+))#i';

Open in new window


Please note:  I've added named groups for capturing your data. The group names are the identifiers inside of angle brackets ( < ... >) and containing underscores (e.g. "form_tag"). You can index your match array using those values instead of indexes (e.g. $match['form_tag']).

As for what I changed, instead of relying on the dot-star, which can be tricky for HTML parsing, I've changed instead to a zero-or-more search for anything not a closing angle bracket ( > ). This means your patterns won't leave the safety of each tag, provided your HTML is properly structured! I've also moved to a positive lookahead approach [ (?= ... ) ] for finding your attribute values. The benefit of this approach is that your attributes can be in any order within the pattern. Currently, your attributes will only be found if they occur in the order you have them defined in your pattern. This is entirely fine if you are guaranteed that the values will always occur in the same order. My approach will give you a bit more flexibility.

I've only mildly tested the above pattern, but I believe it to be sound. If it is not working for you, and you care to provide an example of your data, I can do further verification.
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
Comment Utility
This "error" won't affect the operation of the pattern, but to be consistent with your readability convention I need to post a correction (I broke the pattern in the wrong place):

$pattern = '#<(?<form_tag>form)(?=[^>]*name="(?<form_name>[^"]+))(?=[^>]*action="(?<form_action>[^"]*))' .
           '|<(?<input_tag>input)(?=[^>]*type="(?<input_type>[^"]+))(?=[^>]*name="(?<input_name>[^"]+))(?=[^>]*value="(?<input_value>[^"]+))(?=[^>]*checked="(?<input_checked>[^"]+))' .
           '|<(?<textarea_tag>textarea)(?=[^>]*name="(?<textarea_name>[^"]+))[^>]*>(?<textarea_content>[^<]|<(?!/textarea))*' .
           '|<(?<select_tag>select)(?=[^>]*name="(?<select_name>[^"]+))' .
           '|<(?<button_tag>button)(?=[^>]*name="(?<button_name>[^"]+))(?=[^>]*value="(?<button_value>[^"]+))#i';

Open in new window


It seems I also created the "select" tag's group name inconsistently with the others--I forgot the "_tag" part of the name. This is also corrected above.
0
 

Author Comment

by:intellisource
Comment Utility
ok this seems to work but i have decided not to go with returning named subpatterns. it tends to get rather confusing especially with PREG_PATTERN_ORDER
0
 

Author Comment

by:intellisource
Comment Utility
but it just gets the form details, input details are not correct. they need to work in any situation.
Array
(
    [0] => Array
        (
            [0] => <form
        )

    [form_tag] => Array
        (
            [0] => form
        )

    [1] => Array
        (
            [0] => form
        )

    [form_name] => Array
        (
            [0] => form1
        )

    [2] => Array
        (
            [0] => form1
        )

    [form_action] => Array
        (
            [0] => login.aspx
        )

    [3] => Array
        (
            [0] => login.aspx
        )

    [input_tag] => Array
        (
            [0] => 
        )

    [4] => Array
        (
            [0] => 
        )

    [input_type] => Array
        (
            [0] => 
        )

    [5] => Array
        (
            [0] => 
        )

    [input_name] => Array
        (
            [0] => 
        )

    [6] => Array
        (
            [0] => 
        )

    [input_value] => Array
        (
            [0] => 
        )

    [7] => Array
        (
            [0] => 
        )

    [input_checked] => Array
        (
            [0] => 
        )

    [8] => Array
        (
            [0] => 
        )

    [textarea_tag] => Array
        (
            [0] => 
        )

    [9] => Array
        (
            [0] => 
        )

    [textarea_name] => Array
        (
            [0] => 
        )

    [10] => Array
        (
            [0] => 
        )

    [textarea_content] => Array
        (
            [0] => 
        )

    [11] => Array
        (
            [0] => 
        )

    [select_tag] => Array
        (
            [0] => 
        )

    [12] => Array
        (
            [0] => 
        )

    [select_name] => Array
        (
            [0] => 
        )

    [13] => Array
        (
            [0] => 
        )

    [button_tag] => Array
        (
            [0] => 
        )

    [14] => Array
        (
            [0] => 
        )

    [button_name] => Array
        (
            [0] => 
        )

    [15] => Array
        (
            [0] => 
        )

    [button_value] => Array
        (
            [0] => 
        )

    [16] => Array
        (
            [0] => 
        )

)

Open in new window

an example form that is loaded is available on https://lmx.leads360.com/web/login.aspx
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
Comment Utility
OK, try this correction:

$pattern = '#<(form)(?=[^>]*name="([^"]*))(?=[^>]*action="([^"]*))' .
           '|<(input)(?=(?:[^>]*type="([^"]*))?)(?=(?:[^>]*name="([^"]*))?)(?=(?:[^>]*value="([^"]*))?)(?=(?:[^>]*checked="([^"]*))?)' .
           '|<(textarea)(?=(?:[^>]*name="([^"]*))?)[^>]*>([^<]|<(?!/textarea))*' .
           '|<(select)(?=(?:[^>]*name="([^"]*))?)' .
           '|<(button)(?=(?:[^>]*name="([^"]*))?)(?=(?:[^>]*value="([^"]*))?)#i';

Open in new window


The problem was with the "checked" property, which not every INPUT has. I made all the attributes optional so the aforementioned snag should be averted. I also removed the named groups per your previous comment.
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 

Author Comment

by:intellisource
Comment Utility
i believe closer to my requirements would be the following using PREG_SET_ORDER
		$pattern = '#<(form)[^>]*name="([^"]+)[^>]*action="([^"]+)'.
			    '|<(input)[^>]*type="([^"]+)[^>]*name="([^"]+)[^>]*value="([^"]+)[^>]*checked="([^"]+)'.
			    '|<(textarea)[^>]*name="([^"]+)[^>]*>(^<)/textarea>'.
			    '|<(select)[^>]*name="([^"]+)'.
			    '|<(option)[^>]*value="([^"]+)[^>]*selected="([^"]+)'.
			    '|<(button)[^>]*name="([^"]+)"[^>]*value="([^"]+)"#ims';

Open in new window

which returns
Array
(
    [0] => Array
        (
            [0] => <form name="form1" method="post" action="login.aspx
            [1] => form
            [2] => form1
            [3] => login.aspx
        )

)

Open in new window

which is ideal for the parsing - but it does not match any of the form elements on https://lmx.leads360.com/web/login.aspx :/ i need to return multiple matches for any of the form elements in ANY form loaded from the web, regardless of whitespaces in or outside the tags matched. what am i missing in my pcre pattern? 0o
0
 
LVL 74

Accepted Solution

by:
käµfm³d   👽 earned 500 total points
Comment Utility
I told a little white lie:  I didn't make the attributes on the FORM tag optional...  they are now:
$pattern = '#<(form)(?=(?:[^>]*name="([^"]*))?)(?=(?:[^>]*action="([^"]*))?)' .
           '|<(input)(?=(?:[^>]*type="([^"]*))?)(?=(?:[^>]*name="([^"]*))?)(?=(?:[^>]*value="([^"]*))?)(?=(?:[^>]*checked="([^"]*))?)' .
           '|<(textarea)(?=(?:[^>]*name="([^"]*))?)[^>]*>([^<]|<(?!/textarea))*' .
           '|<(select)(?=(?:[^>]*name="([^"]*))?)' .
           '|<(button)(?=(?:[^>]*name="([^"]*))?)(?=(?:[^>]*value="([^"]*))?)#i';

Open in new window

0
 

Author Closing Comment

by:intellisource
Comment Utility
hehehe thanks man ;) ur the winner :D with ur definate response to me missing certain pcre elements lol :P
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
Comment Utility
NP. Glad it worked for you  = )
0
 

Author Comment

by:intellisource
Comment Utility
just one more question - if an element is within a div with style visibility: hidden or display: none how can i detect that (hidden|none|empty string) by adding to the four element pattern's regex?
0
 

Author Comment

by:intellisource
Comment Utility
for example loading the login form in the example url, the emailTexbox element is retrieved but does not give any clue as to wether it is displayed/hidden or not.
<div class="dialog" style="position: absolute; visibility: hidden; z-index: 70013; left: 735px; top: 212px;" id="passwordDialog"><div class="header" id="passwordDialog_HeaderSpan">

            Forgot Password
        
</div><div class="content password" id="passwordDialog_InnerSpan">

            <p>Please enter the email address you signed up with and we will send you a new login link.</p>
            <dl>
                <dt>Email:</dt>
                <dd><input type="text" class="bigtextbox" id="emailTextBox" name="emailTextBox" tabindex="0"></dd>
            </dl>
            <div class="buttons submitcancel">
                <a onclick="RequestPasswordReset();" id="submitPasswordRest" class="submit left" tabindex="0">Submit</a>
                <a onclick="passwordDialog.Close();" class="cancel right" tabindex="0">Close</a>
            </div>
        
</div><div class="footer" id="passwordDialog_FooterSpan">

        
</div></div>

Open in new window

0
 

Author Comment

by:intellisource
Comment Utility
0

Featured Post

Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

Join & Write a Comment

I imagine that there are some, like me, who require a way of getting currency exchange rates for implementation in web project from time to time, so I thought I would share a solution that I have developed for this purpose. It turns out that Yaho…
Developers of all skill levels should learn to use current best practices when developing websites. However many developers, new and old, fall into the trap of using deprecated features because this is what so many tutorials and books tell them to u…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
The viewer will learn how to dynamically set the form action using jQuery.

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now