?
Solved

loading an external form with fields and attributes into an array via preg_match_all

Posted on 2011-09-07
13
Medium Priority
?
307 Views
Last Modified: 2012-05-12
hi there, i have the following code in which the function takes the html loaded via file_get_contents and parses it for forms. for each set of form html, it is then supposed to retrieve the form name and action attributes; input types, names, values and possibly checked attributes; textarea name attributes and values;  select name and value attributes (still working on catching the selected option) ; as well as button name and value attributes - but in this adapted regex from a working PCRE used in another part of the site, it says that | is an unknown modifier. i need to parse the attributes no matter in what order they are - but in one preg_match_all PCRE pattern.
the function is as follows:
function getform($html) {
	/*
	- load form as:
	  array(
	   [type],
	   [name],
	   [value],
	   [checked]
	  );
	*/
	$ret = array();
	$forms = array();
	preg_match_all("/\<form [^\>]+\>.*\<\/form\>/ims",
			$html,
			$out);
	print_r($out);
	foreach ($out[0] as $form) {
		array_push($forms,$form);
	}
	foreach ($forms as $form) {
		$pattern = '/\<(form).*\sname="([^"]+)".*\saction="([^")".*\>'.
			   '|\<(input).*\stype="([^"]+)".*\sname="([^"]+)".*\svalue="([^"]+)".*\schecked="([^"]+)".*\/\>'.
			   '|\<(textarea).*\sname="([^"]+)".*\>(.*)*\<\/textarea\>/ims'.
			   '|<(select).*\sname="([^"]+)".*>'.
			   '|<(button).*\sname="([^"]+)".*\svalue="([^"]+)".*>';
		preg_match_all($pattern,
					$form,
					$out);
		print_r($out);
		/*foreach ($out[1] as $type) {
			switch ($type) {
				case "form":
					array_push($ret, $out[2][i]);
					break;
				case "input":
					
					break;
				case "textarea":
					
					break;
				case "select":
					
					break;
				case "button":
					
					break;
			}
		}*/
	}
}

Open in new window

please assist me in completing the second regular expression on line 21 - it goes through the first preg_match_all retrieving all forms on any page that is loaded via file_get_contents, passing an array of forms to the next step on line 20.
namaste - Greywacke.
0
Comment
Question by:intellisource
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 8
  • 5
13 Comments
 

Author Comment

by:intellisource
ID: 36495147
ok i discovered why the | were returning an error - it is because the regular expression was closed before the end of the string! :S
i have replaced the pattern with
		$pattern = '/\<(form).*name="([^"]+)".*action="([^")".*\>'.
			   '|\<(input).*type="([^"]+)".*name="([^"]+)".*value="([^"]+)".*checked="([^"]+)".*\/\>'.
			   '|\<(textarea).*name="([^"]+)".*\>(.*)*\<\/textarea\>'.
			   '|<(select).*name="([^"]+)".*>'.
			   '|<(button).*name="([^"]+)".*value="([^"]+)".*>/ims';

Open in new window

but unfortunately it still returns the matches as empty :/ what's wrong here? i know it is supposed to return them in any order of attributes within a tag, but i do not see how this can be done - especially if i cannot get any results yet 0o
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 36495345
Let's make a few corrections to your pattern to make it more sound:

$pattern = '#<(?<form_tag>form)(?=[^>]*name="(?<form_name>[^"]+))(?=[^>]*action="(?<form_action>[^"]*))' .
           '|<(?<input_tag>input)(?=[^>]*type="(?<input_type>[^"]+))(?=[^>]*name="(?<input_name>[^"]+))(?=[^>]*value="(?<input_value>[^"]+))(?=[^>]*checked="(?<input_checked>[^"]+))' .
           '|<(?<textarea_tag>textarea)(?=[^>]*name="(?<textarea_name>[^"]+))[^>]*>(?<textarea_content>[^<]' .
           '|<(?!/textarea))*|<(?<select>select)(?=[^>]*name="(?<select_name>[^"]+))' .
           '|<(?<button_tag>button)(?=[^>]*name="(?<button_name>[^"]+))(?=[^>]*value="(?<button_value>[^"]+))#i';

Open in new window


Please note:  I've added named groups for capturing your data. The group names are the identifiers inside of angle brackets ( < ... >) and containing underscores (e.g. "form_tag"). You can index your match array using those values instead of indexes (e.g. $match['form_tag']).

As for what I changed, instead of relying on the dot-star, which can be tricky for HTML parsing, I've changed instead to a zero-or-more search for anything not a closing angle bracket ( > ). This means your patterns won't leave the safety of each tag, provided your HTML is properly structured! I've also moved to a positive lookahead approach [ (?= ... ) ] for finding your attribute values. The benefit of this approach is that your attributes can be in any order within the pattern. Currently, your attributes will only be found if they occur in the order you have them defined in your pattern. This is entirely fine if you are guaranteed that the values will always occur in the same order. My approach will give you a bit more flexibility.

I've only mildly tested the above pattern, but I believe it to be sound. If it is not working for you, and you care to provide an example of your data, I can do further verification.
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 36495387
This "error" won't affect the operation of the pattern, but to be consistent with your readability convention I need to post a correction (I broke the pattern in the wrong place):

$pattern = '#<(?<form_tag>form)(?=[^>]*name="(?<form_name>[^"]+))(?=[^>]*action="(?<form_action>[^"]*))' .
           '|<(?<input_tag>input)(?=[^>]*type="(?<input_type>[^"]+))(?=[^>]*name="(?<input_name>[^"]+))(?=[^>]*value="(?<input_value>[^"]+))(?=[^>]*checked="(?<input_checked>[^"]+))' .
           '|<(?<textarea_tag>textarea)(?=[^>]*name="(?<textarea_name>[^"]+))[^>]*>(?<textarea_content>[^<]|<(?!/textarea))*' .
           '|<(?<select_tag>select)(?=[^>]*name="(?<select_name>[^"]+))' .
           '|<(?<button_tag>button)(?=[^>]*name="(?<button_name>[^"]+))(?=[^>]*value="(?<button_value>[^"]+))#i';

Open in new window


It seems I also created the "select" tag's group name inconsistently with the others--I forgot the "_tag" part of the name. This is also corrected above.
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 

Author Comment

by:intellisource
ID: 36495858
ok this seems to work but i have decided not to go with returning named subpatterns. it tends to get rather confusing especially with PREG_PATTERN_ORDER
0
 

Author Comment

by:intellisource
ID: 36495928
but it just gets the form details, input details are not correct. they need to work in any situation.
Array
(
    [0] => Array
        (
            [0] => <form
        )

    [form_tag] => Array
        (
            [0] => form
        )

    [1] => Array
        (
            [0] => form
        )

    [form_name] => Array
        (
            [0] => form1
        )

    [2] => Array
        (
            [0] => form1
        )

    [form_action] => Array
        (
            [0] => login.aspx
        )

    [3] => Array
        (
            [0] => login.aspx
        )

    [input_tag] => Array
        (
            [0] => 
        )

    [4] => Array
        (
            [0] => 
        )

    [input_type] => Array
        (
            [0] => 
        )

    [5] => Array
        (
            [0] => 
        )

    [input_name] => Array
        (
            [0] => 
        )

    [6] => Array
        (
            [0] => 
        )

    [input_value] => Array
        (
            [0] => 
        )

    [7] => Array
        (
            [0] => 
        )

    [input_checked] => Array
        (
            [0] => 
        )

    [8] => Array
        (
            [0] => 
        )

    [textarea_tag] => Array
        (
            [0] => 
        )

    [9] => Array
        (
            [0] => 
        )

    [textarea_name] => Array
        (
            [0] => 
        )

    [10] => Array
        (
            [0] => 
        )

    [textarea_content] => Array
        (
            [0] => 
        )

    [11] => Array
        (
            [0] => 
        )

    [select_tag] => Array
        (
            [0] => 
        )

    [12] => Array
        (
            [0] => 
        )

    [select_name] => Array
        (
            [0] => 
        )

    [13] => Array
        (
            [0] => 
        )

    [button_tag] => Array
        (
            [0] => 
        )

    [14] => Array
        (
            [0] => 
        )

    [button_name] => Array
        (
            [0] => 
        )

    [15] => Array
        (
            [0] => 
        )

    [button_value] => Array
        (
            [0] => 
        )

    [16] => Array
        (
            [0] => 
        )

)

Open in new window

an example form that is loaded is available on https://lmx.leads360.com/web/login.aspx
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 36496448
OK, try this correction:

$pattern = '#<(form)(?=[^>]*name="([^"]*))(?=[^>]*action="([^"]*))' .
           '|<(input)(?=(?:[^>]*type="([^"]*))?)(?=(?:[^>]*name="([^"]*))?)(?=(?:[^>]*value="([^"]*))?)(?=(?:[^>]*checked="([^"]*))?)' .
           '|<(textarea)(?=(?:[^>]*name="([^"]*))?)[^>]*>([^<]|<(?!/textarea))*' .
           '|<(select)(?=(?:[^>]*name="([^"]*))?)' .
           '|<(button)(?=(?:[^>]*name="([^"]*))?)(?=(?:[^>]*value="([^"]*))?)#i';

Open in new window


The problem was with the "checked" property, which not every INPUT has. I made all the attributes optional so the aforementioned snag should be averted. I also removed the named groups per your previous comment.
0
 

Author Comment

by:intellisource
ID: 36496451
i believe closer to my requirements would be the following using PREG_SET_ORDER
		$pattern = '#<(form)[^>]*name="([^"]+)[^>]*action="([^"]+)'.
			    '|<(input)[^>]*type="([^"]+)[^>]*name="([^"]+)[^>]*value="([^"]+)[^>]*checked="([^"]+)'.
			    '|<(textarea)[^>]*name="([^"]+)[^>]*>(^<)/textarea>'.
			    '|<(select)[^>]*name="([^"]+)'.
			    '|<(option)[^>]*value="([^"]+)[^>]*selected="([^"]+)'.
			    '|<(button)[^>]*name="([^"]+)"[^>]*value="([^"]+)"#ims';

Open in new window

which returns
Array
(
    [0] => Array
        (
            [0] => <form name="form1" method="post" action="login.aspx
            [1] => form
            [2] => form1
            [3] => login.aspx
        )

)

Open in new window

which is ideal for the parsing - but it does not match any of the form elements on https://lmx.leads360.com/web/login.aspx :/ i need to return multiple matches for any of the form elements in ANY form loaded from the web, regardless of whitespaces in or outside the tags matched. what am i missing in my pcre pattern? 0o
0
 
LVL 75

Accepted Solution

by:
käµfm³d   👽 earned 2000 total points
ID: 36496467
I told a little white lie:  I didn't make the attributes on the FORM tag optional...  they are now:
$pattern = '#<(form)(?=(?:[^>]*name="([^"]*))?)(?=(?:[^>]*action="([^"]*))?)' .
           '|<(input)(?=(?:[^>]*type="([^"]*))?)(?=(?:[^>]*name="([^"]*))?)(?=(?:[^>]*value="([^"]*))?)(?=(?:[^>]*checked="([^"]*))?)' .
           '|<(textarea)(?=(?:[^>]*name="([^"]*))?)[^>]*>([^<]|<(?!/textarea))*' .
           '|<(select)(?=(?:[^>]*name="([^"]*))?)' .
           '|<(button)(?=(?:[^>]*name="([^"]*))?)(?=(?:[^>]*value="([^"]*))?)#i';

Open in new window

0
 

Author Closing Comment

by:intellisource
ID: 36496697
hehehe thanks man ;) ur the winner :D with ur definate response to me missing certain pcre elements lol :P
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 36497248
NP. Glad it worked for you  = )
0
 

Author Comment

by:intellisource
ID: 36497389
just one more question - if an element is within a div with style visibility: hidden or display: none how can i detect that (hidden|none|empty string) by adding to the four element pattern's regex?
0
 

Author Comment

by:intellisource
ID: 36497404
for example loading the login form in the example url, the emailTexbox element is retrieved but does not give any clue as to wether it is displayed/hidden or not.
<div class="dialog" style="position: absolute; visibility: hidden; z-index: 70013; left: 735px; top: 212px;" id="passwordDialog"><div class="header" id="passwordDialog_HeaderSpan">

            Forgot Password
        
</div><div class="content password" id="passwordDialog_InnerSpan">

            <p>Please enter the email address you signed up with and we will send you a new login link.</p>
            <dl>
                <dt>Email:</dt>
                <dd><input type="text" class="bigtextbox" id="emailTextBox" name="emailTextBox" tabindex="0"></dd>
            </dl>
            <div class="buttons submitcancel">
                <a onclick="RequestPasswordReset();" id="submitPasswordRest" class="submit left" tabindex="0">Submit</a>
                <a onclick="passwordDialog.Close();" class="cancel right" tabindex="0">Close</a>
            </div>
        
</div><div class="footer" id="passwordDialog_FooterSpan">

        
</div></div>

Open in new window

0

Featured Post

WordPress Tutorial 1: Installation & Setup

WordPress is a very popular option for running your web site and can be used to get your content online quickly for the world to see. This guide will walk you through installing the WordPress server software and the initial setup process.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Things That Drive Us Nuts Have you noticed the use of the reCaptcha feature at EE and other web sites?  It wants you to read and retype something that looks like this. Insanity!  It's not EE's fault - that's just the way reCaptcha works.  But it i…
This article discusses how to create an extensible mechanism for linked drop downs.
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…
Suggested Courses

770 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question