Solved

how to check if there is a DIV enclosing a form element within the conditional PCRE?

Posted on 2011-09-07
12
347 Views
Last Modified: 2012-05-12
hi, i've just completed a PCRE as follows to retrieve form elements from a form:
$pattern = '#<(form)(?=(?:[^>]*name="([^"]*))?)(?=(?:[^>]*action="([^"]*))?)' .
           '|<(input)(?=(?:[^>]*type="([^"]*))?)(?=(?:[^>]*name="([^"]*))?)(?=(?:[^>]*value="([^"]*))?)(?=(?:[^>]*checked="([^"]*))?)' .
           '|<(textarea)(?=(?:[^>]*name="([^"]*))?)[^>]*>([^<]|<(?!/textarea))*' .
           '|<(select)(?=(?:[^>]*name="([^"]*))?)' .
           '|<(button)(?=(?:[^>]*name="([^"]*))?)(?=(?:[^>]*value="([^"]*))?)#i';

Open in new window

i wish to post this question as a follow-up to the question available at http://www.experts-exchange.com/Programming/Languages/Regular_Expressions/Q_27295097.html#a36497404
now if an element is within a div (parental or any other ancestral level tags) with inline style including "visibility: hidden" or "display: none" how can i detect that (hidden|none|empty string) by adding to the four element pattern's regex?
for example loading the login form on https://lmx.leads360.com/web/Login.aspx - the emailTexbox element is retrieved but does not give any clue as to wether it is displayed/hidden or not.
Array
(
    [0] => Array
        (
            [0] => <form
            [1] => form
            [2] => form1
            [3] => login.aspx
        )

    [1] => Array
        (
            [0] => <input
            [1] => 
            [2] => 
            [3] => 
            [4] => input
            [5] => hidden
            [6] => __VIEWSTATE
            [7] => /wEPDwULLTEwMTAwMDM0ODIPZBYCAgMPZBYCAg0PFgIeC18hSXRlbUNvdW50AgEWAmYPZBYCAgEPFgIeBFRleHQF3gU8ZGl2Pg0KPHA+T24gOC8xOS8wOSB3ZSByZWxlYXNlZCBtaW5vciB1cGRhdGVzIHRvIEV4cHJlc3MuIFBsZWFzZSByZXZpZXcgdGhlIDxhIGhyZWY9Imh0dHA6Ly9sZWFkczM2MC56ZW5kZXNrLmNvbS9mb3J1bXMvMTY0MzYvZW50cmllcy80OTk5OCIgdGFyZ2V0PSJfYmxhbmsiPnJlbGVhc2Ugbm90ZXM8L2E+IGZvciBkZXRhaWxzLiBXZSBhcmUgdmVyeSBpbnRlcmVzdGVkIGluIHlvdXIgZmVlZGJhY2ssIGlmIHlvdSBoYXZlIGFueSBmZWF0dXJlIHJlY29tbWVuZGF0aW9ucywgcGxlYXNlIHBvc3QgdGhlbSBpbiBvdXIgPGEgaHJlZj0iaHR0cDovL2xlYWRzMzYwLnplbmRlc2suY29tL2ZvcnVtcy8xNjQzOS9lbnRyaWVzIiB0YXJnZXQ9Il9ibGFuayI+dXNlciBmb3J1bTwvYT4uPC9wPg0KPHVsPg0KPGxpPldlIG5vdyBvZmZlciA1IHRlbXBsYXRlczogTW9ydGdhZ2UsIERlYnQvTG9hbk1vZCwgSW5zdXJhbmNlIChIb21lL0F1dG8pLCBJbnN1cmFuY2UgKEhlYWx0aC9MaWZlKSwgR2VuZXJpYzwvbGk+DQo8bGk+TmV3IDxhIGhyZWY9Imh0dHBzOi8vbG14LmxlYWRzMzYwLmNvbS9oZWxwIiB0YXJnZXQ9Il9ibGFuayI+aGVscCBmb3J1bXM8L2E+IGFuZCB0aWNrZXRpbmcgc3lzdGVtPC9saT4NCjxsaT5OZXcgZGVkaWNhdGVkIEV4cHJlc3Mgc3VwcG9ydCBwZXJzb24gYW5kIGNoYXQgbm93IGF2YWlsYWJsZSBmb3IgdHJpYWwgYW5kIHBheWluZyBjbGllbnRzPC9saT4NCjwvdWw+DQo8L2Rpdj4NCmRkSh+rfjq3ECYgva0xikqoCQO0DWY=
        )

    [2] => Array
        (
            [0] => <input
            [1] => 
            [2] => 
            [3] => 
            [4] => input
            [5] => hidden
            [6] => __EVENTVALIDATION
            [7] => /wEWBQL7tbDVBgLw0JndDgLi/qahAwKSuuDUCwKCkfPgDOt1y12mZhJ/qB81miBJ4pLiwjLK
        )

    [3] => Array
        (
            [0] => <input
            [1] => 
            [2] => 
            [3] => 
            [4] => input
            [5] => text
            [6] => usernameTextBox
        )

    [4] => Array
        (
            [0] => <input
            [1] => 
            [2] => 
            [3] => 
            [4] => input
            [5] => password
            [6] => passwordTextBox
        )

    [5] => Array
        (
            [0] => <input
            [1] => 
            [2] => 
            [3] => 
            [4] => input
            [5] => text
            [6] => emailTextBox
        )

)

Open in new window

i also need to check for a disabled value, or a parent style hiding the field - to be appended to the same element subarray. how would this be done? here follows the possible hiding of an element via parental inline css.
<div class="dialog" style="position: absolute; visibility: hidden; z-index: 70013; left: 735px; top: 212px;" id="passwordDialog"><div class="header" id="passwordDialog_HeaderSpan">

            Forgot Password
        
</div><div class="content password" id="passwordDialog_InnerSpan">

            <p>Please enter the email address you signed up with and we will send you a new login link.</p>
            <dl>
                <dt>Email:</dt>
                <dd><input type="text" class="bigtextbox" id="emailTextBox" name="emailTextBox" tabindex="0"></dd>
            </dl>
            <div class="buttons submitcancel">
                <a onclick="RequestPasswordReset();" id="submitPasswordRest" class="submit left" tabindex="0">Submit</a>
                <a onclick="passwordDialog.Close();" class="cancel right" tabindex="0">Close</a>
            </div>
        
</div><div class="footer" id="passwordDialog_FooterSpan">

        
</div></div>

Open in new window

0
Comment
Question by:intellisource
  • 9
  • 2
12 Comments
 
LVL 74

Expert Comment

by:käµfm³d 👽
Comment Utility
You won't be able to add to the patterns because PHP regex does not support unbounded lookbehind. Even if it did, trying to fully parse HTML with regex is inadvisable--it gets extremely messy and is most often unreliable. At this point I would say to extract your tags via the regex, then use one of the built-in HTML parsing functions to find your <div> tags, then see if the regex-captured text is a substring of the <div>'s text. I don't know the function names for the HTML parsers off-hand, but I'll try to find them.
0
 

Author Comment

by:intellisource
Comment Utility
thanks kaufmed - would appreciate that - should perhaps have done this in the first place :P lol
but - they say there are many ways to skin a cat - i am only aware of the xml parsing functions not html dom - might this be what you meant?
0
 

Author Comment

by:intellisource
Comment Utility
browsed php.net and came accross the dom objects. i think this is what you meant, i tried using loadHTML method as follows, within the forms loop by regex:
$doc = new DOMDocument();
$doc->loadHTML($form);

Open in new window

but unfortunately i can't see how to parse for wether a parent div is hidden or not.
this is the function to detect this at the moment, it receives the entire form element and its contents as well as the element name to check, as id's do not get submitted to the server.
function ishidden($name,$html) {
	$dom = new DOMDocument();
	@$dom->loadHTML($html);
	$divs = $dom->getElementsByTagName("div");
	foreach ($divs as $div) {
		$style = $div->getAttribute("style");
		if (preg_match("/display:none|visibility:hidden/ims",$style)) {
			foreach ($div->childNodes as $child) {
				if ($child->getAttribute("name")==$name) {
					return true;
				}
			}
		}
	}
	return false;
}

Open in new window

0
 
LVL 108

Expert Comment

by:Ray Paseur
Comment Utility
This is an interesting question.  Can you tell us why you are doing this?  What is the input and the expected work product?  If we know that, there may be easier ways to skin this cat.
0
 

Author Comment

by:intellisource
Comment Utility
it is to automate laborious work. a whole string of copy and paste processes can be eliminated by automating the form submissions into our quicktextpro integrations.
0
 

Author Comment

by:intellisource
Comment Utility
oh, this function is to be used in reading forms, to emulate form submissions - i've given you the purpose.
0
6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

 

Author Comment

by:intellisource
Comment Utility
sorry but i am really not familiar with using these php dom objects... -_- quite different to javascripts, where i could have loaded the element, from there retrieved parents while it's not top level. don't see how to do this here in php....
0
 
LVL 108

Expert Comment

by:Ray Paseur
Comment Utility
I can show you how to find the form elements, if that is any help.

<?php // RAY_temp_intellisource.php
error_reporting(E_ALL);


// SEE http://www.experts-exchange.com/Programming/Languages/Regular_Expressions/Q_27296137.html?cid=1572


// A URL TO TEST WITH
$url = 'https://lmx.leads360.com/web/Login.aspx';

// READ THE GENERATED HTML STRING
$htm = my_curl($url);

// REMOVE THE END-OF-LINE CHARACTERS
$htm = str_replace(PHP_EOL, "", $htm);

// ISOLATE THE FORM
$form   = explode("<form",$htm);
$form   = explode("</form>",$form[1]);
$inputs = explode("<input",$form[0]);

// ISOLATE THE INPUTS TO THE REQUEST
foreach($inputs as $key => $val)
{
    // IDENTIFY THE ACTION SCRIPT
    $action = strpos($val, "action");
    if($action !== false)
    {
        // EXTRACT THE ACTION SCRIPT NAME FROM THE FORM INPUT
        $actstart = strpos($val, "\"", $action+1);
        $actend   = strpos($val, "\"", $actstart+1);
        $posturl  = substr($val, $actstart+1, ($actend-$actstart-1));
        continue;
    }

    // IDENTIFY THE INPUT FIELDS BY NAME AND VALUE PAIRS
    $name = strpos($val, "name");
    if($name !== false)
    {
        // EXTRACT THE NAME FROM THE FORM INPUT
        $namestart = strpos($val, "\"", $name+1);
        $nameend   = strpos($val, "\"", $namestart+1);
        $strname   = substr($val, $namestart+1, ($nameend-$namestart-1));

        // EXTRACT THE VALUE
        $value = strpos($val, "value");
        if($value !== false)
        {
            $valuestart = strpos($val, "\"", $value+1);
            $valueend   = strpos($val, "\"", $valuestart+1);
            $strvalue   = substr($val, $valuestart+1, ($valueend-$valuestart-1));
        }

        // IF NO VALUE
        else
        {
            $strvalue   = NULL;
        }
    }
    $postdata[$strname] = $strvalue;
}

// SHOW THE WORK PRODUCT
echo "<pre>";
echo PHP_EOL . "THE ACTION SCRIPT URL IS: $posturl";
echo PHP_EOL . "THE REQUEST ARGUMENTS ARE: ";
var_dump($postdata);



// A FUNCTION TO RUN A CURL-GET CLIENT CALL TO A FOREIGN SERVER
function my_curl
( $url
, $timeout=3
, $error_report=TRUE
)
{
    $curl = curl_init();

    // HEADERS AND OPTIONS APPEAR TO BE A FIREFOX BROWSER REFERRED BY GOOGLE
    $header[] = "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
    $header[] = "Cache-Control: max-age=0";
    $header[] = "Connection: keep-alive";
    $header[] = "Keep-Alive: 300";
    $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
    $header[] = "Accept-Language: en-us,en;q=0.5";
    $header[] = "Pragma: "; // BROWSERS USUALLY LEAVE BLANK

    // SET THE CURL OPTIONS - SEE http://php.net/manual/en/function.curl-setopt.php
    curl_setopt( $curl, CURLOPT_URL,            $url  );
    curl_setopt( $curl, CURLOPT_USERAGENT,      'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'  );
    curl_setopt( $curl, CURLOPT_HTTPHEADER,     $header  );
    curl_setopt( $curl, CURLOPT_REFERER,        'http://www.google.com'  );
    curl_setopt( $curl, CURLOPT_ENCODING,       'gzip,deflate'  );
    curl_setopt( $curl, CURLOPT_AUTOREFERER,    TRUE  );
    curl_setopt( $curl, CURLOPT_RETURNTRANSFER, TRUE  );
    curl_setopt( $curl, CURLOPT_FOLLOWLOCATION, TRUE  );
    curl_setopt( $curl, CURLOPT_TIMEOUT,        $timeout  );

    // RUN THE CURL REQUEST AND GET THE RESULTS
    $htm = curl_exec($curl);

    // ON FAILURE HANDLE ERROR MESSAGE
    if ($htm === FALSE)
    {
        if ($error_report)
        {
            $err = curl_errno($curl);
            $inf = curl_getinfo($curl);
            echo "CURL FAIL: $url TIMEOUT=$timeout, CURL_ERRNO=$err";
            var_dump($inf);
        }
        curl_close($curl);
        return FALSE;
    }

    // ON SUCCESS RETURN XML / HTML STRING
    curl_close($curl);
    return $htm;
}

Open in new window

0
 

Author Comment

by:intellisource
Comment Utility
well getting the form elements via pcre was not too big a problem - just the formatting which kaufmed helped me with. the issue i am facing now is determining wether a form element is within a div that has the inline style of display: none or visibility: hidden. not so sure how to work that out parsing with the php DOM objects (DOM objects specifically used DOMDocument::loadHTML). there does not seem to be an ancestry/parent property as javascript has when parsing the DOM though, so my mind is rather stuck on this. :(
0
 

Author Comment

by:intellisource
Comment Utility
the issue is merely within this function, which is passed the element name and the form html tree:
function ishidden($name,$html) {
        $dom = new DOMDocument();
        @$dom->loadHTML($html);
        $divs = $dom->getElementsByTagName("div");
        foreach ($divs as $div) {
                $style = $div->getAttribute("style");
                if (preg_match("/display:none|visibility:hidden/ims",$style)) {
                        foreach ($div->childNodes as $child) {
                                if ($child->getAttribute("name")==$name) {
                                        return true;
                                }
                        }
                }
        }
        return false;
}

Open in new window

0
 

Accepted Solution

by:
intellisource earned 0 total points
Comment Utility
okay.
after a business breakfast with the client, and an inspired discussion towards resolving this issue as in yesterday - i've located the PHP Simple HTML DOM Parser, which does in fact include a parent property to each DOM element! ;)
just figuring how to include and use this API though... then it will be about 30 minutes to complete this function! :D thanks for the help guys...
0
 

Author Closing Comment

by:intellisource
Comment Utility
have decided to go with the PHP Simple HTML DOM Parser, linked in this post to a resolution of the actual problem. ;)
0

Featured Post

Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

Join & Write a Comment

This article will explain how to display the first page of your Microsoft Word documents (e.g. .doc, .docx, etc...) as images in a web page programatically. I have scoured the web on a way to do this unsuccessfully. The goal is to produce something …
Part of the Global Positioning System A geocode (https://developers.google.com/maps/documentation/geocoding/) is the major subset of a GPS coordinate (http://en.wikipedia.org/wiki/Global_Positioning_System), the other parts being the altitude and t…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
The viewer will learn how to count occurrences of each item in an array.

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now