Solved

PHP Regex help.

Posted on 2012-03-22
21
475 Views
Last Modified: 2012-08-13
Hi there.

Data example:

							<table class="table-gradient">
								<thead>
									<tr>
										<th scope="col"><a href="" style="text-decoration: none; color: #FFF;">NICKNAME</a></th>
										<th scope="col"><a href="gm_decisions.php?sort=count&dir=DESC" style="text-decoration: none; color: #FFF;">GM</a></th>
										<th scope="col"><a href="gm_decisions.php?sort=date_decided&dir=DESC" style="text-decoration: none; color: #FFF;">Date Decided</a></th>
										<th scope="col"><a href="gm_decisions.php?sort=decision_id&dir=DESC" style="text-decoration: none; color: #FFF;">Decision ID</a></th>
										<th scope="col">Category</th>
										<th scope="col">Decision</th>
									</tr>
								</thead>

								<tbody>
									
																<tr>
									<td><a href="playerview.php?account_id=5640930">lHadesl</a></td>
									<td><a href="gm_decisions.php?searchType=gm&search=Rejanu">Rejanu</a></td>
									<td>03-23-12 17:42</td>
									<td>146152</td>
									<td>Excessive Verbal Abuse</td>
									<td><a href="javascript:void(0);" >Guilty</a></td>
								</tr>
									
																<tr>
									<td><a href="playerview.php?account_id=3012910">Mezmerise</a></td>
									<td><a href="gm_decisions.php?searchType=gm&search=Rejanu">Rejanu</a></td>
									<td>03-24-12 11:50</td>
									<td>145933</td>
									<td>Excessive Verbal Abuse</td>
									<td><a href="javascript:void(0);" >Innocent</a></td>
								</tr>

										
								</tbody>
							</table>

Open in new window



I have an array with dates. Now I need a regex to loop through the html source and see if table contains any data with dates that are in the dates array within <tr> </tr>. If it does I need to place it into the array : Date Decided, Category, Description.

Thank you all for the help.
0
Comment
Question by:mropenmind
  • 11
  • 5
  • 4
  • +1
21 Comments
 

Author Comment

by:mropenmind
ID: 37755416
Date Array Example:

Array
(
    [0] => 2012-03-23
    [1] => 2012-03-24
    [2] => 2012-03-25
    [3] => 2012-03-26
    [4] => 2012-03-27
    [5] => 2012-03-28
    [6] => 2012-03-29
    [7] => 2012-03-30
    [8] => 2012-03-31
    [9] => 2012-04-01
    [10] => 2012-04-02
    [11] => 2012-04-03
    [12] => 2012-04-04
    [13] => 2012-04-05
)
0
 

Author Comment

by:mropenmind
ID: 37755429
There are 2 matches in the Data example, therefore I need to add them into array

As you can see, table contains both date and time (<td>03-23-12 17:42</td>) but I don't need time, just date.

Data1: 03-23-12,Excessive Verbal Abuse,Guilty
Data2: 03-24-12,Excessive Verbal Abuse,Innocent
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 37755435
The test data has no intersection, if I understand the question correctly.  It looks like the HTML dates are March 21 and the date array example starts with March 23.

I'll try changing the data a little bit and see if that can produce a reasonable test case.
0
 

Author Comment

by:mropenmind
ID: 37755440
Thanks. I've noticed my error before, and therefore corrected it.
0
 
LVL 35

Expert Comment

by:Terry Woods
ID: 37755468
A regex isn't really the best tool for parsing HTML, but when I tried using a DOM parser I just got an error. The following extracts the data in a useful manner, but it's not exactly tidy - depending on how tidy the code needs to be, it may be enough:

$tableBody = preg_replace("#^.*<tbody>(.*?)</tbody>.*$#is", "$1", $text);
print "Table body: $tableBody\n";
preg_match_all("#<tr>(?:(?!</tr>).)*?<td>(.*?)</td>(?:(?!</tr>).)*?<td>(.*?)</td>(?:(?!</tr>).)
*?<td>(.*?)</td>(?:(?!</tr>).)*?<td>(.*?)</td>(?:(?!</tr>).)*?<td>(.*?)</td>(?:(?!</tr>).)*?<td
>(.*?)</td>#si", $tableBody, $matches);
print_r($matches);

Open in new window

0
 
LVL 108

Accepted Solution

by:
Ray Paseur earned 300 total points
ID: 37755475
You can see the script in action here.
http://www.laprbass.com/RAY_temp_mropenmind.php

But I would like to suggest that you take a moment (perhaps post another question here at EE) to start a conversation about data design patterns.  Parsing HTML is a really brittle approach to data gathering.  It may work once when it is first written and tested, but if the publisher of the HTML makes any changes, you're screwed.  For this reason many publishers expose an API and render either XML or JSON strings.  If you could get this data from a formal interface (API interfaces are almost always version-numbered and are not published until they are stable) you would be better off.

One other note -- be sure that your method of access to the web site does not violate the terms of service or the copyright notice.  Some sites explicitly disallow automated access to their web pages.  If you violate their terms of service you can be sued successfully and you may wind up with a huge legal bill.  It's not worth this risk, so be careful to check and ensure that you're in squeaky-clean compliance with the TOS.
<?php // RAY_temp_mropenmind.php
error_reporting(E_ALL);
echo "<pre>";

// REQUIRED SINCE PHP 5.1+
date_default_timezone_set('America/New_York');


// TEST DATA FROM THE POST AT EE
$htm = <<<HTM
							<table class="table-gradient">
								<thead>
									<tr>
										<th scope="col"><a href="" style="text-decoration: none; color: #FFF;">NICKNAME</a></th>
										<th scope="col"><a href="gm_decisions.php?sort=count&dir=DESC" style="text-decoration: none; color: #FFF;">GM</a></th>
										<th scope="col"><a href="gm_decisions.php?sort=date_decided&dir=DESC" style="text-decoration: none; color: #FFF;">Date Decided</a></th>
										<th scope="col"><a href="gm_decisions.php?sort=decision_id&dir=DESC" style="text-decoration: none; color: #FFF;">Decision ID</a></th>
										<th scope="col">Category</th>
										<th scope="col">Decision</th>
									</tr>
								</thead>

								<tbody>

																<tr>
									<td><a href="playerview.php?account_id=5640930">lHadesl</a></td>
									<td><a href="gm_decisions.php?searchType=gm&search=Rejanu">Rejanu</a></td>
									<td>03-21-12 17:42</td>
									<td>146152</td>
									<td>Excessive Verbal Abuse</td>
									<td><a href="javascript:void(0);" >Guilty</a></td>
								</tr>

																<tr>
									<td><a href="playerview.php?account_id=3012910">Mezmerise</a></td>
									<td><a href="gm_decisions.php?searchType=gm&search=Rejanu">Rejanu</a></td>
	<!-- CHANGE HERE -->			<td>04-01-12 11:50</td>
									<td>145933</td>
									<td>Excessive Verbal Abuse</td>
									<td><a href="javascript:void(0);" >Innocent</a></td>
								</tr>


								</tbody>
							</table>
HTM;

// FUNCTION TO RETURN AN ARRAY OF DATES
function array_of_dates($alpha='Today', $omega='Today')
{
    // MIGHT WANT TO ADD SOME SANITY CHECKS HERE
    $out = array();
    $alpha = date('Y-m-d', strtotime($alpha));
    $omega = date('Y-m-d', strtotime($omega));
    while($alpha <= $omega)
    {
        $out[] = $alpha;
        $alpha = date('Y-m-d', strtotime($alpha . ' + 1 DAY'));
    }
    return $out;
}


// GET SOMETHING TO TEST WITH
$dts = array_of_dates('March 23', 'April 5');

// BREAK THE HTML INTO TABLE-ROWS
$trs = explode('<tr>', $htm);

// TEST EACH TABLE ROW
foreach ($trs as $tr)
{
    // TEST AGAINST EACH DATE
    foreach ($dts as $dt)
    {
        // IF THIS DATE IS PRESENT
        $test_date = date('m-d-y', strtotime($dt));
        if (strpos($tr, $test_date))
        {
            // ISOLATE THE DATA ELEMENTS
            // var_dump($tr);
            $tds = explode('<td>', $tr);

            // SHOW THE INFORMATION WE FOUND
            foreach ($tds as $td)
            {
                $td = trim($td);
                echo PHP_EOL . strip_tags($td);
            }
        }
        else continue;
    }
}

Open in new window

Best of luck with your project, ~Ray
0
 

Author Comment

by:mropenmind
ID: 37755485
Mezmerise
Rejanu
      
04-01-12 11:50
145933
Excessive Verbal Abuse
Innocent

Why there is a space in results, and how do I add results into the array and then print them out?
0
 

Author Comment

by:mropenmind
ID: 37755487
I only need
date (without time) = 04-01-12
Category = Excessive Verbal Abuse
Decision = Innocent
0
 

Author Comment

by:mropenmind
ID: 37755489
empty line was there because of: <!-- CHANGE HERE -->
0
 
LVL 35

Expert Comment

by:Terry Woods
ID: 37755502
$tableBody = preg_replace("#^.*<tbody>(.*?)</tbody>.*$#is", "$1", $text);
preg_match_all("#<tr>(?:(?!</tr>).)*?<td>(.*?)</td>(?:(?!</tr>).)*?<td>(.*?)</td>(?:(?!</tr>).)
*?<td>(.*?)</td>(?:(?!</tr>).)*?<td>(.*?)</td>(?:(?!</tr>).)*?<td>(.*?)</td>(?:(?!</tr>).)*?<td
>(.*?)</td>#si", $tableBody, $matches);
foreach($matches[1] as $num=>$value) {
  print "Date: ".preg_replace("/ .*/","",$matches[3][$num])."\n";
  print "Category: ".$matches[5][$num]."\n";
  print "Decision: ".strip_tags($matches[6][$num])."\n";
}

Output:
Date: 03-21-12
Category: Excessive Verbal Abuse
Decision: Guilty
Date: 03-21-12
Category: Excessive Verbal Abuse
Decision: Innocent
0
6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

 

Author Comment

by:mropenmind
ID: 37755515
where did you put that code?
0
 
LVL 35

Assisted Solution

by:Terry Woods
Terry Woods earned 200 total points
ID: 37755518
Tested it on a linux server. Actually the line breaks might cause trouble - corrected version here:

$tableBody = preg_replace("#^.*<tbody>(.*?)</tbody>.*$#is", "$1", $text);
preg_match_all("#<tr>(?:(?!</tr>).)*?<td>(.*?)</td>(?:(?!</tr>).)*?<td>(.*?)</td>(?:(?!</tr>).)*?<td>(.*?)</td>(?:(?!</tr>).)*?<td>(.*?)</td>(?:(?!</tr>).)*?<td>(.*?)</td>(?:(?!</tr>).)*?<td>(.*?)</td>#si", $tableBody, $matches);
foreach($matches[1] as $num=>$value) {
  print "Date: ".preg_replace("/ .*/","",$matches[3][$num])."\n";
  print "Category: ".$matches[5][$num]."\n";
  print "Decision: ".strip_tags($matches[6][$num])."\n";
}

Open in new window

0
 

Author Comment

by:mropenmind
ID: 37755532
I can't seem to find the correct place where to place your latest code.
0
 
LVL 35

Expert Comment

by:Terry Woods
ID: 37755538
Just put the HTML source into $text first, and it should work.

Oh, and Ray, data design patterns would make a great subject for an article...
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 37755542
@mropenmind: Have you ever taken a class in PHP programming?  If not, you might want to consider it.  Many community colleges offer PHP classes, and there are user groups (that offer code reviews) in the major cities.  This will give you some structured learning about PHP and it will make your learning process faster and much, much easier.

If you cannot find those kinds of learning resources, run (don't walk) to buy this book and give yourself a month to read, absorb, and work through the examples.  It will not make you a pro, but it will put you light years ahead in the quest to do things with PHP.
http://www.sitepoint.com/books/phpmysql4/

Once you have completed the SitePoint book you will never again feel like you brought a spork to a knife fight!
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 37755543
@TerryAtOpus:

;-)

Thanks, ~Ray
0
 

Author Comment

by:mropenmind
ID: 37755546
I used your code in the way TerryAtOpus said, but it's just that I didn't find a way to make it work with the date array.
0
 

Author Comment

by:mropenmind
ID: 37755548
that's why I posted: "I can't seem to find the correct place where to place your latest code."
0
 
LVL 35

Expert Comment

by:Terry Woods
ID: 37755576
Are you confusing my code with Ray's? We both posted completely independent solutions. My code should give the output I posted with just the HTML source.
0
 

Author Comment

by:mropenmind
ID: 37755577
Oh yea, I actually am...
0
 
LVL 10

Expert Comment

by:pfrancois
ID: 37756246
0

Featured Post

Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

Join & Write a Comment

The purpose of this article is to demonstrate how we can use conditional statements using Python.
This article discusses four methods for overlaying images in a container on a web page
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now