Solved

Regular Expression to collect data from html page and place in an array

Posted on 2010-11-15
9
378 Views
Last Modified: 2012-05-10
Hi
My regex is not the best and I am looking to collect some data from this page to add to a database: http://www.iso.org/iso/support/currency_codes_list-1.htm

I am basically hoping to get hold of the entity - currency - alphabetic code - numeric code and place them in an array

Looking at the source code the code for the table begins around line 200 and is formatted like so:
                  <tr >       
                  <td valign="top">
                        AFGHANISTAN
                  </td>
                  <td valign="top">
                        Afghani
                  </td>
                  <td valign="top">
                        AFN
                  </td>
                  <td valign="top">
                        971
                  </td>
                  </tr>
                  <tr class="zebra">                         
                  <td valign="top">
                        ÅLAND ISLANDS
                  </td>
                  <td valign="top">
                        Euro
                  </td>
                  <td valign="top">
                        EUR
                  </td>             
                  <td valign="top">
                        978
                  </td>             
                  </tr>             
                  <tr >                         
                  <td valign="top">
                        ALBANIA
                  </td>             
                  <td valign="top">
                        Lek
                  </td>             
                  <td valign="top">
                        ALL
                  </td>             
                  <td valign="top">
                        008
                  </td>             
                  </tr>             
                  <tr class="zebra">
                  <td valign="top">
                        ALGERIA
                  </td>
                  <td valign="top">
                        Algerian Dinar
                  </td>             
                  <td valign="top">
                        DZD
                  </td>             
                  <td valign="top">
                        012
                  </td>             
                  </tr>  

I have the following code wrote by kambiz for another problem, which works how I would like:

// fetch data  
$exchange_data = file_get_contents('http://www.bloomberg.com/javascripts/currdata.js');
// apply regular expression
preg_match_all("/\['(\w{3}):CUR'\]\s*=\s*(\d+(.\d+)?);/", $exchange_data, $matches);
// create array of CURRENCY => RATE
$exchange_rates = array_combine($matches[1], $matches[2]);

I am hoping someone could help with some regex which helps me collect the data I want from http://www.iso.org/iso/support/currency_codes_list-1.htm

Thanks
0
Comment
Question by:dchid
9 Comments
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 34140807
How about the following? The values will be stored in capture group 1 (Entity), 2 (Currency), 3 (Alphabetic code), and 4 (Numeric code).
<td valign="top">\s*([\S ]*)\s*</td>\s*<td valign="top">\s*([\S ]*)(?:<a[^>]*>[^<]*</a>)?\s*</td>\s*<td valign="top">\s*([A-Z]{3})\s*</td>\s*<td valign="top">\s*(\d{3}|Nil)\s*

Open in new window

0
 
LVL 108

Accepted Solution

by:
Ray Paseur earned 167 total points
ID: 34141604
Writing a REGEX for this may be more trouble than it is worth.  The data is not well-formatted.  Here is a script that will make sense of the data.  You can install it and run it to see the output which is an array of associative arrays.
<?php // RAY_iso_currency_codes.php

error_reporting(E_ALL);



// READ THE HTML AND REMOVE THE PARTS WE DO NOT WANT

$url = 'http://www.iso.org/iso/support/currency_codes_list-1.htm';

$htm = file_get_contents($url);

$arr = explode('<strong>Numeric code</strong>', $htm);

$poz = strpos($arr[1], '<tr >');

$htm = substr($arr[1], $poz);

$arr = explode('</tbody>', $htm);

$htm = $arr[0];



$htm = str_replace('class="zebra"', '', $htm);

$arr = explode('<tr >', $htm);

unset($arr[0]);



$out = array();

foreach ($arr as $tr)

{

    $a = array();

    $b = array();

    $c = array();



    // PROCESS BY ROWS

    $tr  = str_replace('<br /><br />', '||', $tr);

    $tra = explode('<td valign="top">', $tr);

    $a['e'] = trim(strip_tags($tra[1]));

    $a['c'] = trim(strip_tags($tra[2]));

    $a['a'] = trim(strip_tags($tra[3]));

    $a['n'] = trim(strip_tags($tra[4]));



    // SOME ROWS HAVE TWO OR THREE ELEMENTS

    if (strpos($tr, '||'))

    {

        $cc = explode('||', $a['c']);

        $aa = explode('||', $a['a']);

        $nn = explode('||', $a['n']);

        $a['c'] = $cc[0];

        $a['a'] = $aa[0];

        $a['n'] = $nn[0];



        $b['e'] = $a['e'];

        $b['c'] = $cc[1];

        $b['a'] = $aa[1];

        $b['n'] = $nn[1];



        if (isset($cc[2]))

        {

            $c['e'] = $a['e'];

            $c['c'] = $cc[2];

            $c['a'] = $aa[2];

            $c['n'] = $nn[2];

        }

    }



    // ADD TO THE ARRAY OF ARRAYS

    $out[]

    = array

    ( 'Entity'     => $a['e']

    , 'Currency'   => $a['c']

    , 'AlphaCode'  => $a['a']

    , 'NumberCode' => $a['n']

    )

    ;



    // IF THERE IS A SECOND ELEMENT

    if (!empty($b))

    {

        $out[]

        = array

        ( 'Entity'     => $b['e']

        , 'Currency'   => $b['c']

        , 'AlphaCode'  => $b['a']

        , 'NumberCode' => $b['n']

        )

        ;

    }



    // IF THERE IS A THIRD ELEMENT

    if (!empty($c))

    {

        $out[]

        = array

        ( 'Entity'     => $c['e']

        , 'Currency'   => $c['c']

        , 'AlphaCode'  => $c['a']

        , 'NumberCode' => $c['n']

        )

        ;

    }

}



// SHOW THE WORK PRODUCT

echo "<pre>";

print_r($out);

Open in new window

0
 

Author Comment

by:dchid
ID: 34141785
Thank you both for the replies.
Kaufmed, what you supplied seems to pull what I need from the html but I am having trouble looping through it to add into database fields (entity - currency - alphabetic code - numeric code) and have not yet managed to get it to work as required.

Ray I will look at what you have supplied there and see if I can get that to work and post back with how it goes.

Thanks again
0
 
LVL 2

Assisted Solution

by:kambiz
kambiz earned 167 total points
ID: 34141859
The currency details on that table are not uniform. In some rows, there are more than one currency, alpha code, and numeric code for one entity (for example, UNITED STATES). Also, there are some rows with only entity field (for example, ANTARCTICA). You may want to process the result associative array to normalize its entries.

// fetch data
$data = file_get_contents('http://www.iso.org/iso/support/currency_codes_list-1.htm');
// apply regex
preg_match_all('@<tr[^>]*>\s*<td[^>]*>\s*([\S ]+)\s*</td>\s*<td[^>]*>\s*([\S ]+)\s*</td>\s*<td[^>]*>\s*([\S ]+)\s*</td>\s*<td[^>]*>\s*([\S ]+)\s*</td>\s*</tr>@s', $data, $matches);
// merge parts into a single associative array
$currencies = array_map('merge_currency_datails', $matches[1], $matches[2], $matches[3], $matches[4]);

// to create an associative array using array_map
function merge_currency_datails($entity, $currency, $alpha_code, $numeric_code) {
  return array(
    'entity'    => htmlspecialchars_decode($entity),
    'currency'  => htmlspecialchars_decode($currency),
    'alpha'     => htmlspecialchars_decode($alpha_code),
    'num'       => htmlspecialchars_decode($numeric_code),
  );
}

Open in new window

0
3 Use Cases for Connected Systems

Our Dev teams are like yours. They’re continually cranking out code for new features/bugs fixes, testing, deploying, testing some more, responding to production monitoring events and more. It’s complex. So, we thought you’d like to see what’s working for us.

 
LVL 75

Assisted Solution

by:käµfm³d 👽
käµfm³d   👽 earned 166 total points
ID: 34141902
Here is code I was playing with to view the extracted data (Note: I had to copy the file locally because I couldn't download it directly from the URL for some reason):
$exchange_data = file_get_contents('test.html');
preg_match_all('#<td valign="top">\s*([\S ]*)\s*</td>\s*<td valign="top">\s*([\S ]*?)\s*(?:<a[^>]*>[^<]*</a>)?\s*</td>\s*<td valign="top">\s*([A-Z]{3})\s*</td>\s*<td valign="top">\s*(\d{3}|Nil)#', $exchange_data, $matches, PREG_SET_ORDER);

print count($matches) . "<br /><br />";

for ($i = 0; $i < count($matches); $i++)
{
	print "<b>Entity:</b> " . $matches[$i][1] . " | <b>Currency:</b> " . $matches[$i][2] . " | <b>Alpha Code:</b> " . $matches[$i][3] . " | <b>Num. Code:</b> " . $matches[$i][4] . "<br />\n";
}

Open in new window

0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 34141908
Here is a screenshot of what the results of the above look like.
untitled.PNG
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 34141931
It seems I missed the detail about the data that kambiz mentions when I originally looked through it. The following accounts for this, but the lines that have multiple data would have to be dealt with as a special case--the data is captured in the submatch, but you would have to separate the submatch into separate lines. You may be better off using an approach like Ray_Paseur's.
$exchange_data = file_get_contents('test.html');
preg_match_all('#<td valign="top">\s*([\S ]*)\s*</td>\s*<td valign="top">\s*([\S ]*?(?:<br /><br />[\S ]*?)*)\s*(?:<a[^>]*>[^<]*</a>)?\s*</td>\s*<td valign="top">\s*([A-Z]{3}(?:<br /><br />[A-Z]{3})*)\s*</td>\s*<td valign="top">\s*((?:\d{3}|Nil)(?:<br /><br />(?:\d{3}|Nil))*)#', $exchange_data, $matches, PREG_SET_ORDER);

print count($matches) . "<br /><br />";

for ($i = 0; $i < count($matches); $i++)
{
	print "<b>Entity:</b> " . $matches[$i][1] . " | <b>Currency:</b> " . $matches[$i][2] . " | <b>Alpha Code:</b> " . $matches[$i][3] . " | <b>Num. Code:</b> " . $matches[$i][4] . "<br />\n";
}

Open in new window

0
 

Author Comment

by:dchid
ID: 34141964
Thank you, you have all been most helpful, I have got all your suggestions partially working how I want, I am going to look into inserting them into the database now and I will post back how this goes.

Kaufmed the for loop you gave has helped alot and with minor alterations I have it working with the other code above.
0
 

Author Comment

by:dchid
ID: 34142240
Thanks to you all, really appreciate all your help.
0

Featured Post

DevOps Toolchain Recommendations

Read this Gartner Research Note and discover how your IT organization can automate and optimize DevOps processes using a toolchain architecture.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Nothing in an HTTP request can be trusted, including HTTP headers and form data.  A form token is a tool that can be used to guard against request forgeries (CSRF).  This article shows an improved approach to form tokens, making it more difficult to…
These days, all we hear about hacktivists took down so and so websites and retrieved thousands of user’s data. One of the techniques to get unauthorized access to database is by performing SQL injection. This article is quite lengthy which gives bas…
Viewers will get an overview of the benefits and risks of using Bitcoin to accept payments. What Bitcoin is: Legality: Risks: Benefits: Which businesses are best suited?: Other things you should know: How to get started:
Use Wufoo, an online form creation tool, to make powerful forms. Learn how to choose which pages of your form are visible to your users based on their inputs. The page rules feature provides you with an opportunity to create if:then statements for y…

932 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

9 Experts available now in Live!

Get 1:1 Help Now