Improve company productivity with a Business Account.Sign Up

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 115
  • Last Modified:

Parsing out data and adding it to an array

I have an html file that I need to parse for information and put that information into a number of arrays.

Can you put together an example of how I can do this?

I've got something like this:

$string = <<<EOD

<html>
<body>
<div> various text </div>
<div class="container">
<h3 class="main">title here</h3>                         //text in bold would be $main[0]
<div> various text here</div>
<div class="listing">more text here</div>            //text in bold would be $listing[0]
<div> various other text here</div>
<span class "wrap">wrap up text</span>           //text in bold would be $wrap[0]
<div>end of this listing</div>
</div>

<div class="container">
<h3 class="main">title 2 here</h3>                        //text in bold would be $main[1]
<div> various text here</div>
<div class="listing">more text here 2</div>          //text in bold would be $listing[1]
<div> various other text here</div>
<span class "wrap">wrap up text 2</span>           //text in bold would be $wrap[1]
<div>end of this listing</div>
</div>

<div class="container">
<h3 class="main">title 3 here</h3>                                  //text in bold would be $main[2]
<div> various text here</div>
<div class="listing">more text here 3</div>        //text in bold would be $listing[2]
<div> various other text here</div>
<span class "wrap">wrap up text 3</span>        //text in bold would be $wrap[2]
<div>end of this listing</div>
</div>

<div> various text</div>
</body>
</html>
EOD;


So for each <div class="container">  I need to pull out the text between:

<h3 class="main">          </h3>

<div class="listing">      </div>

<span class "wrap">      </span>

and make each into it's own array, so I can later say:

<p>
echo $main[0];'<br />';
echo $listing[0];'<br />';
echo $wrap[0];'<br />';
</p><p>
echo $main[1];'<br />';
echo $listing[1];'<br />';
echo $wrap[1];'<br />';
</p><p>
echo $main[1];'<br />';
echo $listing[1];'<br />';
echo $wrap[1];'<br />';
</p>

There will be 25 reiterations of this in all.

I know some php basics of course, but don't know how to put a while or for each thing nor how to pull the data out from between text... but I can certainly work off of a good example if you can put one together for me.

Thanks!   Chris
0
St_Aug_Beach_Bum
Asked:
St_Aug_Beach_Bum
1 Solution
 
gheistCommented:
So you want to extract data from a webpage?
0
 
Kyle HamiltonData ScientistCommented:
sounds like you're looking for scraper. if your php knowledge is that weak, you might be better off finding a library that does it.
0
 
Kyle HamiltonData ScientistCommented:
[link to competing web site removed]
0
Upgrade your Question Security!

Your question, your audience. Choose who sees your identity—and your question—with question security.

 
Ray PaseurCommented:
Is this really the data you're working with?  It's not even valid HTML!
0
 
Ray PaseurCommented:
This does what you're asking for, but I have a feeling that there is a "backstory" here, and if we understood that we might be able to lead you to a more suitable solution.
http://iconoun.com/demo/temp_staug.php 

Web scraping is fraught with risk and you should expect any web scraping script to fail at any time without notice, so don't depend on this automation to do anything important, or to produce any output that goes directly into another automated process.  The reason this is risky is that publishers can, and do, tinker with their HTML documents all the time.  Any and all assumptions about the structure of an HTML document are at risk.

A safer and better approach to getting external data is to ask the publisher to expose an API.  APIs are typically version-controlled and stable.  API version 1.0 will always produce the same document (probably XML or JSON) and will not vary, so you can depend on the format.  API version 1.1 will represent improvements and additions to version 1.0.  Things won't really change until you get to API version 2+, etc.  

If the publisher wants you to be able to use its data, it should expose an API for you, however this may come at a cost, since the publisher is the copyright holder and can legally charge for the use of its data.

<?php

/**
 * See http://www.experts-exchange.com/Programming/Languages/Scripting/PHP/Q_28607748.html
 */
error_reporting(E_ALL);

// TEST DATA CREATED USING NOWDOC SYNTAX
$str = <<<'EOD'

<html>
<body>
<div> various text </div>
<div class="container">
<h3 class="main">title here</h3>                         //text in bold would be $main[0]
<div> various text here</div>
<div class="listing">more text here</div>            //text in bold would be $listing[0]
<div> various other text here</div>
<span class "wrap">wrap up text</span>           //text in bold would be $wrap[0]
<div>end of this listing</div>
</div>

<div class="container">
<h3 class="main">title 2 here</h3>                        //text in bold would be $main[1]
<div> various text here</div>
<div class="listing">more text here 2</div>          //text in bold would be $listing[1]
<div> various other text here</div>
<span class "wrap">wrap up text 2</span>           //text in bold would be $wrap[1]
<div>end of this listing</div>
</div>

<div class="container">
<h3 class="main">title 3 here</h3>                                  //text in bold would be $main[2]
<div> various text here</div>
<div class="listing">more text here 3</div>        //text in bold would be $listing[2]
<div> various other text here</div>
<span class "wrap">wrap up text 3</span>        //text in bold would be $wrap[2]
<div>end of this listing</div>
</div>

<div> various text</div>
</body>
</html>
EOD;


// THE SIGNAL INFORMATION
$trap['main']    = [ '<h3 class="main">',      '</h3>' ];
$trap['listing'] = [ '<div class="listing">',  '</div>' ];
$trap['wrap']    = [ '<span class "wrap">',    '</span>' ];


// EXTRACT THE DATA AND BUILD NEW ARRAYS
foreach ($trap as $var => $arr)
{
    $rgx = '#' . '(' . preg_quote($arr[0]) . ')(.*?)(' . preg_quote($arr[1]) . ')#';
    preg_match_all($rgx, $str, $mat);
    $$var = $mat[2];
}


// SHOW THE WORK PRODUCTS
$kount = 0;
while ($kount > -1)
{
    echo '<p>';
    echo $main[$kount]    . '<br />';
    echo $listing[$kount] . '<br />';
    echo $wrap[$kount]    . '<br />';
    echo '</p>' . PHP_EOL;

    $kount++;
    if (empty($main[$kount])) break;
}

Open in new window

0
 
St_Aug_Beach_BumAuthor Commented:
Yikes, ok all.

Thank you for help. It's not the actually html, I just threw that together for an example so I can work from it. I'm pulling data from a number of sites for a project on trends, not a critical application. Looked at several spidering services by they didn't do quite what I wanted and were costly for a small project.

Thanks again.
0
 
Ray PaseurCommented:
OK, good luck with it.  As long as you understand the risks...

best regards, ~Ray
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now