Marco Gasi
asked on
Fromatting problem saving to file parsed html content
Hi all.
I'm using simple_html_dom.php to parse some page. Everything works fine, but when I need to get li content I get all list content as one item instead of getting each list element separated.
I use this function:
I use this on my localhost so I don't worry about global. Now suppose I have this html:
Using the function above I get correct result and if I print the resulting array I get 5 array elements. But I want to put this elements in a json file to speed up the use of a jquery plugin for instant translation (jquery.lang.js). So I'm using this piece of code:
I would expect to get this:
But I get this instead:
Any idea?
Thanks in advance
Marco
I'm using simple_html_dom.php to parse some page. Everything works fine, but when I need to get li content I get all list content as one item instead of getting each list element separated.
I use this function:
function getTextBetweenTags( $string, $tagname )
{
global $tokens;
$html = new simple_html_dom();
$html->load( $string );
foreach ( $html->find( $tagname ) as $element )
{
$tokens[] = $element->plaintext;
}
}
I use this on my localhost so I don't worry about global. Now suppose I have this html:
<h1>header1 </h1>
<<h3>header3 </h3>
<ul>
<li>item1</li>
<li>item2</li>
<li>item3</li>
</ul>
Using the function above I get correct result and if I print the resulting array I get 5 array elements. But I want to put this elements in a json file to speed up the use of a jquery plugin for instant translation (jquery.lang.js). So I'm using this piece of code:
$json = array();
foreach ($tokens as $t)
{
$t = trim($t);
$json[] = "$t" . ":\r\n" . "\"\",\r\n";
}
I would expect to get this:
"header1":
"",
"header3":
"",
"li1":
"",
"li2":
"",
"li3":
"",
But I get this instead:
"header1":
"",
"header3":
"",
"li1 li2 li3":
"",
Any idea?
Thanks in advance
Marco
ASKER
Lol, I had read that post: this is the reason because I moved to a dom parser script...
Thanks for your replay, Ray: your script is wonderful. But, said that I need an output like the one I describe above, I need then to preocess the json produced by your code to format it as I need or there is ome other tecnique to do it?
Another important point is that I don't need to get the whole document content but just some tag content leaving the rest as it is. As I said, I use this to speed up the creation of some json file which will hold the translation of the website text so I need to parse only the tag where is some text to translate. Since I have a series of pages which are all identical (they describe the company products) I know I need to translate just h1, h3 h4 and li elements.
Thanks for your replay, Ray: your script is wonderful. But, said that I need an output like the one I describe above, I need then to preocess the json produced by your code to format it as I need or there is ome other tecnique to do it?
Another important point is that I don't need to get the whole document content but just some tag content leaving the rest as it is. As I said, I use this to speed up the creation of some json file which will hold the translation of the website text so I need to parse only the tag where is some text to translate. Since I have a series of pages which are all identical (they describe the company products) I know I need to translate just h1, h3 h4 and li elements.
ASKER
Weel, id I use directly the native DOM parser:
I get this:
That is, for each ul tag I get first all li items merged in one array item and then I get them separated. What does this mean?
$dom = new DOMDocument;
$dom->loadHTML( $content );
$li = $dom->getElementsByTagName('li');
foreach ( $li as $l )
{
$tokens[] = $l->nodeValue;
}
foreach ($tokens as $t)
{
echo '"' . $t . '"' . ":<br>" . "\"\",<br>";
}
I get this:
"item1 item2 item3":
"",
"item1":
"",
"item2":
"",
"item3":
"",
That is, for each ul tag I get first all li items merged in one array item and then I get them separated. What does this mean?
Your output looks more like the third element in tokens is your "ul" element instead of three "li" elements.
Can you show the code you're using to call getTextBetweenTags?
Also, you're loading up the DOM every time that getTextBetweenTags is called. It'd be a lot more efficient to load the DOM once and have getTextBetweenTags call that loaded/parsed object each time.
Can you show the code you're using to call getTextBetweenTags?
Also, you're loading up the DOM every time that getTextBetweenTags is called. It'd be a lot more efficient to load the DOM once and have getTextBetweenTags call that loaded/parsed object each time.
ASKER
Thanks gr8gonzo for your reply. Now I'm away but please, look at my last comment: even using DOM in the way I have shown give the same result. I agree with you: it's probably the whole ul element: how to exclude it?
Anyway, I call that funvtion this way:
Anyway, I call that funvtion this way:
getTextBetweenTags($content, 'li');
Please show us a "real world" test case so we can see what the entire document looks like. There may be easier ways to do this, and the most accurate test data set will show the best results.
ASKER
With pleasure, but I can't do it just now. I'll do it later, within two or three hours. I'll post here the full file and the full script I'm using to process it.
OK, great - Just need the input and the expected output. No need to post the script code.
ASKER
Ok, even because the script after all is all here yet :-)
Here's the input:
About the output, I just need tha plain text inside html tags: the best would be get all tags which have some text within and put that text in an array. Then I could process the array an get an output like this:
and so on.
I started today to work on this to make a tedious part of my job easier and quickier and... well, do you know how it has gone :-)
Here's the input:
<section class="content page">
<div id="page-title"><h1>Alberi luminosi</h1></div>
<div class="container page-container">
<div class=" row">
<div class="col-12">
<ul>
<li class="parent">
<a class="fancybox" href="<?php echo img_url( 'arboles/01.jpg' ) ?>">
<img class="imgFLthumb" src="<?php echo img_url( 'arboles/01.jpg' ) ?>" />
</a>
<h4 lang="it">Albero luminoso</h4>
<ul> <li lang="it">Altezza mt 2,10</li>
<li lang="it">Nr 6 rami + tronco</li>
<li lang="it">Alimentazione 230 volts /24 volts, con trasformatore 700 Led, 50 Watt</li>
<li lang="it">Colori: arancio, giallo, rosso, blu, verde</li></ul>
</li>
<li class="parent">
<a class="fancybox" href="<?php echo img_url( 'arboles/02.jpg' ) ?>">
<img class="imgFLthumb" src="<?php echo img_url( 'arboles/02.jpg' ) ?>" />
</a>
<h4 lang="it">Albero luminoso LED mod. "MELO"</h4>
<ul> <li lang="it">Altezza mt 3,00</li>
<li lang="it">Alimentazione 230 volts/ 24 volts, con trasformatore</li>
<li lang="it">Disponibile con Foglie e Mele o Foglie e Fiori</li></ul>
</li>
<li class="parent">
<a class="fancybox" href="<?php echo img_url( 'arboles/03.jpg' ) ?>">
<img class="imgFLthumb" src="<?php echo img_url( 'arboles/03.jpg' ) ?>" />
</a>
<h4 lang="it">Albero luminoso LED</h4>
<ul> <li lang="it">Altezza mt 5,00</li>
<li lang="it">Alimentazione 230 volts/ 24 volts, con trasformatore</li>
<li lang="it">Multicolor con 5200 Led, controllo gioco luci tramite telecomando</li>
<li lang="it">IN OFFERTA SPECIALE FINO AD ESAURIMENTO SCORTE</li></ul>
</li>
<li class="parent">
<a class="fancybox" href="<?php echo img_url( 'arboles/04.jpg' ) ?>">
<img class="imgFLthumb" src="<?php echo img_url( 'arboles/04.jpg' ) ?>" />
</a>
<h4 lang="it">Albero luminoso LED mod. FICUS</h4>
<ul> <li lang="it">Altezza mt 3,00</li>
<li lang="it">Alimentazione 230 volts /24 volts, con trasformatore, Consumo 100 Watt</li>
<li lang="it">Controllo movimento luci con telecomando</li>
<li lang="it">Colori: Rosso, Bianco e Celeste</li> </ul>
</li>
<li class="parent">
<a class="fancybox" href="<?php echo img_url( 'arboles/05.jpg' ) ?>">
<img class="imgFLthumb" src="<?php echo img_url( 'arboles/05.jpg' ) ?>" />
</a>
<h4 lang="it">Albero luminoso LED mod "S"</h4>
<ul> <li lang="it">Altezza mt.2</li>
<li lang="it">Alimentazione 230 volts/24 volts, con trasformatore</li>
<li lang="it">Consumo 80 Watt</li>
<li lang="it">Tronco Nero Foglie verdi e Fiori Azzurri,rossi o viola</li>
<li lang="it">Tronco Bianco Foglie verdi con fiori bianchi</li>
<li lang="it">Tronco bianco foglie bianche con fiori bianchi</li></ul>
</li>
<li class="parent">
<a class="fancybox" href="<?php echo img_url( 'arboles/06.jpg' ) ?>">
<img class="imgFLthumb" src="<?php echo img_url( 'arboles/06.jpg' ) ?>" />
</a>
<h4 lang="it">Albero luminoso LED Mod.DUBAI</h4>
<ul> <li lang="it">Altezza mt 1,30</li>
<li lang="it">Alimentazione 230 volts /24 volts, con trasformatore, Cons. 50W</li>
<li lang="it">Con foglie verdi e fiori rossi, celesti o viola</li></ul>
</li>
<li class="parent">
<a class="fancybox" href="<?php echo img_url( 'arboles/07.jpg' ) ?>">
<img class="imgFLthumb" src="<?php echo img_url( 'arboles/07.jpg' ) ?>" />
</a>
<h4 lang="it">Albero LED Tronco "L" Mod. CBL01/CBL02</h4>
<ul> <li lang="it">Altezza mt 1,50/1,80</li>
<li lang="it">Alimentazione 230 volts /24 volts, con trasformatore, Cons. 80W</li>
<li lang="it">Con foglie verdi e fiori: Rossi, Blu o Viola</li></ul>
</li>
<li class="parent">
<a class="fancybox" href="<?php echo img_url( 'arboles/08.jpg' ) ?>">
<img class="imgFLthumb" src="<?php echo img_url( 'arboles/08.jpg' ) ?>" />
</a>
<h4 lang="it">Albero luminoso LED mod. CBL01</h4>
<ul> <li lang="it">Altezza Totale Mt. 2,50</li>
<li lang="it">Bellissimo con Nr. 1950 Led tra foglie e fiori</li>
<li lang="it">Alimentazione 230 volts/ 24 volts, con trasformatore</li>
<li lang="it">Disponibile con Foglie col. Verde e Fiori: Rossi, Viola, Blu</li></ul>
</li>
</ul>
</div>
</div>
</div>
</section>
About the output, I just need tha plain text inside html tags: the best would be get all tags which have some text within and put that text in an array. Then I could process the array an get an output like this:
"sometext":
"",
"someothertext":
"",
and so on.
I started today to work on this to make a tedious part of my job easier and quickier and... well, do you know how it has gone :-)
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Hi Ray.
I tried your code and it worked fine but a strange error I can't fix: please test it on this file:
Another problem is that if I try to extend your code to process h1 and h3 elements too I get empty files:
Finally, I just don't understand what's wrong in this code:
I tried your code and it worked fine but a strange error I can't fix: please test it on this file:
<section class="content page">
<div id="page-title"><h1 lang="it">Cinema 10D</h1></div>
<div class="container page-container">
<div class=" row">
<div class="col-md-6 col-sm-6 col-xs-12">
<h4 lang="it">CHI NON HA VISTO UN FILM IN 10D? </h4>
<p lang="it">Punti di forza del cinema 10D: bassi costi di gestione, incassi immediati, con la possibilita' di riscatto dell'investimento in pochi mesi con un target di clienti che va da 3 anni fino oltre 70 anni.</p>
</div>
<div class="col-md-6 col-sm-6 col-xs-12">
<h4 lang="it">CINEMA 8D/10D SPETTACOLO VIAGGIANTE!</h4>
<p lang="it">Personalizziamo camion e rimorchi per spettacoli viaggianti - ideale per gli operatori del carnevale. Costruito con i più alti standard di sicurezza, tutto compatibile con certificati di legge CEE. Contattaci per maggior info.
Oggi anche in noleggio (8 o 12 posti).</p>
</div>
<div class="col-md-12" style="margin: 30px auto;">
<h5 lang="it">EFFECI MOVIE dispone dei migliori film presenti sul mercato, direttamente dalla sede U.S.A., con prezzi decisamente imbattibili. Offriamo una ricca gamma di filmati 3D a vostra scelta con temi sempre diversi per intrattenere piccoli e grandi. Dai un'cocchiata al <a href="<?php echo base_url( 'peliculones' ) ?>">catalogo dei film</a></h5>
</div>
</div>
<div class="row oferta">
<div class="col-md-6 col-sm-6 col-xs-12">
<img src="<?php echo pics_url( '00_cine02.gif' ) ?>" alt="cine" />
<p lang="it">I nostri ingegneri troveranno le migliori soluzioni in base alle vostre esigenze, progettando il vostro cinema tridimensionale.</p>
</div>
<div class="col-md-6 col-sm-6 col-xs-12">
<h3 lang="it">OFFERTA SPECIALE! (low cost)</h3>
<ul>
<li lang="it">CINEMA 6D con pistoni pneumatici</li>
<li lang="it">8 posti- Poltrone ecologiche in pelle</li>
<li lang="it">Sound Gold system- 4 speakers</li>
<li lang="it">DLP proiettori full HD</li>
<li lang="it">3D schermo polarizzato su misura</li>
<li lang="it">100 occhiali 3D </li>
<li lang="it">10 film 3D</li>
<li lang="it">3D software alta definizione</li>
<li lang="it">Compressore silenziato</li>
<li lang="it">Effetti Speciali: solletico alle gambe, getto d'aria, bolle di sapone, strobo e movimento delle sedie. </li>
<li lang="it">Possibilità di aumentare le poltrone fino12/16 posti.</li>
<li lang="it"><h5 lang="it">€ 17.900</h5></li>
</ul>
</div>
</div>
<div class="row">
<div class="col-md-12 col-sm-12 col-xs-12" style="margin: 40px auto;">
<h3 lang="it">TUTTI I NOSTRI CINEMA SONO COSTRUITI CON I PIU' ALTI STANDARD DI SICUREZZA
INTERAMENTE CONFORME ALLE NORMATIVE CEE 20155</h3>
</div>
</div>
<div class="row">
<div class="row-same-height row-full-height">
<div class="ol-xs-4 col-xs-height orange col-full-height">
<ul>
<h4 lang="it">CARATTERISTICHE</h4>
<li lang="it">Certificazioni CE per ogni singolo pezzo.</li>
<li lang="it">Ogni poltrona è di materiale ecologico garantito 5 anni</li>
<li lang="it">Sound Gold system (4speakers)</li>
<li lang="it">DLP proiettori full HD</li>
<li lang="it">3D schermo Polarizzato</li>
<li lang="it">occhiali 3D</li>
<li lang="it">movie 3D in dotazione</li>
<li lang="it">3D software alta definizione</li>
<li lang="it">compressore silenziato HP</li>
</ul>
</div>
<div class="col-xs-4 col-xs-height blue col-full-height">
<h4 lang="it">EFFETTI STANDARD</h4>
<ul>
<li lang="it">Video 3D full HD</li>
<li lang="it">Solletico alle gambe</li>
<li lang="it">Soffio al collo</li>
<li lang="it">Strobo</li>
<li lang="it">Carosello di luci</li>
<li lang="it">Getto d'acqua</li>
<li lang="it">Getto d'aria </li>
<li lang="it">Bolle di sapone</li>
<li lang="it"><li lang="it">Laser</li>
<li lang="it">Fumo</li>
<li lang="it">Effetto tornado</li>
</ul>
</div>
<div class="col-xs-4 col-xs-height magenta col-full-height">
<h4 lang="it">EFFETTI EXTRA</h4>
<ul>
<li lang="it">Profumo</li>
<li lang="it">Neve</li>
<li lang="it">solletico alle mani</li>
<li lang="it">tremolio</li>
<li lang="it">Effetto fuoco</li>
<li lang="it">Dolby Sorround Sound 5.1.</li>
<li lang="it">Contapersone</li>
<li lang="it">Telecamera+TV </li>
</ul>
</div>
</div>
<div class="row">
<div class="col-md-12 col-sm-12 col-xs-12" style="margin: 40px auto;">
<h3 lang="it">Non hai un locale per installare il tuo cinema multidimensionale?</h3>
<h3 lang="it">Noleggia il nostro box personalizzabile!</h3>
</div>
<div class="col-md-6 col-sm-6 col-xs-12">
<img src="<?php echo pics_url( '00_cine_03.jpg' ) ?>" alt="cine" />
</div>
<div class="col-md-6 col-sm-6 col-xs-12">
<p lang="it">Un opportunita vantaggiosa per guadagnare affittando e con possibilita di acquistare il cinema a rate.
Minimo 3 mesi ad un massimo di 12, pagando mensilmente con rata anticipata ed un piccolo deposito che andra a scalare su prezzo di riscatto.</p>
<p lang="it">Il Cinema tridimensionale e' compresivo di: Nr. 15 Film</p>
<p lang="it">Effetti: aria, soffio sul collo, solletico alle gambe, bolle sapone, effetto tornado, strobo, laser, fumo, carosello luci, subwoofer + casse audio Special Gold</p>
</div>
</div>
</div>
</div>
</section>
I get opening-closing tag mismatch errors I can't understand.Another problem is that if I try to extend your code to process h1 and h3 elements too I get empty files:
$htm = str_replace('<?', '<?', $htm);
$htm = str_replace('?>', '?>', $htm);
// SOME SIGNAL STRINGS
$h1 = '<h1 lang="it">';
$end_h1 = '</h1>';
$h3 = '<h3 lang="it">';
$end_h3 = '</h3>';
$h4 = '<h4 lang="it">';
$end_h4 = '</h4>';
$ul = '<ul>';
$end_ul = '</ul>';
$bl = '<data>';
$end_bl = '</data>';
// BREAK THE HTML STRING INTO DATA UNITS ON THE H4 TAGS
$arr = explode($h1, $htm);
unset($arr[0]);
foreach ($arr as $key => $sub)
{
$sub = $h1 . $sub;
$poz = strpos($sub, $end_ul);
$sub = substr($sub,0,$poz);
$sub .= $end_ul;
$arr[$key] = $bl . $sub . $end_bl;
}
// ACTIVATE THIS TO SEE THE ARRAY (USE "VIEW SOURCE")
// var_dump($arr);
// TIDY UP THE HTML STRING
$htm = implode(NULL, $arr);
// BREAK THE HTML STRING INTO DATA UNITS ON THE H4 TAGS
$arr = explode($h3, $htm);
unset($arr[0]);
foreach ($arr as $key => $sub)
{
$sub = $h3 . $sub;
$poz = strpos($sub, $end_ul);
$sub = substr($sub,0,$poz);
$sub .= $end_ul;
$arr[$key] = $bl . $sub . $end_bl;
}
// ACTIVATE THIS TO SEE THE ARRAY (USE "VIEW SOURCE")
// var_dump($arr);
// TIDY UP THE HTML STRING
$htm = implode(NULL, $arr);
// BREAK THE HTML STRING INTO DATA UNITS ON THE H4 TAGS
$arr = explode($h4, $htm);
unset($arr[0]);
foreach ($arr as $key => $sub)
{
$sub = $h4 . $sub;
$poz = strpos($sub, $end_ul);
$sub = substr($sub,0,$poz);
$sub .= $end_ul;
$arr[$key] = $bl . $sub . $end_bl;
}
// ACTIVATE THIS TO SEE THE ARRAY (USE "VIEW SOURCE")
// var_dump($arr);
// TIDY UP THE HTML STRING
$htm = implode(NULL, $arr);
$htm = preg_replace('/\s\s+/', ' ', $htm);
// WRAP THE HTML STRING INTO AN XML DOCUMENT
$doc = '<wrap>' . $htm . '</wrap>';
// ACTIVATE THIS TO SHOW THE XML DOCUMENT
// echo htmlentities($doc);
// TRY TO MAKE AN OBJECT
$obj = SimpleXML_Load_String($doc);
// ACTIVATE THIS TO SEE THE OBJECT
// var_dump($obj);
// PROCESS THE OBJECT TO DISPLAY THE PARTS
foreach ($obj->data as $element)
{
// echo PHP_EOL . $element->h4;
$tokens[] = PHP_EOL . '"' . $element->h1 . '":';
$tokens[] = PHP_EOL . '"",';
// echo PHP_EOL . $element->h4;
foreach($element->ul->li as $item)
{
// echo PHP_EOL . ' ' . $item;
$tokens[] = PHP_EOL . '"' . $item . '":';
$tokens[] = PHP_EOL . '"",';
}
// echo PHP_EOL;
}
file_put_contents( $newname, $tokens );
It's evident I don't understand the logic of your code: can you explain, please?Finally, I just don't understand what's wrong in this code:
for ( $i = 0; $i < count( $files ); $i++ )
{
$fn = $files[ $i ];
$fn = str_replace( '\\', '/', $fn );
$parts = pathinfo( $fn );
$fname = $parts[ 'basename' ];
$dirname = $parts[ 'dirname' ];
$el = explode( '.', $fname );
$json_name = $el[ 0 ] . '.json';
echo "<br>Processing file $fn<br>";
$content = file_get_contents( $fn );
$matches = array();
$tokens = array();
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML( $content );
libxml_clear_errors();
$li = $dom->getElementsByTagName('li');
foreach ( $li as $l )
{
$tokens[] = $l->nodeValue;
}
$json = array();
foreach ( $tokens as $t )
{
$t = trim( $t );
$json[] = '"' . $t . '"' . ":" . PHP_EOL;
$json[] = '"",' . PHP_EOL;
}
$newname = $dirname . '/' . $json_name;
file_put_contents( $newname, $tokens );
}
echo "Done!";
I get an extra element which contains the whole list, so I get"item1 item2 item3":
"",
"item1":
"",
"item2":
"",
"item3":
"",
Just curious... Where does this HTML come from? I see PHP scripts embedded in the HTML document, so I'm wondering if the information you want to capture is available in another form (perhaps a database or text / template file) without the markup.
ASKER
The HTML has been written by me: it is the view in a CodeIgniter site with no database, even if it should be there - and yes, I know Laravel i great and I'm learning it but it require some more time that the time I have just now :-)
I appreciate you approach problem-solving oriented, but I really would like to understand the unexpected result of getElementByTag...
I appreciate you approach problem-solving oriented, but I really would like to understand the unexpected result of getElementByTag...
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Unfortunately no, Ray.
The output I need, that is the final json file is something like this:
if my list is as the following one
I get this:
So the parser works as expected and my markup, even if validated, breaks the parser.
Thank you for your help.
The output I need, that is the final json file is something like this:
"Cinema 10D":
"",
"OFFERTA SPECIALE! (low cost)":
"",
"TUTTI I NOSTRI CINEMA SONO COSTRUITI CON I PIU' ALTI STANDARD DI SICUREZZA INTERAMENTE CONFORME ALLE NORMATIVE CEE 20155":
"",
"Non hai un locale per installare il tuo cinema multidimensionale?":
"",
"Noleggia il nostro box personalizzabile!":
"",
Anyway, it seems that at least a part of the issue origins from the fact i have nested lists and within this list I have h4 and h3 tags: if I use my original code which uses simple_html_dom.php script without processing specifically h3 and h4 I get a better result. But I still get this behavior:if my list is as the following one
<li class="parent">
<a class="fancybox" href="<?php echo img_url( 'efectosespeciales/01.jpg' ) ?>">
<img class="imgFLthumb" src="<?php echo img_url( 'efectosespeciales/01.jpg' ) ?>" />
</a>
<h4 lang="it">Macchina Spara Coriandoli funzionante con bombole c02</h4>
<ul><li lang="it">Cod PTC01</li>
<li lang="it">Incredibili! per eventi in stadio e palazzetti</li>
<li lang="it">Due modelli medio e Grande, completi di cassa in alluminio</li>
<li lang="it">richiudibile con maniglie</li></ul>
</li>
I get this:
"Macchina Spara Coriandoli funzionante con bombole c02 Cod PTC01 Incredibili! per eventi in stadio e palazzetti Due modelli medio e Grande, completi di cassa in alluminio richiudibile con maniglie":
"",
"Cod PTC01":
"",
"Incredibili! per eventi in stadio e palazzetti":
"",
"Due modelli medio e Grande, completi di cassa in alluminio":
"",
"richiudibile con maniglie":
"",
That is the first item grabs the whole list before it gets the content of the <li class='parent'> and then the parser parses the nested list giving its contents.So the parser works as expected and my markup, even if validated, breaks the parser.
Thank you for your help.
ASKER
Thank you Ray: I always leran something by you.
Thanks, Marco. Sorry I couldn't get an exact solution for you. All the best, ~Ray
Please see: http://iconoun.com/demo/temp_marqusg.php
Open in new window