Fromatting problem saving to file parsed html content

Hi all.
I'm using simple_html_dom.php to parse some page. Everything works fine, but when I need to get li content I get all list content as one item instead of getting each list element separated.

I use this function:
function getTextBetweenTags( $string, $tagname )
{
	global $tokens;
	$html = new simple_html_dom();
	$html->load( $string );
	foreach ( $html->find( $tagname ) as $element )
	{
		$tokens[] = $element->plaintext;
	}
}

Open in new window


I use this on my localhost so I don't worry about global. Now suppose I have this html:

<h1>header1 </h1>
<<h3>header3 </h3>
<ul>
<li>item1</li>
<li>item2</li>
<li>item3</li>
</ul>

Open in new window


Using the function above I get correct result and if I print the resulting array I get 5 array elements. But I want to put this elements in a json file to speed up the use of a jquery plugin for instant translation (jquery.lang.js). So I'm using this piece of code:

		$json = array();
		foreach ($tokens as $t)
		{
			$t = trim($t);
			$json[] = "$t" . ":\r\n" . "\"\",\r\n";
		}

Open in new window


I would expect to get this:

"header1":
"",
"header3":
"",
"li1":
"",
"li2":
"",
"li3":
"",

Open in new window


But I get this instead:

"header1":
"",
"header3":
"",
"li1             li2                li3":
"",

Open in new window


Any idea?
Thanks in advance
Marco
LVL 32
Marco GasiFreelancerAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Ray PaseurCommented:
I think I might approach this a little differently.  Also, if you haven't seen it yet, this is one of the best comments ever on the difficulties of parsing complex markup.

Please see: http://iconoun.com/demo/temp_marqusg.php
<?php // demp/temp_marqusg.php

/**
 * http://www.experts-exchange.com/questions/28692697/Fromatting-problem-saving-to-file-parsed-html-content.html
 *
 * http://php.net/manual/en/function.simplexml-load-string.php
 * http://php.net/manual/en/function.json-encode.php
 */
error_reporting(E_ALL);
echo '<pre>';

// SOME TEST DATA
$htm = <<<EOD
<h1>header1 </h1>
<h3>header3 </h3>
<ul>
<li>item1</li>
<li>item2</li>
<li>item3</li>
</ul>
EOD;

// WRAP THE HTML INTO AN XML DOCUMENT
$doc = '<wrap>' . $htm . '</wrap>';

// TRY TO MAKE AN OBJECT
$obj = SimpleXML_Load_String($doc);

// ACTIVATE THIS TO SEE THE OBJECT
// var_dump($obj);

// TRY TO MAKE A JSON STRING
$jso = json_encode($obj, JSON_PRETTY_PRINT);
echo htmlentities($jso);

Open in new window

0
Marco GasiFreelancerAuthor Commented:
Lol, I had read that post: this is the reason because I moved to a dom parser script...
Thanks for your replay, Ray: your script is wonderful. But, said that I need an output like the one I describe above, I need then to preocess the json produced by your code to format it as I need or there is ome other tecnique to do it?
Another important point is that I don't need to get the whole document content but just some tag content leaving the rest as it is. As I said, I use this to speed up the creation of some json file which will hold the translation of the website text so I need to parse only the tag where is some text to translate. Since I have a series of pages which are all identical (they describe the company products) I know I need to translate just h1, h3 h4 and li elements.
0
Marco GasiFreelancerAuthor Commented:
Weel, id I use directly the native DOM parser:

		$dom = new DOMDocument;
		$dom->loadHTML( $content );
		$li = $dom->getElementsByTagName('li');
		foreach ( $li as $l )
		{
			$tokens[] = $l->nodeValue;
		}
		foreach ($tokens as $t)
		{
			echo '"' . $t . '"' . ":<br>" . "\"\",<br>";
		}

Open in new window


I get this:

"item1 item2 item3":
"",
"item1":
"",
"item2":
"",
"item3":
"",

Open in new window


That is, for each ul tag I get first all li items merged in one array item and then I get them separated. What does this mean?
0
Cloud Class® Course: Microsoft Windows 7 Basic

This introductory course to Windows 7 environment will teach you about working with the Windows operating system. You will learn about basic functions including start menu; the desktop; managing files, folders, and libraries.

gr8gonzoConsultantCommented:
Your output looks more like the third element in tokens is your "ul" element instead of three "li" elements.

Can you show the code you're using to call getTextBetweenTags?

Also, you're loading up the DOM every time that getTextBetweenTags is called. It'd be a lot more efficient to load the DOM once and have getTextBetweenTags call that loaded/parsed object each time.
0
Marco GasiFreelancerAuthor Commented:
Thanks gr8gonzo for your reply. Now I'm away but please, look at my last comment: even using DOM in the way I have shown give the same result. I agree with you: it's probably the whole ul element: how to exclude it?
Anyway, I call that funvtion this way:
getTextBetweenTags($content, 'li'); 

Open in new window

0
Ray PaseurCommented:
Please show us a "real world" test case so we can see what the entire document looks like.  There may be easier ways to do this, and the most accurate test data set will show the best results.
0
Marco GasiFreelancerAuthor Commented:
With pleasure, but I can't do it just now. I'll do it later, within two or three hours. I'll post here the full file and the full script I'm using to process it.
0
Ray PaseurCommented:
OK, great - Just need the input and the expected output.  No need to post the script code.
0
Marco GasiFreelancerAuthor Commented:
Ok, even because the script after all is all here yet :-)
Here's the input:
<section class="content page">
	<div id="page-title"><h1>Alberi luminosi</h1></div>
	<div class="container page-container">
		<div class=" row">
			<div class="col-12">
				<ul>
					<li class="parent">
						<a class="fancybox" href="<?php echo img_url( 'arboles/01.jpg' ) ?>">
							<img class="imgFLthumb" src="<?php echo img_url( 'arboles/01.jpg' ) ?>" />
						</a>
						<h4 lang="it">Albero luminoso</h4>
						<ul> <li lang="it">Altezza mt 2,10</li>
							<li lang="it">Nr 6 rami + tronco</li>
							<li lang="it">Alimentazione 230 volts /24 volts, con trasformatore 700 Led, 50 Watt</li>
							<li lang="it">Colori: arancio, giallo, rosso, blu, verde</li></ul>
					</li>
					<li class="parent">
						<a class="fancybox" href="<?php echo img_url( 'arboles/02.jpg' ) ?>">
							<img class="imgFLthumb" src="<?php echo img_url( 'arboles/02.jpg' ) ?>" />
						</a>
						<h4 lang="it">Albero luminoso LED mod. "MELO"</h4>
						<ul> <li lang="it">Altezza mt 3,00</li>
							<li lang="it">Alimentazione 230 volts/ 24 volts, con trasformatore</li>
							<li lang="it">Disponibile con Foglie e Mele o Foglie e Fiori</li></ul>
					</li>
					<li class="parent">
						<a class="fancybox" href="<?php echo img_url( 'arboles/03.jpg' ) ?>">
							<img class="imgFLthumb" src="<?php echo img_url( 'arboles/03.jpg' ) ?>" />
						</a>
						<h4 lang="it">Albero luminoso LED</h4>
						<ul> <li lang="it">Altezza mt 5,00</li>
							<li lang="it">Alimentazione 230 volts/ 24 volts, con trasformatore</li>
							<li lang="it">Multicolor con 5200 Led, controllo gioco luci tramite telecomando</li>
							<li lang="it">IN OFFERTA SPECIALE FINO AD ESAURIMENTO SCORTE</li></ul>
					</li>
					<li class="parent">
						<a class="fancybox" href="<?php echo img_url( 'arboles/04.jpg' ) ?>">
							<img class="imgFLthumb" src="<?php echo img_url( 'arboles/04.jpg' ) ?>" />
						</a>
						<h4 lang="it">Albero luminoso LED mod. FICUS</h4>
						<ul> <li lang="it">Altezza mt 3,00</li>
							<li lang="it">Alimentazione 230 volts /24 volts, con trasformatore, Consumo 100 Watt</li>
							<li lang="it">Controllo movimento luci con telecomando</li>
							<li lang="it">Colori: Rosso, Bianco e Celeste</li> </ul>
					</li>
					<li class="parent">
						<a class="fancybox" href="<?php echo img_url( 'arboles/05.jpg' ) ?>">
							<img class="imgFLthumb" src="<?php echo img_url( 'arboles/05.jpg' ) ?>" />
						</a>
						<h4 lang="it">Albero luminoso LED mod "S"</h4>
						<ul> <li lang="it">Altezza mt.2</li>
							<li lang="it">Alimentazione 230 volts/24 volts, con trasformatore</li>
							<li lang="it">Consumo 80 Watt</li>
							<li lang="it">Tronco Nero Foglie verdi e Fiori Azzurri,rossi o viola</li>
							<li lang="it">Tronco Bianco Foglie verdi con fiori bianchi</li>
							<li lang="it">Tronco bianco foglie bianche con fiori bianchi</li></ul>
					</li>
					<li class="parent">
						<a class="fancybox" href="<?php echo img_url( 'arboles/06.jpg' ) ?>">
							<img class="imgFLthumb" src="<?php echo img_url( 'arboles/06.jpg' ) ?>" />
						</a>
						<h4 lang="it">Albero luminoso LED Mod.DUBAI</h4>
						<ul> <li lang="it">Altezza mt 1,30</li>
							<li lang="it">Alimentazione 230 volts /24 volts, con trasformatore, Cons. 50W</li>
							<li lang="it">Con foglie verdi e fiori rossi, celesti o viola</li></ul>
					</li>
					<li class="parent">
						<a class="fancybox" href="<?php echo img_url( 'arboles/07.jpg' ) ?>">
							<img class="imgFLthumb" src="<?php echo img_url( 'arboles/07.jpg' ) ?>" />
						</a>
						<h4 lang="it">Albero LED Tronco "L" Mod. CBL01/CBL02</h4>
						<ul> <li lang="it">Altezza mt 1,50/1,80</li>
							<li lang="it">Alimentazione 230 volts /24 volts, con trasformatore, Cons. 80W</li>
							<li lang="it">Con foglie verdi e fiori: Rossi, Blu o Viola</li></ul>
					</li>
					<li class="parent">
						<a class="fancybox" href="<?php echo img_url( 'arboles/08.jpg' ) ?>">
							<img class="imgFLthumb" src="<?php echo img_url( 'arboles/08.jpg' ) ?>" />
						</a>
						<h4 lang="it">Albero luminoso LED mod. CBL01</h4>
						<ul> <li lang="it">Altezza Totale Mt. 2,50</li>
							<li lang="it">Bellissimo con Nr. 1950 Led tra foglie e fiori</li>
							<li lang="it">Alimentazione 230 volts/ 24 volts, con trasformatore</li>
							<li lang="it">Disponibile con Foglie col. Verde e Fiori: Rossi, Viola, Blu</li></ul>
					</li>
				</ul>
			</div>
		</div>
	</div>
</section>

Open in new window


About the output, I just need tha plain text inside html tags: the best would be get all tags which have some text within and put that text in an array. Then I could process the array an get an output like this:

"sometext":
"",
"someothertext":
"",

Open in new window


and so on.

I started today to work on this to make a tedious part of my job easier and quickier and... well, do you know how it has gone :-)
0
Ray PaseurCommented:
Hopefully this will be helpful.  There are comments throughout the code.
Please see http://iconoun.com/demo/temp_marqusg.php

<?php // demo/temp_marqusg.php

/**
 * http://www.experts-exchange.com/questions/28692697/Fromatting-problem-saving-to-file-parsed-html-content.html
 *
 * http://php.net/manual/en/function.simplexml-load-string.php
 * http://php.net/manual/en/function.json-encode.php
 */
error_reporting(E_ALL);
echo '<pre>';

// SOME TEST DATA
$htm = <<<EOD
<section class="content page">
	<div id="page-title"><h1>Alberi luminosi</h1></div>
	<div class="container page-container">
		<div class=" row">
			<div class="col-12">
				<ul>
					<li class="parent">
						<a class="fancybox" href="<?php echo img_url( 'arboles/01.jpg' ) ?>">
							<img class="imgFLthumb" src="<?php echo img_url( 'arboles/01.jpg' ) ?>" />
						</a>
						<h4 lang="it">Albero luminoso</h4>
						<ul> <li lang="it">Altezza mt 2,10</li>
							<li lang="it">Nr 6 rami + tronco</li>
							<li lang="it">Alimentazione 230 volts /24 volts, con trasformatore 700 Led, 50 Watt</li>
							<li lang="it">Colori: arancio, giallo, rosso, blu, verde</li></ul>
					</li>
					<li class="parent">
						<a class="fancybox" href="<?php echo img_url( 'arboles/02.jpg' ) ?>">
							<img class="imgFLthumb" src="<?php echo img_url( 'arboles/02.jpg' ) ?>" />
						</a>
						<h4 lang="it">Albero luminoso LED mod. "MELO"</h4>
						<ul> <li lang="it">Altezza mt 3,00</li>
							<li lang="it">Alimentazione 230 volts/ 24 volts, con trasformatore</li>
							<li lang="it">Disponibile con Foglie e Mele o Foglie e Fiori</li></ul>
					</li>
					<li class="parent">
						<a class="fancybox" href="<?php echo img_url( 'arboles/03.jpg' ) ?>">
							<img class="imgFLthumb" src="<?php echo img_url( 'arboles/03.jpg' ) ?>" />
						</a>
						<h4 lang="it">Albero luminoso LED</h4>
						<ul> <li lang="it">Altezza mt 5,00</li>
							<li lang="it">Alimentazione 230 volts/ 24 volts, con trasformatore</li>
							<li lang="it">Multicolor con 5200 Led, controllo gioco luci tramite telecomando</li>
							<li lang="it">IN OFFERTA SPECIALE FINO AD ESAURIMENTO SCORTE</li></ul>
					</li>
					<li class="parent">
						<a class="fancybox" href="<?php echo img_url( 'arboles/04.jpg' ) ?>">
							<img class="imgFLthumb" src="<?php echo img_url( 'arboles/04.jpg' ) ?>" />
						</a>
						<h4 lang="it">Albero luminoso LED mod. FICUS</h4>
						<ul> <li lang="it">Altezza mt 3,00</li>
							<li lang="it">Alimentazione 230 volts /24 volts, con trasformatore, Consumo 100 Watt</li>
							<li lang="it">Controllo movimento luci con telecomando</li>
							<li lang="it">Colori: Rosso, Bianco e Celeste</li> </ul>
					</li>
					<li class="parent">
						<a class="fancybox" href="<?php echo img_url( 'arboles/05.jpg' ) ?>">
							<img class="imgFLthumb" src="<?php echo img_url( 'arboles/05.jpg' ) ?>" />
						</a>
						<h4 lang="it">Albero luminoso LED mod "S"</h4>
						<ul> <li lang="it">Altezza mt.2</li>
							<li lang="it">Alimentazione 230 volts/24 volts, con trasformatore</li>
							<li lang="it">Consumo 80 Watt</li>
							<li lang="it">Tronco Nero Foglie verdi e Fiori Azzurri,rossi o viola</li>
							<li lang="it">Tronco Bianco Foglie verdi con fiori bianchi</li>
							<li lang="it">Tronco bianco foglie bianche con fiori bianchi</li></ul>
					</li>
					<li class="parent">
						<a class="fancybox" href="<?php echo img_url( 'arboles/06.jpg' ) ?>">
							<img class="imgFLthumb" src="<?php echo img_url( 'arboles/06.jpg' ) ?>" />
						</a>
						<h4 lang="it">Albero luminoso LED Mod.DUBAI</h4>
						<ul> <li lang="it">Altezza mt 1,30</li>
							<li lang="it">Alimentazione 230 volts /24 volts, con trasformatore, Cons. 50W</li>
							<li lang="it">Con foglie verdi e fiori rossi, celesti o viola</li></ul>
					</li>
					<li class="parent">
						<a class="fancybox" href="<?php echo img_url( 'arboles/07.jpg' ) ?>">
							<img class="imgFLthumb" src="<?php echo img_url( 'arboles/07.jpg' ) ?>" />
						</a>
						<h4 lang="it">Albero LED Tronco "L" Mod. CBL01/CBL02</h4>
						<ul> <li lang="it">Altezza mt 1,50/1,80</li>
							<li lang="it">Alimentazione 230 volts /24 volts, con trasformatore, Cons. 80W</li>
							<li lang="it">Con foglie verdi e fiori: Rossi, Blu o Viola</li></ul>
					</li>
					<li class="parent">
						<a class="fancybox" href="<?php echo img_url( 'arboles/08.jpg' ) ?>">
							<img class="imgFLthumb" src="<?php echo img_url( 'arboles/08.jpg' ) ?>" />
						</a>
						<h4 lang="it">Albero luminoso LED mod. CBL01</h4>
						<ul> <li lang="it">Altezza Totale Mt. 2,50</li>
							<li lang="it">Bellissimo con Nr. 1950 Led tra foglie e fiori</li>
							<li lang="it">Alimentazione 230 volts/ 24 volts, con trasformatore</li>
							<li lang="it">Disponibile con Foglie col. Verde e Fiori: Rossi, Viola, Blu</li></ul>
					</li>
				</ul>
			</div>
		</div>
	</div>
</section>
EOD;

// REMOVE UNWANTED PHP TAGS AND ANY OTHER UNDESIRABLE ARTIFACTS HERE
$htm = str_replace('<?', '&lt;?', $htm);
$htm = str_replace('?>', '?&gt;', $htm);

// SOME SIGNAL STRINGS
$h4     = '<h4 lang="it">';
$end_h4 = '</h4>';
$ul     = '<ul>';
$end_ul = '</ul>';
$bl     = '<data>';
$end_bl = '</data>';

// BREAK THE HTML STRING INTO DATA UNITS ON THE H4 TAGS
$arr = explode($h4, $htm);
unset($arr[0]);
foreach ($arr as $key => $sub)
{
    $sub = $h4 . $sub;
    $poz = strpos($sub, $end_ul);
    $sub = substr($sub,0,$poz);
    $sub .= $end_ul;
    $arr[$key] = $bl . $sub . $end_bl;
}

// ACTIVATE THIS TO SEE THE ARRAY (USE "VIEW SOURCE")
// var_dump($arr);

// TIDY UP THE HTML STRING
$htm = implode(NULL, $arr);
$htm = preg_replace('/\s\s+/', ' ', $htm);

// WRAP THE HTML STRING INTO AN XML DOCUMENT
$doc = '<wrap>' . $htm . '</wrap>';

// ACTIVATE THIS TO SHOW THE XML DOCUMENT
// echo htmlentities($doc);

// TRY TO MAKE AN OBJECT
$obj = SimpleXML_Load_String($doc);

// ACTIVATE THIS TO SEE THE OBJECT
// var_dump($obj);

// PROCESS THE OBJECT TO DISPLAY THE PARTS
foreach ($obj->data as $element)
{
    echo PHP_EOL . $element->h4;
    foreach($element->ul->li as $item)
    {
        echo PHP_EOL . '   ' . $item;
    }
    echo PHP_EOL;
}

// TRY TO MAKE A JSON STRING
$jso = json_encode($obj, JSON_PRETTY_PRINT);
echo htmlentities($jso);

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Marco GasiFreelancerAuthor Commented:
Hi Ray.
I tried your code and it worked fine but a strange error I can't fix: please test it on this file:
<section class="content page">
	<div id="page-title"><h1 lang="it">Cinema 10D</h1></div>
	<div class="container page-container">
		<div class=" row">
			<div class="col-md-6 col-sm-6 col-xs-12">
				<h4 lang="it">CHI NON HA VISTO UN FILM IN 10D? </h4>
				<p lang="it">Punti di forza del cinema 10D: bassi costi di gestione, incassi immediati, con la possibilita' di riscatto dell'investimento in pochi mesi con un target di clienti che va da 3 anni fino oltre 70 anni.</p>
			</div>
			<div class="col-md-6 col-sm-6 col-xs-12">
				<h4 lang="it">CINEMA 8D/10D SPETTACOLO VIAGGIANTE!</h4>
				<p lang="it">Personalizziamo camion e rimorchi per spettacoli viaggianti - ideale per gli operatori del carnevale. Costruito con i più alti standard di sicurezza, tutto compatibile con certificati di legge CEE. Contattaci per maggior info.
					Oggi anche in noleggio (8 o 12 posti).</p>
			</div>
			<div class="col-md-12" style="margin: 30px auto;">
				<h5 lang="it">EFFECI MOVIE dispone dei migliori film presenti sul mercato, direttamente dalla sede U.S.A., con prezzi decisamente imbattibili. Offriamo una ricca gamma di filmati 3D a vostra scelta con temi sempre diversi per intrattenere piccoli e grandi. Dai un'cocchiata al <a href="<?php echo base_url( 'peliculones' ) ?>">catalogo dei film</a></h5> 
			</div>
		</div>
		<div class="row oferta">
			<div class="col-md-6 col-sm-6 col-xs-12">
				<img src="<?php echo pics_url( '00_cine02.gif' ) ?>" alt="cine" />
				<p lang="it">I nostri ingegneri troveranno le migliori soluzioni in base alle vostre esigenze, progettando il vostro cinema tridimensionale.</p>
			</div>
			<div class="col-md-6 col-sm-6 col-xs-12">
				<h3 lang="it">OFFERTA SPECIALE! (low cost)</h3>
				<ul>
					<li lang="it">CINEMA 6D con pistoni pneumatici</li>
					<li lang="it">8 posti- Poltrone ecologiche in pelle</li>
					<li lang="it">Sound Gold system- 4 speakers</li>
					<li lang="it">DLP proiettori full HD</li>
					<li lang="it">3D schermo polarizzato su misura</li> 
					<li lang="it">100 occhiali 3D </li>
					<li lang="it">10 film 3D</li>
					<li lang="it">3D software alta definizione</li>
					<li lang="it">Compressore silenziato</li>
					<li lang="it">Effetti Speciali:	solletico alle gambe, getto d'aria, bolle di sapone, strobo e movimento delle sedie. </li>
					<li lang="it">Possibilità di aumentare le poltrone fino12/16 posti.</li>
					<li lang="it"><h5 lang="it">&euro; 17.900</h5></li>
				</ul>
			</div>
		</div>
		<div class="row">
			<div class="col-md-12 col-sm-12 col-xs-12" style="margin: 40px auto;">
				<h3 lang="it">TUTTI I NOSTRI CINEMA SONO COSTRUITI CON I PIU' ALTI STANDARD DI SICUREZZA
					INTERAMENTE CONFORME ALLE NORMATIVE CEE 20155</h3>
			</div>
		</div>
		<div class="row">
			<div class="row-same-height row-full-height">
				<div class="ol-xs-4 col-xs-height orange col-full-height">
					<ul>
						<h4 lang="it">CARATTERISTICHE</h4>
						<li lang="it">Certificazioni CE per ogni singolo pezzo.</li>
						<li lang="it">Ogni poltrona è di materiale ecologico garantito 5 anni</li>
						<li lang="it">Sound Gold system (4speakers)</li>
						<li lang="it">DLP proiettori full HD</li>
						<li lang="it">3D schermo Polarizzato</li>
						<li lang="it">occhiali 3D</li>
						<li lang="it">movie 3D in dotazione</li>
						<li lang="it">3D software alta definizione</li>
						<li lang="it">compressore silenziato HP</li>
					</ul>
				</div>
				<div class="col-xs-4 col-xs-height blue col-full-height">
					<h4 lang="it">EFFETTI STANDARD</h4>
					<ul>
						<li lang="it">Video 3D full HD</li>
						<li lang="it">Solletico alle gambe</li>
						<li lang="it">Soffio al collo</li>
						<li lang="it">Strobo</li>
						<li lang="it">Carosello di luci</li>
						<li lang="it">Getto d'acqua</li>
						<li lang="it">Getto d'aria </li>
						<li lang="it">Bolle di sapone</li>
						<li lang="it"><li lang="it">Laser</li>
						<li lang="it">Fumo</li> 
						<li lang="it">Effetto tornado</li>
					</ul>
				</div>
				<div class="col-xs-4 col-xs-height magenta col-full-height">
					<h4 lang="it">EFFETTI EXTRA</h4>
					<ul>
						<li lang="it">Profumo</li>
						<li lang="it">Neve</li> 
						<li lang="it">solletico alle mani</li>
						<li lang="it">tremolio</li> 
						<li lang="it">Effetto fuoco</li>
						<li lang="it">Dolby Sorround Sound 5.1.</li>
						<li lang="it">Contapersone</li>
						<li lang="it">Telecamera+TV </li>
					</ul>
				</div>
			</div>
			<div class="row">
				<div class="col-md-12 col-sm-12 col-xs-12" style="margin: 40px auto;">
					<h3 lang="it">Non hai un locale per installare il tuo cinema multidimensionale?</h3>
					<h3 lang="it">Noleggia il nostro box personalizzabile!</h3>
				</div>
				<div class="col-md-6 col-sm-6 col-xs-12">
					<img src="<?php echo pics_url( '00_cine_03.jpg' ) ?>" alt="cine" />
				</div>
				<div class="col-md-6 col-sm-6 col-xs-12">
					<p lang="it">Un opportunita vantaggiosa per guadagnare affittando e con possibilita di acquistare il cinema a rate.
						Minimo 3 mesi ad un massimo di 12, pagando mensilmente con rata anticipata ed un piccolo deposito che andra a scalare su prezzo di riscatto.</p>
					<p lang="it">Il Cinema tridimensionale e' compresivo di: Nr. 15 Film</p>
					<p lang="it">Effetti: aria, soffio sul collo, solletico alle gambe, bolle sapone, effetto tornado, strobo, laser, fumo, carosello luci, subwoofer + casse audio Special Gold</p>
				</div>
			</div>
		</div>
	</div>
</section>

Open in new window

I get opening-closing tag mismatch errors I can't understand.
Another problem is that if I try to extend your code to process h1 and h3 elements too I get empty files:

		$htm = str_replace('<?', '&lt;?', $htm);
		$htm = str_replace('?>', '?&gt;', $htm);

		// SOME SIGNAL STRINGS
		$h1     = '<h1 lang="it">';
		$end_h1 = '</h1>';
		$h3     = '<h3 lang="it">';
		$end_h3 = '</h3>';
		$h4     = '<h4 lang="it">';
		$end_h4 = '</h4>';
		$ul     = '<ul>';
		$end_ul = '</ul>';
		$bl     = '<data>';
		$end_bl = '</data>';

		// BREAK THE HTML STRING INTO DATA UNITS ON THE H4 TAGS
		$arr = explode($h1, $htm);
		unset($arr[0]);
		foreach ($arr as $key => $sub)
		{
				$sub = $h1 . $sub;
				$poz = strpos($sub, $end_ul);
				$sub = substr($sub,0,$poz);
				$sub .= $end_ul;
				$arr[$key] = $bl . $sub . $end_bl;
		}

		// ACTIVATE THIS TO SEE THE ARRAY (USE "VIEW SOURCE")
		// var_dump($arr);

		// TIDY UP THE HTML STRING
		$htm = implode(NULL, $arr);
		// BREAK THE HTML STRING INTO DATA UNITS ON THE H4 TAGS
		$arr = explode($h3, $htm);
		unset($arr[0]);
		foreach ($arr as $key => $sub)
		{
				$sub = $h3 . $sub;
				$poz = strpos($sub, $end_ul);
				$sub = substr($sub,0,$poz);
				$sub .= $end_ul;
				$arr[$key] = $bl . $sub . $end_bl;
		}

		// ACTIVATE THIS TO SEE THE ARRAY (USE "VIEW SOURCE")
		// var_dump($arr);

		// TIDY UP THE HTML STRING
		$htm = implode(NULL, $arr);
		
		
		// BREAK THE HTML STRING INTO DATA UNITS ON THE H4 TAGS
		$arr = explode($h4, $htm);
		unset($arr[0]);
		foreach ($arr as $key => $sub)
		{
				$sub = $h4 . $sub;
				$poz = strpos($sub, $end_ul);
				$sub = substr($sub,0,$poz);
				$sub .= $end_ul;
				$arr[$key] = $bl . $sub . $end_bl;
		}

		// ACTIVATE THIS TO SEE THE ARRAY (USE "VIEW SOURCE")
		// var_dump($arr);

		// TIDY UP THE HTML STRING
		$htm = implode(NULL, $arr);
		$htm = preg_replace('/\s\s+/', ' ', $htm);

		// WRAP THE HTML STRING INTO AN XML DOCUMENT
		$doc = '<wrap>' . $htm . '</wrap>';

		// ACTIVATE THIS TO SHOW THE XML DOCUMENT
		// echo htmlentities($doc);

		// TRY TO MAKE AN OBJECT
		$obj = SimpleXML_Load_String($doc);

		// ACTIVATE THIS TO SEE THE OBJECT
		// var_dump($obj);

		// PROCESS THE OBJECT TO DISPLAY THE PARTS
		foreach ($obj->data as $element)
		{
//				echo PHP_EOL . $element->h4;
				$tokens[] = PHP_EOL . '"' . $element->h1 . '":';
				$tokens[] = PHP_EOL . '"",';
//				echo PHP_EOL . $element->h4;
				foreach($element->ul->li as $item)
				{
//						echo PHP_EOL . '   ' . $item;
					$tokens[] = PHP_EOL . '"' . $item . '":';
					$tokens[] = PHP_EOL . '"",';
				}
//				echo PHP_EOL;
		}
		file_put_contents( $newname, $tokens );

Open in new window

It's evident I don't understand the logic of your code: can you explain, please?

Finally, I just don't understand what's wrong in this code:

	for ( $i = 0; $i < count( $files ); $i++ )
	{
		$fn = $files[ $i ];
		$fn = str_replace( '\\', '/', $fn );
		$parts = pathinfo( $fn );
		$fname = $parts[ 'basename' ];
		$dirname = $parts[ 'dirname' ];
		$el = explode( '.', $fname );
		$json_name = $el[ 0 ] . '.json';
		echo "<br>Processing file $fn<br>";
		$content = file_get_contents( $fn );
		$matches = array();
		$tokens = array();
		$dom = new DOMDocument;
		libxml_use_internal_errors(true);
		$dom->loadHTML( $content );
		libxml_clear_errors();
		$li = $dom->getElementsByTagName('li');
		foreach ( $li as $l )
		{
			$tokens[] = $l->nodeValue;
		}

		$json = array();
		foreach ( $tokens as $t )
		{
			$t = trim( $t );
			$json[] = '"' . $t . '"' . ":" . PHP_EOL;
			$json[] = '"",' . PHP_EOL;
		}
		$newname = $dirname . '/' . $json_name;
		file_put_contents( $newname, $tokens );
	}
	echo "Done!";

Open in new window

I get an extra element which contains the whole list, so I get
"item1  item2  item3":
"",
"item1":
"",
"item2":
"",
"item3":
"",

Open in new window

0
Ray PaseurCommented:
Just curious... Where does this HTML come from?  I see PHP scripts embedded in the HTML document, so I'm wondering if the information you want to capture is available in another form (perhaps a database or text / template file) without the markup.
0
Marco GasiFreelancerAuthor Commented:
The HTML has been written by me: it is the view in a CodeIgniter site with no database, even if it should be there - and yes, I know Laravel i great and I'm learning it but it require some more time that the time I have just now :-)
I appreciate you approach problem-solving oriented, but I really would like to understand the unexpected result of getElementByTag...
0
Ray PaseurCommented:
See if this gets closer to what you need.  As with any parsing project, the devil is in the details, and the quality of a code solution is directly related to the quality of the test data and the corresponding desired output.
http://iconoun.com/demo/temp_marqusg.php
<?php // demo/temp_marqusg.php

/**
 * http://www.experts-exchange.com/questions/28692697/Fromatting-problem-saving-to-file-parsed-html-content.html
 *
 */
error_reporting(E_ALL);
echo '<pre>';

// SOME TEST DATA
$htm = <<<EOD
<section class="content page">
	<div id="page-title"><h1 lang="it">Cinema 10D</h1></div>
	<div class="container page-container">
		<div class=" row">
			<div class="col-md-6 col-sm-6 col-xs-12">
				<h4 lang="it">CHI NON HA VISTO UN FILM IN 10D? </h4>
				<p lang="it">Punti di forza del cinema 10D: bassi costi di gestione, incassi immediati, con la possibilita' di riscatto dell'investimento in pochi mesi con un target di clienti che va da 3 anni fino oltre 70 anni.</p>
			</div>
			<div class="col-md-6 col-sm-6 col-xs-12">
				<h4 lang="it">CINEMA 8D/10D SPETTACOLO VIAGGIANTE!</h4>
				<p lang="it">Personalizziamo camion e rimorchi per spettacoli viaggianti - ideale per gli operatori del carnevale. Costruito con i più alti standard di sicurezza, tutto compatibile con certificati di legge CEE. Contattaci per maggior info.
					Oggi anche in noleggio (8 o 12 posti).</p>
			</div>
			<div class="col-md-12" style="margin: 30px auto;">
				<h5 lang="it">EFFECI MOVIE dispone dei migliori film presenti sul mercato, direttamente dalla sede U.S.A., con prezzi decisamente imbattibili. Offriamo una ricca gamma di filmati 3D a vostra scelta con temi sempre diversi per intrattenere piccoli e grandi. Dai un'cocchiata al <a href="<?php echo base_url( 'peliculones' ) ?>">catalogo dei film</a></h5>
			</div>
		</div>
		<div class="row oferta">
			<div class="col-md-6 col-sm-6 col-xs-12">
				<img src="<?php echo pics_url( '00_cine02.gif' ) ?>" alt="cine" />
				<p lang="it">I nostri ingegneri troveranno le migliori soluzioni in base alle vostre esigenze, progettando il vostro cinema tridimensionale.</p>
			</div>
			<div class="col-md-6 col-sm-6 col-xs-12">
				<h3 lang="it">OFFERTA SPECIALE! (low cost)</h3>
				<ul>
					<li lang="it">CINEMA 6D con pistoni pneumatici</li>
					<li lang="it">8 posti- Poltrone ecologiche in pelle</li>
					<li lang="it">Sound Gold system- 4 speakers</li>
					<li lang="it">DLP proiettori full HD</li>
					<li lang="it">3D schermo polarizzato su misura</li>
					<li lang="it">100 occhiali 3D </li>
					<li lang="it">10 film 3D</li>
					<li lang="it">3D software alta definizione</li>
					<li lang="it">Compressore silenziato</li>
					<li lang="it">Effetti Speciali:	solletico alle gambe, getto d'aria, bolle di sapone, strobo e movimento delle sedie. </li>
					<li lang="it">Possibilità di aumentare le poltrone fino12/16 posti.</li>
					<li lang="it"><h5 lang="it">&euro; 17.900</h5></li>
				</ul>
			</div>
		</div>
		<div class="row">
			<div class="col-md-12 col-sm-12 col-xs-12" style="margin: 40px auto;">
				<h3 lang="it">TUTTI I NOSTRI CINEMA SONO COSTRUITI CON I PIU' ALTI STANDARD DI SICUREZZA
					INTERAMENTE CONFORME ALLE NORMATIVE CEE 20155</h3>
			</div>
		</div>
		<div class="row">
			<div class="row-same-height row-full-height">
				<div class="ol-xs-4 col-xs-height orange col-full-height">
					<ul>
						<h4 lang="it">CARATTERISTICHE</h4>
						<li lang="it">Certificazioni CE per ogni singolo pezzo.</li>
						<li lang="it">Ogni poltrona è di materiale ecologico garantito 5 anni</li>
						<li lang="it">Sound Gold system (4speakers)</li>
						<li lang="it">DLP proiettori full HD</li>
						<li lang="it">3D schermo Polarizzato</li>
						<li lang="it">occhiali 3D</li>
						<li lang="it">movie 3D in dotazione</li>
						<li lang="it">3D software alta definizione</li>
						<li lang="it">compressore silenziato HP</li>
					</ul>
				</div>
				<div class="col-xs-4 col-xs-height blue col-full-height">
					<h4 lang="it">EFFETTI STANDARD</h4>
					<ul>
						<li lang="it">Video 3D full HD</li>
						<li lang="it">Solletico alle gambe</li>
						<li lang="it">Soffio al collo</li>
						<li lang="it">Strobo</li>
						<li lang="it">Carosello di luci</li>
						<li lang="it">Getto d'acqua</li>
						<li lang="it">Getto d'aria </li>
						<li lang="it">Bolle di sapone</li>
						<li lang="it"><li lang="it">Laser</li>
						<li lang="it">Fumo</li>
						<li lang="it">Effetto tornado</li>
					</ul>
				</div>
				<div class="col-xs-4 col-xs-height magenta col-full-height">
					<h4 lang="it">EFFETTI EXTRA</h4>
					<ul>
						<li lang="it">Profumo</li>
						<li lang="it">Neve</li>
						<li lang="it">solletico alle mani</li>
						<li lang="it">tremolio</li>
						<li lang="it">Effetto fuoco</li>
						<li lang="it">Dolby Sorround Sound 5.1.</li>
						<li lang="it">Contapersone</li>
						<li lang="it">Telecamera+TV </li>
					</ul>
				</div>
			</div>
			<div class="row">
				<div class="col-md-12 col-sm-12 col-xs-12" style="margin: 40px auto;">
					<h3 lang="it">Non hai un locale per installare il tuo cinema multidimensionale?</h3>
					<h3 lang="it">Noleggia il nostro box personalizzabile!</h3>
				</div>
				<div class="col-md-6 col-sm-6 col-xs-12">
					<img src="<?php echo pics_url( '00_cine_03.jpg' ) ?>" alt="cine" />
				</div>
				<div class="col-md-6 col-sm-6 col-xs-12">
					<p lang="it">Un opportunita vantaggiosa per guadagnare affittando e con possibilita di acquistare il cinema a rate.
						Minimo 3 mesi ad un massimo di 12, pagando mensilmente con rata anticipata ed un piccolo deposito che andra a scalare su prezzo di riscatto.</p>
					<p lang="it">Il Cinema tridimensionale e' compresivo di: Nr. 15 Film</p>
					<p lang="it">Effetti: aria, soffio sul collo, solletico alle gambe, bolle sapone, effetto tornado, strobo, laser, fumo, carosello luci, subwoofer + casse audio Special Gold</p>
				</div>
			</div>
		</div>
	</div>
</section>
EOD;


// DEFINE A DELIMITER KNOW TO BE ABSENT FROM THE DOCUMENT
$dlm = '|||';

// REMOVE UNWANTED PHP TAGS AND ANY OTHER UNDESIRABLE ARTIFACTS HERE
$htm = str_replace('<?', '&lt;?', $htm);
$htm = str_replace('?>', '?&gt;', $htm);

// MARK END-OF-LINE CHARACER POSITIONS
$htm = str_replace(PHP_EOL, $dlm, $htm);

// A REGULAR EXPRESSION THAT WILL ISOLATE ANYTHING INSIDE LT/GT WICKETS
$rgx
= '#'           // REGEX DELIMITER
. '\<'          // ESCAPED LT
. '.*?'         // ANYTHING OR NOTHING
. '\>'          // ESCAPED GT
. '#'           // REGEX DELIMITER
. 's'           // REGEX FLAG: SINGLE LINE
;
// DISCARD THE HTML MARKUP, KEEPING ONLY THE TEXT
$htm = preg_replace($rgx, NULL, $htm);

// REMOVE EXCESS WHITESPACE
$htm = preg_replace('/\s\s+/', ' ', $htm);

// RETURN THE END-OF-LINE CHARACTERS
$htm = str_replace($dlm, PHP_EOL, $htm);
$htm = trim($htm);

// BREAK THE HTML STRING INTO LINES AND ELIMINATE THE BLANK LINES
$arr = explode(PHP_EOL, $htm);
foreach ($arr as $key => $sub)
{
    $sub = trim($sub);
    if (!$sub) unset($arr[$key]);
}
// RECONSTRUCT THE TEXT STRING
$htm = implode(NULL, $arr);
var_dump($htm);

Open in new window

0
Marco GasiFreelancerAuthor Commented:
Unfortunately no, Ray.
The output I need, that is the final json file is something like this:

"Cinema 10D":
"",
"OFFERTA SPECIALE! (low cost)":
"",
"TUTTI I NOSTRI CINEMA SONO COSTRUITI CON I PIU' ALTI STANDARD DI SICUREZZA 	INTERAMENTE CONFORME ALLE NORMATIVE CEE 20155":
"",
"Non hai un locale per installare il tuo cinema multidimensionale?":
"",
"Noleggia il nostro box personalizzabile!":
"",

Open in new window

Anyway, it seems that at least a part of the issue origins from the fact i have nested lists and within this list I have h4 and h3 tags: if I use my original code which uses simple_html_dom.php script without processing specifically h3 and h4 I get a better result. But I still get this behavior:

if my list is as the following one
					<li class="parent">
						<a class="fancybox" href="<?php echo img_url( 'efectosespeciales/01.jpg' ) ?>">
							<img class="imgFLthumb" src="<?php echo img_url( 'efectosespeciales/01.jpg' ) ?>" />
						</a>
						<h4 lang="it">Macchina Spara Coriandoli funzionante con bombole c02</h4>
						<ul><li lang="it">Cod PTC01</li>
							<li lang="it">Incredibili! per eventi in stadio e palazzetti</li>
							<li lang="it">Due modelli medio e Grande, completi di cassa in alluminio</li>
							<li lang="it">richiudibile con maniglie</li></ul>
					</li>

Open in new window


I get this:
"Macchina Spara Coriandoli funzionante con bombole c02  						Cod PTC01  							Incredibili! per eventi in stadio e palazzetti  							Due modelli medio e Grande, completi di cassa in alluminio  							richiudibile con maniglie":
"",
"Cod PTC01":
"",
"Incredibili! per eventi in stadio e palazzetti":
"",
"Due modelli medio e Grande, completi di cassa in alluminio":
"",
"richiudibile con maniglie":
"",

Open in new window

That is the first item grabs the whole list before it gets the content of the <li class='parent'> and then the parser parses the nested list giving its contents.
So the parser works as expected and my markup, even if validated, breaks the parser.
Thank you for your help.
0
Marco GasiFreelancerAuthor Commented:
Thank you Ray: I always leran something by you.
0
Ray PaseurCommented:
Thanks, Marco.  Sorry I couldn't get an exact solution for you.  All the best, ~Ray
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
PHP

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.