Link to home
Start Free TrialLog in
Avatar of Theo
Theo

asked on

How to clean this feed?

Hi,

At www.groenerekenkamer.nl/milieublogs I run a page with the feeds of several blogs. As you can see both the 'GreenieWatch' feed and the 'FoodHealthSkeptic'-feed contain leftover tags. Since other feeds do not have this I assume it is the feeds fault. I would like to contact te producer, but I do not have the idea that he would know a solution. Do you?
Avatar of iliyas_patel
iliyas_patel
Flag of India image

Use a proper standard instruction and make sure at the time of running your previous data/record should be clear
Avatar of iliyas86
iliyas86

thanks
Please see the code snippet.  You will see things with strings like & l t ;  b r  & g t ;

These are "entitized" HTML tags, and that is usually the right way to put tags into an XML string like an RSS feed.  If you want to scrub this out of the feed, I will be glad to show you how.  Please post a link to the source of the data and I can show you a simple PHP script that will clean it up.  Hopefully you can integrate that in to Drupal.
<div class="block block-aggregator" id="block-aggregator-feed-6">
          <h2 class="title">Food and Health Skeptic</h2>
        <div class="content"><div class="item-list"><ul><li class="first"><a href="http://john-ray.blogspot.com/2010/09/autism-drug-has-some-promise-this-is.html">&lt;br&gt;&lt;br /&gt;&lt;b&gt;Autism  drug has some</a>
</li>

<li><a href="http://john-ray.blogspot.com/2010/09/anti-mcdonalds-ad-angers-fast-food.html">&lt;br&gt;&lt;br /&gt;&lt;b&gt;Anti-McDonald&#039;s ad angers</a>
</li>
<li><a href="http://john-ray.blogspot.com/2010/09/study-finds-people-with-lots-of-friends.html">&lt;br&gt;&lt;br /&gt;&lt;b&gt;Study finds people with</a>
</li>
<li><a href="http://john-ray.blogspot.com/2010/09/wcrf-is-at-it-again-extra-inch-on-waist.html">&lt;br&gt;&lt;br /&gt;&lt;b&gt;The WCRF is at it again: </a>

</li>
<li class="last"><a href="http://john-ray.blogspot.com/2010/09/cancer-patients-from-wealthy-areas-of.html">&lt;br&gt;&lt;br /&gt;&lt;b&gt;Cancer patients from</a>
</li>
</ul></div><div class="more-link"><a href="/aggregator/sources/6" title="Het meest recente nieuws van deze feed bekijken.">more</a></div></div>
 </div>

Open in new window

Avatar of Theo

ASKER

Sounds promising Ray, thanks,
The datasource is: http://john-ray.blogspot.com/feeds/posts/default?alt=rss

Though I have no idea where to integrate that in Drupal.
Avatar of Theo

ASKER

Fyi: on the page on my site that i point you to only shows blocks of the feed. If I go the the page of that feed (on my site) I see that the text is interspersed with 'No title provided'.
Interesting.  You may want to try processing the feed though this before using it in your site.  Give it a try and let's see if we have made any progress.
<?php // RAY_temp_theorichel.php
error_reporting(E_ALL);

// TEST DATA - READ AND MAKE AN OBJECT
$url = 'http://john-ray.blogspot.com/feeds/posts/default?alt=rss';
$xml = file_get_contents($url);
$obj = SimpleXML_Load_String($xml);

// ITERATE OVER THE OBJECT TO CLEAN UP THE EMBEDDED HTML IN THE DESCRIPTION FIELDS
foreach ($obj->channel->item as $item)
{
    $desc = $item->description;
    $desc = str_replace('<br>',   ' ', $desc);
    $desc = str_replace('<br />', ' ', $desc);
    $desc = strip_tags($desc, '<b>');
    $desc = str_replace('</b>', '</b> ', $desc);
    $item->description = $desc;
}

// ACTIVATE THIS TO SEE THE NEW OBJECT
// var_dump($obj);

// PRODUCE CLEANED UP XML
echo $obj->AsXML();

Open in new window

You should be able to fix this by configuring Drupal's Input Format settings for the input format used by that node type. There is an option to remove by tags or to entityize them. You simply need to switch it.
It looks like the current setting is "entitize"

;-)
ASKER CERTIFIED SOLUTION
Avatar of Thomas4019
Thomas4019
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Theo

ASKER

Gentlemen thank you both, but the feed of the Drupal core aggregator does not produce 'nodes', but something called  a 'source'.
And as to the script: thanks veru much, but I wouldnt know where to paste that. Anyone else does?
So are you building that page with just the core aggregator module? I have not used that module much but thought it was just for producing RSS feeds, etc. Are you using any contributed modules to make that page like Views, Services, etc?
Avatar of Theo

ASKER

@Thomas: Yes indeed, the core module, it works alright, though it has problems with Atom feeds. I added a patch, but that didnt improve anything,. The original url was: http://john-ray.blogspot.com/feeds/posts/default but it only works now since I added '?alt=rss', and then shows the tags.

BTW: I just discovered that the aggregator does have a setting to strip tags (yes I should have seen that before,  my bad), but it has no effect.
Avatar of Theo

ASKER

This made clear that with the present software I would never be able to solve my problem. Switching to Feeds did.