Solved

An Intelligent Script

Posted on 2003-12-05
5
380 Views
Last Modified: 2008-03-06
Hi Experts!

I'd like create a PHP script that extract relevant information from WebPage and i need your help.

The script captures the html code od various similar page from a site (for example all articles page). I define which parts of body (i.e. title of articles) may be extract and i'd like create an algorithm that define, automatically, regular expression for that parts.

I need your suggestions.
0
Comment
Question by:ttiero
  • 2
5 Comments
 
LVL 6

Expert Comment

by:aolXFT
Comment Utility
Unless you show us a sample of the body you want to extract information from we can't really help.

From your question though, I'd consider using XML functions. You might have to run your document through tidy first though to make it XHTML(and therefore XML) Compliant. You can install tidy as a PHP/PECL Extension ( http://pecl.php.net/package/tidy ).

Then use XML Functions to get what you want.
0
 

Author Comment

by:ttiero
Comment Utility
There is PHP classes that transform HTML to XML?
0
 
LVL 6

Accepted Solution

by:
aolXFT earned 200 total points
Comment Utility
There is an PHP Extension for Tidy. you can check out tidy at http://www.w3.org/People/Raggett/tidy/.  John Coggeshall wrote a PHP Extension for libTidy, or tidyLib, or whatever it is, and submitted it to PECL, check out the above url, or http://www.coggeshall.org/tidy.php

The command line tidy client allows the switch -asxml so I'm sure you can do the same with the PHP extension.

That may however be overkill depending on how complex your document is. Basicly I need a sample.
0

Featured Post

Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

Join & Write a Comment

Suggested Solutions

This article will explain how to display the first page of your Microsoft Word documents (e.g. .doc, .docx, etc...) as images in a web page programatically. I have scoured the web on a way to do this unsuccessfully. The goal is to produce something …
Deprecated and Headed for the Dustbin By now, you have probably heard that some PHP features, while convenient, can also cause PHP security problems.  This article discusses one of those, called register_globals.  It is a thing you do not want.  …
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
The viewer will learn how to count occurrences of each item in an array.

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now