Solved

Getting text from Wikipedia

Posted on 2009-07-13
1
329 Views
Last Modified: 2012-05-07
I have a database of products, there is a column called "wikiurl" which relates to the Wikipedia URL. My question is what is the best way to extract the intro paragraph from wikipedia. So for example if "wikiurl" = "iPhone" then I would want to get the first paragraph from the page: http://en.wikipedia.org/wiki/Iphone

I'm using PHP and CodeIgniter. Whats the best way to scrape this info?
0
Comment
Question by:alex_wareing
1 Comment
 
LVL 39

Accepted Solution

by:
Roger Baklund earned 500 total points
ID: 24845065
The code below seems to do what you want. It fetches the first paragraph from the page. I am not sure if this will work with all articles.

Some warnings are generated during the parsing, which is why I used error_reporting() to supress them.
error_reporting(E_ALL^E_WARNING);

$d = new DOMDocument();

$d->loadHTMLFile('http://en.wikipedia.org/wiki/Iphone');

$paras = $d->getElementsByTagName('p');

echo $paras->item(0)->nodeValue;

Open in new window

0

Featured Post

How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

Join & Write a Comment

This article will explain how to display the first page of your Microsoft Word documents (e.g. .doc, .docx, etc...) as images in a web page programatically. I have scoured the web on a way to do this unsuccessfully. The goal is to produce something …
Things That Drive Us Nuts Have you noticed the use of the reCaptcha feature at EE and other web sites?  It wants you to read and retype something that looks like this.Insanity!  It's not EE's fault - that's just the way reCaptcha works.  But it is …
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
The viewer will learn how to count occurrences of each item in an array.

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now