Scrape Data From Web Page and Add to Variables in ColdFusion

This is a sort of general question.  What I'm looking for is a method to scrape data from a web page, such a a few numbers or a few words, and embed them in a cfset variable.  I was just curious if anyone may have a reference or path I could get on to figure out how to do this.  For example, ColdFusion heads over to the Fox news website and grabs the front page. I can later grab data out of this page if it exists. i.e. Grab every 10 words after the word NASA on the Fox news homepage. Make sense?
Cole3388Asked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

erikTsomikSystem Architect, CF programmer Commented:
you may need to use Replace function to created a regex.
0
_agx_Commented:
Assuming you're allowed to scrape these sites, you can use <cfhttp> to grab an the html of a page (just like a browser does). Then do whatever you want with the results ie cfhttp.fileContent. You can save it to a file, variable, etc..  

http://livedocs.adobe.com/coldfusion/8/htmldocs/help.html?content=Tags_g-h_09.html

How you parse the html will depend. You could use a regex (or possibly xmlparse). But the exact expression will obviously vary based on the content.  Just keep in mind screen scraping is tough. For one thing, some sites add tricks to prevent it. But also, you're dealing with changing content. So what is the perfect regex today, may fail tomorrow when they redesign their page.  RSS feeds are easier to consume because they're more standard
http://livedocs.adobe.com/coldfusion/8/htmldocs/Tags_f_01.html

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
_agx_Commented:
> Grab every 10 words after the word NASA on the Fox news homepage

  Again it would depend on how the content is formatted.  "NASA" could be in the middle of plain text,
  or in the middle of html tags <span style="....">NASA</span>. One approach would be to remove all
  of the html tags first  (see udf from cflib.org)  
   http://www.cflib.org/udf.cfm?id=1598

  Then use a regex (or possibly adapt this function) to grab the next 10 words after "NASA"
  http://www.cflib.org/udf/FullLeft
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Web Servers

From novice to tech pro — start learning today.