Scrape Data From Web Page and Add to Variables in ColdFusion

This is a sort of general question.  What I'm looking for is a method to scrape data from a web page, such a a few numbers or a few words, and embed them in a cfset variable.  I was just curious if anyone may have a reference or path I could get on to figure out how to do this.  For example, ColdFusion heads over to the Fox news website and grabs the front page. I can later grab data out of this page if it exists. i.e. Grab every 10 words after the word NASA on the Fox news homepage. Make sense?
Who is Participating?
Assuming you're allowed to scrape these sites, you can use <cfhttp> to grab an the html of a page (just like a browser does). Then do whatever you want with the results ie cfhttp.fileContent. You can save it to a file, variable, etc..

How you parse the html will depend. You could use a regex (or possibly xmlparse). But the exact expression will obviously vary based on the content.  Just keep in mind screen scraping is tough. For one thing, some sites add tricks to prevent it. But also, you're dealing with changing content. So what is the perfect regex today, may fail tomorrow when they redesign their page.  RSS feeds are easier to consume because they're more standard

erikTsomikSystem Architect, CF programmer Commented:
you may need to use Replace function to created a regex.
> Grab every 10 words after the word NASA on the Fox news homepage

  Again it would depend on how the content is formatted.  "NASA" could be in the middle of plain text,
  or in the middle of html tags <span style="....">NASA</span>. One approach would be to remove all
  of the html tags first  (see udf from

  Then use a regex (or possibly adapt this function) to grab the next 10 words after "NASA"
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.