?
Solved

Cfhttp array elements

Posted on 2005-03-21
25
Medium Priority
?
346 Views
Last Modified: 2013-12-24
I have an array that is full of URL's the URL's are complete and can be retrieved with cfhttp manually. The first URL can be retreived using cfhttp but the connection fails at the next URL. There are several URL's but only the first can be retreived. Confused
<cfset StripLinks = #REReplaceNoCase(#Mid(cfhttp.FileContent, Match.pos[1], Match.len[1])#,"=[""]", "http://www.rspb.org.uk/vacancies/index.asp", "ALL")#>
<cfset LinksArray = ListToArray(StripLinks)>
<cfloop from="1" to="#arrayLen(LinksArray)#" index="i">
<cfhttp method="get" url="#LinksArray[i]#">
<cfdump var="#cfhttp#">
0
Comment
Question by:VHSB
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 13
  • 11
25 Comments
 
LVL 5

Expert Comment

by:8riaN
ID: 13597011
While it's likely you have already done this, just to be sure...
have you ruled out a problem with the match, but having it just dump the URLs to the ouput instead of <cfhttp>-ing them inside the LinksArray loop?
0
 
LVL 17

Expert Comment

by:Tacobell777
ID: 13597015
after you do your rereplace is the string still a list?
when you dump LinksArray is it a valid array?
0
 

Author Comment

by:VHSB
ID: 13600157
8riaN, I have already checked that, see below thanks.

Tacobell after i do my replace, the string is still a list.
However when I take the cfhttp out and cfdump"#LinksArray[i]#" I get the URL's I expect, eg:
http://www.rspb.org.uk/vacancies/index.asp?id=3570305 
http://www.rspb.org.uk/vacancies/index.asp?id=3510305 
http://www.rspb.org.uk/vacancies/index.asp?id=3440305 
http://www.rspb.org.uk/vacancies/index.asp?id=3430305

But when I cfdump var="#LinksArray[i]#" with the cfhttp code, I dont get the URL's as I expect, eg:
http://www.rspb.org.uk/vacancies/index.asp?id=3570305 
;id=3570305
0
Manage your data center from practically anywhere

The KN8164V features HD resolution of 1920 x 1200, FIPS 140-2 with level 1 security standards and virtual media transmissions at twice the speed. Built for reliability, the KN series provides local console and remote over IP access, ensuring 24/7 availability to all servers.

 

Author Comment

by:VHSB
ID: 13604499
The results Im getting from cfdump var="#LinksArray[i]#" with the cfhttp code are pretty strange. Is this a problem with the loop ???
0
 

Author Comment

by:VHSB
ID: 13605507
Ive increased the points guys. Ive made no progress with this for 2days and ive tried everything I can think of. Cheers
0
 
LVL 5

Expert Comment

by:8riaN
ID: 13614307
Something is definately off here.  I suggest the following 2 ways to try and expose what is really happening.  It could still be a problem with the match that is hidden by the browser, so would need to rule that out also before considering the messy alternatives.

If you haven't done this yet, do it and in any case tell us what happens when you do.

1) write a loop to output the list, instead of using cfdump so you can more easily check the HTML source (without all that extra stuff <cfdump> adds) and examine the source to make sure there isn't extra stuff the brower isn't showing you.

2) change the <cfhttp> to <cflocation> and see what URL you are browsing to, then maybe do one <cfhttp> and do the LinksArray[2] as a <cflocation>

Post the results.  And while you're at it, would you please post the Regular Expression you are using?  And maybe a little of the raw data (if it's not confidential) which it is running on?

Thanks, we'll get this,

8riaN
0
 

Author Comment

by:VHSB
ID: 13614568
Thanks for the reply 8riaN I was begining to get desperate:
1) I output the list without using cfhttp and I got the complete list of links. However when I output the list with the cfhttp code included I got a strange result, here is the out put:
http://www.rspb.org.uk/vacancies/index.asp?id=3570305 ;id=3570305

That only displays the first link and an additional Id?????????

2) Im not familiar with <cflocation> as this is my first CF project, give me a few minutes to look it up and how to use it.

Here is the regex that parses the links from the index page, for rspb.org.uk:
[[:punct:]]+id=(.*?)[[:digit:]]+

The ReReplace is in my first post.

Thanks

Sam
0
 
LVL 5

Expert Comment

by:8riaN
ID: 13614583
can I see the first few links worth of raw data?
0
 

Author Comment

by:VHSB
ID: 13614585
Sorry I forgot to post the html I am parsing:

<tr class="alt"><td><a href="?id=3490305">Centre Assistant</a></td><td>East Scotland</td><td>1/4/2005</td></tr>
<tr><td><a href="?id=3480305">Centre Assistant</a></td><td>East Scotland</td><td>1/4/2005</td></tr>
<tr class="alt"><td><a href="?id=3400305">Visitor &amp; Publicity Officer</a></td><td>Essex</td><td>1/4/2005</td></tr>
<tr><td><a href="?id=3370305">Visitor &amp; Publicity Officer</a></td><td>Lancashire</td><td>31/3/2005</td></tr>
 
0
 

Author Comment

by:VHSB
ID: 13614901
I the <cflocation> instead of the <cfhttp> and it gave me the webpage of the first URL, but nothing else.

Regards

Sam
0
 
LVL 5

Expert Comment

by:8riaN
ID: 13615114
ok, I think you have to rework the whole regExp search.  First one quick note:

the symbol "?" means "match zero or one of the preceding expression"
"*" means zero or more,
so .*? mean zero or more of anything one or more of anything
which is confusing to both me and the regexp parser
you need to escape the ? if you want the literal ?, like this: \?

Instead of trying to do this all at one go, I suggest writing a loop.  Since what you want is to recognize a pattern which includes stuff you want and stuff you don't want, just run through and build a list of what you do want, it is much less vulnerable and better suited to how RegExps work.
Since you have well behaved data, I'd suggest taking advantage of it to reduce confusion like so:

<cfset regExp='"\?id=([[:digit:]]+)"'>

then make sure to use only the subexpression (.pos[2], .len[2]) to build a list, as follows:
<cfset idList="">
<cfset findID=REFind(regExp,cfhttp.FileContent,1,True)>
<cfloop condition="findID.pos[1]">
   <cfset idList=ListAppend(idList,mid(cfhttp.FileContent,findID.pos[2],cfhttp.FileContent,findID.len[2])>
   <cfset findID=REFind(regExp,cfhttp.FileContent,findID.pos[1]+findID.len[1],True)>
</cfloop>

then
<cfloop from="1" to="#arrayLen(idList)#" index="i">
   <cfhttp method="get" url="http://www.rspb.org.uk/vacancies/index.asp?ID=#idList[i]#">
<cfdump var="#cfhttp#">

It's a much cleaner, easier to debug method, even if it takes a few more lines of code.

Tell me if that works for you.
8riaN
0
 

Author Comment

by:VHSB
ID: 13615555
Much appreciated 8riaN.
Im getting a problem with <cfset idList=ListAppend(idList,mid(cfhttp.FileContent,findID.pos[2],cfhttp.FileContent,findID.len[2]))> as the function takes three parameters? Any ideas?
0
 
LVL 5

Accepted Solution

by:
8riaN earned 2000 total points
ID: 13615710
well, the 3rd param is supposed to be optional, let's make this easier to read/debug:

<cfset thisID=mid(cfhttp.FileContent,findID.pos[2],cfhttp.FileContent,findID.len[2])>
<cfset idList=ListAppend(idList,thisID)>

If the second line doesn't work, replace it with:
<cfset idList=ListAppend(idList,thisID,",")>
but it's supposed to default to that anyway.

8riaN
0
 

Author Comment

by:VHSB
ID: 13615823
8riaN it works a treat mate thank you very much for your perseverence with this one.

Thanks again
0
 
LVL 5

Expert Comment

by:8riaN
ID: 13615836
Horay!
0
 

Author Comment

by:VHSB
ID: 13616333
8riaN,
if I needed to add several other different index pages to parse, how flexible would the code be?
0
 
LVL 5

Expert Comment

by:8riaN
ID: 13616816
totally flexible, the question boils down to whether the RegExp works with the new page, i.e. will you always find the id in a link where href="?id=#####" . If so great, if not, you need a new RegExp to use for that file

say the link was href="/search/search.asp?searchID=#####"

you'd make a new RegExp like this:
regExp2='href="/search/search\.asp\?searchID=([[:digit:]]*)"'
remembering to escape things like . and ? and $ in the URL search string (e.g. search\.asp for search.asp)

then all the code is the same until you actually make the URL:
   <cfhttp method="get" url="http://www.rspb.org.uk/search/search.asp?ID=#idList[i]#">

QED
0
 

Author Comment

by:VHSB
ID: 13630430
8riaN, sorry to flog this to death but:

Ive added a second website to scrape: http://www.website2/news/jobs.asp

The raw data for this page: <A HREF='job_brief.asp?ID=491'>

A complete job URL should look like this: http://www.website2/news/jobs_brief.asp?id=495

Therefore I used the same regex but changed the URL to:
http://www.english-nature.org.uk/news/job_brief.asp?ID=#Links2[i]#

Looks right? But wait for it........I only get the first id number. Where am i going wrong?

Regards
0
 
LVL 5

Expert Comment

by:8riaN
ID: 13634168
This RegExp:
"\?id=([[:digit:]]+)"
expects a url like this:
href="?id=###"
because it has the quotes.

You need to change it to reflect the fact that the ? is not immediately preceeded by " like this, for example:
job_brief.asp\?id=([[:digit:]]+)

Take some time to read both the Regular Expressions, you should be able to generalize to your next task from that.  Just remember that the more specific you are, the less vulnerable to false matches you are.

Also I aught to mention that scraping web sites can run afoul of intellectual property rights, so know the policies of the sites you are scraping and if you don't own the data, get permission if possible.

(I had to add that last bit once you actually used the word "scrape")

Cheers,
8riaN
0
 

Author Comment

by:VHSB
ID: 13636241
8rian,
Thanks for the advice. Three websites have been contacted and agreed to let us perfom the "scrapes", as long as we adhere to a couple of conditions.

With regards to the regex, I have tried job_brief.asp\?id=([[:digit:]]+) yet I still only get the first id.
0
 
LVL 5

Expert Comment

by:8riaN
ID: 13637674
Please post enough of the data that I can see at two IDs and the code that replaces this:
<cfset idList="">
<cfset findID=REFind(regExp,cfhttp.FileContent,1,True)>
<cfloop condition="findID.pos[1]">
   <cfset idList=ListAppend(idList,mid(cfhttp.FileContent,findID.pos[2],cfhttp.FileContent,findID.len[2])>
   <cfset findID=REFind(regExp,cfhttp.FileContent,findID.pos[1]+findID.len[1],True)>
</cfloop>

0
 

Author Comment

by:VHSB
ID: 13639169
8riaN, please ignore my last post Ive identified my problem. I had used REFindNoCase for <cfset findID=REFind(regExp,cfhttp.FileContent,1,True)> and REFind for <cfset findID=REFind(regExp,cfhttp.FileContent,findID.pos[1]+findID.len[1],True)>. Once I changed them both to REFind it worked, but Im not qite sure why that would make the difference, any ideas?

Thanks for your advice with this, i appreciate it.

S
0
 
LVL 5

Expert Comment

by:8riaN
ID: 13653318
That sure does seem backwards...

If it were reversed and you had the id in LCase and it worked when you switched to REFindNoCase, I would get it, but you're describing the opposite.

Hmm.  Is it possible that case is not consistent throughout the data?  In this case, they should both be NoCase.  You should eyeball the data to make sure you're not dropping anything if you haven't already done it.  This explaination is inconsistent with your description of the symptoms, but eyeballing the data would catch other problems as well, so should be done.  Are you getting the same number of links that exist in the original?

But if you know you're getting all the links, this goes in the "curious behaviours which I'll look out for next time" file because, as they say, if it ain't broke, don't fix it.

8riaN
0
 

Author Comment

by:VHSB
ID: 13653439
8riaN, your right it does seem backwards, or maybe its me that backwards :)
I did switch it to REFindNoCase and it worked, not the other way round.

Thanks again :)
0
 
LVL 5

Expert Comment

by:8riaN
ID: 13653568
That makes much more sense.

I noticed that I used id and the data used ID - my bad, sorry about that.

But it does mean that your data is likely well behaved.  I'd still count the links anyway, or at least check the first, last and a couple random spot checks if there are too many to count.

8riaN
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Have you ever sent email via ColdFusion and thought of tracking this mail to capture the exact date and time when the message was opened ?  If yes, then this article is for you ! First we need a table user_email with columns user_id , email , sub…
Article by: kevp75
Hey folks, 'bout time for me to come around with a little tip. Thanks to IIS 7.5 Extensions and Microsoft (well... really Windows 8, and IIS 8 I guess...), we can now prime our Application Pools, when IIS starts. Now, though it would be nice t…
In this brief tutorial Pawel from AdRem Software explains how you can quickly find out which services are running on your network, or what are the IP addresses of servers responsible for each service. Software used is freeware NetCrunch Tools (https…
Visualize your data even better in Access queries. Given a date and a value, this lesson shows how to compare that value with the previous value, calculate the difference, and display a circle if the value is the same, an up triangle if it increased…
Suggested Courses
Course of the Month14 days, 18 hours left to enroll

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question