I am writing a program that will extract rss links from a given html file.
I have two solutions for doing this,
first solution, parse the html file, send a http request for each link it encounters to get the file, and then use "Informa" RSS open source library to determine if it is a valid RSS file.
second solution, for each link in the html file, check if it has file extension of a rss file, then mark it as a potential rss file. If it doesn't have a rss file extension, check if the link has the form "www.xxxx.com/feed/
" where the bottom directory of the url is named "feed", then mark it as a potential rss file. If this link is a potential rss file, then send a http request to obtain the file, and then use "Informa" RSS open source library to determine if it is a valid RSS file.
As you can see, the second solution will be a lot faster since it doesn't require a http request to be sent for each link, but since RSS file extensions vary greatly, from xml to html to aspx. so from the look of this, it seems like every link will belong to this catagory, since most non rss feeds are with html file extension.
my problem now is, with second solution how do i check if the file is of rss file extension since there will be a huge varieties of file extension for rss file?
Hopefully you guys can understand my question