regEx to pull all links to a web page only

i have code to a class that pulls out links from a web page. im having trouble with the regEx caus eit pulls in dirty data many times. i want it only to match on real web pages and not mailto tags, links to .pdf or .doc or .ppt

can anyone suggest the correct regex in the way im using it below?
using System;
using System.Data;
using System.Configuration;
using System.Web;
using System.Web.Security;
using System.Web.UI;
using System.Web.UI.WebControls;
using System.Web.UI.WebControls.WebParts;
using System.Web.UI.HtmlControls;
using System.Collections.Generic;
using System.Text.RegularExpressions;



public struct LinkItem
{
    public string Href;
    public string Text;

    public override string ToString()
    {
        return Href + "\n\t" + Text;
    }
}



/// <summary>
/// Summary description for Class1
/// </summary>
public static class LinkFinder
{
    private const string _LINK_REGEX = "href=\"[a-zA-Z./:&\\d_-]+\"";

    public static List<LinkItem> Find(string file)
    {
        List<LinkItem> list = new List<LinkItem>();

        MatchCollection m1 = Regex.Matches(file, @"(?i)(<A.*?>.*?</A>)", RegexOptions.Singleline);
        //MatchCollection m1 = Regex.Matches(file, _LINK_REGEX, RegexOptions.Singleline);
        

        // 2.
        // Loop over each match.
        foreach (Match m in m1)
        {
            string value = m.Groups[1].Value;
            LinkItem i = new LinkItem();

            // 3.
            // Get href attribute.
            Match m2 = Regex.Match(value, @"href=\""(.*?)\""",
                RegexOptions.Singleline);
            if (m2.Success)
            {
                i.Href = m2.Groups[1].Value;
                
            }

            // 4.
            // Remove inner tags from text.
            string t = Regex.Replace(value, @"\s*<.*?>\s*", "",
                RegexOptions.Singleline);
            i.Text = t;

            list.Add(i);
        }
        return list;
    }
}

Open in new window

supergirl2008Asked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Terry WoodsIT GuruCommented:
Try this. Do you need to set an ignore case option too?
MatchCollection m1 = Regex.Matches(file, @"(?i)(<A[^>]*href\s*=\s*['"](?!mailto|[^'"]*\.(?:pdf|doc|ppt))[^>]*>.*?</A>)", RegexOptions.Singleline);

Open in new window

0
Terry WoodsIT GuruCommented:
Oh, and I'm not sure whether you might need to escape the double quotes in the pattern.
0
Exploring ASP.NET Core: Fundamentals

Learn to build web apps and services, IoT apps, and mobile backends by covering the fundamentals of ASP.NET Core and  exploring the core foundations for app libraries.

supergirl2008Author Commented:
also what about full paths vs relative paths found in href attributes?
0
supergirl2008Author Commented:
i want to do this:
http://onaje.com/content/working-regular-expressions-href-url-extractor

but i also want to exclude and links to .doc, .pdf and .gif ... files

what RegEx d i add to this guys code to exclude my unwanted ones?
<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >]
0
supergirl2008Author Commented:
that regEx should read...
<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >]

Open in new window

0
Terry WoodsIT GuruCommented:

<\s*a\s+[^>]*href\s*=\s*[\"']?((?![^\"' >]*(?:doc|pdf|gif)['\" >])[^\"' >]+)[\"' >]

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
.NET Programming

From novice to tech pro — start learning today.