Regex to extract data in HTML table

Hi,

I have data in HTML Table tags and need to extract it in a structured fashion (maybe in DataTable).  
e.g.  
1)  <td align=center bgcolor=F0F7FF><font size=1 face=verdana>1940</font></td>
In the above statement, I need to get just the number '1040'.

2)  <td align="center" bgcolor="FEFCCF"><font face="verdana" size="1"><a href="http://azoogleads.com/az/getOfferDetail.php?OfferId=523\"><b>*HOT*</b>- Original Smileys!!</a></font></td>

In the above statement, I need to get just the string (not the link) 'Original Smileys' or maybe '*HOT*- Original Smileys!!'.

Could you please provide the proper RegEx to do that?  Please help. Thank you.

I am giving the HTML of the whole table for your reference. I need to get values from just first (Offer)  and third (Leads) columns and only those rows where there is an integer value in the third column (i.e. where third column is not null). But that is the next step.
-----------------
"<table border=1 cellspacing=1 cellpadding=0><tr bgcolor=00458A><th><font face=verdana size=1 color=white>Offer</font></th><th><font size=1 face=verdana color=white>11/30</font></th><th><font size=1 face=verdana color=white>Leads</font></th><th><font face=verdana size=1 color=white>Totals</font></th></tr><tr><td align=center bgcolor=FEFCCF><font face=verdana size=1><a href=\"http://az.com/az/getOfferDetail.php?OfferId=523\"><b>*HOT*</b> - Original Smileys!!</a></font></td><td align=center bgcolor=FEFCCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>1940</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>$1261.00</font></td></tr><tr><td align=center bgcolor=FEEFCF><font size=1 face=verdana>Clicks >></font></td><td align=center bgcolor=FEEFCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>62459</font></td></tr><tr><td align=center bgcolor=FEFCCF><font face=verdana size=1><a href=\"http://az.com/az/getOfferDetail.php?OfferId=542\">Original Free Cursors - US Only</a></font></td><td align=center bgcolor=FEFCCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>495</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>$866.25</font></td></tr><tr><td align=center bgcolor=FEEFCF><font size=1 face=verdana>Clicks >></font></td><td align=center bgcolor=FEEFCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>8749</font></td></tr><tr><td align=center bgcolor=FEFCCF><font face=verdana size=1><a href=\"http://az.com/az/getOfferDetail.php?OfferId=662\">Free Screensavers - International</a></font></td><td align=center bgcolor=FEFCCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>3449</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>$2069.40</font></td></tr><tr><td align=center bgcolor=FEEFCF><font size=1 face=verdana>Clicks >></font></td><td align=center bgcolor=FEEFCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>226999</font></td></tr><tr><td align=center bgcolor=FEFCCF><font face=verdana size=1><a href=\"http://az.com/az/getOfferDetail.php?OfferId=919\">Free Smileys - UK ONLY!</a></font></td><td align=center bgcolor=FEFCCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>208</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>$416.00</font></td></tr><tr><td align=center bgcolor=FEEFCF><font size=1 face=verdana>Clicks >></font></td><td align=center bgcolor=FEEFCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>4370</font></td></tr><tr><td align=center bgcolor=FEFCCF><font face=verdana size=1><a href=\"http://az.com/az/getOfferDetail.php?OfferId=925\">Smileys - International Only</a></font></td><td align=center bgcolor=FEFCCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>559</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>$363.35</font></td></tr><tr><td align=center bgcolor=FEEFCF><font size=1 face=verdana>Clicks >></font></td><td align=center bgcolor=FEEFCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>12306</font></td></tr><tr><td align=center bgcolor=FEFCCF><font face=verdana size=1><a href=\"http://az.com/az/getOfferDetail.php?OfferId=1239\">Free Download - Fun Cards - US Only</a></font></td><td align=center bgcolor=FEFCCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>817</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>$1838.25</font></td></tr><tr><td align=center bgcolor=FEEFCF><font size=1 face=verdana>Clicks >></font></td><td align=center bgcolor=FEEFCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>17446</font></td></tr><tr><td align=center bgcolor=FEFCCF><font face=verdana size=1><a href=\"http://az.com/az/getOfferDetail.php?OfferId=1240\">Free Download - Fun Cards UK Only</a></font></td><td align=center bgcolor=FEFCCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>32</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>$96.00</font></td></tr><tr><td align=center bgcolor=FEEFCF><font size=1 face=verdana>Clicks >></font></td><td align=center bgcolor=FEEFCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>943</font></td></tr><tr><td align=center bgcolor=FEFCCF><font face=verdana size=1><a href=\"http://az.com/az/getOfferDetail.php?OfferId=1241\">Fun Cards - International</a></font></td><td align=center bgcolor=FEFCCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>170</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>$85.00</font></td></tr><tr><td align=center bgcolor=FEEFCF><font size=1 face=verdana>Clicks >></font></td><td align=center bgcolor=FEEFCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>5692</font></td></tr><tr><td align=center bgcolor=FEFCCF><font face=verdana size=1><a href=\"http://az.com/az/getOfferDetail.php?OfferId=1622\">Smiley - GEO US - ONLY</a></font></td><td align=center bgcolor=FEFCCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>1035</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>$2070.00</font></td></tr><tr><td align=center bgcolor=FEEFCF><font size=1 face=verdana>Clicks >></font></td><td align=center bgcolor=FEEFCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>10441</font></td></tr><tr><td align=center bgcolor=FEFCCF><font face=verdana size=1><a href=\"http://az.com/az/getOfferDetail.php?OfferId=1630\">Free Screensavers - GEO US ONLY</a></font></td><td align=center bgcolor=FEFCCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>157</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>$235.50</font></td></tr><tr><td align=center bgcolor=FEEFCF><font size=1 face=verdana>Clicks >></font></td><td align=center bgcolor=FEEFCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>4651</font></td></tr><tr bgcolor=006A00><th><font size=1 color=CCFFE6 face=verdana>Daily Totals</font></th><td align=center><font size=1 color=CCFFE6 face=verdana>$0.00</font></td><td align=center colspan=2><font size=1 color=CCFFE6 face=verdana>Total: $0.00</font></th></tr><tr bgcolor=F07800><th><font size=1 color=FFFFDD face=verdana>Daily Traffic</th><td align=center><font size=1 color=CCFFE6 face=verdana>0</font></td><td align=center colspan=2><font size=1 color=FFFFDD face=verdana>Hits: 0</font></th></tr>\n<tr bgcolor=00458A><th>&nbsp;</th><th><font size=1 face=verdana color=white>11/30</font></th><td align=center colspan=2><font size=1 face=verdana color=white>Leads: 8862</font></td></table>"
--------------

enggAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Fernando SotoRetiredCommented:
Hi engg;

Try something like this out.

      StreamReader sr = new StreamReader(@"C:\Temp\data.html");
      string input = sr.ReadToEnd();
      sr.Close();
      ArrayList al = new ArrayList();
      string pattern = @"<td.*?(?:verdana>(\d+)|(Original.*?)</a>|<b>(.*?)</b>(.*?)</a>)+.*?</td>";
      Regex re = new Regex(pattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
      MatchCollection mc;
      mc = re.Matches(input);
      foreach(Match m in mc)
      {
            if( m.Groups[1].Value != "") al.Add(m.Groups[1].Value);
            if( m.Groups[2].Value != "") al.Add(m.Groups[2].Value);
                        
            if( m.Groups[3].Value != "" && m.Groups[4].Value != "")
            {
                   al.Add(m.Groups[3].Value + m.Groups[4].Value);
            }
            else
            {
                  if( m.Groups[3].Value != "") al.Add(m.Groups[3].Value);
                  if( m.Groups[4].Value != "") al.Add(m.Groups[4].Value);
            }                        
      }
      string output = "";
      for( int idx = 0; idx < al.Count; idx++)
      {
            output += ((string) al[idx]) + "\n";
      }
      MessageBox.Show(output);


Fernando
0
wlfsCommented:
I doubt this can be done with a single regex. At least I wouldn't know how to :)
First, capture the contents of 1st and 3rd col of all rows, and then get rid of all the html tags in it.

            MatchCollection mc;
            Dictionary<string,int> hash = new Dictionary<string,int>();

            string html = "...";  // your html string here
            mc = Regex.Matches(html, @"<tr[^>]*><td[^>]*>(.*?)</td><td.+?/td><td[^>]*>(.*?)</td><td.+?/td></tr>");
            foreach (Match m in mc) {
                try {
                    hash[Regex.Replace(m.Groups[1].Value, @"<[^>]+>", "")] =
                        Convert.ToInt16(Regex.Replace(m.Groups[2].Value, @"<[^>]+>", ""));
                }
                catch (FormatException) {}
            }

            foreach (string s in hash.Keys) {
                Console.WriteLine(s + ": " + hash[s].ToString());
            }

prints out:

*HOT* - Original Smileys!!: 1940
Original Free Cursors - US Only: 495
Free Screensavers - International: 3449
Free Smileys - UK ONLY!: 208
Smileys - International Only: 559
Free Download - Fun Cards - US Only: 817
Free Download - Fun Cards UK Only: 32
Fun Cards - International: 170
Smiley - GEO US - ONLY: 1035
Free Screensavers - GEO US ONLY: 157

The catch (FormatException) catches the cases where the third column is not a number (the rows with "Click >>").
I used a Hashtable to store the data. Storing it in a DataTable should be straight forward. If you need further help on this, just ask.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
wlfsCommented:
Sorry Fernando,
I was typing without reloading and didn't realize you had already posted.
0
Cloud Class® Course: Microsoft Azure 2017

Azure has a changed a lot since it was originally introduce by adding new services and features. Do you know everything you need to about Azure? This course will teach you about the Azure App Service, monitoring and application insights, DevOps, and Team Services.

enggAuthor Commented:
Thanks a lot Fernando, wlfs!
Sorry for taking long to get back with you.   wlfs's solution worked like magic.  I think a couple of changes in the Fernando's solution would make it work.
I was trying to understand these regexs. I know
.  any
+ one or more
* zero or more
? zero or one
and I know what [] and ^ mean.

Could you please tell me what these patterns mean? I spent a lot of time on them.
  .+?
  .*?
[^>]*

Fernando, have you used nested groups (parentheses)? what is 'original' in your regex?
Thank you very much.

0
Fernando SotoRetiredCommented:
Hi engg ;

To your question:
Could you please tell me what these patterns mean? I spent a lot of time on them?

  .+?   = . Means any character the + sign after it means one or more of them and the ? after that is called the lazy
               quantifier.
  .*?   = . Means any character the * after it means zero or more of them and the ? after that is called the lazy
               quantifier.
[^>]* = All characters in the [ and ] are in what is called a character class. What this means is that the next character
              to match can be any one of the characters between the buckets. When the ^ is the first character in the class
              this means that the next character to match must NOT be in the class. And the next symbol the * means to match
              zero or more of the characters in the class.

Now what does I mean by a lazy quantifier. The regex engine will parse the string one character at a time and mark all the captures it finds. So let me give an example here. You stated in your last post the following:

Sorry for taking long to get back with you.   wlfs's solution worked like magic.  I think a couple of changes in the Fernando's solution would make it work.

Lets say that I wanted to capture all the sentences in this statement and I used the following regex pattern, (.*\.\s?)+ to accomplish it. This pattern states to look for any character any number of times, .*, until you find a period, \., followed by a space zero or one time, \s?, Group that all together, ( ) and continue to find more, +. You would expect to find three sentences but in fact you only find the whole string and not three individual sentences. The reason why is the way that the regex engine works and that is to consume all the characters there are, which is the whole string in this case and then start looking for the period which in fact it has found at the last character position. Then it looks for zero or one spaces. Well there are no more characters to consume and so it has found a complete match. Because there are no more characters to check it returns the single match it found. But this is not what we wanted. By the way the * quantifier is known as greedy and this is why. Now lets try a different pattern, (.*?\.\s?)+? , which is basically the same pattern except for the two added ?, called the lazy quantifier. What these two new characters do in the pattern is once the first part of the pattern has consumed all of the characters and then goes to the next character the period it knows that there was a lazy quantifier before it so it marks the position of the current period and starts to back track looking to see if there were any periods before it. In this case it finds one before it and marks that one and continues looking for another one and it finds that and marks it and then continues again and this time does not find any more. At this point it jumps to the last period it had mark and marks it as a capture. Then it starts the whole process over looking for more and in this case finds two more.
0
enggAuthor Commented:
Thank you very much, Fernando.
There could not have been a better explanation.
I wish I could give 500 points to both of you. I will split them between you both.
Thanks.
0
Fernando SotoRetiredCommented:
Glad I was able to help. :=)
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Fonts Typography

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.