• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 457
  • Last Modified:

Regex to extract data in HTML table

Hi,

I have data in HTML Table tags and need to extract it in a structured fashion (maybe in DataTable).  
e.g.  
1)  <td align=center bgcolor=F0F7FF><font size=1 face=verdana>1940</font></td>
In the above statement, I need to get just the number '1040'.

2)  <td align="center" bgcolor="FEFCCF"><font face="verdana" size="1"><a href="http://azoogleads.com/az/getOfferDetail.php?OfferId=523\"><b>*HOT*</b>- Original Smileys!!</a></font></td>

In the above statement, I need to get just the string (not the link) 'Original Smileys' or maybe '*HOT*- Original Smileys!!'.

Could you please provide the proper RegEx to do that?  Please help. Thank you.

I am giving the HTML of the whole table for your reference. I need to get values from just first (Offer)  and third (Leads) columns and only those rows where there is an integer value in the third column (i.e. where third column is not null). But that is the next step.
-----------------
"<table border=1 cellspacing=1 cellpadding=0><tr bgcolor=00458A><th><font face=verdana size=1 color=white>Offer</font></th><th><font size=1 face=verdana color=white>11/30</font></th><th><font size=1 face=verdana color=white>Leads</font></th><th><font face=verdana size=1 color=white>Totals</font></th></tr><tr><td align=center bgcolor=FEFCCF><font face=verdana size=1><a href=\"http://az.com/az/getOfferDetail.php?OfferId=523\"><b>*HOT*</b> - Original Smileys!!</a></font></td><td align=center bgcolor=FEFCCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>1940</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>$1261.00</font></td></tr><tr><td align=center bgcolor=FEEFCF><font size=1 face=verdana>Clicks >></font></td><td align=center bgcolor=FEEFCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>62459</font></td></tr><tr><td align=center bgcolor=FEFCCF><font face=verdana size=1><a href=\"http://az.com/az/getOfferDetail.php?OfferId=542\">Original Free Cursors - US Only</a></font></td><td align=center bgcolor=FEFCCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>495</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>$866.25</font></td></tr><tr><td align=center bgcolor=FEEFCF><font size=1 face=verdana>Clicks >></font></td><td align=center bgcolor=FEEFCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>8749</font></td></tr><tr><td align=center bgcolor=FEFCCF><font face=verdana size=1><a href=\"http://az.com/az/getOfferDetail.php?OfferId=662\">Free Screensavers - International</a></font></td><td align=center bgcolor=FEFCCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>3449</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>$2069.40</font></td></tr><tr><td align=center bgcolor=FEEFCF><font size=1 face=verdana>Clicks >></font></td><td align=center bgcolor=FEEFCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>226999</font></td></tr><tr><td align=center bgcolor=FEFCCF><font face=verdana size=1><a href=\"http://az.com/az/getOfferDetail.php?OfferId=919\">Free Smileys - UK ONLY!</a></font></td><td align=center bgcolor=FEFCCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>208</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>$416.00</font></td></tr><tr><td align=center bgcolor=FEEFCF><font size=1 face=verdana>Clicks >></font></td><td align=center bgcolor=FEEFCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>4370</font></td></tr><tr><td align=center bgcolor=FEFCCF><font face=verdana size=1><a href=\"http://az.com/az/getOfferDetail.php?OfferId=925\">Smileys - International Only</a></font></td><td align=center bgcolor=FEFCCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>559</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>$363.35</font></td></tr><tr><td align=center bgcolor=FEEFCF><font size=1 face=verdana>Clicks >></font></td><td align=center bgcolor=FEEFCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>12306</font></td></tr><tr><td align=center bgcolor=FEFCCF><font face=verdana size=1><a href=\"http://az.com/az/getOfferDetail.php?OfferId=1239\">Free Download - Fun Cards - US Only</a></font></td><td align=center bgcolor=FEFCCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>817</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>$1838.25</font></td></tr><tr><td align=center bgcolor=FEEFCF><font size=1 face=verdana>Clicks >></font></td><td align=center bgcolor=FEEFCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>17446</font></td></tr><tr><td align=center bgcolor=FEFCCF><font face=verdana size=1><a href=\"http://az.com/az/getOfferDetail.php?OfferId=1240\">Free Download - Fun Cards UK Only</a></font></td><td align=center bgcolor=FEFCCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>32</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>$96.00</font></td></tr><tr><td align=center bgcolor=FEEFCF><font size=1 face=verdana>Clicks >></font></td><td align=center bgcolor=FEEFCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>943</font></td></tr><tr><td align=center bgcolor=FEFCCF><font face=verdana size=1><a href=\"http://az.com/az/getOfferDetail.php?OfferId=1241\">Fun Cards - International</a></font></td><td align=center bgcolor=FEFCCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>170</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>$85.00</font></td></tr><tr><td align=center bgcolor=FEEFCF><font size=1 face=verdana>Clicks >></font></td><td align=center bgcolor=FEEFCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>5692</font></td></tr><tr><td align=center bgcolor=FEFCCF><font face=verdana size=1><a href=\"http://az.com/az/getOfferDetail.php?OfferId=1622\">Smiley - GEO US - ONLY</a></font></td><td align=center bgcolor=FEFCCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>1035</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>$2070.00</font></td></tr><tr><td align=center bgcolor=FEEFCF><font size=1 face=verdana>Clicks >></font></td><td align=center bgcolor=FEEFCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>10441</font></td></tr><tr><td align=center bgcolor=FEFCCF><font face=verdana size=1><a href=\"http://az.com/az/getOfferDetail.php?OfferId=1630\">Free Screensavers - GEO US ONLY</a></font></td><td align=center bgcolor=FEFCCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>157</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>$235.50</font></td></tr><tr><td align=center bgcolor=FEEFCF><font size=1 face=verdana>Clicks >></font></td><td align=center bgcolor=FEEFCF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=F0F7FF><font size=1 face=verdana>&nbsp;</font></td><td align=center bgcolor=FFF2F0><font size=1 face=verdana>4651</font></td></tr><tr bgcolor=006A00><th><font size=1 color=CCFFE6 face=verdana>Daily Totals</font></th><td align=center><font size=1 color=CCFFE6 face=verdana>$0.00</font></td><td align=center colspan=2><font size=1 color=CCFFE6 face=verdana>Total: $0.00</font></th></tr><tr bgcolor=F07800><th><font size=1 color=FFFFDD face=verdana>Daily Traffic</th><td align=center><font size=1 color=CCFFE6 face=verdana>0</font></td><td align=center colspan=2><font size=1 color=FFFFDD face=verdana>Hits: 0</font></th></tr>\n<tr bgcolor=00458A><th>&nbsp;</th><th><font size=1 face=verdana color=white>11/30</font></th><td align=center colspan=2><font size=1 face=verdana color=white>Leads: 8862</font></td></table>"
--------------

0
engg
Asked:
engg
  • 3
  • 2
  • 2
2 Solutions
 
Fernando SotoCommented:
Hi engg;

Try something like this out.

      StreamReader sr = new StreamReader(@"C:\Temp\data.html");
      string input = sr.ReadToEnd();
      sr.Close();
      ArrayList al = new ArrayList();
      string pattern = @"<td.*?(?:verdana>(\d+)|(Original.*?)</a>|<b>(.*?)</b>(.*?)</a>)+.*?</td>";
      Regex re = new Regex(pattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
      MatchCollection mc;
      mc = re.Matches(input);
      foreach(Match m in mc)
      {
            if( m.Groups[1].Value != "") al.Add(m.Groups[1].Value);
            if( m.Groups[2].Value != "") al.Add(m.Groups[2].Value);
                        
            if( m.Groups[3].Value != "" && m.Groups[4].Value != "")
            {
                   al.Add(m.Groups[3].Value + m.Groups[4].Value);
            }
            else
            {
                  if( m.Groups[3].Value != "") al.Add(m.Groups[3].Value);
                  if( m.Groups[4].Value != "") al.Add(m.Groups[4].Value);
            }                        
      }
      string output = "";
      for( int idx = 0; idx < al.Count; idx++)
      {
            output += ((string) al[idx]) + "\n";
      }
      MessageBox.Show(output);


Fernando
0
 
wlfsCommented:
I doubt this can be done with a single regex. At least I wouldn't know how to :)
First, capture the contents of 1st and 3rd col of all rows, and then get rid of all the html tags in it.

            MatchCollection mc;
            Dictionary<string,int> hash = new Dictionary<string,int>();

            string html = "...";  // your html string here
            mc = Regex.Matches(html, @"<tr[^>]*><td[^>]*>(.*?)</td><td.+?/td><td[^>]*>(.*?)</td><td.+?/td></tr>");
            foreach (Match m in mc) {
                try {
                    hash[Regex.Replace(m.Groups[1].Value, @"<[^>]+>", "")] =
                        Convert.ToInt16(Regex.Replace(m.Groups[2].Value, @"<[^>]+>", ""));
                }
                catch (FormatException) {}
            }

            foreach (string s in hash.Keys) {
                Console.WriteLine(s + ": " + hash[s].ToString());
            }

prints out:

*HOT* - Original Smileys!!: 1940
Original Free Cursors - US Only: 495
Free Screensavers - International: 3449
Free Smileys - UK ONLY!: 208
Smileys - International Only: 559
Free Download - Fun Cards - US Only: 817
Free Download - Fun Cards UK Only: 32
Fun Cards - International: 170
Smiley - GEO US - ONLY: 1035
Free Screensavers - GEO US ONLY: 157

The catch (FormatException) catches the cases where the third column is not a number (the rows with "Click >>").
I used a Hashtable to store the data. Storing it in a DataTable should be straight forward. If you need further help on this, just ask.
0
 
wlfsCommented:
Sorry Fernando,
I was typing without reloading and didn't realize you had already posted.
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
enggAuthor Commented:
Thanks a lot Fernando, wlfs!
Sorry for taking long to get back with you.   wlfs's solution worked like magic.  I think a couple of changes in the Fernando's solution would make it work.
I was trying to understand these regexs. I know
.  any
+ one or more
* zero or more
? zero or one
and I know what [] and ^ mean.

Could you please tell me what these patterns mean? I spent a lot of time on them.
  .+?
  .*?
[^>]*

Fernando, have you used nested groups (parentheses)? what is 'original' in your regex?
Thank you very much.

0
 
Fernando SotoCommented:
Hi engg ;

To your question:
Could you please tell me what these patterns mean? I spent a lot of time on them?

  .+?   = . Means any character the + sign after it means one or more of them and the ? after that is called the lazy
               quantifier.
  .*?   = . Means any character the * after it means zero or more of them and the ? after that is called the lazy
               quantifier.
[^>]* = All characters in the [ and ] are in what is called a character class. What this means is that the next character
              to match can be any one of the characters between the buckets. When the ^ is the first character in the class
              this means that the next character to match must NOT be in the class. And the next symbol the * means to match
              zero or more of the characters in the class.

Now what does I mean by a lazy quantifier. The regex engine will parse the string one character at a time and mark all the captures it finds. So let me give an example here. You stated in your last post the following:

Sorry for taking long to get back with you.   wlfs's solution worked like magic.  I think a couple of changes in the Fernando's solution would make it work.

Lets say that I wanted to capture all the sentences in this statement and I used the following regex pattern, (.*\.\s?)+ to accomplish it. This pattern states to look for any character any number of times, .*, until you find a period, \., followed by a space zero or one time, \s?, Group that all together, ( ) and continue to find more, +. You would expect to find three sentences but in fact you only find the whole string and not three individual sentences. The reason why is the way that the regex engine works and that is to consume all the characters there are, which is the whole string in this case and then start looking for the period which in fact it has found at the last character position. Then it looks for zero or one spaces. Well there are no more characters to consume and so it has found a complete match. Because there are no more characters to check it returns the single match it found. But this is not what we wanted. By the way the * quantifier is known as greedy and this is why. Now lets try a different pattern, (.*?\.\s?)+? , which is basically the same pattern except for the two added ?, called the lazy quantifier. What these two new characters do in the pattern is once the first part of the pattern has consumed all of the characters and then goes to the next character the period it knows that there was a lazy quantifier before it so it marks the position of the current period and starts to back track looking to see if there were any periods before it. In this case it finds one before it and marks that one and continues looking for another one and it finds that and marks it and then continues again and this time does not find any more. At this point it jumps to the last period it had mark and marks it as a capture. Then it starts the whole process over looking for more and in this case finds two more.
0
 
enggAuthor Commented:
Thank you very much, Fernando.
There could not have been a better explanation.
I wish I could give 500 points to both of you. I will split them between you both.
Thanks.
0
 
Fernando SotoCommented:
Glad I was able to help. :=)
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

  • 3
  • 2
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now