Solved

Regex.Match question C#

Posted on 2009-07-03
12
518 Views
Last Modified: 2012-05-07
Hello,

I have a mission to scrape comments from this website:
http://www.youtube.com/watch_ajax?v=LfLFMf4chBc&action_get_comments=1&p=0&commentthreshold=-5&commentfilter=0&page_size=10

The problem is, comments make me trouble:
What I have is:
if (!String.IsNullOrEmpty(input))
                {
                    MatchCollection collection = Regex.Matches(input, "<div class=\"watch-comment-body\"><div >([^']*?)</div>");
                    if (collection.Count > 1)
                    {
                        for (int i = 0; i < collection.Count; )
                        {
                            currentComment = (Regex.Match(collection[i].Value, "<div >([^']*?)</div>").Groups[1].Value);                          
                            if (currentComment != "Show" || currentComment != "Hide")
                            {
                                cache = cache + "#" + currentComment;
                                counter();
                            }
                            i++;
                        }
//...process data
}

But Regex.Match will always return empty match, since the code of website is:
                <div class="watch-comment-body">
                    <div >

                        the comment text I want to get is here
                    </div>

Its divided in lines, and that is my problem. Any ideas on how I could fix this issue?
0
Comment
Question by:GVNPublic123
  • 7
  • 5
12 Comments
 
LVL 18

Accepted Solution

by:
Gary Davis earned 500 total points
ID: 24774087
There is white space before the 2nd div but your 1st pattern does not account for it so it fails the match. Remove the <div>:
 Regex.Matches(input, "<div class=\"watch-comment-body\">([^']*?)</div>");
 
Gary Davis
0
 

Author Comment

by:GVNPublic123
ID: 24774551
It didnt work. I think its still wrong according to the syntax on website and how Regex operates.
0
 
LVL 18

Expert Comment

by:Gary Davis
ID: 24774894
I often use a tool called Expresso to help debug and test out regular expressions. It is free and at http://www.ultrapico.com/.
I ran this reg exp:
"<div class=\"watch-comment-body\">([^']*?)</div>
 against your sample data and it matched, setting the found group to the data from the 2nd <div> upto but not including the 1st (and only) </div>. Maybe not what you wanted but it did match and being on multiple lines did not matter.
Changing the regex to this:
<div class=\"watch-comment-body\">\s*<div>\s*([^']*?)\s*</div>
Will get just the string "the comment text I want to get is here" and maybe that's what you want.
One thing to point out. Your 2nd <div> in your example has a space after the v which will cause a problem. Code for it if you need to.
Gary
 
0
Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 

Author Comment

by:GVNPublic123
ID: 24776480
<div class=\"watch-comment-body\">([^']*?)</div> = works, but results in many white spaces (empty lines).

<div class=\"watch-comment-body\">\s*<div>\s*([^']*?)\s*</div> = failes to compile with error:
Error      1      Unrecognized escape sequence      (underlines s in \s")
0
 
LVL 18

Expert Comment

by:Gary Davis
ID: 24776783
When you need backslashes in strings, use the @ before the 1st quote:
Regex.Matches(input, @"<div class=\"watch-comment-body\">\s*<div>\s*([^']*?)\s*</div>");
Or double up the backslashes.
 
0
 

Author Comment

by:GVNPublic123
ID: 24776808
I tried @, but than the \ before " doesnt do the trick anymore.  \"watch-comment-body\" How do I fix that?
0
 
LVL 18

Expert Comment

by:Gary Davis
ID: 24776921
OK. Leave the @ off and double up the backslash for the \s. You are then escaping the backslash (\\) just like you are escaping the quote (\").
 
Regex.Matches(input, "<div class=\"watch-comment-body\">\\s*<div>\\s*([^']*?)\\s*</div>");
0
 

Author Comment

by:GVNPublic123
ID: 24776983
Yes, but by doing  MatchCollection collection = Regex.Matches(input, "<div class=\"watch-comment-body\">\\s*<div>\\s*([^']*?)\\s*</div>"); I get no matches
0
 
LVL 18

Expert Comment

by:Gary Davis
ID: 24777200
Did you remove the space after the V in the data in: <div >?
Or add a space or \\s* in the pattern.
0
 

Author Comment

by:GVNPublic123
ID: 24777323
Ah, screw that, it doesnt work either.

Now I have:
MatchCollection collection = Regex.Matches(input, "<div class=\"watch-comment-body\">([^']*?)</div>");

But than on the end, I get the matches, and when I write to file the file is like:
empty space                                                                             comment
empty space                                                                             comment

I tried to do match[i].Trim();, but it didnt work. Please help.
0
 

Author Comment

by:GVNPublic123
ID: 24777355
Ah, implementing the new string and trimming it solved the problem. Thank you very much guys!
0
 

Author Closing Comment

by:GVNPublic123
ID: 31599647
Good answer
0

Featured Post

Free Tool: Postgres Monitoring System

A PHP and Perl based system to collect and display usage statistics from PostgreSQL databases.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction Although it is an old technology, serial ports are still being used by many hardware manufacturers. If you develop applications in C#, Microsoft .NET framework has SerialPort class to communicate with the serial ports.  I needed to…
Performance in games development is paramount: every microsecond counts to be able to do everything in less than 33ms (aiming at 16ms). C# foreach statement is one of the worst performance killers, and here I explain why.
Although Jacob Bernoulli (1654-1705) has been credited as the creator of "Binomial Distribution Table", Gottfried Leibniz (1646-1716) did his dissertation on the subject in 1666; Leibniz you may recall is the co-inventor of "Calculus" and beat Isaac…

840 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question