Solved

Regex.Match question C#

Posted on 2009-07-03
12
515 Views
Last Modified: 2012-05-07
Hello,

I have a mission to scrape comments from this website:
http://www.youtube.com/watch_ajax?v=LfLFMf4chBc&action_get_comments=1&p=0&commentthreshold=-5&commentfilter=0&page_size=10

The problem is, comments make me trouble:
What I have is:
if (!String.IsNullOrEmpty(input))
                {
                    MatchCollection collection = Regex.Matches(input, "<div class=\"watch-comment-body\"><div >([^']*?)</div>");
                    if (collection.Count > 1)
                    {
                        for (int i = 0; i < collection.Count; )
                        {
                            currentComment = (Regex.Match(collection[i].Value, "<div >([^']*?)</div>").Groups[1].Value);                          
                            if (currentComment != "Show" || currentComment != "Hide")
                            {
                                cache = cache + "#" + currentComment;
                                counter();
                            }
                            i++;
                        }
//...process data
}

But Regex.Match will always return empty match, since the code of website is:
                <div class="watch-comment-body">
                    <div >

                        the comment text I want to get is here
                    </div>

Its divided in lines, and that is my problem. Any ideas on how I could fix this issue?
0
Comment
Question by:GVNPublic123
  • 7
  • 5
12 Comments
 
LVL 18

Accepted Solution

by:
Gary Davis earned 500 total points
ID: 24774087
There is white space before the 2nd div but your 1st pattern does not account for it so it fails the match. Remove the <div>:
 Regex.Matches(input, "<div class=\"watch-comment-body\">([^']*?)</div>");
 
Gary Davis
0
 

Author Comment

by:GVNPublic123
ID: 24774551
It didnt work. I think its still wrong according to the syntax on website and how Regex operates.
0
 
LVL 18

Expert Comment

by:Gary Davis
ID: 24774894
I often use a tool called Expresso to help debug and test out regular expressions. It is free and at http://www.ultrapico.com/.
I ran this reg exp:
"<div class=\"watch-comment-body\">([^']*?)</div>
 against your sample data and it matched, setting the found group to the data from the 2nd <div> upto but not including the 1st (and only) </div>. Maybe not what you wanted but it did match and being on multiple lines did not matter.
Changing the regex to this:
<div class=\"watch-comment-body\">\s*<div>\s*([^']*?)\s*</div>
Will get just the string "the comment text I want to get is here" and maybe that's what you want.
One thing to point out. Your 2nd <div> in your example has a space after the v which will cause a problem. Code for it if you need to.
Gary
 
0
 

Author Comment

by:GVNPublic123
ID: 24776480
<div class=\"watch-comment-body\">([^']*?)</div> = works, but results in many white spaces (empty lines).

<div class=\"watch-comment-body\">\s*<div>\s*([^']*?)\s*</div> = failes to compile with error:
Error      1      Unrecognized escape sequence      (underlines s in \s")
0
 
LVL 18

Expert Comment

by:Gary Davis
ID: 24776783
When you need backslashes in strings, use the @ before the 1st quote:
Regex.Matches(input, @"<div class=\"watch-comment-body\">\s*<div>\s*([^']*?)\s*</div>");
Or double up the backslashes.
 
0
 

Author Comment

by:GVNPublic123
ID: 24776808
I tried @, but than the \ before " doesnt do the trick anymore.  \"watch-comment-body\" How do I fix that?
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 
LVL 18

Expert Comment

by:Gary Davis
ID: 24776921
OK. Leave the @ off and double up the backslash for the \s. You are then escaping the backslash (\\) just like you are escaping the quote (\").
 
Regex.Matches(input, "<div class=\"watch-comment-body\">\\s*<div>\\s*([^']*?)\\s*</div>");
0
 

Author Comment

by:GVNPublic123
ID: 24776983
Yes, but by doing  MatchCollection collection = Regex.Matches(input, "<div class=\"watch-comment-body\">\\s*<div>\\s*([^']*?)\\s*</div>"); I get no matches
0
 
LVL 18

Expert Comment

by:Gary Davis
ID: 24777200
Did you remove the space after the V in the data in: <div >?
Or add a space or \\s* in the pattern.
0
 

Author Comment

by:GVNPublic123
ID: 24777323
Ah, screw that, it doesnt work either.

Now I have:
MatchCollection collection = Regex.Matches(input, "<div class=\"watch-comment-body\">([^']*?)</div>");

But than on the end, I get the matches, and when I write to file the file is like:
empty space                                                                             comment
empty space                                                                             comment

I tried to do match[i].Trim();, but it didnt work. Please help.
0
 

Author Comment

by:GVNPublic123
ID: 24777355
Ah, implementing the new string and trimming it solved the problem. Thank you very much guys!
0
 

Author Closing Comment

by:GVNPublic123
ID: 31599647
Good answer
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
what technologies offer Authentication over Web Services? 4 104
Achieve json result 2 65
Stream.BeginRead and Stream.EndRead in .NET Core 5 37
Connection String 16 43
Introduction Although it is an old technology, serial ports are still being used by many hardware manufacturers. If you develop applications in C#, Microsoft .NET framework has SerialPort class to communicate with the serial ports.  I needed to…
This article is for Object-Oriented Programming (OOP) beginners. An Interface contains declarations of events, indexers, methods and/or properties. Any class which implements the Interface should provide the concrete implementation for each Inter…
This Micro Tutorial will teach you how to censor certain areas of your screen. The example in this video will show a little boy's face being blurred. This will be demonstrated using Adobe Premiere Pro CS6.
Video by: Mark
This lesson goes over how to construct ordered and unordered lists and how to create hyperlinks.

910 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

21 Experts available now in Live!

Get 1:1 Help Now