Solved

Regex.Match question C#

Posted on 2009-07-03
12
514 Views
Last Modified: 2012-05-07
Hello,

I have a mission to scrape comments from this website:
http://www.youtube.com/watch_ajax?v=LfLFMf4chBc&action_get_comments=1&p=0&commentthreshold=-5&commentfilter=0&page_size=10

The problem is, comments make me trouble:
What I have is:
if (!String.IsNullOrEmpty(input))
                {
                    MatchCollection collection = Regex.Matches(input, "<div class=\"watch-comment-body\"><div >([^']*?)</div>");
                    if (collection.Count > 1)
                    {
                        for (int i = 0; i < collection.Count; )
                        {
                            currentComment = (Regex.Match(collection[i].Value, "<div >([^']*?)</div>").Groups[1].Value);                          
                            if (currentComment != "Show" || currentComment != "Hide")
                            {
                                cache = cache + "#" + currentComment;
                                counter();
                            }
                            i++;
                        }
//...process data
}

But Regex.Match will always return empty match, since the code of website is:
                <div class="watch-comment-body">
                    <div >

                        the comment text I want to get is here
                    </div>

Its divided in lines, and that is my problem. Any ideas on how I could fix this issue?
0
Comment
Question by:GVNPublic123
  • 7
  • 5
12 Comments
 
LVL 18

Accepted Solution

by:
Gary Davis earned 500 total points
ID: 24774087
There is white space before the 2nd div but your 1st pattern does not account for it so it fails the match. Remove the <div>:
 Regex.Matches(input, "<div class=\"watch-comment-body\">([^']*?)</div>");
 
Gary Davis
0
 

Author Comment

by:GVNPublic123
ID: 24774551
It didnt work. I think its still wrong according to the syntax on website and how Regex operates.
0
 
LVL 18

Expert Comment

by:Gary Davis
ID: 24774894
I often use a tool called Expresso to help debug and test out regular expressions. It is free and at http://www.ultrapico.com/.
I ran this reg exp:
"<div class=\"watch-comment-body\">([^']*?)</div>
 against your sample data and it matched, setting the found group to the data from the 2nd <div> upto but not including the 1st (and only) </div>. Maybe not what you wanted but it did match and being on multiple lines did not matter.
Changing the regex to this:
<div class=\"watch-comment-body\">\s*<div>\s*([^']*?)\s*</div>
Will get just the string "the comment text I want to get is here" and maybe that's what you want.
One thing to point out. Your 2nd <div> in your example has a space after the v which will cause a problem. Code for it if you need to.
Gary
 
0
 

Author Comment

by:GVNPublic123
ID: 24776480
<div class=\"watch-comment-body\">([^']*?)</div> = works, but results in many white spaces (empty lines).

<div class=\"watch-comment-body\">\s*<div>\s*([^']*?)\s*</div> = failes to compile with error:
Error      1      Unrecognized escape sequence      (underlines s in \s")
0
 
LVL 18

Expert Comment

by:Gary Davis
ID: 24776783
When you need backslashes in strings, use the @ before the 1st quote:
Regex.Matches(input, @"<div class=\"watch-comment-body\">\s*<div>\s*([^']*?)\s*</div>");
Or double up the backslashes.
 
0
 

Author Comment

by:GVNPublic123
ID: 24776808
I tried @, but than the \ before " doesnt do the trick anymore.  \"watch-comment-body\" How do I fix that?
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 18

Expert Comment

by:Gary Davis
ID: 24776921
OK. Leave the @ off and double up the backslash for the \s. You are then escaping the backslash (\\) just like you are escaping the quote (\").
 
Regex.Matches(input, "<div class=\"watch-comment-body\">\\s*<div>\\s*([^']*?)\\s*</div>");
0
 

Author Comment

by:GVNPublic123
ID: 24776983
Yes, but by doing  MatchCollection collection = Regex.Matches(input, "<div class=\"watch-comment-body\">\\s*<div>\\s*([^']*?)\\s*</div>"); I get no matches
0
 
LVL 18

Expert Comment

by:Gary Davis
ID: 24777200
Did you remove the space after the V in the data in: <div >?
Or add a space or \\s* in the pattern.
0
 

Author Comment

by:GVNPublic123
ID: 24777323
Ah, screw that, it doesnt work either.

Now I have:
MatchCollection collection = Regex.Matches(input, "<div class=\"watch-comment-body\">([^']*?)</div>");

But than on the end, I get the matches, and when I write to file the file is like:
empty space                                                                             comment
empty space                                                                             comment

I tried to do match[i].Trim();, but it didnt work. Please help.
0
 

Author Comment

by:GVNPublic123
ID: 24777355
Ah, implementing the new string and trimming it solved the problem. Thank you very much guys!
0
 

Author Closing Comment

by:GVNPublic123
ID: 31599647
Good answer
0

Featured Post

Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

Join & Write a Comment

This article describes a simple method to resize a control at runtime.  It includes ready-to-use source code and a complete sample demonstration application.  We'll also talk about C# Extension Methods. Introduction In one of my applications…
Real-time is more about the business, not the technology. In day-to-day life, to make real-time decisions like buying or investing, business needs the latest information(e.g. Gold Rate/Stock Rate). Unlike traditional days, you need not wait for a fe…
Sending a Secure fax is easy with eFax Corporate (http://www.enterprise.efax.com). First, Just open a new email message.  In the To field, type your recipient's fax number @efaxsend.com. You can even send a secure international fax — just include t…
Polish reports in Access so they look terrific. Take yourself to another level. Equations, Back Color, Alternate Back Color. Write easy VBA Code. Tighten space to use less pages. Launch report from a menu, considering criteria only when it is filled…

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now