?
Solved

How do I extract google scholar's html tags using regular expressions?

Posted on 2010-01-10
4
Medium Priority
?
545 Views
Last Modified: 2013-11-23
Good evening,
I need to extract the authors' name from the html code of a Google Scholar page, using java.
For example, i need to extract "R Banuelos, RG Smits" from the following html code, using regular expressions.
Could i ask for someone's advice?

Thank you.



<br><span class=gs_a>R Bañuelos, RG Smits - Probability Theory and Related Fields, 1997 - Springer</span><br>Summary. We study the asymptotic behavior of Brownian motion and its conditioned process <br>
in cones using an in®nite series representation of its transition density. A concise probabilistic <br>
interpretation of this series in terms of the skew product decomposition of Brownian <b> ...</b> <br><span class=gs_fl><a href="/scholar?cites=726791209970358048&amp;hl=en&amp;as_sdt=2000">Cited by 52</a> - <a href="/scholar?q=related:INcoOMEUFgoJ:scholar.google.com/&amp;hl=en&amp;as_sdt=2000">Related articles</a>
0
Comment
Question by:AirFranz
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
4 Comments
 
LVL 13

Expert Comment

by:numberkruncher
ID: 26278471
Something like the following should do the trick:
function extractAuthor(html) {
	return trim(html.match(/<span class=gs_a>([^<\-]*)(<\/span>|\-)/)[1]);
}
function trim(text) {
    return text.replace(/^\s*/, "").replace(/\s*$/, "");
}
alert(extractAuthor("<br><span class=gs_a>R Bañuelos, RG Smits - Probability Theory and Related Fields, 1997 - Springer</span><br>"));

Open in new window

0
 

Author Comment

by:AirFranz
ID: 26278585
it's true, but i should do something more.
For each result contained in a Google Scholar results' page i need to extract the authors' names. I was thinking about using pattern and matcher, something like the following:

Pattern p = Pattern.compile("(\\S+)");               line (1)                                    
Matcher m = p.matcher(stringToSearch);

while (m.find())   {
                                            String codeGroup = m.group(1);
                                           }

where stringToSearch is the html code of the results' page.

I'm looking for what to put into the " " in line (1).

Thanks
                       
0
 
LVL 13

Accepted Solution

by:
numberkruncher earned 1000 total points
ID: 26278824
Ah right, sorry I didn't realize there were multiple matches. In that case something like the following should do the trick.

Afaik there are no JavaScript alternatives to the Java classes Pattern and Matcher, but it is reasonably straightforward.
function extractAuthors(html) {
	var pattern = /<span class=gs_a>([^<\-]*)/g;
	var matches = html.match(pattern);
	var results = new Array();
	for (var i = 0; i < matches.length; ++i)
		results.push(trim(matches[i].substr(17)));
	return results;
}
function trim(text) {
    return text.replace(/^\s*/g, "").replace(/\s*$/, "");
}

var authors = extractAuthors("<br><span class=gs_a>R Bañuelos, RG Smits - Probability Theory and Related Fields, 1997 - Springer</span><br><br><span class=gs_a>Bob - Springer</span><br>");
alert(authors[0]);
alert(authors[1]);

Open in new window

0

Featured Post

Optimize your web performance

What's in the eBook?
- Full list of reasons for poor performance
- Ultimate measures to speed things up
- Primary web monitoring types
- KPIs you should be monitoring in order to increase your ROI

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Not sure what the best email signature size is? Are you worried about email signature image size? Follow this best practice guide.
Basic understanding on "OO- Object Orientation" is needed for designing a logical solution to solve a problem. Basic OOAD is a prerequisite for a coder to ensure that they follow the basic design of OO. This would help developers to understand the b…
This tutorial covers a practical example of lazy loading technique and early loading technique in a Singleton Design Pattern.
This theoretical tutorial explains exceptions, reasons for exceptions, different categories of exception and exception hierarchy.
Suggested Courses

752 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question