Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium


How do I extract google scholar's html tags using regular expressions?

Posted on 2010-01-10
Medium Priority
Last Modified: 2013-11-23
Good evening,
I need to extract the authors' name from the html code of a Google Scholar page, using java.
For example, i need to extract "R Banuelos, RG Smits" from the following html code, using regular expressions.
Could i ask for someone's advice?

Thank you.

<br><span class=gs_a>R Bañuelos, RG Smits - Probability Theory and Related Fields, 1997 - Springer</span><br>Summary. We study the asymptotic behavior of Brownian motion and its conditioned process <br>
in cones using an in®nite series representation of its transition density. A concise probabilistic <br>
interpretation of this series in terms of the skew product decomposition of Brownian <b> ...</b> <br><span class=gs_fl><a href="/scholar?cites=726791209970358048&amp;hl=en&amp;as_sdt=2000">Cited by 52</a> - <a href="/scholar?q=related:INcoOMEUFgoJ:scholar.google.com/&amp;hl=en&amp;as_sdt=2000">Related articles</a>
Question by:AirFranz
  • 2
LVL 13

Expert Comment

ID: 26278471
Something like the following should do the trick:
function extractAuthor(html) {
	return trim(html.match(/<span class=gs_a>([^<\-]*)(<\/span>|\-)/)[1]);
function trim(text) {
    return text.replace(/^\s*/, "").replace(/\s*$/, "");
alert(extractAuthor("<br><span class=gs_a>R Bañuelos, RG Smits - Probability Theory and Related Fields, 1997 - Springer</span><br>"));

Open in new window


Author Comment

ID: 26278585
it's true, but i should do something more.
For each result contained in a Google Scholar results' page i need to extract the authors' names. I was thinking about using pattern and matcher, something like the following:

Pattern p = Pattern.compile("(\\S+)");               line (1)                                    
Matcher m = p.matcher(stringToSearch);

while (m.find())   {
                                            String codeGroup = m.group(1);

where stringToSearch is the html code of the results' page.

I'm looking for what to put into the " " in line (1).

LVL 13

Accepted Solution

numberkruncher earned 1000 total points
ID: 26278824
Ah right, sorry I didn't realize there were multiple matches. In that case something like the following should do the trick.

Afaik there are no JavaScript alternatives to the Java classes Pattern and Matcher, but it is reasonably straightforward.
function extractAuthors(html) {
	var pattern = /<span class=gs_a>([^<\-]*)/g;
	var matches = html.match(pattern);
	var results = new Array();
	for (var i = 0; i < matches.length; ++i)
	return results;
function trim(text) {
    return text.replace(/^\s*/g, "").replace(/\s*$/, "");

var authors = extractAuthors("<br><span class=gs_a>R Bañuelos, RG Smits - Probability Theory and Related Fields, 1997 - Springer</span><br><br><span class=gs_a>Bob - Springer</span><br>");

Open in new window


Featured Post


Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

We are witnesses that everyone is saying that our children shouldn't "play" with a technology because it is dangerous. This article is going to prove that they are wrong.
Don’ts and Dos are two important end products of software testing basics that a tester needs to regard. This article attempts to explain the principles of both.
This tutorial covers a step-by-step guide to install VisualVM launcher in eclipse.
The viewer will learn the basics of jQuery, including how to invoke it on a web page. Reference your jQuery libraries: (CODE) Include your new external js/jQuery file: (CODE) Write your first lines of code to setup your site for jQuery.: (CODE)
Suggested Courses
Course of the Month10 days, 10 hours left to enroll

571 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question