Solved

PERl script to extract citations

Posted on 2003-12-05
29
404 Views
Last Modified: 2010-03-04
I am looking for a PERl script to extract citations from text files. It needs to
identify things like:

Maclean, D.J.H. (1996)
Bloggs (1993)
Bloggs et al (2000)
Bloggs et. al. (2000)
Bloggs and Fred (1985)
Bloggs, Fred and George (2001)
...(Bloggs, 2000;Fred, 2001)
...[Bloggs, 2000;Fred, 2001]

Please let me know where to find a script like this.

Thanks
0
Comment
Question by:waleed072098
  • 12
  • 9
  • 4
  • +2
29 Comments
 
LVL 20

Expert Comment

by:jmcg
ID: 9885794
Unless you can come up a set of rules for identifying what is a citation and what is not, based entirely on the textual presentation, a perl script is not going to be of much help. Other than seeing 4 digits followed by a ) or ], I don't see any regularity sufficient to allow locating a citation and exactly identifying where the citation starts looks like it would be tricky.
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 9888346
could you also please post what you have done so far
0
 

Author Comment

by:waleed072098
ID: 9892395
ahoffmann :-

Post what ? I have not done anything. I am looking for a script that perform the above requirements.

Thanks

I am doubling points
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 9892833
hmm, and this is not a homework, somehow?
0
 

Author Comment

by:waleed072098
ID: 9893046
I am a geography lecturer my friend and I do publish papers all the time and I need this tools.
Thanks
0
 
LVL 19

Expert Comment

by:Kim Ryan
ID: 9893123
I have creatred a module for identifying peoples names
http://search.cpan.org/~kimryan/Lingua-EN-NameParse-1.18/NameParse.pm
This module  has been used for an application very similar to yours.
However, the problem is quite complex, and it takes a fair amount of
tuning to get good results.
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 9895262
ok, then we're back to jmcg's question:
  please give possible lines, and the rules how to detect the number
0
 

Author Comment

by:waleed072098
ID: 9907954
Somebody cam up with this:

(?i)\w+\s*(et\.* al\.*)\s*\([12][4567890]\d\d\)
(?i)\w+\s*\([12][4567890]\d\d\s*\,+\s*\w*\)
(?i)\w+\s*\([12][4567890]\d\d\)
(?i)\w*\s(and*)\s\w+\s*\(*[12][4567890]\d\d\)*
(?i)\w+\s*\,*\s*[12][4567890]\d\d\s*;

0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 9908226
\w+,?\s+((et\.?\s*al\.?)|and|\w+\s+and)\s+\(([12][4567890]\d\d)\)

# does not match your last 2 line, 'cause you still gave no rules
0
 

Author Comment

by:waleed072098
ID: 9908312
These are rules. arent they? What rules ? I cant come up with better.
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 9910995
if you don't have rules how to identify citations, then every answer is valid.
What are you then asking for?
0
 

Author Comment

by:waleed072098
ID: 9932352
I think that we could safely
assume that a citation looks like

* (1???)
(*,1???)

which is fairly unlikely to be something else than a citation (in most
texts). The difficulty is extracting the author names when there may be
one two or three of them before the bracket.


Thanks

0
 

Author Comment

by:waleed072098
ID: 9932356
...........and

* (2???)
(*,2???)

0
Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 51

Expert Comment

by:ahoffmann
ID: 9937023
ok, so this is not a citation:
...[Bloggs, 2000;Fred, 2001]

s#[,\s\(]([12]\d\d\d)\).*#$1#;
0
 
LVL 1

Expert Comment

by:str8dn
ID: 10204257
I find it easier in multiple case scenarios such as this to handle it in pieces.
This way additional cases (formats) can be addressed easily as they arise, and we get less confused than we would with one giant expression.

For example
my (@cites) = (
      "Maclean, D.J.H. (1996) ",
      "Bloggs (1993) ",
      "Bloggs et al (2000) ",
      "Bloggs et. al. (2000) ",
      #"Bloggs and Fred (1985) ",       
      "Bloggs, Fred and George (2001) ",
      "(Bloggs, 2000;Fred, 2001) ",
      "[Bloggs, 2000;Fred, 2001] ",
);

my (@authors, @initials, @pubyear);
foreach (@cites) {
      if ( /^(\w+),\s*([A-Z]\.[A-Z]\.[A-Z]\.)\s*\(([0-9]{4})\)/ ) {
            # formats like Maclean, D.J.H. (1996)
            push(@authors,  $1);
            push(@initials, $2);
            push(@pubyear,  $3);
      }
      elsif ( /^(\w+)\s*\(([0-9]{4})\)/ ) {
            # formats like Bloggs (1993)
            push(@authors,  $1);
            push(@initials, "");
            push(@pubyear,  $2);
      }
      elsif ( /^(\w+)\s*et\.*\sal\.*\s*\(([0-9]{4})\)/ ) {
            # formats like Bloggs et al (2000)
            #           or Bloggs et. al. (2000)
            push(@authors,  $1);
            push(@initials, "");
            push(@pubyear,  $2);
      }
      elsif ( /^[\[\(](\w+),\s*([0-9]{4});(\w+),\s*([0-9]{4})[\]\)]/ ) {
            # formats like (Bloggs, 2000;Fred, 2001)
            #               or [Bloggs, 2000;Fred, 2001]
            # for first author
            push(@authors,  $1);
            push(@initials, "");
            push(@pubyear,  $2);
      
            # for second author
            push(@authors,  $3);
            push(@initials, "");
            push(@pubyear,  $4);
      }
      else {
            print "Uninstituted format - add code appropriately: $_\n";
      }
}
      
my $i;
for( $i = 0; $i <= $#authors; $i++) {
      print "Author: $authors[$i]";
      if ($initials[$i])  {
            print ", $initials[$i]";
      }
      print "\t";
      print "Published: $pubyear[$i]\n";
      print "--\n";
}

Of course, you would use your own data structure, but this gives the general idea.


0
 
LVL 1

Expert Comment

by:str8dn
ID: 10204351
The code will give you something like the following as output:

Uninstituted format - add code appropriately: Bloggs and Fred (1985)
Uninstituted format - add code appropriately: Bloggs, Fred and George (2001)
Author: Maclean, D.J.H.      Published: 1996
--
Author: Bloggs      Published: 1993
--
Author: Bloggs      Published: 2000
--
Author: Bloggs      Published: 2000
--
Author: Bloggs      Published: 2000
--
Author: Fred      Published: 2001
--
Author: Bloggs      Published: 2000
--
Author: Fred      Published: 2001
--

Of course, the original identification of a "cite" within text is still up to you, although
you would use something similar to the case basaed scenario above... basically remove the
^ (start of line requirement) and run it on whole text lines until you get a working set.

This is all of course assuming that the much simpler 'accessing the references page' is not
an option...
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 10207569
str8dn, in which way should your suggestion help, if you first need to convert the citations from text file into a perl array?
0
 
LVL 1

Accepted Solution

by:
str8dn earned 400 total points
ID: 10221397
The citations themselves are simply groupings of words and numbers in
a finite set of recognizeable formats (the point of the code I offerred, and of trying to use regular expressions in the first place.)

By handling the text lines in the same manner - creating a finite set of regular expressions (one for each of the recognized citation types) - the citations themsleves can be pulled from lines.

The code I wrote for waleed not only handles the post processiong (citation parsing), it also provides the foundation on which the larger regular expressions that recognize a citation within normal text can be created.

Here is more detail - note that I have not debugged this code - this code is written to give the methodology of a possible solution...


my @words;
my $line_before;
foreach (@lines_of_text) {  # so that $_ = current line of text
      if ($line_before) {
            @words_of_line_before = split(/\s+/, $line_before);
      }

      @words = split(/\s+/, $_);

      my $wordCount = 0;
      foreach my $word (@words) {

            if ( $word =~ /\([0-9]{4}\)/ ) {  

                  # There is a year of format "(XXXX)" in this line.
                  # This could be from one of the following citations
                  # (of those listed initially):
                  #       
                  # Maclean, D.J.H. (1996)
                       # Bloggs (1993)
                       # Bloggs et al (2000)
                       # Bloggs et. al. (2000)
                       # Bloggs and Fred (1985)
                       # Bloggs, Fred and George (2001)
                  #

                  # Get a subset words before the year marker on which
                  # to test the regular expressions for the citations
                  # - this avoids complications arising from false
                  # positives created by matches elsewhere in the
                  # same line.
                  my @words_to_test;
                  my $maxNeeded = 5; # maximum number of words needed
                                 # from prior to the marker
                  
                  my $words_to_test_count = 0;
                  while ($words_to_test_count <= $wordCount
                        && $words_to_test_count <= $maxNeeded) {
                        push(@words_to_test,
                              $words[$words_to_test_count-1]);
                        $words_to_test_count++;
                  }

                  # if not enough words were retrieved, and there was a
                  # line read before this one, add the last words of
                  # the past line for testing.
                  my $wbCount = 0;# Count of the words gotten from the
                              # line before.
                  while ($line_before
                        && $words_to_test_count <= $maxNeeded ) {

                        my @rev_words_from_before =
                              reverse split(/\s*/, $line_before);


                        push(@words_from_before;
                              $rev_words_from_before[$wbCount];

                        }
                        $words_to_test_count++;
                        $wbCount++;
                  }
                  
                       if ( $words_to_test =~
                  /(\w+),\s*([A-Z]\.[A-Z]\.[A-Z]\.)\s*\(([0-9]{4})\)/ ) {
                            # formats like Maclean, D.J.H. (1996)
                            push(@authors,  $1);
                            push(@initials, $2);
                            push(@pubyear,  $3);
                  }
                       elsif ( /(\w+)\s*et\.*\sal\.*\s*\(([0-9]{4})\)/ ) {
                            # formats like Bloggs et al (2000)
                            # formats like Bloggs et. al. (2000)
                            push(@authors,  $1);
                            push(@initials, "");
                            push(@pubyear,  $2);
                  }
                       elsif ( #reg ex for "Bloggs, Fred and George (2001)"
                            # formats like Bloggs, Fred and George (2001)
                        ){
                              # ...
                        }
                              
                  elsif ( /^(\w+)\s*\(([0-9]{4})\)/ ) {
                            # formats like Bloggs (1993)
                            push(@authors,  $1);
                            push(@initials, "");
                            push(@pubyear,  $2);
                  }
                       else {
                            print "Uninstituted format - add code",
                              " appropriately: $_\n";
                       }
            }
            
            $wordCount++;
      }
      $line_before = $_;
}

0
 

Author Comment

by:waleed072098
ID: 10235529
str8dn:-

Thanks for the help but I cant understand why you hardcoded the citation.:

 #      
               # Maclean, D.J.H. (1996)
                    # Bloggs (1993)
                    # Bloggs et al (2000)
                    # Bloggs et. al. (2000)
                    # Bloggs and Fred (1985)
                    # Bloggs, Fred and George (2001)

                         # formats like Maclean, D.J.H. (1996)

My question was :
I am looking for a PERl script to extract citations from text files. Any citations ? in any text files?
Thanks

0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 10236850
waleed, when do you give us the rules how citations look like?
You still got working solutions for your posted examples.
0
 

Author Comment

by:waleed072098
ID: 10258395
You are right Mr. ahoffmann but please answer your own question:

>str8dn, in which way should your suggestion help, if you first need to convert >the citations from text file into a perl array?


Thanks
Points increased to 600
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 10273391
grrr, is this a guessing contest?
Again: what are the rules to identify citations?
0
 
LVL 1

Expert Comment

by:str8dn
ID: 10274367


str8dn

e.g.
Hi waleed,

Those "hard coded lines" that start with '#' are actually comments to let you know that the block of code they are in is designed to handle citation discovery (within any line of text) those types of citations. Additional blocks can be added as necessary to recognize additional citation formats as you discover them (since what you gave us originally is not by any means a definitiave).

If this is your first perl program, you've picked a tough one to start with, so be ready for some long hours of reading and trial and error.   I would definitely suggest you begin with "Programming Perl" by O'Reilly.

The code I offered is basiclally a rough design to pull citations direclty from any amount of text.  

As you take a look you will notice that it assumes you can break the full text up into seperate lines (@lines of text).  If lines are not easily dilineated, then you can simply break it up directly into words and change the inital foreach loop accordingly.  

Words are then pulled from the lines (into @words) to further break up the text and offer additional information for the citation recognition.  The first set of citations (first if-block) are recognized because all have a year in parens like (XXXX) - notice the corresponding if statement.

Each of the first set of citations is then further tested by regular expressions.  Either one of them will be recognized or the code will continue on to another set of citations to test against (note that this makes the order of testing significant.)

I only gave a single subset of citations that this code will recognize.  Additional blocks for additional formats of citations would of course be added.  This code should be enough though to point you in the right direction.

Hopefully this will help you see a solution for the further specifics of your problem (aka Hope this helps)

str8dn
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 10278866
according the rules given in http:#9932352 and http:#9932356 I already gave a working solution
(str8dn , no offence or critism to your exhausting example:-)
0
 

Author Comment

by:waleed072098
ID: 10648557

The only person who deserves the 400 points is  str8dn.

What did ahoffman helpped with other than asking for the rules ?

Thanks

 
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 10652355
> What did ahoffman helpped with ..
giving working suggestions (according the posted rules:)
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 10656422
agreed
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Suggested Solutions

Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded (http://www.mongodb.org/downloads),  installed and started the daemon, and had a console window open. After an hour or two of playing at the command …
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Access reports are powerful and flexible. Learn how to create a query and then a grouped report using the wizard. Modify the report design after the wizard is done to make it look better. There will be another video to explain how to put the final p…

759 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now