PERl script to extract citations

I am looking for a PERl script to extract citations from text files. It needs to
identify things like:

Maclean, D.J.H. (1996)
Bloggs (1993)
Bloggs et al (2000)
Bloggs et. al. (2000)
Bloggs and Fred (1985)
Bloggs, Fred and George (2001)
...(Bloggs, 2000;Fred, 2001)
...[Bloggs, 2000;Fred, 2001]

Please let me know where to find a script like this.

Thanks
waleed072098Asked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

jmcgOwnerCommented:
Unless you can come up a set of rules for identifying what is a citation and what is not, based entirely on the textual presentation, a perl script is not going to be of much help. Other than seeing 4 digits followed by a ) or ], I don't see any regularity sufficient to allow locating a citation and exactly identifying where the citation starts looks like it would be tricky.
0
ahoffmannCommented:
could you also please post what you have done so far
0
waleed072098Author Commented:
ahoffmann :-

Post what ? I have not done anything. I am looking for a script that perform the above requirements.

Thanks

I am doubling points
0
Cloud Class® Course: C++ 11 Fundamentals

This course will introduce you to C++ 11 and teach you about syntax fundamentals.

ahoffmannCommented:
hmm, and this is not a homework, somehow?
0
waleed072098Author Commented:
I am a geography lecturer my friend and I do publish papers all the time and I need this tools.
Thanks
0
Kim RyanIT ConsultantCommented:
I have creatred a module for identifying peoples names
http://search.cpan.org/~kimryan/Lingua-EN-NameParse-1.18/NameParse.pm
This module  has been used for an application very similar to yours.
However, the problem is quite complex, and it takes a fair amount of
tuning to get good results.
0
ahoffmannCommented:
ok, then we're back to jmcg's question:
  please give possible lines, and the rules how to detect the number
0
waleed072098Author Commented:
Somebody cam up with this:

(?i)\w+\s*(et\.* al\.*)\s*\([12][4567890]\d\d\)
(?i)\w+\s*\([12][4567890]\d\d\s*\,+\s*\w*\)
(?i)\w+\s*\([12][4567890]\d\d\)
(?i)\w*\s(and*)\s\w+\s*\(*[12][4567890]\d\d\)*
(?i)\w+\s*\,*\s*[12][4567890]\d\d\s*;

0
ahoffmannCommented:
\w+,?\s+((et\.?\s*al\.?)|and|\w+\s+and)\s+\(([12][4567890]\d\d)\)

# does not match your last 2 line, 'cause you still gave no rules
0
waleed072098Author Commented:
These are rules. arent they? What rules ? I cant come up with better.
0
ahoffmannCommented:
if you don't have rules how to identify citations, then every answer is valid.
What are you then asking for?
0
waleed072098Author Commented:
I think that we could safely
assume that a citation looks like

* (1???)
(*,1???)

which is fairly unlikely to be something else than a citation (in most
texts). The difficulty is extracting the author names when there may be
one two or three of them before the bracket.


Thanks

0
waleed072098Author Commented:
...........and

* (2???)
(*,2???)

0
ahoffmannCommented:
ok, so this is not a citation:
...[Bloggs, 2000;Fred, 2001]

s#[,\s\(]([12]\d\d\d)\).*#$1#;
0
str8dnCommented:
I find it easier in multiple case scenarios such as this to handle it in pieces.
This way additional cases (formats) can be addressed easily as they arise, and we get less confused than we would with one giant expression.

For example
my (@cites) = (
      "Maclean, D.J.H. (1996) ",
      "Bloggs (1993) ",
      "Bloggs et al (2000) ",
      "Bloggs et. al. (2000) ",
      #"Bloggs and Fred (1985) ",       
      "Bloggs, Fred and George (2001) ",
      "(Bloggs, 2000;Fred, 2001) ",
      "[Bloggs, 2000;Fred, 2001] ",
);

my (@authors, @initials, @pubyear);
foreach (@cites) {
      if ( /^(\w+),\s*([A-Z]\.[A-Z]\.[A-Z]\.)\s*\(([0-9]{4})\)/ ) {
            # formats like Maclean, D.J.H. (1996)
            push(@authors,  $1);
            push(@initials, $2);
            push(@pubyear,  $3);
      }
      elsif ( /^(\w+)\s*\(([0-9]{4})\)/ ) {
            # formats like Bloggs (1993)
            push(@authors,  $1);
            push(@initials, "");
            push(@pubyear,  $2);
      }
      elsif ( /^(\w+)\s*et\.*\sal\.*\s*\(([0-9]{4})\)/ ) {
            # formats like Bloggs et al (2000)
            #           or Bloggs et. al. (2000)
            push(@authors,  $1);
            push(@initials, "");
            push(@pubyear,  $2);
      }
      elsif ( /^[\[\(](\w+),\s*([0-9]{4});(\w+),\s*([0-9]{4})[\]\)]/ ) {
            # formats like (Bloggs, 2000;Fred, 2001)
            #               or [Bloggs, 2000;Fred, 2001]
            # for first author
            push(@authors,  $1);
            push(@initials, "");
            push(@pubyear,  $2);
      
            # for second author
            push(@authors,  $3);
            push(@initials, "");
            push(@pubyear,  $4);
      }
      else {
            print "Uninstituted format - add code appropriately: $_\n";
      }
}
      
my $i;
for( $i = 0; $i <= $#authors; $i++) {
      print "Author: $authors[$i]";
      if ($initials[$i])  {
            print ", $initials[$i]";
      }
      print "\t";
      print "Published: $pubyear[$i]\n";
      print "--\n";
}

Of course, you would use your own data structure, but this gives the general idea.


0
str8dnCommented:
The code will give you something like the following as output:

Uninstituted format - add code appropriately: Bloggs and Fred (1985)
Uninstituted format - add code appropriately: Bloggs, Fred and George (2001)
Author: Maclean, D.J.H.      Published: 1996
--
Author: Bloggs      Published: 1993
--
Author: Bloggs      Published: 2000
--
Author: Bloggs      Published: 2000
--
Author: Bloggs      Published: 2000
--
Author: Fred      Published: 2001
--
Author: Bloggs      Published: 2000
--
Author: Fred      Published: 2001
--

Of course, the original identification of a "cite" within text is still up to you, although
you would use something similar to the case basaed scenario above... basically remove the
^ (start of line requirement) and run it on whole text lines until you get a working set.

This is all of course assuming that the much simpler 'accessing the references page' is not
an option...
0
ahoffmannCommented:
str8dn, in which way should your suggestion help, if you first need to convert the citations from text file into a perl array?
0
str8dnCommented:
The citations themselves are simply groupings of words and numbers in
a finite set of recognizeable formats (the point of the code I offerred, and of trying to use regular expressions in the first place.)

By handling the text lines in the same manner - creating a finite set of regular expressions (one for each of the recognized citation types) - the citations themsleves can be pulled from lines.

The code I wrote for waleed not only handles the post processiong (citation parsing), it also provides the foundation on which the larger regular expressions that recognize a citation within normal text can be created.

Here is more detail - note that I have not debugged this code - this code is written to give the methodology of a possible solution...


my @words;
my $line_before;
foreach (@lines_of_text) {  # so that $_ = current line of text
      if ($line_before) {
            @words_of_line_before = split(/\s+/, $line_before);
      }

      @words = split(/\s+/, $_);

      my $wordCount = 0;
      foreach my $word (@words) {

            if ( $word =~ /\([0-9]{4}\)/ ) {  

                  # There is a year of format "(XXXX)" in this line.
                  # This could be from one of the following citations
                  # (of those listed initially):
                  #       
                  # Maclean, D.J.H. (1996)
                       # Bloggs (1993)
                       # Bloggs et al (2000)
                       # Bloggs et. al. (2000)
                       # Bloggs and Fred (1985)
                       # Bloggs, Fred and George (2001)
                  #

                  # Get a subset words before the year marker on which
                  # to test the regular expressions for the citations
                  # - this avoids complications arising from false
                  # positives created by matches elsewhere in the
                  # same line.
                  my @words_to_test;
                  my $maxNeeded = 5; # maximum number of words needed
                                 # from prior to the marker
                  
                  my $words_to_test_count = 0;
                  while ($words_to_test_count <= $wordCount
                        && $words_to_test_count <= $maxNeeded) {
                        push(@words_to_test,
                              $words[$words_to_test_count-1]);
                        $words_to_test_count++;
                  }

                  # if not enough words were retrieved, and there was a
                  # line read before this one, add the last words of
                  # the past line for testing.
                  my $wbCount = 0;# Count of the words gotten from the
                              # line before.
                  while ($line_before
                        && $words_to_test_count <= $maxNeeded ) {

                        my @rev_words_from_before =
                              reverse split(/\s*/, $line_before);


                        push(@words_from_before;
                              $rev_words_from_before[$wbCount];

                        }
                        $words_to_test_count++;
                        $wbCount++;
                  }
                  
                       if ( $words_to_test =~
                  /(\w+),\s*([A-Z]\.[A-Z]\.[A-Z]\.)\s*\(([0-9]{4})\)/ ) {
                            # formats like Maclean, D.J.H. (1996)
                            push(@authors,  $1);
                            push(@initials, $2);
                            push(@pubyear,  $3);
                  }
                       elsif ( /(\w+)\s*et\.*\sal\.*\s*\(([0-9]{4})\)/ ) {
                            # formats like Bloggs et al (2000)
                            # formats like Bloggs et. al. (2000)
                            push(@authors,  $1);
                            push(@initials, "");
                            push(@pubyear,  $2);
                  }
                       elsif ( #reg ex for "Bloggs, Fred and George (2001)"
                            # formats like Bloggs, Fred and George (2001)
                        ){
                              # ...
                        }
                              
                  elsif ( /^(\w+)\s*\(([0-9]{4})\)/ ) {
                            # formats like Bloggs (1993)
                            push(@authors,  $1);
                            push(@initials, "");
                            push(@pubyear,  $2);
                  }
                       else {
                            print "Uninstituted format - add code",
                              " appropriately: $_\n";
                       }
            }
            
            $wordCount++;
      }
      $line_before = $_;
}

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
waleed072098Author Commented:
str8dn:-

Thanks for the help but I cant understand why you hardcoded the citation.:

 #      
               # Maclean, D.J.H. (1996)
                    # Bloggs (1993)
                    # Bloggs et al (2000)
                    # Bloggs et. al. (2000)
                    # Bloggs and Fred (1985)
                    # Bloggs, Fred and George (2001)

                         # formats like Maclean, D.J.H. (1996)

My question was :
I am looking for a PERl script to extract citations from text files. Any citations ? in any text files?
Thanks

0
ahoffmannCommented:
waleed, when do you give us the rules how citations look like?
You still got working solutions for your posted examples.
0
waleed072098Author Commented:
You are right Mr. ahoffmann but please answer your own question:

>str8dn, in which way should your suggestion help, if you first need to convert >the citations from text file into a perl array?


Thanks
Points increased to 600
0
ahoffmannCommented:
grrr, is this a guessing contest?
Again: what are the rules to identify citations?
0
str8dnCommented:


str8dn

e.g.
Hi waleed,

Those "hard coded lines" that start with '#' are actually comments to let you know that the block of code they are in is designed to handle citation discovery (within any line of text) those types of citations. Additional blocks can be added as necessary to recognize additional citation formats as you discover them (since what you gave us originally is not by any means a definitiave).

If this is your first perl program, you've picked a tough one to start with, so be ready for some long hours of reading and trial and error.   I would definitely suggest you begin with "Programming Perl" by O'Reilly.

The code I offered is basiclally a rough design to pull citations direclty from any amount of text.  

As you take a look you will notice that it assumes you can break the full text up into seperate lines (@lines of text).  If lines are not easily dilineated, then you can simply break it up directly into words and change the inital foreach loop accordingly.  

Words are then pulled from the lines (into @words) to further break up the text and offer additional information for the citation recognition.  The first set of citations (first if-block) are recognized because all have a year in parens like (XXXX) - notice the corresponding if statement.

Each of the first set of citations is then further tested by regular expressions.  Either one of them will be recognized or the code will continue on to another set of citations to test against (note that this makes the order of testing significant.)

I only gave a single subset of citations that this code will recognize.  Additional blocks for additional formats of citations would of course be added.  This code should be enough though to point you in the right direction.

Hopefully this will help you see a solution for the further specifics of your problem (aka Hope this helps)

str8dn
0
ahoffmannCommented:
according the rules given in http:#9932352 and http:#9932356 I already gave a working solution
(str8dn , no offence or critism to your exhausting example:-)
0
waleed072098Author Commented:

The only person who deserves the 400 points is  str8dn.

What did ahoffman helpped with other than asking for the rules ?

Thanks

 
0
ahoffmannCommented:
> What did ahoffman helpped with ..
giving working suggestions (according the posted rules:)
0
ahoffmannCommented:
agreed
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Perl

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.