segregate of a word document using perl

Posted on 2009-02-10
Last Modified: 2012-05-06
hello people
here with i have attached a sample resume,
here the task is i need to segregate the information in the document and store it in the respective tables in DB. Here each color represent different data's, i have done it for your there is a catch point each and every resume has different format. so can any one help in writing a perl script which parse the document and segregate the data's.
FYI, am using ubuntu linux.
Question by:gowthamprabakar
    LVL 16

    Expert Comment

    by:Bryan Butler
    So, does it need to parse the word document?  It would be easier if it was converted to a text document or something easier as far as parsing.  Next, you will have a tough time parsing things that are on the same line such as the phone and email in your doc; email/phone wouldn't be hard to determine, but if this was job title and description, then it would be much harder.  I would start small and see what works.

    As far as coding, maybe come up with as many "rules" as possible to start with and see what you can combine.

    parse lines and try to determine sections
     - if 3 or 4 short lines = address
     - if line is at top has 2 or 3 words, with no email and phone/numbers = name
     - if  line equals "WORK PROFILE" or "WORK" or "PREVIOUS EXPREIENCE" or etc. then start of work section
     - if line equals "Education", "Shool", "Degrees", then start of education
     - etc.

    So basically try to find as many assuptions as you can, usually based on location, size, or specific wording, and then go from there.  There's nothing that will work 100% of the time since the resumes are all different.  This is something that reqruiters, job boards, and companies use for their DBs/resumes, so there might be some samples on the web to work with.  

    Author Comment

    Thank you for the reply
    can you gimme some links that deals with same kind of scenario :)
    or any scripts ;)
    LVL 6

    Expert Comment

    Save the Word resume.doc document as
    * resume.odt
       and then save it again as
    * resume.html
    It is now much easier to parse using regular expresions or LWP

    Author Comment

    Thank you fero
    should i need to use a program like catdoc
    for doing these conversions ?
    can you give me a sample code which extract details from the attached file.
    LVL 6

    Accepted Solution

    I do not know what information you want to extract. The following is just an idea It creates a file into.txt and saves each line that begins with the word 'Organization" as follows:

     Organization                  :      Micro village Communications Pvt Ltd. - Bangalore
     Organization                  :      Mahizham InfoTech - Madurai
     Organization                  :      Wandoz Web Services Pvt Ltd. - Bangalore

    What the final extract should look like?
    It's close to midnight. Have a nice day. See you tomorrow.
    use warnings;
    use strict;
    open RESUME, 'resume.html' or die $!;
    open INTO, '>into.txt' or die $!;
    while(<RESUME>)	{
    	if( $_ =~ m/Organization/ )	{
    		$_ =~ s/<\/FONT><\/P>//g;
    		print INTO $_ . "\n";

    Open in new window

    LVL 16

    Expert Comment

    by:Bryan Butler
    If you have to convert a bunch of them, then a converter is the way to go.  There are many of them out there and you will probably need one that keeps the formatting.  

    This is a huge topic.  A search for "resume parsing" found "" and adding 'free' found  Something like that may be the way to go as it is a major project to make in handle all the resume formats.

    Author Comment

    thanks a lot fero
    if u don't mind can explain what exactly the code is doing
    line by line explanation will be great.
    how to use a or operator
     ex:organization or company name
          technical skills or skill sets

    i think am troubling u lot ;)
    LVL 6

    Expert Comment

    This is probably not the best solution because , like I said, I do not know what info you need. So...

    and other good websites
    1 open RESUME, 'resume.html' or die $!;
    2 open INTO, '>into.txt' or die $!;
    3 while(<RESUME>) {
    4        chomp;
    5        if( $_ =~ m/Organization/ )     {
    6                $_ =~ s/<\/FONT><\/P>//g;
    7                print INTO $_ . "\n";
    8        }
    9 }
    # --------------------- Line:
    2 creates a new text file into.txt for writing into it '>into.txt'
    3 loops through the document resume.html line by line
    4 cuts of the last character = new line \n
    5 Regular expressions (regex):
     $_ is a variable Perl, it is the current line
     =~ regular experssion for equal
      m - matching. i.e. if the word 'Organization' is in this line ...
    6 </FONT></P>  is the end of line, we need to get rid of it. Since / is part of regex, we need to escape it using \. "Escape" means that we want that rising slash in its proper function.
    7  prints a line ($_) into text file into.txt

    Open in new window


    Author Comment

    Thank you fero for your guidance
    great work :)

    Author Comment

    Hi fero
    thanks for the code, its work :)
    let me explain the scenario, am new to perl, i don't know anything
    but they have given me a job to do.
    the job is, i have to extract as many as information possible from the resume.doc and display it in the
    Form elements which is a .php that the user can verify the details.
    can u plz help me getting this job done

    Write Comment

    Please enter a first name

    Please enter a last name

    We will never share this with anyone.

    Featured Post

    Maximize Your Threat Intelligence Reporting

    Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

    On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
    In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (…
    Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
    Sending a Secure fax is easy with eFax Corporate ( First, Just open a new email message.  In the To field, type your recipient's fax number You can even send a secure international fax — just include t…

    737 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    18 Experts available now in Live!

    Get 1:1 Help Now