Link to home
Start Free TrialLog in
Avatar of gowthamprabakar
gowthamprabakarFlag for India

asked on

segregate of a word document using perl

hello people
here with i have attached a sample resume,
here the task is i need to segregate the information in the document and store it in the respective tables in DB. Here each color represent different data's, i have done it for your reference.here there is a catch point each and every resume has different format. so can any one help in writing a perl script which parse the document and segregate the data's.
FYI, am using ubuntu linux.
resume.doc
Avatar of Bryan Butler
Bryan Butler
Flag of United States of America image

So, does it need to parse the word document?  It would be easier if it was converted to a text document or something easier as far as parsing.  Next, you will have a tough time parsing things that are on the same line such as the phone and email in your doc; email/phone wouldn't be hard to determine, but if this was job title and description, then it would be much harder.  I would start small and see what works.

As far as coding, maybe come up with as many "rules" as possible to start with and see what you can combine.

parse lines and try to determine sections
 - if 3 or 4 short lines = address
 - if line is at top has 2 or 3 words, with no email and phone/numbers = name
 - if  line equals "WORK PROFILE" or "WORK" or "PREVIOUS EXPREIENCE" or etc. then start of work section
 - if line equals "Education", "Shool", "Degrees", then start of education
 - etc.

So basically try to find as many assuptions as you can, usually based on location, size, or specific wording, and then go from there.  There's nothing that will work 100% of the time since the resumes are all different.  This is something that reqruiters, job boards, and companies use for their DBs/resumes, so there might be some samples on the web to work with.  
Avatar of gowthamprabakar

ASKER

Thank you for the reply
can you gimme some links that deals with same kind of scenario :)
or any scripts ;)
cheers
Avatar of Fero45
Fero45

Save the Word resume.doc document as
* resume.odt
   and then save it again as
* resume.html
It is now much easier to parse using regular expresions or LWP
Thank you fero
should i need to use a program like catdoc
for doing these conversions ?
can you give me a sample code which extract details from the attached file.
cheers
ASKER CERTIFIED SOLUTION
Avatar of Fero45
Fero45

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
If you have to convert a bunch of them, then a converter is the way to go.  There are many of them out there and you will probably need one that keeps the formatting.  

This is a huge topic.  A search for "resume parsing" found "http://www.akken.com/resume-parsing.php" and adding 'free' found http://www.handyarchive.com/free/parsing-tool/.  Something like that may be the way to go as it is a major project to make in handle all the resume formats.
thanks a lot fero
if u don't mind can explain what exactly the code is doing
line by line explanation will be great.
how to use a or operator
 ex:organization or company name
      technical skills or skill sets

i think am troubling u lot ;)
cheers
This is probably not the best solution because , like I said, I do not know what info you need. So...

http://www.perl.com/doc/manual/html/pod/perlre.html

and other good websites
1 open RESUME, 'resume.html' or die $!;
2 open INTO, '>into.txt' or die $!;
 
3 while(<RESUME>) {
4        chomp;
5        if( $_ =~ m/Organization/ )     {
6                $_ =~ s/<\/FONT><\/P>//g;
7                print INTO $_ . "\n";
8        }
9 }
 
# --------------------- Line:
2 creates a new text file into.txt for writing into it '>into.txt'
3 loops through the document resume.html line by line
4 cuts of the last character = new line \n
5 Regular expressions (regex):
 $_ is a variable Perl, it is the current line
 =~ regular experssion for equal
  m - matching. i.e. if the word 'Organization' is in this line ...
6 </FONT></P>  is the end of line, we need to get rid of it. Since / is part of regex, we need to escape it using \. "Escape" means that we want that rising slash in its proper function.
7  prints a line ($_) into text file into.txt

Open in new window

Thank you fero for your guidance
great work :)
cheers
Hi fero
thanks for the code, its work :)
let me explain the scenario, am new to perl, i don't know anything
but they have given me a job to do.
the job is, i have to extract as many as information possible from the resume.doc and display it in the
Form elements which is a .php file.so that the user can verify the details.
can u plz help me getting this job done
cheers