Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 477
  • Last Modified:

segregate of a word document using perl

hello people
here with i have attached a sample resume,
here the task is i need to segregate the information in the document and store it in the respective tables in DB. Here each color represent different data's, i have done it for your reference.here there is a catch point each and every resume has different format. so can any one help in writing a perl script which parse the document and segregate the data's.
FYI, am using ubuntu linux.
  • 5
  • 3
  • 2
1 Solution
Bryan ButlerCommented:
So, does it need to parse the word document?  It would be easier if it was converted to a text document or something easier as far as parsing.  Next, you will have a tough time parsing things that are on the same line such as the phone and email in your doc; email/phone wouldn't be hard to determine, but if this was job title and description, then it would be much harder.  I would start small and see what works.

As far as coding, maybe come up with as many "rules" as possible to start with and see what you can combine.

parse lines and try to determine sections
 - if 3 or 4 short lines = address
 - if line is at top has 2 or 3 words, with no email and phone/numbers = name
 - if  line equals "WORK PROFILE" or "WORK" or "PREVIOUS EXPREIENCE" or etc. then start of work section
 - if line equals "Education", "Shool", "Degrees", then start of education
 - etc.

So basically try to find as many assuptions as you can, usually based on location, size, or specific wording, and then go from there.  There's nothing that will work 100% of the time since the resumes are all different.  This is something that reqruiters, job boards, and companies use for their DBs/resumes, so there might be some samples on the web to work with.  
gowthamprabakarAuthor Commented:
Thank you for the reply
can you gimme some links that deals with same kind of scenario :)
or any scripts ;)
Save the Word resume.doc document as
* resume.odt
   and then save it again as
* resume.html
It is now much easier to parse using regular expresions or LWP

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

gowthamprabakarAuthor Commented:
Thank you fero
should i need to use a program like catdoc
for doing these conversions ?
can you give me a sample code which extract details from the attached file.
I do not know what information you want to extract. The following is just an idea It creates a file into.txt and saves each line that begins with the word 'Organization" as follows:

 Organization                  :      Micro village Communications Pvt Ltd. - Bangalore
 Organization                  :      Mahizham InfoTech - Madurai
 Organization                  :      Wandoz Web Services Pvt Ltd. - Bangalore

What the final extract should look like?
It's close to midnight. Have a nice day. See you tomorrow.
use warnings;
use strict;
open RESUME, 'resume.html' or die $!;
open INTO, '>into.txt' or die $!;
while(<RESUME>)	{
	if( $_ =~ m/Organization/ )	{
		$_ =~ s/<\/FONT><\/P>//g;
		print INTO $_ . "\n";

Open in new window

Bryan ButlerCommented:
If you have to convert a bunch of them, then a converter is the way to go.  There are many of them out there and you will probably need one that keeps the formatting.  

This is a huge topic.  A search for "resume parsing" found "http://www.akken.com/resume-parsing.php" and adding 'free' found http://www.handyarchive.com/free/parsing-tool/.  Something like that may be the way to go as it is a major project to make in handle all the resume formats.
gowthamprabakarAuthor Commented:
thanks a lot fero
if u don't mind can explain what exactly the code is doing
line by line explanation will be great.
how to use a or operator
 ex:organization or company name
      technical skills or skill sets

i think am troubling u lot ;)
This is probably not the best solution because , like I said, I do not know what info you need. So...


and other good websites
1 open RESUME, 'resume.html' or die $!;
2 open INTO, '>into.txt' or die $!;
3 while(<RESUME>) {
4        chomp;
5        if( $_ =~ m/Organization/ )     {
6                $_ =~ s/<\/FONT><\/P>//g;
7                print INTO $_ . "\n";
8        }
9 }
# --------------------- Line:
2 creates a new text file into.txt for writing into it '>into.txt'
3 loops through the document resume.html line by line
4 cuts of the last character = new line \n
5 Regular expressions (regex):
 $_ is a variable Perl, it is the current line
 =~ regular experssion for equal
  m - matching. i.e. if the word 'Organization' is in this line ...
6 </FONT></P>  is the end of line, we need to get rid of it. Since / is part of regex, we need to escape it using \. "Escape" means that we want that rising slash in its proper function.
7  prints a line ($_) into text file into.txt

Open in new window

gowthamprabakarAuthor Commented:
Thank you fero for your guidance
great work :)
gowthamprabakarAuthor Commented:
Hi fero
thanks for the code, its work :)
let me explain the scenario, am new to perl, i don't know anything
but they have given me a job to do.
the job is, i have to extract as many as information possible from the resume.doc and display it in the
Form elements which is a .php file.so that the user can verify the details.
can u plz help me getting this job done

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 5
  • 3
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now