Link to home
Start Free TrialLog in
Avatar of James Sermbezis
James Sermbezis

asked on

CV data miner

Hi,

Is it possible to extract certain words from multiple documents and then place them in to an excel spreadsheet sort of like a data miner (I guess?).

Example:

Let's assume I need someone with experience in Java and they need at least 3 years experience, I can then type in those requirements and it will then go through all my CV's and put the people it thinks meets those requirements in to a excel spreadsheet.

I'm sure this is possible but I'm not sure how hard this would be (I'm doing this for a school project)

Thanks,
James
Avatar of Jim Cakalic
Jim Cakalic
Flag of United States of America image

Are you looking to index all words in the document or is there a defined set of keywords?
What are your source document formats? (plain text files, Word docs, PDFs)?
Is there any persistence of the index or is the intent to recreate it for each search?
Searching a file is simple. Attached is a class to search Word docx files for one or more keywords. The Apache POI project is used to get the text of the document. The main expects a comma-separated list of keywords (e.g., Java,Python) followed by one or more arguments that is either a docx file or a folder that it will descend recursively finding and searching any docx files it encounters.

It doesn't put the results in a spreadsheet but I'll leave that to you (hint, you can either just write a csv because Excel can read that or you can use the Apache POI project to write a true xls). Right now it just prints the names of the files that match one or more of the keywords.

The difficult part is years of experience as there is no standard resume format to rely upon to denote sections for position and the time the candidate held that position.

I use Maven to build. These are the dependencies you'll need:

            <dependency>
                  <groupId>org.apache.commons</groupId>
                  <artifactId>commons-lang3</artifactId>
                  <version>3.4</version>
            </dependency>

            <dependency>
                <groupId>org.apache.poi</groupId>
                <artifactId>poi</artifactId>
                <version>3.16</version>
            </dependency>

            <dependency>
                <groupId>org.apache.poi</groupId>
                <artifactId>poi-ooxml</artifactId>
                <version>3.16</version>
            </dependency>
SearchResume.java
This question needs an answer!
Become an EE member today
7 DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform.
View membership options
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.