Solved

Excel function/macro -- extract proper nouns

Posted on 2013-12-04
10
1,207 Views
Last Modified: 2013-12-10
Experts:

Need some help with developing -- if all possible -- a function/macro -- that scans through a single columns (in a spreadsheet) and "detects" all proper nouns.  

Then, in step #2, it would add only the nouns into another column (separated by commas).

For example, I may have the following two (2) sentences in column A:

The agency made the announcement late Tuesday night, hours after a union representative told CNN that the train engineer apparently "was nodding off and caught himself too late" before the accident.

On Monday, one of the nation's most esteemed weeklies, New York magazine, announced that it would soon shift to an every-other-week publishing schedule while beefing up its Web production.

Then, once the function/macro has been applied, I'd like to see the following nouns in column B:
B1: agency, announcement, union, representative, train, engineer, accident
B2: nation, magazine, schedule, production

Note:  In actuality, I have thousands of strings in column A.   Some of them are short... some of them are longer.

Even if I only get the 80% solution, I certainly would be a huge time saver.

Any thoughts on how to accomplish this task?

Thanks,
EEH
0
Comment
Question by:ExpExchHelp
  • 3
  • 3
  • 2
  • +2
10 Comments
 
LVL 4

Expert Comment

by:andrew_man
Comment Utility
Can post your worksheet here?
0
 
LVL 15

Assisted Solution

by:JimFive
JimFive earned 300 total points
Comment Utility
This is a hard problem, even you got it wrong in your example:  the word union is an adjective in the first statement, as is train.

Probably the easiest first approximation would be to find all words that follow an article (a, an, the)
You could build up a list of adjectives and look for non-adjective words that follow an adjective.  This would also let you skip adjectives after an article.  But you still have a problem with multifunctional words.

A thought that just popped into my head is:  I wonder if you can access the MS-Word spellchecker/thesaurus to get part of speech for words?
0
 
LVL 4

Expert Comment

by:andrew_man
Comment Utility
So, we need a table of noun.
0
 
LVL 19

Assisted Solution

by:regmigrant
regmigrant earned 100 total points
Comment Utility
First you will need a list of all the nouns you want to find - there is no way to programmatically identify what's a noun and what's a verb within VBA.

Alternatively you could work with an external service such as this
http://stackoverflow.com/questions/5981547/finding-the-english-definition-of-a-word and I see that Microsoft provide an online tool that -presumably- would allow access to their thesaurus etc. (I haven't checked) http://office.microsoft.com/Research/query.asmx

Once you've established your data base - or written a function to work with an online tool -  you will need to read each piece of text, chop it into words and check against the list.

It would be relatively easy to do this with either a recursive function or an array but I probably wouldn't start with Excel/VBA if this is a production tool

How good are your VBA skills?

Reg
0
 
LVL 4

Expert Comment

by:andrew_man
Comment Utility
Dear Reg,

I agreed.  Do you think we can use the WebService() function in the Excel 2013.  But, we must find out the webservice provider.

Andrew
0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 

Author Comment

by:ExpExchHelp
Comment Utility
All:

Thank you for the comments... I appreciate it.

Unfortunately, I won't be able to post the the worksheet.   The "strings" are based on comments collected during a company survey.    As you can imagine, some of the comments are "sensitive"... posting them online would surely expedite my getting laid off.  ;)

Also, I don't have a ready table of nouns so that it compare it against.    I completely understand this is tricky w/o having a data dictionary to compare it against.

Based on a quick look of the comments, anything in the English language would qualify.   On the other hand, if a general e-dictionary with, e.g., 50,000 nouns would be somewhere posted online, I think it'll catch maybe 75% or more of the content.

EEH
0
 
LVL 45

Accepted Solution

by:
aikimark earned 100 total points
Comment Utility
Please consider the Stanford Parser.  Although there doesn't appear to be a directly consumable interface for VBA, there are several language interfaces.
http://nlp.stanford.edu/software/lex-parser.shtml

There is at least one syntactic parser on sourceforge.net
0
 
LVL 15

Assisted Solution

by:JimFive
JimFive earned 300 total points
Comment Utility
I think that you need to consider your end point.  Since this is for survey comments you presumably want to be able to aggregate what the respondents commented upon.  It might make sense to just count word frequencies and eliminate the useless words from that list.  If you really wanted the meaningful words for each row you could useyour meaningful wwords list as your starting point.
0
 

Author Comment

by:ExpExchHelp
Comment Utility
aikimark -- thx, I'll look into that one.

JimFive -- excellent point.  Frequency count may facilitate findings the important "nuggets".  Any recommendations how to accomplish that in Excel or Access?

EEH
0
 
LVL 15

Assisted Solution

by:JimFive
JimFive earned 300 total points
Comment Utility
In Access I would do this.
Create a table called Words with two columns Word varchar(50), Occurrence longint

Then, in VBA
code up a routine to go through each row of your recordset.  Strip out punctuation.  Split into an array on the spaces.  Loop through the array and if the word is new add it to the Words table with an occurrence of 1, if the word is already in the words table then add 1 to its occurrence.
It would look something like this
Sub CountWords()
Dim rs as Recordset

Set rs = openrecordset(Comments)

While not rs.eof()
    comment = rs("Comment")
    comment = replace(replace(replace(comment,".",""),",",""),"!","")
    commentwords = split(comment, " ")
    for i = lbound(commentwords) to ubound(commentwords)
        in = dlookup("occurrence","Words","Word = """ & commentwords(i) & """")
        if isnull(in) then
            addword
        else
            incrementoccurrence
        end if
    next
    rs.movenext
Wend

Open in new window

0

Featured Post

What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

A little background as to how I came to I design this code: Around 5 years ago I designed an add-in that formatted Excel files to a corporate standard, applying different cell colours and font type depending on whether the cells contained inputs,…
This code takes an Excel list of URL’s and adds a header titled “URL List”. It then searches through all URL’s in column “A”, looking for duplicates. When a duplicate is found, it is moved to the top of the list. The duplicate URL’s are then highlig…
The view will learn how to download and install SIMTOOLS and FORMLIST into Excel, how to use SIMTOOLS to generate a Monte Carlo simulation of 30 sales calls, and how to calculate the conditional probability based on the results of the Monte Carlo …
This Micro Tutorial will demonstrate in Microsoft Excel how to add style and sexy appeal to horizontal bar charts.

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now