Advertisement

02.27.2008 at 02:34PM PST, ID: 23198749
[x]
Attachment Details
[x]
The Solution Rating System

With so many solutions, how can you tell which solutions are most likely to help you and which ones are not? To provide you with a tool to use, we rate our solutions based on various elements that most accurately determine if a solution is a quality solution. To explain what factors affect the solution rating, here are the elements we take into consideration when formulating our solution rating.

  • The Grade of the Solution
  • The Zone Rank of the Expert Providing the Solution
  • The Number of Author and Expert Comments
  • The Number of Experts Contributing
  • The Feedback of the Community

Your Input Matters
Because of the way the system is set up, the most important variable in this equation is you. As a member of Experts Exchange, you are able to cast your vote on the quality of the solutions in regard to how complete, accurate, helpful and easy to understand each solution is. When you provide your feedback, each rating is adjusted accordingly. So, if you see a solution that has a poor rating that you think is a good solution, let us know by rating it. As you do, the rating will be adjusted and will become more accurate for other members of our site.

If you have any suggestions that you would like to make for our rating system, please ask a question in the Suggestions Zone of Community Support.

Thank you!

What is the best way to parse and create a keyword list (using Perl)

Tags: Perl
Sample line in the text file (one of about 60000 lines, each of which will be parsed) which represents a record:
AN 0000001--DT Jnl Article--MT Print^PDF--AU Smith, T.E.--PA JAW--TI The Life and Times of Dr. Water--DE Water Quality^Training^Coliforms^Water Industry--AB Overall it blah blah blah

What I need to do is
1) find all records that have PA JAW or PA ST(A|B|C|D|E|F|G)
2) Create an alphabetical list of the terms used in the "DE Water Quality^Training^Coliforms^Water Industry"
So it will cycle through each line and would not repeat any particular term used again in the list
3) The -- are actually \x1e but EE wouldn't display it when I copied and pasted it

I think I could actually write this myself but I know you guys have such unique ways of doing this stuff.  (i.e. you put in 1 line what I usually do in about 10)  :)  Thanks in advance,

Purple.
Start your free trial to view this solution
Question Stats
Zone: Programming
Question Asked By: PurpleSlade
Solution Provided By: ozo
Participating Experts: 2
Solution Grade: A
Views: 0
Translate:
Loading Advertisement...
02.27.2008 at 03:00PM PST, ID: 20999437

Rank: Genius

All comments and solutions are available to Premium Service Members only.

Start your 7 day free trial and see for yourself why Experts Exchange is the easiest and most proven technology resource in the world. Get Started

Already a member? Login to view this solution.

 
02.27.2008 at 03:13PM PST, ID: 20999537

All comments and solutions are available to Premium Service Members only.

Start your 7 day free trial and see for yourself why Experts Exchange is the easiest and most proven technology resource in the world. Get Started

Already a member? Login to view this solution.

 
02.27.2008 at 03:40PM PST, ID: 20999749

Rank: Genius

All comments and solutions are available to Premium Service Members only.

Start your 7 day free trial and see for yourself why Experts Exchange is the easiest and most proven technology resource in the world. Get Started

Already a member? Login to view this solution.

 
02.27.2008 at 03:47PM PST, ID: 20999794

All comments and solutions are available to Premium Service Members only.

Start your 7 day free trial and see for yourself why Experts Exchange is the easiest and most proven technology resource in the world. Get Started

Already a member? Login to view this solution.

 
02.27.2008 at 03:48PM PST, ID: 20999808

Rank: Genius

All comments and solutions are available to Premium Service Members only.

Start your 7 day free trial and see for yourself why Experts Exchange is the easiest and most proven technology resource in the world. Get Started

Already a member? Login to view this solution.

 
02.27.2008 at 03:50PM PST, ID: 20999857

All comments and solutions are available to Premium Service Members only.

Start your 7 day free trial and see for yourself why Experts Exchange is the easiest and most proven technology resource in the world. Get Started

Already a member? Login to view this solution.

 
02.27.2008 at 03:50PM PST, ID: 20999858

Rank: Genius

All comments and solutions are available to Premium Service Members only.

Start your 7 day free trial and see for yourself why Experts Exchange is the easiest and most proven technology resource in the world. Get Started

Already a member? Login to view this solution.

 
02.27.2008 at 03:53PM PST, ID: 20999906

All comments and solutions are available to Premium Service Members only.

Start your 7 day free trial and see for yourself why Experts Exchange is the easiest and most proven technology resource in the world. Get Started

Already a member? Login to view this solution.

 
02.27.2008 at 04:04PM PST, ID: 21000002

Rank: Genius

All comments and solutions are available to Premium Service Members only.

Start your 7 day free trial and see for yourself why Experts Exchange is the easiest and most proven technology resource in the world. Get Started

Already a member? Login to view this solution.

 
 
Loading Advertisement...
Microsoft
  • Internet Protocols
  • Applications
  • Development
  • OS
  • Hardware
  • Windows Security
Apple
  • Operating Systems
  • Hardware
  • Programming
  • Networking
  • Software
Internet
  • Search Engines
  • File Sharing
  • WebTrends / Stats
  • Spy / Ad Blockers
  • Web Browsers
  • New Net Users
  • Web Development
  • Chat / IM
  • Anti Spam
  • Web Servers
  • Anti-Virus
  • Email Clients
Gamers
  • Tips
  • Online / MMORPG
  • Puzzle
  • Emulators
  • Action / Adventure
  • Role Playing
  • Consoles
  • Game Programming
  • Strategy
  • Sports
  • Misc
  • Computer Games
Digital Living
  • Hardware
  • New Net Users
  • New Users
  • Software
  • Digital Music
  • Gaming World
  • Home Security
  • Apple
  • Networking Hardware
Virus & Spyware
  • Vulnerabilities
  • IDS
  • Encryption
  • Anti-Virus
  • Operating Systems Security
  • Software Firewalls
  • WebApplications
  • Cell Phones
  • Operating Systems
  • Internet
  • Hardware Firewalls
Hardware
  • Handhelds / PDAs
  • Displays / Monitors
  • Components
  • Networking Hardware
  • Peripherals
  • Laptops/Notebooks
  • Storage
  • Servers
  • Desktops
  • New Users
  • Misc
  • Apple
Software
  • System Utilities
  • Industry Specific
  • Network Management
  • Photos / Graphics
  • Page Layout
  • VMWare
  • Misc
  • Web Development
  • OS
  • CYGWIN
  • Voice Recognition
  • Message Queue
  • Quality Assurance
  • Security
  • Firewalls
  • MultiMedia Applications
  • Development
  • Database
  • Office / Productivity
  • Business Management
  • OS/2 Apps
  • Server Software
  • Internet / Email
ITPro
  • OS
  • Storage
  • Encryption
  • Operating Systems Security
  • Apple Hardware
  • Laptops & Notebooks
  • Servers
  • Networking Hardware
  • Peripherals
  • Devices
  • Displays / Monitors
  • WebTrends / Stats
  • Search Engines
  • Firewalls
  • WebApplications
  • IDS
  • Vulnerabilities
  • Email Clients
  • File Sharing
  • Spy / Ad Blockers
  • Web Browsers
  • Web Servers
  • Networking
  • Anti-Virus
  • Chat / IM
  • Anti Spam
Developer
  • Web Servers
  • Web Browsers
  • Game Programming
  • Dev Tools
  • Industry Specific
  • Office / Productivity
  • Database
  • CYGWIN
  • Web Development
  • Search Engines
  • File Sharing
  • WebTrends / Stats
  • Programming
  • Content Management
  • Application Servers
  • Protocols
Storage
  • Removable Backup Media
  • Storage Technology
  • Servers
  • Grid
  • Remote Access
  • Backup / Restore
  • Misc
  • Hard Drives
OS
  • Miscellaneous
  • Security
  • Development
  • Linux
  • VMWare
  • MainFrame OS
  • Unix
  • Apple
  • OS / 2
  • AS / 400
  • BeOS
  • Microsoft
  • VMS / OpenVMS
Database
  • Oracle
  • Miscellaneous
  • MySQL
  • Software
  • Sybase
  • Contact Management
  • PostgreSQL
  • Data Manipulation
  • Clarion
  • InterSystems Cache
  • Siebel
  • MUMPS
  • OLAP
  • SQLBase
  • SAS
  • GIS & GPS
  • 4GL
  • Berkeley DB
  • DB2
  • Informix
  • Interbase / Firebird
  • FoxPro
  • Reporting
  • LDAP
  • Filemaker Pro
  • MS SQL Server
  • dBase
  • MS Access
Security
  • Misc
  • Web Browsers
  • Software Firewalls
  • Operating Systems Security
  • File Sharing
  • Spy / Ad Blockers
  • Vulnerabilities
  • WebApplications
  • IDS
  • Anti-Virus
  • Encryption
  • Anti Spam
  • Email Clients
  • VPN
  • Chat / IM
Programming
  • Editors IDEs
  • Installation
  • Handhelds / PDAs
  • Multimedia Programming
  • System / Kernel
  • Algorithms
  • Game
  • Signal Processing
  • Project Management
  • Open Source
  • Database
  • Misc
  • Languages
  • Processor Platforms
  • Theory
Web Development
  • Scripting
  • Blogs
  • Web Servers
  • Software
  • Search Engines
  • Web Graphics
  • Images
  • Internet Marketing
  • Images and Photos
  • Components
  • Document Imaging
  • Web Languages/Standards
  • Illustration
  • WebApplications
  • Fonts
  • WebTrends / Stats
  • Authoring
  • Digital Camera Software
  • Miscellaneous
Networking
  • Protocols
  • Apple Networking
  • Network Management
  • Message Queue
  • Application Servers
  • Content Management
  • File Servers
  • Email Servers
  • Misc
  • Java Editors & IDEs
  • Wireless
  • Networking Hardware
  • Backup / Restore
  • System Utilities
  • ISPs & Hosting
  • Web Servers
  • Storage Technology
  • Removable Backup Media
  • Servers
  • Broadband
  • Grid
  • OS / 2
  • Novell Netware
  • Unix Networking
  • Windows Networking
  • Security
  • Telecommunications
  • Operating Systems
  • Linux Networking
Other
  • Community Advisor
  • Lounge
  • Community Support
  • New Net Users
  • Philosophy / Religion
  • Math / Science
  • Miscellaneous
  • URLs
  • Expert Lounge
  • Politics
  • Puzzles / Riddles
Community Support
  • Suggestions
  • New to EE
  • New Topics
  • Community Advisor
  • CleanUp
  • Announcements
  • General
  • Feedback
  • Input
  • EE Bugs
 
02.27.2008 at 03:00PM PST, ID: 20999437

Rank: Genius

What are the  terms used in the "DE Water Quality^Training^Coliforms^Water Industry"?
Is "Water" a separate term, or is "DE Water Quality" one term?
Are the terms to be found after the 6th \x1e or -- on a line?
If you don't repeat terms does that mean that line should be changed to
N 0000001--DT Jnl Article--MT Print^PDF--AU Smith, T.E.--PA JAW--TI The Life and Times of Dr. Water--DE Water Quality^Training^Coliforms^Industry--AB Overall it blah
Or do you want it alphabetized to
N 0000001--DT Jnl Article--MT Print^PDF--AU Smith, T.E.--PA JAW--TI The Life and Times of Dr. Water--AB blah ^Coliforms DE ^Industry it Overall Quality Training ^Water
(ignoring case for alphabetization purposes)

 
02.27.2008 at 03:13PM PST, ID: 20999537
Hi ozo -
The terms in that example would be delimited by the ^ so "Water Quality" would be one term, "Training" one term, etc. - so after the first line (or if it was only one line) all I would want is an alpha list of:

Coliforms
Training
Water Industry
Water Quality

by not repeating I mean that if on the next line the text has "DE Cryptosporidium^Water Quality^Legisliation^Coliforms" then the keyword list would become:

Coliforms
Cryptosporidium
Legislation
Training
Water Industry
Water Quality

so it would ignore that Water Quality and Coliforms were in there.

Basically I am trying to compile a list of keywords (designated by DE) that have been used for records that have a PA JAW or ST(A|B ... .etc.) but don't want to add the keyword more than once.
 
02.27.2008 at 03:40PM PST, ID: 20999749

Rank: Genius

1:
2:
3:
4:
5:
6:
7:
8:
my %keywords;
while(<>) {
	my @f = split /\x1e/;
	next unless ($f[6] =~ /PA JAW/ or $f[6] =~ /PA ST[A-G]/);
	$keywords{$_} = 1 foreach (split/\^/, $f[9]);
}
 
print "Keywords:\n  " . join("\n  ", sort keys %keywords) . "\n";
Open in New Window
 
02.27.2008 at 03:47PM PST, ID: 20999794
Hi Adam,
That splits everything with /\x1e/ not just the Descriptor field (the DE field)
 
02.27.2008 at 03:48PM PST, ID: 20999808

Rank: Genius

while( <> ){
  @key{split/\^/,$1}=() if /\x1e(PA JAW|PA ST[A-G])\x1e/ && /\x1eDE\s*([^\x1e]*)/;
}
$\=$/;
print for sort keys %key;
Accepted Solution
 
02.27.2008 at 03:50PM PST, ID: 20999857
Actually, never mind, I think I understand what you did which is that $f[6] represents that DE will always be that position in the array.  This is not always going to be the case, but there will always be a PA and a DE field.
 
02.27.2008 at 03:50PM PST, ID: 20999858

Rank: Genius

Do  the \x1e fields always appear in the same order?
 
02.27.2008 at 03:53PM PST, ID: 20999906
That appears to work ozo - one other question - one of the keywords is pH which always appears at the bottom of the list, rather than going in with the capital "P"s - is there any way to change the output to put the p with the Ps?
 
02.27.2008 at 04:04PM PST, ID: 21000002

Rank: Genius

If you want to capitalize everything, you could use
@key{split/\^/,uc $1}=();

Keeping the case, but ignoring case in the sort is a little slower:
print for sort(uc $a cmp uc $b} keys %key;
If you have a lot of keys and you want to preserve case but ignore it on the sort
then it may be better to keep two copes of each term, one for sorting and one for displaying.
 
 
02.28.2008 at 08:18AM PST, ID: 21005275
ozo - I noticed that there is something wrong with this.  It is taking compound descriptors and making a term for each of them.  For example it makes
"asbestos cement pipe" into
asbestos
asbestos cememt
asbestos cement pipe
 
 
02.28.2008 at 08:39AM PST, ID: 21005547
I am going to add a follow-up question to this since I think I may understand the problem.
 
 
 
20080236-EE-VQP-29 / EE_QW_2_20070628