OK - the question is I have 2 data sources
DATA SOURCE 1: about 60,000 xml files containing bibliographic information in the following format:
<DE>Compliance^Federal Regulations</DE> <-- DE is the field containing keywords that describe what the book is about
<.... more fields ...>
<bibrec="0060002"> <-- since PA does not have a value of JAW, it will ignore this record
... more files to process
DATA SOURCE 2: An XML Thesaurus that directly corresponds to the <DE></DE> tags above. It is what we use to classify our information.
<BT>Water Legislation and Regulations</BT>
<History>2006/02/28 13:30 Moved from top level by grant</History>
<RT>Best Available Technology</RT>
<RT>Information Collection Rule</RT>
<RT>Safe Drinking Water Act</RT>
<RT>Safe Drinking Water Act Amendments</RT>
<NT>Information Collection Rule</NT>
<NT>Total Coliform Rule</NT>
<History>2006/03/03 10:00 created by grant</History>
WHAT I NEED TO DO:
Cycle through all of the SOURCE 1 data files, look for only files that have <PA>JAW</PA> and compare what is in the <DE> in SOURCE 1 to what is in the SOURCE 2 Thesaurus file and build a NEW thesaurus file using only the terms that are found in SOURCE 1. So in the above example, it would build a new thesaurus using all the words from the <DE> tags (delimited by a carat ^)
So it would take all the words in those DE tags and use them but for example if Term "Superman" in the Thesaurus was never used in the SOURCE 1 data files, it would not include that in the new thesaurus file. I need to also keep it so that it's alphbetical by the <T> and to only include <T>,<NT>,<BT> in the new one.
Sorry I know this may be confusing. If you have any questions shoot them my way.