Cross referencing 2 different XML file sources to build a new XML file

OK - the question is I have 2 data sources

DATA SOURCE 1: about 60,000 xml files containing bibliographic information in the following format:

File #1
<bibrec="0060000">
  <PA>JAW</PA>
  <TI>Book Title</TI>
  <AU>Author Name</AU>
  <DE>Compliance^Federal Regulations</DE>    <-- DE is the field containing keywords that describe what the book is about
  <.... more fields ...>
</bibrec>

File #2
<bibrec="0060001">
  <PA>JAW</PA>
  <AU>author 2</AU>
  <DE>Compliance</DE>
</bibrec>

File #3
<bibrec="0060002">   <-- since PA does not have a value of JAW, it will ignore this record
   <PA>OPF</PA>
   <TI>Title 3</TI>
   <DE>Cryptosporidium^Water Quality</DE>
</bibrec>

... more files to process

DATA SOURCE 2: An XML Thesaurus that directly corresponds to the <DE></DE> tags above.  It is what we use to classify our information.

<thesaurus>
  <TermInfo>
    <T>Compliance</T>
    <BT>Water Legislation and Regulations</BT>
    <History>2006/02/28 13:30 Moved from top level by grant</History>
    <RT>Best Available Technology</RT>
    <RT>Information Collection Rule</RT>
    <RT>Monitoring</RT>
    <RT>Permits</RT>
    <RT>Primacy</RT>
    <RT>Safe Drinking Water Act</RT>
    <RT>Safe Drinking Water Act Amendments</RT>
    <RT>Standards</RT>
    <RT>Viability</RT>
    <Status>Accepted</Status>
    <ttm>compliance</ttm>
    <ttm>regulat*</ttm>
  </TermInfo>
  <TermInfo>
     <T>Superman</T>
     <RT>Lois Lane</T>
  </TermInfo>
  <TermInfo>
    <T>Federal Regulations</T>
    <BT>Jurisdictional Regulations</BT>
    <NT>Information Collection Rule</NT>
    <NT>Total Coliform Rule</NT>
    <Status>Accepted</Status>
    <History>2006/03/03 10:00 created by grant</History>
    <ttm>regulat*</ttm>
  </TermInfo>
</thesaurus>

WHAT I NEED TO DO:
Cycle through all of the SOURCE 1 data files, look for only files that have <PA>JAW</PA> and compare what is in the <DE> in SOURCE 1 to what is in the SOURCE 2 Thesaurus file and build a NEW thesaurus file using only the terms that are found in SOURCE 1.  So in the above example, it would build a new thesaurus using all the words from the <DE> tags (delimited by a carat ^)

So it would take all the words in those DE tags and use them but for example if Term "Superman" in the Thesaurus was never used in the SOURCE 1 data files, it would not include that in the new thesaurus file.  I need to also keep it so that it's alphbetical by the <T> and to only include <T>,<NT>,<BT> in the new one.

Sorry I know this may be confusing.  If you have any questions shoot them my way.
LVL 2
PurpleSladeAsked:
Who is Participating?
 
wlfsConnect With a Mentor Commented:
Hi PurpleSlade,

The following code assumes that the newly filtered thesaurus can be held in memory. 60.000 files sounds like a lot of data, so if it happens that you run out of memory, we have to take a different approach. Let me know in case.
Also, I didn't test the code. Maybe it needs a little debugging.

After the disclaimers, here it comes:

my @bib_files = ... ;      # don't know how you build the list of all bib files. insert appropriate code.
my $thesaurus_file = "name_of_thesaurus_file";
my $out_file = "name_of_out_file";

my %desc;

foreach my $bib_file (@bib_files) {
  local $\ = "</bibrec>";
  open my($file), '<', $bib_file;
  while (<$file>) {
    if (m|<PA>JAW</P>| && m|<DE>(.+?)</DE>) {
      $desc{$_} = 1 foreach split '^', $1;
    }
  }
}
# now the keys of %desc contain all description items of all bib items with <PA>JAW</PA>

{
  local $\ = "</TermInfo>";
  open my($thesaurus), '<', $thesaurus_file;
  while (<$thesaurus>) {
    m|<T>(.+?)</T>| and
    $desc{$1} &&= join "\n", "  <TermInfo>", grep(m!<(T|NT|BT)>.+?</\1>!, split "\n"), "  </TermInfo>\n";
  }
}
# now the values of %desc contain the filtered Thesaurus entries

open  my($out), '>', $out_file;
print $out "<thesaurus>\n";
print $desc{$_} foreach sort keys %desc;
print $out "</thesaurus>\n";
close $out;
0
 
PurpleSladeAuthor Commented:
Hi wlfs - I am testing it out now.  I will let you know what's up.
0
 
PurpleSladeAuthor Commented:
wlfs - question - what does

local $\ = "</bibrec>";

do?

I realize that local $\ will allow you to slurp the file and you make it local so that it doesn't do it for all < ... > files that get read, but what does adding the "</bibrec> do?  Why would you do that over local $\ = undef; ?
0
Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
wlfsCommented:
GOOD you are asking. Awful mistake. It must be $/ instead of $\.
Please replace both $\ in my code with $/ !
Sorry for that.

$/ is the input record separator. If it is null, there is no separator and thus the entire file becomes a single record, i.e. it is slurped at once.

In our case, $/ = "</bibrec>" causes perl to read in each bib-entry of the xml file as a record, one after the other. Each scalar call of the <$file> operator reads the file up to (and including) the next "</bibrec>".

The default line-wise behaviour of <$file>, that you are probably familiar with, is caused by the default value of $/="\n".

$\, the one I was erroneously using, is the output record separator. I gets appended to the output stream after each print statement. Its default value is $\="".
0
 
PurpleSladeAuthor Commented:
ok, I think i get it - let me clarify one thing - All of the xml files are zipped in directories that hold 1000 files each, so to do stuff like this, I need to unzip them 1000 at a time and then I usually just traverse through them.  So, my code is looking right now like this (trying to modify my code to yours):

`del *.xml`;
my $jawcount = 0;

#open(OUT, ">thes/journalthesaurus.xml");

while ($z = <../waternet.0000000.zip>) #zipped directory of xml files
{
   print "unzipping $z\n";
   `unzip $z`;
   print "finished unzipping files\n";
   
   while ($f = <waternet.0000001.xml>) #individiual xml file
   {
            print "$f\n";
            local $\ = undef;
                open my($file),'<',$f;
                $waternetrec = <IN>;
                close(IN);
     
               if($waternetrec =~ /<PA>JAW<\/PA>/ && /<DE>(.+?)<\/DE>/)
                {
                  $jawcount++;
                  print "$jawcount: $1: $jawcount";      
                }
   }
   `del *.xml`;
}

#close(OUT);
0
 
PurpleSladeAuthor Commented:
wlfs - this statement isn't executing:

if($waternetrec =~ /<PA>JAW<\/PA>/ && /<DE>(.+?)<\/DE>/)

if I change it to test for them individually
if($waternetrec =~ /<PA>JAW<\/PA>/)
if($waternetrec =~ /<DE>(.+?)<\/DE>/)
they both execute as true, but not the way I have it above.
0
 
wlfsCommented:
regarding your last comment:
  if($waternetrec =~ /<PA>JAW<\/PA>/ && /<DE>(.+?)<\/DE>/)
doesn't work, because the second pattern match is implicitly bound to $_. You need to change it to
  if($waternetrec =~ /<PA>JAW<\/PA>/ && $waternetrec =~ /<DE>(.+?)<\/DE>/)
0
 
PurpleSladeAuthor Commented:
OK, I got that working so I have:

      if($waternetrec =~ /<PA>JAW<\/PA>/ && $waternetrec =~ /<DE>(.+?)<\/DE>/)
      {
                  $desc{$_} = 1 foreach split '^', $1;
                  $jawcount++;
      }


however, the "$desc{$_} = 1 foreach split '^', $1;" line doesn't appear to be working because when I try and see the keys, it is not coming up iwth anything:

   foreach $key (keys(%desc))
   {
            print "$key\n";
      }

prints nothing
0
 
wlfsCommented:
Well, this is due to:
          local $\ = undef;
          open my($file),'<',$f;
          $waternetrec = <IN>;
          close(IN);

$\ is the OUTput record separator. Dont't worry, I mix it up all the time :)
Why do you set it to undef? If you slurp the file as a whole, you will get the description items of the first bib-item (per file) only. All others are ignored. Unless I missed something important, I am very sure that reading it in recordwise (with $/ = "</bibrec>") is the way to go.

Then, you open the file named $f and associate it with filehandle $file.
  open my($file),'<',$f;
But then, you read from the filehandle IN
  $waternetrec = <IN>;

IN has never been opened, it is a completely empty file handle. Just as empty as $waternetrec stays all the time :)
0
 
PurpleSladeAuthor Commented:
There is only one bibrec per file, thus I have no need to do $/ = </bibrec> because I am reading one file at a time.  I did fix the other stuff, just hadn't posted it.  Which brings me back to that one line that doesn't seem to be working correctly.

$desc{$_} = 1 foreach split '^', $1;

# new code
`del *.xml`;
my $jawcount = 0;
my %desc;

#open(OUT, ">thes/journalthesaurus.xml");

while ($z = <../waternet.0000000.zip>)
{
   print "unzipping $z\n";
   `unzip $z`;
   print "finished unzipping files\n";
   
   while ($f = <waternet.0000001.xml>)
   {
      local $/ = undef;
      open IN, "< $f" or die "Couldn't open `$filename': $!";
      $waternetrec = <IN>;
      close(IN);
     
      if($waternetrec =~ /<PA>JAW<\/PA>/ && $waternetrec =~ /<DE>(.+?)<\/DE>/)
      {
                  $desc{$_} = 1 foreach split '^', $1;
                  $jawcount++;
      }
      else
      {
                  print "$f didn't match conditions\n\n$waternetrec\n\n";
      }
   }
   `del *.xml`;
   
   foreach $key (keys(%desc))
   {
            print "$key\n";
   }
}

So, I have it so that the conditions are being met.
0
 
PurpleSladeAuthor Commented:
This is an example of the xml that I am using for testing purposes (I called it bibrec originally for clarity but this is the xml) - so this is contained in the file waternet.0000001.xml, the next file is waternet.0000002.xml, etc.

<waternet n="0000001">
  <LLObjectID />
  <PDF />
  <WN>JAW71001</WN>
  <LK />
  <TI>Emphasis - 1970's</TI>
  <AU>Larson, T.E.</AU>
  <PA>JAW</PA>
  <PN>Journal American Water Works Association</PN>
  <CL />
  <MD />
  <CS />
  <PU>AWWA</PU>
  <AV>AWWA</AV>
  <SN>0003-150X</SN>
  <BN />
  <LC />
  <CO>JAWWA5</CO>
  <OR />
  <PD>January</PD>
  <PY>1971</PY>
  <ED />
  <VO>63</VO>
  <NO>1</NO>
  <PG>1-2</PG>
  <PS>2</PS>
  <LA>English</LA>
  <NT />
  <DT>Jnl Article</DT>
  <MT>Print</MT>
  <AB>Overall it appears that the United States water industry is providing more than adequate service, but a Community Water Supply Study showed that 12.4% of the water systems studied exceeded the coliform limits for drinking water. In order to bring all systems up to standard the AWWA Board of Directors established the "Action Now" committee for this purpose. Another consideration for maintaining quality water systems is financial aid from government bodies. Money is needed for education, training, and research. New technologies must be developed to deal with the deteriorating quality of water.</AB>
  <DE>Water Quality^Training^Coliforms</DE>
  <STATUS>E</STATUS>
- <ADD>
  <D />
  <IN />
  </ADD>
- <CHG>
  <D>20051102</D>
  <IN>grant</IN>
  </CHG>
- <CHG>
  <D>19951002</D>
  <IN>DAW</IN>
  </CHG>
  </waternet>
0
 
wlfsCommented:
I see. Everything makes perfect sense now.
So what's the status? Did you generate the filtered thesaurus yet? Everything working fine, or still having trouble?

> Which brings me back to that one line that doesn't seem to be working correctly.
> So, I have it so that the conditions are being met.

That is, I am a little confused about the status :)
0
 
wlfsCommented:
Just another thing that comes to my mind.
If performance is a matter of interest, it would be faster to unzip the files directly to a pipe instead of a file. That works fine under unix. I don't know about the unzip executable under windows, but I guess that should have a pipe mode too.

Instead of unzipping and then reading the file you would (-p switch works under linux)
  open IN, "unzip -p $z |";

This way, all files in $z are unzipped, concatenated and fed into perls IN file handle without writing anything on disk.
Read from IN as from a normal file. No need to clean up any unzipped *.xml files afterwards, because there aren't any.
0
 
PurpleSladeAuthor Commented:
Lol, sorry, I am not explaining very well.  

I think I have it working - I still couldn't get the line:

               $desc{$_} = 1 foreach split '^', $1;

to work, so I replaced it with a less streamlined, but working:

                  @descriptors = split(/\^/,$1);
                  foreach $val (@descriptors)
                  {                        
                        $desc{$val} = 1;
                  }

I know there's probably a much better way to write it (yours) but I couldn't get your statement to work.   I ran a test on the first 1000 and it seemed to when I tested it against our database.   I am running the program across all the 60,000 xml files right now.  The only part I was kind of lost in your code was:

  $desc{$1} &&= join "\n", "  <TermInfo>", grep(m!<(T|NT|BT)>.+?</\1>!, split "\n"), "  </TermInfo>\n";

I know enough to do a few things in Perl, but don't use it frequently enough to look at *that* and go "ah, yes, I see where you're going with this".  ;)

I do hope it works because this seemingly trivial program is one of the backbones of our e-commerce site.  No pressure though.  :P
0
 
PurpleSladeAuthor Commented:
That's a great suggestion about the pipe.  I looked at the unzip executable commands and there is that "-p" option.  I will try that right now.
0
 
wlfsCommented:
  $desc{$1} &&= join(...);
is the short form for
   if ($desc{$1}) {
      $desc{$1} = join(...);
   }
Only if the descriptor existed in the xml files it will be extracted from the thesaurus.

until tomorrow, woolf
0
 
wlfsCommented:
And be aware that in
  grep(m!<(T|NT|BT)>.+?</\1>!, split "\n")
the split is implicitly performed on $_. If the current record from the thesaurus file is in, e.g., $thes_rec then you should use
  grep(m!<(T|NT|BT)>.+?</\1>!, split "\n", $thes_rec)
instead.
0
 
PurpleSladeAuthor Commented:
wlfs - it appears to work but I get some odd stuff that seems to happen like (note the '1''s that appear):

  <TermInfo>
<T>Thailand</T>
<BT>Geographical locations</BT>
<NT>Bangkok, Thailand</NT>
  </TermInfo>
1  <TermInfo>
<T>Third World Countries</T>
<BT>Geographical locations</BT>
<RT>Appropriate Technology</RT>
<RT>Developing Countries</RT>
<RT>Small Water Systems</RT>
  </TermInfo>
  <TermInfo>
<T>Thornton, Colorado</T>
<BT>Geographical locations</BT>
  </TermInfo>
11  <TermInfo>
<T>Toilets</T>
0
 
PurpleSladeAuthor Commented:
This is the latest code:

`del *.xml`;
my $thesaurus_file = "t:/Water Library/Thesaurus.xml";
my $out_file = "t:/Water Library/JAWThes.xml";

my $jawcount = 0;
my %desc;

#open(OUT, ">thes/journalthesaurus.xml");

while ($z = <../waternet.*.zip>)
{
   print "unzipping $z\n";
   `unzip $z`;
   print "finished unzipping files\n";
   
   while ($f = <waternet.*.xml>)
   {
            local $/ = undef;
      open IN, "< $f" or die "Couldn't open `$filename': $!";
      $waternetrec = <IN>;
      close(IN);
     
      # print "$waternetrec\n";
     
      if($waternetrec =~ /<PA>JAW<\/PA>/ && $waternetrec =~ /<DE>(.+?)<\/DE>/)
      {
                  @descriptors = split(/\^/,$1);
                  foreach $val (@descriptors)
                  {
                        #print("val = $val\n");
                        $desc{$val} = 1;
                  }
                  #$desc{$_} = 1 foreach split '^', $1;
                  $jawcount++;
      }
      else
      {
                  #print "$f didn't match conditions\n\n$waternetrec\n\n";
            }
   }
   `del *.xml`;
}

local $/ = "</TermInfo>";
open my($thesaurus),'<',$thesaurus_file;
while(<$thesaurus>)
{
      m/<T>(.+?)<\/T>/ and $desc{$1} &&= join "\n", "  <TermInfo>", grep(m!<(T|NT|BT|RT)>.+?</\1>!, split "\n"), "  </TermInfo>\n";
}

open  my($out), '>', $out_file;
print $out "<thesaurus>\n";
print $out $desc{$_} foreach sort keys %desc;
print $out "</thesaurus>\n";
close $out;




#close(OUT);
0
 
wlfsCommented:
Sorry for coming back late.

The '1's are caused by description items which are found in the bib-files but NOT found in the thesaurus.
Each description item in the bib-files causes $desc{ text of description } to be set to 1. If this description item is not found in the thesaurus it stays '1' and is printed as '1' in the output.

To make up for that, I suggest to (i) output only exisiting thesaurus items and (ii) generate a second file where all description items without thesaurus entry are listed. This way the filtered thesaurus is not messed up, and you know which thesaurus items are missing.

my $missing_file = "t:/Water Library/JAWThes_missing.xml";
.
.
.
open  my($out), '>', $out_file;
print $out "<thesaurus>\n";
open  my($missing), '>', $missing_file;
print $missing "<thesaurus>\n";

foreach (sort keys %desc) {
  if ($desc{$_} eq '1') {
    print $missing "  <TermInfo>\n", "    <T>$_</T>\n", "  </TermInfo>\n";
  }
  else {
    print $out $desc{$_}
  }

print $missing "</thesaurus>\n";
close $missing;
print $out "</thesaurus>\n";
close $out;
0
 
PurpleSladeAuthor Commented:
Hi wlfs - thanks for all your excellent help.  I am running the new program with the modifications and I expect it will run fine.  One thing - I tried out your suggestion about the piped thing and unfortunately it deleted everything in the first zip file, so I'm not sure what happened.  I will try it again when I have some time.

Thanks again,
Purple
0
 
wlfsCommented:
You are most welcome.

> unfortunately it deleted everything in the first zip file
I hope you had a backup :)
0
 
PurpleSladeAuthor Commented:
Yes, we back up our data.  I need to stop working on live files, but that's normally why I just run the unzip utility and move them all to a different folder and process the data there while the live data remains secure in the original directory.  This is the first time they have ever been deleted as a result of something I tried.  However, I still like the suggestion and will try and implement it, just this time not on the live data.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.