Link to home
Start Free TrialLog in
Avatar of PurpleSlade
PurpleSlade

asked on

Cross referencing 2 different XML file sources to build a new XML file

OK - the question is I have 2 data sources

DATA SOURCE 1: about 60,000 xml files containing bibliographic information in the following format:

File #1
<bibrec="0060000">
  <PA>JAW</PA>
  <TI>Book Title</TI>
  <AU>Author Name</AU>
  <DE>Compliance^Federal Regulations</DE>    <-- DE is the field containing keywords that describe what the book is about
  <.... more fields ...>
</bibrec>

File #2
<bibrec="0060001">
  <PA>JAW</PA>
  <AU>author 2</AU>
  <DE>Compliance</DE>
</bibrec>

File #3
<bibrec="0060002">   <-- since PA does not have a value of JAW, it will ignore this record
   <PA>OPF</PA>
   <TI>Title 3</TI>
   <DE>Cryptosporidium^Water Quality</DE>
</bibrec>

... more files to process

DATA SOURCE 2: An XML Thesaurus that directly corresponds to the <DE></DE> tags above.  It is what we use to classify our information.

<thesaurus>
  <TermInfo>
    <T>Compliance</T>
    <BT>Water Legislation and Regulations</BT>
    <History>2006/02/28 13:30 Moved from top level by grant</History>
    <RT>Best Available Technology</RT>
    <RT>Information Collection Rule</RT>
    <RT>Monitoring</RT>
    <RT>Permits</RT>
    <RT>Primacy</RT>
    <RT>Safe Drinking Water Act</RT>
    <RT>Safe Drinking Water Act Amendments</RT>
    <RT>Standards</RT>
    <RT>Viability</RT>
    <Status>Accepted</Status>
    <ttm>compliance</ttm>
    <ttm>regulat*</ttm>
  </TermInfo>
  <TermInfo>
     <T>Superman</T>
     <RT>Lois Lane</T>
  </TermInfo>
  <TermInfo>
    <T>Federal Regulations</T>
    <BT>Jurisdictional Regulations</BT>
    <NT>Information Collection Rule</NT>
    <NT>Total Coliform Rule</NT>
    <Status>Accepted</Status>
    <History>2006/03/03 10:00 created by grant</History>
    <ttm>regulat*</ttm>
  </TermInfo>
</thesaurus>

WHAT I NEED TO DO:
Cycle through all of the SOURCE 1 data files, look for only files that have <PA>JAW</PA> and compare what is in the <DE> in SOURCE 1 to what is in the SOURCE 2 Thesaurus file and build a NEW thesaurus file using only the terms that are found in SOURCE 1.  So in the above example, it would build a new thesaurus using all the words from the <DE> tags (delimited by a carat ^)

So it would take all the words in those DE tags and use them but for example if Term "Superman" in the Thesaurus was never used in the SOURCE 1 data files, it would not include that in the new thesaurus file.  I need to also keep it so that it's alphbetical by the <T> and to only include <T>,<NT>,<BT> in the new one.

Sorry I know this may be confusing.  If you have any questions shoot them my way.
ASKER CERTIFIED SOLUTION
Avatar of wlfs
wlfs

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of PurpleSlade
PurpleSlade

ASKER

Hi wlfs - I am testing it out now.  I will let you know what's up.
wlfs - question - what does

local $\ = "</bibrec>";

do?

I realize that local $\ will allow you to slurp the file and you make it local so that it doesn't do it for all < ... > files that get read, but what does adding the "</bibrec> do?  Why would you do that over local $\ = undef; ?
GOOD you are asking. Awful mistake. It must be $/ instead of $\.
Please replace both $\ in my code with $/ !
Sorry for that.

$/ is the input record separator. If it is null, there is no separator and thus the entire file becomes a single record, i.e. it is slurped at once.

In our case, $/ = "</bibrec>" causes perl to read in each bib-entry of the xml file as a record, one after the other. Each scalar call of the <$file> operator reads the file up to (and including) the next "</bibrec>".

The default line-wise behaviour of <$file>, that you are probably familiar with, is caused by the default value of $/="\n".

$\, the one I was erroneously using, is the output record separator. I gets appended to the output stream after each print statement. Its default value is $\="".
ok, I think i get it - let me clarify one thing - All of the xml files are zipped in directories that hold 1000 files each, so to do stuff like this, I need to unzip them 1000 at a time and then I usually just traverse through them.  So, my code is looking right now like this (trying to modify my code to yours):

`del *.xml`;
my $jawcount = 0;

#open(OUT, ">thes/journalthesaurus.xml");

while ($z = <../waternet.0000000.zip>) #zipped directory of xml files
{
   print "unzipping $z\n";
   `unzip $z`;
   print "finished unzipping files\n";
   
   while ($f = <waternet.0000001.xml>) #individiual xml file
   {
            print "$f\n";
            local $\ = undef;
                open my($file),'<',$f;
                $waternetrec = <IN>;
                close(IN);
     
               if($waternetrec =~ /<PA>JAW<\/PA>/ && /<DE>(.+?)<\/DE>/)
                {
                  $jawcount++;
                  print "$jawcount: $1: $jawcount";      
                }
   }
   `del *.xml`;
}

#close(OUT);
wlfs - this statement isn't executing:

if($waternetrec =~ /<PA>JAW<\/PA>/ && /<DE>(.+?)<\/DE>/)

if I change it to test for them individually
if($waternetrec =~ /<PA>JAW<\/PA>/)
if($waternetrec =~ /<DE>(.+?)<\/DE>/)
they both execute as true, but not the way I have it above.
regarding your last comment:
  if($waternetrec =~ /<PA>JAW<\/PA>/ && /<DE>(.+?)<\/DE>/)
doesn't work, because the second pattern match is implicitly bound to $_. You need to change it to
  if($waternetrec =~ /<PA>JAW<\/PA>/ && $waternetrec =~ /<DE>(.+?)<\/DE>/)
OK, I got that working so I have:

      if($waternetrec =~ /<PA>JAW<\/PA>/ && $waternetrec =~ /<DE>(.+?)<\/DE>/)
      {
                  $desc{$_} = 1 foreach split '^', $1;
                  $jawcount++;
      }


however, the "$desc{$_} = 1 foreach split '^', $1;" line doesn't appear to be working because when I try and see the keys, it is not coming up iwth anything:

   foreach $key (keys(%desc))
   {
            print "$key\n";
      }

prints nothing
Well, this is due to:
          local $\ = undef;
          open my($file),'<',$f;
          $waternetrec = <IN>;
          close(IN);

$\ is the OUTput record separator. Dont't worry, I mix it up all the time :)
Why do you set it to undef? If you slurp the file as a whole, you will get the description items of the first bib-item (per file) only. All others are ignored. Unless I missed something important, I am very sure that reading it in recordwise (with $/ = "</bibrec>") is the way to go.

Then, you open the file named $f and associate it with filehandle $file.
  open my($file),'<',$f;
But then, you read from the filehandle IN
  $waternetrec = <IN>;

IN has never been opened, it is a completely empty file handle. Just as empty as $waternetrec stays all the time :)
There is only one bibrec per file, thus I have no need to do $/ = </bibrec> because I am reading one file at a time.  I did fix the other stuff, just hadn't posted it.  Which brings me back to that one line that doesn't seem to be working correctly.

$desc{$_} = 1 foreach split '^', $1;

# new code
`del *.xml`;
my $jawcount = 0;
my %desc;

#open(OUT, ">thes/journalthesaurus.xml");

while ($z = <../waternet.0000000.zip>)
{
   print "unzipping $z\n";
   `unzip $z`;
   print "finished unzipping files\n";
   
   while ($f = <waternet.0000001.xml>)
   {
      local $/ = undef;
      open IN, "< $f" or die "Couldn't open `$filename': $!";
      $waternetrec = <IN>;
      close(IN);
     
      if($waternetrec =~ /<PA>JAW<\/PA>/ && $waternetrec =~ /<DE>(.+?)<\/DE>/)
      {
                  $desc{$_} = 1 foreach split '^', $1;
                  $jawcount++;
      }
      else
      {
                  print "$f didn't match conditions\n\n$waternetrec\n\n";
      }
   }
   `del *.xml`;
   
   foreach $key (keys(%desc))
   {
            print "$key\n";
   }
}

So, I have it so that the conditions are being met.
This is an example of the xml that I am using for testing purposes (I called it bibrec originally for clarity but this is the xml) - so this is contained in the file waternet.0000001.xml, the next file is waternet.0000002.xml, etc.

<waternet n="0000001">
  <LLObjectID />
  <PDF />
  <WN>JAW71001</WN>
  <LK />
  <TI>Emphasis - 1970's</TI>
  <AU>Larson, T.E.</AU>
  <PA>JAW</PA>
  <PN>Journal American Water Works Association</PN>
  <CL />
  <MD />
  <CS />
  <PU>AWWA</PU>
  <AV>AWWA</AV>
  <SN>0003-150X</SN>
  <BN />
  <LC />
  <CO>JAWWA5</CO>
  <OR />
  <PD>January</PD>
  <PY>1971</PY>
  <ED />
  <VO>63</VO>
  <NO>1</NO>
  <PG>1-2</PG>
  <PS>2</PS>
  <LA>English</LA>
  <NT />
  <DT>Jnl Article</DT>
  <MT>Print</MT>
  <AB>Overall it appears that the United States water industry is providing more than adequate service, but a Community Water Supply Study showed that 12.4% of the water systems studied exceeded the coliform limits for drinking water. In order to bring all systems up to standard the AWWA Board of Directors established the "Action Now" committee for this purpose. Another consideration for maintaining quality water systems is financial aid from government bodies. Money is needed for education, training, and research. New technologies must be developed to deal with the deteriorating quality of water.</AB>
  <DE>Water Quality^Training^Coliforms</DE>
  <STATUS>E</STATUS>
- <ADD>
  <D />
  <IN />
  </ADD>
- <CHG>
  <D>20051102</D>
  <IN>grant</IN>
  </CHG>
- <CHG>
  <D>19951002</D>
  <IN>DAW</IN>
  </CHG>
  </waternet>
I see. Everything makes perfect sense now.
So what's the status? Did you generate the filtered thesaurus yet? Everything working fine, or still having trouble?

> Which brings me back to that one line that doesn't seem to be working correctly.
> So, I have it so that the conditions are being met.

That is, I am a little confused about the status :)
Just another thing that comes to my mind.
If performance is a matter of interest, it would be faster to unzip the files directly to a pipe instead of a file. That works fine under unix. I don't know about the unzip executable under windows, but I guess that should have a pipe mode too.

Instead of unzipping and then reading the file you would (-p switch works under linux)
  open IN, "unzip -p $z |";

This way, all files in $z are unzipped, concatenated and fed into perls IN file handle without writing anything on disk.
Read from IN as from a normal file. No need to clean up any unzipped *.xml files afterwards, because there aren't any.
Lol, sorry, I am not explaining very well.  

I think I have it working - I still couldn't get the line:

               $desc{$_} = 1 foreach split '^', $1;

to work, so I replaced it with a less streamlined, but working:

                  @descriptors = split(/\^/,$1);
                  foreach $val (@descriptors)
                  {                        
                        $desc{$val} = 1;
                  }

I know there's probably a much better way to write it (yours) but I couldn't get your statement to work.   I ran a test on the first 1000 and it seemed to when I tested it against our database.   I am running the program across all the 60,000 xml files right now.  The only part I was kind of lost in your code was:

  $desc{$1} &&= join "\n", "  <TermInfo>", grep(m!<(T|NT|BT)>.+?</\1>!, split "\n"), "  </TermInfo>\n";

I know enough to do a few things in Perl, but don't use it frequently enough to look at *that* and go "ah, yes, I see where you're going with this".  ;)

I do hope it works because this seemingly trivial program is one of the backbones of our e-commerce site.  No pressure though.  :P
That's a great suggestion about the pipe.  I looked at the unzip executable commands and there is that "-p" option.  I will try that right now.
  $desc{$1} &&= join(...);
is the short form for
   if ($desc{$1}) {
      $desc{$1} = join(...);
   }
Only if the descriptor existed in the xml files it will be extracted from the thesaurus.

until tomorrow, woolf
And be aware that in
  grep(m!<(T|NT|BT)>.+?</\1>!, split "\n")
the split is implicitly performed on $_. If the current record from the thesaurus file is in, e.g., $thes_rec then you should use
  grep(m!<(T|NT|BT)>.+?</\1>!, split "\n", $thes_rec)
instead.
wlfs - it appears to work but I get some odd stuff that seems to happen like (note the '1''s that appear):

  <TermInfo>
<T>Thailand</T>
<BT>Geographical locations</BT>
<NT>Bangkok, Thailand</NT>
  </TermInfo>
1  <TermInfo>
<T>Third World Countries</T>
<BT>Geographical locations</BT>
<RT>Appropriate Technology</RT>
<RT>Developing Countries</RT>
<RT>Small Water Systems</RT>
  </TermInfo>
  <TermInfo>
<T>Thornton, Colorado</T>
<BT>Geographical locations</BT>
  </TermInfo>
11  <TermInfo>
<T>Toilets</T>
This is the latest code:

`del *.xml`;
my $thesaurus_file = "t:/Water Library/Thesaurus.xml";
my $out_file = "t:/Water Library/JAWThes.xml";

my $jawcount = 0;
my %desc;

#open(OUT, ">thes/journalthesaurus.xml");

while ($z = <../waternet.*.zip>)
{
   print "unzipping $z\n";
   `unzip $z`;
   print "finished unzipping files\n";
   
   while ($f = <waternet.*.xml>)
   {
            local $/ = undef;
      open IN, "< $f" or die "Couldn't open `$filename': $!";
      $waternetrec = <IN>;
      close(IN);
     
      # print "$waternetrec\n";
     
      if($waternetrec =~ /<PA>JAW<\/PA>/ && $waternetrec =~ /<DE>(.+?)<\/DE>/)
      {
                  @descriptors = split(/\^/,$1);
                  foreach $val (@descriptors)
                  {
                        #print("val = $val\n");
                        $desc{$val} = 1;
                  }
                  #$desc{$_} = 1 foreach split '^', $1;
                  $jawcount++;
      }
      else
      {
                  #print "$f didn't match conditions\n\n$waternetrec\n\n";
            }
   }
   `del *.xml`;
}

local $/ = "</TermInfo>";
open my($thesaurus),'<',$thesaurus_file;
while(<$thesaurus>)
{
      m/<T>(.+?)<\/T>/ and $desc{$1} &&= join "\n", "  <TermInfo>", grep(m!<(T|NT|BT|RT)>.+?</\1>!, split "\n"), "  </TermInfo>\n";
}

open  my($out), '>', $out_file;
print $out "<thesaurus>\n";
print $out $desc{$_} foreach sort keys %desc;
print $out "</thesaurus>\n";
close $out;




#close(OUT);
Sorry for coming back late.

The '1's are caused by description items which are found in the bib-files but NOT found in the thesaurus.
Each description item in the bib-files causes $desc{ text of description } to be set to 1. If this description item is not found in the thesaurus it stays '1' and is printed as '1' in the output.

To make up for that, I suggest to (i) output only exisiting thesaurus items and (ii) generate a second file where all description items without thesaurus entry are listed. This way the filtered thesaurus is not messed up, and you know which thesaurus items are missing.

my $missing_file = "t:/Water Library/JAWThes_missing.xml";
.
.
.
open  my($out), '>', $out_file;
print $out "<thesaurus>\n";
open  my($missing), '>', $missing_file;
print $missing "<thesaurus>\n";

foreach (sort keys %desc) {
  if ($desc{$_} eq '1') {
    print $missing "  <TermInfo>\n", "    <T>$_</T>\n", "  </TermInfo>\n";
  }
  else {
    print $out $desc{$_}
  }

print $missing "</thesaurus>\n";
close $missing;
print $out "</thesaurus>\n";
close $out;
Hi wlfs - thanks for all your excellent help.  I am running the new program with the modifications and I expect it will run fine.  One thing - I tried out your suggestion about the piped thing and unfortunately it deleted everything in the first zip file, so I'm not sure what happened.  I will try it again when I have some time.

Thanks again,
Purple
You are most welcome.

> unfortunately it deleted everything in the first zip file
I hope you had a backup :)
Yes, we back up our data.  I need to stop working on live files, but that's normally why I just run the unzip utility and move them all to a different folder and process the data there while the live data remains secure in the original directory.  This is the first time they have ever been deleted as a result of something I tried.  However, I still like the suggestion and will try and implement it, just this time not on the live data.