?
Solved

Cross referencing 2 different XML file sources to build a new XML file

Posted on 2006-05-01
23
Medium Priority
?
165 Views
Last Modified: 2010-03-05
OK - the question is I have 2 data sources

DATA SOURCE 1: about 60,000 xml files containing bibliographic information in the following format:

File #1
<bibrec="0060000">
  <PA>JAW</PA>
  <TI>Book Title</TI>
  <AU>Author Name</AU>
  <DE>Compliance^Federal Regulations</DE>    <-- DE is the field containing keywords that describe what the book is about
  <.... more fields ...>
</bibrec>

File #2
<bibrec="0060001">
  <PA>JAW</PA>
  <AU>author 2</AU>
  <DE>Compliance</DE>
</bibrec>

File #3
<bibrec="0060002">   <-- since PA does not have a value of JAW, it will ignore this record
   <PA>OPF</PA>
   <TI>Title 3</TI>
   <DE>Cryptosporidium^Water Quality</DE>
</bibrec>

... more files to process

DATA SOURCE 2: An XML Thesaurus that directly corresponds to the <DE></DE> tags above.  It is what we use to classify our information.

<thesaurus>
  <TermInfo>
    <T>Compliance</T>
    <BT>Water Legislation and Regulations</BT>
    <History>2006/02/28 13:30 Moved from top level by grant</History>
    <RT>Best Available Technology</RT>
    <RT>Information Collection Rule</RT>
    <RT>Monitoring</RT>
    <RT>Permits</RT>
    <RT>Primacy</RT>
    <RT>Safe Drinking Water Act</RT>
    <RT>Safe Drinking Water Act Amendments</RT>
    <RT>Standards</RT>
    <RT>Viability</RT>
    <Status>Accepted</Status>
    <ttm>compliance</ttm>
    <ttm>regulat*</ttm>
  </TermInfo>
  <TermInfo>
     <T>Superman</T>
     <RT>Lois Lane</T>
  </TermInfo>
  <TermInfo>
    <T>Federal Regulations</T>
    <BT>Jurisdictional Regulations</BT>
    <NT>Information Collection Rule</NT>
    <NT>Total Coliform Rule</NT>
    <Status>Accepted</Status>
    <History>2006/03/03 10:00 created by grant</History>
    <ttm>regulat*</ttm>
  </TermInfo>
</thesaurus>

WHAT I NEED TO DO:
Cycle through all of the SOURCE 1 data files, look for only files that have <PA>JAW</PA> and compare what is in the <DE> in SOURCE 1 to what is in the SOURCE 2 Thesaurus file and build a NEW thesaurus file using only the terms that are found in SOURCE 1.  So in the above example, it would build a new thesaurus using all the words from the <DE> tags (delimited by a carat ^)

So it would take all the words in those DE tags and use them but for example if Term "Superman" in the Thesaurus was never used in the SOURCE 1 data files, it would not include that in the new thesaurus file.  I need to also keep it so that it's alphbetical by the <T> and to only include <T>,<NT>,<BT> in the new one.

Sorry I know this may be confusing.  If you have any questions shoot them my way.
0
Comment
Question by:PurpleSlade
  • 13
  • 10
23 Comments
 
LVL 5

Accepted Solution

by:
wlfs earned 2000 total points
ID: 16582429
Hi PurpleSlade,

The following code assumes that the newly filtered thesaurus can be held in memory. 60.000 files sounds like a lot of data, so if it happens that you run out of memory, we have to take a different approach. Let me know in case.
Also, I didn't test the code. Maybe it needs a little debugging.

After the disclaimers, here it comes:

my @bib_files = ... ;      # don't know how you build the list of all bib files. insert appropriate code.
my $thesaurus_file = "name_of_thesaurus_file";
my $out_file = "name_of_out_file";

my %desc;

foreach my $bib_file (@bib_files) {
  local $\ = "</bibrec>";
  open my($file), '<', $bib_file;
  while (<$file>) {
    if (m|<PA>JAW</P>| && m|<DE>(.+?)</DE>) {
      $desc{$_} = 1 foreach split '^', $1;
    }
  }
}
# now the keys of %desc contain all description items of all bib items with <PA>JAW</PA>

{
  local $\ = "</TermInfo>";
  open my($thesaurus), '<', $thesaurus_file;
  while (<$thesaurus>) {
    m|<T>(.+?)</T>| and
    $desc{$1} &&= join "\n", "  <TermInfo>", grep(m!<(T|NT|BT)>.+?</\1>!, split "\n"), "  </TermInfo>\n";
  }
}
# now the values of %desc contain the filtered Thesaurus entries

open  my($out), '>', $out_file;
print $out "<thesaurus>\n";
print $desc{$_} foreach sort keys %desc;
print $out "</thesaurus>\n";
close $out;
0
 
LVL 2

Author Comment

by:PurpleSlade
ID: 16588483
Hi wlfs - I am testing it out now.  I will let you know what's up.
0
 
LVL 2

Author Comment

by:PurpleSlade
ID: 16589206
wlfs - question - what does

local $\ = "</bibrec>";

do?

I realize that local $\ will allow you to slurp the file and you make it local so that it doesn't do it for all < ... > files that get read, but what does adding the "</bibrec> do?  Why would you do that over local $\ = undef; ?
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
LVL 5

Expert Comment

by:wlfs
ID: 16589370
GOOD you are asking. Awful mistake. It must be $/ instead of $\.
Please replace both $\ in my code with $/ !
Sorry for that.

$/ is the input record separator. If it is null, there is no separator and thus the entire file becomes a single record, i.e. it is slurped at once.

In our case, $/ = "</bibrec>" causes perl to read in each bib-entry of the xml file as a record, one after the other. Each scalar call of the <$file> operator reads the file up to (and including) the next "</bibrec>".

The default line-wise behaviour of <$file>, that you are probably familiar with, is caused by the default value of $/="\n".

$\, the one I was erroneously using, is the output record separator. I gets appended to the output stream after each print statement. Its default value is $\="".
0
 
LVL 2

Author Comment

by:PurpleSlade
ID: 16589629
ok, I think i get it - let me clarify one thing - All of the xml files are zipped in directories that hold 1000 files each, so to do stuff like this, I need to unzip them 1000 at a time and then I usually just traverse through them.  So, my code is looking right now like this (trying to modify my code to yours):

`del *.xml`;
my $jawcount = 0;

#open(OUT, ">thes/journalthesaurus.xml");

while ($z = <../waternet.0000000.zip>) #zipped directory of xml files
{
   print "unzipping $z\n";
   `unzip $z`;
   print "finished unzipping files\n";
   
   while ($f = <waternet.0000001.xml>) #individiual xml file
   {
            print "$f\n";
            local $\ = undef;
                open my($file),'<',$f;
                $waternetrec = <IN>;
                close(IN);
     
               if($waternetrec =~ /<PA>JAW<\/PA>/ && /<DE>(.+?)<\/DE>/)
                {
                  $jawcount++;
                  print "$jawcount: $1: $jawcount";      
                }
   }
   `del *.xml`;
}

#close(OUT);
0
 
LVL 2

Author Comment

by:PurpleSlade
ID: 16589909
wlfs - this statement isn't executing:

if($waternetrec =~ /<PA>JAW<\/PA>/ && /<DE>(.+?)<\/DE>/)

if I change it to test for them individually
if($waternetrec =~ /<PA>JAW<\/PA>/)
if($waternetrec =~ /<DE>(.+?)<\/DE>/)
they both execute as true, but not the way I have it above.
0
 
LVL 5

Expert Comment

by:wlfs
ID: 16590289
regarding your last comment:
  if($waternetrec =~ /<PA>JAW<\/PA>/ && /<DE>(.+?)<\/DE>/)
doesn't work, because the second pattern match is implicitly bound to $_. You need to change it to
  if($waternetrec =~ /<PA>JAW<\/PA>/ && $waternetrec =~ /<DE>(.+?)<\/DE>/)
0
 
LVL 2

Author Comment

by:PurpleSlade
ID: 16590304
OK, I got that working so I have:

      if($waternetrec =~ /<PA>JAW<\/PA>/ && $waternetrec =~ /<DE>(.+?)<\/DE>/)
      {
                  $desc{$_} = 1 foreach split '^', $1;
                  $jawcount++;
      }


however, the "$desc{$_} = 1 foreach split '^', $1;" line doesn't appear to be working because when I try and see the keys, it is not coming up iwth anything:

   foreach $key (keys(%desc))
   {
            print "$key\n";
      }

prints nothing
0
 
LVL 5

Expert Comment

by:wlfs
ID: 16590381
Well, this is due to:
          local $\ = undef;
          open my($file),'<',$f;
          $waternetrec = <IN>;
          close(IN);

$\ is the OUTput record separator. Dont't worry, I mix it up all the time :)
Why do you set it to undef? If you slurp the file as a whole, you will get the description items of the first bib-item (per file) only. All others are ignored. Unless I missed something important, I am very sure that reading it in recordwise (with $/ = "</bibrec>") is the way to go.

Then, you open the file named $f and associate it with filehandle $file.
  open my($file),'<',$f;
But then, you read from the filehandle IN
  $waternetrec = <IN>;

IN has never been opened, it is a completely empty file handle. Just as empty as $waternetrec stays all the time :)
0
 
LVL 2

Author Comment

by:PurpleSlade
ID: 16590455
There is only one bibrec per file, thus I have no need to do $/ = </bibrec> because I am reading one file at a time.  I did fix the other stuff, just hadn't posted it.  Which brings me back to that one line that doesn't seem to be working correctly.

$desc{$_} = 1 foreach split '^', $1;

# new code
`del *.xml`;
my $jawcount = 0;
my %desc;

#open(OUT, ">thes/journalthesaurus.xml");

while ($z = <../waternet.0000000.zip>)
{
   print "unzipping $z\n";
   `unzip $z`;
   print "finished unzipping files\n";
   
   while ($f = <waternet.0000001.xml>)
   {
      local $/ = undef;
      open IN, "< $f" or die "Couldn't open `$filename': $!";
      $waternetrec = <IN>;
      close(IN);
     
      if($waternetrec =~ /<PA>JAW<\/PA>/ && $waternetrec =~ /<DE>(.+?)<\/DE>/)
      {
                  $desc{$_} = 1 foreach split '^', $1;
                  $jawcount++;
      }
      else
      {
                  print "$f didn't match conditions\n\n$waternetrec\n\n";
      }
   }
   `del *.xml`;
   
   foreach $key (keys(%desc))
   {
            print "$key\n";
   }
}

So, I have it so that the conditions are being met.
0
 
LVL 2

Author Comment

by:PurpleSlade
ID: 16590470
This is an example of the xml that I am using for testing purposes (I called it bibrec originally for clarity but this is the xml) - so this is contained in the file waternet.0000001.xml, the next file is waternet.0000002.xml, etc.

<waternet n="0000001">
  <LLObjectID />
  <PDF />
  <WN>JAW71001</WN>
  <LK />
  <TI>Emphasis - 1970's</TI>
  <AU>Larson, T.E.</AU>
  <PA>JAW</PA>
  <PN>Journal American Water Works Association</PN>
  <CL />
  <MD />
  <CS />
  <PU>AWWA</PU>
  <AV>AWWA</AV>
  <SN>0003-150X</SN>
  <BN />
  <LC />
  <CO>JAWWA5</CO>
  <OR />
  <PD>January</PD>
  <PY>1971</PY>
  <ED />
  <VO>63</VO>
  <NO>1</NO>
  <PG>1-2</PG>
  <PS>2</PS>
  <LA>English</LA>
  <NT />
  <DT>Jnl Article</DT>
  <MT>Print</MT>
  <AB>Overall it appears that the United States water industry is providing more than adequate service, but a Community Water Supply Study showed that 12.4% of the water systems studied exceeded the coliform limits for drinking water. In order to bring all systems up to standard the AWWA Board of Directors established the "Action Now" committee for this purpose. Another consideration for maintaining quality water systems is financial aid from government bodies. Money is needed for education, training, and research. New technologies must be developed to deal with the deteriorating quality of water.</AB>
  <DE>Water Quality^Training^Coliforms</DE>
  <STATUS>E</STATUS>
- <ADD>
  <D />
  <IN />
  </ADD>
- <CHG>
  <D>20051102</D>
  <IN>grant</IN>
  </CHG>
- <CHG>
  <D>19951002</D>
  <IN>DAW</IN>
  </CHG>
  </waternet>
0
 
LVL 5

Expert Comment

by:wlfs
ID: 16591088
I see. Everything makes perfect sense now.
So what's the status? Did you generate the filtered thesaurus yet? Everything working fine, or still having trouble?

> Which brings me back to that one line that doesn't seem to be working correctly.
> So, I have it so that the conditions are being met.

That is, I am a little confused about the status :)
0
 
LVL 5

Expert Comment

by:wlfs
ID: 16591174
Just another thing that comes to my mind.
If performance is a matter of interest, it would be faster to unzip the files directly to a pipe instead of a file. That works fine under unix. I don't know about the unzip executable under windows, but I guess that should have a pipe mode too.

Instead of unzipping and then reading the file you would (-p switch works under linux)
  open IN, "unzip -p $z |";

This way, all files in $z are unzipped, concatenated and fed into perls IN file handle without writing anything on disk.
Read from IN as from a normal file. No need to clean up any unzipped *.xml files afterwards, because there aren't any.
0
 
LVL 2

Author Comment

by:PurpleSlade
ID: 16591259
Lol, sorry, I am not explaining very well.  

I think I have it working - I still couldn't get the line:

               $desc{$_} = 1 foreach split '^', $1;

to work, so I replaced it with a less streamlined, but working:

                  @descriptors = split(/\^/,$1);
                  foreach $val (@descriptors)
                  {                        
                        $desc{$val} = 1;
                  }

I know there's probably a much better way to write it (yours) but I couldn't get your statement to work.   I ran a test on the first 1000 and it seemed to when I tested it against our database.   I am running the program across all the 60,000 xml files right now.  The only part I was kind of lost in your code was:

  $desc{$1} &&= join "\n", "  <TermInfo>", grep(m!<(T|NT|BT)>.+?</\1>!, split "\n"), "  </TermInfo>\n";

I know enough to do a few things in Perl, but don't use it frequently enough to look at *that* and go "ah, yes, I see where you're going with this".  ;)

I do hope it works because this seemingly trivial program is one of the backbones of our e-commerce site.  No pressure though.  :P
0
 
LVL 2

Author Comment

by:PurpleSlade
ID: 16591434
That's a great suggestion about the pipe.  I looked at the unzip executable commands and there is that "-p" option.  I will try that right now.
0
 
LVL 5

Expert Comment

by:wlfs
ID: 16591468
  $desc{$1} &&= join(...);
is the short form for
   if ($desc{$1}) {
      $desc{$1} = join(...);
   }
Only if the descriptor existed in the xml files it will be extracted from the thesaurus.

until tomorrow, woolf
0
 
LVL 5

Expert Comment

by:wlfs
ID: 16594928
And be aware that in
  grep(m!<(T|NT|BT)>.+?</\1>!, split "\n")
the split is implicitly performed on $_. If the current record from the thesaurus file is in, e.g., $thes_rec then you should use
  grep(m!<(T|NT|BT)>.+?</\1>!, split "\n", $thes_rec)
instead.
0
 
LVL 2

Author Comment

by:PurpleSlade
ID: 16598442
wlfs - it appears to work but I get some odd stuff that seems to happen like (note the '1''s that appear):

  <TermInfo>
<T>Thailand</T>
<BT>Geographical locations</BT>
<NT>Bangkok, Thailand</NT>
  </TermInfo>
1  <TermInfo>
<T>Third World Countries</T>
<BT>Geographical locations</BT>
<RT>Appropriate Technology</RT>
<RT>Developing Countries</RT>
<RT>Small Water Systems</RT>
  </TermInfo>
  <TermInfo>
<T>Thornton, Colorado</T>
<BT>Geographical locations</BT>
  </TermInfo>
11  <TermInfo>
<T>Toilets</T>
0
 
LVL 2

Author Comment

by:PurpleSlade
ID: 16598452
This is the latest code:

`del *.xml`;
my $thesaurus_file = "t:/Water Library/Thesaurus.xml";
my $out_file = "t:/Water Library/JAWThes.xml";

my $jawcount = 0;
my %desc;

#open(OUT, ">thes/journalthesaurus.xml");

while ($z = <../waternet.*.zip>)
{
   print "unzipping $z\n";
   `unzip $z`;
   print "finished unzipping files\n";
   
   while ($f = <waternet.*.xml>)
   {
            local $/ = undef;
      open IN, "< $f" or die "Couldn't open `$filename': $!";
      $waternetrec = <IN>;
      close(IN);
     
      # print "$waternetrec\n";
     
      if($waternetrec =~ /<PA>JAW<\/PA>/ && $waternetrec =~ /<DE>(.+?)<\/DE>/)
      {
                  @descriptors = split(/\^/,$1);
                  foreach $val (@descriptors)
                  {
                        #print("val = $val\n");
                        $desc{$val} = 1;
                  }
                  #$desc{$_} = 1 foreach split '^', $1;
                  $jawcount++;
      }
      else
      {
                  #print "$f didn't match conditions\n\n$waternetrec\n\n";
            }
   }
   `del *.xml`;
}

local $/ = "</TermInfo>";
open my($thesaurus),'<',$thesaurus_file;
while(<$thesaurus>)
{
      m/<T>(.+?)<\/T>/ and $desc{$1} &&= join "\n", "  <TermInfo>", grep(m!<(T|NT|BT|RT)>.+?</\1>!, split "\n"), "  </TermInfo>\n";
}

open  my($out), '>', $out_file;
print $out "<thesaurus>\n";
print $out $desc{$_} foreach sort keys %desc;
print $out "</thesaurus>\n";
close $out;




#close(OUT);
0
 
LVL 5

Expert Comment

by:wlfs
ID: 16600917
Sorry for coming back late.

The '1's are caused by description items which are found in the bib-files but NOT found in the thesaurus.
Each description item in the bib-files causes $desc{ text of description } to be set to 1. If this description item is not found in the thesaurus it stays '1' and is printed as '1' in the output.

To make up for that, I suggest to (i) output only exisiting thesaurus items and (ii) generate a second file where all description items without thesaurus entry are listed. This way the filtered thesaurus is not messed up, and you know which thesaurus items are missing.

my $missing_file = "t:/Water Library/JAWThes_missing.xml";
.
.
.
open  my($out), '>', $out_file;
print $out "<thesaurus>\n";
open  my($missing), '>', $missing_file;
print $missing "<thesaurus>\n";

foreach (sort keys %desc) {
  if ($desc{$_} eq '1') {
    print $missing "  <TermInfo>\n", "    <T>$_</T>\n", "  </TermInfo>\n";
  }
  else {
    print $out $desc{$_}
  }

print $missing "</thesaurus>\n";
close $missing;
print $out "</thesaurus>\n";
close $out;
0
 
LVL 2

Author Comment

by:PurpleSlade
ID: 16606686
Hi wlfs - thanks for all your excellent help.  I am running the new program with the modifications and I expect it will run fine.  One thing - I tried out your suggestion about the piped thing and unfortunately it deleted everything in the first zip file, so I'm not sure what happened.  I will try it again when I have some time.

Thanks again,
Purple
0
 
LVL 5

Expert Comment

by:wlfs
ID: 16606734
You are most welcome.

> unfortunately it deleted everything in the first zip file
I hope you had a backup :)
0
 
LVL 2

Author Comment

by:PurpleSlade
ID: 16606865
Yes, we back up our data.  I need to stop working on live files, but that's normally why I just run the unzip utility and move them all to a different folder and process the data there while the live data remains secure in the original directory.  This is the first time they have ever been deleted as a result of something I tried.  However, I still like the suggestion and will try and implement it, just this time not on the live data.
0

Featured Post

VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
Checking the Alert Log in AWS RDS Oracle can be a pain through their user interface.  I made a script to download the Alert Log, look for errors, and email me the trace files.  In this article I'll describe what I did and share my script.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans
Suggested Courses

839 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question