Avatar of hadrons
hadrons
 asked on

Looking to entire extract record set in XML file if value of one tag is > or = to a specific value

I have a xml file that basically formatted as such:

<product>
....
<a001>product name</a001>
<b002>product date</b002>
<c003>product price</c003>
....
</supplydetail></product>

The record sets start with <product> on its own line and while the </product> is suppose to be on its own line sometimes it isn't and the scheme is flexible that it doesn't have it, but this is just to describe what the file looks like; basically the record set starts at <product> and ends at </product>.

What I needs to a Perl script that will yank the entire record set from <product> to </product> with everything inbetween if the value of <b002>product date</b002> is greater or equal to a specific date. The date info would be in YYYYMMDD form.

So if the record set has a date greater or equal to 20150122 then it would be extracted from the source file. I do have similar scripts, but nothing like a function to evaluate the data value like that. Thanks
Programming Languages-OtherPerl

Avatar of undefined
Last Comment
hadrons

8/22/2022 - Mon
wilcoxon

What is the full structure of the XML?  You do not give enough information about the XML.  You should always use an XML parsing module when dealing with XML data.  Here's a code piece that will get you most of the way there:
use XML::Simple;
my %opt = (); # may not be needed or may need some of the options set
my $ref = XMLin($filename, %opt) or die "could not parse $filename: $!";
if ($ref->{supplydetail}{b002} > YYYYMMDD) {
    print XMLout($ref, %opt);
}

Open in new window

ASKER CERTIFIED SOLUTION
ozo

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
GET A PERSONALIZED SOLUTION
Ask your own question & get feedback from real experts
Find out why thousands trust the EE community with their toughest problems.
hadrons

ASKER
The command line ozo supplied worked great, but I would like to follow on using a script. I'll attach a sample file to this comment to give you an idea of the XML structure - its just one record, but a file I normally work with can have thousands with additional tags in them. I do use a XML parser, but its usually Twig XML to extract specific parts of the record set whereas in this case I would need to entire record set.

Underneath is an example of what I would normally use to extract specific tags from all record sets, but I'm not sure how to rewrite to pluck out entire record set with a tag with a specific value.

my @files = glob("*.xml");      
foreach my $file(@files) {
          system ("echo currently processing file: $file");
                      open FILE, '<:encoding(UTF-8)', $file or warn "Can't open $file: $!";  
                      open PARSED, '>>:encoding(UTF-8)', ($file . "_isbneans.txt") or warn "Cannot open file for write: $!";  

   while (<FILE>) {
        $_=~ s/^\s+//;
        if (/ONIXmessage/) {
        my $t= XML::Twig->new(
                 twig_roots   => {
                                 'product/a001' => \&print_1,
                                 'product/b004' => \&print_1,
                                 'product/b005' => \&print_1,
                                         'product/productidentifier/b244' => \&print_1,
                 }
                            );

        eval {$t->parsefile( $file);};
        print PARSED;
}
}

##  SUB ROUTINES  
               
                sub print_1
                { my( $t, $elt)= @_;
                  eval{  print PARSED "\n" . $elt->text . "\n"; };
                  warn $@ if $@;
                  $t->purge;                                                                                                                                                                              
                }

}
sample.xml
wilcoxon

I've never been a fan of XML::Twig approach - I mostly use XML::Simple or XML::SAX these days.

One word of warning on ozo's command line, it uses regexes so has the same risks and concerns as any other approach that does not use an actual XML parser.  On the other hand, ozo is king of the one-liner.

I think this should work for a modified script (or at least be close):
my @files = glob("*.xml");      
foreach my $file(@files) {
    system ("echo currently processing file: $file"); 
    open FILE, '<:encoding(UTF-8)', $file or warn "Can't open $file: $!";  
    open PARSED, '>>:encoding(UTF-8)', ($file . "_isbneans.txt") or warn "Cannot open file for write: $!";  

    while (<FILE>) {
        s/^\s+//;
        if (/ONIXmessage/) {
            my $t= XML::Twig->new(
                twig_roots   => {
                    'product' => \&do_product,
                    'product/a001' => \&print_1,
                    'product/b004' => \&print_1,
                    'product/b005' => \&print_1,
                    'product/productidentifier/b244' => \&print_1,
                }
            );

            eval {$t->parsefile( $file);};
            print PARSED;
        }
    }
}

##  SUB ROUTINES  
sub do_product {
    my ($t, $elt) = @_;
    return unless ($elt->first_child_text("b002") <= YYYYMMDD);
    eval { print PARSED "\n" . $elt->text . "\n" };
    warn $@ if $@;
}

sub print_1 {
    my( $t, $elt)= @_;
    eval{  print PARSED "\n" . $elt->text . "\n"; };
    warn $@ if $@;
    $t->purge;                                                                                                                                                                               
} 

Open in new window

Experts Exchange is like having an extremely knowledgeable team sitting and waiting for your call. Couldn't do my job half as well as I do without it!
James Murphy
hadrons

ASKER
Sorry for the delay in grading this solution, but I wanted to give the script a try and split the solution difference, but I wasn't able to get it to go, but I think it does provide a good template to move forward, so I'll tinker with it when I have more free time; thanks again.