Perl and correct XML tags

Jacknumee
Jacknumee used Ask the Experts™
on
I need a perl program that given a correct or almost correct XML file, produces a properly indented file consisting of only the tags, without content and without attributes. Here is an example:

Given the following file:

<?xml version ="1.0?>
<?xml -stylesheet type="text/xsl" href="cda.xsl"?>
<cd type ="single">
<title> Revolver, top two </title>
<band> The Beatles </band>
<track>
  <song>Eleanor Rigby </song>
  <time>2:45</time>
  <written year ="1964/>
</track>

Will give the following output:

<cd>
   <title>
   </title>
   <band>
   </band>
   <track>
     <song>
     </song>
     <time>
     </time>
     <written/>
   </track>
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
rj2

Commented:
#!/usr/bin/perl
use XML::Parser;

$p = new XML::Parser(Handlers => {Start => \&start_handler,
                         End   => \&end_handler});
$p->parsefile('test.xml', ErrorContext => 3);

sub start_handler
{
my $expat = shift; my $element = shift;
print "<$element>\n";
}

sub end_handler
{
my $expat = shift; my $element = shift;
print "</$element>\n";
}


Your XML is not well-formed.
Too make it work replace it with well-formed XML as shown below.

<?xml version ="1.0"?>
<?xml-stylesheet type="text/xsl" href="cda.xsl"?>
<cd type ="single">
<title> Revolver, top two </title>
<band> The Beatles </band>
<track>
 <song>Eleanor Rigby </song>
 <time>2:45</time>
 <written year="1964"/>
</track>
</cd>
rj2

Commented:
"almost correct XML" is not good enough.
It must be well-formed, or it will not parse.
That is the rule, and there are no AFAIK no exceptions to this rule.
If you need almost correct your better off using the HTML::Parser that you can use to parse tags without closed tags.

search cpan.org for HTML::Parser for the module.

bW
CompTIA Security+

Learn the essential functions of CompTIA Security+, which establishes the core knowledge required of any cybersecurity role and leads professionals into intermediate-level cybersecurity jobs.

Author

Commented:
Is there a way of achieving my goal without using
HTML::Parser or XML::Parser

I would like to achieve my goal using another method.
Thanks
rj2

Commented:
Why don't you want to use XML::Parser?
Do you want to another XML parser instead?
Using a XML parser is really the only way to make this work not only for your sample data, but for all well-formed XML.
It might be possible to make a regexp that work just for your sample data, is that what you want?
rj2

Commented:
Regexp approach below.

#!/usr/bin/perl
$file="test2.xml";
$/='';
open(FILE,$file) || die("Cannot open file: $!");
$xml=<FILE>;

#remove comments
$xml =~ s/<\?[^>]*>//sg;
#remove attributes
$xml =~ s/<([a-zA-Z_]*)[^>\/]*(\/)?/<$1$2/sg;
#remove text
$xml =~ s/>[^<]*</></sg;
#remove newlines
$xml =~ s/\n|<//sg;

@xml=split('>',$xml);

foreach(@xml) {          
     if(m/^\//) {
          $level--;
     }
     print ' ' x ($level*2),"<$_>\n";    
     if(!m/\//) {
          $level++;
     }
}

Author

Commented:
Yes this is what I want.
Will this work for perl 5?
I need to print the answer to an outfile!

Author

Commented:
I try to run it, and it just tell me
 (-h will show valid options).
I dont understand why!!!
I am running it under perl 5.
rj2

Commented:
It will work with Perl 5,yes.
The message "(-h will show valid options)" come when you send a option to Perl that it don't recognize.

Paste code above into your editor.
Change the line
$file="test2.xml";
to correct path / filename to to xml file on your system.
Save code above as a file e.g. indent.pl

Start a command prompt.
Goto the directory where you save indent.pl
Then type command "perl indent.pl" from your command prompt.

Does it work then?
If still problem, please post output of "perl -v", and also tell me what OS you're using (Win95/Win98/NT/2000/XP/Redhat Linux/whatever) ?
rj2

Commented:
To print answer to an outfile, type command below
perl indent.pl > outfile.xml

Author

Commented:
Now I get the following:

Read on closed filehandle <FILE> at ./q4.pl line 7.
Use of uninitialized value at ./q4.pl line 10.
Use of uninitialized value at ./q4.pl line 12.
Use of uninitialized value at ./q4.pl line 14.
Use of uninitialized value at ./q4.pl line 16.
Use of uninitialized value at ./q4.pl line 20.

Author

Commented:
Disregard my last comment

Author

Commented:
The program compiles now, but does not seem to do anything. I made a few minor modifications.  

#!/usr/local/perl5/bin/perl -w

open(FILE,"$ARGV[0]");
open(Outfile, ">new.txt");
$xml=<FILE>;

#remove comments
$xml =~ s/<\?[^>]*>//;
#remove attributes
$xml =~ s/<([a-zA-Z_]*)[^>\/]*(\/)?/<$1$2/;
#remove text
$xml =~ s/>[^<]*</></;
#remove newlines
$xml =~ s/\n|<//;



@xml=split('>',$xml);

foreach(@xml) {          
    if(m/^\//) {
         $level--;
    }
   
print ' ' x ($level*2),"<$_>\n";    
    if(!m/\//) {
         $level++;

print Outfile;

    }
}

close (FILE);
close (Outfile);


Maybe you could tell me why???????
Commented:
Use code below to give filename as parameter and print output to file new.txt instead of stdout

#!/usr/local/perl5/bin/perl
$/='';
open(FILE,$ARGV[0]) || die("Cannot open file: $!");
$xml=<FILE>;
open(OUTFILE, ">new.txt");

#remove comments
$xml =~ s/<\?[^>]*>//sg;
#remove attributes
$xml =~ s/<([a-zA-Z_]*)[^>\/]*(\/)?/<$1$2/sg;
#remove text
$xml =~ s/>[^<]*</></sg;
#remove newlines
$xml =~ s/\n|<//sg;

@xml=split('>',$xml);

foreach(@xml) {          
    if(m/^\//) {
         $level--;
    }
    print OUTFILE ' ' x ($level*2),"<$_>\n";    
    if(!m/\//) {
         $level++;
    }
}
close(FILE);
close(OUTFILE);
rj2

Commented:
Jacknumee,
Did this solve your question?
Please close question by clicking "Accept comment as answer" on the comment that helped you most if you regard your question as solved.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial