Link to home
Start Free TrialLog in
Avatar of nighttress
nighttress

asked on

script to run software

Hi,

Not sure if anyone has used this software before or has a good example but I am trying to use a piece of software called xml2csv to do batch conversion of a large number of very complicated and what seems badly set out xml to csv,  Does anyone use this software and hence have examples of how the script should look before running it I can customise it to use my tags just need somewhere to start, I have tired all sorts with these xml files but all attempts fail at getting data out of a specific tag which is the most important and everything will end up in an sql server 2005 database.

Thanks Phil
Avatar of nighttress
nighttress

ASKER

Here is a link to the software that I am trying to use  

http://www.a7soft.com/


Thanks Phil
I've never used the software, and so far have avoided XML files.

But if you go into the Microsoft Visual Studio 2005 (part of the client components install from SQL2K5), It has the SQL Server Integration Services (formerly DTS). Yopu can do a lot with it, I understand it should be able to read XML files natively.
Avatar of b0lsc0tt
nighttress,

I haven't tried that program either.  I am curious to hear how the info above helps.  You may be out of luck though if the documents are poorly formed.  Any xml parser or program looking at an xml file will depend on its form and validity.  Depending on the "errors" there may be ways to fix them but if it is real bad then your are better off looking for a text to CSV parser or program.

bol
Hi jimpen, b0isc0tt,

Thank you both for the replies, I have already tried using SSIS to import the xml files but due to their very poor structure and invalid characters and only the basic knowledge of xml,  SSIS seemed to have serious problems doing anything with the files that came close, this is why I thought that I would try and convert them to a more usable format to allow me to shred and import them.  I have tried some other programs that should allow you to import the xml files but they have the same sort of problems and I am not able to change the structure of the files as they are coming from a third party.  If it would be of any help I could post one of the xml files but they are large and I eventually will have to import a few thousand and then import more on a week to week basis.

Any other help would be very much appreciated as this is a task that has been on my list for some time.

Thanks PD
It may help to have a small sample that represents the file and content.  You can also let us know what environment you will run this in or what language you want to use for it.  A script or program that will work with text won't have the same issue as an xml program/parser.  However without some type of order or organization even a script/program like I described won't be able to get the info.  It will read the content better than an xml parser but the results will be unreliable or incomplete.

If you are dealing with as much data as you describe though this could still be the best option.  Especially if the bad data is bad in the same way in all files.  You may need to do it in parts but a program can be pretty amazing in what it can do.

I you know Regular Expressions then the program PowerGrep may be helpful to do this.  It is a very powerful program to search files and their content.

bol
Hi b0isc0tt

Thanks again for the reply, I have attached one of the xml files that I will be importing, I have stripped out the personal data but the structure is fine.  The tags that I would like to be able to get the data out of on the first run are.

<InstitutionType>
<InstitutionName>
<InstitutionNumber>
<InspectionNumber>

<TransformedXML><MeetingNeeds><SchoolGrade>

Although I may need further details later.

I am using SQL Server 2005 on a Windows NT server and accessing it through a Windows XP machine, my TSQL is pretty good.

I have used RegExp a long time ago but how much of that I remember I couldn't say and I would like to be able to do this as a batch process.

If there is anything else I can provide please dont hesitate to ask.

Thanks Phil

280643TextCopy.txt
The file contains valid XML, so existing XML parsers should be able to handle it.  There are several in perl that will work.  One in particular will work very well for what you need.  The XML::Simple module can read a file, and make the parts of it accessible very easily.  The following code will get the keys you asked for.  Similar code can be used to get to any of the elements in the file.
#!/usr/bin/perl
use XML::Simple;
 
my $xml = XMLin('280643TextCopy.txt');
 
print "InstitutionType=$xml->{InstitutionType}\n";
print "InstitutionName=$xml->{InstitutionName}\n";
print "InstitutionNumber=$xml->{InstitutionNumber}\n";
print "InspectionNumber=$xml->{InspectionNumber}\n";
print "TransformedXML=$xml->{InspectionReport}->{TransformedXml}\n";

Open in new window

Hi Adam,

Thanks for the reply, you say "One in particular will work very well for what you need" can you tell me which one that is so that I can take a look tonight?

Thanks Phil
It is the XML::Simple module.  The code I posted uses it.  As your requirements appear so far, this module will be the easiest to use.  As you can see from the code, a single line reads in the entire xml file into a variable (named $xml in the code above).  Once read in, you can access each element of the xml file through that variable.  

For example, the <InstitutionType> element has it's content in $xml->{InstitutionType}.
The <InspectionReport><TransformedXml> has it's content in $xml->{InspectionReport}->{TransformedXml}
And so on....

The documentation for this module is here:
http://search.cpan.org/~grantm/XML-Simple-2.18/lib/XML/Simple.pm

If you are not already familar with references in perl, you might want to check this out as well:
http://perldoc.perl.org/perlreftut.html
http://perldoc.perl.org/perldsc.html

If your requirements change to the point where XML::Simple is not sufficient, there are other XML parsers available.  I wouldn't bother looking into those now though.
Hi Adam,

Sorry it has taken me some time to get back to this, I have downloaded XML Simple 2.18 and extracted it and under the help files it says you need to download something called XML-SAX-Expat first, I have also got a copy of this now as well but I am not sure what to do with either of these, there seems to be no sort of installation file or am I looking at this wrong as I have never used perl in any way, shape or form?

Thanks Phil
Normally, you use an auto-installer.  This will install all of the modules dependencies automatically.  As long as the computer has internet access, this usually works.

If you are on windows using the ActiveState distribution:
    From a command prompt:  ppm install XML-Simple
    See here if you are behind a firewall:
    http://aspn.activestate.com/ASPN/Downloads/ActivePerl/PPM/

If you are on *nix:
    As root: perl -MCPAN -e "install XML::Simple"

The first time running on *nix, you will be asked a bunch of questions.  You can generally use the default options.
Hi Adam,

Thanks again for the quick reply, I have had a look at the link that you sent me and from what I can work out I need to download one of these installer packages as you have previously said, looking at the ActiveState web site I am guessing that I need to download something like ActivePerl Pro Studio to get what I need but it says that you have to pay for it though you do get to trial it for 21 days I would be wanting to use this over the long term, is there a free alternative as the company that I work for really dont like spending money.  I will be running this on a windows machine most of the time at home over a standard internet connection.  Am I understanding this correctly or just barking up the wrong tree?

Thanks Phil
Perl itself is free.  Most unix distributions come with it by default.  There are several companies that make it available on windows, with ActiveState being the most popular.

To get just perl, it is free.  The development studio includes some additional development tools - they are nice, but not necessary.  Any plain text editor will work to create perl files.

Here is a link to the download:
From the main page -> Language Distributions -> ActivePerl -> Get ActivePerl (on the right) -> Download (ActivePerl)

Providing your info is optional.

Once installed, you will have something called ppm.  If you start ppm (either type ppm at a command prompt with nothing else, or click the ppm icon), you will get a window listing available packages.  You can click on a package name to install it.  You can do this to install the XML::Simple package.  Or you can just run it from the command prompt using the command above.
>>  Any plain text editor will work to create perl files.

Some of our Network guys use ConTEXT (http://www.contexteditor.org/), which is freeware, which has highlight plugins for a number of languages.
Hi Adam,

Sorry again it has taken so long to get back to this but I have a number of projects on the go at the moment, I followed your instructions and downloaded and installed the installer and have tested it with the sample script that is included and everything works fine, I have also ran the script that you wrote for me and part of it seems to work, it outputs 3 quaters of what I am after I just have a couple of Questions if you dont mind me bothering you.

1.  One of the pieces of data that I was after was the SchoolGrade, this was located in the TransformedXML main tag, I have navigated down to where it is using your example but I keep getting a blank return, is there something that I am doing wrong?

#!/usr/bin/perl
use XML::Simple;
 
my $xml = XMLin('c:\XML2\test2\280643TextCopy.txt');
 
print "InstitutionType=$xml->{InstitutionType}\n";
print "InstitutionName=$xml->{InstitutionName}\n";
print "InstitutionNumber=$xml->{InstitutionNumber}\n";
print "InspectionNumber=$xml->{InspectionNumber}\n";
print "TransformedXML=$xml->{InspectionReport}->{TransformedXml}->{MeetingNeeds}->{SchoolGrader}\n";

2.  If I wanted to exacute this routine on all files in a directory what would I need to do, I tried replacing the $xml variable to "*.xnl" but it tried to read this as a file name.

3.  At the moment the script prints the results to the cmd window, is it at all possible to get the results to print to one text file for all files within the directory?

Sorry for all the questions.

Thanks Phil
I dont suppose anyone else has a solution for the above, for the first time ever I have been given a deadline so getting a little desperate :-|
1. The TransformedXML is not processed into a data structure, it is just a string.  To process that you need:
        my $TransformedXML = XMLin($xml->{InspectionReport}->{TransformedXml});
    Then you can access it's keys just like the other XML:
        $TransformedXML->{ReportBody}->{QualityOfProvision}->{Judgements}->{CareGuidance}->{SchoolGrade}

2 and 3.
    chdir("c:/XML2/test2");
    foreach (glob("*.txt")) {
        unless(open(my $out, ">$_.result")) {
            warn "Could not open $_.result: $!\n";
            next;
        }
        my $xml = XMLin($_);
        print $out "InstitutionType=$xml->{InstitutionType}\n";
        print $out "InstitutionName=$xml->{InstitutionName}\n";
        print $out "InstitutionNumber=$xml->{InstitutionNumber}\n";
        print $out "InspectionNumber=$xml->{InspectionNumber}\n";
        my $TransformedXML = XMLin($xml->{InspectionReport}->{TransformedXml});
        #additional print statements
    }
Hi Adam,

I have implemented the code that you posted and have made a change to the foreach statement to look at all xml files *.xml as the text extension was the only way the site would let me upload the example and I am getting the error

Could not find test.xml in .\ at test.pl line 3

Reading through the code could you explain something in your previous example you declared $xml=XMLin as the file example that I posted but here you declare is as $xml->[InspectionReport}, is this assuming that I have put the previous line from your example in to the script

$xml = XMLin('c:\XML2\test2\280643TextCopy.xml');

I tried to do this just to see if that was what I was meant to do and get the following error

encoding specified in XML declaration is incorrect at line 1, column 30, byte 30 at c:\Perl\lib\XML\Paraser.pm line 187

Is this something to do with the structure of the xml file rather than the perl script.

Again sorry for all the questions.

Thanks


For the XMLin function, you can either give it a string of XML, or a filename.  It will return the xml data structure, so you can access elements.  

At first, I gave the XMLin a filename.  When looping, I gave it $_, which is the filename set by the foreach line.  For the $TransformedXML, I gave it the string of XML  node.


For the filename, if the file is not in the current directory, you will have to give it the full path the the file.  

For the encoding error, do you get this on the file you attached earlier?  I don't get that error.
Hi Adam,

Thanks again for the reply, I think I have understood what you meant to I have just put the code back to the way it was, I have just changed the directory where I want to search for the files, I have downloaded the file that I originaly sent to you and put it in the directory that I want to search so I should be running the same thing as you are but when I run the script now I just get the error

Could not find test.xml in .\ at test.pl line 3

Here is the perl script that I am running

use XML::Simple;

my $TransformedXML = XMLin($xml->{InspectionReport}->{TransformedXml});
$TransformedXML->{ReportBody}->{QualityOfProvision}->{Judgements}->{CareGuidance}->{SchoolGrade};
print "InstitutionType=$xml->{InstitutionType}\n";
print "InstitutionName=$xml->{InstitutionName}\n";
print "InstitutionNumber=$xml->{InstitutionNumber}\n";
print "InspectionNumber=$xml->{InspectionNumber}\n";
print "TransformedXML=$xml->{InspectionReport}->{TransformedXml}\n";
chdir("c:/XML2/test2");
    foreach (glob("*.txt")) {
        unless(open(my $out, ">$_.result")) {
            warn "Could not open $_.result: $!\n";
            next;
        }
        my $xml = XMLin($_);
        print $out "InstitutionType=$xml->{InstitutionType}\n";
        print $out "InstitutionName=$xml->{InstitutionName}\n";
        print $out "InstitutionNumber=$xml->{InstitutionNumber}\n";
        print $out "InspectionNumber=$xml->{InspectionNumber}\n";
        my $TransformedXML = XMLin($xml->{InspectionReport}->{TransformedXml});
        #additional print statements
    }

So from this I am searching through the directory c:/XML/test2 looking at all text files to read through though this will be changed at a late time when I have sorted out the parasing to all xml files.  Am I just being stupid now as I didn't think I would find this hard though I suppose I am trying to get my head around Perl and XML at the same time knowing next to nothing of either :-)

Thanks Phil
Here is the entire script.  Note that there is a "use strict" and "use warnings" line in this version.  The use strict requires that all variables be defined.  This is good practice, as it will help you find errors when using a variable you didn't intend.  The use warnings gives warnings about possible mistakes, if it find any.  This will also help in finding any possible problems.

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use XML::Simple;
 
chdir("c:/XML2/test2");
foreach (glob("*.xml")) {
	my $out;
	unless(open($out, ">$_.result")) {
		warn "Could not open $_.result: $!\n";
		next;
	}
	my $xml = XMLin($_);
	print $out "InstitutionType=$xml->{InstitutionType}\n";
	print $out "InstitutionName=$xml->{InstitutionName}\n";
	print $out "InstitutionNumber=$xml->{InstitutionNumber}\n";
	print $out "InspectionNumber=$xml->{InspectionNumber}\n";
	
	my $TransformedXML = XMLin($xml->{InspectionReport}->{TransformedXml});
	print $out "SchoolGrade=$TransformedXML->{ReportBody}->{QualityOfProvision}->{Judgements}->{CareGuidance}->{SchoolGrade}\n";
	close($out);
}

Open in new window

Hi Adam,

I cant say thanks enough for all the help with this, I have pasted your code into my perl file and it runs, it outputs a new result file with the inspection number, type, inspection name and number, but does not output anything for the Grade, I am back to getting that error

encoding specified in XML Declaration is incorrect at line one

I have had a look at the xml files and when reading up most xml files have something like encoding="UTF-8" or encoding="UTF-16" and mine do not, I have tried adding this to one of the files but still get the same error, is there some invalid characters in the TrasnsformedXML tag that when it gets there without the relevant encoding it cant do any more?

Thanks
ASKER CERTIFIED SOLUTION
Avatar of Adam314
Adam314

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Hi Adam,

Sorry again for the delay in replying to this, I have used the piece of code that you have pasted about and ran it but although I am not getting any errors I am also not getting any output either, it just seems to think about it and thats it, I have had a look at the scrip and there doesn't seem much difference to the other pieces that you have posted with the exception of the "encoding" statement?????

Thanks Phil
Do the files have an .xml extension on them?

Is the script in the same directory as the files?  If not, uncomment the chdir statement (line 7), and make sure it has the correct directory name.

If you comment out the encoding line (line 21), does the script run correctly (except giving warnings)?
Hi Adam,

I cant thank you enough for all the help that you have given me with this and the patience you have shown with me, now that I have this task out of the way I am going to spend more time looking through what is happening in the code and learning some more about the language.

Thanks again, Phil