Link to home
Start Free TrialLog in
Avatar of LetsLearn
LetsLearn

asked on

Help setting up script and query for XML

Hi,

I would like to set up a script to query a publicly available XML file. I have used Microsoft Access in the past with scripts etc. but it's been a while, so I am hoping someone can point me in the right direction.

There is a publicly available XML file which is released from the Australian Business Register every few weeks.

https://data.gov.au/dataset/abn-bulk-extract (available from the links on the page titled 'ABN Bulk Extract Part 1' and 'ABN Bulk Extract Part 2')

It lists all of the Australian Business Numbers (ABNs) for Australian businesses.

What I would like to do is extract the new Australian Private Companies which have been registered in the state of the ACT (Australian Capital Territory) during the last release period.

I envisaged I would be able to do this by querying the following fields in the XML file:

EntityTypeIn = PRV
EntityTypeText = Australian Private Company
State = ACT
ABNStatusFromDate = 20180117

The latest release date is the 31 January 2018 and the previous release date was the 16 January 2018, so I thought I could extract the dates for businesses registered in between those release dates, 20180117, 20180118, 20180119, and so forth.

I have downloaded the XML and imported using the Get External Data function, however it doesn't import the folder structures correctly so I couldn't query the ABNStatusFromDate.

I realize it is possible to work around this by using a script but I am not familiar enough to be able to do this. The files are quite large and I wasn't able to import into Excel because it ran out of memory.

If someone can help with this it would be greatly appreciated.

Many thanks Cathryn
Avatar of John Tsioumpris
John Tsioumpris
Flag of Greece image

I think you need to take a look at the guide from Microsoft..the paragraph "How Do I Traverse a Document?" is what you are looking.
Have you written any XPath or worked with an DOM Document?
State = ACT
Do you mean status=ACT
If the following works for the first XML file in the downloaded zip file, then it can be packaged to loop through all the XML files once they've been extracted from the zip file.  Such functional packaging should include a parameter for the date, unless the date can be determined programmatically.
select-xml -Path "C:\Users\Mark\Downloads\20180131_Public01.xml" `
    -XPath '//ABR[ABN[@status="ACT" and @ABNStatusFromDate>"20180116"] | EntityType[EntityTypeIn="PRV" and EntityTypeText="Australian Private Company"]]' |
% {$_.node.outerxml} > "C:\Users\Mark\Downloads\20180131_Public01_New.xml"

Open in new window

The PS script produced a 1225 line file that looks like this:
<ABR recordLastUpdatedDate="20180129" replaced="N"><ABN status="ACT" ABNStatusFromDate="20180127">11001845059</ABN><EntityType><EntityTypeInd>IND</EntityTypeInd><EntityTypeText>Individual/Sole Trader</EntityTypeText></EntityType><LegalEntity><IndividualName type="LGL"><GivenName>ADVAITA</GivenName><GivenName>JUBILEE</GivenName><FamilyName>FOORD</FamilyName></IndividualName><BusinessAddress><AddressDetails><State>NSW</State><Postcode>2093</Postcode></AddressDetails></BusinessAddress></LegalEntity><OtherEntity><NonIndividualName type="TRD"><NonIndividualNameText>EXPRESS DISPENSARY</NonIndividualNameText></NonIndividualName></OtherEntity></ABR>
<ABR recordLastUpdatedDate="20180130" replaced="N"><ABN status="ACT" ABNStatusFromDate="20180130">11008259859</ABN><EntityType><EntityTypeInd>IND</EntityTypeInd><EntityTypeText>Individual/Sole Trader</EntityTypeText></EntityType><LegalEntity><IndividualName type="LGL"><GivenName>ANNE</GivenName><GivenName>VERONICA</GivenName><FamilyName>LONDON</FamilyName></IndividualName><BusinessAddress><AddressDetails><State>WA</State><Postcode>6208</Postcode></AddressDetails></BusinessAddress></LegalEntity><OtherEntity><NonIndividualName type="BN"><NonIndividualNameText>AROMA BODY BALANCE</NonIndividualNameText></NonIndividualName></OtherEntity></ABR>
<ABR recordLastUpdatedDate="20180124" replaced="N"><ABN status="ACT" ABNStatusFromDate="20180123">11030625056</ABN><EntityType><EntityTypeInd>IND</EntityTypeInd><EntityTypeText>Individual/Sole Trader</EntityTypeText></EntityType><LegalEntity><IndividualName type="LGL"><GivenName>LAI MAN</GivenName><FamilyName>LO</FamilyName></IndividualName><BusinessAddress><AddressDetails><State>VIC</State><Postcode>3805</Postcode></AddressDetails></BusinessAddress></LegalEntity><OtherEntity><NonIndividualName type="TRD"><NonIndividualNameText>Asia Pacific Trading</NonIndividualNameText></NonIndividualName></OtherEntity></ABR>

Open in new window

output file produced
20180131_Public01_New.xml
Avatar of LetsLearn
LetsLearn

ASKER

Hi aikimark,

Thanks so much for your help with this.

In answer to your earlier question no I haven't written any XPath or worked with an DOM Document previously. I have done some modules in Access years ago, so it's been a while.

It should be <State>ACT</State> from the Business Address rather than status="ACT".

If I am adding this in as a module what should I have as the opening and closing statements for the module? Sounds basic, but it's a long time.

Thanks once again,
Cathryn
This is Powershell, not VBA
Thank you aikimark, works perfectly, really appreciate your help.
I think there's a problem with the XPath string.  Please test this version.
select-xml -Path "C:\Users\Mark\Downloads\20180131_Public01.xml" `
    -XPath '//ABR[./ABN[@ABNStatusFromDate>"20180116"]][./EntityType[EntityTypeInd="PRV" and EntityTypeText="Australian Private Company"]][.//AddressDetails[State="ACT"]]'
% {$_.node.outerxml} > "C:\Users\Mark\Downloads\20180131_Public01_New.xml"

Open in new window

I didn't have the State="ACT" condition in the accepted comment.

I've been playing with a VBA version of this, trying to tweak the performance.
I was able to revise the string to return the ABN[@ABNStatusFromDate="20180116", however I have struggled to be able to include the other fields.

I tried the above string but it is returning an empty file.

When I revised the string to return the first field I closed the question thinking I would be able to work out the remaining fields but have struggled.
please request that the question be reopened
to return the first field
When it works properly, it will output the ABR node/element text, based on the date and other criteria.  What are you trying to output?
I would like to output the lines where:

State = ACT
ABNStatusFromDate= "20180116" through to "20180131"
EntityTypeInd="PRV"
EntityTypeText="Australian Private Company"
This works for extracting an XML with the date criteria only. Not sure how to add in the extra criteria/fields.

select-xml -Path "C:\Users\cathr\Desktop\extract\20180131_Public01.xml" `
    -XPath '//ABR[ABN[@ABNStatusFromDate="20180116"]]' |
% {$_.node.outerxml} > "C:\Users\cathr\Desktop\extract\20180116_Public01_1.xml"

Open in new window


I'll request to have the question re-opened.
ASKER CERTIFIED SOLUTION
Avatar of aikimark
aikimark
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
That's perfect Mark! Thank you!
When they reopen the question, accept the most recent solution comment.
I sent through a support request to have the question re-opened, and received an email referring me to use a 'help bell' on the page. I can't see a help bell, or am I looking in the wrong place?
I don't know what 'bell' they are referring to, either.  I usually click the Report Question link near your question text.  Maybe they will clarify if you reply to their email.
Thanks Scott
Thanks Mark this works perfectly