Link to home
Start Free TrialLog in
Avatar of LetsLearn
LetsLearn

asked on

Large XML file extraction

I have some very large XML files which are provided from an external source. I need to search through the files to isolate certain lines based on search criteria, eg ABNStatusFromDate="20180101", <State>ACT</State> and <EntityTypeText>Australian Private Company</EntityTypeText>.

I have tried opening the files in Excel and the Windows XML Notepad reader. They crash due to the size. I can open them in the FirstObject reader very quickly and I can search one by one by each date and then double check to see if the state and entity type matches the two above criteria. But this will obviously take forever.

Is there a way I can do this quickly and extract by date and matching criteria into a smaller XML file or Excel?

I won't attach the file because of the size. Below are a couple of lines from the XML file.

Many thanks for any assistance which can be given.

<ABR recordLastUpdatedDate="20171121" replaced="N"><ABN status="ACT" ABNStatusFromDate="20171109">14622745297</ABN><EntityType><EntityTypeInd>PRV</EntityTypeInd><EntityTypeText>Australian Private Company</EntityTypeText></EntityType><MainEntity><NonIndividualName type="MN"><NonIndividualNameText>FUMAO PROPERTY PTY LTD</NonIndividualNameText></NonIndividualName><BusinessAddress><AddressDetails><State>VIC</State><Postcode>3149</Postcode></AddressDetails></BusinessAddress></MainEntity><ASICNumber ASICNumberType="undetermined">622745297</ASICNumber><GST status="ACT" GSTStatusFromDate="20171109" /></ABR>
<ABR recordLastUpdatedDate="20000701" replaced="N"><ABN status="CAN" ABNStatusFromDate="20000228">14622746156</ABN><EntityType><EntityTypeInd>CUT</EntityTypeInd><EntityTypeText>Corporate Unit Trust</EntityTypeText></EntityType><MainEntity><NonIndividualName type="MN"><NonIndividualNameText>THE JOKA UNIT TRUST</NonIndividualNameText></NonIndividualName><BusinessAddress><AddressDetails><State>NSW</State><Postcode>2320</Postcode></AddressDetails></BusinessAddress></MainEntity></ABR>
<ABR recordLastUpdatedDate="20150202" replaced="N"><ABN status="CAN" ABNStatusFromDate="20150202">14622746237</ABN><EntityType><EntityTypeInd>IND</EntityTypeInd><EntityTypeText>Individual/Sole Trader</EntityTypeText></EntityType><LegalEntity><IndividualName type="LGL"><GivenName>BOUASRI</GivenName><FamilyName>VILLAROSA</FamilyName></IndividualName><BusinessAddress><AddressDetails><State>NSW</State><Postcode>2560</Postcode></AddressDetails></BusinessAddress></LegalEntity><OtherEntity><NonIndividualName type="TRD"><NonIndividualNameText>Villarosa Photography</NonIndividualNameText></NonIndividualName></OtherEntity></ABR>
<ABR recordLastUpdatedDate="20171109" replaced="N"><ABN status="ACT" ABNStatusFromDate="20171109">14622749795</ABN><EntityType><EntityTypeInd>PRV</EntityTypeInd><EntityTypeText>Australian Private Company</EntityTypeText></EntityType><MainEntity><NonIndividualName type="MN"><NonIndividualNameText>3 DU YOGA PTY LTD</NonIndividualNameText></NonIndividualName><BusinessAddress><AddressDetails><State>NSW</State><Postcode>2150</Postcode></AddressDetails></BusinessAddress></MainEntity><ASICNumber ASICNumberType="undetermined">622749795</ASICNumber></ABR>
<ABR recordLastUpdatedDate="20030601" replaced="N"><ABN status="CAN" ABNStatusFromDate="20030601">14622750886</ABN><EntityType><EntityTypeInd>IND</EntityTypeInd><EntityTypeText>Individual/Sole Trader</EntityTypeText></EntityType><LegalEntity><IndividualName type="LGL"><NameTitle>MS</NameTitle><GivenName>JANETTE</GivenName><GivenName>ROSEMARY</GivenName><FamilyName>CAMUGLIA</FamilyName></IndividualName><BusinessAddress><AddressDetails><State>QLD</State><Postcode>4059</Postcode></AddressDetails></BusinessAddress></LegalEntity></ABR>
Avatar of Ryan Chong
Ryan Chong
Flag of Singapore image

Is there a way I can do this quickly and extract by date and matching criteria into a smaller XML file or Excel?
not sure about loading XML to Excel, but certainly it can be loaded into MS SQL, from there we probably can do a query thing to filter what you want or what you don't want.

hope that make sense?
Avatar of LetsLearn
LetsLearn

ASKER

I didn't think of doing that Ryan. Yes that would be great.
something like this:

declare @Output VARCHAR(MAX)
set @Output = '
<ABR recordLastUpdatedDate="20171121" replaced="N">
	<ABN status="ACT" ABNStatusFromDate="20171109">14622745297</ABN>
	<EntityType>
		<EntityTypeInd>PRV</EntityTypeInd>
		<EntityTypeText>Australian Private Company</EntityTypeText>
	</EntityType>
	<MainEntity>
		<NonIndividualName type="MN">
			<NonIndividualNameText>FUMAO PROPERTY PTY LTD</NonIndividualNameText>
		</NonIndividualName>
		<BusinessAddress>
			<AddressDetails>
				<State>VIC</State>
				<Postcode>3149</Postcode>
			</AddressDetails>
		</BusinessAddress>
	</MainEntity>
	<ASICNumber ASICNumberType="undetermined">622745297</ASICNumber>
	<GST status="ACT" GSTStatusFromDate="20171109" />
</ABR>
<ABR recordLastUpdatedDate="20000701" replaced="N">
	<ABN status="CAN" ABNStatusFromDate="20000228">14622746156</ABN>
	<EntityType>
		<EntityTypeInd>CUT</EntityTypeInd>
		<EntityTypeText>Corporate Unit Trust</EntityTypeText>
	</EntityType>
	<MainEntity>
		<NonIndividualName type="MN">
			<NonIndividualNameText>THE JOKA UNIT TRUST</NonIndividualNameText>
		</NonIndividualName>
		<BusinessAddress>
			<AddressDetails>
				<State>NSW</State>
				<Postcode>2320</Postcode>
			</AddressDetails>
		</BusinessAddress>
	</MainEntity>
</ABR>
<ABR recordLastUpdatedDate="20150202" replaced="N">
	<ABN status="CAN" ABNStatusFromDate="20150202">14622746237</ABN>
	<EntityType>
		<EntityTypeInd>IND</EntityTypeInd>
		<EntityTypeText>Individual/Sole Trader</EntityTypeText>
	</EntityType>
	<LegalEntity>
		<IndividualName type="LGL">
			<GivenName>BOUASRI</GivenName>
			<FamilyName>VILLAROSA</FamilyName>
		</IndividualName>
		<BusinessAddress>
			<AddressDetails>
				<State>NSW</State>
				<Postcode>2560</Postcode>
			</AddressDetails>
		</BusinessAddress>
	</LegalEntity>
	<OtherEntity>
		<NonIndividualName type="TRD">
			<NonIndividualNameText>Villarosa Photography</NonIndividualNameText>
		</NonIndividualName>
	</OtherEntity>
</ABR>
<ABR recordLastUpdatedDate="20171109" replaced="N">
	<ABN status="ACT" ABNStatusFromDate="20171109">14622749795</ABN>
	<EntityType>
		<EntityTypeInd>PRV</EntityTypeInd>
		<EntityTypeText>Australian Private Company</EntityTypeText>
	</EntityType>
	<MainEntity>
		<NonIndividualName type="MN">
			<NonIndividualNameText>3 DU YOGA PTY LTD</NonIndividualNameText>
		</NonIndividualName>
		<BusinessAddress>
			<AddressDetails>
				<State>NSW</State>
				<Postcode>2150</Postcode>
			</AddressDetails>
		</BusinessAddress>
	</MainEntity>
	<ASICNumber ASICNumberType="undetermined">622749795</ASICNumber>
</ABR>
<ABR recordLastUpdatedDate="20030601" replaced="N">
	<ABN status="CAN" ABNStatusFromDate="20030601">14622750886</ABN>
	<EntityType>
		<EntityTypeInd>IND</EntityTypeInd>
		<EntityTypeText>Individual/Sole Trader</EntityTypeText>
	</EntityType>
	<LegalEntity>
		<IndividualName type="LGL">
			<NameTitle>MS</NameTitle>
			<GivenName>JANETTE</GivenName>
			<GivenName>ROSEMARY</GivenName>
			<FamilyName>CAMUGLIA</FamilyName>
		</IndividualName>
		<BusinessAddress>
			<AddressDetails>
				<State>QLD</State>
				<Postcode>4059</Postcode>
			</AddressDetails>
		</BusinessAddress>
	</LegalEntity>
</ABR>
'

DECLARE @XML XML

SET @XML = CONVERT (XML, @Output)
	
SELECT 
	c.value('(./@recordLastUpdatedDate)', 'VARCHAR(MAX)') recordLastUpdatedDate,
	c.value('(./@replaced)', 'VARCHAR(MAX)') replaced,
	c.value('ABN[1]', 'nvarchar(max)') ABN_Text,
	c.value('(ABN/@status)[1]', 'VARCHAR(MAX)') ABN_status,
	c.value('(ABN/@ABNStatusFromDate)[1]', 'VARCHAR(MAX)') ABNStatusFromDate,
	c.value('(EntityType/EntityTypeInd)[1]', 'nvarchar(max)') EntityTypeInd,
	c.value('(EntityType/EntityTypeText)[1]', 'nvarchar(max)') EntityTypeText
	-- other fields ...
FROM @XML.nodes('//ABR') t(c)

Open in new window


User generated image
Many thanks for doing this Ryan, I'll test it and let you know how I get on.
Hi Ryan,

I was thinking of using this in MS Access as the files open reasonably quickly. Could I trouble you for one extra piece of help, how would I run this in Access? It's been a while since I have done anything with modules, classes and queries in Access.

Thanks in advance,
Cathryn
ASKER CERTIFIED SOLUTION
Avatar of Ryan Chong
Ryan Chong
Flag of Singapore image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks Ryan, I'm closing this as the best solution.

I might ask a similar question in the Microsoft Access area as well as it is a program I am more familiar with.

Thank you for your help.