Link to home
Start Free TrialLog in
Avatar of Russellbrown
RussellbrownFlag for United States of America

asked on

Excel VBA to Extract XBRL Data

Hi, I have downloaded programmatically from SEC (https://www.sec.gov/Archives/edgar/data/1800/000110465914056950/0001104659-14-056950-index.htm) a set of XBRL documents in zipped format  for a specific entity ( CIK Number is 0000001800    ) to my local drive.
The zip file is 0000001800_0001104659-14-056950-xbrl.zip and contains 6 xml documents. One of these files is named: abt-20140630.xml.

I would like to pull 4 values from the said xml to Excel using VBA ( without the need to unzipped the zip file).
i.e. Sales: 5551000000, CostOfGoodsSold: 2506000000, Start Date: 2014-01-01 and End Date:2014-06-30
The reference sources are:

1. <us-gaap:SalesRevenueNet id="Tag334" decimals="-6" contextRef="D2014Q2" unitRef="USD">5551000000</us-gaap:SalesRevenueNet>
2. <us-gaap:CostOfGoodsSold id="Tag437" decimals="-6" contextRef="D2014Q2" unitRef="USD">2506000000</us-gaap:CostOfGoodsSold>

3. <xbrli:period>
      <xbrli:startDate>2014-01-01</xbrli:startDate>
      <xbrli:endDate>2014-06-30</xbrli:endDate>
    </xbrli:period>


Specifications:
1. MS Excel 2010
2. Windows 8

Would appreciate your help with the VBA code to achieve this.

Thank you.
0000001800-0001104659-14-056950-xbrl.zip
Avatar of aikimark
aikimark
Flag of United States of America image

VBA can be used to extract one or more files from a zip file like this:
Dim oShell As Object

Set oShell = CreateObject("Shell.Application")
oShell.Namespace("C:\users\RussellBrown\Downloads").CopyHere oApp.Namespace("C:\users\RussellBrown\Downloads\0000001800_0001104659-14-056950-xbrl.zip").items.Item("abt-20140630.xml")

Open in new window


Once extracted, you would use a DOM object or a regular expression object to extract the data from the abt-20140630.xml file

This is a regular expression pattern to get that data
<xbrli:startDate>([^<]*)</xbrli:startDate>(?:.|\n)*?<xbrli:endDate>([^<]*)</xbrli:endDate>(?:.|\n)*?<us-gaap:SalesRevenueNet id="Tag334"[^>]*>(\d*)</us-gaap:SalesRevenueNet>(?:.|\n)*?<us-gaap:CostOfGoodsSold id="Tag437"[^>]*>(\d*)</us-gaap:CostOfGoodsSold>

Open in new window

Which produces the following

SubMatch 0: 2014-01-01
SubMatch 1: 2014-06-30
SubMatch 2: 5551000000
SubMatch 3: 2506000000

I'm not sure of some of the details.  For instance,
1. Since you have multiple SalesRevenueNet and CostOfGoodsSold items, do we differentiate them by just their id values or by other/more data?
2. Since there are multiple xbrli:startDate and xbrli:endDate items, which pair should be used?
Avatar of Russellbrown

ASKER

Hi, thank you very much for the reply.
I am able to get the first part working i.e.

Sub ExtractFromZip()
Dim oShell As Object

Set oShell = CreateObject("Shell.Application")
oShell.Namespace("C:\ZTestSEC\TestEE").CopyHere oShell.Namespace("C:\ZTestSEC\0000001800_0001104659-14-056950-xbrl.zip").items.Item("abt-20140630.xml")

End Sub

How would I add the additional code for the second part i.e.
'<xbrli:startDate>([^<]*)</xbrli:startDate>(?:.|\n)*?<xbrli:endDate>([^<]*)</xbrli:endDate>(?:.|\n)*?<us-gaap:SalesRevenueNet id="Tag334"[^>]*>(\d*)</us-gaap:SalesRevenueNet>(?:.|\n)*?<us-gaap:CostOfGoodsSold id="Tag437"[^>]*>(\d*)</us-gaap:CostOfGoodsSold>

Sorry, I am not familiar with the DOM object or a regular expression object. Can you please assist by providing the vba code lines?

Will revert on the 2 questions that you have kindly raised  after getting the code done,  in order to keep it simple.

Thanks.
Please answer my questions relative to the actual XML file.
Hi,
1. for SalesRevenueNet and CostOfGoodsSold  items- we will use the contextRef, and in this case it is ="D2014Q2" representing the 2nd quarter of Year 2014. Later on we can modify it into a variable.
2. For the start and end dates - the first instance of the <xbrli:period></xbrli:period> if it is possible.
ASKER CERTIFIED SOLUTION
Avatar of aikimark
aikimark
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
It works perfectly. Thank you very much.  I noticed the patterns i.e.

1)([^<]*)

2) (?:.|\n)*?

3)  .*?contextRef=""D2014Q2""[^>]*>(\d*)    

I visited http://msdn.microsoft.com/en-us/library/ms974570.aspx to have a look at the syntax and expressions. Hope this is the correct reference site.
Can you please kick-start me on how to read the symbols so I can apply it accordingly to other scenarios?
 
Thanks.
I experimented with the codes by adding 2 more items: Sales and COGS for D2014Q2YTD i.e.
a = "<xbrli:startDate>([^<]*)</xbrli:startDate>(?:.|\n)*?"
b = "<xbrli:endDate>([^<]*)</xbrli:endDate>(?:.|\n)*?"
c = "<us-gaap:SalesRevenueNet .*?contextRef=""D2014Q2""[^>]*>(\d*)</us-gaap:SalesRevenueNet>(?:.|\n)*?"
d = "<us-gaap:CostOfGoodsSold .*?contextRef=""D2014Q2""[^>]*>(\d*)</us-gaap:CostOfGoodsSold>(?:.|\n)*?"
e = "<us-gaap:SalesRevenueNet .*?contextRef=""D2014Q2YTD""[^>]*>(\d*)</us-gaap:SalesRevenueNet>(?:.|\n)*?"
f = "<us-gaap:CostOfGoodsSold .*?contextRef=""D2014Q2YTD""[^>]*>(\d*)</us-gaap:CostOfGoodsSold>"
   
I get the following results:
1)  oRE.Pattern = a & b & c & d works ( 4 items )
2)  oRE.Pattern = a & b & e & f works  ( 4 items )
3)  oRE.Pattern = a & b & c & d & e & f not working ( 4 + 2 more items )

Would appreciate it very much if you could point out where I have do it wrong. Thanks.
Here is a good article by Patrick Matthews that you should read.
http:A_1336-Using-Regular-Expressions-in-Visual-Basic-for-Applications-and-Visual-Basic-6.html

1)([^<]*)
Capture all characters that are not a less than character and place them into a capture group.

2) (?:.|\n)*?
Ignore all characters, but don't be greedy when matching.  Although this is in parentheses, the leading ?: characters indicate that this group should not be captured.

3)  .*?contextRef=""D2014Q2""[^>]*>(\d*)
Ignore all characters, but don't be greedy when matching.  Then match contextRef="D2014Q2" string followed by any characters that aren't a > character, terminating with a > character.  Then capture zero or more digits.  Since this pattern is a VB string, actual quote mark characters are represented by double-quote character sequences.
The pattern should resemble the order of the items in the file.
a & b & c & e & d & f

It is possible to go after the data one or two patterns at a time.
Thank you so much for the excellent replies. I will go through the references and try to work with some examples. Before I sign off and happily accept  the solutions to my question(s), would you be familiar with the SEC XBRL? If so, I will post it as a new question i.e.

Either using Python or Excel VBA:
1. obtain a list of Instance URLs for 10-K filings e.g. http://www.sec.gov/Archives/edgar/data/1800/000110465909048013/abt-20090630.xml
a) use the monthly RSS feed e.g. http://www.sec.gov/Archives/edgar/monthly/xbrlrss-2014-08.xml  to obtain the company name, CIK, Accession Number/Acc-no: and Period of Report. Add a loop to iterate through the years and months.
b) output the list with details to either Excel / Pyhton.

Many thanks and regards.
If you have a follow-up question, then close this thread and include a link to this thread in your new question.  After posting the new question, add a comment in this thread letting the experts (and future readers) know about the new question.
Excellent. The replies were clear and concise and addressed my question(s) completely.