Russellbrown
asked on
Excel VBA to Extract XBRL Data
Hi, I have downloaded programmatically from SEC (https://www.sec.gov/Archives/edgar/data/1800/000110465914056950/0001104659-14-056950-index.htm) a set of XBRL documents in zipped format for a specific entity ( CIK Number is 0000001800 ) to my local drive.
The zip file is 0000001800_0001104659-14-0 56950-xbrl .zip and contains 6 xml documents. One of these files is named: abt-20140630.xml.
I would like to pull 4 values from the said xml to Excel using VBA ( without the need to unzipped the zip file).
i.e. Sales: 5551000000, CostOfGoodsSold: 2506000000, Start Date: 2014-01-01 and End Date:2014-06-30
The reference sources are:
1. <us-gaap:SalesRevenueNet id="Tag334" decimals="-6" contextRef="D2014Q2" unitRef="USD">5551000000</ us-gaap:Sa lesRevenue Net>
2. <us-gaap:CostOfGoodsSold id="Tag437" decimals="-6" contextRef="D2014Q2" unitRef="USD">2506000000</ us-gaap:Co stOfGoodsS old>
3. <xbrli:period>
<xbrli:startDate>2014-01-0 1</xbrli:s tartDate>
<xbrli:endDate>2014-06-30< /xbrli:end Date>
</xbrli:period>
Specifications:
1. MS Excel 2010
2. Windows 8
Would appreciate your help with the VBA code to achieve this.
Thank you.
0000001800-0001104659-14-056950-xbrl.zip
The zip file is 0000001800_0001104659-14-0
I would like to pull 4 values from the said xml to Excel using VBA ( without the need to unzipped the zip file).
i.e. Sales: 5551000000, CostOfGoodsSold: 2506000000, Start Date: 2014-01-01 and End Date:2014-06-30
The reference sources are:
1. <us-gaap:SalesRevenueNet id="Tag334" decimals="-6" contextRef="D2014Q2" unitRef="USD">5551000000</
2. <us-gaap:CostOfGoodsSold id="Tag437" decimals="-6" contextRef="D2014Q2" unitRef="USD">2506000000</
3. <xbrli:period>
<xbrli:startDate>2014-01-0
<xbrli:endDate>2014-06-30<
</xbrli:period>
Specifications:
1. MS Excel 2010
2. Windows 8
Would appreciate your help with the VBA code to achieve this.
Thank you.
0000001800-0001104659-14-056950-xbrl.zip
ASKER
Hi, thank you very much for the reply.
I am able to get the first part working i.e.
Sub ExtractFromZip()
Dim oShell As Object
Set oShell = CreateObject("Shell.Applic ation")
oShell.Namespace("C:\ZTest SEC\TestEE ").CopyHer e oShell.Namespace("C:\ZTest SEC\000000 1800_00011 04659-14-0 56950-xbrl .zip").ite ms.Item("a bt-2014063 0.xml")
End Sub
How would I add the additional code for the second part i.e.
'<xbrli:startDate>([^<]*)< /xbrli:sta rtDate>(?: .|\n)*?<xb rli:endDat e>([^<]*)< /xbrli:end Date>(?:.| \n)*?<us-g aap:SalesR evenueNet id="Tag334"[^>]*>(\d*)</us -gaap:Sale sRevenueNe t>(?:.|\n) *?<us-gaap :CostOfGoo dsSold id="Tag437"[^>]*>(\d*)</us -gaap:Cost OfGoodsSol d>
Sorry, I am not familiar with the DOM object or a regular expression object. Can you please assist by providing the vba code lines?
Will revert on the 2 questions that you have kindly raised after getting the code done, in order to keep it simple.
Thanks.
I am able to get the first part working i.e.
Sub ExtractFromZip()
Dim oShell As Object
Set oShell = CreateObject("Shell.Applic
oShell.Namespace("C:\ZTest
End Sub
How would I add the additional code for the second part i.e.
'<xbrli:startDate>([^<]*)<
Sorry, I am not familiar with the DOM object or a regular expression object. Can you please assist by providing the vba code lines?
Will revert on the 2 questions that you have kindly raised after getting the code done, in order to keep it simple.
Thanks.
Please answer my questions relative to the actual XML file.
ASKER
Hi,
1. for SalesRevenueNet and CostOfGoodsSold items- we will use the contextRef, and in this case it is ="D2014Q2" representing the 2nd quarter of Year 2014. Later on we can modify it into a variable.
2. For the start and end dates - the first instance of the <xbrli:period></xbrli:peri od> if it is possible.
1. for SalesRevenueNet and CostOfGoodsSold items- we will use the contextRef, and in this case it is ="D2014Q2" representing the 2nd quarter of Year 2014. Later on we can modify it into a variable.
2. For the start and end dates - the first instance of the <xbrli:period></xbrli:peri
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
It works perfectly. Thank you very much. I noticed the patterns i.e.
1)([^<]*)
2) (?:.|\n)*?
3) .*?contextRef=""D2014Q2""[ ^>]*>(\d*)
I visited http://msdn.microsoft.com/en-us/library/ms974570.aspx to have a look at the syntax and expressions. Hope this is the correct reference site.
Can you please kick-start me on how to read the symbols so I can apply it accordingly to other scenarios?
Thanks.
1)([^<]*)
2) (?:.|\n)*?
3) .*?contextRef=""D2014Q2""[
I visited http://msdn.microsoft.com/en-us/library/ms974570.aspx to have a look at the syntax and expressions. Hope this is the correct reference site.
Can you please kick-start me on how to read the symbols so I can apply it accordingly to other scenarios?
Thanks.
ASKER
I experimented with the codes by adding 2 more items: Sales and COGS for D2014Q2YTD i.e.
a = "<xbrli:startDate>([^<]*)< /xbrli:sta rtDate>(?: .|\n)*?"
b = "<xbrli:endDate>([^<]*)</x brli:endDa te>(?:.|\n )*?"
c = "<us-gaap:SalesRevenueNet .*?contextRef=""D2014Q2""[ ^>]*>(\d*) </us-gaap: SalesReven ueNet>(?:. |\n)*?"
d = "<us-gaap:CostOfGoodsSold .*?contextRef=""D2014Q2""[ ^>]*>(\d*) </us-gaap: CostOfGood sSold>(?:. |\n)*?"
e = "<us-gaap:SalesRevenueNet .*?contextRef=""D2014Q2YTD ""[^>]*>(\ d*)</us-ga ap:SalesRe venueNet>( ?:.|\n)*?"
f = "<us-gaap:CostOfGoodsSold .*?contextRef=""D2014Q2YTD ""[^>]*>(\ d*)</us-ga ap:CostOfG oodsSold>"
I get the following results:
1) oRE.Pattern = a & b & c & d works ( 4 items )
2) oRE.Pattern = a & b & e & f works ( 4 items )
3) oRE.Pattern = a & b & c & d & e & f not working ( 4 + 2 more items )
Would appreciate it very much if you could point out where I have do it wrong. Thanks.
a = "<xbrli:startDate>([^<]*)<
b = "<xbrli:endDate>([^<]*)</x
c = "<us-gaap:SalesRevenueNet .*?contextRef=""D2014Q2""[
d = "<us-gaap:CostOfGoodsSold .*?contextRef=""D2014Q2""[
e = "<us-gaap:SalesRevenueNet .*?contextRef=""D2014Q2YTD
f = "<us-gaap:CostOfGoodsSold .*?contextRef=""D2014Q2YTD
I get the following results:
1) oRE.Pattern = a & b & c & d works ( 4 items )
2) oRE.Pattern = a & b & e & f works ( 4 items )
3) oRE.Pattern = a & b & c & d & e & f not working ( 4 + 2 more items )
Would appreciate it very much if you could point out where I have do it wrong. Thanks.
Here is a good article by Patrick Matthews that you should read.
http:A_1336-Using-Regular-Expressions-in-Visual-Basic-for-Applications-and-Visual-Basic-6.html
1)([^<]*)
Capture all characters that are not a less than character and place them into a capture group.
2) (?:.|\n)*?
Ignore all characters, but don't be greedy when matching. Although this is in parentheses, the leading ?: characters indicate that this group should not be captured.
3) .*?contextRef=""D2014Q2""[ ^>]*>(\d*)
Ignore all characters, but don't be greedy when matching. Then match contextRef="D2014Q2" string followed by any characters that aren't a > character, terminating with a > character. Then capture zero or more digits. Since this pattern is a VB string, actual quote mark characters are represented by double-quote character sequences.
http:A_1336-Using-Regular-Expressions-in-Visual-Basic-for-Applications-and-Visual-Basic-6.html
1)([^<]*)
Capture all characters that are not a less than character and place them into a capture group.
2) (?:.|\n)*?
Ignore all characters, but don't be greedy when matching. Although this is in parentheses, the leading ?: characters indicate that this group should not be captured.
3) .*?contextRef=""D2014Q2""[
Ignore all characters, but don't be greedy when matching. Then match contextRef="D2014Q2" string followed by any characters that aren't a > character, terminating with a > character. Then capture zero or more digits. Since this pattern is a VB string, actual quote mark characters are represented by double-quote character sequences.
The pattern should resemble the order of the items in the file.
a & b & c & e & d & f
It is possible to go after the data one or two patterns at a time.
a & b & c & e & d & f
It is possible to go after the data one or two patterns at a time.
ASKER
Thank you so much for the excellent replies. I will go through the references and try to work with some examples. Before I sign off and happily accept the solutions to my question(s), would you be familiar with the SEC XBRL? If so, I will post it as a new question i.e.
Either using Python or Excel VBA:
1. obtain a list of Instance URLs for 10-K filings e.g. http://www.sec.gov/Archives/edgar/data/1800/000110465909048013/abt-20090630.xml
a) use the monthly RSS feed e.g. http://www.sec.gov/Archives/edgar/monthly/xbrlrss-2014-08.xml to obtain the company name, CIK, Accession Number/Acc-no: and Period of Report. Add a loop to iterate through the years and months.
b) output the list with details to either Excel / Pyhton.
Many thanks and regards.
Either using Python or Excel VBA:
1. obtain a list of Instance URLs for 10-K filings e.g. http://www.sec.gov/Archives/edgar/data/1800/000110465909048013/abt-20090630.xml
a) use the monthly RSS feed e.g. http://www.sec.gov/Archives/edgar/monthly/xbrlrss-2014-08.xml to obtain the company name, CIK, Accession Number/Acc-no: and Period of Report. Add a loop to iterate through the years and months.
b) output the list with details to either Excel / Pyhton.
Many thanks and regards.
If you have a follow-up question, then close this thread and include a link to this thread in your new question. After posting the new question, add a comment in this thread letting the experts (and future readers) know about the new question.
ASKER
Excellent. The replies were clear and concise and addressed my question(s) completely.
Open in new window
Once extracted, you would use a DOM object or a regular expression object to extract the data from the abt-20140630.xml file
This is a regular expression pattern to get that data
Open in new window
Which produces the followingSubMatch 0: 2014-01-01
SubMatch 1: 2014-06-30
SubMatch 2: 5551000000
SubMatch 3: 2506000000
I'm not sure of some of the details. For instance,
1. Since you have multiple SalesRevenueNet and CostOfGoodsSold items, do we differentiate them by just their id values or by other/more data?
2. Since there are multiple xbrli:startDate and xbrli:endDate items, which pair should be used?