Excel VBA to Extract XBRL Data

Hi, I have downloaded programmatically from SEC (https://www.sec.gov/Archives/edgar/data/1800/000110465914056950/0001104659-14-056950-index.htm) a set of XBRL documents in zipped format  for a specific entity ( CIK Number is 0000001800    ) to my local drive.
The zip file is 0000001800_0001104659-14-056950-xbrl.zip and contains 6 xml documents. One of these files is named: abt-20140630.xml.

I would like to pull 4 values from the said xml to Excel using VBA ( without the need to unzipped the zip file).
i.e. Sales: 5551000000, CostOfGoodsSold: 2506000000, Start Date: 2014-01-01 and End Date:2014-06-30
The reference sources are:

1. <us-gaap:SalesRevenueNet id="Tag334" decimals="-6" contextRef="D2014Q2" unitRef="USD">5551000000</us-gaap:SalesRevenueNet>
2. <us-gaap:CostOfGoodsSold id="Tag437" decimals="-6" contextRef="D2014Q2" unitRef="USD">2506000000</us-gaap:CostOfGoodsSold>

3. <xbrli:period>

1. MS Excel 2010
2. Windows 8

Would appreciate your help with the VBA code to achieve this.

Thank you.
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

VBA can be used to extract one or more files from a zip file like this:
Dim oShell As Object

Set oShell = CreateObject("Shell.Application")
oShell.Namespace("C:\users\RussellBrown\Downloads").CopyHere oApp.Namespace("C:\users\RussellBrown\Downloads\0000001800_0001104659-14-056950-xbrl.zip").items.Item("abt-20140630.xml")

Open in new window

Once extracted, you would use a DOM object or a regular expression object to extract the data from the abt-20140630.xml file

This is a regular expression pattern to get that data
<xbrli:startDate>([^<]*)</xbrli:startDate>(?:.|\n)*?<xbrli:endDate>([^<]*)</xbrli:endDate>(?:.|\n)*?<us-gaap:SalesRevenueNet id="Tag334"[^>]*>(\d*)</us-gaap:SalesRevenueNet>(?:.|\n)*?<us-gaap:CostOfGoodsSold id="Tag437"[^>]*>(\d*)</us-gaap:CostOfGoodsSold>

Open in new window

Which produces the following

SubMatch 0: 2014-01-01
SubMatch 1: 2014-06-30
SubMatch 2: 5551000000
SubMatch 3: 2506000000

I'm not sure of some of the details.  For instance,
1. Since you have multiple SalesRevenueNet and CostOfGoodsSold items, do we differentiate them by just their id values or by other/more data?
2. Since there are multiple xbrli:startDate and xbrli:endDate items, which pair should be used?
RussellbrownAuthor Commented:
Hi, thank you very much for the reply.
I am able to get the first part working i.e.

Sub ExtractFromZip()
Dim oShell As Object

Set oShell = CreateObject("Shell.Application")
oShell.Namespace("C:\ZTestSEC\TestEE").CopyHere oShell.Namespace("C:\ZTestSEC\0000001800_0001104659-14-056950-xbrl.zip").items.Item("abt-20140630.xml")

End Sub

How would I add the additional code for the second part i.e.
'<xbrli:startDate>([^<]*)</xbrli:startDate>(?:.|\n)*?<xbrli:endDate>([^<]*)</xbrli:endDate>(?:.|\n)*?<us-gaap:SalesRevenueNet id="Tag334"[^>]*>(\d*)</us-gaap:SalesRevenueNet>(?:.|\n)*?<us-gaap:CostOfGoodsSold id="Tag437"[^>]*>(\d*)</us-gaap:CostOfGoodsSold>

Sorry, I am not familiar with the DOM object or a regular expression object. Can you please assist by providing the vba code lines?

Will revert on the 2 questions that you have kindly raised  after getting the code done,  in order to keep it simple.

Please answer my questions relative to the actual XML file.
OWASP: Threats Fundamentals

Learn the top ten threats that are present in modern web-application development and how to protect your business from them.

RussellbrownAuthor Commented:
1. for SalesRevenueNet and CostOfGoodsSold  items- we will use the contextRef, and in this case it is ="D2014Q2" representing the 2nd quarter of Year 2014. Later on we can modify it into a variable.
2. For the start and end dates - the first instance of the <xbrli:period></xbrli:period> if it is possible.
Here is a version of VBA code that displays the parsed data in the immediate window
Option Explicit

Sub Q_28497353()
    Dim oRE As Object
    Dim oMatches As Object
    Dim oM As Object
    Dim oSM As Object
    Dim lngSM As Long
    Dim strXML As String
    Dim oShell As Object
    Const cPath As String = "C:\users\RussellBrown\Downloads"
    Const cFile As String = "abt-20140630.xml"
    Const cZipfile As String = "0000001800-0001104659-14-056950-xbrl.zip"
    If Len(Dir(cPath & "\" & cFile)) = 0 Then
        If Len(Dir(cPath & "\" & cZipfile)) = 0 Then
            MsgBox "Zip file does not exist", vbCritical
            Exit Sub
            Set oShell = CreateObject("Shell.Application")
            oShell.Namespace(cPath).CopyHere oShell.Namespace(cPath & "\" & cZipfile).items.Item(cFile)
        End If
    End If
    If Len(Dir(cPath & "\" & cFile)) = 0 Then
        MsgBox "Unable to extract " & cFile & " from zip file", vbCritical
        Exit Sub
    End If
    Open cPath & "\" & cFile For Input As #1
    strXML = Input(LOF(1), #1)
    Close #1
    Set oRE = CreateObject("vbscript.regexp")
    oRE.Global = True
    oRE.Pattern = "<xbrli:startDate>([^<]*)</xbrli:startDate>(?:.|\n)*?<xbrli:endDate>([^<]*)</xbrli:endDate>(?:.|\n)*?<us-gaap:SalesRevenueNet .*?contextRef=""D2014Q2""[^>]*>(\d*)</us-gaap:SalesRevenueNet>(?:.|\n)*?<us-gaap:CostOfGoodsSold .*?contextRef=""D2014Q2""[^>]*>(\d*)</us-gaap:CostOfGoodsSold>"

    If oRE.test(strXML) Then
        Set oMatches = oRE.Execute(strXML)
        For Each oM In oMatches
           With oM
               For lngSM = 0 To .submatches.Count - 1
                   Debug.Print "submatch(" & lngSM & ")", .submatches(lngSM)
           End With
        MsgBox "XML data did not match regular expression pattern", vbExclamation
    End If

End Sub

Open in new window

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
RussellbrownAuthor Commented:
It works perfectly. Thank you very much.  I noticed the patterns i.e.


2) (?:.|\n)*?

3)  .*?contextRef=""D2014Q2""[^>]*>(\d*)    

I visited http://msdn.microsoft.com/en-us/library/ms974570.aspx to have a look at the syntax and expressions. Hope this is the correct reference site.
Can you please kick-start me on how to read the symbols so I can apply it accordingly to other scenarios?
RussellbrownAuthor Commented:
I experimented with the codes by adding 2 more items: Sales and COGS for D2014Q2YTD i.e.
a = "<xbrli:startDate>([^<]*)</xbrli:startDate>(?:.|\n)*?"
b = "<xbrli:endDate>([^<]*)</xbrli:endDate>(?:.|\n)*?"
c = "<us-gaap:SalesRevenueNet .*?contextRef=""D2014Q2""[^>]*>(\d*)</us-gaap:SalesRevenueNet>(?:.|\n)*?"
d = "<us-gaap:CostOfGoodsSold .*?contextRef=""D2014Q2""[^>]*>(\d*)</us-gaap:CostOfGoodsSold>(?:.|\n)*?"
e = "<us-gaap:SalesRevenueNet .*?contextRef=""D2014Q2YTD""[^>]*>(\d*)</us-gaap:SalesRevenueNet>(?:.|\n)*?"
f = "<us-gaap:CostOfGoodsSold .*?contextRef=""D2014Q2YTD""[^>]*>(\d*)</us-gaap:CostOfGoodsSold>"
I get the following results:
1)  oRE.Pattern = a & b & c & d works ( 4 items )
2)  oRE.Pattern = a & b & e & f works  ( 4 items )
3)  oRE.Pattern = a & b & c & d & e & f not working ( 4 + 2 more items )

Would appreciate it very much if you could point out where I have do it wrong. Thanks.
Here is a good article by Patrick Matthews that you should read.

Capture all characters that are not a less than character and place them into a capture group.

2) (?:.|\n)*?
Ignore all characters, but don't be greedy when matching.  Although this is in parentheses, the leading ?: characters indicate that this group should not be captured.

3)  .*?contextRef=""D2014Q2""[^>]*>(\d*)
Ignore all characters, but don't be greedy when matching.  Then match contextRef="D2014Q2" string followed by any characters that aren't a > character, terminating with a > character.  Then capture zero or more digits.  Since this pattern is a VB string, actual quote mark characters are represented by double-quote character sequences.
The pattern should resemble the order of the items in the file.
a & b & c & e & d & f

It is possible to go after the data one or two patterns at a time.
RussellbrownAuthor Commented:
Thank you so much for the excellent replies. I will go through the references and try to work with some examples. Before I sign off and happily accept  the solutions to my question(s), would you be familiar with the SEC XBRL? If so, I will post it as a new question i.e.

Either using Python or Excel VBA:
1. obtain a list of Instance URLs for 10-K filings e.g. http://www.sec.gov/Archives/edgar/data/1800/000110465909048013/abt-20090630.xml
a) use the monthly RSS feed e.g. http://www.sec.gov/Archives/edgar/monthly/xbrlrss-2014-08.xml  to obtain the company name, CIK, Accession Number/Acc-no: and Period of Report. Add a loop to iterate through the years and months.
b) output the list with details to either Excel / Pyhton.

Many thanks and regards.
If you have a follow-up question, then close this thread and include a link to this thread in your new question.  After posting the new question, add a comment in this thread letting the experts (and future readers) know about the new question.
RussellbrownAuthor Commented:
Excellent. The replies were clear and concise and addressed my question(s) completely.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Microsoft Excel

From novice to tech pro — start learning today.