Link to home
Start Free TrialLog in
Avatar of NVIT
NVITFlag for United States of America

asked on

Extract field headers and values between quotes

Note to admin: I have posted to VBScript topic before (I think). Why doesn't it work now? so, I put VBA, even though I want a VBS or VBScript *sigh*

Here is a sample of a text file with records.

I'd like to extract the fields and their values

<ET FP="Public\ACAD\PROJECTS\1400\In\112315A\Pages from Coconut Island Infrastructure and Bldg Improvements As-Builts.pdf" SZ="392051" LB="2040856" AT="1448320810" CT="1448320810" MT="1448320810" ST="9" AR="0" XA="8192" IN="0" OST="0" IIMG="9406"/>
<ET FP="Public\ACAD\PROJECTS\1400\In\112316A" SZ="11366959" LB="2041241" AT="1479924188" CT="1479924186" MT="1479924188" ST="10" AR="0" XA="8192" IN="0" OST="0" IIMG="9407"/>
<ET FP="Public\ACAD\PROJECTS\1400\In\112316A\SCHEME B.pdf" SZ="11366959" LB="2041242" AT="1479924188" CT="1479924188" MT="1479924188" ST="9" AR="0" XA="8192" IN="0" OST="0" IIMG="9408"/>
<ET FP="Public\ACAD\PROJECTS\1400\In\112316B" SZ="4968231" LB="2052356" AT="1479942325" CT="1479942322" MT="1479942325" ST="10" AR="0" XA="8192" IN="0" OST="0" IIMG="9409"/>
<ET FP="Public\ACAD\PROJECTS\1400\In\112316B\Nohea-PHASE 2 SITE PLAN.PDF" SZ="1539304" LB="2052357" AT="1479942325" CT="1479942325" MT="1479942325" ST="9" AR="0" XA="8192" IN="0" OST="0" IIMG="9410"/>
<ET FP="Public\ACAD\PROJECTS\1400\In\112316B\Project Location.pdf" SZ="3428927" LB="2053863" AT="1479942325" CT="1479942325" MT="1479942325" ST="9" AR="0" XA="8192" IN="0" OST="0" IIMG="9411"/>

Open in new window


So it becomes...
Public\ACAD\PROJECTS\1400\In\112315A\Pages from Coconut Island Infrastructure and Bldg Improvements As-Builts.pdf,392051,2040856,1448320810,1448320810,1448320810,9,0,8192,0,0,9406
Public\ACAD\PROJECTS\1400\In\112316A,11366959,2041241,1479924188,1479924186,1479924188,10,0,8192,0,0,9407
Public\ACAD\PROJECTS\1400\In\112316A\SCHEME B.pdf,11366959,2041242,1479924188,1479924188,1479924188,9,0,8192,0,0,9408
Public\ACAD\PROJECTS\1400\In\112316B,4968231,2052356,1479942325,1479942322,1479942325,10,0,8192,0,0,9409
Public\ACAD\PROJECTS\1400\In\112316B\Nohea-PHASE 2 SITE PLAN.PDF,1539304,2052357,1479942325,1479942325,1479942325,9,0,8192,0,0,9410
Public\ACAD\PROJECTS\1400\In\112316B\Project Location.pdf,3428927,2053863,1479942325,1479942325,1479942325,9,0,8192,0,0,9411

Open in new window

SOLUTION
Avatar of JesterToo
JesterToo
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Avatar of Bill Prew
Bill Prew

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Similar to BillPrew's code, but using regex to remove what we don't want from each line
Option Explicit

Const ForReading = 1
Const ForWriting = 2
Const ForAppending = 8
Const TristateTrue = -1
Const TristateFalse = 0
Const TristateUseDefault = -2

Dim strLine, strInFile, strOutFile, objFSO, objInFile, objOutFile, objRegex

strInFile = "B:\ee\EE28990550\in.txt"
strOutFile = "B:\ee\EE28990550\out.txt"

Set objFSO = CreateObject("Scripting.FileSystemObject")

Set objInFile = objFSO.OpenTextFile(strInFile, ForReading, False, TriStateUseDefault)
Set objOutFile = objFSO.OpenTextFile(strOutFile, ForWriting, True, TriStateUseDefault)

Set objRegex = New RegExp
objRegex.Global = True
objRegex.Pattern = "(^[^""]+="")|(""[^""]+$)|(""[^""]+="")"

Do Until objInFile.AtEndOfStream
    strLine = objInFile.ReadLine
    strLine = objRegex.Replace(strLine, ",")
    strLine = Mid(strLine, 2)    'remove initial comma
    strLine = Left(strLine, Len(strLine) - 1)    'remove trailing comma
    objOutFile.WriteLine strLine
Loop

objInFile.Close
objOutFile.Close

Open in new window


Note: It might even be possible to do something like this for the entire file, but my initial regex tests weren't promising.
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Bill Prew
Bill Prew

Just to clarify, the code I posted earlier produced an output file exactly matching the sample desired output that was posted by NVIT.

~bp
Avatar of NVIT

ASKER

@Aikimark and Bill: Not sure how I should grade this one because I did say I wanted the header, which Bill's kind of provides but Aiki doesn't. I was hoping for a cleaner header line, i.e. just the field names and commas. I can live with that. On the other hand, although I didn't post a sample, Bill's missed the top record, which Aiki's provides. So, I'll divide it equally. To show, here's the top 2 lines:

Bill's:
" SZ=, LB=, AT=, CT=, MT=, ST=, AR=, XA=, IN=, OST=, IIMG=
Public,813139942623,66,1461377486,1217736680,1461377486,10,0,8192,0,0,1

Aiki's:
,813139942623,64,0,0,0,10,16,8192,0,0,0
Public,813139942623,66,1461377486,1217736680,1461377486,10,0,8192,0,0,1

@JesterToo: Yours was close but the 2nd-to last field missed the comma. Thus making the last field 07 instead of 0,7
NVIT,

Curios, where did you request a header record, I don't see it which was why I didn't plan on one.

~bp
In your question text, you wrote:
I'd like to extract the fields and their values
As Bill Prew noted in his most recent comment, you neither asked for a header nor showed any header in your output example.

Since this question is now closed, I recommend that you open a new question and ask about a header line.  It would be helpful to those experts to have a link to this question thread.
Did the input text you posted accurately represent the actual input data?
Avatar of NVIT

ASKER

I was not clear. My apologies...
Is field and header the same? I stated in the title but not in the body.

The fields / headers would be on the left of the values. e.g.
FP="Public\ACAD\PROJECTS\1400\In\112316A" SZ="11366959" LB=

Open in new window


i.e.
FP, SZ, LB, ...

Open in new window

Give this  try:

Const ForReading = 1
Const ForWriting = 2
Const ForAppending = 8
Const TristateTrue = -1
Const TristateFalse = 0
Const TristateUseDefault = -2

strInFile = "B:\ee\EE28990550\in.txt"
strOutFile = "B:\ee\EE28990550\out.txt"

Set objFSO = CreateObject("Scripting.FileSystemObject")

Set objInFile = objFSO.OpenTextFile(strInFile, ForReading, False, TriStateUseDefault)
Set objOutFile = objFSO.OpenTextFile(strOutFile, ForWriting, True, TriStateUseDefault)

Set objRegex = New RegExp
objRegex.Global = True

blnHeader=True
Do Until objInFile.AtEndOfStream
    strLine = objInFile.ReadLine
    If blnHeader = True Then
        blnHeader = False
        strTemp = ""
        objRegex.Pattern = " ([a-zA-Z0-9]+?)="
        Set objMatches = objRegex.Execute(strLine)
        For Each objMatch In objMatches
            If strTemp = "" Then
                strTemp = objMatch.SubMatches(0)
            Else
                strTemp = strTemp & "," & objMatch.SubMatches(0)
            End If
        Next
        objOutFile.WriteLine strTemp
    End If

    strTemp = ""
    objRegex.Pattern = "\""(.+?)\"""
    Set objMatches = objRegex.Execute(strLine)
    For Each objMatch In objMatches
        If strTemp = "" Then
            strTemp = objMatch.SubMatches(0)
        Else
            strTemp = strTemp & "," & objMatch.SubMatches(0)
        End If
    Next
    objOutFile.WriteLine strTemp
Loop

objInFile.Close
objOutFile.Close

Open in new window

~bp