Link to home
Start Free TrialLog in
Avatar of mcdev
mcdev

asked on

Parsing a text file into 2 listboxes based on strings

This may be simpler than I'm making it, but it is really confounding me.

I want to parse this file, a bookmark file into two listboxes.

Basically, the attributes in the file are thus:
-----------------------------------------

<DT><H3 ADD_DATE="961102203" ID="NC:BookmarksRoot#$b742f58">Developer Information</H3>

<H3 -strings- > - denotes folder name start
'string'</H3> - denotes end of folder name


<DT><A HREF="http://www.faqs.org/rfcs/" ADD_DATE="961104168">RFC Archive</A>

<A href="string"> - denotes URL
'string'</a> - denotes end of URL description

-----------------------------------------

I want to put the URL into listbox one, and the description into listbox2.

What's the best way to search for <a href="string"> and </a> and work with both the url and the description inbetween the <a> </a> tags?

My head hurts.  :(
Avatar of aeklund
aeklund

This should do it for you:

Private Sub Command1_Click()
  Dim lfnum As Long
  lfnum = FreeFile
 
  Dim sLine As String
  Dim sPage As String
  Open "c:\test.htm" For Input As #lfnum
    Do Until EOF(lfnum)
      Line Input #lfnum, sLine
      sPage = sPage & sLine
    Loop
  Close #lfnum
 
  Dim lpos1 As Long
  Dim lpos2 As Long
 
  Do
    lpos1 = InStr(1, UCase(sPage), "<A HREF")
    If lpos1 = 0 Then Exit Do
    sPage = Right(sPage, Len(sPage) - lpos1 + 1)
   
    lpos1 = InStr(1, UCase(sPage), "HREF=" & Chr(34))
    lpos1 = lpos1 + 6
    lpos2 = InStr(lpos1 + 1, sPage, Chr(34))
    List1.AddItem Mid(sPage, lpos1, lpos2 - lpos1)
   
    lpos1 = InStr(1, sPage, ">") + 1
    lpos2 = InStr(lpos1 + 1, UCase(sPage), "</A")
    List2.AddItem Mid(sPage, lpos1, lpos2 - lpos1)
   
    sPage = Right(sPage, Len(sPage) - lpos2 - 3)
  Loop
End Sub
There may be a simpler way but I would use a 2 step approach.

Take the original string, split it with a delimiter of <A REF=.  You can disregard the first element in the array because you know it precedes the <A HREF.

Now you can split again (the splitted strings) with a delimiter of </A> to get what's inbetween the two.

Eg.

blah blah blah <A HREF="https://www.experts-exchange.com"> more
blah </A>

split once for <a HREF="
element 0 - blah blah blah
element 1 - https://www.experts-exchange.com"> more blah </A>

split element 1 for ">
new element 0 - https://www.experts-exchange.com
new element 1 - more blah </A>

you can then remove the </A> with an left(new element 1, length(new element 1) - instrrev(new element 1, </A>

There's probably a much simpler function but this is what I would do if I couldn't find it.  Of course, this closer to pseudo code.

Cheers,

keenez
Hi
You know that: first came URL address and then URL description if exists.
The URL address is between "" and URL description is between ">" and "</A>"

I'm have two function (adapted from Lisp Lenguage) CAR and CDR

This is an Example for Visual Basic (3.0 to 6.0)
In the Form you draw a Command Button the name is Command1

'CAR Function retrive the string up to first caracter
Function car (ByVal Lista As String, ByVal caracter As String) As String
Lista = Trim(Lista)
If InStr(1, Lista, caracter) > 0 Then
    car = Trim(Left(Lista, InStr(1, Lista, caracter) - 1))
Else
    car = Trim(Lista)
End If
End Function

'CDR Function retrive the string beyond to first caracter
Function cdr (ByVal Lista As String, ByVal caracter As String) As String
Lista = Trim(Lista)
If InStr(1, Lista, caracter) > 0 Then
cdr = Trim(Right(Lista, Len(Lista) - InStr(1, Lista, caracter)))
Else
cdr = ""
End If
End Function


Private Sub Command1_Click()
Dim MiTexto As String, Aux As String
MiTexto = "<DT><A HREF=""http://www.faqs.org/rfcs/"" ADD_DATE=""961104168"">RFC Archive</A>"

MsgBox "The origen is" & vbCrLf & MiTexto

MiTexto = Mid(cdr(MiTexto, "A HREF="""), 7)
Aux = car(MiTexto, """")
MsgBox "URL Address:" & vbCrLf & Aux
'You can add in list box List1.AddItem Aux

Aux = cdr(MiTexto, ">")
Aux = car(Aux, "</A>")
MsgBox "URL Description:" & vbCrLf & Aux
'You can add in list box List2.AddItem Aux

End Sub

Good Luck
Renato
mcdev,

best way i don't know.
One way I know( i think):
you store the line in a string variable: say strTag
that is  strTag = <DT><A HREF="http://www.faqs.org/rfcs/" ADD_DATE="961104168">RFC Archive</A>

Modify the code to meet your needs

Private sub cmdAddtoList_Click()
dim ArrTag() as string
dim strHref as string
dim strDesc as string

ArrTag= split(strTag,chr(34))

for i = 0 to ubound(ArrTag)
    strHref = replace(ArrTag(i)," ","")
    if len (strHref) > 5 then
       if UCASE(right(strHref,5)) = "HREF=" Then
          if i < ubound(ArrTag) Then _
             strHref = ArrTag(i + 1)
             Exit for
       end if
    End If
Next

ArrTag= split(strTag,">")

For i = 0 To UBound(ArrTag)
    strDesc = Replace(ArrTag(i), " ", "")
    If Len(strDesc) > 3 Then
       If UCase(Right(strDesc, 3)) = "</A" Then
             strDesc = Left(ArrTag(i), InStr(ArrTag(i), "<") - 1)
             Exit For
       End If
    End If
Next
listbox1.additem strHref
listbox2.additem strDesc
End sub
I always do things the hard way , but I'd do something similar to the last post , except use left, right, and mid strings ,..
' <DT><A HREF="<http://www.faqs.org/rfcs/>" ADD_DATE="961104168">RFC Archive</A>

listbox1.clear
listbox2.clear
open filename for input as #1
while not eof(1)
input #1,a$
i=instr(a$,"<A HREF=")
i=i+8
a$=mid$(a$,9)' get start of url
i=instr(a$,">")
strHref=left$(a$,i-1)  ' url

a$=mid$(a$,i+1)
i=instr(a$,">")
a$=mid$(a$,i+1)
i=instr(a$,"<")
strDesc=left$(a$,i-1) ' desc

listbox1.additem strHref
listbox2.additem strDesc
wend
close 1

' didn't test this, but that's how I usually do it ,.. inching my way along..

-- David







listbox1.additem strHref
listbox2.additem strDesc
I like it simple with syntax error checking (Just in case):

Open pathname For Input As 1
 strA = Input(FileLen(pathname), 1)
Close 1
 
c = InStr(1, UCase(strA), "<A HREF")
While c > 0
 strA = Mid(strA, c + 1)
 c = InStr(1, UCase(strA), "<A HREF")
 b = InStr(strA, Chr(34))
If b > 0 Then
 strA = Mid(strA, b + 1)
 b = InStr(strA, Chr(34))
 strURL = Mid(strA, 1, b - 1)
 b = InStr(strA, ">")
 strA = Mid(strA, b + 1)
 b = InStr(strA, "<")
 If b > 0 Then
  strDESCR = Mid(strA, 1, b - 1)
  List1.AddItem strURL
  List2.AddItem strDESCR
 End If
 End If
 c = InStr(1, UCase(strA), "<A HREF")
Wend
Okay, I'm going to throw my hat into the ring. To my mind, since this is an HTML document it is just begging to be parsed using the DOM, which does all the hard work for you, and provides a rather more elegant and robust solution than instr and split. McDev: I'm not sure what you need from the 'folder name' part of your question. I think everyone has ignored it, so I will too unless you come back to us.
Put two listboxes on a form (List1 and List2) and add a reference to MicroSoft Internet Controls. Paste the following code into the form.

Kindest regards,
Rhaedes


Dim IE As SHDocVw.InternetExplorer

Private Sub Form_Load()
Set IE = New InternetExplorer 'create instance of IExplorer
IE.Navigate2 ("c:\WHEREVER\myFile.htm") 'load file

Do While IE.readyState <> READYSTATE_COMPLETE 'wait until fully loaded
 DoEvents
Loop

With IE.document.All.tags("A") 'get anchor collection
 For n = 0 To .length - 1
  List2.AddItem .Item(n).getAttribute("href") 'extract href
  List1.AddItem .Item(n).innerText 'extract description
 Next n
End With

Set IE = Nothing 'do away with IExplorer
End Sub
dear mcdev,

All of them here have overlooked some points.

1)  If there is one or additional spaces between "A" & "HREF" then their program may not work as intended:

their code will work if:
all your tags have one space between "A" & "HREF", for eg:

<A HREF="http://www.faqs.org/rfcs/" ADD_DATE="961104168">RFC Archive</A>

but if there are more than 1 space then there will be a problem, for eg:
<A   HREF="http://www.faqs.org/rfcs/" ADD_DATE="961104168">RFC Archive</A>

2) Their conditions are case sensitive, that is if you type any one of the tags in lower case, for eg :
<a HREF="http://www.faqs.org/rfcs/" ADD_DATE="961104168">RFC Archive</a>

Or

<a hReF="http://www.faqs.org/rfcs/" ADD_DATE="961104168">RFC Archive</a>
Then their code will not find "URL" at all.

*******************************************************


There is one problem in the code I gave:

If there is one or more white space between "H" & "REF"
or between any characters in "HREF" like "H REF" or "HR ef", then also the output result will be the url.In order to overcome taht you will have to do an additional cheking.My modified code is:
that is  strTag = <DT><A HREF="http://www.faqs.org/rfcs/" ADD_DATE="961104168">RFC Archive</A>

Modify the code to meet your needs

Private sub cmdAddtoList_Click()
dim ArrTag() as string
dim strHref as string
dim strDesc as string

ArrTag= split(strTag,chr(34))

for i = 0 to ubound(ArrTag)
   strHref = replace(ArrTag(i)," ","")
   if len (strHref) > 5 then
      if UCASE(right(strHref,5)) = "HREF=" Then
         if instr(ucase(ArrTag(i),"HREF") > 0 Then  
            if i < ubound(ArrTag) Then
               strHref = ArrTag(i + 1)
               Exit for
            end if
         end if
   End If
Next

ArrTag= split(strTag,">")

For i = 0 To UBound(ArrTag)
   strDesc = Replace(ArrTag(i), " ", "")
   If Len(strDesc) > 3 Then
      If UCase(Right(strDesc, 3)) = "</A" Then
            strDesc = Left(ArrTag(i), InStr(ArrTag(i), "<") - 1)
            Exit For
      End If
   End If
Next
listbox1.additem strHref
listbox2.additem strDesc
End sub
Vbbuff: You are NOT correct when you say 'All of them here have overlooked some points'! The solution using the DOM by definition works with all good HTML documents, whether or not they contain tags and elements in uppercase, lower case, with extra whitespace, etc, and an endless number of possibilities that your code does not contemplate (tabs, newline characters, nobreak spaces etc. etc.). Also note (no disrespect) that your code contains syntax errors (you appear not to have closed all brackets properly, for example).
Mcdev: Use the code which works best for you or with which you are most comfortable: since your strings appear to be simple, a solution with 'Instr' or similar will work just fine. But in all honesty, the Document Object Model exists precisely so that you can parse HTML simply and robustly with a few lines of code.

Kindest regards,
Rhaedes
dear rhaedes,
I was not refering to you, I correct myself. I was also just pointing (with no disrepect too) out to some of the points that were overlooked , but are important. After all this site is all about providing and gaining knowledge ain't it ?
Absolutely. And of course you are 100% correct in pointing out the failings of the other methods.
Respect and regards,
Rhaedes
mcdev:
This old question needs to be finalized -- accept an answer, split points, or get a refund.  For information on your options, please click here-> http:/help/closing.jsp#1 
Experts: Post your closing recommendations!  Who deserves points here?
Avatar of DanRollins
mcdev, an EE Moderator will handle this for you.
Moderator, my recommended disposition is:

    Refund points and save as a 0-pt PAQ.

DanRollins -- EE database cleanup volunteer
ASKER CERTIFIED SOLUTION
Avatar of YensidMod
YensidMod

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial