mcdev
asked on
Parsing a text file into 2 listboxes based on strings
This may be simpler than I'm making it, but it is really confounding me.
I want to parse this file, a bookmark file into two listboxes.
Basically, the attributes in the file are thus:
-------------------------- ---------- -----
<DT><H3 ADD_DATE="961102203" ID="NC:BookmarksRoot#$b742 f58">Devel oper Information</H3>
<H3 -strings- > - denotes folder name start
'string'</H3> - denotes end of folder name
<DT><A HREF="http://www.faqs.org/rfcs/" ADD_DATE="961104168">RFC Archive</A>
<A href="string"> - denotes URL
'string'</a> - denotes end of URL description
-------------------------- ---------- -----
I want to put the URL into listbox one, and the description into listbox2.
What's the best way to search for <a href="string"> and </a> and work with both the url and the description inbetween the <a> </a> tags?
My head hurts. :(
I want to parse this file, a bookmark file into two listboxes.
Basically, the attributes in the file are thus:
--------------------------
<DT><H3 ADD_DATE="961102203" ID="NC:BookmarksRoot#$b742
<H3 -strings- > - denotes folder name start
'string'</H3> - denotes end of folder name
<DT><A HREF="http://www.faqs.org/rfcs/" ADD_DATE="961104168">RFC Archive</A>
<A href="string"> - denotes URL
'string'</a> - denotes end of URL description
--------------------------
I want to put the URL into listbox one, and the description into listbox2.
What's the best way to search for <a href="string"> and </a> and work with both the url and the description inbetween the <a> </a> tags?
My head hurts. :(
There may be a simpler way but I would use a 2 step approach.
Take the original string, split it with a delimiter of <A REF=. You can disregard the first element in the array because you know it precedes the <A HREF.
Now you can split again (the splitted strings) with a delimiter of </A> to get what's inbetween the two.
Eg.
blah blah blah <A HREF="https://www.experts-exchange.com"> more
blah </A>
split once for <a HREF="
element 0 - blah blah blah
element 1 - https://www.experts-exchange.com"> more blah </A>
split element 1 for ">
new element 0 - https://www.experts-exchange.com
new element 1 - more blah </A>
you can then remove the </A> with an left(new element 1, length(new element 1) - instrrev(new element 1, </A>
There's probably a much simpler function but this is what I would do if I couldn't find it. Of course, this closer to pseudo code.
Cheers,
keenez
Take the original string, split it with a delimiter of <A REF=. You can disregard the first element in the array because you know it precedes the <A HREF.
Now you can split again (the splitted strings) with a delimiter of </A> to get what's inbetween the two.
Eg.
blah blah blah <A HREF="https://www.experts-exchange.com"> more
blah </A>
split once for <a HREF="
element 0 - blah blah blah
element 1 - https://www.experts-exchange.com"> more blah </A>
split element 1 for ">
new element 0 - https://www.experts-exchange.com
new element 1 - more blah </A>
you can then remove the </A> with an left(new element 1, length(new element 1) - instrrev(new element 1, </A>
There's probably a much simpler function but this is what I would do if I couldn't find it. Of course, this closer to pseudo code.
Cheers,
keenez
Hi
You know that: first came URL address and then URL description if exists.
The URL address is between "" and URL description is between ">" and "</A>"
I'm have two function (adapted from Lisp Lenguage) CAR and CDR
This is an Example for Visual Basic (3.0 to 6.0)
In the Form you draw a Command Button the name is Command1
'CAR Function retrive the string up to first caracter
Function car (ByVal Lista As String, ByVal caracter As String) As String
Lista = Trim(Lista)
If InStr(1, Lista, caracter) > 0 Then
car = Trim(Left(Lista, InStr(1, Lista, caracter) - 1))
Else
car = Trim(Lista)
End If
End Function
'CDR Function retrive the string beyond to first caracter
Function cdr (ByVal Lista As String, ByVal caracter As String) As String
Lista = Trim(Lista)
If InStr(1, Lista, caracter) > 0 Then
cdr = Trim(Right(Lista, Len(Lista) - InStr(1, Lista, caracter)))
Else
cdr = ""
End If
End Function
Private Sub Command1_Click()
Dim MiTexto As String, Aux As String
MiTexto = "<DT><A HREF=""http://www.faqs.org/rfcs/"" ADD_DATE=""961104168"">RFC Archive</A>"
MsgBox "The origen is" & vbCrLf & MiTexto
MiTexto = Mid(cdr(MiTexto, "A HREF="""), 7)
Aux = car(MiTexto, """")
MsgBox "URL Address:" & vbCrLf & Aux
'You can add in list box List1.AddItem Aux
Aux = cdr(MiTexto, ">")
Aux = car(Aux, "</A>")
MsgBox "URL Description:" & vbCrLf & Aux
'You can add in list box List2.AddItem Aux
End Sub
Good Luck
Renato
You know that: first came URL address and then URL description if exists.
The URL address is between "" and URL description is between ">" and "</A>"
I'm have two function (adapted from Lisp Lenguage) CAR and CDR
This is an Example for Visual Basic (3.0 to 6.0)
In the Form you draw a Command Button the name is Command1
'CAR Function retrive the string up to first caracter
Function car (ByVal Lista As String, ByVal caracter As String) As String
Lista = Trim(Lista)
If InStr(1, Lista, caracter) > 0 Then
car = Trim(Left(Lista, InStr(1, Lista, caracter) - 1))
Else
car = Trim(Lista)
End If
End Function
'CDR Function retrive the string beyond to first caracter
Function cdr (ByVal Lista As String, ByVal caracter As String) As String
Lista = Trim(Lista)
If InStr(1, Lista, caracter) > 0 Then
cdr = Trim(Right(Lista, Len(Lista) - InStr(1, Lista, caracter)))
Else
cdr = ""
End If
End Function
Private Sub Command1_Click()
Dim MiTexto As String, Aux As String
MiTexto = "<DT><A HREF=""http://www.faqs.org/rfcs/"" ADD_DATE=""961104168"">RFC
MsgBox "The origen is" & vbCrLf & MiTexto
MiTexto = Mid(cdr(MiTexto, "A HREF="""), 7)
Aux = car(MiTexto, """")
MsgBox "URL Address:" & vbCrLf & Aux
'You can add in list box List1.AddItem Aux
Aux = cdr(MiTexto, ">")
Aux = car(Aux, "</A>")
MsgBox "URL Description:" & vbCrLf & Aux
'You can add in list box List2.AddItem Aux
End Sub
Good Luck
Renato
mcdev,
best way i don't know.
One way I know( i think):
you store the line in a string variable: say strTag
that is strTag = <DT><A HREF="http://www.faqs.org/rfcs/" ADD_DATE="961104168">RFC Archive</A>
Modify the code to meet your needs
Private sub cmdAddtoList_Click()
dim ArrTag() as string
dim strHref as string
dim strDesc as string
ArrTag= split(strTag,chr(34))
for i = 0 to ubound(ArrTag)
strHref = replace(ArrTag(i)," ","")
if len (strHref) > 5 then
if UCASE(right(strHref,5)) = "HREF=" Then
if i < ubound(ArrTag) Then _
strHref = ArrTag(i + 1)
Exit for
end if
End If
Next
ArrTag= split(strTag,">")
For i = 0 To UBound(ArrTag)
strDesc = Replace(ArrTag(i), " ", "")
If Len(strDesc) > 3 Then
If UCase(Right(strDesc, 3)) = "</A" Then
strDesc = Left(ArrTag(i), InStr(ArrTag(i), "<") - 1)
Exit For
End If
End If
Next
listbox1.additem strHref
listbox2.additem strDesc
End sub
best way i don't know.
One way I know( i think):
you store the line in a string variable: say strTag
that is strTag = <DT><A HREF="http://www.faqs.org/rfcs/" ADD_DATE="961104168">RFC Archive</A>
Modify the code to meet your needs
Private sub cmdAddtoList_Click()
dim ArrTag() as string
dim strHref as string
dim strDesc as string
ArrTag= split(strTag,chr(34))
for i = 0 to ubound(ArrTag)
strHref = replace(ArrTag(i)," ","")
if len (strHref) > 5 then
if UCASE(right(strHref,5)) = "HREF=" Then
if i < ubound(ArrTag) Then _
strHref = ArrTag(i + 1)
Exit for
end if
End If
Next
ArrTag= split(strTag,">")
For i = 0 To UBound(ArrTag)
strDesc = Replace(ArrTag(i), " ", "")
If Len(strDesc) > 3 Then
If UCase(Right(strDesc, 3)) = "</A" Then
strDesc = Left(ArrTag(i), InStr(ArrTag(i), "<") - 1)
Exit For
End If
End If
Next
listbox1.additem strHref
listbox2.additem strDesc
End sub
I always do things the hard way , but I'd do something similar to the last post , except use left, right, and mid strings ,..
' <DT><A HREF="<http://www.faqs.org/rfcs/>" ADD_DATE="961104168">RFC Archive</A>
listbox1.clear
listbox2.clear
open filename for input as #1
while not eof(1)
input #1,a$
i=instr(a$,"<A HREF=")
i=i+8
a$=mid$(a$,9)' get start of url
i=instr(a$,">")
strHref=left$(a$,i-1) ' url
a$=mid$(a$,i+1)
i=instr(a$,">")
a$=mid$(a$,i+1)
i=instr(a$,"<")
strDesc=left$(a$,i-1) ' desc
listbox1.additem strHref
listbox2.additem strDesc
wend
close 1
' didn't test this, but that's how I usually do it ,.. inching my way along..
-- David
listbox1.additem strHref
listbox2.additem strDesc
' <DT><A HREF="<http://www.faqs.org/rfcs/>" ADD_DATE="961104168">RFC Archive</A>
listbox1.clear
listbox2.clear
open filename for input as #1
while not eof(1)
input #1,a$
i=instr(a$,"<A HREF=")
i=i+8
a$=mid$(a$,9)' get start of url
i=instr(a$,">")
strHref=left$(a$,i-1) ' url
a$=mid$(a$,i+1)
i=instr(a$,">")
a$=mid$(a$,i+1)
i=instr(a$,"<")
strDesc=left$(a$,i-1) ' desc
listbox1.additem strHref
listbox2.additem strDesc
wend
close 1
' didn't test this, but that's how I usually do it ,.. inching my way along..
-- David
listbox1.additem strHref
listbox2.additem strDesc
I like it simple with syntax error checking (Just in case):
Open pathname For Input As 1
strA = Input(FileLen(pathname), 1)
Close 1
c = InStr(1, UCase(strA), "<A HREF")
While c > 0
strA = Mid(strA, c + 1)
c = InStr(1, UCase(strA), "<A HREF")
b = InStr(strA, Chr(34))
If b > 0 Then
strA = Mid(strA, b + 1)
b = InStr(strA, Chr(34))
strURL = Mid(strA, 1, b - 1)
b = InStr(strA, ">")
strA = Mid(strA, b + 1)
b = InStr(strA, "<")
If b > 0 Then
strDESCR = Mid(strA, 1, b - 1)
List1.AddItem strURL
List2.AddItem strDESCR
End If
End If
c = InStr(1, UCase(strA), "<A HREF")
Wend
Open pathname For Input As 1
strA = Input(FileLen(pathname), 1)
Close 1
c = InStr(1, UCase(strA), "<A HREF")
While c > 0
strA = Mid(strA, c + 1)
c = InStr(1, UCase(strA), "<A HREF")
b = InStr(strA, Chr(34))
If b > 0 Then
strA = Mid(strA, b + 1)
b = InStr(strA, Chr(34))
strURL = Mid(strA, 1, b - 1)
b = InStr(strA, ">")
strA = Mid(strA, b + 1)
b = InStr(strA, "<")
If b > 0 Then
strDESCR = Mid(strA, 1, b - 1)
List1.AddItem strURL
List2.AddItem strDESCR
End If
End If
c = InStr(1, UCase(strA), "<A HREF")
Wend
Okay, I'm going to throw my hat into the ring. To my mind, since this is an HTML document it is just begging to be parsed using the DOM, which does all the hard work for you, and provides a rather more elegant and robust solution than instr and split. McDev: I'm not sure what you need from the 'folder name' part of your question. I think everyone has ignored it, so I will too unless you come back to us.
Put two listboxes on a form (List1 and List2) and add a reference to MicroSoft Internet Controls. Paste the following code into the form.
Kindest regards,
Rhaedes
Dim IE As SHDocVw.InternetExplorer
Private Sub Form_Load()
Set IE = New InternetExplorer 'create instance of IExplorer
IE.Navigate2 ("c:\WHEREVER\myFile.htm") 'load file
Do While IE.readyState <> READYSTATE_COMPLETE 'wait until fully loaded
DoEvents
Loop
With IE.document.All.tags("A") 'get anchor collection
For n = 0 To .length - 1
List2.AddItem .Item(n).getAttribute("hre f") 'extract href
List1.AddItem .Item(n).innerText 'extract description
Next n
End With
Set IE = Nothing 'do away with IExplorer
End Sub
Put two listboxes on a form (List1 and List2) and add a reference to MicroSoft Internet Controls. Paste the following code into the form.
Kindest regards,
Rhaedes
Dim IE As SHDocVw.InternetExplorer
Private Sub Form_Load()
Set IE = New InternetExplorer 'create instance of IExplorer
IE.Navigate2 ("c:\WHEREVER\myFile.htm")
Do While IE.readyState <> READYSTATE_COMPLETE 'wait until fully loaded
DoEvents
Loop
With IE.document.All.tags("A") 'get anchor collection
For n = 0 To .length - 1
List2.AddItem .Item(n).getAttribute("hre
List1.AddItem .Item(n).innerText 'extract description
Next n
End With
Set IE = Nothing 'do away with IExplorer
End Sub
dear mcdev,
All of them here have overlooked some points.
1) If there is one or additional spaces between "A" & "HREF" then their program may not work as intended:
their code will work if:
all your tags have one space between "A" & "HREF", for eg:
<A HREF="http://www.faqs.org/rfcs/" ADD_DATE="961104168">RFC Archive</A>
but if there are more than 1 space then there will be a problem, for eg:
<A HREF="http://www.faqs.org/rfcs/" ADD_DATE="961104168">RFC Archive</A>
2) Their conditions are case sensitive, that is if you type any one of the tags in lower case, for eg :
<a HREF="http://www.faqs.org/rfcs/" ADD_DATE="961104168">RFC Archive</a>
Or
<a hReF="http://www.faqs.org/rfcs/" ADD_DATE="961104168">RFC Archive</a>
Then their code will not find "URL" at all.
************************** ********** ********** *********
All of them here have overlooked some points.
1) If there is one or additional spaces between "A" & "HREF" then their program may not work as intended:
their code will work if:
all your tags have one space between "A" & "HREF", for eg:
<A HREF="http://www.faqs.org/rfcs/" ADD_DATE="961104168">RFC Archive</A>
but if there are more than 1 space then there will be a problem, for eg:
<A HREF="http://www.faqs.org/rfcs/" ADD_DATE="961104168">RFC Archive</A>
2) Their conditions are case sensitive, that is if you type any one of the tags in lower case, for eg :
<a HREF="http://www.faqs.org/rfcs/" ADD_DATE="961104168">RFC Archive</a>
Or
<a hReF="http://www.faqs.org/rfcs/" ADD_DATE="961104168">RFC Archive</a>
Then their code will not find "URL" at all.
**************************
There is one problem in the code I gave:
If there is one or more white space between "H" & "REF"
or between any characters in "HREF" like "H REF" or "HR ef", then also the output result will be the url.In order to overcome taht you will have to do an additional cheking.My modified code is:
that is strTag = <DT><A HREF="http://www.faqs.org/rfcs/" ADD_DATE="961104168">RFC Archive</A>
Modify the code to meet your needs
Private sub cmdAddtoList_Click()
dim ArrTag() as string
dim strHref as string
dim strDesc as string
ArrTag= split(strTag,chr(34))
for i = 0 to ubound(ArrTag)
strHref = replace(ArrTag(i)," ","")
if len (strHref) > 5 then
if UCASE(right(strHref,5)) = "HREF=" Then
if instr(ucase(ArrTag(i),"HRE F") > 0 Then
if i < ubound(ArrTag) Then
strHref = ArrTag(i + 1)
Exit for
end if
end if
End If
Next
ArrTag= split(strTag,">")
For i = 0 To UBound(ArrTag)
strDesc = Replace(ArrTag(i), " ", "")
If Len(strDesc) > 3 Then
If UCase(Right(strDesc, 3)) = "</A" Then
strDesc = Left(ArrTag(i), InStr(ArrTag(i), "<") - 1)
Exit For
End If
End If
Next
listbox1.additem strHref
listbox2.additem strDesc
End sub
If there is one or more white space between "H" & "REF"
or between any characters in "HREF" like "H REF" or "HR ef", then also the output result will be the url.In order to overcome taht you will have to do an additional cheking.My modified code is:
that is strTag = <DT><A HREF="http://www.faqs.org/rfcs/" ADD_DATE="961104168">RFC Archive</A>
Modify the code to meet your needs
Private sub cmdAddtoList_Click()
dim ArrTag() as string
dim strHref as string
dim strDesc as string
ArrTag= split(strTag,chr(34))
for i = 0 to ubound(ArrTag)
strHref = replace(ArrTag(i)," ","")
if len (strHref) > 5 then
if UCASE(right(strHref,5)) = "HREF=" Then
if instr(ucase(ArrTag(i),"HRE
if i < ubound(ArrTag) Then
strHref = ArrTag(i + 1)
Exit for
end if
end if
End If
Next
ArrTag= split(strTag,">")
For i = 0 To UBound(ArrTag)
strDesc = Replace(ArrTag(i), " ", "")
If Len(strDesc) > 3 Then
If UCase(Right(strDesc, 3)) = "</A" Then
strDesc = Left(ArrTag(i), InStr(ArrTag(i), "<") - 1)
Exit For
End If
End If
Next
listbox1.additem strHref
listbox2.additem strDesc
End sub
Vbbuff: You are NOT correct when you say 'All of them here have overlooked some points'! The solution using the DOM by definition works with all good HTML documents, whether or not they contain tags and elements in uppercase, lower case, with extra whitespace, etc, and an endless number of possibilities that your code does not contemplate (tabs, newline characters, nobreak spaces etc. etc.). Also note (no disrespect) that your code contains syntax errors (you appear not to have closed all brackets properly, for example).
Mcdev: Use the code which works best for you or with which you are most comfortable: since your strings appear to be simple, a solution with 'Instr' or similar will work just fine. But in all honesty, the Document Object Model exists precisely so that you can parse HTML simply and robustly with a few lines of code.
Kindest regards,
Rhaedes
Mcdev: Use the code which works best for you or with which you are most comfortable: since your strings appear to be simple, a solution with 'Instr' or similar will work just fine. But in all honesty, the Document Object Model exists precisely so that you can parse HTML simply and robustly with a few lines of code.
Kindest regards,
Rhaedes
dear rhaedes,
I was not refering to you, I correct myself. I was also just pointing (with no disrepect too) out to some of the points that were overlooked , but are important. After all this site is all about providing and gaining knowledge ain't it ?
I was not refering to you, I correct myself. I was also just pointing (with no disrepect too) out to some of the points that were overlooked , but are important. After all this site is all about providing and gaining knowledge ain't it ?
Absolutely. And of course you are 100% correct in pointing out the failings of the other methods.
Respect and regards,
Rhaedes
Respect and regards,
Rhaedes
mcdev:
This old question needs to be finalized -- accept an answer, split points, or get a refund. For information on your options, please click here-> http:/help/closing.jsp#1
Experts: Post your closing recommendations! Who deserves points here?
This old question needs to be finalized -- accept an answer, split points, or get a refund. For information on your options, please click here-> http:/help/closing.jsp#1
Experts: Post your closing recommendations! Who deserves points here?
mcdev, an EE Moderator will handle this for you.
Moderator, my recommended disposition is:
Refund points and save as a 0-pt PAQ.
DanRollins -- EE database cleanup volunteer
Moderator, my recommended disposition is:
Refund points and save as a 0-pt PAQ.
DanRollins -- EE database cleanup volunteer
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Private Sub Command1_Click()
Dim lfnum As Long
lfnum = FreeFile
Dim sLine As String
Dim sPage As String
Open "c:\test.htm" For Input As #lfnum
Do Until EOF(lfnum)
Line Input #lfnum, sLine
sPage = sPage & sLine
Loop
Close #lfnum
Dim lpos1 As Long
Dim lpos2 As Long
Do
lpos1 = InStr(1, UCase(sPage), "<A HREF")
If lpos1 = 0 Then Exit Do
sPage = Right(sPage, Len(sPage) - lpos1 + 1)
lpos1 = InStr(1, UCase(sPage), "HREF=" & Chr(34))
lpos1 = lpos1 + 6
lpos2 = InStr(lpos1 + 1, sPage, Chr(34))
List1.AddItem Mid(sPage, lpos1, lpos2 - lpos1)
lpos1 = InStr(1, sPage, ">") + 1
lpos2 = InStr(lpos1 + 1, UCase(sPage), "</A")
List2.AddItem Mid(sPage, lpos1, lpos2 - lpos1)
sPage = Right(sPage, Len(sPage) - lpos2 - 3)
Loop
End Sub