Link to home
Start Free TrialLog in
Avatar of brian_leighty
brian_leighty

asked on

HELP!!! CRITICAL!!! I'm trying to extract a URL from a text file...then write it to a text file with just the URL.

Using system.io or something similar I need to read a text file.  This can be done unitl end of file or readline but I need to extract the URL in the file...Its in the form of "rcrast@http://ww3.yahoo.com/main/Prh13b/ps20060325-178782/index.html".

I need to extract this and then write it to another text file line by line...



The file also contains many different characters that should be excluded from the extraction and they look like this:  

À‰e ¿  @F€p Ê  @:Ÿ8 è  À.y^   tœ© ! €åÙí ^   [I Š   ]Á‹   €ßÿ  Ä  ÀULC        À=}  r €¶8» ‘  ¥… Ì  €#&4 ð  Àž9   '˜+ X   ?†\ K  l¡a ` @F1á p @hÛË  €Fm È  @’K‡  €p+·€ @¶J"€í  Ëuó€ €Úù– “  • q   ºì }  €¶çÊ   %ã @ö”« š € , ˆ €dè;€¨ @? T  €û”« ®  €£Î÷ Á   ÁCË ì  @z÷       ý
Avatar of deighton
deighton
Flag of United Kingdom of Great Britain and Northern Ireland image

will the url always have the string http:// in it?  The what will indicate the end of the url in the file?  Can you explain in general terms how the urls can be recognised?  Will the files have multiple urls in them, or just one?
Avatar of brian_leighty
brian_leighty

ASKER

i just need to pull a bunch of websites out of a index file that internet explorer has made....its to log all the websites that my users goto...the http:// doesn't matter as long as I got all the web pages
I was thinking you could use the http to spot the urls when parsing the file, Is this is a  systems file, e.g. index.dat?   I've had problems manipulating those files in the past, they are special files it seems.
no it's fine...what about the "rcrast@" to spot but HTTP is fine..


        Dim x As IO.File

        Try
            x.Delete("c:\temp.dat")
        Catch
        End Try
        x.Copy("c:\documents and settings\andyd\cookies\index.dat", "c:\temp.dat")



        Dim fs As New IO.StreamReader("c:\temp.dat")
        Dim fw As New IO.StreamWriter("c:\output.txt")

        Dim s As String = fs.ReadToEnd
        s = s.ToLower

        Dim bdone As Boolean

        Dim i As Integer = s.IndexOf("cookie:".ToLower)

        Do Until i < 0

            s = s.Substring(i)
            i = s.IndexOf("@")
            s = s.Substring(i + 1)

            bdone = False

            Dim surl As String = ""

            Dim j As Integer = 0
            Do Until bdone

                Dim sa As String = s.Substring(j, 1)
                If Asc(sa) = 0 Then
                    bdone = True
                Else
                    surl += sa
                    j += 1
                End If

            Loop

            fw.WriteLine(surl)



            i = s.IndexOf("cookie:".ToLower)
        Loop

        fs.Close()
        fw.Close()



if that doesn't work, then in the two place where i search for cookie:, search for rcrast
that's so perfect but what about a checkbox or something to extract by the http:\\ and something to end the url
what do you mean by  'something to end the url'
I dont think i mean anyhting by it because it will be the same as with "cookie:"

something like ".html"
you dont have to worry about that I need to try "http://
ASKER CERTIFIED SOLUTION
Avatar of deighton
deighton
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial