Solved

Writing data to text files from webpages

Posted on 2012-04-01
27
232 Views
Last Modified: 2012-04-03
Have written some coding to create text files from webpages created in Dreamweaver. It is attached.
It all works well apart from one thing:

I have the webpages in a folder C:/Web Design/Faith various/new website/mag articles

This contains subfolders:

Jan02
Mar02
May02
Jul02
Sep02
Nov02
Jan03
Mar03
May03
Jul03
Sep03
Nov03
Jan04
and on up to Mar12

The present coding iterates through the ALL the folders and writes the data to a file 0.txt. It then does this again (writing all the data to 1.txt) and repeats up to about 60.txt, all files being identical. What I want to do is to write all the 02 files to 02.txt, then the 03 files to 03.txt and so on.

The Sub Button1_Click is called from a button on a form. This calls displayFolder.  In the coding in this the sub WriteToTextFile is called which does the actual writing (you can ignore this. The coding is rather convoluted but it works!).

LoopThruDirectory does the looping, closes fsoStream, increments Filenumber (which in its turn changes the output file name from (e.g.) 0.txt to 1.txt but by that time all the data has been written to 0.txt.

I am sure it is a silly fault in the coding but I cannot find it.  Would be grateful for anyone's help.
Export-coding
0
Comment
Question by:bogorman
  • 14
  • 13
27 Comments
 
LVL 42

Expert Comment

by:sedgwick
Comment Utility
why do u increment the Filenumber after looping all the folders (line 102)?
i didn't quite understand the requirements, each subfolder contains txt files u wish to convert to html?
is there a correlation between the mag issue year and the txt file name? (i.e.  Jan02 -> 02.txt)
can u elaborate please?

cheers
0
 

Author Comment

by:bogorman
Comment Utility
Hi sedgwick,

Sorry I did not explain properly.

Each folder contains html files. I want to read some of the data from these (e.g. title, body and other fields) and write them to a text file for importing into a drupal site. So the files are like this:

Jan02 folder:    First article.html. Second article.html......
Mar02 folder:  First article.html. Second article.html......

Yes, at present, all the coding does, or should do, is to create a text file for each folder (0.txt, 1.txt, etc) and write the data to each of these. Unfortunately all the data from all the files is written to 0.txt and then repeated for 1.txt, etc., so I end up with about 60 files all containing all the data. (there are six folders for each year).

I would like to write the data from all the html files in the 02 folders to 02.txt, all the html files in the 03 folder to 03.txt, etc.

Does that help you?

Brian
0
 
LVL 42

Expert Comment

by:sedgwick
Comment Utility
try now:

Option Explicit On
Imports System
Imports System.IO
Imports System.Text.RegularExpressions
Imports System.Text


Public Class Form1
    Public dirFiles As Scripting.Folder
    Public filenumber As Integer
    Public fso
    Public fsoStream
    Public BodyLength


    Private Sub dirFiles_Change()
        Dim fso
        fso = CreateObject("Scripting.FileSystemObject")
        Dim startFolder As Scripting.Folder
        Dim numFolders As Integer
        Dim strStart As String, strSummary As String
        strStart = dirFiles.Path
        ' Avoid root directories as it's likely to run out of memory
        If Len(dirFiles.Path) > 3 Then
            txtDisplay.Text = ""
            startFolder = fso.GetFolder(strStart)
            numFolders = displayFolder(strStart, True)
            strSummary = numFolders & " Folders, "
            strSummary = strSummary & "Total size: " & Format(startFolder.Size, "#,##0")
            txtDisplay.Text = txtDisplay.Text & vbCrLf & strSummary
        Else
            txtDisplay.Text = ""
        End If
        startFolder = Nothing
        fso = Nothing
    End Sub
    Private Function displayFolder(ByVal folderName As String, ByVal firstTime As Boolean) As Integer
        Dim fso
        fso = CreateObject("Scripting.FileSystemObject")

        Dim rootFolder As Scripting.Folder, currentFolder As Scripting.Folder, subFolder As Scripting.Folder
        Static folderCount As Integer
        rootFolder = fso.GetFolder(folderName)
        If firstTime = True Then
            folderCount = 0
        End If
        txtDisplay.Text = txtDisplay.Text & rootFolder.Path & vbCrLf
        folderCount = folderCount + 1
        For Each currentFolder In rootFolder.SubFolders
            txtDisplay.Text = txtDisplay.Text & currentFolder.Path & vbCrLf
            'iterate through files in the folder
            'and write HTML coding in each to a text file
            
            LoopThruDirectory(rootFolder.Path)

            folderCount = folderCount + 1
            For Each subFolder In currentFolder.SubFolders
                folderCount = folderCount + displayFolder(subFolder.Path, True)
            Next subFolder
        Next currentFolder
        displayFolder = folderCount
        rootFolder = Nothing
        fso = Nothing
    End Function

    Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
        displayFolder("C:/Web Design/Faith various/new website/mag articles", False)
        filenumber = 1
    End Sub

    Sub LoopThruDirectory(ByVal foldername As String)

        Dim fso
        Dim filename As String
        Dim dir As String
        Dim files As String()
        Dim file As String
        fso = CreateObject("Scripting.FileSystemObject")
             
        Dim dirs() As String = Directory.GetDirectories(foldername)
        For Each dir In dirs
            files = Directory.GetFiles(dir)
            ' Process the list of files found in the directory.
            For Each file In files
		        filenumber = filenumber + 1
                WriteToTextFile(file, filenumber)
            Next
        Next

        fso = Nothing

    End Sub

    Public Function GetPageHTML( _
                ByVal URL As String) As String
        ' Retrieves the HTML from the specified URL
        Dim objWC As New System.Net.WebClient()
        Return New System.Text.UTF8Encoding().GetString( _
           objWC.DownloadData(URL))
    End Function

    Public Sub WriteToTextFile(ByVal filename As String, ByVal filenumber As Integer)
        Dim HTMLcoding As String
        Dim StartTitle As Integer
        Dim EndTitle As Integer
        Dim StartPageTitle As Integer
        Dim EndPageTitle As Integer
        Dim PageTitle As String
        Dim Title As String
        Dim TitleLength As Integer
        Dim StartBody As Integer
        Dim EndBody As Integer
        Dim Body As String
        Dim MagIssue As String
        Dim URL As String
        Dim LengthURL As Integer
        Dim IssueSubstring As String
        Dim IssueMonth As Integer
        Dim IssueMonthStr As String
        Dim IssueYear As Integer
        Dim StartImage As String
        Dim EndImage As String
        Dim Image As String
        Dim ImageTag As String
        Dim Author As String
        Dim slashPosition As Integer
        Dim isImage As Boolean
        Dim ArticleOrder As String
        Dim URLTitle As String
        Dim IssueStr As String
        Dim IssueStart As Integer
        Dim IssueEnd As Integer
        Dim FaithMagStart As Integer
        Dim YearStart As Integer
        Dim Linebreak As Integer
        Dim Chr As String

	' Create a text file, and return a reference to a TextStream
        filename = CStr(filenumber) & ".txt"

        fsoStream = fso.CreateTextFile(filename, True)
        
        fsoStream.Write("Name|Body|Author|IssueDate|ArticleOrder<>")     'for article import (without image)
		
        'check it is an html or an htm file:
        If filename.Length > 4 Then
            If filename.Substring(filename.Length - 4) = "html" Or filename.Substring(filename.Length - 3) = "htm" Then
                HTMLcoding = GetPageHTML(filename)
                'remove blank lines, etc:
                HTMLcoding = Regex.Replace(HTMLcoding, "(\r\n\s*?){2,}", Environment.NewLine)
                'extract page title: (between <title> and </title>)
                StartPageTitle = InStr(HTMLcoding, "<title>", CompareMethod.Text)
                'EndPageTitle = InStr(StartPageTitle, HTMLcoding, "</title>", CompareMethod.Text)
                StartPageTitle = StartPageTitle + 7

                EndPageTitle = InStr(StartPageTitle, HTMLcoding, "</title>", CompareMethod.Text)
                PageTitle = Trim(Mid(HTMLcoding, StartPageTitle, EndPageTitle - StartPageTitle))


                Image = ""
                'StartImage = InStr(HTMLcoding, "<!-- InstanceBeginEditable name=""LeftMidTopPanel"" --><img src=", CompareMethod.Text)
                StartImage = InStr(HTMLcoding, "<!-- InstanceBeginEditable name=""LeftMidTopPanel"" -->", CompareMethod.Text)
                If StartImage > 0 Then
                    StartImage = InStr(StartImage, HTMLcoding, "<img src=", CompareMethod.Text) 'there may be spaces between these two tags
                End If

                If StartImage = 0 Then
                    StartImage = 0

                End If

                If StartImage > 0 Then
                    'StartImage = StartImage + 62 'add an additional offset to include closing " after StartImage
                    EndImage = InStr(StartImage, HTMLcoding, "<!-- InstanceEndEditable -->", CompareMethod.Text)
                    Image = HTMLcoding.Substring(StartImage, EndImage - StartImage - 1)
                    slashPosition = Image.LastIndexOf("images/")
                    Image = Image.Substring(slashPosition + 7)
                End If

                'we then have to trim off the end of the image string which usually has the form:
                ' width="127" height="177" >
                'the image is always a jpg or gif
                isImage = False

                EndImage = InStr(Image, "jpg", CompareMethod.Text)
                If EndImage > 0 Then
                    EndImage = EndImage + 3
                    isImage = True
                Else
                    EndImage = InStr(Image, "gif", CompareMethod.Text)
                    If EndImage > 0 Then
                        EndImage = EndImage + 3
                        isImage = True
                    End If
                End If

                If Not (isImage) Then
                    Image = ""
                Else
                    Image = Image.Substring(0, EndImage - 1)
                End If






                'we then have to remove the coding from the beginning of the page down to the end of the
                'title string embedded in the page. This title is contained
                'between the tags <span class="Arial18FFE5B8Cent"> and the subsequent </span>
                'or between the tags <p class="Arial18FFE5B8Cent"> and the subsequent </p>

                Title = ""

                If InStr(HTMLcoding, "<span class=""Arial18FFE5B8Cent"">", CompareMethod.Text) > 0 Then
                    StartTitle = InStr(HTMLcoding, "<span class=""Arial18FFE5B8Cent"">", CompareMethod.Text)
                    EndTitle = InStr(StartTitle, HTMLcoding, "</span>", CompareMethod.Text)
                    TitleLength = EndTitle - StartTitle - 32
                    Title = HTMLcoding.Substring(StartTitle + 31, TitleLength)
                    HTMLcoding = HTMLcoding.Substring(EndTitle + 6)

                ElseIf InStr(HTMLcoding, "<p class=""Arial18FFE5B8Cent"">", CompareMethod.Text) > 0 Then
                    StartTitle = InStr(HTMLcoding, "<p class=""Arial18FFE5B8Cent"">", CompareMethod.Text)
                    EndTitle = InStr(StartTitle, HTMLcoding, "</p>", CompareMethod.Text)
                    TitleLength = EndTitle - StartTitle - 29
                    Title = HTMLcoding.Substring(StartTitle + 28, TitleLength)
                    HTMLcoding = HTMLcoding.Substring(EndTitle + 3)

                ElseIf InStr(HTMLcoding, "<h2 class=""Arial18FFE5B8Cent"">", CompareMethod.Text) > 0 Then
                    StartTitle = InStr(HTMLcoding, "<h2 class=""Arial18FFE5B8Cent"">", CompareMethod.Text)
                    EndTitle = InStr(StartTitle, HTMLcoding, "</h2>", CompareMethod.Text)
                    TitleLength = EndTitle - StartTitle - 30
                    Title = HTMLcoding.Substring(StartTitle + 29, TitleLength)
                    HTMLcoding = HTMLcoding.Substring(EndTitle + 3)

                End If
                ImageTag = ""

                If isImage Then
                    ImageTag = "<img alt=""" & Title & """ src=""/drupal/files/images/feasts/" & Image
                    ImageTag = ImageTag & """ style=""width: 210px; height: 265px; float: left; margin-left: 5px; margin-right: 5px;""/> "
                End If

                'body takes the form: <!-- InstanceBeginEditable name="CentralPanel" --> ..... <!-- InstanceEndEditable --> 
                StartBody = InStr(HTMLcoding, "<!-- InstanceBeginEditable name=""CentralPanel"" -->", CompareMethod.Text) + 50
                EndBody = InStr(HTMLcoding, "<!-- InstanceEndEditable -->", CompareMethod.Text)
                Body = HTMLcoding.Substring(0, EndBody - 1)
                'we now want to remove the <br/> tags at the beginning of the Body to remove spaces at the top of the body text:
                'we will just remove up to two:
                If Body.Substring(0, 5) = "<br/>" Then
                    Body = Body.Substring(5)
                End If
                If Body.Substring(0, 5) = "<br/>" Then
                    Body = Body.Substring(5)
                End If
                Body = ImageTag & Body






                slashPosition = filename.LastIndexOf("\")
                filename = filename.Substring(slashPosition + 1)
                MagIssue = filename.Substring(0, 5) 'reads first four characters of file which, in these folders, are like Nov09
                IssueSubstring = MagIssue.Substring(0, 3)
                IssueMonth = Val(IssueSubstring)
                Select Case IssueSubstring
                    Case "Jan"
                        IssueMonth = 1
                    Case "Mar"
                        IssueMonth = 3
                    Case "May"
                        IssueMonth = 5
                    Case "Jul"
                        IssueMonth = 7
                    Case "Sep"
                        IssueMonth = 9
                    Case "Nov"
                        IssueMonth = 11
                End Select

                IssueSubstring = MagIssue.Substring(3, 2)
                IssueYear = Val(IssueSubstring) + 2000
                If IssueMonth < 10 Then
                    IssueMonthStr = "0" + CStr(IssueMonth)
                Else
                    IssueMonthStr = CStr(IssueMonth)
                End If
                IssueStr = "01/" & IssueMonthStr & "/" & IssueYear & " - 00:00"


                'extract Authors name from coding:
                Author = ""
                If InStr(HTMLcoding, "FAITH Magazine") > 0 Then
                    Author = HTMLcoding.Substring(0, InStr(HTMLcoding, "FAITH Magazine") - 1)
                    'we then have to try to remove tags from the string

                    Author = Trim(Regex.Replace(Author, "<[^<>]+>", ""))


                    If Author = "Editorial" Then Author = "The Editor"
                End If

                If fFirstLetterPosition(Author) = 0 Then
                    Author = ""
                Else
                    Author = Author.Substring(fFirstLetterPosition(Author))
                End If





                Body = Trim(Body)
                FaithMagStart = InStr(Body, "FAITH Magazine", CompareMethod.Text)
                If FaithMagStart > 0 Then
                    YearStart = InStr(Body, "20")  'beginning of year. Always 20 as articles date from 2002
                    If YearStart > 0 Then
                        Body = Trim(Body.Substring(YearStart + 3))
                    End If
                End If
                If (Body.Substring(0, 2) = vbCrLf) Then   'remove 1st 2 chars if vbCrLf
                    Body = Trim(Body.Substring(2))
                End If






                'Dim result As String = System.Text.RegularExpressions.Regex.Replace(Body, "(?s).*?FAITH\s+Magazine\s+\w+\s+&ndash;\s+\w+\s+\d+\s*<br>\s*<br>\s*", String.Empty)
                'Body = System.Text.RegularExpressions.Regex.Replace(Body, "(?s).*?FAITH\s+Magazine\s+\w+\s+&ndash;\s+\w+\s+\d+\s*<br>\s*<br>\s*", String.Empty)


                'we then have to restrict the length of the URL string to 128 (Drupal requirement)
                URLTitle = PageTitle
                'remove non-alphanumeric characters
                URLTitle = OnlyAlphaNumericChars(URLTitle)
                URLTitle = Replace(URLTitle, ",", "") ' OnlyAlphaNumeric seems to miss commas

                URL = "publications/magazine/" & MagIssue.ToLower & "/" & MagIssue.ToLower & "_" & (Replace(Trim(URLTitle), " ", "_")).ToLower

                LengthURL = URL.Length
                If LengthURL > 128 Then
                    URL = URL.Substring(0, 128)
                End If

                ArticleOrder = "B - Special Articles" 'set ArticleOrder to the number after Editorial so the special articles will be
                'listed after it but before the regular articles which have ArticleOrder set to C, D, E...


                If InStr(Author, "Editorial") Then
                    ArticleOrder = "A - Editorial"
                Else

                    Select Case PageTitle

                        Case "The Road from Regensburg"
                            ArticleOrder = "T - The Road from Regensburg"
                        Case "Comment on the Comments"
                            ArticleOrder = "U - Comment on the Comments"
                        Case "Book Reviews"
                            ArticleOrder = "V - Book Reviews"
                        Case "Letters to the Editor"
                            ArticleOrder = "W - Letters to the Editor"
                        Case "Notes from Across the Atlantic"
                            ArticleOrder = "X - Notes from Across the Atlantic"
                        Case "Cutting Edge"
                            ArticleOrder = "Y - Cutting Edge"
                        Case "Sunday by Sunday"
                            ArticleOrder = "Z - Sunday by Sunday"
                    End Select
                End If

                slashPosition = Body.LastIndexOf("/>")
                BodyLength = Body.Length

                Body = Body.Substring(slashPosition + 1, BodyLength - slashPosition - 2)
                If Body.Substring(0, 4) = "<br>" Then  'delete break if at beginning of Body
                    Body = Trim(Body.Substring(4))
                End If
                If Body.Substring(0, 4) = "<br>" Then  'delete second break if at beginning of Body
                    Body = Trim(Body.Substring(4))
                End If
                If Asc(Body.Substring(0, 2)) < 32 Then   'remove 1st char if a control chr
                    Body = Trim(Body.Substring(1))
                End If
                If Asc(Body.Substring(0, 2)) < 32 Then   'and again - remove 1st char if a control chr
                    Body = Trim(Body.Substring(1))
                End If
                If Body.Substring(0, 4) = "<br>" Then  'delete break if at beginning of Body
                    Body = Trim(Body.Substring(4))
                End If
                If Body.Substring(0, 4) = "<br>" Then  'delete second break if at beginning of Body
                    Body = Trim(Body.Substring(4))
                End If

                'fsoStream.Write(URL)
                'fsoStream.Write("|")
                'fsoStream.Write(PageTitle)
                'fsoStream.Write("|")
                fsoStream.Write(Title)
                fsoStream.Write("|")
                fsoStream.Write(Body)
                fsoStream.Write("|")
                'fsoStream.Write(Image)
                'fsoStream.Write("|")
                fsoStream.Write(Author)
                fsoStream.Write("|")
                fsoStream.Write(IssueStr)
                fsoStream.Write("|")
                fsoStream.Write(ArticleOrder)
                fsoStream.Write("<>")

            End If
        End If
		
		fsoStream.Close()
        fsoStream = Nothing
    End Sub

    Public Function OnlyAlphaNumericChars(ByVal OrigString As _
      String) As String
        '***********************************************************
        'INPUT:  Any String
        'OUTPUT: The Input String with all non-alphanumeric characters 
        '        removed
        'EXAMPLE Debug.Print OnlyAlphaNumericChars("Hello World!")
        'output = "HelloWorld")
        'NOTES:  Not optimized for speed and will run slow on long
        '        strings.  If you plan on using long strings, consider 
        '        using alternative method of appending to output string,
        '        such as the method at
        '        http://www.freevbcode.com/ShowCode.Asp?ID=154
        '***********************************************************
        Dim lLen As Long
        Dim sAns As String
        Dim lCtr As Long
        Dim sChar As String

        sAns = ""

        OrigString = Trim(OrigString)
        lLen = Len(OrigString)
        For lCtr = 1 To lLen
            sChar = Mid(OrigString, lCtr, 1)
            If IsAlphaNumeric(Mid(OrigString, lCtr, 1)) Then
                sAns = sAns & sChar
            End If

        Next

        OnlyAlphaNumericChars = sAns

    End Function

    Private Function IsAlphaNumeric(ByVal sChr As String) As Boolean
        IsAlphaNumeric = sChr Like "[A-Z,a-z,0-9 ]"
    End Function

    Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load

    End Sub

    Public Function fFirstLetterPosition(ByVal strIN As Object) As Integer
        Dim intReturn As Integer
        Dim iCount As Integer

        For iCount = 1 To Len(strIN & "")
            If Mid(strIN, iCount, 1) Like "[A-z]" Then
                intReturn = iCount
                Exit For
            End If
        Next iCount

        fFirstLetterPosition = intReturn
    End Function
End Class

Open in new window

0
 

Author Comment

by:bogorman
Comment Utility
I get 'Null Reference Exception was Unhandled' on the line:

fsoStream = fso.CreateTextFile(filename, True)

Tried adding:

Dim fsoStream As Scripting.TextStream
Dim fso

To the Dim's at the top of WriteToTextFile but I still get the Exception error.
0
 

Author Comment

by:bogorman
Comment Utility
Have corrected the error by inserting:

 fso = CreateObject("Scripting.FileSystemObject")

and the coding runs without error. However 116 files are created, numbered consecutively 0.txt, 1.txt.....   The first one 0.txt contains all the data but the others just contain:

Name|Body|Author|IssueDate|ArticleOrder<>

(presumably from the line
fsoStream.Write("Name|Body|Author|IssueDate|ArticleOrder<>")     'for article import (without image)

which should appear at the top of each file but the data is missing.
0
 
LVL 42

Expert Comment

by:sedgwick
Comment Utility
Let me check
0
 
LVL 42

Expert Comment

by:sedgwick
Comment Utility
if it's .net why not using System.IO namespace where u can manipulate folders and files easily instead of using CreateObject???

can u post screenshot of your winform?
0
 

Author Comment

by:bogorman
Comment Utility
Is this what you need?   Am running the program from within Visual Studio 2005 in Parallels, running on a Macbook Pro (OSX 10.7.3)
Not sure what you mean by using System.IO namespace.
Think it should be able to cure this by altering the loops.  
Would like to send you some of the folders containing the html files but I don't think I can do this from with EE.
Parallels-Picture-1.png
0
 
LVL 42

Expert Comment

by:sedgwick
Comment Utility
you are using VS 2005 to write your vb.net application, so you use .net framework.
System.IO namespace is a module contains types that allows manipulation on files and data streams, and also provides file and directory support.
In general, CreateObject() is used in vb scripts and not in winforms applications.
all the CreateObject("Scripting.FileSystemObject") usage is redundant and should be removed from the code.

for example, instead of:
fso = CreateObject("Scripting.FileSystemObject")
fso.CreateTextFile(filename, True)

use:
File.WriteAllText(filename, "the text u wanna write")

i will walk-through you to get your code neat, are you up for it or u just wanna make code works?
0
 
LVL 42

Expert Comment

by:sedgwick
Comment Utility
and by the way, who calls dirFiles_Change()?
what txtDisplay control is used for?
0
 

Author Comment

by:bogorman
Comment Utility
I don't need to neaten the code.  It will only be used once to create the text files so I can import them into my drupal site.

I just want the code to work and I think the problem is with the two loops.
The button on the form calls displayFolder which calls LoopThruDirectory and passes it C:\Web Design\Faith various\new website\mag articles (the folder in which all the subfolders  are which contain the html files).
dirs() is then loaded with the subdirectories:
C:\Web Design\Faith various\new website\mag articles\Jan06  and
C:\Web Design\Faith various\new website\mag articles\Jan07
(I have only put two folders i here just to test).
files is then loaded with the files in the first subdirectory (all the Jan06 html files - and also some others, pdfs, etc which will not be read by this program)

What I want to do at this stage is to process a certain number of subfolders (say, all the ones with 06, then 07, etc, and write them to a separate text file, e.g. 06.txt, 07.txt, 08.txt). To simplify this it could just read the folders six at a time - doesn't really matter - the files could then be called 0.txt, 1.txt, 2.txt.

There is a bug in the coding at present as the data in all the files in mag articles is written to 0.txt, then the same data to 1.txt, etc.
0
 

Author Comment

by:bogorman
Comment Utility
Have thought about this again.
To avoid taking up a lot of your time, if I could, as at present, place all the subfolders (Jan06, Mar06, May06, Jul06, Sep06, Nov06, Jan07, Mar07, .......) in one folder. Again this is where they are at present (in mag articles).
If I could just iterate through these folders and write the data from the html files which they include (at first to 0.txt) but then, after reading 6 folders close the stream, open a new one to write to the next text file (1.txt) and read the next 6 folders, and so on. This should do it.
0
 
LVL 42

Expert Comment

by:sedgwick
Comment Utility
Ok so let me get it straight, you want to iterate all the sub folders under the root folder (say "C:/Web Design/Faith various/new website/mag articles").

all folders that shares the same year (i.e Jan06, Feb06 etc) should output a single txt file called 06.txt which contain all the data extracted from all the html files which located in them.

did i spotted it?

btw, why do u use System.Net.WebClient::DownloadData() to get the html if you deal with local files???
you are not trying to access a real url cause the html files are in the local machine.
0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 

Author Comment

by:bogorman
Comment Utility
Yes, that's what I want to do.
Suppose I need not change System.Net>WebClient::DownloadData() as it works at present other than the problem with creating the files.
Thanks.
Brian
0
 
LVL 42

Expert Comment

by:sedgwick
Comment Utility
can u post an output file? like 01.txt?
what it should be look like?
0
 

Author Comment

by:bogorman
Comment Utility
Here it is.
0.txt
0
 
LVL 42

Expert Comment

by:sedgwick
Comment Utility
i've added comments in the code to make it more clear.
what you should see after the code ends is txt files under the root folder you pass as a parameter to the Extract function for each year.
so if you have 5 subfolders called Jan01,Feb01,May02,Dec02,Oct03, you should see 3 txt files called: 01.txt, 02.txt and 03.txt.


    
Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
        Extract("C:/Web Design/Faith various/new website/mag articles")
    End Sub

Public Sub Extract(ByVal rootFodler As String)

        'dictionary which key is the txt filename (i.e. 01.txt) and value is the aggregated html data from all the html files belongs to this year -> Jan01, Feb01, Oct01 etc)
        Dim map As Dictionary(Of String, String) = New Dictionary(Of String, String)

        'this header will be the 1st line of every text file
        Dim Header As String = "Name|Body|Author|IssueDate|ArticleOrder<>"

        'get all htm/html files under the root folder
        Dim files = Directory.GetFiles(rootFodler, "*.htm*", SearchOption.AllDirectories)

        'loop the files
        For Each file As String In files

            'get the directory name
            Dim dirName = New FileInfo(file).DirectoryName

            'build the txt file from the directory name (i.e. if the file located here c:\temp\Jan05, 
            'then the txt file name will be  c:\temp\05.txt
            Dim fileName = String.Format("{0}\{1}.txt", rootFodler, dirName.Substring(dirName.Length - 2))

            'extract html data from file
            Dim htmlData = ExtractData(file)

            'check if the filename already exists in dictionary
            If map.ContainsKey(fileName) Then
                'append the file data to the content in the dictionary of the designated txt file
                map(fileName) = map(fileName) + Environment.NewLine + htmlData
            Else
                map(fileName) = htmlData
            End If

        Next

        'loop throuh all file name, create the files and write the html data 
        For Each key As String In map.Keys
            File.WriteAllText(key, map(key))
        Next

    End Sub

    Private Function ExtractData(filename As String) As String
        Dim HTMLcoding As String
        Dim StartTitle As Integer
        Dim EndTitle As Integer
        Dim StartPageTitle As Integer
        Dim EndPageTitle As Integer
        Dim PageTitle As String
        Dim Title As String
        Dim TitleLength As Integer
        Dim StartBody As Integer
        Dim EndBody As Integer
        Dim Body As String
        Dim MagIssue As String
        Dim URL As String
        Dim LengthURL As Integer
        Dim IssueSubstring As String
        Dim IssueMonth As Integer
        Dim IssueMonthStr As String
        Dim IssueYear As Integer
        Dim StartImage As String
        Dim EndImage As String
        Dim Image As String
        Dim ImageTag As String
        Dim Author As String
        Dim slashPosition As Integer
        Dim isImage As Boolean
        Dim ArticleOrder As String
        Dim URLTitle As String
        Dim IssueStr As String
        Dim FaithMagStart As Integer
        Dim YearStart As Integer

        HTMLcoding = File.ReadAllText(filename)

        'remove blank lines, etc:
        HTMLcoding = Regex.Replace(HTMLcoding, "(\r\n\s*?){2,}", Environment.NewLine)
        'extract page title: (between <title> and </title>)
        StartPageTitle = InStr(HTMLcoding, "<title>", CompareMethod.Text)
        'EndPageTitle = InStr(StartPageTitle, HTMLcoding, "</title>", CompareMethod.Text)
        StartPageTitle = StartPageTitle + 7

        EndPageTitle = InStr(StartPageTitle, HTMLcoding, "</title>", CompareMethod.Text)
        PageTitle = Trim(Mid(HTMLcoding, StartPageTitle, EndPageTitle - StartPageTitle))


        Image = ""
        'StartImage = InStr(HTMLcoding, "<!-- InstanceBeginEditable name=""LeftMidTopPanel"" --><img src=", CompareMethod.Text)
        StartImage = InStr(HTMLcoding, "<!-- InstanceBeginEditable name=""LeftMidTopPanel"" -->", CompareMethod.Text)
        If StartImage > 0 Then
            StartImage = InStr(StartImage, HTMLcoding, "<img src=", CompareMethod.Text) 'there may be spaces between these two tags
        End If

        If StartImage = 0 Then
            StartImage = 0

        End If

        If StartImage > 0 Then
            'StartImage = StartImage + 62 'add an additional offset to include closing " after StartImage
            EndImage = InStr(StartImage, HTMLcoding, "<!-- InstanceEndEditable -->", CompareMethod.Text)
            Image = HTMLcoding.Substring(StartImage, EndImage - StartImage - 1)
            slashPosition = Image.LastIndexOf("images/")
            Image = Image.Substring(slashPosition + 7)
        End If

        'we then have to trim off the end of the image string which usually has the form:
        ' width="127" height="177" >
        'the image is always a jpg or gif
        isImage = False

        EndImage = InStr(Image, "jpg", CompareMethod.Text)
        If EndImage > 0 Then
            EndImage = EndImage + 3
            isImage = True
        Else
            EndImage = InStr(Image, "gif", CompareMethod.Text)
            If EndImage > 0 Then
                EndImage = EndImage + 3
                isImage = True
            End If
        End If

        If Not (isImage) Then
            Image = ""
        Else
            Image = Image.Substring(0, EndImage - 1)
        End If






        'we then have to remove the coding from the beginning of the page down to the end of the
        'title string embedded in the page. This title is contained
        'between the tags <span class="Arial18FFE5B8Cent"> and the subsequent </span>
        'or between the tags <p class="Arial18FFE5B8Cent"> and the subsequent </p>

        Title = ""

        If InStr(HTMLcoding, "<span class=""Arial18FFE5B8Cent"">", CompareMethod.Text) > 0 Then
            StartTitle = InStr(HTMLcoding, "<span class=""Arial18FFE5B8Cent"">", CompareMethod.Text)
            EndTitle = InStr(StartTitle, HTMLcoding, "</span>", CompareMethod.Text)
            TitleLength = EndTitle - StartTitle - 32
            Title = HTMLcoding.Substring(StartTitle + 31, TitleLength)
            HTMLcoding = HTMLcoding.Substring(EndTitle + 6)

        ElseIf InStr(HTMLcoding, "<p class=""Arial18FFE5B8Cent"">", CompareMethod.Text) > 0 Then
            StartTitle = InStr(HTMLcoding, "<p class=""Arial18FFE5B8Cent"">", CompareMethod.Text)
            EndTitle = InStr(StartTitle, HTMLcoding, "</p>", CompareMethod.Text)
            TitleLength = EndTitle - StartTitle - 29
            Title = HTMLcoding.Substring(StartTitle + 28, TitleLength)
            HTMLcoding = HTMLcoding.Substring(EndTitle + 3)

        ElseIf InStr(HTMLcoding, "<h2 class=""Arial18FFE5B8Cent"">", CompareMethod.Text) > 0 Then
            StartTitle = InStr(HTMLcoding, "<h2 class=""Arial18FFE5B8Cent"">", CompareMethod.Text)
            EndTitle = InStr(StartTitle, HTMLcoding, "</h2>", CompareMethod.Text)
            TitleLength = EndTitle - StartTitle - 30
            Title = HTMLcoding.Substring(StartTitle + 29, TitleLength)
            HTMLcoding = HTMLcoding.Substring(EndTitle + 3)

        End If
        ImageTag = ""

        If isImage Then
            ImageTag = "<img alt=""" & Title & """ src=""/drupal/files/images/feasts/" & Image
            ImageTag = ImageTag & """ style=""width: 210px; height: 265px; float: left; margin-left: 5px; margin-right: 5px;""/> "
        End If

        'body takes the form: <!-- InstanceBeginEditable name="CentralPanel" --> ..... <!-- InstanceEndEditable --> 
        StartBody = InStr(HTMLcoding, "<!-- InstanceBeginEditable name=""CentralPanel"" -->", CompareMethod.Text) + 50
        EndBody = InStr(HTMLcoding, "<!-- InstanceEndEditable -->", CompareMethod.Text)
        Body = HTMLcoding.Substring(0, EndBody - 1)
        'we now want to remove the <br/> tags at the beginning of the Body to remove spaces at the top of the body text:
        'we will just remove up to two:
        If Body.Substring(0, 5) = "<br/>" Then
            Body = Body.Substring(5)
        End If
        If Body.Substring(0, 5) = "<br/>" Then
            Body = Body.Substring(5)
        End If
        Body = ImageTag & Body






        slashPosition = filename.LastIndexOf("\")
        filename = filename.Substring(slashPosition + 1)
        MagIssue = filename.Substring(0, 5) 'reads first four characters of file which, in these folders, are like Nov09
        IssueSubstring = MagIssue.Substring(0, 3)
        IssueMonth = Val(IssueSubstring)
        Select Case IssueSubstring
            Case "Jan"
                IssueMonth = 1
            Case "Mar"
                IssueMonth = 3
            Case "May"
                IssueMonth = 5
            Case "Jul"
                IssueMonth = 7
            Case "Sep"
                IssueMonth = 9
            Case "Nov"
                IssueMonth = 11
        End Select

        IssueSubstring = MagIssue.Substring(3, 2)
        IssueYear = Val(IssueSubstring) + 2000
        If IssueMonth < 10 Then
            IssueMonthStr = "0" + CStr(IssueMonth)
        Else
            IssueMonthStr = CStr(IssueMonth)
        End If
        IssueStr = "01/" & IssueMonthStr & "/" & IssueYear & " - 00:00"


        'extract Authors name from coding:
        Author = ""
        If InStr(HTMLcoding, "FAITH Magazine") > 0 Then
            Author = HTMLcoding.Substring(0, InStr(HTMLcoding, "FAITH Magazine") - 1)
            'we then have to try to remove tags from the string

            Author = Trim(Regex.Replace(Author, "<[^<>]+>", ""))


            If Author = "Editorial" Then Author = "The Editor"
        End If

        If fFirstLetterPosition(Author) = 0 Then
            Author = ""
        Else
            Author = Author.Substring(fFirstLetterPosition(Author))
        End If





        Body = Trim(Body)
        FaithMagStart = InStr(Body, "FAITH Magazine", CompareMethod.Text)
        If FaithMagStart > 0 Then
            YearStart = InStr(Body, "20")  'beginning of year. Always 20 as articles date from 2002
            If YearStart > 0 Then
                Body = Trim(Body.Substring(YearStart + 3))
            End If
        End If
        If (Body.Substring(0, 2) = vbCrLf) Then   'remove 1st 2 chars if vbCrLf
            Body = Trim(Body.Substring(2))
        End If






        'Dim result As String = System.Text.RegularExpressions.Regex.Replace(Body, "(?s).*?FAITH\s+Magazine\s+\w+\s+&ndash;\s+\w+\s+\d+\s*<br>\s*<br>\s*", String.Empty)
        'Body = System.Text.RegularExpressions.Regex.Replace(Body, "(?s).*?FAITH\s+Magazine\s+\w+\s+&ndash;\s+\w+\s+\d+\s*<br>\s*<br>\s*", String.Empty)


        'we then have to restrict the length of the URL string to 128 (Drupal requirement)
        URLTitle = PageTitle
        'remove non-alphanumeric characters
        URLTitle = OnlyAlphaNumericChars(URLTitle)
        URLTitle = Replace(URLTitle, ",", "") ' OnlyAlphaNumeric seems to miss commas

        URL = "publications/magazine/" & MagIssue.ToLower & "/" & MagIssue.ToLower & "_" & (Replace(Trim(URLTitle), " ", "_")).ToLower

        LengthURL = URL.Length
        If LengthURL > 128 Then
            URL = URL.Substring(0, 128)
        End If

        ArticleOrder = "B - Special Articles" 'set ArticleOrder to the number after Editorial so the special articles will be
        'listed after it but before the regular articles which have ArticleOrder set to C, D, E...


        If InStr(Author, "Editorial") Then
            ArticleOrder = "A - Editorial"
        Else

            Select Case PageTitle

                Case "The Road from Regensburg"
                    ArticleOrder = "T - The Road from Regensburg"
                Case "Comment on the Comments"
                    ArticleOrder = "U - Comment on the Comments"
                Case "Book Reviews"
                    ArticleOrder = "V - Book Reviews"
                Case "Letters to the Editor"
                    ArticleOrder = "W - Letters to the Editor"
                Case "Notes from Across the Atlantic"
                    ArticleOrder = "X - Notes from Across the Atlantic"
                Case "Cutting Edge"
                    ArticleOrder = "Y - Cutting Edge"
                Case "Sunday by Sunday"
                    ArticleOrder = "Z - Sunday by Sunday"
            End Select
        End If

        slashPosition = Body.LastIndexOf("/>")
        Dim BodyLength = Body.Length

        Body = Body.Substring(slashPosition + 1, BodyLength - slashPosition - 2)
        If Body.Substring(0, 4) = "<br>" Then  'delete break if at beginning of Body
            Body = Trim(Body.Substring(4))
        End If
        If Body.Substring(0, 4) = "<br>" Then  'delete second break if at beginning of Body
            Body = Trim(Body.Substring(4))
        End If
        If Asc(Body.Substring(0, 2)) < 32 Then   'remove 1st char if a control chr
            Body = Trim(Body.Substring(1))
        End If
        If Asc(Body.Substring(0, 2)) < 32 Then   'and again - remove 1st char if a control chr
            Body = Trim(Body.Substring(1))
        End If
        If Body.Substring(0, 4) = "<br>" Then  'delete break if at beginning of Body
            Body = Trim(Body.Substring(4))
        End If
        If Body.Substring(0, 4) = "<br>" Then  'delete second break if at beginning of Body
            Body = Trim(Body.Substring(4))
        End If

        Return String.Format("{0}|{1}|{2}|{3}|{4}<>", Title, Body, Author, IssueStr, ArticleOrder)

    End Function

    Public Function fFirstLetterPosition(ByVal strIN As Object) As Integer
        Dim intReturn As Integer
        Dim iCount As Integer

        For iCount = 1 To Len(strIN & "")
            If Mid(strIN, iCount, 1) Like "[A-z]" Then
                intReturn = iCount
                Exit For
            End If
        Next iCount

        fFirstLetterPosition = intReturn
    End Function

    Public Function OnlyAlphaNumericChars(ByVal OrigString As _
      String) As String
        '***********************************************************
        'INPUT:  Any String
        'OUTPUT: The Input String with all non-alphanumeric characters 
        '        removed
        'EXAMPLE Debug.Print OnlyAlphaNumericChars("Hello World!")
        'output = "HelloWorld")
        'NOTES:  Not optimized for speed and will run slow on long
        '        strings.  If you plan on using long strings, consider 
        '        using alternative method of appending to output string,
        '        such as the method at
        '        http://www.freevbcode.com/ShowCode.Asp?ID=154
        '***********************************************************
        Dim lLen As Long
        Dim sAns As String
        Dim lCtr As Long
        Dim sChar As String

        sAns = ""

        OrigString = Trim(OrigString)
        lLen = Len(OrigString)
        For lCtr = 1 To lLen
            sChar = Mid(OrigString, lCtr, 1)
            If IsAlphaNumeric(Mid(OrigString, lCtr, 1)) Then
                sAns = sAns & sChar
            End If

        Next

        OnlyAlphaNumericChars = sAns

    End Function

    Private Function IsAlphaNumeric(ByVal sChr As String) As Boolean
        IsAlphaNumeric = sChr Like "[A-Z,a-z,0-9 ]"
    End Function

Open in new window

0
 
LVL 42

Expert Comment

by:sedgwick
Comment Utility
btw, i didn't touch the logic in ExtractData function, i took it as is from your code simply get rid of the filestream and webclient.downloadData.

i run the code on a same scenario as u described where i had a root folder with subfolders with the same patterns:
Jan01,Feb01,Mar01,Jan02,Feb02,May03,and so on.

so basically, for all '01' subfolders (Jan01,Feb01,Mar01) i parse the html files and aggregate them to a single txt file called 01.txt.
same thing for '02' subfolders and so on.
0
 

Author Comment

by:bogorman
Comment Utility
Thanks.   Had to add:

Option Explicit On
Imports System
Imports System.IO
Imports System.Text.RegularExpressions
Imports System.Text


Public Class Form1
    Public dirFiles As Scripting.Folder
    Public filenumber As Integer
    Public fso
    Public fsoStream
    Public BodyLength


at top of coding and

End Class

at the end.   Then the coding runs.

I do however get an ArgumentException was Unhandled error (argument Length must be greater than or equal to zero) on the second line below:

        EndPageTitle = InStr(StartPageTitle, HTMLcoding, "</title>", CompareMethod.Text)
        PageTitle = Trim(Mid(HTMLcoding, StartPageTitle, EndPageTitle - StartPageTitle))

I cannot find the variable Length in the statements, so cannot understand the error.

EndPageTitle is zero.    The html file in question does have a </title> tag.   Did you make any changes that could cause this.   There were no errors in parsing (is that the correct expression?) the files before.    

Will try to trace it. In the meantime if you have any ideas, pse let me know.
0
 

Author Comment

by:bogorman
Comment Utility
Have discovered something.  
The function ExtractData has a parameter filename.   The file it is reading is:

C:/Web Design/Faith various/new website/mag articles\Jan07\_notes\Jan07AMuslimsJourneyToChrist.html.mno

note the _notes subfolder!   There is a _notes subfolder in jan07 but no file in it. It should be coded to avoid this subfolder (sorry I did not mention it).    Think my coding only read files with .htm or .html extensions and that was why I didn't have this problem before.
0
 

Author Comment

by:bogorman
Comment Utility
Apparently the .mno files are created by Dreamweaver and we do not need to read them. They are hidden files so I have only just noticed them

How can I modify the line:

        Dim files = Directory.GetFiles(rootFodler, "*.htm*", SearchOption.AllDirectories)

to only get files which have .htm or .html extensions (not html.mno nor htm.mno).

The 06.txt and the 07.txt files seem to be written well except for the fact that they are missing the first line of field titles:

Name|Body|Author|IssueDate|ArticleOrder<>

In the sub Extract the Header does not seem to be written at the top of each text file.

Otherwise it seems to be working perfectly.
0
 
LVL 42

Expert Comment

by:sedgwick
Comment Utility
change this line:
            Dim dirName = New FileInfo(file).DirectoryName

Open in new window


to this lines:
	Dim fi = New FileInfo(file)
Dim dirName = fi.DirectoryName

'check for htm/html files only
	If fi.Extension <> ".htm" AndAlso fi.Extension <> ".html" Then
		Continue For
	End If

Open in new window

0
 
LVL 42

Expert Comment

by:sedgwick
Comment Utility
also change this lines:
 'check if the filename already exists in dictionary
            If map.ContainsKey(fileName) Then
                'append the file data to the content in the dictionary of the designated txt file
                map(fileName) = map(fileName) + Environment.NewLine + htmlData
            Else
                map(fileName) = htmlData
            End If

Open in new window


to this lines:
 'check if the filename already exists in dictionary
            If map.ContainsKey(fileName) Then
                'append the file data to the content in the dictionary of the designated txt file
                map(fileName) = map(fileName) + Environment.NewLine + htmlData
            Else
                map(fileName) = Header + Environment.NewLine + htmlData
            End If

Open in new window

0
 

Author Comment

by:bogorman
Comment Utility
Absolutely brilliant, sedgwick
Only ONE other thing.   There is no indication when the program has finished running. Would be nice if the form could display "Export finished" .
0
 
LVL 42

Accepted Solution

by:
sedgwick earned 500 total points
Comment Utility
add this line after Export is done:

MessageBox.Show("Export finished")
0
 

Author Closing Comment

by:bogorman
Comment Utility
Thanks so much for all the work you have done on this, sedgwick. Works perfectly.
0
 
LVL 42

Expert Comment

by:sedgwick
Comment Utility
my pleasure :)
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Suggested Solutions

A basic question.. “What is the Garbage Collector?” The usual answer given back: “Garbage collector is a background thread run by the CLR for freeing up the memory space used by the objects which are no longer used by the program.” I wondered …
Recently while returning home from work my wife (another .NET developer) was murmuring something. On further poking she said that she has been assigned a task where she has to serialize and deserialize objects and she is afraid of serialization. Wha…
Here's a very brief overview of the methods PRTG Network Monitor (https://www.paessler.com/prtg) offers for monitoring bandwidth, to help you decide which methods you´d like to investigate in more detail.  The methods are covered in more detail in o…
This video demonstrates how to create an example email signature rule for a department in a company using CodeTwo Exchange Rules. The signature will be inserted beneath users' latest emails in conversations and will be displayed in users' Sent Items…

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now