Solved

Combining XML files from a folder

Posted on 2004-08-12
7
209 Views
Last Modified: 2010-04-23
I have a set of XHTML text files in a folder, I am reading in there text and writing it out into one large file.  The problem is that there is a lot of text that is getting repeated that should not be repeated.  I am refering to <html><head>....</head>  and the like.  There is just one long table in each file, I need to combine these tables also.

I am currently using  IO.StreamReader but I think I need to conver to Xml.XmlTextReader, so help with this would be nice too.

Any help would be nice.
0
Comment
Question by:JHalstead
  • 4
  • 3
7 Comments
 
LVL 24

Expert Comment

by:Justin_W
Comment Utility
Unless you really need to parse the files as XML for some reason, I would probably do it this way:
1. Create a StringBuilder
2. Read the first file into a String variable.
3. Use String.IndexOf("<body"), String.LastIndexOf("</body>"), and String.SubString(...) to extract the content from the String variable.
4. Append the substring/content to your StringBuilder.
5. Repeat steps 2-4 for each additional file.

Also, here's a utility for reading the contents of any text file:
   Public Shared Function ReadTextFile(ByVal fileName As String) As String
       Dim reader As System.IO.StreamReader
       Try
           reader = System.IO.File.OpenText(fileName)
           Dim s As String = reader.ReadToEnd()
           Return s
       Finally
           If (Not IsNothing(reader)) Then
               Try
                   reader.Close()
               Catch
               End Try
           End If
       End Try
   End Function
0
 

Author Comment

by:JHalstead
Comment Utility
Ok, I don't know what I am doing wrige here but now, only the last file in the array in being saved to the output file...


    Private Sub DoCombine()
        Dim myreader As New System.Text.StringBuilder
        Dim myjgs As New ArrayList
        Dim mystrings As New ArrayList

        If ListBox1.Items.Count <= 0 Then
            MsgBox("There are no files to load, Combine aborted!")
            Exit Sub
        End If

        For loff As Integer = 0 To ListBox1.Items.Count - 1
            If ListBox1.Items(loff).lastindexof("jg") <= 0 Then
                MsgBox("There is a non JG file in list, Combine aborted!", MsgBoxStyle.OKOnly, "Docsoft")
                Exit Sub
            End If

            Dim mystring As String = ListBox1.Items(loff).substring(0, ListBox1.Items(loff).lastindexof("jg"))
            If Not myjgs.Contains(mystring) Then myjgs.Add(mystring)
            mystrings.Add(ListBox1.Items(loff))
        Next

        For moff As Integer = 0 To myjgs.Count - 1
            Dim myfstrings As ArrayList = mystrings
            myfstrings = FilterArray(myjgs(moff) & "jg", mystrings.Clone)

            Dim mypath As String = rootpath & "\" & myjgs(moff) & ".toc.all.htm"
            '            Dim mywriter As New Xml.XmlTextWriter(mypath, System.Text.Encoding.UTF8)

            For x As Integer = 0 To myfstrings.Count - 1

                '------------------------------
                Dim sr As New IO.StreamReader(rootpath & "\" & myfstrings(x).ToString)
                Dim s As String = sr.ReadLine
                Dim buf As New System.Text.StringBuilder

                Do Until s Is Nothing
                    If s.StartsWith("<html") Then

                        Do Until s Is Nothing OrElse s.StartsWith("</html>")

                            If s.StartsWith("<EFFECT") Then
                                Dim ei As Integer = s.LastIndexOf(">")
                                s = s.Insert(ei, "/")
                            End If

                            buf.Append(s)

                            buf.Append(vbCrLf)
                            s = sr.ReadLine()
                        Loop

                        buf.Append(s)
                        Dim sw As New IO.StreamWriter(mypath)
                        sw.WriteLine(buf.ToString())

                        sw.Close()

                    End If

                    s = sr.ReadLine()

                    sr.Close()
                Loop


                '------------------------------

            Next
        Next

        MsgBox("DONE")
    End Sub
0
 
LVL 24

Expert Comment

by:Justin_W
Comment Utility
The following will overwrite the file each time:
                        Dim sw As New IO.StreamWriter(mypath)
                        sw.WriteLine(buf.ToString())

                        sw.Close()
Move the preceding lines to after the end of the loop, and move the following line to before the FOR loop:
                        Dim buf As New System.Text.StringBuilder
0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 

Author Comment

by:JHalstead
Comment Utility
Ok, it works to some degree, could you please do two things for me:

1)  give an example of how I would delete a table, not parse it, when it has a specific ID attribute
2)  is there a way to add white space to the output document to make it more readable
0
 
LVL 24

Accepted Solution

by:
Justin_W earned 500 total points
Comment Utility
Both of those operations would require more complicated string manipulation or XML parsing.  You could use XmlDocument.LoadXml() to parse a String.  XmlDocument.PreserveWhiteSpace = false should make the XmlDocument format the XML nicely when converted back to a string.  Locating a specific node with specific attributes would require XPath expressions or DOM navigation, and would be a separate question.
0
 

Author Comment

by:JHalstead
Comment Utility
Alright then, thanks for the help...
0
 
LVL 24

Expert Comment

by:Justin_W
Comment Utility
You're welcome.
0

Featured Post

What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

Suggested Solutions

Article by: Kraeven
Introduction Remote Share is a simple remote sharing tool, enabling you to see, add and remove remote or local shares. The application is written in VB.NET targeting the .NET framework 2.0. The source code and the compiled programs have been in…
It’s quite interesting for me as I worked with Excel using vb.net for some time. Here are some topics which I know want to share with others whom this might help. First of all if you are working with Excel then you need to Download the Following …
It is a freely distributed piece of software for such tasks as photo retouching, image composition and image authoring. It works on many operating systems, in many languages.
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

6 Experts available now in Live!

Get 1:1 Help Now