Speed up data parser

Hello,
I currently have a file with 1 million characters.. the file is 1 MB in size. I am trying to parse data with this old function that still works but very slow.
    start0end
    start1end
    start2end
    start3end
    start4end
    start5end
    start6end

Open in new window

the code, takes about 5 painful minutes to process the whole data
    Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
        Dim sFinal = ""
        Dim strData = textbox.Text
        Dim strFirst = "start"
        Dim strSec = "end"

        Dim strID As String, Pos1 As Long, Pos2 As Long, strCur As String = ""

        Do While InStr(strData, strFirst) > 0
            Pos1 = InStr(strData, strFirst)
            strID = Mid(strData, Pos1 + Len(strFirst))
            Pos2 = InStr(strID, strSec)

            If Pos2 > 0 Then
                strID = Microsoft.VisualBasic.Left(strID, Pos2 - 1)
            End If

            If strID <> strCur Then
                strCur = strID

                sFinal += strID & ","
            End If

            strData = Mid(strData, Pos1 + Len(strFirst) + 3 + Len(strID))
        Loop
    End Sub

Open in new window

LVL 1
XK8ERAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Jacques Bourgeois (James Burger)PresidentCommented:
Operations are always slow when you perform a lot of operations on strings.

There are a few tricks that you can use.

For instance, everytime that you have a literal string value, such as the "," on your assignment to sFinal, a string object needs to be created and will then need to be destroyed by the Garbage Collection. If you were to use a variable such as Dim coma = "," and use coma instead of ",", you would create a single string for the whole operation instead of creating one each time you concatenate.

Then, the concatenation itself does something similar. Everytime that you assign something to a variable, the system reserve memory for the new string and then destroy the old one before setting the variable to the new block of memory. This is necessary because the new variable does not have the same length as the old one. There is a System.Text.StringBuilder object that can speed up concatenations tremendously by using its Append method.

Then, you are using Visual Basic functions instead of using the functions built into the framework. In some cases the compiler does the translation, but not always. Your VB function is first called, and then it calls the corresponding .NET method. Using the .NET methods directly could, in some cases, make things a little faster.

Right, Left and Mid can be replaced by variations of Substring (strID = strID.Substring(strData, Pos1 + strFirst.Length). Note that I also use the Length method of the String instead of the Len method of Visual Basic.

Instr can similarly be replaced by the IndexOf method of the String class.

Instead of reassigning to strData in the last line, keep the whole data in all the time. You will be working with the same object all the time instead of having a new object created in each loop. Keep the value of Pos1 + Len(strFirst) + 3 + Len(strID)) for the next iteration of the loop and use it as the start position Pos1 = InStr(Pos3, strData, strFirst), or if you want to do it the .NET way: Pos1 = strData.IndexOf(strFirst, Pos3)
0
XK8ERAuthor Commented:
JamesBurger, that was beautiful explanation and I am happy to learn from such great minds.. do you think its possible to see some code of how you would see this function running smooth ?
0
XK8ERAuthor Commented:
I've been tying for these past few hours to get this going but this is rocket science to me! I am just a farmer that only knows how to plant tomatoes..
0
Cloud Class® Course: SQL Server Core 2016

This course will introduce you to SQL Server Core 2016, as well as teach you about SSMS, data tools, installation, server configuration, using Management Studio, and writing and executing queries.

Miguel OzSoftware EngineerCommented:
JamesBurger explanation only deals with id parsing, but if the file comes as text and string to parse as a line of this file you can use File.ReadAllLines method to get the file contents as string arrays as follows:
        public string ExtractIdAsCsvString(string filename)
        {
            //Initial values
            var strFirst = "start";
            var strSec = "end";
            string strCur = String.Empty; 
            int idxFirst = strFirst.Length;
            int lenSec = strSec.Length;
            //Real calc starts here
            string[] dataLines = File.ReadAllLines(filename);
            StringBuilder sbFinal = new StringBuilder();
            foreach (var line in dataLines)
            {
                var strId = line.Substring(idxFirst, line.Length - lenSec);
                if (strId == strCur)
                    continue;
                sbFinal.Append(strId);
            }
            return sbFinal.ToString();//YOur sFinal result.
        }

Open in new window

Notice that code above is quick but it does not do error check, meaning that it will fail if the file contains the following:
start3     //No end
4end       //No start

Open in new window

If this is the case you need to modify the above code as follows:
        public string ExtractIdAsCsvStringWithErrorCheck(string filename)
        {
            //Initial values
            var strFirst = "start";
            var strSec = "end";
            var comma = ',';
            int lenSec = strSec.Length;
            string strCur = String.Empty;
            //Real calc starts here
            string[] dataLines = File.ReadAllLines(filename);
            StringBuilder sbFinal = new StringBuilder();
            foreach (var line in dataLines)
            {
                //check if line starts with "start" else the idx = 0 
                int idxFirst = line.StartsWith(strFirst)? strFirst.Length: 0;
                //check if line ends with "end"  
                string strId;
                if (line.EndsWith(strSec))
                {
                    strId = line.Substring(idxFirst, line.Length - lenSec);
                }
                else
                {
                    strId = line.Substring(idxFirst);
                }
                if (strId == strCur)
                    continue;
                sbFinal.Append(strId);
                sbFinal.Append(comma);
            }
            return sbFinal.ToString();//YOur sFinal result.
        }

Open in new window

Note: Code is C# but it can be easy translated to VB.NET.
0
käµfm³d 👽Commented:
If you were to use a variable such as Dim coma = "," and use coma instead of ",", you would create a single string for the whole operation instead of creating one each time you concatenate.
Not quite true since string literals are interned in .NET.

From http://msdn.microsoft.com/en-us/library/system.string.intern.aspx:

For example, if you assign the same literal string to several variables, the runtime retrieves the same reference to the literal string from the intern pool and assigns it to each variable.

In this case it's not the literal that's the problem, it's the temporary string that is created as a result of the concatenation (that is, the one that eventually gets assigned to the variable).

There is a System.Text.StringBuilder object that can speed up concatenations tremendously by using its Append method.
When you have 4 (IIRC) or less concats, then straight concatenation can be faster than creating a new StringBuilder instance. Also, since a StringBuilder is built on top of a character array, you can incur a hit if the underlying data store needs to resize. It would be better to send in a size to the constructor if you know in advance how long the resulting string is going to be. Every time the data store needs to be resized, it is doubled, and all of the existing data needs to be copied from the old array to the new array.
0
Jacques Bourgeois (James Burger)PresidentCommented:
Thanks Kaufmed. I did not know about string interning. At some point, I read in a Microsoft blog about using String.Empty instead of "" in order to prevent the generation of multiple empty strings, and so assumed that the trick was good for any string whose value needs to be reused.

As for the StringBuilder, it's true that its constructor eats up some processor time, and that it brings nothing if you have few concatenations to make. But with the size of the file we are dealing with here, and the small chunks of data that needs to be isolated and then concatenated, this is not an issue.

And for setting the MaxCapacity, I have a test that I use for a demo in my class where we build a string of 50000 chars, one char at a time (50000 concatenations) vs doing the same thing with a StringBuilder. The results are something like the following, with variations depending on the computer naturally:

  Concatenation : between 1000 and 1100 milliseconds
  StringBuilder with the default constructor : 5 or 6 milliseconds
  StringBuilder with a Capacity of 50000 : 5 or 6 milliseconds

One would have to test more to get a precise information, but it seems from these tests that the initial capacity is over 50000, which is enough for most jobs.

As for anything having to do with performance, only testing can really tell you if one method is better than another.

And to come back directly to the original question, after a good night of sleep one sees more clearly and get new ideas. If I understand what your code is doing (when parsing data, when just looking at code, you sometimes miss a detail), you are trying to transform

    start0end
    start1end
    start2end
    start3end
    start4end
    start5end
    start6end

into

    0,1,2,3,4,5,6


If this is the case, then this should do the trick:
strData = strData.Replace("end" & Environment.NewLine & "start", ",")
sFinal = strData.Substring(5, strData.ToString.Length - 8)	'Removes the first start and the last end

Open in new window

If it is still too slow, try with a StringBuilder to see if it is better:
Dim stbData As New System.Text.StringBuilder(TextBox.Text)
stbData.Replace("end" & Environment.NewLine & "start", ",")
sFinal = stbData.ToString.Substring(5, stbData.ToString.Length - 8)	'Removes the first start and the last end

Open in new window

You might need to adjust the code to the specific format of your data. For instance, are there spaces or tabulations before each start? My code assure there aren't. Or is the data coming from a source that provides a carriage return that is the same as environment.newline?
0
XK8ERAuthor Commented:
thanks guys for all the information.. one thing not to be confused with the samples.

start0end
start1end
start2end
start3end

I am also using this to parse html files and other documents
<input name"extract this">
<b>name: extract this</b>

it would have to work with those as well like the original code does..
0
Jacques Bourgeois (James Burger)PresidentCommented:
The same technique can be used for any string that you know is not part of the data itself, which is even easier with xml and html because of the tags. And since you will have many replacements to do, then the StringBuilder is probably the best thing:
Dim stbData As New System.Text.StringBuilder(TextBox.Text)
stbData.Replace("end" & Environment.NewLine & "start", ",")
stbData.Remove("<input name""")
stbData.Replace(""">" ,  ",")
stbData.Remove("<b>")
stbData.Replace("</b>" ,  ",")

Open in new window

... and so on as needed
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
XK8ERAuthor Commented:
thanks guys
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Visual Basic.NET

From novice to tech pro — start learning today.