Parsing a text file using C#


Hello group,

I have a text file and need to parse it using C#. What is the best way to parse it?


 Name : ABC  DEF                      Applicant ID:

 Date: 6/7/2009                        Test Form: A23

 Applied: Yes                            Code: 000001

 Number:163                            Score: 230

 
akohanAsked:
Who is Participating?

Improve company productivity with a Business Account.Sign Up

x
 
_Gerry_Connect With a Mentor Commented:
There's more than one way to do this.
Line by line using something like

    using System.IO;
    ...
    using (var reader= new StreamReader("myfile.txt"))
    {
         var myline=string.Empty;
         while ((myline=reader.ReadLine()) != null)
         {
               // do something with the line....
         }
    }

or read the whole file into memory and work with it there:

using System.IO;
   ...
     var thewholefile = File.ReadAllText("myfile.txt");

    use string.split() etc. to break up the file afterwards.


I think I read somewhere that the first method is actually faster to execute.
As for what you do to the lines to parse them depends entirely on what you are expecting to find in the file :-)
 
0
 
akohanAuthor Commented:

Hi  Gerry,

Thanks yes that was I have done too. However, I had to do some cleaning since for situations like:

 Date: 6/7/2009                        Test Form: A23

I had to check the position of "Date:" and then extracting after ":" or same thing for "Test From:" etc using  IndexOf() and Substring() methods.

Any idea if I'm on a right track?

Thanks.


0
 
_Gerry_Commented:
That will do nicely and should work fine.
 
...or you could get clever with .Split() and LINQ   :)

        static void Main(string[] args)
        {
            string text = " Date: 6/7/2009                        Test Form: A23";
            System.Console.WriteLine("Original text: '{0}'", text);
 
            var words = from w in text.Split(' ',':','\t').AsEnumerable<string>() where w!=string.Empty select w;
            
            System.Console.WriteLine("{0} words in text:", words.Count());
 
            foreach (string s in words)
            {
                System.Console.WriteLine(s);
            }
            Console.ReadLine();
        }

Open in new window

0
Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

 
akohanAuthor Commented:


Thanks for your comment. I don't know anything about Linq since I'm new to C# but will check it out for sure.

I will get back to you.

Thanks!
0
 
Tony McCreathTechnical SEO ConsultantCommented:
For linq you need C# 3.5

another way is to use regular expressions (Regex, System.Text.RegularExpressions).

What's important is that you define the syntax rules. Here's a sequential set of rules that may make sense:

Any character except ":" = first field name
":"
Any character except tab (\t) = first value
"\t"
Any character except ":" = second field name
":"
All other characters

This could be defined in regex like the following. It also strips spaces and names the capture groups  1=name1, 2=value1, 3=name2, 4=value2



 *(?<name1>[^:]*) *: *(?<value1>[^\t]*) *\t *(?<name2>[^:]*) *: *(?<value2>.*) *

Open in new window

0
 
akohanAuthor Commented:

Hello,

Thanks for your comments. I have attached the format I am receiving (after converting a specific file to text). Is the above method still good for it or should I change my approach?

In following example I will need to extract;
William Smith
5/2/2008
55 ( which is exam score)
1025409804

Once again thanks.

Regards,


Header file
Name of visitor: William Smith Signin Date: 5/2/2008 Position Applied For: Driver Number Correct (exam score): 55 Percentile Total (%total): 78 Median Score for Position: 52 Applicant ID: 1025409804 Test Form: ZipCode 1x20558 Job Code: 000001 Age Adjusted Score: 52 Equiv: 117 Suggested Hiring Range: 19 - 44

Open in new window

0
 
akohanAuthor Commented:

I just found out I'm using (base on help dialog) .NET 3.5 sp1

so I guess I can use Linq right?

0
 
_Gerry_Commented:
Yup, or regular expressions, or plain old IndexOf/Substring... that's why programming is so fun.
Your attached example is a bit tricky.... trying to separate the "William Smith" from "Signin" etc.  
You need to know in the code the full label names, making it very tricky for an otherwise excellent regular expression approach but easy for IndexOf/Substring and perhaps the rather terse Linq example I posted earlier.

In the interest of the principle of KISS (except I'm sure you're not stupid :-) maybe IndexOf/Substring is the best approach after all !
0
 
Tony McCreathTechnical SEO ConsultantCommented:
Will the field headings always be the same?

Name of visitor
Signin Date
Position Applied For
Number Correct (exam score)
Percentile Total (%total)
Median Score for Position
Applicant ID
Test Form
Job Code
Age Adjusted Score
Equiv
Suggested Hiring Range

This would make life easier.

I don't have time now. I'll try and put a Regex script together later.
0
 
akohanAuthor Commented:

Yes, it will be always like that.

Thanks.
0
 
Tony McCreathTechnical SEO ConsultantCommented:
I've just noticed your latest example is different to the original example.

Will all the fields ALWAYS be present and in the same order?
Will a single entry cover multiple lines?
Are you talking about one entry per file?
is "Header file" part of the data to parse?

0
 
akohanAuthor Commented:

Hi Tiggerito,

Yes, consider the last one since I'm generating them as the latter one.

0
 
Tony McCreathTechnical SEO ConsultantCommented:
Here's a simple Regex to gather the data if it is entered in exactly as you stated.

It is basically a copy of the example you provided with some alterations:

Any regex sensitive characters have been escaped. that is ( and ) were changed to \( and \)

The values have been replaced by the following capture sequences:

(?<fieldvalue>.*)

In each case 'fieldvalue' is change to the name of the field. This is saying, capture any number of characters into a group called 'fieldvalue'

Name of visitor: (?<name>.*) Signin Date: (?<date>.*) Position Applied For: (?<position>.*) Number Correct \(exam score\): (?<score>.*) Percentile Total \(%total\): (?<total>.*) Median Score for Position: (?<median>.*) Applicant ID: (?<id>.*) Test Form: (?<form>.*) Job Code: (?<job>.*) Age Adjusted Score: (?<agescore>.*) Equiv: (?<equiv>.*) Suggested Hiring Range: (?<range>.*)

Open in new window

0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.