Solved

Parsing a text file using C#

Posted on 2009-07-13
13
544 Views
Last Modified: 2013-12-17

Hello group,

I have a text file and need to parse it using C#. What is the best way to parse it?


 Name : ABC  DEF                      Applicant ID:

 Date: 6/7/2009                        Test Form: A23

 Applied: Yes                            Code: 000001

 Number:163                            Score: 230

 
0
Comment
Question by:akohan
  • 6
  • 4
  • 3
13 Comments
 
LVL 3

Accepted Solution

by:
_Gerry_ earned 500 total points
ID: 24846016
There's more than one way to do this.
Line by line using something like

    using System.IO;
    ...
    using (var reader= new StreamReader("myfile.txt"))
    {
         var myline=string.Empty;
         while ((myline=reader.ReadLine()) != null)
         {
               // do something with the line....
         }
    }

or read the whole file into memory and work with it there:

using System.IO;
   ...
     var thewholefile = File.ReadAllText("myfile.txt");

    use string.split() etc. to break up the file afterwards.


I think I read somewhere that the first method is actually faster to execute.
As for what you do to the lines to parse them depends entirely on what you are expecting to find in the file :-)
 
0
 

Author Comment

by:akohan
ID: 24846908

Hi  Gerry,

Thanks yes that was I have done too. However, I had to do some cleaning since for situations like:

 Date: 6/7/2009                        Test Form: A23

I had to check the position of "Date:" and then extracting after ":" or same thing for "Test From:" etc using  IndexOf() and Substring() methods.

Any idea if I'm on a right track?

Thanks.


0
 
LVL 3

Expert Comment

by:_Gerry_
ID: 24847201
That will do nicely and should work fine.
 
...or you could get clever with .Split() and LINQ   :)

        static void Main(string[] args)
        {
            string text = " Date: 6/7/2009                        Test Form: A23";
            System.Console.WriteLine("Original text: '{0}'", text);
 
            var words = from w in text.Split(' ',':','\t').AsEnumerable<string>() where w!=string.Empty select w;
            
            System.Console.WriteLine("{0} words in text:", words.Count());
 
            foreach (string s in words)
            {
                System.Console.WriteLine(s);
            }
            Console.ReadLine();
        }

Open in new window

0
Resolve Critical IT Incidents Fast

If your data, services or processes become compromised, your organization can suffer damage in just minutes and how fast you communicate during a major IT incident is everything. Learn how to immediately identify incidents & best practices to resolve them quickly and effectively.

 

Author Comment

by:akohan
ID: 24847310


Thanks for your comment. I don't know anything about Linq since I'm new to C# but will check it out for sure.

I will get back to you.

Thanks!
0
 
LVL 23

Expert Comment

by:Tony McCreath
ID: 24849466
For linq you need C# 3.5

another way is to use regular expressions (Regex, System.Text.RegularExpressions).

What's important is that you define the syntax rules. Here's a sequential set of rules that may make sense:

Any character except ":" = first field name
":"
Any character except tab (\t) = first value
"\t"
Any character except ":" = second field name
":"
All other characters

This could be defined in regex like the following. It also strips spaces and names the capture groups  1=name1, 2=value1, 3=name2, 4=value2



 *(?<name1>[^:]*) *: *(?<value1>[^\t]*) *\t *(?<name2>[^:]*) *: *(?<value2>.*) *

Open in new window

0
 

Author Comment

by:akohan
ID: 24854962

Hello,

Thanks for your comments. I have attached the format I am receiving (after converting a specific file to text). Is the above method still good for it or should I change my approach?

In following example I will need to extract;
William Smith
5/2/2008
55 ( which is exam score)
1025409804

Once again thanks.

Regards,


Header file
Name of visitor: William Smith Signin Date: 5/2/2008 Position Applied For: Driver Number Correct (exam score): 55 Percentile Total (%total): 78 Median Score for Position: 52 Applicant ID: 1025409804 Test Form: ZipCode 1x20558 Job Code: 000001 Age Adjusted Score: 52 Equiv: 117 Suggested Hiring Range: 19 - 44

Open in new window

0
 

Author Comment

by:akohan
ID: 24854973

I just found out I'm using (base on help dialog) .NET 3.5 sp1

so I guess I can use Linq right?

0
 
LVL 3

Expert Comment

by:_Gerry_
ID: 24855770
Yup, or regular expressions, or plain old IndexOf/Substring... that's why programming is so fun.
Your attached example is a bit tricky.... trying to separate the "William Smith" from "Signin" etc.  
You need to know in the code the full label names, making it very tricky for an otherwise excellent regular expression approach but easy for IndexOf/Substring and perhaps the rather terse Linq example I posted earlier.

In the interest of the principle of KISS (except I'm sure you're not stupid :-) maybe IndexOf/Substring is the best approach after all !
0
 
LVL 23

Expert Comment

by:Tony McCreath
ID: 24857667
Will the field headings always be the same?

Name of visitor
Signin Date
Position Applied For
Number Correct (exam score)
Percentile Total (%total)
Median Score for Position
Applicant ID
Test Form
Job Code
Age Adjusted Score
Equiv
Suggested Hiring Range

This would make life easier.

I don't have time now. I'll try and put a Regex script together later.
0
 

Author Comment

by:akohan
ID: 24862727

Yes, it will be always like that.

Thanks.
0
 
LVL 23

Expert Comment

by:Tony McCreath
ID: 24867876
I've just noticed your latest example is different to the original example.

Will all the fields ALWAYS be present and in the same order?
Will a single entry cover multiple lines?
Are you talking about one entry per file?
is "Header file" part of the data to parse?

0
 

Author Comment

by:akohan
ID: 24897072

Hi Tiggerito,

Yes, consider the last one since I'm generating them as the latter one.

0
 
LVL 23

Expert Comment

by:Tony McCreath
ID: 24902641
Here's a simple Regex to gather the data if it is entered in exactly as you stated.

It is basically a copy of the example you provided with some alterations:

Any regex sensitive characters have been escaped. that is ( and ) were changed to \( and \)

The values have been replaced by the following capture sequences:

(?<fieldvalue>.*)

In each case 'fieldvalue' is change to the name of the field. This is saying, capture any number of characters into a group called 'fieldvalue'

Name of visitor: (?<name>.*) Signin Date: (?<date>.*) Position Applied For: (?<position>.*) Number Correct \(exam score\): (?<score>.*) Percentile Total \(%total\): (?<total>.*) Median Score for Position: (?<median>.*) Applicant ID: (?<id>.*) Test Form: (?<form>.*) Job Code: (?<job>.*) Age Adjusted Score: (?<agescore>.*) Equiv: (?<equiv>.*) Suggested Hiring Range: (?<range>.*)

Open in new window

0

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

A basic question.. “What is the Garbage Collector?” The usual answer given back: “Garbage collector is a background thread run by the CLR for freeing up the memory space used by the objects which are no longer used by the program.” I wondered …
Today I had a very interesting conundrum that had to get solved quickly. Needless to say, it wasn't resolved quickly because when we needed it we were very rushed, but as soon as the conference call was over and I took a step back I saw the correct …
Attackers love to prey on accounts that have privileges. Reducing privileged accounts and protecting privileged accounts therefore is paramount. Users, groups, and service accounts need to be protected to help protect the entire Active Directory …

685 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question