Solved

Regular Expressions- extracting data from files

Posted on 2008-10-25
11
394 Views
Last Modified: 2010-04-21
Hello!

I've been learning Regular Expressions in preparation for an upcoming project, and need a little advice on how to preceed to the next step.

I have around 1,100 reports that I need to extract various data from. The reports are in .wpd format (version 12), and I've converted a test group of 150 files to .txt format.

I've created and tested several RegEx strings, and they all seem to work well.

Now, how do I 'use' these regex strings to parse and extract the data from the 1,100 text files and push that data into either/or-
   -individual text files (1 for each corresponding report) for further refinement
   -into a single .txt, .csv or other format

I have a variety of tools at my disposal (vb.net/visual source.net 2003, MS Office 2003, parse-o-matic, textpad, etc..), and am not against purchasing/acquiring/learning other software to help in this task (should the .net learning curve be too overwhelming).

Should I jump into coding a solution? If so, what platform or tools would be appropriate?

Is there a more complete 'packaged' solution available for this task? What are your experiences and preferences regarding these?

Thanks again for your help!

GreggB.
0
Comment
Question by:montarch
  • 4
  • 3
  • 2
  • +2
11 Comments
 
LVL 27

Expert Comment

by:ddrudik
ID: 22803984
VB.NET would  be a fine candidate for such a project.  Please provide a text sample, specify what data you want to pull from that text sample, and what you want to do with that data pulled.
0
 
LVL 27

Expert Comment

by:ddrudik
ID: 22804242
Also, you should explain more about the regex patterns and how they relate to the files, for example you say you have 1 regex file per file, would that same regex pattern match properly any other files in your set?  How are your files named in your exported file set?
0
 
LVL 5

Expert Comment

by:PaulKeating
ID: 22804611
This really looks to me like a command-line application. Consider doing it in Python (www.python.org). The language is free and has a shallow learning curve. A working prototype of the program you want (reading from standard input and writing to standard output) is short and easy to understand without even reading the tutorial:








import re

import sys
 

myregex = re.compile('^This is a test line$')
 

for line in sys.stdin:

    if myregex.match(line):

        print line.rstrip()

Open in new window

0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 22804741
> .. looks to me like a command-line application.
I'll Cc that:

egrep 'your-regex here' your-1100-files-here > single.txt

the result is as you requested:
>    -into a single .txt, .csv or other format

0
 
LVL 16

Expert Comment

by:Bryan Butler
ID: 22805594
I believe egrep, or some kinds of stream editor/text manipulation language (sed/awk) would be the quickest.  Perl has always been the best known as the best text handling language in my book, but that probably would be the quickest way to go.  Python is an easy language to pick up, and if you want to do some specific things, such as date manipulation or other test transformation, then a high level language such as this, or one with a good ability/plugin/library to handle these text changes would be more appropriate.  I've been starting to use PowerShell for some text handling, and it ties in all of the MS/.net technologies.  Does any of that help?
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 

Author Comment

by:montarch
ID: 22805800
Thanks for the responses, folks.
-DDrudik:
To answer your questions -as best I can- however a little long in the explaination:
The report files are Archaeological Survey Project Reports.
1 File = 1 Report = 1 Project.

I want to-
- collect 3 pieces of data from every report
- extract those 3 pieces of data to fields in an output file called "projects.csv"

At a later date, the data in this output file (projects.csv) will be imported into our existing  'Projects Database'.

Examples and Samples of the 3 pieces of data to collect, extract, output-
-I realize that while my RegExps work, they need alot of refining. I'll worry about that later.-
---------------------------------------------------
Data #1- Permit Number
Report sample:      U-03-MQ-123a
RegEx to match:      U-[0-9][0-9]-MQ-\d{1,4}.?(\w*.?)*
To fld in .CSV:      permitnumber
---------------------------------------------------
Data #2- Project County
Report sample:      LOS GENERIC COUNTY, UTAH
RegEx to match:      \w* ?\w+ ?county,( ?| +)Utah
To fld in .CSV:      projectcounty
---------------------------------------------------
Data #3- Project Number
Report sample:      MYCO Report No. 03-100
RegEx to match:      (Report No.).?\d\d-\d\d\d?
To field in .CSV:      'projectnumber'
---------------------------------------------------
I hope this is enough information. Let me know otherwise.
I'm also going to look into the Python approach.

Thanks!
GreggB.
0
 
LVL 16

Expert Comment

by:Bryan Butler
ID: 22805877
Sounds like just a quick solution using whatever regex language you feel comfortable with.  If you're on unix, then go with just a quick 3 line script.  But it sounds like you're on windows and if you want to use what you got, instead of installing an IDE/compiler and learning a language, you could use WSH/powershell as these are basically the 'dos' part of windows, only they are like dos on steroids.  Here is a powershell script I used in another answer that could be modified to do it.  This was splitting an SQL string by reading one line at a time and sending the output to a file.  You would not want to sue the "Replace", but rather the "search".

$aryCDSs = Get-Content ".\test1.txt"
$iter = 0
do {
    [regex]::Replace($aryCDSs[$iter], "Update", "`n`$0") >> test2.txt

    $iter++
} until ($iter -eq $aryCDSs.length)
0
 
LVL 16

Expert Comment

by:Bryan Butler
ID: 22805891
And have 3 lines as in:

$aryCDSs = Get-Content ".\test1.txt"
$iter = 0
do {
    [regex]::Replace($aryCDSs[$iter], "<regex1>", "`$0") >> test2.txt
    "," >>test2.txt  #add comma
    [regex]::Replace($aryCDSs[$iter], "<regex2">, "`$0") >> test2.txt
    "," >>test2.txt   #add comma
    [regex]::Replace($aryCDSs[$iter], "<regex3>", "`$0") >> test2.txt
   "'n">>test2.txt  # adds newline char

    $iter++
} until ($iter -eq $aryCDSs.length)

Does that help?  If you have something like an mainframe/ebcdic file with packed decimals or something, then that's a whole different story ;)
0
 
LVL 27

Accepted Solution

by:
ddrudik earned 500 total points
ID: 22806164
It all depends on your source text, but the following is VB.NET command-line project code that worked with the source text you provided.  I named the file projects.csv but I delimited the file with | instead of , since you have , in one of your columns (the county).  You could choose to delimit in , and enclose that column in " if you prefer.
Imports System.Text.RegularExpressions

Imports System.IO

Module Module1

    Sub Main()

        Try

            If File.Exists("projects.csv") Then

                File.Delete("projects.csv")

                Console.WriteLine("'projects.csv' file found, deleted file.")

            End If

            For Each datafile In Directory.GetFiles("c:\datafiles")

                Dim sr As StreamReader = New StreamReader(datafile)

                Dim filetext As String = sr.ReadToEnd()

                sr.Close()

                Console.WriteLine("Processing: " & datafile)

                Dim repermitnumber As Regex = New Regex("U-\d{2}-MQ-\d{1,4}[a-z]*")

                Dim reprojectcounty As Regex = New Regex("\w* ?\w+ ?county, *Utah", RegexOptions.IgnoreCase)

                Dim rereportnumber As Regex = New Regex("(?<=Report No\. ?)\d\d-\d\d\d?")

                Dim mpermitnumber As Match = repermitnumber.Match(filetext)

                Dim mprojectcounty As Match = reprojectcounty.Match(filetext)

                Dim mreportnumber As Match = rereportnumber.Match(filetext)

                Dim sw As StreamWriter = New StreamWriter("projects.csv", True)

                Dim dataline As String = mpermitnumber.Groups(0).Value & "|" & mprojectcounty.Groups(0).Value & "|" & mreportnumber.Groups(0).Value

                Console.WriteLine("   Writing: " & dataline)

                sw.WriteLine(dataline)

                sw.Close()

            Next

        Catch E As Exception

            Console.WriteLine("An error was encountered:")

            Console.WriteLine(E.Message)

        End Try

    End Sub

End Module

Open in new window

0
 

Author Closing Comment

by:montarch
ID: 31512326
Thanks for everyones help, and sorry for not getting back sooner.
I've spent the last week working over the various solutions provided, based on the tools that I have (vb.net 2003, windows powershell), and the vb.net solution worked out the best for me. I did enjoy diving into powershell, however, and plan on using it in other tasks. Powershell just didn't give me the results that vb.net did. Thanks all, and thanks ddrudik- I have more questions about this project, but I'll pout those in another post.
0
 
LVL 27

Expert Comment

by:ddrudik
ID: 22858907
Thanks for the question and the points.
0

Featured Post

Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
how to limit the number of characters entered? 2 35
Smarter Regex Replace 3 43
REGEXREPLACE 5 50
VBA Test For Valid Number Format 6 44
I have been reconstructing a PHP-based application that has grown into a full blown interface system over the last ten years by a developer that has now gone into business for himself building websites. I am not incredibly fond of writing PHP code o…
As most anyone who uses or has come across them can attest to, regular expressions (regex) are a complicated bit of magic. Packed so succinctly within their cryptic syntax lies a great deal of power. It's not the "take over the world" kind of power,…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

746 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!

Get 1:1 Help Now