Solved

Regular Expression beginner search text files for a string

Posted on 2010-11-09
15
501 Views
Last Modified: 2012-06-21
I'm a regular expression beginner.
I've got around 250 text files I want to examine.
I want to extract the text between two labels.
The labels will exist in each text file.
The text between the two labels may be a few characters or it may be a few lines of text.
1. What is the simplest application to run the regular expression from (initially thought of excel, vba?)
2. What would the actual regular expression be?

Thanks for any advice.
0
Comment
Question by:mike99c
  • 5
  • 3
  • 3
  • +4
15 Comments
 
LVL 2

Expert Comment

by:Pacane
ID: 34095082
Answer to 1. Linux shell, grep
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 34095128
1) Ultra Edit supports searching a list of files and it has regular expression support as well. It is a pay application, but it has a free trial.

2) That depends on what the labels are and which editor you decide to go with. You may be able to use the following:
(?<=beginning label or phrase).*?(?=ending label or phrase)

Open in new window

0
 
LVL 35

Expert Comment

by:Terry Woods
ID: 34096963
Note that you'll need the option active for the . character to match newlines, otherwise you won't pick up the cases where the text spans multiple lines. Usually there's a checkbox for setting that option near to where you enter the regular expression. In Perl compatible regex's the option is "s".
0
Gigs: Get Your Project Delivered by an Expert

Select from freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely and get projects done right.

 
LVL 13

Accepted Solution

by:
Superdave earned 250 total points
ID: 34097075
I'd use Python because it's easy to learn and handles standard Perl-compatible regular expressions; in that case the attached program will work.  Note the angle-brackets are part of my example labels, no special meaning.
But I don't know for comparison what VBA has for regular expressions.

Pacane: grep won't handle multiple lines, and there's no such thing as a Linux shell.



#!/usr/bin/python
import re

text=file(r'filename.txt','r').read()  # textmode
pat=re.compile(r'<label1>(.*?)<label2>',re.S)
t=pat.findall(text)
print `t`

Open in new window

0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 34097227
>>  Note that you'll need the option active for the . character to match newlines

Such a beginner's mistake. I'm not worthy  ;)
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 34097233
P.S.

Good call Terry!
0
 

Author Comment

by:mike99c
ID: 34105825
Hi SuperDave

Got your script to work, thanks
All that's left is
~ to search through all files *.txt (ok to have subdirectories?)
~ can I put the individual filename that the text is coming from at the beginning of each line of the outputted text?

Thanks again.
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 34107089
Here's a modification to SuperDave's script that adds the two new requirements:
for dirpath, dirnames, filenames in os.walk("C:\\ee"):
	for f in filenames:
		if (f.endswith(".txt")):
			fullfile = dirpath + "\\" + f;
			text = file(fullfile, 'r').read();
			t = pat.findall(text);
			if (t):
				print `fullfile` + ":" + `t`;

Open in new window

0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 34107096
Sorry, full script attached:
#!/usr/bin/python
import re
import os

for dirpath, dirnames, filenames in os.walk("C:\\ee"):
	for f in filenames:
		if (f.endswith(".txt")):
			fullfile = dirpath + "\\" + f;
			text = file(fullfile, 'r').read();
			t = pat.findall(text);
			if (t):
				print `fullfile` + ":" + `t`;

Open in new window

0
 
LVL 4

Expert Comment

by:BrainB
ID: 34107794
I can do Excel VBA if you are interested. Useful for data in worksheets & text files.
0
 

Author Comment

by:mike99c
ID: 34109608
Would also be interested in doing this in Excel VBA
0
 
LVL 3

Expert Comment

by:richard_crist
ID: 34114454
A Windows based product I have used for many years which strongly supports regular expressions is FileLocator Pro.  You can find it at http://www.mythicsoft.com.  It supports regular expressions in the data content search, filename specification, and path names.  It lets you start with beginner mode and switch to advanced mode in the interface, and includes an expression builder.  I have been using it for about 5 years and it keeps getting better.
There is a free lite version of the product called Agent Ransack, which you can also download there.
0
 
LVL 4

Assisted Solution

by:BrainB
BrainB earned 250 total points
ID: 34131206
Here is the Excel version. The code goes into a normal code module. The data is put into into the currently active worksheet. It handles multi line tags ok. The process strips away everything but the tag contents including non-printing characters (eg. end of line breaks). Please note the requirement to add a reference to "Microsoft Script Regular Expressions" in the Visual Basic Editor found in the menu Tools\References.

If you have problems it will be helpful if you upload one of your text files because there are so many variations possible.
'=============================================================================
'- EXCEL : OPEN ALL TEXT FILES IN A FOLDER & EXTRACT REQUIRED DATA
'- Needs VB Editor Tools\Reference "Microsoft Script Regular Expressions xx"
'- * Change MyFolder variable to required path
'- * Change MyTag variable to the required tag (just the tag  (no < or /..>)
'- Puts file name and text into the currently active worksheet
'- Brian Baulsom November 2010
'=============================================================================
Dim MyFolder As String
Dim MyTag As String
Dim MyFile As String
Dim FileCount As Integer
Dim FullName As String
Dim MyRegExp As Object
Dim MyPattern As String
Dim MyMatches As Variant    ' Reg Exp extract set
Dim MatchCount As Integer
Dim MatchItem As String
Dim FileString As String    ' whole text file
'------------------------
Dim ws As Worksheet
Dim ToRow As Long
'-------------------------------------------------------------------------
'=============================================================================
'- MAIN ROUTINE
'=============================================================================
Sub TEXTFILE_PROCESS()
    '=========================================================================
    '- ***********  CHANGE VARIABLES ****************************************
    MyFolder = "F:\Test\"       ' nb. final backslash
    MyTag = "script"            ' tag to look for
    '=========================================================================
    Application.Calculation = xlCalculationManual
    Set ws = ActiveSheet
    With ws
        .Cells.ClearContents
        .Range("A1").Value = " Tage = " & MyTag
        .Columns("B:B").WrapText = True
    End With
    ToRow = 2
    Set MyRegExp = CreateObject("VbScript.RegExp")
    MyPattern = "<" & MyTag & "(\n|.)*?/" & MyTag & ">" '
    'MyPattern = "<script(\n|.)*?/script>"
    FileCount = 0
    '-------------------------------------------------------------------------
    '- GET FILES
    MyFile = Dir(MyFolder & "*.txt")   ' text files
    '- LOOP through files in folder
    Do While MyFile <> ""
        FullName = MyFolder & MyFile
        FileCount = FileCount + 1
        Application.StatusBar = FileCount
        '-------------------------------------------------------------------------
        '- READ THE FILE INTO MEMORY AND CLOSE IT
        Open FullName For Input As #1
            FileString = Input(FileLen(FullName), #1)
        Close #1
        '------------------------------------------------------------------------
        GET_DATA        ' SUBROUTINE
        '----------------------------------------------------------------------
        '- NEXT FILE
        MyFile = Dir   ' Get next file
    Loop
    '------------------------------------------------------------------------
    ws.Range("A1:A" & ToRow).EntireRow.AutoFit
    MsgBox ("Processed " & FileCount & " file(s).")
    Application.Calculation = xlCalculationAutomatic
    Application.StatusBar = False
End Sub
'=============================================================================

'=============================================================================
'- EXTRACT DATA AND ADD TO WORKSHEET
'- Removes non-printing characters   eg. end of line etc.
'=============================================================================
Private Sub GET_DATA()
    With MyRegExp
        .Global = True
        .ignorecase = True
        .MultiLine = True
        .Pattern = MyPattern
        Set MyMatches = .Execute(FileString)
    End With
    '-----------------------------------------------------------------------
    '- GET MATCHES
    MatchCount = MyMatches.Count
    If MatchCount > 0 Then
        For m = 0 To MatchCount - 1
            MatchItem = Application.WorksheetFunction.Clean(MyMatches(m))
            MatchItem = Replace(MatchItem, "<" & MyTag, "", 1, -1, vbTextCompare)
            MatchItem = Replace(MatchItem, "</" & MyTag & ">", "", 1, -1, vbTextCompare)
            MatchItem = Replace(MatchItem, ">", "", 1, -1, vbTextCompare)
            ws.Cells(ToRow, 1).Value = MyFile
            ws.Cells(ToRow, 2).Value = MatchItem
            ToRow = ToRow + 1
        Next
    End If
End Sub
'-------------------------------------------------------------------------------

Open in new window

0
 

Author Comment

by:mike99c
ID: 34131611
Thanks for the response, here is an example

;       Addressed  : Act_5015 159/1, Dyn_156/1S, Dyn_557/1, Dyn_18/1
;                    Act_5025/1, Act_9158/1
;
;       Identity   : File01.TXT

I want to capture the text between Addressed and Identity
0
 
LVL 4

Expert Comment

by:BrainB
ID: 34131953
That was relatively painless ..................
'=============================================================================
'- VERSION 2
'- EXCEL : OPEN ALL TEXT FILES IN A FOLDER & EXTRACT REQUIRED DATA
'- Needs VB Editor Tools\Reference "Microsoft Script Regular Expressions xx"
'- **** Change MyFolder variable below to required path
'- Puts file name and text into the currently active worksheet
'- Brian Baulsom November 2010
'=============================================================================
Dim MyFolder As String
Dim MyFile As String
Dim FileCount As Integer
Dim FullName As String
Dim MyRegExp As Object
Dim MyPattern As String
Dim MyMatches As Variant    ' Reg Exp extract set
Dim MatchCount As Integer
Dim MatchItem As String
Dim FileString As String    ' whole text file
'------------------------
Dim ws As Worksheet
Dim ToRow As Long
'-------------------------------------------------------------------------
'=============================================================================
'- VERSION 2 : MAIN ROUTINE
'=============================================================================
Sub TEXTFILE_PROCESS()
    '=========================================================================
    '- ***********  CHANGE VARIABLES ****************************************
    MyFolder = "F:\Test\"       ' nb. final backslash
    '=========================================================================
    Application.Calculation = xlCalculationManual
    Set ws = ActiveSheet
    With ws
        .Cells.ClearContents
        .Range("A1").Value = " Tage = " & MyTag
        .Columns("B:B").WrapText = True
    End With
    ToRow = 2
    Set MyRegExp = CreateObject("VbScript.RegExp")
    MyPattern = "Addressed(\n|.)*?Identity"
    FileCount = 0
    '-------------------------------------------------------------------------
    '- GET FILES
    MyFile = Dir(MyFolder & "*.txt")   ' text files
    '- LOOP through files in folder
    Do While MyFile <> ""
        FullName = MyFolder & MyFile
        FileCount = FileCount + 1
        Application.StatusBar = FileCount
        '-------------------------------------------------------------------------
        '- READ THE FILE INTO MEMORY AND CLOSE IT
        Open FullName For Input As #1
            FileString = Input(FileLen(FullName), #1)
        Close #1
        '------------------------------------------------------------------------
        GET_DATA        ' SUBROUTINE
        '----------------------------------------------------------------------
        '- NEXT FILE
        MyFile = Dir   ' Get next file
    Loop
    '------------------------------------------------------------------------
    ws.Range("A1:A" & ToRow).EntireRow.AutoFit
    MsgBox ("Processed " & FileCount & " file(s).")
    Application.Calculation = xlCalculationAutomatic
    Application.StatusBar = False
End Sub
'=============================================================================

'=============================================================================
'- EXTRACT DATA AND ADD TO WORKSHEET
'- Removes non-printing characters   eg. end of line etc.
'=============================================================================
Private Sub GET_DATA()
    With MyRegExp
        .Global = True
        .ignorecase = True
        .MultiLine = True
        .Pattern = MyPattern
        Set MyMatches = .Execute(FileString)
    End With
    '-----------------------------------------------------------------------
    '- GET MATCHES
    MatchCount = MyMatches.Count
    If MatchCount > 0 Then
        For m = 0 To MatchCount - 1
            MatchItem = Application.WorksheetFunction.Clean(MyMatches(m))
            MatchItem = Replace(MatchItem, "Addressed  :", "", 1, -1, vbTextCompare)
            MatchItem = Replace(MatchItem, "Identity", "", 1, -1, vbTextCompare)
            MatchItem = Replace(MatchItem, ";;", "", 1, -1, vbTextCompare)
            MatchItem = Trim(MatchItem)
            ws.Cells(ToRow, 1).Value = MyFile
            ws.Cells(ToRow, 2).Value = MatchItem
            ToRow = ToRow + 1
        Next
    End If
End Sub
'-------------------------------------------------------------------------------

Open in new window

0

Featured Post

Courses: Start Training Online With Pros, Today

Brush up on the basics or master the advanced techniques required to earn essential industry certifications, with Courses. Enroll in a course and start learning today. Training topics range from Android App Dev to the Xen Virtualization Platform.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

by Batuhan Cetin Regular expression is a language that we use to edit a string or retrieve sub-strings that meets specific rules from a text. A regular expression can be applied to a set of string variables. There are many RegEx engines for u…
I have been reconstructing a PHP-based application that has grown into a full blown interface system over the last ten years by a developer that has now gone into business for himself building websites. I am not incredibly fond of writing PHP code o…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

786 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question