Link to home
Create AccountLog in
Avatar of mike99c
mike99c

asked on

Regular Expression beginner search text files for a string

I'm a regular expression beginner.
I've got around 250 text files I want to examine.
I want to extract the text between two labels.
The labels will exist in each text file.
The text between the two labels may be a few characters or it may be a few lines of text.
1. What is the simplest application to run the regular expression from (initially thought of excel, vba?)
2. What would the actual regular expression be?

Thanks for any advice.
Avatar of Pacane
Pacane

Answer to 1. Linux shell, grep
Avatar of kaufmed
1) Ultra Edit supports searching a list of files and it has regular expression support as well. It is a pay application, but it has a free trial.

2) That depends on what the labels are and which editor you decide to go with. You may be able to use the following:
(?<=beginning label or phrase).*?(?=ending label or phrase)

Open in new window

Note that you'll need the option active for the . character to match newlines, otherwise you won't pick up the cases where the text spans multiple lines. Usually there's a checkbox for setting that option near to where you enter the regular expression. In Perl compatible regex's the option is "s".
ASKER CERTIFIED SOLUTION
Avatar of Superdave
Superdave
Flag of United States of America image

Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
See answer
>>  Note that you'll need the option active for the . character to match newlines

Such a beginner's mistake. I'm not worthy  ;)
P.S.

Good call Terry!
Avatar of mike99c

ASKER

Hi SuperDave

Got your script to work, thanks
All that's left is
~ to search through all files *.txt (ok to have subdirectories?)
~ can I put the individual filename that the text is coming from at the beginning of each line of the outputted text?

Thanks again.
Here's a modification to SuperDave's script that adds the two new requirements:
for dirpath, dirnames, filenames in os.walk("C:\\ee"):
	for f in filenames:
		if (f.endswith(".txt")):
			fullfile = dirpath + "\\" + f;
			text = file(fullfile, 'r').read();
			t = pat.findall(text);
			if (t):
				print `fullfile` + ":" + `t`;

Open in new window

Sorry, full script attached:
#!/usr/bin/python
import re
import os

for dirpath, dirnames, filenames in os.walk("C:\\ee"):
	for f in filenames:
		if (f.endswith(".txt")):
			fullfile = dirpath + "\\" + f;
			text = file(fullfile, 'r').read();
			t = pat.findall(text);
			if (t):
				print `fullfile` + ":" + `t`;

Open in new window

I can do Excel VBA if you are interested. Useful for data in worksheets & text files.
Avatar of mike99c

ASKER

Would also be interested in doing this in Excel VBA
A Windows based product I have used for many years which strongly supports regular expressions is FileLocator Pro.  You can find it at http://www.mythicsoft.com.  It supports regular expressions in the data content search, filename specification, and path names.  It lets you start with beginner mode and switch to advanced mode in the interface, and includes an expression builder.  I have been using it for about 5 years and it keeps getting better.
There is a free lite version of the product called Agent Ransack, which you can also download there.
SOLUTION
Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
Avatar of mike99c

ASKER

Thanks for the response, here is an example

;       Addressed  : Act_5015 159/1, Dyn_156/1S, Dyn_557/1, Dyn_18/1
;                    Act_5025/1, Act_9158/1
;
;       Identity   : File01.TXT

I want to capture the text between Addressed and Identity
That was relatively painless ..................
'=============================================================================
'- VERSION 2
'- EXCEL : OPEN ALL TEXT FILES IN A FOLDER & EXTRACT REQUIRED DATA
'- Needs VB Editor Tools\Reference "Microsoft Script Regular Expressions xx"
'- **** Change MyFolder variable below to required path
'- Puts file name and text into the currently active worksheet
'- Brian Baulsom November 2010
'=============================================================================
Dim MyFolder As String
Dim MyFile As String
Dim FileCount As Integer
Dim FullName As String
Dim MyRegExp As Object
Dim MyPattern As String
Dim MyMatches As Variant    ' Reg Exp extract set
Dim MatchCount As Integer
Dim MatchItem As String
Dim FileString As String    ' whole text file
'------------------------
Dim ws As Worksheet
Dim ToRow As Long
'-------------------------------------------------------------------------
'=============================================================================
'- VERSION 2 : MAIN ROUTINE
'=============================================================================
Sub TEXTFILE_PROCESS()
    '=========================================================================
    '- ***********  CHANGE VARIABLES ****************************************
    MyFolder = "F:\Test\"       ' nb. final backslash
    '=========================================================================
    Application.Calculation = xlCalculationManual
    Set ws = ActiveSheet
    With ws
        .Cells.ClearContents
        .Range("A1").Value = " Tage = " & MyTag
        .Columns("B:B").WrapText = True
    End With
    ToRow = 2
    Set MyRegExp = CreateObject("VbScript.RegExp")
    MyPattern = "Addressed(\n|.)*?Identity"
    FileCount = 0
    '-------------------------------------------------------------------------
    '- GET FILES
    MyFile = Dir(MyFolder & "*.txt")   ' text files
    '- LOOP through files in folder
    Do While MyFile <> ""
        FullName = MyFolder & MyFile
        FileCount = FileCount + 1
        Application.StatusBar = FileCount
        '-------------------------------------------------------------------------
        '- READ THE FILE INTO MEMORY AND CLOSE IT
        Open FullName For Input As #1
            FileString = Input(FileLen(FullName), #1)
        Close #1
        '------------------------------------------------------------------------
        GET_DATA        ' SUBROUTINE
        '----------------------------------------------------------------------
        '- NEXT FILE
        MyFile = Dir   ' Get next file
    Loop
    '------------------------------------------------------------------------
    ws.Range("A1:A" & ToRow).EntireRow.AutoFit
    MsgBox ("Processed " & FileCount & " file(s).")
    Application.Calculation = xlCalculationAutomatic
    Application.StatusBar = False
End Sub
'=============================================================================

'=============================================================================
'- EXTRACT DATA AND ADD TO WORKSHEET
'- Removes non-printing characters   eg. end of line etc.
'=============================================================================
Private Sub GET_DATA()
    With MyRegExp
        .Global = True
        .ignorecase = True
        .MultiLine = True
        .Pattern = MyPattern
        Set MyMatches = .Execute(FileString)
    End With
    '-----------------------------------------------------------------------
    '- GET MATCHES
    MatchCount = MyMatches.Count
    If MatchCount > 0 Then
        For m = 0 To MatchCount - 1
            MatchItem = Application.WorksheetFunction.Clean(MyMatches(m))
            MatchItem = Replace(MatchItem, "Addressed  :", "", 1, -1, vbTextCompare)
            MatchItem = Replace(MatchItem, "Identity", "", 1, -1, vbTextCompare)
            MatchItem = Replace(MatchItem, ";;", "", 1, -1, vbTextCompare)
            MatchItem = Trim(MatchItem)
            ws.Cells(ToRow, 1).Value = MyFile
            ws.Cells(ToRow, 2).Value = MatchItem
            ToRow = ToRow + 1
        Next
    End If
End Sub
'-------------------------------------------------------------------------------

Open in new window