mike99c
asked on
Regular Expression beginner search text files for a string
I'm a regular expression beginner.
I've got around 250 text files I want to examine.
I want to extract the text between two labels.
The labels will exist in each text file.
The text between the two labels may be a few characters or it may be a few lines of text.
1. What is the simplest application to run the regular expression from (initially thought of excel, vba?)
2. What would the actual regular expression be?
Thanks for any advice.
I've got around 250 text files I want to examine.
I want to extract the text between two labels.
The labels will exist in each text file.
The text between the two labels may be a few characters or it may be a few lines of text.
1. What is the simplest application to run the regular expression from (initially thought of excel, vba?)
2. What would the actual regular expression be?
Thanks for any advice.
Answer to 1. Linux shell, grep
1) Ultra Edit supports searching a list of files and it has regular expression support as well. It is a pay application, but it has a free trial.
2) That depends on what the labels are and which editor you decide to go with. You may be able to use the following:
2) That depends on what the labels are and which editor you decide to go with. You may be able to use the following:
(?<=beginning label or phrase).*?(?=ending label or phrase)
Note that you'll need the option active for the . character to match newlines, otherwise you won't pick up the cases where the text spans multiple lines. Usually there's a checkbox for setting that option near to where you enter the regular expression. In Perl compatible regex's the option is "s".
ASKER CERTIFIED SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
>> Note that you'll need the option active for the . character to match newlines
Such a beginner's mistake. I'm not worthy ;)
Such a beginner's mistake. I'm not worthy ;)
P.S.
Good call Terry!
Good call Terry!
ASKER
Hi SuperDave
Got your script to work, thanks
All that's left is
~ to search through all files *.txt (ok to have subdirectories?)
~ can I put the individual filename that the text is coming from at the beginning of each line of the outputted text?
Thanks again.
Got your script to work, thanks
All that's left is
~ to search through all files *.txt (ok to have subdirectories?)
~ can I put the individual filename that the text is coming from at the beginning of each line of the outputted text?
Thanks again.
Here's a modification to SuperDave's script that adds the two new requirements:
for dirpath, dirnames, filenames in os.walk("C:\\ee"):
for f in filenames:
if (f.endswith(".txt")):
fullfile = dirpath + "\\" + f;
text = file(fullfile, 'r').read();
t = pat.findall(text);
if (t):
print `fullfile` + ":" + `t`;
Sorry, full script attached:
#!/usr/bin/python
import re
import os
for dirpath, dirnames, filenames in os.walk("C:\\ee"):
for f in filenames:
if (f.endswith(".txt")):
fullfile = dirpath + "\\" + f;
text = file(fullfile, 'r').read();
t = pat.findall(text);
if (t):
print `fullfile` + ":" + `t`;
I can do Excel VBA if you are interested. Useful for data in worksheets & text files.
ASKER
Would also be interested in doing this in Excel VBA
A Windows based product I have used for many years which strongly supports regular expressions is FileLocator Pro. You can find it at http://www.mythicsoft.com. It supports regular expressions in the data content search, filename specification, and path names. It lets you start with beginner mode and switch to advanced mode in the interface, and includes an expression builder. I have been using it for about 5 years and it keeps getting better.
There is a free lite version of the product called Agent Ransack, which you can also download there.
There is a free lite version of the product called Agent Ransack, which you can also download there.
SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
ASKER
Thanks for the response, here is an example
; Addressed : Act_5015 159/1, Dyn_156/1S, Dyn_557/1, Dyn_18/1
; Act_5025/1, Act_9158/1
;
; Identity : File01.TXT
I want to capture the text between Addressed and Identity
; Addressed : Act_5015 159/1, Dyn_156/1S, Dyn_557/1, Dyn_18/1
; Act_5025/1, Act_9158/1
;
; Identity : File01.TXT
I want to capture the text between Addressed and Identity
That was relatively painless ..................
'=============================================================================
'- VERSION 2
'- EXCEL : OPEN ALL TEXT FILES IN A FOLDER & EXTRACT REQUIRED DATA
'- Needs VB Editor Tools\Reference "Microsoft Script Regular Expressions xx"
'- **** Change MyFolder variable below to required path
'- Puts file name and text into the currently active worksheet
'- Brian Baulsom November 2010
'=============================================================================
Dim MyFolder As String
Dim MyFile As String
Dim FileCount As Integer
Dim FullName As String
Dim MyRegExp As Object
Dim MyPattern As String
Dim MyMatches As Variant ' Reg Exp extract set
Dim MatchCount As Integer
Dim MatchItem As String
Dim FileString As String ' whole text file
'------------------------
Dim ws As Worksheet
Dim ToRow As Long
'-------------------------------------------------------------------------
'=============================================================================
'- VERSION 2 : MAIN ROUTINE
'=============================================================================
Sub TEXTFILE_PROCESS()
'=========================================================================
'- *********** CHANGE VARIABLES ****************************************
MyFolder = "F:\Test\" ' nb. final backslash
'=========================================================================
Application.Calculation = xlCalculationManual
Set ws = ActiveSheet
With ws
.Cells.ClearContents
.Range("A1").Value = " Tage = " & MyTag
.Columns("B:B").WrapText = True
End With
ToRow = 2
Set MyRegExp = CreateObject("VbScript.RegExp")
MyPattern = "Addressed(\n|.)*?Identity"
FileCount = 0
'-------------------------------------------------------------------------
'- GET FILES
MyFile = Dir(MyFolder & "*.txt") ' text files
'- LOOP through files in folder
Do While MyFile <> ""
FullName = MyFolder & MyFile
FileCount = FileCount + 1
Application.StatusBar = FileCount
'-------------------------------------------------------------------------
'- READ THE FILE INTO MEMORY AND CLOSE IT
Open FullName For Input As #1
FileString = Input(FileLen(FullName), #1)
Close #1
'------------------------------------------------------------------------
GET_DATA ' SUBROUTINE
'----------------------------------------------------------------------
'- NEXT FILE
MyFile = Dir ' Get next file
Loop
'------------------------------------------------------------------------
ws.Range("A1:A" & ToRow).EntireRow.AutoFit
MsgBox ("Processed " & FileCount & " file(s).")
Application.Calculation = xlCalculationAutomatic
Application.StatusBar = False
End Sub
'=============================================================================
'=============================================================================
'- EXTRACT DATA AND ADD TO WORKSHEET
'- Removes non-printing characters eg. end of line etc.
'=============================================================================
Private Sub GET_DATA()
With MyRegExp
.Global = True
.ignorecase = True
.MultiLine = True
.Pattern = MyPattern
Set MyMatches = .Execute(FileString)
End With
'-----------------------------------------------------------------------
'- GET MATCHES
MatchCount = MyMatches.Count
If MatchCount > 0 Then
For m = 0 To MatchCount - 1
MatchItem = Application.WorksheetFunction.Clean(MyMatches(m))
MatchItem = Replace(MatchItem, "Addressed :", "", 1, -1, vbTextCompare)
MatchItem = Replace(MatchItem, "Identity", "", 1, -1, vbTextCompare)
MatchItem = Replace(MatchItem, ";;", "", 1, -1, vbTextCompare)
MatchItem = Trim(MatchItem)
ws.Cells(ToRow, 1).Value = MyFile
ws.Cells(ToRow, 2).Value = MatchItem
ToRow = ToRow + 1
Next
End If
End Sub
'-------------------------------------------------------------------------------