• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 541
  • Last Modified:

Regular Expression beginner search text files for a string

I'm a regular expression beginner.
I've got around 250 text files I want to examine.
I want to extract the text between two labels.
The labels will exist in each text file.
The text between the two labels may be a few characters or it may be a few lines of text.
1. What is the simplest application to run the regular expression from (initially thought of excel, vba?)
2. What would the actual regular expression be?

Thanks for any advice.
0
mike99c
Asked:
mike99c
  • 5
  • 3
  • 3
  • +4
2 Solutions
 
PacaneCommented:
Answer to 1. Linux shell, grep
0
 
käµfm³d 👽Commented:
1) Ultra Edit supports searching a list of files and it has regular expression support as well. It is a pay application, but it has a free trial.

2) That depends on what the labels are and which editor you decide to go with. You may be able to use the following:
(?<=beginning label or phrase).*?(?=ending label or phrase)

Open in new window

0
 
Terry WoodsIT GuruCommented:
Note that you'll need the option active for the . character to match newlines, otherwise you won't pick up the cases where the text spans multiple lines. Usually there's a checkbox for setting that option near to where you enter the regular expression. In Perl compatible regex's the option is "s".
0
Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
SuperdaveCommented:
I'd use Python because it's easy to learn and handles standard Perl-compatible regular expressions; in that case the attached program will work.  Note the angle-brackets are part of my example labels, no special meaning.
But I don't know for comparison what VBA has for regular expressions.

Pacane: grep won't handle multiple lines, and there's no such thing as a Linux shell.



#!/usr/bin/python
import re

text=file(r'filename.txt','r').read()  # textmode
pat=re.compile(r'<label1>(.*?)<label2>',re.S)
t=pat.findall(text)
print `t`

Open in new window

0
 
käµfm³d 👽Commented:
>>  Note that you'll need the option active for the . character to match newlines

Such a beginner's mistake. I'm not worthy  ;)
0
 
käµfm³d 👽Commented:
P.S.

Good call Terry!
0
 
mike99cAuthor Commented:
Hi SuperDave

Got your script to work, thanks
All that's left is
~ to search through all files *.txt (ok to have subdirectories?)
~ can I put the individual filename that the text is coming from at the beginning of each line of the outputted text?

Thanks again.
0
 
käµfm³d 👽Commented:
Here's a modification to SuperDave's script that adds the two new requirements:
for dirpath, dirnames, filenames in os.walk("C:\\ee"):
	for f in filenames:
		if (f.endswith(".txt")):
			fullfile = dirpath + "\\" + f;
			text = file(fullfile, 'r').read();
			t = pat.findall(text);
			if (t):
				print `fullfile` + ":" + `t`;

Open in new window

0
 
käµfm³d 👽Commented:
Sorry, full script attached:
#!/usr/bin/python
import re
import os

for dirpath, dirnames, filenames in os.walk("C:\\ee"):
	for f in filenames:
		if (f.endswith(".txt")):
			fullfile = dirpath + "\\" + f;
			text = file(fullfile, 'r').read();
			t = pat.findall(text);
			if (t):
				print `fullfile` + ":" + `t`;

Open in new window

0
 
BrainBCommented:
I can do Excel VBA if you are interested. Useful for data in worksheets & text files.
0
 
mike99cAuthor Commented:
Would also be interested in doing this in Excel VBA
0
 
richard_cristCommented:
A Windows based product I have used for many years which strongly supports regular expressions is FileLocator Pro.  You can find it at http://www.mythicsoft.com.  It supports regular expressions in the data content search, filename specification, and path names.  It lets you start with beginner mode and switch to advanced mode in the interface, and includes an expression builder.  I have been using it for about 5 years and it keeps getting better.
There is a free lite version of the product called Agent Ransack, which you can also download there.
0
 
BrainBCommented:
Here is the Excel version. The code goes into a normal code module. The data is put into into the currently active worksheet. It handles multi line tags ok. The process strips away everything but the tag contents including non-printing characters (eg. end of line breaks). Please note the requirement to add a reference to "Microsoft Script Regular Expressions" in the Visual Basic Editor found in the menu Tools\References.

If you have problems it will be helpful if you upload one of your text files because there are so many variations possible.
'=============================================================================
'- EXCEL : OPEN ALL TEXT FILES IN A FOLDER & EXTRACT REQUIRED DATA
'- Needs VB Editor Tools\Reference "Microsoft Script Regular Expressions xx"
'- * Change MyFolder variable to required path
'- * Change MyTag variable to the required tag (just the tag  (no < or /..>)
'- Puts file name and text into the currently active worksheet
'- Brian Baulsom November 2010
'=============================================================================
Dim MyFolder As String
Dim MyTag As String
Dim MyFile As String
Dim FileCount As Integer
Dim FullName As String
Dim MyRegExp As Object
Dim MyPattern As String
Dim MyMatches As Variant    ' Reg Exp extract set
Dim MatchCount As Integer
Dim MatchItem As String
Dim FileString As String    ' whole text file
'------------------------
Dim ws As Worksheet
Dim ToRow As Long
'-------------------------------------------------------------------------
'=============================================================================
'- MAIN ROUTINE
'=============================================================================
Sub TEXTFILE_PROCESS()
    '=========================================================================
    '- ***********  CHANGE VARIABLES ****************************************
    MyFolder = "F:\Test\"       ' nb. final backslash
    MyTag = "script"            ' tag to look for
    '=========================================================================
    Application.Calculation = xlCalculationManual
    Set ws = ActiveSheet
    With ws
        .Cells.ClearContents
        .Range("A1").Value = " Tage = " & MyTag
        .Columns("B:B").WrapText = True
    End With
    ToRow = 2
    Set MyRegExp = CreateObject("VbScript.RegExp")
    MyPattern = "<" & MyTag & "(\n|.)*?/" & MyTag & ">" '
    'MyPattern = "<script(\n|.)*?/script>"
    FileCount = 0
    '-------------------------------------------------------------------------
    '- GET FILES
    MyFile = Dir(MyFolder & "*.txt")   ' text files
    '- LOOP through files in folder
    Do While MyFile <> ""
        FullName = MyFolder & MyFile
        FileCount = FileCount + 1
        Application.StatusBar = FileCount
        '-------------------------------------------------------------------------
        '- READ THE FILE INTO MEMORY AND CLOSE IT
        Open FullName For Input As #1
            FileString = Input(FileLen(FullName), #1)
        Close #1
        '------------------------------------------------------------------------
        GET_DATA        ' SUBROUTINE
        '----------------------------------------------------------------------
        '- NEXT FILE
        MyFile = Dir   ' Get next file
    Loop
    '------------------------------------------------------------------------
    ws.Range("A1:A" & ToRow).EntireRow.AutoFit
    MsgBox ("Processed " & FileCount & " file(s).")
    Application.Calculation = xlCalculationAutomatic
    Application.StatusBar = False
End Sub
'=============================================================================

'=============================================================================
'- EXTRACT DATA AND ADD TO WORKSHEET
'- Removes non-printing characters   eg. end of line etc.
'=============================================================================
Private Sub GET_DATA()
    With MyRegExp
        .Global = True
        .ignorecase = True
        .MultiLine = True
        .Pattern = MyPattern
        Set MyMatches = .Execute(FileString)
    End With
    '-----------------------------------------------------------------------
    '- GET MATCHES
    MatchCount = MyMatches.Count
    If MatchCount > 0 Then
        For m = 0 To MatchCount - 1
            MatchItem = Application.WorksheetFunction.Clean(MyMatches(m))
            MatchItem = Replace(MatchItem, "<" & MyTag, "", 1, -1, vbTextCompare)
            MatchItem = Replace(MatchItem, "</" & MyTag & ">", "", 1, -1, vbTextCompare)
            MatchItem = Replace(MatchItem, ">", "", 1, -1, vbTextCompare)
            ws.Cells(ToRow, 1).Value = MyFile
            ws.Cells(ToRow, 2).Value = MatchItem
            ToRow = ToRow + 1
        Next
    End If
End Sub
'-------------------------------------------------------------------------------

Open in new window

0
 
mike99cAuthor Commented:
Thanks for the response, here is an example

;       Addressed  : Act_5015 159/1, Dyn_156/1S, Dyn_557/1, Dyn_18/1
;                    Act_5025/1, Act_9158/1
;
;       Identity   : File01.TXT

I want to capture the text between Addressed and Identity
0
 
BrainBCommented:
That was relatively painless ..................
'=============================================================================
'- VERSION 2
'- EXCEL : OPEN ALL TEXT FILES IN A FOLDER & EXTRACT REQUIRED DATA
'- Needs VB Editor Tools\Reference "Microsoft Script Regular Expressions xx"
'- **** Change MyFolder variable below to required path
'- Puts file name and text into the currently active worksheet
'- Brian Baulsom November 2010
'=============================================================================
Dim MyFolder As String
Dim MyFile As String
Dim FileCount As Integer
Dim FullName As String
Dim MyRegExp As Object
Dim MyPattern As String
Dim MyMatches As Variant    ' Reg Exp extract set
Dim MatchCount As Integer
Dim MatchItem As String
Dim FileString As String    ' whole text file
'------------------------
Dim ws As Worksheet
Dim ToRow As Long
'-------------------------------------------------------------------------
'=============================================================================
'- VERSION 2 : MAIN ROUTINE
'=============================================================================
Sub TEXTFILE_PROCESS()
    '=========================================================================
    '- ***********  CHANGE VARIABLES ****************************************
    MyFolder = "F:\Test\"       ' nb. final backslash
    '=========================================================================
    Application.Calculation = xlCalculationManual
    Set ws = ActiveSheet
    With ws
        .Cells.ClearContents
        .Range("A1").Value = " Tage = " & MyTag
        .Columns("B:B").WrapText = True
    End With
    ToRow = 2
    Set MyRegExp = CreateObject("VbScript.RegExp")
    MyPattern = "Addressed(\n|.)*?Identity"
    FileCount = 0
    '-------------------------------------------------------------------------
    '- GET FILES
    MyFile = Dir(MyFolder & "*.txt")   ' text files
    '- LOOP through files in folder
    Do While MyFile <> ""
        FullName = MyFolder & MyFile
        FileCount = FileCount + 1
        Application.StatusBar = FileCount
        '-------------------------------------------------------------------------
        '- READ THE FILE INTO MEMORY AND CLOSE IT
        Open FullName For Input As #1
            FileString = Input(FileLen(FullName), #1)
        Close #1
        '------------------------------------------------------------------------
        GET_DATA        ' SUBROUTINE
        '----------------------------------------------------------------------
        '- NEXT FILE
        MyFile = Dir   ' Get next file
    Loop
    '------------------------------------------------------------------------
    ws.Range("A1:A" & ToRow).EntireRow.AutoFit
    MsgBox ("Processed " & FileCount & " file(s).")
    Application.Calculation = xlCalculationAutomatic
    Application.StatusBar = False
End Sub
'=============================================================================

'=============================================================================
'- EXTRACT DATA AND ADD TO WORKSHEET
'- Removes non-printing characters   eg. end of line etc.
'=============================================================================
Private Sub GET_DATA()
    With MyRegExp
        .Global = True
        .ignorecase = True
        .MultiLine = True
        .Pattern = MyPattern
        Set MyMatches = .Execute(FileString)
    End With
    '-----------------------------------------------------------------------
    '- GET MATCHES
    MatchCount = MyMatches.Count
    If MatchCount > 0 Then
        For m = 0 To MatchCount - 1
            MatchItem = Application.WorksheetFunction.Clean(MyMatches(m))
            MatchItem = Replace(MatchItem, "Addressed  :", "", 1, -1, vbTextCompare)
            MatchItem = Replace(MatchItem, "Identity", "", 1, -1, vbTextCompare)
            MatchItem = Replace(MatchItem, ";;", "", 1, -1, vbTextCompare)
            MatchItem = Trim(MatchItem)
            ws.Cells(ToRow, 1).Value = MyFile
            ws.Cells(ToRow, 2).Value = MatchItem
            ToRow = ToRow + 1
        Next
    End If
End Sub
'-------------------------------------------------------------------------------

Open in new window

0

Featured Post

Receive 1:1 tech help

Solve your biggest tech problems alongside global tech experts with 1:1 help.

  • 5
  • 3
  • 3
  • +4
Tackle projects and never again get stuck behind a technical roadblock.
Join Now