Solved

Regular Expression beginner search text files for a string

Posted on 2010-11-09
15
492 Views
Last Modified: 2012-06-21
I'm a regular expression beginner.
I've got around 250 text files I want to examine.
I want to extract the text between two labels.
The labels will exist in each text file.
The text between the two labels may be a few characters or it may be a few lines of text.
1. What is the simplest application to run the regular expression from (initially thought of excel, vba?)
2. What would the actual regular expression be?

Thanks for any advice.
0
Comment
Question by:mike99c
  • 5
  • 3
  • 3
  • +4
15 Comments
 
LVL 2

Expert Comment

by:Pacane
ID: 34095082
Answer to 1. Linux shell, grep
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 34095128
1) Ultra Edit supports searching a list of files and it has regular expression support as well. It is a pay application, but it has a free trial.

2) That depends on what the labels are and which editor you decide to go with. You may be able to use the following:
(?<=beginning label or phrase).*?(?=ending label or phrase)

Open in new window

0
 
LVL 35

Expert Comment

by:Terry Woods
ID: 34096963
Note that you'll need the option active for the . character to match newlines, otherwise you won't pick up the cases where the text spans multiple lines. Usually there's a checkbox for setting that option near to where you enter the regular expression. In Perl compatible regex's the option is "s".
0
 
LVL 13

Accepted Solution

by:
Superdave earned 250 total points
ID: 34097075
I'd use Python because it's easy to learn and handles standard Perl-compatible regular expressions; in that case the attached program will work.  Note the angle-brackets are part of my example labels, no special meaning.
But I don't know for comparison what VBA has for regular expressions.

Pacane: grep won't handle multiple lines, and there's no such thing as a Linux shell.



#!/usr/bin/python

import re



text=file(r'filename.txt','r').read()  # textmode

pat=re.compile(r'<label1>(.*?)<label2>',re.S)

t=pat.findall(text)

print `t`

Open in new window

0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 34097227
>>  Note that you'll need the option active for the . character to match newlines

Such a beginner's mistake. I'm not worthy  ;)
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 34097233
P.S.

Good call Terry!
0
 

Author Comment

by:mike99c
ID: 34105825
Hi SuperDave

Got your script to work, thanks
All that's left is
~ to search through all files *.txt (ok to have subdirectories?)
~ can I put the individual filename that the text is coming from at the beginning of each line of the outputted text?

Thanks again.
0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 34107089
Here's a modification to SuperDave's script that adds the two new requirements:
for dirpath, dirnames, filenames in os.walk("C:\\ee"):
	for f in filenames:
		if (f.endswith(".txt")):
			fullfile = dirpath + "\\" + f;
			text = file(fullfile, 'r').read();
			t = pat.findall(text);
			if (t):
				print `fullfile` + ":" + `t`;

Open in new window

0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 34107096
Sorry, full script attached:
#!/usr/bin/python
import re
import os

for dirpath, dirnames, filenames in os.walk("C:\\ee"):
	for f in filenames:
		if (f.endswith(".txt")):
			fullfile = dirpath + "\\" + f;
			text = file(fullfile, 'r').read();
			t = pat.findall(text);
			if (t):
				print `fullfile` + ":" + `t`;

Open in new window

0
 
LVL 4

Expert Comment

by:BrainB
ID: 34107794
I can do Excel VBA if you are interested. Useful for data in worksheets & text files.
0
 

Author Comment

by:mike99c
ID: 34109608
Would also be interested in doing this in Excel VBA
0
 
LVL 3

Expert Comment

by:richard_crist
ID: 34114454
A Windows based product I have used for many years which strongly supports regular expressions is FileLocator Pro.  You can find it at http://www.mythicsoft.com.  It supports regular expressions in the data content search, filename specification, and path names.  It lets you start with beginner mode and switch to advanced mode in the interface, and includes an expression builder.  I have been using it for about 5 years and it keeps getting better.
There is a free lite version of the product called Agent Ransack, which you can also download there.
0
 
LVL 4

Assisted Solution

by:BrainB
BrainB earned 250 total points
ID: 34131206
Here is the Excel version. The code goes into a normal code module. The data is put into into the currently active worksheet. It handles multi line tags ok. The process strips away everything but the tag contents including non-printing characters (eg. end of line breaks). Please note the requirement to add a reference to "Microsoft Script Regular Expressions" in the Visual Basic Editor found in the menu Tools\References.

If you have problems it will be helpful if you upload one of your text files because there are so many variations possible.
'=============================================================================

'- EXCEL : OPEN ALL TEXT FILES IN A FOLDER & EXTRACT REQUIRED DATA

'- Needs VB Editor Tools\Reference "Microsoft Script Regular Expressions xx"

'- * Change MyFolder variable to required path

'- * Change MyTag variable to the required tag (just the tag  (no < or /..>)

'- Puts file name and text into the currently active worksheet

'- Brian Baulsom November 2010

'=============================================================================

Dim MyFolder As String

Dim MyTag As String

Dim MyFile As String

Dim FileCount As Integer

Dim FullName As String

Dim MyRegExp As Object

Dim MyPattern As String

Dim MyMatches As Variant    ' Reg Exp extract set

Dim MatchCount As Integer

Dim MatchItem As String

Dim FileString As String    ' whole text file

'------------------------

Dim ws As Worksheet

Dim ToRow As Long

'-------------------------------------------------------------------------

'=============================================================================

'- MAIN ROUTINE

'=============================================================================

Sub TEXTFILE_PROCESS()

    '=========================================================================

    '- ***********  CHANGE VARIABLES ****************************************

    MyFolder = "F:\Test\"       ' nb. final backslash

    MyTag = "script"            ' tag to look for

    '=========================================================================

    Application.Calculation = xlCalculationManual

    Set ws = ActiveSheet

    With ws

        .Cells.ClearContents

        .Range("A1").Value = " Tage = " & MyTag

        .Columns("B:B").WrapText = True

    End With

    ToRow = 2

    Set MyRegExp = CreateObject("VbScript.RegExp")

    MyPattern = "<" & MyTag & "(\n|.)*?/" & MyTag & ">" '

    'MyPattern = "<script(\n|.)*?/script>"

    FileCount = 0

    '-------------------------------------------------------------------------

    '- GET FILES

    MyFile = Dir(MyFolder & "*.txt")   ' text files

    '- LOOP through files in folder

    Do While MyFile <> ""

        FullName = MyFolder & MyFile

        FileCount = FileCount + 1

        Application.StatusBar = FileCount

        '-------------------------------------------------------------------------

        '- READ THE FILE INTO MEMORY AND CLOSE IT

        Open FullName For Input As #1

            FileString = Input(FileLen(FullName), #1)

        Close #1

        '------------------------------------------------------------------------

        GET_DATA        ' SUBROUTINE

        '----------------------------------------------------------------------

        '- NEXT FILE

        MyFile = Dir   ' Get next file

    Loop

    '------------------------------------------------------------------------

    ws.Range("A1:A" & ToRow).EntireRow.AutoFit

    MsgBox ("Processed " & FileCount & " file(s).")

    Application.Calculation = xlCalculationAutomatic

    Application.StatusBar = False

End Sub

'=============================================================================



'=============================================================================

'- EXTRACT DATA AND ADD TO WORKSHEET

'- Removes non-printing characters   eg. end of line etc.

'=============================================================================

Private Sub GET_DATA()

    With MyRegExp

        .Global = True

        .ignorecase = True

        .MultiLine = True

        .Pattern = MyPattern

        Set MyMatches = .Execute(FileString)

    End With

    '-----------------------------------------------------------------------

    '- GET MATCHES

    MatchCount = MyMatches.Count

    If MatchCount > 0 Then

        For m = 0 To MatchCount - 1

            MatchItem = Application.WorksheetFunction.Clean(MyMatches(m))

            MatchItem = Replace(MatchItem, "<" & MyTag, "", 1, -1, vbTextCompare)

            MatchItem = Replace(MatchItem, "</" & MyTag & ">", "", 1, -1, vbTextCompare)

            MatchItem = Replace(MatchItem, ">", "", 1, -1, vbTextCompare)

            ws.Cells(ToRow, 1).Value = MyFile

            ws.Cells(ToRow, 2).Value = MatchItem

            ToRow = ToRow + 1

        Next

    End If

End Sub

'-------------------------------------------------------------------------------

Open in new window

0
 

Author Comment

by:mike99c
ID: 34131611
Thanks for the response, here is an example

;       Addressed  : Act_5015 159/1, Dyn_156/1S, Dyn_557/1, Dyn_18/1
;                    Act_5025/1, Act_9158/1
;
;       Identity   : File01.TXT

I want to capture the text between Addressed and Identity
0
 
LVL 4

Expert Comment

by:BrainB
ID: 34131953
That was relatively painless ..................
'=============================================================================

'- VERSION 2

'- EXCEL : OPEN ALL TEXT FILES IN A FOLDER & EXTRACT REQUIRED DATA

'- Needs VB Editor Tools\Reference "Microsoft Script Regular Expressions xx"

'- **** Change MyFolder variable below to required path

'- Puts file name and text into the currently active worksheet

'- Brian Baulsom November 2010

'=============================================================================

Dim MyFolder As String

Dim MyFile As String

Dim FileCount As Integer

Dim FullName As String

Dim MyRegExp As Object

Dim MyPattern As String

Dim MyMatches As Variant    ' Reg Exp extract set

Dim MatchCount As Integer

Dim MatchItem As String

Dim FileString As String    ' whole text file

'------------------------

Dim ws As Worksheet

Dim ToRow As Long

'-------------------------------------------------------------------------

'=============================================================================

'- VERSION 2 : MAIN ROUTINE

'=============================================================================

Sub TEXTFILE_PROCESS()

    '=========================================================================

    '- ***********  CHANGE VARIABLES ****************************************

    MyFolder = "F:\Test\"       ' nb. final backslash

    '=========================================================================

    Application.Calculation = xlCalculationManual

    Set ws = ActiveSheet

    With ws

        .Cells.ClearContents

        .Range("A1").Value = " Tage = " & MyTag

        .Columns("B:B").WrapText = True

    End With

    ToRow = 2

    Set MyRegExp = CreateObject("VbScript.RegExp")

    MyPattern = "Addressed(\n|.)*?Identity"

    FileCount = 0

    '-------------------------------------------------------------------------

    '- GET FILES

    MyFile = Dir(MyFolder & "*.txt")   ' text files

    '- LOOP through files in folder

    Do While MyFile <> ""

        FullName = MyFolder & MyFile

        FileCount = FileCount + 1

        Application.StatusBar = FileCount

        '-------------------------------------------------------------------------

        '- READ THE FILE INTO MEMORY AND CLOSE IT

        Open FullName For Input As #1

            FileString = Input(FileLen(FullName), #1)

        Close #1

        '------------------------------------------------------------------------

        GET_DATA        ' SUBROUTINE

        '----------------------------------------------------------------------

        '- NEXT FILE

        MyFile = Dir   ' Get next file

    Loop

    '------------------------------------------------------------------------

    ws.Range("A1:A" & ToRow).EntireRow.AutoFit

    MsgBox ("Processed " & FileCount & " file(s).")

    Application.Calculation = xlCalculationAutomatic

    Application.StatusBar = False

End Sub

'=============================================================================



'=============================================================================

'- EXTRACT DATA AND ADD TO WORKSHEET

'- Removes non-printing characters   eg. end of line etc.

'=============================================================================

Private Sub GET_DATA()

    With MyRegExp

        .Global = True

        .ignorecase = True

        .MultiLine = True

        .Pattern = MyPattern

        Set MyMatches = .Execute(FileString)

    End With

    '-----------------------------------------------------------------------

    '- GET MATCHES

    MatchCount = MyMatches.Count

    If MatchCount > 0 Then

        For m = 0 To MatchCount - 1

            MatchItem = Application.WorksheetFunction.Clean(MyMatches(m))

            MatchItem = Replace(MatchItem, "Addressed  :", "", 1, -1, vbTextCompare)

            MatchItem = Replace(MatchItem, "Identity", "", 1, -1, vbTextCompare)

            MatchItem = Replace(MatchItem, ";;", "", 1, -1, vbTextCompare)

            MatchItem = Trim(MatchItem)

            ws.Cells(ToRow, 1).Value = MyFile

            ws.Cells(ToRow, 2).Value = MatchItem

            ToRow = ToRow + 1

        Next

    End If

End Sub

'-------------------------------------------------------------------------------

Open in new window

0

Featured Post

Highfive + Dolby Voice = No More Audio Complaints!

Poor audio quality is one of the top reasons people don’t use video conferencing. Get the crispest, clearest audio powered by Dolby Voice in every meeting. Highfive and Dolby Voice deliver the best video conferencing and audio experience for every meeting and every room.

Join & Write a Comment

Whatever be the reason, if you are working on web development side,  you will need day-today validation codes like email validation, date validation , IP address validation, phone validation on any of the edit page or say at the time of registration…
As most anyone who uses or has come across them can attest to, regular expressions (regex) are a complicated bit of magic. Packed so succinctly within their cryptic syntax lies a great deal of power. It's not the "take over the world" kind of power,…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

746 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now