Solved

python find position of string in entire file

Posted on 2012-04-05
7
614 Views
Last Modified: 2012-04-23
Hi,

Below is a very simplified situation of my code. I'm iterating over lines in a file and on every line a regex is tried. I want the exact start position in the entire file if the regex matches. However the code below doesn't get the correct position. Any ideas? Please note, my situation doesn't allow to use a global regex on the entire file! I have to iterate over the lines.

file = open('file.txt').read()
pos = 0
match_pos = 0
for line in file.splitlines():
  match = re.search('function [^{]+?{', line)
  if match:
     match_pos = pos + match.start() #exact pos
     print match_pos
     print file.count('\n', 0, match_pos) #lineno
  
  
  pos = pos + len(line)

Open in new window

0
Comment
Question by:Dennie
  • 4
  • 2
7 Comments
 
LVL 41

Expert Comment

by:HonorGod
ID: 37813917
What do you mean by "the exact start position"?

Might you be getting into trouble with your calculations because of the "newline" characters a the end of each line?

Might something like this be what you want?

import re
pat = re.compile( 'function [^{]+?{', re.MULTILINE )
fh = open( 'file.txt', 'rb' )
data = fh.read()
fh.close()
for item in re.finditer( pat, data ) :
  print '%5d..%5d' % ( item.start(), item.end() )

Open in new window

0
 

Author Comment

by:Dennie
ID: 37813995
with exact position I mean that if I would use:

m = re.finditer('searchsomething', file, flags=re.DOTALL)
m.start()

That the m.start in this code would match the start position of a match in the code of my first post (where i'm iterating over the lines)
Again, I have to iterate over all the lines!
0
 
LVL 41

Expert Comment

by:HonorGod
ID: 37814046
Are you looking for the offset from the beginning of the file (in bytes)?
Or are you looking for the offset at the beginning of each line?

Sorry for being confused.

The snippet of code that I supplied above displays the "exact" offset of the match, both the staring and ending position.
0
Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

 

Author Comment

by:Dennie
ID: 37814076
"The snippet of code that I supplied above displays the "exact" offset of the match, both the staring and ending position."

Yes but I have to iterate over the lines. your example is the exact match that I'm looking for, but I can't use a global regex in the file. I can only use a regex in the line.
0
 
LVL 41

Assisted Solution

by:HonorGod
HonorGod earned 250 total points
ID: 37814277
ok, so you want to iterate over the individual lines in the file, but obtain the offset from the beginning of the file.
Is that right?

Why do you need to iterate over the lines?

The regular expression can include the use of ^ to match the beginning of each line, so this may be the solution for which you are looking:

import re
pat = re.compile( '^function [^{]+?{', re.MULTILINE )
fh = open( 'file.txt', 'rb' )
data = fh.read()
fh.close()
for item in re.finditer( pat, data ) :
  print '%5d: "%s"' % ( item.start(), data[ item.start() : item.end() ] )

Open in new window

0
 
LVL 28

Accepted Solution

by:
pepr earned 250 total points
ID: 37814547
I guess that the code should be used for example when building an editor where you have only a "window of lines" to be searched.  Is it correct?

Basically, I can see no problem with your basic idea. Try the following code:

a.py
import re

rexFunctionPattern = r'function [^{]+?{'
rexFunction = re.compile(rexFunctionPattern)

def getStartsByLines(fname, rex=rexFunction):
    with open(fname) as f:
        pos = 0
        match_pos = 0
        for lineno, line in enumerate(f):
            m = rexFunction.search(line)
            if m:
                match_pos = pos + m.start() #exact pos
                yield (lineno + 1, m.start(), match_pos)

            pos = pos + len(line)


def getStartsBlockReading(fname, pattern=rexFunctionPattern):
    with open(fname) as f:
        data = f.read()

    for m in re.finditer(pattern, data, flags=re.DOTALL):
        yield m.start()


for t in getStartsByLines('file.txt'):
    print t

print '-' * 30

for pos in getStartsBlockReading('file.txt'):
    print pos

Open in new window


With my sample of file.txt...
a
b
c
xyz function klm { something here }
r
t  function zzz { something here }
z

Open in new window


It prints on my console:

c:\tmp\_Python\Dennie\Q_27664512>a.py
(4, 4, 10)
(6, 3, 47)
------------------------------
10
47

Open in new window


The earlier code returns the line number (one-based), line position (here zero-based, but can be adjusted easily), and the character offset from the beginning of the file (zero based).  The second generator retuns the zero-based character offsets via the find iter of a single block of data.

If the file is opened in a text mode, then it does not matter what newlines are used.
0
 
LVL 41

Expert Comment

by:HonorGod
ID: 37880328
Thanks for the assist, and the points.

Good luck & have a great day.
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Suggested Solutions

Article by: Swadhin
Introduction of Lists in Python: There are six built-in types of sequences. Lists and tuples are the most common one. In this article we will see how to use Lists in python and how we can utilize it while doing our own program. In general we can al…
A short article about a problem I had getting the GPS LocationListener working.
This theoretical tutorial explains exceptions, reasons for exceptions, different categories of exception and exception hierarchy.
The viewer will learn additional member functions of the vector class. Specifically, the capacity and swap member functions will be introduced.

746 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now