python find position of string in entire file

Hi,

Below is a very simplified situation of my code. I'm iterating over lines in a file and on every line a regex is tried. I want the exact start position in the entire file if the regex matches. However the code below doesn't get the correct position. Any ideas? Please note, my situation doesn't allow to use a global regex on the entire file! I have to iterate over the lines.

file = open('file.txt').read()
pos = 0
match_pos = 0
for line in file.splitlines():
  match = re.search('function [^{]+?{', line)
  if match:
     match_pos = pos + match.start() #exact pos
     print match_pos
     print file.count('\n', 0, match_pos) #lineno
  
  
  pos = pos + len(line)

Open in new window

DennieAsked:
Who is Participating?
 
peprCommented:
I guess that the code should be used for example when building an editor where you have only a "window of lines" to be searched.  Is it correct?

Basically, I can see no problem with your basic idea. Try the following code:

a.py
import re

rexFunctionPattern = r'function [^{]+?{'
rexFunction = re.compile(rexFunctionPattern)

def getStartsByLines(fname, rex=rexFunction):
    with open(fname) as f:
        pos = 0
        match_pos = 0
        for lineno, line in enumerate(f):
            m = rexFunction.search(line)
            if m:
                match_pos = pos + m.start() #exact pos
                yield (lineno + 1, m.start(), match_pos)

            pos = pos + len(line)


def getStartsBlockReading(fname, pattern=rexFunctionPattern):
    with open(fname) as f:
        data = f.read()

    for m in re.finditer(pattern, data, flags=re.DOTALL):
        yield m.start()


for t in getStartsByLines('file.txt'):
    print t

print '-' * 30

for pos in getStartsBlockReading('file.txt'):
    print pos

Open in new window


With my sample of file.txt...
a
b
c
xyz function klm { something here }
r
t  function zzz { something here }
z

Open in new window


It prints on my console:

c:\tmp\_Python\Dennie\Q_27664512>a.py
(4, 4, 10)
(6, 3, 47)
------------------------------
10
47

Open in new window


The earlier code returns the line number (one-based), line position (here zero-based, but can be adjusted easily), and the character offset from the beginning of the file (zero based).  The second generator retuns the zero-based character offsets via the find iter of a single block of data.

If the file is opened in a text mode, then it does not matter what newlines are used.
0
 
HonorGodSoftware EngineerCommented:
What do you mean by "the exact start position"?

Might you be getting into trouble with your calculations because of the "newline" characters a the end of each line?

Might something like this be what you want?

import re
pat = re.compile( 'function [^{]+?{', re.MULTILINE )
fh = open( 'file.txt', 'rb' )
data = fh.read()
fh.close()
for item in re.finditer( pat, data ) :
  print '%5d..%5d' % ( item.start(), item.end() )

Open in new window

0
 
DennieAuthor Commented:
with exact position I mean that if I would use:

m = re.finditer('searchsomething', file, flags=re.DOTALL)
m.start()

That the m.start in this code would match the start position of a match in the code of my first post (where i'm iterating over the lines)
Again, I have to iterate over all the lines!
0
Cloud Class® Course: SQL Server Core 2016

This course will introduce you to SQL Server Core 2016, as well as teach you about SSMS, data tools, installation, server configuration, using Management Studio, and writing and executing queries.

 
HonorGodSoftware EngineerCommented:
Are you looking for the offset from the beginning of the file (in bytes)?
Or are you looking for the offset at the beginning of each line?

Sorry for being confused.

The snippet of code that I supplied above displays the "exact" offset of the match, both the staring and ending position.
0
 
DennieAuthor Commented:
"The snippet of code that I supplied above displays the "exact" offset of the match, both the staring and ending position."

Yes but I have to iterate over the lines. your example is the exact match that I'm looking for, but I can't use a global regex in the file. I can only use a regex in the line.
0
 
HonorGodSoftware EngineerCommented:
ok, so you want to iterate over the individual lines in the file, but obtain the offset from the beginning of the file.
Is that right?

Why do you need to iterate over the lines?

The regular expression can include the use of ^ to match the beginning of each line, so this may be the solution for which you are looking:

import re
pat = re.compile( '^function [^{]+?{', re.MULTILINE )
fh = open( 'file.txt', 'rb' )
data = fh.read()
fh.close()
for item in re.finditer( pat, data ) :
  print '%5d: "%s"' % ( item.start(), data[ item.start() : item.end() ] )

Open in new window

0
 
HonorGodSoftware EngineerCommented:
Thanks for the assist, and the points.

Good luck & have a great day.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.