Solved

Find Last Occurance of a String and output that and the line before it

Posted on 2013-11-09
22
387 Views
Last Modified: 2013-11-20
I've gone round in circles for hours on this.

I have a python script that acquires data and updates into a mysql database. The final stage of this script is for it to search for the last occurance of a string in a file (which could be up to 300mb in size) and store that line and the line immediatley before it into a string for me to manipulate a little before updating the database.

An example of the text file structure is:

2013-11-09 12:00:06 CTxMemPool::accept() : accepted 00554c4704140c0152fa55c8d99fb340feed8ebb55d2d440b094042465e578fb (poolsz 610)
2013-11-09 12:00:06 ThreadRPCServer method=sendrawtransaction
2013-11-09 12:00:06 ERROR: Non-canonical signature: R value excessively padded
2013-11-09 12:00:06 ERROR: CScriptCheck() : 0067dc2d8de20b8db292468e59481acb03eae93e392f98bc47b7e5fbb29ad08c VerifySignature failed
2013-11-09 12:00:06 ERROR: CTxMemPool::accept() : ConnectInputs failed 0067dc2d8de20b8db292468e59481acb03eae93e392f98bc47b7e5fbb29ad08c
2013-11-09 12:00:06 ThreadRPCServer method=sendrawtransaction
2013-11-09 12:00:06 CTxMemPool::accept() : accepted 023d313238d2876b043a0a332474bb1f0dead85722a655994a5e2a30ce600bb9 (poolsz 611)
2013-11-09 12:00:06 ThreadRPCServer method=sendrawtransaction
2013-11-09 12:00:06 CTxMemPool::accept() : accepted 0292d7d1a6f255d6bf72b8f808af7154bf68ecc53fd87503d2bb3b76976b204e (poolsz 612)

Open in new window


In the above example, id be looking for the last instance of "0067dc2d8de20b8db292468e59481acb03eae93e392f98bc47b7e5fbb29ad08c" and the line before so output I am looking for is:

2013-11-09 12:00:06 ERROR: CScriptCheck() : 0067dc2d8de20b8db292468e59481acb03eae93e392f98bc47b7e5fbb29ad08c VerifySignature failed
2013-11-09 12:00:06 ERROR: CTxMemPool::accept() : ConnectInputs failed 0067dc2d8de20b8db292468e59481acb03eae93e392f98bc47b7e5fbb29ad08c

Open in new window


What is the most efficient way of getting this data (especially as the file size could get up to 300mb in size)?

I've played with grep and tac etc which gets me close but i need to be able to do it in python as part of this script.

grep attempt was: tac debug.log | grep 0067dc2d8de20b8db292468e59481acb03eae93e392f98bc47b7e5fbb29ad08c -B1 -m2 | tac

Thanks

James
0
Comment
Question by:Delerium1978
  • 12
  • 10
22 Comments
 
LVL 28

Expert Comment

by:pepr
Comment Utility
For the huge files, you must not store everything in memory. You have to do it on-the-fly. Just open the file for reading, loop through all the lines, and remember only the best candidates for the last found lines (with the pattern and the previous).

To emulate grep, use the standard re module that implements regular expression functionality. Try the following code:
#!python3
import re

fname = 'data.txt'
pattern = '0067dc2d8de20b8db292468e59481acb03eae93e392f98bc47b7e5fbb29ad08c'

rex = re.compile(pattern)
storedPrevious = None
storedLine = None
with open(fname) as f:
    previousLine = None
    for line in f:
        if rex.search(line) is not None:
            storedPrevious = previousLine
            storedLine = line
        previousLine = line
        
print(storedPrevious, end='')
print(storedLine)

Open in new window

0
 

Author Comment

by:Delerium1978
Comment Utility
Thanks pepr.  The worry for the approach you've suggested is that it could take a very large amount of time as I would be looping through the file for about 1000 patterns.

I need to test this obviously which i'll do in the morning but i fear it might not be slick enough. Is it work somehow reading the whole file into memory whilst it gets the results and updates a database then un-loading it somehow?

James
0
 
LVL 28

Expert Comment

by:pepr
Comment Utility
Do you have a database of the patterns (say in a text file)?

There may be serveral approaches to solve that complication. It also depends what will be the future use of the application -- i.e. whether it is to be used regularly or once, etc.

First, you may build a structure of the patterns and compile that many of regular expressions, and then compare each line of the text file with more regular expressions. It would probably more efficient than looping the text file 1000 times.

It may be the case that it would be more efficient to preprocess the text file into the form of a database with keys equal to the identifiers (and with indexes), and join the records with the table of "patterns" -- the searched identifiers (also with index). The solution would be better if the database is to be used more times. Python has the sqlite3 standard module that would probably be good enough for the solution.
0
 

Author Comment

by:Delerium1978
Comment Utility
Hi Pepr, the patterns would be coming from a mysql db and the plan is for me to crontab this script every 20 mins or so so its definitley going to be used regularly.

The patterns in the db change every 20 mins.
0
 
LVL 28

Expert Comment

by:pepr
Comment Utility
Are the patterns (the numeric id's) stored also as keys, or do you have to search the texts from some fields?
0
 

Author Comment

by:Delerium1978
Comment Utility
the patterns are unique and in their own field . and are the primary key of the table although they're alpha numeric.

Edit: I 'could' also add a numeric autonumber key if that helps any?
0
 
LVL 28

Expert Comment

by:pepr
Comment Utility
What id has the "previous line" for the record with the searched pattern. How the lines are ordered?

It should be quite easy to get the records with the pattern via the JOIN in the SQL query. If the "previous record" is bound to the pattern one only by its position, then the autonumber may be a good idea to get the previous for it.

I would also recommend to ask moderator to add the MySQL zone to this question. Also, you should show the structure of the tables and the test example so that an SQL query could be designed for your purpose.

Definitely, I would not recommend the sequential pass through all the records. Any reasonable database engine would be much better in the search.

Have a look at http://sourceforge.net/projects/mysql-python/ for Python interface module to MySQL.
0
 

Author Comment

by:Delerium1978
Comment Utility
Sorry i think i've confused you. The MySQL table contains a list of transaction ids like:

pattern = '0067dc2d8de20b8db292468e59481acb03eae93e392f98bc47b7e5fbb29ad08c'

Open in new window


then there is the text file which has:

2013-11-09 12:00:06 CTxMemPool::accept() : accepted 00554c4704140c0152fa55c8d99fb340feed8ebb55d2d440b094042465e578fb (poolsz 610)
2013-11-09 12:00:06 ThreadRPCServer method=sendrawtransaction
2013-11-09 12:00:06 ERROR: Non-canonical signature: R value excessively padded
2013-11-09 12:00:06 ERROR: CScriptCheck() : 0067dc2d8de20b8db292468e59481acb03eae93e392f98bc47b7e5fbb29ad08c VerifySignature failed
2013-11-09 12:00:06 ERROR: CTxMemPool::accept() : ConnectInputs failed 0067dc2d8de20b8db292468e59481acb03eae93e392f98bc47b7e5fbb29ad08c
2013-11-09 12:00:06 ThreadRPCServer method=sendrawtransaction
2013-11-09 12:00:06 CTxMemPool::accept() : accepted 023d313238d2876b043a0a332474bb1f0dead85722a655994a5e2a30ce600bb9 (poolsz 611)
2013-11-09 12:00:06 ThreadRPCServer method=sendrawtransaction
2013-11-09 12:00:06 CTxMemPool::accept() : accepted 0292d7d1a6f255d6bf72b8f808af7154bf68ecc53fd87503d2bb3b76976b204e (poolsz 612)

Open in new window


For each transaction id in the mysql database table, i need to look in this text file and find the last instance found and return that and the line immediatley before that match. Is this any better explaination? (Sorry i'm not the best).

James
0
 
LVL 28

Expert Comment

by:pepr
Comment Utility
Hi James. No problem.  How often the 300+ MB changes? It it is a full history for a longer time, then it possibly would be good to transform the text file into the database table. This can also also be done from Python.

When the things get more difficult to be done in one step, then it is the time to split the problem to two simpler ones, or to put the agent who cares to the middle.

Basically, that kind of searching is the case. A database engine will always be better in that. However, you pay for it by the extra disk space for the database. On the other hand, 1 GB is nothing today, unless you want to use a handheld device.

If the text file is to be transformed into a database, you can get rid of records that you actually do not need for the purpose. I.e. the database can collect only the lines with the id and the previous line. Does the lines with the id be detected easily without knowing the id? For example, they always contain "ERROR: CScriptCheck()" or "VerifySignature failed" or so.

What version of Python do you use? (Some things may be simpler in Python 3.1+.)
0
 

Author Comment

by:Delerium1978
Comment Utility
Hi pepr

The text file changes realtime but I only run my scripts every 20 minutes so would only need to refresh any database table every 20 mins too.

Python 2.7.5+ is what im using, unfortunately due to some old API's im using i cant upgrade right now.

Disk space shouldnt be an issue although i'd have to install mysql locally (as i'm currently outputting to an external mysql db which would have potential bandwidth issues if im refreshing a 300mb table every 20 min.

Unfortunatley the lines with the ID could be anything (even messages i'm not aware of). I guess it could be possible for me to compile a list of lines that it would never ever be on.
0
 
LVL 28

Expert Comment

by:pepr
Comment Utility
The Python 2.7 is also just fine for the purpose. The idea that I did not tell you is to read the text file only from the position just after the last read one. (I thought that the tell()/seek() cannot be used for the text files, but it is possible even for the older Python versions.

The idea is to remember (store persistently) the last processed position in the text file (the f.tell() for the file object returns the position). Next time, after the 20 minutes, you can f.seek() to the position and process only the newer records in the file. This way, you need to process only that much lines that are generated during the 20 minutes.

The new lines should be used to update the database table with the log lines. I guess it will not be a lot of them (from the technical point of view).

Has the "numeric" (string) id any other special characteristics like a fixed width, some words around, etc.?
0
Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

 

Author Comment

by:Delerium1978
Comment Utility
not really - the seek thing might not be a solution either because occasionally the server will need a restart every couple of days so the logfile is purged.

i fiddled about last night with subprocess:

import subprocess

proggie = "tac /home/james/.bitcoin/debug.log | grep -e %s --after-context=1 -m2 | tac" % row[0]
resp = subprocess.check_output(proggie, shell=True)

Open in new window


Apart from it giving me 3 lines from the grep and giving pipe and tac errors, it did seem to include the information i was looking for (but i do want to understand the errors more). Speed of it was actually ok too as it only took 30s for the processing of ~200 records against the file.

What is your thoughts on this approach?

James
0
 
LVL 28

Expert Comment

by:pepr
Comment Utility
You can always store the position together with the line content to detect whether the log file was drastically changed since the last processing.

The truth is that the the grep may be faster that Python's regular expressions (you may find it here http://swtch.com/~rsc/regexp/regexp1.html). Anyway, you want to search for many of them. This would mean many of passes of the file, and that is  the source of inefficiency.

Did you try to do the same with Python regular expressions?

Do you want to find all the lines for the patterns, or only the last pair for any of the patterns?
0
 

Author Comment

by:Delerium1978
Comment Utility
Admittedly regular expressions are my nemesis, i just cannot get my head around them.

The search is to find the last entry for that specific pattern and return it and the line before it then loop to the next pattern etc.
0
 
LVL 28

Expert Comment

by:pepr
Comment Utility
Try the following script:
#!python2.7

import re

# Get all the patterns to the list from the database (here fixed).
patterns = ['0067dc2d8de20b8db292468e59481acb03eae93e392f98bc47b7e5fbb29ad08c',
            '0292d7d1a6f255d6bf72b8f808af7154bf68ecc53fd87503d2bb3b76976b204e',
           ]
           


# Compile one regular expression from all of the patterns.
rex = re.compile('|'.join(patterns))

# Initialize the dictionary for last occurence of pairs (previous, line) where
# pattern is the key and the pair is the value.
d = {}

# The processed log filename.
fname = 'debug.log'

# Open the log file for reading, remember always also the previous line
with open(fname) as f:
    previous = None
    for line in f:
        # Search if the pattern is inside.
        m = rex.search(line) 
        if m is not None:
            # One of the patterns was found. We can extract it as group zero.
            k = m.group(0)
            
            # Remember or overwrite the lines for the pattern.
            d[k] = (previous, line)
            
        # This line will be the previous for the next loop    
        previous = line
        
        
# Report the last occurences for the patterns. It could be also appended to 
# the file with some timestamp information, etc.
with open(fname + '.out', 'w') as f:
    for k, (previous, line) in d.iteritems():
        f.write(('-' * 50) + '\n')
        f.write(k + '\n')
        f.write('\t' + previous)
        f.write('\t' + line + '\n')

# The same output to display on the console.
#        print '-' * 50
#        print k
#        print '\t' + previous,
#        print '\t' + line

Open in new window

For the sample from the question, it writes the following to the debug.log.out file:
--------------------------------------------------
0292d7d1a6f255d6bf72b8f808af7154bf68ecc53fd87503d2bb3b76976b204e
	2013-11-09 12:00:06 ThreadRPCServer method=sendrawtransaction
	2013-11-09 12:00:06 CTxMemPool::accept() : accepted 0292d7d1a6f255d6bf72b8f808af7154bf68ecc53fd87503d2bb3b76976b204e (poolsz 612)

--------------------------------------------------
0067dc2d8de20b8db292468e59481acb03eae93e392f98bc47b7e5fbb29ad08c
	2013-11-09 12:00:06 ERROR: CScriptCheck() : 0067dc2d8de20b8db292468e59481acb03eae93e392f98bc47b7e5fbb29ad08c VerifySignature failed
	2013-11-09 12:00:06 ERROR: CTxMemPool::accept() : ConnectInputs failed 0067dc2d8de20b8db292468e59481acb03eae93e392f98bc47b7e5fbb29ad08c

Open in new window

0
 

Author Comment

by:Delerium1978
Comment Utility
I'll give it a whirl when i get home later :) With the grep solution i proposed, i guess it was quite quick because i used 'tac' whereas the solution you have posted goes from the start and loops through all until it gets to the last one. I'll compare and get back to you.
0
 
LVL 28

Expert Comment

by:pepr
Comment Utility
Well, I do not know the tac implementation. It is an interesting idea. I will try to find a similar Python solution.
0
 
LVL 28

Accepted Solution

by:
pepr earned 500 total points
Comment Utility
Try the following script that simulates also the tac functionality. The code for reversing the lines is not mine -- the author and the reference mentioned in the comment. (I did not test it heavily.):
#!python2.7

import os
import re

# Get all the patterns as the set from the database (here fixed) -- notice
# the curly braces.
patterns = {'0067dc2d8de20b8db292468e59481acb03eae93e392f98bc47b7e5fbb29ad08c',
            '0292d7d1a6f255d6bf72b8f808af7154bf68ecc53fd87503d2bb3b76976b204e',
           }



# Compile one regular expression from all of the patterns.
rex = re.compile('|'.join(patterns))

# Initialize the dictionary for last occurence of pairs (previous, line) where
# pattern is the key and the pair is the value.
d = {}

# The processed log filename.
fname = 'debug.log'


# The following two functions reverse the file lines.
# (by Darius Bacon, http://stackoverflow.com/a/260433/1346705)
def reversed_lines(f):
    """Generate the lines of file in reverse order."""
    part = ''
    for block in reversed_blocks(f):
        for c in reversed(block):
            if c == '\n' and part:
                yield part[::-1]
                part = ''
            part += c
    if part: yield part[::-1]

def reversed_blocks(f, blocksize=4096):
    """Generate blocks of file's contents in reverse order."""
    f.seek(0, os.SEEK_END)
    here = f.tell()
    while 0 < here:
        delta = min(blocksize, here)
        here -= delta
        f.seek(here, os.SEEK_SET)
        yield f.read(delta)



# Open the log file for reading, remember always also the previous line
with open(fname) as f:

    # The line with one of the searched pattern k.
    theLine = None
    k = None

    # Initial status of the finite automaton.
    status = 0

    # Read the lines in the reversed order.
    for line in reversed_lines(f):
        if status == 0:  # searching for the line with a pattern
            # Search if the pattern is inside.
            m = rex.search(line)
            if m is not None:
                # One of the patterns was found. We can extract it as group zero.
                k = m.group(0)

                # Remember the line.
                theLine = line

                # Remove the pattern from the set. Change the state if the set is empty.
                patterns.remove(k)
                if len(patterns) == 0:
                    status = 2
                else:
                    # There still is some pattern. Rebuild the regular expression.
                    rex = re.compile('|'.join(patterns))

                    # Change the status to collect the next line.
                    status = 1

        elif status == 1:  # the previous line of theLine
            # Store this line and theLine for the pattern k.
            d[k] = (line, theLine)

            # Set the status for searching another pattern.
            status = 0

        elif status == 2:  # last pattern was found
            # Store this line and theLine for the pattern k.
            d[k] = (line, theLine)
            status = 3

            # Break the loop now because there is nothing to search for.
            break;

    # If the very first line contains the pattern, it was not captured yet.
    # The previous line does not exists. Let's use empty line instead.
    if status == 1 or status == 2:
        d[k] = ('', theLine)


# Report the last occurences for the patterns. It could be also appended to
# the file with some timestamp information, etc.
with open(fname + '.out', 'w') as f:
    for k, (previous, line) in d.iteritems():
        f.write(('-' * 50) + '\n')
        f.write(k + '\n')
        f.write('\t' + previous)
        f.write('\t' + line + '\n')

Open in new window

0
 

Author Comment

by:Delerium1978
Comment Utility
Still not had chance to try it - hopefully ill get time tomorrow night. Thanks again.
0
 
LVL 28

Expert Comment

by:pepr
Comment Utility
Did it work? Can you compare the times of the solutions?
0
 

Author Comment

by:Delerium1978
Comment Utility
Sorry pepr - when i tried it it took ~30 seconds to run and then returned no results, then i've got very busy with some work over the next two months so i'm going to have to park this one for now.

Rather than keep you hanging, i gave your script the solution credit as im sure it will be close to what i need when i get the time to look at this again.

Many thanks for your efforts.

James
0
 
LVL 28

Expert Comment

by:pepr
Comment Utility
The result should be stored in the file with the .out extension.

OK. Feel free to continue here to fix the solution.
0

Featured Post

Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

Join & Write a Comment

Suggested Solutions

Installing Python 2.7.3 version on Windows operating system For installing Python first we need to download Python's latest version from URL" www.python.org " You can also get information on Python scripting language from the above mentioned we…
Introduction On September 29, 2012, the Python 3.3.0 was released; nothing extremely unexpected,  yet another, better version of Python. But, if you work in Microsoft Windows, you should notice that the Python Launcher for Windows was introduced wi…
Learn the basics of lists in Python. Lists, as their name suggests, are a means for ordering and storing values. : Lists are declared using brackets; for example: t = [1, 2, 3]: Lists may contain a mix of data types; for example: t = ['string', 1, T…
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…

772 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now