State machine to elaborate a text file in Python

Hi all!

I have a script that takes care of deleting records from a text file.

Each file has this structure:

1st important row
random stuff random stuff T0738023 random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff T7738024 random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff T2738025 random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff

Open in new window


As you can see the T string identifies the beginning and the end of a record.

Just as an example here's the record "T7738024":

random stuff random stuff T7738024  random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff

Open in new window


In this script I want to be able to delete all the selected records from the text file after asking the user which records he wants to delete while keeping the 1st row and the others in the same identical way (including spaces and invisible characters).

The script already asks the user data and manipulate files copying here and renaming various times.

Here I paste only the "hot" part:

import shutil, os

def sbloccaRicevuteDoppie(lotto_da_sbloccare, ricevute_doppie):
        percorso_file_lotto = lotto_da_sbloccare
        percorso_file_lotto_lot = lotto_da_sbloccare+".LOT"
        percorso_file_lotto_lox = lotto_da_sbloccare+".LOX"
        output = []
        state = 'KEEP'
        numeri_ricevute = ricevute_doppie
        stop_markers = [ 'T%d' % val for val in xrange(10) ]

        with open(percorso_file_lotto_lox) as fin:
                line1 = fin.next() # keep first line unconditionally
                output.append(line1)
                for line in fin:
                        if state == 'KEEP':
                                for numero_ricevuta in numeri_ricevute:
                                        if numero_ricevuta in line:
                                                state = 'DISCARD'
                                                numeri_ricevute.remove(numero_ricevuta)
                                                print numeri_ricevute
                                                break
                                        else:
                                                state = 'KEEP'
                                                output.append(line)
                                                break
                        elif state == 'DISCARD':
                                for stop_marker in stop_markers:
                                        if stop_marker in line:
                                                for numero_ricevuta in numeri_ricevute:
                                                        if numero_ricevuta in line:
                                                                state = 'DISCARD'
                                                                break
                                                        elif numero_ricevuta not in line:
                                                                state = 'KEEP'
                                                                output.append(line)
                                                                break

        with open(percorso_file_lotto_lot, 'w') as fout:
                for line in output:
                        fout.write(line)


lotto_da_sbloccare = '/home/pitto/scripts/spaccalotti/test'
ricevute_doppie = []
ricevute_doppie.append("T0738023")
ricevute_doppie.append("T7738024")
ricevute_doppie.append("T2738025")
sbloccaRicevuteDoppie (lotto_da_sbloccare, ricevute_doppie)

Open in new window


If I comment the append rows I can simulate the user inserting data.

The trouble is that I can recognize a new record from the T and a number after but only a complete record matching something in the ricevute_doppie list should be deleted.

I really can't think of a solution but the nested for you see there looking at the 'DISCARD' state...

But it's not working correctly if I insert more than one record to be deleted :(
LVL 1
ltpittAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

peprCommented:
Just few comments (I will post the code a bit later):

The for loops for detection of the patterns may be inefficient. The more patterns is appended, the better would be to use a regular expression.

My experience is that a finite automaton for processing a text should use plain numbered states (read it anonymous). It becomes more apparent when you end with many states. The readable name of states just forces you to change more things when you need to modify the automaton. It is better to add a commend than to give the status a readable value.
0
ltpittAuthor Commented:
I have your suggestion in a previous answer (http://www.experts-exchange.com/Programming/Languages/Scripting/Python/Q_28222688.html) and I'm sorry to waste your time again.

It's not that I didn't read it it's that it looks quite difficult to me so I decided to stick with something more readable for my basic skills.

I think I have to study it better...
0
peprCommented:
Try the following code:
import os
import re
import shutil


def sbloccaRicevuteDoppie(name, *wanted):
    lox_fname = name + '.LOX'
    lot_fname = name + '.LOT'

    rexWanted = re.compile('|'.join(wanted))
    rexMarked = re.compile('\sT\d+\s')  # T and the numerals, surrounded by spaces

    with open(lox_fname) as fin, open(lot_fname, "w") as fout:

        status = 0  # initial state of the finite automaton

        for line in fin:
            if status == 0:             # keep the first line
                fout.write(line)
                status = 1

            elif status == 1:           # wait for the wanted record
                m = rexWanted.search(line)
                if m is not None:
                    fout.write(line)    # starting line
                    status = 2

            elif status == 2:           # lines of the wanted record
                # Could be the line with another mark and the mark can be wanted
                # or unwanted. The unwanted mark is or wanted or general Tnnnn.
                # This way, we must test for wanted first, and only after for
                # the general mark. If none of the marks can be applied, then
                # just collect another line of the previously started record.
                m = rexWanted.search(line)
                if m is not None:
                    fout.write(line)    # starting line of another wanted immediately
                    #status = 2         i.e. keep the same status
                else:
                    m = rexMarked.search(line)
                    if m is not None:
                        status = 1       # started unwanted section, go to "wait" status
                    else:                # still belonging to the collected record
                        fout.write(line) # ... collect it


if __name__ == '__main__':
    sbloccaRicevuteDoppie('./test', 'T0738023', 'T2738025')

Open in new window

If you remove the star before wanted, you can pass a list as an argument.
0
C++ 11 Fundamentals

This course will introduce you to C++ 11 and teach you about syntax fundamentals.

ltpittAuthor Commented:
I am trying it but I get:

  File "doppia.py", line 13
    with open(lox_fname) as fin, open(lot_fname, "w") as fout:
                               ^
SyntaxError: invalid syntax
0
peprCommented:
Can you attach the exact file? I did run the same code, and I do not observe the error. What Python version do you use?

Please, try the attached a.py that worked for me. Fix only the path at the last line.
a.py
0
ltpittAuthor Commented:
Same error as the file I've tried...

Maybe my python is too old (the one coming with latest Crunchbang Linux)?

Python 2.6.6 (r266:84292, Dec 27 2010, 10:20:06)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
0
peprCommented:
The with construct was in 2.6 but possibly not in that form. Try the following instead:
import os
import re
import shutil


def sbloccaRicevuteDoppie(name, *wanted):
    lox_fname = name + '.LOX'
    lot_fname = name + '.LOT'

    rexWanted = re.compile('|'.join(wanted))
    rexMarked = re.compile('\sT\d+\s')  # T and the numerals, surrounded by spaces

    fin = open(lox_fname)
    fout = open(lot_fname, "w")

    status = 0  # initial state of the finite automaton

    for line in fin:
        if status == 0:             # keep the first line
            fout.write(line)
            status = 1

        elif status == 1:           # wait for the wanted record
            m = rexWanted.search(line)
            if m is not None:
                fout.write(line)    # starting line
                status = 2

        elif status == 2:           # lines of the wanted record
            # Could be the line with another mark and the mark can be wanted
            # or unwanted. The unwanted mark is or wanted or general Tnnnn.
            # This way, we must test for wanted first, and only after for
            # the general mark. If none of the marks can be applied, then
            # just collect another line of the previously started record.
            m = rexWanted.search(line)
            if m is not None:
                fout.write(line)    # starting line of another wanted immediately
                #status = 2         i.e. keep the same status
            else:
                m = rexMarked.search(line)
                if m is not None:
                    status = 1       # started unwanted section, go to "wait" status
                else:                # still belonging to the collected record
                    fout.write(line) # ... collect it

    fout.close()
    fin.close()
    

if __name__ == '__main__':
    sbloccaRicevuteDoppie('./test', 'T0738023', 'T2738025')

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
ltpittAuthor Commented:
You are a volcano...

This time the script works flawlessly but the final file doesn't match the records asked (I didn't touch a single row of it)
0
peprCommented:
I have used the sample from the question, but I have removed one of the marks -- the middle one.

The problem is that detection of the line with the mark depend on the content of the other lines. Can you describe better (show example) what does not work? What of the expected was not extracted, and what of the wanted was thrown away?
0
ltpittAuthor Commented:
Maybe my example was bad...

Here's the real data:

http://pitto.homeip.net/example.lox
0
peprCommented:
The sample above contains the marks T0738023, T7738024, and T2738025; however, the example.lox contains the marks T0738023, T0738024, T0738025.

Notice the bold numerals after T. Fix the arguments in the function call.
0
ltpittAuthor Commented:
Hi, it's me: the idiot.

Thanks a lot for all your kind help.

I'll try to learn from your script and write better.

God bless your time, skill and patience.
0
ltpittAuthor Commented:
Perfect solution for need, perfectly commented, avoided to kill me for being stupid.

What else?
0
peprCommented:
There is no stupid question ;)
0
ltpittAuthor Commented:
The question, I admit it, wasn't so stupid.

My testing of your script was not stupid was idiotic :)

Thanks again for the preciously commented code.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Python

From novice to tech pro — start learning today.