asked on

State machine to elaborate a text file in Python

Hi all!

I have a script that takes care of deleting records from a text file.

Each file has this structure:

1st important row
random stuff random stuff T0738023 random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff T7738024 random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff T2738025 random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff

Open in new window

As you can see the T string identifies the beginning and the end of a record.

Just as an example here's the record "T7738024":

random stuff random stuff T7738024  random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff

Open in new window

In this script I want to be able to delete all the selected records from the text file after asking the user which records he wants to delete while keeping the 1st row and the others in the same identical way (including spaces and invisible characters).

The script already asks the user data and manipulate files copying here and renaming various times.

Here I paste only the "hot" part:

import shutil, os

def sbloccaRicevuteDoppie(lotto_da_sbloccare, ricevute_doppie):
        percorso_file_lotto = lotto_da_sbloccare
        percorso_file_lotto_lot = lotto_da_sbloccare+".LOT"
        percorso_file_lotto_lox = lotto_da_sbloccare+".LOX"
        output = []
        state = 'KEEP'
        numeri_ricevute = ricevute_doppie
        stop_markers = [ 'T%d' % val for val in xrange(10) ]

        with open(percorso_file_lotto_lox) as fin:
                line1 = fin.next() # keep first line unconditionally
                output.append(line1)
                for line in fin:
                        if state == 'KEEP':
                                for numero_ricevuta in numeri_ricevute:
                                        if numero_ricevuta in line:
                                                state = 'DISCARD'
                                                numeri_ricevute.remove(numero_ricevuta)
                                                print numeri_ricevute
                                                break
                                        else:
                                                state = 'KEEP'
                                                output.append(line)
                                                break
                        elif state == 'DISCARD':
                                for stop_marker in stop_markers:
                                        if stop_marker in line:
                                                for numero_ricevuta in numeri_ricevute:
                                                        if numero_ricevuta in line:
                                                                state = 'DISCARD'
                                                                break
                                                        elif numero_ricevuta not in line:
                                                                state = 'KEEP'
                                                                output.append(line)
                                                                break

        with open(percorso_file_lotto_lot, 'w') as fout:
                for line in output:
                        fout.write(line)


lotto_da_sbloccare = '/home/pitto/scripts/spaccalotti/test'
ricevute_doppie = []
ricevute_doppie.append("T0738023")
ricevute_doppie.append("T7738024")
ricevute_doppie.append("T2738025")
sbloccaRicevuteDoppie (lotto_da_sbloccare, ricevute_doppie)

Open in new window

If I comment the append rows I can simulate the user inserting data.

The trouble is that I can recognize a new record from the T and a number after but only a complete record matching something in the ricevute_doppie list should be deleted.

I really can't think of a solution but the nested for you see there looking at the 'DISCARD' state...

But it's not working correctly if I insert more than one record to be deleted :(

pepr

Just few comments (I will post the code a bit later):

The for loops for detection of the patterns may be inefficient. The more patterns is appended, the better would be to use a regular expression.

My experience is that a finite automaton for processing a text should use plain numbered states (read it anonymous). It becomes more apparent when you end with many states. The readable name of states just forces you to change more things when you need to modify the automaton. It is better to add a commend than to give the status a readable value.

ltpitt

ASKER

I have your suggestion in a previous answer (https://www.experts-exchange.com/questions/28222688/Text-file-manipulation-keep-specific-rows-with-Python.html) and I'm sorry to waste your time again.

It's not that I didn't read it it's that it looks quite difficult to me so I decided to stick with something more readable for my basic skills.

I think I have to study it better...

pepr

Try the following code:

import os
import re
import shutil


def sbloccaRicevuteDoppie(name, *wanted):
    lox_fname = name + '.LOX'
    lot_fname = name + '.LOT'

    rexWanted = re.compile('|'.join(wanted))
    rexMarked = re.compile('\sT\d+\s')  # T and the numerals, surrounded by spaces

    with open(lox_fname) as fin, open(lot_fname, "w") as fout:

        status = 0  # initial state of the finite automaton

        for line in fin:
            if status == 0:             # keep the first line
                fout.write(line)
                status = 1

            elif status == 1:           # wait for the wanted record
                m = rexWanted.search(line)
                if m is not None:
                    fout.write(line)    # starting line
                    status = 2

            elif status == 2:           # lines of the wanted record
                # Could be the line with another mark and the mark can be wanted
                # or unwanted. The unwanted mark is or wanted or general Tnnnn.
                # This way, we must test for wanted first, and only after for
                # the general mark. If none of the marks can be applied, then
                # just collect another line of the previously started record.
                m = rexWanted.search(line)
                if m is not None:
                    fout.write(line)    # starting line of another wanted immediately
                    #status = 2         i.e. keep the same status
                else:
                    m = rexMarked.search(line)
                    if m is not None:
                        status = 1       # started unwanted section, go to "wait" status
                    else:                # still belonging to the collected record
                        fout.write(line) # ... collect it


if __name__ == '__main__':
    sbloccaRicevuteDoppie('./test', 'T0738023', 'T2738025')

Open in new window

If you remove the star before wanted, you can pass a list as an argument.

ltpitt

ASKER

I am trying it but I get:

File "doppia.py", line 13
with open(lox_fname) as fin, open(lot_fname, "w") as fout:
^
SyntaxError: invalid syntax

pepr

Can you attach the exact file? I did run the same code, and I do not observe the error. What Python version do you use?

Please, try the attached a.py that worked for me. Fix only the path at the last line.
a.py

ltpitt

ASKER

Same error as the file I've tried...

Maybe my python is too old (the one coming with latest Crunchbang Linux)?

Python 2.6.6 (r266:84292, Dec 27 2010, 10:20:06)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>

ASKER CERTIFIED SOLUTION

pepr

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ltpitt

ASKER

You are a volcano...

This time the script works flawlessly but the final file doesn't match the records asked (I didn't touch a single row of it)

pepr

I have used the sample from the question, but I have removed one of the marks -- the middle one.

The problem is that detection of the line with the mark depend on the content of the other lines. Can you describe better (show example) what does not work? What of the expected was not extracted, and what of the wanted was thrown away?

ltpitt

ASKER

Maybe my example was bad...

Here's the real data:

http://pitto.homeip.net/example.lox

pepr

The sample above contains the marks T0738023, T7738024, and T2738025; however, the example.lox contains the marks T0738023, T0738024, T0738025.

Notice the bold numerals after T. Fix the arguments in the function call.

ltpitt

ASKER

Hi, it's me: the idiot.

Thanks a lot for all your kind help.

I'll try to learn from your script and write better.

God bless your time, skill and patience.

ltpitt

ASKER

Perfect solution for need, perfectly commented, avoided to kill me for being stupid.

What else?

pepr

There is no stupid question ;)

ltpitt

ASKER

The question, I admit it, wasn't so stupid.

My testing of your script was not stupid was idiotic :)

Thanks again for the preciously commented code.