ltpitt
asked on
State machine to elaborate a text file in Python
Hi all!
I have a script that takes care of deleting records from a text file.
Each file has this structure:
As you can see the T string identifies the beginning and the end of a record.
Just as an example here's the record "T7738024":
In this script I want to be able to delete all the selected records from the text file after asking the user which records he wants to delete while keeping the 1st row and the others in the same identical way (including spaces and invisible characters).
The script already asks the user data and manipulate files copying here and renaming various times.
Here I paste only the "hot" part:
If I comment the append rows I can simulate the user inserting data.
The trouble is that I can recognize a new record from the T and a number after but only a complete record matching something in the ricevute_doppie list should be deleted.
I really can't think of a solution but the nested for you see there looking at the 'DISCARD' state...
But it's not working correctly if I insert more than one record to be deleted :(
I have a script that takes care of deleting records from a text file.
Each file has this structure:
1st important row
random stuff random stuff T0738023 random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff T7738024 random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff T2738025 random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
As you can see the T string identifies the beginning and the end of a record.
Just as an example here's the record "T7738024":
random stuff random stuff T7738024 random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
random stuff random stuff random stuff random stuff
In this script I want to be able to delete all the selected records from the text file after asking the user which records he wants to delete while keeping the 1st row and the others in the same identical way (including spaces and invisible characters).
The script already asks the user data and manipulate files copying here and renaming various times.
Here I paste only the "hot" part:
import shutil, os
def sbloccaRicevuteDoppie(lotto_da_sbloccare, ricevute_doppie):
percorso_file_lotto = lotto_da_sbloccare
percorso_file_lotto_lot = lotto_da_sbloccare+".LOT"
percorso_file_lotto_lox = lotto_da_sbloccare+".LOX"
output = []
state = 'KEEP'
numeri_ricevute = ricevute_doppie
stop_markers = [ 'T%d' % val for val in xrange(10) ]
with open(percorso_file_lotto_lox) as fin:
line1 = fin.next() # keep first line unconditionally
output.append(line1)
for line in fin:
if state == 'KEEP':
for numero_ricevuta in numeri_ricevute:
if numero_ricevuta in line:
state = 'DISCARD'
numeri_ricevute.remove(numero_ricevuta)
print numeri_ricevute
break
else:
state = 'KEEP'
output.append(line)
break
elif state == 'DISCARD':
for stop_marker in stop_markers:
if stop_marker in line:
for numero_ricevuta in numeri_ricevute:
if numero_ricevuta in line:
state = 'DISCARD'
break
elif numero_ricevuta not in line:
state = 'KEEP'
output.append(line)
break
with open(percorso_file_lotto_lot, 'w') as fout:
for line in output:
fout.write(line)
lotto_da_sbloccare = '/home/pitto/scripts/spaccalotti/test'
ricevute_doppie = []
ricevute_doppie.append("T0738023")
ricevute_doppie.append("T7738024")
ricevute_doppie.append("T2738025")
sbloccaRicevuteDoppie (lotto_da_sbloccare, ricevute_doppie)
If I comment the append rows I can simulate the user inserting data.
The trouble is that I can recognize a new record from the T and a number after but only a complete record matching something in the ricevute_doppie list should be deleted.
I really can't think of a solution but the nested for you see there looking at the 'DISCARD' state...
But it's not working correctly if I insert more than one record to be deleted :(
ASKER
I have your suggestion in a previous answer (https://www.experts-exchange.com/questions/28222688/Text-file-manipulation-keep-specific-rows-with-Python.html) and I'm sorry to waste your time again.
It's not that I didn't read it it's that it looks quite difficult to me so I decided to stick with something more readable for my basic skills.
I think I have to study it better...
It's not that I didn't read it it's that it looks quite difficult to me so I decided to stick with something more readable for my basic skills.
I think I have to study it better...
Try the following code:
import os
import re
import shutil
def sbloccaRicevuteDoppie(name, *wanted):
lox_fname = name + '.LOX'
lot_fname = name + '.LOT'
rexWanted = re.compile('|'.join(wanted))
rexMarked = re.compile('\sT\d+\s') # T and the numerals, surrounded by spaces
with open(lox_fname) as fin, open(lot_fname, "w") as fout:
status = 0 # initial state of the finite automaton
for line in fin:
if status == 0: # keep the first line
fout.write(line)
status = 1
elif status == 1: # wait for the wanted record
m = rexWanted.search(line)
if m is not None:
fout.write(line) # starting line
status = 2
elif status == 2: # lines of the wanted record
# Could be the line with another mark and the mark can be wanted
# or unwanted. The unwanted mark is or wanted or general Tnnnn.
# This way, we must test for wanted first, and only after for
# the general mark. If none of the marks can be applied, then
# just collect another line of the previously started record.
m = rexWanted.search(line)
if m is not None:
fout.write(line) # starting line of another wanted immediately
#status = 2 i.e. keep the same status
else:
m = rexMarked.search(line)
if m is not None:
status = 1 # started unwanted section, go to "wait" status
else: # still belonging to the collected record
fout.write(line) # ... collect it
if __name__ == '__main__':
sbloccaRicevuteDoppie('./test', 'T0738023', 'T2738025')
If you remove the star before wanted, you can pass a list as an argument.
ASKER
I am trying it but I get:
File "doppia.py", line 13
with open(lox_fname) as fin, open(lot_fname, "w") as fout:
^
SyntaxError: invalid syntax
File "doppia.py", line 13
with open(lox_fname) as fin, open(lot_fname, "w") as fout:
^
SyntaxError: invalid syntax
Can you attach the exact file? I did run the same code, and I do not observe the error. What Python version do you use?
Please, try the attached a.py that worked for me. Fix only the path at the last line.
a.py
Please, try the attached a.py that worked for me. Fix only the path at the last line.
a.py
ASKER
Same error as the file I've tried...
Maybe my python is too old (the one coming with latest Crunchbang Linux)?
Python 2.6.6 (r266:84292, Dec 27 2010, 10:20:06)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
Maybe my python is too old (the one coming with latest Crunchbang Linux)?
Python 2.6.6 (r266:84292, Dec 27 2010, 10:20:06)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
You are a volcano...
This time the script works flawlessly but the final file doesn't match the records asked (I didn't touch a single row of it)
This time the script works flawlessly but the final file doesn't match the records asked (I didn't touch a single row of it)
I have used the sample from the question, but I have removed one of the marks -- the middle one.
The problem is that detection of the line with the mark depend on the content of the other lines. Can you describe better (show example) what does not work? What of the expected was not extracted, and what of the wanted was thrown away?
The problem is that detection of the line with the mark depend on the content of the other lines. Can you describe better (show example) what does not work? What of the expected was not extracted, and what of the wanted was thrown away?
ASKER
The sample above contains the marks T0738023, T7738024, and T2738025; however, the example.lox contains the marks T0738023, T0738024, T0738025.
Notice the bold numerals after T. Fix the arguments in the function call.
Notice the bold numerals after T. Fix the arguments in the function call.
ASKER
Hi, it's me: the idiot.
Thanks a lot for all your kind help.
I'll try to learn from your script and write better.
God bless your time, skill and patience.
Thanks a lot for all your kind help.
I'll try to learn from your script and write better.
God bless your time, skill and patience.
ASKER
Perfect solution for need, perfectly commented, avoided to kill me for being stupid.
What else?
What else?
There is no stupid question ;)
ASKER
The question, I admit it, wasn't so stupid.
My testing of your script was not stupid was idiotic :)
Thanks again for the preciously commented code.
My testing of your script was not stupid was idiotic :)
Thanks again for the preciously commented code.
The for loops for detection of the patterns may be inefficient. The more patterns is appended, the better would be to use a regular expression.
My experience is that a finite automaton for processing a text should use plain numbered states (read it anonymous). It becomes more apparent when you end with many states. The readable name of states just forces you to change more things when you need to modify the automaton. It is better to add a commend than to give the status a readable value.