Link to home
Start Free TrialLog in
Avatar of TommyMac501
TommyMac501

asked on

Can't seem to get my head around my loops

In printing, quite often we need to rearrange data so it comes out in a different order than it arrives.  Once it's been printed its set up on cutters and needs to come out in a stream order (it called north south splitting).  I need to write a 4-way north south which is to divide the file into (4) sections, determine any leftovers (in this case, 3 records:  19/4=4.5, round down to 4.  4x4=16 R3); example below.

The data comes in in this order; 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19

It needs to be rearranged so that when the records are printed and cut into 4 "ribbons" they are presented in this way:

1   6   11   16        Print direction
2   7   12   17                |
3   8   13   18                |
4   9   14   19                |
5  10  15                      V

So, the natural order of the file needs to be rebuilt as as an output file in this order; 1,6,11,16,2,7,12,17,3,8,13,18.........

Does python have a reasonably efficient way of doing this outside of seeking to file points inside loops and counters?  These files are quite large with record lengths in the hundreds, and record counts in the millions.



Avatar of momogentoo
momogentoo

It looks like a matrix calculation.
Try Numeric Python?
You mention seeking, so I assume you have  a fixed record length. You can open the same file multiple times, and position each file handle at the right spot, then read from each filehandle in a loop. An example is provided below.
# splitreader.py
class splitreader:
    
    def __init__(self,filename,reclen,splitcount):
        self.filename = filename
        self.reclen = reclen
        self.splitcount = splitcount
        self.fh = []
    
    def open(self):
        for i in range(self.splitcount):
            self.fh.append(open(self.filename,'rb'))
        self.fh[0].seek(0,2)
        self.filesize = self.fh[0].tell()
        self.recordcount = self.filesize / self.reclen
        self.chunksize = (self.recordcount / self.splitcount) * self.reclen
        self.leftovers = self.recordcount % self.splitcount
        startpos = 0
        for i in range(self.splitcount):
            self.fh[i].seek(startpos)
            startpos += self.chunksize 
            if i < self.leftovers:
                startpos += self.reclen
    
    def get_row(self):
        row = []
        for i in range(self.splitcount):
            row.append(self.fh[i].read(self.reclen))
        return row
    
    def close(self):        
        for i in range(self.splitcount):
            self.fh[i].close()
    
def test():
    sr = splitreader('fixedreclen_3.dat',3,4)
    sr.open()
    print 'recordcount =',sr.recordcount
    print 'leftovers =',sr.leftovers
    while sr.fh[-1].tell() < sr.filesize:
        row = sr.get_row()
        print ' '.join(row)
    row = sr.get_row()  # read final row with leftovers
    print ' '.join(row)
    sr.close()
    
if __name__ == '__main__':
    test()
 
# fixedreclen_3.dat contains this single line:
111222333444555666777888999AAABBBCCCDDDEEEFFFGGGHHHIIIJJJ
 
# output:
recordcount = 19
leftovers = 3
111 666 BBB GGG
222 777 CCC HHH
333 888 DDD III
444 999 EEE JJJ
555 AAA FFF

Open in new window

Avatar of pepr
I do not know the background. Anyway, do you really want to print all of the "million records"? There probably is some selection of them. Isn't it? Then you should also explain why it is important to print it in columns instead of rows. It would be understandable if they were say pages of a phone list or the like. However, in such case, you want to make sections by pages -- say 300 items per page. If for some reasons the cutter needs the four columns arranged this way (probably because of the printed media being cut to four piles of say price cards for a shop that are put one pile to another to get the correct ordering.

Still, the "millions of printed records" seems to be unrealistic for me. This way I guess that you really want to print the number of items that could easily fit into memory. In Python, you could read them into a list that could be accessed also through indexing.

Please, write more details.
Some real world data would help a lot.

Are the records fixed format? fixed length?
Avatar of TommyMac501

ASKER

"I do not know the background. Anyway, do you really want to print all of the "million records"?"

In direct mail, it's very common to print millions of pieces in a run.  Datasets normally run anywhere from 150,000 records to 6 or 7 million in a job.  It's normal.  The output structure is build for the reason you specified; In this particular case, the customer is printing post cards in columns of four across and 18" wide form.  They are "slit" into columns, then chopped, effectively creating postcards that are "stacked" at the end of the slitter/cutter.  Since you may have dozens of these cutters, the files are run like this so that they can be reassembled in natural order on skids.

The data is almost always drawn from mainframe systems and are fixed length ASCII records with CRLF.  Since i need to "jump around the file" I am using file pointers (seeking) to the start of the next output record.  The problem I'm having is the algorithm.  

My plan was to create a loop counter equal to the record count / 4, and have an array list with 4 elements; one to hold the seek position for each subsequent record in the output leg.

My loops got out of control and I'm having a tough time keeping track of why I'm either getting more, or, less records than the input file.  

In an aside, I'm using "wing" as an editor for the autocomplete and integrated debugger, but it's slow in step mode debugging.  Does anyone have a better editor suggestion to try?

CXR:  I will gave that solution a try (after I read and understand it).. :)  
Please show us your code. Else we are shooting in the dark.
Then the crx's approach is probably the correct one. Try the alternative snippet below that simulates the printing by writing to the output file. It can easily be modified to produce the reorganized output file from the input source.

The simulated intput reads the records as lines, however, it is possible to read, say, multiline fixed records.

The simulated output looks like (997 records counted from zero):

rec00000000000000000000000000000000000000000000000000
rec00000000000000000000000000000000000000000000000250
rec00000000000000000000000000000000000000000000000500
rec00000000000000000000000000000000000000000000000750
-------------------------------------------------------
rec00000000000000000000000000000000000000000000000001
rec00000000000000000000000000000000000000000000000251
rec00000000000000000000000000000000000000000000000501
rec00000000000000000000000000000000000000000000000751
-------------------------------------------------------
... and the last ones

-------------------------------------------------------
rec00000000000000000000000000000000000000000000000246
rec00000000000000000000000000000000000000000000000496
rec00000000000000000000000000000000000000000000000746
rec00000000000000000000000000000000000000000000000996
-------------------------------------------------------
rec00000000000000000000000000000000000000000000000247
rec00000000000000000000000000000000000000000000000497
rec00000000000000000000000000000000000000000000000747
-------------------------------------------------------
rec00000000000000000000000000000000000000000000000248
rec00000000000000000000000000000000000000000000000498
rec00000000000000000000000000000000000000000000000748
-------------------------------------------------------
rec00000000000000000000000000000000000000000000000249
rec00000000000000000000000000000000000000000000000499
rec00000000000000000000000000000000000000000000000749
-------------------------------------------------------



import os
 
# Generate some sample file.
fname = 'test.dat'
f = open(fname, 'w')
for n in xrange(1000 - 3):  # i.e. simulate 3 missing items for the last case
    s = 'rec%050i\n' % n
    f.write(s)
f.close()
 
# Determine the fixed length (assumed) of the record.
f = open(fname)
pos1 = f.tell()
s = f.readline()
pos2 = f.tell()
f.close()
 
recsize = pos2 - pos1
 
print 'Record size:', recsize
 
# Determine the length of the file and compute the number
# of records.
fsize = os.stat(fname).st_size
print 'File size:', fsize 
nrec = fsize / recsize
print 'No. of records:', nrec
 
# Compute the seek offset. We know that we want 4 columns; hence, add 3 to get
# the length of the longest column after the "floor division".
nrows = (nrec + 3) // 4
print 'Num of rows:', nrows
offset = nrows * recsize
print 'Offset:', offset
 
# Simulate the printing by output to another file.
fout = open('output.txt', 'w')
 
# Open the four input files with different offsets.
f1 = open(fname)
f2 = open(fname)
f3 = open(fname)
f4 = open(fname)
 
# Seek to the right offsets.
f2.seek(offset)
f3.seek(offset * 2)
f4.seek(offset * 3)
 
# Loop the known number of times through the records.
# The last column may contain empty printings at the end.
rec4 = 'init'
for n in xrange(nrows):
    # Read the four records...
    rec1 = f1.readline()
    rec2 = f2.readline()
    rec3 = f3.readline()
    if rec4 != '':           # not reading after EOF
        rec4 = f4.readline()
 
    # ... and put them to the output.
    fout.write(rec1)
    fout.write(rec2)
    fout.write(rec3)
    fout.write(rec4)
 
    # Visualize the printed row.
    fout.write('-' * recsize + '\n')
 
 
# Close all input file objects and the output file.
f1.close()
f2.close()
f3.close()
f4.close()
 
fout.close()

Open in new window

sorry, I've been away.  Thanks everyone for contributing.  I thought pepr's solution was the most unique, entering the file in four different places.  You all were a great help, and I'm happy to pay my mothly dues to belong here.

TM
ASKER CERTIFIED SOLUTION
Avatar of Roger Baklund
Roger Baklund
Flag of Norway image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Yes, and I did emphasize that in the now accepted solution: "Then the crx's approach is probably the correct one." ;)
cxr: You are right, I didn't understand that.  It's my fault, not yours..  help me figure out how to assign some points to you and I'll happily oblige..  :)

Ok, it may be me, but I cannot find a "Request Attention" button anyplace...
See the lower right corner of the original question, on the top of this page. Just above the google translate function. :)