Solved

REad file Odd and misterious keyboard interruption

Posted on 2012-04-09
4
244 Views
Last Modified: 2012-04-22
Hi,

I am trying to read a very basic txt (bed) file, tab-delimited, three columns and I get an odd error when I try to read it in linux, and when I try to read it in mac it basically slows my computer down until almost freezing it.

No idea why, it's a huge mystery to me and many other pythonistas have looked at it with no success.

Here I attach the code for reading the file  (read_file.py) and the complete file (lads_galaxy_hg19_mapped.bed).

UNIX get the following error:
[node1390]data> python read_file.py
11401198 11694590
18002539 18392747
Traceback (most recent call last):
  File "read_file.py", line 25, in <module>
    bed_positions[chromosome].extend(range(start, end+1))
KeyboardInterrupt
Terminated

MAC OS X basically my computer freezes after typing:
>python read_file.py

Have not tried reading it in Windows - I have never used windows so I have no idea about it.

Please advise on how to fix this issue and if you were able to reproduce it.
read-file.py
lads-galaxy-hg19-mapped.txt
0
Comment
Question by:dfernan
  • 2
  • 2
4 Comments
 
LVL 28

Expert Comment

by:pepr
ID: 37826645
My guess is that the KeyboardInterrupt did not happen on its own.  It probably also took too much time and you pressed Ctrl+C.

It is not surprising that it takes too much time (and too much memory) if you run the modified script:

a.py
'''
 This takes a bedfile and a snparray data file and outputs the
 ratio between the median intensity for the bed regions divided
 by the median intensity for genome background (i.e., non bed regions)
'''

chrom = ['chr'+str(i) for i in range(1,23)]+['chrX']

bed_positions = {chromosome:0 for chromosome in chrom}
filename= 'lads_galaxy_hg19_mapped.bed'
bedfile = open(filename, 'r')
for line in bedfile:
    temp = line.split('\t')
    start = int(temp[1])
    end = int(temp[2])
    chromosome = temp[0]
    assert chromosome in bed_positions
    bed_positions[chromosome] += end - start + 1

bedfile.close()  # do not forget to close the file explicitly or use the with construct

sum = 0
for chromosome in bed_positions:
    n = bed_positions[chromosome]
    print chromosome, n
    sum += n

print 'sum', sum

Open in new window


The modified script just counts the lenght of the chromosome sequences that you add to the dictionary.  The original script keeps that long lists of numbers for each chromosome.  See the result printed on my console (dictionary was not sorted here):
c:\tmp\___python\dfernan\Q_27668375>python a.py
chrX 78751464
chr13 35025843
chr12 46518464
chr11 55813063
chr10 51281011
chr17 16289483
chr16 32767186
chr15 24089858
chr14 38560293
chr19 7700805
chr18 35824641
chr22 5908395
chr20 23604355
chr21 9100296
chr7 65906044
chr6 67765901
chr5 79127419
chr4 99751548
chr3 73996343
chr2 95936291
chr1 77870888
chr9 43333020
chr8 65489675
sum 1130412286

Open in new window


You can see that all the lists have 1 130 412 286 elements together.  The memory space (needed for storing them) is comparable with all the RAM memory that you probably have in your computer.  Then you want to sort the lists.  It is not surprising that it would take a lot of time.

I do not understand the genetics.  Anyway, the naive approach to get the median definitely cannot work in the case.
0
 
LVL 28

Accepted Solution

by:
pepr earned 500 total points
ID: 37827037
Try the following modification to get the medians for the sequences.  The sequences are not unfolded.  They are captured as tuples of (start, end) numbers.  The median value is calculated from those rather few tuples (about one hundreds vs. the one hundred milions).

b.py
'''
 This takes a bedfile and a snparray data file and outputs the
 ratio between the median intensity for the bed regions divided
 by the median intensity for genome background (i.e., non bed regions)
'''


def loadBedPositions(fname):

    chrom = ['chr'+str(i) for i in range(1,23)]+['chrX']

    bed_positions = {chromosome: [] for chromosome in chrom}
    bedfile = open(fname, 'r')
    for line in bedfile:
        temp = line.split('\t')
        start = int(temp[1])
        end = int(temp[2])
        chromosome = temp[0]
        assert chromosome in bed_positions
        bed_positions[chromosome].append( (start, end) )

    bedfile.close()  # do not forget to close the file explicitly or use the with construct
    return bed_positions


def medianValue(lst):

    def myCmp(a, b):
        return cmp(a[0], b[0])

    # The length of the sequence is the sum of lengths of subsequences (intervals).
    seqlen = 0
    for start, end in lst:
        seqlen += end - start + 1

    # Index of the median element.
    mi = seqlen // 2

    # Skip the subsequences that do not cross the median point, and then get
    # the median out of the next subsequence
    n = 0      # init -- number of elements in previous sequences
    for start, end in sorted(lst, myCmp):
        x = end - start + 1      # length of the subsequence
        if n + x > mi:           # subsequence contains the middle element
            di = mi - n          # index of the middle elem inside the subsequence
            return start + di - 1  # value of the median
        else:
            n += x


if __name__ == '__main__':
    bed_positions = loadBedPositions('lads_galaxy_hg19_mapped.bed')
    for chromosome in bed_positions:
        print chromosome, medianValue(bed_positions[chromosome])

Open in new window


It prints on my console:
c:\tmp\___python\dfernan\Q_27668375>python b.py
chrX 94388829
chr13 70206810
chr12 66997615
chr11 83347289
chr10 83912730
chr17 50310741
chr16 52344135
chr15 60290770
chr14 49523046
chr19 33042562
chr18 42711731
chr22 26715724
chr20 21452517
chr21 25053830
chr7 87779768
chr6 90178045
chr5 94652023
chr4 91913307
chr3 85084133
chr2 127283853
chr1 119451935
chr9 76428582
chr8 67100249

Open in new window


Consider this a first attempt.  I did not check it for errors/validity.
0
 

Author Comment

by:dfernan
ID: 37828376
Pepr,

Thank you very much, you are definitely right, the issue was a mem issue I did not have in mind.  I thought it was possible to create a list of range 1 to 3 billion but that's way too much memory.

The program terminates because the server kicks me out, it's just that the message is quite cryptic, no memory message error, just the termination error.

Well, with respect to the solutions they are all great and now it's working, thanks a lot!
0
 

Author Closing Comment

by:dfernan
ID: 37878411
thx, great answer!
0

Featured Post

What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

Join & Write a Comment

A set of related code is known to be a Module, it helps us to organize our code logically which is much easier for us to understand and use it. Module is an object with arbitrarily named attributes which can be used in binding and referencing. …
Sequence is something that used to store data in it in very simple words. Let us just create a list first. To create a list first of all we need to give a name to our list which I have taken as “COURSE” followed by equals sign and finally enclosed …
Learn the basics of lists in Python. Lists, as their name suggests, are a means for ordering and storing values. : Lists are declared using brackets; for example: t = [1, 2, 3]: Lists may contain a mix of data types; for example: t = ['string', 1, T…
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …

760 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now