Solve your biggest tech problems alongside global tech experts with 1:1 help.
'''
This takes a bedfile and a snparray data file and outputs the
ratio between the median intensity for the bed regions divided
by the median intensity for genome background (i.e., non bed regions)
'''
def loadBedPositions(fname):
chrom = ['chr'+str(i) for i in range(1,23)]+['chrX']
bed_positions = {chromosome: [] for chromosome in chrom}
bedfile = open(fname, 'r')
for line in bedfile:
temp = line.split('\t')
start = int(temp[1])
end = int(temp[2])
chromosome = temp[0]
assert chromosome in bed_positions
bed_positions[chromosome].append( (start, end) )
bedfile.close() # do not forget to close the file explicitly or use the with construct
return bed_positions
def medianValue(lst):
def myCmp(a, b):
return cmp(a[0], b[0])
# The length of the sequence is the sum of lengths of subsequences (intervals).
seqlen = 0
for start, end in lst:
seqlen += end - start + 1
# Index of the median element.
mi = seqlen // 2
# Skip the subsequences that do not cross the median point, and then get
# the median out of the next subsequence
n = 0 # init -- number of elements in previous sequences
for start, end in sorted(lst, myCmp):
x = end - start + 1 # length of the subsequence
if n + x > mi: # subsequence contains the middle element
di = mi - n # index of the middle elem inside the subsequence
return start + di - 1 # value of the median
else:
n += x
if __name__ == '__main__':
bed_positions = loadBedPositions('lads_galaxy_hg19_mapped.bed')
for chromosome in bed_positions:
print chromosome, medianValue(bed_positions[chromosome])
c:\tmp\___python\dfernan\Q_27668375>python b.py
chrX 94388829
chr13 70206810
chr12 66997615
chr11 83347289
chr10 83912730
chr17 50310741
chr16 52344135
chr15 60290770
chr14 49523046
chr19 33042562
chr18 42711731
chr22 26715724
chr20 21452517
chr21 25053830
chr7 87779768
chr6 90178045
chr5 94652023
chr4 91913307
chr3 85084133
chr2 127283853
chr1 119451935
chr9 76428582
chr8 67100249
It is not surprising that it takes too much time (and too much memory) if you run the modified script:
a.py
Open in new window
The modified script just counts the lenght of the chromosome sequences that you add to the dictionary. The original script keeps that long lists of numbers for each chromosome. See the result printed on my console (dictionary was not sorted here):
Open in new window
You can see that all the lists have 1 130 412 286 elements together. The memory space (needed for storing them) is comparable with all the RAM memory that you probably have in your computer. Then you want to sort the lists. It is not surprising that it would take a lot of time.
I do not understand the genetics. Anyway, the naive approach to get the median definitely cannot work in the case.