Hello all --
I have an array of hashes that store data about apx. 4,000 genes. A typical entry looks like this (some entries truncated with '...' for posting purposes):
BF2784 = {
'name' => 'BF2784',
'descr' => 'putative EPS related membrane protein',
'start' => '3242515',
'end' => '3244920',
'ori' => 'pos',
'bp' => '2409',
'GC' => '56.3',
'GeneID' => '3287061',
'aa' => '802',
'kDa' => '88.3',
'GI' => '60682255',
'groups' => {
'CDD' => [
'COG0455',
'COG3206',
'cd00550'
],
'COG' => {
'D' => 'Cell cycle control, mitosis and meiosis genes',
'M' => 'Cell wall/membrane biogenesis genes'
}
},
'links' => {
'pep' => '
http://www.ncbi.nlm.nih.gov/entrez/...',
'seq' => '
http://www.ncbi.nlm.nih.gov/entrez/...',
'summary' => '
http://www.ncbi.nlm.nih.gov/entrez/...',
'upstr' => '
http://www.ncbi.nlm.nih.gov/entrez/...' },
'aka' => {
'lec' => [
'orf3_tsr19'
]
'gb' => [
'sigE'
],
},
'up_gap' => {
'end' => '3242514',
'size' => '13',
'start' => '3242502'
},
'microarray' => {
'SeqID' => 'BFRAG050600002693',
'descr' => '2693|Bacteroides fragilis|0|506|CDS...'
'102805' => {
'0265_ML' => {
'avg' => '959.5032',
'block1' => '820.4571',
'block2' => '1142.2980',
'block3' => '915.7545'
},
'1394_EL' => {
'avg' => '422.2764',
'block1' => '448.5869',
'block2' => '454.1586',
'block3' => '364.0837'
},
'9343_EL' => {
'avg' => '797.4852',
'block1' => '753.0446',
'block2' => '885.5215',
'block3' => '753.8896'
},
'9343_ML' => {
'avg' => '858.0540',
'block1' => '933.8485',
'block2' => '822.4420',
'block3' => '817.8716'
},
'CrrD_ML' => {
'avg' => '952.3332',
'block1' => '1000.5565',
'block2' => '949.4948',
'block3' => '906.9484'
}
},
'121905' => {
'9343_ML' => {
'avg' => '976.8530',
'block1' => '1053.0826',
'block2' => '930.7049',
'block3' => '946.7716'
},
'ddUngD' => {
'avg' => '852.5260',
'block1' => '851.9713',
'block2' => '823.1842',
'block3' => '882.4226'
},
'mpi_mut44' => {
'avg' => '1295.1745',
'block1' => '1367.4020',
'block2' => '1229.0144',
'block3' => '1289.1070'
},
'mpi_mut8' => {
'avg' => '1126.2450',
'block1' => '1115.6544',
'block2' => '1093.3422',
'block3' => '1169.7385'
},
'tsr19_M1' => {
'avg' => '1895.5840',
'block1' => '1916.8111',
'block2' => '1798.7082',
'block3' => '1971.2327'
},
'tsr19_M3' => {
'avg' => '1249.8808',
'block1' => '1215.6576',
'block2' => '1281.3577',
'block3' => '1252.6272'
}
},
},
};
These data are stored on disk in Storable format, and retrieved by:
use Storable qw(store retrieve);
my $data = retrieve("master_9343.db")
;
The entry above is from a hash in the $data->[0] array, e.g.:
use Data::Dumper;
open (DD, ">BF2784_dump.txt") or die;
print DD Dumper($data->[0]{BF2784})
;
This arrangement works great if I'm looking up a particular attribute of the gene, or a small set of attributes:
print "$gene begins at $data->[0]{$gene}{start} and ends at $data->[0]{$gene}{end}\n";
but what I want to do now is provide a program to display *all* stored data about the gene. Some genes do not have all the entries shown above (for example, there may be only one name for a particular gene, thus neither $data->[0]{$gene}{aka}{lec
} nor $data->[0]{$gene}{aka}{gb}
will exist for that value of $gene, but either or both might exist for another value of $gene). This is true for several of the structures ($data->[0]{$gene}{groups}
, $data->[0]{$gene}{up_gap},
etc.).
So, my goal is to write a command line program that, when provided the name of a gene, prints out a report containing all the data accumulated for that gene. For now, I'm going to output it to a text file using Perl's Report formats, but I might eventually try for a Perl/Tk version.
What is the best way to iterate over such a variable structure to dump all the data, ignoring values that don't exist? I could of course set variables for each possible entry, testing first if it exists:
if (exists $data->[0]{$gene}{aka}{lec
}) {
my $aka = join (", ", @{$data->[0]{$gene}{aka}{l
ec}});
}
if (exists $data->[0]{$gene}{groups}{
CDD}) {
my $cdd = join (", ", @{$data->[0]{$gene}{groups
}{CDD}});
}
etc., but this brute force approach seems tedious and wasteful. Does anyone have any suggestions for an efficient way to do this, such that I wind up with a collection of variables suitable to pass to a report format subfunction?
Thanks --
Mike
Start Free Trial