Advertisement

05.15.2003 at 08:03PM PDT, ID: 20618114
[x]
Attachment Details
[x]
The Solution Rating System

With so many solutions, how can you tell which solutions are most likely to help you and which ones are not? To provide you with a tool to use, we rate our solutions based on various elements that most accurately determine if a solution is a quality solution. To explain what factors affect the solution rating, here are the elements we take into consideration when formulating our solution rating.

  • The Grade of the Solution
  • The Zone Rank of the Expert Providing the Solution
  • The Number of Author and Expert Comments
  • The Number of Experts Contributing
  • The Feedback of the Community

Your Input Matters
Because of the way the system is set up, the most important variable in this equation is you. As a member of Experts Exchange, you are able to cast your vote on the quality of the solutions in regard to how complete, accurate, helpful and easy to understand each solution is. When you provide your feedback, each rating is adjusted accordingly. So, if you see a solution that has a poor rating that you think is a good solution, let us know by rating it. As you do, the rating will be adjusted and will become more accurate for other members of our site.

If you have any suggestions that you would like to make for our rating system, please ask a question in the Suggestions Zone of Community Support.

Thank you!

PERL - extracting the max and min value from sets of data

Tags: perl, value, max, min
Hi, i have a tab delimited input file that looks like this:
Query      Score      E-Value      Position
At1g01010       775      0.0      4705
At1g01010       775      0.0      4765
At1g01010       775      0.0      4825
At1g01010       775      0.0      4885
At1g01010       775      0.0      4945
At1g01010       775      0.0      5005
At1g01010       775      0.0      5065
At1g01010       557      e-158      3996
At1g01010       557      e-158      4056 .....
What i would like to do is: for the rows where the query and score are the same, get the maximum and minimum value of position.  I would like the output to be a tab delimited file containing the 'query', 'score', 'e-value', 'position min' and 'position max'.  I will be dealing with very very large files so ideally i would prefer it if the file wasn't read in all at once, but instead was queued in, but i don't know how possible this is, as i have been trying and failing miserably.
Please help!!!
Start your free trial to view this solution
Question Stats
Zone: Programming
Question Asked By: MonkeyMoo
Solution Provided By: teraplane
Participating Experts: 2
Solution Grade: A
Views: 69
Translate:
Loading Advertisement...
05.15.2003 at 09:11PM PDT, ID: 8538087

Rank: Wizard

All comments and solutions are available to Premium Service Members only.

Start your 7 day free trial and see for yourself why Experts Exchange is the easiest and most proven technology resource in the world. Get Started

Already a member? Login to view this solution.

 
05.15.2003 at 09:14PM PDT, ID: 8538099

All comments and solutions are available to Premium Service Members only.

Start your 7 day free trial and see for yourself why Experts Exchange is the easiest and most proven technology resource in the world. Get Started

Already a member? Login to view this solution.

 
05.15.2003 at 09:44PM PDT, ID: 8538193

All comments and solutions are available to Premium Service Members only.

Start your 7 day free trial and see for yourself why Experts Exchange is the easiest and most proven technology resource in the world. Get Started

Already a member? Login to view this solution.

 
05.15.2003 at 11:16PM PDT, ID: 8538549

All comments and solutions are available to Premium Service Members only.

Start your 7 day free trial and see for yourself why Experts Exchange is the easiest and most proven technology resource in the world. Get Started

Already a member? Login to view this solution.

 
05.16.2003 at 12:28AM PDT, ID: 8538764

All comments and solutions are available to Premium Service Members only.

Start your 7 day free trial and see for yourself why Experts Exchange is the easiest and most proven technology resource in the world. Get Started

Already a member? Login to view this solution.

 
 
Loading Advertisement...
Microsoft
  • Internet Protocols
  • Applications
  • Development
  • OS
  • Hardware
  • Windows Security
Apple
  • Operating Systems
  • Hardware
  • Programming
  • Networking
  • Software
Internet
  • Search Engines
  • File Sharing
  • WebTrends / Stats
  • Spy / Ad Blockers
  • Web Browsers
  • New Net Users
  • Web Development
  • Chat / IM
  • Anti Spam
  • Web Servers
  • Anti-Virus
  • Email Clients
Gamers
  • Tips
  • Online / MMORPG
  • Puzzle
  • Emulators
  • Action / Adventure
  • Role Playing
  • Consoles
  • Game Programming
  • Strategy
  • Sports
  • Misc
  • Computer Games
Digital Living
  • Hardware
  • New Net Users
  • New Users
  • Software
  • Digital Music
  • Gaming World
  • Home Security
  • Apple
  • Networking Hardware
Virus & Spyware
  • Vulnerabilities
  • IDS
  • Encryption
  • Anti-Virus
  • Operating Systems Security
  • Software Firewalls
  • WebApplications
  • Cell Phones
  • Operating Systems
  • Internet
  • Hardware Firewalls
Hardware
  • Handhelds / PDAs
  • Displays / Monitors
  • Components
  • Networking Hardware
  • Peripherals
  • Laptops/Notebooks
  • Storage
  • Servers
  • Desktops
  • New Users
  • Misc
  • Apple
Software
  • System Utilities
  • Industry Specific
  • Network Management
  • Photos / Graphics
  • Page Layout
  • VMWare
  • Misc
  • Web Development
  • OS
  • CYGWIN
  • Voice Recognition
  • Message Queue
  • Quality Assurance
  • Security
  • Firewalls
  • MultiMedia Applications
  • Development
  • Database
  • Office / Productivity
  • Business Management
  • OS/2 Apps
  • Server Software
  • Internet / Email
ITPro
  • OS
  • Storage
  • Encryption
  • Operating Systems Security
  • Apple Hardware
  • Laptops & Notebooks
  • Servers
  • Networking Hardware
  • Peripherals
  • Devices
  • Displays / Monitors
  • WebTrends / Stats
  • Search Engines
  • Firewalls
  • WebApplications
  • IDS
  • Vulnerabilities
  • Email Clients
  • File Sharing
  • Spy / Ad Blockers
  • Web Browsers
  • Web Servers
  • Networking
  • Anti-Virus
  • Chat / IM
  • Anti Spam
Developer
  • Web Servers
  • Web Browsers
  • Game Programming
  • Dev Tools
  • Industry Specific
  • Office / Productivity
  • Database
  • CYGWIN
  • Web Development
  • Search Engines
  • File Sharing
  • WebTrends / Stats
  • Programming
  • Content Management
  • Application Servers
  • Protocols
Storage
  • Removable Backup Media
  • Storage Technology
  • Servers
  • Grid
  • Remote Access
  • Backup / Restore
  • Misc
  • Hard Drives
OS
  • Miscellaneous
  • Security
  • Development
  • Linux
  • VMWare
  • MainFrame OS
  • Unix
  • Apple
  • OS / 2
  • AS / 400
  • BeOS
  • Microsoft
  • VMS / OpenVMS
Database
  • Oracle
  • Miscellaneous
  • MySQL
  • Software
  • Sybase
  • Contact Management
  • PostgreSQL
  • Data Manipulation
  • Clarion
  • InterSystems Cache
  • Siebel
  • MUMPS
  • OLAP
  • SQLBase
  • SAS
  • GIS & GPS
  • 4GL
  • Berkeley DB
  • DB2
  • Informix
  • Interbase / Firebird
  • FoxPro
  • Reporting
  • LDAP
  • Filemaker Pro
  • MS SQL Server
  • dBase
  • MS Access
Security
  • Misc
  • Web Browsers
  • Software Firewalls
  • Operating Systems Security
  • File Sharing
  • Spy / Ad Blockers
  • Vulnerabilities
  • WebApplications
  • IDS
  • Anti-Virus
  • Encryption
  • Anti Spam
  • Email Clients
  • VPN
  • Chat / IM
Programming
  • Editors IDEs
  • Installation
  • Handhelds / PDAs
  • Multimedia Programming
  • System / Kernel
  • Algorithms
  • Game
  • Signal Processing
  • Project Management
  • Open Source
  • Database
  • Misc
  • Languages
  • Processor Platforms
  • Theory
Web Development
  • Scripting
  • Blogs
  • Web Servers
  • Software
  • Search Engines
  • Web Graphics
  • Images
  • Internet Marketing
  • Images and Photos
  • Components
  • Document Imaging
  • Web Languages/Standards
  • Illustration
  • WebApplications
  • Fonts
  • WebTrends / Stats
  • Authoring
  • Digital Camera Software
  • Miscellaneous
Networking
  • Protocols
  • Apple Networking
  • Network Management
  • Message Queue
  • Application Servers
  • Content Management
  • File Servers
  • Email Servers
  • Misc
  • Java Editors & IDEs
  • Wireless
  • Networking Hardware
  • Backup / Restore
  • System Utilities
  • ISPs & Hosting
  • Web Servers
  • Storage Technology
  • Removable Backup Media
  • Servers
  • Broadband
  • Grid
  • OS / 2
  • Novell Netware
  • Unix Networking
  • Windows Networking
  • Security
  • Telecommunications
  • Operating Systems
  • Linux Networking
Other
  • Community Advisor
  • Lounge
  • Community Support
  • New Net Users
  • Philosophy / Religion
  • Math / Science
  • Miscellaneous
  • URLs
  • Expert Lounge
  • Politics
  • Puzzles / Riddles
Community Support
  • Suggestions
  • New to EE
  • New Topics
  • Community Advisor
  • CleanUp
  • Announcements
  • General
  • Feedback
  • Input
  • EE Bugs
 
05.15.2003 at 09:11PM PDT, ID: 8538087

Rank: Wizard

This should get you started.

The postions where all in ascending order, so I reorded them in case they are not always in sequence.

I'm working from a DATA block for testing but you can easily change this to a file.

Also, the e-value always seems to be the same. I'm using the last one from a set of records with identical Query and Score, but maybe this isn't what you want.

It reads in the file (or block) one line at a time, but the %query_score hash can use quite a lot of memory depending on file size.


my %query_score;
while ( <DATA> )
{
    chomp;
    ($Query,$Score,$E_Value,$Position) = split(/ +/,$_);
    push( @{ $query_score{"$Query:$Score"}{position} },$Position);
    $query_score{"$Query:$Score"}{E_Value} = $E_Value;
   
}

foreach $key ( sort keys %query_score )
{
    ($Query,$Score) = split(/:/,$key);
    @positions =  sort @{ $query_score{$key}{position} };
    $E_Value = $query_score{$key}{E_Value};
    $min = shift(@positions);
    $max = pop(@positions);
    print("$Query \t$Score\t $min\t $max\n");
   
}

__DATA__
At1g01010       775      0.0      4705
At1g01010       775      0.0      4765
At1g01010       775      0.0      4825
At1g01010       775      0.0      4945
At1g01010       775      0.0      4885
At1g01010       775      0.0      5005
At1g01010       775      0.0      5065
At1g01010       557      e-158    3996
At1g01010       557      e-158    4056
Accepted Solution
 
05.15.2003 at 09:14PM PDT, ID: 8538099
#!/usr/bin/perl

$OUT_FILE = "OutMaxMin.txt";
open ( OUT, ">$OUT_FILE");

while (<STDIN>) {
   ($Query, $Score, $EValue, $Position) = split(/\t/);

   $Que_Sc = $Query.",".$Score;
   if ( ! exists ( $hash{$Que_Sc} ) ) {
      $hash{$Que_Sc}[0] = $EValue;
      $hash{$Que_Sc}[1] = $Position; ### Min position
      $hash{$Que_Sc}[2] = $Position; ### Max position
   }
   else {
      if ( $hash{$Que_Sc}[1] > $Position ) {
           $hash{$Que_Sc}[1] = $Position;
      }
      if ( $hash{$Que_Sc}[2] < $Position ) {
           $hash{$Que_Sc}[2] = $Position;
      }
   }
}

foreach $Que_Sc ( keys (%hash) ) {
   ($Query, $Score) = split(/,/, $Que_Sc);
   print OUT "$Query\t$Score\t$hash{$Que_Sc}[0]\t$hash{$Que_Sc}[1]\t$hash{$Que_Sc}[2]\n";
}
 
05.15.2003 at 09:44PM PDT, ID: 8538193
Thanks to both of you for your quick answers - you've saved the remaining bits of my sanity to fight another day. i chose teraplanes because i found it easier to understand exactly what was happening.  thanks again xx :D
 
05.15.2003 at 11:16PM PDT, ID: 8538549
sorry, one more quick question along the same sort of lines i guess:
the output looks like this:
Query           Score                   Start  End
At1g01010           248           4483      4603
At1g01010           305           3760      3880
At1g01010           309           5173      5293
At1g01010           385           5437      5617
At1g01010           557           3996      4236
At1g01010           775           4705      5065
At1g01010           99.6           6188      6248
At1g01020           180           8266      8326
At1g01020           192           8606      8666
At1g01020           212           7775      7835
How do i now produce an output which for each unique query number only has the start and end position for the highest score?  Last bit of annoyance, promise!!
 
05.16.2003 at 12:28AM PDT, ID: 8538764
Ur last bit of annoyance is cleared, if u really manage to understand this :)
#!/usr/bin/perl

while (<STDIN>) {
   #($Query, $Score, $Start, $End) = split(/\t/);
   ($Query, $Score, $Start, $End) = split;

   if ( ! exists ( $hash{$Query} ) ) {
      $hash{$Query}[0] = $Score;
      $hash{$Query}[1] = $Start;
      $hash{$Query}[2] = $End;
   }
   else {
      if ( $hash{$Query}[0] < $Score ) {
           $hash{$Query}[0] = $Score;
           $hash{$Query}[1] = $Start;
           $hash{$Query}[2] = $End;
      }
   }
}

foreach $Query ( keys (%hash) ) {
   print "$Query\t$hash{$Query}[0]\t$hash{$Query}[1]\t$hash{$Query}[2]\n";
}


Just give the above input and check
 
 
20080236-EE-VQP-29