[Webinar] Streamline your web hosting managementRegister Today

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 171
  • Last Modified:

Save space, Gain speed

I am doing a research envolving
building huge suffix trees to a very very large DB (up to 10^9 chars (DNA...))

here are two points I would like to have
your views:

* would declaring:
enum DNA {a,t,c,g};
DNA arr[1000..00]

will actualy cost less space then:
char arr[1000..00]


* is there a critical SPACE / TIME
differance between declaring:

char arr[1000..00]
to:
arr=new char[100..00]

In my view all these questions are compiler depended.

Working on SGM 128M, Irix OS.
code compiles using CC compiler.

Yair
0
yairy
Asked:
yairy
  • 4
1 Solution
 
nietodCommented:
>> * would declaring:
>> enum DNA {a,t,c,g};
>> DNA arr[1000..00]
>>
>> will actualy cost less space then:
>> char arr[1000..00]

probably not.  the size used to store an enum is implimentation defined, but it will always be at least 1 byte.

continues
0
 
nietodCommented:
>>  * is there a critical SPACE / TIME
>> differance between declaring:
>>
>> char arr[1000..00]
>> to:
>> arr=new char[100..00]
Time: yes.  A huge difference.  the new operator has to allocate from a heap and that is very time consuming.  and array allcoated locally or globally will be allocated in fraction of the time.  However all that I am talking about is the time to allocate (or free) the array, for that there is a huge difference, but you probalby do that very rarely, so the time doesn't matter very much.  Once allcoated the two arrays will work just as quickly.  So it is very unlekely that you will find that using new slows down your program.  (Unless you use it a lot)

space:  There is a small difference in total space.  when you allocate from a heap (using new) additonal space must be reserved to help manage the heap (mostly for storing pointers used within the heap.)  But this is only a few extra bytes, compared to the memory used to store a huge array its not a significant increase in size.  So you won't see new using much extra space in that sense (If you use new for many small allcoations, that can be wasteful).  However there is another difference other than total space.  The difference is where the memory comes from.  If you declare an array globally, the memory (usually) comes from a global data segment, if you declare the array locally the memory (usually) comes from the program's stack.  if you allocate the memory with new it comes from the heap.  There may be advantages or dissadvantages to drawing from each of these 3 areas.  For example a program might have limited stack space (it depends on the OS, the compiler and other factors) if that is the case, allocating a huge array locally may cause the program to overflow its stack and crash.  The heap is often designed to handle large allocations, but again there are limits to it.  On some compilers/OSs the heap may expand to accomidate almost any allcoation size, but it might not on others, so yoiu may find that allocating the array with new might cause you to run out of heap space.  etc...

continues
0
 
nietodCommented:
If you find that this array is too large to safely work with (that you can't reliabley allocate it) you might consider alternatives.  Most obviously the data can be stored in a file.  This may slow down access, especially if you need frequent access to points spready through out the data.  Depending on your OS another option is to use memmory mapping to map the array in memory to a file on disk.  This allows you to access the array as if it was in memory, but the OS will swap portions of the array out to disk (saving memory)  when they are less frequently used.
0
The new generation of project management tools

With monday.com’s project management tool, you can see what everyone on your team is working in a single glance. Its intuitive dashboards are customizable, so you can create systems that work for you.

 
LucHoltkampCommented:
Huge amount of memory can be saved by packing several DNA sequences in one byte. For each one you need only 2 bits, so you're memory usage goes down a factor 4 compared to unsigned chars.
The way to go is to write a small container class for a DNA sequence, if you do not try to make it generic, it should be fairly simple...
Embed or inherit a vector<unsigned char> into the container, there you store the data, just keep it simple with a get() and set() function...
What the heck, I write some code for you to get you started:

enum DNA { A=0, T, C, G };
class Chromosome
{
   private:
      vector<unsigned char> *data;   // could also be done with
                                                         // private inheritance
   public:
      Chromosome(unsigned n, DNA dna)
      {
         // initialise container with n copies of dna
         unsigned char temp = dna | dna << 2 | dna << 4 | dna << 6;
         data = new vector<unsigned char>( (n>>2) + 1, temp);
      }
      ~Chromosome()
      {
         delete data;
      }
      void set(unsigned index, DNA dna)
      {
           unsigned i = index >> 2;                                 // vector index
           unsigned m = (index & 0x03) << 1;                // place of DNA in byte
           data[i] = data[i] & ~(0x03 << m) | dna << m;
      }  
      DNA get(unsigned index)
      {
           unsigned i = index >> 2;
           unsigned m = (index & 0x03) << 1;
           return (DNA)(data[i] >> m & 0x03);
      }    
};

Of course this could be made much fancier, with an iterator etc...
However that is not straitforward in this case because you cannot return a pointer or reference to a DNA entry, so expressions like container[10] = A will be difficult to implement, for that reason I used a set and get function.

BTW, I didn't test it, so perhaps it is not perfect :-)

Luc
0
 
nietodCommented:
Good point.  I was going to mention that yesterday but for some reason I though that it would take 4 bits to pack the data (must have been thinking there was 4 values...) so it cuts storage in half.  when you consider the extra complexity, probably not worth it.  but as the actuall savings is 3/4--that starts to get more significant.
0
 
yairyAuthor Commented:


0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

  • 4
Tackle projects and never again get stuck behind a technical roadblock.
Join Now