Save space, Gain speed

I am doing a research envolving
building huge suffix trees to a very very large DB (up to 10^9 chars (DNA...))

here are two points I would like to have
your views:

* would declaring:
enum DNA {a,t,c,g};
DNA arr[1000..00]

will actualy cost less space then:
char arr[1000..00]

* is there a critical SPACE / TIME
differance between declaring:

char arr[1000..00]
arr=new char[100..00]

In my view all these questions are compiler depended.

Working on SGM 128M, Irix OS.
code compiles using CC compiler.

Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

>> * would declaring:
>> enum DNA {a,t,c,g};
>> DNA arr[1000..00]
>> will actualy cost less space then:
>> char arr[1000..00]

probably not.  the size used to store an enum is implimentation defined, but it will always be at least 1 byte.


Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
>>  * is there a critical SPACE / TIME
>> differance between declaring:
>> char arr[1000..00]
>> to:
>> arr=new char[100..00]
Time: yes.  A huge difference.  the new operator has to allocate from a heap and that is very time consuming.  and array allcoated locally or globally will be allocated in fraction of the time.  However all that I am talking about is the time to allocate (or free) the array, for that there is a huge difference, but you probalby do that very rarely, so the time doesn't matter very much.  Once allcoated the two arrays will work just as quickly.  So it is very unlekely that you will find that using new slows down your program.  (Unless you use it a lot)

space:  There is a small difference in total space.  when you allocate from a heap (using new) additonal space must be reserved to help manage the heap (mostly for storing pointers used within the heap.)  But this is only a few extra bytes, compared to the memory used to store a huge array its not a significant increase in size.  So you won't see new using much extra space in that sense (If you use new for many small allcoations, that can be wasteful).  However there is another difference other than total space.  The difference is where the memory comes from.  If you declare an array globally, the memory (usually) comes from a global data segment, if you declare the array locally the memory (usually) comes from the program's stack.  if you allocate the memory with new it comes from the heap.  There may be advantages or dissadvantages to drawing from each of these 3 areas.  For example a program might have limited stack space (it depends on the OS, the compiler and other factors) if that is the case, allocating a huge array locally may cause the program to overflow its stack and crash.  The heap is often designed to handle large allocations, but again there are limits to it.  On some compilers/OSs the heap may expand to accomidate almost any allcoation size, but it might not on others, so yoiu may find that allocating the array with new might cause you to run out of heap space.  etc...

If you find that this array is too large to safely work with (that you can't reliabley allocate it) you might consider alternatives.  Most obviously the data can be stored in a file.  This may slow down access, especially if you need frequent access to points spready through out the data.  Depending on your OS another option is to use memmory mapping to map the array in memory to a file on disk.  This allows you to access the array as if it was in memory, but the OS will swap portions of the array out to disk (saving memory)  when they are less frequently used.
Get your problem seen by more experts

Be seen. Boost your question’s priority for more expert views and faster solutions

Huge amount of memory can be saved by packing several DNA sequences in one byte. For each one you need only 2 bits, so you're memory usage goes down a factor 4 compared to unsigned chars.
The way to go is to write a small container class for a DNA sequence, if you do not try to make it generic, it should be fairly simple...
Embed or inherit a vector<unsigned char> into the container, there you store the data, just keep it simple with a get() and set() function...
What the heck, I write some code for you to get you started:

enum DNA { A=0, T, C, G };
class Chromosome
      vector<unsigned char> *data;   // could also be done with
                                                         // private inheritance
      Chromosome(unsigned n, DNA dna)
         // initialise container with n copies of dna
         unsigned char temp = dna | dna << 2 | dna << 4 | dna << 6;
         data = new vector<unsigned char>( (n>>2) + 1, temp);
         delete data;
      void set(unsigned index, DNA dna)
           unsigned i = index >> 2;                                 // vector index
           unsigned m = (index & 0x03) << 1;                // place of DNA in byte
           data[i] = data[i] & ~(0x03 << m) | dna << m;
      DNA get(unsigned index)
           unsigned i = index >> 2;
           unsigned m = (index & 0x03) << 1;
           return (DNA)(data[i] >> m & 0x03);

Of course this could be made much fancier, with an iterator etc...
However that is not straitforward in this case because you cannot return a pointer or reference to a DNA entry, so expressions like container[10] = A will be difficult to implement, for that reason I used a set and get function.

BTW, I didn't test it, so perhaps it is not perfect :-)

Good point.  I was going to mention that yesterday but for some reason I though that it would take 4 bits to pack the data (must have been thinking there was 4 values...) so it cuts storage in half.  when you consider the extra complexity, probably not worth it.  but as the actuall savings is 3/4--that starts to get more significant.
yairyAuthor Commented:

It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.