?
Solved

Please critique my String Class

Posted on 2006-05-29
21
Medium Priority
?
555 Views
Last Modified: 2013-12-14
Here's my idea:

- Data is held internally in a char* initialized to an array of 128 chars.
- Each instance has a BufferLength unsigned int
- Data is assigned with an overloaded = operator. (The argument is a const char*)
       - If the length of the argument is longer than our buffer, bit shift the buffersize to the left by 1 (double it in size)

A structure:
- char*
- size

Static Methods & Variables
- Whenever a delete is performed, add the buffer-being-deleted to one of the static queues
- Static queues are of type "A structure" from above and come in these flavors according to buffer size:
-   128-size queue
-   256-size queue
-   512-size queue
-  .... so on up to a point

When a new buffer is created, the appropriate queue is checked to see if a block of memory is available (has previous been freed up)
- If a buffer is available, dequeue it and it becomes the memory-block for the current string instance
- If a buffer is not available, simply allocate a new one

I'm going to do some stress testing on this design, but at this point in time it sounds pretty good to me.

Can anybody spot some major flaws in this design? Or perhaps i didn't explain it clearly?



0
Comment
Question by:oxygen_728
  • 7
  • 6
  • 5
  • +3
21 Comments
 
LVL 86

Assisted Solution

by:jkr
jkr earned 100 total points
ID: 16786928
Since strings are commonly used in a multitude of instances - thus I don't see what static members should be good for. IMO that breaks the design, since each individual string object should maintain it's own memory, not something that is common to all instances of the class.
0
 

Author Comment

by:oxygen_728
ID: 16786963
jkr - The goal of having static members is to have a place for 'deleted buffers' to go when they are 'deleted'

It's mainly a memory reuse scheme.

If I remember correctly - wouldn't the static queues be in one place in system memory, regardless of the number of active instances?

0
 
LVL 39

Accepted Solution

by:
itsmeandnobodyelse earned 800 total points
ID: 16787753
In 1993 in one of my early C++ projects, I used a string class based on a similar concept. It always allocated a minimum of 64 bytes for each string regardless of the fact that 99 percent of all strings are shorter. At these times my PC had 16 MB physical memory while the destination PC had 4 MB.  A customer record consisted of 25 strings on average, so after reading 2000 records from a customer file my program run short on memory. I had to implement an own string class that operated on huge memory arrays (64k arrays) and chunks of 4 bytes. Though nowadays we have at least 512 MB, the overhead has grown as well. IMO, a concept where you *wasted* 80 percent or more of the available memory is not appropriate.

Actually, any heap management operates on chunks which sizes were a multiple of 2. AFAIK, the minimum chunk size is 32 bytes on a modern OS, but it may have changed. The problem is to avoid fragmentation so that a request for a big piece of contiguous memory couldn't be satisfied while there are many small free chunks available. The worst case is a heap where any second 32byte chunk is free ...

Heap managers are the fastest piece of software you'll find at a computer. You can't compete with that by writing your own memory manager. The only thing that is faster is to use fixed sized arrays allocated on the stack. However, that hardly is suitable for dynamic string management ...

Regards, Alex
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 

Author Comment

by:oxygen_728
ID: 16787887
Alex: In my case I'd be willing to trade a little extra space for a little less computation (Which is why I'm starting with 128 byte chunks - I may take it down to 64)

When you're talking about heap management - It is my understanding that you're talking about:
- Sticking newly allocated memory blocks in ideal locations
- Swapping / Merging / Splitting various blocks to make memory less fragmented and/or more suitable to predicted allocations

If this is the case, it is my understanding that C++ does none of this, and it is up to the programmer to do such a thing.

Based upon my understanding of the situation, I have decided to implement my technique above - It will largely avoid fragmentation as I understand it, because the blocks stay allocated.

I plan on using more complex memory management for the larger stuff.


P.S. I'm running into an interesting issue about how to dispose of my static queus at program exit.
0
 
LVL 2

Assisted Solution

by:danielsson
danielsson earned 700 total points
ID: 16787991
What I additionally would consider a major flaw today is that you're using char* instead of wchar_t*. New designs should - IMHO - always take Unicode into consideration.
0
 

Author Comment

by:oxygen_728
ID: 16788171
Daniel - I understand your point of view... however I believe there are some situations in which unicode should be ignored (and that is my situation)

My situation is DirectX in which many functions want char*.

Care to elaborate a bit? Maybe you can convince me =)

0
 
LVL 2

Assisted Solution

by:danielsson
danielsson earned 700 total points
ID: 16788260
Following points:
- Windows NT is all Unicode,
- Unicode is faster than single character strings
- You will not get into a hassle when you realize that you need Unicode in the end (e.g. for many names, Unicode characters are needed, even if you're in an english speaking country)

I really can't believe you have functions in DirectX which do not support Unicode? Try compiling with the _UNICODE preprocessor flag set (remove _DBCS), do they want wchar_t's now? Examples? Most functions either have wchar_t variants only, or the have a FunctionA and a FunctionW version to which the Function is mapped, depending on which compiler setting is used (I assume you're using VC++ in some version).
0
 

Author Comment

by:oxygen_728
ID: 16788270

The books I've learned Directx from have used char* with the directx functions.

Perhaps there are variants that use wchar_t , but I've never used them. I will look into it.

Thanks for the information.
0
 
LVL 39

Assisted Solution

by:itsmeandnobodyelse
itsmeandnobodyelse earned 800 total points
ID: 16788416
>>>> it is my understanding that C++ does none of this

You are wrong. Whenever you make new/delete heap management exactly will do that for you. In C it is malloc/free and the goal of both C++/C heap management is to gain a maximum of contiguous space by maximum speed.

>>>> I'd be willing to trade a little extra space for a little less computation

There is nothing to trade with... memory chunks usually were organized in (binary) trees. Incrementing the depth by one means to double the amount of memory. If using 64 byte chunks, dealing with 1 MB means you have a depth of 14. Using 32byte chunks you could manage the same amount of strings by using only 512k or have a depth of 15 and 1 MB as well. Running a loop from 0 to 14 or from 0 to 15 doesn't make any difference. What counts is that you could manage the double amount of strings.

The only thing where you can make speed against a  heapmanager is if you exactly know the maximum number of strings and the avarage range of string lengths, e. g. it's 99 percent less than 32 and only 1 percent greater.  Then, you could allocate memory once and confine yourself on the management of short strings only.

The main purpose of a string class isn't to increase allocation speed but (string) buffer management, such as dynamic sizing, concatatination, replacing, finding, trimming, (wildcard) matching, parsing, copying, ... Modern string classes also have reference counting what means that you can pass strings by value without copying the text buffer (only incrementing the reference counter). As a string class stores it's string length into a member (the allocated length too) it has some advantages over char arrays which always has to calculate it's length by looking for the terminating zero character.

>>>> New designs should - IMHO - always take Unicode into consideration.

Don't think that is true. If you don't intend to sell your prog to Asia and deal with ANSI characters only, UNICODE has no advantages but disadvantages only (e. g. you need double space and can't look at the data with programs/tools that can't handle UNICODE and that is still the majority). MS made the design decision to be able to sell one program all around the world.

If you handle all text with your own string class you easily could go to UNICODE later. You would need to change the internal buffers and all interfaces using char*. The latter you could make easier by using TCHAR instead of char or a similar typedef of our own if you want to keep it portable.


Regards, Alex
0
 
LVL 39

Assisted Solution

by:itsmeandnobodyelse
itsmeandnobodyelse earned 800 total points
ID: 16788472
>>>> - Windows NT is all Unicode,

Exactly, it is all registry strings that were UNICODE.

- Unicode is faster than single character strings

Why should it be faster? It always needs the double space to store the same amount of data. 16bit arrays are not faster than 8bit arrays at a 32bit or 64bit processor. The only thing is, if you always need to convert from ANSI to UNICODE and back, e. g. when using a pure UNICODE library, you have advantages when using UNICODE strings only.

- e.g. for many names, Unicode characters are needed, even if you're in an english speaking country

No, that isn't true. There are single-char charactersets for all American and European countries, including Russia. The ANSI characterset is valid for all American and European names.

Regards, Alex


0
 
LVL 2

Assisted Solution

by:danielsson
danielsson earned 700 total points
ID: 16788647
>>- Unicode is faster than single character strings

>Why should it be faster? It always needs the double space to store the same amount of data. 16bit arrays are not faster than 8bit arrays at a 32bit or 64bit processor. The only thing is, if you always need to convert from ANSI to UNICODE and back, e. g. when using a pure UNICODE library, you have advantages when using UNICODE strings only.

On Windows NT (including XP), all function calls using char* are internally converted to Unicode strings. That's why Unicode is faster. Not that it would matter for nearly all projects. But it's like that.

>>- e.g. for many names, Unicode characters are needed, even if you're in an english speaking country

>No, that isn't true. There are single-char charactersets for all American and European countries, including Russia. The ANSI characterset is valid for all American and European names.

I know. But still: Have you ever messed around with codepages? It seems so. Then you know that it's a real pain in the neck to convert from one codepage to the other and keeping track of which codepage is used where. Additionally, many countries even allow persons to keep special characters in their name when they move to another country ;) Examples could include czech names with "hacek" symbols.

Using Unicode from the start simply removes all sorrows on portability, just as you said above: MS introduced Unicode to be able to port Windows to any language. Do you know whether your product happens to be sold to China in the future? It's barely any additional effort to use Unicode everywhere if you're designing something from scratch. So, in my opinion(!), do it.

This is some further interesting reading on Unicode:

http://www.joelonsoftware.com/articles/Unicode.html
0
 
LVL 2

Assisted Solution

by:danielsson
danielsson earned 700 total points
ID: 16788795
On the note that debuggers don't support unicode for visual browsing: That's true for VS2003 and downwards. Things get better with VS2005.
0
 
LVL 39

Assisted Solution

by:itsmeandnobodyelse
itsmeandnobodyelse earned 800 total points
ID: 16788799
>>>> all function calls using char* are internally converted to Unicode strings

It's Windows API only.

>>>> Unicode from the start simply removes all sorrows on portability

I have made different experiences. In all projects where UNICODE was involved we got a really mess cause none of the existing libraries, tools, printers, devices, ... actually could handle it. Did you ever try to open a UNICODE text file by using a normal editor, mail client, double clicking, ... And it isn't portable in the moment where you are leaving the (NT) MS World, you'll find out that sorrows begin ...

You know the _UNICODE macro used in the windows API. The idea is to simply switch the macro and turn a UNICODE projetc to ANSI and back ... Did you ever heard of a project where that has worked? IMO, UNICODE is a good concept but badly implemented and poorly supported.

If there is only one tool/library that can't handle UNICODE you need to convert strings from UNICODE to ANSI what means that you lost all information that might have been the reason for using UNICODE. Great !

Note, most (computer) people in India, China or Japan are using the English Windows Version not the localized one. We have some of them here as experts in EE.

Did you really have good experiences when using UNICODE? In big projects?

Maybe I would make a new attempt if things have improved in the last years ...

Regards, Alex



0
 
LVL 2

Assisted Solution

by:danielsson
danielsson earned 700 total points
ID: 16788831
Well, I have terrible experiences with a big non-Unicode application which was to support languages other than english (and german). We went for the codepage solution which was a big mistake. For new modules, we then started using the _T macros so that we could compile them both in Unicode and Single-Byte mode. In fact, that worked out quite well.

Oh, another note on when Unicode is a really bad idea: Windows 98/ME support. If you want to support the Win9X operating systems, you'd need Unicows in order to make Unicode work. If you stick to Windows NT (XP, 2003,...) and if you're using Visual Studio 2005, you'll be fine. In all other cases, stick with char* if you don't desperately need Unicode. Or use the _T notation to be able to port to Unicode later.

But the questions was if there are any major flaws in his string class, and in my opinion, a general string class should definitely work with Unicode. You're of course right, if you need tools which don't support Unicode, you're into trouble. But in that case, you are anyway: Such tools can't even handle file names which aren't in the current codepage.
0
 
LVL 39

Assisted Solution

by:itsmeandnobodyelse
itsmeandnobodyelse earned 800 total points
ID: 16788856
>>>> That's true for VS2003 and downwards. Things get better with VS2005.

That's exactly what I meant. If leaving the college in 2006 and starting with .NET, not having to use any old library or old code, with VS2005 you are finally able to debug UNICODE strings.

Do you understand why I didn't recommend them ...

Regards, Alex
0
 
LVL 2

Assisted Solution

by:danielsson
danielsson earned 700 total points
ID: 16788872
>>Do you understand why I didn't recommend them ...

Yes, of course. Do you understand why I recommend them ;)
0
 
LVL 4

Assisted Solution

by:e_tadeu
e_tadeu earned 200 total points
ID: 16789177
With your algorithm, memory is never freed... suppose the user reads a document 1GB file into your strings and then close it... you will have 1GB of memory in the "static queue" that will be mostly unused.

Why don't you just use std::string? It's portable, fast, reusable, well tested and encapsulated :)
It uses standard allocation (new) and deletion.. the only problem is that it has a slight overhead if you need to allocate/deallocate *a lot*.. in this case, you can use std::string with a custom allocator, such as a memory pool. See:

http://www.boost.org/libs/pool/doc/index.html

Or you can use other variations, such as STL Rope, or other implementations that uses COW (Copy On Write). There are plenty of options there!
0
 

Author Comment

by:oxygen_728
ID: 16792551
Ok Alex and e_tadeu:

I guess one of my main concerns was memory fragmentation due to lots of allocation / deallocation. New/Delete memory allocation/deallocation is vulnerable to fragmentation after many allocations/deallocations, correct? My code above handles that problem by reusing memory blocks.

However, I suppose std::string using a memory pool would accomplish the same thing?

e_tadeu:
What I implemented above is basically a string wrapper with a memory pool.
The only reason I might be hesitant to use std::string is I'd be worry about memory fragmentatoin after the program has allocated/deallocated several tens of thousands of strings while also managing larger chunks of memory (models/textures).

I'm going to post another question about std::string which I will link here.

e_tadeu - You're correct that memory is never freed in my algorithm - and that's on purpose. Given that the string class is designed with games in mind, I can esablish some documentation for the class that says it is efficient up to a certain size of string - OR - more likley, I will have it simply deallocate blocks over a certain size (which, in my opinion, are much less likely to cause horrible fragmentation)

0
 

Author Comment

by:oxygen_728
ID: 16792638
Here's a question I just posted specifically about std::string
http://www.experts-exchange.com/Programming/Programming_Languages/Cplusplus/Q_21868579.html

If it turns out to be nearly as efficient as my own string class, of course I'll use it.

0
 
LVL 11

Assisted Solution

by:dbkruger
dbkruger earned 200 total points
ID: 16793912
Unicode is of course, slower than char by virtue of the greater number of bytes. I would write a templated version to support both, with wrappers if you don't want people to have to use the template.

Strings are typically of widely different sizes, so allocating 128 bytes is not optimal, but then again, in my experience it is not very hard to significantly improve on performance by such memory tuning. The default allocators aren't great. This is old information though, and before you go writing something you ought to see how good the C++ memory allocator is on the platforms you are targeting. If you can't beat it, you shouldn't be writing your own allocation scheme. And of course, since it is work, you'd better be able to beat it by a significant percentage. But unless you support the size of the strings you are using, you are extremely unlikely to win. I would hazard a guess that the average string size is somewhere under 10, and probably under 4 (there are a huge number of empty strings for one). Your implementation will create a massive penalty in page faults for such systems, not to mention actually requiring a huge amount of memory. So you need to use size-binned block storage (4, 8, 12, etc) in order to have any hope of being faster than the default.

The default in every system I have encountered uses malloc as the underlying call, but that is old information. If it is true for your system, then of course the fact that it encodes the block size in the memory each time is a huge performance loss, particularly for small blocks of memory. For a string of size <= 4, the size adds 100% overhead, both in space and time.

Again, even if the default memory allocators are pretty good, you should be able to win in some specific circumstances. For example, if you are going to allocate a bunch of strings, do some processing, and deallocate them all at once, you could block allocate a whole chunk of memory, allocate from that pool, and turn the delete into empty inlined code. That's a huge performance win. The same trick has worked wonders for situations when building a dynamic data structure like a linked list, where the entire list is to be deallocated at the same time. Write a node destructor that doesn't bother freeing the memory, and free the whole block at the end. For this type of situation, I have seen performance increases of 100% and more, depending on how light the other processing is.

Some would argue that the amount of time involved is small, but I would disagree. Strings being widely used, if you come up with an implementation that is 30% better than the one in the STL, you have something worthwhile.

0
 

Author Comment

by:oxygen_728
ID: 16801578
This may be of interest to you string gurus: http://www.experts-exchange.com/Programming/Programming_Languages/Cplusplus/Q_21870137.html

It is question regarding writing to std::string as if it were a char*

0

Featured Post

Important Lessons on Recovering from Petya

In their most recent webinar, Skyport Systems explores ways to isolate and protect critical databases to keep the core of your company safe from harm.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

IntroductionThis article is the second in a three part article series on the Visual Studio 2008 Debugger.  It provides tips in setting and using breakpoints. If not familiar with this debugger, you can find a basic introduction in the EE article loc…
Introduction This article is a continuation of the C/C++ Visual Studio Express debugger series. Part 1 provided a quick start guide in using the debugger. Part 2 focused on additional topics in breakpoints. As your assignments become a little more …
The goal of the tutorial is to teach the user how to use functions in C++. The video will cover how to define functions, how to call functions and how to create functions prototypes. Microsoft Visual C++ 2010 Express will be used as a text editor an…
The viewer will learn how to pass data into a function in C++. This is one step further in using functions. Instead of only printing text onto the console, the function will be able to perform calculations with argumentents given by the user.
Suggested Courses

862 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question