Link to home
Start Free TrialLog in
Avatar of wevouch
wevouch

asked on

BerkeleyDB: High Volume database hitting limits

I am using Berkeley DB to store real time ticks for market data. To give you an idea of the amount of data...
- I get roughly 15000 ticks/sec on an average
- by the end of the day the database file grows upto 50GB.

My system configuration is...
$ uname -a
Linux gritsbox.rsi.com 2.6.18-92.1.17.el5 #1 SMP Wed Oct 22 04:19:38 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
- Quad core 1.6 Ghz
- 4GB RAM
- I have set the Berkeley db cache to 2GB and I am using the BTREE access method.

I started out using one database to store everything and it works well for 3/4th of the trading day. However, Db:puts increasingly start to take longer and towards the end of the day the delay is so high that I start loosing ticks. I figured that this could be because of BTREE trying to rebalance a heavy tree on every PUT (is this correct?)

I then decided to open 10000 (yeah!!!) databases in the environment, 1 for each symbol(ticker) that i am subscribed to. my idea behind this was that balancing a much smaller tree would be faster for Berkeley DB. However when i run the code, i get a "Db::open: Cannot allocate memory" error after opening just 300 something databases.

How do I make this work?

thanks
Avatar of Gns
Gns

I don't think your analysis of why the writes are quite right... I think it is _much_ simpler than that. Likely the initial seek just take a huge time to complete, so each write progressively slows... And for each tick, the write will be just  a tad slower....

Now, your second problem is that your wxcessive use of "DB files" simply lock a lot of memory. So that is a no-no.

I'd do one of a few different things:

- Allocate 3 database files/day. Programmatically switch what one to write to depending on the time ... so that you get three equal sized files over the workday.

- Switch to a real RDBMS. It would need some attention, so that a huge amount of writes don't make it keel over.... My "quant-users" used to do this for Reuters... They only saved about 50 ticks/second to our Oracle DB, but ... that was enough to put some hefty pressure on the server. The accumulated data, to be used for some time series or similar, wasn't used much (thankfully), since that would've been the real killer;-).

- Focus on usability. If the saved data cannot be used, don't save it;-):-). Massage it as best you can, then save the results.

Cheers
-- Glenn
Avatar of wevouch

ASKER

Thanks for your response Glenn.

- Allocating 3 files/day might work, the only problem is blocking all threads while the files are created and db pointers swapped. However, its worth a try and if nothing else works I would go for this.
- Real RDBMS: wouldnt that be slower? with the SQL layer and the data type checking that an RDBMS has to do (among other things). I actually tried this before I decided to go with BerkeleyDB and it was horribly slow. 25000 inserts/sec would be too much for RDBMS...rt?
- Yeah usability is paramount. A whole bunch of applications/servers are hungry for this realtime tick data. I am storing everything in the JSON format which is small and portable.

I have acheived significant performance improvement with the same 1 database by optimizing (and then reoptimizing) my code. However, its still not enough to handle occasional bursts of 40000 ticks/sec.

I think I have hit a limit on code optimizations. I am not going to upgrade to a quad core 3GHz... that might help... rt?

thanks
Nishant
ASKER CERTIFIED SOLUTION
Avatar of Gns
Gns

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial