Link to home
Start Free TrialLog in
Avatar of websss
websssFlag for Kenya

asked on

SQL Server to Big data - need advice on architecture change or options

We use SQL server 2017 as our main data store RDMS

We have a vehicle tracking app with one massive log table (all positions for vehicles)  and a lot of other small tables (like car number plate, users etc), pretty much every query involves a join to this main log table
We are just getting going, and the main table is over a billion rows and over 1TB with a relatively small number of cars in relation to the growth plans for the next year
I know this is small, but we estimate we will grow between 10x to 100x in the next year, which has brought up some architecture questions for long term.

Currently we use SQL server which runs nicely, we will also be adding redis to address reduce load on DB,
Currently the DB grows by 10GB every day
Considering we maybe growing at least by 100GB a day, over time this puts us into the BIG DATA category, and I need to start looking at some options for scaling
Every record that gets inserted to the log table gets an ID, and we use this ID for logging events, alerts etc for fast performance (this works well at the current sizing)

Currently my biggest headache is servers with disk storage, I can find server options upto 10TB quite easily, but after this options are limited and prices sky rocket.
Pricing is a massive issue, we are not Google/Microsoft and cannot afford huge server costs.
Performance is also a huge issue, for example, people expect to run reports and expect very small loading times.
We were also planning to move away from expensive sql server too, and had previously been looking at MariaDb with innoDb engine, but now i'm open to options

I've been looking into Hadoop as it seems to tick the pricing issue, but not sure about performance for fast queries or how that works with sql queries and joins.
I need an always on option with redundancy and the ability to grow the DB size to infinity
realtime (latest) tracking info is also high priority.

I read an uber engineering article which said they moved away from mongoDb to MySql as mongo was too slow, so the NoSql option concerns me, as does the architecture change and the impact that will have.

I'm hoping to hear from some seasoned experts that could recommend areas for me to investigate and potential solutions for the database, ease of getting storage solutions when we hit 100's of TB, and ways to maintain performance.

We use a bit of java, but the main coding is c# and .net core.
Avatar of MacNuttin
MacNuttin
Flag of United States of America image

The tech for big Data is all a big headache your MS SQL is running good just archive the older data. Get an Azure account and push it there for random access. The only reason to go with Hadoop NoSQL Data Lake etc. is for large unstructured data then you need to learn spark and python to make sense of it. Sorry for the short answer. It's a bookmark to see where the question goes. Cheers
Do you track exact positions and do you really need exact values? When not, then you can reduce the amount of data. E.g. by latticing or vectorization of the data.

btw, can you post the table DDL? Do you use compression? Did you evaluate using column store? Do you use partitions?
Avatar of websss

ASKER

thanks

We do not use compression, column store or partitions

I looked into azure pricing but its very high compared to say a hetzner dedicated server with 10TB Nvme disk
Post the table DDL.. cause using the correct data types can also save some space. Also using transformed coordinates may allow the usage of data types requiring lesser space.
You should definitely be using data compression.  And columnstores, especially for older data.  That would cut your data size dramatically.

Some disk systems also have built-in compression.  You'll want to stick to hardware compression, not software compression.  Or, if you decide to try software compression based on the disk vendor, very seriously and thoroughly vet/test out any software compression before deciding to go with it.
Usually, the design should be such that the primary working table should be kept to the bare minimum to meet the immidiate requirements. I.e. The past week, month
Presumably there are no updates in this table, just inserts
An insert trigger that would get the columns needed for historical reference need to be copied out to another table that will be used in historical queries.

These archive/large data table could/should be setup with partitioning possibly on a yearly basis (a year in its own data file)

The records on the main table can be trimmed on a regular basis removing rows outside the desired window.

In such a large storage need, you should be looking at a SAN
I like what Arnold says. Also websss, Azure is just a seamless connection to more space and big data capability you get what you pay for. A SAN solves the space issue also seamlessly.  Surely you will want to look into an open source one. Splitting your biggest tables so only the newest most utilized data is on the local machine. There are tips on how to do this else where:
https://dba.stackexchange.com/questions/48176/splitting-a-large-table-to-improve-performance

To get ready for big data download the new Azure Data Studio and learn about kubernetes
Hi,

You said in earlier comment:
...and the main table is over a billion rows and over 1TB..
,
We do not use compression, column store or partitions
and
Currently the DB grows by 10GB every day
.

For these reasons only I strongly suggest you look at partitioning and compression of your data.
  • Use compression to save disk space.
  • Use partitioning for performance reasons (partition pruning where queries only access those partitions where the required data resides) and easier maintenance like reorg, archiving old data, loading and offloading data etc.

Maintenance of HUGE  non partitioned table can take many hours if not days but partitioned table requires only maintenance of the partitions that are most active and if partition is done right that time can be from few minutes to few hours depending of the size of the table.
And queries on HUGE non partioned table are very CPU and resource consuming and take much longer time to execute rather than queries on partitioned tables.

Regards,
     Tomas Helgi
Avatar of websss

ASKER

Thanks Tomas

We generally store data for 6 months
Data that usually comes in is for now, it rare cases it might be for yesterday and possibly that week or month
We dont get anything coming in that is older than 31 days (based on a business rule to discard data)

I've looked at partitioning, as TODAY is the most frequent (generally within the last 1 hours) should I be partitioning on the hour for today, then by day (up to a week) then by month?

Or would the overhead of this be too much

Everything is based on GPSDateTime
Do new partitions get created automagically for new hours/days or do I need to create them with a script?
Hi,

We generally store data for 6 months

Start with a partitioning using some date column having one month in each partition.
That should be a good starting point. :)

Regards,
    Tomas Helgi
We generally store data for 6 months
Get your requirements straight.

Either your store up to 6 month or you don't. "Generally" is not a term you can work with.

As you didn't answer:

1) How do you store your coordinates?
INT works. Just shift your latitude/longituide by the necessary precision to the left, 4 to 6 decimal places is sufficient in many cases.

The smallest NUMERIC is 5 bytes. Thus this would reduce already the space needed by 20%.

2) Now some math (the vectoring approach).
Assumption: you track using a constant time delta of 1 second.
 just store the start and end as full coordinates (INT). store the intermediate values as delta/offset to the previous point. Then you can use SMALLINT ( -2^15 (-32,768) to 2^15-1 (32,767) ). 32,767 cm/s can be enough resolution. This would result in a reduction of ~50% space.
When your time delta is larger or you need a larger space delta, then you can still use SMALLINT. Just adjust the precisions, make it the 32767 dm/s or m/s.

3) Now some math (the latticing).
Assumption: a precision of 11m is sufficient (4 decimal places in coordinates).
Using SMALLINT (-32768-32767) gives a window of approximately 720kmx720km. Thus store the full start and end coordinates and store the positions to this translated window coordinates. This will also reduce space requirements about 50%, but requires some more math, then you have tracks leaving a window.

1) can be used, when you don't need a lot of SQL Servers spatial queries.
2-3) can be used, when you only archive this data and don't need to work with multiple tracks in SQL Server natively.

p.s. for 1) when your tracked coordinates are not over the world, but only for one country, then even using SMALLINT may be an option with a precision of 110 meters.
Hi,

Any more help on this matter ?

Regards,
    Tomas Helgi
This question needs an answer!
Become an EE member today
7 DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform.
View membership options
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.