Cassandra Select Query

I have a 3 node datastax cassandra(Community) cluster with huge data. I have few tables which contain 3-5 billion records in them. I want to delete data that is older than 90 days from those tables.

The problem is how do i run a select query which runs without timeout. I am currently running below query

NOW=$(date -d "-3 month" +"%Y-%m-%d")
select day_ts from table_name where minute_ts < '$NOW' LIMIT 100000 ALLOW FILTERING;


Even if i limit the select query result, it will still parse the whole 3-5 billion records and then filter the data.

Please suggest what can be a efficient way to do this.
Abhinav GroverAsked:
Who is Participating?
 
Tomas Helgi JohannssonCommented:
Hi!

Is your table partitioned ? If not, table this size should be partitioned otherwise you will have performance issues on your queries. Choose a good primary key/clustering key to cluster your data across your nodes for "near even" workload distribution.
Partitioning the data in a "right" way you exclude  data that does not meet your queries filtering in their parsing/search of the data hence shorter execution time.
Also if you don't have index on the table I strongly suggest you put on an index that satisfy your queries where clause columns to speed up your queries.

http://datascale.io/cassandra-partitioning-and-clustering-keys-explained/
http://www.planetcassandra.org/blog/the-most-important-thing-to-know-in-cassandra-data-modeling-the-primary-key/
https://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_compound_keys_c.html
https://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_secondary_index_c.html

Regards,
   Tomas Helgi
1
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.