asked on

elasticsearch; document database, algorithms

Hello;

I have a big problem of performance in my application;

My application receives packages via Network, it inserts in elastic search the number of packages received at every time; ( it generates more than million lines by 24h )

My Front End ( HMI ), will show total packages received from differents types between the midnight and NOW. HMI refresh every 10 secondes.

So, every 10 secondes, i have a query to cumulate the number of packages of every type from midnight until now.

I think the query take more than 10 secondes, so when i send the query of refresh, the app blocked.

What i think to do is the following;

insert a document for every type of packages represent the total of packages, so when i calculate, i will add the last 10 secondes received packages on the last total.

So I have 2 Questions:

1- is there a way to insert only one total and not a total every 10 secondes? so i will update it? i know there is a query for update, but we do not prefer make a lot of update query; i search a way like new packages cumulate on total automatically! or any other way. what is the best practice here?

2- how to assure that the new total is the old one plus the new packages and there are no packages lost or counted twice! i may insert several lines in a second, imagine i retrieve packages, and in the same time there were new packages.

Sorry that my question need a few minutes to read and understand.

Best regards

Jeffrey Dake

Hi Adam,

Trying to make sure I understand your problem. So for each package you have, you make a new document in elastic search that represents that package and that is indexed in your search index.

So then you have a frontend that every 10 seconds is doing a query into your elastic search application. This query takes a long time, so every time you do this query it is blocking the frontend from doing anything else because you have such a large search index.

So you are looking for a better way to get the data that has come in within the last 24hours so that your application is not blocked, and you want to make sure that your total numbers are correct. Does that sound right?

Jeffrey Dake

Assuming I got your statement correct, is the frontend application getting blocked because it is doing a long query? My initial thought is that you could keep running the query you are doing, but have some sort of background process that runs that query to sum up all the totals and then put it in some sort of other database like a Memcached or a Redis table. Then your frontend application could just query those tables where all the work was already pre-computed and thus not having your requests coming from the frontend doing the search in elasticsearch that may be very slow. Hopefully this helps, if not look forward to your response so I can hopefully help out with what we could try next (I am worried I am not understanding what is "blocking" your application.

Adam AL ABDOU

ASKER

Hello Jeffrey. Thank you for your answer.

what block my frontend that i send queries of refresh before previous query finish.

I may ansert several documents for every package ( depends on its content), it is a complicated system but i tried to simplify the idea.

Yes, u have understood the idea.

I cannot use another databases like redis, i cannot make technical options. So like a developer, i can use only what is exists, elasticsearch and mybe Mongodb.

the suggestion of memcached is interested. How can assure the correctness. the total on the time t2 is the total on t1 + number of new arrived packages.

so, now i calculate every time from the midnight to NOW. i thought that i can profit from last calculation to calculate new one. regards

Jeffrey Dake

That was why I was thinking a memcached table. You could have key of <Type> and then in your value you could have a value of the <Total>. So I was thinking you could have essentially a mapping of the type with the total. I would then store somewhere else, maybe in memcached too the last timestamp that you ran this query. Then inside your document inside elastic search you have a timestamp as one of your fields of when the entry was added into elastic search. That way you could have a background process that refreshed your values on whatever time frame you wanted to and it would only have to query elastic search for new documents indexed after the last time you ran the query. I would think this should work for you.

You could also if you wanted to be extra safe to make sure you don't miss any documents would be to have another memcached cache that expired documents after a short period of time that you stored the unique id to your document into. Then you could have your query where you are grabbing all of the "new" documents from the search index go a little bit back in time and cross reference it with this cache to make sure it wasn't counted yet. I would probably do this as a safe guard, just in case the actual indexing of your document takes a little bit of time and could mess up your times. For example if putting into the index takes a few seconds to a minute so the timestamp indexed is actually behind the current time.

Hopefully you have access to memcached, but this is how I would probably approach it if I had to only use elasticsearch. Querying the memcached straight for the types and totals would then should probably be in the low milliseconds of a query.

ASKER CERTIFIED SOLUTION

Jeffrey Dake

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Adam AL ABDOU

ASKER

Hello Jeffrey for your answers whiche are very helpful, i ll try to put totals somewhere in a db and update them when new documents arrived, best regards

Jeffrey Dake

Glad I could help. Hope it all works out good for you.