effecient way of deleting docs-- ELASTICSEARCH (asp.net NEST)

I see we are using Delete by Query while deleting docs.

Seems there are better ways to do this as mentioned here:

https://www.elastic.co/guide/en/elasticsearch/reference/1.6/docs-delete-by-query.html

The main purpose to save index size.

Can you help how can we decrease index size.

and by NEST I mean..
http://nest.azurewebsites.net/nest/core/delete-by-query.html
Dinesh KumarAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Randy DownsOWNERCommented:
You will have to move to the API if you upgrade to version 2. It should be more efficient since it's one bulk delete.

use the scroll/scan API to find all matching ids and then issue a bulk request to delete them.
.

Maybe this will help optimizing your index.

Mapping

If your search requirements allow it, there is some room for optimization in the mapping definition of your index:

    By default, Elasticsearch stores the original data in a special _source field. If you do not need it, disable it.
    By default, Elasticsearch analyzes the input data of all fields in a special _all field. If you do not need it, disable it.
    If you are using the _source field, there is no additional value in setting any other field to _stored.
    If you are not using the _source field, only set those fields to _stored that you need to. Note, however, that using _source brings certain advantages, such as the ability to use the update API.
    For analyzed fields, do you need norms? If not, disable them by setting norms.enabled to false.
    Do you need to store term frequencies and positions, as is done by default, or can you do with less – maybe only doc numbers? Set index_options to what you really need, as outlined in the string core type description.
    For analyzed fields, use the simplest analyzer that satisfies the requirements for the field. Or maybe you can even go with not_analyzed?
    Do not analyze, store, or even send data to Elasticsearch that you do not need for answering search requests. In particular, double-check the content of mappings that you do not define yourself (e.g., because a tool like Logstash generates them for you).
Dinesh KumarAuthor Commented:
For bulk delete: version 2 of API has not released yet.

other part if I am not wrong is related with performance and not with saving storage space.

Actually what is happening right now is we are having around 21 indexes and I see they are growing very fast each alternate day.

like one month back one index was 1.8 gb and now that is 8gb  and the total storage of server is around 40 GB.. so Server can crash soon :)

Also we keep on deleting the docs so seems deleted docs are increasing a lot, seems permanently deleting them should help but how:)
Randy DownsOWNERCommented:
The tweaks above do effect performance but it's probably size as well since those fields would have to be stored somewhere. I am assuming that the fields would be discarded if they are disabled making the index somewhat smaller. The less information you track the less you need in your index.
Big Business Goals? Which KPIs Will Help You

The most successful MSPs rely on metrics – known as key performance indicators (KPIs) – for making informed decisions that help their businesses thrive, rather than just survive. This eBook provides an overview of the most important KPIs used by top MSPs.

Dinesh KumarAuthor Commented:
Implementing experts' suggestions in code to see them in action.
Dinesh KumarAuthor Commented:
Scratching mind like how can I know the size of the space which is being taken by deleted docs.

The deleted docs should exist in  index only and removing them should help..

To my information, a document is something in JSON format and that carries the actual data.
Randy DownsOWNERCommented:
Maybe this will help clarify.

When a document is deleted or updated (= delete + add), Apache Lucene simply marks a bit in a per-segment bitset to record that the document is deleted. All subsequent searches simply skip any deleted documents.

It is not until segments are merged that the bytes consumed by deleted documents are reclaimed. Likewise, any terms that occur only in deleted documents (ghost terms) are not removed until merge. This approach is necessary because it would otherwise be far too costly to update Lucene's write-once index data structures and aggregate statistics for every document deletion, but it has some implications:

    Deleted documents tie up disk space in the index.
    In-memory per-document data structures, such as norms or field data, will still consume RAM for deleted documents.
    Search throughput is lower, since each search must check the deleted bitset for every potential hit. More on this below.
    Aggregate term statistics, used for query scoring, will still reflect deleted terms and documents. When a merge completes, the term statistics will suddenly jump closer to their true values, changing hit scores. In practice this impact is minor, unless the deleted documents had divergent statistics from the rest of the index.
    A deleted document ties up a document ID from the maximum 2.1 B documents for a single shard. If your shard is riding close to that limit (not recommended!) this could matter.
    Fuzzy queries can have slightly different results, because they may match ghost terms.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Dinesh KumarAuthor Commented:
Sorry, may be I am asking again but how can we delete all the documents permanently. let us say using NEST
Randy DownsOWNERCommented:
The page you linked in the 1st post has NEST examples. Here's a sample:

var node = new Uri("http://localhost:9200");
var settings = new ConnectionSettings(node);
var client = new ElasticClient(settings);

client.DeleteByQuery<ElasticsearchProject>(del => del
            .Query(q => q.QueryString(qs=>qs.Query("*"))
        ));

Open in new window


Optimize
The optimize API allows to optimize one or more indices through an API. The optimize process basically optimizes the index for faster search operations (and relates to the number of segments a Lucene index holds within each shard). The optimize operation allows to reduce the number of segments by merging them.

This call will block until the optimize is complete. If the http connection is lost, the request will continue in the background, and any new requests will block until the previous optimize is complete
curl localhost:9200/<indexname>/_optimize

Open in new window

Dinesh KumarAuthor Commented:
seems this don't delete documents permanently and they result into increased disk storage
Dinesh KumarAuthor Commented:
when document are deleted, they go into deleted index.
Randy DownsOWNERCommented:
I think that's by design. Does optimize help?
Randy DownsOWNERCommented:
The space is not recovered immediately on deletes:

When a document is deleted or updated (= delete + add), Apache Lucene simply marks a bit in a per-segment bitset to record that the document is deleted.
Dinesh KumarAuthor Commented:
Thank you.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Databases

From novice to tech pro — start learning today.