asked on

understanding indexing and searching ?

I am exploring aws elastic search for a future project, but would like to understand indexing and searching in general. I would like to get a better understanding of what is the approach of indexing the data e.t.c . like what it means to create a index and once we have index, do we relate that index with some data. I understand similar concept applies when we are searching anything on the web as well. how is indexing done in the web such that we get to our results faster?

noci

High Over:
Search is examing some body of data for occurrences of a specific pattern.... Tools have been created to help here like databases to organize data in a structured way. Searching does require a full search of a complete body of data.

Indexing is a technique where the body of data is scanned once before (or when stuff is added, the new part is scanned, if stuff gets deleted the removed stuff gets scanned).. and a list if indexable items get tallies where that item is in the body of data. (This also applies to databases).

Then when searching FIRST the index list is search and then when the item is found the places where it was found in the body of data can be visited to get all the nearby data. If the index has no known item it eitehr means the data is not there (in organized data) or the data needs to be rescanned for this additional item.

Kent Olsen

Hi Mikha,

That's a really broad subject. Quick answers may give you an idea of areas to explore more deeply, and complete answers aren't easily done in a text based exchange.

Indexing is a requirement for reasonable access through large amounts of data. Equally (and I would argue more) important is database design. There's a reason that OLTP and OLAP systems can look so different, and that is that the two databases are designed to more efficiently accomplish different things.

An index is typically a b-tree structure that contains data values (the data being indexed) and database keys (the location of the data). Data, and indexes, are typically stored in blocks, usually a multiple of 4K (8K, 16K, 32K, etc.) If a million rows are stored 10 rows per block, 100,000 blocks may have to be read to find a desired row, or to determine that it doesn't exist. If the indexes are stored 100 rows per block in a balance b-tree, only 4 blocks have to be read to locate the desired data. The trade off is that changes to the data can also require significant changes to the indexes to keep them in sync. The cost-benefit of that trade off is usually worth it. :)

It's usually a DBA task to add or change indexes to a table. When that happens, it's usually because the database was poorly designed, or the business needs have changed and the original design is inadequate. DBAs typically know the general rules for applying indexes, but they're usually not database architects (designers) and their efforts in modifying indexes can be counter productive.

Indexing data used in web searches has challenges that aren't in typical databases. I suspect that the Google, Yahoo, Bing, etc. databases have been redesigned several times since they were first built. Not enough was known in the early days to predict what searches would look like 20 years into the future or even the raw power available in modern servers.

Hope this gives you a starting point....
Kent

David Favor

As Kent said, this is a vast terrain of information + design considerations.

Design Considerations: How you approach this depends your original data corpus, retrieval speed requirements, number of concurrent users who will be searching at one time, then access pattern (read mostly, read almost always, read/write, write heavy), your budget, your tech expertise or team's tech expertise.

https://mariadb.com/kb/en/full-text-index-overview/ provides a... Google-esque type search syntax facility... and... you'll require someone very smart to work out the design of this.

One of the hidden perks of FTS really kicks in when you'd have to scale... and your data is primarily read only...

With FTS, if you setup a multi-master database system, you can add many instances which will all function independently, then in the rare case when new data must be added, the entire system will slow down slightly.

If you're starting a large project like this, likely good to give consideration to the Design Considerations above.

noci

Addition: i forgot to mention Donald Knuth has written some books on the subject in the 1970's before he got sidetracked because of the poor state of typesetting and book production in the early 1980's. and started on TEX system.

Another Biggie in the industrie is Claude Shannon. although he was in Gaming, he also deviced a way to be able to search scrabble word databases in predictable speeds.'
Where wordt are stored in "letter matrices" goiing right on matches and going down on misses.

(The way google & co can build efficient indices).

mikha

ASKER

thank you everyone for your great insights.

This question needs an answer!

Become an EE member today

7 DAY FREE TRIAL

Members can start a 7-Day Free trial then enjoy unlimited access to the platform.

View membership options

Learn why we charge membership fees

We get it - no one likes a content blocker. Take one extra minute and find out why we block content.