Start Free Trial

asked on

using Hadoop

I found this video on installing Hadoop on Windows.
https://www.youtube.com/watch?v=VhxWig96dME

I have about a terabyte of data with approximately 10 billion records. It is currently stored in 30 to 500 mb csv files. To do a keyword search on the data takes about 40 hours. Mostly each record is text.

If I set up hadoop,
1) how difficult would it be to load all this data into hadoop?
2) how long would it take to do a keyword search in hadoop?

Does anyone have any idea?

SOLUTION

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

What is your hardware profile (memory and CPU count)?

Are you going to use the same disk for Hadoop as you are currently using for the CSV files?

ASKER

My idea was to get rid of the csv files or just store them on a hard disk drive. Just using a quad core i7 with 16 gb.
I may not have enabled windows indexing. But I thought that hadoop would be so much shorter than 40 hours.

Hadoop, by itself might only provide an environment for a database/search tool that will fill your searching needs. Since you aren't in a multi-server environment, I'm not sure that Hadoop would be better than a database, capable of exploiting your multi-core hardware.

Maybe you can import in some SQL database?
lucene/solr is another full text search application

Gary Patterson, CISSP

I'm not a Hadoop expert, but I do know that Hadoop is primarily used to store and process data across a large number of systems. So unless you plan to take your files and distribute them across a bunch of servers, it may not be the tool for you. Hadoop, by itself, also does not index your data - and based on your requirements, it sounds to me like what you need is indexing.

Here's a good article that explains what Hadoop is NOT:

http://wiki.apache.org/hadoop/HadoopIsNot

Any tool that indexes your data is going to improve your search times. Once your data is indexed, query performance depends on a number of factors, but simple queries (even over 10 billion row data sets) that utilize an index can return results in a fraction of a second. Let's assume that you're running this on a system that stores the data locally (not across a network), and is dedicated to just this purpose. The main performance factors are:

CPU speed and number of CPUs
Memory
Disk I/O speed
Index size and type
Uniqueness of search terms (if you have one hit on a search term it will come back faster than if you have a million hits)

With non-indexed data, the basic search process is simple - you have to scan every row of every file every time you do a search. So your search times are pretty much the same every time on a given machine - however long it takes to load each document off disk into memory and scan each row for a match.

But once you've built indexes over your data, the whole game changes. Now, instead of having to scan every row of every document every time, all your tool needs to do is look up the search term in the index, and then retrieve the specific matching rows from the listed documents.

Think about a textbook or a reference book. Without an index, you would have to read every word on every page to find all the references to the term "parallel", for example. If the book is 500 pages long, that is going to take a while. But if the book has a comprehensive index, all you have to do is flip to the back, find the index entry for "parallel", and now you have a list of all the places in the book where SQL is mentioned - in seconds.

That is exactly how text indexing works - you first scan the documents and make an ordered list of all the words used, and where they are used. Then you update that as the underlying data changes. When it comes time to search, you refer to the index initially - not the billions of rows of underlying text.

Based on your brief requirements, I think a simple indexing/search tool might be more what you want. If you just need a desktop search tool, here's one open source example called DocFetcher.

http://docfetcher.sourceforge.net/en/index.html

If you're looking for programming tools to help with index and searching tasks, you might want to look at Lucene:

http://www.ibm.com/developerworks/library/os-apache-lucenesearch/

SOLUTION

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ASKER

Hi. I forgot to mention that I am using this as my search tool, FileLocator Pro http://www.mythicsoft.com/filelocatorpro. Will indexing help if I am using this instead of the default Windows search?

Thank you.

SOLUTION

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ASKER CERTIFIED SOLUTION

Gary Patterson, CISSP

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

I don't think that is the optimal tool to use to search CSV files. If you need to search a particular column in the CSV files for a particular value, the filelocatorpro software will not restrict its results to that column. It is a grep alternative. You need a search tool that understands structured data.

If there were some way to divide the work, you could run four copies of a script to read the CSV files and do the filtering. That would cut the elapsed time down to 10 hours. You might be able to increase the number of tasks, but at some point you would be I/O bound.

You will likely be limited to the speed of your hard drive/SSD and the I/O channel speed. This is true for loading your CSV data into a database or indexing process.

Gary Patterson, CISSP

If this is structured data, and you need to do column-oriented searches, and want the fastest search results, then yuo should consider loading this data into a database, indexing it, and then using SQL - unless the data is so rapidly-changing that maintaining an index is impractical.

ASKER

At the moment I am stuck with FileLocatorPro because I have a whole lot of code that I did not write which is built around using it. But I need to make a major change because the data is growing so fast (about 1 gb per day).

Would SQL be faster or more appropriate than hadoop -the learning curve would definitely be shorter?

I will check the cache in FileLocatorPro as well.

what is the source of the CSV data?

ASKER

The files come from Twitter. Here is a sample.
tweets-07-02-11-00-00-01.csv

Gary Patterson, CISSP

OK, so can you explain how you typically need to search this data?

For example, I see that the 6th column is apparently the user name and the 7th is the text of the tweet. Do you typically search by user name? By date? By time? Of do you typically just search for specific keywords in the tweet text? Do you do simple single keyword searches?

If I was working with you to design a search system over this, I'd ask you to show me as complete a list as possible of all the searches you do, how ofter you perform each one, etc. That would help me understand how to best store the data.

Also, will your database just grow and grow, or will you roll off old data at some point?

If you're adding a gig a day, this will definitely present some ongoing challenges.

1. This isn't a CSV. Although delimited, the delimiter is not a comma (looks like a semi-colon)
2. Normally, CSV files include a header. What kind of keyword search are you trying to describe if there are no headers to describe the fields/columns?
3. This is unparsed data

==================
If you had a 7200 RPM disk, the fastest you could possibly read 1TB of data would be in a little over 2 hours. With an SSD, that upper limit drops to about 30 minutes. This is just read time. If you need to write the results of each search result, you will add to that time as long as you are using the same storage device and I/O channels.

Why are you downloading tweets instead of letting Twitter do the search for you?

ASKER

Thanks all.

1) Aikimark, you are right, my mistake. The data is not csv, it is semi-colon delimited.

2) What do you mean that it is unparsed?

3) Gary, I do a boolean search with maybe 6 terms in the entire set of data. I do this with about 20 boolean searches but may want to increase this to 300 or more per month. Then I do a daily search just on the new data (about 1 gb a day).

4) At the moment I am not using the username but may want to later. I am currently just searching on the text field

5) A typical boolean search might be Toyota OR Corrola OR Prius OR LEXUS AND recall

6) The database will just grow and grow. Now I am adding about 500 mb/day but it will increase to 1 gb/day

Do you need any of the other fields?
Are you going to limit the searching to non-trivial English words or allow multi-word phrases and unicode characters?

ASKER

I would like to have phrases but if that is much more difficult I may not. FileLocatorPro allows for phrases. As to unicode, I am having some trouble finding the encoding to bring tweets that were originally in Japanese to show up as Japanese.
I am currently not using the other fields to search in but would not want to limit myself to that for the future.

Can you explain ONE SCENARIO what you want to do so that we understand what is that you are trying to do exactly with the data.
Is it just check if a STRING exists in any of the files or not or something else also?

If it just file search.. to confirm if a string exists or not, they you need not to use Hadoop.

If you can share some more details on WHAT you NEED I can share few thoughts. Btw I worked on similar freelancing project around data search in a group of files sometime back.

Thanks,

I've been playing with the (free) Microsoft Log Parser utility (v2.2). Without headers, you can still refer to the columns as Field1, Field2, etc. and the SQL is pretty standard and clean. I'm not sure if the 758 rows in your sample file will give an accurate timing approximation for your large files. If it is installed on your PC, point the following statement from your sample to one of the 30MB files and see how long it runs.

logparser -i:TSV -iseparator:";" -headerrow:off -o:null "Select Field7 From C:\Users\AikiMark\Downloads\tweets-07-02-11-00-00-01.csv Where Field7 Like '%Toyota%' OR Field7 Like '%Corrola%' OR Field7 Like '%Prius%' OR Field7 Like '%LEXUS%' AND Field7 Like '%recall%'"

Open in new window

2) What do you mean that it is unparsed?

If you want to limit your search to field7, you will need to parse the records/rows. As it is, any search tool will see single records with string values.

SOLUTION

Gary Patterson, CISSP

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ASKER

thanks. I worked with a tool that used lucene before, something called open search server.
ultimately I would like to have a webservice that would support searching the repository through a web interface. Is lucene going to be faster than MySQL? And is DocFetcher going to be faster than FLP? After indexing how long would a full repository search for a single keyword take in the text field of the tweet (with semi-colon delimiters)?

Actually the system I have is automated in the sense that daily searches are run on the most recent days data and the results, in the format I uploaded as "output.csv", are deposited into a folder on my pc. Those daily searches are probably getting long now but I have not checked. The searches I want to run on the entire repository take a ridiculous amount of time.

Gary Patterson, CISSP

Aikimark mentioned Log Parser: it is a great tool, and we use it extensively for system management purposes. But it doesn't index - which means that every search has to scan every file every time, translating to hours of runtime for every search. I suggest that you forget about any solution that doesn't index your data.

Your requirements lend themselves to indexing, for a number of reasons:

1) Once you've taken the time to index a file, that index is available forever, and dramatically speeds up every subsequent search.

2) Your data set is growing, but non-volatile (once added, the existing data doesn't change). This means that you only need to update indexes once a day, when you receive a new set of data to add to the collection (or you could constantly with the right tools if you received data from a stream).

After talking about it with a couple of co-workers, the consensus here is that the Apache Solr search engine (based on Lucene) is an ideal tool for this type of task - especially considering the eventual size your data set is going to reach.

Solr is specifically designed to create and search full-text indexes over very large data sets. It is scalable, and can be used in partitioned data sets - meaning that when your data set grows larger, you can scatter your data across multiple systems and conduct fast searches in parallel. It works well with structured and unstructured data. It can handle queries in any language. It can handle very complex queries, and is easy to interface to and retrieve results from. It is used and proven in environments that index billions of documents.

If also can do near real-time indexing, so for example if you are using Twitter's streaming API to receive data, you could pump that into Solr and have indexes updated almost immediately.

Take a look at Solr's features, and I think you'll see that it is a great fit.

http://lucene.apache.org/solr/features.html

- Gary

Gary Patterson, CISSP

ultimately I would like to have a webservice that would support searching the repository through a web interface.

Solr is accessible through a web service right out of the box.

Is lucene going to be faster than MySQL?

For this? My guess is "yes". MySQL is great for a lot of things, but Lucene/Solr is optimized for text search. You could always set up both and benchmark, but I think that you'll find that Solr is the clear winner for this particular task - especially for the size of data set that you'll be working with.

And is DocFetcher going to be faster than FLP?

Orders of magnitude faster. No comparison.

After indexing how long would a full repository search for a single keyword take in the text field of the tweet (with semi-colon delimiters)?

Open in new window

Depends on the number of result rows. For a small result set, the search could be sub-second. For larger result sets, it pretty much just depends on how fast your system can page in the needed disk pages from the underlying documents - for a result set of a million rows, maybe a couplf of minutes on fast hardware? The whole world changes when you go from a raw scan to searching an index. The index search is rapid - once you find the first match, all the remaining matches are located adjacent to to the first term in the index.

I wasn't thinking about using the log parser to evaluate the time to read/parse the data. That performance test will inform us of the times required to index all of the data.

I still think that storing all the data on a hard drive is a bad idea. Let Twitter do the search and just store the results.

The idea is to create separate result sets or indexes for each query. As you add data, you do the same searches on the new data and then update your result sets. Don't search the repository more than once for any search criteria. In your initial queries, read the data once and subject the records to all the query/filtering code (once).

I will ask again...why not let Twitter perform the search and just save the results?

SOLUTION

Gary Patterson, CISSP

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Gary Patterson, CISSP

I don't pretend to understand all of the possible use cases in this particular matter, but I do know that if you plan to repeatedly query a large data set and extract significantly smaller subsets of data, then the optimum plan is to index your data set immediately, and take advantage of that index (or those indexes) over and over on every query.

Now if the use pattern is to search all data as it comes in one time, and then never again, then a one-time scan of the data is optimal - but if that is the case Willie wouldn't have the need to retain the old data - the data set wouldn't be growing. This looks like part of an analytic tool to me designed to allow them to pull our reports of activity that mentions certain companies - possibly over long periods of time - for additional analysis.

If that is the case, then the optimal performance strategy is to build and maintain a permanent index.

I saw a presentation about Solr last night and agree that it, or something equivalent, would be the optimal solution.

If you need help with Solr, contact Lucidworks.