Data Transfer from Hadoop to MongoDB

Hello All,

I have data sitting in HDFS in the form of Hive tables and I need to load that on a daily basis (delta load) to MongoDB.

What languages/setup/jobs/techniques I can use to achieve this reliably? Any help is highly appreciated.

Who is Participating?
btanExec ConsultantCommented:
It should be some form of stated
Map-Reduce jobs are used to extract, transform and load data from one store to another. Hadoop can act as a complex ETL mechanism to migrate data in various forms via one or more MapReduce jobs that pull the data from one store, apply multiple transformations (applying new data layouts or other aggregation) and loading the data to another store. This approach can be used to move data from or to MongoDB, depending on the desired result.

Some use case via the Hive query or Sparkle
I started with a simple example of taking 1 minute time series intervals of stock prices with the opening (first) price, high (max), low (min), and closing (last) price of each time interval and turning them into 5 minute intervals (called OHLC bars). The 1-minute data is stored in MongoDB and is then processed in Hive or Spark via the MongoDB Hadoop Connector, which allows MongoDB to be an input or output to/from Hadoop and Spark.
Well, mongodb isn't SQL.  It has no tables or really  a schema either.   So the good news is you can easily put anything you want into it because mongodb doesn't care about such things.   Dump all the hive tables as text and stick in the date as one of the values

Or you could just use this mongodb plugin and let hadoop do this for you..
RavinosqlAuthor Commented:
Thank you for the response!! So the data can be dumped from Hive tables  to mongo without staging in between correct? Do you know if there should be any jobs scheduled for this to happen on a daily basis? Thanks!
RavinosqlAuthor Commented:
Hello All...I proposed the mongo connector option but it did not well received as there are concerns over the overhead on mongo's performance over the usage of the connector.

Is anyone familiar with any solution using the Hive Metastore, spark or spring batch for this use case please help..Thanks!!
btanExec ConsultantCommented:
the connector has been more used instead and some shared try to fetch the data from Hive without running hiveserver which exposes a Thrift service so that you can probably save some overhead. MongoDB not being the standard relational db does limit the altenative tested means for such transfer. The connector fare better still though Sparkle can be tried but it is new and not as often preached as first option.

Do see this
I saw the appeal of Spark from my first introduction. It was pretty easy to use. It is also especially nice in that it has operations that run on all elements in a list or a matrix of data. .....
The downside is that it certainly is new and I seemed to run into a non-trival bug (SPARK-5361 now fixed in 1.2.2+) that prevented me from writing from pyspark to a Hadoop file (writing to Hadoop & MongoDB in Java & Scala should work). Also I found it hard to visualize the data as I was manipulating it. It reminded me of my college days being frustrated debugging matrices
Probably more importantly is that, once you analyze data in Hadoop, the work of reporting and operationalizing the results often need to be done. The MongoDB Hadoop Connector makes it easy to process results and put them into MongoDB, for blazing fast reporting and querying with all the benefits of an operational database.....
Overall, the benefit of the MongoDB Hadoop Connector, is combining the benefits of highly parallel analysis in Hadoop with low latency, rich querying for operational purposes from MongoDB and allowing technology teams to focus on data analysis rather than integration.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.