Before you begin, it is recommended you download and explore some software and utilities. The prerequisites chapter discusses all of these tools. For example, the Hadoop sandbox
provides us with a working cluster. Utilities like PuTTY
allow us to interact with the cluster in order to run jobs, perform file system operations, and demonstrate the capabilities of Hadoop. Linux is the operating system that supports Hadoop.
Once you have all the tools you need to get started, you will learn about the history of Hadoop; how it began as an attempt to create a better open source search engine and how it grew into the powerful data and processing engine it is today.
We’ll explore how Hadoop might fit within a large-scale enterprise, evaluating strengths and weakness of its implementation. We’ll also take a tour of the Hadoop Sandbox using the Ambari graphical user interface.
A core component of Hadoop is the Hadoop File System (HDFS). We’ll talk about how it differs from an ordinary file system and how it supports the Hadoop distributed architecture. We’ll take a look at the various nodes of HDFS and their respective roles. We’ll end with a tour of the HDFS within Linux.
We’ll then learn about ETL and MapReduce. ETL is what connects Hadoop to the outside world. Scoop is an ETL tool provided by Hadoop for exchanging data between Hadoop and an external database server. We’ll go over how to use Scoop to pull data from a Postgres database. We’ll demonstrate how to build and run a basic application in the Java language and follow it up with information on a very important component of Hadoop: MapReduce.