I recently read “Hadoop – Beginner’s guide” by Garry Turkington, Packt Publishing. This is a review of the book.
As the title says, this book is intented to provide the basic knowledge required to get started in Hadoop. If you are already using Hadoop and want a more in-depth analysis, you’d rather have look at other books, such as a definitive guide or a cookbook. In addition to being an introduction to Hadoop, this book also visits the current ecosystem of Hadoop. This includes Amazon Elastic Map Reduce, Hive, Sqoop, Flume. The book is addressed to Java programmers, but a few examples are written in Ruby, so it’s a plus if you know Ruby too.
Let’s see an overview of the chapters:
Chapter 1: The first chapter gives an overview of what is Big Data and why humanity came up with it. After a discussion about scaling up vs scaling out and their advantages and disadvantages, the chapter finishes with a short history of Hadoop and Elastic Map Reduce (EMR).
Chapter 2: This chapter is a get up and running walkthrough, where the reader is guided to execute the bundled Word Count program (the Hello World of Map Reduce) on a local Hadoop cluster. Then, the same code is executed on Elastic Map Reduce.
Chapter 3: After the introductory chapters, the author explains Map Reduce (MR) and its relation to Hadoop and HDFS. The reader is then walked through writing the Word Count program by herself and again, execute it both locally and EMR.
Chapter 4: Here the reader sees slightly more advanced stuff, such as chaining MR jobs and using a distributed cache to use look-up data throughout the Hadoop cluster. Also there’s an introduction to the Streaming API, which allows easy development of MR jobs using scripting languages (examples are in Ruby).
Chapter 5: This chapter advances further with multiple input jobs, the implementation of an iterative algorithm on Map Reduce and Avro, a data persistence framework with bindings for many programming languages.
Chapter 6: The sixth chapter revolves around failure handling on a local or EMR Hadoop cluster, understanding from the log files what went wrong and how to overcome it.
Chapter 7: This chapter describes the setup of a Hadoop cluster from hardware to deciding the number of nodes, storage type, security, and HDFS.
Chapter 8: In chapter the reader is presented Hive, an abstraction layer above Hadoop. The author gives an overview of Hive-QL, the SQL-like query language and explains how the queries are dynamically translated to MR jobs, that run transparently underneath.
Chapter 9: This chapter explains how Hadoop often needs to work with the relational world, drawing data from and storing data to relational DBs. The reader can also read about Sqoop, which facilitates these operations.
Chapter 10: In chapter 10 the reader learns about Flume, a system that aggregates and moves large volumes of log data. This chapter gives set-up instructions, describes different scenarios and discusses the cases where it is an alternative to Sqoop.
Chapter 11: The final chapter of the book essentially gives directions to MR-related technologies that the reader might be interested in.
I think that key selling point of “Hadoop – Beginner’s Guide” is that it touches several peripheral technologies to Hadoop. A typical reader that wants to dive into Hadoop is most probably also interested in Elastic Map Reduce, Hive, Sqoop and Flume. So, this book is a bargain in that it combines two or more introductory books. I also appreciated the fact that the author takes care to run most examples both locally and on EMR, pointing out differences, where there are any. Finally, I found the examples sufficiently explanatory, guiding the reader through set-up instructions and programming without leaving blurry points.
The biggest disadvantage of the book is the errata, with most examples containing at least one. For someone that wants to delve into a new complicated technology such as Map Reduce, every bit of concentration that goes into finding out why the code doesn’t run is a shame. I would also like to point out a problem with the listings of tab-separated files (more than 10), where the tabs are simply not printed, making the relative examples unnecessarily difficult to understand.
The conclusion I drew is that this is a well-written beginner book on Hadoop, with the errata being it’s main problem. I would buy it if I needed a good Hadoop book tomorrow, but I would wait for the second edition otherwise. For those that bought or intend to buy the book’s current edition, you can get the latest version of the examples code and see the reported errata at the book’s official page.