Hadoop Distributed File System (HDFS)

Hadoop has a set of command line commands that allows the user to look and manipulate the directories and files in HDFS. It's important that you practice and feel comfortable with these commands.
Open your browser and do a Google search for "Hadoop Command Line Shell". This should point you to the Commands Guide of Apache Hadoop and other relevant documents.
Take a look into this documentation and try to remember the commands covered in class and look at the detailed documentation. Also look at the other commandss that we did not cover and try to understand what they do.
In this class we have three HDFS labs :
  • one for loading data.
  • one for viewing the cluster contents.
  • one for getting data out of the cluster.
A. Loading Data
The process of loading data into the cluster involves some planning related with how the files will be organized in the directory structure. Remember that we use this hierarchy to select what files are included on the particular run. For example in some cases makes sense to structure the files like :


In some other cases it makes more sense to organize your data as :


where date is formatted as YYYYMMDD to facilitate the listing of the files in HDFS.

Lab Instructions :
Your first objective is to create a directory structure in HDFS that allows loading of files daily. For the sake of simplicity we have one file per day.
  1. Write a script/command to create the directory structure.
  2. Write a script to load files daily. This would be run everyday with the current date called from a crontab job. Use the files inside of the "sample_data" directory in the folder for this lab.
Ex : In January 17, 2012 it would recieve 20120117 as a parameter.

B. Viewing Data Contents
The objective of this lab is to understand the commands related to viewing the HDFS files and its contents. Again consult the "Commands Guide of Apache Hadoop" document on the web.

Lab Instructions :
Try to perform these simple steps :
  1. Look at the HDFS files of under your Hadoop user.
  2. Look at contents of some files.
Try to find what commands are available to look inside of HDFS files. Remember that in real situations the files you are looking are big and it's not practical to send all contents of the file to the screen.

C. Getting Data from The Cluster to Local Disk
Getting data out of a cluster is a necessary step in order to complete the data pipeline. The results of processing generally are stored into other applications, databases like MySQL or even into another clusters.
The most common case is to load the results of the processing into a SQL database.

Lab Instructions :
  1. Your objective is to extract the file server_20120509.log under logs/01_hdfs/day_20120509 in HDFS and load it into program.
Note : Managing MySQL server is outside of the scope of the class. We are not covering the details of MySQL dump command. The solution includes commented lines of a sample call of MySQL dump for illustration purposes.

We want to just be able to "pipe" data into another shell script.

Note : There are tools to throttle is which the data is loaded into a database. In our lab you can ignore this consideration.

When loading real data always remember this rule :
"Be careful to create any program that runs into distributed mode and reads or writes data into a non-distributed system(ex: MySQL database). A Hadoop cluster can overload any single server with read.write requests".

Solutions :
Refer to your Training Guide.

3. Map Reduce Programming

Hadoop Map-Reduce framework is written in Java, but the framework allows you to write programs in other languages.

In this class we are going to use python to develop solutions to the proposed problems.

A. Word Count

The word count problem is the most famous map reduce program. The objective is to count the frequency of words of a large text. This algorithm is the basis for building a search engine index.

Lab Instructions :

  1. Develop the word count map-reduce program to count the words on the text of the book "Less Miserable". Before you start, execute the prepare step, to load the data into HDFS. You can use other scripting languages but the solution will be provided in python.
Hint : Break the problem into 3 parts :
  a) mapper.py
  b) reducer.py
  c) driver.sh
B. Most Frequent Words Count
Use the output from the previous program to list the most frequent words with their counts.

Lab Instructions :
  1. Use the same strategy of breaking the programs in three parts. Copy the files from the previous exercise and use them as a starting point.
  2. Load the data from the output by using third filter to load the files : 02_mapreduce/result_wc/part-*'.
Remember that the output of map-reduce jobs produce the part-XXXXXX files in HDFS.

Solutions :
Refer to your Training Guide.

4. Data Access Tool - Hive
Hive command line tool allows you to submit jobs via bash scripts. We will be using a technique that relies on Unix HERE files.
You can see an example below :
hive << EOF
SET mapred.job.name=Job: Your job name;
your query

Identifying properties of a data set :
We have a table 'user_data' that contains the following fields :
data_date : string
   user_id : string
   properties : string
The properties field is formatted as a series of attribute=value pairs.
Ex : Age=21;state=CA;gender=M;

Lab Instructions :
1. Write a program that produces a list of properties with minimum value(min_value), largest value(max_value) and number of unique values. Before you start, execute the prepare step to load the data into HDFS.

2. Generate a count per state.
Now that extracted the properties, calculate the number of records per state.

Lab Instructions :

1. Write a program that lists the states and their count from the data input.

Solutions :
Refer to your Training Guide.

5. Data Access Tool - Pig
Pig command line tool like the Hive allows you to submit jobs via bash scripts. We will be using the same technique that relies on Unix HERE files.

A. Simple Logs
We have a set of log files and need to create a job that runs every hour and perform some calculations. The log files are delimited by a 'tab' character and have the following fields :
  • site
  • hour_of_day
  • page_views
  • data_date
The log files are located on the prepare folder. Load them in HDFS at data/pig/simple_logs folder and use them as the input.
Important : In order to load tab delimited files use pigStorage('\u0001').

Lab Instructions:
Create a program to :
1. Calculate the total views per hour per day.
2. Calculate the total views per day.
3. Calculate the total counts of each hour across all days.

In Pig, you can easily extend the programs. You can starts with item number 1 and extend the same program reusing the variables built on the previous step(s).

B. NASA Logs
In this lab you have a web server log from NASA and your goal is to parse and load the data into pig. Each line of the log look like this :

n1043350.ksc.nasa.gov -- [30/Aug/1995:10:04:07 -0400] "GET /shuttle/missions/sts-34/mission-sts-34.html HTTP/1.0" 200 5630 - - [30/Aug/1995:10:04:11 -0400] "GET /images/USA-logosmall.gif
HTTP/1.0" 200 234

Here is the same line with the part descriptions :

Source Adr :
Dummy1:  -
Dummy2:  -
Date and Time: [30/Aug/1995:10:04:11 -0400]
Http command: "GET /images/USA-logosmall.gif HTTP/1.0"
n1 200
n2 234

Lab Instructions :
The goal of this lab is to create a Pig program to extract the Http command field and perform the following operations:
1. Capture the first "level" of the command.
2. Filter the first level record that are images.
3. Report on the frequency of each level.


Please sign in to leave a comment.
Powered by Zendesk