Hadoop Basics and learnings: Hadoop installation steps

Step wise installation guide for hadoop single node cluster:
1. Install Linux Ubunto
2. open terminal and perform the following steps/commands:
   a. Sudo Su
   b. Type password
   c. apt-get udpate            ------ to update to latest version
   d. apt-get install openjdk-6-jdk   ------ updates java
   e. apt-get install ssh-server      ------ updates ssh serveer
   f. apt-get install eclipse

3. Add hadoop user using the following commands:
   $ sudo addgroup hadoop
   $ sudo adduser --ingroup hadoop xxuser

4. Configuring SSH:
   We need to configure SSH access to localhost for the user created in the above step.
   a. Generate SSH key:
   $ su - xxuser
        $ ssh-keygen -t rsa -P ""
      I have kept password as spaces. You may have to enter a password.

      When you receive prompt to enter the file in which to save the key , enter the location where you want to save the key:
       eg: /home/xxuser/.ssh/key

   b. Use the following command to enable SSH access to your local machine with newly created key.
      $ cat $HOME/.ssh/key.pub >> $HOME/.ssh/authorized_keys

   c. Issue the command:
      $ ssh localhost
      When you receive the prompt: "Are you sure you want to continue connecting (yes/no)?" enter yes.

5. download a stable version of hadoop - hadoop-**.tar.gz -   from apache mirror sites
   refer to http://www.apache.org/dyn/closer.cgi/hadoop/common/

6. copy it to a folder --- I copied it into a folder created in Home /work/ -- pwd inside work gives /home/xxx/work
   and then untar/unzip it and it creates a folder like hadoop-*.*.*

   with ls command inside work folder - You should be able to see the hadoop-*.*.* folder

7. In bashrc file copy the following settings:
   export HADOOP_HOME="/home/xxx/work/hadoop-*.*.*"   ---- to store path of hadoop
   export JAVA_HOME="/usr/lib/jvm/java-6-openjdk"     ---- java path
   export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin ---- helps to run program from anywhere

8. Inside hadoop folder you should be able to see another folder called conf navigate into that folder
   pwd here should give - /home/xxx/work/hadoop-*.*.*/conf

   Inside this folder there will be many files that defines configuration settings for hadoop.
   You need to edit : core-site.xml, hdfs-site.xml, mapred-site.xml, masters and slaves.
   localhost will be used in these settings. It can be obained by typing ' hostname' in the terminal.

a. In core-site.xml add the following code:
   <configuration>
      <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:9000</value>
      </property>
      <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/xxx/</value>
      </property>
   </configuration>

     First property here is the hostname followed by colon(:) and then 9000.
     This property specifies url for the name node

     Second property refers to base for temporary directories.

b.In hdfs-site.xml add the following code
   <configuration>
   <property>
      <name>dfs.replication</name>
      <value>1</value>
   </property>
   <property>
      <name>dfs.name.dir</name>
      <value>/home/xxx/dfs/name</value>
   </property>
   <property>
      <name>dfs.data.dir</name>
      <value>/home/xxx/dfs/data</value>
   </property>
   <property>
      <name>dfs.datanode.max.xcievers</name>
      <value>4096</value>
   </property>
   </configuration>

   property 1 - dfs.replication - This refers to replication factor. I kept it as 1. This has to be specified as per the replication factor needed.
   property 2 - dfs.name.dir    - Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently
   property 3 - dfs.data.dir    - Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks.
        property 4 - dfs.datanode.max.xcievers - An upper bound of number of files a datanode can serve at any one time.

c.In mapred-site.xml add the following code
   <configuration>
      <property>
       <name>mapred.job.tracker</name>
    <value>localhost:9001</value>
      </property>
      <property>
       <name>mapred.local.dir</name>
       <value>/home/xxx/mapred/local</value>
      </property>
      <property>
       <name>mapred.system.dir</name>
       <value>/hadoop/mapred/system</value>
      </property>
      <property>
           <name>hadoop.job.history.user.location</name>
       <value>/history</value>
      </property>
   </configuration>
   property1 - mapred.job.tracker - Host or IP and port of JobTracker.
   property2 - mapred.local.dir   - Comma-separated list of paths on the local filesystem where temporary MapReduce data is written.
   property3 - mapred.system.dir - Path on the HDFS where the MapReduce framework stores system files

d. In masters file - write hostname
e. In slave file   - write hostname
f. In hadoop-env.sh, you need to specify the java implementation. So, put - "export JAVA_HOME=/usr/lib/jvm/java-6-openjdk" just under java implementation to use.

9. Before starting hdfs daemons for first time, format the namenode.
   .../hadoop/bin/hadoop namenode -format

10. start-all.sh command will start all the daemons of the hdfs. You will be prompted to enter the password. Once this is done, type jps on the command line. You should be able to see the following:
nnnn JobTracker
nnnn TaskTracker
nnnn DataNode
nnnn NameNode
nnnn SecondaryNameNode
nnnn Jps

If any of these is missing, then there is a problem in starting that particular daemon.
Stop-all.sh command will stop all the daemons.

11. start-dfs.sh -- this will start hdfs cluster. But mapred related daemons will still be inactive.
jps after this command will show:
nnnn SecondaryNameNode
nnnn DataNode
nnnn NameNode
nnnn Jps

stop-dfs.sh is the command to stop.

12. start-mapred.sh -- this will start jobtracker and tasktracker daemons. Always use this command after starting hdfs cluster.
    stop-mapred.sh   -- stops jobtracker and tasktracker daemons.

Saturday, February 8, 2014

Hadoop installation steps