oozie-distcp_template

Problem Description:

This article will help with writing a Oozie Distcp Action workflow to run copy HDFS data across two HDFS HA cluster without updating HA configuration of both clusters on the Server side.

Background:

To run Hadoop distcp command on a Cluster with NameNode High Availability (HA) enabled, the following is required:

Adding of nameservice information of both Source and destination cluster
Restarting of the services. The reason being that YARN ResourceManager renews delegation tokens for applications.

Solution

To avoid server side configuration, the MapReduce jobs can send the configurations to RM at runtime and RM uses these configurations to renew tokens via "mapreduce.job.send-token-conf" We can leverage the same via Oozie Distcp Action.

Here is the Oozie distcp Action template that would allow basic oozie distcp action on a Kerberos environment and help parameterize on runtime. This way end users can run at their schedule.

job.properties
workflow.xml

Steps:

Update job.properties file

Reflect Namenode HA and YARN Cluster-id related information from 2 Clusters. Following files from 2 clusters would be needed.

/etc/hadoop/conf/core-site.xml
/etc/hadoop/conf/yarn-site.xml
/etc/hadoop/conf/hdfs-site.xml

Following commands would help you source this information out.

grep -A1 'fs.defaultFS' /etc/hadoop/conf/*
grep -A1 'yarn.resourcemanager.cluster-id' /etc/hadoop/conf/*
grep -A1 'dfs.nameservices' /etc/hadoop/conf/*
grep -A1 'dfs.internal.nameservices' /etc/hadoop/conf/*
grep -A1 'dfs.client.failover.proxy.provider' /etc/hadoop/conf/*
grep -A1 'dfs.ha.namenodes' /etc/hadoop/conf/*
grep -A1 'dfs.namenode.rpc-address' /etc/hadoop/conf/*

Example:
# grep -A1 'fs.defaultFS' /etc/hadoop/conf/*

nameNode=hdfs://shmpsehdpenv1

# Cluster where the job will be initiated
# grep -A1 'yarn.resourcemanager.cluster-id' /etc/hadoop/conf/*

jobTracker=yarn-cluster

# YARN QUEUE 

property01=mapred.job.queue.name
value01=default

# 12 Cluster specifics properties 

# comma separated list of file system names, from both engi
# grep -A1 'fs.defaultFS' /etc/hadoop/conf/*

property02=mapreduce.job.hdfs-servers
value02=hdfs://shmpsehdpenv1,hdfs://smayanihdp

# Comma separated list of nameservices
# grep -A1 'dfs.nameservices' /etc/hadoop/conf/*

property03=dfs.nameservices
value03=shmpsehdpenv1,smayanihdp

# Only include the cluster where you intend to run the job.
# grep -A1 'dfs.internal.nameservices' /etc/hadoop/conf/*

property04=dfs.internal.nameservices
value04=shmpsehdpenv1

# grep -A1 'dfs.client.failover.proxy.provider' /etc/hadoop/conf/*

property05=dfs.client.failover.proxy.provider.shmpsehdpenv1
value05=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

# grep -A1 'dfs.ha.namenodes' /etc/hadoop/conf/*

property06=dfs.ha.namenodes.shmpsehdpenv1
value06=nn1,nn2

# grep -A1 'dfs.namenode.rpc-address' /etc/hadoop/conf/*

property07=dfs.namenode.rpc-address.shmpsehdpenv1.nn1
value07=shmpsehdpenv1n2.hortonworks.com:8020

property08=dfs.namenode.rpc-address.shmpsehdpenv1.nn2
value08=shmpsehdpenv1n3.hortonworks.com:8020

# grep -A1 'dfs.client.failover.proxy.provider' /etc/hadoop/conf/*

property09=dfs.client.failover.proxy.provider.smayanihdp
value09=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

# grep -A1 'dfs.ha.namenodes' /etc/hadoop/conf/*

property10=dfs.ha.namenodes.smayanihdp
value10=nn1,nn2

# grep -A1 'dfs.namenode.rpc-address' /etc/hadoop/conf/*

property11=dfs.namenode.rpc-address.smayanihdp.nn1
value11=hdpnode2.openstacklocal:8020

property12=dfs.namenode.rpc-address.smayanihdp.nn2
value12=hdpnode3.openstacklocal:8020

# RM uses these configurations to renew tokens

property13=mapreduce.job.send-token-conf
value13=mapreduce.jobhistory.principal|^dfs.nameservices|^dfs.namenode.rpc-address.*|^dfs.ha.namenodes.*|^dfs.client.failover.proxy.provider.*|dfs.namenode.kerberos.principal

Update the workflow directory

examplesRoot=oozie-distcp-workflow
oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/workflow.xml

Update the HDFS Source and Destination path

source=hdfs://smayanihdp/user/user01t01/passwd
destination=hdfs://shmpsehdpenv1/user/user01t01/passwd

Execute Oozie Workflow.

kinit user01t01

hdfs dfs -rm -r -skipTrash oozie-distcp-workflow
 
hdfs dfs -put oozie-distcp-workflow
source /usr/hdp/current/oozie-client/conf/oozie-env.sh ; 

/usr/hdp/current/oozie-client/bin/oozie job -config job.properties -run

/usr/hdp/current/oozie-client/bin/oozie job -info 0000001-180227164558355-oozie-oozi-W -verbose

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

oozie-distcp_template

About

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
oozie-distcp-workflow		oozie-distcp-workflow
README.md		README.md
job.properties		job.properties

saumilmayani/oozie-distcp_template

Folders and files

Latest commit

History

Repository files navigation

oozie-distcp_template

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages