-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
failed copying hadoop conf files from S3 #5
Comments
Odd, let me take a look. Sriram |
@piaozhexiu figured this out. Turns out that if you have *-site.xml's in your HADOOP_HOME/conf/, it thinks that the output dir is actually an HDFS path (due to possible settings in the *-site.xml's). So this is actually a bug - so marking this issue as a bug. We need to use "file://" paths explicitly for copies. You can wait for a patch, or you can move the *-site.xml's to another location. However, in general, since Genie can talk to multiple clusters, its probably not a good idea to have the *-site.xml's inside HADOOP_HOME/conf anyways, as we don't want any conflicts. I will add this to the Wiki. BTW you also need 3 slashes (not two) in your file paths. I have updated the populateSampleConfigs.py to make this more obvious. Sriram |
I copied the files to a new location, updated my cluster config, then re-ran the test, it still failed. I then removed the *-site.xml files from HADOOP_HOME/conf, re-ran the test. This time it succeeded. Thanks for your help with this. |
Cool! And the launcher script has now been fixed to use file://, which will avoid this problem in the future (i.e. you shouldn't have to remove the *-site.xml. Thanks for bringing this to our notice. |
One last thing, I am getting a 404 when I try to view: http://localhost:8080/genie-jobs/0c2b15ca-d209-4e84-8aa7-fac377bd1a21 Did I miss something in the config to view the output files? --Jimmy From: Sriram Krishnan [email protected] Cool! And the launcher script has now been fixed to use file://, which will Thanks for bringing this to our notice. ‹ |
Did you enable Tomcat directory browsing? Also, check out the Genie properties - search for "dir". I am suspecting that you may not have a symlink to your working directory from inside Tomcat as it suggests: |
I just did a fresh install according to the wiki instructions and I'm getting the same error Will retry in 5 seconds to ensure that this is not a transient error |
@bridiver couple of questions/comments:
|
Running latest from github. Last commit Nov 18 |
How about the two properties referring to the working directory referenced here: https://github.com/Netflix/genie/wiki/Customization-and-Options#required
Does this directory - /mnt/tomcat/genie-jobs/58835829-b63f-4761-b7bc-7ffc5120d97f/ exist? You may also want to check if you can run this from the command-line (which is what the Genie launcher runs):
We haven't tried Genie with the MapR distribution - but I can't see why it shouldn't work. If it still doesn't work, please share your cmd.log and the genie.properties. |
Those two properties are correct in /home/hadoop/genie/genie-web/src/main/resources/genie.properties hadoop@ip-xx-xx-xxx-xxx:~/.versions/1.0.3/conf$ hadoop fs -cp file:///home/hadoop/.versions/1.0.3/conf/core-site.xml file:///home/hadoop/.versions/1.0.3/conf/mapred-site.xml file:///home/hadoop/.versions/1.0.3/conf/hdfs-site.xml file:///mnt/tomcat/genie-jobs/58835829-b63f-4761-b7bc-7ffc5120d97f/conf/ cp: When copying multiple files, destination file:///mnt/tomcat/genie-jobs/58835829-b63f-4761-b7bc-7ffc5120d97f/conf/ should be a directory. hadoop@ip-xx-xx-xxx-xxx:~/.versions/1.0.3/conf$ ls /mnt/tomcat/genie-jobs/58835829-b63f-4761-b7bc-7ffc5120d97f/conf/ capacity-scheduler.xml core-site.xml hadoop-default.xml hadoop-metrics2.properties hadoop-policy.xml log4j.properties mapred-site.xml slaves ssl-server.xml.example configuration.xsl fair-scheduler.xml hadoop-env.sh hadoop-metrics.properties hdfs-site.xml mapred-queue-acls.xml masters ssl-client.xml.example taskcontroller.cfg hadoop@ip-xx-xx-xxx-xxx:~/.versions/1.0.3/conf$ ls /home/hadoop/.versions/1.0.3/conf/ capacity-scheduler.xml core-site.xml hadoop-default.xml hadoop-metrics2.properties hadoop-policy.xml log4j.properties mapred-site.xml slaves ssl-server.xml.example configuration.xsl fair-scheduler.xml hadoop-env.sh hadoop-metrics.properties hdfs-site.xml mapred-queue-acls.xml masters ssl-client.xml.example taskcontroller.cfg |
If I use hadoop fs to copy the files one at a time it works. It's only when I try to copy more than one that it fails. Could there be some reason that it's not recognizing the path as a directory? $ hadoop fs -cp file:///home/hadoop/.versions/1.0.3/conf/core-site.xml file:///mnt/tomcat/genie-jobs/58835829-b63f-4761-b7bc-7ffc5120d97f/conf/ $ hadoop fs -ls file:///mnt/tomcat/genie-jobs/58835829-b63f-4761-b7bc-7ffc5120d97f/conf/ Found 18 items -rw-r--r-- 1 hadoop hadoop 7457 2013-12-06 19:24 /mnt/tomcat/genie-jobs/58835829-b63f-4761-b7bc-7ffc5120d97f/conf/capacity-scheduler.xml -rw-r--r-- 1 hadoop hadoop 535 2013-12-06 19:24 /mnt/tomcat/genie-jobs/58835829-b63f-4761-b7bc-7ffc5120d97f/conf/configuration.xsl -rwxrwxrwx 1 hadoop hadoop 2063 2013-12-07 21:04 /mnt/tomcat/genie-jobs/58835829-b63f-4761-b7bc-7ffc5120d97f/conf/core-site.xml -rw-r--r-- 1 hadoop hadoop 327 2013-12-06 19:24 /mnt/tomcat/genie-jobs/58835829-b63f-4761-b7bc-7ffc5120d97f/conf/fair-scheduler.xml -rw-r--r-- 1 hadoop hadoop 38546 2013-12-06 19:24 /mnt/tomcat/genie-jobs/58835829-b63f-4761-b7bc-7ffc5120d97f/conf/hadoop-default.xml -rw-r--r-- 1 hadoop hadoop 311 2013-12-06 19:24 /mnt/tomcat/genie-jobs/58835829-b63f-4761-b7bc-7ffc5120d97f/conf/hadoop-env.sh -rw-r--r-- 1 hadoop hadoop 1654 2013-12-06 19:24 /mnt/tomcat/genie-jobs/58835829-b63f-4761-b7bc-7ffc5120d97f/conf/hadoop-metrics.properties -rw-r--r-- 1 hadoop hadoop 1512 2013-12-06 19:24 /mnt/tomcat/genie-jobs/58835829-b63f-4761-b7bc-7ffc5120d97f/conf/hadoop-metrics2.properties -rw-r--r-- 1 hadoop hadoop 4644 2013-12-06 19:24 /mnt/tomcat/genie-jobs/58835829-b63f-4761-b7bc-7ffc5120d97f/conf/hadoop-policy.xml -rwxrwxrwx 1 hadoop hadoop 1416 2013-12-07 21:05 /mnt/tomcat/genie-jobs/58835829-b63f-4761-b7bc-7ffc5120d97f/conf/hdfs-site.xml -rw-r--r-- 1 hadoop hadoop 4649 2013-12-06 19:24 /mnt/tomcat/genie-jobs/58835829-b63f-4761-b7bc-7ffc5120d97f/conf/log4j.properties -rw-r--r-- 1 hadoop hadoop 2033 2013-12-06 19:24 /mnt/tomcat/genie-jobs/58835829-b63f-4761-b7bc-7ffc5120d97f/conf/mapred-queue-acls.xml -rwxrwxrwx 1 hadoop hadoop 2233 2013-12-07 21:04 /mnt/tomcat/genie-jobs/58835829-b63f-4761-b7bc-7ffc5120d97f/conf/mapred-site.xml -rw-r--r-- 1 hadoop hadoop 10 2013-12-06 19:24 /mnt/tomcat/genie-jobs/58835829-b63f-4761-b7bc-7ffc5120d97f/conf/masters -rw-r--r-- 1 hadoop hadoop 10 2013-12-06 19:24 /mnt/tomcat/genie-jobs/58835829-b63f-4761-b7bc-7ffc5120d97f/conf/slaves -rw-r--r-- 1 hadoop hadoop 1243 2013-12-06 19:24 /mnt/tomcat/genie-jobs/58835829-b63f-4761-b7bc-7ffc5120d97f/conf/ssl-client.xml.example -rw-r--r-- 1 hadoop hadoop 1195 2013-12-06 19:24 /mnt/tomcat/genie-jobs/58835829-b63f-4761-b7bc-7ffc5120d97f/conf/ssl-server.xml.example -rw-r--r-- 1 hadoop hadoop 382 2013-12-06 19:24 /mnt/tomcat/genie-jobs/58835829-b63f-4761-b7bc-7ffc5120d97f/conf/taskcontroller.cfg |
I wonder if this is a problem specific to the mapr distribution? -copyFromLocal and -put both seem to work with multiple files, but only from the command line. If I change joblauncher.sh to use them it fails with an error casting RawLocal to Local |
If you can verify this independently outside of Genie, it seems like a bug in the MapR distribution. EMR and Apache Hadoop can copy multiple files to a directory using hadoop fs -cp. See http://hadoop.apache.org/docs/r1.0.4/file_system_shell.html#cp:
If you don't mind trying Genie with the default EMR distribution, it may be easiest if you use the bootstrap action: Sriram ps: Neither the -put or -copyFromLocal will work because both of them fail if anything is being overwritten (which happens at a few places in the launcher script). |
Maybe the best thing is just to put the files in S3, that is the preferred method anyway, right? Will we run into this issue anywhere else? I will try it on standard EMR, but the MapR performance is substantially better so we will need to find some way to work around this. |
I am suspicious if putting it on S3 will work - since it is still going to use the "hadoop fs -cp" command. But it is worth a shot. But if you do put the *-site.xml's on S3 (which we do ourselves at Netflix), be mindful of the security implications - https://github.com/Netflix/genie/wiki/Customization-and-Options#cloud-security. Alternatively, if you are just playing with it right now with local files, you can just replace "hadoop fs -cp" with just "cp" in the jobLauncher.sh. However, once you get further, the "hadoop fs -cp" issue needs to be resolved because we also use S3 to stage job dependencies, and Genie pulls it down to the local file system using the same command. Sriram ps: If you can figure out a way to get it to work on MapR, then we will gladly accept patches :). |
Good point. The MapR documentation appears to be inconsistent -cp Copy files that match the file pattern to a destination. When copying multiple files, the destination must be a directory. The command syntax only shows one source, but the description claims multiple files will work and the command accepts multiple files. It does appear to be a bug with MapR and the only way I can see around it is to enumerate the files and copy them individually. I can fork the repo and make the changes, but can you give me an idea of where else I will have to deal with this? Thanks. |
This works diff --git a/genie-web/conf/system/apps/genie/bin/joblauncher.sh b/genie-web/conf/system/apps/genie/bin/joblauncher.sh index f899ecf..6ba336c 100755 --- a/genie-web/conf/system/apps/genie/bin/joblauncher.sh +++ b/genie-web/conf/system/apps/genie/bin/joblauncher.sh @@ -46,28 +46,30 @@ function copyFiles { # copy over the files to/from S3 - retry $NUM_RETRIES times i=0 retVal=0 - echo "Copying files $SOURCE to $DESTINATION" while true do - $TIMEOUT $S3CP ${SOURCE} ${DESTINATION}/ - retVal="$?" - if [ "$retVal" -eq 0 ]; then - break - else - echo "Will retry in 5 seconds to ensure that this is not a transient error" - sleep 5 - i=$(($i+1)) - fi - - # exit with error if done retrying - if [ "$i" -eq "$NUM_RETRIES" ]; then - echo "Failed to copy files from $SOURCE to $DESTINATION" - break - fi + # iterate through the files one at a time because MapR doesn't support multiple files in -cp + for FILE in $SOURCE + do + echo "Copying file $FILE to $DESTINATION" + $TIMEOUT $S3CP ${FILE} ${DESTINATION}/ + retVal="$?" + if [ "$retVal" -ne 0 ]; then + echo "Will retry in 5 seconds to ensure that this is not a transient error" + sleep 5 + i=$(($i+1)) + fi + + # exit with error if done retrying + if [ "$i" -eq "$NUM_RETRIES" ]; then + echo "Failed to copy files from $SOURCE to $DESTINATION" + break + fi + done + + # return 0 or error code from s3cp + return $retVal done - - # return 0 or error code from s3cp - return $retVal } Although it's not great, it will retry all the files if one of them fails |
Glad that worked out. Maybe the "right" thing to do would be to file a bug report with MapR so that the hadoop fs -cp works as documented. |
Ok, this is better. Do I need to check to make sure it's the MapR distribution or is it OK to iterate through the files for both if I'm going to submit a pull request? I submitted a ticket to MapR support, but who knows how long it will take to get a fix out. diff --git a/genie-web/conf/system/apps/genie/bin/joblauncher.sh b/genie-web/conf/system/apps/genie/bin/joblauncher.sh index f899ecf..e03ea62 100755 --- a/genie-web/conf/system/apps/genie/bin/joblauncher.sh +++ b/genie-web/conf/system/apps/genie/bin/joblauncher.sh @@ -33,37 +33,49 @@ function copyFiles { # use hadoop for s3 copying S3CP="$HADOOP_HOME/bin/hadoop fs $HADOOP_S3CP_OPTS -cp" - + # number of retries for s3cp NUM_RETRIES=5 # convert CSV to be space separated SOURCE=`echo $SOURCE | sed -e 's/,/ /g'` - + # run hadoop fs -cp via timeout, so it doesn't hang indefinitely TIMEOUT="$XS_SYSTEM_HOME/timeout3 -t $HADOOP_S3CP_TIMEOUT" - + # copy over the files to/from S3 - retry $NUM_RETRIES times i=0 retVal=0 - echo "Copying files $SOURCE to $DESTINATION" - while true + + # iterate through the files one at a time because MapR doesn't support multiple files in -cp + for FILE in $SOURCE do - $TIMEOUT $S3CP ${SOURCE} ${DESTINATION}/ - retVal="$?" - if [ "$retVal" -eq 0 ]; then + echo "Copying file $FILE to $DESTINATION" + + while true + do + $TIMEOUT $S3CP ${FILE} ${DESTINATION}/ + retVal="$?" + if [ "$retVal" -ne 0 ]; then + echo "Will retry in 5 seconds to ensure that this is not a transient error" + sleep 5 + i=$(($i+1)) + else + # got the file, move on to the next one + break + fi + + # exit with error if done retrying + if [ "$i" -eq "$NUM_RETRIES" ]; then + echo "Failed to copy files from $SOURCE to $DESTINATION" + break + fi + done + + # couldn't copy one of the files after retrying so exit with error + if [ "$retVal" -ne 0 ]; then break - else - echo "Will retry in 5 seconds to ensure that this is not a transient error" - sleep 5 - i=$(($i+1)) fi - - # exit with error if done retrying - if [ "$i" -eq "$NUM_RETRIES" ]; then - echo "Failed to copy files from $SOURCE to $DESTINATION" - break - fi done # return 0 or error code from s3cp |
Although this should work in general, copying the files one by one will increase the startup time for jobs since the hadoop CLI spawns a JVM for each call - we actually started with copying files one by one, and changed it to the batch/bulk copy as a (minor) optimization. Since this appears to be a MapR bug, I would prefer not to change the current implementation - hopefully you can use this as a workaround until the bug is fixed. Thanks for debugging this! |
MapR has filed this as bug #12644. I haven't tested it yet, but I believe it will work as-is with S3 |
just FYI - you get a different issue when running hadoop 2. Apparently cp has been changed so that it does cause an error if you are overwriting a file Copying Hadoop config files... |
@bridiver could you create a new issue for this? Looks like its worth looking into. Thanks! |
cat /tmp/tomcat/genie-jobs/09376457-596a-48ac-9e91-1cdf5cdbbffc/cmd.log
Job Execution Parameters
ARGS = 4
CMD = hadoop
CMDLINE = fs -ls /
CURRENT_JOB_FILE_DEPENDENCIES =
S3_HADOOP_CONF_FILES = file://Library/hadoop/hadoop-1.1.2/conf/core-site.xml,file://Library/hadoop/hadoop-1.1.2/conf/mapred-site.xml,file://Library/hadoop/hadoop-1.1.2/conf/hdfs-site.xml
S3_HIVE_CONF_FILES =
S3_PIG_CONF_FILES =
CURRENT_JOB_WORKING_DIR = /tmp/tomcat/genie-jobs/09376457-596a-48ac-9e91-1cdf5cdbbffc
CURRENT_JOB_CONF_DIR = /tmp/tomcat/genie-jobs/09376457-596a-48ac-9e91-1cdf5cdbbffc/conf
S3_ARCHIVE_LOCATION =
HADOOP_USER_NAME = genietest
HADOOP_GROUP_NAME = hadoop
HADOOP_HOME = /Library/hadoop/hadoop-1.1.2
HIVE_HOME = /apps/hive/current
PIG_HOME = /apps/pig/current
HADOOP_S3CP_TIMEOUT = 1800
Creating job conf dir: /tmp/tomcat/genie-jobs/09376457-596a-48ac-9e91-1cdf5cdbbffc/conf
Copying job dependency files:
Copying Hadoop config files...
Copying files file://Library/hadoop/hadoop-1.1.2/conf/core-site.xml file://Library/hadoop/hadoop-1.1.2/conf/mapred-site.xml file://Library/hadoop/hadoop-1.1.2/conf/hdfs-site.xml to /tmp/tomcat/genie-jobs/09376457-596a-48ac-9e91-1cdf5cdbbffc/conf
Warning: $HADOOP_HOME is deprecated.
2013-06-24 12:42:37.030 java[70743:1c03] Unable to load realm info from SCDynamicStore
cp: When copying multiple files, destination /tmp/tomcat/genie-jobs/09376457-596a-48ac-9e91-1cdf5cdbbffc/conf/ should be a directory.
Will retry in 5 seconds to ensure that this is not a transient error
Warning: $HADOOP_HOME is deprecated.
2013-06-24 12:42:43.183 java[70769:1c03] Unable to load realm info from SCDynamicStore
cp: When copying multiple files, destination /tmp/tomcat/genie-jobs/09376457-596a-48ac-9e91-1cdf5cdbbffc/conf/ should be a directory.
Will retry in 5 seconds to ensure that this is not a transient error
Warning: $HADOOP_HOME is deprecated.
2013-06-24 12:42:49.333 java[70794:1c03] Unable to load realm info from SCDynamicStore
cp: When copying multiple files, destination /tmp/tomcat/genie-jobs/09376457-596a-48ac-9e91-1cdf5cdbbffc/conf/ should be a directory.
Will retry in 5 seconds to ensure that this is not a transient error
Warning: $HADOOP_HOME is deprecated.
2013-06-24 12:42:55.491 java[70825:1c03] Unable to load realm info from SCDynamicStore
cp: When copying multiple files, destination /tmp/tomcat/genie-jobs/09376457-596a-48ac-9e91-1cdf5cdbbffc/conf/ should be a directory.
Will retry in 5 seconds to ensure that this is not a transient error
Warning: $HADOOP_HOME is deprecated.
2013-06-24 12:43:01.662 java[70850:1c03] Unable to load realm info from SCDynamicStore
cp: When copying multiple files, destination /tmp/tomcat/genie-jobs/09376457-596a-48ac-9e91-1cdf5cdbbffc/conf/ should be a directory.
Will retry in 5 seconds to ensure that this is not a transient error
Failed to copy files from file://Library/hadoop/hadoop-1.1.2/conf/core-site.xml file://Library/hadoop/hadoop-1.1.2/conf/mapred-site.xml file://Library/hadoop/hadoop-1.1.2/conf/hdfs-site.xml to /tmp/tomcat/genie-jobs/09376457-596a-48ac-9e91-1cdf5cdbbffc/conf
Not archiving files in working directory
conf Directory Listing After Job Fails:
tsunami:09376457-596a-48ac-9e91-1cdf5cdbbffc schappetj$ ls -1 conf/
capacity-scheduler.xml
configuration.xsl
core-site.xml
fair-scheduler.xml
hadoop-env.sh
hadoop-metrics2.properties
hadoop-policy.xml
hdfs-site.xml
log4j.properties
mapred-queue-acls.xml
mapred-site.xml
masters
slaves
ssl-client.xml.example
ssl-server.xml.example
taskcontroller.cfg
The text was updated successfully, but these errors were encountered: