-
Hadoop commandline tools use the trash to delete items; programs using the hadoop API do not. <remark>verify Pig uses the trash</remark>
-
The tasktracker heap size does not impact the amount of memory available for your tasks. You must adjust
mapred.map.child.java.opts
ormapred.reduce.child.java.opts
. -
Stripe your data across all disks
-
Don’t use the root drive
-
Don’t RAID your disks
-
-
Don’t hit external services — see [server_logs_ddos].
-
Remove files from the trash immediately with
hadoop fs -rm -skipTrash /users/(you)/.Trash
-
Input format for Zip files: ZIP file input format
-
Calculate file checksums: IvoryMonkey by Josh Patterson (@jpatanooga) will calculate a checksum locally that you can compare with the hadoop filesystem’s internal checksum. I’m not aware of a way to calculate the md5 or SHA1 checksum of a file on the HDFS short of doing a map-side job with a non-splitable input format (TODO update command for wukong 3.0 format)