Skip to content

Latest commit

 

History

History
25 lines (12 loc) · 1.13 KB

14d-gotchas.asciidoc

File metadata and controls

25 lines (12 loc) · 1.13 KB

Gotchas

  • Hadoop commandline tools use the trash to delete items; programs using the hadoop API do not. <remark>verify Pig uses the trash</remark>

  • The tasktracker heap size does not impact the amount of memory available for your tasks. You must adjust mapred.map.child.java.opts or mapred.reduce.child.java.opts.

  • Stripe your data across all disks

    • Don’t use the root drive

    • Don’t RAID your disks

  • Don’t hit external services — see [server_logs_ddos].

Tips

  • Remove files from the trash immediately with hadoop fs -rm -skipTrash /users/(you)/.Trash

  • Input format for Zip files: ZIP file input format

  • Calculate file checksums: IvoryMonkey by Josh Patterson (@jpatanooga) will calculate a checksum locally that you can compare with the hadoop filesystem’s internal checksum. I’m not aware of a way to calculate the md5 or SHA1 checksum of a file on the HDFS short of doing a map-side job with a non-splitable input format (TODO update command for wukong 3.0 format)

Common Error Messages