Install Hadoop.

This must have already been finished if you followed the guide at

Install javac. (JAVAC 1.6)

 WARNING: Do not use JAVA 1.7

Type `javac -version` to check the version of your javac and whether javac is installed.
Amazon EMR uses java 1.6 and javac 1.7 does not let you compile using lower version. Therefore, if you have javac 1.7 or higher in your system, install and use javac 1.6.

Find where 'javac' binary is using `which javac | xargs readlink -f`. In case of CCIS Linux cluster, it is located at "/usr/lib/jvm/java-6-openjdk/bin/javac".

Config "config" file - step 01

Change first 3 lines following your system configuration.

  • JAVA_HOME is two folders up from your location of javac. In case of CCIS Linux cluster, it is "/usr/lib/jvm/java-6-openjdk"
  • HADOOP_VERSION and HADOOP_HOME are version and install (or unzipped) location of your hadoop.

Do not worry about lines below "# run02.bash" for now.

Run "hw00/scripts/run01.bash" file.

The script compiles hw00/src/edu/neu/cs6240/ta/ file, makes jar file under "hw00/jar/" directory, and runs HADOOP using the jar file. It should finish in about 30 seconds.
Run `wc -w hw00/data/input/smalldata`. Compare the result to file "hw00/data/output/part-r-00000". If Identical, things are running correctly.
When you run the script next time, be careful since the script will automatically remove previous result under "hw00/data/output/"

Install elastic-mapreduce-ruby (EMR)

Elastic Mapreduce Ruby Client helps you to run EMR jobs from command line, which speeds up your development process. Download and install ruby1.8 and elastic-mapreduce-ruby ( does not create a folder when being unzipped, so place in an empty folder before unzipping it. Follow EMR's README.txt. You will need ACCESS_KEY, SECRET_ACCESS_KEY, and KEY_PAIR which can be created and downloaded here ( Your "key-pair-file" should be in the form of "[PATH-to-folder]/[name-you-provided].pem" and your region should be "us-east-1". Write down the name of KEY_PAIR.

Config "config" file - step 02

Change next 4 lines following your system configuration.

  • S3CMD is the location of the s3cmd binary.
  • S3_DIR and S3N_DIR are the paths to the S3 buckets you will save your result. Change following your BUCKET_NAME. I set them to be "s3:[BUCKET_NAME]" and "s3n:[BUCKET_NAME]" in current config file.
  • EMR is path to your elastic-mapreduce binary
  • KEY_PAIR is the name of the KEY_PAIR you created while installing elastic-mapreduce-ruby

Run "hw00/scripts/run02.bash"

Run "hw00/scripts/run02.bash" from command line. It should submit a job to AWS EMR. Check the status of your job either using commands in elastic-mapreduce-ruby, such as "[path-to-elastic-mapreduce-ruby]/elastic-mapreduce —list" or using web browser ( I found it easier to follow jobs using web browser. After job finishes, your result in "s3://[BUCKET_NAME]/homeworks/hw00/data/output/run02.out/part-r-00000" should be '275052'.


For upcoming homeworks, you can create folder hw01, hw02 under homeworks/ folder and copy & modify hw00/run02.bash file to compile and test your new homeworks. Don't forget to create necessary folders, such as jar/, bin/, and etc below hw0*/ folder to make your script run properly.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License