README
The following guide describes how you should set up your system to do homework 01 and the following homeworks. The guide was tested on Ubuntu 12.10 system, but should work on any Linux systems including Linux machines in the PC cluster at the first floor of WVH, which all CCIS Grad students should have access to.
When working on AWS (Amazon Web Service), I strongly suggest you to use Linux or Mac machine and build your own script that automates the AWS accessing process. AWS web interface is fairly time consuming and I strongly suggest you to avoid using it as much as possible. The following guide will introduce you to some tricks, hacks, hints, and scripts I built to automate your programming & testing process.
Good Luck!
Sign up Amazon Web Service
- requires phone number
- Activate account from your email (immediate email)
- Write down or Copy&Paste your ACCESS_KEY, SECRET_ACCESS_KEY somewhere from activation page
Set up AWS S3 (Simple Storage Service)
S3 is usually referred to as "Bucket". It is HDFS-like (Hadoop File System) data storage system. Maybe, AWS S3 is HDFS, but I haven't checked.
- Goto AWS Management Console. Link to management console is located at the top right corner of AWS main page.
- Click S3 from console. It will lead to this page.
- Follow Getting started guide. Write down BUCKET_NAME somewhere. Create bucket name using lowercase letters, numbers, periods (.), and dashes (-) only. AWS does not allow you to run MapReduce job using bucket whose name contains characters other than previously mentioned characters.
- Uploading, moving, deleting files using browser is tedious. Let's automate this. Download and setup s3cmd from s3tools. If you are using Ubuntu you can type "sudo apt-get install s3cmd" from your console. In case of other systems, download the source from http://s3tools.org/s3cmd and setup. FYI, s3cmd does not require installation.
- Type "s3cmd —configure" from console. It asks for "Access Key" and "Secret Key", which are ACCESS_KEY and SECRET_ACCESS_KEY mentioned above. Follow default settings (Just press enter until 'Save settings?' question.) if you are not sure about other settings. At 'Save settings?', enter 'y' if you want to save settings, since default is not saving the setting. Settings are saved at "/home/$USER/.s3cfg".
- Download file http://www.ccs.neu.edu/home/yerihyo/helloworld.txt to somewhere.
- Type "s3cmd sync [Path to Downloaded file] s3://[BUCKET_NAME]/helloworld.txt"
- Check that your file has been properly uploaded using S3 Console. "s3cmd sync" can sync whole folder and subfolders and now you have an easy way to upload multiple files to your S3.
Setup Hadoop & EMR client
- Hadoop is yet not very reliable. DO NOT INSTALL PACKAGES PROVIDED FROM OS. NEGATIVE COMMENTS HAVE BEEN REPORTED ON UBUNTU'S HADOOP RELEASE. Instead, download directly from apache. (http://hadoop.apache.org/) I used Hadoop 1.1.1 Hadoop 1.0.4 for this guide(Link to Download Link to Download). Unzip it and it is ready to use.
- Download the linked source code. Unzip the code and follow the instruction in README.txt.