How to compile a Hadoop Program Posted on Oct 8, 2014


Before compiling your first hadoop program, please see the instructions on how to run the WordCount Example. You can get the wordcount example code from Github (Make sure you get the compatible version): wget https://github.com/apache/hadoop-common/raw/trunk/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/WordCount.java Optionally you can change package org.apache.hadoop.examples; to package org.janzhou;. Set the HADOOP_CLASSPATH: export HADOOP_CLASSPATH=$(bin/hadoop classpath) Compile: javac -classpath ${HADOOP_CLASSPATH} -d WordCount/ WordCount.java Create JAR: jar -cvf WordCount.jar -C WordCount/ . Run: bin/hadoop jar WordCount.jar org.janzhou.wordcount /wordcount/input /wordcount/output Using sun.tools.javac.Main You normally invoke javac.exe from the command line, but you can also invoke it from within a Java program.

Read More

How to run Hadoop WordCount.java Map-Reduce Program Posted on Oct 7, 2014


Hadoop comes with a set of demonstration programs. They are located in here. One of them is WordCount.java which will automatically compute the word frequency of all text files found in the HDFS directory you ask it to process. Follow the Hadoop Tutorial to run the example. Creating a working directory for your data: bin/hdfs dfs -mkdir /wordcount Copy Data Files to HDFS: bin/hdfs dfs -copyFromLocal /path/to/your/data /wordcount/input Running WordCount: bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.1.jar wordcount /wordcount/input /wordcount/output View the Results: bin/hdfs dfs -cat /wordcount/output/part-r-00000 Download the Results: bin/hdfs dfs -copyToLocal /wordcount/output/part-r-00000 .

Read More

How to Write a Research Paper Posted on Oct 1, 2014
Writing is easy. All you do is stare at a blank sheet of paper until drops of blood form on your forehead. --- Gene Fowler


Research is hard. In doing a research, you should start from finding a good research topic that truly interests you. However, finding a good research topic is out of the scope of this paper. In this paper, I mainly focus on writing. Writing skills is essential in producing a good quantity paper. The writing skills used in a paper should depends on the specific topic and solution the paper is telling.

Read More

Applying KNN To MNIST Dataset Posted on Sep 28, 2014
MNIST is a set of handwritten digits images. The k-Nearest Neighbors algorithm can be used to recognize the handwritten digits.


The MNIST is a set of handwritten digits images. You can download it by using Makefile: %.gz: wget http://yann.lecun.com/exdb/mnist/$*.gz %.idx: %.gz gzip -d $*.gz mv $* $*.idx prepare: t10k-images-idx3-ubyte.idx t10k-labels-idx1-ubyte.idx train-images-idx3-ubyte.idx train-labels-idx1-ubyte.idx The size of each image is 28x28 pixels. The IDX File Format The MNIST dataset are stored in IDX file format. The basic format is: magic number size in dimension 0 size in dimension 1 size in dimension 2 .....

Read More

NNThroughputBenchmark Posted on Sep 18, 2014
How to use NNThroughputBenchmark -- one of the earliest NameNode Benchmarks for Hadoop.


NNThroughputBenchmark is one of the earliest NameNode Benchmarks. It was first described in HDFS Scalability: The Limits to Growth In order to measure the name-node performance, I implemented a bench- mark called NNThroughputBenchmark, which now is a standard part of the HDFS code base. NNThroughputBenchmark is a single-node benchmark, which starts a name-node and runs a series of client threads on the same node. Each client repetitively performs the same name-node operation by directly calling the name-node method implementing this operation.

Read More

On Balance among Energy, Performance and Recovery in Storage Systems Posted on Jun 30, 2014
Junyao Zhan, Jiangling Yin, Jun Wang and Jian Zhou. 2014 IEEE 34th International Conference on Distributed Computing Systems Workshops.


With the increasing size of the clusters as well as the increasing capacity of each storage node, current storage systems are spending more time on recovery. When node failure happens, the system enters degradation mode in which node reconstruction/block recovery is initiated. This very process needs to wake up a number of disks and takes a substantial amount of I/O bandwidth which will not only compromise energy efficiency but also performance.

Read More

JSON-RPC over Golang Websocket Posted on Nov 3, 2013
This is a example of using jsonrpc to communicate between Golang and web browsers over Websocket.


Basic ideas JSON-RPC is a lightweight remote procedure call protocol. The request of JSON-RPC is a single object serialized using JSON, a lightweight data-interchange format most commonly used in web applications to send data from the server to the browser. Typically JSON data is transfered using Ajax1. But WebSocket represents the next evolutionary step in web communication. It support two way communication, provide bi-directional, full-duplex communications channels, over a single TCP socket2.

Read More

Removing ^M Characters In Vim Posted on Jun 28, 2013
Removing ^M Characters In Vim


If you edit files in gedit or notepad and ^M characters would be inserted. After that you could not simply remove ^M in VIM with the following command: %s/^M//g %s/\^M//g %s/^V^M//g %s/C-vC-m//g As pattern was not found. ^M in VIM can be manipulated as it is an \r character, which is read as carriage return. Doing a replace for \r characters will remove the ^M: %s/\r//g Your file would also contain \0, which is null-byte.

Read More

MIND: A Black-Box Energy Consumption Model for Disk Arrays Posted on Jul 25, 2011
Zhuo Liu, Jian Zhou, Weikuan Yu, Fei Wu1, Xiao Qin, and Changsheng Xie. 2011 International Green Computing Conference and Workshops (IGCC).


Energy consumption is becoming a growing concern in data centers. Many energy-conservation techniques have been proposed to address this problem. However, an integrated method is still needed to evaluate energy efficiency of storage systems and various power conservation techniques. Extensive measurements of different workloads on storage systems are often very timeconsuming and require expensive equipments. We have analyzed changing characteristics such as power and performance of stand-alone disks and RAID arrays, and then defined MIND as a black box power model for RAID arrays.

Read More

TRACER: A Trace Replay Tool to Evaluate Energy-Efficiency of Mass Storage Systems Posted on Sep 21, 2010
Zhuo Liu, Fei Wu, Xiao Qin, Changsheng Xie, Jian Zhou, and Jianzong Wang. IEEE Cluster 2010.


Improving energy efficiency of mass storage systems has become an important and pressing research issue in large HPC centers and data centers. New energy conservation techniques in storage systems constantly spring up; however, there is a lack of systematic and uniform way of accurately evaluating energy-efficient storage systems and objectively comparing a wide range of energy-saving techniques. This research presents a new integrated scheme, called TRACER, for evaluating energyefficiency of mass storage systems and judging energy-saving techniques.

Read More

Connect. Socialize.