Revision as of 21:08, 27 February 2011

Spring 2011 (TA)

Hadoop's two important design features, which have great influence over the experiment's performance are examined: data placement policy and task scheduling policy.
It is noted that, for map tasks, the scheduler uses a locality optimization technique. After selecting a job, the scheduler picks the map task in the job with data closest to the slave, on the same node if possible, otherwise on the same rack, or finally on a remote rack. For reduce tasks, the jobtracker just takes the next in the reduce tasks' list and assign it to the tasktracker.

Formalized the proposed design. Analyzing why a specific family of function are used, the speedup attained, the evaluation measurement.

Studied LSH and explored different LSH families according to the distance measurement they use.
Based on how LSH methods work, analysed how these LSH methods can be parallelized in distributed environment.

Examined how the main class clustering algorithms can be parallelized, respectively.
Explained spectral clustering from theoretical aspect, from graph cut viewpoint.
Based on the understanding of main clustering algorithms, proposed optimizing method for spectral clustering to deal with large data set.
The proposed method makes use of LSH to do pre-precessing.

Survey on main clustering algorithms and the distributed map-reduce method of these algorithms.
Mahout experimenting.

Worked on efficient approximation of gram matrix using map-reduce framework, focusing on LSH performance evaluation and network communication measurement.

Worked on Approximation of gram matrices using Locality Sensitive Hashing on Cluster.

Courses:
- CMPT 886: Special Topics in Operating Systems and Computer Architecture

Worked on Band approximation of gram matrices (large high-dimensional dataset) using Hilbert curve on multicore.

Courses:
- CMPT 705: Design and Analysis of Algorithms
- CMPT 726: Machine Learning

Worked on Band approximation of gram matrices (large high-dimensional dataset) using Hilbert curve on multicore.

@@ Line 4: / Line 4: @@
 === Feb 28 ===
-* Hadoop's two important design features, which have great influence over the experiment's performance
+* Hadoop's two important design features, which have great influence over the experiment's performance are examined: data placement policy and task scheduling policy.
-are examined: data placement policy and task scheduling policy.
 * It is noted that, for map tasks, the scheduler uses a locality optimization technique. After selecting a job, the scheduler picks the map task in the job with data closest to the slave, on the same node if possible, otherwise on the same rack, or finally on a remote rack. For reduce tasks, the jobtracker just takes the next in the reduce tasks' list and assign it to the tasktracker.