Difference between revisions of "Private:progress-gao"

From NMSL
 
(11 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
= Fall 2011 (RA) =
 +
* Courses:
 +
**None.
 +
 +
=== Nov 16th ===
 +
* Leveraged Amazon Elastic MapReduce cluster to run the proposed method on Wikipedia dataset, and got great reduction of computing time when comparing with that using 5-node cluster.
 +
* Run the proposed method, Parallel Spectral Clustering and Spectral Clustering on Wikipedia datasets. The results show that the proposed method has better performance than the other two methods.
 +
 +
=== Oct 17th ===
 +
* '''Report:''' [https://cs-nsl-svn.cs.surrey.sfu.ca/cssvn/nsl-members/gao/reports/Report_Oct_17th/reportTemplate.pdf here]
 +
* Experimented with the tuning of the parameter "how many terms to keep"
 +
* Run large dataset on the proposed algorithm with good performance (91.38%).
 +
 +
=== Oct 4th ===
 +
* '''Report:''' [https://cs-nsl-svn.cs.surrey.sfu.ca/cssvn/nsl-members/gao/reports/Report_Oct_4th/reportTemplate.pdf here]
 +
* Done with stop-word removing and stemming.
 +
* In similarity matrix computation part, in order to reduce feature set, used first-F items in one doc's content.
 +
* Spectral clustering accuracy is 96.6%. Kmeans accuracy is 62.22%.
 +
* Will use the proposed algorithm on the dataset.
 +
 +
=== Sep 20th ===
 +
* '''Report:''' [https://cs-nsl-svn.cs.surrey.sfu.ca/cssvn/nsl-members/gao/reports/Report_Sep_20th/reportTemplate.pdf here]
 +
* Successfully crawled 3,550,567 web pages from Wikipedia, forming 739,144 categories.
 +
* Processed the raw HTML files and have it ready for test purpose.
 +
* Now working on stop-word removing and stemming.
 +
* In the following 4-5 days, will have the dataset fully processed and ready.
 +
* And then, one day to do tf-idf on the dataset (since that part of the code is done).
 +
* Following that, first test the dataset on kmeans; and then, try spectral clustering on the data.
 +
 +
= Summer 2011 (RA) =
 +
* Courses:
 +
**None.
 +
 +
=== May 30th ===
 +
* '''Report:''' [https://cs-nsl-svn.cs.surrey.sfu.ca/cssvn/nsl-members/gao/reports/Report_May_30th/reportTemplate.pdf here]
 +
* Added Nyström method into the comparison of proposed method and PSC method, in the aspect of accuracy.
 +
* Continued to run the experiment on large dataset. Will continue the experiment and collect more data.
 +
* Analysed the experiment results in the experiment result part.
 +
* Included more sources to show the legitimacy of the evaluation metrics.
 +
* Refined the structure of figures.
 +
 +
=== May 16th ===
 +
* Got Parallel SC method (using MPICH2 in implementation) running on NSL cluster and got the result for comparison purpose. The hardware platform remains unchanged (NSL cluster).
 +
* Examined and ran a few experiments with Nystrom Extension Spectral Clustering method (in matlab) and "Fast approximate spectral clustering" method (in R). However, the problem here is that these two method runs on single machine, which will somehow defeat the comparison purpose (because platform is changed). Moreover, the memory usage limitation of Matlab will fail the experiment when using 2^12 data set.
 +
* Area affected by update: experiment setup and evaluation parts.
 +
 +
=== May 5th ===
 +
* Managed to find two recent works (implementation) for comparison purpose, set up the cluster to run the implementation.
 +
* There are also other similar papers on the same topic, which have been added into the related work part.
 +
* Further clarification of the results (as well as the methodology of our proposed method).
 +
 
= Spring 2011 (TA) =
 
= Spring 2011 (TA) =
 
* Courses:
 
* Courses:
Line 4: Line 55:
  
 
=== Apr 19th ===
 
=== Apr 19th ===
* '''Report:''' [https://cs-nsl-svn.cs.surrey.sfu.ca/cssvn/nsl-members/gao/reports/Report_Apr_19th.pdf here]
 
 
* Applied larger datasets to the proposed scheme, and get HUGE computing time reduction (the ratio of about 1/18) with guaranteed error, more can be found in the report.
 
* Applied larger datasets to the proposed scheme, and get HUGE computing time reduction (the ratio of about 1/18) with guaranteed error, more can be found in the report.
 
* Read through related publications, compared our method with the already existing ones, and updated the related work part in the paper.
 
* Read through related publications, compared our method with the already existing ones, and updated the related work part in the paper.

Latest revision as of 00:01, 16 November 2011

Fall 2011 (RA)

  • Courses:
    • None.

Nov 16th

  • Leveraged Amazon Elastic MapReduce cluster to run the proposed method on Wikipedia dataset, and got great reduction of computing time when comparing with that using 5-node cluster.
  • Run the proposed method, Parallel Spectral Clustering and Spectral Clustering on Wikipedia datasets. The results show that the proposed method has better performance than the other two methods.

Oct 17th

  • Report: here
  • Experimented with the tuning of the parameter "how many terms to keep"
  • Run large dataset on the proposed algorithm with good performance (91.38%).

Oct 4th

  • Report: here
  • Done with stop-word removing and stemming.
  • In similarity matrix computation part, in order to reduce feature set, used first-F items in one doc's content.
  • Spectral clustering accuracy is 96.6%. Kmeans accuracy is 62.22%.
  • Will use the proposed algorithm on the dataset.

Sep 20th

  • Report: here
  • Successfully crawled 3,550,567 web pages from Wikipedia, forming 739,144 categories.
  • Processed the raw HTML files and have it ready for test purpose.
  • Now working on stop-word removing and stemming.
  • In the following 4-5 days, will have the dataset fully processed and ready.
  • And then, one day to do tf-idf on the dataset (since that part of the code is done).
  • Following that, first test the dataset on kmeans; and then, try spectral clustering on the data.

Summer 2011 (RA)

  • Courses:
    • None.

May 30th

  • Report: here
  • Added Nyström method into the comparison of proposed method and PSC method, in the aspect of accuracy.
  • Continued to run the experiment on large dataset. Will continue the experiment and collect more data.
  • Analysed the experiment results in the experiment result part.
  • Included more sources to show the legitimacy of the evaluation metrics.
  • Refined the structure of figures.

May 16th

  • Got Parallel SC method (using MPICH2 in implementation) running on NSL cluster and got the result for comparison purpose. The hardware platform remains unchanged (NSL cluster).
  • Examined and ran a few experiments with Nystrom Extension Spectral Clustering method (in matlab) and "Fast approximate spectral clustering" method (in R). However, the problem here is that these two method runs on single machine, which will somehow defeat the comparison purpose (because platform is changed). Moreover, the memory usage limitation of Matlab will fail the experiment when using 2^12 data set.
  • Area affected by update: experiment setup and evaluation parts.

May 5th

  • Managed to find two recent works (implementation) for comparison purpose, set up the cluster to run the implementation.
  • There are also other similar papers on the same topic, which have been added into the related work part.
  • Further clarification of the results (as well as the methodology of our proposed method).

Spring 2011 (TA)

  • Courses:
    • None.

Apr 19th

  • Applied larger datasets to the proposed scheme, and get HUGE computing time reduction (the ratio of about 1/18) with guaranteed error, more can be found in the report.
  • Read through related publications, compared our method with the already existing ones, and updated the related work part in the paper.
  • TA course assignment five marking, final exam supervision and marking

Apr 11th

  • Corrected the implementation of Dunn Index (which is used for evaluation purpose).
  • Solved the problem of memory insufficiency in gram matrix generation (for comparison purpose). The problem is that, in order to compare the proposed method to the original method, we need the raw gram matrix computed beforehand. However, for dataset no less than 2^13, there will be memory insufficiency. The problem is solved using a buffer (instead of keeping all the matrix entries in memory)
  • The above mentioned fix will enable us to extend the comparison to larger datasets.

Apr 3rd

  • Implemented the whole proposed design.
  • Implemented evaluation measurement.
  • Got the result of running dataset with size from 2^6 to 2^13, more results will be collected.

Mar 21st

  • Corrected bugs in the original Mahout code. The original source code has a "java.lang.RuntimeException: java.lang.ClassNotFoundException" bug, this error appears in several places in the code.
  • Implementing evaluation measurement.

Mar 14th

  • Set up Mahout (together with Hadoop) environment and run tests
  • Implementing the proposed idea on Mahout
  • Found out a nice optimizing opportunity, named as, "Time Multiplexing". The optimization is based on the observation: In common cases, the Spectral Clustering step should execute only after the preprocessing step. However, we observe that there can be intersection (time multiplexing) between LSH step and Spectral Clustering step.

Mar 7th

  • Trying to add extra code in the library of Mahout to implement the proposed idea.
  • The affected classes: AffinityMatrixInputJob (to perform gram matrix calculation), DistributedRowMatrix (to construct the diagonal matrix), DistributedLanczosSolver (to perform eigen-decomposition)
  • Merging the previous reports and making necessary changes

Feb 28

  • Hadoop's two important design features, which have great influence over the experiment's performance are examined: data placement policy and task scheduling policy.
  • It is noted that, for map tasks, the scheduler uses a locality optimization technique. After selecting a job, the scheduler picks the map task in the job with data closest to the slave, on the same node if possible, otherwise on the same rack, or finally on a remote rack. For reduce tasks, the jobtracker just takes the next in the reduce tasks' list and assign it to the tasktracker.

Feb 14

  • Formalized the proposed design. Analyzing why a specific family of function are used, the speedup attained, the evaluation measurement.

Feb 7

  • Studied LSH and explored different LSH families according to the distance measurement they use.
  • Based on how LSH methods work, analysed how these LSH methods can be parallelized in distributed environment.

Jan 31

  • Examined how the main class clustering algorithms can be parallelized, respectively.
  • Explained spectral clustering from theoretical aspect, from graph cut viewpoint.
  • Based on the understanding of main clustering algorithms, proposed optimizing method for spectral clustering to deal with large data set.
  • The proposed method makes use of LSH to do pre-precessing.

Jan 17

  • Understanding Spectral clustering and distributed implementation.
  • Mahout experimenting.

Jan 10

  • Survey on main clustering algorithms and the distributed map-reduce method of these algorithms.
  • Mahout experimenting.


Fall 2010 (FELLOWSHIP)

  • Courses:
    • CMPT 771: Internet Architecture and Protcols
    • CMPT 741: Data Mining
  • Worked on efficient approximation of gram matrix using map-reduce framework, focusing on LSH performance evaluation and network communication measurement.


Summer 2010 (RA)

  • Courses:
    • None
  • Worked on Approximation of gram matrices using Locality Sensitive Hashing on Cluster.


Spring 2010 (TA+RA)

  • Courses:
    • CMPT 886: Special Topics in Operating Systems and Computer Architecture
  • Worked on Band approximation of gram matrices (large high-dimensional dataset) using Hilbert curve on multicore.


Fall 2009 (TA)

  • Courses:
    • CMPT 705: Design and Analysis of Algorithms
    • CMPT 726: Machine Learning
  • Worked on Band approximation of gram matrices (large high-dimensional dataset) using Hilbert curve on multicore.