Difference between revisions of "Private:progress-dameh"

From NMSL
 
(47 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Spring 2011 (TA) =
+
= Fall 2011 and Spring 2012 (RA) =
 +
 
 +
 
 +
=== Jan 27 ===
 +
* Implementing my method on cluster, report to be sent soon.
 +
 
 +
 
 +
=== Dec 23 ===
 +
* Finished the study of different methods, the analysis and the implementation on a single machine.
 +
 
 +
 
 +
=== Nov 9 ===
 +
* Scaled the Spectral Hashing.
 +
* Working on the US Census Data (1990) [number of features (dimensions) = 69 ]
 +
* run experiments on small DataSet (4k) using the 3 methods LSH Random Projection, LSH  Spectral Hashing and Hilbert Curve
 +
* now running on large DataSet (1M) using LSH only (as H-curve doesn't work on large Scale)
 +
* experiment results to be added to the report
 +
 
 +
 
 +
=== Oct 26 ===
 +
* I am facing problems working on very high dimensional data (d~100000) [curse of dimensionality]
 +
** I can't use small dataSets (1k points) to compare different methods (Hilbert curve fails for high dimensional but LSH still works)
 +
** The concept of distance becomes less precise as the number of dimensions grows.
 +
** There are other techniques to cluster the high-dimensional data (no usage of gram matrix) such as Subspace Clustering, Projected Clustering and Correlation Clustering.
 +
* I start working on different real DataSets with low dimension (~100)
 +
* Working on Scaling the Spectral Hashing by scaling the Principle Component Analysis function on matlab
 +
 
 +
 +
=== Oct 12 ===
 +
* Process Very-High dimensional and Very-Large Scale real DataSets
 +
** I use document clustering, where each document i is commonly represented by a term-frequency vector xi = [x1i, x2i, ..., xDi], where D is the total number of keywords in the given document corpus, and xji is the number of occurrences of keyword j in document i. Keywords correspond to the words that remain after the pre-processing of stop-words removal and words stemming. Below N is the total number of documents.
 +
* DataSets: source: http://archive.ics.uci.edu/ml/datasets/Bag+of+Words:
 +
** NYTimes news articles:
 +
*** orig source: ldc.upenn.edu
 +
*** N=300000
 +
*** D=102660
 +
** PubMed abstracts: (PubMed comprises more than 21 million citations for biomedical literature from MEDLINE, life science journals, and online books)
 +
*** orig source: www.pubmed.gov
 +
*** N=8200000
 +
*** D=141043
 +
* Currently I process 1M documents from PubMed with the dimesion 141043, total size of this DataSet on HD is about 250GB.
 +
* To Do: compare the performance using two different LSH families.
 +
 
 +
 
 +
=== Spet 28 ===
 +
* Updated report is here:
 +
** /nsl/students/dameh/reports/doc/main.pdf
 +
* updated sections :
 +
** IV. PERFORMANCE METRICS
 +
** VI. RESULTS
 +
 
 +
 
 +
=== Sept 12 ===
 +
* Comparison Results for H-curve, LSH-Spectral hashing and LSH-Random Projection:
 +
* Metrics used for comparison:
 +
** FN.SpaceReduction product Metric : Gram matrix level comparison which takes in account both the FN and the space reduction for the approximated gram matrix (as FN decreases space reduction increases)
 +
*** to unify the comparison we set H-curve window_width  = DataSets_size/2^k. where k is the number of the LSH hash bits (number of buckets 2^k), in this case we get same matrix size for both H-curve and LSH.
 +
*** this metric gives us the optimal value of k (maximum).
 +
*** LSH-Spectral hashing over performs LSH-Random projection. H-curve over performs both.
 +
*** Graph results can be found at these addresses :
 +
**** FN for different values of k:                  dameh/reports/dameh/fall11/FN.pdf
 +
**** gram matrix size for different values of k: dameh/reports/dameh/fall11/Size.pdf
 +
**** FN.SpaceReduction                            : dameh/reports/dameh/fall11/FSprod.pdf
 +
** Average distance between each point and its Exemplar: which is clustering error metric
 +
*** to unify the comparison we are comparing Avg distance against the number of Exemplars, (we change a parameter in the affinity propagation to control number of Exemplars, or its an initialization parameter for k-means clustering)
 +
*** again, LSH-Spectral hashing over performs LSH-Random projection, and LSH-Spectral hashing over performs H-curve only for smaller number of clusters
 +
*** Graph result can be found at this address:  dameh/reports/dameh/fall11/ClusteringError.pdf
 +
** over all, the advantage of LSH over H-cuve is the parallel property of LSH, where we can get a speedup up to 2^k for the LSH (each node takes care of each bucket)
 +
 
 +
 
 +
= Summer 2011 (TA) =
 +
* Research: Gram Matrix approximation
 +
 
 +
 
 +
=== Aug 22 ===
 +
*fixing bugs in the Hilbert curve and the affinity propagation codes (still working on them)
 +
*revised the metric to be FN.SpaceRedcution product (Inspired from the Energy.Delay product used in the Energy-Efficient Computing )
 +
*cmpt300 Final grading
 +
 
 +
 
 +
=== Aug 8 ===
 +
*working on comparing LSH methods with Space filling Curve methods.
 +
** for the low level comparing (matrix level using average FN): we get almost same average FN for both spectral hashing and H-curve.
 +
** comparing in the high level (clustering error):
 +
*** I am picking the optimal value of k for spectral hashing and choosing a widow width for H-curve that gives us same approximated matrix size, using the formula (winWidth=DataSize/2^k)
 +
*** In the affinity propagation, using small value of self similarity (s(i,i)) we get small numbers of clusters and vise verse , so I am changing self similarity values between max and min to get same number of clusters for both methods, which will be the x-axis, on the other hand, the y-axis will be the average distance between each point and its Exemplar.
 +
* PS: I was sick for a week after I came back form China and then busy in setting up and grading assignments for cmpt300.
 +
 
 +
 
 +
===July 10 - July 24 ===
 +
* Attending workshop on [http://www.nudt.edu.cn/summerschool/eng/lessoninfo.html High Performance Computing] at National University of Defense Technology, China.
 +
 
 +
 
 +
=== July 8 ===
 +
* Comparing our method using the "Locality Sensitive Hashing" to the method using "Hilbert Space filling Curve"
 +
* Initial results can be found at this address students/dameh/reports/dameh/summer11/H-CurveVSlsh.pdf
 +
 
 +
 
 +
=== Jun 17 ===
 +
* Report should be available [https://cs-nsl-svn.cs.surrey.sfu.ca/cssvn/nsl-members/dameh/summer11/doc/main.pdf here]
 +
 
 +
 
 +
=== Jun 03 ===
 +
* Proposed new metric to be used instead of the FN ratio (more powerful to compare different families and to pick the optimal value of k)
 +
* More analysis on the spectral hashing
 +
* Writing up the report with the results to be sent soon
 +
 
 +
 
 +
=== May 19 ===
 +
* Spectral Hashing is Data dependent hashing and more powerful than the random projection, see this [https://cs-nsl-svn.cs.surrey.sfu.ca/cssvn/nsl-members/dameh/summer11/doc/RandomVsSpectral.pdf comparison examples] in the 2D.
 +
* Random Projection Hashing
 +
** Method:
 +
*** Take random projections of data.
 +
*** Quantize the projections to hash value’s bits.
 +
** Disadvantage:
 +
*** Data Independent Hashing.
 +
** Advantage:
 +
*** Strong theoretical guarantees.
 +
*** Simple and fast to compute (linear time).
 +
** Types:
 +
*** Euclidean distance: two parameters to vary that have the opposite affect (number of random vectors k and quantization width w).
 +
*** L1 distance: uncontrolled, as k increase error increase.
 +
*** Cosine distance: more controlled than L1 as k increase.
 +
* Data-Dependent Hashing
 +
**Advantage:
 +
*** Data-distribution dependent; powerful than Random Projection.
 +
** Disadvantage:
 +
*** More complex to compute than Random Projection.
 +
** Types:
 +
*** Spectral hashing:
 +
**** Compute PCA.
 +
**** Calculate the k smallest single-dimension analytical EigenFunctions.
 +
**** Threshold the analytical EigenFnction of zero to obtain hash value's bits.
 +
*** KernelLSH:
 +
**** The key idea is first embed the data into a lower space using the kernel G and then do LSH in the lower dimensional space.
 +
 
 +
 
 +
=== May 5 ===
 +
* Recap
 +
** Problem: kernel methods' gram matrix is O(N^2), which is computationally and memory usage infeasible for large-scale datasets
 +
** Previous Work:
 +
*** Low rank matrix approximation:
 +
**** using EigenDecomposition: Optimal solution
 +
***** Problem: requires O(N^3) computation to construct the low rank matrix
 +
***** Solution: use Nytsrom theorem to approximate the EigenDecomposition of the matrix which requires O(m^2N) where m << N.
 +
***** Problem: In this methods in despite of reducing the computation complexity still we need O(N^2) to store the matrix, so the next methods reduce both computation and memory.
 +
*** Efficient implementation for computing the kernel matrix:
 +
**** Using Spatial Indexing:
 +
***** using Z-order, H-order, kd-tress and other spatial indexing techniques to order the points in the space and then using  sliding window where the kernel function computed only between points in that window.
 +
***** Problems: requires O(NlogN) as preprocessing to order the points, fails for very high dimensional Data and hard to parallel
 +
***** Solution: We proposed a method that works fine for large-scale and highly dimensional data that overcomes all pervious works drawbacks, easy to parallel, requires linear time preprocessing and reduce both computation and memory usage down form quadratic to subquadratic:
 +
**** Using Spatial Hashing or Locality sensitive hashing: where the kernel function and the cluster methods works on the buckets individually.
 +
***** LSH families :
 +
****** Random projection:
 +
******* Hamming distance, Euclidean Distance, Cosine Distance
 +
******* Problem for random projection, assume data uniformly distrubuted and it doesn’t take in account if data are in a kernel space, Solution:
 +
****** Spectral Hashing: overcomes previous families problems
 +
****** Kernelized LSH : Problem: more complex to compute
 +
** In progress: Comparing different LSH families
 +
 
 +
 
 +
= Spring 2011 (TA/RA) =
 
* Courses:
 
* Courses:
 
**CMPT 885: Special Topics in Computer Architecture
 
**CMPT 885: Special Topics in Computer Architecture
 
**CMPT 886: Special Topics in Operating Systems
 
**CMPT 886: Special Topics in Operating Systems
 
*research:
 
**Gram matrix approximation
 
**...
 
 
=== Jan 24 ===
 
*reading and understanding previous work on Gram matrix approximation:
 
** Nyström Method  [http://portal.acm.org/ft_gateway.cfm?id=1194916&type=pdf&CFID=5834693&CFTOKEN=95499878]  [http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=D118367961D24D13CCF1ACBC18F9BBDF?doi=10.1.1.18.7519&rep=rep1&type=pdf]
 
 
  
  

Latest revision as of 17:19, 26 January 2012

Fall 2011 and Spring 2012 (RA)

Jan 27

  • Implementing my method on cluster, report to be sent soon.


Dec 23

  • Finished the study of different methods, the analysis and the implementation on a single machine.


Nov 9

  • Scaled the Spectral Hashing.
  • Working on the US Census Data (1990) [number of features (dimensions) = 69 ]
  • run experiments on small DataSet (4k) using the 3 methods LSH Random Projection, LSH Spectral Hashing and Hilbert Curve
  • now running on large DataSet (1M) using LSH only (as H-curve doesn't work on large Scale)
  • experiment results to be added to the report


Oct 26

  • I am facing problems working on very high dimensional data (d~100000) [curse of dimensionality]
    • I can't use small dataSets (1k points) to compare different methods (Hilbert curve fails for high dimensional but LSH still works)
    • The concept of distance becomes less precise as the number of dimensions grows.
    • There are other techniques to cluster the high-dimensional data (no usage of gram matrix) such as Subspace Clustering, Projected Clustering and Correlation Clustering.
  • I start working on different real DataSets with low dimension (~100)
  • Working on Scaling the Spectral Hashing by scaling the Principle Component Analysis function on matlab


Oct 12

  • Process Very-High dimensional and Very-Large Scale real DataSets
    • I use document clustering, where each document i is commonly represented by a term-frequency vector xi = [x1i, x2i, ..., xDi], where D is the total number of keywords in the given document corpus, and xji is the number of occurrences of keyword j in document i. Keywords correspond to the words that remain after the pre-processing of stop-words removal and words stemming. Below N is the total number of documents.
  • DataSets: source: http://archive.ics.uci.edu/ml/datasets/Bag+of+Words:
    • NYTimes news articles:
      • orig source: ldc.upenn.edu
      • N=300000
      • D=102660
    • PubMed abstracts: (PubMed comprises more than 21 million citations for biomedical literature from MEDLINE, life science journals, and online books)
      • orig source: www.pubmed.gov
      • N=8200000
      • D=141043
  • Currently I process 1M documents from PubMed with the dimesion 141043, total size of this DataSet on HD is about 250GB.
  • To Do: compare the performance using two different LSH families.


Spet 28

  • Updated report is here:
    • /nsl/students/dameh/reports/doc/main.pdf
  • updated sections :
    • IV. PERFORMANCE METRICS
    • VI. RESULTS


Sept 12

  • Comparison Results for H-curve, LSH-Spectral hashing and LSH-Random Projection:
  • Metrics used for comparison:
    • FN.SpaceReduction product Metric : Gram matrix level comparison which takes in account both the FN and the space reduction for the approximated gram matrix (as FN decreases space reduction increases)
      • to unify the comparison we set H-curve window_width = DataSets_size/2^k. where k is the number of the LSH hash bits (number of buckets 2^k), in this case we get same matrix size for both H-curve and LSH.
      • this metric gives us the optimal value of k (maximum).
      • LSH-Spectral hashing over performs LSH-Random projection. H-curve over performs both.
      • Graph results can be found at these addresses :
        • FN for different values of k: dameh/reports/dameh/fall11/FN.pdf
        • gram matrix size for different values of k: dameh/reports/dameh/fall11/Size.pdf
        • FN.SpaceReduction : dameh/reports/dameh/fall11/FSprod.pdf
    • Average distance between each point and its Exemplar: which is clustering error metric
      • to unify the comparison we are comparing Avg distance against the number of Exemplars, (we change a parameter in the affinity propagation to control number of Exemplars, or its an initialization parameter for k-means clustering)
      • again, LSH-Spectral hashing over performs LSH-Random projection, and LSH-Spectral hashing over performs H-curve only for smaller number of clusters
      • Graph result can be found at this address: dameh/reports/dameh/fall11/ClusteringError.pdf
    • over all, the advantage of LSH over H-cuve is the parallel property of LSH, where we can get a speedup up to 2^k for the LSH (each node takes care of each bucket)


Summer 2011 (TA)

  • Research: Gram Matrix approximation


Aug 22

  • fixing bugs in the Hilbert curve and the affinity propagation codes (still working on them)
  • revised the metric to be FN.SpaceRedcution product (Inspired from the Energy.Delay product used in the Energy-Efficient Computing )
  • cmpt300 Final grading


Aug 8

  • working on comparing LSH methods with Space filling Curve methods.
    • for the low level comparing (matrix level using average FN): we get almost same average FN for both spectral hashing and H-curve.
    • comparing in the high level (clustering error):
      • I am picking the optimal value of k for spectral hashing and choosing a widow width for H-curve that gives us same approximated matrix size, using the formula (winWidth=DataSize/2^k)
      • In the affinity propagation, using small value of self similarity (s(i,i)) we get small numbers of clusters and vise verse , so I am changing self similarity values between max and min to get same number of clusters for both methods, which will be the x-axis, on the other hand, the y-axis will be the average distance between each point and its Exemplar.
  • PS: I was sick for a week after I came back form China and then busy in setting up and grading assignments for cmpt300.


July 10 - July 24


July 8

  • Comparing our method using the "Locality Sensitive Hashing" to the method using "Hilbert Space filling Curve"
  • Initial results can be found at this address students/dameh/reports/dameh/summer11/H-CurveVSlsh.pdf


Jun 17

  • Report should be available here


Jun 03

  • Proposed new metric to be used instead of the FN ratio (more powerful to compare different families and to pick the optimal value of k)
  • More analysis on the spectral hashing
  • Writing up the report with the results to be sent soon


May 19

  • Spectral Hashing is Data dependent hashing and more powerful than the random projection, see this comparison examples in the 2D.
  • Random Projection Hashing
    • Method:
      • Take random projections of data.
      • Quantize the projections to hash value’s bits.
    • Disadvantage:
      • Data Independent Hashing.
    • Advantage:
      • Strong theoretical guarantees.
      • Simple and fast to compute (linear time).
    • Types:
      • Euclidean distance: two parameters to vary that have the opposite affect (number of random vectors k and quantization width w).
      • L1 distance: uncontrolled, as k increase error increase.
      • Cosine distance: more controlled than L1 as k increase.
  • Data-Dependent Hashing
    • Advantage:
      • Data-distribution dependent; powerful than Random Projection.
    • Disadvantage:
      • More complex to compute than Random Projection.
    • Types:
      • Spectral hashing:
        • Compute PCA.
        • Calculate the k smallest single-dimension analytical EigenFunctions.
        • Threshold the analytical EigenFnction of zero to obtain hash value's bits.
      • KernelLSH:
        • The key idea is first embed the data into a lower space using the kernel G and then do LSH in the lower dimensional space.


May 5

  • Recap
    • Problem: kernel methods' gram matrix is O(N^2), which is computationally and memory usage infeasible for large-scale datasets
    • Previous Work:
      • Low rank matrix approximation:
        • using EigenDecomposition: Optimal solution
          • Problem: requires O(N^3) computation to construct the low rank matrix
          • Solution: use Nytsrom theorem to approximate the EigenDecomposition of the matrix which requires O(m^2N) where m << N.
          • Problem: In this methods in despite of reducing the computation complexity still we need O(N^2) to store the matrix, so the next methods reduce both computation and memory.
      • Efficient implementation for computing the kernel matrix:
        • Using Spatial Indexing:
          • using Z-order, H-order, kd-tress and other spatial indexing techniques to order the points in the space and then using sliding window where the kernel function computed only between points in that window.
          • Problems: requires O(NlogN) as preprocessing to order the points, fails for very high dimensional Data and hard to parallel
          • Solution: We proposed a method that works fine for large-scale and highly dimensional data that overcomes all pervious works drawbacks, easy to parallel, requires linear time preprocessing and reduce both computation and memory usage down form quadratic to subquadratic:
        • Using Spatial Hashing or Locality sensitive hashing: where the kernel function and the cluster methods works on the buckets individually.
          • LSH families :
            • Random projection:
              • Hamming distance, Euclidean Distance, Cosine Distance
              • Problem for random projection, assume data uniformly distrubuted and it doesn’t take in account if data are in a kernel space, Solution:
            • Spectral Hashing: overcomes previous families problems
            • Kernelized LSH : Problem: more complex to compute
    • In progress: Comparing different LSH families


Spring 2011 (TA/RA)

  • Courses:
    • CMPT 885: Special Topics in Computer Architecture
    • CMPT 886: Special Topics in Operating Systems


Fall 2010 (FAS-GF/RA)

  • Courses:
    • CMPT 705: Design/Analysis Algorithms
    • CMPT 820: Multimedia Systems
  • Worked on Gram matrix approximation, High Performance/large scale application case studies


Summer 2010 (TA)

  • Worked on large Scale Kernel methods


Spring 2010 (RA)

  • Courses:
    • CMPT 768: Sound Synthesis
  • Worked on Locality sensitive hashing families and their applications and implementations