Revision as of 02:20, 12 October 2011

Fall 2011 (RA)

Oct 12

Process Very-High dimensional and Very-Large Scale real DataSets
- I chose document clustering, where each document i is commonly represented by a term-frequency vector xi = [x1i, x2i, ..., xDi], where D is the total number of keywords in the given document corpus, and xji is the number of occurrences of keyword j in document i , below N is the total number of documents.
DataSets (http://archive.ics.uci.edu/ml/datasets/Bag+of+Words):
- NYTimes news articles:
  - orig source: ldc.upenn.edu
  - N=300000
  - D=102660

- PubMed abstracts:
  - orig source: www.pubmed.gov
  - N=8200000
  - D=141043

currently I process 1M documents from PubMEd with the dimesion 141043, total size of this DataSet on HD about 250GB.
To Do: compare the performance using two different LSH families.

Spet 28

Updated report is here:
- /nsl/students/dameh/reports/doc/main.pdf
updated sections :
- IV. PERFORMANCE METRICS
- VI. RESULTS

Sept 12

Comparison Results for H-curve, LSH-Spectral hashing and LSH-Random Projection:
Metrics used for comparison:
- FN.SpaceReduction product Metric : Gram matrix level comparison which takes in account both the FN and the space reduction for the approximated gram matrix (as FN decreases space reduction increases)
  - to unify the comparison we set H-curve window_width = DataSets_size/2^k. where k is the number of the LSH hash bits (number of buckets 2^k), in this case we get same matrix size for both H-curve and LSH.
  - this metric gives us the optimal value of k (maximum).
  - LSH-Spectral hashing over performs LSH-Random projection. H-curve over performs both.
  - Graph results can be found at these addresses :
    - FN for different values of k: dameh/reports/dameh/fall11/FN.pdf
    - gram matrix size for different values of k: dameh/reports/dameh/fall11/Size.pdf
    - FN.SpaceReduction : dameh/reports/dameh/fall11/FSprod.pdf
- Average distance between each point and its Exemplar: which is clustering error metric
  - to unify the comparison we are comparing Avg distance against the number of Exemplars, (we change a parameter in the affinity propagation to control number of Exemplars, or its an initialization parameter for k-means clustering)
  - again, LSH-Spectral hashing over performs LSH-Random projection, and LSH-Spectral hashing over performs H-curve only for smaller number of clusters
  - Graph result can be found at this address: dameh/reports/dameh/fall11/ClusteringError.pdf
- over all, the advantage of LSH over H-cuve is the parallel property of LSH, where we can get a speedup up to 2^k for the LSH (each node takes care of each bucket)

Summer 2011 (TA)

Research: Gram Matrix approximation

Aug 22

fixing bugs in the Hilbert curve and the affinity propagation codes (still working on them)
revised the metric to be FN.SpaceRedcution product (Inspired from the Energy.Delay product used in the Energy-Efficient Computing )
cmpt300 Final grading

Aug 8

working on comparing LSH methods with Space filling Curve methods.
- for the low level comparing (matrix level using average FN): we get almost same average FN for both spectral hashing and H-curve.
- comparing in the high level (clustering error):
  - I am picking the optimal value of k for spectral hashing and choosing a widow width for H-curve that gives us same approximated matrix size, using the formula (winWidth=DataSize/2^k)
  - In the affinity propagation, using small value of self similarity (s(i,i)) we get small numbers of clusters and vise verse , so I am changing self similarity values between max and min to get same number of clusters for both methods, which will be the x-axis, on the other hand, the y-axis will be the average distance between each point and its Exemplar.
PS: I was sick for a week after I came back form China and then busy in setting up and grading assignments for cmpt300.

July 10 - July 24

Attending workshop on High Performance Computing at National University of Defense Technology, China.

July 8

Comparing our method using the "Locality Sensitive Hashing" to the method using "Hilbert Space filling Curve"
Initial results can be found at this address students/dameh/reports/dameh/summer11/H-CurveVSlsh.pdf

Jun 17

Report should be available here

Jun 03

Proposed new metric to be used instead of the FN ratio (more powerful to compare different families and to pick the optimal value of k)
More analysis on the spectral hashing
Writing up the report with the results to be sent soon

May 19

Spectral Hashing is Data dependent hashing and more powerful than the random projection, see this comparison examples in the 2D.
Random Projection Hashing
- Method:
  - Take random projections of data.
  - Quantize the projections to hash value’s bits.
- Disadvantage:
  - Data Independent Hashing.
- Advantage:
  - Strong theoretical guarantees.
  - Simple and fast to compute (linear time).
- Types:
  - Euclidean distance: two parameters to vary that have the opposite affect (number of random vectors k and quantization width w).
  - L1 distance: uncontrolled, as k increase error increase.
  - Cosine distance: more controlled than L1 as k increase.
Data-Dependent Hashing
- Advantage:
  - Data-distribution dependent; powerful than Random Projection.
- Disadvantage:
  - More complex to compute than Random Projection.
- Types:
  - Spectral hashing:
    - Compute PCA.
    - Calculate the k smallest single-dimension analytical EigenFunctions.
    - Threshold the analytical EigenFnction of zero to obtain hash value's bits.
  - KernelLSH:
    - The key idea is first embed the data into a lower space using the kernel G and then do LSH in the lower dimensional space.

May 5

Recap
- Problem: kernel methods' gram matrix is O(N^2), which is computationally and memory usage infeasible for large-scale datasets
- Previous Work:
  - Low rank matrix approximation:
    - using EigenDecomposition: Optimal solution
      - Problem: requires O(N^3) computation to construct the low rank matrix
      - Solution: use Nytsrom theorem to approximate the EigenDecomposition of the matrix which requires O(m^2N) where m << N.
      - Problem: In this methods in despite of reducing the computation complexity still we need O(N^2) to store the matrix, so the next methods reduce both computation and memory.
  - Efficient implementation for computing the kernel matrix:
    - Using Spatial Indexing:
      - using Z-order, H-order, kd-tress and other spatial indexing techniques to order the points in the space and then using sliding window where the kernel function computed only between points in that window.
      - Problems: requires O(NlogN) as preprocessing to order the points, fails for very high dimensional Data and hard to parallel
      - Solution: We proposed a method that works fine for large-scale and highly dimensional data that overcomes all pervious works drawbacks, easy to parallel, requires linear time preprocessing and reduce both computation and memory usage down form quadratic to subquadratic:
    - Using Spatial Hashing or Locality sensitive hashing: where the kernel function and the cluster methods works on the buckets individually.
      - LSH families :
        Random projection:
        Hamming distance, Euclidean Distance, Cosine Distance
        
        Problem for random projection, assume data uniformly distrubuted and it doesn’t take in account if data are in a kernel space, Solution:
        
        Spectral Hashing: overcomes previous families problems
        
        Kernelized LSH : Problem: more complex to compute
- In progress: Comparing different LSH families

Spring 2011 (TA/RA)

Courses:
- CMPT 885: Special Topics in Computer Architecture
- CMPT 886: Special Topics in Operating Systems

Fall 2010 (FAS-GF/RA)

Courses:
- CMPT 705: Design/Analysis Algorithms
- CMPT 820: Multimedia Systems

Worked on Gram matrix approximation, High Performance/large scale application case studies

Summer 2010 (TA)

Worked on large Scale Kernel methods

Spring 2010 (RA)

Courses:
- CMPT 768: Sound Synthesis

Worked on Locality sensitive hashing families and their applications and implementations

@@ Line 18: / Line 18: @@
 * currently I process 1M documents from PubMEd with the dimesion 141043, total size of this DataSet on HD about 250GB.
 * To Do: compare the performance using two different LSH families.
 === Spet 28 ===

Difference between revisions of "Private:progress-dameh"

Revision as of 02:20, 12 October 2011

Contents

Fall 2011 (RA)

Oct 12

Spet 28

Sept 12

Summer 2011 (TA)

Aug 22

Aug 8

July 10 - July 24

July 8

Jun 17

Jun 03

May 19

May 5

Spring 2011 (TA/RA)

Fall 2010 (FAS-GF/RA)

Summer 2010 (TA)

Spring 2010 (RA)