Fall 2011 and Spring 2012 (RA)

Jan 27

Implementing my method on cluster, report to be sent soon.

Dec 23

Finished the study of different methods, the analysis and the implementation on a single machine.

Nov 9

Scaled the Spectral Hashing.
Working on the US Census Data (1990) [number of features (dimensions) = 69 ]
run experiments on small DataSet (4k) using the 3 methods LSH Random Projection, LSH Spectral Hashing and Hilbert Curve
now running on large DataSet (1M) using LSH only (as H-curve doesn't work on large Scale)
experiment results to be added to the report

Oct 26

I am facing problems working on very high dimensional data (d~100000) [curse of dimensionality]
- I can't use small dataSets (1k points) to compare different methods (Hilbert curve fails for high dimensional but LSH still works)
- The concept of distance becomes less precise as the number of dimensions grows.
- There are other techniques to cluster the high-dimensional data (no usage of gram matrix) such as Subspace Clustering, Projected Clustering and Correlation Clustering.
I start working on different real DataSets with low dimension (~100)
Working on Scaling the Spectral Hashing by scaling the Principle Component Analysis function on matlab

Oct 12

Process Very-High dimensional and Very-Large Scale real DataSets
- I use document clustering, where each document i is commonly represented by a term-frequency vector xi = [x1i, x2i, ..., xDi], where D is the total number of keywords in the given document corpus, and xji is the number of occurrences of keyword j in document i. Keywords correspond to the words that remain after the pre-processing of stop-words removal and words stemming. Below N is the total number of documents.
DataSets: source: http://archive.ics.uci.edu/ml/datasets/Bag+of+Words:
- NYTimes news articles:
  - orig source: ldc.upenn.edu
  - N=300000
  - D=102660
- PubMed abstracts: (PubMed comprises more than 21 million citations for biomedical literature from MEDLINE, life science journals, and online books)
  - orig source: www.pubmed.gov
  - N=8200000
  - D=141043
Currently I process 1M documents from PubMed with the dimesion 141043, total size of this DataSet on HD is about 250GB.
To Do: compare the performance using two different LSH families.

Spet 28

Updated report is here:
- /nsl/students/dameh/reports/doc/main.pdf
updated sections :
- IV. PERFORMANCE METRICS
- VI. RESULTS

Sept 12

Comparison Results for H-curve, LSH-Spectral hashing and LSH-Random Projection:
Metrics used for comparison:
- FN.SpaceReduction product Metric : Gram matrix level comparison which takes in account both the FN and the space reduction for the approximated gram matrix (as FN decreases space reduction increases)
  - to unify the comparison we set H-curve window_width = DataSets_size/2^k. where k is the number of the LSH hash bits (number of buckets 2^k), in this case we get same matrix size for both H-curve and LSH.
  - this metric gives us the optimal value of k (maximum).
  - LSH-Spectral hashing over performs LSH-Random projection. H-curve over performs both.
  - Graph results can be found at these addresses :
    - FN for different values of k: dameh/reports/dameh/fall11/FN.pdf
    - gram matrix size for different values of k: dameh/reports/dameh/fall11/Size.pdf
    - FN.SpaceReduction : dameh/reports/dameh/fall11/FSprod.pdf
- Average distance between each point and its Exemplar: which is clustering error metric
  - to unify the comparison we are comparing Avg distance against the number of Exemplars, (we change a parameter in the affinity propagation to control number of Exemplars, or its an initialization parameter for k-means clustering)
  - again, LSH-Spectral hashing over performs LSH-Random projection, and LSH-Spectral hashing over performs H-curve only for smaller number of clusters
  - Graph result can be found at this address: dameh/reports/dameh/fall11/ClusteringError.pdf
- over all, the advantage of LSH over H-cuve is the parallel property of LSH, where we can get a speedup up to 2^k for the LSH (each node takes care of each bucket)

Summer 2011 (TA)

Research: Gram Matrix approximation

Aug 22

fixing bugs in the Hilbert curve and the affinity propagation codes (still working on them)
revised the metric to be FN.SpaceRedcution product (Inspired from the Energy.Delay product used in the Energy-Efficient Computing )
cmpt300 Final grading

Aug 8

working on comparing LSH methods with Space filling Curve methods.
- for the low level comparing (matrix level using average FN): we get almost same average FN for both spectral hashing and H-curve.
- comparing in the high level (clustering error):
  - I am picking the optimal value of k for spectral hashing and choosing a widow width for H-curve that gives us same approximated matrix size, using the formula (winWidth=DataSize/2^k)
  - In the affinity propagation, using small value of self similarity (s(i,i)) we get small numbers of clusters and vise verse , so I am changing self similarity values between max and min to get same number of clusters for both methods, which will be the x-axis, on the other hand, the y-axis will be the average distance between each point and its Exemplar.
PS: I was sick for a week after I came back form China and then busy in setting up and grading assignments for cmpt300.

July 10 - July 24

Attending workshop on High Performance Computing at National University of Defense Technology, China.

July 8

Comparing our method using the "Locality Sensitive Hashing" to the method using "Hilbert Space filling Curve"
Initial results can be found at this address students/dameh/reports/dameh/summer11/H-CurveVSlsh.pdf

Jun 17

Report should be available here

Jun 03

Proposed new metric to be used instead of the FN ratio (more powerful to compare different families and to pick the optimal value of k)
More analysis on the spectral hashing
Writing up the report with the results to be sent soon

May 19

Spectral Hashing is Data dependent hashing and more powerful than the random projection, see this comparison examples in the 2D.
Random Projection Hashing
- Method:
  - Take random projections of data.
  - Quantize the projections to hash value’s bits.
- Disadvantage:
  - Data Independent Hashing.
- Advantage:
  - Strong theoretical guarantees.
  - Simple and fast to compute (linear time).
- Types:
  - Euclidean distance: two parameters to vary that have the opposite affect (number of random vectors k and quantization width w).
  - L1 distance: uncontrolled, as k increase error increase.
  - Cosine distance: more controlled than L1 as k increase.
Data-Dependent Hashing
- Advantage:
  - Data-distribution dependent; powerful than Random Projection.
- Disadvantage:
  - More complex to compute than Random Projection.
- Types:
  - Spectral hashing:
    - Compute PCA.
    - Calculate the k smallest single-dimension analytical EigenFunctions.
    - Threshold the analytical EigenFnction of zero to obtain hash value's bits.
  - KernelLSH:
    - The key idea is first embed the data into a lower space using the kernel G and then do LSH in the lower dimensional space.

May 5

Recap
- Problem: kernel methods' gram matrix is O(N^2), which is computationally and memory usage infeasible for large-scale datasets
- Previous Work:
  - Low rank matrix approximation:
    - using EigenDecomposition: Optimal solution
      - Problem: requires O(N^3) computation to construct the low rank matrix
      - Solution: use Nytsrom theorem to approximate the EigenDecomposition of the matrix which requires O(m^2N) where m << N.
      - Problem: In this methods in despite of reducing the computation complexity still we need O(N^2) to store the matrix, so the next methods reduce both computation and memory.
  - Efficient implementation for computing the kernel matrix:
    - Using Spatial Indexing:
      - using Z-order, H-order, kd-tress and other spatial indexing techniques to order the points in the space and then using sliding window where the kernel function computed only between points in that window.
      - Problems: requires O(NlogN) as preprocessing to order the points, fails for very high dimensional Data and hard to parallel
      - Solution: We proposed a method that works fine for large-scale and highly dimensional data that overcomes all pervious works drawbacks, easy to parallel, requires linear time preprocessing and reduce both computation and memory usage down form quadratic to subquadratic:
    - Using Spatial Hashing or Locality sensitive hashing: where the kernel function and the cluster methods works on the buckets individually.
      - LSH families :
        Random projection:
        Hamming distance, Euclidean Distance, Cosine Distance
        
        Problem for random projection, assume data uniformly distrubuted and it doesn’t take in account if data are in a kernel space, Solution:
        
        Spectral Hashing: overcomes previous families problems
        
        Kernelized LSH : Problem: more complex to compute
- In progress: Comparing different LSH families

Spring 2011 (TA/RA)

Courses:
- CMPT 885: Special Topics in Computer Architecture
- CMPT 886: Special Topics in Operating Systems

Fall 2010 (FAS-GF/RA)

Courses:
- CMPT 705: Design/Analysis Algorithms
- CMPT 820: Multimedia Systems

Worked on Gram matrix approximation, High Performance/large scale application case studies

Summer 2010 (TA)

Worked on large Scale Kernel methods

Spring 2010 (RA)

Courses:
- CMPT 768: Sound Synthesis

Worked on Locality sensitive hashing families and their applications and implementations

Private:progress-dameh

Contents

Fall 2011 and Spring 2012 (RA)

Jan 27

Dec 23

Nov 9

Oct 26

Oct 12

Spet 28

Sept 12

Summer 2011 (TA)

Aug 22

Aug 8

July 10 - July 24

July 8

Jun 17

Jun 03

May 19

May 5

Spring 2011 (TA/RA)

Fall 2010 (FAS-GF/RA)

Summer 2010 (TA)

Spring 2010 (RA)