Skip to main content Skip to main navigation

Publication

Efficient Learning for Hashing Proportional Data

Zhao Xu; Kristian Kersting; Christian Bauckhage
In: Mohammed Javeed Zaki; Arno Siebes; Jeffrey Xu Yu; Bart Goethals; Geoffrey I. Webb; Xindong Wu (Hrsg.). 12th IEEE International Conference on Data Mining. IEEE International Conference on Data Mining (ICDM-2012), December 10-13, Brussels, Belgium, Pages 735-744, IEEE Computer Society, 2012.

Abstract

Spectral hashing (SH) seeks compact binary codes of data points so that Hamming distances between codes correlate with data similarity. Quickly learning such codes typically boils down to principle component analysis (PCA). However, this is only justified for normally distributed data. For proportional data (normalized histograms), this is not the case. Due to the sum-to-unity constraint, features that are as independent as possible will not all be uncorrelated. In this paper, we show that a linear-time transformation efficiently copes with sum-to-unity constraints: first, we select a small number K of diverse data points by maximizing the volume of the simplex spanned by these prototypes; second, we represent each data point by means of its cosine similarities to the K selected prototypes. This maximum volume hashing is sensible since each dimension in the transformed space is likely to follow a von Mises (vM) distribution, and, in very high dimensions, the vM distribution closely resembles a Gaussian distribution. This justifies to employ PCA on the transformed data. Our extensive experiments validate this: maximum volume hashing outperforms spectral hashing and other state of the art techniques.

More links