Circulant Binary Embeddings

\[ \def\gg{\mathbf{g}} \def\xx{\mathbf{x}} \def\yy{\mathbf{y}} \def\mD{\mathbf{D}} \def\mG{\mathbf{G}} \def\mW{\mathbf{W}} \def\mX{\mathbf{X}} \def\mY{\mathbf{Y}} \def\sgn{\mathrm{sgn}} \def\RR{\mathbb{R}} \def\cF{\mathcal{F}} \def\cN{\mathcal{N}} \def\cO{\mathcal{O}} \def\cS{\mathcal{S}} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator{\me}{e} \]

The research team at SigOpt works to provide the best Bayesian optimization platform for our customers. In our spare time, we also engage in research projects in a variety of other fields. This blog post highlights one of those recent projects which will be presented Tuesday February 6 at AAAI 2018 (2:00pm ET, Poster 1127, Jefferson room). For those who cannot attend that session, we discuss here the topic of embeddings of vectors and the computational gains available when using circulant binary embeddings.

General Embeddings

Embedding a vector consists of mapping it to a different space, with the goal of studying it in a lower dimension, finding a more fruitful representation of the vector, or transforming it to a structured space with desirable properties. Generally, an embedding can be expressed as a continuous function $f$ \begin{align*} f: \mX \to \mY \end{align*} where $\mX \in \RR^{d \times N}$ and $\mY \in \RR^{k \times N}$. Note that the input and output spaces are matrices with $N$ columns, to account for the fact that generally we are interested in embedding several vectors simultaneously. An example embedding might be a linear embedding, which would be written as \begin{align*} \mY = \mW^\top \mX \end{align*} where $\mW \in \RR^{d \times k}$. The figures below show some data in 2 dimensions and sample embeddings in 1 dimensions.


Figure 1: Linear embeddings of ${\RR^2 \to \RR}$, i.e., $d=2$ and $k=1$.
(left) Clusters of blue, red and orange points are embedded into spaces defined by the solid, dashed and dotted lines through orthogonal projection.
(right) The associated one dimensional embeddings, with a single tick mark to indicate the embedded value $y=0$.
(top right) $\mW=\begin{pmatrix}1\\0\end{pmatrix}$ (center right) $\mW=\begin{pmatrix}0\\1\end{pmatrix}$ (bottom right) $\mW=\begin{pmatrix}0.316\\0.949\end{pmatrix}$



Figure 2: The Chebyshev nodes (also called the Chebyshev-Gauss-Lobatto points) are an embedding of points (in blue) on a circle onto a diameter (in red) of that circle. These points are especially useful in approximation theory.


In applications such as hashing, it is commonly beneficial to consider random embeddings of vectors. The distribution of outcomes for randomly chosen embeddings on a set of input data can say something about the structure of the data.(1) Similarly, certain properties might be preseved in the randomly embedded versions of vectors with high probability, such as a sense of distance. The figure below demonstrates that random embeddings can preserve a sense of relative distance with high probability.


Figure 3: Random embeddings can preserve relative distance with high probability. (left) Four vectors in $\RR^2$. (right) Normalized histograms showing the densities of distances in embedded space between the vectors denoted in the upper right corner of each histogram. Clearly, there is much greater probability that the embedded versions of red and orange will be closer than of orange and black (roughly 99%).

Binary Embeddings

While the idea of embedding vectors in lower dimensional space is a valuable proposition, akin to dimensionality reduction, the idea most commonly finds its home in embedding vectors in a binary space. Essentially, the goal is to encode data in $\RR^d$ as a $k$-length binary vector with values in $\{-1, 1\}^k$.(2) The standard assumption is that all vectors have 2-norm equal to 1, and are on the $d-1$-sphere $\cS^{d-1}$.

This mapping is nonlinear, and thus the function defining it must involve more than a matrix-vector product. A standard mapping consists of applying the sign function to the result of the linear mapping \begin{align*} h(\xx) = \sgn(\mW^\top \xx). \end{align*} Here, $\sgn$ acts in an elementwise fashion. The figure below shows some sample binary embeddings of 2D vectors (which appear on the unit circle because they have been normalized.


Figure 4: Points on $\cS^1$, i.e., the unit circle, and some associated binary embeddings. (left) $\mW=\begin{pmatrix}0.316\\0.949\end{pmatrix}$ splits the points on the circle into having a 1-bit encoding. (center) $\mW=\begin{pmatrix}1&0\\0&1\end{pmatrix}$ produces the $k=2$ embedding based on the signs of the $x$, $y$ coordinates. (right) $\mW=\begin{pmatrix}1&0&0.316\\0&1&0.949\end{pmatrix}$ produces a 3-bit encoding consisting of the concatenation of the other two.


This concept of random projections to create binary embeddings can provide a valuable mechanism for storing and operating on data efficiently (here is a great article discussing the topic). Additionally, important theory exists regarding the quality of such embeddings by considering the relationship between distances in the physical space and the embedded space (related to the idea expressed in Figure 2).

The proximity of two vectors $\xx_i,\xx_j\in\cS^{d-1}$ can be judged most naturally by considering the angle between them, denoted $\theta_{\xx_i,\xx_j}$.(3) Probably the most natural way to talk about the proximity of two vectors in $\{-1,1\}^k$ is with the Hamming distance, $d_H(h(\xx_i), h(\xx_j))$, which measures the number of non-matching bits. (Jacques et al. 2013) showed that there is an important relationship between these two senses of distance for random binary embeddings.

Theorem 1. Given $0\lt\epsilon\lt1$ and $n$ vectors $\xx_i \in \mathcal{S}^{d - 1}$ for $1\leq i\leq n$, assume $k = \cO(\epsilon^{-2} \log n)$. Then, with probability at least $1 - \me^{-c \epsilon^2 k}$, \begin{align*} \left| d_{\textrm{H}}(h(\xx_i), h(\xx_j)) - \frac{1}{\pi}\theta_{\xx_i, \xx_j} \right| \leq \epsilon, \end{align*} for some constant ${c > 0}$.

This theorem says that, with high probability, vectors that are close in physical space will be close in embedded space. This is a beneficial result because computations in the $k$-dimensional bit vector space are often much more efficient than in the original space. Some analyses on data, such as clustering, which rely on distance computations benefit from this efficiency. The condition $k = \cO(\epsilon^{-2} \log n)$ is often referred to as the bit complexity: the number of bits required to achieve some desired accuracy in the binary space.

The computation of these embeddings is, however, a potentially costly process, primarily because a $\mW\in\RR^{d \times k}$ matrix full of (generally) normal random values needs to be created and $\mW^\top\mX$ must be computed. If computing the embedding costs more than the savings of working with the embedded version of the data, then there is a restricted benefit.(4) Initial attempts to attack this problem involved bilinear projections, along with other structured matrix strategies, to reduce the computational and/or storage cost.

Circulant Binary Embeddings

As indicated in the title, our research focused on enforcing a circulant structure on $\mW$ for efficient computation in a strategy named, unsurprisingly, circulant binary embeddings. The original article (Yu et al. 2014) presented the idea of restricting the complexity of the embedding matrix to $\mW^\top=\mG\mD$(5) where

  • $\mD\in\RR^{d\times d}$ is a diagonal matrix with $\pm1$ randomly chosen on the diagonal (a Rademacher sequence), and
  • $\mG\in\RR^{d\times d}$ is a circulant matrix defined from the vector $\gg$, which has standard normal distributed vectors.(6)

A circulant matrix $\mG\in\RR^{d\times d}$ is a special type of Toeplitz matrix defined using $\gg\in\RR^d$ such that \[ \gg=\begin{pmatrix}g_1\\g_2\\\vdots\\g_d\end{pmatrix},\qquad \mG=\begin{pmatrix}g_1&g_d&\cdots&g_2 \\ g_2&g_1&\cdots&g_3 \\ \vdots&\vdots&\ddots&\vdots \\ g_d&g_{d-1}&\cdots&g_1\end{pmatrix}. \] Circulant matrices have the benefit of being both cheaper to store and only requiring $\cO(d\log d)$ to compute matrix-vector products rather than $\cO(d^2)$. This can be done using the fast Fourier transform: $\mG\xx=\cF^{-1}\left(\cF(\gg)\odot\cF(\xx)\right)$, where $\odot$ denotes the elementwise product of two vectors. The full binary embedding function, then, is \[ h_{\text{circ}}(\xx) = \sgn\left(\cF^{-1}\left(\cF(\gg)\odot\cF(\mD\xx)\right)\right). \]

Our Article: Optimality of Circulant Binary Embeddings

The benefit of working with circulant matrices comes at a cost, however, because the $\mW$ matrix no longer has $kd$ independent random variables; its values are dependent because of the relationship between the columns of a circulant matrix. As a result, the Theorem 1 described above no longer directly applies.

One standard condition that is required for analysis of circulant binary embeddings is a small norm of all the vectors $\|\xx_i\|_\infty$. The articles below use that, and other assumptions, to study the impact of the reduction in random complexity from the unstructured case. They show that a greater bit complexity is needed to achieve the same probabilistic bound.

  • (Yu et al. 2015) was able the prove that $k=\cO(\epsilon^{-2}\log^2 n)$ would be sufficient.
  • (Oymak 2016) improved on that result to require only $k=\cO(\epsilon^{-3}\log n)$ many bits for the same accuracy.

In our presentation at AAAI (which can be viewed here) we prove that the optimal bit complexity $k=\cO(\epsilon^{-2}\log n)$ (associated with an unstructured $\mW$ in Theorem 1) can be achieved for circulant binary embeddings, subject to the same smallness of the $\xx_i$ vectors needed for the other bounds. In the figures below we show the error described in Theorem 1 decreasing as a function of $k$ and one application where the binary embedding can be used efficiently.


Figure 5: The errors (the difference in distance in physical and embedded space) of the unstructured and circulant binary embeddings decay like $k^{-1/2}$. This matches the expectation from Theorem 1. This experiment considered 10000 vectors with uniform random entries in a 10000 dimension space.



Figure 6: We consider $n=1000$ vectors in $d=512$ that have been drawn from 5 normally distributed clusters. We then consider the 100 nearest vectors in physical space as well as in the embedded space for increasingly large bit complexity $k$. We plot the average number of nearest neighbors found in embedded space that also appear in the original space, as well as the compression attainable by storing the vectors with a 1-byte data type rather than 8-bytes.


Conclusion

As computational tools for machine learning and data science evolve, it will become increasingly necessary to develop intelligent and efficient ways to manage and access data. Circulant binary embeddings are one of many strategies which can be used to compress data while retaining some measure of its original properties. Here at SigOpt, we are interested in understanding and contributing to research on this and other topics, as well as getting out to conferences and learning about the latest and greatest ideas primed to change the world. I hope you will stop by and see my (Jungtaek Kim) presentation at AAAI on Tuesday!

(1) This same logic of applying random matrix operations and studying the distribution of outcomes can play a useful role in estimating certain matrix quantities. One of my favorite articles from graduate school discussed just such an idea.

(2) The use of the term binary in computer science often refers to $\{0, 1\}$, not $\{-1, 1\}$, but the literature on binary embeddings sometimes uses the latter to simplify the notation involving the sign function.

(3) Some might argue that the most natural way to talk about the proximity of two vectors $\xx,\yy\in\cS^{d-1}$ is to just treat them as though they are in $\RR^d$ and measure the 2-norm (Euclidean distance). This is related to the angle, though, because $\|\xx\|=\|\yy\|=1$: $\|x-y\|^2 = 2(1 - \cos \theta_{\xx,\yy})$.

(4) That is not to say that there is no benefit. Research has shown the value in conducting feature extraction/analysis in the binary space, which may be beneficial regardless of the cost in acquiring it.

(5) Note that, in this circulant setting, $\mW^T\xx\in\RR^d$ so, assuming only $k \leq d$ bits are required, then any $k$ entries can be randomly chosen from the output to produce the $k$ dimensional embedding. This is the standard setting associated with binary embeddings, but if $k \gt d$ bits are required then multiple circulant embeddings can be computed until $k$ bits become available. Or, the vectors $\xx$ can be extended with all zeros to reach the desired bit length.

(6) The analysis in our article revolves around results involving matrices with normally distributed random components, but binary embeddings can be more coherently designed. (Yu et al. 2015), among other articles, discusses strategies for optimal binary embeddings designed to effectively embed vectors according to certain criteria.

Jungtaek is a Ph.D. student in computer science at POSTECH in South Korea. He is studying Bayesian optimization, hyperparameter optimization and automated machine learning. Jungtaek is working as an intern at SigOpt in the spring 2018 to gain experience in the San Francisco startup scene.
Mike McCourt, PhD