The Paper

Steffen Rendle introduced Factorization Machines a while ago in 2010, but I have only more recently come across it.

There is only a very short proof provided of Lemma 3.1 (page 3): “The model equation of a factorization machine can be computed in linear time .”

I decided to derive the proof myself and found some things interesting along the way. I wrote down an earful hoping that someone else finds it helpful!

The Proof

Let be a function with commutative operator (such as multiplication, dot product, Hadamard product, or simultaneously diagonalizable matrix multiplication 1) that follows the decomposition:

There are many cases in computer science where we want to compute something in a pairwise fashion between all and . One such case is linear combination of pairwise effects for the function . A naive description of the component wise operations that are the “product” between two elements can be computed as follows:

A standard calculation requires operations. However, if is also distributive over summation, we can expand and factorize this computation in operations. Let’s begin! $$ \sum\limits_{i=1}^n \sum\limits_{j=1}^n f(i, j) = u(0) \circ \left(\sum\limits_{i=1}^n s(i) \right)

  • u(1) \circ \left(\sum\limits_{i=1}^n s(i) \right)
  • \ldots + u(n) \circ \left(\sum\limits_{i=1}^n s(i) \right) = \left(u(0) + u(1) + \ldots + u(n) \right) \circ \left(\sum\limits_{i=1}^n s(i) \right) = \left(\sum\limits_{i=1}^n u(i) \right) \circ \left(\sum\limits_{i=1}^n s(i) \right) $$

This is really, really awesome. We have reduced from operations and operations to operations and only one lonely operation. This has reduced an unnecessary loop over , illustrating that this problem is actually less complex than it originally appeared 2.

Is this intuitive? I think so. Remember in elementary school, where it was way more fun to compute something like than ? This is a similar thing, where you win by factorizing. In this case, we can further cheat becomes it becomes clear when you write the expansion that is not a necessary variable. This is a case where pairwise interactions can factor out due to the combination of commutative and distributive properties of in such a beautiful way so we don’t care what the value of is. Amazing!

For a bit of a brain tease, let , . This would be like having each pair joust for highest value, and then taking the smallest of these values (perhaps a quasi-winner take all?): Due to our factorization proof, it would mean it is equivalent to computing: I find this somewhat intuitive; the largest simply depends on either the row or column value, not the combination.

What if we don’t care about self similarity ? We can have a smart factorization that significantly reduces the number of computations when self similarity is unimportant.

Due to the commutative property of , the operations are symmetric. The sum can be thought of as summing over all entries for a matrix of the results :


Due to the triangular nature of this matrix, we can separate into upper triangular, lower triangular, and diagonal matrices:

Due to the commutative nature of :

We can then decompose

Or rearranged:

Therefore, if all you care about is just the upper triangular:

Lets go back to a more verbose, lower level representation expanding the sums as follows:

We can simplify the values of the diagonal in :

Putting it all together, we get can compute the upper triangular pairwise interactions as follows:

As an example, when is and , this looks like the following:

There is something very beautiful about this transformation. It says that summing the upper triangular elements computed by can be decomposed into a one dimensional function that only iterates over the number of components .

Why is this useful?

Basically factorization reduces redundant computation.

When performs some really expensive operation like a dot product upon all pairwise combinations of elements, computing can be simplified to , where is the length of the vector.

This is exactly the case used in Factorization Machines, where we create a latent space from an input , . When modeling real world data, probably has some sublinear dependency on , perhaps . That’s much better than the naive !

In addition, this simplification is used as a tool to create other proofs, like showing that Spearman’s coefficient is a particular case of the general correlation coefficient.

  1. If anyone knows real world cases for simultaneously diagonalizable matrices, that would be pretty neat. 

  2. Philosophically, I think the Kolmogorov complexity of algorithms is related to the depth of conditional statements. I don’t have a strong proof of this but it feels like that each conditional is another layer in a state diagram for a program, and the number of states of the state diagram being some measure of Kolmogorov complexity. Here we have reduced a layer of conditional statements that would occur with the loop operation, meaning this has lower Kolmogorov complexity.