Descriptive Proof of Factorization with Factorization Machines
The Paper
Steffen Rendle introduced Factorization Machines a while ago in 2010, but I have only more recently come across it.
There is only a very short proof provided of Lemma 3.1 (page 3): “The model equation of a factorization machine can be computed in linear time .”
I decided to derive the proof myself and found some things interesting along the way. I wrote down an earful hoping that someone else finds it helpful!
The Proof
Let be a function with commutative operator (such as multiplication, dot product, Hadamard product, or simultaneously diagonalizable matrix multiplication 1) that follows the decomposition:
There are many cases in computer science where we want to compute something in a pairwise fashion between all and . One such case is linear combination of pairwise effects for the function . A naive description of the component wise operations that are the “product” between two elements can be computed as follows:
A standard calculation requires operations. However, if is also distributive over summation, we can expand and factorize this computation in operations. Let’s begin! $$ \sum\limits_{i=1}^n \sum\limits_{j=1}^n f(i, j) = u(0) \circ \left(\sum\limits_{i=1}^n s(i) \right)
- u(1) \circ \left(\sum\limits_{i=1}^n s(i) \right)
- \ldots + u(n) \circ \left(\sum\limits_{i=1}^n s(i) \right) = \left(u(0) + u(1) + \ldots + u(n) \right) \circ \left(\sum\limits_{i=1}^n s(i) \right) = \left(\sum\limits_{i=1}^n u(i) \right) \circ \left(\sum\limits_{i=1}^n s(i) \right) $$
This is really, really awesome. We have reduced from operations and operations to operations and only one lonely operation. This has reduced an unnecessary loop over , illustrating that this problem is actually less complex than it originally appeared 2.
Is this intuitive? I think so. Remember in elementary school, where it was way more fun to compute something like than ? This is a similar thing, where you win by factorizing. In this case, we can further cheat becomes it becomes clear when you write the expansion that is not a necessary variable. This is a case where pairwise interactions can factor out due to the combination of commutative and distributive properties of in such a beautiful way so we don’t care what the value of is. Amazing!
For a bit of a brain tease, let , . This would be like having each pair joust for highest value, and then taking the smallest of these values (perhaps a quasi-winner take all?): Due to our factorization proof, it would mean it is equivalent to computing: I find this somewhat intuitive; the largest simply depends on either the row or column value, not the combination.
What if we don’t care about self similarity ? We can have a smart factorization that significantly reduces the number of computations when self similarity is unimportant.
Due to the commutative property of , the operations are symmetric. The sum can be thought of as summing over all entries for a matrix of the results :
where
Due to the triangular nature of this matrix, we can separate into upper triangular, lower triangular, and diagonal matrices:
Due to the commutative nature of :
We can then decompose
Or rearranged:
Therefore, if all you care about is just the upper triangular:
Lets go back to a more verbose, lower level representation expanding the sums as follows:
We can simplify the values of the diagonal in :
Putting it all together, we get can compute the upper triangular pairwise interactions as follows:
As an example, when is and , this looks like the following:
There is something very beautiful about this transformation. It says that summing the upper triangular elements computed by can be decomposed into a one dimensional function that only iterates over the number of components .
Why is this useful?
Basically factorization reduces redundant computation.
When performs some really expensive operation like a dot product upon all pairwise combinations of elements, computing can be simplified to , where is the length of the vector.
This is exactly the case used in Factorization Machines, where we create a latent space from an input , . When modeling real world data, probably has some sublinear dependency on , perhaps . That’s much better than the naive !
In addition, this simplification is used as a tool to create other proofs, like showing that Spearman’s coefficient is a particular case of the general correlation coefficient.
-
If anyone knows real world cases for simultaneously diagonalizable matrices, that would be pretty neat. ↩
-
Philosophically, I think the Kolmogorov complexity of algorithms is related to the depth of conditional statements. I don’t have a strong proof of this but it feels like that each conditional is another layer in a state diagram for a program, and the number of states of the state diagram being some measure of Kolmogorov complexity. Here we have reduced a layer of conditional statements that would occur with the loop operation, meaning this has lower Kolmogorov complexity. ↩