Machine Learning Articles of the Week: Network Dynamics with BuzzFeed and Quora, Unnecessary Distributed Machine Learning, and the Utility of Small Data

This week is a double whammy of large consumer focused organizations looking at network dynamics! Very exciting times to have so many interesting approaches being published each week.

The Emperor’s New Clothes: Distributed Machine Learning

When you can get workstations with 1TB of RAM, dozens of cores, and multiple GPUs for most datasets do you really need a distributed machine learning solution?

From my personal experiences most people write low-throughput code and use parallelization as the wrong tool to speed things up. However once there is a decoupling between what is creating the data and what is consuming it, like a host that saves events to S3 and another that reads S3 and creates features, it does make sense to go parallel to improve networking throughput between the two nodes or cluster. The reasons for having the decoupling typically come from building high availability services which somewhat is in conflict with single node workstations for machine learning.

Introducing Pound: Process for Optimizing and Understanding Network Diffusion

Twins Andrew and Adam Kelleher created a graph construction system called Pound which takes event data and creates a graph of nodes and edges which can be explored with network analysis tools. Of particular interest is understanding network diffusion of information cascades, visualized below by Adam Kelleher:

Pound Visualization

I believe this is at least loosely based on work by Jure Leskovec, and it’s very exciting to see where this work may continue.

Upvote Dynamics on the Quora Network

Quora is able to connect questions to experts who can answer them effectively. This explores a graph theoretic look at the Quora network dynamics by constructing a graph of users with an incremental algorithm to add new users and follow relationships. A sparsifying step is done to remove paths that take longer than some time threshold, and then longest path is computed. The author explored answer propagation dynamics through this metric finding some interesting results like that follower count for good answers significantly helps in the beginning of a post, as time goes on the low follow count users tend to converge.

How Not to Drown in Numbers

An interesting discussion about how useful small data is; things like qualitative studies or human insight are extremely valuable when a computer does not easily have access to the data. The author presents several real world cases where small data is useful even with “big data” approaches.