Yihong Wu
Counting Motifs with Graph Sampling
Abstract: Applied researchers often construct a network from data that
has been collected from a random sample of nodes, with the goal to
infer properties of the parent network from the sampled version. Two
of the most widely used sampling schemes are \emph{subgraph sampling},
where we sample each vertex independently with probability $p$ and
observe the subgraph induced by the sampled vertices, and
\emph{neighborhood sampling}, where we additionally observe the edges
between the sampled vertices and their neighbors.
In this talk, we study the problem of estimating the number of motifs
as induced subgraphs under both models from a statistical
perspective. We show that: for parent graph $G$ with maximal degree
$d$, for any connected motif $h$ on $k$ vertices, to count the number
of copies of $h$ in $G$, denoted by $s=\s(h,G)$, with a multiplicative
error of $\epsilon$, \begin{itemize} \item For subgraph sampling, the
optimal sampling ratio $p$ is $\Theta_{k}(\max\{
\ppth{s\epsilon^2}^{-\frac{1}{k}}, \frac{d^{k-1}}{s\epsilon^{2}} \})$,
which only depends on the size of the motif but \emph{not} its actual
topology. Furthermore, we show that Horvitz-Thompson type estimators
are universally optimal for any connected motifs.
\item For neighborhood sampling, we propose a family of
estimators, encompassing and outperforming the Horvitz-Thompson
estimator and achieving the sampling ratio $O_{k}(\min\{
(\frac{d}{s\epsilon^2})^{\frac{1}{k-1}},
\sqrt{\frac{d^{k-2}}{s\epsilon^2}}\})$, which again only depends on
the size of $h$. This is shown to be optimal for all motifs with at
most 4 vertices and cliques of all sizes. \end{itemize} The matching
minimax lower bounds are established using certain algebraic
properties of subgraph counts. These results allow us to quantify how
much more informative neighborhood sampling is than subgraph sampling,
as empirically verified by experiments on synthetic and real-world
data. We also address the issue of adaptation to the unknown maximum
degree, and study specific problems for parent graphs with additional
structures, e.g., trees or planar graphs. Extensions to counting
connected components will be briefly discussed.
This is joint work with Jason M. Klusowski (Yale).