Skip to content

Instantly share code, notes, and snippets.

@zhouzhuojie
Created August 2, 2013 13:26
Show Gist options
  • Save zhouzhuojie/6139862 to your computer and use it in GitHub Desktop.
Save zhouzhuojie/6139862 to your computer and use it in GitHub Desktop.
Former Ideas
List
--------------
* Statistics - MCMC
* Statistics - One long run vs Many short runs
* CS - Non-backtracking
* CS - Tree Sampling vs Graph Sampling
* Other things
Statistics - MCMC
-------------
#### 1. Asymptotic Variance is Key
Consider random walk sequence as a AR(1) series.
$$ \mu = E(X_n) = \rho E(X_{n-1}) + E(e_n) $$
And starting from stationary distribution:
$$ \sigma^2_{CLT} = \gamma_0 + 2\sum_{k=1}^\infty \gamma_k,$$
where $\gamma_k = Cov(Y_j, Y_{j+k})$,
and $Y_j$ is some attribute function of $X_j$: $Y_j = f(X_j)$
#### 2. Esitmate the Asymptotic Variance Using Batch Means
* Batch Mean Method
* Overlapping batch mean method
#### 3. Effective Sample Size
Roughly $\sqrt{\frac{1+\rho}{1-\rho}}$
Statistics - One long run vs Many short runs
----------------
#### Questions
1. Samples in the sample pool are **dependent**, should we 'thin' it?
* Yes, if space/storage is not computational or statistical efficient
* Yes, if $g(X_i)$ is computational expensive
* No, otherwise
* typical, for Online Social Network, we do not need to 'thin' the random walk samples.
2. Should we discard "burn-in"?
* Burn-ins are just pushing the initial distribution from a point {x} to {$P^nx$}
* See experiment table [here](https://docs.google.com/spreadsheet/ccc?key=0AmIBiL5nyHqXdHZOdE9vcnpOV01ON2pMcmtHUENTLVE&usp=sharing). Basically we do not need burn-in if we adopt one long run random walk.
3. Multiple short runs?
* "Diversity of the initial distribution". Still, there is no guarantee to improve statistical or computational efficiency.
* Likely to give indication of convergence faster
4. How long is enough?
* For one long run, just use the query budget.
Statistics - Rule of Thumb for Importance Sampling
----------------
#### Introduction
$$\hat{\mu} = \frac{1}{N}\sum_{k=1}^Nh(X_k)\frac{\pi(X_k)}{p(X_k)}$$
where
$h(\cdot)$ is the attribute function,
$\pi(\cdot)$ is the expected/target distribution density function,
$p(\cdot)$ is the sampling/obtained distribution density function. $p(\cdot)$ is easy to obtain, e.g. SRW: $p(v) = \frac{k_v}{2|E|}$
#### Ideal case for Important Sampling
$$p(X_k) \approx h(X_k)\pi(X_k)$$
Non-backtracking Random Walk
----------------------
#### Introduction
$P$ and $P'$ are two random walks' transition matrix. If:
$$
\begin{equation}
\begin{split}
P(j,i)P'(e_{ij}, e_{jk}) &= P(j,k)P'(e_{kj},e_{ji}) \\
P'(e_{ij}, e_{jk}) &\geq P(j,k)
\end{split}
\end{equation}
$$
then the random walk with $P'$ with have a unique stationary distribution $\pi'(e_{ij}) = \pi(i)P(i,j)$, and $\sigma'^2(f)\leq \sigma^2(f)$.
#### Design of the non-backtracking random walk
* Never go back from the income edge
* $P(j,i) = 1/d(j)$
* $P'(e_{ji}, e_{jk}) = 1/(d(j)-1)$
* $P'(e_{ji}, e_{jk}) \geq P(j,i)$
#### New Ideas
1. Pushing $P'(e_{ij}, e_{jk}) \geq P(j,k) $
* Divide the neighbors into two groups, $C_1$ and $C_2$
* Come in $C_1$ then go out to $C_2$, and vice versa.
* Although sometime it increase the probability of going out to another group, the probability $P'(e_{ji}, e_{jk})$ remains the same.
* $$ P'(e_{ji}, e_{jk}) = 0.5\cdot 0 + 0.5\cdot 1/(0.5\cdot d) = 1/d = P(j,k) $$
* Maybe some experiments?
CS - Tree Sampling vs Graph Sampling
--------------
#### Questions
1. Is tree sampling better than graph sampling?
2. Find better starting node for tree sampling?
3. On-the-fly drill down for tree sampling. How to accurately calculate the probability of a node being accessed?
#### Some observations
1. Adding random edges to a tree will increase the conductance of the graph.
2. By selecting the nodes that are connected to cross-cutting edges, can we design random walk that are much faster than original simple random walk?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment