Skip to content

Instantly share code, notes, and snippets.

@khirotaka
Last active August 10, 2019 14:45
Show Gist options
  • Save khirotaka/4de47f86596df81f99b6cc7dbd8a6660 to your computer and use it in GitHub Desktop.
Save khirotaka/4de47f86596df81f99b6cc7dbd8a6660 to your computer and use it in GitHub Desktop.

KDD 19 Survey Partly

1, Deep Landscape Forecasting for Real-time Bidding Advertising

GitHub

Paper

Abstract

The emergence of real-time auction in online advertising has drawn huge attention of modeling the market competition, i.e., bid landscape forecasting. The problem is formulated as to forecast the probability distribution of market price for each ad auction. With the consideration of the censorship issue which is caused by the second-price auction mechanism, many researchers have devoted their efforts on bid landscape forecasting by incorporating survival analysis from medical research field. However, most existing solutions mainly focus on either counting-based statistics of the segmented sample clusters, or learning a parameterized model based on some heuristic assumptions of distribution forms. Moreover, they neither consider the sequential patterns of the feature over the price space. In order to capture more sophisticated yet flexible patterns at fine-grained level of the data, we propose a Deep Landscape Forecasting (DLF) model which combines deep learning for probability distribution forecasting and survival analysis for censorship handling. Specifically, we utilize a recurrent neural network to flexibly model the conditional winning probability w.r.t. each bid price. Then we conduct the bid landscape forecasting through probability chain rule with strict mathematical derivations. And, in an end-to-end manner, we optimize the model by minimizing two negative likelihood losses with comprehensive motivations. Without any specific assumption for the distribution form of bid landscape, our model shows great advantages over previous works on fitting various sophisticated market price distributions. In the experiments over two large-scale real-world datasets, our model significantly outperforms the state-of-the-art solutions under various metrics.

System requirement

  • Python 2.7.6
  • TensorFlow >= 1.3
  • GTX 1080Ti GPU × 1
  • Main Memory: 128GB
  • Intel(R) Core(TM) i7-6900K CPU

2, Kicksore (Pairwise Comparisons with Flexible Time-Dynamics)

GitHub

Paper

Installation

pip install kickscore

Abstract

Inspired by applications in sports where the skill of players or teams competing against each other varies over time, we propose a prob- abilistic model of pairwise-comparison outcomes that can capture a wide range of time dynamics. We achieve this by replacing the static parameters of a class of popular pairwise-comparison models by continuous-time Gaussian processes; the covariance function of these processes enables expressive dynamics. We develop an effi- cient inference algorithm that computes an approximate Bayesian posterior distribution. Despite the flexbility of our model, our infer- ence algorithm requires only a few linear-time iterations over the data and can take advantage of modern multiprocessor computer architectures. We apply our model to several historical databases of sports outcomes and find that our approach a) outperforms com- peting approaches in terms of predictive performance, b) scales to millions of observations, and c) generates compelling visualizations that help in understanding and interpreting the data.


3, TICC (Toeplitz Inverse Covariance-Based Clustering of Multivariate Time Series Data)

GitHub

Paper まだ公開されていない

Abstract

TICC is a python solver for efficiently segmenting and clustering a multivariate time series. It takes as input a T-by-n data matrix, a regularization parameter lambda and smoothness parameter beta, the window size w and the number of clusters k. TICC breaks the T timestamps into segments where each segment belongs to one of the k clusters. The total number of segments is affected by the smoothness parameter beta. It does so by running an EM algorithm where TICC alternately assigns points to clusters using a dynamic programming algorithm and updates the cluster parameters by solving a Toeplitz Inverse Covariance Estimation problem.

Installation

git clone https://github.com/davidhallac/TICC.git


4, TVGL

GitHub

Paper

Abstract

TVGL is a python solver for inferring dynamic networks from raw time series data.

Installation

git clone https://github.com/davidhallac/TVGL.git


5, Deep Learning for Time Series Classification

GitHub

Paper


6, EpiDeep: Exploiting Embeddings for Influenza Forecasting

Code なし

Paper

Abstract

Influenza leads to regular losses of lives annually and requires careful monitoring and control by health organizations. Annual influenza forecasts help policymakers implement effective counter- measures to control both seasonal and pandemic outbreaks. Existing forecasting techniques suffer from problems such as poor forecast- ing performance, lack of modeling flexibility, data sparsity, and/or lack of intepretability. We propose EpiDeep, a novel deep neural net- work approach for epidemic forecasting which tackles all of these issues by learning meaningful representations of incidence curves in a continuous feature space and accurately predicting future inci- dences, peak intensity, peak time, and onset of the upcoming season. We present extensive experiments on forecasting ILI (influenza-like illnesses) in the United States, leveraging multiple metrics to quan- tify success. Our results demonstrate that EpiDeep is successful at learning meaningful embeddings and, more importantly, that these embeddings evolve as the season progresses. Furthermore, our approach outperforms non-trivial baselines by up to 40%.


7, DevNet

Code

Paper

Abstract

Although deep learning has been applied to successfully address many data mining problems, relatively limited work has been done on deep learning for anomaly detection. Existing deep anomaly detection methods, which focus on learning new feature represen- tations to enable downstream anomaly detection methods, perform indirect optimization of anomaly scores, leading to data-inefficient learning and suboptimal anomaly scoring. Also, they are typically designed as unsupervised learning due to the lack of large-scale labeled anomaly data. As a result, they are difficult to leverage prior knowledge (e.g., a few labeled anomalies) when such information is available as in many real-world anomaly detection applications. This paper introduces a novel anomaly detection framework and its instantiation to address these problems. Instead of representation learning, our method fulfills an end-to-end learning of anomaly scores by a neural deviation learning, in which we leverage a few (e.g., multiple to dozens) labeled anomalies and a prior probability to enforce statistically significant deviations of the anomaly scores of anomalies from that of normal data objects in the upper tail. Extensive results show that our method can be trained substantially more data-efficiently and achieves significantly better anomaly scoring than state-of-the-art competing methods.

System Requirements

  • python==3.6.6
  • keras==2.2.4
  • keras-applications==1.0.6
  • keras-preprocessing==1.0.5
  • tensorflow-gpu==1.10.0
  • scikit-learn==0.20.0
  • numpy==1.14.5
  • pandas==0.23.4
  • scipy==1.1.0
  • tensorboard==1.10.0

8, AtSNE

GitHub

Paper

Abstract

Visualization of high-dimensional data is a fundamental yet chal- lenging problem in data mining. These visualization techniques are commonly used to reveal the patterns in the high-dimensional data, such as clusters and the similarity among clusters. Recently, some successful visualization tools (e.g., BH-t-SNE and LargeVis) have been developed. However, there are two limitations with them : (1) they cannot capture the global data structure well. Thus, their visualization results are sensitive to initialization, which may cause confusions to the data analysis. (2) They cannot scale to large-scale datasets. They are not suitable to be implemented on the GPU plat- form because their complex algorithm logic, high memory cost, and random memory access mode will lead to low hardware utilization. To address the aforementioned problems, we propose a novel visual- ization approach named as Anchor-t-SNE (AtSNE), which provides efficient GPU-based visualization solution for large-scale and high- dimensional data. Specifically, we generate a number of anchor points from the original data and regard them as the skeleton of the layout, which holds the global structure information. We propose a hierarchical optimization approach to optimize the positions of the anchor points and ordinary data points in the layout simul- taneously. Our approach presents much better and robust visual effects on 11 public datasets, and achieve 5 to 28 times speed-up on different datasets, compared with the current state-of-the-art methods. In particular, we deliver a high-quality 2-D layout for a 20 million and 96-dimension dataset within 5 hours, while the current methods fail to give results due to running out of the memory.

Requirements

  • CUDA (8 or later), nvcc and cublas
  • GCC
  • fails

Note, Dockerfile is available.


9, Robust Anomaly Detection for Multivariate Time Series through Stochastic Recurrent Neural Network

GitHub

Paper

Abstract

Industry devices (i.e., entities) such as server machines, spacecrafts, engines, etc., are typically monitored with multivariate time series, whose anomaly detection is critical for an entity’s service quality management. However, due to the complex temporal dependence and stochasticity of multivariate time series, their anomaly detec- tion remains a big challenge. This paper proposes OmniAnomaly, a stochastic recurrent neural network for multivariate time series anomaly detection that works well robustly for various devices. Its core idea is to capture the normal patterns of multivariate time series by learning their robust representations with key techniques such as stochastic variable connection and planar normalizing flow, reconstruct input data by the representations, and use the reconstruction probabilities to determine anomalies. Moreover, for a detected entity anomaly, OmniAnomaly can provide interpretations based on the reconstruction probabilities of its constituent univariate time series. The evaluation experiments are conducted on two public datasets from aerospace and a new server machine dataset (collected and released by us) from an Internet company. OmniAnomaly achieves an overall F1-Score of 0.86 in three real-world datasets, significantly outperforming the best performing baseline method by 0.09. The interpretation accuracy for OmniAnomaly is up to 0.89.

System requirements

Python 3.5+


10, oi-VAE: Output Interpretable VAEs for Nonlinear Group Factor Analysis

GitHub

Paper

Abstract

Deep generative models have recently yielded en- couraging results in producing subjectively realis- tic samples of complex data. Far less attention has been paid to making these generative models inter- pretable. In many scenarios, ranging from scien- tific applications to finance, the observed variables have a natural grouping. It is often of interest to understand systems of interaction amongst these groups, and latent factor models (LFMs) are an at- tractive approach. However, traditional LFMs are limited by assuming a linear correlation structure. We present an output interpretable VAE (oi-VAE) for grouped data that models complex, nonlinear latent-to-observed relationships. We combine a structured VAE comprised of group-specific gen- erators with a sparsity-inducing prior. We demon- strate that oi-VAE yields meaningful notions of interpretability in the analysis of motion capture and MEG data. We further show that in these situations, the regularization inherent to oi-VAE can actually lead to improved generalization and learned generative processes.

Graph Neural Network

1, Must read papers on GNN

GitHub


2, Predicting Path Failure In Time Evolving Graphs

GitHub

Paper

Abstract

In this paper we use a time-evolving graph which consists of a sequence of graph snapshots over time to model many real-world networks. We study the path classification problem in a time-evolving graph, which has many applications in real-world scenarios, for example, predicting path failure in a telecommunication network and predicting path congestion in a traffic network in the near future. In order to capture the temporal dependency and graph structure dynamics, we design a novel deep neural network named Long Short-Term Memory R-GCN (LRGCN). LRGCN considers temporal dependency between time-adjacent graph snapshots as a special relation with memory, and uses relational GCN to jointly process both intra-time and inter-time relations. We also propose a new path representation method named self-attentive path embedding (SAPE), to embed paths of arbitrary length into fixed-length vectors. Through experiments on a real-world telecommunication network and a traffic network in California, we demonstrate the superiority of LRGCN to other competing methods in path failure prediction, and prove the effectiveness of SAPE on path representation.


3, SNAP System

Homepage

Download

Abstract

Stanford Network Analysis Platform (SNAP) is a general purpose, high performance system for analysis and manipulation of large networks. Graphs consists of nodes and directed/undirected/multiple edges between the graph nodes. Networks are graphs with data on nodes and/or edges of the network. The core SNAP library is written in C++ and optimized for maximum performance and compact graph representation. It easily scales to massive networks with hundreds of millions of nodes, and billions of edges. It efficiently manipulates large graphs, calculates structural properties, generates regular and random graphs, and supports attributes on nodes and edges. Besides scalability to large graphs, an additional strength of SNAP is that nodes, edges and attributes in a graph or a network can be changed dynamically during the computation. SNAP was originally developed by Jure Leskovec in the course of his PhD studies. The first release was made available in Nov, 2009. SNAP uses a general purpose STL (Standard Template Library)-like library GLib developed at Jozef Stefan Institute. SNAP and GLib are being actively developed and used in numerous academic and industrial projects.


4, Predicting Dynamic Embedding Trajectory in Temporal Interaction Networks

GitHub

Paper

Introduction

JODIE is a representation learning framework for temporal interaction networks. Given a sequence of entity-entity interactions, JODIE learns a dynamic embedding trajectory for every entity (as opposed to a single embedding). These trajectories can then be used for various downstream machine learning tasks. JODIE is fast and makes accurate predictions about future interactions and anomaly detection.

JODIE can be used for two broad category of tasks:

  1. Temporal Link Prediction : Which two entities will interact next? Example applications are recommender systems and modeling network evolution.

  2. Temporal Node Classification : When does the state of an node change from normal to abnormal? Example applications are anomaly detection, ban prediction, dropout and churn prediction, and fraud and account compromise.

Setup

git clone https://github.com/srijankr/jodie.git
cd jodie/
pip install -r requirements.txt

chmod +x initialize.sh
./initialize.sh

chmod +x download_data.sh
./download_data.sh

5, ClusterGCN

GitHub

Paper

Abstract

Graph convolutional network (GCN) has been successfully applied to many graph-based applications; however, training a large-scale GCN remains challenging. Current SGD-based algorithms suffer from either a high computational cost that exponentially grows with number of GCN layers, or a large space requirement for keeping the entire graph and the embedding of each node in memory. In this paper, we propose Cluster-GCN, a novel GCN algorithm that is suitable for SGD-based training by exploiting the graph clustering structure. Cluster-GCN works as the following: at each step, it samples a block of nodes that associate with a dense subgraph identified by a graph clustering algorithm, and restricts the neighborhood search within this subgraph. This simple but effective strategy leads to significantly improved memory and computational efficiency while being able to achieve comparable test accuracy with previous algorithms. To test the scalability of our algorithm, we create a new Amazon2M data with 2 million nodes and 61 million edges which is more than 5 times larger than the previous largest publicly available dataset (Reddit). For training a 3-layer GCN on this data, Cluster-GCN is faster than the previous state-of-the-art VR-GCN (1523 seconds vs 1961 seconds) and using much less memory (2.2GB vs 11.2GB). Furthermore, for training 4 layer GCN on this data, our algorithm can finish in around 36 minutes while all the existing GCN training algorithms fail to train due to the out-of-memory issue. Furthermore, Cluster-GCN allows us to train much deeper GCN without much time and memory overhead, which leads to improved prediction accuracy---using a 5-layer Cluster-GCN, we achieve state-of-the-art test F1 score 99.36 on the PPI dataset, while the previous best result was 98.71 by [16].

System Requirements

OS: Ubuntu

  • Python == 3.5.2
  • networkx == 1.11
  • tqdm == 4.28.1
  • numpy == 1.15.4
  • pandas == 0.23.4
  • texttable == 1.5.0
  • scipy == 1.1.0
  • argparse == 1.1.0
  • torch == 0.4.1
  • torch-geometric == 0.3.1
  • metis == 0.2a.4
  • scikit-learn == 0.20

sudo apt-get install libmetis-dev


Uncategorized

Home page: Emily B. Fox

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment