As social networks have become a major source of information, predicting the outcome of information diffusion has appeared intriguing both to researchers and practitioners. By organizing and categorizing the joint efforts of numerous studies on popularity prediction, this survey presents a hierarchical taxonomy and helps to establish a systematic overview of popularity prediction methods for microblog. Specifically, we uncover three lines of thoughts: the feature-based approach, time-series modelling, and the collaborative filtering approach and analyze them respectively. Furthermore, we also categorize prediction methods based on their underlying rationale: whether they attempt to model the motivation of users or monitor the early responses. Finally, we put these prediction methods to test by performing experiments on real-life data collected from popular social networks Twitter and Weibo. We compare the methods in terms of accuracy, efficiency, timeliness and robustness. As far as we are concerned, there is no precedented survey aimed at microblog popularity prediction at the time of submission. By establishing a taxonomy and evaluation for the first time, we hope to provide an in-depth review of state-of-the-art prediction methods and point out directions for further research. Our evaluations show that time-series modelling has the advantage of high accuracy and the ability to improve over time. The feature-based methods using only temporal features performs nearly as well as using all possible features, producing average results. This suggests that temporal features do have strong predictive power and that power is better exploited with time-series models. On the other hand, this implies that we know little about the future popularity of an item before it is posted, which may be the focus of further research.
Based on the performance of entire social networks, anomaly analysis for evolving social networks generally ignores the otherness of the evolutionary behaviors of different nodes, such that it is difficult to precisely identify the anomalous evolutionary behaviors of nodes (Aebn). Assuming that a nodes evolutionary behavior that generates and removes edges normally follows stable evolutionary mechanisms, this study focuses on detecting and assessing Aebn whose evolutionary mechanisms deviate from their past mechanisms, and proposes a link prediction detection (LPD) method and a matrix perturbation assessment (MPA) method. LPD describes a nodes evolutionary behavior by fitting its evolutionary mechanism, and designs indexes for edge generation and removal to evaluate the extent to which the evolutionary mechanism of a nodes evolutionary behavior can be fitted by a link prediction algorithm. Further, it detects Aebn by quantifying the differences among behavior vectors that characterize the nodes evolutionary behaviors in different periods. In addition, MPA considers Aebn as a perturbation of the social network structure, and quantifies the effect of Aebn on the social network structure based on matrix perturbation analysis. Extensive experiments on eight disparate real-world networks demonstrate that analyzing Aebn from the perspective of evolutionary mechanisms is important and beneficial. Compared with other state-of-the-art methods, LPD can effectively detect Aebn, and MPA can reasonably assess their effect.
Predictive business process monitoring refers to the act of making predictions about the future state of ongoing cases of a business process, based on their incomplete execution traces and logs of historical (completed) traces. Motivated by the increasingly pervasive availability of fine-grained event data about business process executions, the problem of predictive process monitoring has received substantial attention in the past years. In particular, a considerable number of methods have been put forward to address the problem of outcome-oriented predictive process monitoring, which refers to classifying each ongoing case of a process according to a given set of possible outcomes -- e.g. Will the customer complain or not? Will an order be delivered, cancelled or withdrawn? Unfortunately, different authors have used different datasets, experimental settings, evaluation measures and baselines to assess their proposals, resulting in poor comparability and an unclear picture of the relative merits and applicability of different methods. To address this gap, this article presents a systematic review and taxonomy of outcome-oriented predictive process monitoring methods, and a comparative experimental evaluation of eleven representative methods using a benchmark covering twelve predictive process monitoring tasks based on four real-life event logs.
Kernel-based regression represents an important family of learning techniques for solving challenging regression tasks with non-linear patterns. Despite being studied extensively, most of the existing work suffers from two major drawbacks: (i) they are often designed for solving regression tasks in a batch learning setting, making them not only computationally inefficient and but also poorly scalable in real-world applications where data arrives sequentially; and (ii) they usually assume a fixed kernel function is given prior to the learning task, which could result in poor performance if the chosen kernel is inappropriate. To overcome these drawbacks, this work presents a novel scheme of Online Multiple Kernel Regression (OMKR), which sequentially learns the kernel-based regressor in an online and scalable fashion, and dynamically explore a pool of multiple diverse kernels to avoid suffering from a single fixed poor kernel so as to remedy the drawback of manual/heuristic kernel selection. The OMKR problem is more challenging than regular kernel-based regression tasks since we have to on-the-fly determine both the optimal kernel-based regressor for each individual kernel and the best combination of the multiple kernel regressors. We propose a family of OMKR algorithms for regression and discuss their application to time series prediction tasks including application to AR, ARMA and ARIMA time-series. We develop novel approaches to make OMKR scalable for large datasets, to counter the problems arising from an unbounded number of support vectors. We also explore the effect of kernel combination at prediction level and at the representation level. Finally, we conduct extensive experiments to evaluate the empirical performance on both real-world regression and times series prediction tasks.
Traditional clustering algorithms fail to produce human-like results when confronted with data of variable density, complex distributions or presence of noise. We propose an improved graph-based clustering algorithm Chameleon 2 which overcomes several drawbacks of state-of-the-art clustering approaches. We modified internal cluster quality measure and added extra step to ensure algorithm robustness. Our results suggest significant positive impact to clustering quality measured by Normalized Mutual Information on 32 artificial datasets used in clustering literature. Significant improvement was confirmed on real-world datasets as well. The performance of clustering algorithms such as DBSCAN is typically parameter sensitive and exhaustive manual parameter tuning is necessary to obtain a meaningful result. All hierarchical clustering methods are very sensitive to cutoff selection and human expert is often required to find true cutoff for each clustering result. We present an automated cutoff selection method enabling Chameleon 2 algorithm to generate high-quality clustering without requiring human interaction.
In recent years, social networks have become important platforms for people to disseminate information. However, we need to take effective control by blocking links when negative contents such as rumor and misinformation distribute over the network. In this paper, we propose a Rumor Spread Minimization (RSM) problem that removes an edge set from network such that the rumor spread is minimized. We first prove the objective function is not submodular. Then we propose submodular lower-bound and upper-bound to approximate objective function. Next, we develop a heuristic algorithm to calculate original objective function. Furthermore, we reformulate our objective function as a Difference Submodular (DS) function. Finally, we conduct experiments on real-world datasets to evaluate our proposed methods. The experiment results show that the upper and lower bounds are very close which indicates the good quality of them. And the proposed algorithm outperforms some comparative methods.
Expert networks are formed by a group of expert-professionals with different specialties to collaboratively resolve specific queries posted to the network. In such networks, when a query reaches an expert who does not have sufficient expertise, this query needs to be routed to other experts for further processing until it is completely solved; therefore, query answering efficiency is sensitive to the underlying query routing mechanism being used. Among all possible query routing mechanisms, decentralized search, operating purely on each expert's local information without any knowledge of network global structure, represents the most basic and scalable routing mechanism, which is applicable to any network scenarios even in dynamic networks. However, there is still a lack of fundamental understanding of the efficiency of decentralized search in expert networks. In this regard, we investigate decentralized search by quantifying its performance under a variety of network settings. Our key findings reveal the existence of network conditions, under which decentralized search can achieve significantly short query routing paths (i.e., between O(log n) and O(log^2 n) hops, n: total number of experts in the network). Based on such theoretical foundation, we further study how the unique properties of decentralized search in expert networks is related to the anecdotal small-world phenomenon. In addition, we demonstrate that decentralized search is robust against estimation errors introduced by misinterpreting the required expertise levels. The developed performance bounds, confirmed by real datasets, are able to assist in predicting network performance and designing complex expert networks.
Introduction to Special Issue on Interactive Data Exploration and Analytics (TKDD-IDEA)
Estimating distributions of labels associated with nodes (e.g., number of connections or citizenship of users in a social network) in large graphs via sampling is a vital part of the study of complex networks. Due to their low cost, sampling via random walks (RWs) has been proposed as an attractive solution to this task. Most RW methods assume either that the network is undirected or that walkers can traverse edges regardless of their direction. Some RW methods have been designed for directed networks where edges coming into a node are not directly observable. In this work, we propose Directed Unbiased Frontier Sampling (DUFS), a sampling method based on a large number of coordinated walkers, each starting from a node chosen uniformly at random. It is applicable to directed networks with invisible incoming edges because it constructs, in real-time, an undirected graph consistent with the walkers trajectories, and due to the use of random jumps which prevent walkers from being trapped. DUFS generalizes previous RW methods and is suited for undirected networks and to directed networks regardless of in-edges visibility. We also propose an improved estimator of vertex label distributions which combines information from the initial walker locations with subsequent RW observations. We evaluate DUFS, comparing it against other RW methods, investigating the impact of its parameters on estimation accuracy and providing practical guidelines for choosing them. In estimating out-degree distributions, DUFS yields significantly better estimates of the head than other methods, while matching or exceeding estimation accuracy of the tail.
Along with the popular of Internet, social networks aiming at different functions has undergone an unprecedented development in recent years. Users create different accounts, which are also called identities, in different social networks. However, there is no intuitive connection between them. In order to link different identities which belong to the same user but in different social networks, we study to solve the problem of identity matching across multiple social networks in this paper. We proposed a three-step framework to exam the efficiency of three kinds of features, i.e. profile features, user generated content features and user relationship features. By exploiting different combinations of features in this framework, we find the least effective features which can identify users across social networks. Some experiments on real-world dataset show that our model can basically match the identities from different social networks.
In this paper, we focus on the problem of learning a Bayesian network over distributed data stored in a commodity cluster. Specifically, we address the challenge of computing the scoring function over distributed data in an efficient and scalable manner, which is a fundamental task during learning. While exact score computation can be done using the MapReduce-style computation, our goal is to compute approximate scores much faster with probabilistic error bounds and in a scalable manner. We propose a novel approach which is designed to achieve: (a) decentralized score computation using the principle of gossiping; (b) lower resource consumption via a probabilistic approach for maintaining scores using the properties of a Markov chain; and (c) effective distribution of tasks during score computation (on large datasets) by synergistically combining well-known hashing techniques. We conduct theoretical analysis of our approach in terms of convergence speed, and memory and network bandwidth consumption. We also discuss how our approach is capable of efficiently recomputing scores when new data are available. We conducted a comprehensive evaluation of our approach and compared with the MapReduce-style computation using datasets of different characteristics on a 16-node cluster. The evaluation results demonstrate that our approach can approximately compute the sufficient statistics needed to compute family scores up to 10 times faster than the MapReduce-style computation while achieving very good accuracy (approximately 10% average relative error).