ACM Transactions on

Knowledge Discovery from Data (TKDD)

Latest Articles

Digger: Detect Similar Groups in Heterogeneous Social Networks

People participate in multiple online social networks, e.g., Facebook, Twitter, and Linkedin, and these social networks with heterogeneous social content and user relationship are named as heterogeneous social networks. Group structure widely exists in heterogeneous social networks, which reveals the evolution of human cooperation. Detecting... (more)

Algorithms for Online Influencer Marketing

Influence maximization is the problem of finding influential users, or nodes, in a graph so as to maximize the spread of information. It has many applications in advertising and marketing on social networks. In this article, we study a highly generic version of influence maximization, one of optimizing influence campaigns by sequentially selecting... (more)

PSP-AMS: Progressive Mining of Sequential Patterns Across Multiple Streams

Sequential pattern mining is used to find frequent data sequences over time. When sequential patterns are generated, the newly arriving patterns may not be identified as frequent sequential patterns due to the existence of old data and sequences. Progressive sequential pattern mining aims to find the most up-to-date sequential patterns given that... (more)


About TKDD 

ACM Transactions on Knowledge Discovery from Data (TKDD) publishes original archival papers in the area of knowledge discovery from data and closely related disciplines.  The majority of the papers that appear in TKDD is expected to address the logical and technical foundation of knowledge discovery and data mining.

Forthcoming Articles
Robust Regression via Heuristic Corruption Thresholding and Its Adaptive Estimation Variation

The presence of data noise and corruptions recently invokes increasing attention on Robust Least Squares Regression (RLSR), which addresses the fundamental problem that learns reliable regression coefficients when response variables can be arbitrarily corrupted. Until now, several important challenges still cannot be handled concurrently: 1) rigorous recovery guarantee of regression coefficients 2) difficulty in estimating the corruption ratio parameter; and 3) scaling to massive dataset. This paper proposes a novel Robust Least squares regression algorithm via Heuristic Corruption Thresholding (RHCT), that concurrently addresses all the above challenges. Specifically, the algorithm alternately optimizes the regression coefficients and estimates the optimal uncorrupted set via heuristic thresholding without corruption ratio parameter until it converges. Moreover, to improve the efficiency of corruption estimation in large-scale data, a robust regression algorithm based on adaptive corruption estimation variation (RACT) is proposed to determine the size of uncorrupted set in the adaptive search without iterating data samples exhaustively. We also prove that our algorithms benefit from strong guarantees analogous to those of state-of-the-art methods in terms of convergence rates and recovery guarantees. Extensive experiment demonstrates that the effectiveness of our new methods are superior to that of existing methods in the recovery of both regression coefficients and uncorrupted sets, with very competitive efficiency.

Taxonomy and Evaluation for Microblog Popularity Prediction

As social networks have become a major source of information, predicting the outcome of information diffusion has appeared intriguing both to researchers and practitioners. By organizing and categorizing the joint efforts of numerous studies on popularity prediction, this survey presents a hierarchical taxonomy and helps to establish a systematic overview of popularity prediction methods for microblog. Specifically, we uncover three lines of thoughts: the feature-based approach, time-series modelling, and the collaborative filtering approach and analyze them respectively. Furthermore, we also categorize prediction methods based on their underlying rationale: whether they attempt to model the motivation of users or monitor the early responses. Finally, we put these prediction methods to test by performing experiments on real-life data collected from popular social networks Twitter and Weibo. We compare the methods in terms of accuracy, efficiency, timeliness and robustness. As far as we are concerned, there is no precedented survey aimed at microblog popularity prediction at the time of submission. By establishing a taxonomy and evaluation for the first time, we hope to provide an in-depth review of state-of-the-art prediction methods and point out directions for further research. Our evaluations show that time-series modelling has the advantage of high accuracy and the ability to improve over time. The feature-based methods using only temporal features performs nearly as well as using all possible features, producing average results. This suggests that temporal features do have strong predictive power and that power is better exploited with time-series models. On the other hand, this implies that we know little about the future popularity of an item before it is posted, which may be the focus of further research.

Probabilistic Feature Selection and Classification Vector Machine

Sparse Bayesian learning is one of the state-of-the-art machine learning algorithms, which is able to make stable and reliable probabilistic predictions. However, some of these algorithms, e.g. probabilistic classification vector machine (PCVM) and relevant vector machine (RVM), are not capable of eliminating irrelevant noise features, which potentially not only increase the computational complexity, but also lead to performance degradation of trained models. To tackle this problem, in this paper, we propose a sparse Bayesian classifier which simultaneously selects the relevant samples and features. We name this classifier a probabilistic feature selection and classification vector machine (PFCVM), in which truncated Gaussian distributions are employed as both sample and feature priors. In order to derive the analytical solution for the proposed algorithm, we use Laplace approximation to calculate approximate posteriors and marginal likelihoods. Finally, we obtain the optimized parameters and hyperparameters by the type-II maximum likelihood method. The experiments on Waveform data set, EEG data sets and high dimensional data sets validate the performance of PFCVM under two criteria: accuracy of classification and efficacy of selected features. Finally, we analyze the generalization performance of PFCVM and derive a generalization error bound for PFCVM. Then by tightening the bound, we demonstrate the significance of the sparseness for the model.

Feature Selection via Transferring Knowledge Across Different Classes

The problem of feature selection has attracted considerable research interest in recent years. Supervised information is capable of significantly improving the quality of selected features. However, existing supervised feature selection methods all require that classes in the labeled data (source domain) and unlabeled data (target domain) to be identical, which may be too restrictive in many cases. In this paper, we consider a more challenging cross-class setting where the classes in these two domains are related but different, which has rarely been studied before. We propose a Cross-class Knowledge Transfer Feature Selection (CKTFS) framework which transfers the cross-class knowledge from the source domain to guide target domain feature selection. Specifically, high-level descriptions, i.e., attributes, are used as the bridge for knowledge transfer. To further improve the quality of the selected features, our framework jointly considers the tasks of cross-class knowledge transfer and feature selection. Experimental results on four benchmark datasets demonstrate the superiority of the proposed method.

DWE-Med: Dynamic Word Embeddings for Medical Domain

Recent advances in unsupervised language processing methods have created an opportunity to exploit massive text corpora for developing high-quality vector space representation (also known as word embeddings) of words. Towards this direction, practitioners have developed and applied several data driven embedding models with quite good rate of success. However, a drawback of these models lies in their premise of static context; wherein, the meaning of a word is assumed to remain the same over the period of time. This is limiting because it is known that the semantic meaning of a concept evolves over time. While such semantic drifts are routinely observed in almost all the domains; their effect is acute in domain such as biomedicine, where the semantic meaning of a concept changes relatively fast. To address this, in this study, we aim to learn temporally aware vector representation of medical concepts from the time-stamped text data, and in doing so provide a systematic approach to formalize the problem. More specifically, a dynamic word embedding based model that jointly learns the temporal characteristics of medical concepts and performs across time-alignment is proposed. Apart from capturing the evolutionary characteristics in an optimal manner, the model also factors in the implicit medical properties useful for a variety of bio-medical applications. Empirical studies conducted on two important bio-medical use cases validates the effectiveness of the proposed approach and suggests that the model not only learns quality embeddings but also facilitates intuitive trajectory visualizations.

Outcome-Oriented Predictive Process Monitoring: Review and Benchmark

Predictive business process monitoring refers to the act of making predictions about the future state of ongoing cases of a business process, based on their incomplete execution traces and logs of historical (completed) traces. Motivated by the increasingly pervasive availability of fine-grained event data about business process executions, the problem of predictive process monitoring has received substantial attention in the past years. In particular, a considerable number of methods have been put forward to address the problem of outcome-oriented predictive process monitoring, which refers to classifying each ongoing case of a process according to a given set of possible outcomes -- e.g. Will the customer complain or not? Will an order be delivered, cancelled or withdrawn? Unfortunately, different authors have used different datasets, experimental settings, evaluation measures and baselines to assess their proposals, resulting in poor comparability and an unclear picture of the relative merits and applicability of different methods. To address this gap, this article presents a systematic review and taxonomy of outcome-oriented predictive process monitoring methods, and a comparative experimental evaluation of eleven representative methods using a benchmark covering twelve predictive process monitoring tasks based on four real-life event logs.

Diffusion Prediction with Network Regularized Role-based User Representation Learning

In this paper, we aim at developing a user representation learning model to solve the information diffusion prediction problem in social media. The main idea is to project the diffusion users into a continuous latent space with the role-based (sender and receiver) representations. This representation learning model embeds the information of both observed diffusion cascades and explicit network into the user diffusion representations. It learns the role-based representations based on a cascade modeling objective that aims at maximizing the likelihood of observed cascades, and employs the matrix factorization objective of reconstructing structural proximities as a regularization on representations. The model is able to capture the unique diffusion characteristics of users in the representation space and handle the problem of the diffusion-related data. We evaluate the proposed model with two prediction strategies, i.e., diffusion simulation and diffusion ranking, on three real-world datasets. The experimental results demonstrate the better performances of the proposed model than state-of-the-art diffusion representation learning models and other popular graph-based methods.

Multi-task Crowdsourcing via an Optimization Framework

The unprecedented amounts of data have catalyzed the trend of combining human insights with machine learning techniques, which facilitate the use of crowdsourcing to enlist label information both effectively and efficiently. One crucial challenge in crowdsourcing is the diverse worker quality, which determines the accuracy of the label information provided by such workers. Motivated by the observations that same set of tasks are typically labeled by the same set of workers, we studied their behaviors across multiple related tasks and proposed an optimization framework for learning from task and worker dual heterogeneity. The proposed method uses a weight tensor to represent the workers' behaviors across multiple tasks, and seeks to find the optimal solution of the tensor by exploiting its structured information. Then, we propose an iterative algorithm to solve the optimization problem and analyze its computational complexity. To infer the true label of an example, we construct a worker ensemble based on the estimated tensor, whose decisions will be weighted using a set of entropy weight. We also prove that the gradient of the most time-consuming updating block is separable with respect to the workers, which leads to a randomized algorithm with faster speed. Moreover, we extend the learning framework to accommodate to the multi-class setting. Finally, we test the performance of our framework on several data sets, and demonstrate its superiority over state-of-the-art techniques.

Rumor Blocking through Online Link Deletion on Social Networks

In recent years, social networks have become important platforms for people to disseminate information. However, we need to take effective control by blocking links when negative contents such as rumor and misinformation distribute over the network. In this paper, we propose a Rumor Spread Minimization (RSM) problem that removes an edge set from network such that the rumor spread is minimized. We first prove the objective function is not submodular. Then we propose submodular lower-bound and upper-bound to approximate objective function. Next, we develop a heuristic algorithm to calculate original objective function. Furthermore, we reformulate our objective function as a Difference Submodular (DS) function. Finally, we conduct experiments on real-world datasets to evaluate our proposed methods. The experiment results show that the upper and lower bounds are very close which indicates the good quality of them. And the proposed algorithm outperforms some comparative methods.

Near-Optimal and Practical Algorithms for Graph Scan Statistics with Connectivity Constraints

One fundamental task in network analysis is detecting hotspots or anomalies in the network; that is, detecting subgraphs where there is signi cantly more activity than one would expect given historical data or some baseline process. Scan statistics is one popular approach used for anomalous subgraph detection. This methodology involves maximizing a score function over all connected subgraphs, which is a challenging computational problem. A number of heuristics have been proposed for these problems, but they do not provide any quality guarantees. Here, we propose a framework for designing algorithms for optimizing a large class of scan statistics for networks, subject to connectivity constraints. Our algorithms run in time that scales linearly on the size of the graph and depends on a parameter we call the e ective solution size, while providing rigorous approximation guarantees. In contrast, most prior methods have super-linear running times in terms of graph size. Extensive empirical evidence demonstrates the e ectiveness and e ciency of our proposed algorithms in comparison with state-of-the-art methods. Our approach improves on the performance relative to all prior methods, giving up to over 25% increase in the score. Further, our algorithms scale to networks with up to a million nodes, which is 1-2 orders of magnitude larger than all prior applications.

The Relationship between Online Social Network Ties and User Attributes

The distance between users has an effect on the formation of social network ties, but it is not the only or even the main factor. Knowing all the features that influence such ties is very important for many related domains such as location-based recommender systems and community and event detection systems for online social networks (OSNs). In recent years, researchers have analysed the role of user geo-location in OSNs. Researchers have also attempted to determine the probability of friendships being established based on distance, where friendship is not only a function of distance. However, some important features of OSNs remain unknown. In order to comprehensively understand the OSN phenomenon we also need to analyse users attributes. Basically, an OSN functions according to four main user properties: user geo-location, user weight, number of user interactions and user lifespan. The research presented here sought to determine whether the user mobility pattern can be used to predict users interaction behaviour. It also investigated whether. in addition to distance. the number of friends (known as user weight) interferes in social network tie formation. To this end, we analysed the above-stated features in three large-scale OSNs by using a two-to-two analysis method. We found that, regardless of a high degree freedom in user mobility, the fraction of the number of outside activities over the inside activity is a significant fraction that help us to address the user interaction behaviour. To the best of our knowledge, research has not been conducted elsewhere on this issue. We also present a high-resolution formula in order to improve the friendship probability function.

Performance Bounds of Decentralized Search in Expert Networks for Query Answering

Expert networks are formed by a group of expert-professionals with different specialties to collaboratively resolve specific queries posted to the network. In such networks, when a query reaches an expert who does not have sufficient expertise, this query needs to be routed to other experts for further processing until it is completely solved; therefore, query answering efficiency is sensitive to the underlying query routing mechanism being used. Among all possible query routing mechanisms, decentralized search, operating purely on each expert's local information without any knowledge of network global structure, represents the most basic and scalable routing mechanism, which is applicable to any network scenarios even in dynamic networks. However, there is still a lack of fundamental understanding of the efficiency of decentralized search in expert networks. In this regard, we investigate decentralized search by quantifying its performance under a variety of network settings. Our key findings reveal the existence of network conditions, under which decentralized search can achieve significantly short query routing paths (i.e., between O(log n) and O(log^2 n) hops, n: total number of experts in the network). Based on such theoretical foundation, we further study how the unique properties of decentralized search in expert networks is related to the anecdotal small-world phenomenon. In addition, we demonstrate that decentralized search is robust against estimation errors introduced by misinterpreting the required expertise levels. The developed performance bounds, confirmed by real datasets, are able to assist in predicting network performance and designing complex expert networks.

A Survey of Parallel Sequential Pattern Mining

With the growing popularity of shared resources, large volumes of complex data of different types are collected automatically. Traditional data mining algorithms generally have problems and challenges including huge memory cost, low processing speed, and inadequate hard disk space. As a fundamental task of data mining, sequential pattern mining (SPM) is used in a wide variety of real-life applications. However, it is more complex and challenging than other pattern mining tasks, i.e., frequent itemset mining and association rule mining, and also suffers from the above challenges when handling the large-scale data. To solve these problems, mining sequential patterns in a parallel or distributed computing environment has emerged as an important issue with many applications. In this paper, an in-depth survey of the current status of parallel sequential pattern mining (PSPM) is investigated and provided, including detailed categorization of traditional serial SPM approaches, and state of the art parallel SPM. We review the related work of parallel sequential pattern mining in detail, including partition-based algorithms for PSPM, Apriori-based PSPM, pattern growth based PSPM, and hybrid algorithms for PSPM, and provide deep description (i.e., characteristics, advantages, disadvantages and summarization) of these parallel approaches of PSPM. Some advanced topics for PSPM, including parallel quantitative / weighted / utility sequential pattern mining, PSPM from uncertain data and stream data, hardware acceleration for PSPM, are further reviewed in details. Besides, we review and provide some well-known open-source software of PSPM. Finally, we summarize some challenges and opportunities of PSPM in the big data era.

Identity Matching across Chinese Social Networks Based on Least Information

Along with the popular of Internet, social networks aiming at different functions has undergone an unprecedented development in recent years. Users create different accounts, which are also called identities, in different social networks. However, there is no intuitive connection between them. In order to link different identities which belong to the same user but in different social networks, we study to solve the problem of identity matching across multiple social networks in this paper. We proposed a three-step framework to exam the efficiency of three kinds of features, i.e. profile features, user generated content features and user relationship features. By exploiting different combinations of features in this framework, we find the least effective features which can identify users across social networks. Some experiments on real-world dataset show that our model can basically match the identities from different social networks.

Variant Grassmann Manifolds: a Representation Augmentation Method for Action Recognition

In classification tasks, classifiers trained with finite examples might generalize poorly to new data with unknown variance. For this issue, data augmentation is a successful solution where numerous artificial examples are added to training sets. In this paper, we focus on data augmentation for improving the accuracy of action recognition, where action videos are modeled by linear dynamical systems and approximately represented as linear subspaces. These subspace representations lie in a non-Euclidean space, named Grassmann manifold, containing points as orthonormal matrixes. It is our concern that poor generalization may result from the variance of manifolds when data come from different sources or classes. Thus we introduce infinitely many \textit{variant Grassmann manifolds} (VGM) subject to a known distribution, then represent each action video as different Grassmann points leading to augmented representation. Furthermore, a prior based on the stability of subspace bases is introduced, so the manifold distribution can be adaptively determined, balancing discrimination and representation. Experimental results of multi-class and multi-source classification show that VGM softmax classifiers achieve lower test error rates compared to methods with a single manifold.

Fast Approximate Score Computation on Large-Scale Distributed Data for Learning Multinomial Bayesian Networks

In this paper, we focus on the problem of learning a Bayesian network over distributed data stored in a commodity cluster. Specifically, we address the challenge of computing the scoring function over distributed data in an efficient and scalable manner, which is a fundamental task during learning. While exact score computation can be done using the MapReduce-style computation, our goal is to compute approximate scores much faster with probabilistic error bounds and in a scalable manner. We propose a novel approach which is designed to achieve: (a) decentralized score computation using the principle of gossiping; (b) lower resource consumption via a probabilistic approach for maintaining scores using the properties of a Markov chain; and (c) effective distribution of tasks during score computation (on large datasets) by synergistically combining well-known hashing techniques. We conduct theoretical analysis of our approach in terms of convergence speed, and memory and network bandwidth consumption. We also discuss how our approach is capable of efficiently recomputing scores when new data are available. We conducted a comprehensive evaluation of our approach and compared with the MapReduce-style computation using datasets of different characteristics on a 16-node cluster. The evaluation results demonstrate that our approach can approximately compute the sufficient statistics needed to compute family scores up to 10 times faster than the MapReduce-style computation while achieving very good accuracy (approximately 10% average relative error).

All ACM Journals | See Full Journal Index

Search TKDD
enter search term and/or author name