ACM Transactions on

Knowledge Discovery from Data (TKDD)

Latest Articles

Coupled Clustering Ensemble by Exploring Data Interdependence

Clustering ensembles combine multiple partitions of data into a single clustering solution. It is an effective technique for improving the quality of... (more)

Entity-Based Query Recommendation for Long-Tail Queries

Query recommendation, which suggests related queries to search engine users, has attracted a lot of attention in recent years. Most of the existing... (more)

Modeling Alzheimer’s Disease Progression with Fused Laplacian Sparse Group Lasso

Alzheimer’s disease (AD), the most common type of dementia, not only imposes a huge financial burden on the health care system, but also a... (more)

Stability and Robustness in Influence Maximization

In the well-studied Influence Maximization problem, the goal is to identify a set of k nodes in a social network whose joint influence on the network... (more)

Protecting Privacy in Trajectories with a User-Centric Approach

The increased use of location-aware devices, such as smartphones, generates a large amount of trajectory data. These data can be useful in several... (more)

FrauDetector+: An Incremental Graph-Mining Approach for Efficient Fraudulent Phone Call Detection

In recent years, telecommunication fraud has become more rampant internationally with the development of modern technology and global communication. Because of rapid growth in the volume of call logs, the task of fraudulent phone call detection is confronted with big data issues in real-world implementations. Although our previous work,... (more)

Large-Scale Adversarial Sports Play Retrieval with Learning to Rank

As teams of professional leagues are becoming more and more analytically driven, the interest in effective data management and access of sports plays... (more)

A General Embedding Framework for Heterogeneous Information Learning in Large-Scale Networks

Network analysis has been widely applied in many real-world tasks, such as gene analysis and targeted marketing. To extract effective features for... (more)

Enumerating Trillion Subgraphs On Distributed Systems

How can we find patterns from an enormous graph with billions of vertices and edges? The subgraph enumeration, which is to find patterns from a graph,... (more)

Event Analytics via Discriminant Tensor Factorization

Analyzing the impact of disastrous events has been central to understanding and responding to crises. Traditionally, the assessment of disaster impact... (more)

Semi-supervised Learning Meets Factorization: Learning to Recommend with Chain Graph Model

Recently, latent factor model (LFM) has been drawing much attention in recommender systems due to its good performance and scalability. However, existing LFMs predict missing values in a user-item rating matrix only based on the known ones, and thus the sparsity of the rating matrix always limits their performance. Meanwhile, semi-supervised... (more)


About TKDD 

ACM Transactions on Knowledge Discovery from Data (TKDD) publishes original archival papers in the area of knowledge discovery from data and closely related disciplines.  The majority of the papers that appear in TKDD is expected to address the logical and technical foundation of knowledge discovery and data mining.

Forthcoming Articles
Detecting and Assessing Anomalous Evolutionary Behaviors of Nodes in Evolving Social Networks

Based on the performance of entire social networks, anomaly analysis for evolving social networks generally ignores the otherness of the evolutionary behaviors of different nodes, such that it is difficult to precisely identify the anomalous evolutionary behaviors of nodes (Aebn). Assuming that a nodes evolutionary behavior that generates and removes edges normally follows stable evolutionary mechanisms, this study focuses on detecting and assessing Aebn whose evolutionary mechanisms deviate from their past mechanisms, and proposes a link prediction detection (LPD) method and a matrix perturbation assessment (MPA) method. LPD describes a nodes evolutionary behavior by fitting its evolutionary mechanism, and designs indexes for edge generation and removal to evaluate the extent to which the evolutionary mechanism of a nodes evolutionary behavior can be fitted by a link prediction algorithm. Further, it detects Aebn by quantifying the differences among behavior vectors that characterize the nodes evolutionary behaviors in different periods. In addition, MPA considers Aebn as a perturbation of the social network structure, and quantifies the effect of Aebn on the social network structure based on matrix perturbation analysis. Extensive experiments on eight disparate real-world networks demonstrate that analyzing Aebn from the perspective of evolutionary mechanisms is important and beneficial. Compared with other state-of-the-art methods, LPD can effectively detect Aebn, and MPA can reasonably assess their effect.

Large Scale Online Multiple Kernel Regression with Application to Time-Series Prediction

Kernel-based regression represents an important family of learning techniques for solving challenging regression tasks with non-linear patterns. Despite being studied extensively, most of the existing work suffers from two major drawbacks: (i) they are often designed for solving regression tasks in a batch learning setting, making them not only computationally inefficient and but also poorly scalable in real-world applications where data arrives sequentially; and (ii) they usually assume a fixed kernel function is given prior to the learning task, which could result in poor performance if the chosen kernel is inappropriate. To overcome these drawbacks, this work presents a novel scheme of Online Multiple Kernel Regression (OMKR), which sequentially learns the kernel-based regressor in an online and scalable fashion, and dynamically explore a pool of multiple diverse kernels to avoid suffering from a single fixed poor kernel so as to remedy the drawback of manual/heuristic kernel selection. The OMKR problem is more challenging than regular kernel-based regression tasks since we have to on-the-fly determine both the optimal kernel-based regressor for each individual kernel and the best combination of the multiple kernel regressors. We propose a family of OMKR algorithms for regression and discuss their application to time series prediction tasks including application to AR, ARMA and ARIMA time-series. We develop novel approaches to make OMKR scalable for large datasets, to counter the problems arising from an unbounded number of support vectors. We also explore the effect of kernel combination at prediction level and at the representation level. Finally, we conduct extensive experiments to evaluate the empirical performance on both real-world regression and times series prediction tasks.

Chameleon 2: An Improved Graph Based Clustering Algorithm

Traditional clustering algorithms fail to produce human-like results when confronted with data of variable density, complex distributions or presence of noise. We propose an improved graph-based clustering algorithm Chameleon 2 which overcomes several drawbacks of state-of-the-art clustering approaches. We modified internal cluster quality measure and added extra step to ensure algorithm robustness. Our results suggest significant positive impact to clustering quality measured by Normalized Mutual Information on 32 artificial datasets used in clustering literature. Significant improvement was confirmed on real-world datasets as well. The performance of clustering algorithms such as DBSCAN is typically parameter sensitive and exhaustive manual parameter tuning is necessary to obtain a meaningful result. All hierarchical clustering methods are very sensitive to cutoff selection and human expert is often required to find true cutoff for each clustering result. We present an automated cutoff selection method enabling Chameleon 2 algorithm to generate high-quality clustering without requiring human interaction.

Algorithms for Online Influencer Marketing

Influence maximization is the problem of finding influential users (or nodes) in a graph so as to maximize the spread of information. It has many applications in advertising and marketing on social networks. In this paper, we study a highly generic version of influence maximization, one of optimizing influence campaigns by sequentially selecting "spread seeds" from a set of influencers, a small subset of the node population, under the hypothesis that, in a given campaign, previously activated nodes remain persistently active throughout and thus do not yield further rewards. This problem is in particular relevant for an important form of online marketing, known as influencer marketing, in which the marketers target a sub-population of influential people, instead of the entire base of potential buyers. Importantly, we make no assumptions on the underlying diffusion model and we work in a setting where neither a diffusion network nor historical activation data are available. We call this problem online influencer marketing with persistence (in short, OIMP). We first discuss motivating scenarios and present our general approach. We introduce an estimator on the influencers' remaining potential -- the expected number of nodes that can still be reached from them -- and justify its strength to rapidly estimate the desired value, relying on real data gathered from Twitter. We then describe a novel algorithm, GT-UCB, relying on upper confidence bounds on the remaining potential. We show that our approach leads to high-quality spreads on both simulated and real datasets, even though it makes almost no assumptions on the diffusion medium. Importantly, it is orders of magnitude faster than state-of-the-art influence maximization methods, making it possible to deal with large-scale online scenarios.

Tensor Completion Algorithms in Big Data Analytics

Tensor completion is a problem of filling the missing or unobserved entries of partially observed tensors. Due to the multidimensional character of tensors in describing complex datasets, tensor completion algorithms and their applications have received wide attention and achievement in data mining, computer vision, signal processing, and neuroscience, etc. In this survey, we provide a modern overview of recent advances in tensor completion algorithms from the perspective of big data analytics characterized by diverse variety, large volume, and high velocity. Towards a better comprehension and comparison of vast existing advances, we summarize and categorize them into four groups including general tensor completion algorithms, tensor completion with auxiliary information (variety), scalable tensor completion algorithms (volume) and dynamic tensor completion algorithms (velocity). Besides, we introduce their applications on real-world data-driven problems and present an open-source package covering several widely used tensor decomposition and completion algorithms. Our goal is to summarize these popular methods and introduce them to researchers for promoting the research process in this field and give an available repository for practitioners. In the end, we also discuss some challenges and promising research directions in this community for future explorations.

PSP-AMS: Progressive mining of Sequential Patterns Across Multiple Streams

Sequential pattern mining is to find frequent data sequences with time. When sequential patterns are generated, the newly arriving patterns may not be identified as frequent sequential patterns due to the existence of old data and sequences. Progressive sequential pattern mining aims to find most up-to-date sequential patterns given that obsolete items will be deleted from the sequences. When sequences come with multiple data streams, it is difficult to maintain and update the current sequential patterns. Even worse, when we consider the sequences across multiple streams, previous methods could not efficiently compute the frequent sequential patterns. In this work, we propose an efficient algorithm PSP-AMS to address this problem. PSP-AMS uses an novel data structure PSP-MS-tree to insert new items, update current items, and delete obsolete items. By maintaining PSP-MS-tree, PSP-AMS efficiently finds the frequent sequential patterns across multiple streams. The experimental results show that PSP-AMS significantly outperforms previous algorithms for mining progressive sequential patterns across multiple streams on synthetic data as well as real data.

Robust Spectral Ensemble Clustering via Rank Minimization

Ensemble Clustering (EC) is an important topic for data cluster analysis. It aims to integrate multiple Basic Partitions (BPs) of a particular dataset into a consensus partition. Among the previous works, one promising and effective way is to transform ensemble clustering as a graph partitioning problem on the co-association matrix, which is a pair-wise similarity matrix summarized by all the BPs in essence. However, most existing EC methods directly utilize the co-association matrix, yet without considering various noises (\textit{e.g.}, the disagreement between different BPs and the outlier ones) that may exist in it. These noises can impair the cluster structure of a co-association matrix and thus mislead the final graph partitioning process. In this paper, we propose a novel Robust Spectral Ensemble Clustering (RSEC) approach to address this challenge. Specifically, we learn a Low-Rank Representation (LRR) for the co-association matrix to reveal its cluster structure and handle the noises; and meanwhile, we conduct spectral clustering on the learned representation to seek a consensus partition. These two steps are jointly performed in a unified optimization framework. In particular, during the optimizing process, we utilize consensus partition to iteratively enhance the block-diagonal structure of low-rank representation, in order to assist the graph partitioning. To solve RSEC, we first formulate it by using nuclear norm as a convex proxy to the rank function. Then, motivated by the recent advance in non-convex rank minimization, we further develop a non-convex model for RSEC and provide it a solution by the majorization-minimization Augmented Lagrange Multiplier (MM-ALM) algorithm. Experimental results on numerous real-world datasets demonstrate the effectiveness of our method over the state-of-the-art. Moreover, several impact factors that may affect the clustering performance of our approach are also explored extensively.

Introduction to Special Issue on Interactive Data Exploration and Analytics (TKDD-IDEA)

Characterizing Directed and Undirected Networks via Multidimensional Walks with Jumps

Estimating distributions of labels associated with nodes (e.g., number of connections or citizenship of users in a social network) in large graphs via sampling is a vital part of the study of complex networks. Due to their low cost, sampling via random walks (RWs) has been proposed as an attractive solution to this task. Most RW methods assume either that the network is undirected or that walkers can traverse edges regardless of their direction. Some RW methods have been designed for directed networks where edges coming into a node are not directly observable. In this work, we propose Directed Unbiased Frontier Sampling (DUFS), a sampling method based on a large number of coordinated walkers, each starting from a node chosen uniformly at random. It is applicable to directed networks with invisible incoming edges because it constructs, in real-time, an undirected graph consistent with the walkers trajectories, and due to the use of random jumps which prevent walkers from being trapped. DUFS generalizes previous RW methods and is suited for undirected networks and to directed networks regardless of in-edges visibility. We also propose an improved estimator of vertex label distributions which combines information from the initial walker locations with subsequent RW observations. We evaluate DUFS, comparing it against other RW methods, investigating the impact of its parameters on estimation accuracy and providing practical guidelines for choosing them. In estimating out-degree distributions, DUFS yields significantly better estimates of the head than other methods, while matching or exceeding estimation accuracy of the tail.

Identity Matching across Chinese Social Networks Based on Least Information

Along with the popular of Internet, social networks aiming at different functions has undergone an unprecedented development in recent years. Users create different accounts, which are also called identities, in different social networks. However, there is no intuitive connection between them. In order to link different identities which belong to the same user but in different social networks, we study to solve the problem of identity matching across multiple social networks in this paper. We proposed a three-step framework to exam the efficiency of three kinds of features, i.e. profile features, user generated content features and user relationship features. By exploiting different combinations of features in this framework, we find the least effective features which can identify users across social networks. Some experiments on real-world dataset show that our model can basically match the identities from different social networks.

DIGGER: Detect Similar Groups in Heterogeneous Social Networks

People participate in multiple online social networks, e.g., Facebook, Twitter, and Linkedin, and these social networks with heterogeneous social content and user relationship are named as heterogeneous social networks. Group structure widely exists in heterogeneous social networks, which reveals the evolution of human cooperation. Detecting similar groups in heterogeneous networks has a great significance for many applications, such as recommendation system and spammer detection, using the wealth of group information. Although promising, this novel problem encounters a variety of technical challenges, including incomplete data, high time complexity, and ground truth. To address the research gap and technical challenges, we take advantage of a ratio-cut optimization function to model this novel problem by the linear mixed-effects method and graph spectral theory. Based on this model, we propose an efficient algorithm called \textsc{Digger} to detect the similar groups in the large graphs. \textsc{Digger} consists of three steps, including measuring user similarity, construct a matching graph and detecting similar groups. We adopt several strategies to lower the computational cost and detail the basis of labeling the ground truth. We evaluate the effectiveness and efficiency of our algorithm on five different types of online social networks. The extensive experiments show that our method achieves 0.633, 0.723 and 0.675 in precision, recall and F1-measure, which significantly surpass the state-of-arts by 24.6$\%$, 14.6$\%$ and 19.7$\%$, respectively. The results demonstrate that our proposal can detect similar groups in heterogeneous networks effectively.

Differentiating Regularization Weights -- a Simple Mechanism to Alleviate Cold Start in Recommender Systems

Matrix factorization (MF) and its extended methodologies have been studied extensively in the last decade. Essentially, MF attempts to search for low-ranked matrices that can best approximate the known rating scores and maintain low Frobenius norm for the low-ranked matrices. Since the two objectives conflict with each other, the common practice is to assign the relative importance weights as the hyper-parameters to these objectives. The two low-ranked matrices returned by MF are interpreted as the latent factors of a user and the latent factors of an item that affect the rating of the user on the item. As a result, it is typical that, in the loss function, we assign a regularization weight $\lambda_p$ on the norms of the latent factors for all users, and another regularization weight $\lambda_q$ on the norms of the latent factors for all the items. We argue that, such a methodology probably over-simplifies the scenario. Alternatively, we probably should assign lower constraints to the latent factors associated with the items or users that reveal more information, and set higher constraints to the others. In this paper, we systematically study this topic. We found that such a simple technique improves the predictions of the MF-based approaches. Perhaps more importantly, this technique better predicts the ratings on the long-tail items, i.e., the items that were rated/viewed/purchased by few users. This suggests that this approach may partially remedy the cold-start issue. The proposed method is very simple and can be easily applied on various recommendation models, such as Factorization Machines (FM), Field-aware Factorization Machines (FFM), Factorizing Personalized Markov Chains (FPMC), Prod2Vec, Behavior2Vec, etc. We release the code for reproducibility.

Text Mining for Evaluating Authors' Birth and Death Years

This article presents a unique method in text and data mining for finding the era, i.e., mining temporal data, in which an anonymous author was living. Finding this era can assist in the examination of a fake document or extracting the time period in which a writer lived. The study and the experiments concern Hebrew, and in some parts, Aramaic and Yiddish rabbinic texts. The rabbinic texts are undated and contain no bibliographic sections, posing an interesting challenge. This work proposes algorithms using key phrases and key words that allow the temporal organization of citations together with linguistic patterns. Based on these key phrases, key words and the references, we established several types of "Iron-clad", Heuristic and Greedy rules for estimating the years of birth and death of a writer in an interesting classification task. Experiments were conducted on corpora including documents authored by 12, 24 and 36 rabbinic writers and demonstrated promising results.

All ACM Journals | See Full Journal Index

Search TKDD
enter search term and/or author name