Latest Articles

## Coupled Clustering Ensemble by Exploring Data Interdependence

Clustering ensembles combine multiple partitions of data into a single clustering solution. It is an effective technique for improving the quality of... (more)

## Entity-Based Query Recommendation for Long-Tail Queries

Query recommendation, which suggests related queries to search engine users, has attracted a lot of attention in recent years. Most of the existing... (more)

## Modeling Alzheimer’s Disease Progression with Fused Laplacian Sparse Group Lasso

Alzheimer’s disease (AD), the most common type of dementia, not only imposes a huge financial burden on the health care system, but also a... (more)

## Stability and Robustness in Influence Maximization

In the well-studied Influence Maximization problem, the goal is to identify a set of k nodes in a social network whose joint influence on the network... (more)

## Protecting Privacy in Trajectories with a User-Centric Approach

The increased use of location-aware devices, such as smartphones, generates a large amount of trajectory data. These data can be useful in several... (more)

## FrauDetector+: An Incremental Graph-Mining Approach for Efficient Fraudulent Phone Call Detection

In recent years, telecommunication fraud has become more rampant internationally with the development of modern technology and global communication. Because of rapid growth in the volume of call logs, the task of fraudulent phone call detection is confronted with big data issues in real-world implementations. Although our previous work,... (more)

## Large-Scale Adversarial Sports Play Retrieval with Learning to Rank

As teams of professional leagues are becoming more and more analytically driven, the interest in effective data management and access of sports plays... (more)

## A General Embedding Framework for Heterogeneous Information Learning in Large-Scale Networks

Network analysis has been widely applied in many real-world tasks, such as gene analysis and targeted marketing. To extract effective features for... (more)

## Enumerating Trillion Subgraphs On Distributed Systems

How can we find patterns from an enormous graph with billions of vertices and edges? The subgraph enumeration, which is to find patterns from a graph,... (more)

## Event Analytics via Discriminant Tensor Factorization

Analyzing the impact of disastrous events has been central to understanding and responding to crises. Traditionally, the assessment of disaster impact... (more)

## Semi-supervised Learning Meets Factorization: Learning to Recommend with Chain Graph Model

Recently, latent factor model (LFM) has been drawing much attention in recommender systems due to its good performance and scalability. However, existing LFMs predict missing values in a user-item rating matrix only based on the known ones, and thus the sparsity of the rating matrix always limits their performance. Meanwhile, semi-supervised... (more)

##### NEWS

ACM Transactions on Knowledge Discovery from Data (TKDD) publishes original archival papers in the area of knowledge discovery from data and closely related disciplines.  The majority of the papers that appear in TKDD is expected to address the logical and technical foundation of knowledge discovery and data mining.

##### Forthcoming Articles
Algorithms for Online Influencer Marketing

Influence maximization is the problem of finding influential users (or nodes) in a graph so as to maximize the spread of information. It has many applications in advertising and marketing on social networks. In this paper, we study a highly generic version of influence maximization, one of optimizing influence campaigns by sequentially selecting "spread seeds" from a set of influencers, a small subset of the node population, under the hypothesis that, in a given campaign, previously activated nodes remain persistently active throughout and thus do not yield further rewards. This problem is in particular relevant for an important form of online marketing, known as influencer marketing, in which the marketers target a sub-population of influential people, instead of the entire base of potential buyers. Importantly, we make no assumptions on the underlying diffusion model and we work in a setting where neither a diffusion network nor historical activation data are available. We call this problem online influencer marketing with persistence (in short, OIMP). We first discuss motivating scenarios and present our general approach. We introduce an estimator on the influencers' remaining potential -- the expected number of nodes that can still be reached from them -- and justify its strength to rapidly estimate the desired value, relying on real data gathered from Twitter. We then describe a novel algorithm, GT-UCB, relying on upper confidence bounds on the remaining potential. We show that our approach leads to high-quality spreads on both simulated and real datasets, even though it makes almost no assumptions on the diffusion medium. Importantly, it is orders of magnitude faster than state-of-the-art influence maximization methods, making it possible to deal with large-scale online scenarios.

Tensor Completion Algorithms in Big Data Analytics

Tensor completion is a problem of filling the missing or unobserved entries of partially observed tensors. Due to the multidimensional character of tensors in describing complex datasets, tensor completion algorithms and their applications have received wide attention and achievement in data mining, computer vision, signal processing, and neuroscience, etc. In this survey, we provide a modern overview of recent advances in tensor completion algorithms from the perspective of big data analytics characterized by diverse variety, large volume, and high velocity. Towards a better comprehension and comparison of vast existing advances, we summarize and categorize them into four groups including general tensor completion algorithms, tensor completion with auxiliary information (variety), scalable tensor completion algorithms (volume) and dynamic tensor completion algorithms (velocity). Besides, we introduce their applications on real-world data-driven problems and present an open-source package covering several widely used tensor decomposition and completion algorithms. Our goal is to summarize these popular methods and introduce them to researchers for promoting the research process in this field and give an available repository for practitioners. In the end, we also discuss some challenges and promising research directions in this community for future explorations.

PSP-AMS: Progressive mining of Sequential Patterns Across Multiple Streams

Sequential pattern mining is to find frequent data sequences with time. When sequential patterns are generated, the newly arriving patterns may not be identified as frequent sequential patterns due to the existence of old data and sequences. Progressive sequential pattern mining aims to find most up-to-date sequential patterns given that obsolete items will be deleted from the sequences. When sequences come with multiple data streams, it is difficult to maintain and update the current sequential patterns. Even worse, when we consider the sequences across multiple streams, previous methods could not efficiently compute the frequent sequential patterns. In this work, we propose an efficient algorithm PSP-AMS to address this problem. PSP-AMS uses an novel data structure PSP-MS-tree to insert new items, update current items, and delete obsolete items. By maintaining PSP-MS-tree, PSP-AMS efficiently finds the frequent sequential patterns across multiple streams. The experimental results show that PSP-AMS significantly outperforms previous algorithms for mining progressive sequential patterns across multiple streams on synthetic data as well as real data.

Robust Spectral Ensemble Clustering via Rank Minimization

Ensemble Clustering (EC) is an important topic for data cluster analysis. It aims to integrate multiple Basic Partitions (BPs) of a particular dataset into a consensus partition. Among the previous works, one promising and effective way is to transform ensemble clustering as a graph partitioning problem on the co-association matrix, which is a pair-wise similarity matrix summarized by all the BPs in essence. However, most existing EC methods directly utilize the co-association matrix, yet without considering various noises (\textit{e.g.}, the disagreement between different BPs and the outlier ones) that may exist in it. These noises can impair the cluster structure of a co-association matrix and thus mislead the final graph partitioning process. In this paper, we propose a novel Robust Spectral Ensemble Clustering (RSEC) approach to address this challenge. Specifically, we learn a Low-Rank Representation (LRR) for the co-association matrix to reveal its cluster structure and handle the noises; and meanwhile, we conduct spectral clustering on the learned representation to seek a consensus partition. These two steps are jointly performed in a unified optimization framework. In particular, during the optimizing process, we utilize consensus partition to iteratively enhance the block-diagonal structure of low-rank representation, in order to assist the graph partitioning. To solve RSEC, we first formulate it by using nuclear norm as a convex proxy to the rank function. Then, motivated by the recent advance in non-convex rank minimization, we further develop a non-convex model for RSEC and provide it a solution by the majorization-minimization Augmented Lagrange Multiplier (MM-ALM) algorithm. Experimental results on numerous real-world datasets demonstrate the effectiveness of our method over the state-of-the-art. Moreover, several impact factors that may affect the clustering performance of our approach are also explored extensively.

Sequential Feature Explanations for Anomaly Detection

In many applications, an anomaly detection system presents the most anomalous data instance to a human analyst, who then must determine whether the instance is truly of interest (e.g. a threat in a security setting). Unfortunately, most anomaly detectors provide no explanation about why an instance was considered anomalous, leaving the analyst with no guidance about where to begin the investigation. To address this issue, we study the problems of computing and evaluating sequential feature explanations (SFEs) for anomaly detectors. An SFE of an anomaly is a sequence of features, which are presented to the analyst one at a time (in order) until the information contained in the highlighted features is enough for the analyst to make a confident judgement about the anomaly. Since analyst effort is related to the amount of information that they consider in an investigation, an explanation's quality is related to the number of features that must be revealed to attain confidence. In this paper, we first formulate the problem of optimizing SFEs for a particular density-based anomaly detector. We then present both greedy algorithms and an optimal algorithm, based on branch-and-bound search, for optimizing SFEs. Finally, we provide a large scale quantitative evaluation of these algorithms using a novel framework for evaluating explanations. The results show that our algorithms are quite effective and that our best greedy algorithm is competitive with optimal solutions.

Identity Matching across Chinese Social Networks Based on Least Information

Along with the popular of Internet, social networks aiming at different functions has undergone an unprecedented development in recent years. Users create different accounts, which are also called identities, in different social networks. However, there is no intuitive connection between them. In order to link different identities which belong to the same user but in different social networks, we study to solve the problem of identity matching across multiple social networks in this paper. We proposed a three-step framework to exam the efficiency of three kinds of features, i.e. profile features, user generated content features and user relationship features. By exploiting different combinations of features in this framework, we find the least effective features which can identify users across social networks. Some experiments on real-world dataset show that our model can basically match the identities from different social networks.

DIGGER: Detect Similar Groups in Heterogeneous Social Networks

People participate in multiple online social networks, e.g., Facebook, Twitter, and Linkedin, and these social networks with heterogeneous social content and user relationship are named as heterogeneous social networks. Group structure widely exists in heterogeneous social networks, which reveals the evolution of human cooperation. Detecting similar groups in heterogeneous networks has a great significance for many applications, such as recommendation system and spammer detection, using the wealth of group information. Although promising, this novel problem encounters a variety of technical challenges, including incomplete data, high time complexity, and ground truth. To address the research gap and technical challenges, we take advantage of a ratio-cut optimization function to model this novel problem by the linear mixed-effects method and graph spectral theory. Based on this model, we propose an efficient algorithm called \textsc{Digger} to detect the similar groups in the large graphs. \textsc{Digger} consists of three steps, including measuring user similarity, construct a matching graph and detecting similar groups. We adopt several strategies to lower the computational cost and detail the basis of labeling the ground truth. We evaluate the effectiveness and efficiency of our algorithm on five different types of online social networks. The extensive experiments show that our method achieves 0.633, 0.723 and 0.675 in precision, recall and F1-measure, which significantly surpass the state-of-arts by 24.6$\%$, 14.6$\%$ and 19.7$\%$, respectively. The results demonstrate that our proposal can detect similar groups in heterogeneous networks effectively.

Differentiating Regularization Weights -- a Simple Mechanism to Alleviate Cold Start in Recommender Systems

Matrix factorization (MF) and its extended methodologies have been studied extensively in the last decade. Essentially, MF attempts to search for low-ranked matrices that can best approximate the known rating scores and maintain low Frobenius norm for the low-ranked matrices. Since the two objectives conflict with each other, the common practice is to assign the relative importance weights as the hyper-parameters to these objectives. The two low-ranked matrices returned by MF are interpreted as the latent factors of a user and the latent factors of an item that affect the rating of the user on the item. As a result, it is typical that, in the loss function, we assign a regularization weight $\lambda_p$ on the norms of the latent factors for all users, and another regularization weight $\lambda_q$ on the norms of the latent factors for all the items. We argue that, such a methodology probably over-simplifies the scenario. Alternatively, we probably should assign lower constraints to the latent factors associated with the items or users that reveal more information, and set higher constraints to the others. In this paper, we systematically study this topic. We found that such a simple technique improves the predictions of the MF-based approaches. Perhaps more importantly, this technique better predicts the ratings on the long-tail items, i.e., the items that were rated/viewed/purchased by few users. This suggests that this approach may partially remedy the cold-start issue. The proposed method is very simple and can be easily applied on various recommendation models, such as Factorization Machines (FM), Field-aware Factorization Machines (FFM), Factorizing Personalized Markov Chains (FPMC), Prod2Vec, Behavior2Vec, etc. We release the code for reproducibility.

Text Mining for Evaluating Authors' Birth and Death Years

This article presents a unique method in text and data mining for finding the era, i.e., mining temporal data, in which an anonymous author was living. Finding this era can assist in the examination of a fake document or extracting the time period in which a writer lived. The study and the experiments concern Hebrew, and in some parts, Aramaic and Yiddish rabbinic texts. The rabbinic texts are undated and contain no bibliographic sections, posing an interesting challenge. This work proposes algorithms using key phrases and key words that allow the temporal organization of citations together with linguistic patterns. Based on these key phrases, key words and the references, we established several types of "Iron-clad", Heuristic and Greedy rules for estimating the years of birth and death of a writer in an interesting classification task. Experiments were conducted on corpora including documents authored by 12, 24 and 36 rabbinic writers and demonstrated promising results.

###### All ACM Journals | See Full Journal Index

Search TKDD
enter search term and/or author name