Latest Articles

## Coupled Clustering Ensemble by Exploring Data Interdependence

Clustering ensembles combine multiple partitions of data into a single clustering solution. It is an effective technique for improving the quality of... (more)

## Entity-Based Query Recommendation for Long-Tail Queries

Query recommendation, which suggests related queries to search engine users, has attracted a lot of attention in recent years. Most of the existing... (more)

## Modeling Alzheimer’s Disease Progression with Fused Laplacian Sparse Group Lasso

Alzheimer’s disease (AD), the most common type of dementia, not only imposes a huge financial burden on the health care system, but also a... (more)

## Stability and Robustness in Influence Maximization

In the well-studied Influence Maximization problem, the goal is to identify a set of k nodes in a social network whose joint influence on the network... (more)

## Protecting Privacy in Trajectories with a User-Centric Approach

The increased use of location-aware devices, such as smartphones, generates a large amount of trajectory data. These data can be useful in several... (more)

## FrauDetector+: An Incremental Graph-Mining Approach for Efficient Fraudulent Phone Call Detection

In recent years, telecommunication fraud has become more rampant internationally with the development of modern technology and global communication. Because of rapid growth in the volume of call logs, the task of fraudulent phone call detection is confronted with big data issues in real-world implementations. Although our previous work,... (more)

## Large-Scale Adversarial Sports Play Retrieval with Learning to Rank

As teams of professional leagues are becoming more and more analytically driven, the interest in effective data management and access of sports plays... (more)

##### NEWS

ACM Transactions on Knowledge Discovery from Data (TKDD) publishes original archival papers in the area of knowledge discovery from data and closely related disciplines.  The majority of the papers that appear in TKDD is expected to address the logical and technical foundation of knowledge discovery and data mining.

##### Forthcoming Articles
Division-by-q dichotomization for interval uncertainty reduction by cutting off equal parts from the left and right based on expert judgments under short-termed observations

A problem of reducing interval uncertainty is considered by an approach of cutting off equal parts from the left and right. The interval contains admissible values of an observed objects parameter. The objects parameter cannot be measured directly or deductively computed, so it is estimated by expert judgments. The task is to map a set of admissible values of the objects parameter (the initial interval) into a set of practicable values of this parameter. Redundant (irrelevant) values are removed according to experts judgments. Terms of observations are short, and the objects statistical data are poor. Any statistical methods for reducing the interval uncertainty are unreliable due to the term of the parameters application tends to be the shortest. Thus an algorithm of flexibly reducing interval uncertainty is designed via adjusting the parameter by expert procedures and allowing to control cutting off. The interval reduction ensues from the adjustment. While the parameter is adjusted forward, the interval becomes progressively narrowed after every next expert procedure. The narrowing is performed via division-by-q dichotomization cutting off the (1/q)-th parts from the left and right. If the current parameters value falls outside of the interval, forward adjustment is canceled. Then backward adjustment is executed, where one of the endpoints is moved backwards. Rough (hard) and smooth (soft) backward movings are provided. If the current parameters value belonging to the interval is too close to either left or right endpoint, then this endpoint is not moved. The closeness is treated differently from both sides by the given relative tolerances. Adjustment is not executed when the current parameters value enclosed within the interval is simultaneously too close to both left and right endpoints. If the current parameters value is trapped like that for a definite number of times in succession, the early stop fires. That definite number serves to reach the statistical stability.

Enumerating Trillion Subgraphs On Distributed Systems

How can we find patterns from an enormous graph with billions of vertices and edges? The subgraph enumeration, which is finding patterns from a graph, is an important task for graph data analysis with many applications including analyzing the social network evolution, measuring the significance of motifs in biological networks, observing the dynamics of Internet, etc. Especially, the triangle enumeration, a special case of the subgraph enumeration where the pattern is a triangle, has many applications such as identifying suspicious users in social networks, detecting web spams, and finding communities. However, recent networks are so large that most of the previous algorithms fail to process them. Recently, several MapReduce algorithms have been proposed to address such large networks; however, they suffer from the massive shuffled data resulting in a very long processing time. In this paper, we propose scalable methods for enumerating trillion subgraphs on distributed systems. We first propose PTE (Pre-partitioned Triangle Enumeration), a new distributed algorithm for enumerating triangles in enormous graphs by resolving the structural inefficiency of the previous MapReduce algorithms. PTE enumerates trillions of triangles in a billion scale graph by decreasing three factors: the amount of shuffled data, total work, and network read. We also propose PSE (Pre-partitioned Subgraph Enumeration), a generalized version of PTE for enumerating subgraphs that match an arbitrary query graph. Experimental results show that PTE provides 47 times faster performance than recent distributed algorithms on real-world graphs, and succeeds in enumerating more than 3 trillion triangles on the ClueWeb12 graph with 6.3 billion vertices and 72 billion edges. Furthermore, PSE successfully enumerates 265 trillion clique subgraphs with 4 vertices from a subdomain hyperlink network, showing 49 times faster performance than the state of the art distributed subgraph enumeration algorithm.

A General Embedding Framework for Heterogeneous Information Learning in Large-scale Networks

Network analysis has been widely applied in many real-world tasks such as gene analysis and targeted marketing. To extract effective features for these analysis tasks, network embedding automatically learns a low-dimensional vector representation for each node, such that the meaningful topological proximity is well preserved. While the embedding algorithms on pure topological structure have attracted considerable attention, in practice, nodes are often abundantly accompanied with other types of meaningful information such as node attributes, second-order proximity, and link directionality. A general framework for incorporating the heterogeneous information into network embedding could be potentially helpful in learning better vector representations. However, it remains a challenging task to jointly embed the geometrical structure and a distinct type of information due to the heterogeneity. In addition, the real-world networks often contain a large number of nodes, which put demands on the scalability of the embedding algorithms. To bridge the gap, in this paper, we propose a general embedding framework named Heterogeneous Information Learning in Large-scale networks (HILL) to accelerate the joint learning. It enables the simultaneous node proximity assessing process to be done in a distributed manner by decomposing the complex modeling and optimization into many simple and independent sub-problems. We validate the significant correlation between the heterogeneous information and topological structure, and illustrate the generalizability of HILL by applying it to perform attributed network embedding and second-order proximity learning. A variation is proposed for link directionality modeling. Experimental results on real-world networks demonstrate the effectiveness and efficiency of HILL.

Algorithms for Online Influencer Marketing

Influence maximization is the problem of finding influential users (or nodes) in a graph so as to maximize the spread of information. It has many applications in advertising and marketing on social networks. In this paper, we study a highly generic version of influence maximization, one of optimizing influence campaigns by sequentially selecting "spread seeds" from a set of influencers, a small subset of the node population, under the hypothesis that, in a given campaign, previously activated nodes remain persistently active throughout and thus do not yield further rewards. This problem is in particular relevant for an important form of online marketing, known as influencer marketing, in which the marketers target a sub-population of influential people, instead of the entire base of potential buyers. Importantly, we make no assumptions on the underlying diffusion model and we work in a setting where neither a diffusion network nor historical activation data are available. We call this problem online influencer marketing with persistence (in short, OIMP). We first discuss motivating scenarios and present our general approach. We introduce an estimator on the influencers' remaining potential -- the expected number of nodes that can still be reached from them -- and justify its strength to rapidly estimate the desired value, relying on real data gathered from Twitter. We then describe a novel algorithm, GT-UCB, relying on upper confidence bounds on the remaining potential. We show that our approach leads to high-quality spreads on both simulated and real datasets, even though it makes almost no assumptions on the diffusion medium. Importantly, it is orders of magnitude faster than state-of-the-art influence maximization methods, making it possible to deal with large-scale online scenarios.

Tensor Completion Algorithms in Big Data Analytics

Tensor completion is a problem of filling the missing or unobserved entries of partially observed tensors. Due to the multidimensional character of tensors in describing complex datasets, tensor completion algorithms and their applications have received wide attention and achievement in data mining, computer vision, signal processing, and neuroscience, etc. In this survey, we provide a modern overview of recent advances in tensor completion algorithms from the perspective of big data analytics characterized by diverse variety, large volume, and high velocity. Towards a better comprehension and comparison of vast existing advances, we summarize and categorize them into four groups including general tensor completion algorithms, tensor completion with auxiliary information (variety), scalable tensor completion algorithms (volume) and dynamic tensor completion algorithms (velocity). Besides, we introduce their applications on real-world data-driven problems and present an open-source package covering several widely used tensor decomposition and completion algorithms. Our goal is to summarize these popular methods and introduce them to researchers for promoting the research process in this field and give an available repository for practitioners. In the end, we also discuss some challenges and promising research directions in this community for future explorations.

Robust Spectral Ensemble Clustering via Rank Minimization

Ensemble Clustering (EC) is an important topic for data cluster analysis. It aims to integrate multiple Basic Partitions (BPs) of a particular dataset into a consensus partition. Among the previous works, one promising and effective way is to transform ensemble clustering as a graph partitioning problem on the co-association matrix, which is a pair-wise similarity matrix summarized by all the BPs in essence. However, most existing EC methods directly utilize the co-association matrix, yet without considering various noises (\textit{e.g.}, the disagreement between different BPs and the outlier ones) that may exist in it. These noises can impair the cluster structure of a co-association matrix and thus mislead the final graph partitioning process. In this paper, we propose a novel Robust Spectral Ensemble Clustering (RSEC) approach to address this challenge. Specifically, we learn a Low-Rank Representation (LRR) for the co-association matrix to reveal its cluster structure and handle the noises; and meanwhile, we conduct spectral clustering on the learned representation to seek a consensus partition. These two steps are jointly performed in a unified optimization framework. In particular, during the optimizing process, we utilize consensus partition to iteratively enhance the block-diagonal structure of low-rank representation, in order to assist the graph partitioning. To solve RSEC, we first formulate it by using nuclear norm as a convex proxy to the rank function. Then, motivated by the recent advance in non-convex rank minimization, we further develop a non-convex model for RSEC and provide it a solution by the majorization-minimization Augmented Lagrange Multiplier (MM-ALM) algorithm. Experimental results on numerous real-world datasets demonstrate the effectiveness of our method over the state-of-the-art. Moreover, several impact factors that may affect the clustering performance of our approach are also explored extensively.

Sequential Feature Explanations for Anomaly Detection

In many applications, an anomaly detection system presents the most anomalous data instance to a human analyst, who then must determine whether the instance is truly of interest (e.g. a threat in a security setting). Unfortunately, most anomaly detectors provide no explanation about why an instance was considered anomalous, leaving the analyst with no guidance about where to begin the investigation. To address this issue, we study the problems of computing and evaluating sequential feature explanations (SFEs) for anomaly detectors. An SFE of an anomaly is a sequence of features, which are presented to the analyst one at a time (in order) until the information contained in the highlighted features is enough for the analyst to make a confident judgement about the anomaly. Since analyst effort is related to the amount of information that they consider in an investigation, an explanation's quality is related to the number of features that must be revealed to attain confidence. In this paper, we first formulate the problem of optimizing SFEs for a particular density-based anomaly detector. We then present both greedy algorithms and an optimal algorithm, based on branch-and-bound search, for optimizing SFEs. Finally, we provide a large scale quantitative evaluation of these algorithms using a novel framework for evaluating explanations. The results show that our algorithms are quite effective and that our best greedy algorithm is competitive with optimal solutions.

DIGGER: Detect Similar Groups in Heterogeneous Social Networks

People participate in multiple online social networks, e.g., Facebook, Twitter, and Linkedin, and these social networks with heterogeneous social content and user relationship are named as heterogeneous social networks. Group structure widely exists in heterogeneous social networks, which reveals the evolution of human cooperation. Detecting similar groups in heterogeneous networks has a great significance for many applications, such as recommendation system and spammer detection, using the wealth of group information. Although promising, this novel problem encounters a variety of technical challenges, including incomplete data, high time complexity, and ground truth. To address the research gap and technical challenges, we take advantage of a ratio-cut optimization function to model this novel problem by the linear mixed-effects method and graph spectral theory. Based on this model, we propose an efficient algorithm called \textsc{Digger} to detect the similar groups in the large graphs. \textsc{Digger} consists of three steps, including measuring user similarity, construct a matching graph and detecting similar groups. We adopt several strategies to lower the computational cost and detail the basis of labeling the ground truth. We evaluate the effectiveness and efficiency of our algorithm on five different types of online social networks. The extensive experiments show that our method achieves 0.633, 0.723 and 0.675 in precision, recall and F1-measure, which significantly surpass the state-of-arts by 24.6$\%$, 14.6$\%$ and 19.7$\%$, respectively. The results demonstrate that our proposal can detect similar groups in heterogeneous networks effectively.

Semi-supervised Learning Meets Factorization: Learning to Recommend with Chain Graph Model

Recently latent factor model (LFM) has been drawing much attention in recommender systems due to its good performance and scalability. However, existing LFMs predict missing values in a user-item rating matrix only based on the known ones, and thus the sparsity of the rating matrix always limits their performance. Meanwhile, semi-supervised learning (SSL) provides an effective way to alleviate the label (i.e., rating) sparsity problem by performing label propagation, which is mainly based on the smoothness insight on affinity graphs. However, graph-based SSL suffers serious scalability and graph unreliable problems when directly being applied to do recommendation. In this paper, we propose a novel probabilistic chain graph model (CGM) to marry SSL with LFM. The proposed CGM is a combination of Bayesian network and Markov random field. The Bayesian network is used to model the rating generation and regression procedures, and the Markov random field is used to model the confidence-aware smoothness constrain between the generated ratings. Experimental results show that our proposed CGM significantly outperforms the state-of-the-art approaches in terms of four evaluation metrics, and with a larger performance margin when data sparsity increases.

###### All ACM Journals | See Full Journal Index

Search TKDD
enter search term and/or author name