ACM Transactions on

Knowledge Discovery from Data (TKDD)

Latest Articles

Sequential Feature Explanations for Anomaly Detection

In many applications, an anomaly detection system presents the most anomalous data instance to a human analyst, who then must determine whether the... (more)

Digger: Detect Similar Groups in Heterogeneous Social Networks

People participate in multiple online social networks, e.g., Facebook, Twitter, and Linkedin, and these social networks with heterogeneous social content and user relationship are named as heterogeneous social networks. Group structure widely exists in heterogeneous social networks, which reveals the evolution of human cooperation. Detecting... (more)

Algorithms for Online Influencer Marketing

Influence maximization is the problem of finding influential users, or nodes, in a graph so as to maximize the spread of information. It has many applications in advertising and marketing on social networks. In this article, we study a highly generic version of influence maximization, one of optimizing influence campaigns by sequentially selecting... (more)

Robust Spectral Ensemble Clustering via Rank Minimization

Ensemble Clustering (EC) is an important topic for data cluster analysis. It targets to integrate multiple Basic Partitions (BPs) of a particular... (more)

PSP-AMS: Progressive Mining of Sequential Patterns Across Multiple Streams

Sequential pattern mining is used to find frequent data sequences over time. When sequential patterns are generated, the newly arriving patterns may not be identified as frequent sequential patterns due to the existence of old data and sequences. Progressive sequential pattern mining aims to find the most up-to-date sequential patterns given that... (more)

Tensor Completion Algorithms in Big Data Analytics

Tensor completion is a problem of filling the missing or unobserved entries of partially observed tensors. Due to the multidimensional character of... (more)

Text Mining for Evaluating Authors' Birth and Death Years

This article presents a unique method in text and data mining for finding the era, i.e., mining temporal data, in which an anonymous author was... (more)

Differentiating Regularization Weights -- A Simple Mechanism to Alleviate Cold Start in Recommender Systems

Matrix factorization (MF) and its extended methodologies have been studied extensively in the... (more)

Large Scale Online Multiple Kernel Regression with Application to Time-Series Prediction

Kernel-based regression represents an important family of learning techniques for solving challenging regression tasks with non-linear patterns.... (more)

Chameleon 2: An Improved Graph-Based Clustering Algorithm

Traditional clustering algorithms fail to produce human-like results when confronted with data of variable density, complex distributions, or in the presence of noise. We propose an improved graph-based clustering algorithm called Chameleon 2, which overcomes several drawbacks of state-of-the-art clustering approaches. We modified the internal... (more)

Characterizing Directed and Undirected Networks via Multidimensional Walks with Jumps

Estimating distributions of node characteristics (labels) such as number of connections or citizenship of users in a social network via edge and node... (more)

Detecting and Assessing Anomalous Evolutionary Behaviors of Nodes in Evolving Social Networks

Based on the performance of entire social networks, anomaly analysis for evolving social networks generally ignores the otherness of the evolutionary... (more)


About TKDD 

ACM Transactions on Knowledge Discovery from Data (TKDD) publishes original archival papers in the area of knowledge discovery from data and closely related disciplines.  The majority of the papers that appear in TKDD is expected to address the logical and technical foundation of knowledge discovery and data mining.

Forthcoming Articles
Robust Regression via Heuristic Corruption Thresholding and Its Adaptive Estimation Variation

The presence of data noise and corruptions recently invokes increasing attention on Robust Least Squares Regression (RLSR), which addresses the fundamental problem that learns reliable regression coefficients when response variables can be arbitrarily corrupted. Until now, several important challenges still cannot be handled concurrently: 1) rigorous recovery guarantee of regression coefficients 2) difficulty in estimating the corruption ratio parameter; and 3) scaling to massive dataset. This paper proposes a novel Robust Least squares regression algorithm via Heuristic Corruption Thresholding (RHCT), that concurrently addresses all the above challenges. Specifically, the algorithm alternately optimizes the regression coefficients and estimates the optimal uncorrupted set via heuristic thresholding without corruption ratio parameter until it converges. Moreover, to improve the efficiency of corruption estimation in large-scale data, a robust regression algorithm based on adaptive corruption estimation variation (RACT) is proposed to determine the size of uncorrupted set in the adaptive search without iterating data samples exhaustively. We also prove that our algorithms benefit from strong guarantees analogous to those of state-of-the-art methods in terms of convergence rates and recovery guarantees. Extensive experiment demonstrates that the effectiveness of our new methods are superior to that of existing methods in the recovery of both regression coefficients and uncorrupted sets, with very competitive efficiency.

Probabilistic Feature Selection and Classification Vector Machine

Sparse Bayesian learning is one of the state-of-the-art machine learning algorithms, which is able to make stable and reliable probabilistic predictions. However, some of these algorithms, e.g. probabilistic classification vector machine (PCVM) and relevant vector machine (RVM), are not capable of eliminating irrelevant noise features, which potentially not only increase the computational complexity, but also lead to performance degradation of trained models. To tackle this problem, in this paper, we propose a sparse Bayesian classifier which simultaneously selects the relevant samples and features. We name this classifier a probabilistic feature selection and classification vector machine (PFCVM), in which truncated Gaussian distributions are employed as both sample and feature priors. In order to derive the analytical solution for the proposed algorithm, we use Laplace approximation to calculate approximate posteriors and marginal likelihoods. Finally, we obtain the optimized parameters and hyperparameters by the type-II maximum likelihood method. The experiments on Waveform data set, EEG data sets and high dimensional data sets validate the performance of PFCVM under two criteria: accuracy of classification and efficacy of selected features. Finally, we analyze the generalization performance of PFCVM and derive a generalization error bound for PFCVM. Then by tightening the bound, we demonstrate the significance of the sparseness for the model.

Feature Selection via Transferring Knowledge Across Different Classes

The problem of feature selection has attracted considerable research interest in recent years. Supervised information is capable of significantly improving the quality of selected features. However, existing supervised feature selection methods all require that classes in the labeled data (source domain) and unlabeled data (target domain) to be identical, which may be too restrictive in many cases. In this paper, we consider a more challenging cross-class setting where the classes in these two domains are related but different, which has rarely been studied before. We propose a Cross-class Knowledge Transfer Feature Selection (CKTFS) framework which transfers the cross-class knowledge from the source domain to guide target domain feature selection. Specifically, high-level descriptions, i.e., attributes, are used as the bridge for knowledge transfer. To further improve the quality of the selected features, our framework jointly considers the tasks of cross-class knowledge transfer and feature selection. Experimental results on four benchmark datasets demonstrate the superiority of the proposed method.

Identifying Complements and Substitutes of Products: A Neural Network Framework Based on Product Embedding

Complements and substitutes are two typical product relationships that deserve consideration in online product recommendations. One of the key objectives of recommender systems is to promote cross-selling, which heavily relies on recommending the appropriate type of products in specific scenarios. Research on consumer behavior has shown that consumers usually prefer substitutes in the browsing stage whereas complements in the purchasing stage. Thus, it is of great importance to identify the complementary and substitutable relationships between products. In this paper, we design a neural network based framework that integrates text content and additional information of online reviews to mine product relations. For text content, we utilize the LDA topic modeling to represent products in a succinct form called ``embedding''. To capture the semantics of complementary and substitutable relationships, we design a modeling process that transfers the product embeddings into semantic features and incorporates additional non-text factors of product reviews based on empirical observations. Two models are developed, wherein the basic model only considers semantic features while the multi-input model also incorporates additional information. Extensive experiments are conducted to verify the effectiveness of product embedding, semantic modeling, and the proposed neural network based framework. Moreover, the improvements of multi-input model over basic model increase with the level of data sparsity, validating the value of additional non-text information when dealing with sparse datasets.

Information Diffusion Prediction with Network Regularized Role-based User Representation Learning

In this paper, we aim at developing a user representation learning model to solve the information diffusion prediction problem in social media. The main idea is to project the diffusion users into a continuous latent space with the role-based (sender and receiver) representations. This representation learning model embeds the information of both observed diffusion cascades and explicit network into the user diffusion representations. It learns the role-based representations based on a cascade modeling objective that aims at maximizing the likelihood of observed cascades, and employs the matrix factorization objective of reconstructing structural proximities as a regularization on representations. The model is able to capture the unique diffusion characteristics of users in the representation space and handle the problem of the diffusion-related data. We evaluate the proposed model with two prediction strategies, i.e., diffusion simulation and diffusion ranking, on three real-world datasets. The experimental results demonstrate the better performances of the proposed model than state-of-the-art diffusion representation learning models and other popular graph-based methods.

Leveraging Label Specific Discriminant Mapping Features for Multi-Label Learning

As an important machine learning task, multi-label learning deals with the problem where each sample instance (feature vector) is associated with multiple labels simultaneously. Most existing approaches focus on manipulating the label space, such as exploiting correlations between labels and reducing label space dimension, with identical feature space in the process of classification. One potential drawback of this traditional strategy is that each label might have its own specific characteristics and using identical features for all label cannot lead to optimized performance. In this paper, we propose an effective algorithm named LSDM leveraging label specific discriminant mapping features for multi-label learning to overcome the drawback. LSDM sets diverse ratio parameter values to conduct cluster analysis on the positive and negative instances of identical label. It reconstructs label specific feature space which includes distance information and spatial topology information. Our experimental results show that combining these two parts of information in the new feature representation can better exploit the clustering results in the learning process. Due to the problem of diverse combinations for identical label, we employ simplified linear discriminant analysis to efficiently excavate optimal one for each label and perform classification by querying the corresponding results. Comparison with the state-of-the-art algorithms on a total of 20 benchmark datasets clearly manifests the competitiveness of LSDM.

Multi-task Crowdsourcing via an Optimization Framework

The unprecedented amounts of data have catalyzed the trend of combining human insights with machine learning techniques, which facilitate the use of crowdsourcing to enlist label information both effectively and efficiently. One crucial challenge in crowdsourcing is the diverse worker quality, which determines the accuracy of the label information provided by such workers. Motivated by the observations that same set of tasks are typically labeled by the same set of workers, we studied their behaviors across multiple related tasks and proposed an optimization framework for learning from task and worker dual heterogeneity. The proposed method uses a weight tensor to represent the workers' behaviors across multiple tasks, and seeks to find the optimal solution of the tensor by exploiting its structured information. Then, we propose an iterative algorithm to solve the optimization problem and analyze its computational complexity. To infer the true label of an example, we construct a worker ensemble based on the estimated tensor, whose decisions will be weighted using a set of entropy weight. We also prove that the gradient of the most time-consuming updating block is separable with respect to the workers, which leads to a randomized algorithm with faster speed. Moreover, we extend the learning framework to accommodate to the multi-class setting. Finally, we test the performance of our framework on several data sets, and demonstrate its superiority over state-of-the-art techniques.

Near-Optimal and Practical Algorithms for Graph Scan Statistics with Connectivity Constraints

One fundamental task in network analysis is detecting hotspots or anomalies in the network; that is, detecting subgraphs where there is signi cantly more activity than one would expect given historical data or some baseline process. Scan statistics is one popular approach used for anomalous subgraph detection. This methodology involves maximizing a score function over all connected subgraphs, which is a challenging computational problem. A number of heuristics have been proposed for these problems, but they do not provide any quality guarantees. Here, we propose a framework for designing algorithms for optimizing a large class of scan statistics for networks, subject to connectivity constraints. Our algorithms run in time that scales linearly on the size of the graph and depends on a parameter we call the e ective solution size, while providing rigorous approximation guarantees. In contrast, most prior methods have super-linear running times in terms of graph size. Extensive empirical evidence demonstrates the e ectiveness and e ciency of our proposed algorithms in comparison with state-of-the-art methods. Our approach improves on the performance relative to all prior methods, giving up to over 25% increase in the score. Further, our algorithms scale to networks with up to a million nodes, which is 1-2 orders of magnitude larger than all prior applications.

The Relationship between Online Social Network Ties and User Attributes

The distance between users has an effect on the formation of social network ties, but it is not the only or even the main factor. Knowing all the features that influence such ties is very important for many related domains such as location-based recommender systems and community and event detection systems for online social networks (OSNs). In recent years, researchers have analysed the role of user geo-location in OSNs. Researchers have also attempted to determine the probability of friendships being established based on distance, where friendship is not only a function of distance. However, some important features of OSNs remain unknown. In order to comprehensively understand the OSN phenomenon we also need to analyse users attributes. Basically, an OSN functions according to four main user properties: user geo-location, user weight, number of user interactions and user lifespan. The research presented here sought to determine whether the user mobility pattern can be used to predict users interaction behaviour. It also investigated whether. in addition to distance. the number of friends (known as user weight) interferes in social network tie formation. To this end, we analysed the above-stated features in three large-scale OSNs by using a two-to-two analysis method. We found that, regardless of a high degree freedom in user mobility, the fraction of the number of outside activities over the inside activity is a significant fraction that help us to address the user interaction behaviour. To the best of our knowledge, research has not been conducted elsewhere on this issue. We also present a high-resolution formula in order to improve the friendship probability function.

A Survey of Parallel Sequential Pattern Mining

With the growing popularity of shared resources, large volumes of complex data of different types are collected automatically. Traditional data mining algorithms generally have problems and challenges including huge memory cost, low processing speed, and inadequate hard disk space. As a fundamental task of data mining, sequential pattern mining (SPM) is used in a wide variety of real-life applications. However, it is more complex and challenging than other pattern mining tasks, i.e., frequent itemset mining and association rule mining, and also suffers from the above challenges when handling the large-scale data. To solve these problems, mining sequential patterns in a parallel or distributed computing environment has emerged as an important issue with many applications. In this paper, an in-depth survey of the current status of parallel sequential pattern mining (PSPM) is investigated and provided, including detailed categorization of traditional serial SPM approaches, and state of the art parallel SPM. We review the related work of parallel sequential pattern mining in detail, including partition-based algorithms for PSPM, Apriori-based PSPM, pattern growth based PSPM, and hybrid algorithms for PSPM, and provide deep description (i.e., characteristics, advantages, disadvantages and summarization) of these parallel approaches of PSPM. Some advanced topics for PSPM, including parallel quantitative / weighted / utility sequential pattern mining, PSPM from uncertain data and stream data, hardware acceleration for PSPM, are further reviewed in details. Besides, we review and provide some well-known open-source software of PSPM. Finally, we summarize some challenges and opportunities of PSPM in the big data era.

Identity Matching across Chinese Social Networks Based on Least Information

Along with the popular of Internet, social networks aiming at different functions has undergone an unprecedented development in recent years. Users create different accounts, which are also called identities, in different social networks. However, there is no intuitive connection between them. In order to link different identities which belong to the same user but in different social networks, we study to solve the problem of identity matching across multiple social networks in this paper. We proposed a three-step framework to exam the efficiency of three kinds of features, i.e. profile features, user generated content features and user relationship features. By exploiting different combinations of features in this framework, we find the least effective features which can identify users across social networks. Some experiments on real-world dataset show that our model can basically match the identities from different social networks.

Variant Grassmann Manifolds: a Representation Augmentation Method for Action Recognition

In classification tasks, classifiers trained with finite examples might generalize poorly to new data with unknown variance. For this issue, data augmentation is a successful solution where numerous artificial examples are added to training sets. In this paper, we focus on data augmentation for improving the accuracy of action recognition, where action videos are modeled by linear dynamical systems and approximately represented as linear subspaces. These subspace representations lie in a non-Euclidean space, named Grassmann manifold, containing points as orthonormal matrixes. It is our concern that poor generalization may result from the variance of manifolds when data come from different sources or classes. Thus we introduce infinitely many \textit{variant Grassmann manifolds} (VGM) subject to a known distribution, then represent each action video as different Grassmann points leading to augmented representation. Furthermore, a prior based on the stability of subspace bases is introduced, so the manifold distribution can be adaptively determined, balancing discrimination and representation. Experimental results of multi-class and multi-source classification show that VGM softmax classifiers achieve lower test error rates compared to methods with a single manifold.

Leveraging Kernel-Incorporated Matrix Factorization for App Recommendation

The ever-increasing number of smartphone applications (apps) available on different app markets poses a challenge for personalized app recommendation. Conventional collaborative filtering-based recommendation methods suffer from sparse and binary user-app implicit feedback, which results in poor performance in discriminating user-app preferences. In this paper, we first propose two kernel incorporated probabilistic matrix factorization models, which introduce app-categorical information to constrain the user and app latent features to be similar to their neighbors in the latent space. The two models are solved by Stochastic Gradient Descent (SGD) with a user-oriented negative sampling scheme. To further improve the recommendation performance, we construct pseudo user-app ratings based on user-app usage information, and propose a novel Kernelized Non-negative Matrix Factorization (KNMF) by incorporating non-negative constraints on latent factors to predict user-app preferences. This model also leverages user-user and app-app similarities with regard to app-categorical information to mine the latent geometric structure in the pseudo-rating space. Adopting the Karush-Kuhn-Tucker (KKT) conditions, a Multiplicative Updating Rules (MUR)-based optimization is proposed for model learning, and the convergence is proved by introducing an auxiliary function. The experimental results on a real user-app installation usage dataset show the comparable performance of our models with the state-of-the-art baselines, in terms of two ranking-oriented evaluation metrics.

Translations Diversification for Expert Finding: A Novel Clustering-based Approach

Expert finding that addressed the task of retrieving and ranking right people in the subject of the user query, has become one of the most important research fields that attracted the attention of many researchers. The most important challenge in the expert finding domain is to determine the strength of relationship between query words and documents written by the candidate expert. One of the most important challenges in information retrieval is the issue of the vocabulary gap between queries and documents. In this study, a translation model based on the clustering of words in two query and co-occurrence spaces is proposed to overcome this problem. First, the words that are semantically close are clustered in the query space and then each cluster in this space are clustered again in the co-occurrence space. Representatives of each cluster in co-occurrence space are considered as a diverse subset of the parent cluster. By this method, the diversity of queries are guaranteed in the query space. Next, a probabilistic model that is based on the degree of word belonging to cluster and the similarity of cluster to the query in the query space is used to consider the problem of vocabulary gap. Finally, the corresponding translations with each query are used in conjunction with a blender model for expert finding. Experiments on Stackoverflow dataset show the efficiency of the proposed methods for expert finding.

All ACM Journals | See Full Journal Index

Search TKDD
enter search term and/or author name