ACM Transactions on

Knowledge Discovery from Data (TKDD)

Latest Articles

Fast Approximate Score Computation on Large-Scale Distributed Data for Learning Multinomial Bayesian Networks

In this article, we focus on the problem of learning a Bayesian network over distributed data stored... (more)

Taxonomy and Evaluation for Microblog Popularity Prediction

As social networks become a major source of information, predicting the outcome of information diffusion has appeared intriguing to both researchers... (more)

Rumor Blocking through Online Link Deletion on Social Networks

In recent years, social networks have become important platforms for people to disseminate information. However, we need to take effective measures... (more)

Outcome-Oriented Predictive Process Monitoring: Review and Benchmark

Predictive business process monitoring refers to the act of making predictions about the future state of ongoing cases of a business process, based on their incomplete execution traces and logs of historical (completed) traces. Motivated by the increasingly pervasive availability of fine-grained event data about business process executions, the... (more)

Performance Bounds of Decentralized Search in Expert Networks for Query Answering

Expert networks are formed by a group of expert-professionals with different specialties to collaboratively resolve specific queries posted to the... (more)

DWE-Med: Dynamic Word Embeddings for Medical Domain

Recent advances in unsupervised language processing methods have created an opportunity to exploit massive text corpora for developing high-quality vector space representation (also known as word embeddings) of words. Towards this direction, practitioners have developed and applied several data driven embedding models with quite good rate of... (more)


About TKDD 

ACM Transactions on Knowledge Discovery from Data (TKDD) publishes original archival papers in the area of knowledge discovery from data and closely related disciplines.  The majority of the papers that appear in TKDD is expected to address the logical and technical foundation of knowledge discovery and data mining.

Forthcoming Articles
Robust Regression via Heuristic Corruption Thresholding and Its Adaptive Estimation Variation

The presence of data noise and corruptions recently invokes increasing attention on Robust Least Squares Regression (RLSR), which addresses the fundamental problem that learns reliable regression coefficients when response variables can be arbitrarily corrupted. Until now, several important challenges still cannot be handled concurrently: 1) rigorous recovery guarantee of regression coefficients 2) difficulty in estimating the corruption ratio parameter; and 3) scaling to massive dataset. This paper proposes a novel Robust Least squares regression algorithm via Heuristic Corruption Thresholding (RHCT), that concurrently addresses all the above challenges. Specifically, the algorithm alternately optimizes the regression coefficients and estimates the optimal uncorrupted set via heuristic thresholding without corruption ratio parameter until it converges. Moreover, to improve the efficiency of corruption estimation in large-scale data, a robust regression algorithm based on adaptive corruption estimation variation (RACT) is proposed to determine the size of uncorrupted set in the adaptive search without iterating data samples exhaustively. We also prove that our algorithms benefit from strong guarantees analogous to those of state-of-the-art methods in terms of convergence rates and recovery guarantees. Extensive experiment demonstrates that the effectiveness of our new methods are superior to that of existing methods in the recovery of both regression coefficients and uncorrupted sets, with very competitive efficiency.

Clustering Users by Their Mobility Behavioral Patterns

The immense stream of data from mobile devices during recent years enables one to learn more about human behavior and provide mobile phone users with personalized services. In this work, we identify groups of users who have the same movement patterns. We analyze trajectories of semantic locations to find users who have similar movement behavior or mobility lifestyle, even when they live in different areas. For this task, we propose a new grouping scheme that is called Lifestyle-Based Clustering (LBC). We represent the mobility movement of each user by a Markov model and calculate the Jensen-Shannon distances among pairs of users. The pairwise distances are represented in a similarity matrix, which is used for the clustering. To validate the unsupervised clustering task, we develop an entropy-based clustering measure, namely, an index that measures the homogeneity of mobility patterns within clusters of users. The analysis is validated on a real-world dataset that contains location-movements of 5,000 cellular phone users that were collected over a two-month period.

Feature Selection via Transferring Knowledge Across Different Classes

The problem of feature selection has attracted considerable research interest in recent years. Supervised information is capable of significantly improving the quality of selected features. However, existing supervised feature selection methods all require that classes in the labeled data (source domain) and unlabeled data (target domain) to be identical, which may be too restrictive in many cases. In this paper, we consider a more challenging cross-class setting where the classes in these two domains are related but different, which has rarely been studied before. We propose a Cross-class Knowledge Transfer Feature Selection (CKTFS) framework which transfers the cross-class knowledge from the source domain to guide target domain feature selection. Specifically, high-level descriptions, i.e., attributes, are used as the bridge for knowledge transfer. To further improve the quality of the selected features, our framework jointly considers the tasks of cross-class knowledge transfer and feature selection. Experimental results on four benchmark datasets demonstrate the superiority of the proposed method.

Identifying Complements and Substitutes of Products: A Neural Network Framework Based on Product Embedding

Complements and substitutes are two typical product relationships that deserve consideration in online product recommendations. One of the key objectives of recommender systems is to promote cross-selling, which heavily relies on recommending the appropriate type of products in specific scenarios. Research on consumer behavior has shown that consumers usually prefer substitutes in the browsing stage whereas complements in the purchasing stage. Thus, it is of great importance to identify the complementary and substitutable relationships between products. In this paper, we design a neural network based framework that integrates text content and additional information of online reviews to mine product relations. For text content, we utilize the LDA topic modeling to represent products in a succinct form called ``embedding''. To capture the semantics of complementary and substitutable relationships, we design a modeling process that transfers the product embeddings into semantic features and incorporates additional non-text factors of product reviews based on empirical observations. Two models are developed, wherein the basic model only considers semantic features while the multi-input model also incorporates additional information. Extensive experiments are conducted to verify the effectiveness of product embedding, semantic modeling, and the proposed neural network based framework. Moreover, the improvements of multi-input model over basic model increase with the level of data sparsity, validating the value of additional non-text information when dealing with sparse datasets.

Information Diffusion Prediction with Network Regularized Role-based User Representation Learning

In this paper, we aim at developing a user representation learning model to solve the information diffusion prediction problem in social media. The main idea is to project the diffusion users into a continuous latent space with the role-based (sender and receiver) representations. This representation learning model embeds the information of both observed diffusion cascades and explicit network into the user diffusion representations. It learns the role-based representations based on a cascade modeling objective that aims at maximizing the likelihood of observed cascades, and employs the matrix factorization objective of reconstructing structural proximities as a regularization on representations. The model is able to capture the unique diffusion characteristics of users in the representation space and handle the problem of the diffusion-related data. We evaluate the proposed model with two prediction strategies, i.e., diffusion simulation and diffusion ranking, on three real-world datasets. The experimental results demonstrate the better performances of the proposed model than state-of-the-art diffusion representation learning models and other popular graph-based methods.

Leveraging Label Specific Discriminant Mapping Features for Multi-Label Learning

As an important machine learning task, multi-label learning deals with the problem where each sample instance (feature vector) is associated with multiple labels simultaneously. Most existing approaches focus on manipulating the label space, such as exploiting correlations between labels and reducing label space dimension, with identical feature space in the process of classification. One potential drawback of this traditional strategy is that each label might have its own specific characteristics and using identical features for all label cannot lead to optimized performance. In this paper, we propose an effective algorithm named LSDM leveraging label specific discriminant mapping features for multi-label learning to overcome the drawback. LSDM sets diverse ratio parameter values to conduct cluster analysis on the positive and negative instances of identical label. It reconstructs label specific feature space which includes distance information and spatial topology information. Our experimental results show that combining these two parts of information in the new feature representation can better exploit the clustering results in the learning process. Due to the problem of diverse combinations for identical label, we employ simplified linear discriminant analysis to efficiently excavate optimal one for each label and perform classification by querying the corresponding results. Comparison with the state-of-the-art algorithms on a total of 20 benchmark datasets clearly manifests the competitiveness of LSDM.

Multi-task Crowdsourcing via an Optimization Framework

The unprecedented amounts of data have catalyzed the trend of combining human insights with machine learning techniques, which facilitate the use of crowdsourcing to enlist label information both effectively and efficiently. One crucial challenge in crowdsourcing is the diverse worker quality, which determines the accuracy of the label information provided by such workers. Motivated by the observations that same set of tasks are typically labeled by the same set of workers, we studied their behaviors across multiple related tasks and proposed an optimization framework for learning from task and worker dual heterogeneity. The proposed method uses a weight tensor to represent the workers' behaviors across multiple tasks, and seeks to find the optimal solution of the tensor by exploiting its structured information. Then, we propose an iterative algorithm to solve the optimization problem and analyze its computational complexity. To infer the true label of an example, we construct a worker ensemble based on the estimated tensor, whose decisions will be weighted using a set of entropy weight. We also prove that the gradient of the most time-consuming updating block is separable with respect to the workers, which leads to a randomized algorithm with faster speed. Moreover, we extend the learning framework to accommodate to the multi-class setting. Finally, we test the performance of our framework on several data sets, and demonstrate its superiority over state-of-the-art techniques.

Near-Optimal and Practical Algorithms for Graph Scan Statistics with Connectivity Constraints

One fundamental task in network analysis is detecting hotspots or anomalies in the network; that is, detecting subgraphs where there is signi cantly more activity than one would expect given historical data or some baseline process. Scan statistics is one popular approach used for anomalous subgraph detection. This methodology involves maximizing a score function over all connected subgraphs, which is a challenging computational problem. A number of heuristics have been proposed for these problems, but they do not provide any quality guarantees. Here, we propose a framework for designing algorithms for optimizing a large class of scan statistics for networks, subject to connectivity constraints. Our algorithms run in time that scales linearly on the size of the graph and depends on a parameter we call the e ective solution size, while providing rigorous approximation guarantees. In contrast, most prior methods have super-linear running times in terms of graph size. Extensive empirical evidence demonstrates the e ectiveness and e ciency of our proposed algorithms in comparison with state-of-the-art methods. Our approach improves on the performance relative to all prior methods, giving up to over 25% increase in the score. Further, our algorithms scale to networks with up to a million nodes, which is 1-2 orders of magnitude larger than all prior applications.

The Relationship between Online Social Network Ties and User Attributes

The distance between users has an effect on the formation of social network ties, but it is not the only or even the main factor. Knowing all the features that influence such ties is very important for many related domains such as location-based recommender systems and community and event detection systems for online social networks (OSNs). In recent years, researchers have analysed the role of user geo-location in OSNs. Researchers have also attempted to determine the probability of friendships being established based on distance, where friendship is not only a function of distance. However, some important features of OSNs remain unknown. In order to comprehensively understand the OSN phenomenon we also need to analyse users attributes. Basically, an OSN functions according to four main user properties: user geo-location, user weight, number of user interactions and user lifespan. The research presented here sought to determine whether the user mobility pattern can be used to predict users interaction behaviour. It also investigated whether. in addition to distance. the number of friends (known as user weight) interferes in social network tie formation. To this end, we analysed the above-stated features in three large-scale OSNs by using a two-to-two analysis method. We found that, regardless of a high degree freedom in user mobility, the fraction of the number of outside activities over the inside activity is a significant fraction that help us to address the user interaction behaviour. To the best of our knowledge, research has not been conducted elsewhere on this issue. We also present a high-resolution formula in order to improve the friendship probability function.

A Survey of Parallel Sequential Pattern Mining

With the growing popularity of shared resources, large volumes of complex data of different types are collected automatically. Traditional data mining algorithms generally have problems and challenges including huge memory cost, low processing speed, and inadequate hard disk space. As a fundamental task of data mining, sequential pattern mining (SPM) is used in a wide variety of real-life applications. However, it is more complex and challenging than other pattern mining tasks, i.e., frequent itemset mining and association rule mining, and also suffers from the above challenges when handling the large-scale data. To solve these problems, mining sequential patterns in a parallel or distributed computing environment has emerged as an important issue with many applications. In this paper, an in-depth survey of the current status of parallel sequential pattern mining (PSPM) is investigated and provided, including detailed categorization of traditional serial SPM approaches, and state of the art parallel SPM. We review the related work of parallel sequential pattern mining in detail, including partition-based algorithms for PSPM, Apriori-based PSPM, pattern growth based PSPM, and hybrid algorithms for PSPM, and provide deep description (i.e., characteristics, advantages, disadvantages and summarization) of these parallel approaches of PSPM. Some advanced topics for PSPM, including parallel quantitative / weighted / utility sequential pattern mining, PSPM from uncertain data and stream data, hardware acceleration for PSPM, are further reviewed in details. Besides, we review and provide some well-known open-source software of PSPM. Finally, we summarize some challenges and opportunities of PSPM in the big data era.

Road Network Construction with Complex Intersections based on Sparsely-Sampled Private Car Trajectory Data

A road network is a critical aspect of both urban planning and route recommendation. This paper proposes an efficient approach to build a fine-grained road network based on sparsely-sampled private car trajectory data under complicated urban environment. In order to resolve difficulties introduced by low sampling rate trajectory data, we concentrate sample points around intersections by utilizing the turning characteristics from the large-scale trajectory data to ensure the accuracy of the detection of intersection and road segments. In front of complex road networks, such as including various overpasses and underpasses, we first layer intersections into major and minor one, and then propose a simplified representation of intersections and corresponding computable model based on the features of roads, which can significantly improve the accuracy of the detected road networks. In order to construct fine-grained road networks, we distinguish various types of intersections using direction information and detected turning limit. To the best of our knowledge, our map building method is the first time to infer road networks based on large-scale private car trajectory data. Extensive evaluations are conducted based on a real-world trajectory dataset from 1,345 private cars in Futian district, Shenzhen city of China. The results demonstrate the effectiveness of the proposed method. The constructed road map matches close to the one from a public editing map OpenStreetMap, especially the location of the road intersections and road segments, which achieves 92.4% intersections within 20 meters and 96.3% road segments within 12 meters.

Variant Grassmann Manifolds: a Representation Augmentation Method for Action Recognition

In classification tasks, classifiers trained with finite examples might generalize poorly to new data with unknown variance. For this issue, data augmentation is a successful solution where numerous artificial examples are added to training sets. In this paper, we focus on data augmentation for improving the accuracy of action recognition, where action videos are modeled by linear dynamical systems and approximately represented as linear subspaces. These subspace representations lie in a non-Euclidean space, named Grassmann manifold, containing points as orthonormal matrixes. It is our concern that poor generalization may result from the variance of manifolds when data come from different sources or classes. Thus we introduce infinitely many \textit{variant Grassmann manifolds} (VGM) subject to a known distribution, then represent each action video as different Grassmann points leading to augmented representation. Furthermore, a prior based on the stability of subspace bases is introduced, so the manifold distribution can be adaptively determined, balancing discrimination and representation. Experimental results of multi-class and multi-source classification show that VGM softmax classifiers achieve lower test error rates compared to methods with a single manifold.

Tensorizing Restricted Boltzmann Machine

Restricted Boltzmann Machine (RBM) is a famous model for feature extraction and can be used as an initializer for Neural Networks. When applying the classic RBM to multidimensional data such as 2D/3D tensors, one needs to vectorize such as high order data. Vectorizing will result in dimensional disaster and valuable spatial information loss. In particular, a large amount of memory is required by commonly used fully-connected layers, making it hard to use the models on low-end devices and preventing from using the models with high order data. In this paper, in order to apply classic RBM on tensorial data directly, we propose a new tensorial RBM model parametrized by the tensor train decomposition (TTRBM). In this model, both input and hidden variables are in tensorial form, which are connected by a parameter matrix in tensor train format. The biggest advantage of the proposed TTRBM is that TTRBM can obtain comparable performance compared with the classic RBM with much fewer model parameters and faster training process. The experiments on three real-world applications, face reconstruction, handwritten digit recognition and image super-resolution, have demonstrated the advantages of TTRBM.

Leveraging Kernel-Incorporated Matrix Factorization for App Recommendation

The ever-increasing number of smartphone applications (apps) available on different app markets poses a challenge for personalized app recommendation. Conventional collaborative filtering-based recommendation methods suffer from sparse and binary user-app implicit feedback, which results in poor performance in discriminating user-app preferences. In this paper, we first propose two kernel incorporated probabilistic matrix factorization models, which introduce app-categorical information to constrain the user and app latent features to be similar to their neighbors in the latent space. The two models are solved by Stochastic Gradient Descent (SGD) with a user-oriented negative sampling scheme. To further improve the recommendation performance, we construct pseudo user-app ratings based on user-app usage information, and propose a novel Kernelized Non-negative Matrix Factorization (KNMF) by incorporating non-negative constraints on latent factors to predict user-app preferences. This model also leverages user-user and app-app similarities with regard to app-categorical information to mine the latent geometric structure in the pseudo-rating space. Adopting the Karush-Kuhn-Tucker (KKT) conditions, a Multiplicative Updating Rules (MUR)-based optimization is proposed for model learning, and the convergence is proved by introducing an auxiliary function. The experimental results on a real user-app installation usage dataset show the comparable performance of our models with the state-of-the-art baselines, in terms of two ranking-oriented evaluation metrics.

Translations Diversification for Expert Finding: A Novel Clustering-based Approach

Expert finding that addressed the task of retrieving and ranking right people in the subject of the user query, has become one of the most important research fields that attracted the attention of many researchers. The most important challenge in the expert finding domain is to determine the strength of relationship between query words and documents written by the candidate expert. One of the most important challenges in information retrieval is the issue of the vocabulary gap between queries and documents. In this study, a translation model based on the clustering of words in two query and co-occurrence spaces is proposed to overcome this problem. First, the words that are semantically close are clustered in the query space and then each cluster in this space are clustered again in the co-occurrence space. Representatives of each cluster in co-occurrence space are considered as a diverse subset of the parent cluster. By this method, the diversity of queries are guaranteed in the query space. Next, a probabilistic model that is based on the degree of word belonging to cluster and the similarity of cluster to the query in the query space is used to consider the problem of vocabulary gap. Finally, the corresponding translations with each query are used in conjunction with a blender model for expert finding. Experiments on Stackoverflow dataset show the efficiency of the proposed methods for expert finding.

All ACM Journals | See Full Journal Index

Search TKDD
enter search term and/or author name