ACM Transactions on

Knowledge Discovery from Data (TKDD)

Latest Articles

Fast Approximate Score Computation on Large-Scale Distributed Data for Learning Multinomial Bayesian Networks

In this article, we focus on the problem of learning a Bayesian network over distributed data stored... (more)

Taxonomy and Evaluation for Microblog Popularity Prediction

As social networks become a major source of information, predicting the outcome of information diffusion has appeared intriguing to both researchers... (more)

Rumor Blocking through Online Link Deletion on Social Networks

In recent years, social networks have become important platforms for people to disseminate information. However, we need to take effective measures... (more)

Outcome-Oriented Predictive Process Monitoring: Review and Benchmark

Predictive business process monitoring refers to the act of making predictions about the future state of ongoing cases of a business process, based on their incomplete execution traces and logs of historical (completed) traces. Motivated by the increasingly pervasive availability of fine-grained event data about business process executions, the... (more)

Performance Bounds of Decentralized Search in Expert Networks for Query Answering

Expert networks are formed by a group of expert-professionals with different specialties to collaboratively resolve specific queries posted to the... (more)

DWE-Med: Dynamic Word Embeddings for Medical Domain

Recent advances in unsupervised language processing methods have created an opportunity to exploit massive text corpora for developing high-quality vector space representation (also known as word embeddings) of words. Towards this direction, practitioners have developed and applied several data driven embedding models with quite good rate of... (more)

Near-Optimal and Practical Algorithms for Graph Scan Statistics with Connectivity Constraints

One fundamental task in network analysis is detecting “hotspots” or “anomalies” in the network; that is, detecting... (more)

Probabilistic Feature Selection and Classification Vector Machine

Sparse Bayesian learning is a state-of-the-art supervised learning algorithm that can choose a subset of relevant samples from the input data and make... (more)

Variant Grassmann Manifolds: A Representation Augmentation Method for Action Recognition

In classification tasks, classifiers trained with finite examples might generalize poorly to new data with unknown variance. For this issue, data augmentation is a successful solution where numerous artificial examples are added to training sets. In this article, we focus on the data augmentation for improving the accuracy of action recognition,... (more)

Leveraging Label-Specific Discriminant Mapping Features for Multi-Label Learning

As an important machine learning task, multi-label learning deals with the problem where each sample instance (feature vector) is associated with... (more)


About TKDD 

ACM Transactions on Knowledge Discovery from Data (TKDD) publishes original archival papers in the area of knowledge discovery from data and closely related disciplines.  The majority of the papers that appear in TKDD is expected to address the logical and technical foundation of knowledge discovery and data mining.

Forthcoming Articles

xTime-sync Video Tag Extraction Using Semantic Association Graph

Robust Regression via Heuristic Corruption Thresholding and Its Adaptive Estimation Variation

The presence of data noise and corruptions recently invokes increasing attention on Robust Least Squares Regression (RLSR), which addresses the fundamental problem that learns reliable regression coefficients when response variables can be arbitrarily corrupted. Until now, several important challenges still cannot be handled concurrently: 1) rigorous recovery guarantee of regression coefficients 2) difficulty in estimating the corruption ratio parameter; and 3) scaling to massive dataset. This paper proposes a novel Robust Least squares regression algorithm via Heuristic Corruption Thresholding (RHCT), that concurrently addresses all the above challenges. Specifically, the algorithm alternately optimizes the regression coefficients and estimates the optimal uncorrupted set via heuristic thresholding without corruption ratio parameter until it converges. Moreover, to improve the efficiency of corruption estimation in large-scale data, a robust regression algorithm based on adaptive corruption estimation variation (RACT) is proposed to determine the size of uncorrupted set in the adaptive search without iterating data samples exhaustively. We also prove that our algorithms benefit from strong guarantees analogous to those of state-of-the-art methods in terms of convergence rates and recovery guarantees. Extensive experiment demonstrates that the effectiveness of our new methods are superior to that of existing methods in the recovery of both regression coefficients and uncorrupted sets, with very competitive efficiency.

A Distance Measure for the Analysis of Polar Opinion Dynamics in Social Networks

Analysis of opinion dynamics in social networks plays an important role in today's life. For predicting users' political preference, it is particularly important to be able to analyze the dynamics of competing polar opinions, such as pro-Democrat vs. pro-Republican. While observing the evolution of polar opinions in a social network over time, can we tell when the network evolved abnormally? Furthermore, can we predict how the opinions of the users will change in the future? To answer such questions, it is insufficient to study individual user behavior, since opinions can spread beyond users' ego-networks. Instead, we need to consider the opinion dynamics of all users simultaneously and capture the connection between the individuals' behavior and the global evolution pattern of the social network. In this work, we introduce the Social Network Distance (SND)---a distance measure that quantifies the likelihood of evolution of one snapshot of a social network into another snapshot under a chosen model of polar opinion dynamics. SND has a rich semantics of a transportation problem, yet, is computable in time linear in the number of users and, as such, is applicable to large-scale online social networks. In our experiments with synthetic and Twitter data, we demonstrate the utility of our distance measure for anomalous event detection. It achieves a true positive rate of 0.83, twice as high as that of alternatives. When used for opinion prediction in Twitter data, SND's accuracy is 75.6%, which is 7.5% higher than that of the next best method.

Heterogeneous-length Text Topic Modeling for Reader-aware Multi-Document Summarization

Housing Demand Estimation based on Express Delivery Data

Housing demand estimation is an important topic in economic research field. It is beneficial and helpful for various applications including real estate market regulation and urban planning. Therefore, housing demand estimation is crucial for both real estate investors and government administrators. Meanwhile, with rapid development of express industry, abundant useful information embedding in express delivery records are helpful for researchers to profile urban life patterns. Express delivery behaviors of residents in a residential community can reflect housing demand to some extent. Although there have been studies focusing on the demand analysis of housing demand, it can not be estimated very well and is still under explored. To this end, in this paper, we propose a systematic housing demand estimation method based on express delivery data. In this work, we first aggregate the express delivery records at community scale with clustering methods and complete vacant values of express delivery records. Then, we extract various features from a less sparse dataset considering both residential mobility probability and attractiveness of residential communities. In addition, correlations between different districts can influence performances of our inference model, thus we take commonalities and differences of different districts into consideration. With the features and correlations between different districts being obtained, we estimate housing demand with a multi-task learning method based on neural networks. Finally, experimental results on real-world data prove that our model is effective to estimate housing demand at the residential community level.

Maximally Correlated Principle Component Analysis Based on Deep Parameterization Learning

Dimensionality reduction is widely used to deal with high dimensional data. As a famous dimensionality reduction method, principle component analysis (PCA) aiming at finding the low dimension feature of original data has made great successes, and many improved PCA have been proposed. However, most algorithms based on PCA only consider the linear correlation of data features. In this paper, we propose a novel dimensionality reduction model called maximally correlated principle component analysis based on deep parameterization learning (MCPCADP), which takes nonlinear correlation into account by using deep parameterization frame for the purpose of dimensionality reduction. The new model explores nonlinear correlation by maximizing Ky Fan norm of the covariance matrix of nonlinearly mapped data features. A new BP algorithm for model optimization is derived. In order to assess the proposed method, we conduct experiments on both synthetic database and several real-world databases. The experimental results demonstrate that the proposed algorithm is comparable with the several widely used algorithms.

Clustering Users by Their Mobility Behavioral Patterns

The immense stream of data from mobile devices during recent years enables one to learn more about human behavior and provide mobile phone users with personalized services. In this work, we identify groups of users who have the same movement patterns. We analyze trajectories of semantic locations to find users who have similar movement behavior or mobility lifestyle, even when they live in different areas. For this task, we propose a new grouping scheme that is called Lifestyle-Based Clustering (LBC). We represent the mobility movement of each user by a Markov model and calculate the Jensen-Shannon distances among pairs of users. The pairwise distances are represented in a similarity matrix, which is used for the clustering. To validate the unsupervised clustering task, we develop an entropy-based clustering measure, namely, an index that measures the homogeneity of mobility patterns within clusters of users. The analysis is validated on a real-world dataset that contains location-movements of 5,000 cellular phone users that were collected over a two-month period.

Continuous-Time Relationship Prediction in Dynamic Heterogeneous Information Networks

Online social networks, World Wide Web, media and technological networks, and other types of so-called information networks are ubiquitous nowadays. These information networks are inherently heterogeneous and dynamic. They are heterogeneous as they consist of multi-typed objects and relations, and they are dynamic as they are constantly evolving over time. One of the challenging issues in such heterogeneous and dynamic environments, is to forecast those relationships in the network that will appear in the future. In this paper, we try to solve the problem of continuous-time relationship prediction in dynamic and heterogeneous information networks. This implies predicting the time it takes for a relationship to appear in the future, given its features that have been extracted by considering both heterogeneity and temporal dynamics of the underlying network. To this end, we first introduce a feature extraction framework that combines the power of meta-path-based modeling and recurrent neural networks to effectively extract features suitable for relationship prediction regarding heterogeneity and dynamicity of the networks. Next, we propose a supervised non-parametric approach, called Non-Parametric Generalized Linear Model (NP-GLM), which infers the hidden underlying probability distribution of the relationship building time given its features. We then present a learning algorithm to train NP-GLM and an inference method to answer time-related queries. Extensive experiments conducted on synthetic data and three real-world datasets, namely Delicious, MovieLens, and DBLP, demonstrate the effectiveness of NP-GLM in solving the continuous-time relationship prediction problem vis-`a-vis competitive baselines.

Feature Selection via Transferring Knowledge Across Different Classes

The problem of feature selection has attracted considerable research interest in recent years. Supervised information is capable of significantly improving the quality of selected features. However, existing supervised feature selection methods all require that classes in the labeled data (source domain) and unlabeled data (target domain) to be identical, which may be too restrictive in many cases. In this paper, we consider a more challenging cross-class setting where the classes in these two domains are related but different, which has rarely been studied before. We propose a Cross-class Knowledge Transfer Feature Selection (CKTFS) framework which transfers the cross-class knowledge from the source domain to guide target domain feature selection. Specifically, high-level descriptions, i.e., attributes, are used as the bridge for knowledge transfer. To further improve the quality of the selected features, our framework jointly considers the tasks of cross-class knowledge transfer and feature selection. Experimental results on four benchmark datasets demonstrate the superiority of the proposed method.

Identifying Complements and Substitutes of Products: A Neural Network Framework Based on Product Embedding

Complements and substitutes are two typical product relationships that deserve consideration in online product recommendations. One of the key objectives of recommender systems is to promote cross-selling, which heavily relies on recommending the appropriate type of products in specific scenarios. Research on consumer behavior has shown that consumers usually prefer substitutes in the browsing stage whereas complements in the purchasing stage. Thus, it is of great importance to identify the complementary and substitutable relationships between products. In this paper, we design a neural network based framework that integrates text content and additional information of online reviews to mine product relations. For text content, we utilize the LDA topic modeling to represent products in a succinct form called ``embedding''. To capture the semantics of complementary and substitutable relationships, we design a modeling process that transfers the product embeddings into semantic features and incorporates additional non-text factors of product reviews based on empirical observations. Two models are developed, wherein the basic model only considers semantic features while the multi-input model also incorporates additional information. Extensive experiments are conducted to verify the effectiveness of product embedding, semantic modeling, and the proposed neural network based framework. Moreover, the improvements of multi-input model over basic model increase with the level of data sparsity, validating the value of additional non-text information when dealing with sparse datasets.

Multi-task Crowdsourcing via an Optimization Framework

The unprecedented amounts of data have catalyzed the trend of combining human insights with machine learning techniques, which facilitate the use of crowdsourcing to enlist label information both effectively and efficiently. One crucial challenge in crowdsourcing is the diverse worker quality, which determines the accuracy of the label information provided by such workers. Motivated by the observations that same set of tasks are typically labeled by the same set of workers, we studied their behaviors across multiple related tasks and proposed an optimization framework for learning from task and worker dual heterogeneity. The proposed method uses a weight tensor to represent the workers' behaviors across multiple tasks, and seeks to find the optimal solution of the tensor by exploiting its structured information. Then, we propose an iterative algorithm to solve the optimization problem and analyze its computational complexity. To infer the true label of an example, we construct a worker ensemble based on the estimated tensor, whose decisions will be weighted using a set of entropy weight. We also prove that the gradient of the most time-consuming updating block is separable with respect to the workers, which leads to a randomized algorithm with faster speed. Moreover, we extend the learning framework to accommodate to the multi-class setting. Finally, we test the performance of our framework on several data sets, and demonstrate its superiority over state-of-the-art techniques.

Extracting Drift-aware Actionable Knowledge from Social Networks using Structural Features

In conventional data mining methods the output is either a description of input data or a prediction of unseen data. But the real-world problems usually require interventions in order to alter the current data specifications towards a desirable goal. Actionable knowledge discovery is a field of study specifically developed for this matter. Many of these methods are capable of extracting actionable knowledge from social networks. In such networks, due to the dependencies among the underlying input data, extracted actions should be evaluated since the changes suggested by the actions may not be described by the constructed predictive model. This enforces the refinement of the model to preserve the quality of extracted actions. Our goal in this paper is to propose a new method of action mining which incorporates an action evaluation process overcoming the mentioned problem while focusing specifically on social network data. Such data contains valuable information based on the links inside the network where a change in some feature values may result in a chain of changes in others due to the dependencies conveyed by the links in the network. We use a state-of-the-art structural feature extraction method to capture the information of the dependencies inside the network. Our proposed method iteratively updates structural features which are incorporated in the action extraction process. In this process we thoroughly examine the effects of the application of actions by discovering possible changes in the network. We call this phenomena ?change propagation?. According to our experimentations our method outperforms the state-of-the-art methods in terms of action quality with comparable efficiency.

Balancing Prediction Errors for Robust Sentiment Classification

Sentiment classification is a popular text mining task in which textual content (e.g., a message) is assigned a polarity label (typically positive or negative) reflecting the sentiment expressed in it. Sentiment classification is used widely in applications like customer feedback analysis where robustness and correctness of results are critical. In this paper, we highlight that prediction accuracy alone is not sufficient for assessing the performance of a sentiment classifier; it is also important that the classifier is not biased towards positive or negative polarity, thus distorting the distribution of positive and negative messages in the predictions. We propose a measure, called Polarity Bias Rate, for quantifying this bias in a sentiment classifier. Secondly, we present two methods for removing this bias in the predictions of unsupervised and supervised sentiment classifiers. Our first method, called Bias-Aware Thresholding (BAT), shifts the decision boundary to control the bias in the predictions. Motivated from cost-sensitive learning, BAT is easily applicable to both lexicon-based unsupervised and supervised classifiers. Our second method called Balanced Logistic Regression (BLR) introduces a bias-remover constraint into the standard logistic regression model. BLR is an automatic bias-free supervised sentiment classifier. We evaluate our methods extensively on six real-world datasets. The experiments involve two lexicon-based and two supervised sentiment classifiers and include evaluation on multiple train-test data sizes. The results show that bias is controlled effectively in predictions. Furthermore, prediction accuracy is also increased in many cases thus enhancing the robustness of sentiment classification.

A Survey of Parallel Sequential Pattern Mining

With the growing popularity of shared resources, large volumes of complex data of different types are collected automatically. Traditional data mining algorithms generally have problems and challenges including huge memory cost, low processing speed, and inadequate hard disk space. As a fundamental task of data mining, sequential pattern mining (SPM) is used in a wide variety of real-life applications. However, it is more complex and challenging than other pattern mining tasks, i.e., frequent itemset mining and association rule mining, and also suffers from the above challenges when handling the large-scale data. To solve these problems, mining sequential patterns in a parallel or distributed computing environment has emerged as an important issue with many applications. In this paper, an in-depth survey of the current status of parallel sequential pattern mining (PSPM) is investigated and provided, including detailed categorization of traditional serial SPM approaches, and state of the art parallel SPM. We review the related work of parallel sequential pattern mining in detail, including partition-based algorithms for PSPM, Apriori-based PSPM, pattern growth based PSPM, and hybrid algorithms for PSPM, and provide deep description (i.e., characteristics, advantages, disadvantages and summarization) of these parallel approaches of PSPM. Some advanced topics for PSPM, including parallel quantitative / weighted / utility sequential pattern mining, PSPM from uncertain data and stream data, hardware acceleration for PSPM, are further reviewed in details. Besides, we review and provide some well-known open-source software of PSPM. Finally, we summarize some challenges and opportunities of PSPM in the big data era.

Community Detection in Small Networks: A New Approach to Graph Partitioning

Road Network Construction with Complex Intersections based on Sparsely-Sampled Private Car Trajectory Data

A road network is a critical aspect of both urban planning and route recommendation. This paper proposes an efficient approach to build a fine-grained road network based on sparsely-sampled private car trajectory data under complicated urban environment. In order to resolve difficulties introduced by low sampling rate trajectory data, we concentrate sample points around intersections by utilizing the turning characteristics from the large-scale trajectory data to ensure the accuracy of the detection of intersection and road segments. In front of complex road networks, such as including various overpasses and underpasses, we first layer intersections into major and minor one, and then propose a simplified representation of intersections and corresponding computable model based on the features of roads, which can significantly improve the accuracy of the detected road networks. In order to construct fine-grained road networks, we distinguish various types of intersections using direction information and detected turning limit. To the best of our knowledge, our map building method is the first time to infer road networks based on large-scale private car trajectory data. Extensive evaluations are conducted based on a real-world trajectory dataset from 1,345 private cars in Futian district, Shenzhen city of China. The results demonstrate the effectiveness of the proposed method. The constructed road map matches close to the one from a public editing map OpenStreetMap, especially the location of the road intersections and road segments, which achieves 92.4% intersections within 20 meters and 96.3% road segments within 12 meters.

On the Impact of Voice Encoding and Transmission on the Predictions of Speaker Warmth and Attractiveness

Modern human-computer interaction systems may not only be based on interpreting natural language but also on detecting speaker interpersonal characteristics in order to determine dialog strategies. This may be of high interest in different fields such as telephone marketing or automatic voice-based interactive services. However, when such systems encounter signals transmitted over a communication network instead of clean speech, e.g., in call centers, the speaker characterization accuracy might be impaired by the degradations caused in the speech signal by the encoding and communication processes. This paper addresses a binary classification of high versus low warm--attractive (WAAT) speakers over different channel and encoding conditions. The ground truth is derived from ratings given to clean speech extracted from an extensive subjective test. Our results show that the AMR-WB+ codec permits good levels of classification accuracy, comparable to the classification with clean, non-degraded speech. This is especially notable for the case of a Random Forest-based classifier, which presents the best performance among the set of evaluated algorithms. The impact of different packet loss rates have been examined, whereas jitter effects have been found to be negligible.

Tensorizing Restricted Boltzmann Machine

Restricted Boltzmann Machine (RBM) is a famous model for feature extraction and can be used as an initializer for Neural Networks. When applying the classic RBM to multidimensional data such as 2D/3D tensors, one needs to vectorize such as high order data. Vectorizing will result in dimensional disaster and valuable spatial information loss. In particular, a large amount of memory is required by commonly used fully-connected layers, making it hard to use the models on low-end devices and preventing from using the models with high order data. In this paper, in order to apply classic RBM on tensorial data directly, we propose a new tensorial RBM model parametrized by the tensor train decomposition (TTRBM). In this model, both input and hidden variables are in tensorial form, which are connected by a parameter matrix in tensor train format. The biggest advantage of the proposed TTRBM is that TTRBM can obtain comparable performance compared with the classic RBM with much fewer model parameters and faster training process. The experiments on three real-world applications, face reconstruction, handwritten digit recognition and image super-resolution, have demonstrated the advantages of TTRBM.

Active Two Phase Collaborative Representation Classifier

The Sparse Representation Classifier (SRC), the Collaborative Representation Classifier (CRC), and the Two Phase Test Sample Sparse Representation (TPTSSR) classifier were introduced in recent times. All these frameworks are supervised and passive in the sense that they cannot benefit from unlabeled data samples. In this letter, inspired by active learning paradigms, we introduce an active Collaborative Representation Classifier that can be used by these frameworks. More precisely, we are interested in the TPTSSR framework due to its good performance and its reasonable computational cost. Our proposed Active Two Phase Collaborative Representation Classifier (ATPCRC) starts by predicting the label of the available unlabeled samples. At testing stage, two coding processes are carried out separately on the set of originally labeled samples and the whole set (original and predicted label). The two types of class-wise reconstruction errors are blended in order to decide the class of any test image. Experiments conducted on four public image datasets show that the proposed ATPCRC can outperform the classic TPTSSR as well as many state-of-the-art methods that exploit label and unlabeled data samples.

Bursty Event Detection in Twitter Streams

Social media, in recent years, have become an invaluable source of information for both public and private organizations to enhance the comprehension of people interests and the onset of new events. Twitter, especially, allows a fast spread of news and events happening real-time that can contribute to situation awareness during emergency situations, but also to understand trending topics of a period. The paper proposes an online algorithm that incrementally groups tweet streams into clusters. The approach summarizes the examined tweets into the cluster centroid by maintaining a number of textual and temporal features that allow the method to effectively discover groups of interest on particular themes. Experiments on messages posted by users addressing different issues, and a comparison with state-of-the-art approaches show that the method is capable to detect discussions regarding topics of interest, but also to distinguish bursty events revealed by a sudden spreading of attention on messages published by users.

Leveraging Kernel-Incorporated Matrix Factorization for App Recommendation

The ever-increasing number of smartphone applications (apps) available on different app markets poses a challenge for personalized app recommendation. Conventional collaborative filtering-based recommendation methods suffer from sparse and binary user-app implicit feedback, which results in poor performance in discriminating user-app preferences. In this paper, we first propose two kernel incorporated probabilistic matrix factorization models, which introduce app-categorical information to constrain the user and app latent features to be similar to their neighbors in the latent space. The two models are solved by Stochastic Gradient Descent (SGD) with a user-oriented negative sampling scheme. To further improve the recommendation performance, we construct pseudo user-app ratings based on user-app usage information, and propose a novel Kernelized Non-negative Matrix Factorization (KNMF) by incorporating non-negative constraints on latent factors to predict user-app preferences. This model also leverages user-user and app-app similarities with regard to app-categorical information to mine the latent geometric structure in the pseudo-rating space. Adopting the Karush-Kuhn-Tucker (KKT) conditions, a Multiplicative Updating Rules (MUR)-based optimization is proposed for model learning, and the convergence is proved by introducing an auxiliary function. The experimental results on a real user-app installation usage dataset show the comparable performance of our models with the state-of-the-art baselines, in terms of two ranking-oriented evaluation metrics.

Translations Diversification for Expert Finding: A Novel Clustering-based Approach

Expert finding that addressed the task of retrieving and ranking right people in the subject of the user query, has become one of the most important research fields that attracted the attention of many researchers. The most important challenge in the expert finding domain is to determine the strength of relationship between query words and documents written by the candidate expert. One of the most important challenges in information retrieval is the issue of the vocabulary gap between queries and documents. In this study, a translation model based on the clustering of words in two query and co-occurrence spaces is proposed to overcome this problem. First, the words that are semantically close are clustered in the query space and then each cluster in this space are clustered again in the co-occurrence space. Representatives of each cluster in co-occurrence space are considered as a diverse subset of the parent cluster. By this method, the diversity of queries are guaranteed in the query space. Next, a probabilistic model that is based on the degree of word belonging to cluster and the similarity of cluster to the query in the query space is used to consider the problem of vocabulary gap. Finally, the corresponding translations with each query are used in conjunction with a blender model for expert finding. Experiments on Stackoverflow dataset show the efficiency of the proposed methods for expert finding.

All ACM Journals | See Full Journal Index

Search TKDD
enter search term and/or author name