ACM Transactions on

Knowledge Discovery from Data (TKDD)

Latest Articles

Active Two Phase Collaborative Representation Classifier

The Sparse Representation Classifier, the Collaborative Representation Classifier (CRC), and the Two Phase Test Sample Sparse Representation (TPTSSR)... (more)

Time-Sync Video Tag Extraction Using Semantic Association Graph

Time-sync comments (TSCs) reveal a new way of extracting the online video tags. However, such TSCs have lots of noises due to users’ diverse... (more)


About TKDD 

ACM Transactions on Knowledge Discovery from Data (TKDD) publishes original archival papers in the area of knowledge discovery from data and closely related disciplines.  The majority of the papers that appear in TKDD is expected to address the logical and technical foundation of knowledge discovery and data mining.

Forthcoming Articles
A Distance Measure for the Analysis of Polar Opinion Dynamics in Social Networks

Analysis of opinion dynamics in social networks plays an important role in today's life. For predicting users' political preference, it is particularly important to be able to analyze the dynamics of competing polar opinions, such as pro-Democrat vs. pro-Republican. While observing the evolution of polar opinions in a social network over time, can we tell when the network evolved abnormally? Furthermore, can we predict how the opinions of the users will change in the future? To answer such questions, it is insufficient to study individual user behavior, since opinions can spread beyond users' ego-networks. Instead, we need to consider the opinion dynamics of all users simultaneously and capture the connection between the individuals' behavior and the global evolution pattern of the social network. In this work, we introduce the Social Network Distance (SND)---a distance measure that quantifies the likelihood of evolution of one snapshot of a social network into another snapshot under a chosen model of polar opinion dynamics. SND has a rich semantics of a transportation problem, yet, is computable in time linear in the number of users and, as such, is applicable to large-scale online social networks. In our experiments with synthetic and Twitter data, we demonstrate the utility of our distance measure for anomalous event detection. It achieves a true positive rate of 0.83, twice as high as that of alternatives. When used for opinion prediction in Twitter data, SND's accuracy is 75.6%, which is 7.5% higher than that of the next best method.

Heterogeneous-length Text Topic Modeling for Reader-aware Multi-Document Summarization

Housing Demand Estimation based on Express Delivery Data

Housing demand estimation is an important topic in economic research field. It is beneficial and helpful for various applications including real estate market regulation and urban planning. Therefore, housing demand estimation is crucial for both real estate investors and government administrators. Meanwhile, with rapid development of express industry, abundant useful information embedding in express delivery records are helpful for researchers to profile urban life patterns. Express delivery behaviors of residents in a residential community can reflect housing demand to some extent. Although there have been studies focusing on the demand analysis of housing demand, it can not be estimated very well and is still under explored. To this end, in this paper, we propose a systematic housing demand estimation method based on express delivery data. In this work, we first aggregate the express delivery records at community scale with clustering methods and complete vacant values of express delivery records. Then, we extract various features from a less sparse dataset considering both residential mobility probability and attractiveness of residential communities. In addition, correlations between different districts can influence performances of our inference model, thus we take commonalities and differences of different districts into consideration. With the features and correlations between different districts being obtained, we estimate housing demand with a multi-task learning method based on neural networks. Finally, experimental results on real-world data prove that our model is effective to estimate housing demand at the residential community level.

Maximally Correlated Principle Component Analysis Based on Deep Parameterization Learning

Dimensionality reduction is widely used to deal with high dimensional data. As a famous dimensionality reduction method, principle component analysis (PCA) aiming at finding the low dimension feature of original data has made great successes, and many improved PCA have been proposed. However, most algorithms based on PCA only consider the linear correlation of data features. In this paper, we propose a novel dimensionality reduction model called maximally correlated principle component analysis based on deep parameterization learning (MCPCADP), which takes nonlinear correlation into account by using deep parameterization frame for the purpose of dimensionality reduction. The new model explores nonlinear correlation by maximizing Ky Fan norm of the covariance matrix of nonlinearly mapped data features. A new BP algorithm for model optimization is derived. In order to assess the proposed method, we conduct experiments on both synthetic database and several real-world databases. The experimental results demonstrate that the proposed algorithm is comparable with the several widely used algorithms.

Hybrid Crowd-Machine Wrapper Inference

Wrapper inference deals in generating programs to extract data from Web pages. Several supervised and unsupervised wrapper inference approaches have been proposed in the literature. On the one hand, unsupervised approaches produce erratic wrappers: whenever the sources do not satisfy underlying assumptions of the inference algorithm, their accuracy is compromised. On the other hand, supervised approaches produce accurate wrappers, but since they need training data, their scalability is limited. The recent advent of crowdsourcing platforms has opened new opportunities for supervised approaches, as they make possible the production of large amounts of training data with the support of workers recruited online. Nevertheless, involving human workers has monetary costs. We present an original hybrid crowd-machine wrapper inference system that offers the benefits of both approaches exploiting the cooperation of crowd workers and unsupervised algorithms. Based on a principled probabilistic model that estimates the quality of wrappers, humans workers are recruited only when unsupervised wrapper induction algorithms are not able to produce sufficiently accurate solutions.

Clustering Users by Their Mobility Behavioral Patterns

The immense stream of data from mobile devices during recent years enables one to learn more about human behavior and provide mobile phone users with personalized services. In this work, we identify groups of users who have the same movement patterns. We analyze trajectories of semantic locations to find users who have similar movement behavior or mobility lifestyle, even when they live in different areas. For this task, we propose a new grouping scheme that is called Lifestyle-Based Clustering (LBC). We represent the mobility movement of each user by a Markov model and calculate the Jensen-Shannon distances among pairs of users. The pairwise distances are represented in a similarity matrix, which is used for the clustering. To validate the unsupervised clustering task, we develop an entropy-based clustering measure, namely, an index that measures the homogeneity of mobility patterns within clusters of users. The analysis is validated on a real-world dataset that contains location-movements of 5,000 cellular phone users that were collected over a two-month period.

Continuous-Time Relationship Prediction in Dynamic Heterogeneous Information Networks

Online social networks, World Wide Web, media and technological networks, and other types of so-called information networks are ubiquitous nowadays. These information networks are inherently heterogeneous and dynamic. They are heterogeneous as they consist of multi-typed objects and relations, and they are dynamic as they are constantly evolving over time. One of the challenging issues in such heterogeneous and dynamic environments, is to forecast those relationships in the network that will appear in the future. In this paper, we try to solve the problem of continuous-time relationship prediction in dynamic and heterogeneous information networks. This implies predicting the time it takes for a relationship to appear in the future, given its features that have been extracted by considering both heterogeneity and temporal dynamics of the underlying network. To this end, we first introduce a feature extraction framework that combines the power of meta-path-based modeling and recurrent neural networks to effectively extract features suitable for relationship prediction regarding heterogeneity and dynamicity of the networks. Next, we propose a supervised non-parametric approach, called Non-Parametric Generalized Linear Model (NP-GLM), which infers the hidden underlying probability distribution of the relationship building time given its features. We then present a learning algorithm to train NP-GLM and an inference method to answer time-related queries. Extensive experiments conducted on synthetic data and three real-world datasets, namely Delicious, MovieLens, and DBLP, demonstrate the effectiveness of NP-GLM in solving the continuous-time relationship prediction problem vis-`a-vis competitive baselines.

Fine-Grained Air Quality Inference with Remote Sensing Data and Ubiquitous Urban Data

With the substantial economic growth, air quality has attracted more and more attention these years. Accessing real-time air quality information is inevitably important for both the public and the government. Unfortunately, the limited number of monitoring stations is unable to provide fine-grained air quality information in big cities like Beijing. One cost-effective approach is to infer air quality based on the records from the monitoring stations. However, existing inference techniques can hardly achieve high accuracy due to the severe data sparsity problem (e.g., only 0.2% data are known). We observe that remote sensing has been a high-quality data source. In this paper, we propose to integrate remote sensing data and ubiquitous urban data for air quality inference. There are two main challenges, i.e., data heterogeneity and incompleteness of the remote sensing data. In response to the challenges, we propose a two-stage inference approach. In the first stage, we use the remote sensing data and the meteorological data to infer and predict air quality with Artificial Neural Networks (ANN), which significantly reduces the percentage of empty cells in the air quality tensor. In the second stage, we introduce a tensor decomposition method to infer the complete set of air quality values. We leverage both the spatial features (i.e., road features and POI features) and the temporal features (i.e., meteorological features) as the constraints. A repetitive training framework is proposed to address the data sparsity problem for model training in the first stage. Experiments with real datasets show that our approach has a profound performance advantage over the state-of-the-art methods, such as U-Air.

Community Detection in Small Networks: A New Approach to Graph Partitioning

Self-adaptive Particle Swarm Optimization for Large-scale Feature Selection in Classification

Many evolutionary computation (EC) methods have been used to solve feature selection problems and they perform well on most small-scale feature selection problems. However, as the dimensionality of feature selection problems increases, the solution space increases exponentially. Meanwhile, there are more irrelevant features than relevant features in datasets, which leads to many local optima in the huge solution space. Therefore, the existing EC methods still suffer from the problem of stagnation in local optima on large-scale feature selection problems. Furthermore, large-scale feature selection problems with different datasets may have different properties. Thus, it may be of low performance to solve different large-scale feature selection problems with an existing EC method that has only one candidate solution generation strategy (CSGS). In addition, it is time-consuming to find a suitable EC method and corresponding suitable parameter values for a given large-scale feature selection problem if we want to solve it effectively and efficiently. In this paper, we propose a self-adaptive particle swarm optimization (SaPSO) algorithm for feature selection, particularly for large-scale feature selection. First, an encoding scheme for the feature selection problem is employed in the SaPSO. Second, three important problems relate to self-adaptive algorithms are investigated. After that, the SaPSO algorithm with a typical self-adaptive mechanism is proposed. The experimental results on 12 datasets show that the solution size obtained by the SaPSO algorithm is smaller than its EC counterparts on all datasets. The SaPSO algorithm performs better than its non-EC and EC counterparts in terms of classification accuracy not only on most training sets but also on most test sets. Furthermore, as the dimensionality of the feature selection problem increases, the advantages of SaPSO become more prominent. This highlights that the SaPSO algorithm is suitable for solving feature selection problems, particularly large-scale feature selection problems.

Probabilistic Mixture Model for Mapping the Underground Pipes

Buried pipes beneath our city are blood vessels that feed human civilization through the supply of water, gas, electricity, etc, and mapping the buried pipes has long been addressed as an issue. In this paper, a suitable coordinate of the detected area is established, the noisy GPR and GPS data are analyzed and normalized, and the pipeline is described mathematically. Based on these, the Probabilistic Mixture Model is proposed to map the buried pipes, which takes discrete noisy GPR and GPS data as the input and the accurate pipe locations and directions as the output. The proposed model consists of the Preprocessing, the Pipe Fitting algorithm, the Classification Fitting Expectation Maximization (CFEM) algorithm, and the Angle-limited Hough (Al-Hough) transform. The direction information of the detecting point is added into the measuring of the distance from a detecting point to nearby pipelines to handle some areas where the pipes are intersected or difficult to classify, the Expectation Maximization (EM) algorithm is upgraded to CFEM algorithm that is able to classify detecting points into different classes, and connect and fit multiple detecting points in each class to get accurate pipe locations and directions, and the Al-Hough transform provides reliable initializations for CFEM, to some extent, ensuring the convergence of the algorithm. The experimental results on the simulated and real-world datasets demonstrate the effectiveness of the proposed model.

On the Impact of Voice Encoding and Transmission on the Predictions of Speaker Warmth and Attractiveness

Modern human-computer interaction systems may not only be based on interpreting natural language but also on detecting speaker interpersonal characteristics in order to determine dialog strategies. This may be of high interest in different fields such as telephone marketing or automatic voice-based interactive services. However, when such systems encounter signals transmitted over a communication network instead of clean speech, e.g., in call centers, the speaker characterization accuracy might be impaired by the degradations caused in the speech signal by the encoding and communication processes. This paper addresses a binary classification of high versus low warm--attractive (WAAT) speakers over different channel and encoding conditions. The ground truth is derived from ratings given to clean speech extracted from an extensive subjective test. Our results show that the AMR-WB+ codec permits good levels of classification accuracy, comparable to the classification with clean, non-degraded speech. This is especially notable for the case of a Random Forest-based classifier, which presents the best performance among the set of evaluated algorithms. The impact of different packet loss rates have been examined, whereas jitter effects have been found to be negligible.

Krylov Subspace Approximation for Local Community Detection in Large Networks

Community detection is an important information mining task to uncover modular structures in large networks. For increasingly common large network data sets, global community detection is prohibitively expensive, and attention has shifted to methods that mine local communities, i.e. identifying all latent members of a particular community from a few labeled seed members. To address such semi-supervised mining task, we systematically develop a local spectral subspace-based community detection method, called LOSP. We define a family of local spectral subspaces based on Krylov subspaces, and seek a sparse indicator for the target community via an ?1 norm minimization over the Krylov subspace. Variants of LOSP depend on type of random walks with different diffusion speeds, type of random walks, dimension of the local spectral subspace and step of diffusions. The effectiveness of the proposed LOSP approach is theoretically analyzed based on Rayleigh quotients, and it is experimentally verified on a wide variety of real-world networks across social, production and biological domains, as well as on an extensive set of synthetic LFR benchmark datasets.

Bayesian Model Selection Approach to Multiple Change-points Detection with Non-Local Prior Distributions

We propose a Bayesian model selection (BMS) boundary detection procedure using non-local prior distributions for a sequence of data with multiple systematic mean changes. By using the non-local priors in the Bayesian model selection framework, the BMS method can effectively suppress the non-boundary spike points with large instantaneous changes. Further, we speed up the algorithm by reducing the multiple change points to a series of single change point detection problems. We establish the consistency of the estimated number and locations of the change points under various prior distributions. From both theoretical and numerical perspectives, we show that the non-local inverse moment prior leads to the fastest convergence rate in identifying the true change points on the boundaries. Extensive simulation studies are conducted to compare the BMS with existing methods, and our method is illustrated with application to the magnetic resonance imaging guided radiation therapy data.

Bursty Event Detection in Twitter Streams

Social media, in recent years, have become an invaluable source of information for both public and private organizations to enhance the comprehension of people interests and the onset of new events. Twitter, especially, allows a fast spread of news and events happening real-time that can contribute to situation awareness during emergency situations, but also to understand trending topics of a period. The paper proposes an online algorithm that incrementally groups tweet streams into clusters. The approach summarizes the examined tweets into the cluster centroid by maintaining a number of textual and temporal features that allow the method to effectively discover groups of interest on particular themes. Experiments on messages posted by users addressing different issues, and a comparison with state-of-the-art approaches show that the method is capable to detect discussions regarding topics of interest, but also to distinguish bursty events revealed by a sudden spreading of attention on messages published by users.

Density-friendly Graph Decomposition

Decomposing a graph into a hierarchical structure via k-core analysis is a standard operation in any modern graph-mining toolkit. k-core decomposition is a simple and efficient method that allows to analyze a graph beyond its mere degree distribution. More specifically, it is used to identify areas in the graph of increasing centrality and connectedness, and it allows to reveal the structural organization of the graph. Despite the fact that k-core analysis relies on vertex degrees, k-cores do not satisfy a certain, rather natural, density property. Simply put, the most central k-core is not necessarily the densest subgraph. This inconsistency between k-cores and graph density provides the basis of our study. We start by defining what it means for a subgraph to be locally-dense, and we show that our definition entails a nested chain decomposition of the graph, similar to the one given by k-cores, but in this case the components are arranged in order of increasing density. We show that such a locally-dense decomposition for a graph G=(V,E) can be computed in polynomial time. The running time of the exact decomposition algorithm is O(|V|2|E|) but is significantly faster in practice. In addition, we develop a linear-time algorithm that provides a factor-2 approximation to the optimal locally-dense decomposition. Furthermore, we show that the k-core decomposition is also a factor-2 approximation, however, as demonstrated by our experimental evaluation, in practice k-cores have different structure than locally-dense subgraphs, and as predicted by the theory, k-cores are not always well-aligned with graph density.

Computing top-k Closeness Centrality Faster in Unweighted Graphs

Given a connected graph G = (V, E), the closeness centrality of a vertex v is defined as n1/(£wV d(v,w)) . This measure is widely used in the analysis of real-world complex networks, and the problem of selecting the k most central vertices has been deeply analysed in the last decade. However, this problem is computationally not easy, especially for large networks: in the first part of the paper, we prove that it is not solvable in time O(|E|2µ) on directed graphs, for any constant µ > 0, under reasonable complexity assumptions. Furthermore, we propose a new algorithm for selecting the k most central nodes in a graph: we experimentally show that this algorithm improves significantly both the textbook algorithm, which is based on computing the distance between all pairs of vertices, and the state of the art. For example, we are able to compute the top k nodes in few dozens of seconds in real-world networks with millions of nodes and edges. Finally, as a case study, we compute the 10 most central actors in the IMDB collaboration network, where two actors are linked if they played together in a movie, and in the Wikipedia citation network, which contains a directed edge from a page p to a page q if p contains a link to q.

All ACM Journals | See Full Journal Index

Search TKDD
enter search term and/or author name