ACM Transactions on

Knowledge Discovery from Data (TKDD)

Latest Articles

Active Two Phase Collaborative Representation Classifier

The Sparse Representation Classifier, the Collaborative Representation Classifier (CRC), and the Two Phase Test Sample Sparse Representation (TPTSSR)... (more)

Time-Sync Video Tag Extraction Using Semantic Association Graph

Time-sync comments (TSCs) reveal a new way of extracting the online video tags. However, such TSCs have lots of noises due to users’ diverse... (more)

Maximally Correlated Principal Component Analysis Based on Deep Parameterization Learning

Dimensionality reduction is widely used to deal with high-dimensional data. As a famous dimensionality reduction method, principal component analysis... (more)

On the Impact of Voice Encoding and Transmission on the Predictions of Speaker Warmth and Attractiveness

Modern human-computer interaction systems may not only be based on interpreting natural language but... (more)

Housing Demand Estimation Based on Express Delivery Data

Housing demand estimation is an important topic in the field of economic research. It is beneficial and helpful for various applications including... (more)

Continuous-Time Relationship Prediction in Dynamic Heterogeneous Information Networks

Online social networks, World Wide Web, media, and technological networks, and other types of so-called information networks are ubiquitous nowadays.... (more)


About TKDD 

ACM Transactions on Knowledge Discovery from Data (TKDD) publishes original archival papers in the area of knowledge discovery from data and closely related disciplines.  The majority of the papers that appear in TKDD is expected to address the logical and technical foundation of knowledge discovery and data mining.

Forthcoming Articles
Hybrid Crowd-Machine Wrapper Inference

Wrapper inference deals in generating programs to extract data from Web pages. Several supervised and unsupervised wrapper inference approaches have been proposed in the literature. On the one hand, unsupervised approaches produce erratic wrappers: whenever the sources do not satisfy underlying assumptions of the inference algorithm, their accuracy is compromised. On the other hand, supervised approaches produce accurate wrappers, but since they need training data, their scalability is limited. The recent advent of crowdsourcing platforms has opened new opportunities for supervised approaches, as they make possible the production of large amounts of training data with the support of workers recruited online. Nevertheless, involving human workers has monetary costs. We present an original hybrid crowd-machine wrapper inference system that offers the benefits of both approaches exploiting the cooperation of crowd workers and unsupervised algorithms. Based on a principled probabilistic model that estimates the quality of wrappers, humans workers are recruited only when unsupervised wrapper induction algorithms are not able to produce sufficiently accurate solutions.

Fine-Grained Air Quality Inference with Remote Sensing Data and Ubiquitous Urban Data

With the substantial economic growth, air quality has attracted more and more attention these years. Accessing real-time air quality information is inevitably important for both the public and the government. Unfortunately, the limited number of monitoring stations is unable to provide fine-grained air quality information in big cities like Beijing. One cost-effective approach is to infer air quality based on the records from the monitoring stations. However, existing inference techniques can hardly achieve high accuracy due to the severe data sparsity problem (e.g., only 0.2% data are known). We observe that remote sensing has been a high-quality data source. In this paper, we propose to integrate remote sensing data and ubiquitous urban data for air quality inference. There are two main challenges, i.e., data heterogeneity and incompleteness of the remote sensing data. In response to the challenges, we propose a two-stage inference approach. In the first stage, we use the remote sensing data and the meteorological data to infer and predict air quality with Artificial Neural Networks (ANN), which significantly reduces the percentage of empty cells in the air quality tensor. In the second stage, we introduce a tensor decomposition method to infer the complete set of air quality values. We leverage both the spatial features (i.e., road features and POI features) and the temporal features (i.e., meteorological features) as the constraints. A repetitive training framework is proposed to address the data sparsity problem for model training in the first stage. Experiments with real datasets show that our approach has a profound performance advantage over the state-of-the-art methods, such as U-Air.

A Unified Framework with Multi-source Data for Predicting Passenger Demands of Ride Services

Ride-hailing applications have been offering convenient ride services for people in need. However, such applications still suffer from the issue of supply-demand disequilibrium, which is a typical problem for traditional taxi services. Effective predictions on passenger demands will alleviate the disequilibrium by avoiding dispatching cars to zero-demand areas and facilitate dynamic pricing. Existing studies of demand predictions mainly utilize a single model instead of a combination of several models, which is only based on trajectory data or orders of ride services or both of them. Meanwhile, they simply apply fixed-size grids to partition space and overlook a lot of geo-spatial information. In this paper, we present a unified framework with a new combined model for the demand prediction, using heterogeneous data and a road network-based method for the area partition, supported by additional geo-spatial information. We analyze and evaluate the performance of our combined model using the actual operational data from UCAR. The experimental results indicate that our model outperforms six other baselines by over 39% in Mean Absolute Error (MAE) and 59% in Root Mean Square Error (RMSE) on average.

Real-time Estimation of the Urban Air Quality with Mobile Sensor System

Recently, real-time air quality estimation has attracted more and more attention from all over the world, which is close to our daily life. With the prevalence of mobile sensors, there is an emerging way to monitor the air quality with mobile sensors on vehicles. Compared with traditional expensive monitor stations, mobile sensors are cheaper and more abundant, but observations from these sensors have unstable spatial and temporal distributions, which results in the existing model could not work very well on this type of data. In this paper, taking advantage of air quality data from mobile sensors, we propose an real-time urban air quality estimation method based on the Gaussian Process Regression for air pollution of the unmonitored areas, pivoting on the diffusion effect and the accumulation effect of air pollution. In order to meet the real-time demands, we propose a two-layer ensemble learning framework and a self-adaptivity mechanism to improve computational efficiency and adaptivity. We evaluate our model with real data from mobile sensor system located in Beijing, China. And the experiments show that our proposed model is superior to the state-of-art spatial regression methods in both precision and time performances.

Community Detection in Small Networks: A New Approach to Graph Partitioning

Towards an optimal outdoor advertising placement : when a budget constraint meets moving trajectories

In this paper we propose and study the problem of trajectory-driven influential billboard placement: given a set of billboards U (each with a location and a cost), a database of trajectories T and a budget L, find a set of billboards within the budget to influence the largest number of trajectories. One core challenge is to identify and reduce the overlap of the influence from different billboards to the same trajectories, while keeping the budget constraint into consideration. We show that this problem is NP-hard and present an enumeration based algorithm with (1 ? 1/e) approximation ratio. However, the enumeration would be very costly when |U | is large. By exploiting the locality property of billboards? influence, we propose a partition-based framework PartSel. PartSel partitions U into a set of small clusters, computes the locally influential billboards for each cluster, and merges them to generate the global solution. Since the local solutions can be obtained much more efficiently than the global one, PartSel would reduce the computation cost greatly; meanwhile it achieves a non-trivial approximation ratio guarantee. Then we propose a LazyProbe method to further prune billboards with low marginal influence, while achieving the same approximation ratio as PartSel. Next, we propose a branch-and -bound method to eliminate unnecessary enumerations in both PartSel and LazyProbe, as well as an aggregated index to speed up the computation of marginal influence. Experiments on real datasets verify the efficiency and effectiveness of our methods.

Self-adaptive Particle Swarm Optimization for Large-scale Feature Selection in Classification

Many evolutionary computation (EC) methods have been used to solve feature selection problems and they perform well on most small-scale feature selection problems. However, as the dimensionality of feature selection problems increases, the solution space increases exponentially. Meanwhile, there are more irrelevant features than relevant features in datasets, which leads to many local optima in the huge solution space. Therefore, the existing EC methods still suffer from the problem of stagnation in local optima on large-scale feature selection problems. Furthermore, large-scale feature selection problems with different datasets may have different properties. Thus, it may be of low performance to solve different large-scale feature selection problems with an existing EC method that has only one candidate solution generation strategy (CSGS). In addition, it is time-consuming to find a suitable EC method and corresponding suitable parameter values for a given large-scale feature selection problem if we want to solve it effectively and efficiently. In this paper, we propose a self-adaptive particle swarm optimization (SaPSO) algorithm for feature selection, particularly for large-scale feature selection. First, an encoding scheme for the feature selection problem is employed in the SaPSO. Second, three important problems relate to self-adaptive algorithms are investigated. After that, the SaPSO algorithm with a typical self-adaptive mechanism is proposed. The experimental results on 12 datasets show that the solution size obtained by the SaPSO algorithm is smaller than its EC counterparts on all datasets. The SaPSO algorithm performs better than its non-EC and EC counterparts in terms of classification accuracy not only on most training sets but also on most test sets. Furthermore, as the dimensionality of the feature selection problem increases, the advantages of SaPSO become more prominent. This highlights that the SaPSO algorithm is suitable for solving feature selection problems, particularly large-scale feature selection problems.

Probabilistic Mixture Model for Mapping the Underground Pipes

Buried pipes beneath our city are blood vessels that feed human civilization through the supply of water, gas, electricity, etc, and mapping the buried pipes has long been addressed as an issue. In this paper, a suitable coordinate of the detected area is established, the noisy GPR and GPS data are analyzed and normalized, and the pipeline is described mathematically. Based on these, the Probabilistic Mixture Model is proposed to map the buried pipes, which takes discrete noisy GPR and GPS data as the input and the accurate pipe locations and directions as the output. The proposed model consists of the Preprocessing, the Pipe Fitting algorithm, the Classification Fitting Expectation Maximization (CFEM) algorithm, and the Angle-limited Hough (Al-Hough) transform. The direction information of the detecting point is added into the measuring of the distance from a detecting point to nearby pipelines to handle some areas where the pipes are intersected or difficult to classify, the Expectation Maximization (EM) algorithm is upgraded to CFEM algorithm that is able to classify detecting points into different classes, and connect and fit multiple detecting points in each class to get accurate pipe locations and directions, and the Al-Hough transform provides reliable initializations for CFEM, to some extent, ensuring the convergence of the algorithm. The experimental results on the simulated and real-world datasets demonstrate the effectiveness of the proposed model.

Aspect Aware Learning for Aspect Category Sentiment Analysis

Aspect category sentiment analysis (ACSA) is an underexploited subtask in aspect level sentiment analysis. It aims to identify the sentiment of predefined aspect categories. The main challenge in ACSA comes from the fact that the aspect category may not occur in the sentence in most of the cases. For example, the review ``\emph{they have delicious sandwiches}'' positively talks about the aspect category ``\emph{food}'' in an implicit manner. In this paper, we propose a novel aspect aware learning framework for ACSA tasks. Our key idea is to exploit the interaction between the aspect category and the contents under the guidance of both sentiment polarity and predefined categories. To this end, we design a two-way memory network for integrating aspect aware learning (AAL) into the framework of sentiment classification. We further present two algorithms to incorporate the potential impacts of aspect categories. One is to capture the correlations between aspect terms and the aspect category like \emph{``sandwiches''} and \emph{``food''}. The other is to recognize the aspect category for sentiment representations like \emph{``food''} for \emph{``delicious''}. We conduct extensive experiments on two SemEval datasets. The results reveal the essential role of AAL in aspect category sentiment analysis by achieving the state-of-the-art performance.

Krylov Subspace Approximation for Local Community Detection in Large Networks

Community detection is an important information mining task to uncover modular structures in large networks. For increasingly common large network data sets, global community detection is prohibitively expensive, and attention has shifted to methods that mine local communities, i.e. identifying all latent members of a particular community from a few labeled seed members. To address such semi-supervised mining task, we systematically develop a local spectral subspace-based community detection method, called LOSP. We define a family of local spectral subspaces based on Krylov subspaces, and seek a sparse indicator for the target community via an ?1 norm minimization over the Krylov subspace. Variants of LOSP depend on type of random walks with different diffusion speeds, type of random walks, dimension of the local spectral subspace and step of diffusions. The effectiveness of the proposed LOSP approach is theoretically analyzed based on Rayleigh quotients, and it is experimentally verified on a wide variety of real-world networks across social, production and biological domains, as well as on an extensive set of synthetic LFR benchmark datasets.

Bayesian Model Selection Approach to Multiple Change-points Detection with Non-Local Prior Distributions

We propose a Bayesian model selection (BMS) boundary detection procedure using non-local prior distributions for a sequence of data with multiple systematic mean changes. By using the non-local priors in the Bayesian model selection framework, the BMS method can effectively suppress the non-boundary spike points with large instantaneous changes. Further, we speed up the algorithm by reducing the multiple change points to a series of single change point detection problems. We establish the consistency of the estimated number and locations of the change points under various prior distributions. From both theoretical and numerical perspectives, we show that the non-local inverse moment prior leads to the fastest convergence rate in identifying the true change points on the boundaries. Extensive simulation studies are conducted to compare the BMS with existing methods, and our method is illustrated with application to the magnetic resonance imaging guided radiation therapy data.

Density-friendly Graph Decomposition

Decomposing a graph into a hierarchical structure via k-core analysis is a standard operation in any modern graph-mining toolkit. k-core decomposition is a simple and efficient method that allows to analyze a graph beyond its mere degree distribution. More specifically, it is used to identify areas in the graph of increasing centrality and connectedness, and it allows to reveal the structural organization of the graph. Despite the fact that k-core analysis relies on vertex degrees, k-cores do not satisfy a certain, rather natural, density property. Simply put, the most central k-core is not necessarily the densest subgraph. This inconsistency between k-cores and graph density provides the basis of our study. We start by defining what it means for a subgraph to be locally-dense, and we show that our definition entails a nested chain decomposition of the graph, similar to the one given by k-cores, but in this case the components are arranged in order of increasing density. We show that such a locally-dense decomposition for a graph G=(V,E) can be computed in polynomial time. The running time of the exact decomposition algorithm is O(|V|2|E|) but is significantly faster in practice. In addition, we develop a linear-time algorithm that provides a factor-2 approximation to the optimal locally-dense decomposition. Furthermore, we show that the k-core decomposition is also a factor-2 approximation, however, as demonstrated by our experimental evaluation, in practice k-cores have different structure than locally-dense subgraphs, and as predicted by the theory, k-cores are not always well-aligned with graph density.

Computing top-k Closeness Centrality Faster in Unweighted Graphs

Given a connected graph G = (V, E), the closeness centrality of a vertex v is defined as n1/(£wV d(v,w)) . This measure is widely used in the analysis of real-world complex networks, and the problem of selecting the k most central vertices has been deeply analysed in the last decade. However, this problem is computationally not easy, especially for large networks: in the first part of the paper, we prove that it is not solvable in time O(|E|2µ) on directed graphs, for any constant µ > 0, under reasonable complexity assumptions. Furthermore, we propose a new algorithm for selecting the k most central nodes in a graph: we experimentally show that this algorithm improves significantly both the textbook algorithm, which is based on computing the distance between all pairs of vertices, and the state of the art. For example, we are able to compute the top k nodes in few dozens of seconds in real-world networks with millions of nodes and edges. Finally, as a case study, we compute the 10 most central actors in the IMDB collaboration network, where two actors are linked if they played together in a movie, and in the Wikipedia citation network, which contains a directed edge from a page p to a page q if p contains a link to q.

All ACM Journals | See Full Journal Index

Search TKDD
enter search term and/or author name