ACM DL

ACM Transactions on

Knowledge Discovery from Data (TKDD)

Menu
Latest Articles

Active Two Phase Collaborative Representation Classifier

The Sparse Representation Classifier, the Collaborative Representation Classifier (CRC), and the Two Phase Test Sample Sparse Representation (TPTSSR)... (more)

Time-Sync Video Tag Extraction Using Semantic Association Graph

Time-sync comments (TSCs) reveal a new way of extracting the online video tags. However, such TSCs have lots of noises due to users’ diverse... (more)

A Distance Measure for the Analysis of Polar Opinion Dynamics in Social Networks

Analysis of opinion dynamics in social networks plays an important role in today’s life. For predicting users’ political preference,... (more)

Maximally Correlated Principal Component Analysis Based on Deep Parameterization Learning

Dimensionality reduction is widely used to deal with high-dimensional data. As a famous dimensionality reduction method, principal component analysis... (more)

On the Impact of Voice Encoding and Transmission on the Predictions of Speaker Warmth and Attractiveness

Modern human-computer interaction systems may not only be based on interpreting natural language but... (more)

Bursty Event Detection in Twitter Streams

Social media, in recent years, have become an invaluable source of information for both public and private organizations to enhance the comprehension of people interests and the onset of new events. Twitter, especially, allows a fast spread of news and events happening real time that can contribute to situation awareness during emergency... (more)

Heterogeneous-Length Text Topic Modeling for Reader-Aware Multi-Document Summarization

More and more user comments like Tweets are available, which often contain user concerns. In order to meet the demands of users, a good summary... (more)

Housing Demand Estimation Based on Express Delivery Data

Housing demand estimation is an important topic in the field of economic research. It is beneficial and helpful for various applications including... (more)

Continuous-Time Relationship Prediction in Dynamic Heterogeneous Information Networks

Online social networks, World Wide Web, media, and technological networks, and other types of so-called information networks are ubiquitous nowadays.... (more)

Clustering Users by Their Mobility Behavioral Patterns

The immense stream of data from mobile devices during recent years enables one to learn more about human behavior and provide mobile phone users with... (more)

NEWS

About TKDD 

ACM Transactions on Knowledge Discovery from Data (TKDD) publishes original archival papers in the area of knowledge discovery from data and closely related disciplines.  The majority of the papers that appear in TKDD is expected to address the logical and technical foundation of knowledge discovery and data mining.

READ MORE
Forthcoming Articles
CT LIS: Learning Influences and Susceptibilities through Temporal Behaviors

How to quantify influences between users, seeing that social network users influence each other in their temporal behaviors? Previous work has directly defined an independent model parameter to capture the interpersonal influence between each pair of users. To do so, these models need a parameter for each pair of users, which results in high-dimensional models becoming easily trapped into the overfitting problem. However, such models do not consider how influences depend on each other if influences are sent from the same user or if influences are received by the same user. Therefore, we propose a model that defines parameters for every user with a latent influence vector and a susceptibility vector, opposite to define influences on user pairs. Such low-dimensional representations naturally cause the interpersonal influences involving the same user to be coupled with each other, thus reducing the model's complexity. Additionally, the model can easily consider the temporal information and sentimental polarities of users' messages. Finally, we conduct extensive experiments on real Microblog data, showing that our model with such representations achieves better performance than the state-of-the-art and pair-wise models, and that learning influences on sentiments can benefit performance.

Hybrid Crowd-Machine Wrapper Inference

Wrapper inference deals in generating programs to extract data from Web pages. Several supervised and unsupervised wrapper inference approaches have been proposed in the literature. On the one hand, unsupervised approaches produce erratic wrappers: whenever the sources do not satisfy underlying assumptions of the inference algorithm, their accuracy is compromised. On the other hand, supervised approaches produce accurate wrappers, but since they need training data, their scalability is limited. The recent advent of crowdsourcing platforms has opened new opportunities for supervised approaches, as they make possible the production of large amounts of training data with the support of workers recruited online. Nevertheless, involving human workers has monetary costs. We present an original hybrid crowd-machine wrapper inference system that offers the benefits of both approaches exploiting the cooperation of crowd workers and unsupervised algorithms. Based on a principled probabilistic model that estimates the quality of wrappers, humans workers are recruited only when unsupervised wrapper induction algorithms are not able to produce sufficiently accurate solutions.

Multi-User Mobile Sequential Recommendation for Route Optimization

We enhance the original mobile sequential recommendation (MSR) model to address several key issues by introducing three new forms, MMSR, MMSR-m and MMSR-d. To enrich the properties of the pick-up points including the locations and probabilities, we add the pick-up capacities to the attributes of pick-up points. The MMSR model finds optimal routes for multiple users at different locations while disallowing overlapping recommended routes. The MMSR-m addresses the issue by assigning a pick-up capacity to each location while the MMSR-d model allows the pick-up capacity to vary at different locations. The MMSR model is numerically solved by parallel simulated annealing. The results confirmed the superiority of our model and the solution methods over several published benchmarks at the more demanding high-dimensions. Our proposed push-point method allows to further improve our parallel algorithms for MMSR-m and MMSR-d to address more realistic city-level problems.

CFOF: A Concentration Free Measure for Anomaly Detection

We present a novel notion of outlier, called the Concentration Free Outlier Factor, or CFOF. As a main contribution, we formalize the notion of concentration of outlier scores and theoretically prove that CFOF does not concentrate in the Euclidean space for any arbitrary large dimensionality. To the best of our knowledge, there are no other proposals of data analysis measures related to the Euclidean distance for which it has been provided theoretical evidence that they are immune to the concentration effect. We determine the closed form of the distribution of CFOF scores in arbitrarily large dimensionalities and show that the CFOF score of a point depends on its squared norm standard score and on the kurtosis of the data distribution, thus providing a clear and statistically founded characterization of this notion. Moreover, we leverage this closed form to provide evidence that the definition does not suffer of the hubness problem affecting other measures in high dimensions. We prove that the number of CFOF outliers coming from each cluster is proportional to cluster size and kurtosis, a property that we call semi-locality. We leverage theoretical findings to shed lights on properties of well-known outlier scores. Indeed, we determine that semi-locality characterizes existing reverse nearest neighbor-based outlier definitions, thus clarifying the exact nature of their observed local behavior. We also formally prove that classical distance-based and density-based outliers concentrate both for bounded and unbounded sample sizes and for fixed and variable values of the neighborhood parameter. We introduce the fast-CFOF algorithm for detecting outliers in large high-dimensional dataset. The algorithm has linear cost, supports multi-resolution analysis, and is embarrassingly parallel. Experiments highlight that the technique is able to efficiently process huge datasets and to deal even with large values of the neighborhood parameter, to avoid concentration, and to obtain excellent accuracy.

Mining Rank Data

The problem of frequent pattern mining has been studied quite extensively for various types of data, including sets, sequences, and graphs. Somewhat surprisingly, another important type of data, namely rank data, has received very little attention in data mining so far. In this paper, we therefore addresses the problem of mining rank data, that is, data in the form of rankings (total orders) of an underlying set of items. More specifically, two types of patterns are considered, namely frequent rankings and dependencies between such rankings in the form of association rules. Algorithms for mining frequent rankings and frequent closed rankings are proposed and tested experimentally, using both synthetic and real data.

Fine-Grained Air Quality Inference with Remote Sensing Data and Ubiquitous Urban Data

With the substantial economic growth, air quality has attracted more and more attention these years. Accessing real-time air quality information is inevitably important for both the public and the government. Unfortunately, the limited number of monitoring stations is unable to provide fine-grained air quality information in big cities like Beijing. One cost-effective approach is to infer air quality based on the records from the monitoring stations. However, existing inference techniques can hardly achieve high accuracy due to the severe data sparsity problem (e.g., only 0.2% data are known). We observe that remote sensing has been a high-quality data source. In this paper, we propose to integrate remote sensing data and ubiquitous urban data for air quality inference. There are two main challenges, i.e., data heterogeneity and incompleteness of the remote sensing data. In response to the challenges, we propose a two-stage inference approach. In the first stage, we use the remote sensing data and the meteorological data to infer and predict air quality with Artificial Neural Networks (ANN), which significantly reduces the percentage of empty cells in the air quality tensor. In the second stage, we introduce a tensor decomposition method to infer the complete set of air quality values. We leverage both the spatial features (i.e., road features and POI features) and the temporal features (i.e., meteorological features) as the constraints. A repetitive training framework is proposed to address the data sparsity problem for model training in the first stage. Experiments with real datasets show that our approach has a profound performance advantage over the state-of-the-art methods, such as U-Air.

A Unified Framework with Multi-source Data for Predicting Passenger Demands of Ride Services

Ride-hailing applications have been offering convenient ride services for people in need. However, such applications still suffer from the issue of supply-demand disequilibrium, which is a typical problem for traditional taxi services. Effective predictions on passenger demands will alleviate the disequilibrium by avoiding dispatching cars to zero-demand areas and facilitate dynamic pricing. Existing studies of demand predictions mainly utilize a single model instead of a combination of several models, which is only based on trajectory data or orders of ride services or both of them. Meanwhile, they simply apply fixed-size grids to partition space and overlook a lot of geo-spatial information. In this paper, we present a unified framework with a new combined model for the demand prediction, using heterogeneous data and a road network-based method for the area partition, supported by additional geo-spatial information. We analyze and evaluate the performance of our combined model using the actual operational data from UCAR. The experimental results indicate that our model outperforms six other baselines by over 39% in Mean Absolute Error (MAE) and 59% in Root Mean Square Error (RMSE) on average.

In Search of a Stochastic Model for the E-news Reader

E-news readers have increasingly at their disposal a very large set of news articles to read. Online newspaper sites use recommendation systems to predict and to offer relevant articles to their users. Typically, these recommendation systems do not leverage the users' reading behavior. Knowing how the reader changes topics in a reading session may lead to recommendations fine-tuned. For example, after reading a certain number of sports items, it may be counter-productive to keep recommending other sports news. The motivation for this paper is the assumption that understanding the user behavior when reading successive online news articles can help on developing recommendation systems. We propose five categories of stochastic models to describe this behavior depending on how the previous reading history affects the future choices of topics. We instantiated these five classes with 33 different stochastic processes covering short-term memory, revealed a preference, cumulative advantage, and geometric sojourn models. Our empirical study is based on large datasets of E-news from two online newspapers. We collected data from more than 13 million users who generated more than 23 million reading sessions, each one composed by the successive clicks of the users on the posted news. We reduce each user session to the sequence of reading news topics. The models were fitted and compared using the Akaike Information Criterion and the Brier Score. We found that the best models are those in which the user moves through topics influenced only by their most recent readings. Our models were also better to predict the next reading than the recommendation systems currently used in these journals showing that our models can improve user satisfaction.

Real-time Estimation of the Urban Air Quality with Mobile Sensor System

Recently, real-time air quality estimation has attracted more and more attention from all over the world, which is close to our daily life. With the prevalence of mobile sensors, there is an emerging way to monitor the air quality with mobile sensors on vehicles. Compared with traditional expensive monitor stations, mobile sensors are cheaper and more abundant, but observations from these sensors have unstable spatial and temporal distributions, which results in the existing model could not work very well on this type of data. In this paper, taking advantage of air quality data from mobile sensors, we propose an real-time urban air quality estimation method based on the Gaussian Process Regression for air pollution of the unmonitored areas, pivoting on the diffusion effect and the accumulation effect of air pollution. In order to meet the real-time demands, we propose a two-layer ensemble learning framework and a self-adaptivity mechanism to improve computational efficiency and adaptivity. We evaluate our model with real data from mobile sensor system located in Beijing, China. And the experiments show that our proposed model is superior to the state-of-art spatial regression methods in both precision and time performances.

Community Detection in Small Networks: A New Approach to Graph Partitioning

Towards an optimal outdoor advertising placement : when a budget constraint meets moving trajectories

In this paper we propose and study the problem of trajectory-driven influential billboard placement: given a set of billboards U (each with a location and a cost), a database of trajectories T and a budget L, find a set of billboards within the budget to influence the largest number of trajectories. One core challenge is to identify and reduce the overlap of the influence from different billboards to the same trajectories, while keeping the budget constraint into consideration. We show that this problem is NP-hard and present an enumeration based algorithm with (1 ? 1/e) approximation ratio. However, the enumeration would be very costly when |U | is large. By exploiting the locality property of billboards? influence, we propose a partition-based framework PartSel. PartSel partitions U into a set of small clusters, computes the locally influential billboards for each cluster, and merges them to generate the global solution. Since the local solutions can be obtained much more efficiently than the global one, PartSel would reduce the computation cost greatly; meanwhile it achieves a non-trivial approximation ratio guarantee. Then we propose a LazyProbe method to further prune billboards with low marginal influence, while achieving the same approximation ratio as PartSel. Next, we propose a branch-and -bound method to eliminate unnecessary enumerations in both PartSel and LazyProbe, as well as an aggregated index to speed up the computation of marginal influence. Experiments on real datasets verify the efficiency and effectiveness of our methods.

Self-adaptive Particle Swarm Optimization for Large-scale Feature Selection in Classification

Many evolutionary computation (EC) methods have been used to solve feature selection problems and they perform well on most small-scale feature selection problems. However, as the dimensionality of feature selection problems increases, the solution space increases exponentially. Meanwhile, there are more irrelevant features than relevant features in datasets, which leads to many local optima in the huge solution space. Therefore, the existing EC methods still suffer from the problem of stagnation in local optima on large-scale feature selection problems. Furthermore, large-scale feature selection problems with different datasets may have different properties. Thus, it may be of low performance to solve different large-scale feature selection problems with an existing EC method that has only one candidate solution generation strategy (CSGS). In addition, it is time-consuming to find a suitable EC method and corresponding suitable parameter values for a given large-scale feature selection problem if we want to solve it effectively and efficiently. In this paper, we propose a self-adaptive particle swarm optimization (SaPSO) algorithm for feature selection, particularly for large-scale feature selection. First, an encoding scheme for the feature selection problem is employed in the SaPSO. Second, three important problems relate to self-adaptive algorithms are investigated. After that, the SaPSO algorithm with a typical self-adaptive mechanism is proposed. The experimental results on 12 datasets show that the solution size obtained by the SaPSO algorithm is smaller than its EC counterparts on all datasets. The SaPSO algorithm performs better than its non-EC and EC counterparts in terms of classification accuracy not only on most training sets but also on most test sets. Furthermore, as the dimensionality of the feature selection problem increases, the advantages of SaPSO become more prominent. This highlights that the SaPSO algorithm is suitable for solving feature selection problems, particularly large-scale feature selection problems.

Interactive Recommendation with User-specific Deep Reinforcement Learning

In this paper, we study a multi-step interactive recommendation problem for explicit-feedback recommender systems. Different from the existing works, we propose a novel user-specific deep reinforcement learning approach to the problem. We first model the problem of interactive recommendation for a target user as a user-specific Markov decision process (MDP). We then derive a multi-MDP reinforcement learning task, where each MDP represents the interactive recommendation process for a specific user. It is a more difficult challenge to learn the optimal policy for such multi-MDP task, as the MDPs for different users may vary remarkably in state transitions. To handle this challenge, we construct user-specific latent states to connect different MDPs by using the technique of matrix factorization (MF). We propose a user-specific deep Q-learning (UDQN) method to estimate the optimal policy based on the user-specific latent states of all MDPs. Further, we propose a biased UDQN method to explicitly model user-specific information of preferences by employing an additional bias parameter when estimating the Q-values for each user's MDP. The capability of our approach to interactive recommendations is sufficiently validated by the comprehensive experimental results and analysis.

Probabilistic Mixture Model for Mapping the Underground Pipes

Buried pipes beneath our city are blood vessels that feed human civilization through the supply of water, gas, electricity, etc, and mapping the buried pipes has long been addressed as an issue. In this paper, a suitable coordinate of the detected area is established, the noisy GPR and GPS data are analyzed and normalized, and the pipeline is described mathematically. Based on these, the Probabilistic Mixture Model is proposed to map the buried pipes, which takes discrete noisy GPR and GPS data as the input and the accurate pipe locations and directions as the output. The proposed model consists of the Preprocessing, the Pipe Fitting algorithm, the Classification Fitting Expectation Maximization (CFEM) algorithm, and the Angle-limited Hough (Al-Hough) transform. The direction information of the detecting point is added into the measuring of the distance from a detecting point to nearby pipelines to handle some areas where the pipes are intersected or difficult to classify, the Expectation Maximization (EM) algorithm is upgraded to CFEM algorithm that is able to classify detecting points into different classes, and connect and fit multiple detecting points in each class to get accurate pipe locations and directions, and the Al-Hough transform provides reliable initializations for CFEM, to some extent, ensuring the convergence of the algorithm. The experimental results on the simulated and real-world datasets demonstrate the effectiveness of the proposed model.

Local Overlapping Community Detection

Local community detection refers to finding the community that contains the given node based on local information, which becomes very meaningful when global information about the network is unavailable or expensive to acquire. Most work on local community detection focuses on finding non-overlapping communities. However, many real-world networks contain overlapping communities like social networks. Given an overlapping node that belongs to multiple communities, the problem is to find communities to which it belongs according to local information. We propose a framework for local overlapping community detection. The framework has three steps. First, find nodes in multiple communities to which the given node belongs. Second, select representative nodes from nodes obtained above, which tends to be in different communities. Third, discover the communities to which these representative nodes belong. In addition, to demonstrate the effectiveness of the framework, we implement six versions of this framework. Experimental results demonstrate that the six implementations versions outperform the previously existing algorithms.

A Pipeline Computing Method of SpTV for Three-Order Tensors on CPU and GPU

Tensors have drawn a great deal of attention in many applications, such as physics, engineering science, social networks, recommended systems, and other fields. Tensor decomposition is the key to exploring the inherent intrinsic data relationship of tensor. There are many sparse tensor and vector multiplications (SpTV) in tensor decomposition. We analyze a variety of storage formats of sparse tensors and develop a piecewise compression strategy to improve the storage efficiency of large sparse tensors. This compression strategy can avoid storing a large number of empty slices and empty fibers in sparse tensors, and thus the storage space requirements are greatly reduced. A parallel algorithm for the SpTV based on the compression format HOCFS is designed to greatly improve its computing performance on GPUs. Each tensor is cut into multiple slices to form a series of SpMV operations, which form the pipelined parallelism. The transmission time of the slices can be hidden through pipelined parallel to further optimize the performance of the SpTV.

Aspect Aware Learning for Aspect Category Sentiment Analysis

Aspect category sentiment analysis (ACSA) is an underexploited subtask in aspect level sentiment analysis. It aims to identify the sentiment of predefined aspect categories. The main challenge in ACSA comes from the fact that the aspect category may not occur in the sentence in most of the cases. For example, the review ``\emph{they have delicious sandwiches}'' positively talks about the aspect category ``\emph{food}'' in an implicit manner. In this paper, we propose a novel aspect aware learning framework for ACSA tasks. Our key idea is to exploit the interaction between the aspect category and the contents under the guidance of both sentiment polarity and predefined categories. To this end, we design a two-way memory network for integrating aspect aware learning (AAL) into the framework of sentiment classification. We further present two algorithms to incorporate the potential impacts of aspect categories. One is to capture the correlations between aspect terms and the aspect category like \emph{``sandwiches''} and \emph{``food''}. The other is to recognize the aspect category for sentiment representations like \emph{``food''} for \emph{``delicious''}. We conduct extensive experiments on two SemEval datasets. The results reveal the essential role of AAL in aspect category sentiment analysis by achieving the state-of-the-art performance.

Krylov Subspace Approximation for Local Community Detection in Large Networks

Community detection is an important information mining task to uncover modular structures in large networks. For increasingly common large network data sets, global community detection is prohibitively expensive, and attention has shifted to methods that mine local communities, i.e. identifying all latent members of a particular community from a few labeled seed members. To address such semi-supervised mining task, we systematically develop a local spectral subspace-based community detection method, called LOSP. We define a family of local spectral subspaces based on Krylov subspaces, and seek a sparse indicator for the target community via an ?1 norm minimization over the Krylov subspace. Variants of LOSP depend on type of random walks with different diffusion speeds, type of random walks, dimension of the local spectral subspace and step of diffusions. The effectiveness of the proposed LOSP approach is theoretically analyzed based on Rayleigh quotients, and it is experimentally verified on a wide variety of real-world networks across social, production and biological domains, as well as on an extensive set of synthetic LFR benchmark datasets.

Attention Models in Graphs: A Survey

Graph-structured data arise naturally in many different application domains. By representing data as graphs, we can capture entities (i.e., nodes) as well as their relationships (i.e., edges) with each other. Many useful insights can be derived from graph-structured data as demonstrated by an ever-growing body of work focused on graph mining. However, in the real-world, graphs can be both large ? with many complex patterns ? and noisy which can pose a problem for effective graph mining. An effective way to deal with this issue is to incorporate ?attention? into graph mining solutions. An attention mechanism allows a method to focus on task-relevant parts of the graph, helping it to make better decisions. In this work, we conduct a comprehensive and focused survey of the literature on the emerging field of graph attention models. We introduce three intuitive taxonomies to group existing work. These are based on problem setting (type of input and output), the type of attention mechanism used, and the task (e.g., graph classification, link prediction, etc.). We motivate our taxonomies through detailed examples and use each to survey competing approaches from a unique standpoint. Finally, we highlight several challenges in the area and discuss promising directions for future work.

Bayesian Model Selection Approach to Multiple Change-points Detection with Non-Local Prior Distributions

We propose a Bayesian model selection (BMS) boundary detection procedure using non-local prior distributions for a sequence of data with multiple systematic mean changes. By using the non-local priors in the Bayesian model selection framework, the BMS method can effectively suppress the non-boundary spike points with large instantaneous changes. Further, we speed up the algorithm by reducing the multiple change points to a series of single change point detection problems. We establish the consistency of the estimated number and locations of the change points under various prior distributions. From both theoretical and numerical perspectives, we show that the non-local inverse moment prior leads to the fastest convergence rate in identifying the true change points on the boundaries. Extensive simulation studies are conducted to compare the BMS with existing methods, and our method is illustrated with application to the magnetic resonance imaging guided radiation therapy data.

Multi-label Punitive kNN with Self-Adjusting Memory for Drifting Data Streams

In multi-label learning, data may simultaneously belong to more than one class. When multi-label data arrives as a stream, the challenges associated with multi-label learning are joined by those of data stream mining, including the need for algorithms that are fast and flexible, able to match both the speed and evolving nature of the stream. This paper presents a punitive k nearest neighbors algorithm with a self-adjusting memory (MLSAMPkNN) for multi-label, drifting data streams. The memory adjusts in size to contain only the current concept and a novel punitive system identifies and penalizes errant data examples early, removing them from the window. By retaining and using only data that are both current and beneficial, MLSAMPkNN is able to adapt quickly and efficiently to changes within the data stream while still maintaining a low computational complexity. Additionally, the punitive removal mechanism offers increased robustness to various data-level difficulties present in data streams, such as class imbalance and noise. The experimental study compares the proposal to 24 algorithms using 30 multi-label datasets on six multi-label metrics, evaluation time, and memory consumption. The superior performance of the proposed method is validated through non-parametric statistical analysis, proving both high accuracy and low time complexity. MLSAMPkNN is shown to be a versatile classifier, capable of returning excellent performance in diverse data stream scenarios.

Density-friendly Graph Decomposition

Decomposing a graph into a hierarchical structure via k-core analysis is a standard operation in any modern graph-mining toolkit. k-core decomposition is a simple and efficient method that allows to analyze a graph beyond its mere degree distribution. More specifically, it is used to identify areas in the graph of increasing centrality and connectedness, and it allows to reveal the structural organization of the graph. Despite the fact that k-core analysis relies on vertex degrees, k-cores do not satisfy a certain, rather natural, density property. Simply put, the most central k-core is not necessarily the densest subgraph. This inconsistency between k-cores and graph density provides the basis of our study. We start by defining what it means for a subgraph to be locally-dense, and we show that our definition entails a nested chain decomposition of the graph, similar to the one given by k-cores, but in this case the components are arranged in order of increasing density. We show that such a locally-dense decomposition for a graph G=(V,E) can be computed in polynomial time. The running time of the exact decomposition algorithm is O(|V|2|E|) but is significantly faster in practice. In addition, we develop a linear-time algorithm that provides a factor-2 approximation to the optimal locally-dense decomposition. Furthermore, we show that the k-core decomposition is also a factor-2 approximation, however, as demonstrated by our experimental evaluation, in practice k-cores have different structure than locally-dense subgraphs, and as predicted by the theory, k-cores are not always well-aligned with graph density.

Computing top-k Closeness Centrality Faster in Unweighted Graphs

Given a connected graph G = (V, E), the closeness centrality of a vertex v is defined as n1/(£wV d(v,w)) . This measure is widely used in the analysis of real-world complex networks, and the problem of selecting the k most central vertices has been deeply analysed in the last decade. However, this problem is computationally not easy, especially for large networks: in the first part of the paper, we prove that it is not solvable in time O(|E|2µ) on directed graphs, for any constant µ > 0, under reasonable complexity assumptions. Furthermore, we propose a new algorithm for selecting the k most central nodes in a graph: we experimentally show that this algorithm improves significantly both the textbook algorithm, which is based on computing the distance between all pairs of vertices, and the state of the art. For example, we are able to compute the top k nodes in few dozens of seconds in real-world networks with millions of nodes and edges. Finally, as a case study, we compute the 10 most central actors in the IMDB collaboration network, where two actors are linked if they played together in a movie, and in the Wikipedia citation network, which contains a directed edge from a page p to a page q if p contains a link to q.

A Unified Framework of Sparse Online Learning

The amount of data in our society has been exploding in the era of big data today. In this paper, we address several open challenges of big data stream classification, including high volume, high velocity, high dimensionality, high sparsity, and high class-imbalance. Many existing studies in data mining literature solve data stream classification tasks in a batch learning setting, which suffers from poor efficiency and scalability when dealing with big data. To overcome the limitations, this paper investigates an online learning framework for big data stream classification tasks. Unlike some existing online data stream classification techniques that are often based on first-order online learning, we propose a framework of Sparse Online Classification (SOC) for data stream classification, which includes some state-of-the-art first-order sparse online learning algorithms as special cases and allows us to derive a new effective second-order online learning algorithm for data stream classification. In addition, we also propose a new cost-sensitive sparse online learning algorithm by extending the framework with application to tackle online anomaly detection tasks where class distribution of data could be very imbalanced. We also analyze the theoretical bounds of the proposed method, and finally conduct an extensive set of experiments, in which encouraging results validate the efficacy of the proposed algorithms in comparison to a family of state-of-the-art techniques on a variety of data stream classification tasks.

High-Utility Itemset Mining with Effective Pruning Strategies

High utility itemset mining is an important data mining problem which considers profit factors besides quantity from the transactional database. It helps to find the most valuable products/items that are difficult to track using only the mere frequent data mining set. An item might have a high-profit value, which might be rare in the transactional database, and has tremendous importance. While there are many existing algorithms to find high utility itemsets that generate comparatively large candidate sets, our main focus is to reduce the computation time significantly with the introduction of pruning strategy. The pruning approach helps to reduce the visitation of unnecessary nodes in the search space and the time taken by the algorithm. In our paper, we proposed an algorithm that constructs the candidate sets in the form of a tree structure, which traverses the itemset with high transaction weighted utility (HTWU). It uses a pruning strategy to reduce the computation time by refraining the visit to unnecessary nodes of an itemset to reduce the search space. It also minimizes the transaction database generated on each node significantly. Our experimental results show that it greatly reduces the execution time for high utility itemset mining.

All ACM Journals | See Full Journal Index

Search TKDD
enter search term and/or author name