ACM Transactions on

Knowledge Discovery from Data (TKDD)

Latest Articles

Fine-Grained Air Quality Inference with Remote Sensing Data and Ubiquitous Urban Data

Air quality has gained much attention in recent years and is of great importance to protecting people’s health. Due to the influence of... (more)

Bayesian Model Selection Approach to Multiple Change-Points Detection with Non-Local Prior Distributions

We propose a Bayesian model selection (BMS) boundary detection procedure using non-local prior... (more)

Real-Time Estimation of the Urban Air Quality with Mobile Sensor System

Recently, real-time air quality estimation has attracted more and more attention from all over the world, which is close to our daily life. With the... (more)

Self-Adaptive Particle Swarm Optimization for Large-Scale Feature Selection in Classification

Many evolutionary computation (EC) methods have been used to solve feature selection problems and they perform well on most small-scale feature... (more)

Hybrid Crowd-Machine Wrapper Inference

Wrapper inference deals in generating programs to extract data from Web pages. Several supervised and unsupervised wrapper inference approaches have been proposed in the literature. On one hand, unsupervised approaches produce erratic wrappers: whenever the sources do not satisfy underlying assumptions of the inference algorithm, their accuracy is... (more)

Krylov Subspace Approximation for Local Community Detection in Large Networks

Community detection is an important information mining task to uncover modular structures in large networks. For increasingly common large network... (more)

Computing top-k Closeness Centrality Faster in Unweighted Graphs

Given a connected graph G=(V,E), where V denotes the set of nodes and E the set of edges of the graph, the length (that is, the number of... (more)

Density-Friendly Graph Decomposition

Decomposing a graph into a hierarchical structure via k-core analysis is a standard operation in any modern graph-mining toolkit. k-core decomposition is a simple and efficient method that allows to analyze a graph beyond its mere degree distribution. More specifically, it is used to identify areas in the graph of increasing centrality and... (more)


About TKDD 

ACM Transactions on Knowledge Discovery from Data (TKDD) publishes original archival papers in the area of knowledge discovery from data and closely related disciplines.  The majority of the papers that appear in TKDD is expected to address the logical and technical foundation of knowledge discovery and data mining.

Forthcoming Articles
Generalizing Long Short-Term Memory Network for Deep Learning from Generic Data

Long Short-Term Memory (LSTM) network is a popular deep learning model, particularly useful for data with temporal correlation, such as texts, sequences, or time series data, thanks to its well sought recurrent structures designed to capture temporal correlation. In this paper, we propose to generalize LSTM to generic machine learning tasks where data used for training do not hold explicit temporal or sequential correlation. Our theme is to explore feature correlation in the original data and convert each instance into a synthetic sentence format by using a 2-gram probabilistic language model. More specifically, for each instance represented in the original feature space, our conversion first seeks to horizontally align original features into a sequential correlated feature vector, resembling to the letter coherence within each single word. In addition, a vertical alignment is also carried out to create multiple time points and simulates sequential order of words (i.e. word correlation) within a sentence. The two dimensional horizontal-and-vertical alignments not only ensure feature correlations are maximally utilized, but also preserve the original feature values in the new representation. As a result, LSTM model can be utilized to achieve good classification accuracy, even if the underlying data do not have explicit temporal or sequential correlation. Experiments on 20 generic datasets confirm that applying LSTM to generic data has a clear performance gain, compared to conventional machine learning methods. This research opens a new paradigm to allow LSTM deep learning to be broadly applied to generic machine learning tasks.

CT LIS: Learning Influences and Susceptibilities through Temporal Behaviors

How to quantify influences between users, seeing that social network users influence each other in their temporal behaviors? Previous work has directly defined an independent model parameter to capture the interpersonal influence between each pair of users. To do so, these models need a parameter for each pair of users, which results in high-dimensional models becoming easily trapped into the overfitting problem. However, such models do not consider how influences depend on each other if influences are sent from the same user or if influences are received by the same user. Therefore, we propose a model that defines parameters for every user with a latent influence vector and a susceptibility vector, opposite to define influences on user pairs. Such low-dimensional representations naturally cause the interpersonal influences involving the same user to be coupled with each other, thus reducing the model's complexity. Additionally, the model can easily consider the temporal information and sentimental polarities of users' messages. Finally, we conduct extensive experiments on real Microblog data, showing that our model with such representations achieves better performance than the state-of-the-art and pair-wise models, and that learning influences on sentiments can benefit performance.

Multi-User Mobile Sequential Recommendation for Route Optimization

We enhance the original mobile sequential recommendation (MSR) model to address several key issues by introducing three new forms, MMSR, MMSR-m and MMSR-d. To enrich the properties of the pick-up points including the locations and probabilities, we add the pick-up capacities to the attributes of pick-up points. The MMSR model finds optimal routes for multiple users at different locations while disallowing overlapping recommended routes. The MMSR-m addresses the issue by assigning a pick-up capacity to each location while the MMSR-d model allows the pick-up capacity to vary at different locations. The MMSR model is numerically solved by parallel simulated annealing. The results confirmed the superiority of our model and the solution methods over several published benchmarks at the more demanding high-dimensions. Our proposed push-point method allows to further improve our parallel algorithms for MMSR-m and MMSR-d to address more realistic city-level problems.

Recurrent Meta-Structure for Robust Similarity Measure in Heterogeneous Information Networks

Similarity measure is one of the fundamental task in heterogeneous information network analysis. It has been applied to many areas such as product recommendation, clustering and Web search. Most of the existing metrics can provide personalized services for users by taking a meta-path or meta-structure as input. However, these metrics may highly depend on the user-specified meta-path or meta-structure. In addition, users must know how to select an appropriate meta-path or meta-structure. In this paper, we propose a novel similarity measure in heterogeneous information networks, called Recurrent Meta-Structure-based Similarity (RMSS). The recurrent meta-structure as a schematic structure in heterogeneous information networks provides a unified framework for integrating all of the meta-paths and meta-structures, and can be constructed automatically by means of repetitively traversing the network schema. In order to formalize the semantics, the recurrent meta-structure is decomposed into several recurrent meta-paths and recurrent meta-trees, and we then define the commuting matrices of the recurrent meta-paths and meta-trees. All of these commuting matrices are combined together according to different weights. We propose two kinds of weighting strategies to determine the weights. The first is called the local weighting strategy which depends on the sparsity of the commuting matrices, and the second is called the global weighting strategy which depends on the strength of the commuting matrices. As a result, RMSS is defined by means of the weighted summation of the commuting matrices. Note that RMSS can also provide personalized services for users by means of the weights of the recurrent meta-paths and meta-trees. Experimental evaluations show that the proposed RMSS is robust and outperforms the existing metrics in terms of ranking and clustering task.

CFOF: A Concentration Free Measure for Anomaly Detection

We present a novel notion of outlier, called the Concentration Free Outlier Factor, or CFOF. As a main contribution, we formalize the notion of concentration of outlier scores and theoretically prove that CFOF does not concentrate in the Euclidean space for any arbitrary large dimensionality. To the best of our knowledge, there are no other proposals of data analysis measures related to the Euclidean distance for which it has been provided theoretical evidence that they are immune to the concentration effect. We determine the closed form of the distribution of CFOF scores in arbitrarily large dimensionalities and show that the CFOF score of a point depends on its squared norm standard score and on the kurtosis of the data distribution, thus providing a clear and statistically founded characterization of this notion. Moreover, we leverage this closed form to provide evidence that the definition does not suffer of the hubness problem affecting other measures in high dimensions. We prove that the number of CFOF outliers coming from each cluster is proportional to cluster size and kurtosis, a property that we call semi-locality. We leverage theoretical findings to shed lights on properties of well-known outlier scores. Indeed, we determine that semi-locality characterizes existing reverse nearest neighbor-based outlier definitions, thus clarifying the exact nature of their observed local behavior. We also formally prove that classical distance-based and density-based outliers concentrate both for bounded and unbounded sample sizes and for fixed and variable values of the neighborhood parameter. We introduce the fast-CFOF algorithm for detecting outliers in large high-dimensional dataset. The algorithm has linear cost, supports multi-resolution analysis, and is embarrassingly parallel. Experiments highlight that the technique is able to efficiently process huge datasets and to deal even with large values of the neighborhood parameter, to avoid concentration, and to obtain excellent accuracy.

Mining Rank Data

The problem of frequent pattern mining has been studied quite extensively for various types of data, including sets, sequences, and graphs. Somewhat surprisingly, another important type of data, namely rank data, has received very little attention in data mining so far. In this paper, we therefore addresses the problem of mining rank data, that is, data in the form of rankings (total orders) of an underlying set of items. More specifically, two types of patterns are considered, namely frequent rankings and dependencies between such rankings in the form of association rules. Algorithms for mining frequent rankings and frequent closed rankings are proposed and tested experimentally, using both synthetic and real data.

A Unified Framework with Multi-source Data for Predicting Passenger Demands of Ride Services

Ride-hailing applications have been offering convenient ride services for people in need. However, such applications still suffer from the issue of supply-demand disequilibrium, which is a typical problem for traditional taxi services. Effective predictions on passenger demands will alleviate the disequilibrium by avoiding dispatching cars to zero-demand areas and facilitate dynamic pricing. Existing studies of demand predictions mainly utilize a single model instead of a combination of several models, which is only based on trajectory data or orders of ride services or both of them. Meanwhile, they simply apply fixed-size grids to partition space and overlook a lot of geo-spatial information. In this paper, we present a unified framework with a new combined model for the demand prediction, using heterogeneous data and a road network-based method for the area partition, supported by additional geo-spatial information. We analyze and evaluate the performance of our combined model using the actual operational data from UCAR. The experimental results indicate that our model outperforms six other baselines by over 39% in Mean Absolute Error (MAE) and 59% in Root Mean Square Error (RMSE) on average.

A Unified Multi-view Clustering Algorithm using Multi-objective Optimization Coupled with Generative Model

There is a large body of work on multi-view clustering which exploits multiple representations (or views) of the same input data for better convergence. These multiple views can come from multiple modalities (image, audio, text) or different feature subsets. Recently, multi-objective based multi-view clustering methods have suppressed the performance of single objective based multi-view clustering techniques. One key problem is that it is difficult to select a single solution from a set of alternative partitionings generated by multi-objective techniques on the final Pareto optimal front. In this paper, we propose a novel multi-objective based multi-view clustering framework which overcomes the problem of selecting a single solution in multi-objective based techniques. In particular, our proposed framework has three major components: (i) multi-view based multi-objective algorithm, Multiview-AMOSA, for initial clustering of data points; (ii) a generative model for generating a combined solution having probabilistic labels; and (iii) K-means algorithm for obtaining the final labels. As the first component, we have adopted a recently developed multi-view based multi-objective clustering algorithm to generate different possible consensus partitionings of a given dataset taking into account different views. A generative model is coupled with the first component to generate a single consensus partitioning after considering multiple solutions. It exploits the latent subsets of the non-dominated solutions obtained from the multi-objective clustering algorithm and combines them to produce a single probabilistic labeled solution. Finally, a simple clustering algorithm, namely K-means, is applied to the generated probabilistic labels to obtain the final cluster labels. Experimental validation of our proposed framework is carried out over several benchmark datasets belonging to three different domains; UCI datasets, search result clustering datasets and patient stratification datasets and it shows an improvement of around 2%-4% over different evaluation metrics in comparison to state-of-the art methods.

Evolutionary Classifier and Cluster Selection Approach for Ensemble Classification

Ensemble classifiers improve the classification performance by combining several classifiers using suitable fusion mythology. Many ensemble classifier generation methods have been developed that allowed the training of multiple classifiers on a single dataset. As such random subspace is a common methodology utilized by many state-of-the-art ensemble classifiers that generate random subsamples from the input data and train classifiers on different subsamples. Real-world datasets have randomness and noise in them therefore, not all randomly generated samples are suitable for training. In this paper, we propose a novel particle swarm optimization-based approach to optimize the random subspace to generate an ensemble classifier. We first generate a random subspace by incrementally clustering input data and then optimize all generated data clusters. On all optimized data clusters, a set of classifiers is trained and is added to the pool. The pool of classifiers is then optimized, and an optimized ensemble classifier is generated. The proposed approach is tested on 12 benchmark datasets from the UCI repository and results are compared with current state-of-the-art ensemble classifier approaches. A statistical significance test is also conducted, and an analysis is presented.

In Search of a Stochastic Model for the E-news Reader

E-news readers have increasingly at their disposal a very large set of news articles to read. Online newspaper sites use recommendation systems to predict and to offer relevant articles to their users. Typically, these recommendation systems do not leverage the users' reading behavior. Knowing how the reader changes topics in a reading session may lead to recommendations fine-tuned. For example, after reading a certain number of sports items, it may be counter-productive to keep recommending other sports news. The motivation for this paper is the assumption that understanding the user behavior when reading successive online news articles can help on developing recommendation systems. We propose five categories of stochastic models to describe this behavior depending on how the previous reading history affects the future choices of topics. We instantiated these five classes with 33 different stochastic processes covering short-term memory, revealed a preference, cumulative advantage, and geometric sojourn models. Our empirical study is based on large datasets of E-news from two online newspapers. We collected data from more than 13 million users who generated more than 23 million reading sessions, each one composed by the successive clicks of the users on the posted news. We reduce each user session to the sequence of reading news topics. The models were fitted and compared using the Akaike Information Criterion and the Brier Score. We found that the best models are those in which the user moves through topics influenced only by their most recent readings. Our models were also better to predict the next reading than the recommendation systems currently used in these journals showing that our models can improve user satisfaction.

Fast Parallel Algorithms for Counting and Listing Triangles in Big Graphs

Big graphs (networks) arising in numerous application areas pose significant challenges for graph analysts as these graphs grow to billions of nodes and edges and are prohibitively large to fit in the main memory. Finding the number of triangles in a graph is an important problem in the mining and analysis of graphs. In this paper, we present two efficient MPI-based distributed memory parallel algorithms for counting triangles in big graphs. The first algorithm employs overlapping partitioning and efficient load balancing schemes to provide a very fast parallel algorithm. The algorithm scales well to networks with billions of nodes and can compute the exact number of triangles in a network with 10 billion edges in 16 minutes. The second algorithm divides the network into non-overlapping partitions leading to a space-efficient algorithm. Our results on both artificial and real-world networks demonstrate a significant space saving with this algorithm. We also present a novel approach that reduces communication cost drastically leading the algorithm to both a space- and runtime-efficient algorithm. Further, we demonstrate how our algorithms can be used to list all triangles in a graph and compute clustering coefficients of nodes. Our algorithm can also be adapted to a parallel approximation algorithm using an edge sparsification method.

Learning Distance Metrics from Probabilistic Information

The goal of metric learning is to learn a good distance metric that can capture the relationships among instances, and its importance has long been recognized in many fields. An implicit assumption in the traditional settings of metric learning is that the associated labels of the instances are deterministic. However, in many real-world applications, the associated labels come naturally with probabilities instead of deterministic values, which makes it difficult for the existing metric learning methods to work well in these applications. To address this challenge, in this article, we study how to effectively learn the distance metric from datasets that contain probabilistic information, and then propose several novel metric learning mechanisms for two types of probabilistic labels, i.e., the instance-wise probabilistic label and the group-wise probabilistic label. Compared with the existing metric learning methods, our proposed mechanisms are capable of learning distance metrics directly from the probabilistic labels with high accuracy. We also theoretically analyze the proposed mechanisms and conduct extensive experiments based on real-world datasets to verify the desirable properties of these mechanisms.

Community Detection in Small Networks: A New Approach to Graph Partitioning

Towards an optimal outdoor advertising placement : when a budget constraint meets moving trajectories

In this paper we propose and study the problem of trajectory-driven influential billboard placement: given a set of billboards U (each with a location and a cost), a database of trajectories T and a budget L, find a set of billboards within the budget to influence the largest number of trajectories. One core challenge is to identify and reduce the overlap of the influence from different billboards to the same trajectories, while keeping the budget constraint into consideration. We show that this problem is NP-hard and present an enumeration based algorithm with (1 ? 1/e) approximation ratio. However, the enumeration would be very costly when |U | is large. By exploiting the locality property of billboards? influence, we propose a partition-based framework PartSel. PartSel partitions U into a set of small clusters, computes the locally influential billboards for each cluster, and merges them to generate the global solution. Since the local solutions can be obtained much more efficiently than the global one, PartSel would reduce the computation cost greatly; meanwhile it achieves a non-trivial approximation ratio guarantee. Then we propose a LazyProbe method to further prune billboards with low marginal influence, while achieving the same approximation ratio as PartSel. Next, we propose a branch-and -bound method to eliminate unnecessary enumerations in both PartSel and LazyProbe, as well as an aggregated index to speed up the computation of marginal influence. Experiments on real datasets verify the efficiency and effectiveness of our methods.

Interactive Recommendation with User-specific Deep Reinforcement Learning

In this paper, we study a multi-step interactive recommendation problem for explicit-feedback recommender systems. Different from the existing works, we propose a novel user-specific deep reinforcement learning approach to the problem. We first model the problem of interactive recommendation for a target user as a user-specific Markov decision process (MDP). We then derive a multi-MDP reinforcement learning task, where each MDP represents the interactive recommendation process for a specific user. It is a more difficult challenge to learn the optimal policy for such multi-MDP task, as the MDPs for different users may vary remarkably in state transitions. To handle this challenge, we construct user-specific latent states to connect different MDPs by using the technique of matrix factorization (MF). We propose a user-specific deep Q-learning (UDQN) method to estimate the optimal policy based on the user-specific latent states of all MDPs. Further, we propose a biased UDQN method to explicitly model user-specific information of preferences by employing an additional bias parameter when estimating the Q-values for each user's MDP. The capability of our approach to interactive recommendations is sufficiently validated by the comprehensive experimental results and analysis.

Local Overlapping Community Detection

Local community detection refers to finding the community that contains the given node based on local information, which becomes very meaningful when global information about the network is unavailable or expensive to acquire. Most work on local community detection focuses on finding non-overlapping communities. However, many real-world networks contain overlapping communities like social networks. Given an overlapping node that belongs to multiple communities, the problem is to find communities to which it belongs according to local information. We propose a framework for local overlapping community detection. The framework has three steps. First, find nodes in multiple communities to which the given node belongs. Second, select representative nodes from nodes obtained above, which tends to be in different communities. Third, discover the communities to which these representative nodes belong. In addition, to demonstrate the effectiveness of the framework, we implement six versions of this framework. Experimental results demonstrate that the six implementations versions outperform the previously existing algorithms.

A Pipeline Computing Method of SpTV for Three-Order Tensors on CPU and GPU

Tensors have drawn a great deal of attention in many applications, such as physics, engineering science, social networks, recommended systems, and other fields. Tensor decomposition is the key to exploring the inherent intrinsic data relationship of tensor. There are many sparse tensor and vector multiplications (SpTV) in tensor decomposition. We analyze a variety of storage formats of sparse tensors and develop a piecewise compression strategy to improve the storage efficiency of large sparse tensors. This compression strategy can avoid storing a large number of empty slices and empty fibers in sparse tensors, and thus the storage space requirements are greatly reduced. A parallel algorithm for the SpTV based on the compression format HOCFS is designed to greatly improve its computing performance on GPUs. Each tensor is cut into multiple slices to form a series of SpMV operations, which form the pipelined parallelism. The transmission time of the slices can be hidden through pipelined parallel to further optimize the performance of the SpTV.

Aspect Aware Learning for Aspect Category Sentiment Analysis

Aspect category sentiment analysis (ACSA) is an underexploited subtask in aspect level sentiment analysis. It aims to identify the sentiment of predefined aspect categories. The main challenge in ACSA comes from the fact that the aspect category may not occur in the sentence in most of the cases. For example, the review ``\emph{they have delicious sandwiches}'' positively talks about the aspect category ``\emph{food}'' in an implicit manner. In this paper, we propose a novel aspect aware learning framework for ACSA tasks. Our key idea is to exploit the interaction between the aspect category and the contents under the guidance of both sentiment polarity and predefined categories. To this end, we design a two-way memory network for integrating aspect aware learning (AAL) into the framework of sentiment classification. We further present two algorithms to incorporate the potential impacts of aspect categories. One is to capture the correlations between aspect terms and the aspect category like \emph{``sandwiches''} and \emph{``food''}. The other is to recognize the aspect category for sentiment representations like \emph{``food''} for \emph{``delicious''}. We conduct extensive experiments on two SemEval datasets. The results reveal the essential role of AAL in aspect category sentiment analysis by achieving the state-of-the-art performance.

A New Smooth Approximation to the Zero One Loss with a Probabilistic Interpretation

We examine a new form of smooth approximation to the zero one loss in which learning is performed using a reformulation of the widely used logistic function. Our approach is based on using the posterior mean of a novel generalized Beta-Bernoulli formulation. This leads to a generalized logistic function that approximates the zero one loss, but retains a probabilistic formulation conferring a number of useful properties. The approach is easily generalized to kernel logistic regression and easily integrated into methods for structured prediction. We present experiments in which we learn such models using an optimization method consisting of a combination of gradient descent and coordinate descent using localized grid search so as to escape from local minima. Our experiments indicate that optimization quality is improved when learning meta-parameters are themselves optimized using a validation set. Our experiments show improved performance relative to widely used logistic and hinge loss methods on a wide variety of problems ranging from standard UC Irvine and libSVM evaluation datasets to product review predictions and a visual information extraction task. We observe that the approach: 1) is more robust to outliers compared to the logistic and hinge losses; 2) outperforms comparable logistic and max margin models on larger scale benchmark problems; 3) when combined with Gaussian-Laplacian mixture prior on parameters the kernelized version of our formulation yields sparser solutions than Support Vector Machine classifiers; and 4) when integrated into a probabilistic structured prediction technique our approach provides more accurate probabilities yielding improved inference and increasing information extraction performance.

Treatment Effect Estimation via Differentiated Confounder Balancing and Regression

Treatment effect plays an important role on decision making in many fields, such as social marketing, healthcare, and public policy. The key challenge on estimating treatment effect in the wild observational studies is to handle confounding bias induced by imbalance of the confounder distributions between treated and control units. Traditional methods remove confounding bias by re-weighting units with supposedly accurate propensity score estimation under the unconfoundedness assumption. Controlling high-dimensional variables may make the unconfoundedness assumption more plausible, but poses new challenge on accurate propensity score estimation. One strand of recent literature seeks to directly optimize weights to balance confounder distributions, bypassing propensity score estimation. But existing balancing methods fail to do selection and differentiation among the pool of a large number of potential confounders, leading to possible underperformance in many high dimensional settings. In this paper, we propose a data-driven Differentiated Confounder Balancing (DCB) algorithm to jointly select confounders, differentiate weights of confounders and balance confounder distributions for treatment effect estimation in the wild high dimensional settings. Besides, under some settings with heavy confounding bias, in order to further reduce the bias and variance of estimated treatment effect, we propose a Regression Adjusted Differentiated Confounder Balancing (RA-DCB) algorithm based on our DCB algorithm by incorporating outcome regression adjustment. The synergistic learning algorithm we proposed is more capable of reducing the confounding bias in many observational studies. To validate the effectiveness of our DCB and RA-DCB algorithms, we conduct extensive experiments on both synthetic and real datasets. The experimental results clearly demonstrate that our algorithms outperform the state-of-the-art methods. By incorporating regression adjustment, our RA-DCB algorithm achieves better performance than DCB algorithm, especially under the settings with heavy confounding bias. Moreover, We show that the top features ranked by our algorithm generate accurate prediction of online advertising effect.

Attention Models in Graphs: A Survey

Graph-structured data arise naturally in many different application domains. By representing data as graphs, we can capture entities (i.e., nodes) as well as their relationships (i.e., edges) with each other. Many useful insights can be derived from graph-structured data as demonstrated by an ever-growing body of work focused on graph mining. However, in the real-world, graphs can be both large ? with many complex patterns ? and noisy which can pose a problem for effective graph mining. An effective way to deal with this issue is to incorporate ?attention? into graph mining solutions. An attention mechanism allows a method to focus on task-relevant parts of the graph, helping it to make better decisions. In this work, we conduct a comprehensive and focused survey of the literature on the emerging field of graph attention models. We introduce three intuitive taxonomies to group existing work. These are based on problem setting (type of input and output), the type of attention mechanism used, and the task (e.g., graph classification, link prediction, etc.). We motivate our taxonomies through detailed examples and use each to survey competing approaches from a unique standpoint. Finally, we highlight several challenges in the area and discuss promising directions for future work.

Multi-label Punitive kNN with Self-Adjusting Memory for Drifting Data Streams

In multi-label learning, data may simultaneously belong to more than one class. When multi-label data arrives as a stream, the challenges associated with multi-label learning are joined by those of data stream mining, including the need for algorithms that are fast and flexible, able to match both the speed and evolving nature of the stream. This paper presents a punitive k nearest neighbors algorithm with a self-adjusting memory (MLSAMPkNN) for multi-label, drifting data streams. The memory adjusts in size to contain only the current concept and a novel punitive system identifies and penalizes errant data examples early, removing them from the window. By retaining and using only data that are both current and beneficial, MLSAMPkNN is able to adapt quickly and efficiently to changes within the data stream while still maintaining a low computational complexity. Additionally, the punitive removal mechanism offers increased robustness to various data-level difficulties present in data streams, such as class imbalance and noise. The experimental study compares the proposal to 24 algorithms using 30 multi-label datasets on six multi-label metrics, evaluation time, and memory consumption. The superior performance of the proposed method is validated through non-parametric statistical analysis, proving both high accuracy and low time complexity. MLSAMPkNN is shown to be a versatile classifier, capable of returning excellent performance in diverse data stream scenarios.

A Unified Framework of Sparse Online Learning

The amount of data in our society has been exploding in the era of big data today. In this paper, we address several open challenges of big data stream classification, including high volume, high velocity, high dimensionality, high sparsity, and high class-imbalance. Many existing studies in data mining literature solve data stream classification tasks in a batch learning setting, which suffers from poor efficiency and scalability when dealing with big data. To overcome the limitations, this paper investigates an online learning framework for big data stream classification tasks. Unlike some existing online data stream classification techniques that are often based on first-order online learning, we propose a framework of Sparse Online Classification (SOC) for data stream classification, which includes some state-of-the-art first-order sparse online learning algorithms as special cases and allows us to derive a new effective second-order online learning algorithm for data stream classification. In addition, we also propose a new cost-sensitive sparse online learning algorithm by extending the framework with application to tackle online anomaly detection tasks where class distribution of data could be very imbalanced. We also analyze the theoretical bounds of the proposed method, and finally conduct an extensive set of experiments, in which encouraging results validate the efficacy of the proposed algorithms in comparison to a family of state-of-the-art techniques on a variety of data stream classification tasks.

High-Utility Itemset Mining with Effective Pruning Strategies

High utility itemset mining is an important data mining problem which considers profit factors besides quantity from the transactional database. It helps to find the most valuable products/items that are difficult to track using only the mere frequent data mining set. An item might have a high-profit value, which might be rare in the transactional database, and has tremendous importance. While there are many existing algorithms to find high utility itemsets that generate comparatively large candidate sets, our main focus is to reduce the computation time significantly with the introduction of pruning strategy. The pruning approach helps to reduce the visitation of unnecessary nodes in the search space and the time taken by the algorithm. In our paper, we proposed an algorithm that constructs the candidate sets in the form of a tree structure, which traverses the itemset with high transaction weighted utility (HTWU). It uses a pruning strategy to reduce the computation time by refraining the visit to unnecessary nodes of an itemset to reduce the search space. It also minimizes the transaction database generated on each node significantly. Our experimental results show that it greatly reduces the execution time for high utility itemset mining.

All ACM Journals | See Full Journal Index

Search TKDD
enter search term and/or author name