ACM Transactions on

Knowledge Discovery from Data (TKDD)

Latest Articles

Aspect Aware Learning for Aspect Category Sentiment Analysis

Aspect category sentiment analysis (ACSA) is an underexploited subtask in aspect level sentiment analysis. It aims to identify the sentiment of... (more)

A Unified Framework with Multi-source Data for Predicting Passenger Demands of Ride Services

Ride-hailing applications have been offering convenient ride services for people in need. However, such applications still suffer from the issue of... (more)

Interactive Recommendation with User-Specific Deep Reinforcement Learning

In this article, we study a multi-step interactive recommendation problem for explicit-feedback recommender systems. Different from the existing... (more)


About TKDD 

ACM Transactions on Knowledge Discovery from Data (TKDD) publishes original archival papers in the area of knowledge discovery from data and closely related disciplines.  The majority of the papers that appear in TKDD is expected to address the logical and technical foundation of knowledge discovery and data mining.

Forthcoming Articles

Digital Democracy: 25 Years Look Back, Look Ahead

AR2Net: An Attentive Neural Approach for Business Location Selection with Satellite Data and Urban Data

Business location selection is crucial to the success of businesses. Traditional approaches like manual survey investigate multiple factors such as foot traffic, neighborhood structure, and available workforce which are typically hard to measure. In this paper, we propose to explore both the satellite data (e.g., satellite images and nighttime light data) and urban data for business location selection tasks of various businesses. We mine discriminative features from the two kinds of data and perform an empirical analysis to evaluate the direct relationship between extracted features and the business popularity of locations. A novel neural network approach named R2Net is proposed to learn the deep interactions among features and predict the business popularity of locations. The approach is trained with a regression-and-ranking combined loss function to preserve accurate popularity estimation and the ranking order of locations at the same time. To support the location selection for multiple businesses, we propose the AR2Net approach with three attention modules which enable the approach to focus on different latent features according to business types. The comprehensive experiments on real-world data verify the effectiveness of the satellite features and the superior performance of our models in terms of four metrics.

Generalizing Long Short-Term Memory Network for Deep Learning from Generic Data

Long Short-Term Memory (LSTM) network is a popular deep learning model, particularly useful for data with temporal correlation, such as texts, sequences, or time series data, thanks to its well sought recurrent structures designed to capture temporal correlation. In this paper, we propose to generalize LSTM to generic machine learning tasks where data used for training do not hold explicit temporal or sequential correlation. Our theme is to explore feature correlation in the original data and convert each instance into a synthetic sentence format by using a 2-gram probabilistic language model. More specifically, for each instance represented in the original feature space, our conversion first seeks to horizontally align original features into a sequential correlated feature vector, resembling to the letter coherence within each single word. In addition, a vertical alignment is also carried out to create multiple time points and simulates sequential order of words (i.e. word correlation) within a sentence. The two dimensional horizontal-and-vertical alignments not only ensure feature correlations are maximally utilized, but also preserve the original feature values in the new representation. As a result, LSTM model can be utilized to achieve good classification accuracy, even if the underlying data do not have explicit temporal or sequential correlation. Experiments on 20 generic datasets confirm that applying LSTM to generic data has a clear performance gain, compared to conventional machine learning methods. This research opens a new paradigm to allow LSTM deep learning to be broadly applied to generic machine learning tasks.

Linearization of Dependency and Sampling for Participation-based Betweenness Centrality in Very Large B-hypergraphs

A B-hypergraph that is a generalization of the directed graph consists of nodes and directed hyperedges. A directed hyperedge represents a relation from a set of source nodes to a single target node. We suggest one possible definition of betweenness centrality (BC) in B-hypergraphs, called PBC (Participation-based BC). A PBC score of a node is computed based on the number of shortest paths in which the node participates. This score can be calculated by using the dependency on the set of its outgoing hyperedges. In this paper, we focus on developing efficient computation algorithms for PBC. We first present an algorithm called ePBC for computing exact PBC scores of nodes, which has a cubic-time complexity. Since this algorithm can be used for only small-sized B-hypergraphs because of its cubic-time complexity, we propose lPBC (linearized PBC) that is an approximation method of ePBC. lPBC whose bound of errors is guaranteed, is computed based on linearization of the dependency on a set of hyperedges. lPBC improves the computing time of ePBC by an order of magnitude (i.e., it requires a quadratic time) while maintaining a high accuracy. lPBC works well on small to medium-sized B-hypergraphs, but is not scalable enough for a very large B-hypergraph with more than a million hyperedges. To cope with such a very large B-hypergraph, we propose a very fast approximation algorithm slPBC (sampling-based lPBC) that can probabilistically guarantee an approximation error by using the Rademacher complexity. We show through extensive experiments that lPBC and slPBC can efficiently approximate PBC scores in various real-world B-hypergraphs with a reasonably good [email protected] The experimental results show that slPBC works efficiently even for a very large B-hypergraph.

Robust Drift Characterization from Event Streams of Business Processes

Process workers may vary the normal execution of a business process to adjust to changes in their operational environment, e.g. changes in workload, season or regulations. Changes may be simple, such as skipping an individual activity, or complex, such as replacing an entire procedure with another. Over time, these changes may negatively affect process performance; hence it is important to identify and understand them early on. As such, a number of techniques have been developed to detect process drifts, i.e. statistically significant changes in process behavior, from process event logs (offline) or event streams (online). However, detecting a drift without characterizing it, i.e. without providing explanations on its nature, is not enough to help analysts understand and rectify root causes for process performance issues. Existing approaches for drift characterization are limited to simple changes that affect individual activities. This paper contributes an efficient, accurate and noise-tolerant method for characterizing complex drifts affecting entire process fragments. The method, which works both offline and online, relies on two cornerstone techniques: one to automatically discover process trees from event streams (logs), the other to transform process trees using a minimum number of change operations. The operations identified are then translated into natural language statements to explain the change behind a drift. The method has been extensively evaluated on artificial and real-life datasets, and against a state-of-the-art baseline method.

Multi-User Mobile Sequential Recommendation for Route Optimization

We enhance the original mobile sequential recommendation (MSR) model to address several key issues by introducing three new forms, MMSR, MMSR-m and MMSR-d. To enrich the properties of the pick-up points including the locations and probabilities, we add the pick-up capacities to the attributes of pick-up points. The MMSR model finds optimal routes for multiple users at different locations while disallowing overlapping recommended routes. The MMSR-m addresses the issue by assigning a pick-up capacity to each location while the MMSR-d model allows the pick-up capacity to vary at different locations. The MMSR model is numerically solved by parallel simulated annealing. The results confirmed the superiority of our model and the solution methods over several published benchmarks at the more demanding high-dimensions. Our proposed push-point method allows to further improve our parallel algorithms for MMSR-m and MMSR-d to address more realistic city-level problems.

CFOF: A Concentration Free Measure for Anomaly Detection

We present a novel notion of outlier, called the Concentration Free Outlier Factor, or CFOF. As a main contribution, we formalize the notion of concentration of outlier scores and theoretically prove that CFOF does not concentrate in the Euclidean space for any arbitrary large dimensionality. To the best of our knowledge, there are no other proposals of data analysis measures related to the Euclidean distance for which it has been provided theoretical evidence that they are immune to the concentration effect. We determine the closed form of the distribution of CFOF scores in arbitrarily large dimensionalities and show that the CFOF score of a point depends on its squared norm standard score and on the kurtosis of the data distribution, thus providing a clear and statistically founded characterization of this notion. Moreover, we leverage this closed form to provide evidence that the definition does not suffer of the hubness problem affecting other measures in high dimensions. We prove that the number of CFOF outliers coming from each cluster is proportional to cluster size and kurtosis, a property that we call semi-locality. We leverage theoretical findings to shed lights on properties of well-known outlier scores. Indeed, we determine that semi-locality characterizes existing reverse nearest neighbor-based outlier definitions, thus clarifying the exact nature of their observed local behavior. We also formally prove that classical distance-based and density-based outliers concentrate both for bounded and unbounded sample sizes and for fixed and variable values of the neighborhood parameter. We introduce the fast-CFOF algorithm for detecting outliers in large high-dimensional dataset. The algorithm has linear cost, supports multi-resolution analysis, and is embarrassingly parallel. Experiments highlight that the technique is able to efficiently process huge datasets and to deal even with large values of the neighborhood parameter, to avoid concentration, and to obtain excellent accuracy.

A Unified Multi-view Clustering Algorithm using Multi-objective Optimization Coupled with Generative Model

There is a large body of work on multi-view clustering which exploits multiple representations (or views) of the same input data for better convergence. These multiple views can come from multiple modalities (image, audio, text) or different feature subsets. Recently, multi-objective based multi-view clustering methods have suppressed the performance of single objective based multi-view clustering techniques. One key problem is that it is difficult to select a single solution from a set of alternative partitionings generated by multi-objective techniques on the final Pareto optimal front. In this paper, we propose a novel multi-objective based multi-view clustering framework which overcomes the problem of selecting a single solution in multi-objective based techniques. In particular, our proposed framework has three major components: (i) multi-view based multi-objective algorithm, Multiview-AMOSA, for initial clustering of data points; (ii) a generative model for generating a combined solution having probabilistic labels; and (iii) K-means algorithm for obtaining the final labels. As the first component, we have adopted a recently developed multi-view based multi-objective clustering algorithm to generate different possible consensus partitionings of a given dataset taking into account different views. A generative model is coupled with the first component to generate a single consensus partitioning after considering multiple solutions. It exploits the latent subsets of the non-dominated solutions obtained from the multi-objective clustering algorithm and combines them to produce a single probabilistic labeled solution. Finally, a simple clustering algorithm, namely K-means, is applied to the generated probabilistic labels to obtain the final cluster labels. Experimental validation of our proposed framework is carried out over several benchmark datasets belonging to three different domains; UCI datasets, search result clustering datasets and patient stratification datasets and it shows an improvement of around 2%-4% over different evaluation metrics in comparison to state-of-the art methods.

Real-time Transportation Prediction Correction using Reconstruction Error in Deep Learning

In online complex systems like transportation system, an importation work is real-time traffic prediction. Due to several reasons, the prediction result derived from an offline-built model would be unreliable. A real-time prediction correction strategy would be important under this situation. In this paper, we propose the prediction correction strategy using the reconstruction error in the deep neural network. The reconstruction error can reflect the model?s ability on feature representation and then determine the fitness of an input data to the model. We first build the relationship between reconstruction error and prediction error. From the perspective of the prediction interval, we demonstrate that the reconstruction error is in positive relation with the prediction interval. Thus the prediction result is more reliable when the reconstruction error is smaller. Then we propose two mechanisms of real-time prediction correction using the reconstruction error. The data driven prediction correction approach selects several training instances with similar reconstruction errors to the current instance and using their average prediction error in correcting the prediction result. The model driven approach builds several component deep neural networks in training. The component training set for each network is selected according to the reconstruction error of training instances. For a predicting instance, it first computes the reconstruction error of the sample in each component network and then averages the results by the reconstruction error and prediction interval. The model driven approach is actually a reconstruction error based deep neural network ensemble approach. Finally, a series of experiments demonstrated that reconstruction error based prediction correction approaches are effective in several prediction problems in transportation including traffic flow prediction on road, traffic flow prediction in entrance & exit station and travel time prediction. Besides the high overall accuracy, our approach can also provide many observations of using the reconstruction error in transportation prediction.

Fast, Accurate and Provable Triangle Counting in Fully Dynamic Graph Streams

Given a stream of edge additions and deletions, how can we estimate the count of triangles in it? If we can store only a subset of the edges, how can we obtain unbiased estimates with small variances? Counting triangles (i.e., cliques of size three) in a graph is a classical problem with applications in a wide range of research areas, including social network analysis, data mining, and databases. Recently, streaming algorithms for triangle counting have been extensively studied since they can naturally be used for large dynamic graphs. However, existing algorithms cannot handle edge deletions or suffer from low accuracy. Can we handle edge deletions while achieving high accuracy? We propose ThinkD, which accurately estimates the counts of global triangles (i.e., all triangles) and local triangles associated with each node in a fully dynamic graph stream with edge additions and deletions. Compared to its best competitors, ThinkD is (a) Accurate: up to 4.3X more accurate within the same memory budget, (b) Fast: up to 2.2X faster for the same accuracy requirements, and (c) Theoretically sound: always maintaining unbiased estimates with small variances. As an application, we use ThinkD to detect suddenly emerging dense subgraphs, and we show its advantages over state-of-the-art methods.

Pop Music Generation: from Melody to Multi-style Arrangement

Music plays an important role in our daily life. With the development of deep learning and modern generation techniques, researchers have done plenty of works on automatic music generation. However, due to the special requirements of both melody and arrangement, most of these methods have limitations when applying to song generation. Some critical factors related to the quality of a song are not well addressed, such as chord progression, rhythm pattern and musical style. In order to tackle the problems and ensure the harmony of multi-track music, in this paper, we propose an end-to-end melody and arrangement generation framework to generate a melody track with several accompany tracks played by some different instruments. To be specific, we first develop a novel \emph{Chord based Rhythm and Melody Cross-Generation Model (CRMCG)} to generate melody with a chord progression. Then, we propose a \emph{Multi-Instrument Co-Arrangement Model (MICA)} based on multi-task learning for multi-track music arrangement. Furthermore, to control the musical style of arrangement, we design a \emph{Multi-Style Multi-Instrument Co-Arrangement Model (MSMICA)} to learn the musical style with adversarial training. Therefore, we can not only maintain the harmony of the generated music but also control the musical style for better utilization. Extensive experiments on a real-world dataset demonstrate the superiority and effectiveness of our proposed models.

Learning Distance Metrics from Probabilistic Information

The goal of metric learning is to learn a good distance metric that can capture the relationships among instances, and its importance has long been recognized in many fields. An implicit assumption in the traditional settings of metric learning is that the associated labels of the instances are deterministic. However, in many real-world applications, the associated labels come naturally with probabilities instead of deterministic values, which makes it difficult for the existing metric learning methods to work well in these applications. To address this challenge, in this article, we study how to effectively learn the distance metric from datasets that contain probabilistic information, and then propose several novel metric learning mechanisms for two types of probabilistic labels, i.e., the instance-wise probabilistic label and the group-wise probabilistic label. Compared with the existing metric learning methods, our proposed mechanisms are capable of learning distance metrics directly from the probabilistic labels with high accuracy. We also theoretically analyze the proposed mechanisms and conduct extensive experiments based on real-world datasets to verify the desirable properties of these mechanisms.

Budget-Constrained Real-Time Bidding Optimization: Multiple Predictors Make It Better

In this paper, we pursue a better solution for the promising problem, i.e., the bidding strategy design, in the Real-Time Bidding (RTB) advertising (AD) environment. Under the budget constraint, the design of an optimal strategy for bidding on each incoming impression opportunity targets at acquiring as many clicks as possible during an AD campaign. State-of-the-art bidding algorithms rely on a single predictor, the clickthrough rate predictor, to calculate the bidding value for each impression. This provides reasonable performance if the predictor has appropriate accuracy in predicting the probability of user clicking. However, the classical methods usually fail to capture optimal results since the predictor can only give moderate accuracy. We improve the situation by accomplishing an additional winning price predictor in the bidding process. In this paper, an algorithm combining powers of multiple prediction models is developed. It emerges from an analogy to the on-line stochastic knapsack problem, and the efficiency of the algorithm is also theoretically analyzed. Experiments conducted on real world RTB datasets show that the proposed solution performs better with regard to both number of clicks achieved and effective cost per click in many different settings of budget constraints.

Adaptive Local Linear Discriminant Analysis

Dimensionality reduction plays a significant role in high-dimensional data processing, and Linear Discriminant Analysis (LDA) is a widely used supervised dimensionality reduction approach. However, a major drawback of LDA is that it is incapable of extracting the local structure information which is crucial for handling multimodal data. In this paper, we propose a novel supervised dimensionality reduction method named Adaptive Local Linear Discriminant Analysis (ALLDA), which adaptively learns a KNN graph from data themselves to extract the local connectivity of data. Furthermore, the original high-dimensional data usually contain noises and redundant features, which has a negative impact on the evaluation of neighborships and degrades the subsequent classification performance. To address this issue, our method learns the similarity matrix and updates the subspace simultaneously so that the neighborships can be evaluated in the optimal subspaces where the noises have been removed. Through the optimal graph embedding, the underlying sub-manifolds of data in intra-class can be extracted precisely. Meanwhile, an efficient iterative optimization algorithm is proposed to solve the minimization problem. Promising experimental results on synthetic and real-world data sets are provided to evaluate the effectiveness of proposed method.

Novel Probabilistic Topic Modeling for Discovering Common and Distinctive Topics from Multiple Datasets

Probabilistic topic models have been extensively studied to discover hidden patterns in the document corpus. However, rather than knowledge merely learned from a single data collection, many real-world application domains demand a comprehensive understanding of the relationships among various document collections. To address such needs, this paper proposes a model that can identify the common and discriminative aspects of multiple data sets. Speci cally, our proposed method takes a Bayesian approach that represents each document as a combination of common topics (shared across all document sets) and distinctive topics (distributions over words that are exclusive to a particular data set). Through extensive experiments, our method demonstrates its effectiveness compared all existing models, and con rms its utility as a practical tool for comparative thinking analysis in real world cases.

Towards an optimal outdoor advertising placement : when a budget constraint meets moving trajectories

In this paper we propose and study the problem of trajectory-driven influential billboard placement: given a set of billboards U (each with a location and a cost), a database of trajectories T and a budget L, find a set of billboards within the budget to influence the largest number of trajectories. One core challenge is to identify and reduce the overlap of the influence from different billboards to the same trajectories, while keeping the budget constraint into consideration. We show that this problem is NP-hard and present an enumeration based algorithm with (1 ? 1/e) approximation ratio. However, the enumeration would be very costly when |U | is large. By exploiting the locality property of billboards? influence, we propose a partition-based framework PartSel. PartSel partitions U into a set of small clusters, computes the locally influential billboards for each cluster, and merges them to generate the global solution. Since the local solutions can be obtained much more efficiently than the global one, PartSel would reduce the computation cost greatly; meanwhile it achieves a non-trivial approximation ratio guarantee. Then we propose a LazyProbe method to further prune billboards with low marginal influence, while achieving the same approximation ratio as PartSel. Next, we propose a branch-and -bound method to eliminate unnecessary enumerations in both PartSel and LazyProbe, as well as an aggregated index to speed up the computation of marginal influence. Experiments on real datasets verify the efficiency and effectiveness of our methods.

New algorithms of feature selection and big data assignment for CBR system integrated by Bayesian network

Under big data, the integrated system of case-based reasoning (CBR) and Bayesian network (BN) has exhibited great advantage in implementing the intelligence of engineering application in many domains. To construct a more accurate hybrid model of CBR and BN, this paper proposes Probability Change Measurement of Solution Parameters (PCMSP) algorithm based on hadoop platform to conduct feature selection. The PCMSP algorithm can select principal problem features according to their effects upon all solution features measured by calculating the weighted relative probability change of all solution features caused by each problem feature. Thus the PCMSP algorithm can perfectly work under big data no matter how complex the data types are and how huge the data size is. To enhance the efficiency of the integrated system, this paper also proposes Half-Division-Cross (HDC) algorithm to assign the computation task of big data. The HDC algorithm assigns big data by grouping all the problem parameters into many small sub-groups and then distributing the data which covers the same sub-group of problem parameters to a slave node. The HDC algorithm can guarantee enough efficiency of the integrated system under big data no matter how large the number of problem parameters is. Finally, lots of experiments are executed to validate the proposed method.

Inferring Lifetime Status of Point-of-Interest: A Multitask Multiclass Approach

A Point-of-Interest (POI) refers to a specific location that people may find useful or interesting. In modern cities, a large number of POIs emerge, grow, stabilize for a period, then finally disappear. The stages (e.g. emerge, grow, etc.) of this process are called lifetime statuses of a POI. While a large body of research has been focused on identifying and recommending POIs, there are few studies on inferring the lifetime status of POIs. Indeed, the predictive analytics of POI lifetime status can be helpful for various tasks, such as urban planning, business site selection, and real estate evaluation. In this paper, we propose a data-driven approach, named iPLTS, to inferring the POI lifetime status with multiple data sources as well as multitask learning framework. Specifically, we first define three types of POI lifetime status, i.e. booming, decaying and stable. Then we formulate a serial classification problem to predict the successive lifetime statuses of POIs over time. Leveraging geographical data and human mobility data, we examine and integrate three aspects of features related to the prosperity of POIs, i.e. region popularity, region demands, and peer competitiveness. Next, since the booming/decaying POIs are very rare in common, we perform stable class decomposition to relieve the imbalance between stable POIs and booming/decaying POIs. Moreover, a POI lifetime status classifier is developed by exploring the multitask learning framework as well as the multiclass kernel-based vector machines. Finally, we perform extensive experiments using large-scale and real-world datasets of New York City. The experimental results validate the effectiveness of our proposed method in automatic inferring POI lifetime status.

Multi-task information bottleneck co-clustering for unsupervised cross-view human action categorization

The widespread adoption of low-cost cameras generates massive amounts of videos recorded from different viewpoints every day. To cope with this vast amount of unlabeled and heterogeneous data, a new multi-task information bottleneck co-clustering (MIBC) approach is proposed to automatically categorize human actions in unlabeled cross-view video collections. Our motivation is that, if learning action category from each view is seen as a single task, it is reasonable to assume that the tasks of learning action patterns from the videos recorded by multiple cameras are dependent and inter-related, since the actions of same subjects recorded synchronously from different camera viewpoints are complementary to each other. MIBC aims to transfer the shared view knowledge across multiple tasks (i.e., camera viewpoints) to boost the performance of each task. Specifically, MIBC involves two parts: 1) extracting action categories for each task by maintaining its own relevant information independently; 2) allowing the feature representations of all tasks to be compressed into a common feature space, which is utilized to capture the relatedness of multiple tasks and transfer the shared knowledge across different camera viewpoints. These two parts of MIBC work simultaneously and can be solved in a novel co-clustering mechanism. Our experimental evaluation on several cross-view action collections shows that the MIBC algorithm outperforms the existing state-of-the-art baselines.

Multiple Set Matching with Bloom Matrix and Bloom Vector

Bloom filter is a space-efficient probabilistic data structure for checking elements' membership in a set. Given multiple sets, a standard Bloom filter is not sufficient when looking for the items to which an element or a set of input elements belong to. An example case is searching documents with keywords in a large text corpus. This is essentially a multiple set matching problem where the input is single or multiple keywords and the result is a set of possible candidate documents. In this article, we solve this multiple set matching problem by proposing two efficient Bloom Multifilters called Bloom Matrix and Bloom Vector. Both of them are space efficient and answer queries with a set of identifiers for multiple set matching problems. We show that the space efficiency can be optimized further according to the distribution of labels among multiple sets: Uniform and Zipf. While both of them are space efficient, Bloom Vector can efficiently exploit Zipf distribution of data for further space reduction. However, both structures are much more space efficient compared with state-of-the-art, Bloofi. Our results also highlight that $\LOOKUP$ operations on Bloom Matrix are significantly faster than on Bloom Vector and Bloofi.

Efficient Ridesharing Framework for Ride-Matching via Heterogeneous Network Embedding

In this paper, we present a network embedding approach for better understanding of the ridesharing behaviour of passengers. A major issue in ridesharing is the accurate assignment of passengers to drivers, and how to maximize the number of rides shared between people belonging to different travel groups has become an increasingly popular research topic. There are two major challenges facing ride-matching: scalability and sparsity. In this paper, contrary to existing approaches that merely depend on the proximity between passengers and drivers, we employ a heterogeneous network to learn the latent semantics from different choices in two types of ridesharing, and extract features in terms of user trajectories and sentiment. A novel framework for ridesharing RShareForm which encodes not only the objects but also a variety of semantic relationships between them is proposed. This paper extends the existing skip-gram model to incorporate meta-paths over a proposed heterogeneous network. It allows diverse features to be used to search for similar participants and then ranks them to improve the quality of ride-matching. Extensive experiments on a large-scale dataset from DiDi in Chengdu, China show that by leveraging heterogeneous network embedding with meta paths, RShareForm can significantly improve the accuracy of identifying the participants for ridesharing over existing methods, including both meta-path guided similarity search methods, and variants of embedding methods.

MP2 SDA: Multi-Party Parallelized Sparse Discriminant Learning

Sparse Discriminant Analysis (SDA) has been widely used to improve the performance of classical Fisher's Linear Discriminant Analysis in supervised metric learning, feature selection and classification. With the increasing needs of distributed data collection, storage and processing, enabling the Sparse Discriminant Learning to embrace the Multi-Party distributed computing environments becomes an emerging research topic. This paper proposes a novel Multi-Party SDA algorithm, which can learn SDA models effectively without sharing any raw data and basic statistics among machines. The proposed algorithm 1) leverages the direct estimation of SDA to derive a distributed loss function for the discriminant learning, 2) parameterizes the distributed loss function with local/global estimates through bootstrapping, and 3) approximates a global estimation of linear discriminant projection vector by optimizing the ``distributed bootstrapping loss function' with gossip-based stochastic gradient descent. Experimental results on both synthetic and real-world benchmark datasets show that our algorithm can compete with the aggregated SDA with similar performance, and significantly outperforms the most recent distributed SDA in terms of accuracy and F1-score.

Ranking from Crowdsourced Pairwise Comparisons via Smoothed Riemannian Optimization

Social Internet of Things has recently become a promising paradigm for augmenting the capability of humans and devices connected in the networks to provide services. In the social Internet of Things network, crowdsourcing that collects the intelligence of the human crowd has served as a powerful tool for data acquisition and distributed computing. To support critical applications (e.g., a recommendation system and assessing the inequality of urban perception), in this paper, we shall focus on the collaborative ranking problems for user preference prediction from crowdsourced pairwise comparisons. Based on the Bradley- Terry-Luce (BTL) model, a maximum likelihood estimation (MLE) is proposed via low-rank approach in order to estimate the underlying weight/score matrix, thereby predicting the ranking list for each user. A novel regularized formulation with the smoothed surrogate of elementwise infinity norm is proposed in order to address the unique challenge of the coupled the non-smooth elementwise infinity norm constraint and non-convex low-rank constraint in the MLE problem. We solve the resulting smoothed rank-constrained optimization problem via developing the Riemannian trust-region algorithm on quotient manifolds of fixed-rank matrices, which enjoys the superlinear convergence rate. The admirable performance and algorithmic advantages of the proposed method over the state-of-the-art algorithms are demonstrated via numerical results. Moreover, the proposed method outperforms state-of-the-art algorithms on large collaborative filtering datasets in both success rate of inferring preference and normalized discounted cumulative gain (NDCG).

Core Decomposition in Multilayer Networks: Theory, Algorithms, and Applications

Multilayer networks are a powerful paradigm to model complex systems, where various relations might occur among the same set of entities. Despite the keen interest in a variety of problems, algorithms, and analysis methods in this type of network, the problem of extracting dense subgraphs has remained largely unexplored. As a first step in this direction, in this work we study the problem of core decomposition of a multilayer network. Unlike the single-layer counterpart in which cores are all nested into one another and can be computed in linear time, the multilayer context is much more challenging as no total order exists among multilayer cores; rather, they form a lattice whose size is exponential in the number of layers. In this setting we devise three algorithms which differ in the way they visit the core lattice and in their pruning techniques. We then move a step forward and study the problem of extracting only the maximal or, as we call them in this work, the inner-most cores, i.e., the cores that are not dominated by any other core in terms of their index on all the layers. As inner-most cores are orders of magnitude less than all the cores, it is desirable to develop algorithms that effectively exploit the maximality property and extract inner-most cores directly. Moreover, we showcase an application of the multilayer core-decomposition tool to the problem of densest-subgraph extraction from multilayer networks. We introduce a definition of multilayer densest subgraph that trades-off between high density and number of layers in which the high density holds, and show how multilayer core decomposition can be exploited to approximate this problem with quality guarantees. As further applications, we exploit multilayer core decomposition to speed-up the extraction of frequent cross-graph quasi-cliques and to generalize the community-search problem to the multilayer setting.

CenEEGs: Valid EEG Selection for Classification

This paper explores valid brain electroencephalography (EEG) selection for EEG classification with different classifiers, which has been rarely addressed in previous studies and is mostly ignored by existing EEG processing methods and applications. Importantly, traditional selection methods are not able to select valid EEG signals for different classifiers. This paper focuses on a source control-based valid EEG selection to reduce the impact of invalid EEG signals and aims to improve EEG-based classification performance for different classifiers. We introduce a novel centroid-based EEG selection approach named CenEEGs which uses a scaling and shifting invariant similarity metric to measure similarities of EEG signals and then applies a globally optimal centroid strategy to select valid EEG signals with respect to a similarity threshold. A detailed comparison with several state-of-the-art time series selection methods by using standard criteria on 8 EEG datasets demonstrates the efficacy and superiority of CenEEGs for different classifiers.

Bradykinesia Recognition in Parkinson's Disease via Single RGB Video

Parkinson?s disease is a progressive nervous system disorder afflicting millions of patients. Among its motor symptoms, bradykinesia is one of the cardinal manifestations. Experienced doctors are required for the clinical diagnosis of bradykinesia, but sometimes they also miss subtle changes, especially in early stages of such disease. Therefore, developing auxiliary diagnostic methods that can automatically detect bradykinesia has received more and more attention. In this paper, we employ a two-stage framework for bradykinesia recognition based on the video of patient movement. First, convolution neural networks are trained to localize keypoints in each video frame. These time-varying coordinates form motion trajectories that represent the whole movement. From the trajectory, we then propose novel measurements, namely stability, completeness, self-similarity, to quantify different motor behaviors. We also propose a periodic motion model called PMNet. An encoder-decoder structure is applied to learn a low dimensional representation of a motion process. The compressed motion process and quantified motor behaviors are combined as inputs to a fully-connected neural network. Different from the traditional means, our solution extends the application scenario outside the hospital and can be easily transplanted to conduct similar tasks. A commonly used clinical assessment is served as a case study. Experimental results based on real-world data validate the effectiveness of our approach for bradykinesia recognition.

Data Sharing via Differentially Private Coupled Matrix Factorization

We address the privacy-preserving data sharing problem in a distributed multiparty setting. In this setting, each data site owns a distinct part of a dataset and the aim is to estimate the parameters of a statistical model conditioned on the complete data without any site revealing any information about the individuals in their own parts. The sites want to maximize the utility of the collective data analysis while providing privacy guarantees for their own portion of the data as well as for each participating individual. Our first contribution is to classify these different privacy requirements as i) site-level and ii) user-level differential privacy and present formal privacy guarantees for these two cases under the model of differential privacy. To satisfy a stronger form of differential privacy, we use a variant of differential privacy which is local differential privacy where the sensitive data is perturbed with a randomized response mechanism prior to the estimation. In this study, we assume that the data instances that are partitioned between several parties are arranged as matrices. A natural statistical model for this distributed scenario is coupled matrix factorization. We present two generic frameworks for privatizing Bayesian inference for coupled matrix factorization models that are able to guarantee proposed differential privacy notions based on the privacy requirements of the model. To privatize Bayesian inference, we first exploit the connection between differential privacy and sampling from a Bayesian posterior via Stochastic Gradient Langevin Dynamics and then derive an efficient coupled matrix factorization method. In the local privacy context, we propose two models that have an additional privatization mechanism to achieve a stronger measure of privacy and introduce a Gibbs sampling based algorithm. We demonstrate that the proposed methods are able to provide good prediction accuracy on synthetic and real datasets while adhering to the introduced privacy constraints.

A Unified Framework of Sparse Online Learning

The amount of data in our society has been exploding in the era of big data today. In this paper, we address several open challenges of big data stream classification, including high volume, high velocity, high dimensionality, high sparsity, and high class-imbalance. Many existing studies in data mining literature solve data stream classification tasks in a batch learning setting, which suffers from poor efficiency and scalability when dealing with big data. To overcome the limitations, this paper investigates an online learning framework for big data stream classification tasks. Unlike some existing online data stream classification techniques that are often based on first-order online learning, we propose a framework of Sparse Online Classification (SOC) for data stream classification, which includes some state-of-the-art first-order sparse online learning algorithms as special cases and allows us to derive a new effective second-order online learning algorithm for data stream classification. In addition, we also propose a new cost-sensitive sparse online learning algorithm by extending the framework with application to tackle online anomaly detection tasks where class distribution of data could be very imbalanced. We also analyze the theoretical bounds of the proposed method, and finally conduct an extensive set of experiments, in which encouraging results validate the efficacy of the proposed algorithms in comparison to a family of state-of-the-art techniques on a variety of data stream classification tasks.

All ACM Journals | See Full Journal Index

Search TKDD
enter search term and/or author name