Long Short-Term Memory (LSTM) network is a popular deep learning model, particularly useful for data with temporal correlation, such as texts, sequences, or time series data, thanks to its well sought recurrent structures designed to capture temporal correlation. In this paper, we propose to generalize LSTM to generic machine learning tasks where data used for training do not hold explicit temporal or sequential correlation. Our theme is to explore feature correlation in the original data and convert each instance into a synthetic sentence format by using a 2-gram probabilistic language model. More specifically, for each instance represented in the original feature space, our conversion first seeks to horizontally align original features into a sequential correlated feature vector, resembling to the letter coherence within each single word. In addition, a vertical alignment is also carried out to create multiple time points and simulates sequential order of words (i.e. word correlation) within a sentence. The two dimensional horizontal-and-vertical alignments not only ensure feature correlations are maximally utilized, but also preserve the original feature values in the new representation. As a result, LSTM model can be utilized to achieve good classification accuracy, even if the underlying data do not have explicit temporal or sequential correlation. Experiments on 20 generic datasets confirm that applying LSTM to generic data has a clear performance gain, compared to conventional machine learning methods. This research opens a new paradigm to allow LSTM deep learning to be broadly applied to generic machine learning tasks.
We enhance the original mobile sequential recommendation (MSR) model to address several key issues by introducing three new forms, MMSR, MMSR-m and MMSR-d. To enrich the properties of the pick-up points including the locations and probabilities, we add the pick-up capacities to the attributes of pick-up points. The MMSR model finds optimal routes for multiple users at different locations while disallowing overlapping recommended routes. The MMSR-m addresses the issue by assigning a pick-up capacity to each location while the MMSR-d model allows the pick-up capacity to vary at different locations. The MMSR model is numerically solved by parallel simulated annealing. The results confirmed the superiority of our model and the solution methods over several published benchmarks at the more demanding high-dimensions. Our proposed push-point method allows to further improve our parallel algorithms for MMSR-m and MMSR-d to address more realistic city-level problems.
Similarity measure is one of the fundamental task in heterogeneous information network analysis. It has been applied to many areas such as product recommendation, clustering and Web search. Most of the existing metrics can provide personalized services for users by taking a meta-path or meta-structure as input. However, these metrics may highly depend on the user-specified meta-path or meta-structure. In addition, users must know how to select an appropriate meta-path or meta-structure. In this paper, we propose a novel similarity measure in heterogeneous information networks, called Recurrent Meta-Structure-based Similarity (RMSS). The recurrent meta-structure as a schematic structure in heterogeneous information networks provides a unified framework for integrating all of the meta-paths and meta-structures, and can be constructed automatically by means of repetitively traversing the network schema. In order to formalize the semantics, the recurrent meta-structure is decomposed into several recurrent meta-paths and recurrent meta-trees, and we then define the commuting matrices of the recurrent meta-paths and meta-trees. All of these commuting matrices are combined together according to different weights. We propose two kinds of weighting strategies to determine the weights. The first is called the local weighting strategy which depends on the sparsity of the commuting matrices, and the second is called the global weighting strategy which depends on the strength of the commuting matrices. As a result, RMSS is defined by means of the weighted summation of the commuting matrices. Note that RMSS can also provide personalized services for users by means of the weights of the recurrent meta-paths and meta-trees. Experimental evaluations show that the proposed RMSS is robust and outperforms the existing metrics in terms of ranking and clustering task.
We present a novel notion of outlier, called the Concentration Free Outlier Factor, or CFOF. As a main contribution, we formalize the notion of concentration of outlier scores and theoretically prove that CFOF does not concentrate in the Euclidean space for any arbitrary large dimensionality. To the best of our knowledge, there are no other proposals of data analysis measures related to the Euclidean distance for which it has been provided theoretical evidence that they are immune to the concentration effect. We determine the closed form of the distribution of CFOF scores in arbitrarily large dimensionalities and show that the CFOF score of a point depends on its squared norm standard score and on the kurtosis of the data distribution, thus providing a clear and statistically founded characterization of this notion. Moreover, we leverage this closed form to provide evidence that the definition does not suffer of the hubness problem affecting other measures in high dimensions. We prove that the number of CFOF outliers coming from each cluster is proportional to cluster size and kurtosis, a property that we call semi-locality. We leverage theoretical findings to shed lights on properties of well-known outlier scores. Indeed, we determine that semi-locality characterizes existing reverse nearest neighbor-based outlier definitions, thus clarifying the exact nature of their observed local behavior. We also formally prove that classical distance-based and density-based outliers concentrate both for bounded and unbounded sample sizes and for fixed and variable values of the neighborhood parameter. We introduce the fast-CFOF algorithm for detecting outliers in large high-dimensional dataset. The algorithm has linear cost, supports multi-resolution analysis, and is embarrassingly parallel. Experiments highlight that the technique is able to efficiently process huge datasets and to deal even with large values of the neighborhood parameter, to avoid concentration, and to obtain excellent accuracy.
There is a large body of work on multi-view clustering which exploits multiple representations (or views) of the same input data for better convergence. These multiple views can come from multiple modalities (image, audio, text) or different feature subsets. Recently, multi-objective based multi-view clustering methods have suppressed the performance of single objective based multi-view clustering techniques. One key problem is that it is difficult to select a single solution from a set of alternative partitionings generated by multi-objective techniques on the final Pareto optimal front. In this paper, we propose a novel multi-objective based multi-view clustering framework which overcomes the problem of selecting a single solution in multi-objective based techniques. In particular, our proposed framework has three major components: (i) multi-view based multi-objective algorithm, Multiview-AMOSA, for initial clustering of data points; (ii) a generative model for generating a combined solution having probabilistic labels; and (iii) K-means algorithm for obtaining the final labels. As the first component, we have adopted a recently developed multi-view based multi-objective clustering algorithm to generate different possible consensus partitionings of a given dataset taking into account different views. A generative model is coupled with the first component to generate a single consensus partitioning after considering multiple solutions. It exploits the latent subsets of the non-dominated solutions obtained from the multi-objective clustering algorithm and combines them to produce a single probabilistic labeled solution. Finally, a simple clustering algorithm, namely K-means, is applied to the generated probabilistic labels to obtain the final cluster labels. Experimental validation of our proposed framework is carried out over several benchmark datasets belonging to three different domains; UCI datasets, search result clustering datasets and patient stratification datasets and it shows an improvement of around 2%-4% over different evaluation metrics in comparison to state-of-the art methods.
Many real systems produce network data or highly interconnected data, which can be called information networks. These information networks form a critical component in modern information infrastructure, constituting a large graph data volume. The analysis of information network data covers several technological areas, among them OLAP technologies. OLAP is a technology that enables multi-dimensional and multi-level analysis on a large volume of data, providing aggregated data visualizations with different perspectives. This paper presents a literature review on the main applications of OLAP technology in the analysis of information network data. To achieve such goal, it shows a systematic review to list the works that apply OLAP technologies in graph data. It defines seven comparison criteria (Materialization, Network, Selection, Aggregation, Model, OLAP Operations, Analytics) to qualify the works found based on their functionalities. The works are analyzed according to each criterion and discussed to identify trends and challenges in the application of OLAP in the information network.
Ensemble classifiers improve the classification performance by combining several classifiers using suitable fusion mythology. Many ensemble classifier generation methods have been developed that allowed the training of multiple classifiers on a single dataset. As such random subspace is a common methodology utilized by many state-of-the-art ensemble classifiers that generate random subsamples from the input data and train classifiers on different subsamples. Real-world datasets have randomness and noise in them therefore, not all randomly generated samples are suitable for training. In this paper, we propose a novel particle swarm optimization-based approach to optimize the random subspace to generate an ensemble classifier. We first generate a random subspace by incrementally clustering input data and then optimize all generated data clusters. On all optimized data clusters, a set of classifiers is trained and is added to the pool. The pool of classifiers is then optimized, and an optimized ensemble classifier is generated. The proposed approach is tested on 12 benchmark datasets from the UCI repository and results are compared with current state-of-the-art ensemble classifier approaches. A statistical significance test is also conducted, and an analysis is presented.
In online complex systems like transportation system, an importation work is real-time traffic prediction. Due to several reasons, the prediction result derived from an offline-built model would be unreliable. A real-time prediction correction strategy would be important under this situation. In this paper, we propose the prediction correction strategy using the reconstruction error in the deep neural network. The reconstruction error can reflect the model?s ability on feature representation and then determine the fitness of an input data to the model. We first build the relationship between reconstruction error and prediction error. From the perspective of the prediction interval, we demonstrate that the reconstruction error is in positive relation with the prediction interval. Thus the prediction result is more reliable when the reconstruction error is smaller. Then we propose two mechanisms of real-time prediction correction using the reconstruction error. The data driven prediction correction approach selects several training instances with similar reconstruction errors to the current instance and using their average prediction error in correcting the prediction result. The model driven approach builds several component deep neural networks in training. The component training set for each network is selected according to the reconstruction error of training instances. For a predicting instance, it first computes the reconstruction error of the sample in each component network and then averages the results by the reconstruction error and prediction interval. The model driven approach is actually a reconstruction error based deep neural network ensemble approach. Finally, a series of experiments demonstrated that reconstruction error based prediction correction approaches are effective in several prediction problems in transportation including traffic flow prediction on road, traffic flow prediction in entrance & exit station and travel time prediction. Besides the high overall accuracy, our approach can also provide many observations of using the reconstruction error in transportation prediction.
Big graphs (networks) arising in numerous application areas pose significant challenges for graph analysts as these graphs grow to billions of nodes and edges and are prohibitively large to fit in the main memory. Finding the number of triangles in a graph is an important problem in the mining and analysis of graphs. In this paper, we present two efficient MPI-based distributed memory parallel algorithms for counting triangles in big graphs. The first algorithm employs overlapping partitioning and efficient load balancing schemes to provide a very fast parallel algorithm. The algorithm scales well to networks with billions of nodes and can compute the exact number of triangles in a network with 10 billion edges in 16 minutes. The second algorithm divides the network into non-overlapping partitions leading to a space-efficient algorithm. Our results on both artificial and real-world networks demonstrate a significant space saving with this algorithm. We also present a novel approach that reduces communication cost drastically leading the algorithm to both a space- and runtime-efficient algorithm. Further, we demonstrate how our algorithms can be used to list all triangles in a graph and compute clustering coefficients of nodes. Our algorithm can also be adapted to a parallel approximation algorithm using an edge sparsification method.
The goal of metric learning is to learn a good distance metric that can capture the relationships among instances, and its importance has long been recognized in many fields. An implicit assumption in the traditional settings of metric learning is that the associated labels of the instances are deterministic. However, in many real-world applications, the associated labels come naturally with probabilities instead of deterministic values, which makes it difficult for the existing metric learning methods to work well in these applications. To address this challenge, in this article, we study how to effectively learn the distance metric from datasets that contain probabilistic information, and then propose several novel metric learning mechanisms for two types of probabilistic labels, i.e., the instance-wise probabilistic label and the group-wise probabilistic label. Compared with the existing metric learning methods, our proposed mechanisms are capable of learning distance metrics directly from the probabilistic labels with high accuracy. We also theoretically analyze the proposed mechanisms and conduct extensive experiments based on real-world datasets to verify the desirable properties of these mechanisms.
Community Detection in Small Networks: A New Approach to Graph Partitioning
Dimensionality reduction plays a significant role in high-dimensional data processing, and Linear Discriminant Analysis (LDA) is a widely used supervised dimensionality reduction approach. However, a major drawback of LDA is that it is incapable of extracting the local structure information which is crucial for handling multimodal data. In this paper, we propose a novel supervised dimensionality reduction method named Adaptive Local Linear Discriminant Analysis (ALLDA), which adaptively learns a KNN graph from data themselves to extract the local connectivity of data. Furthermore, the original high-dimensional data usually contain noises and redundant features, which has a negative impact on the evaluation of neighborships and degrades the subsequent classification performance. To address this issue, our method learns the similarity matrix and updates the subspace simultaneously so that the neighborships can be evaluated in the optimal subspaces where the noises have been removed. Through the optimal graph embedding, the underlying sub-manifolds of data in intra-class can be extracted precisely. Meanwhile, an efficient iterative optimization algorithm is proposed to solve the minimization problem. Promising experimental results on synthetic and real-world data sets are provided to evaluate the effectiveness of proposed method.
Probabilistic topic models have been extensively studied to discover hidden patterns in the document corpus. However, rather than knowledge merely learned from a single data collection, many real-world application domains demand a comprehensive understanding of the relationships among various document collections. To address such needs, this paper proposes a model that can identify the common and discriminative aspects of multiple data sets. Speci cally, our proposed method takes a Bayesian approach that represents each document as a combination of common topics (shared across all document sets) and distinctive topics (distributions over words that are exclusive to a particular data set). Through extensive experiments, our method demonstrates its effectiveness compared all existing models, and con rms its utility as a practical tool for comparative thinking analysis in real world cases.
In this paper we propose and study the problem of trajectory-driven influential billboard placement: given a set of billboards U (each with a location and a cost), a database of trajectories T and a budget L, find a set of billboards within the budget to influence the largest number of trajectories. One core challenge is to identify and reduce the overlap of the influence from different billboards to the same trajectories, while keeping the budget constraint into consideration. We show that this problem is NP-hard and present an enumeration based algorithm with (1 ? 1/e) approximation ratio. However, the enumeration would be very costly when |U | is large. By exploiting the locality property of billboards? influence, we propose a partition-based framework PartSel. PartSel partitions U into a set of small clusters, computes the locally influential billboards for each cluster, and merges them to generate the global solution. Since the local solutions can be obtained much more efficiently than the global one, PartSel would reduce the computation cost greatly; meanwhile it achieves a non-trivial approximation ratio guarantee. Then we propose a LazyProbe method to further prune billboards with low marginal influence, while achieving the same approximation ratio as PartSel. Next, we propose a branch-and -bound method to eliminate unnecessary enumerations in both PartSel and LazyProbe, as well as an aggregated index to speed up the computation of marginal influence. Experiments on real datasets verify the efficiency and effectiveness of our methods.
A Point-of-Interest (POI) refers to a specific location that people may find useful or interesting. In modern cities, a large number of POIs emerge, grow, stabilize for a period, then finally disappear. The stages (e.g. emerge, grow, etc.) of this process are called lifetime statuses of a POI. While a large body of research has been focused on identifying and recommending POIs, there are few studies on inferring the lifetime status of POIs. Indeed, the predictive analytics of POI lifetime status can be helpful for various tasks, such as urban planning, business site selection, and real estate evaluation. In this paper, we propose a data-driven approach, named iPLTS, to inferring the POI lifetime status with multiple data sources as well as multitask learning framework. Specifically, we first define three types of POI lifetime status, i.e. booming, decaying and stable. Then we formulate a serial classification problem to predict the successive lifetime statuses of POIs over time. Leveraging geographical data and human mobility data, we examine and integrate three aspects of features related to the prosperity of POIs, i.e. region popularity, region demands, and peer competitiveness. Next, since the booming/decaying POIs are very rare in common, we perform stable class decomposition to relieve the imbalance between stable POIs and booming/decaying POIs. Moreover, a POI lifetime status classifier is developed by exploring the multitask learning framework as well as the multiclass kernel-based vector machines. Finally, we perform extensive experiments using large-scale and real-world datasets of New York City. The experimental results validate the effectiveness of our proposed method in automatic inferring POI lifetime status.
Local community detection refers to finding the community that contains the given node based on local information, which becomes very meaningful when global information about the network is unavailable or expensive to acquire. Most work on local community detection focuses on finding non-overlapping communities. However, many real-world networks contain overlapping communities like social networks. Given an overlapping node that belongs to multiple communities, the problem is to find communities to which it belongs according to local information. We propose a framework for local overlapping community detection. The framework has three steps. First, find nodes in multiple communities to which the given node belongs. Second, select representative nodes from nodes obtained above, which tends to be in different communities. Third, discover the communities to which these representative nodes belong. In addition, to demonstrate the effectiveness of the framework, we implement six versions of this framework. Experimental results demonstrate that the six implementations versions outperform the previously existing algorithms.
We examine a new form of smooth approximation to the zero one loss in which learning is performed using a reformulation of the widely used logistic function. Our approach is based on using the posterior mean of a novel generalized Beta-Bernoulli formulation. This leads to a generalized logistic function that approximates the zero one loss, but retains a probabilistic formulation conferring a number of useful properties. The approach is easily generalized to kernel logistic regression and easily integrated into methods for structured prediction. We present experiments in which we learn such models using an optimization method consisting of a combination of gradient descent and coordinate descent using localized grid search so as to escape from local minima. Our experiments indicate that optimization quality is improved when learning meta-parameters are themselves optimized using a validation set. Our experiments show improved performance relative to widely used logistic and hinge loss methods on a wide variety of problems ranging from standard UC Irvine and libSVM evaluation datasets to product review predictions and a visual information extraction task. We observe that the approach: 1) is more robust to outliers compared to the logistic and hinge losses; 2) outperforms comparable logistic and max margin models on larger scale benchmark problems; 3) when combined with Gaussian-Laplacian mixture prior on parameters the kernelized version of our formulation yields sparser solutions than Support Vector Machine classifiers; and 4) when integrated into a probabilistic structured prediction technique our approach provides more accurate probabilities yielding improved inference and increasing information extraction performance.
Treatment effect plays an important role on decision making in many fields, such as social marketing, healthcare, and public policy. The key challenge on estimating treatment effect in the wild observational studies is to handle confounding bias induced by imbalance of the confounder distributions between treated and control units. Traditional methods remove confounding bias by re-weighting units with supposedly accurate propensity score estimation under the unconfoundedness assumption. Controlling high-dimensional variables may make the unconfoundedness assumption more plausible, but poses new challenge on accurate propensity score estimation. One strand of recent literature seeks to directly optimize weights to balance confounder distributions, bypassing propensity score estimation. But existing balancing methods fail to do selection and differentiation among the pool of a large number of potential confounders, leading to possible underperformance in many high dimensional settings. In this paper, we propose a data-driven Differentiated Confounder Balancing (DCB) algorithm to jointly select confounders, differentiate weights of confounders and balance confounder distributions for treatment effect estimation in the wild high dimensional settings. Besides, under some settings with heavy confounding bias, in order to further reduce the bias and variance of estimated treatment effect, we propose a Regression Adjusted Differentiated Confounder Balancing (RA-DCB) algorithm based on our DCB algorithm by incorporating outcome regression adjustment. The synergistic learning algorithm we proposed is more capable of reducing the confounding bias in many observational studies. To validate the effectiveness of our DCB and RA-DCB algorithms, we conduct extensive experiments on both synthetic and real datasets. The experimental results clearly demonstrate that our algorithms outperform the state-of-the-art methods. By incorporating regression adjustment, our RA-DCB algorithm achieves better performance than DCB algorithm, especially under the settings with heavy confounding bias. Moreover, We show that the top features ranked by our algorithm generate accurate prediction of online advertising effect.
Multilayer networks are a powerful paradigm to model complex systems, where various relations might occur among the same set of entities. Despite the keen interest in a variety of problems, algorithms, and analysis methods in this type of network, the problem of extracting dense subgraphs has remained largely unexplored. As a first step in this direction, in this work we study the problem of core decomposition of a multilayer network. Unlike the single-layer counterpart in which cores are all nested into one another and can be computed in linear time, the multilayer context is much more challenging as no total order exists among multilayer cores; rather, they form a lattice whose size is exponential in the number of layers. In this setting we devise three algorithms which differ in the way they visit the core lattice and in their pruning techniques. We then move a step forward and study the problem of extracting only the maximal or, as we call them in this work, the inner-most cores, i.e., the cores that are not dominated by any other core in terms of their index on all the layers. As inner-most cores are orders of magnitude less than all the cores, it is desirable to develop algorithms that effectively exploit the maximality property and extract inner-most cores directly. Moreover, we showcase an application of the multilayer core-decomposition tool to the problem of densest-subgraph extraction from multilayer networks. We introduce a definition of multilayer densest subgraph that trades-off between high density and number of layers in which the high density holds, and show how multilayer core decomposition can be exploited to approximate this problem with quality guarantees. As further applications, we exploit multilayer core decomposition to speed-up the extraction of frequent cross-graph quasi-cliques and to generalize the community-search problem to the multilayer setting.
This paper explores valid brain electroencephalography (EEG) selection for EEG classification with different classifiers, which has been rarely addressed in previous studies and is mostly ignored by existing EEG processing methods and applications. Importantly, traditional selection methods are not able to select valid EEG signals for different classifiers. This paper focuses on a source control-based valid EEG selection to reduce the impact of invalid EEG signals and aims to improve EEG-based classification performance for different classifiers. We introduce a novel centroid-based EEG selection approach named CenEEGs which uses a scaling and shifting invariant similarity metric to measure similarities of EEG signals and then applies a globally optimal centroid strategy to select valid EEG signals with respect to a similarity threshold. A detailed comparison with several state-of-the-art time series selection methods by using standard criteria on 8 EEG datasets demonstrates the efficacy and superiority of CenEEGs for different classifiers.
Parkinson?s disease is a progressive nervous system disorder afflicting millions of patients. Among its motor symptoms, bradykinesia is one of the cardinal manifestations. Experienced doctors are required for the clinical diagnosis of bradykinesia, but sometimes they also miss subtle changes, especially in early stages of such disease. Therefore, developing auxiliary diagnostic methods that can automatically detect bradykinesia has received more and more attention. In this paper, we employ a two-stage framework for bradykinesia recognition based on the video of patient movement. First, convolution neural networks are trained to localize keypoints in each video frame. These time-varying coordinates form motion trajectories that represent the whole movement. From the trajectory, we then propose novel measurements, namely stability, completeness, self-similarity, to quantify different motor behaviors. We also propose a periodic motion model called PMNet. An encoder-decoder structure is applied to learn a low dimensional representation of a motion process. The compressed motion process and quantified motor behaviors are combined as inputs to a fully-connected neural network. Different from the traditional means, our solution extends the application scenario outside the hospital and can be easily transplanted to conduct similar tasks. A commonly used clinical assessment is served as a case study. Experimental results based on real-world data validate the effectiveness of our approach for bradykinesia recognition.
The amount of data in our society has been exploding in the era of big data today. In this paper, we address several open challenges of big data stream classification, including high volume, high velocity, high dimensionality, high sparsity, and high class-imbalance. Many existing studies in data mining literature solve data stream classification tasks in a batch learning setting, which suffers from poor efficiency and scalability when dealing with big data. To overcome the limitations, this paper investigates an online learning framework for big data stream classification tasks. Unlike some existing online data stream classification techniques that are often based on first-order online learning, we propose a framework of Sparse Online Classification (SOC) for data stream classification, which includes some state-of-the-art first-order sparse online learning algorithms as special cases and allows us to derive a new effective second-order online learning algorithm for data stream classification. In addition, we also propose a new cost-sensitive sparse online learning algorithm by extending the framework with application to tackle online anomaly detection tasks where class distribution of data could be very imbalanced. We also analyze the theoretical bounds of the proposed method, and finally conduct an extensive set of experiments, in which encouraging results validate the efficacy of the proposed algorithms in comparison to a family of state-of-the-art techniques on a variety of data stream classification tasks.