Real-world networks are often organized as modules or communities of similar nodes that serve as functional units. These networks are also rich in content, with nodes having distinguishing features or attributes. In order to discover a network's modular structure, it is necessary to take into account not only its links but also node attributes. We describe an information-theoretic method that identifies modules by compressing descriptions of information flow on a network. Our formulation introduces node content into the description of information flow, which we then minimize to discover groups of nodes with similar attributes that also tend to trap the flow of information. The method has several advantages: it is conceptually simple and does not require ad-hoc parameters to specify the number of modules or to control the relative contribution of links and node attributes to network structure. We apply the proposed method to partition real-world networks with known community structure. We demonstrate that adding node attributes helps recover the underlying community structure in content-rich networks more effectively than using links alone. In addition, we show that our method is faster and more accurate than alternative state-of-the-art algorithms.
Feature selection is important in many big data applications. There are at least two critical challenges. Firstly, in many applications, the dimensionality is extremely high, in millions, and keeps growing. Secondly, feature selection has to be highly scalable, preferably in an online manner such that each feature can be processed in a sequential scan. In this paper, we develop SAOLA, a Scalable and Accurate OnLine Approach for feature selection. With a theoretical analysis on bounds of the pairwise correlations between features, SAOLA employs novel online pairwise comparison techniques to address the two challenges and maintain a parsimonious model over time in an online manner. Furthermore, to tackle the dimensionality that arrives by groups, we extend our SAOLA algorithm, and then propose a novel group-SAOLA algorithm for online group feature selection. The group-SAOLA algorithm can online maintain a set of feature groups that is sparse at the level of both groups and individual features simultaneously. An empirical study using a series of benchmark real data sets shows that our two algorithms, SAOLA and group-SAOLA, are scalable on data sets of extremely high dimensionality, and have superior performance over the state-of-the-art feature selection methods.
One of the key obstacles in making learning protocols realistic in applications is the need to supervise them, a costly process that often requires hiring domain experts. We consider the framework to use the world knowledge as indirect supervision. World knowledge is general-purpose knowledge, which is not designed for any specific domain. Then the key challenges are how to adapt the world knowledge to domains and how to represent it for learning. In this paper, we provide an example of using world knowledge for domain dependent document clustering. We provide three ways to specify the world knowledge to domains by resolving the ambiguity of the entities and their types, and represent the data with world knowledge as a heterogeneous information network. Then we propose a clustering algorithm that can cluster multiple types and incorporate the sub-type information as constraints. In the experiments, we use two existing knowledge bases as our sources of world knowledge. One is Freebase, which is collaboratively collected knowledge about entities and their organizations. The other is YAGO2, a knowledge base automatically extracted from Wikipedia and maps knowledge to the linguistic knowledge base, WordNet. Experimental results on two text benchmark datasets (20newsgroups and RCV1) show that incorporating world knowledge as indirect supervision can significantly outperform the state-of-the-art clustering algorithms as well as clustering algorithms enhanced with world knowledge features.
Solving Inverse Frequent Itemset Mining with Infrequency Constraints via Large-Scale Linear Programs
Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees.
Parallel Field Ranking
Distributed Algorithms for Computing Very Large Thresholded Covariance Matrices
It is traditionally a challenge for home buyers to understand, compare and contrast the investment values of estates. While a number of estate appraisal methods have been developed to value real properties, the performances of these methods have been limited by the traditional data sources for estate appraisal. With the development of new ways of collecting estate-related mobile data, there is a potential to leverage geographic dependencies of estates for enhancing estate appraisal. Indeed, the geographic dependencies of the investment value of an estate can be from the characteristics of its own neighborhood (individual), the values of its nearby estates (peer), and the prosperity of the affiliated latent business area (zone). To this end, in this paper, we propose a geographic method, named ClusRanking, for estate appraisal by leveraging the mutual enforcement of ranking and clustering power. ClusRanking is able to exploit geographic individual, peer, and zone dependencies in a probabilistic ranking model. Specifically, we first extract the geographic utility of estates from geography data, estimate the neighborhood popularity of estates by mining taxicab trajectory data, and model the influence of latent business areas. Also, we fuse these three influential factors and predict real estate investment value. Moreover, we simultaneously consider individual, peer and zone dependencies, and derive an estate-specific ranking likelihood as the objective function. Furthermore, we propose an improved method named CR-ClusRanking by incorporating checkin information as a regularization term which reduces the performance volatility of estate ranking system. Finally, we conduct a comprehensive evaluation with the real estate related data of Beijing, and the experimental results demonstrate the effectiveness of our proposed methods.
Batch Mode Active Sampling based on Marginal Probability Distribution Matching
The goal of community detection algorithms is to identify densely-connected units within large networks. An implicit assumption is that all the constituent nodes belong equally to their associated community. As a result, to date, efforts have been primarily driven to identify communities as a whole, rather than understanding by how much an individual node belongs to its community. In this paper, we argue that the belongingness of nodes in a community is not uniform. We quantify the degree of belongingness of a vertex within a community by a new vertex-based metric called permanence. The central idea of permanence is based on the observation that the strength of membership of a vertex to a community depends upon two factors: (i) the distribution of the connectivity of the vertex to individual communities, and (ii) how tightly the vertex is connected internally. We present the formulation of permanence based on these two quantities. We demonstrate that compared to other existing metrics, the change in permanence is more commensurate to the level of perturbation in ground-truth communities. We discuss how permanence can help us understand and utilize the structure and evolution of communities in a network. We further show that permanence is an excellent metric for identifying communities. We show that the process of maximizing permanence (abbreviated as MaxPerm) produces meaningful communities that concur with the ground-truth community structure of the networks more accurately than eight other popular community detection algorithms. Finally, we provide mathematical proofs to demonstrate the correctness of finding communities by maximizing permanence. In particular, we show that the communities obtained by this method are (i) less affected by the changes in vertex-ordering, and (ii) more resilient to resolution limit, degeneracy of solutions and asymptotic growth of values.