Conference Program

Program Overview








To presenters: small changes might be made closer to the conference date *without* notice.

Long paper presentation: 25 mins (including Q&A)

Short paper presentation: 15 mins (including Q&A)


June 3, 2018 (Sunday)

08:00

08:30

Registration

 

 

Mayfair 1 (300)

Mayfair 2 (80)

Mayfair 3 (80)

Grosvenor 1 (40)

Grosvenor 2 (40)

08:30

10:00

Tutorial
Deep Learning for Biomedical Discovery and Data Mining
(Truyen Tran)

Workshop

BDM
6th Workshop on Biologically Inspired Data Mining Technique

Chair: Shafiq Alam, Gillian Dobbie

Workshop

ML4Cyber
Australasian Workshop on Machine Learning for Cyber-security

Chair: Jonathan Oliver, Jun Zhang

Workshop

BDASC
Big Data Analytics for Social Computing

Chair: Mark Western, Junbin Gao, Lin Wu, Michele Haynes, Yang Wang

Workshop

PAISI
The 12th Pacific Asia Workshop on Intelligence and Security Informatics

Chair: Michael Chau, G. Alan Wang, Hsinchun Chen

10:00

10:30

Coffee Break

10:30

12:00

Tutorial
Deep Learning for Biomedical Discovery and Data Mining
(Truyen Tran)

Workshop

BDM
6th Workshop on Biologically Inspired Data Mining Technique

Chair: Shafiq Alam, Gillian Dobbie

Workshop

ML4Cyber
Australasian Workshop on Machine Learning for Cyber-security

Chair: Jonathan Oliver, Jun Zhang

Workshop

BDASC
Big Data Analytics for Social Computing

Chair: Mark Western, Junbin Gao, Lin Wu, Michele Haynes, Yang Wang

 

 

 

Workshop

PAISI
The 12th Pacific Asia Workshop on Intelligence and Security Informatics

Chair: Michael Chau, G. Alan Wang, Hsinchun Chen

12:00

13:30

Lunch

13:30

15:30

Tutorial

Non-IID Recommender Systems in Practice with Modern AI Techniques
(Liang Hu, Longbing Cao, Songlei Jian)

Tutorial

Relevant Structure Search in Graph Databases: Methods and Applications
(Yuanyuan Zhu, Xin Huang)

Workshop
DaMEMO
Data Mining for Energy Modeling and Optimization
Chair: Irena Koprinska, Jeremiah Deng, Alicia Troncoso

Workshop
BDASC

Big Data Analytics for Social Computing
Chair: Mark Western, Junbin Gao, Lin Wu, Michele Haynes, Yang Wang


15:30

16:00

Coffee Break

16:00

18:00

Tutorial

Non-IID Recommender Systems in Practice with Modern AI Techniques
(Liang Hu, Longbing Cao, Songlei Jian)


Workshop
DaMEMO
Data Mining for Energy Modeling and Optimization
Chair: Irena Koprinska, Jeremiah Deng, Alicia Troncoso

Workshop
BDASC
Big Data Analytics for
Social Computing
Chair: Mark Western, Junbin Gao, Lin Wu, Michele Haynes, Yang Wang


18:00

20:00

Welcome Reception

June 4, 2018 (Monday)

08:00

08:30

Registration

08:30

09:00

Conference Opening (Mayfair)

09:00

10:00

Keynote Speech
Instance Spaces for Objective Assessment of Algorithms and Benchmark Test Suites (Professor Kate Smith-Miles)
Chair: Vincent S. Tseng

10:00

10:15

Coffee Break

 

 

Mayfair 1 (300)

Mayfair 2 (80)

Mayfair 3 (80)

Grosvenor (80)

10:15

11:55

Research
Classification and supervised machine learning (Part 1)
Chair: Anni Coden

Research
Healthcare, Bioinformatics and Related Topics (Application) (Part 1)
Chair: Wei Luo

Research
Human, Behaviour and Interactions (Application)
(Part 1)

Chair: Guandong Xu

Research

Opinion
Mining and Sentiment Analysis

Chair: LIM Ee-Peng

11:55

13:10

Lunch

13:10

14:50

Research
Classification and supervised machine learning (Part 2)
Chair: Bryn Jeffries

Research
Healthcare, Bioinformatics and Related Topics (Application) (Part 2)
Chair: Adriel Cheng

Research
Human, Behaviour and Interactions (Application)
(Part 2)

Chair: Elham Naghi Zadeh Kakhki

Research
Graphical models, latent variables and statistical methods(Part 1)
Chair: Zengchang Qin

14:50

15:00

Coffee Break

15:10

17:05

Research
Classification and supervised machine learning (Part 3)
Chair: Özgür Akgün

Research
Healthcare, Bioinformatics and Related Topics (Application) (Part 3)
Chair: Anni Coden

Research
Anomaly detection and analytics
Chair: Ke Deng

Research
Graphical models, latent variables and statistical methods (Part 2)
Chair: Zengchang Qin

June 5, 2018 (Tuesday)

08:30

09:00

Registration

09:00

10:00

Keynote Speech
Machine Learning @ Amazon
(Dr Rajeev Rastogi)
Chair: Geoff Webb

10:00

10:15

Coffee Break

10:15

11:55

Research
Representation Learning and Embedding (Part 1)
Chair: Xiangliang Zhang

Research
Semi-Structured Data and NLP (Part 1)
Chair: Yuefeng Li

Research
Spatial-Temporal, Time-series and Stream Mining (Part 1)
Chair: Xing Xie

Research
Feature Learning and Data Mining Process (Part 1)
Chair: Chowdhury Farhan Ahmed

11:55

13:10

Lunch

13:10

14:50

Research
Representation Learning and Embedding (Part 2)
Chair: Xiangliang Zhang

Research
Semi-Structured Data and NLP (Part 2)
Chair: Yuefeng Li

Research
Spatial-Temporal, Time-series and Stream Mining (Part 2)
Chair: Ke Deng

Research
Feature Learning and Data Mining Process (Part 2)
Chair: Bryn Jeffries

14:50

15:20

Coffee Break

15:20

17:00

Research
Social Network, Ubiquitous Data and Graph Mining (Part 1)
Chair: Chowdhury Farhan Ahmed

Research
Community Detection and Network Science
Chair: Shi Chuan

Research
Spatial-Temporal, Time-series and Stream Mining (Part 3)
Chair: Ke Deng

Research
Feature Learning and Data Mining Process (Part 3)
Chair: Sanguthevar Rajasekaran

19:00

22:45

Banquet
(Award Announcement, PAKDD19 introduction)

June 6, 2018 (Wednesday)

09:00

10:00

Keynote Speech
Continuous Machine Learning
(Professor Bing Liu)
Chair: Vincent S. Tseng

10:00

10:15

Coffee Break

10:15

11:55

Research
Deep Learning Theory and Applications in KDD (Part 1)
Chair: Lin Wu

Research
Clustering and Unsupervised Learning (Part 1)
Chair: Min-Ling Zhang

Research
Privacy-Preserving and Security
Chair: Jiuxin Cao

Research
Social Network, Ubiquitous Data and Graph Mining (Part 2)
Chair: Chowdhury Farhan Ahmed

11:55

13:10

Lunch

13:10

14:10

Data Competition
Data Competition Presentation and Award
Chair: Hugo Jair Escalante, Wei-Wei Tu

14:10

14:25

Coffee Break

14:25

16:05

Research
Deep Learning Theory and Applications in KDD (Part 2)
Chair: Lin Wu

Research
Clustering and Unsupervised Learning (Part 2)
Chair: Ye Zhu

Research
Recommendation and Data Factorization
Chair: Jeffrey Chan

Research
Social Network, Ubiquitous Data and Graph Mining (Part 3)
Chair: Juixin Cao

16:05

16:20

Conference Closing

 

 

Research Sessions


Classification and supervised machine learning (Part 1)
Classifier Risk Estimation under Limited Labeling Resources(abstract, pdf)
Anurag Kumar*, Carnegie Mellon University; Bhiksha Raj, Carnegie Mellon University
Evaluating a trained system is an important component of machine learning. Labeling test data for large scale evaluation of a trained model can be extremely time consuming and expensive. In this paper we propose strategies for estimating performance of a classifier using as little labeling resource as possible. Specifically, we assume a labeling budget is given and the goal is to get a good estimate of the classifier performance using the provided labeling budget. We propose strategies to get a precise estimate of classifier accuracy under this restricted labeling budget scenario. We show that these strategies can reduce the variance in estimation of classifier accuracy by a significant amount compared to simple random sampling (over 65% in several cases). In terms of labeling resource, the reduction in number of samples required (compared to random sampling) to estimate the classifier accuracy with only 1% error is high as 60% in some cases. Anurag Kumar*, Carnegie Mellon University; Bhiksha Raj, Carnegie Mellon University
Social Stream Classification with Emerging New Labels(abstract, pdf)
Xin Mu*, Nanjing university; Feida Zhu, Singapore Management University; Yue Liu, Singapore Management University; Ee-peng Lim, Singapore Management University; Zhi-Hua Zhou, Nanjing university
As an important research topic with well-recognized practical values, classification of social streams has been identified with increasing popularity with social data, such as the tweet stream generated by Twitter users in chronological order. A salient, and perhaps also the most interesting, feature of such user-generated content is its never-failing novelty, which, unfortunately, would challenge most traditional pre-trained classification models as they are built based on fixed label set and would therefore fail to identify new labels as they emerge. In this paper, we study the problem of classification of social streams with emerging new labels, and propose a novel ensemble framework, integrating an instance-based learner and a label-based learner by completely-random trees. The proposed framework can not only classify known labels in the multi-label scenario, but also detect emerging new labels and update itself in the data stream. Extensive experiments on real-world stream data set from Weibo, a Chinese micro-blogging platform, demonstrate the superiority of our approach over the state-of-the-art methods.Xin Mu*, Nanjing university; Feida Zhu, Singapore Management University; Yue Liu, Singapore Management University; Ee-peng Lim, Singapore Management University; Zhi-Hua Zhou, Nanjing university
Exploiting Anti-monotonicity of Multi-label Evaluation Measures for Inducing Multi-label Rules(abstract, pdf)
Michael Rapp, TU Darmstadt; Eneldo Loza Mencia*, TU Darmstadt; Johannes Fürnkranz, TU Darmstadt
Exploiting dependencies between labels is considered to be crucial for multi-label classification. Rules are able to expose label dependencies such as implications, subsumptions or exclusions in a human-comprehensible and interpretable manner. However, the induction of rules with multiple labels in the head is particularly challenging, as the number of label combinations which must be taken into account for each rule grows exponentially with the number of available labels. To overcome this limitation, algorithms for exhaustive rule mining typically use properties such as anti-monotonicity or decomposability in order to prune the search space. In the present paper, we examine whether commonly used multi-label evaluation metrics satisfy these properties and therefore are suited to prune the search space for multi-label heads.Michael Rapp, TU Darmstadt; Eneldo Loza Mencia*, TU Darmstadt; Johannes Fürnkranz, TU Darmstadt
Modeling Label Interactions in Multi-label Classification: A Multi-structure SVM Perspective(abstract, pdf)
Anusha Kasinikota, DXC Technology; Balamurugan Palaniappan*, IIT Bombay; Shirish Shevade, iisc
Multi-label classification has attracted much interest due to its wide applicability. Modeling label interactions and investigating their impact on classifier quality are crucial aspects of multi-label classification. In this paper, we propose a multi-structure SVM (called MSSVM) which allows the user to hypothesize multiple label interaction structures and helps to identify their importance in improving generalization performance. We design an efficient optimization algorithm to solve the proposed MSSVM. Extensive empirical evaluation provides fresh and interesting insights into the following questions: a) How do label interactions affect multiple performance metrics typically used in multi-label classification? b) Do higher order label interactions significantly impact a given performance metric for a particular dataset? c) Can we make useful suggestions on the label interaction structure? and d) Is it always beneficial to model label interactions in multi-label classification? Anusha Kasinikota, DXC Technology; Balamurugan Palaniappan*, IIT Bombay; Shirish Shevade, iisc
Sentiment Classification Using Neural Networks with Sentiment Centroids(abstract, pdf)
maoquan wang*, East China Normal University; chen shiyun, East China Normal University; Liang He, ECNU
Neural networks(NN) have demonstrated powerful ability to extract text features automatically for sentiment classification in recent years. Although semantic and syntactic features are well studied, global category information has been mostly ignored within the NN based framework. Samples with the same sentiment category should have similar vectors in represent space. Motivated by this, we propose a novel global sentiment centroids based neural framework, which incorporates the sentiment category features. The centroids assist NN to extract discriminative category features from a global perspective. We apply our approach to several real large-scale sentiment-labeled datasets, and the extensive experiments show that our model not only obtains more powerful sentiment feature representations, but also achieves some state-of-the-art results with a simple neural network structure. maoquan wang*, East China Normal University; chen shiyun, East China Normal University; Liang He, ECNU
Classification and supervised machine learning (Part 2)
Random Pairwise Shapelets Forest(abstract, pdf)
Jidong Yuan*, Beijing Jiaotong University; Zhihai Wang, Beijing Jiaotong University; Haiyang Liu, Beijing Jiaotong University; Mohan Shi, Beijing Jiaotong University
Shapelet is a discriminative subsequence of time series. An advanced time series classification method is to integrate shapelet with random forest. However, it shows several limitations. First, random shapelet forest requires a large training cost for split threshold searching. Second, a single shapelet provides limited information for only one branch of the decision tree, resulting in insufficient accuracy and interpretability. Third, randomized ensemble causes interpretability declining. For that, this paper presents Random Pairwise Shapelets Forest (RPSF). RPSF combines a pair of shapelets from different classes to construct random forest. It is more efficient due to omit of threshold search, and more effective due to including of additional information from different classes. Moreover, a discriminability metric, Decomposed Mean Decrease Impurity (DMDI), is proposed to identify influential region for every class. Extensive experiments show that RPSF improves the accuracy and training speed of shapelet forest. Case studies demonstrate the interpretability of our method.Jidong Yuan*, Beijing Jiaotong University; Zhihai Wang, Beijing Jiaotong University; Haiyang Liu, Beijing Jiaotong University; Mohan Shi, Beijing Jiaotong University
A Locally Adaptive Multi-Label k-Nearest Neighbor Algorithm(abstract, pdf)
Dengbao Wang, College of Computer and Information Science, Southwest University; Jingyuan Wang, Southwest University; Fei Hu, Southwest University; Li LI*, Southwest University; Xiuzhen Zhang, RMIT University
In the field of multi-label learning, ML-kNN is the first lazy learning approach and one of the most influential approaches. The main idea of it is to adapt k-NN method to deal with multi-label data, where maximum a posteriori rule is utilized to adaptively adjust decision boundary for each unseen instance. In ML-kNN, all test instances which get the same number of votes among k nearest neighbors have the same probability to be assigned a label, which may cause improper decision since it ignores the local difference of samples. Actually, in real world data sets, the instances with (or without) label l from different locations may have different numbers of neighbors with the label l. In this paper, we propose a locally adaptive Multi-Label k-Nearest Neighbor method to address this problem, which takes the local difference of samples into account. We show how a simple modification to the posterior probability expression, previously used in ML-kNN algorithm, allows us to take the local difference into account. Experimental results on benchmark data sets demonstrate that our approach has superior classification performance with respect to other kNN-based algorithms.Dengbao Wang, College of Computer and Information Science, Southwest University; Jingyuan Wang, Southwest University; Fei Hu, Southwest University; Li LI*, Southwest University; Xiuzhen Zhang, RMIT University
Classification With Reject Option Using Conformal Prediction(abstract, pdf)
Henrik Linusson*, University of Bor泻 Ulf Johansson, J��ping University; Tuwe L��r��J��ping University; Henrik Bostr��KTH Royal Institute of Technology
In this paper, we propose a practically useful means of interpreting the predictions produced by a conformal classifier. The proposed interpretation leads to a classifier with a reject option, that allows the user to limit the number of erroneous predictions made on the test set, without any need to reveal the true labels of the test objects. The method described in this paper works by estimating the cumulative error count on a set of predictions provided by a conformal classifier, ordered by their confidence. Given a test set and a user-specified parameter k, the proposed classification procedure outputs the largest possible amount of predictions containing on average at most k errors, while refusing to make predictions for test objects where it is too uncertain. We conduct an empirical evaluation using benchmark datasets, and show that we are able to provide accurate estimates for the error rate on the test set.Henrik Linusson*, University of Borås; Ulf Johansson, Jönköping University; Tuwe Löfström, Jönköping University; Henrik Boström, KTH Royal Institute of Technology
Target Learning: A Novel Framework to Mine Significant Dependencies for Unlabeled Data(abstract, pdf)
Limin Wang*, Jilin University; Shenglei Chen, Nanjing Audit University; Musa Mammadov, Federation University
To mine significant dependencies among predictive attributes, much work has been carried out to learn Bayesian netwrok classifiers BNC$_T$s from labeled training data set $T$. However, if BNC$_T$ does not capture the “right” dependencies that would be most relevant to unlabeled testing instance, that will result in performance degradation. To address this issue we propose a novel framework, called target learning, that takes each unlabeled testing instance as a target and builds an “unstable” Bayesian model BNC$_P$ for it. To make BNC$_P$ and BNC$_T$ complementary to each other and work efficiently in combination, the same learning strategy is applied to build them. Experimental comparison on 32 large data sets from UCI machine learning repository shows that, for BNCs with different degrees of dependence target learning always helps improve the generalization performance with minimal additional computation.Limin Wang*, Jilin University; Shenglei Chen, Nanjing Audit University; Musa Mammadov, Federation University
Automatic Chinese Reading Comprehension Grading by LSTM with Knowledge Adaptation(abstract, pdf)
Yuwei Huang, Beijing University of Chemical Technology;Beijing Advanced Innovation Center For Future Education,Beijing Normal University; Xi Yang*, Beijing Advanced Innovation Center For Future Education,Beijing Normal University; Fuzhen Zhuang, Institute of Computing Technology, Chinese Academy of Sciences; Lishan Zhang, Beijing Advanced Innovation Center For Future Education,Beijing Normal University; Shengquan Yu, Beijing Advanced Innovation Center For Future Education,Beijing Normal University
Owing to the subjectivity of graders and the complexity of assessment standard, grading is a tough problem in the field of education. This paper presents an algorithm for automatic grading of open-ended Chinese reading comprehension questions. Due to the high complexity of feature engineering and the lack of consideration for word order in frequency based word embedding models, we utilize long-short term memory recurrent neural network to extract semantic feature in student answers automatically. In addition, we also try to impose the knowledge adaptation from web corpus to student answers, and represent the students’ responses to vectors which are fed into the memory network. Along this line, the workload of teacher and the subjectivity in reading comprehension grading can both be reduced obviously. What’s more, the automatic grading methods for Chinese reading comprehension will be more thorough. The experimental results on five Chinese and two English data sets demonstrate the superior performance over compared baselines.Yuwei Huang, Beijing University of Chemical Technology;Beijing Advanced Innovation Center For Future Education,Beijing Normal University; Xi Yang*, Beijing Advanced Innovation Center For Future Education,Beijing Normal University; Fuzhen Zhuang, Institute of Computing Technology, Chinese Academy of Sciences; Lishan Zhang, Beijing Advanced Innovation Center For Future Education,Beijing Normal University; Shengquan Yu, Beijing Advanced Innovation Center For Future Education,Beijing Normal University
Data Mining with Algorithmic Transparency(abstract, pdf)
Yan Zhou*, UT Dallas; Yasmeen Alufaisan, UT Dallas; Murat Kantarcioglu, UT Dallas
In this paper, we investigate whether decision trees can be used to interpret a black-box classifier without knowing the learning algorithm and the training data. Decision trees are known for their transparency and high expressivity. However, they are also notorious for their instability and tendency to grow excessively large. We present a classifier reverse engineering model that outputs a decision tree to interpret the black-box classifier. There are two major challenges. One is to build such a decision tree with controlled stability and size, and the other is that probing the black-box classifier is limited for security and economic reasons. Our model addresses the two issues by simultaneously minimizing sampling cost and classifier complexity. We present our empirical results on four real datasets, and demonstrate that our reverse engineering learning model can effectively approximate and simplify the black box classifier.Yan Zhou*, UT Dallas; Yasmeen Alufaisan, UT Dallas; Murat Kantarcioglu, UT Dallas
Classification and supervised machine learning (Part 3)
Cost-Sensitive Reference Pair Encoding for Multi-Label Learning(abstract, pdf)
Yao-Yuan Yang, National Taiwan University; Kuan-Hao Huang, National Taiwan University; Chih-Wei Chang, Carnegie Mellon University; Hsuan-Tien Lin*, National Taiwan University
Label space expansion for multi-label classification (MLC) is a methodology that encodes the original label vectors to higher dimensional codes before training and decodes the predicted codes back to the label vectors during testing. The methodology has been demonstrated to improve the performance of MLC algorithms when coupled with off-the-shelf error-correcting codes for encoding and decoding. Nevertheless, such a coding scheme can be complicated to implement, and cannot easily satisfy a common application need of cost-sensitive MLC—adapting to different evaluation criteria of interest. In this work, we show that a simpler coding scheme based on the concept of a reference pair of label vectors achieves cost-sensitivity more naturally. In particular, our proposed cost-sensitive reference pair encoding (CSRPE) algorithm contains cluster-based encoding, weight-based training and voting-based decoding steps, all utilizing the cost information. Furthermore, we leverage the cost information embedded in the code space of CSRPE to propose a novel active learning algorithm for cost-sensitive MLC. Extensive experimental results verify that CSRPE performs better than state-of-the-art algorithms across different MLC criteria. The results also demonstrate that the CSRPE-backed active learning algorithm is superior to existing algorithms for active MLC, and further justify the usefulness of CSRPE. Yao-Yuan Yang, National Taiwan University; Kuan-Hao Huang, National Taiwan University; Chih-Wei Chang, Carnegie Mellon University; Hsuan-Tien Lin*, National Taiwan University
Fuzzy Integral Optimization with Deep Q-Network for EEG-based Intention Recognition(abstract, pdf)
Dalin Zhang*, UNSW Sydney; Lina Yao, UNSW; Sen Wang, Griffith University; Kaixuan Chen, UNSW Sydney; Zheng Yang, Tsinghua University; Boualem Benatallah, UNSW Sydney
Non-invasive brain-computer interface using electroencephalography (EEG) signals promises a convenient approach empowering humans to communicate with and even control the outside world only with intentions. Herein, we propose to analyze EEG signals using fuzzy integral with deep reinforcement learning optimization to aggregate two aspects of information contained within EEG signals, namely local spatio-temporal and global temporal information, and demonstrate its benefits in EEG-based human intention recognition tasks. The EEG signals are first transformed into a 3D format preserving both topological and temporal structures, followed by distinctive local spatio-temporal feature extraction by a 3D-CNN, as well as the global temporal feature extractions by an RNN. Next, a fuzzy integral with respect to the optimized fuzzy measures with deep reinforcement learning is utilized to integrate the two extracted information and makes a final decision. The proposed approach retains the topological and temporal structures of EEG signals and merges them in a more efficient way. Experiments on a public EEG-based movement intention dataset demonstrate the effectiveness and superior performance of our proposed method over a set of state-of-the-art counterparts and many baseline models.Dalin Zhang*, UNSW Sydney; Lina Yao, UNSW; Sen Wang, Griffith University; Kaixuan Chen, UNSW Sydney; Zheng Yang, Tsinghua University; Boualem Benatallah, UNSW Sydney
Heterogeneous Domain Adaptation Based on Class Decomposition Schemes(abstract, pdf)
Firat Ismailoglu, Maastricht University; Evgueni Smirnov*, Maastricht University; Ralf Peeters, Maastricht University; Shuang Zhou, Maastricht University; Pieter Collins , Maastricht University
This paper introduces a novel classification algorithm for heterogeneous domain adaptation. The algorithm projects both the target and source data into a common latent feature space of the class decomposition scheme used. The distinctive features of the algorithm are: (1) it does not impose any assumptions on the data other than sharing the same class labels; (2) it allows adaptation of multiple source domains at once; (3) it can help improve the topology of the projected data for class separability. The algorithm provides two built-in classification rules plus it allows applying any other classification model.Firat Ismailoglu, Maastricht University; Evgueni Smirnov*, Maastricht University; Ralf Peeters, Maastricht University; Shuang Zhou, Maastricht University; Pieter Collins , Maastricht University
A Deep Neural Spoiler Detection Model using a Genre-Aware Attention Mechanism(abstract, pdf)
Buru Chang, Korea University, Data mining & Information systems labortory; Hyunjae Kim, DMIS; RAE HYUN KIM, KOREA UNIVERSITY; daehan kim, korea university; Jaewoo Kang*, Korea University
The fast-growing volume of online activity and user-generated content increases the chances of users being exposed to spoilers. To address this problem, several spoiler detection models have been proposed. However, most of the previous models rely on hand-crafted domain-specific features, which limits the generalizability of the models. In this paper, we propose a new deep neural spoiler detection model that uses a genre-aware attention mechanism. Our model consists of a genre encoder and a sentence encoder. The genre encoder is used to extract a genre feature vector from given genres using a convolutional neural network. The sentence encoder is used to extract sentence feature vectors from a given sentence using a bi-directional gated recurrent unit. We also propose a genre-aware attention layer based on the attention mechanism that utilizes genre information for detecting spoilers which vary by genres. Using a sentence feature, our proposed model determines whether a given sentence is a spoiler. The experimental results on a spoiler dataset show that our proposed model which does not use hand-crafted features outperforms the state-of-the-art spoiler detection baseline models. We also conduct a qualitative analysis on the relations between spoilers and genres, and highlight the results through an attention weight visualization. Buru Chang, Korea University, Data mining & Information systems labortory; Hyunjae Kim, DMIS; RAE HYUN KIM, KOREA UNIVERSITY; daehan kim, korea university; Jaewoo Kang*, Korea University
Robust Semi-Supervised Learning on Multiple Networks with Noise(abstract, pdf)
Junting Ye*, Stony Brook University; Leman Akoglu, Carnegie Mellon University
Graph-regularized semi-supervised learning has been effectively used for classification when (i) data instances are connected through a graph, and (ii) labeled data is scarce. Leveraging multiple relations (or graphs) between the instances can improve the prediction performance, however noisy and/or irrelevant relations may deteriorate the performance. As a result, an effective weighing scheme needs to be put in place for robustness. In this paper, we propose iMUNE, a robust and effective approach for multi-relational graph-regularized semi-supervised classification, that is immune to noise. Under a convex formulation, we infer weights for the multiple graphs as well as a solution (i.e., labeling). We provide a careful analysis of the inferred weights, based on which we devise an algorithm that filters out irrelevant and noisy graphs and produces weights proportional to the informativeness of the remaining graphs. Moreover, iMUNE is linearly scalable w.r.t. the number of edges. Through extensive experiments on various real-world datasets, we show the effectiveness of our method, which yields superior results under different noise models, and under increasing number of noisy graphs and intensity of noise, as compared to a list of baselines and state-of-the-art approaches.Junting Ye*, Stony Brook University; Leman Akoglu, Carnegie Mellon University
Healthcare, BioInformatics and Related Topics (Application) (Part 1)
Corrosion Prediction on Sewer Networks with Sparse Monitoring Sites: A Case Study(abstract, pdf)
Jianjia Zhang*, CSIRO; Bin Li, Fudan University; Xuhui Fan, UNSW; Yang Wang, Data61, CSIRO; Fang Chen, CSIRO
Sewer corrosion is a widespread and costly issue for water utilities. Knowing the corrosion status of a sewer network could help the water utility to improve efficiency and save costs in sewer pipe maintenance and rehabilitation. However, inspecting the corrosion status of all sewer pipes is impractical. To prioritize sewer pipes in terms of corrosion risk, the water utility requires a corrosion prediction model built on influential factors that cause sewer corrosion, such as hydrogen sulphide (H$_2$S) and temperature. This work leverages a Bayesian nonparametric method, Gaussian Process, to integrate the physical model developed by domain experts, the sparse H$_2$S and temperature monitored records, and the sewer geometry to predict corrosion risk levels on the entire sewer network. Jianjia Zhang*, CSIRO; Bin Li, Fudan University; Xuhui Fan, UNSW; Yang Wang, Data61, CSIRO; Fang Chen, CSIRO
CAPED: Context-Aware Powerlet-based Energy Disaggregation(abstract, pdf)
Jingyue Gao, Peking University; Yasha Wang*, Peking University; Xu Chu, Peking University; Yuanduo He, Peking University; Ziqing Mao, Peking University
Energy disaggregation is the task of decomposing a household’s total electricity consumption into individual appliances, which becomes increasingly important in energy reservation research nowadays. In this paper, we propose a novel algorithm taking the context of disaggregation task into consideration. First, we design a new method to efficiently extract each appliance’s typical consumption patterns, i.e. powerlets. When performing the disaggregation task, we model it as an optimization problem and incorporate context information into the cost function. Experiments on two public datasets have demonstrated the superiority of our algorithm over the state-of-the-art work, achieving about 13.7% and 4.8% higher disaggregation accuracy respectively.Jingyue Gao, Peking University; Yasha Wang*, Peking University; Xu Chu, Peking University; Yuanduo He, Peking University; Ziqing Mao, Peking University
Rolling Forecasting Forward by Boosting Heterogeneous Kernels(abstract, pdf)
Di Zhang*, Communication University of China; Yunquan Zhang, State Key Lab of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences; Qiang Niu, Department of Mathematical Sciences, Xi’an Jiaotong-Liverpool University; Xingbao Qiu, China Mobile Communications Corporation
The problem discussed in this paper stems from a project of cellular network traffic prediction, the primary step of network planning striving to serve the continuously soaring network traffic best with limited resource. The traffic prediction emphasizes two aspects: 1) how to exploit the potential value of physical and electronic properties for tens of thousands of wireless stations, which may partly determine the allocation of traffic load in some intricate way; 2) the lack of sufficient and high-quality historical records, for the appropriate training of long-term predictions, further aggravated by frequent reconfigurations in daily operation. To solve this problem, we define a general framework to accommodate several variants of multi-step forecasting, via decomposing the problem into a series of single-step vector-output regression tasks. They can further be augmented by miscellaneous attributive information, in the form of boosted multiple kernels. Experiments on multiple telecom datasets show that the solution outperforms conventional time series methods on accuracy, especially for long horizons. Those attributes describing the macroscopic factors, such as the network type, topology, locations, are significantly helpful for longer horizons, whereas the immediate values in the near future are mainly determined by their recent records. Di Zhang*, Communication University of China; Yunquan Zhang, State Key Lab of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences; Qiang Niu, Department of Mathematical Sciences, Xi’an Jiaotong-Liverpool University; Xingbao Qiu, China Mobile Communications Corporation
IDLP: A Novel Label Propagation Framework For Disease Gene Prioritization(abstract, pdf)
Yaogong Zhang*, NanKai University
Prioritizing disease genes is trying to identify potential disease causing genes for a given phenotype, which can be applied to reveal the inherited basis of human diseases and facilitate drug development. Our motivation is inspired by label propagation algorithm and the false positive protein-protein interactions that exist in the dataset. To the best of our knowledge, the false positive protein-protein interactions have not been considered before in disease gene prioritization. We conduct extensive experiments over OMIM datasets, and our proposed method IDLP has demonstrated its effectiveness compared with eight state-of-the-art approaches.Yaogong Zhang*, NanKai University
Deep Learning for Forecasting Stock Returns in the Cross-Section(abstract, pdf)
Masaya Abe*, Nomura Asset Management Co.,Ltd.; Hideki Nakayama, The University of Tokyo
Many studies have been undertaken by using machine learning techniques, including neural networks, to predict stock returns. Recently, a method known as deep learning, which achieves high performance mainly in image recognition and speech recognition, has attracted attention in the machine learning field. This paper implements deep learning to predict one-month-ahead stock returns in the cross-section in the Japanese stock market and investigates the performance of the method. Our results show that deep neural networks generally outperform shallow neural networks, and the best networks also outperform representative machine learning models. These results indicate that deep learning shows promise as a skillful machine learning method to predict stock returns in the cross-section.Masaya Abe*, Nomura Asset Management Co.,Ltd.; Hideki Nakayama, The University of Tokyo
An interaction-enhanced feature selection algorithm(abstract, pdf)
Xiaochuan Tang*, University of Electronic Science and Technology of China; Yuanshun Dai, University of Electronic Science and Technology of China; Yanping Xiang, University of Electronic Science and Technology of China; Liang Luo, University of Electronic Science and Technology of China
Feature selection is a crucial pre-processing step in machine learning and data mining. A popular approach is based on information theoretic measures. Most of the existing methods used low-dimensional mutual information terms that are ineffective in detecting high-order feature interactions. To fill this gap, we employ higher-order interactions for feature selection. We first relax the assumptions of MI-based methods to allow for higher-order interactions. A direct calculation of the interaction terms is computationally expensive. We use four-dimensional joint mutual information, a computationally efficient measure, to estimate the interaction terms. We also use the `maximum of the minimum’ nonlinear approach to avoid the overestimation of feature significance. Finally, we arrive at an effective feature selection method that makes use of higher-order interactions. To evaluate the performance of the proposed method, we compare it with seven representative feature selection methods, including RelaxMRMR, JMIM, IWFS, CIFE, MIFS, MIM, and reliefF. Experimental results on eighteen benchmark data sets demonstrate that higher-order interactions are effective in improving MI-based feature selection.Xiaochuan Tang*, University of Electronic Science and Technology of China; Yuanshun Dai, University of Electronic Science and Technology of China; Yanping Xiang, University of Electronic Science and Technology of China; Liang Luo, University of Electronic Science and Technology of China
Healthcare, BioInformatics and Related Topics (Application) (Part 2)
Vine Copula-based Asymmetry and Tail Dependence Modeling(abstract, pdf)
Jia Xu*, University of Technology, Sydney; Longbing Cao, University of Technology Sydney
Financial variables such as asset returns in the massive market contain various hierarchical and horizontal relationships that form complicated dependence structures. Modeling these structures is challenging due to the stylized facts of market data. Many research works in recent decades showed that copula is an effective method to describe relations among variables. Vine structures were introduced to represent the decomposition of multivariate copula functions. However, the model construction of vine structures is still a tough problem owing to the geometrical data, conditional independent assumptions and the stylized facts. In this paper, we introduce a new bottom-to-up method to construct regular vine structures and applies the model to $12$ currencies over $16$ years as a case study to analyze the asymmetric and fat tail features. The out-of-sample performance of our model is evaluated by Value at Risk, a widely used industrial benchmark. Jia Xu*, University of Technology, Sydney; Longbing Cao, University of Technology Sydney
Detecting forged alcohol non-invasively through vibrational spectroscopy and machine learning(abstract, pdf)
James Large*, University of East Anglia; Kate Kemsley, The Quadram Institute; Nikolaus Wellner, The Quadram Institute; Ian Goodall, Scotch Whisky Research Institute; Anthony Bagnall, University of East Anglia
Alcoholic spirits are a common target for counterfeiting and adulteration, with potential costs to public health, the taxpayer and brand integrity. Current methods to authenticate spirits include examinations of superficial appearance and consistency, or require the tester to open the bottle and remove a sample. The former is inexact, while the latter is not suitable for widespread screening or for high-value spirits, which lose value once opened. We study whether non-invasive near infrared spectroscopy, in combination with traditional and time series classification methods, can correctly classify the alcohol content (a key factor in determining authenticity) of synthesised spirits sealed in real bottles. Such an experimental setup could allow for a portable, cheap to operate, and fast authentication device. We find that ethanol content can be classified with high accuracy, however methanol content proved difficult with the algorithms evaluated. James Large*, University of East Anglia; Kate Kemsley, The Quadram Institute; Nikolaus Wellner, The Quadram Institute; Ian Goodall, Scotch Whisky Research Institute; Anthony Bagnall, University of East Anglia
Research and Application of Mapping Relationship based on Learning Attention Mechanism(abstract, pdf)
Wanwan Jiang*, Shanghai University; Lingyu Xu, Shanghai University; Jie Yu, Shanghai University; Gaowei Zhang, Shanghai University
The study on the interactions between different or the same variables of financial markets is an interesting topic. Many efforts have been devoted to investigate this issue. However, there has been little work studying the relationship of the various attributes within the stock, while this relationship is essential for us to have a deeper understanding of stockӳ internal mechanisms. So in this paper, we explored using sequence-to-sequence model for extracting the relationship of arbitrarily two properties of the stock. We not only give a qualitative description of the relationship between stockӳ attributes, but also quantify the relationship through the model. The experimental results show that there are certain correlations between the internal attributes of the stock, among which the correlation between Close&%Tuv and %Chg&%Tuv are more prominent. In addition, we also conducted the anomaly detection on network public opinion information, and found out the starting points of abnormal events combined with the network news information. By comparing the starting points of the events and the changes in the relationship between stock attributes, we concluded that there is a certain regularity between them.Wanwan Jiang*, Shanghai University; Lingyu Xu, Shanghai University; Jie Yu, Shanghai University; Gaowei Zhang, Shanghai University
Human Identification via Unsupervised Feature Learning from UWB Radar Data(abstract, pdf)
Jie Yin*, The University of Sydney; Son Tran, The Australian E-Health Research Centre, CSIRO; Qing Zhang, The Australian E-Health Research Centre, CSIRO
This paper presents an automated approach to automatically distinguish the identity of multiple residents in smart homes. Without using any intrusive video surveillance devices or wearable tags, we achieve the goal of human identification through properly processing and analyzing the received signals from the ultra-wideband (UWB) radar installed in indoor environments. Because the UWB signals are very noisy and unstable, we employ unsupervised feature learning techniques to automatically learn local, discriminative features that can incorporate intra-class variations of the same identity, and yet reflect differences in distinguishing different human identities. The learned features are then used to train an SVM classifier and recognize the identity of residents. We demonstrate the effectiveness of our proposed solution through extensive experiments using real data collected in real-life situations. Our findings show that feature learning based on $K$-means clustering, coupled with whitening and pooling, achieves the highest accuracy, when only a limited quantity of training data is available. This shows that the proposed feature learning and classification framework combined with the UWB radar technology provides an effective solution to human identification in multi-residential smart homes.Jie Yin*, The University of Sydney; Son Tran, The Australian E-Health Research Centre, CSIRO; Qing Zhang, The Australian E-Health Research Centre, CSIRO
Prescriptive Analytics through Constrained Bayesian Optimization(abstract, pdf)
Haripriya Harikumar*, Deakin University; Santu Rana, Deakin University, Australia; Sunil Gupta, Deakin University, Australia; Thin Nguyen, Deakin University; Ramachandra Kaimal, Amrita University; Svetha Venkatesh, Deakin University
Prescriptive analytics leverages predictive data mining algorithms to prescribe appropriate changes to alter a predicted outcome of undesired class to a desired one. As an example, based on the conversation of a reformed addict on a message board, prescriptive analytics may predict the intervention required. We develop a novel prescriptive analytics solution by formulating a constrained Bayesian optimization problem to find the smallest change that we need to make on an actionable set of features so that with sufficient confidence an instance can be changed from an undesirable class to the desirable class. We use two public health dataset, multi-year CDC dataset on disease prevalence across the 50 states of USA and alcohol related data from Reddit to demonstrate the usefulness of our results.Haripriya Harikumar*, Deakin University; Santu Rana, Deakin University, Australia; Sunil Gupta, Deakin University, Australia; Thin Nguyen, Deakin University; Ramachandra Kaimal, Amrita University; Svetha Venkatesh, Deakin University
Healthcare, BioInformatics and Related Topics (Application) (Part 3)
Neighborhood Constraint Matrix Completion for Drug-Target Interaction Prediction(abstract, pdf)
Xin Fan*, Nankai University; Yuxiang Hong, Nankai University; Xiaohu Liu, NanKai University; Yaogong Zhang, NanKai University; MaoQiang Xie, NanKai Universiy
Identifying drug-target interactions is an important step in drug discovery, but only a small part of the interactions have been validated, and the experimental determination process is both expensive and time-consuming. Therefore, there is a strong demand to develop the computational methods, which can predict potential drug-target interactions to guide the experimental verification. In this paper, we propose a novel algorithm for drug-target interaction prediction, named Neighborhood Constraint Matrix Completion(NCMC). Different from previous methods, for existing drug-target interaction network, we exploit the low rank property of its adjacency matrix to predict new interactions. Moreover, with the rarity of known entries, we introduce the similarity information of drugs/targets, and propose the neighborhood constraint to regularize the unknown cases. Furthermore, we formulate the whole task into a convex optimization problem and solve it by a fast proximal gradient descent framework, which can quickly converge to a global optimal solution. Finally, we extensively evaluated our method on four real datasets, and NCMC demonstrated its effectiveness compared with the other five state-of-the-art approaches.Xin Fan*, Nankai University; Yuxiang Hong, Nankai University; Xiaohu Liu, NanKai University; Yaogong Zhang, NanKai University; MaoQiang Xie, NanKai Universiy
Detecting Hypopnea and Obstructive Apnea Events using Convolutional Neural Networks on Wavelet Spectrograms of Nasal Airflow(abstract, pdf)
Stephen McCloskey*, The University of Sydney; Rim Haidar, The University of Sydney; Irena Koprinska, University of Sydney; Bryn Jeffries, The University of Sydney
We present a novel approach for detecting hypopnea and obstructive apnea events during sleep, using a single channel nasal airflow from polysomnography recordings, applying a Convolutional Neural Network (CNN) to a 2-D image wavelet spectrogram of the nasal signal. We compare this approach to directly training a 1-D CNN on the raw nasal airflow signal. The evaluation was conducted on a large dataset consisting of 69,264 examples from 1,507 subjects. Our results showed that both approaches achieved good accuracy, with the 2-D CNN outperforming the 1-D CNN. The higher accuracy and the less complex architecture of the 2-D CNN show that converting biological signals into spectrograms and using them in conjunction with CNNs is a promising method for sleep apnea recognition.Stephen McCloskey*, The University of Sydney; Rim Haidar, The University of Sydney; Irena Koprinska, University of Sydney; Bryn Jeffries, The University of Sydney
Deep Ensemble Classifiers and Peer Effects Analysis for Churn Forecasting in Retail Banking(abstract, pdf)
Yuzhou Chen*, Southern Methodist University; Yulia Gel, The University of Texas at Dallas; Vyacheslav Lyubchich, UMCES; Todd Winship, Temenos Group
Modern customer analytics offers retailers a variety of unprecedented opportunities to enhance customer intelligence solutions by tracking individual clients and their peers and studying clientele behavioral patterns. While telecommunication providers have been actively utilizing peer network data to improve their customer analytics for a number of years, there yet exists a very limited knowledge on the peer effects in retail banking. We introduce modern deep learning concepts to quantify the impact of social network variables on bank customer attrition. Furthermore, we propose a novel deep ensemble classifier that systematically integrates predictive capabilities of individual classifiers in a meta level model, by efficiently stacking multiple predictions using convolutional neural networks. We evaluate our methodology in application to customer retention in a retail financial institution in Canada.Yuzhou Chen*, Southern Methodist University; Yulia Gel, The University of Texas at Dallas; Vyacheslav Lyubchich, UMCES; Todd Winship, Temenos Group
GBTM: Graph Based Troubleshooting Method for Handling Customer Cases Using Storage System Log(abstract, pdf)
Subhendu Khatuya*, IIT KHARAGPUR; Ajay Bakshi, NetApp; Jayanta Basak, NetApp; Niloy Ganguly, IIT Kharagpur; Bivas Mitra, Indian Institute of Technology Kharagpur
Present day computing environments comprise of different bits of hardware and software that are associated with each other in a complex way. Hence, in case of failures of such system, it is very difficult to detect the exact module which has caused the problem. In such a situation, an automated technique which can pin down to (at least) a set of modules that may be responsible for the failure would be very useful for support engineers – in this paper we exactly do that. We propose a graph based troubleshooting methodology exploring storage system logs (EMS) of around 4500 customer cases to troubleshoot customer problems. We provide a ranked list of modules to the support engineers which can significantly narrow down the troubleshooting process for around 95% cases with only 10% false positive rate where as the competing baseline MonitorRank covers only 74% cases with 23% false positive rate.
Fusion of Modern and Tradition: A Multi-Stage-Based Deep Network Approach for Head Detection(abstract, pdf)
Chih-Chieh Hung*, Tamkang university; Fu-Chun Hsu , The University of Melbourne
Detecting humans in video is becoming essential for monitoring crowd behavior. Head detection is proven as a promising way to realize detecting and tracking crowd. In this paper, a novel learning strategy, called Deep Motion Information Network (abbr. as DMIN) is proposed for head detection. The concept of DMIN is to borrow the traditional well-developed head detection approaches which are composed of multiple stages, and then replace each stages in the pipeline into a cascade of sub-deep-networks to simulate the function of each stage. This learning strategy can lead to many benefits such as preventing many trial and error in designing deep networks, achieving global optimization for each stage, and reducing the amount of training dataset needed. The proposed approach is validated using the PETS2009 dataset. The results show the proposed approach can achieve impressive speedup of the process in addition to significant improvement in recall rates. A very high F-score of 85% is achieved using the proposed network that is by far higher than other methods proposed in literature.Chih-Chieh Hung*, Tamkang university; Fu-Chun Hsu , The University of Melbourne
Learning Treatment Regimens from Electronic Medical Records(abstract, pdf)
Hung Hoang*, Japan Advanced Institute of Science and Technology; Tu Bao Ho, JAIST
Appropriate treatment regimens play a vital role in improving patients’ health status. Although some achievements have been made, few of the recent studies of learning treatment regimens have exploited different kinds of patient information due to the difficulty in adopting heterogeneous data to many data mining methods. Moreover, current studies seem too rigid with fixed intervals of treatment periods. To this end, this work proposes a generic data-driven framework which can derive group-treatment regimens from electronic medical records by utilizing a mixed-variate restricted Boltzmann machine and incorporating medical domain knowledge. We conducted experiments on coronary artery disease as a case study. The obtained results shows that the framework is promising and capable of assisting physicians in making clinical decisions.Hung Hoang*, Japan Advanced Institute of Science and Technology; Tu Bao Ho, JAIST
Human, Behaviour and Interactions (Application) (Part 1)
Mining POI Alias from Microblog Conversations(abstract, pdf)
Yihong Zhang*, Kyoto University; Lina Yao, UNSW
In location-based analysis for microblogs, it is important to know if two toponyms refer to the same point-of-interest (POI). However, existing online knowledge bases are often incomplete or inaccurate for toponym data, especially for those used in informal conversations. In this paper, we propose a method for extracting compatible toponyms from microblog conversations. We propose three compatibility measures, namely, geographical closeness, surface name similarity, and association similarity. We show that by combining these measures and using an algorithm for weight tuning, we can reach an high discovery accuracy of 77%. The finding of this paper can be useful for improving location-based analysis as well as extending existing knowledge bases.Yihong Zhang*, Kyoto University; Lina Yao, UNSW
DyPerm: Maximizing Permanence for Dynamic Community Detection(abstract, pdf)
Prerna Agarwal, IBM Research; Richa Verma, IIIT Delhi; Ayush Agarwal, IIIT Delhi; Tanmoy Chakraborty*, IIIT Delhi, India
In this paper, we propose DyPerm, the first dynamic community detection method which optimizes a novel community scoring metric, called permanence. DyPerm incrementally modifies the community structure by updating those communities where the editing of nodes and edges has been performed, keeping the rest of the network unchanged. We present strong theoretical guarantees to show how/why mere updates on the existing community structure leads to permanence maximization in dynamic networks, which in turn decreases the computational complexity drastically. Experiments on both synthetic and six real-world networks with given ground-truth community structure show that DyPerm achieves (on average) 35% gain in accuracy (based on NMI) compared to the best method among four baseline methods. DyPerm also turns out to be 15 times faster than its static counterpart. For the sake of reproducibility, we have released the anonymized codes and datasets at https://tinyurl.com/dyperm-code. Prerna Agarwal, IBM Research; Richa Verma, IIIT Delhi; Ayush Agarwal, IIIT Delhi; Tanmoy Chakraborty*, IIIT Delhi, India
Mining User Behavioral Rules from Smartphone Data through Association Analysis(abstract, pdf)
Iqbal Sarker*, Swinburne University of Technology; Flora Salim, RMIT University
The increasing popularity of smart mobile phones and their powerful sensing capabilities have enabled the collection of rich contex- tual information and mobile phone usage records through the device logs. This paper formulates the problem of mining behavioral association rules of individual mobile phone users utilizing their smartphone data. Association rule learning is the most popular technique to discover rules utilizing large datasets. However, it is well-known that a large proportion of association rules generated are redundant. This redundant production makes not only the rule-set unnecessarily large but also makes the de- cision making process more complex and ine ective. In this paper, we propose an approach that e ectively identi es the redundancy in associ- ations and extracts a concise set of behavioral association rules that are non-redundant. The e ectiveness of the proposed approach is examined by considering the real mobile phone datasets of individual users.Iqbal Sarker*, Swinburne University of Technology; Flora Salim, RMIT University
A Context-aware Evaluation Method of Driving Behavior(abstract, pdf)
Yikai Zhai, Beihang University; Tianyu Wo*, Beihang University; Xuelian Lin, Beihang University; Zhou Huang, Beihang University; Junyu Chen, Tsinghua University
As Uber-like chauffeured car services become more and more popular, many drivers have joined the market without special training. To ensure the safety and efficiency of transportation services, it is an important task to accurately evaluate the driving performance of individual driver. Most of the existing methods basically depend on the statistic of abnormal driving events extracted from individual vehicles. However, the occurrence of abnormal events can be affected by various factors, such as road conditions, time of day and weather. It can be bias to judge the driverӳ performance by merely counting the abnormal events without considering the driving context. In this paper, we analyze the influence of driving context over driving behaviors and propose a context-aware evaluation method. Instead of taking all the occurrence of driving events as the same, we adopt the TF-IDF to determine the risk weight of a driving event in a specific driving context. Based on the risk-weighted statistics, we evaluate the driving performance precisely and normalize it using the Z score model. An evaluation system is implemented. We evaluate the effectiveness of our method based on a real dataset with 3-year traces of 1000 drivers. The normalized score determined by our method have a greater correlation (0.611) with the accident records than that of the number of abnormal driving events (0.523).Yikai Zhai, Beihang University; Tianyu Wo*, Beihang University; Xuelian Lin, Beihang University; Zhou Huang, Beihang University; Junyu Chen, Tsinghua University
Measurement of Users’ Experience on Online Platforms from their Behavior Logs(abstract, pdf)
Deepali Jain, Adobe Research; Atanu R. Sinha*, Adobe Research; Deepali Gupta, IIT Delhi; Nikhil Sheoran, IIT Roorkee; Sopan Khosla, Adobe Research
Delivering effective, personalized user experience by online platforms calls for individualized measure. Explicit measurement of experience, as mostly practiced, takes the form of satisfaction scores obtained by asking questions to users. Obtaining response from every user is not feasible, the responses are conditioned on the questions, and provide only a snapshot, while experience is a journey. Instead, we measure experience values implicitly from users’ click actions (events), thereby measuring for every user and for every event. The latent experience values are obtained without-asking-questions, by combining a recurrent neural network (RNN) with value elicitation from event-sequence. The platform environment is modeled using an RNN, recognizing that a user’s sequence of actions has a temporal dependence structure. We then propose eliciting value of a user’s experience as a latent construct in this environment. We offer two methods: one based on rules crafted from marketing and consumer behavior theories, and another data-driven approach where we apply fixed point iteration, similar to the approach used in model-based reinforcement learning. Evaluation and comparison with baseline show that experience values by themselves provide a good basis for predicting conversion behavior, without feature engineering. Deepali Jain, Adobe Research; Atanu R. Sinha*, Adobe Research; Deepali Gupta, IIT Delhi; Nikhil Sheoran, IIT Roorkee; Sopan Khosla, Adobe Research
Human, Behaviour and Interactions (Application) (Part 2)
Mining Human Periodic Behaviors using Mobility Intention and Relative Entropy(abstract, pdf)
feng Yi*, Institute of Information Engineering, Chinese Academy of Sciences; Libo Yin, China Industrial Control Systems Cyber Emergency Response Team; Hui Wen, Institute of information Engineering, CAS; Hongsong Zhu, Institute of information Engineering, CAS; Limin Sun, Institute of Information Engineering, Chinese Academy of Sciences; Bo Jiang, IIE?CAS
Human periodic behaviors mining is essential to many applications, and existing research works show that human behaviors are periodic. However, existing human periodic works are reported with limited improvements in using periodicity of locations and unsatisfactory accuracy for oscillation of human periodic behaviors. To address these challenges, in this paper we propose a Mobility Intention and Relative Entropy (MIRE) model, in which the mobility intentions extracted from spatiotemporal data to characterize usersӨistory records. A new periodicity detection algorithm based on relative entropy is then proposed, and a rigorous proof that the maximal relative entropy corresponds to the correct periodicity is provided. The experimental results on real world datasets demonstrate that the proposed MIRE model can properly mining human periodic behaviors, and it significantly outperforms state-of-the-art periodicity detection algorithms.feng Yi*, Institute of Information Engineering, Chinese Academy of Sciences; Libo Yin, China Industrial Control Systems Cyber Emergency Response Team; Hui Wen, Institute of information Engineering, CAS; Hongsong Zhu, Institute of information Engineering, CAS; Limin Sun, Institute of Information Engineering, Chinese Academy of Sciences; Bo Jiang, IIE,CAS
Context-Uncertainty-Aware Chatbot Action Selection via Parameterized Auxiliary Reinforcement Learning(abstract, pdf)
Chuandong Yin*, The University of Melbourne; Rui Zhang, ” University of Melbourne, Australia”; Jianzhong Qi, The University of Melbourne; Yu Sun, University of Melbourne; Tan Tenglun, The University of Melbourne
We propose a context-uncertainty-aware chatbot and a reinforcement learning (RL) model to train the chatbot. The proposed model is named Parameterized Auxiliary Asynchronous Advantage Actor Critic (PA4C). We utilize a user simulator to simulate the uncertainty of usersҠutterance based on real data. Our PA4C model interacts with simulated users to gradually adapt to different usersҠutterance confidence in a conversation context. Compared with naive rule-based approaches, our chatbot trained via the PA4C model avoids hand-crafted action selection and is more robust to user utterance variance. The PA4C model optimizes conventional RL models with action parameterization and auxiliary tasks for chatbot training, which address the problems of a large action space and zero-reward states. We evaluate the PA4C model over training a chatbot for calendar event creation tasks. Experimental results show that our model outperforms the state-of-the-art RL models in terms of success rate, dialogue length, and episode reward.Chuandong Yin*, The University of Melbourne; Rui Zhang, ” University of Melbourne, Australia”; Jianzhong Qi, The University of Melbourne; Yu Sun, University of Melbourne; Tan Tenglun, The University of Melbourne
Learning Product Embedding from Multi-relational User Behavior(abstract, pdf)
Zhao Zhang*, PEKING UNIVERSITY; Weizheng Chen, PEKING UNIBERSITY; Xiaoxuan Ren, Peking University; Yan Zhang, Peking University
Network embedding is a very important method to learn low-dimensional representations of vertexes in networks, which is very useful in many tasks such as label classification and visualization. However, most existing network embedding methods can only learning embedding from single relational network, which only contains one type of edge relationship between two nodes. But in real world, especially in product network, many information is presented in multi-relational network. Based on user behavior, edges in product network have many types: “co-purchasing”, “co-viewing”, “view after purchasing” and so on. So, we propose a novel network embedding method aiming to embed multi-relational product network into a low-dimensional vector space. We show that our method leads to better performance on label classification and visualization tasks in product network.Zhao Zhang*, PEKING UNIVERSITY; Weizheng Chen, PEKING UNIBERSITY; Xiaoxuan Ren, Peking University; Yan Zhang, Peking University
Vulnerability Assessment of Metro Systems Based on Dynamic Network Structure(abstract, pdf)
Jun Pu, Computer Network Information Center, Chinese Academy of Sciences; Chuanren Liu, Drexel University; Jianghua Zhao, Computer Network Information Center, Chinese Academy of Sciences; Ke Han, Imperial College London; Yuanchun Zhou*, Computer Network Information Center, Chinese Academy of Sciences
Invulnerable metro systems are essential for the safety and efficiency of urban transportation services. Therefore, it is of significant interest to systematically assess the vulnerability of metro systems. To this end, in this paper, we assess the vulnerability of metro systems with a data-driven framework in which dynamic travel patterns are considered. Specifically, we use effective attack strategies based on the topology structure of metro networks. The network structure depends on not only connectivity among metro stations but also dynamic passenger flow patterns. Thus, two data-driven metrics, satisfaction rate (SR) and satisfaction rate with path cost (SRPC), are proposed to quantify the vulnerability of metro networks after our attack strategies. Finally, we conduct experiments on Shanghai metro system. The results indicate that the metro system is vulnerable to malicious attacks while it shows strong robustness to random failures. Our results also highlight weak-points and bottlenecks in the system, which may bear practical managerial implications for policymakers to improve the reliability and robustness of the metro systems and the public transportation services.Jun Pu, Computer Network Information Center, Chinese Academy of Sciences; Chuanren Liu, Drexel University; Jianghua Zhao, Computer Network Information Center, Chinese Academy of Sciences; Ke Han, Imperial College London; Yuanchun Zhou*, Computer Network Information Center, Chinese Academy of Sciences
Visual Relation Extraction via Multi-modal Translation Embedding Based Model(abstract, pdf)
Zhichao Li*, BUPT; Yuping Han, BUPT; Yajing Xu, Beijing University of Posts and Telecommunications; Sheng Gao, BUPT
Visual relation, such as “person holds dog” is an effective semantic unit for image understanding, as well as a bridge to connect computer vision and natural language. Recent work has been proposed to extract the object features in the image with the aid of respective textual description. However, very little work has been done to com- bine the multi-modal information to model the subject-predicate-object relation triplets to obtain deeper scene understanding. In this paper, we propose a novel visual relation extraction model named Multi-modal Translation Embedding Based Model to integrate the visual information and respective textual knowledge base. For that, our proposed model places objects of the image as well as their semantic relationships in t- wo di erent low-dimensional spaces where the relation can be modeled as a simple translation vector to connect the entity descriptions in the knowledge graph. Moreover, we also propose a visual phrase learning method to capture the interactions between objects of the image to en- hance the performance of visual relation extraction. Experiments are conducted on two real world datasets, which show that our proposed model can bene t from incorporating the language information into the relation embeddings and provide signi cant improvement compared to the state-of-the-art methods.Zhichao Li*, BUPT; Yuping Han, BUPT; Yajing Xu, Beijing University of Posts and Telecommunications; Sheng Gao, BUPT
Anomaly detection and analytics
Sub-trajectory- and Trajectory-Neighbor-based Outlier Detection over Trajectory Streams(abstract, pdf)
zhihua zhu*, Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences; di yao, Institute of Computing Technology, Chinese Academy of Sciences; jianhui huang, Institute of Computing Technology, Chinese Academy of Sciences; jingping bi, Institute of Computing Technology, Chinese Academy of Sciences; Hanqiang Li, National Defence Key Laboratory of Blind Processing of Signals, Chengdu, China
Precisely and efficiently anomaly detection over trajectory streams is critical for many real-time applications. However, due to the uncertainty and complexity of behaviors of objects over trajectory streams, this problem has not been well solved. In this paper, we propose a novel detection algorithm, called STN-Outlier, for real time applications, where a set of fine-grained behavioral features are extracted from the sub-trajectory instead of point and a novel distance function is designed to measure the behavior similarity between two trajectories. Additionally, an optimized framework(TSX) is introduced to reduce the CPU resources cost of STN-Outlier. The performance experiments demonstrate that STN-Outlier successfully captures more fine-grained behaviors than the state-of-the-art methods; besides, the TSX framework outperforms the baseline solutions in terms of the CPU time in all cases.zhihua zhu*, Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences; di yao, Institute of Computing Technology, Chinese Academy of Sciences; jianhui huang, Institute of Computing Technology, Chinese Academy of Sciences; jingping bi, Institute of Computing Technology, Chinese Academy of Sciences; Hanqiang Li, National Defence Key Laboratory of Blind Processing of Signals, Chengdu, China
An Unsupervised Boosting Strategy for Outlier Detection Ensembles(abstract, pdf)
Guilherme Campos*, UFMG; Arthur Zimek, University of Southern Denmark; Wagner Meira Jr., UFMG
Ensemble techniques have been applied to the unsupervised outlier detection problem in some scenarios. Challenges are the generation of diverse ensemble members and the combination of individual results into an ensemble. For the latter challenge, some methods tried to design smaller ensembles out of a wealth of possible ensemble members, to improve the diversity and accuracy of the ensemble (relating to the ensemble selection problem in classification). We propose a boosting strategy for combinations showing improvements on benchmark datasets.Guilherme Campos*, UFMG; Arthur Zimek, University of Southern Denmark; Wagner Meira Jr., UFMG
DeepAD: A Generic Framework based on Deep Learning for Time Series Anomaly Detection(abstract, pdf)
Teodora Sandra Buda*, IBM; Bora Caglayan, IBM ; Haytham Assem, IBM
This paper presents a generic anomaly detection approach for time-series data. Existing anomaly detection approaches have several drawbacks such as a large number of false positives, parameters tuning difficulties, the need for a labeled dataset for training, use-case restrictions, or difficulty of use. We propose DeepAD, an anomaly detection framework that leverages a plethora of time-series forecasting models in order to detect anomalies more accurately, irrespective of the underlying complex patterns to be learnt. Our solution does not rely on the labels of the anomalous class for training the model, nor for optimizing the threshold based on highest detection given the labels in the training data. We compare our framework against EGADS framework on real and synthetic data with varying time-series characteristics. Results show significant improvements on average of 25% and up to 40-50% in F-score, precision, and recall on the Yahoo Webscope Benchmark.Teodora Sandra Buda*, IBM; Bora Caglayan, IBM ; Haytham Assem, IBM
Anomaly detection technique robust to units and scales of measurement(abstract, pdf)
Sunil Aryal*, Federation University
Existing anomaly detection methods are sensitive to units and scales of measurement. Their performances vary significantly if feature values are measured in different units or scales. In many data mining applications, units and scales of feature values may not be known. This paper introduces a new anomaly detection technique using unsupervised stochastic forest, called ‘usfAD’, which is robust to units and scales of measurement. Empirical results show that it produces more consistent results than five state-of-the-art anomaly detection techniques across a wide range of synthetic and benchmark datasets.Sunil Aryal*, Federation University
Automated Explanations of User-expected Trends for Aggregate Queries(abstract, pdf)
Ibrahim Ibrahim*, University of Queensland; Xue Li, University of Queensland; Xin Zhao, University of Queensland; Sanad AL Maskari, University of Queensland; Abdullah Albarrak, University of Queensland; Yanjun Zhang, University of Queensland
Recently, a deeper level of data exploration has emerged enabling users to infer anomalies in their queries. This exploration level strives to explain why a particular anomaly exists within a query result by providing a set of explanations. These explanations are precisely a set of alterations, such that when applied on the original query cause anomalies to disappear. Trends are pattern changes in business applications generated based on SQL aggregated queries. Additionally, a user expected trend is a particular pattern change in data was supposedly happen based on businesses studies. In this paper, we generalize this process to automatically produce explanations for users expected trends. We propose User Trend Explanations (UTE) framework which provides insightful explanations by taking a set of user-specified points (called prospective trend), and finds a top explanation that produce this trend.We develop a notion of uniformity of a predicate on a given output, and implement a set of algorithms to search the data space efficiently and effectively. The key idea is harnessing the linear search space rather than the exponential space to enable accurate explanations that are possible with tuples.Our experiments on real datasets show significant improvements UTE provides when compared with state-of-the-art related algorithms.Ibrahim Ibrahim*, University of Queensland; Xue Li, University of Queensland; Xin Zhao, University of Queensland; Sanad AL Maskari, University of Queensland; Abdullah Albarrak, University of Queensland; Yanjun Zhang, University of Queensland
Social Spammer Detection: A Multi-Relational Embedding Approach(abstract, pdf)
Jun Yin, University of Technology Sydney; Zili Zhou, University of Technology Sydney; Shaowu Liu, University of Technology Sydney; Zhiang Wu*, Nanjing University of Finance and Economics; Guandong Xu, University of Technology Sydney, Australia
Since the relation is the main data shape of social networks, social spammer detection desperately needs a relation-dependent but content-independent framework. Some recent detection method transforms the social relations into a set of topological features, such as degree, k-core, etc. However, the multiple heterogeneous relations and the direction within each relation have not been fully explored for identifying social spammers. In this paper, we make an attempt to adopt the Multi-Relational Embedding (MRE) approach for learning latent features of the social network. The MRE model is able to fuse multiple kinds of different relations and also learn two latent vectors for each relation indicating both sending role and receiving role of every user, respectively. Experimental results on a real-world multi-relational social network demonstrate the latent features extracted by our MRE model can improve the detection performance remarkably.Jun Yin, University of Technology Sydney; Zili Zhou, University of Technology Sydney; Shaowu Liu, University of Technology Sydney; Zhiang Wu*, Nanjing University of Finance and Economics; Guandong Xu, University of Technology Sydney, Australia
Opinion Mining and Sentiment Analysis
Learning to Rank Items of Minimal Reviews using Weak Supervision(abstract, pdf)
Yassien Shaalan*, RMIT; Xiuzhen Zhang, RMIT University; Jeffrey Chan, RMIT University, Australia
Customer reviews and star ratings are widely used on Ecommerce and reviewing sites for the public to express their opinions. To help the online public make decisions, items (e.g., products, services, movies, books) are typically represented and ordered by an aggregated star rating from all reviews. Existing approaches simply average star ratings or use other statistical functions to aggregate star ratings. However, these approaches rely on the existence of large numbers of reviews to work effectively. On the other hand, many new items have few reviews. In this paper, we argue that at the core of review aggregation is ranking items, hence, we cast the problem of ranking a set of items as a learning to rank (L2R) problem to address the issue of reviews scarcity. We devise a rank-oriented loss function to directly optimize the ranking of groups of items. Standard L2R models require ranking labels for training, but item ranking ground-truth information is not always available. Therefore, we propose to aggregate star ratings for items with large numbers of reviews to automatically generate weak supervision ranking labels for training. We further propose to extract features from review contents, rating distributions and helpfulness information to train the ranking model. Extensive experiments on an Amazon dataset showed that our model is very effective compared to state-of-the-art heuristic aggregation approaches, regression and standard L2R approaches.Yassien Shaalan*, RMIT; Xiuzhen Zhang, RMIT University; Jeffrey Chan, RMIT University, Australia
Multimodal Mixture Density Boosting Network for Personality Mining(abstract, pdf)
Nhi Vo*, University of Technology Sydney; Shaowu Liu, University of Technology Sydney; Guandong Xu, University of Technology Sydney, Australia; Xuezhong He, University of Technology Sydney
Knowing people’s personalities is useful in various real-world applications, such as personnel selection. Traditionally, we have to rely on qualitative methodologies, e.g. surveys or psychology tests to determine a person’s traits. However, recent advances in machine learning have it possible to automate this process by inferring personalities from textual data. Despite of its success, text-based method ignores the facial expression and the way people speak, which can also carry important information about human characteristics. In this work, a personality mining framework is proposed to exploit all the information from videos, including visual, auditory, and textual perspectives. Using a state-of-art cascade network built on advanced gradient boosting algorithms, the result produced by our proposed methodology can achieve lower prediction errors than most current machine learning algorithms. Our multimodal mixture density boosting network especially perform well with small sample size datasets, which is useful for learning problems in psychology fields where big data is often not available.Nhi Vo*, University of Technology Sydney; Shaowu Liu, University of Technology Sydney; Guandong Xu, University of Technology Sydney, Australia; Xuezhong He, University of Technology Sydney
Identifying Singleton Spammers via Spammer Group Detection(abstract, pdf)
Dheeraj Kumar, Purdue University; Yassien Shaalan, RMIT; Xiuzhen Zhang*, RMIT University; Jeffrey Chan, RMIT University, Australia
Opinion spam is a well-recognized threat to the credibility of online reviews. Existing approaches to detecting spam reviews or spammers examine review content, reviewer behavior and reviewer-product network, and often operate on the assumption that spammers write at least several if not many fake reviews. On the other hand, spammers setup multiple sockpuppet IDs and write one-time, singleton spam reviews to avoid detection. It is reported that for most review sites, a large portion, sometimes over 90%, of reviewers are singletons (identified by the reviewer ID). Singleton spammers are difficult to catch due to the scarcity of behavioral clues. In this paper, we argue that the key to detect singleton spammers (and their fake reviews) is to detect group spam attacks by inferring the hidden collusiveness among them. To address the challenge of lack of explicit behavioral signals for singleton reviewers, we propose to infer the hidden reviewer-product associations by completing the review-product matrix by leveraging the product and review metadata and text. Experiments on three real-life Yelp datasets established that our approach can effectively detect singleton spammers via group detection, which are often missed by existing approaches.Dheeraj Kumar, Purdue University; Yassien Shaalan, RMIT; Xiuzhen Zhang*, RMIT University; Jeffrey Chan, RMIT University, Australia
Adaptive Attention Network for Review Sentiment Classification(abstract, pdf)
Chuantao Zong, Sun Yat-sen University; Wenfeng Feng, Sun Yat-sen University; Vincent W. Zheng, Advanced Digital Sciences Center; Hankz Hankui Zhuo*, Sun Yat-sen University
Document-level sentiment classification is an important NLP task. The state of the art shows that attention mechanism is particularly effective on document-level sentiment classification. Despite the success of previous attention mechanism, it neglects the correlations among inputs (e.g., words in a sentence), which can be useful for improving the classification result. In this paper, we propose a novel Adaptive Attention Network (AAN) to explicitly model the correlations among inputs. Our AAN has a two-layer attention hierarchy. It first learns an attention score for each input. Given each input’s embedding and attention score, it then computes a weighted sum over all the words’ embeddings. This weighted sum is seen as a “contex” embedding, aggregating all the inputs. Finally, to model the correlations among inputs, it computes another attention score for each input, based on the input embedding and the context embedding. These new attention scores are our final output of AAN. In document-level sentiment classification, we apply AAN to model words in a sentence and sentences in a review. We evaluate AAN on three public data sets, and show that it outperforms state-of-the-art baselines.Chuantao Zong, Sun Yat-sen University; Wenfeng Feng, Sun Yat-sen University; Vincent W. Zheng, Advanced Digital Sciences Center; Hankui Zhuo*, Sun Yat-sen University
Cross-Domain Sentiment Classification via A Bifurcated-LSTM(abstract, pdf)
Jinlong Ji*, Case Western Reserve University; Changqing Luo, Case Western Reserve University; Xuhui Chen, Case Western Reserve University; Lixing Yu, Case Western Reserve University; Pan Li, Case Western Reserve University
Sentiment classification plays a vital role in current online commercial transactions because it is critical to understand users’ opinions and feedbacks in businesses or products. Cross-domain sentiment classification can adopt a well-trained classifier from one source domain to other target domains, which reduces the time and efforts of training new classifiers in these domains. Existing cross-domain sentiment classification methods require data or other information in target domains in order to train their models. However, collecting and processing new corpora require very heavy workload. Besides, the data in target domains may be private and not always available for training. To address these issues, motivated by multi-task learning, we design a Bifurcated-LSTM which takes advantages of attention-based LSTM classifiers along with augmented dataset and orthogonal constraints. This Bifurcated-LSTM can extract domain-invariant sentiment features from the source domain to perform sentiment analysis in different target domains. We conduct extensive experiments on seven classic types of product reviews, and results show that our system leads to significant performance improvement.Jinlong Ji*, Case Western Reserve University; Changqing Luo, Case Western Reserve University; Xuhui Chen, Case Western Reserve University; Lixing Yu, Case Western Reserve University; Pan Li, Case Western Reserve University
Graphical models, latent variables and statistical methods (Part 1)
Probabilistic Topic and Role Model for Information Diffusion in Social Network(abstract, pdf)
Hengpeng Xu*, Nankai University; Jinmao Wei, Nankai University; Zhenglu Yang, Nankai University; Jianhua Ruan, University of Texas at San Antonio; Jun Wang, Nankai university
Information diffusion, which addresses the issue of how a piece of information spreads and reaches individuals in or between networks, has attracted considerable research attention due to its widespread applications, such as viral marketing and rumor control. However, the process of information diffusion is complex and its underlying mechanism remains unclear. An important reason is that social influence takes many forms and each form may be determined by various factors. One of the major challenges is how to capture all the crucial factors of a social network such as users’ interests (which can be represented as topics), users’ attributes (which can be summarized as roles), and users’ reposting behaviors in a unified manner to model the information diffusion process. To address the problem, we propose the joint information diffusion model (TRM) that integrates user topical interest extraction, role recognition, and information diffusion modeling into a unified framework. TRM seamlessly unifies the user topic role extraction, role recognition, and modeling of information diffusion, and then translates the calculations of individual level influence to the role-topic pairwise influence, which can provide a coarse-grained diffusion representation. Extensive experiments on two real-world datasets validate the effectiveness of our approach under various evaluation indices, which performs superior than the state-of-the-art models by a large margin.Hengpeng Xu*, Nankai University; Jinmao Wei, Nankai University; Zhenglu Yang, Nankai University; Jianhua Ruan, University of Texas at San Antonio; Jun Wang, Nankai university
Course-Specific Markovian Models for Grade Prediction(abstract, pdf)
Qian Hu*, George Mason University; Huzefa Rangwala, George Mason University
Over the past 15 years, the average six-year graduation rates for colleges and universities across the Unites States have remained stable at around 60%. This vehemently impacts society in terms of workforce development, national productivity and economic activity. Educational early-warning systems have been identified as an important approach to tackle this problem. The key to these systems are accurate grade prediction algorithms. In this paper we propose application of markovian models for the problem of predicting next-term student performance. Traditional approaches predict studentӳ grade in a course by using a subset of prior courses and content features. However, these models ignore the dynamic evolution of studentӳ knowledge states, which is a strong influence on studentӳ learning and performance. We developed course-specific Hidden Markov Models and Hidden Semi-markov Models for the problem of next-term grade prediction. Our experimental results on datasets from a large public university show that the proposed approaches outperform prior state-of-the-art methods. We show by a case study the application of these methods for early identification of at-risk students.Qian Hu*, George Mason University; Huzefa Rangwala, George Mason University
A Temporal Topic Model For Noisy Mediums(abstract, pdf)
Robert Churchill*, Georgetown University; Lisa Singh, Georgetown University; Christo Kirov, Georgetown University
Social media and online newspaper content is increasing rapidly. The goal of this work is to identify the topics associated with this content and understand the changing dynamics of these topics over time. We pro- pose Topic Flow Model (TFM), a graph theoretic temporal topic model that identifies topics as they emerge, and tracks them through time as they persist, diminish, and re-emerge. TFM identifies topic words by capturing the changing relationship strength of words over time, and offers solutions for dealing with flood words, i.e., domain specific words that pollute topics. We conduct an extensive empirical analysis of TFM on Twitter data, newspaper articles, and synthetic data, and find that the topic accuracy and signal to noise ratio of meaningful topic words are better than existing state of the art methods.Robert Churchill*, Georgetown University; Lisa Singh, Georgetown University; Christo Kirov, Georgetown University
A CRF-based Stacking Model with Meta-Features for Named Entity Recognition(abstract, pdf)
Shifeng Liu*, University of New South Wales; Yifang Sun, University of New South Wales; Wei Wang, University of New South wales; Xiaoling Zhou, University of New South Wales
Named Entity Recognition (NER) is a challenging task in Natural Language Processing, which attracts researchers for many years. Traditional methods for NER are based on handcrafted rules. Recently, machine learning based methods are introduced for the NER task and achieve the state-of-the-art performance. As an alternative way to handle the NER task, stacking, which combines a set of classifiers into one classifier, has not been well explored for the NER task. In this paper, we propose a stacking model for the NER task. We extend the original stacking model from both model and feature aspects. We use Conditional Random Fields as the level-1 classifier, and we also apply meta-features from global aspect and local aspect of the level-0 classifiers and tokens in our model. In the experiments, our model achieves the state-of-the-art performance on the CoNLL 2003 Shared task with an F1 score of 92.38%. We also show that all the proposed features are effective to our model. Our model also reveals its robustness when the performance of the level-0 classifiers is low.Shifeng Liu*, University of New South Wales; Yifang Sun, University of New South Wales; Wei Wang, University of New South wales; Xiaoling Zhou, University of New South Wales
Adding Missing Words to Regular Expressions(abstract, pdf)
Thomas Rebele*, Tꭩcom ParisTech; Katerina Tzompanaki, Cergy-Pontoise University; Fabian Suchanek, Telecom ParisTech
Regular expressions (regexes) are patterns that are used in many applications to extract words or tokens from text. However, even hand-crafted regexes may fail to match all the intended words. In this paper, we propose a novel way to generalize a given regex so that it matches also a set of missing (previously non-matched) words. Our method finds an approximate match between the missing words and the regex, and adds disjunctions for the unmatched parts appropriately. We show that this method can not just improve the precision and recall of the regex, but also generate much shorter regexes than baselines and competitors on various datasets.Thomas Rebele*, Télécom ParisTech; Katerina Tzompanaki, Cergy-Pontoise University; Fabian Suchanek, Telecom ParisTech
Graphical models, latent variables and statistical methods (Part 2)
Marrying Community Discovery and Role Analysis in Social Media via Topic Modeling(abstract, pdf)
Gianni Costa, ICAR-CNR; Riccardo Ortale*, ICAR-CNR
We explore the adoption of topic modeling to inform the seamless integration of community discovery and role analysis. For this purpose, we develop a new Bayesian probabilistic generative model of social media, according to which the observation of social links and textual contents is governed by novel and intuitive relationships among latent content topics, communities and roles. Variational inference under the devised model allows for exploratory, descriptive and predictive tasks, including the detection and interpretation of overlapping communities, roles and topics as well as the prediction of missing links. Extensive tests on real-world social media reveal a superior accuracy of the proposed model in comparison to state-of-the-art competitors, which substantiates the rationality of the motivating intuition. The experimental results are also insightfully inspected from a qualitative viewpoint.Gianni Costa, ICAR-CNR; Riccardo Ortale*, ICAR-CNR
Text Generation Based on Generative Adversarial Nets with Latent Variables(abstract, pdf)
heng wang, Beihang University; Zengchang Qin*, Beihang University; Tao Wan, Beihang University
In this paper, we propose a model using generative adversarial net (GAN) to generate realistic text. Instead of using standard GAN, we combine variational autoencoder (VAE) with generative adversarial net. The use of high-level latent random variables is helpful to learn the data distribution and solve the problem that generative adversarial net always emits the similar data. We propose the VGAN model where the generative model is composed of recurrent neural network and VAE. The discriminative model is a convolutional neural network. We train the model via policy gradient. We apply the proposed model to the task of text generation and compare it to other recent neural network based models, such as recurrent neural network language model and SeqGAN. We evaluate the performance of the model by calculating negative log-likelihood and the BLEU score. We conduct experiments on three benchmark datasets, and results show that our model outperforms other previous models. heng wang, Beihang University; Zengchang Qin*, Beihang University; Tao Wan, Beihang University
Geminio: Finding Duplicates in a Question Haystack(abstract, pdf)
sandya mannarswamy*, Conduent Labs ; saravanan chidambaram, HPE
Effective reuse of existing crowd-sourced intelligence present in Community Question Answering (CQA) forums requires efficient approaches for the problem of Duplicate Question Detection (DQD). Approaches which use standalone encoded representations of each of the questions in the question pair fail to use the cross-question interactions between the two questions which impacts their performance adversely. In this paper, we propose two new schemes for DQD task. Our first approach leverages semantic relations and our second approach utilizes fine grained word level interactions across the two question sentences. We achieve test accuracy of 75.7% and 77.8% with our first and second approaches respectively on a publicly available DQD data set, demonstrating that cross-question analysis information can help improve DQD task performance.sandya mannarswamy*, Conduent Labs ; saravanan chidambaram, HPE
Fast Converging Multi-armed Bandit Optimization using Probabilistic Graphical Model(abstract, pdf)
Chen Zhao*, Graduate School of Systems and Information Engineering, University of Tsukuba; Kohei Watanabe, Graduate School of Information Science and Technology, University of Tokyo; Bin Yang, Rakuten; Yu Hirate, Rakuten Institute of Technology
This paper designs a strategic model that can be used to optimize click-though rates (CTR) useful to recommender systems for profit. As a vital step of data prediction, approximating a function based on discrete samples is desirable when ground truth is not directly accessible. While interpolation algorithms are prevalent in modern machine learning such as regression and non-kernel SVMs, all of which are good at speculating functions with a finite number of parameters, they are, however, not proper options for fitting arbitrary functions with no explicit form. The major contribution of this paper consists of a semi-parametric graphical model complying with properties of the Gaussian Markov random field (GMRF) to approximate general functions that can be multivariate. Based upon model inference, this paper further investigates several policies commonly used in decision-making process to optimize the multi-armed bandit model (MAB) problem. The primary objective is to locate global optimum of an unknown function. In case of recommender systems, the proposed algorithm leads to maximum user clicks with rescheduled recommendation policy maintaining the lowest possible cost. Comparative experiments are conducted among a set of policies. Empirical evaluation suggests that Thompson sampling is the most suitable policy for the proposed algorithm. Chen Zhao*, Graduate School of Systems and Information Engineering, University of Tsukuba; Kohei Watanabe, Graduate School of Information Science and Technology, University of Tokyo; Bin Yang, Rakuten; Yu Hirate, Rakuten Institute of Technology
Leveraging Label Category Relationships in Multi-class Crowdsourcing(abstract, pdf)
YUAN JIN*, Monash University; Lan Du, Monash University; Ye Zhu, Deakin University; Mark Carman, Monash University
Current quality control methods for crowdsourcing largely account for variance in worker responses to items in terms of interactions between item difficulty and worker expertise. Few have investigated the correlations that can exist between the response labels. When the number of categories is large, semantic relationships between labels become indicative of how crowd-workers respond to items, with expert workers tending to respond with more related categories and with difficult items tending to see greater spread in the responses. Based on these observations, we propose a new statistical model which contains a latent real-valued matrix for capturing the relatedness of response categories alongside variables for worker expertise, item difficulty and the true labels. The model is easily extended with prior information in terms of a hierarchical structure over response labels. Experiments show that compared with numerous state-of-the-art baselines, our model (both with and without the prior knowledge) yields superior true label prediction performance on four new crowdsourcing datasets featuring large sets of label categories. YUAN JIN*, Monash University; Lan Du, Monash University; Ye Zhu, Deakin University; Mark Carman, Monash University
Embedding Knowledge Graphs Based on Transitivity and Asymmetry of Rules(abstract, pdf)
Mengya Wang*, Sun Yat-sen University; Erhu Rong, Sun Yat-sen University; Hankz Hankui Zhuo, Sun Yat-sen University; Huiling Zhu, Sun Yat-sen University
Representation learning of knowledge graphs encodes entities and relation types into a continuous low-dimensional vector space, learns embeddings of entities and relation types. Most existing methods only concentrate on knowledge triples, ignoring logic rules which contain rich background knowledge. Although there has been some work aiming at leveraging both knowledge triples and logic rules, they ignore the transitivity and asymmetry of logic rules. In this paper, we propose a novel approach to learn knowledge representations with entities and ordered relations in knowledges and logic rules. The key idea is to integrate knowledge triples and logic rules, and approximately order the relation types in logic rules to utilize the transitivity and asymmetry of logic rules. All entries of the embeddings of relation types are constrained to be non-negative. We translate the general constrained optimization problem into an unconstrained optimization problem to solve the non-negative matrix factorization. Experimental results show that our model significantly outperforms other baselines on knowledge graph completion task. It indicates that our model is capable of capturing the transitivity and asymmetry information, which is significant when learning embeddings of knowledge graphs.Mengya Wang*, Sun Yat-sen University; Erhu Rong, Sun Yat-sen University; Hankui Zhuo, Sun Yat-sen University; Huiling Zhu, Sun Yat-sen University
Representation Learning and Embedding (Part 1)
SIGNet: Scalable Embeddings for Signed Networks(abstract, pdf)
Mohammad Islam*, Virginia Tech; B. Aditya Prakash, Virginia Tech; Naren Ramakrishnan, Virginia Tech
Recent successes in word embedding and document embedding have motivated researchers to explore similar representations for networks and to use such representations for tasks such as edge prediction, node label prediction, and community detection. Such network embedding methods are largely focused on finding distributed representations for unsigned networks and are unable to discover embeddings that respect polarities inherent in edges. We propose SIGNet, a fast scalable embedding method suitable for signed networks. Our proposed objective function aims to carefully model the social structure implicit in signed networks by reinforcing the principles of social balance theory. Our method builds upon the traditional word2vec family of embedding approaches and adds a new targeted node sampling strategy to maintain structural balance in higher-order neighborhoods. We demonstrate the superiority of SIGNet over state-of-the-art methods proposed for both signed and unsigned networks on several real world datasets from different domains. In particular, SIGNet offers an approach to generate a richer vocabulary of features of signed networks to support representation and reasoning.Mohammad Islam*, Virginia Tech; B. Aditya Prakash, Virginia Tech; Naren Ramakrishnan, Virginia Tech
Sub2Vec: Feature Learning for Subgraphs(abstract, pdf)
Bijaya Adhikari*, Virginia Tech; Yao Zhang, Virginia Tech; Naren Ramakrishnan, Virginia Tech; B. Aditya Prakash, Virginia Tech
Network embeddings have become very popular in learning effective feature representations of networks. Motivated by the recent successes of embeddings in natural language processing, researchers have tried to find network embeddings in order to exploit machine learning algorithms for mining tasks like node classification and edge prediction. However, most of the work focuses on distributed representations of nodes that are inherently ill-suited to tasks such as community detection which are intuitively dependent on subgraphs. Here, we formulate subgraph embedding problem based on two intuitive properties of subgraphs and propose Sub2Vec, an unsupervised algorithm to learn feature representations of arbitrary subgraphs. We also highlight the usability of Sub2Vec by leveraging it for network mining tasks, like community detection and graph classification.We show that Sub2Vec gets significant gains over state-of-the-art methods. In particular, Sub2Vec offers an approach to generate a richer vocabulary of meaningful features of subgraphs for representation and reasoning.Bijaya Adhikari*, Virginia Tech; Yao Zhang, Virginia Tech; Naren Ramakrishnan, Virginia Tech; B. Aditya Prakash, Virginia Tech
Interaction Content Aware Network Embedding via Co-embedding of Nodes and Edges(abstract, pdf)
Linchuan Xu*, The Hong Kong Polytechnic University; Xiaokai Wei, Facebook Inc.; Jiannong Cao, The Hong Kong Polytechnic University; Philip S Yu, UIC
Network embedding has been increasingly employed in network analysis as it can learn node representations that encode the network structure resulting from node interactions. In this paper, besides the network structure, the interaction content within which each interaction arises is also embedded because it reveals interaction preferences of the two nodes involved, and interaction preferences are essential characteristics that nodes expose in the network environment. Specifically, we propose interaction content aware network embedding (ICANE) via co-embedding of nodes and edges. The embedding of edges is to learn edge representations that preserve the interaction content. Then the interaction content can be incorporated into node representations through edge representations. Comprehensive evaluation demonstrates ICANE outperforms five recent network embedding models in applications including visualization, link prediction and classification.Linchuan Xu*, The Hong Kong Polytechnic University; Xiaokai Wei, Facebook Inc.; Jiannong Cao, The Hong Kong Polytechnic University; Philip S Yu, UIC
MetaGraph2Vec: Complex Semantic Path Augmented Heterogeneous Network Embedding(abstract, pdf)
Daokun Zhang*, University of Technology Sydney; Jie Yin, The University of Sydney; Xingquan Zhu, Florida Atlantic University; Chengqi Zhang, University of Technology Sydney
Network embedding in heterogeneous information networks (HINs) is a challenging task, due to complications of different node types and rich relationships between nodes. As a result, conventional network embedding techniques cannot work on such HINs. Recently, metapath-based approaches have been proposed to characterize relationships in HINs but they are ineffective in capturing rich contexts and semantics between nodes for embedding learning, mainly because (1) metapath is a rather strict single path node-node relationship descriptor, which is unable to accommodate variance in relationships, and (2) only a small portion of paths can match the metapath, resulting in sparse context information for embedding learning. In this paper, we advocate a new metagraph concept to capture richer structural contexts and semantics between distant nodes. A metagraph contains multiple paths between nodes, each describing one type of relationships, so the augmentation of multiple metapaths provides an effective way to capture rich contexts and semantic relations between nodes. This greatly boosts the ability of metapath-based embedding techniques in handling very sparse HINs. We propose a new embedding learning algorithm, namely MetaGraph2Vec, which uses metagraph to guide the generation of random walks and to learn latent embeddings of multi-typed HIN nodes. Experimental results show that MetaGraph2Vec is able to outperform the state-of-the-art baselines in various heterogeneous network mining tasks such as node classification, node clustering, and similarity search.Daokun Zhang*, University of Technology Sydney; Jie Yin, The University of Sydney; Xingquan Zhu, Florida Atlantic University; Chengqi Zhang, University of Technology Sydney
Representation Learning and Embedding (Part 2)
Multi-Network User Identification via Graph-Aware Embedding(abstract, pdf)
Yang Yang, Nanjing University; Yi-Feng Wu, Nanjing University; De-Chuan Zhan*, Nanjing University; Yuan Jiang, Nanjing University
User identification is widely used in anomaly detection, recommendation system and so on. Previous approaches focus on extraction of features describing users, and the learners try to emphasize the differences between different user identities. However, one applicable user identification scenario occurs in the circumstance of social network, where features of users are not acquirable while only relationships between users are provided. In this paper, we aim at the later situation, i.e., the Network User Identification, where features of users cannot be extracted in social network applications. We consider the information limitation of the single network and focus on utilizing the multiple relationships between identities from multi-networks. Different from the existing common subspace methods in Cross-Network User Identification, we propose a more discriminative Graph-Aware Embedding (GAEM) method for modeling the relationships as well as the transformation between different social networks explicitly in one unified framework. As a consequence, we can get more accurate predictions of the user identities directly based on the learned transferring model with GAEM. The experimental evaluations on real-world data demonstrate the superiorities of our proposed method comparing to the state-of-the-art.Yang Yang, Nanjing University; Yi-Feng Wu, Nanjing University; De-Chuan Zhan*, Nanjing University; Yuan Jiang, Nanjing University
Knowledge-based Recommendation with Hierarchical Collaborative Embedding(abstract, pdf)
Zili Zhou, University of Technology Sydney; Shaowu Liu, University of Technology Sydney; Guandong Xu*, University of Technology Sydney, Australia; Xing Xie, Microsoft Research Asia; Jun Yin, University of Technology Sydney; Yidong Li, Beijing Jiaotong Univeristy, China; Wu Zhang, Shanghai University
Data sparsity is a common issue in recommendation systems, particularly collaborative filtering. In real recommendation scenarios, user preferences are often quantitatively sparse because of the application nature. To address the issue, we proposed a knowledge graph-based semantic information enhancement mechanism to enrich the user preferences. Specifically, the proposed Hierarchical Collaborative Embedding (HCE) model leverages both network structure and text info embedded in knowledge bases to supplement traditional collaborative filtering. The HCE model jointly learns the latent representations from user preferences, linkages between items and knowledge base, as well as the semantic representations from knowledge base. Experiment results on GitHub dataset demonstrated that semantic information from knowledge base has been properly captured, resulting improved recommendation performance.Zili Zhou, University of Technology Sydney; Shaowu Liu, University of Technology Sydney; Guandong Xu*, University of Technology Sydney, Australia; Xing Xie, Microsoft Research Asia; Jun Yin, University of Technology Sydney; Yidong Li, Beijing Jiaotong Univeristy, China; Wu Zhang, Shanghai University
DPNE: Differentially Private Network Embedding(abstract, pdf)
Depeng Xu, University of Arkansas; Shuhan Yuan, University of Arkansas; Xintao Wu*, University of Arkansas; Hai Phan, New Jersey Institute of Technology
Learning the low-dimensional representations of the vertices in a network can help users understand the network structure and perform other data mining tasks efficiently. Various network embedding approaches such as DeepWalk and LINE have been developed recently. However, how to protect the individual privacy in network embedding has not been exploited. It is challenging to achieve high utility as the sensitivity of stochastic gradients in random walks and that of edge sampling are very high, thus incurring high utility loss when applying Laplace mechanism and exponential mechanism to achieve differential privacy. In this paper, we develop a differentially private network embedding method (DPNE). In this method, we leverage the recent theoretical findings that network embedding methods such as DeepWalk and LINE are equivalent to factorization of some matrices derived from the adjacency matrix of the original network and apply objective perturbation on the objective function of matrix factorization. We evaluate the learned representations by our DPNE from three different real world datasets on two data mining tasks: vertex classification and link prediction. Experiment results show the effectiveness of DPNE. To our best knowledge, this is the first work on how to preserve differential privacy in network embedding.Depeng Xu, University of Arkansas; Shuhan Yuan, University of Arkansas; Xintao Wu*, University of Arkansas; Hai Phan, New Jersey Institute of Technology
A Generalization of Recurrent Neural Networks for Graph Embedding(abstract, pdf)
Xiao Han*, Beijing University of Posts and Telecommunications; Chunhong Zhang, Beijing University of Posts and Telecommunications; Chenchen Guo, Beijing University of Posts and Telecommunications; Yang Ji, Beijing University of Posts and Telecommunications
Due to the ubiquity of graphs, machine learning on graphs facilitates many AI systems. In order to incorporate the rich information of graphs into machine learning models, graph embedding has been developed, which seeks to preserve the graphs into low dimensional embeddings. Recently, researchers try to conduct graph embedding via generalizing neural networks on graphs. However, most existing approaches focus on node embedding, ignoring the heterogeneity of edges. Besides, the similarity relationship among random walk sequences has been rarely discussed. In this paper, we propose a generalization of Recurrent Neural Networks on Graphs (G-RNN) for graph embedding. More specifically, first we propose to utilize edge embedding and node embedding jointly to preserve graphs, which is of great significance in multi-relational graphs with heterogenous edges. Then we propose the definition of subgraph level high-order proximity to preserve the inter-sequence proximity into the embeddings. To verify the generalization of G-RNN, we apply it to the embedding of knowledge graph, a typical multi-relational graph. Empirically we evaluate the resulting embeddings on the tasks of link prediction and node classification. The results show that the embeddings learned by G-RNN are powerful on both tasks, producing better performance than the baselines.Xiao Han*, Beijing University of Posts and Telecommunications; Chunhong Zhang, Beijing University of Posts and Telecommunications; Chenchen Guo, Beijing University of Posts and Telecommunications; Yang Ji, Beijing University of Posts and Telecommunications
NE-FLGC: Network Embedding based on Fusing Local (first-order) and Global (second-order) network structure with node Content(abstract, pdf)
Hongyan Xu*, TianJin University
This paper studies the problem of Representation Learning for network with textual information, which aims to learn low dimensional vectors for nodes by leveraging network structure and textual information. Most existing works only focus on one aspect of network structure and cannot fuse network first-order proximity, second-order proximity and textual information. In this paper, we propose a novel network embedding method NE-FLGC: Network Embedding based on Fusing Local (first-order) and Global (second-order) network structure with node Content. Especially, we adopt context-enhance method that obtains node embedding by concatenating the vector of itself and the context vectors. In experiments, we compare our model with existing network embedding models on four real-world datasets. The experimental results demonstrate that NE-FLGC is stable and significantly outperforms state-of-the-art methods.Hongyan Xu*, TianJin University
Semi-Structured Data and NLP (Part 1)
Category Multi-Representation: A Unified Solution for Named Entity Recognition in Clinical Texts(abstract, pdf)
Jiangtao Zhang*, Tsinghua University; Juanzi Li, Tsinghua University; Shuai Wang, Tsinghua University; Yan Zhang, Tsinghua University; Yixin Cao, Tsinghua University; Lei Hou, Tsinghua University; Xiaoli Li, Institute for Infocomm Research , A*STAR, Singapore
Clinical Named Entity Recognition (CNER), the task of identifying the entity boundaries in clinical texts, is essential for many applications. Previous methods usually follow the traditional NER methods that heavily rely on language specific features (i.e. linguistics and lexicons) and high quality annotated data. However, due to the problem of Limited Availability of Annotated Data and Informal Clinical Texts, CNER becomes more challenging. In this paper, we propose a novel method that learn multiple representations for each category, namely category-multi-representation (CMR) that captures the semantic relatedness between words and clinical categories from different perspectives. The CMR is learned based on a large scale unannotated corpus and a small set of annotated data, which greatly alleviates the burden of human effort. Instead of the language specific features, our proposed method uses more evidential features without any additional NLP tools, and enjoys a lightweight adaption among languages. We conduct a series of experiments to verify our new CMR features can further improve the performance of NER significantly without leveraging any external lexicons. Jiangtao Zhang*, Tsinghua University; Juanzi Li, Tsinghua University; Shuai Wang, Tsinghua University; Yan Zhang, Tsinghua University; Yixin Cao, Tsinghua University; Lei Hou, Tsinghua University; Xiaoli Li, Institute for Infocomm Research , A*STAR, Singapore
A Heterogeneous Information Network Method for Entity Set Expansion in Knowledge Graph(abstract, pdf)
Xiaohuan Cao, Beijing University of Posts and Telecommunications; Chuan Shi*, Beijing University of Posts and Telecommunications; Yuyan Zheng, Beijing University of Posts and Telecommunications; Jiayu Ding, Beijing University of Posts and Telecommunications; Xiaoli Li, Institute for Infocomm Research , A*STAR, Singapore; Bin Wu, Beijing University of Posts and Telecommunications
Entity Set Expansion (ESE) is an important data mining task, e.g. query suggestion. It aims to expand an entity seed set to obtain more entities which have traits in common. Traditionally, text and Web information are widely used for ESE. Recently, some ESE methods employ Knowledge Graph (KG) to extend entities. However, these methods usually fail to sufficiently and efficiently utilize the rich semantics contained in KG. In this paper, we use the Heterogeneous Information Network (HIN) to represent KG, which would effectively capture hidden semantic relations between seed entities. However, the complex KG introduces new challenges for HIN analysis, such as generation of meta paths between entities and addressing ambiguity caused by multiple types of objects. To solve these problems, we propose a novel Concatenated Meta Path based Entity Set Expansion method (CoMeSE). With the delicate design of the concatenated meta path generation and multi-type-constrained meta path, CoMeSE can quickly and accurately detect important path features in KG. In addition, heuristic learning and PU learning are employed to learn the weights of extracted meta paths. Extensive experiments on real dataset show that the CoMeSE accurately and quickly expands the given small entity set. Xiaohuan Cao, Beijing University of Posts and Telecommunications; Chuan Shi*, Beijing University of Posts and Telecommunications; Yuyan Zheng, Beijing University of Posts and Telecommunications; Jiayu Ding, Beijing University of Posts and Telecommunications; Xiaoli Li, Institute for Infocomm Research , A*STAR, Singapore; Bin Wu, Beijing University of Posts and Telecommunications
Identifying In-App User Actions from Mobile Web Logs(abstract, pdf)
Bilih Priyogi*, RMIT University; Mark Sanderson, RMIT University; Flora Salim, RMIT University; Jeffrey Chan, RMIT University, Australia; Martin Tomko, University of Melbourne; Yongli Ren, RMIT University
We address the problem of identifying in-app user actions from Web access logs when the content of those logs is both encrypted (through HTTPS) and also contains automated Web accesses. We find that the distribution of time gaps between HTTPS accesses can distinguish user actions from automated Web accesses generated by the apps, and we determine that it is reasonable to identify meaningful user actions within mobile Web logs by modelling this temporal feature. A real-world experiment is conducted with multiple mobile devices running some popular apps, and the results show that the proposed clusteringbased method achieves good accuracy in identifying user actions, and outperforms the state-of-the-art baseline by 17.84%.Bilih Priyogi*, RMIT University; Mark Sanderson, RMIT University; Flora Salim, RMIT University; Jeffrey Chan, RMIT University, Australia; Martin Tomko, University of Melbourne; Yongli Ren, RMIT University
Harvesting Knowledge from Cultural Heritage Artifacts in Museums of India(abstract, pdf)
Abhilasha Sancheti, Adobe Research; Paridhi Maheshwari, Indian Institute of Technology Kanpur; Rajat Chaturvedi, Indian Institute of Technology Bombay; Anish Monsy, Indian Institute of Technology Guwahati; Tanya Goyal, University of Texas at Austin; Balaji Vasan Srinivasan*, Adobe Research
Recent efforts towards digitization of cultural heritage artifacts have resulted in a surge of information around these artifacts. However, the organization of these artifacts falls short with respect to accessing the facts across these entities. In this paper, we present a method to harvest the knowledge and form a knowledge graph from the digitized artifacts in the Museums of India repository via distant supervision to enable better accessibility of the facts and ability to extract new insights around the artifacts. Triples extracted from an open information extractor are first canonicalized to a standard taxonomy based on a metric-based scoring. Since a standard taxonomy is insufficient to capture all the relationships, we propose a sequential clustering based approach to add artifact specific relationships to the taxonomy (and to the knowledge graph). The graph is enriched by inferring missing facts based on a probabilistic soft logic approach seeded from a frequent item set framework. Human evaluation of the final knowledge graph showed an accuracy of 75% on par with knowledge bases like DBpedia.Abhilasha Sancheti, Adobe Research; Paridhi Maheshwari, Indian Institute of Technology Kanpur; Rajat Chaturvedi, Indian Institute of Technology Bombay; Anish Monsy, Indian Institute of Technology Guwahati; Tanya Goyal, University of Texas at Austin; Balaji Vasan Srinivasan*, Adobe Research
Query-based Automatic Training Set Selection for Microblog Retrieval(abstract, pdf)
khaled Albishre*, QUT; Yuefeng Li, Queensland University of Technology; Yue Xu, Queensland University of Technology, Australia
Typical pseudo-relevance feedback models assume that the top-ranked documents are the most relevant and use those documents to select feedback terms for query expansion. In real applications, however, short documents, such as microblogs, may not have enough information about the searched topic, thus increasing the chance that irrelevant documents will be included in the initial set of retrieved documents. This situation reduces a feedback model’s ability to capture information that is relevant to users’ needs, which makes determining the best documents for relevant feedback without requiring extra effort from the user a critical challenge. In this paper, we propose an innovative mechanism to automatically select useful feedback documents using a topic modeling technique to improve the pseudo-relevance feedback model’s effectiveness. The main idea behind the proposed model is to discover the latent topics in the top-ranked documents that allow for the exploitation of the correlation between terms in relevant topics. To capture discriminative terms for query expansion, we incorporated topical features into a relevance model that focuses on the temporal information in the selected set of documents. Empirical experimental results on TREC 2011-2013 microblog datasets show that our model significantly outperforms all state-of-the-art baseline models.khaled Albishre*, QUT; Yuefeng Li, Queensland University of Technology; Yue Xu, Queensland University of Technology, Australia
Semi-Structured Data and NLP (Part 2)
Distributed representation of multi-sense words: A loss driven approach(abstract, pdf)
Saurav Manchanda*, University of Minnesota, Twin Cities; George Karypis, University of Minnesota, Twin Cities
Word2Vec’s Skip Gram model is the current state-of-the-art approach for estimating the distributed representation of words. However, it assumes a single vector per word, which is not well-suited for representing words that have multiple senses. This work presents LDMI, a new model for estimating distributional representations of words. LDMI relies on the idea that, if a word carries multiple senses, then having a different representation for each of its senses should lead to a lower loss associated with predicting its co-occurring words, as opposed to the case when a single vector representation is used for all the senses. After identifying the multi-sense words, LDMI clusters the occurrences of these words to assign a sense to each occurrence. Experiments on the contextual word similarity task show that LDMI leads to better performance than competing approaches.
Active Blocking Scheme Learning for Entity Resolution(abstract, pdf)
Jingyu Shao*, ANU; Qing Wang, ANU
Blocking is an important part of entity resolution. It aims to improve the time eciency of entity resolution by grouping potentially matched records into the same block. Both supervised and unsupervised approaches have been proposed in the past; Nonetheless, they still have some limitations: either a large amount of labels are required or the blocking quality is hard to be guaranteed. To address these issues, in this paper, we propose a blocking scheme learning approach based on active learning techniques. With a limited label budget, our approach can learn a blocking scheme to generate high quality blocks. Two strategies called active sampling and active branching are proposed to select samples and generate the blocking scheme eciently. We experimentally verify that our approach outperforms several baseline approaches over four realworld datasets.Jingyu Shao*, ANU; Qing Wang, ANU
Mining Relations from Unstructured Content(abstract, pdf)
Ismini Lourentzou, University of Illinois at Urbana – Champaign; Alfredo Alba, IBM Research; Anni Coden, IBM Research; Anna Lisa Gentile*, IBM Research; Daniel Gruhl, IBM Research; Steve Welch, IBM Research
Extracting relations from unstructured Web content is a challenging task and for any new relation a significant effort is required to design, train and tune the extraction models. In this work, we investigate how to obtain suitable results for relation extraction with modest human efforts, relying on a dynamic active learning approach. We pro- pose a method to reliably generate high quality training/test data for relation extraction – for any generic user-demonstrated relation, starting from a few user provided examples and extracting valuable samples from unstructured and unlabeled Web content. To this extent we propose a strategy which learns how to identify the best order to human-annotate data, maximizing learning performance early in the process. We demonstrate the viability of the approach (i) against state of the art datasets for relation extraction as well as (ii) a real case study identifying text expressing a causal relation between a drug and an adverse reaction from various sources of Web content.Ismini Lourentzou, University of Illinois at Urbana – Champaign; Alfredo Alba, IBM Research; Anni Coden, IBM Research; Anna Lisa Gentile*, IBM Research; Daniel Gruhl, IBM Research; Steve Welch, IBM Research
Incorporating Word Embeddings into Open Directory Project based Large-scale Classification(abstract, pdf)
Kang-Min Kim, Korea University; Dinara Aliyeva, Korea University; Byung-Ju Choi, Korea University; Sangkeun Lee*, Korea University?Korea
Recently, implicit representation models, such as embedding or deep learning, have been successfully adopted to text classification task due to their outstanding performance. However, these approaches are limited to small- or moderate-scale text classification. Explicit representation models are often used in a large-scale text classification, like the Open Directory Project (ODP)-based text classification. However, the performance of these models is limited to the associated knowledge bases. In this paper, we incorporate word embeddings into the ODP-based large-scale classification. To this end, we first generate category vectors, which represent the semantics of ODP categories by jointly modeling word embeddings and the ODP-based text classification. We then propose a novel semantic similarity measure, which utilizes the category and word vectors obtained from the joint model and word embeddings, respectively. The evaluation results clearly show the efficacy of our methodology in large-scale text classification. The proposed scheme exhibits significant improvements of 10% and 28% in terms of macro-averaging F1-score and precision at k, respectively, over state-of-the-art techniques.Kang-Min Kim, Korea University; Dinara Aliyeva, Korea University; Byung-Ju Choi, Korea University; Sangkeun Lee*, Korea University,Korea
Inference of a Concise Regular Expression Considering Interleaving from XML Documents(abstract, pdf)
Xiaolan Zhang*, State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences; University of Chinese Academy of Sciences; Yeting Li, State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences; University of Chinese Academy of Sciences; Fanlin Cui, State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences; Chunmei Dong, State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences and University of Chinese Academy of Sciences; Haiming Chen, State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences
XML schemas are useful in various applications. However, many XML documents in practice are not accompanied by a (valid) schema recommended by W3C. Therefore, it is essential to design efficient algorithms for schema inference. Each element in XML schema has its content model defined by a regular expression. Schema inference can be reduced to the inference of restricted regular expressions. In this paper, we focus on learning restricted regular expressions considering interleaving from a given set of XML documents. The new subclass proposed is so-called CHAin Regular Expression with Interleaving (ICHARE). Then we introduce the inference algorithm GenICHARE based on single occurrence automaton (SOA) and maximum independent set (MIS). The algorithm is proved to infer a descriptive ICHARE from a set of given sample. At last, based on the data set crawled from the Web, we compare the coverage proportion of ICHARE with other existing subclasses. Besides, we analyze the conciseness of regular expressions inferred by the algorithm GenICHARE based on DBLP. Experimental results show that ICHARE is more concise and useful in practice, and the inference algorithm is promising and effective.Xiaolan Zhang*, State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences; University of Chinese Academy of Sciences; Yeting Li, State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences; University of Chinese Academy of Sciences; Fanlin Cui, State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences; Chunmei Dong, State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences and University of Chinese Academy of Sciences; Haiming Chen, State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences
Spatial-Temporal, Time-series and Stream Mining (Part 1)
Accelerating Adaptive Online Learning by Matrix Approximation(abstract, pdf)
Yuanyu Wan*, Nanjing University; Lijun Zhang, Nanjing University
Adaptive subgradient methods are able to leverage the second-order information of functions to improve the regret, and have become popular for online learning and optimization. According to the amount of information used, adaptive subgradient methods can be divided into diagonal-matrix version (ADA-DIAG) and full-matrix version (ADA-FULL). In practice, ADA-DIAG is the most commonly adopted rather than ADA-FULL, because ADA-FULL is computationally intractable in high dimensions though it has smaller regret when gradients are correlated. In this paper, we propose to apply techniques of matrix approximation to accelerate ADA-FULL, and develop two methods based on random projections. Compared with ADA-FULL, at each iteration, our methods reduce the space complexity from $O(d^2)$ to $O(\tau d)$ and the time complexity from $O(d^3)$ to $O(\tau^2 d)$ where $d$ is the dimensionality of the data and $\tau \ll d$ is the number of random projections. Experimental results about online convex optimization show that both methods are comparable to ADA-FULL and outperform other state-of-the-art algorithms including ADA-DIAG. Furthermore, the experiments of training convolutional neural networks show again that our method outperforms other state-of-the-art algorithms including ADA-DIAG.Yuanyu Wan*, Nanjing University; Lijun Zhang, Nanjing University
Cruising or Waiting: A Shared Recommender System for Taxi Drivers(abstract, pdf)
Xiaoting Jiang*, Shanghai Jiao Tong University; Yanyan Shen, Shanghai Jiao Tong University; Yanmin Zhu, Shanghai Jiao Tong University
Recent efforts have been made on mining mobility of taxi trajectories and developing recommender systems for taxi drivers. Existing systems focused on recommending seeking routes to the place with the highest passenger pick-up possibility. They mostly ignore that waiting at nearby taxi stands may also help increase the profit. Furthermore, the recommended results seldom consider potential competitions among drivers and real-time traffic. In this paper, we propose a shared recommender system for taxi drivers by including waiting as one kind of seeking policy. We model a seeking process as a Markov Decision Process, and propose a novel Q-learning algorithm to train the model based on massive trajectory data efficiently. During online recommendation, we update the model using feedbacks from drivers and recommend the optimal seeking policy by taking competitions among drivers and real-time traffic into account. Experimental results show that our system achieves better performance than the state-of-the-art approaches. Xiaoting Jiang*, Shanghai Jiao Tong University; Yanyan Shen, Shanghai Jiao Tong University; Yanmin Zhu, Shanghai Jiao Tong University
A Local Online Learning Approach for Non-linear Data(abstract, pdf)
Xinxing Yang*, Ant Financial Services Group; Jun Zhou, Ant Financial; PEILIN ZHAO, South China University of Technology; Cen Chen, Ant Financial Services Group; Chaochao Chen, Ant Financial; Xiaolong Li, Ant Financial
The efficiency and scalability of online learning methods make them a popular choice for solving the learning problems with big data and limited memory. Most of the existing online learning approaches are based on global models, which consider the incoming example as linear separable. However, this assumption is not always valid in practice. Therefore, local online learning framework was proposed to solve non-linear separable task without kernel modeling. Weights in local online learning framework are based on the first-order information, thus will significantly limit the performance of online learning. Intuitively, the second-order online learning algorithms, e.g., Soft Confidence-Weighted (SCW), can significantly alleviate this issue. Inspired by the second-order algorithms and local online learning framework, we propose a Soft Confidence-Weighted Local Online Learning(SCW-LOL) algorithm, which extends the single hyperplane SCW to the case with multiple local hyperplanes. Those local hyperplanes are connected by a common component and will be optimized simultaneously. We also examine the theoretical relationship between the single and multiple hyperplanes. The extensive experimental results show that the proposed SCW-LOL learns an online convergence boundary, overall achieving the best performance over almost all datasets, without any kernel modeling and parameter tuning.Xinxing Yang*, Ant Financial Services Group; Jun Zhou, Ant Financial; PEILIN ZHAO, Ant Finance Groups; Cen Chen, Ant Financial Services Group; Chaochao Chen, Ant Financial; Xiaolong Li, Ant Financial
Contextual Location Imputation for Confined WiFi Trajectories(abstract, pdf)
Elham Naghizade*, The University of Melbourne; Jeffrey Chan, RMIT University, Australia; Yongli Ren, RMIT University; Martin Tomko, The University of Melbourne
The analysis of mobility patterns from large-scale spatio-temporal datasets is key to personalised location-based applications. Datasets capturing user location are, however, often incomplete due to temporary failures of sensors, deliberate interruptions or because of data privacy restrictions. Effective location imputation is thus a critical processing step enabling mobility pattern mining from sparse data. To date, most studies in this area have focused on coarse location prediction at city scale. In this paper we aim to infer the missing location information of individuals tracked within structured, mostly confined spaces such as a university campus or a mall. Many indoor tracking datasets may be collected by sensing user presence via WiFi sensing and consist of timestamped associations with the network’s access points (APs). Such coarse location information imposes unique challenges to the location imputation problem. We present a contextual model that combines the regularity of individuals’ visits to enable accurate imputation of missing locations in sparse indoor trajectories. This model also considers implicit social ties to capture similarities between individuals, applying Graph-regularized Nonnegative Matrix Factorization (GNMF) techniques. Our findings suggest that people’ movement in confined spaces is largely habitual and their social ties plays a role in their less frequently visited locations.Elham Naghizade*, The University of Melbourne; Jeffrey Chan, RMIT University, Australia; Yongli Ren, RMIT University; Martin Tomko, The University of Melbourne
Spatial-Temporal, Time-series and Stream Mining (Part 2)
Low redundancy estimation of correlation matrices for time series using triangular bounds(abstract, pdf)
Erik Scharw夨ter*, Hasso Plattner Institute; Fabian Geier, HPI; Lukas Faber, Hasso Plattner Institute; Emmanuel M��r, Hasso Plattner Institute
The dramatic increase in the availability of large collections of time series requires new approaches for scalable time series analysis. Correlation analysis for all pairs of time series is a fundamental first step of analysis of such data, but is particularly hard for large collections of time series due to its quadratic complexity. State-of-the-art approaches focus on efficiently approximating correlations larger than a hard threshold or compressing fully computed correlation matrices in hindsight. In contrast, we aim at estimates for the full pairwise correlation structure without computing and storing all pairwise correlations. We introduce the novel problem of low redundancy estimation for correlation matrices to capture the complete correlation structure with as few parameters and correlation computations as possible. We propose a novel estimation algorithm that is very efficient and comes with formal approximation guarantees. Our algorithm avoids the computation of redundant blocks in the correlation matrix to drastically reduce time and space complexity of estimation. We perform an extensive empirical evaluation of our approach and show that we obtain high quality estimates with drastically reduced space requirements on a large variety of datasets.Erik Scharwächter*, Hasso Plattner Institute; Fabian Geier, HPI; Lukas Faber, Hasso Plattner Institute; Emmanuel Müller, Hasso Plattner Institute
Traffic Accident Detection with Spatiotemporal Impact Measurement(abstract, pdf)
Mingxuan Yue*, University of Southern California; Liyue Fan, University at Albany SUNY; Cyrus Shahabi, Computer Science Department. University of Southern California
Traffic incidents continue to cause a significant loss in deaths, injuries, and property damages. Reported traffic accident data contains a considerable amount of human errors, hindering the studies on traffic accidents. Several approaches have been developed to detect accidents using traffic data in real time. However, those approaches do not consider the spatiotemporal patterns inherent in traffic data, resulting in high false alarm rates. In this paper, we study the problem of traffic accident detection by considering multiple traffic speed time series collected from road network sensors. To capture the spatiotemporal impact of traffic accidents to upstream locations, we adopt Impact Interval Grouping (IIG), which compares real-time traffic speed with historical data, and generates impact intervals to determine the presence of accidents. Furthermore, we take a multivariate time series classification approach and extract three novel features to measure the severity of traffic accidents. We use real-world traffic speed and accident datasets in our empirical evaluation, and our solutions outperform state-of-the-art approaches in multivariate time series classification.Mingxuan Yue*, University of Southern California; Liyue Fan, University at Albany SUNY; Cyrus Shahabi, Computer Science Department. University of Southern California
MicroGRID: An Accurate and Efficient Real-Time Stream Data Clustering With Noise(abstract, pdf)
Zahir Tari*, RMIT University
Data stream clustering aims to produce clusters from a data-stream in a real-time. Many of existing algorithms focus however on solving a single problem, leaving anomalous noise in data streams at the wayside. This paper describes the MicroGRID approach to cluster data from single data-streams to handle noisy data streams, accurately identifying and separating noise-affected data points from outlier points. In particular, MicroGRID utilises a combination of micro cluster and grid-based prospectives, an approach that has not been attempted when clustering data-streams. The experimental results clearly show that that MicroGRID significantly outperforms the baseline methods: MicroGRID is up 87\% faster and up to 80\% more accurate clustering outputs.Zahir Tari*, RMIT University
UFSSF – An Efficient Unsupervised Feature Selection for Streaming Features(abstract, pdf)
Naif Almusallam*, RMIT; Zahir Tari, RMIT University; Jeffrey Chan, RMIT University, Australia; Adil Alharthi, Albaha University
Streaming features applications pose challenges for dimensionality reduction techniques, particularly for feature selection. Traditional feature selection techniques cannot be used in such an environment, as they do require the full feature space in advance in order to statistically deter- mine the representative features. A new algorithm, called Unsupervised Feature Selection for Streaming Features (UFSSF), is proposed to select representative features in streaming features applications without the need to know the features or class labels in advance. UFSSF extends the k-mean clustering algorithm to include linearly dependent similarity measures so as to incrementally decide whether to add the newly arrived feature to the existing set of representative features. Those features that are not representative are discarded. Naif Almusallam*, RMIT; Zahir Tari, RMIT University; Jeffrey Chan, RMIT University, Australia; Adil Alharthi, Albaha University
Online Clustering for Evolving Data Streams with Online Anomaly Detection(abstract, pdf)
Milad Chenaghlou*, The University of Melbourne; Masud Moshtaghi, University of Melbourne; Christopher Leckie, University of Melbourne; Mahsa Salehi, Monash
Clustering data streams is an emerging challenge with a wide range of applications in areas including Wireless Sensor Networks, the Internet of Things, finance and social media. In an evolving data stream, a clustering algorithm is desired to both (a) assign observations to clusters and (b) identify anomalies in real-time. Current state-of-the-art algorithms in the literature do not address feature (b) as they only consider the \textit{spatial} proximity of data, which results in (1) poor clustering and (2) poor demonstration of the temporal evolution of data in noisy environments. In this paper, we propose an online clustering algorithm that considers the \textit{temporal} proximity of observations as well as their spatial proximity to identify anomalies in real-time. It identifies the evolution of clusters in noisy streams, incrementally updates the model and calculates the minimum window length over the evolving data stream without jeopardizing performance. To the best of our knowledge, this is the first online clustering algorithm that identifies anomalies in real-time and discovers the temporal evolution of clusters. Our contributions are supported by synthetic as well as real-world data experiments.Milad Chenaghlou*, The University of Melbourne; Masud Moshtaghi, University of Melbourne; Christopher Leckie, University of Melbourne; Mahsa Salehi, Monash
Spatial-Temporal, Time-series and Stream Mining (Part 3)
An Incremental Dual nu-Support Vector Regression Algorithm(abstract, pdf)
Hang Yu*, University of Technology Sydney; Jie Lu, University of Technology Sydney; Guangquan Zhang, University of Technology Sydney
Support vector regression (SVR) has been a hot research topic for several years as it is an effective regression learning algorithm. Early studies on SVR mostly focus on solving large-scale problems. Nowadays, an increasing number of researchers are focusing on incremental SVR algorithms. However, these incremental SVR algorithms cannot handle uncertain data, which are very common in real life because the data in the training example must be precise. Therefore, to handle the incremental regression problem with uncertain data, an incremental dual nu-support vector regression algorithm (dual-v-SVR) is proposed. In the algorithm, a dual-v-SVR formulation is designed to handle the uncertain data at first, then we design two special adjustments to enable the dual-v-SVR model to learn incrementally: incremental adjustment and decremental adjustment. Finally, the experiment results demonstrate that the incremental dual-v-SVR algorithm is an efficient incremental algorithm which is not only capable of solving the incremental regression problem with uncertain data, it is also faster than batch or other incremental SVR algorithms.Hang Yu*, University of Technology Sydney; Jie Lu, University of Technology Sydney; Guangquan Zhang, University of Technology Sydney
Text Stream to Temporal Network – A Dynamic Heartbeat Graph to Detect Emerging Events on Twitter(abstract, pdf)
Zafar Saeed, Quaid i Azam University; Rabeeh Abbasi, Quaiz Azam University; Abida Sadaf, Quaid i Azam University; Imran Razzak, UTS; Guandong Xu*, University of Technology Sydney, Australia
Huge mounds of data are generated every second on the Internet. People around the globe publish and share information related to real-life events they experience every day. This provides a valuable opportunity to analyze the content of this information to detect real-life happenings, however, it is quite challenging task. Most of the existing methods focus on bursty features to highlight the significance of data entities, but ignore the fact that burstiness often dominates the other minor details which, sometimes, can be very important. Based on this fact, in this work, we propose a novel graph-based approach named the Dynamic Heartbeat Graph (DHG) that not only detects the events at an early stage, but also suppresses them in the upcoming adjacent data stream in order to highlight new emerging events. This characteristic makes the proposed method interesting and efficient in finding emerging events and related topics. The experiment results on real-life datasets (i.e. FA Cup Final and Super Tuesday 2012) show a considerable improvement in most cases, while time complexity remains very attractive.Zafar Saeed, Quaid i Azam University; Rabeeh Abbasi, Quaiz Azam University; Abida Sadaf, Quaid i Azam University; Imran Razzak, KSAU-HS; Guandong Xu*, University of Technology Sydney, Australia
Model the Dynamic Evolution of Facial Expression from Image Sequences(abstract, pdf)
Zhaoxin Huan*, Nanjing university; Lin Shang, Nanjing University
Facial Expression Recognition (FER) has been a challenging task for decades. In this paper, we model the dynamic evolution of facial expression by extracting both temporal appearance features and temporal geometry features from image sequences. To extract the pairwise feature evolution, our approach consists of two different models. The first model combines convolutional layers and temporal recursion to extract dynamic appearance features from raw images. While the other model focuses on the geometrical variations based on facial landmarks,in which a novel 2-distance representation and resample technique are also proposed. These two models are combined by weighted method in order to boost the performance of recognition. We test our approach on three widely used databases: CK+, Oulu-CASIA and MMI. The experimental results show that we achieve state-of-the-art accuracy. Moreover,our models have minor-setup parameters and can work for the variable-length frame sequences input, which is flexible in practical applications.Zhaoxin Huan*, Nanjing university; Lin Shang, Nanjing University
Unsupervised Disaggregation of Low Granularity Resource Consumption Time Series(abstract, pdf)
Pantelis Chronis*, University of Peloponnese; Spiros Skiadopoulos, University of Peloponnese; Giorgos Giannopoulos, Athena RC
Resource consumption is typically monitored at a single point that aggregates all activities of the household in one time series. A key task in resource demand management is disaggregation; an operation that decomposes such a composite time series in the consumption parts that construct it, thus, extracting detailed information about how and when resources were consumed. Current state-of-the-art disaggregation methods have two drawbacks: (a) they mostly work for frequently sampled time series and (b) they require supervision (that comes in terms of labelled data). In practice, though, sampling is not frequent and labelled data are often not available. With this problem in mind, in this paper, we present a method designed for unsupervised disaggregation of consumption time series of low granularity. Our method utilizes a stochastic model of resource consumption along with empirical findings on consumption types (e.g., average volume) to perform disaggregation. Experiments with real world resource consumption data demonstrate up to 85% Recall in identifying different consumption types. Pantelis Chronis*, University of Peloponnese; Spiros Skiadopoulos, University of Peloponnese; Giorgos Giannopoulos, Athena RC
STARS: Soft Multi-Task Learning for Activity Recognition from Multi-Modal Sensor Data(abstract, pdf)
Xi Liu*, Michigan State University; Pang-Ning Tan, MSU; Lei Liu, HP Labs
Human activity recognition from ubiquitous sensor data is an important but challenging classification problem for applications such as assisted living, energy management, and security monitoring of smart homes. In this paper, we present a soft probabilistic classification model for human activity recognition from multi-modal sensors in a smart home environment. The model employs a softmax multi-task learning approach to fit a joint model for all the rooms in the smart home, taking into account the diverse types of sensors available in different rooms. The model also learns the transitional dependencies between activities to improve its prediction accuracy. Experimental results on a real-world dataset showed that the proposed approach outperforms several baseline methods, including k-nearest neighbors, conditional random field, and standard multinomial logistic regression.Xi Liu*, Michigan State University; Pang-Ning Tan, MSU; Lei Liu, HP Labs
A refined MISD algorithm based on Gaussian process regression(abstract, pdf)
Feng Zhou*, Data61; Zhidong Li, Data61, CSIRO; Xuhui Fan, UNSW; Yang Wang, Data61, CSIRO; Arcot Sowmya, UNSW; Fang Chen, CSIRO
Time series data is a common data type in real life, and modelling of time series data along with its underlying temporal dynamics is always a challenging job. Temporal point process is an outstanding method to model time series data in domains that require temporal continuity, and includes homogeneous Poisson process, inhomogeneous Poisson process and Hawkes process. We focus on Hawkes process which can explain self-exciting phenomena in many real applications. In classical Hawkes process, the triggering kernel is always assumed to be an exponential decay function, which is inappropriate for some scenarios, so nonparametric methods have been used to deal with this problem, such as model independent stochastic de-clustering (MISD) algorithm. However, MISD algorithm has a strong dependence on the number of bins, which leads to underfitting for some bins and overfitting for others, so the choice of bin number is a critical step. In this paper, we innovatively embed a Gaussian process regression into the iterations of MISD to make this algorithm less sensitive to the choice of bin number. Feng Zhou*, Data61; Zhidong Li, Data61, CSIRO; Xuhui Fan, UNSW; Yang Wang, Data61, CSIRO; Arcot Sowmya, UNSW; Fang Chen, CSIRO
Feature Learning and Data Mining Process (Part 1)
Discovering High Utility Itemsets Based on the Artificial Bee Colony Algorithm(abstract, pdf)
Wei Song*, North China University of Technology; Chaomin Huang, North China University of Technology
Mining high utility itemsets (HUI) is an interesting research problem in data mining. Recently, evolutionary computation has attracted researchersҠattention, and based on the genetic algorithm and particle swarm optimization, new algorithms for mining HUIs have been proposed. In this paper, we propose a new algorithm called HUI mining based on the artificial bee colony algorithm (HUIM-ABC). In HUIM-ABC, a bitmap is used to transform the original data-base that represents a nectar source and three types of bee. In addition to an ef-ficient bitwise operation and direct utility computation, a bitmap can also be used for search space pruning. Furthermore, the size of discovered itemsets is used to generate new nectar sources, which has a higher chance of producing HUIs than generating new nectar sources at random. Extensive tests conducted on publicly available datasets show that the proposed algorithm outperforms existing state-of-the-art algorithms.Wei Song*, North China University of Technology; Chaomin Huang, North China University of Technology
A Scalable and Efficient Subgroup Blocking Scheme for Multidatabase Record Linkage(abstract, pdf)
Thilina Ranbaduge*, The Australian National University; Dinusha Vatsalan, Australian National University; Peter Christen, The Australian National University
Record linkage is a commonly used task in data integration to facilitate the identification of matching records that refer to the same entity from different databases. The scalability of multidatabase record linkage (MDRL) is significantly challenged with the increase of both the sizes and the number of databases that are to be linked. Identifying matching records across subgroups of databases is an important aspect in MDRL that has not been addressed so far. We propose a scalable subgroup blocking approach for MDRL that uses an efficient search over a graph structure to identify similar blocks of records that need to be compared across subgroups of multiple databases. We provide an analysis of our technique in terms of complexity and blocking quality. We conduct an empirical study on large real-world datasets that shows our approach is scalable with the size of subgroups and the number of databases, and outperforms an existing state-of-the-art blocking technique for MDRL.Thilina Ranbaduge*, The Australian National University; Dinusha Vatsalan, Australian National University; Peter Christen, The Australian National University
Efficient Feature Selection Framework For Digital Marketing Applications(abstract, pdf)
Wei Zhang*, Adobe; shiladitya bose, Adobe; Said Kobeissi, Adobe; Scott Tomko, Adobe; Chris Challis, adobe
Digital marketing strategies can help businesses achieve better Return on Investment(ROI). Big data and predictive modelling are key to identifying these specific customers. Yet the very rich and mostly irrelevant attributes(features) will adversely affect the predictive modelling performance, both computationally and qualitatively. So selecting relevant features is a crucial task for marketing applications. The feature selection process is very time consuming due to the large amount of data and high dimensionality of features. In this paper, we propose to reduce the computation time through regularizing the feature search process using expert knowledge. We also combine the regularized search with a generative filtering step, so we can address potential problems with the regularized search and further speed up the process. In addition, a progressive sampling and coarse to fine selection framework is built to further lower the space and time requirements. Wei Zhang*, Adobe; shiladitya bose, Adobe; Said Kobeissi, Adobe; Scott Tomko, Adobe; Chris Challis, adobe
Dynamic feature selection algorithm based on minimum vertex cover of hypergraph(abstract, pdf)
Xiaojun Xie, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics; Xiaolin Qin*, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics
Feature selection is an important pre-processing step in many fields, such as data mining, machine learning and pattern recognition. This paper focuses on dynamically updating a subset of features with new samples arriving and provides a hypergraph model to deal with dynamic feature selection problem. Firstly, we discuss the relationship between feature selection of information system and minimum vertex cover of hypergraph, and feature selection is converted to a minimum vertex cover problem based on this relationship. Then, an algorithm for generating induced hypergraph from information system is presented, the induced hypergraph can be divided into two part: the original induced hypergraph and the added hypergraph with new samples arriving. Finally, a novel dynamic feature selection algorithm based on minimal vertex cover of hypergraph is proposed, and this algorithm only need a small amount of computation. Experiments show that the proposed method is feasible and highly effective.Xiaojun Xie, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics; Xiaolin Qin*, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics
Feature Selection for Multiclass Binary Data(abstract, pdf)
Kushani Perera*, University of Melbourne
Feature selection in binary datasets is an important task in many real world machine learning applications such as document classification, genomic data analysis, and image recognition. Despite many algorithms available, selecting features that distinguish all classes from one another in a multiclass binary dataset remains a challenge. Furthermore, many existing feature selection methods incur unnecessary computation costs for binary data, as they are not specifically designed for binary data. We show that exploiting the symmetry and feature value imbalance of binary datasets, more efficient feature selection measures that can better distinguish the classes in multiclass binary datasets can be developed. Using these measures, we propose a greedy feature selection algorithm, CovSkew, for multiclass binary data. We show that CovSkew achieves high accuracy gain over baselines, upto ~40%, especially when the selected feature subset is small and has low computation costs.Kushani Perera*, University of Melbourne; Jeffrey Chan, RMIT University; Shanika Karunasekera, University of Melbourne
Feature Learning and Data Mining Process (Part 2)
Scalable Model-based Cascaded Imputation of Missing Data(abstract, pdf)
Jacob Montiel*, Télécom ParisTech, Université Paris-Saclay; Jesse Read, Ecole Polytechnique; Albert Bifet, Telecom ParisTech; Talel Abdessalem, Telecom ParisTech
Missing data is a common trait of real-world data that can negatively impact interpretability. In this paper, we present CASCADE IMPUTATION (CIM), an effective and scalable technique for automatic imputation of missing data. CIM is not restrictive on the characteristics of the data set, providing support for: Missing At Random and Missing Completely At Random data, numerical and nominal attributes, and large data sets including highly dimensional data sets. We compare CIM against well-established imputation techniques over a variety of data sets under multiple test configurations to measure the impact of imputation on the classification problem. Test results show that CIM outperforms other imputation methods over multiple test conditions. Additionally, we identify optimal performance and failure conditions for popular imputation techniques.Jacob Montiel*, Télécom ParisTech, Université Paris-Saclay; Jesse Read, Ecole Polytechnique; Albert Bifet, Telecom ParisTech; Talel Abdessalem, Telecom ParisTech
On Reducing Dimensionality of Labeled Data Efficiently(abstract, pdf)
Guoxi Zhang*, Kyoto University; Tomoharu Iwata, NTT Communication Science Laboratories; Hisashi Kashima, Kyoto University
We address the task of reducing dimensionality for labeled data. Particularly, we consider to improve class separation in latent space. Many existing nonlinear algorithms assume nonparametric distributions in latent space that are based on pair-wise distances. It is infeasible to compute or store pairwise distances of large data. In this paper, we propose a nonlinear method that assumes a parametric distribution in the latent space. The proposed method employs a spherical mixture model for the parametric distribution, and it collapses data points with the same label to the corresponding mixture components in the latent space. In this process only distances between data points and the centers of the components are computed, so the proposed method is more efficient than methods using pairwise distances. In our evaluation on five datasets, we demonstrate the efficacy and efficiency of the proposed method. In practice, the proposed method can be used to speedup k-NN classification or visualize data points with their class structure.
Using Metric Space Indexing for Complete and Efficient Record Linkage(abstract, pdf)
Özgür Akgün*, University of St Andrews; Alan Dearle, University of St Andrews; Graham Kirby, University of St Andrews; Peter Christen, The Australian National University
Record linkage is the process of identifying records that refer to the same real-world entities in situations where entity identifiers are unavailable. Records are linked on the basis of similarity between common attributes, with every pair being classified as a link or non-link depending on their similarity. Linkage is usually performed in a three-step process: first, groups of similar candidate records are identified using indexing, then pairs within the same group are compared in more detail, and finally classified. Even state-of-the-art indexing techniques, such as locality sensitive hashing, have potential drawbacks. They may fail to group together some true matching records with high similarity, or they may group records with low similarity, leading to high computational overhead. We propose using metric space indexing (MSI) to perform complete linkage, resulting in a parameter-free process combining indexing, comparison and classification into a single step delivering complete and efficient record linkage. An evaluation on real-world data from several domains shows that linkage using MSI can yield better quality than current indexing techniques, with similar execution cost, without the need for domain knowledge or trial and error to configure the process.Özgür Akgün*, University of St Andrews; Alan Dearle, University of St Andrews; Graham Kirby, University of St Andrews; Peter Christen, The Australian National University
Dimensionality Reduction via Community Detection in Small Sample Datasets(abstract, pdf)
Kartikeya Bhardwaj*, Carnegie Mellon University; Radu Marculescu, Carnegie Mellon University
Real world networks constructed from raw data are often characterized by complex community structures. Existing dimensionality reduction techniques, however, do not take such characteristics into account. This is especially important for problems with low number of samples where the curse of dimensionality is particularly significant. Therefore, in this paper, we propose FeatureNet, a novel community-based dimensionality reduction framework targeting small sample problems. To this end, we propose a new method to directly construct a network from high-dimensional raw data while explicitly revealing its hidden community structure; these communities are then used to learn low-dimensional features using a representation learning framework. We show the effectiveness of our approach on eight datasets covering application areas as diverse as handwritten digits, biology, physical sciences, NLP, and computational sustainability. Extensive experiments on the above datasets (with sizes mostly between 100 and 1500 samples) demonstrate that FeatureNet significantly outperforms (i.e., up to 40% improvement in classification accuracy) ten well-known dimensionality reduction methods like PCA, Kernel PCA, Isomap, SNE, t-SNE, etc.Kartikeya Bhardwaj*, Carnegie Mellon University; Radu Marculescu, Carnegie Mellon University
Topic-sensitive Influential Paper Discovery in Citation Network(abstract, pdf)
Xin Huang, Shanghai Jiao Tong University; Chang-An Chen, Shanghai Jiao Tong University; Changhuan Peng, Shanghai Jiao Tong University; Xudong Wu, Shanghai Jiao Tong University; Luoyi Fu, Shanghai Jiao Tong University; Xinbing Wang*, Shanghai Jiao Tong University
Discovering important papers in different academic topics is known as topic-sensitive influential paper discovery. Previous works mainly find the influential papers based on the structure of citation networks but neglect the text information, while the text of documents gives a more precise description of topics. In our paper, we creatively combine both topics of text and the influence of topics over citation networks to discover influential articles. The observation on three standard citation networks shows that the existence of citations between papers is related to the topic of citing papers and the importance of cited papers. Based on this finding, we introduce two parameters to describe the topic distribution and the importance of a document. We then propose MTID, a scalable generative model, which generates the network with these two parameters. The experiment confirms superiority of MTID over other topic-based methods, in terms of at least 50% better citation prediction in recall, precision and mean reciprocal rank. In discovering influential articles in different topics, MTID not only identifies papers with high citations, but also succeeds in discovering other important papers, including papers about standard datasets and the rising stars.Xin Huang, Shanghai Jiao Tong University; Chang-An Chen, Shanghai Jiao Tong University; Changhuan Peng, Shanghai Jiao Tong University; Xudong Wu, Shanghai Jiao Tong University; Luoyi Fu, Shanghai Jiao Tong University; Xinbing Wang*, Shanghai Jiao Tong University
Feature Learning and Data Mining Process (Part 3)
An Extended Random-Sets Model for Fusion-based Text Feature Selection(abstract, pdf)
Abdullah Alharbi*, QUT; Yuefeng Li, Queensland University of Technology; Yue Xu, Queensland University of Technology, Australia
Selecting features that represent a specific corpus is important for the success of many machine learning and text mining applications. In information retrieval (IR), fusion-based techniques have shown remarkable performance compared to traditional models. However, in text feature selection (FS), popular models do not consider the fusion of the taxonomic features of the corpus. This research proposed an innovative and effective extended random-sets model for fusion-based FS. The model fused scores of different hierarchal features to accurately weight the representative words based on their appearance across the documents in the corpus and in several latent topics. The model was evaluated for information filtering (IF) using TREC topics and the standard RCV1 dataset. The results showed that the proposed model significantly outperformed eleven state-of-the-art baseline models in six evaluation metrics.Abdullah Alharbi*, QUT; Yuefeng Li, Queensland University of Technology; Yue Xu, Queensland University of Technology, Australia
Attribute Reduction Based on Improved Information Gain Rate and Ant Colony Optimization(abstract, pdf)
Wei Jipeng*, Guilin University of Electronic Technology; Wen Yimin, Guilin University of Electronic Technology; Wei Qianjin, Guilin University of Electronic Technology
Solving minimal attribute reduction (MAR) in rough set theory is a NP-hard and nonlinear constrained combinatorial optimization problem. Ant colony optimization (ACO), a new intelligent computing method, takes strategies of heuristic search, which is characterized by a distributed and positive feedback and it has the advantage of excellent global optimization ability for handling combinatorial optimization problems. Having considered that the existing information entropy and information gain methods fail to help to select the optimal minimal attribute every time, this paper proposed a novel attribute reduction algorithm based on ACO. Firstly, the algorithm adopts an improved information gain rate as heuristic information. Secondly, each ant solves a problem of minimum attributes reduction and then conduct redundancy test to each selected attribute. Whatӳ more, redundant detection of all non-core attributes in the optimal solution will be perfomed in each generation. The result of the experiment on several datasets from UCI show that the proposed algorithms are more capable of finding the minimum attribute reduction and can faster converge and at the same time they can almost keep the classification accuracy, compared with the traditional attribute reduction based on ACO algorithm.Wei Jipeng*, Guilin University of Electronic Technology; Wen Yimin, Guilin University of Electronic Technology; Wei Qianjin, Guilin University of Electronic Technology
Efficient Approximate Algorithms for the Closest Pair Problem in High Dimensional Spaces(abstract, pdf)
Xingyu Cai, University of Connecticut; Sanguthevar Rajasekaran*, University of Connecticut; Fan Zhang, Zhejiang University
The Closest Pair Problem (CPP) is one of the fundamental problems that has a wide range of applications in data mining, such as unsupervised data clustering, user pattern similarity search, etc. A number of exact and approximate algorithms have been proposed to solve it in the low dimensional space. In this paper, we address the problem when the metric space is of a high dimension. For example, the drug-target or movie-user interaction data could contain as many as hundreds of features, etc. To solve this problem under the l2 norm distance, we present novel approximate algorithms.Our algorithms are based on the novel idea of projecting the points into the real line. We prove high probability bounds on the run time and accuracy for both of the proposed algorithms. Both algorithms are evaluated via comprehensive experiments and compared with existing best-known approaches. The experiments reveal that our proposed approaches outperform the existing methods.Xingyu Cai, University of Connecticut; Sanguthevar Rajasekaran*, University of Connecticut; Fan Zhang, Zhejiang University
Efficient Compression Technique for Sparse Sets(abstract, pdf)
rameshwar Pratap*, CMI; Ishan Sohony, Pune Institute of Computer Technology, Pune; Raghav Kulkarni, CMI
Recent growth in internet has generated large amount of data over web. Representations of most of such data are high-dimensional and sparse. Many fundamental subroutines of various data analytics tasks such as clustering, ranking, nearest neighbour scales poorly with the data dimension. In spite of significant growth in the computational power performing such computations on high dimensional data sets are infeasible, and at times impossible. Thus, it is desirable to investigate on compression algorithms that can significantly reduce dimension while preserving similarity between data objects. In this work, we consider the data points as sets, and use Jaccard similarity as the similarity measure. Pratap \textit{et. al.}~\cite{KulkarniP16} suggested a compression technique for high dimensional, sparse, binary data for preserving the Inner product and Hamming distance. In this work, we show that their algorithm also works well for Jaccard similarity. We present a theoretical analysis of compression bound and complement it with rigorous experimentation on synthetic and real-world datasets. We also compare our results with the state-of-the-art “min-wise independent permutation~\cite{BroderCFM98}”, and show that our compression algorithm achieves almost equal accuracy while significantly reducing the compression time and the randomness.rameshwar Pratap*, CMI; Ishan Sohony, Pune Institute of Computer Technology, Pune; Raghav Kulkarni, CMI
It pays to be Certain: Unsupervised Record Linkage via Ambiguity Minimization(abstract, pdf)
Anna Jurek*, Queen’s University; Deepak P, Queen’s University Belfast
Record linkage (RL) is a process of identifying records that refer to the same real-world entity. Many existing approaches to RL apply supervised machine learning (ML) techniques to generate a classification model that classifies a pair of records as either linked or non-linked. In such techniques, the labeled data helps guide the choice and relative importance to similarity measures to be employed in RL. Unsupervised RL is therefore a more challenging problem since the quality of similarity measures needs to be estimated in the absence of linkage labels. In this paper we propose a novel optimization approach to unsupervised RL. We define a scoring technique which aggregates similarities between two records along all attributes and all available similarity measures using a weighted sum formulation. The core idea behind our method is embodied in an objective function representing the overall ambiguity of the scoring across a dataset. Our goal is to iteratively optimize the objective function to progressively refine estimates of the scoring weights in the direction of lesser overall ambiguity. We have evaluated our approach on multiple real world datasets which are commonly used in the RL community. Our experimental results show that our proposed approach outperforms state-of-the-art techniques, while being orders of magnitude faster.Anna Jurek*, Queen’s University; Deepak P, Queen’s University Belfast
Community Detection and Network Science
Consensus Community Detection in Multilayer Networks using Parameter-free Graph Pruning(abstract, pdf)
Domenico Mandaglio, DIMES, University of Calabria; Alessia Amelio, DIMES, University of Calabria; Andrea Tagarelli*, DIMES, University of Calabria, IT
The clustering ensemble paradigm has emerged as an effective tool for community detection in multilayer networks, which allows for producing consensus solutions that are designed to be more robust to the algorithmic selection and configuration bias. However, one limitation is related to the dependency on a co-association threshold that controls the degree of consensus in the community structure solution. The goal of this work is to overcome this limitation with a new framework of ensemble-based multilayer community detection, which features parameter-free identification of consensus communities based on generative models of graph pruning that are able to filter out noisy co-associations. We also present an enhanced version of the modularity-driven ensemble-based multilayer community detection method, in which community memberships of nodes are reconsidered to optimize the multilayer modularity of the consensus solution. Experimental evidence on real-world networks confirms the beneficial effect of using model-based filtering methods and also shows the superiority of the proposed method on state-of-the-art multilayer community detection. Domenico Mandaglio, DIMES, University of Calabria; Alessia Amelio, DIMES, University of Calabria; Andrea Tagarelli*, DIMES, University of Calabria, IT
Community Discovery Based on Social Relations and Temporal-Spatial Topics in LBSNs(abstract, pdf)
Shuai Xu, Southeast University; Jiuxin Cao*, Southeast University; Xuelin Zhu, Southeast University; Yi Dong, Southeast University; Bo Liu, Southeast University
Community discovery is a comprehensive problem associating with sociology and computer science. The recent surge of Location-Based Social Networks (LBSNs) brings new challenges to this problem as there is no definite community structure in LBSNs. This paper tackles the multidimensional community discovery in LBSNs based on user check-in characteristics. Communities discovered in this paper satisfy two requirements: frequent user interaction and consistent temporal-spatial pattern. Firstly, based on a new definition of dynamic user interaction, two types of check-ins in LBSNs are distinguished. Secondly, a novel community discovery model called SRTST is conceived to describe the generative process of different types of check-ins. Thirdly, the Gibbs Sampling algorithm is derived for the model parameter estimation. In the end, empirical experiments on real-world LBSN datasets are designed to validate the performance of the proposed model. Experimental results show that SRTST model can discover multidimensional communities and it outperforms the state-of-the-art methods on various evaluation metrics.Shuai Xu, Southeast University; Jiuxin Cao*, Southeast University; Xuelin Zhu, Southeast University; Yi Dong, Southeast University; Bo Liu, Southeast University
A Unified Weakly Supervised Framework for Community Detection and Semantic Matching(abstract, pdf)
Xiao Liu*, Tianjin University; di jin, School of Computer Software, Tianjin University
Due to the sparsity of network, some community detection methods only based on topology often lead to relatively low accuracy. Although some methods have been proposed to improve the detection accuracy by using few known semi-supervised information or node content, the research of community detection not only pursues the enhancement of community accuracy, but also pays more attention to the semantic description for communities. In this paper, we proposed a unified nonnegative matrix factorization framework simultaneously for community detection and semantic matching by integrating both semi-supervised information and node content. The framework reveals two-fold community structures as well as their coupling relationship matrix, which helps to identify accurate community structure and at the same time assign specific semantic information to each community. Experiments on some real networks show that the framework is efficient to match each community with specific semantic information, and the performance are superior over the compared methods.Xiao Liu*, Tianjin University; di jin, School of Computer Software, Tianjin University
Tapping Community Memberships and Devising a Novel Homophily Modeling Approach for Trust Prediction(abstract, pdf)
Pulkit Parikh*, IIIT Hyderabad; Manish Gupta, Microsoft; Vasudeva Varma, IIIT Hyderabad
The prediction of trust relations between users of social networks is valuable for addressing the issue of finding credible information. Inferring trust is a challenging problem, since user-specified trust relations are highly sparse and follow a power-law degree distribution. In this paper, we explore utilizing community memberships for trust prediction in a principled manner. We also propose a novel method to model homophily that complements existing work. To the best of our knowledge, this is the first work that mathematically formulates an insight based on community memberships for unsupervised trust prediction. We propose and model the hypothesis that a user is more likely to develop a trust relation within the user’s community than outside it. Unlike existing work, our approach for encoding homophily directly links user-user similarities with the pair-wise trust model. We derive mathematical factors that model our hypothesis relating community memberships to trust relations and the homophily effect. Along with low-rank matrix factorization, they are combined into chTrust, the proposed multi-faceted optimization framework. Our experiments on the standard Ciao and Epinions datasets show that the proposed framework outperforms multiple unsupervised trust prediction baselines for all test user pairs as well as the low-degree segment, across evaluation settings.Pulkit Parikh*, IIIT Hyderabad; Manish Gupta, Microsoft; Vasudeva Varma, IIIT Hyderabad
Deep Learning Theory and Applications in KDD (Part 1)
Text-visualizing Neural Network Model: Understanding Online Financial Textual Data(abstract, pdf)
Tomoki Ito*, The University of Tokyo; Hiroki Sakaji, The University of Tokyo; Kota Tsubouchi, Yahoo Japan Corporation; ? ??, ????; Tatsuo Yamashita, Yahoo Japan Corporation
This study aims to visualize financial documents to swiftly obtain market sentiment information from these documents and determine the reason for which sentiment decisions are made. This type of visualization is considered helpful for nonexperts to easily understand technical documents such as financial reports. To achieve this, we propose a novel interpretable neural network (NN) architecture called gradient interpretable NN (GINN). GINN can visualize both the market sentiment score from a whole financial document and the sentiment gradient scores in concept units. We experimentally demonstrate the validity of text visualization produced by GINN using a real textual dataset. Tomoki Ito*, The University of Tokyo; Hiroki Sakaji, The University of Tokyo; Kota Tsubouchi, Yahoo Japan Corporation; 潔 和泉, 東京大学; Tatsuo Yamashita, Yahoo Japan Corporation
MIDA: Multiple Imputation using Denoising Autoencoders(abstract, pdf)
lovedeep gondara*, simon fraser university; Ke Wang, SFU
Missing data is a significant problem impacting all domains. State-of-the-art framework for minimizing missing data bias is multiple imputation, for which the choice of an imputation model remains nontrivial. We propose a multiple imputation model based on overcomplete deep denoising autoencoders. Our proposed model is capable of handling different data types, missingness patterns, missingness proportions and distributions. Evaluation on several real life datasets show our proposed model significantly outperforms current state-of-the-art methods under varying conditions while simultaneously improving end of the line analytics.lovedeep gondara*, simon fraser university; Ke Wang, SFU
Dual Control Memory Augmented Neural Networks for Treatment Recommendations(abstract, pdf)
Hung Le*, Deakin University; Truyen Tran, Deakin University; Svetha Venkatesh, Deakin University
We formulate the task of treatment recommendation as a sequence-to-sequence prediction model that takes the timeׯrdered medical history as input, and predicts a sequence of future clinical procedures and medications. It is built on the premise that an effective treatment plan may have long״erm dependencies from previous medical history. We approach the problem by using a memoryסugmented neural network, in particular, by leveraging the recent differentiable neural computer that consists of a neural controller and an external memory module. Differing from the original model, we use dual controllers, one for encoding the history followed by another for decoding the treatment sequences. In the encoding phase, the memory is updated as new input is read; at the end of this phase, the memory holds not only the medical history but also the information about the current illness. During the decoding phase, the memory is writeװrotected. The decoding controller generates a treatment sequence, one treatment option at a time. The resulting dual controller writeװrotected memoryסugmented neural network is demonstrated on the MIMIC-III dataset on two tasks: procedure prediction and medication prescription. The results show improved performance over both traditional bag-of-words and sequence-to-sequence methods.Hung Le*, Deakin University; Truyen Tran, Deakin University; Svetha Venkatesh, Deakin University
Denoising Time Series Data Using Asymmetric Generative Adversarial Networks(abstract, pdf)
Sunil Gandhi*, University Of Maryland Baltimore County; Tim Oates, University Of Maryland Baltimore County; Tinoosh Mohsenin, Nil; David Hairston, Nil
Denoising data is a preprocessing step for several time series mining algorithms. This step is especially important if the noise in data originates from diverse sources. Consequently, it is commonly used in biomedical applications that use Electroencephalography (EEG) data. In EEG data noise can occur due to ocular, muscular and cardiac activities. In this paper, we explicitly learn to remove noise from time series data without assuming a prior distribution of noise. We propose an online, fully automated, end-to-end system for denoising time series data. Our model for denoising time series is trained using unpaired training corpora and does not need information about the source of the noise or how it is manifested in the time series. We propose a new architecture called AsymmetricGAN that uses a generative adversarial network for denoising time series data. To analyze our approach, we create a synthetic dataset that is easy to visualize and interpret. We also evaluate and show the effectiveness of our approach on an existing EEG dataset.Sunil Gandhi*, University Of Maryland Baltimore County; Tim Oates, University Of Maryland Baltimore County; Tinoosh Mohsenin, Nil; David Hairston, Nil
Shared Deep Kernel Learning for Dimensionality Reduction(abstract, pdf)
Xinwei Jiang*, China University of Geosciences; Junbin Gao, University of Sydney, Australia; Xiaobo Liu, China University of Geosciences; Zhihua Cai, China University of Geosciences; Dongmei Zhang, China University of Geosciences; Yuanxing Liu, China University of Geosciences
Deep Kernel Learning (DKL) has been proven to be an effective method to learn complex feature representation by combining the structural properties of deep learning with the nonparametric flexibility of kernel methods, which can be naturally used for supervised dimensionality reduction. However, if limited training data are available its performance could be compromised because parameters of the deep structure embedded into the model are large and difficult to be efficiently optimized. In order to address this issue, we propose the Shared Deep Kernel Learning model by combining DKL with shared Gaussian Process Latent Variable Model. The novel method could not only bring the improved performance without increasing model complexity but also learn the hierarchical features by sharing the deep kernel. The comparison with some supervised dimensionality reduction methods and deep learning approach verify the advantages of the proposed model.Xinwei Jiang*, China University of Geosciences; Junbin Gao, University of Sydney, Australia; Xiaobo Liu, China University of Geosciences; Zhihua Cai, China University of Geosciences; Dongmei Zhang, China University of Geosciences; Yuanxing Liu, China University of Geosciences
Deep Learning Theory and Applications in KDD (Part 2)
CDSSD: Refreshing Single Shot Object Detection Using A Conv-Deconv Network(abstract, pdf)
Vijay Gabale*, Huew; Uma Sawant, IIT Bombay
Single shot multi-box object detectors have been recently shown to achieve state-of-the-art performance on object detection tasks. We extend the single shot detection (SSD) framework and propose a generic architecture using a deep convolution-deconvolution network. Our architecture does not rely on any pretrained network; and can be pretrained in an unsupervised manner for a given image dataset. Furthermore, we propose a novel approach to combine feature maps from both convolution and deconvolution layers to predict bounding boxes and labels with improved accuracy. Our framework, Conv-Deconv SSD (CDSSD), with its two key contributions ֠unsupervised learning and multi-layer confluence of convolution-deconvolution feature maps results in state-of-the-art performance while utilizing significantly less number of bounding boxes and improved identification of small objects. On 300 X 300 image inputs, we achieve 80.7% mAP on VOC07 and 78.1% mAP on VOC07+12 (1.7% to 2.8% improvement over StairNet, DSSD, SSD). CDSSD achieves 30.2% mAP on COCO performing at-par with R-FCN and faster-R-FCN, while working on smaller size input images. Furthermore, CDSSD matches SSD performance while utilizing 82% of data, and reduces the prediction time per image by 10%.Vijay Gabale*, Huew; Uma Sawant, IIT Bombay
Binary Classification of Sequences Possessing Unilateral Common Factor with AMS and APR(abstract, pdf)
Yujin Tang*, KDDI Research, Inc.; Kei Yonekawa, KDDI Research, Inc.; Mori Kurokawa, KDDI Research, Inc.; Shinya Wada, KDDI Research, Inc.; Kiyohito Yoshihara, KDDI Research, Inc.
Most real-world sequence data for binary classification tasks appear to possess unilateral common factor. That is, samples from one of the classes occur because of common underlying causes while those from the other class may not. We are interested in resolving these tasks using convolutional neural networks (CNN). However, due to both the technical specification and the nature of the data, learning a classifier is generally associated with two problems: 1) de fining a segmentation window size to sub-sequence for sufficient data augmentation and avoiding serious multiple-instance learning issue is non-trivial; 2) samples from one of the classes have common underlying causes and thus present similar features, while those from the other class can have various latent characteristics which can distract CNN in the learning process. We mitigate the first problem by introducing a random variable on sample scaling parameters, whose distribution’s parameters are jointly learnt with CNN and leads to what we call adaptive multi-scale sampling (AMS). To address the second problem, we propose activation pattern regularization (APR) on only samples with the common causes such that CNN focuses on learning representations pertaining to the common factor. We demonstrate the effectiveness of both proposals in extensive experiments on real-world datasets.Yujin Tang*, KDDI Research, Inc.; Kei Yonekawa, KDDI Research, Inc.; Mori Kurokawa, KDDI Research, Inc.; Shinya Wada, KDDI Research, Inc.; Kiyohito Yoshihara, KDDI Research, Inc.
Automating reading comprehension by generating question and answer pairs(abstract, pdf)
Vishwajeet Kumar*, IITB-Monash Research Academy; kireeti Boorla, Oracle; Yogesh Meena, IIT Bombay; Ganesh Ramkrishnan, IIT Bombay; Yuan-Fang Li , Monash University
Neural network-based methods represent the state-of-the-art in question generation from text. Existing work focuses on generating only questions from text without concerning itself with answer generation. Moreover, our analysis shows that handling rare words and generating the most appropriate question given a candidate answer are still challenges facing existing approaches. We present a novel two-stage process to generate question-answer pairs from the text. For the first stage, we present alternatives for encoding the span of the pivotal answer in the sentence using Pointer Networks. In our second stage, we employ sequence to sequence models for question generation, enhanced with rich linguistic features. Finally, global attention and answer encoding are used for generating the question most relevant to the answer. We linguistically analyze the role of each component in our framework and explain enhancements for each. This analysis is supported by extensive experimental evaluations. Using standard evaluation metrics as well as human evaluations, our experimental results validate the significant improvement in the quality of questions generated by our framework over the state-of-the-art. The technique presented here represents another step towards more automated reading comprehension assessment. We also present a live system\footnote{Demo of the system is available at \url{https://goo.gl/EYvBuZ}.} to demonstrate the effectiveness of our approach. Vishwajeet Kumar*, IITB-Monash Research Academy; kireeti Boorla, Oracle; Yogesh Meena, IIT Bombay; Ganesh Ramkrishnan, IIT Bombay; Yuan-Fang Li , Monash University
Emotion Classification with Data Augmentation Using Generative Adversarial Networks(abstract, pdf)
Xinyue Zhu*, BUPT; Yifan Liu, Beihang University; Jiahong Li, Beijing San Kuai Yun Technology Co., Ltd; Tao Wan, Beihang University; Zengchang Qin, Beihang University
It is a difficult task to classify images with multiple class labels using only a small number of labeled examples, especially when the label (class) distribution is imbalanced. Emotion classification is such an example of imbalanced label distribution, because some classes of emotions like disgusted are relatively rare comparing to other labels like happy or sad. In this paper, we propose a data augmentation method using generative adversarial networks (GAN). It can complement and complete the data manifold and find better margins between neighboring classes. Specifically, we design a framework using a CNN model as the classifier and a cycle-consistent adversarial networks (CycleGAN) as the generator. In order to avoid gradient vanishing problem, we employ the least-squared loss as adversarial loss. We also propose several evaluation methods on three benchmark datasets to validate GAN’s performance. Empirical results show that we can obtain 5%~10% increase in the classification accuracy after employing the GAN-based data augmentation techniques. Xinyue Zhu*, BUPT; Yifan Liu, Beihang University; Jiahong Li, Beijing San Kuai Yun Technology Co., Ltd; Tao Wan, Beihang University; Zengchang Qin, Beihang University
Trans2Vec: Learning Transaction Embedding via Items and Frequent Itemsets(abstract, pdf)
Dang Nguyen*, Deakin University; Tu Nguyen, “Deakin University, Australia”; Wei Luo, Deakin University; Svetha Venkatesh, Deakin University
Learning meaningful and effective representations for transaction data is a crucial prerequisite for transaction classification and clustering tasks. Traditional methods which use frequent itemsets (FIs) as features often suffer from the data sparsity and high-dimensionality problems. Several supervised methods based on discriminative FIs have been proposed to address these disadvantages, but they require transaction labels, thus rendering them inapplicable to real-world applications where labels are not given. In this paper, we propose an unsupervised method which learns low-dimensional continuous vectors for transactions based on information of both singleton items and FIs. We demonstrate the superior performance of our proposed method in classifying transactions on four datasets compared with several state-of-the-art baselines.Dang Nguyen*, Deakin University; Tu Nguyen, “Deakin University, Australia”; Wei Luo, Deakin University; Svetha Venkatesh, Deakin University
Detecting Complex Sensitive Information via Phrase Structure in Recursive Neural Networks(abstract, pdf)
Jan Neerbek*, University of Aarhus; Ira Assent, University of Aarhus; Peter Dolog, University of Aalborg
State-of-the-art sensitive information detection in unstructured data relies on the frequency of co-occurrence of keywords with sensitive seed words. In practice, however, this may fail to detect more complex patterns of sensitive information. In this work, we propose learning phrase structures that separate sensitive from non-sensitive documents in recursive neural networks. Our evaluation on real data with human labeled sensitive content shows that our new approach outperforms existing keyword based strategies.Jan Neerbek*, University of Aarhus; Ira Assent, University of Aarhus; Peter Dolog, University of Aalborg
Clustering and Unsupervised Learning (Part 1)
A Distance Scaling Method to Improve Density-based Clustering(abstract, pdf)
YE ZHU*, Deakin University; Kai Ming Ting, Federation University, Australia; Maia Angelova, Deakin University
Density-based clustering is able to find clusters of arbitrary sizes and shapes while effectively separating noise. Despite its advantage over other types of clustering, it is well-known that most density-based algorithms face the same challenge of finding clusters with varied densities. Recently, ReScale, a principled density-ratio preprocessing technique, enables a density-based clustering algorithm to identify clusters with varied densities. However, because the technique is based on one-dimensional scaling, it does not do well in datasets which require multi-dimensional scaling. In this paper, we propose a multi-dimensional scaling method, named DScale, which rescales based on the computed distance. It overcomes the key weakness of ReScale and requires one less parameter while maintaining the simplicity of the implementation. Our empirical evaluation shows that DScale has better clustering performance than ReScale for three existing density-based algorithms, i.e., DBSCAN, OPTICS and DP, on synthetic and real-world datasets.YE ZHU*, Deakin University; Kai Ming Ting, Federation University, Australia; Maia Angelova, Deakin University
Neighbourhood Contrast: A better means to detect clusters than density(abstract, pdf)
Bo Chen*, Monash University; Kai Ming Ting, Federation University, Australia
Most density-based clustering algorithms suffer from large density variations among clusters. This paper proposes a new measure called Neighbourhood Contrast (NC) as an better alternative to density in detecting clusters. The proposed NC admits all local density maxima, regardless of their densities, to have similar NC values. Due to this unique property, NC is a better means to detect clusters in a dataset with large density variations among clusters. We provide two applications of NC. First, replacing density with NC in the current state-of-the-art clustering procedure DP leads to significantly improved clustering performance. Second, we devise a new clustering algorithm called Neighbourhood Contrast Clustering (NCC) which does not require density or distance calculation and therefore has a linear time complexity in terms of dataset size. Our empirical evaluation shows that both NC-based methods outperform density-based methods including the current state-of-the-art.Bo Chen*, Monash University; Kai Ming Ting, Federation University, Australia
Clustering of Multiple Density Peaks(abstract, pdf)
borui cai*, deakin university; Guangyan Huang, Deakin University; yong xiang, deakin university
Density-based clustering, such as Density Peak Clustering (DPC) and DBSCAN, can find arbitrary shape clusters and have wide applications such as image processing, spatial data mining and text mining. In DBSCAN, a core point has density greater than a threshold, and can spread its cluster ID to other neighbour core points. However, core points selected by one cut/threshold are too coarse to segment fine clusters that are sensitive to densities. DPC resolves this problem by finding a data point with the peak density as centre to develop a fine cluster. Unfortunately, a DPC cluster that comprises only one centre may be too fine to form a natural cluster. In this paper, we provide a novel clustering of multiple density peaks (MDPC) to flexibly find clusters with arbitrary number of regional centres with local peak densities through extending DPC. In MDPC, we generate fine seed clusters containing single density peak, and form clusters with multiple density peaks by merging those clusters that are close to each other and have similar density distributions. Comprehensive experiments have been conducted on both synthetic and real-world datasets to demonstrate the accuracy and effectiveness of MDPC compared with DPC, DBSCAN and other base-line clustering algorithms.borui cai*, deakin university; guangyan huang, deakin university; yong xiang, deakin university
A New Local Density for Density Peak Clustering(abstract, pdf)
William Zhu*, University of Electronic Science and Technology of China
Density peak clustering is able to recognize clusters of arbitrary shapes, so it has attracted attention in academic community. However, existing density peak clustering algorithms prefer to select cluster centers from dense regions and thus easily ignore clusters from sparse regions. To solve this problem, we redefine the local density of a point as the number of points whose neighbors contain this point. This idea is based on our following finding: whether in dense clusters or insparse clusters, a cluster center would have a relatively high local density calculated by our new measure. Even in a sparse region, there may be some points with high local densities in our definition, thus one of these points can be selected to be the center of this region in subsequent steps and this region is then detected as a cluster.We apply our new definition to both density peak clustering and the combination of density peak clustering with agglomerative clustering. Experiments on benchmark datasets show the effectiveness of our methods.William Zhu*, University of Electronic Science and Technology of China
An Efficient Ranking-centered Density-based Document Clustering Method(abstract, pdf)
Wathsala Mohotti*, QUT; Richi Nayak, Queensland University of Technology, Brisbane
Document clustering is a popular method for discovering useful information from text data. This paper proposes an innovative hybrid document clustering method based on the novel concepts of ranking, density and shared neighborhood. We utilize ranked documents generated from a search engine to effectively build a graph of shared relevant documents. The high density regions in the graph are processed to form initial clusters. The clustering decisions are further refined using the shared neighborhood information. Empirical analysis shows that the proposed method is able to produce accurate and efficient solution as compared to relevant benchmarking methods.Wathsala Mohotti*, QUT; Richi Nayak, Queensland University of Technology, Brisbane
Clustering and Unsupervised Learning (Part 2)
Fast Manifold Landmarking using Locality Sensitive Hashing(abstract, pdf)
Zay Maung Maung Aye*, University of Melbourne; Kotagiri Ramamohanarao, The University of Melbourne; Benjamin Rubinstein, University of Melbourne
Manifold landmarks can approximately represent the low-dimensional nonlinear manifold structure embedded in high-dimensional ambient feature space. Due to the quadratic complexity of many learning algorithms in the number of training samples, selecting a sample subset as manifold landmarks has become an important issue for scalable learning. Unfortunately, state-of-the-art Gaussian process methods for selecting manifold landmarks themselves are not scalable to large datasets. In an attempt to speed up learning manifold landmarks, uniformly selected minibatch stochastic gradient descent is used by the state-of-the-art approach. Unfortunately, this approach only goes part way to making manifold learning tractable. We propose two adaptive sample selection approaches for gradient descent optimization, which can lead to better performance in accuracy and computational time. Our methods exploit the compatibility of locality sensitive hashing (via LSH and DBH) and the manifold assumption, thereby limiting expensive optimization to relevant regions of the data. Landmarks selected by our methods achieve superior accuracy than training the state-of-the-art learner with randomly selected minibatch. We also demonstrate that our methods can be used to find manifold landmarks without learning Gaussian processes at all, which leads to orders-of-magnitude speed up with only minimal decrease in accuracy.Zay Maung Maung Aye*, University of Melbourne; Kotagiri Ramamohanarao, The University of Melbourne; Benjamin Rubinstein, University of Melbourne
Equitable Conceptual Clustering using OWA operator(abstract, pdf)
Noureddine Aribi, University of Oran; Abdelkader Ouali, University of Caen Normandy; Yahia Lebbah, University of Oran; Samir Loudni*, University of Caen Normandy
We propose an equitable conceptual clustering approach based on multi-agent optimization. In the context of conceptual clustering, each cluster is represented by an agent having its own satisfaction and the problem consists in finding the best cumulative satisfaction while emphasizing a fair compromise between all individual agents. The fairness goal is achieved using an equitable formulation of the Ordered Weighted Averages (OWA) operator. Experiments performed on UCI datasets and on instances coming from real application ERP show that our approach efficiently finds clusterings of consistently high quality.Noureddine Aribi, University of Oran; Abdelkader Ouali, University of Caen Normandy; Yahia Lebbah, University of Oran; Samir Loudni*, University of Caen Normandy
Unsupervised Extremely Randomized Trees(abstract, pdf)
Kevin Dalleau*, LORIA; Malika Smail-Tabbone, LORIA; Miguel Couceiro, LORIA
In this paper we present a method to compute distances on unlabeled data, based on extremely randomized trees. This method, Unsupervised Extremely Randomized Trees, is used jointly with a novel randomized labeling scheme we describe here, and that we call AddCl3. Unlike existing methods such as AddCl1 and AddCl2, no synthetic instances are generated, thus avoiding an increase in the size of the dataset. The empirical study of this method shows that Unsupervised Extremely Randomized Trees with AddCl3 provides competitive results regarding the quality of resulting clusterings, while clearly outperforming previous similar methods in terms of running time.Kevin Dalleau*, LORIA; Malika Smail-Tabbone, LORIA; Miguel Couceiro, LORIA
Local Graph Clustering by Multi-network Random Walk with Restart(abstract, pdf)
Yaowei Yan*, The Pennsylvania State University; Dongsheng Luo, The Pennsylvania State University; Jingchao Ni, The Pennsylvania State University; Hongliang Fei, Baidu Big Data Lab; Wei Fan, Baidu Big Data Lab; John Yen, The Pennsylvania State University; Xiong Yu, Case Western Reserve University; Xiang Zhang, The Pennsylvania State University
Searching local graph clusters is an important problem in big network analysis. Given a query node in a graph, local clustering aims at finding a subgraph around the query node, which consists of nodes highly relevant to the query node. Existing local clustering methods are based on single networks that contain limited information. In contrast, the real data are always comprehensive and can be represented better by multiple connected networks (multi-network). To take the advantage of heterogeneity of multi-network and improve the clustering accuracy, we advance a strategy for local graph clustering based on Multi-network Random Walk with Restart (MRWR), which discovers local clusters on a target network in association with additional networks. For the proposed local clustering method, we develop a localized approximate algorithm (AMRWR) on solid theoretical basis to speed up the searching process. To the best of our knowledge, this is the first elaboration of local clustering on a target network by integrating multiple networks. Empirical evaluations show that the proposed method improves clustering accuracy by more than 10% on average with competently short running time, compared with the alternative state-of-the-art graph clustering approaches.Yaowei Yan*, The Pennsylvania State University; Dongsheng Luo, The Pennsylvania State University; Jingchao Ni, The Pennsylvania State University; Hongliang Fei, Baidu Big Data Lab; Wei Fan, Baidu Big Data Lab; John Yen, The Pennsylvania State University; Xiong Yu, Case Western Reserve University; Xiang Zhang, The Pennsylvania State University
Scalable Approximation Algorithm for Graph Summarization(abstract, pdf)
Maham Anwar Beg, Lahore University of Management Sciences; Muhammad Ahmad, Lahore University of Management Sciences; Arif Zaman, Lahore University of Management Sciences; Imdadullah Khan*, Lahore University of Management Sciences
Massive sizes of real-world graphs, such as social networks and web graph, impose serious challenges to process and perform analytics on them. These issues can be resolved by working on a small summary of the graph instead . A summary is a compressed version of the graph that removes several details, yet preserves it’s essential structure. Generally, some predefined quality measure of the summary is optimized to bound the approximation error incurred by working on the summary instead of the whole graph. All known summarization algorithms are computationally prohibitive and do not scale to large graphs. In this paper we present an efficient randomized algorithm to compute graph summaries with the goal to minimize reconstruction error. We propose a novel weighted sampling scheme to sample vertices for merging that will result in the least reconstruction error. We provide analytical bounds on the running time of the algorithm and prove approximation guarantee for our score computation. Efficiency of our algorithm makes it scalable to very large graphs on which known algorithms cannot be applied. We test our algorithm on several real world graphs to empirically demonstrate the quality of summaries produced and compare to state of the art algorithms. We use the summaries to answer several structural queries about original graph and report their accuracies.Maham Anwar Beg, Lahore University of Management Sciences; Muhammad Ahmad, Lahore University of Management Sciences; Arif Zaman, Lahore University of Management Sciences; Imdadullah Khan*, Lahore University of Management Sciences
Privacy-Preserving and Security
RIPEx: Extracting malicious IP addresses from security forums using cross-forum learning(abstract, pdf)
Joobin Gharibshah*, University of California, Riverside; Evangelos Papalexakis, UC Riverside; Michlais Michail Faloutsos, University of California, Riverside
Is it possible to extract malicious IP addresses reported in security forums in an automatic way? This is the question at the heart of our work. We focus on security forums, where security professionals and hackers share knowledge and information, and often report misbehaving IP addresses. So far, there have only been a few efforts to extract information from such security forums. We propose RIPEx, a systematic approach to identify and label IP addresses in security forums by utilizing a cross-forum learning method. In more detail, the challenge is twofold: (a) identifying IP addresses from other numerical entities, such as software version numbers, and (b) classifying the IP address as benign or malicious. We propose an integrated solution that tackles both these problems. A novelty of our approach is that it does not require training data for each new forum. Our approach does knowledge transfer across forums: we use a classifier from our source forums to identify seed information for training a classifier on the target forum. We evaluate our method using data collected from five security forums with a total of 31K users and 542K posts. First, RIPEx can distinguish IP address from other numeric expressions with 95% precision and above 93% recall on average. Second, RIPEx identifies malicious IP addresses with an average precision of 88% and over 78% recall, using our cross-forum learning. Our work is a first step towards harnessing the wealth of useful information that can be found in security forums.Joobin Gharibshah*, University of California, Riverside; Evangelos Papalexakis, UC Riverside; Michlais Michail Faloutsos, University of California, Riverside
Pattern-Mining based Cryptanalysis of Bloom Filters for Privacy-Preserving Record Linkage(abstract, pdf)
Peter Christen*, The Australian National University; Anushka Vidanage, The Australian National University; Thilina Ranbaduge, The Australian National University; Rainer Schnell, University Duisburg-Essen
Data mining projects increasingly require records about individuals to be linked across databases to facilitate advanced analytics. The process of linking records without revealing any sensitive or confidential information about the entities represented by these records is known as privacy-preserving record linkage (PPRL). Bloom filters are a popular PPRL technique to encode sensitive information while still enabling approximate linking of records. However, Bloom filter encoding can be vulnerable to attacks that can re-identify some encoded values from sets of Bloom filters. Existing attacks exploit that certain Bloom filters can occur frequently in an encoded database, and thus likely correspond to frequent plain-text values such as common names. We present a novel attack method based on a maximal frequent itemset mining technique which identifies frequently co-occurring bit positions in a set of Bloom filters. Our attack can re-identify encoded sensitive values even when all Bloom filters in an encoded database are unique. As our experiments on a real-world data set show, our attack can successfully re-identify values from encoded Bloom filters even in scenarios where previous attacks fail.Peter Christen*, The Australian National University; Anushka Vidanage, The Australian National University; Thilina Ranbaduge, The Australian National University; Rainer Schnell, University Duisburg-Essen
A Privacy Preserving Bayesian Optimization with High Efficiency(abstract, pdf)
Thanh Dai Nguyen*, Deakin University; Sunil Gupta, Deakin University, Australia; Santu Rana, Deakin University, Australia; Svetha Venkatesh, Deakin University
Bayesian optimization is a powerful machine learning technique for solving experimental design problems. With its use in industrial design optimization, time and cost of industrial processes can be reduced significantly. However, often the experimenters in industries may not have the expertise of optimization techniques and may require help from third-party optimization services. This can cause privacy concerns as the optimized design of an industrial process typically needs to be kept secret to retain its competitive advantages. To this end, we propose a novel Bayesian optimization algorithm that can allow the experimenters from an industry to utilize the expertise of a third-party optimization service in privacy preserving manner. Privacy of our proposed algorithm is guaranteed under a modern privacy preserving framework called Error Preserving Privacy, especially designed to maintain high utility even under the privacy restrictions. Using several benchmark optimization problems as well as optimization problems from real-world industrial processes, we demonstrate that the optimization efficiency of our algorithm is comparable to the non-private Bayesian optimization algorithm and significantly better than its differential privacy counterpart.Thanh Dai Nguyen*, Deakin University; Sunil Gupta, Deakin University, Australia; Santu Rana, Deakin University, Australia; Svetha Venkatesh, Deakin University
Randomizing SVM against Adversarial Attacks Under Uncertainty(abstract, pdf)
Yan Chen, Columbia University; Wei Wang, Beijing Jiaotong University; Xiangliang Zhang*, ” King Abdullah University of Science and Technology, Saudi Arabia”
Robust machine learning algorithms have been widely studied in adversarial environments where the adversary maliciously manipulates data samples to evade security systems. As an example, Support Vector Machines with robustness have been investigated to learn from poisoned training data. In this paper, we propose randomized SVMs against generalized adversarial attacks under uncertainty, through learning a classifier distribution rather than a single classifier in traditional robust SVMs. The randomized SVMs have advantages on better resistance against attacks while preserving high accuracy of classification, especially for non-separable cases. The experimental results demonstrate the effectiveness of our proposed models on defending against various attacks, including more aggressive attacks with uncertainty.Yan Chen, Columbia University; Wei Wang, Beijing Jiaotong University; Xiangliang Zhang*, ” King Abdullah University of Science and Technology, Saudi Arabia”
e-Distance Weighted Support Vector Regression(abstract, pdf)
Ge Ou*, Jilin University; Yan Wang, Jilin University; Lan Huang, CS Department, Jilin University; Wei Pang, University of Aberdeen; George Macleod Coghill, University of Aberdeen
We propose a novel support vector regression approach called e-Distance Weighted Support Vector Regression (e-DWSVR). e-DWSVR specifically addresses a challenging issue in support vector regression: how to deal with the situation when the distribution of the internal data in the e-tube is different from that of the boundary data containing support vectors. The proposed e-DWSVR optimizes the minimum margin and the mean of functional margin simultaneously to tackle this issue. To solve the new optimization problem arising from e-DWSVR, we adopt dual coordinate descent (DCD) with kernel functions for medium-scale problems and also employ averaged stochastic gradient descent (ASGD) to make e-DWSVR scalable to larger problems. We report promising results obtained by e-DWSVR in comparison with five popular regression methods on sixteen UCI benchmark datasets.Ge Ou*, Jilin University; Yan Wang, Jilin University; Lan Huang, CS Department, Jilin University; Wei Pang, University of Aberdeen; George Macleod Coghill, University of Aberdeen
Recommendation and Data Factorization
One for the Road: Recommending Male Street Attire(abstract, pdf)
Debopriyo Banerjee*, IIT Kharagpur; Niloy Ganguly, IIT Kharagpur; Krothapalli Sreenivasa Rao, IIT Kharagpur; Shamik Sural, Indian Institute of Technology Kharagpur
Growth of male fashion industry and escalating popularity of affordable street fashion wear has created a demand for the intervention of effective data analytics and recommender systems for male street wear. This motivated us to undertake extensive image collection of male subjects in casual wear and pose; assiduously annotate and carefully select discriminating features. We build up a classifier which predicts accurately the attractive quotient of an outfit. Further, we build a recommendation system – MalOutRec – which provides pointed recommendation of changing a part of the outfit in case the outfit looks unattractive (e.g. change the existing pair of trousers with a recommended one). We employ an innovative methodology that uses personalized pagerank in designing MalOutRec – experimental results show that it handsomely beats the metapath based baseline algorithm.Debopriyo Banerjee*, IIT Kharagpur; Niloy Ganguly, IIT Kharagpur; Krothapalli Sreenivasa Rao, IIT Kharagpur; Shamik Sural, Indian Institute of Technology Kharagpur
Context-aware Location Annotation on Mobility Records through User Grouping(abstract, pdf)
Yong Zhang, Beihang University; Hua Wei, Pennsylvania State University; Xuelian Lin*, Beihang University; Fei Wu, Pennsylvania State University; Zhenhui (Jessie) Li, Penn State University; Kaiheng Chen, Beihang University; Yuandong Wang, Beihang University; Jie Xu, University of Leeds
Due to the increasing popularity of location-based services, a massive volume of human mobility records have been generated. At the same time, the growing spatial context data provides us rich semantic information. Associating the mobility records with relevant surrounding contexts, known as the location annotation, enables us to understand the semantics of the mobility records and helps further tasks like advertising. However, the location annotation problem is challenging due to the ambiguity of surrounding contexts and the sparsity of personal data. To solve this problem, we propose a Context-Aware location annotation method through User Grouping (CAUG) to annotate locations with venues. This method leverages user grouping and venue categories to alleviate the data sparsity issue and annotates locations according to multi-view information (spatial, temporal and contextual) of multiple granularities. Through extensive experiments on a real-world dataset, we demonstrate that our method significantly outperforms other baseline methods.Yong Zhang, Beihang University; Hua Wei, Pennsylvania State University; Xuelian Lin*, Beihang University; Fei Wu, Pennsylvania State University; Zhenhui (Jessie) Li, Penn State University; Kaiheng Chen, Beihang University; Yuandong Wang, Beihang University; Jie Xu, University of Leeds
A Joint Optimization Method for Personalized Recommendation Result Diversification(abstract, pdf)
Xiaojie Wang*, The University of Melbourne; Jianzhong Qi, The University of Melbourne; Kotagiri Ramamohanarao, The University of Melbourne; Yu Sun, University of Melbourne; Bo Li, University of Illinois at Urbana׃hampaign; Rui Zhang, ” University of Melbourne, Australia”
In recommendation systems, items of interest are often classified into categories such as genres of movies. Existing research has shown that diversified recommendations can improve real user experience. However, most existing methods do not consider the fact that users’ levels of interest (i.e., user preferences) in different categories usually vary, and such user preferences are not reflected in the diversified recommendations. We propose an algorithm that considers user preferences for different categories when recommending diversified results, and refer to this problem as personalized recommendation diversification. In the proposed algorithm, a model that captures user preferences for different categories is optimized jointly toward both relevance and diversity. To provide the proposed algorithm with informative training labels and effectively evaluate recommendation diversity, we also propose a new personalized diversity measure. The proposed measure overcomes limitations of existing measures in evaluating recommendation diversity: existing measures either cannot effectively handle user preferences for different categories, or cannot evaluate both relevance and diversity at the same time. Experiments using two real-world datasets confirm the superiority of the proposed algorithm, and show the effectiveness of the proposed measure in capturing user preferences.Xiaojie Wang*, The University of Melbourne; Jianzhong Qi, The University of Melbourne; Kotagiri Ramamohanarao, The University of Melbourne; Yu Sun, University of Melbourne; Bo Li, University of Illinois at Urbana–Champaign; Rui Zhang, ” University of Melbourne, Australia”
Personalized Item-of-Interest Recommendation on Storage Constrained Smartphone based on Word Embedding Quantization(abstract, pdf)
Si-Ying Huang, National Chung Hsing University; Yung-Yu Chen, National Center for High-Performance Computing; Hung-Yuan Chen, ITRI; Chen Lun-Chi, National Center for High-Performance Computing; Yao-Chung Fan*, National Chung Hsing Universit
In recent years, word embedding models receive tremendous research attentions due to their capability of capturing textual semantics. This study investigates the issue of employing word embedding models into resource-limited smartphones for personalized item recommendation. The challenge lies in that the existing embedding models are often too large to fit into a resource-limited smartphones. One naive idea is to incorporate a secondary storage by residing the model in the secondary storage and processing recommendation with the secondary storage. However, this idea suffers from the burden of additional traffics. To this end, we propose a framework called Word Embedding Quantization (WEQ) that constructs an index upon a given word embedding model and stores the index on the primary storage to enable the use of the word embedding model on smartphones. One challenge for using the index is that the exact user profile is no longer ensured. However, we find that there are opportunities for computing the correct recommendation results by knowing only inexact user profile. In this paper, we propose a series of techniques that leverage the opportunities for computing candidates with the goal of minimizing the accessing cost to a secondary storage. Experiments are made to verify the efficiency of the proposed techniques, which demonstrates the feasibility of the proposed framework. Si-Ying Huang, National Chung Hsing University; Yung-Yu Chen, National Center for High-Performance Computing; Hung-Yuan Chen, ITRI; Chen Lun-Chi, National Center for High-Performance Computing; Yao-Chung Fan*, National Chung Hsing Universit
Social Network, Ubiquitous Data and Graph Mining (Part 1)
Topic-specific Retweet Count Ranking for Weibo(abstract, pdf)
Hangyu Mao*, Peking University; Yang Xiao, Peking University; Yuan Wang, Peking University; Jiakang Wang, Peking University; Zhen Xiao, Peking University
In this paper, we study *topic-specific* retweet count ranking problem in Weibo. Two challenges make this task nontrivial. Firstly, traditional methods cannot derive effective feature for tweets, because in topic-specific setting, tweets usually have too many shared contents to distinguish them. We propose a LSTM-embedded autoencoder to generate tweet features with the insight that any different prefixes of tweet text is a possible distinctive feature. Secondly, it is critical to fully catch the meaning of topic in topic-specific setting, but Weibo can provide little information about topic. We leverage real-time news information from Toutiao to enrich the meaning of topic, as more than 85% topics are headline news. We evaluate the proposed components based on ablation methods, and compare the overall solution with a recently-proposed tensor factorization model. Extensive experiments on real Weibo data show the effectiveness and flexibility of our methods.Hangyu Mao*, Peking University; Yang Xiao, Peking University; Yuan Wang, Peking University; Jiakang Wang, Peking University; Zhen Xiao, Peking University
Motif-Aware Diffusion Networks Inference(abstract, pdf)
Qi Tan, HKBU; Yang Liu, Hong Kong Baptist University; Jiming Liu*, Hong Kong Baptist University
Characterizing and understanding information diffusion over social networks play an important role in various real-world applications. In many scenarios, however, only the states of nodes can be observed while the underlying diffusion networks are unknown. Many methods have therefore been proposed to infer the underlying networks based on node observations. To enhance the inference performance, structural priors of the networks, such as sparsity, scale-free, and modularity, are often incorporated into the learning procedure. As the structural primitives of networks, network motifs occur frequently in many social networks, and play an essential role in describing the network structures and functionalities. However, to the best of our knowledge, no existing work exploits this kind of structural primitives in diffusion network inference. In order to address this unexplored yet important issue, in this paper, we propose a novel method called Motif-Aware Diffusion Networks Inference (MADNI), which aims to mine the motif profile from the node observations and infer the underlying network based on the mined motif profile. The mined motif profile and the inferred network are alternately refined until the learning procedure converges. Extensive experiments on both synthetic and real-world datasets validate the effectiveness of the proposed method.Qi Tan, HKBU; Yang Liu, Hong Kong Baptist University; Jiming Liu*, Hong Kong Baptist University
Tri-Fly: Distributed Estimation of Global and Local Triangle Counts in Graph Streams(abstract, pdf)
Kijung Shin*, Carnegie Mellon University; Mohammad Hammoud, Carnegie Mellon University; Euiwoong Lee, Carnegie Mellon University; Jinoh Oh, Adobe Systems; Christos Faloutsos, CMU, Pittsburgh
Given a graph stream, how can we estimate the number of triangles in it using multiple machines with limited storage? Counting triangles (i.e., cycles of length three) is a classical graph problem whose importance has been recognized in diverse fields, including data mining, social network analysis, and databases. Recently, for triangle counting in massive graphs, two approaches have been intensively studied. One approach is streaming algorithms, which estimate the count of triangles incrementally in time-evolving graphs or in large graphs only part of which can be stored. The other approach is distributed algorithms for utilizing computational power and storage of multiple machines. Can we have the best of both worlds? We propose Tri-Fly, the first distributed streaming algorithm for approximate triangle counting. Making one pass over a graph stream, Tri-Fly rapidly and accurately estimates the counts of global triangles and local triangles incident to each node. Compared to state-of-the-art single-machine streaming algorithms, Tri-Fly is (a) Accurate: yields up to 4.5X smaller estimation error, (b) Fast: runs up to 8.8X faster with linear scalability, and (c) Theoretically sound: gives unbiased estimates with smaller variances.Kijung Shin*, Carnegie Mellon University; Mohammad Hammoud, Carnegie Mellon University; Euiwoong Lee, Carnegie Mellon University; Jinoh Oh, Adobe Systems; Christos Faloutsos, CMU, Pittsburgh
WFSM-MaxPWS: An Efficient Approach for Mining Weighted Frequent Subgraphs from Edge-Weighted Graph Databases(abstract, pdf)
Md. Ashraful Islam, University of Dhaka; Chowdhury Farhan Ahmed, University of Dhaka; Carson Leung*, University of Manitoba; Calvin Hoi, University of Manitoba
Weighted frequent subgraph mining comes with an inherent challenge—namely, weighted support does not support the downward closure property, which is often used in mining algorithms for reducing the search space. Although this challenge attracted attention from several researchers, most existing works in this field use either affinity based pruning or alternative anti-monotonic weighting technique for subgraphs other than average edge-weight. In this paper, we propose an efficient weighted frequent subgraph mining algorithm called WFSM-MaxPWS. Our algorithm uses the MaxPWS pruning technique, which significantly reduces search space without changing subgraph weighting scheme while ensuring completeness. Our evaluation results on three different graph datasets with two different weight distributions (normal and negative exponential) showed that our WFSM-MaxPWS algorithm led to significant runtime improvement over the existing MaxW pruning technique (which is a concept for weighted pattern mining in computing subgraph weight by taking average of edge weights).Md. Ashraful Islam, University of Dhaka; Chowdhury Farhan Ahmed, University of Dhaka; Carson Leung*, University of Manitoba; Calvin Hoi, University of Manitoba
A Game-Theoretic Adversarial Approach to Dynamic Network Prediction(abstract, pdf)
Jia Li*, University of Illinois at Chicago; Brian Ziebart, UIC; Tanya Berger-Wolf, University of Illinois
Predicting the evolution of a dynamic network–|the addition of new edges and the removal of existing edges–is challenging. In part, this is because: (1) networks are often noisy; (2) various performance measures emphasize different aspects of prediction; and (3) it is not clear which network features are useful for prediction. To address these challenges, we develop a novel framework for robust dynamic network prediction using an adversarial formulation that leverages both edge-based and global network features to make predictions. We conduct experiments on five distinct dynamic network datasets to show the superiority of our approach compared to state-of-the-art methods.Jia Li*, University of Illinois at Chicago; Brian Ziebart, UIC; Tanya Berger-Wolf, University of Illinois
Social Network, Ubiquitous Data and Graph Mining (Part 2)
Targeted Influence Minimization in Social Networks(abstract, pdf)
Xinjue Wang, RMIT; Ke Deng*, RMIT University; Jianxin Li, The University of Western Australia; Xiaochun Yang, Northeast University; Jeffrey Xu Yu, Chinese University of Hong Kong; Christian Jensen, Aalborg University
An online social network can be used for the diffusion of malicious information like derogatory rumours, disinformation, hate speech, revenge pornography, etc. This motivates the study of influence minimization that aim to prevent the spread of malicious information. Unlike previous influence minimization work, this study considers the influence minimization in relation to a particular group of social network users, called targeted influence minimization. Thus, the objective is to protect a set of users, called target nodes, from malicious information originating from another set of users, called active nodes. This study also addresses two fundamental, but largely ignored, issues in different influence minimization problems: (i) the impact of a budget on the solution; (ii) robust sampling. To this end, two scenarios are investigated, namely unconstrained and constrained budget. Given an unconstrained budget, we provide an optimal solution; Given a constrained budget, we show the problem is NP-hard and develop a greedy algorithm with an (1-1/e)-approximation. More importantly, in order to solve the influence minimization problem in large, real-world social networks, we propose a robust sampling-based solution with a desirable theoretic bound. Extensive experiments using real social network datasets offer insight into the effectiveness and efficiency of the proposed solutions.Xinjue Wang, RMIT; Ke Deng*, RMIT University; Jianxin Li, The University of Western Australia; Xiaochun Yang, Northeast University; Jeffrey Xu Yu, Chinese University of Hong Kong, Hong Kong; Christian Jensen, Aalborg University
Maximizing Social Influence on Target Users(abstract, pdf)
Yu-Ting Wen*, National Chiao Tung University; Wen-Chih Peng, National Chiao Tung University, Taiwan; Hong-Han Shuai, National Chiao Tung University
Influence maximization has attracted a considerable amount of research work due to the explosive growth in online social networks. Existing studies of influence maximization on social networks aim at deriving a set of users (referred to as seed users) in a social network to maximize the expected number of users influenced by those seed users. However, in some scenarios, such as election campaigns and target audience marketing, the requirement of the influence maximization is to influence a set of specific users. This set of users is defined as the target set of users. In this paper, given a target set of users, we study the Target Influence Maximization (TIM) problem with the purpose of maximizing the number of users within the target set. We particularly focus on two important issues: 1) how to capture the social influence among users, and 2) how to develop an efficient scheme that offers wide influence spread on specified subsets. Experiment results on real-world datasets validate the performance of the solution for TIM using our proposed approaches.Yu-Ting Wen*, National Chiao Tung University; Wen-Chih Peng, National Chiao Tung University, Taiwan; Hong-Han Shuai, National Chiao Tung University
Team Expansion in Collaborative Environments(abstract, pdf)
Lun Zhao*, Nanjing University; Yuan Yao, Nanjing University; Guibing Guo, Northeastern University; Hanghang Tong, Arizona State University; Feng Xu, Nanjing University; Jian Lv, Nanjing University
In this paper, we study the team expansion problem in collaborative environments where people collaborate with each other in the form of a team, which might need to be expanded frequently by having additional team members during the course of the project. Intuitively, there are three factors as well as the interactions between them that have a profound impact on the performance of the expanded team, including (1) the task the team is performing, (2) the existing team members, and (3) the new candidate team member. However, the vast majority of the existing work either considers these factors separately, or even ignores some of these factors. In this paper, we propose a neural network based approach TECE to simultaneously model the interactions between the team task, the team members as well as the candidate team members. Experimental evaluations on real-world datasets demonstrate the effectiveness of the proposed approach.Lun Zhao*, Nanjing University; Yuan Yao, Nanjing University; Guibing Guo, Northeastern University; Hanghang Tong, Arizona State University; Feng Xu, Nanjing University; Jian Lv, Nanjing University
HashAlign: Hash-based Alignment of Multiple Graphs(abstract, pdf)
Wei Lee, University of Michigan; Mark Heimann*, University of Michigan; Shengjie Pan, University of Michigan; Kuan-yu Chen, University of Michigan; Danai Koutra, U Michigan
Fusing or aligning two or more networks is a fundamental building block of many graph mining tasks (e.g., recommendation systems, link prediction, collective analysis of networks). Most past work has focused on formulating the pairwise graph alignment problem as an optimization problem with varying constraints and relaxations. In this paper, we study the problem of multiple graph alignment (i.e., the problem of collectively aligning multiple graphs at once) and propose HASHALIGN, an efficient and intuitive hash-based framework for network alignment that leverages structural properties and other node and edge attributes (if available) simultaneously. We introduce a new construction of LSH families, as well as robust node and graph features that are tailored for this task. Our method quickly finds the alignment between multiple graphs while avoiding the all-pairwise-comparison problem by expressing all alignments in terms of a chosen ңenterҠgraph. Our extensive experiments on synthetic and real networks show that, on average, HASHALIGN is 2נfaster and 10 to 20% more accurate than the baselines in pairwise alignment, and 2נfaster while 50% more accurate in multiple graph alignment.Wei Lee, University of Michigan; Mark Heimann*, University of Michigan; Shengjie Pan, University of Michigan; Kuan-yu Chen, University of Michigan; Danai Koutra, U Michigan
Evaluating and Analyzing Reliability over Decentralized and Complex Networks(abstract, pdf)
Jaron Mar, University of Auckland; Jiamou Liu*, University of Auckland; Yanni Tang, College of Computer & Information Science, Southwest University; Wu Chen, College of Computer ? Information Science, Southwest University; Tianyi Sun, University of Auckland
In an increasingly interconnected and distributed world, the ability to ensure communications becomes pivotal in day-to-day operations. Given a network whose edges are prone to failures and disruptions, reliability captures the probability that traffic will reach a target location by traversing edges starting from a given source. This paper investigates reliability in decentralized and complex networks. To evaluate reliability, we introduce a multi-agent method that involves pathfinding agents to reduce the graph. Performance of this method is tested on scale-free and small-world networks as well as real-world spatial networks. We also investigate reliability score which aims to rank the capability of nodes in terms of traffic dissemination traffic across all nodes. Analysis over spatial networks indicates that the reliability score correlates with central and sub-central regions in a geographical region.Jaron Mar, University of Auckland; Jiamou Liu*, University of Auckland; Yanni Tang, College of Computer & Information Science, Southwest University; Wu Chen, College of Computer & Information Science, Southwest University; Tianyi Sun, University of Auckland
Social Network, Ubiquitous Data and Graph Mining (Part 3)
Efficient Exact and Approximate Algorithms for Computing Betweenness Centrality in Directed Graphs(abstract, pdf)
Mostafa Haghir Chehreghani*, Telecom Paristech; Albert Bifet, Telecom ParisTech; Talel Abdessalem, Telecom ParisTech
In this paper, first given a directed network G and a vertex r ? V (G), we propose a new exact algorithm to compute betweenness score of r. Our algorithm pre-computes a set RF(r), which is used to prune a huge amount of computations that do not contribute in the betweenness score of r. Then, for the cases where RF(r) is large, we present a randomized algorithm that samples from RF(r) and performs computations for only the sampled elements. We show that this algorithm provides an (, ?)-approximation of the betweenness score of r. Finally, we empirically evaluate our algorithms and show that they significantly outperform the most efficient existing algorithms, in terms of both running time and accuracy. Our experiments also show that our proposed algorithms can effectively compute betweenness scores of all vertices in a set of vertices.Mostafa Haghir Chehreghani*, Telecom Paristech; Albert Bifet, Telecom ParisTech; Talel Abdessalem, Telecom ParisTech
Forecasting Bitcoin Price with Graph Chainlets(abstract, pdf)
Cuneyt Akcora*, University of Texas at Dallas; Yulia Gel, The University of Texas at Dallas; Murat Kantarcioglu, UT Dallas; Asim Dey, University of Texas at Dallas
Over the last couple of years, Bitcoin cryptocurrency and the Blockchain technology that forms the basis of Bitcoin have witnessed a flood of attention. In contrast to fiat currencies used worldwide, the Bitcoin distributed ledger is publicly available by design. This facilitates observing all financial interactions on the network, and analyzing how the network evolves in time. We introduce a novel concept of chainlets, or Bitcoin subgraphs, which allows us to evaluate the local topological structure of the Bitcoin graph over time. Furthermore, we assess the role of chainlets on Bitcoin price formation and dynamics. We investigate the predictive Granger causality of chainlets and identify certain types of chainlets that exhibit the highest predictive influence on Bitcoin price and investment risk. Cuneyt Akcora*, University of Texas at Dallas; Yulia Gel, The University of Texas at Dallas; Murat Kantarcioglu, UT Dallas; Asim Dey, University of Texas at Dallas
Information Propagation Trees for Protest Event Prediction(abstract, pdf)
JEFFERY ANSAH*, University of South Australia, UniSA, Data Analytics Group; Wei Kang, University of South Australia; Lin Liu, University of South Australia; Jixue Liu, University of South Australia; Jiuyong Li, University of South Australia
Protest event prediction using information propagation from social media is an important but challenging problem. Despite the plethora of research, the implicit relationship between social media information propagation and real-world protest events is unknown. Given some information propagating on social media, how can we tell if a protest event will occur? What features of information propagation are useful and how do these features contribute to a pending protest event? In this paper, we address these questions by presenting a novel formalized propagation tree model that captures relevant protest information propagating as precursors to protest events. We present a viewpoint of information propagation as trees which captures both temporal and structural aspects of information propagation. We construct and extract structural and temporal features daily from propagation trees. We develop a matching scheme that maps daily feature values to protest events. Finally, we build a robust prediction model that leverages propagation tree features for protest event prediction. Extensive experiments conducted on Twitter datasets across states in Australia show that our model outperforms existing state-of-the-art prediction models with an accuracy of up to 89% and F1-score of 0.84. We also provide insights on the interpretability of our features to real-world protest events.JEFFERY ANSAH*, University of South Australia, UniSA, Data Analytics Group; Wei Kang, University of South Australia; Lin Liu, University of South Australia; Jixue Liu, University of South Australia; Jiuyong Li, University of South Australia
Predictive Social Team Formation Analysis via Feature Representation Learning(abstract, pdf)
Lo-Pang-Yun Ting, National Cheng Kung University; Cheng-Te Li, National Cheng Kung University; Kun-Ta Chuang*, National Cheng Kung University
Team formation is to find a group of experts covering required skills and well collaborating together. Existing studies suffer from two defects: cannot afford flexible designation of team members and do not consider whether the formed team is truly adopted in practice. In this paper, we propose the \textit{Predictive Team Formation} (PTF) problem. PTF provides the flexibility of designated members and delivers the prediction-based formulation to compose the team. We propose two methods by learning the feature representations of experts based on \textit{node2vec} \cite{node2vec}. One is \textit{Biased-n2v} that models the topic bias of each expert in the social network. The other is \textit{Guided-n2v} that refines the transition probabilities between skills and experts to guide the random walk in a heterogeneous graph. Experiments conducted on DBLP and IMDb datasets exhibit that our methods can significantly outperform the state-of-the-art optimization-based approaches in terms of prediction accuracy. We also reveal that the designated members with tight social connections can lead to better performance.Lo-Pang-Yun Ting, National Cheng Kung University; Cheng-Te Li, National Cheng Kung University; Kun-Ta Chuang*, National Cheng Kung University
Leveraging Local Interactions for Geolocating Social Media Users(abstract, pdf)
Mohammad Ebrahimi*, University of New South Wales; Elaheh ShafieiBavani, University of New South Wales; Raymond Wong, University of New South Wales; Fang Chen, University of New South Wales
Predicting the geolocation of social media users is one of the core tasks in many applications, such as rapid disaster response, targeted advertisement, and recommending local events. In this paper, we introduce a new approach for user geolocation that unifies users’ social relationships, textual content, and metadata. Our two key contributions are as follows: (1) We leverage semantic similarity between users’ posts to predict their geographic proximity. To achieve this, we train and utilize a powerful word embedding model over millions of tweets. (2) To deal with isolated users in the social graph, we utilize a stacking-based learning approach to predict users’ locations based on their tweets’ textual content and metadata. Evaluation on three standard Twitter benchmark datasets shows that our approach outperforms state-of-the-art user geolocation methods.Mohammad Ebrahimi*, University of New South Wales; Elaheh ShafieiBavani, University of New South Wales; Raymond Wong, University of New South Wales; Fang Chen, University of New South Wales
Utilizing sequences of touch gestures for user verification on mobile devices(abstract, pdf)
Liron Ben Kimon*, Ben Gurion University; Yisroel Mirsky, Ben Gurion University; Lior Rokach, Ben Gurion University; Bracha Shapira, Ben-Gurion University of the Negev
Smartphones have become ubiquitous in our daily lives; they are used for a wide range of tasks and store increasing amounts of personal data. To minimize risk and prevent misuse of this data by unauthorized users, access must be restricted to verified users. Current classification-based methods for gesture-based user verification only consider single gestures, and not sequences. In this paper, we present a method which utilizes information from sequences of touchscreen gestures, and the context in which the gestures were made using only basic touch features. To evaluate our approach, we built an application which records all the necessary data from the device (touch and contextual sensors which do not consume significant battery life). Using XGBoost on the collected data, we were able to classify between a legitimate user and the population of illegitimate users (imposters) with an average equal error rate (EER) of 4.78% and an average area under the curve (AUC) of 98.15%. Our method demonstrates that by considering only basic touch features and utilizing sequences of gestures, as opposed to individual gestures, the accuracy of the verification process improves significantly.Liron Ben Kimon*, Ben Gurion University; Yisroel Mirsky, Ben Gurion University; Lior Rokach, Ben Gurion University; Bracha Shapira, Ben-Gurion University of the Negev