Auto Byte

Science AI

### ArXiv Weekly: 10 NLP Papers You May Want to Read

[NLP paper 1/10]

Why you may want to read this: Newest paper from C.-C. Jay Kuo (Distinguished Professor of ECE and CS, University of Southern California).

### SBERT-WK: A Sentence Embedding Method by Dissecting BERT-based Word Models.

Bin Wang, C.-C. Jay Kuo

Sentence embedding is an important research topic in natural language processing (NLP) since it can transfer knowledge to downstream tasks. Meanwhile, a contextualized word representation, called BERT, achieves the state-of-the-art performance in quite a few NLP tasks. Yet, it is an open problem to generate a high quality sentence representation from BERT-based word models. It was shown in previous study that different layers of BERT capture different linguistic properties. This allows us to fusion information across layers to find better sentence representation. In this work, we study the layer-wise pattern of the word representation of deep contextualized models. Then, we propose a new sentence embedding method by dissecting BERT-based word models through geometric analysis of the space spanned by the word representation. It is called the SBERT-WK method. No further training is required in SBERT-WK. We evaluate SBERT-WK on semantic textual similarity and downstream supervised tasks. Furthermore, ten sentence-level probing tasks are presented for detailed linguistic analysis. Experiments show that SBERT-WK achieves the state-of-the-art performance. Our codes are publicly available.

[NLP paper 2/10]

### Scalable Neural Methods for Reasoning With a Symbolic Knowledge Base.

William W. Cohen, Haitian Sun, R. Alex Hofer, Matthew Siegler

We describe a novel way of representing a symbolic knowledge base (KB) called a sparse-matrix reified KB. This representation enables neural modules that are fully differentiable, faithful to the original semantics of the KB, expressive enough to model multi-hop inferences, and scalable enough to use with realistically large KBs. The sparse-matrix reified KB can be distributed across multiple GPUs, can scale to tens of millions of entities and facts, and is orders of magnitude faster than naive sparse-matrix implementations. The reified KB enables very simple end-to-end architectures to obtain competitive performance on several benchmarks representing two families of tasks: KB completion, and learning semantic parsers from denotations.

[NLP paper 3/10]

Why you may want to read this: Newest paper from Boi Faltings (Ecole Polytechnique Federale de Lausanne (EPFL)).

### GameWikiSum: a Novel Large Multi-Document Summarization Dataset.

Diego Antognini, Boi Faltings

Today's research progress in the field of multi-document summarization is obstructed by the small number of available datasets. Since the acquisition of reference summaries is costly, existing datasets contain only hundreds of samples at most, resulting in heavy reliance on hand-crafted features or necessitating additional, manually annotated data. The lack of large corpora therefore hinders the development of sophisticated models. Additionally, most publicly available multi-document summarization corpora are in the news domain, and no analogous dataset exists in the video game domain. In this paper, we propose GameWikiSum, a new domain-specific dataset for multi-document summarization, which is one hundred times larger than commonly used datasets, and in another domain than news. Input documents consist of long professional video game reviews as well as references of their gameplay sections in Wikipedia pages. We analyze the proposed dataset and show that both abstractive and extractive models can be trained on it. We release GameWikiSum for further research: https://github.com/Diego999/GameWikiSum.

[NLP paper 4/10]

Why you may want to read this: Newest paper from Boi Faltings (Ecole Polytechnique Federale de Lausanne (EPFL)).

### HotelRec: a Novel Very Large-Scale Hotel Recommendation Dataset.

Diego Antognini, Boi Faltings

Today, recommender systems are an inevitable part of everyone's daily digital routine and are present on most internet platforms. State-of-the-art deep learning-based models require a large number of data to achieve their best performance. Many datasets fulfilling this criterion have been proposed for multiple domains, such as Amazon products, restaurants, or beers. However, works and datasets in the hotel domain are limited: the largest hotel review dataset is below the million samples. Additionally, the hotel domain suffers from a higher data sparsity than traditional recommendation datasets and therefore, traditional collaborative-filtering approaches cannot be applied to such data. In this paper, we propose HotelRec, a very large-scale hotel recommendation dataset, based on TripAdvisor, containing 50 million reviews. To the best of our knowledge, HotelRec is the largest publicly available dataset in the hotel domain (50M versus 0.9M) and additionally, the largest recommendation dataset in a single domain and with textual reviews (50M versus 22M). We release HotelRec for further research: https://github.com/Diego999/HotelRec.

[NLP paper 5/10]

Why you may want to read this: Newest paper from Ali Farhadi (Associate Professor, Computer Science and Engineering, University of Washington).

### Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping.

Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, Noah Smith

Fine-tuning pretrained contextual word embedding models to supervised downstream tasks has become commonplace in natural language processing. This process, however, is often brittle: even with the same hyperparameter values, distinct random seeds can lead to substantially different results. To better understand this phenomenon, we experiment with four datasets from the GLUE benchmark, fine-tuning BERT hundreds of times on each while varying only the random seeds. We find substantial performance increases compared to previously reported results, and we quantify how the performance of the best-found model varies as a function of the number of fine-tuning trials. Further, we examine two factors influenced by the choice of random seed: weight initialization and training data order. We find that both contribute comparably to the variance of out-of-sample performance, and that some weight initializations perform well across all tasks explored. On small datasets, we observe that many fine-tuning trials diverge part of the way through training, and we offer best practices for practitioners to stop training less promising runs early. We publicly release all of our experimental data, including training and validation scores for 2,100 trials, to encourage further analysis of training dynamics during fine-tuning.

[NLP paper 6/10]

Why you may want to read this: Newest paper from Jianfeng Gao (Microsoft Research, Redmond).

### The Microsoft Toolkit of Multi-Task Deep Neural Networks for Natural Language Understanding.

Xiaodong Liu, Yu Wang, Jianshu Ji, Hao Cheng, Xueyun Zhu, Emmanuel Awa, Pengcheng He, Weizhu Chen, Hoifung Poon, Guihong Cao, Jianfeng Gao

We present MT-DNN, an open-source natural language understanding (NLU) toolkit that makes it easy for researchers and developers to train customized deep learning models. Built upon PyTorch and Transformers, MT-DNN is designed to facilitate rapid customization for a broad spectrum of NLU tasks, using a variety of objectives (classification, regression, structured prediction) and text encoders (e.g., RNNs, BERT, RoBERTa, UniLM). A unique feature of MT-DNN is its built-in support for robust and transferable learning using the adversarial multi-task learning paradigm. To enable efficient production deployment, MT-DNN supports multi-task knowledge distillation, which can substantially compress a deep neural model without significant performance drop. We demonstrate the effectiveness of MT-DNN on a wide range of NLU applications across general and biomedical domains. The software and pre-trained models will be publicly available at https://github.com/namisan/mt-dnn.

[NLP paper 7/10]

Why you may want to read this: Newest paper from Srinivasan Parthasarathy (Professor of Computer Science and Engineering, The Ohio State University).

### Transfer Learning for Abstractive Summarization at Controllable Budgets.

Ritesh Sarkhel, Moniba Keymanesh, Arnab Nandi, Srinivasan Parthasarathy

Summarizing a document within an allocated budget while maintaining its major concepts is a challenging task. If the budget can take any arbitrary value and not known beforehand, it becomes even more difficult. Most of the existing methods for abstractive summarization, including state-of-the-art neural networks are data intensive. If the number of available training samples becomes limited, they fail to construct high-quality summaries. We propose MLS, an end-to-end framework to generate abstractive summaries with limited training data at arbitrary compression budgets. MLS employs a pair of supervised sequence-to-sequence networks. The first network called the \textit{MFS-Net} constructs a minimal feasible summary by identifying the key concepts of the input document. The second network called the Pointer-Magnifier then generates the final summary from the minimal feasible summary by leveraging an interpretable multi-headed attention model. Experiments on two cross-domain datasets show that MLS outperforms baseline methods over a range of success metrics including ROUGE and METEOR. We observed an improvement of approximately 4% in both metrics over the state-of-art convolutional network at lower budgets. Results from a human evaluation study also establish the effectiveness of MLS in generating complete coherent summaries at arbitrary compression budgets.

[NLP paper 8/10]

Why you may want to read this: Newest paper from Richard Dobson (Group lead and Professor in Bioinformatics, Kings College London).

### Identifying physical health comorbidities in a cohort of individuals with severe mental illness: An application of SemEHR.

Rebecca Bendayan, Honghan Wu, Zeljko Kraljevic, Robert Stewart, Tom Searle, Jaya Chaturvedi, Jayati Das-Munshi, Zina Ibrahim, Aurelie Mascio, Angus Roberts, Daniel Bean, Richard Dobson

Multimorbidity research in mental health services requires data from physical health conditions which is traditionally limited in mental health care electronic health records. In this study, we aimed to extract data from physical health conditions from clinical notes using SemEHR. Data was extracted from Clinical Record Interactive Search (CRIS) system at South London and Maudsley Biomedical Research Centre (SLaM BRC) and the cohort consisted of all individuals who had received a primary or secondary diagnosis of severe mental illness between 2007 and 2018. Three pairs of annotators annotated 2403 documents with an average Cohen's Kappa of 0.757. Results show that the NLP performance varies across different diseases areas (F1 0.601 - 0.954) suggesting that the language patterns or terminologies of different condition groups entail different technical challenges to the same NLP task.

[NLP paper 9/10]

Why you may want to read this: Newest paper from Kun Zhou (Professor of Computer Science, Zhejiang University), Ji-Rong Wen (Renmin University of China).

### Improving Multi-Turn Response Selection Models with Complementary Last-Utterance Selection by Instance Weighting.

Kun Zhou, Wayne Xin Zhao, Yutao Zhu, Ji-Rong Wen, Jingsong Yu

Open-domain retrieval-based dialogue systems require a considerable amount of training data to learn their parameters. However, in practice, the negative samples of training data are usually selected from an unannotated conversation data set at random. The generated training data is likely to contain noise and affect the performance of the response selection models. To address this difficulty, we consider utilizing the underlying correlation in the data resource itself to derive different kinds of supervision signals and reduce the influence of noisy data. More specially, we consider a main-complementary task pair. The main task (\ie our focus) selects the correct response given the last utterance and context, and the complementary task selects the last utterance given the response and context. The key point is that the output of the complementary task is used to set instance weights for the main task. We conduct extensive experiments in two public datasets and obtain significant improvement in both datasets. We also investigate the variant of our approach in multiple aspects, and the results have verified the effectiveness of our approach.

[NLP paper 10/10]

Why you may want to read this: Newest paper from Haewoon Kwak (Senior Scientist, Qatar Computing Research Institute, Hamad Bin Khalifa University).

### FrameAxis: Characterizing Framing Bias and Intensity with Word Embedding.

Haewoon Kwak, Jisun An, Yong-Yeol Ahn

We propose FrameAxis, a method of characterizing the framing of a given text by identifying the most relevant semantic axes ("microframes") defined by antonym word pairs. In contrast to the traditional framing analysis, which has been constrained by a small number of manually annotated general frames, our unsupervised approach provides much more detailed insights, by considering a host of semantic axes. Our method is capable of quantitatively teasing out framing bias -- how biased a text is in each microframe -- and framing intensity -- how much each microframe is used -- from the text, offering a nuanced characterization of framing. We evaluate our approach using SemEval datasets as well as three other datasets and human evaluations, demonstrating that FrameAxis can reliably characterize documents with relevant microframes. Our method may allow scalable and nuanced computational analyses of framing across disciplines.

### ArXiv Weekly: 10 CV Papers You May Want to Read

[CV paper 1/10]

Why you may want to read this: Newest paper from Matthew A. Turk (President, TTIC), Rama Chellappa (Professor, ECE, Affiliate Professor, CS, UMD, College Park).

### Recognizing Families In the Wild (RFIW): The 4th Edition.

Joseph P. Robinson, Yu Yin, Zaid Khan, Ming Shao, Siyu Xia, Michael Stopa, Samson Timoner, Matthew A. Turk, Rama Chellappa, Yun Fu

Recognizing Families In the Wild (RFIW): an annual large-scale, multi-track automatic kinship recognition evaluation that supports various visual kin-based problems on scales much higher than ever before. Organized in conjunction with the 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG) as a Challenge, RFIW provides a platform for publishing original work and the gathering of experts for a discussion of the next steps. This paper summarizes the supported tasks (i.e., kinship verification, tri-subject verification, and search & retrieval of missing children) in the evaluation protocols, which include the practical motivation, technical background, data splits, metrics, and benchmark results. Furthermore, top submissions (i.e., leader-board stats) are listed and reviewed as a high-level analysis on the state of the problem. In the end, the purpose of this paper is to describe the 2020 RFIW challenge, end-to-end, along with forecasts in promising future directions.

[CV paper 2/10]

Why you may want to read this: Newest paper from Witold Pedrycz (), MengChu Zhou (New Jersey Institute of Technology/King Abdulaziz).

### Residual-Sparse Fuzzy $C$-Means Clustering Incorporating Morphological Reconstruction and Wavelet frames.

Cong Wang, Witold Pedrycz, ZhiWu Li, MengChu Zhou, Jun Zhao

Instead of directly utilizing an observed image including some outliers, noise or intensity inhomogeneity, the use of its ideal value (e.g. noise-free image) has a favorable impact on clustering. Hence, the accurate estimation of the residual (e.g. unknown noise) between the observed image and its ideal value is an important task. To do so, we propose an \ell_0 regularization-based Fuzzy C-Means (FCM) algorithm incorporating a morphological reconstruction operation and a tight wavelet frame transform. To achieve a sound trade-off between detail preservation and noise suppression, morphological reconstruction is used to filter an observed image. By combining the observed and filtered images, a weighted sum image is generated. Since a tight wavelet frame system has sparse representations of an image, it is employed to decompose the weighted sum image, thus forming its corresponding feature set. Taking it as data for clustering, we present an improved FCM algorithm by imposing an \ell_0 regularization term on the residual between the feature set and its ideal value, which implies that the favorable estimation of the residual is obtained and the ideal value participates in clustering. Spatial information is also introduced into clustering since it is naturally encountered in image segmentation. Furthermore, it makes the estimation of the residual more reliable. To further enhance the segmentation effects of the improved FCM algorithm, we also employ the morphological reconstruction to smoothen the labels generated by clustering. Finally, based on the prototypes and smoothed labels, the segmented image is reconstructed by using a tight wavelet frame reconstruction operation. Experimental results reported for synthetic, medical, and color images show that the proposed algorithm is effective and efficient, and outperforms other algorithms.

[CV paper 3/10]

Why you may want to read this: Newest paper from Gareth Barker (Professor of Magnetic Resonance Physics, King's College London), Sebastian Ourselin (Professor of Healthcare Engineering, King's College London).

### Automated Labelling using an Attention model for Radiology reports of MRI scans (ALARM).

David A. Wood, Jeremy Lynch, Sina Kafiabadi, Emily Guilhem, Aisha Al Busaidi, Antanas Montvila, Thomas Varsavsky, Juveria Siddiqui, Naveen Gadapa, Matthew Townend, Martin Kiik, Keena Patel, Gareth Barker, Sebastian Ourselin, James H. Cole, Thomas C. Booth

Labelling large datasets for training high-capacity neural networks is a major obstacle to the development of deep learning-based medical imaging applications. Here we present a transformer-based network for magnetic resonance imaging (MRI) radiology report classification which automates this task by assigning image labels on the basis of free-text expert radiology reports. Our model's performance is comparable to that of an expert radiologist, and better than that of an expert physician, demonstrating the feasibility of this approach. We make code available online for researchers to label their own MRI datasets for medical imaging applications.

[CV paper 4/10]

Why you may want to read this: Newest paper from Ming-Hsuan Yang (University of California at Merced).

### Weakly-Supervised Semantic Segmentation by Iterative Affinity Learning.

Xiang Wang, Sifei Liu, Huimin Ma, Ming-Hsuan Yang

Weakly-supervised semantic segmentation is a challenging task as no pixel-wise label information is provided for training. Recent methods have exploited classification networks to localize objects by selecting regions with strong response. While such response map provides sparse information, however, there exist strong pairwise relations between pixels in natural images, which can be utilized to propagate the sparse map to a much denser one. In this paper, we propose an iterative algorithm to learn such pairwise relations, which consists of two branches, a unary segmentation network which learns the label probabilities for each pixel, and a pairwise affinity network which learns affinity matrix and refines the probability map generated from the unary network. The refined results by the pairwise network are then used as supervision to train the unary network, and the procedures are conducted iteratively to obtain better segmentation progressively. To learn reliable pixel affinity without accurate annotation, we also propose to mine confident regions. We show that iteratively training this framework is equivalent to optimizing an energy function with convergence to a local minimum. Experimental results on the PASCAL VOC 2012 and COCO datasets demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.

[CV paper 5/10]

Why you may want to read this: Newest paper from Ming-Ming Cheng (Professor, CS, Nankai University), Ming-Hsuan Yang (University of California at Merced).

### Model-Agnostic Structured Sparsification with Learnable Channel Shuffle.

Xin-Yu Zhang, Kai Zhao, Taihong Xiao, Ming-Ming Cheng, Ming-Hsuan Yang

Recent advances in convolutional neural networks (CNNs) usually come with the expense of considerable computational overhead and memory footprint. Network compression aims to alleviate this issue by training compact models with comparable performance. However, existing compression techniques either entail dedicated expert design or compromise with a moderate performance drop. To this end, we propose a model-agnostic structured sparsification method for efficient network compression. The proposed method automatically induces structurally sparse representations of the convolutional weights, thereby facilitating the implementation of the compressed model with the highly-optimized group convolution. We further address the problem of inter-group communication with a learnable channel shuffle mechanism. The proposed approach is model-agnostic and highly compressible with a negligible performance drop. Extensive experimental results and analysis demonstrate that our approach performs favorably against the state-of-the-art network pruning methods. The code will be publicly available after the review process.

[CV paper 6/10]

Why you may want to read this: Newest paper from Alan Yuille (Professor of Cognitive Science and Computer Science, Johns Hopkins University).

### When Radiology Report Generation Meets Knowledge Graph.

Yixiao Zhang, Xiaosong Wang, Ziyue Xu, Qihang Yu, Alan Yuille, Daguang Xu

[CV paper 7/10]

Why you may want to read this: Newest paper from Lei Zhang (Chair Professor, Dept. of Computing, The Hong Kong Polytechnic University).

### Directional Deep Embedding and Appearance Learning for Fast Video Object Segmentation.

Yingjie Yin, De Xu, Xingang Wang, Lei Zhang

Most recent semi-supervised video object segmentation (VOS) methods rely on fine-tuning deep convolutional neural networks online using the given mask of the first frame or predicted masks of subsequent frames. However, the online fine-tuning process is usually time-consuming, limiting the practical use of such methods. We propose a directional deep embedding and appearance learning (DDEAL) method, which is free of the online fine-tuning process, for fast VOS. First, a global directional matching module, which can be efficiently implemented by parallel convolutional operations, is proposed to learn a semantic pixel-wise embedding as an internal guidance. Second, an effective directional appearance model based statistics is proposed to represent the target and background on a spherical embedding space for VOS. Equipped with the global directional matching module and the directional appearance model learning module, DDEAL learns static cues from the labeled first frame and dynamically updates cues of the subsequent frames for object segmentation. Our method exhibits state-of-the-art VOS performance without using online fine-tuning. Specifically, it achieves a J & F mean score of 74.8% on DAVIS 2017 dataset and an overall score G of 71.3% on the large-scale YouTube-VOS dataset, while retaining a speed of 25 fps with a single NVIDIA TITAN Xp GPU. Furthermore, our faster version runs 31 fps with only a little accuracy loss. Our code and trained networks are available at https://github.com/YingjieYin/Directional-Deep-Embedding-and-Appearance-Learning-for-Fast-Video-Object-Segmentation.

[CV paper 8/10]

Why you may want to read this: Newest paper from Richard Socher (Chief Scientist at Salesforce), Steven C.H. Hoi (Managing Director of Salesforce Research Asia; IEEE Fellow; Assoc Professor at SMU).

### DivideMix: Learning with Noisy Labels as Semi-supervised Learning.

Junnan Li, Richard Socher, Steven C.H. Hoi

Deep neural networks are known to be annotation-hungry. Numerous efforts have been devoted to reducing the annotation cost when learning with deep networks. Two prominent directions include learning with noisy labels and semi-supervised learning by exploiting unlabeled data. In this work, we propose DivideMix, a novel framework for learning with noisy labels by leveraging semi-supervised learning techniques. In particular, DivideMix models the per-sample loss distribution with a mixture model to dynamically divide the training data into a labeled set with clean samples and an unlabeled set with noisy samples, and trains the model on both the labeled and unlabeled data in a semi-supervised manner. To avoid confirmation bias, we simultaneously train two diverged networks where each network uses the dataset division from the other network. During the semi-supervised training phase, we improve the MixMatch strategy by performing label co-refinement and label co-guessing on labeled and unlabeled samples, respectively. Experiments on multiple benchmark datasets demonstrate substantial improvements over state-of-the-art methods. Code is available at https://github.com/LiJunnan1992/DivideMix .

[CV paper 9/10]

Why you may want to read this: Newest paper from Kevin Bowyer (Schubmehl-Prein Family Professor of Computer Science and Engineering, University of …).

### Deep Learning-Based Feature Extraction in Iris Recognition: Use Existing Models, Fine-tune or Train From Scratch?.

Aidan Boyd, Adam Czajka, Kevin Bowyer

Modern deep learning techniques can be employed to generate effective feature extractors for the task of iris recognition. The question arises: should we train such structures from scratch on a relatively large iris image dataset, or it is better to fine-tune the existing models to adapt them to a new domain? In this work we explore five different sets of weights for the popular ResNet-50 architecture to find out whether iris-specific feature extractors perform better than models trained for non-iris tasks. Features are extracted from each convolutional layer and the classification accuracy achieved by a Support Vector Machine is measured on a dataset that is disjoint from the samples used in training of the ResNet-50 model. We show that the optimal training strategy is to fine-tune an off-the-shelf set of weights to the iris recognition domain. This approach results in greater accuracy than both off-the-shelf weights and a model trained from scratch. The winning, fine-tuned approach also shows an increase in performance when compared to previous work, in which only off-the-shelf (not fine-tuned) models were used in iris feature extraction. We make the best-performing ResNet-50 model, fine-tuned with more than 360,000 iris images, publicly available along with this paper.

[CV paper 10/10]

Why you may want to read this: Newest paper from Kyunghyun Cho (New York University, Facebook AI Research).

### An interpretable classifier for high-resolution breast cancer screening images utilizing weakly supervised localization.

Yiqiu Shen, Nan Wu, Jason Phang, Jungkyu Park, Kangning Liu, Sudarshini Tyagi, Laura Heacock, S. Gene Kim, Linda Moy, Kyunghyun Cho, Krzysztof J. Geras

Medical images differ from natural images in significantly higher resolutions and smaller regions of interest. Because of these differences, neural network architectures that work well for natural images might not be applicable to medical image analysis. In this work, we extend the globally-aware multiple instance classifier, a framework we proposed to address these unique properties of medical images. This model first uses a low-capacity, yet memory-efficient, network on the whole image to identify the most informative regions. It then applies another higher-capacity network to collect details from chosen regions. Finally, it employs a fusion module that aggregates global and local information to make a final prediction. While existing methods often require lesion segmentation during training, our model is trained with only image-level labels and can generate pixel-level saliency maps indicating possible malignant findings. We apply the model to screening mammography interpretation: predicting the presence or absence of benign and malignant lesions. On the NYU Breast Cancer Screening Dataset, consisting of more than one million images, our model achieves an AUC of 0.93 in classifying breasts with malignant findings, outperforming ResNet-34 and Faster R-CNN. Compared to ResNet-34, our model is 4.1x faster for inference while using 78.4% less GPU memory. Furthermore, we demonstrate, in a reader study, that our model surpasses radiologist-level AUC by a margin of 0.11. The proposed model is available online: https://github.com/nyukat/GMIC.

### ArXiv Weekly: 10 ML Papers You May Want to Read

[ML paper 1/10]

Why you may want to read this: Newest paper from Kyle Cranmer (New York University), Eilam Gross (Professor pf Physics, Weizmann Institute of Science).

### Set2Graph: Learning Graphs From Sets.

Hadar Serviansky, Nimrod Segol, Jonathan Shlomi, Kyle Cranmer, Eilam Gross, Haggai Maron, Yaron Lipman

Many problems in machine learning (ML) can be cast as learning functions from sets to graphs, or more generally to hypergraphs; in short, Set2Graph functions. Examples include clustering, learning vertex and edge features on graphs, and learning triplet data in a collection. Current neural network models that approximate Set2Graph functions come from two main ML sub-fields: equivariant learning, and similarity learning. Equivariant models would be in general computationally challenging or even infeasible, while similarity learning models can be shown to have limited expressive power. In this paper we suggest a neural network model family for learning Set2Graph functions that is both practical and of maximal expressive power (universal), that is, can approximate arbitrary continuous Set2Graph functions over compact sets. Testing our models on different machine learning tasks, including an application to particle physics, we find them favorable to existing baselines.

[ML paper 2/10]

Why you may want to read this: Newest paper from Bernhard Schölkopf (Director, Max Planck Institute for Intelligent Systems; and Distinguished Amazon Scholar).

### Algorithmic Recourse: from Counterfactual Explanations to Interventions.

Amir-Hossein Karimi, Bernhard Schölkopf, Isabel Valera

As machine learning is increasingly used to inform consequential decision-making (e.g., pre-trial bail and loan approval), it becomes important to explain how the system arrived at its decision, and also suggest actions to achieve a favorable decision. Counterfactual explanations -- "how the world would have (had) to be different for a desirable outcome to occur" -- aim to satisfy these criteria. Existing works have primarily focused on designing algorithms to obtain counterfactual explanations for a wide range of settings. However, one of the main objectives of "explanations as a means to help a data-subject act rather than merely understand" has been overlooked. In layman's terms, counterfactual explanations inform an individual where they need to get to, but not how to get there. In this work, we rely on causal reasoning to caution against the use of counterfactual explanations as a recommendable set of actions for recourse. Instead, we propose a shift of paradigm from recourse via nearest counterfactual explanations to recourse through minimal interventions, moving the focus from explanations to recommendations. Finally, we provide the reader with an extensive discussion on how to realistically achieve recourse beyond structural interventions.

[ML paper 3/10]

Why you may want to read this: Newest paper from Wolfram Burgard (Professor of Computer Science, University of Freiburg).

### SYMOG: learning symmetric mixture of Gaussian modes for improved fixed-point quantization.

Lukas Enderich, Fabian Timm, Wolfram Burgard

Deep neural networks (DNNs) have been proven to outperform classical methods on several machine learning benchmarks. However, they have high computational complexity and require powerful processing units. Especially when deployed on embedded systems, model size and inference time must be significantly reduced. We propose SYMOG (symmetric mixture of Gaussian modes), which significantly decreases the complexity of DNNs through low-bit fixed-point quantization. SYMOG is a novel soft quantization method such that the learning task and the quantization are solved simultaneously. During training the weight distribution changes from an unimodal Gaussian distribution to a symmetric mixture of Gaussians, where each mean value belongs to a particular fixed-point mode. We evaluate our approach with different architectures (LeNet5, VGG7, VGG11, DenseNet) on common benchmark data sets (MNIST, CIFAR-10, CIFAR-100) and we compare with state-of-the-art quantization approaches. We achieve excellent results and outperform 2-bit state-of-the-art performance with an error rate of only 5.71% on CIFAR-10 and 27.65% on CIFAR-100.

[ML paper 4/10]

Why you may want to read this: Newest paper from Ruslan Salakhutdinov (Associate Professor, Machine Learning Department, CMU), Louis-Philippe Morency (Associate professor,  Carnegie Mellon University).

### Learning Not to Learn in the Presence of Noisy Labels.

Liu Ziyin, Blair Chen, Ru Wang, Paul Pu Liang, Ruslan Salakhutdinov, Louis-Philippe Morency, Masahito Ueda

Learning in the presence of label noise is a challenging yet important task: it is crucial to design models that are robust in the presence of mislabeled datasets. In this paper, we discover that a new class of loss functions called the gambler's loss provides strong robustness to label noise across various levels of corruption. We show that training with this loss function encourages the model to "abstain" from learning on the data points with noisy labels, resulting in a simple and effective method to improve robustness and generalization. In addition, we propose two practical extensions of the method: 1) an analytical early stopping criterion to approximately stop training before the memorization of noisy labels, as well as 2) a heuristic for setting hyperparameters which do not require knowledge of the noise corruption rate. We demonstrate the effectiveness of our method by achieving strong results across three image and text classification tasks as compared to existing baselines.

[ML paper 5/10]

Why you may want to read this: Newest paper from Jure Leskovec (Professor of Computer Science, Stanford University).

### Unifying Graph Convolutional Neural Networks and Label Propagation.

Hongwei Wang, Jure Leskovec

Label Propagation (LPA) and Graph Convolutional Neural Networks (GCN) are both message passing algorithms on graphs. Both solve the task of node classification but LPA propagates node label information across the edges of the graph, while GCN propagates and transforms node feature information. However, while conceptually similar, theoretical relation between LPA and GCN has not yet been investigated. Here we study the relationship between LPA and GCN in terms of two aspects: (1) feature/label smoothing where we analyze how the feature/label of one node is spread over its neighbors; And, (2) feature/label influence of how much the initial feature/label of one node influences the final feature/label of another node. Based on our theoretical analysis, we propose an end-to-end model that unifies GCN and LPA for node classification. In our unified model, edge weights are learnable, and the LPA serves as regularization to assist the GCN in learning proper edge weights that lead to improved classification performance. Our model can also be seen as learning attention weights based on node labels, which is more task-oriented than existing feature-based attention models. In a number of experiments on real-world graphs, our model shows superiority over state-of-the-art GCN-based methods in terms of node classification accuracy.

[ML paper 6/10]

Why you may want to read this: Newest paper from Jure Leskovec (Professor of Computer Science, Stanford University).

### Entity Context and Relational Paths for Knowledge Graph Completion.

Hongwei Wang, Hongyu Ren, Jure Leskovec

Knowledge graph completion aims to predict missing relations between entities in a knowledge graph. While many different methods have been proposed, there is a lack of a unifying framework that would lead to state-of-the-art results. Here we develop PathCon, a knowledge graph completion method that harnesses four novel insights to outperform existing methods. PathCon predicts relations between a pair of entities by: (1) Considering the Relational Context of each entity by capturing the relation types adjacent to the entity and modeled through a novel edge-based message passing scheme; (2) Considering the Relational Paths capturing all paths between the two entities; And, (3) adaptively integrating the Relational Context and Relational Path through a learnable attention mechanism. Importantly, (4) in contrast to conventional node-based representations, PathCon represents context and path only using the relation types, which makes it applicable in an inductive setting. Experimental results on knowledge graph benchmarks as well as our newly proposed dataset show that PathCon outperforms state-of-the-art knowledge graph completion methods by a large margin. Finally, PathCon is able to provide interpretable explanations by identifying relations that provide the context and paths that are important for a given predicted relation.

[ML paper 7/10]

Why you may want to read this: Newest paper from Jure Leskovec (Professor of Computer Science, Stanford University).

### Query2box: Reasoning over Knowledge Graphs in Vector Space using Box Embeddings.

Hongyu Ren, Weihua Hu, Jure Leskovec

Answering complex logical queries on large-scale incomplete knowledge graphs (KGs) is a fundamental yet challenging task. Recently, a promising approach to this problem has been to embed KG entities as well as the query into a vector space such that entities that answer the query are embedded close to the query. However, prior work models queries as single points in the vector space, which is problematic because a complex query represents a potentially large set of its answer entities, but it is unclear how such a set can be represented as a single point. Furthermore, prior work can only handle queries that use conjunctions (\wedge) and existential quantifiers (\exists). Handling queries with logical disjunctions (\vee) remains an open problem. Here we propose query2box, an embedding-based framework for reasoning over arbitrary queries with \wedge, \vee, and \exists operators in massive and incomplete KGs. Our main insight is that queries can be embedded as boxes (i.e., hyper-rectangles), where a set of points inside the box corresponds to a set of answer entities of the query. We show that conjunctions can be naturally represented as intersections of boxes and also prove a negative result that handling disjunctions would require embedding with dimension proportional to the number of KG entities. However, we show that by transforming queries into a Disjunctive Normal Form, query2box is capable of handling arbitrary logical queries with \wedge, \vee, \exists in a scalable manner. We demonstrate the effectiveness of query2box on three large KGs and show that query2box achieves up to 25% relative improvement over the state of the art.

[ML paper 8/10]

Why you may want to read this: Newest paper from Corinna Cortes (Google Research, NY), Mehryar Mohri (Professor of Computer Science and Mathematics, Courant Institute of Mathematical …).

Corinna Cortes, Giulia DeSalvo, Claudio Gentile, Mehryar Mohri, Ningshan Zhang

We present a new active learning algorithm that adaptively partitions the input space into a finite number of regions, and subsequently seeks a distinct predictor for each region, both phases actively requesting labels. We prove theoretical guarantees for both the generalization error and the label complexity of our algorithm, and analyze the number of regions defined by the algorithm under some mild assumptions. We also report the results of an extensive suite of experiments on several real-world datasets demonstrating substantial empirical benefits over existing single-region and non-adaptive region-based active learning baselines.

[ML paper 9/10]

Why you may want to read this: Newest paper from Tom Mitchell (E. Fredkin University Professor of Machine Learning, Carnegie Mellon University).

### Jelly Bean World: A Testbed for Never-Ending Learning.

Emmanouil Antonios Platanios, Abulhair Saparov, Tom Mitchell

Machine learning has shown growing success in recent years. However, current machine learning systems are highly specialized, trained for particular problems or domains, and typically on a single narrow dataset. Human learning, on the other hand, is highly general and adaptable. Never-ending learning is a machine learning paradigm that aims to bridge this gap, with the goal of encouraging researchers to design machine learning systems that can learn to perform a wider variety of inter-related tasks in more complex environments. To date, there is no environment or testbed to facilitate the development and evaluation of never-ending learning systems. To this end, we propose the Jelly Bean World testbed. The Jelly Bean World allows experimentation over two-dimensional grid worlds which are filled with items and in which agents can navigate. This testbed provides environments that are sufficiently complex and where more generally intelligent algorithms ought to perform better than current state-of-the-art reinforcement learning approaches. It does so by producing non-stationary environments and facilitating experimentation with multi-task, multi-agent, multi-modal, and curriculum learning settings. We hope that this new freely-available software will prompt new research and interest in the development and evaluation of never-ending learning systems and more broadly, general intelligence systems.

[ML paper 10/10]

Why you may want to read this: Newest paper from Jimmy Ba (University of Toronto).

### BatchEnsemble: an Alternative Approach to Efficient Ensemble and Lifelong Learning.

Yeming Wen, Dustin Tran, Jimmy Ba

Ensembles, where multiple neural networks are trained individually and their predictions are averaged, have been shown to be widely successful for improving both the accuracy and predictive uncertainty of single neural networks. However, an ensemble's cost for both training and testing increases linearly with the number of networks, which quickly becomes untenable. In this paper, we propose BatchEnsemble, an ensemble method whose computational and memory costs are significantly lower than typical ensembles. BatchEnsemble achieves this by defining each weight matrix to be the Hadamard product of a shared weight among all ensemble members and a rank-one matrix per member. Unlike ensembles, BatchEnsemble is not only parallelizable across devices, where one device trains one member, but also parallelizable within a device, where multiple ensemble members are updated simultaneously for a given mini-batch. Across CIFAR-10, CIFAR-100, WMT14 EN-DE/EN-FR translation, and out-of-distribution tasks, BatchEnsemble yields competitive accuracy and uncertainties as typical ensembles; the speedup at test time is 3X and memory reduction is 3X at an ensemble of size 4. We also apply BatchEnsemble to lifelong learning, where on Split-CIFAR-100, BatchEnsemble yields comparable performance to progressive neural networks while having a much lower computational and memory costs. We further show that BatchEnsemble can easily scale up to lifelong learning on Split-ImageNet which involves 100 sequential learning tasks.