# ArXiv Weekly Radiostation：本周NLP、CV、ML精选论文30篇（3.22-3.28）

### ArXiv Weekly: 10 NLP Papers You May Want to Read

[NLP paper 1/10]

Why you may want to read this: Newest paper from Quoc V. Le (Research Scientist, Google Brain), Christopher D. Manning (Professor of Computer Science and Linguistics, Stanford University).

### ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators.

Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning

Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.

[NLP paper 2/10]

Why you may want to read this: Newest paper from Eric Xing (Professor of Machine Learning, Language Technology, Computer Science, Cargenie Mellon …).

### PathVQA: 30000+ Questions for Medical Visual Question Answering.

Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, Pengtao Xie

Is it possible to develop an "AI Pathologist" to pass the board-certified examination of the American Board of Pathology? To achieve this goal, the first step is to create a visual question answering (VQA) dataset where the AI agent is presented with a pathology image together with a question and is asked to give the correct answer. Our work makes the first attempt to build such a dataset. Different from creating general-domain VQA datasets where the images are widely accessible and there are many crowdsourcing workers available and capable of generating question-answer pairs, developing a medical VQA dataset is much more challenging. First, due to privacy concerns, pathology images are usually not publicly available. Second, only well-trained pathologists can understand pathology images, but they barely have time to help create datasets for AI research. To address these challenges, we resort to pathology textbooks and online digital libraries. We develop a semi-automated pipeline to extract pathology images and captions from textbooks and generate question-answer pairs from captions using natural language processing. We collect 32,799 open-ended questions from 4,998 pathology images where each question is manually checked to ensure correctness. To our best knowledge, this is the first dataset for pathology VQA. Our dataset will be released publicly to promote research in medical VQA.

[NLP paper 3/10]

Why you may want to read this: Newest paper from Mohammed J. Zaki (Professor, Computer Science Department, RPI).

### A Framework for Generating Explanations from Temporal Personal Health Data.

Jonathan J. Harris, Ching-Hua Chen, Mohammed J. Zaki

Whereas it has become easier for individuals to track their personal health data (e.g., heart rate, step count, food log), there is still a wide chasm between the collection of data and the generation of meaningful explanations to help users better understand what their data means to them. With an increased comprehension of their data, users will be able to act upon the newfound information and work towards striving closer to their health goals. We aim to bridge the gap between data collection and explanation generation by mining the data for interesting behavioral findings that may provide hints about a user's tendencies. Our focus is on improving the explainability of temporal personal health data via a set of informative summary templates, or "protoforms." These protoforms span both evaluation-based summaries that help users evaluate their health goals and pattern-based summaries that explain their implicit behaviors. In addition to individual users, the protoforms we use are also designed for population-level summaries. We apply our approach to generate summaries (both univariate and multivariate) from real user data and show that our system can generate interesting and useful explanations.

[NLP paper 4/10]

Why you may want to read this: Newest paper from Maarten de Rijke (University of Amsterdam).

### TLDR: Token Loss Dynamic Reweighting for Reducing Repetitive Utterance Generation.

Shaojie Jiang, Thomas Wolf, Christof Monz, Maarten de Rijke

Natural Language Generation (NLG) models are prone to generating repetitive utterances. In this work, we study the repetition problem for encoder-decoder models, using both recurrent neural network (RNN) and transformer architectures. To this end, we consider the chit-chat task, where the problem is more prominent than in other tasks that need encoder-decoder architectures. We first study the influence of model architectures. By using pre-attention and highway connections for RNNs, we manage to achieve lower repetition rates. However, this method does not generalize to other models such as transformers. We hypothesize that the deeper reason is that in the training corpora, there are hard tokens that are more difficult for a generative model to learn than others and, once learning has finished, hard tokens are still under-learned, so that repetitive generations are more likely to happen. Based on this hypothesis, we propose token loss dynamic reweighting (TLDR) that applies differentiable weights to individual token losses. By using higher weights for hard tokens and lower weights for easy tokens, NLG models are able to learn individual tokens at different paces. Experiments on chit-chat benchmark datasets show that TLDR is more effective in repetition reduction for both RNN and transformer architectures than baselines using different weighting functions.

[NLP paper 5/10]

Why you may want to read this: Newest paper from Wei Liu (PhD Student of Computer Science, UNC Chapel Hill).

### E2EET: From Pipeline to End-to-end Entity Typing via Transformer-Based Embeddings.

Michael Stewart, Wei Liu

Entity Typing (ET) is the process of identifying the semantic types of every entity within a corpus. In contrast to Named Entity Recognition, where each token in a sentence is labelled with zero or one class label, ET involves labelling each entity mention with one or more class labels. Existing entity typing models, which operate at the mention level, are limited by two key factors: they do not make use of recently-proposed context-dependent embeddings, and are trained on fixed context windows. They are therefore sensitive to window size selection and are unable to incorporate the context of the entire document. In light of these drawbacks we propose to incorporate context using transformer-based embeddings for a mention-level model, and an end-to-end model using a Bi-GRU to remove the dependency on window size. An extensive ablative study demonstrates the effectiveness of contextualised embeddings for mention-level models and the competitiveness of our end-to-end model for entity typing.

[NLP paper 6/10]

Why you may want to read this: Newest paper from Meeyoung Cha (KAIST).

### BaitWatcher: A lightweight web interface for the detection of incongruent news headlines.

Kunwoo Park, Taegyun Kim, Seunghyun Yoon, Meeyoung Cha, Kyomin Jung

In digital environments where substantial amounts of information are shared online, news headlines play essential roles in the selection and diffusion of news articles. Some news articles attract audience attention by showing exaggerated or misleading headlines. This study addresses the \textit{headline incongruity} problem, in which a news headline makes claims that are either unrelated or opposite to the contents of the corresponding article. We present \textit{BaitWatcher}, which is a lightweight web interface that guides readers in estimating the likelihood of incongruence in news articles before clicking on the headlines. BaitWatcher utilizes a hierarchical recurrent encoder that efficiently learns complex textual representations of a news headline and its associated body text. For training the model, we construct a million scale dataset of news articles, which we also release for broader research use. Based on the results of a focus group interview, we discuss the importance of developing an interpretable AI agent for the design of a better interface for mitigating the effects of online misinformation.

[NLP paper 7/10]

Why you may want to read this: Newest paper from Marco Baroni (ICREA Professor (Barcelona) and FAIR Research Scientist (Paris)).

### Rat big, cat eaten! Ideas for a useful deep-agent protolanguage.

Marco Baroni

Deep-agent communities developing their own language-like communication protocol are a hot (or at least warm) topic in AI. Such agents could be very useful in machine-machine and human-machine interaction scenarios long before they have evolved a protocol as complex as human language. Here, I propose a small set of priorities we should focus on, if we want to get as fast as possible to a stage where deep agents speak a useful protolanguage.

[NLP paper 8/10]

Why you may want to read this: Newest paper from Jiajun Bu (Zhejiang University).

### Matching Text with Deep Mutual Information Estimation.

Xixi Zhou, Chengxi Li, Jiajun Bu, Chengwei Yao, Keyue Shi, Zhi Yu, Zhou Yu,  (2) University of Californi,  Davis

Text matching is a core natural language processing research problem. How to retain sufficient information on both content and structure information is one important challenge. In this paper, we present a neural approach for general-purpose text matching with deep mutual information estimation incorporated. Our approach, Text matching with Deep Info Max (TIM), is integrated with a procedure of unsupervised learning of representations by maximizing the mutual information between text matching neural network's input and output. We use both global and local mutual information to learn text representations. We evaluate our text matching approach on several tasks including natural language inference, paraphrase identification, and answer selection. Compared to the state-of-the-art approaches, the experiments show that our method integrated with mutual information estimation learns better text representation and achieves better experimental results of text matching tasks without exploiting pretraining on external data.

[NLP paper 9/10]

Why you may want to read this: Newest paper from Maosong Sun (Professor of Computer Science and Technology, Tsinghua University).

### Generating Major Types of Chinese Classical Poetry in a Uniformed Framework.

Jinyi Hu, Maosong Sun

Poetry generation is an interesting research topic in the field of text generation. As one of the most valuable literary and cultural heritages of China, Chinese classical poetry is very familiar and loved by Chinese people from generation to generation. It has many particular characteristics in its language structure, ranging from form, sound to meaning, thus is regarded as an ideal testing task for text generation. In this paper, we propose a GPT-2 based uniformed framework for generating major types of Chinese classical poems. We define a unified format for formulating all types of training samples by integrating detailed form information, then present a simple form-stressed weighting method in GPT-2 to strengthen the control to the form of the generated poems, with special emphasis on those forms with longer body length. Preliminary experimental results show this enhanced model can generate Chinese classical poems of major types with high quality in both form and content, validating the effectiveness of the proposed strategy. The model has been incorporated into Jiuge, the most influential Chinese classical poetry generation system developed by Tsinghua University (Guo et al., 2019).

[NLP paper 10/10]

Why you may want to read this: Newest paper from Marcus Liwicki (Luleå University of Technology, EISLAB, Machine Learning, Sweden).

### Word2Vec: Optimal Hyper-Parameters and Their Impact on NLP Downstream Tasks.

Tosin P. Adewumi, Foteini Liwicki, Marcus Liwicki

Word2Vec is a prominent tool for Natural Language Processing (NLP) tasks. Similar inspiration is found in distributed embeddings for state-of-the-art (sota) deep neural networks. However, wrong combination of hyper-parameters can produce poor quality vectors. The objective of this work is to show optimal combination of hyper-parameters exists and evaluate various combinations. We compare them with the original model released by Mikolov. Both intrinsic and extrinsic (downstream) evaluations, including Named Entity Recognition (NER) and Sentiment Analysis (SA) were carried out. The downstream tasks reveal that the best model is task-specific, high analogy scores don't necessarily correlate positively with F1 scores and the same applies for more data. Increasing vector dimension size after a point leads to poor quality or performance. If ethical considerations to save time, energy and the environment are made, then reasonably smaller corpora may do just as well or even better in some cases. Besides, using a small corpus, we obtain better human-assigned WordSim scores, corresponding Spearman correlation and better downstream (NER & SA) performance compared to Mikolov's model, trained on 100 billion word corpus.

### ArXiv Weekly: 10 CV Papers You May Want to Read

[CV paper 1/10]

Why you may want to read this: Newest paper from Andrew Zisserman (University of Oxford).

### Compact Deep Aggregation for Set Retrieval.

Yujie Zhong, Relja Arandjelović, Andrew Zisserman

The objective of this work is to learn a compact embedding of a set of descriptors that is suitable for efficient retrieval and ranking, whilst maintaining discriminability of the individual descriptors. We focus on a specific example of this general problem -- that of retrieving images containing multiple faces from a large scale dataset of images. Here the set consists of the face descriptors in each image, and given a query for multiple identities, the goal is then to retrieve, in order, images which contain all the identities, all but one, \etc

To this end, we make the following contributions: first, we propose a CNN architecture -- {\em SetNet} -- to achieve the objective: it learns face descriptors and their aggregation over a set to produce a compact fixed length descriptor designed for set retrieval, and the score of an image is a count of the number of identities that match the query; second, we show that this compact descriptor has minimal loss of discriminability up to two faces per image, and degrades slowly after that -- far exceeding a number of baselines; third, we explore the speed vs.\ retrieval quality trade-off for set retrieval using this compact descriptor; and, finally, we collect and annotate a large dataset of images containing various number of celebrities, which we use for evaluation and is publicly released.

[CV paper 2/10]

Why you may want to read this: Newest paper from Xiangyu Zhang (Research Leader, Megvii Technology), Jian Sun (Chief Scientist | Managing Director of Research, Megvii (Face++)).

### Detection in Crowded Scenes: One Proposal, Multiple Predictions.

Xuangeng Chu, Anlin Zheng, Xiangyu Zhang, Jian Sun

We propose a simple yet effective proposal-based object detector, aiming at detecting highly-overlapped instances in crowded scenes. The key of our approach is to let each proposal predict a set of correlated instances rather than a single one in previous proposal-based frameworks. Equipped with new techniques such as EMD Loss and Set NMS, our detector can effectively handle the difficulty of detecting highly overlapped objects. On a FPN-Res50 baseline, our detector can obtain 4.9\% AP gains on challenging CrowdHuman dataset and 1.0\% \text{MR}^{-2} improvements on CityPersons dataset, without bells and whistles. Moreover, on less crowed datasets like COCO, our approach can still achieve moderate improvement, suggesting the proposed method is robust to crowdedness. Code and pre-trained models will be released at https://github.com/megvii-model/CrowdDetection.

[CV paper 3/10]

Why you may want to read this: Newest paper from Xiangyu Zhang (Research Leader, Megvii Technology), Jian Sun (Chief Scientist | Managing Director of Research, Megvii (Face++)).

### Learning Dynamic Routing for Semantic Segmentation.

Yanwei Li, Lin Song, Yukang Chen, Zeming Li, Xiangyu Zhang, Xingang Wang, Jian Sun

Recently, numerous handcrafted and searched networks have been applied for semantic segmentation. However, previous works intend to handle inputs with various scales in pre-defined static architectures, such as FCN, U-Net, and DeepLab series. This paper studies a conceptually new method to alleviate the scale variance in semantic representation, named dynamic routing. The proposed framework generates data-dependent routes, adapting to the scale distribution of each image. To this end, a differentiable gating function, called soft conditional gate, is proposed to select scale transform paths on the fly. In addition, the computational cost can be further reduced in an end-to-end manner by giving budget constraints to the gating function. We further relax the network level routing space to support multi-path propagations and skip-connections in each forward, bringing substantial network capacity. To demonstrate the superiority of the dynamic property, we compare with several static architectures, which can be modeled as special cases in the routing space. Extensive experiments are conducted on Cityscapes and PASCAL VOC 2012 to illustrate the effectiveness of the dynamic framework. Code is available at https://github.com/yanwei-li/DynamicRouting.

[CV paper 4/10]

Why you may want to read this: Newest paper from Cordelia Schmid (Research director INRIA ), Julien Mairal (Inria - Grenoble Rhône-Alpes).

### Selecting Relevant Features from a Universal Representation for Few-shot Classification.

Nikita Dvornik, Cordelia Schmid, Julien Mairal

Popular approaches for few-shot classification consist of first learning a generic data representation based on a large annotated dataset, before adapting the representation to new classes given only a few labeled samples. In this work, we propose a new strategy based on feature selection, which is both simpler and more effective than previous feature adaptation approaches. First, we obtain a universal representation by training a set of semantically different feature extractors. Then, given a few-shot learning task, we use our universal feature bank to automatically select the most relevant representations. We show that a simple non-parametric classifier built on top of such features produces high accuracy and generalizes to domains never seen during training, which leads to state-of-the-art results on MetaDataset and improved accuracy on mini-ImageNet.

[CV paper 5/10]

Why you may want to read this: Newest paper from Piotr Dollár (Facebook AI Research), Kaiming He (Research Scientist, Facebook AI Research (FAIR)), Ross Girshick (Research Scientist, Facebook AI Research (FAIR)), Alan Yuille (Professor of Cognitive Science and Computer Science, Johns Hopkins University).

### Are Labels Necessary for Neural Architecture Search?.

Chenxi Liu, Piotr Dollár, Kaiming He, Ross Girshick, Alan Yuille, Saining Xie

Existing neural network architectures in computer vision --- whether designed by humans or by machines --- were typically found using both images and their associated labels. In this paper, we ask the question: can we find high-quality neural architectures using only images, but no human-annotated labels? To answer this question, we first define a new setup called Unsupervised Neural Architecture Search (UnNAS). We then conduct two sets of experiments. In sample-based experiments, we train a large number (500) of diverse architectures with either supervised or unsupervised objectives, and find that the architecture rankings produced with and without labels are highly correlated. In search-based experiments, we run a well-established NAS algorithm (DARTS) using various unsupervised objectives, and report that the architectures searched without labels can be competitive to their counterparts searched with labels. Together, these results reveal the potentially surprising finding that labels are not necessary, and the image statistics alone may be sufficient to identify good neural architectures.

[CV paper 6/10]

Why you may want to read this: Newest paper from Leonidas Guibas (Professor of Computer Science, Stanford University).

### Learning 3D Part Assembly from a Single Image.

Yichen Li, Kaichun Mo, Lin Shao, Minhyuk Sung, Leonidas Guibas

Autonomous assembly is a crucial capability for robots in many applications. For this task, several problems such as obstacle avoidance, motion planning, and actuator control have been extensively studied in robotics. However, when it comes to task specification, the space of possibilities remains underexplored. Towards this end, we introduce a novel problem, single-image-guided 3D part assembly, along with a learningbased solution. We study this problem in the setting of furniture assembly from a given complete set of parts and a single image depicting the entire assembled object. Multiple challenges exist in this setting, including handling ambiguity among parts (e.g., slats in a chair back and leg stretchers) and 3D pose prediction for parts and part subassemblies, whether visible or occluded. We address these issues by proposing a two-module pipeline that leverages strong 2D-3D correspondences and assembly-oriented graph message-passing to infer part relationships. In experiments with a PartNet-based synthetic benchmark, we demonstrate the effectiveness of our framework as compared with three baseline approaches.

[CV paper 7/10]

Why you may want to read this: Newest paper from Larry S. Davis (Professor of Computer Science, University of Maryland).

### DeepStrip: High Resolution Boundary Refinement.

Peng Zhou, Brian Price, Scott Cohen, Gregg Wilensky, Larry S. Davis

In this paper, we target refining the boundaries in high resolution images given low resolution masks. For memory and computation efficiency, we propose to convert the regions of interest into strip images and compute a boundary prediction in the strip domain. To detect the target boundary, we present a framework with two prediction layers. First, all potential boundaries are predicted as an initial prediction and then a selection layer is used to pick the target boundary and smooth the result. To encourage accurate prediction, a loss which measures the boundary distance in the strip domain is introduced. In addition, we enforce a matching consistency and C0 continuity regularization to the network to reduce false alarms. Extensive experiments on both public and a newly created high resolution dataset strongly validate our approach.

[CV paper 8/10]

Why you may want to read this: Newest paper from Matthew Brown (Research Scientist, Google), Ming-Hsuan Yang (University of California at Merced).

### Rethinking Class-Balanced Methods for Long-Tailed Visual Recognition from a Domain Adaptation Perspective.

Muhammad Abdullah Jamal, Matthew Brown, Ming-Hsuan Yang, Liqiang Wang, Boqing Gong

Object frequency in the real world often follows a power law, leading to a mismatch between datasets with long-tailed class distributions seen by a machine learning model and our expectation of the model to perform well on all classes. We analyze this mismatch from a domain adaptation point of view. First of all, we connect existing class-balanced methods for long-tailed classification to target shift, a well-studied scenario in domain adaptation. The connection reveals that these methods implicitly assume that the training data and test data share the same class-conditioned distribution, which does not hold in general and especially for the tail classes. While a head class could contain abundant and diverse training examples that well represent the expected data at inference time, the tail classes are often short of representative training data. To this end, we propose to augment the classic class-balanced learning by explicitly estimating the differences between the class-conditioned distributions with a meta-learning approach. We validate our approach with six benchmark datasets and three loss functions.

[CV paper 9/10]

Why you may want to read this: Newest paper from Dacheng Tao (The University of Sydney).

### Distillating Knowledge from Graph Convolutional Networks.

Yiding Yang, Jiayan Qiu, Mingli Song, Dacheng Tao, Xinchao Wang

Existing knowledge distillation methods focus on convolutional neural networks~(CNNs), where the input samples like images lie in a grid domain, and have largely overlooked graph convolutional networks~(GCN) that handle non-grid data. In this paper, we propose to our best knowledge the first dedicated approach to {distilling} knowledge from a pre-trained GCN model. To enable the knowledge transfer from the teacher GCN to the student, we propose a local structure preserving module that explicitly accounts for the topological semantics of the teacher. In this module, the local structure information from both the teacher and the student are extracted as distributions, and hence minimizing the distance between these distributions enables topology-aware knowledge transfer from the teacher, yielding a compact yet high-performance student model. Moreover, the proposed approach is readily extendable to dynamic graph models, where the input graphs for the teacher and the student may differ. We evaluate the proposed method on two different datasets using GCN models of different architectures, and demonstrate that our method achieves the state-of-the-art knowledge distillation performance for GCN models.

[CV paper 10/10]

Why you may want to read this: Newest paper from Ig-Jae Kim (KIST).

### Cylindrical Convolutional Networks for Joint Object Detection and Viewpoint Estimation.

Sunghun Joung, Seungryong Kim, Hanjae Kim, Minsu Kim, Ig-Jae Kim, Junghyun Cho, Kwanghoon Sohn

Existing techniques to encode spatial invariance within deep convolutional neural networks only model 2D transformation fields. This does not account for the fact that objects in a 2D space are a projection of 3D ones, and thus they have limited ability to severe object viewpoint changes. To overcome this limitation, we introduce a learnable module, cylindrical convolutional networks (CCNs), that exploit cylindrical representation of a convolutional kernel defined in the 3D space. CCNs extract a view-specific feature through a view-specific convolutional kernel to predict object category scores at each viewpoint. With the view-specific feature, we simultaneously determine objective category and viewpoints using the proposed sinusoidal soft-argmax module. Our experiments demonstrate the effectiveness of the cylindrical convolutional networks on joint object detection and viewpoint estimation.

### ArXiv Weekly: 10 ML Papers You May Want to Read

[ML paper 1/10]

Why you may want to read this: Newest paper from Vipin Kumar (University of Minnesota).

### Inverse Problems, Deep Learning, and Symmetry Breaking.

Kshitij Tayal, Chieh-Hsin Lai, Vipin Kumar, Ju Sun

In many physical systems, inputs related by intrinsic system symmetries are mapped to the same output. When inverting such systems, i.e., solving the associated inverse problems, there is no unique solution. This causes fundamental difficulties for deploying the emerging end-to-end deep learning approach. Using the generalized phase retrieval problem as an illustrative example, we show that careful symmetry breaking on the training data can help get rid of the difficulties and significantly improve the learning performance. We also extract and highlight the underlying mathematical principle of the proposed solution, which is directly applicable to other inverse problems.

[ML paper 2/10]

Why you may want to read this: Newest paper from Trevor Darrell (Professor of Computer Science, UC Berkeley), Marcus Rohrbach (Research Scientist, Facebook AI Research).

### Adversarial Continual Learning.

Sayna Ebrahimi, Franziska Meier, Roberto Calandra, Trevor Darrell, Marcus Rohrbach

Continual learning aims to learn new tasks without forgetting previously learned ones. We hypothesize that representations learned to solve each task in a sequence have a shared structure while containing some task-specific properties. We show that shared features are significantly less prone to forgetting and propose a novel hybrid continual learning framework that learns a disjoint representation for task-invariant and task-specific features required to solve a sequence of tasks. Our model combines architecture growth to prevent forgetting of task-specific skills and an experience replay approach to preserve shared skills. We demonstrate our hybrid approach is effective in avoiding forgetting and show it is superior to both architecture-based and memory-based approaches on class incrementally learning of a single dataset as well as a sequence of multiple datasets in image classification. Our code is available at \url{https://github.com/facebookresearch/Adversarial-Continual-Learning}.

[ML paper 3/10]

Why you may want to read this: Newest paper from Klaus-Robert Müller (Professor for Machine Learning, TU Berlin, Germany and Korea University, Seoul, Korea …).

### Automatic Identification of Types of Alterations in Historical Manuscripts.

David Lassner, Anne Baillot, Sergej Dogadov, Klaus-Robert Müller, Shinichi Nakajima

Alterations in historical manuscripts such as letters represent a promising field of research. On the one hand, they help understand the construction of text. On the other hand, topics that are being considered sensitive at the time of the manuscript gain coherence and contextuality when taking alterations into account, especially in the case of deletions. The analysis of alterations in manuscripts, though, is a traditionally very tedious work. In this paper, we present a machine learning-based approach to help categorize alterations in documents. In particular, we present a new probabilistic model (Alteration Latent Dirichlet Allocation, alterLDA in the following) that categorizes content-related alterations. The method proposed here is developed based on experiments carried out on the digital scholarly edition Berlin Intellectuals, for which alterLDA achieves high performance in the recognition of alterations on labelled data. On unlabelled data, applying alterLDA leads to interesting new insights into the alteration behavior of authors, editors and other manuscript contributors, as well as insights into sensitive topics in the correspondence of Berlin intellectuals around 1800. In addition to the findings based on the digital scholarly edition Berlin Intellectuals, we present a general framework for the analysis of text genesis that can be used in the context of other digital resources representing document variants. To that end, we present in detail the methodological steps that are to be followed in order to achieve such results, giving thereby a prime example of an Machine Learning application the Digital Humanities.

[ML paper 4/10]

Why you may want to read this: Newest paper from David Karger (Professor of Computer Science, MIT).

### ARDA: Automatic Relational Data Augmentation for Machine Learning.

Nadiia Chepurko, Ryan Marcus, Emanuel Zgraggen, Raul Castro Fernandez, Tim Kraska, David Karger

Automatic machine learning (\AML) is a family of techniques to automate the process of training predictive models, aiming to both improve performance and make machine learning more accessible. While many recent works have focused on aspects of the machine learning pipeline like model selection, hyperparameter tuning, and feature selection, relatively few works have focused on automatic data augmentation. Automatic data augmentation involves finding new features relevant to the user's predictive task with minimal human-in-the-loop'' involvement.

We present \system, an end-to-end system that takes as input a dataset and a data repository, and outputs an augmented data set such that training a predictive model on this augmented dataset results in improved performance. Our system has two distinct components: (1) a framework to search and join data with the input data, based on various attributes of the input, and (2) an efficient feature selection algorithm that prunes out noisy or irrelevant features from the resulting join. We perform an extensive empirical evaluation of different system components and benchmark our feature selection algorithm on real-world datasets.

[ML paper 5/10]

Why you may want to read this: Newest paper from Mubarak Shah (Trustee Chair Professor of Computer Science, University of Central Florida).

### iTAML: An Incremental Task-Agnostic Meta-learning Approach.

Jathushan Rajasegaran, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Mubarak Shah

Humans can continuously learn new knowledge as their experience grows. In contrast, previous learning in deep neural networks can quickly fade out when they are trained on a new task. In this paper, we hypothesize this problem can be avoided by learning a set of generalized parameters, that are neither specific to old nor new tasks. In this pursuit, we introduce a novel meta-learning approach that seeks to maintain an equilibrium between all the encountered tasks. This is ensured by a new meta-update rule which avoids catastrophic forgetting. In comparison to previous meta-learning techniques, our approach is task-agnostic. When presented with a continuum of data, our model automatically identifies the task and quickly adapts to it with just a single update. We perform extensive experiments on five datasets in a class-incremental setting, leading to significant improvements over the state of the art methods (e.g., a 21.3% boost on CIFAR100 with 10 incremental tasks). Specifically, on large-scale datasets that generally prove difficult cases for incremental learning, our approach delivers absolute gains as high as 19.1% and 7.4% on ImageNet and MS-Celeb datasets, respectively.

[ML paper 6/10]

Why you may want to read this: Newest paper from Eamonn J. Keogh (Professor of Computer Science, University of California - Riverside).

### FastDTW is approximate and Generally Slower than the Algorithm it Approximates.

Renjie Wu, Eamonn J. Keogh

Many time series data mining problems can be solved with repeated use of distance measure. Examples of such tasks include similarity search, clustering, classification, anomaly detection and segmentation. For over two decades it has been known that the Dynamic Time Warping (DTW) distance measure is the best measure to use for most tasks, in most domains. Because the classic DTW algorithm has quadratic time complexity, many ideas have been introduced to reduce its amortized time, or to quickly approximate it. One of the most cited approximate approaches is FastDTW. The FastDTW algorithm has well over a thousand citations and has been explicitly used in several hundred research efforts. In this work, we make a surprising claim. In any realistic data mining application, the approximate FastDTW is much slower than the exact DTW. This fact clearly has implications for the community that uses this algorithm: allowing it to address much larger datasets, get exact results, and do so in less time. Our observation also has a more sobering lesson for the community. This work may serve as a reminder to the community to exercise more caution in uncritically accepting published results.

[ML paper 7/10]

Why you may want to read this: Newest paper from Quoc V. Le (Research Scientist, Google Brain).

### Meta Pseudo Labels.

Hieu Pham, Qizhe Xie, Zihang Dai, Quoc V. Le

Many training algorithms of a deep neural network can be interpreted as minimizing the cross entropy loss between the prediction made by the network and a target distribution. In supervised learning, this target distribution is typically the ground-truth one-hot vector. In semi-supervised learning, this target distribution is typically generated by a pre-trained teacher model to train the main network. In this work, instead of using such predefined target distributions, we show that learning to adjust the target distribution based on the learning state of the main network can lead to better performances. In particular, we propose an efficient meta-learning algorithm, which encourages the teacher to adjust the target distributions of training examples in the manner that improves the learning of the main network. The teacher is updated by policy gradients computed by evaluating the main network on a held-out validation set.

Our experiments demonstrate substantial improvements over strong baselines and establish state-ofthe-art performance on CIFAR-10, SVHN, and ImageNet. For instance, with ResNets on small datasets, we achieve 96.1% on CIFAR-10 with 4,000 labeled examples and 73.9% top-1 on ImageNet with 10% examples. Meanwhile, with EfficientNet on full datasets plus extra unlabeled data, we attain 98.6% accuracy on CIFAR-10 and 86.9% top-1 accuracy on ImageNet.

[ML paper 8/10]

Why you may want to read this: Newest paper from Philip H. S. Torr (Professor, University of Oxford).

### Data Parallelism in Training Sparse Neural Networks.

Namhoon Lee, Philip H. S. Torr, Martin Jaggi

Network pruning is an effective methodology to compress large neural networks, and sparse neural networks obtained by pruning can benefit from their reduced memory and computational costs at use. Notably, recent advances have found that it is possible to find a trainable sparse neural network even at random initialization prior to training; hence the obtained sparse network only needs to be trained. While this approach of pruning at initialization turned out to be highly effective, little has been studied about the training aspects of these sparse neural networks. In this work, we focus on measuring the effects of data parallelism on training sparse neural networks. As a result, we find that the data parallelism in training sparse neural networks is no worse than that in training densely parameterized neural networks, despite the general difficulty of training sparse neural networks. When training sparse networks using SGD with momentum, the breakdown of the perfect scaling regime occurs even much later than the dense at large batch sizes.

[ML paper 9/10]

Why you may want to read this: Newest paper from Kyunghyun Cho (New York University, Facebook AI Research).

### Understanding the robustness of deep neural network classifiers for breast cancer screening.

Witold Oleszkiewicz, Taro Makino, Stanisław Jastrzębski, Tomasz Trzciński, Linda Moy, Kyunghyun Cho, Laura Heacock, Krzysztof J. Geras

Deep neural networks (DNNs) show promise in breast cancer screening, but their robustness to input perturbations must be better understood before they can be clinically implemented. There exists extensive literature on this subject in the context of natural images that can potentially be built upon. However, it cannot be assumed that conclusions about robustness will transfer from natural images to mammogram images, due to significant differences between the two image modalities. In order to determine whether conclusions will transfer, we measure the sensitivity of a radiologist-level screening mammogram image classifier to four commonly studied input perturbations that natural image classifiers are sensitive to. We find that mammogram image classifiers are also sensitive to these perturbations, which suggests that we can build on the existing literature. We also perform a detailed analysis on the effects of low-pass filtering, and find that it degrades the visibility of clinically meaningful features called microcalcifications. Since low-pass filtering removes semantically meaningful information that is predictive of breast cancer, we argue that it is undesirable for mammogram image classifiers to be invariant to it. This is in contrast to natural images, where we do not want DNNs to be sensitive to low-pass filtering due to its tendency to remove information that is human-incomprehensible.

[ML paper 10/10]

Why you may want to read this: Newest paper from Aki Vehtari (Associate Professor, Aalto University).

### Preferential Batch Bayesian Optimization.

Eero Siivola, Akash Kumar Dhaka, Michael Riis Andersen, Javier Gonzalez, Pablo Garcia Moreno, Aki Vehtari

Most research in Bayesian optimization (BO) has focused on direct feedback scenarios, where one has access to exact, or perturbed, values of some expensive-to-evaluate objective. This direction has been mainly driven by the use of BO in machine learning hyper-parameter configuration problems. However, in domains such as modelling human preferences, A/B tests or recommender systems, there is a need of methods that are able to replace direct feedback with preferential feedback, obtained via rankings or pairwise comparisons. In this work, we present Preferential Batch Bayesian Optimization (PBBO), a new framework that allows to find the optimum of a latent function of interest, given any type of parallel preferential feedback for a group of two or more points. We do so by using a Gaussian process model with a likelihood specially designed to enable parallel and efficient data collection mechanisms, which are key in modern machine learning. We show how the acquisitions developed under this framework generalize and augment previous approaches in Bayesian optimization, expanding the use of these techniques to a wider range of domains. An extensive simulation study shows the benefits of this approach, both with simulated functions and four real data sets.