CS 11-731: Machine Translation and Sequence-to-sequence Models

Quality Estimation and Adaptively Sparse Transformers -- João Graça and André Martins (11/19/2019)

This will be a talk by João Graça and André Martins, of Unbabel.

Abstract:
In the first part of this talk, the CTO and founder of Unbabel, João Graça, will give an overview of the Unbabel pipeline for translation, which combines MT and humans to achieve perfect quality translation, with increasing speed and quality and decreasing human effort. We will also present OpenKiwi, our open-source toolkit for translation quality estimation which won a systems demonstration award at ACL 2019 and obtained the highest results in the WMT QE 2019 shared task. In the second part of the talk, André Martins, VP of AI Research, will describe recent work (ACL 2019 and EMNLP 2019) on sparse sequence-to-sequence models and adaptively sparse Transformers. Attention mechanisms have become ubiquitous in NLP. Recent architectures, notably the Transformer, learn powerful context-aware word representations through layered, multi-headed attention. The multiple heads learn diverse types of word relationships. However, with standard softmax attention, all attention heads are dense, assigning a non-zero weight to all context words. In this talk, I will introduce the adaptively sparse Transformer, wherein attention heads have flexible, context-dependent sparsity patterns. This sparsity is accomplished by replacing softmax with alpha-entmax: a differentiable generalization of softmax that allows low-scoring words to receive precisely zero weight. Moreover, we derive a method to automatically learn the alpha parameter -- which controls the shape and sparsity of alpha-entmax -- allowing attention heads to choose between focused or spread-out behavior. Our adaptively sparse Transformer improves interpretability and head diversity when compared to softmax Transformers on machine translation datasets. Findings of the quantitative and qualitative analysis of our approach include that heads in different layers learn different sparsity preferences and tend to be more diverse in their attention distributions than softmax Transformers. Furthermore, at no cost in accuracy, sparsity in attention heads helps to uncover different head specializations. Joint work with Fábio Kepler, Gonçalo Correia, Ben Peters, Vlad Niculae, and the rest of the Unbabel AI team.

<-- Back To Schedule

Machine Translation andSequence-to-sequence Models

Machine Translation and
Sequence-to-sequence Models