CS 11-737: Multilingual NLP

Data-driven Strategies for NMT (2/8/2022)

Lecture: (by Graham Neubig)

Data augmentation strategies

Language in 10: Bengali

Discussion: Read one of the cited papers on data augmentation

Reference: Data Augmentation for Low-Resource Neural Machine Translation (Fadaee et al. 2017)
Reference: Handling Syntactic Divergence in Low-resource Machine Translation (Zhou et al. 2019)
Reference: Generalized Data Augmentation for Low-resource Translation (Xia et al. 2019)

Try to think of how it would work for one of the languages you're familiar with. Are there any potential hurdles to applying such a method? Are there any improvements you can think of?

References:

Tool: GIZA++
Tool: fastalign
Tool: awesome-align
Reference: Generalized Data Augmentation for Low-resource Translation (Xia et al. 2019)
Reference: Improving Neural Machine Translation Models with Monolingual Data (Sennrich et al. 2016)
Reference: Understanding Back-Translation at Scale (Edunov et al. 2018)
Reference: Iterative Back-Translation for Neural Machine Translation (Hoang et al. 2018)
Reference: Meta Back-translation (Pham et al. 2021)
Reference: Copied Monolingual Data Improves Low-Resource Neural Machine Translation (Currey et al. 2018)
Reference: Data Augmentation for Low-Resource Neural Machine Translation (Fadaee et al. 2017)
Reference: Unsupervised Machine Translation Using Monolingual Corpora Only (Lample et al. 2018)
Reference: Handling Syntactic Divergence in Low-resource Machine Translation (Zhou et al. 2019)

<-- Back To Schedule