CS 11-747: Neural Networks for NLP

Course Schedule

Introduction

8/29 Class Introduction

Content:

Introduction to Neural Networks
Example Tasks and Their Difficulties
What Neural Nets can Do To Help

Reading Material

Highly Recommended: Goldberg Book Chapters 1-5 (this is a lot to read, but covers basic concepts in neural networks that many people in the class may have covered already. If you're already familiar with neural nets, skim it. If not, please read carefully and ask lots of questions to the TAs/instructor.)
Reference: Deep Unordered Composition. (Iyyer et al.)

Slides: Class Intro Slides
Sample Code: Class Intro Code Examples
Lecture Video: Class Intro Lecture Video

8/31 A Simple (?) Exercise: Predicting the Next Word in a Sentence

Content:

Computational Graphs
Feed-forward Neural Network Language Models
Measuring Model Performance: Likelihood and Perplexity

Reading Material

Highly Recommended: Goldberg Book Chapters 8-9
Reference: Goldberg Book Chapters 6-7 (because CS11-711 is a pre-requisite, I will assume you know most of this already, but it might be worth browsing for terminology, etc.)
Reference: Maximum entropy (log-linear) language models. (Rosenfeld 1996)
Reference: A Neural Probabilistic Language Model. (Bengio et al. 2003, JMLR)
Reference: An Overview of Gradient Descent Algorithms. (Ruder 2016)
Reference: The Marginal Value of Adaptive Gradient Methods. (Wilson et al. 2017)
Reference: Stronger Baselines for Neural MT. (Denkowski Neubig 2017)
Reference: Using the Output Embedding. (Press and Wolf 2016)

Slides: LM Slides
Sample Code: LM Code Examples
Lecture Video: LM Lecture Video

Section 1: Models of Words

9/5 Distributional Semantics and Word Vectors

Content:

Describing a word by the company that it keeps
Counting and predicting
Skip-grams and CBOW
Evaluating/Visualizing Word Vectors
Advanced Methods for Word Vectors

Reading Material

Required Reading (for quiz): Goldberg Book Chapters 10-11
Reference: WordNet
Reference: Linguistic Regularities in Continuous Representations (Mikolov et al. 2013)
Reference: t-SNE (van der Maaten and Hinton 2008)
Reference: Visualizing w/ PCA vs. t-SNE (Derksen 2016)
Reference: How to use t-SNE effectively (Wattenberg et al. 2016)
Reference: Evaluating Word Embeddings (Schnabel et al. 2015)
Reference: Morphology-based Embeddings (Luong et al. 2013)
Reference: Character-based Embeddings (Ling et al. 2015)
Reference: Subword-based Embeddings (Bojankowski et al. 2017)
Reference: Multi-prototype Embeddings (Reisinger and Mooney 2010)
Reference: Non-parametric Multi-prototype Embeddings (Neelakantan et al. 2014)
Reference: Cross-lingual Embeddings (Faruqui et al. 2014)
Reference: Retrofitting to Lexicons (Faruqui et al. 2015)
Reference: Sparse Word Embeddings (Murphy et al. 2012)
Reference: De-biasing Word Embeddings (Bolukbasi et al. 2016)

Slides: Word Embedding Slides
Sample Code: Word Embedding Code Examples
Lecture Video: Word Embedding Lecture Video

9/7 Why is `word2vec` So Fast?: Speed Tricks for Neural Nets

Content: (Guest Lecture: Taylor Berg-Kirkpatrick)

Softmax Approximations: Negative Sampling, Hierarchical Softmax
Parallel Training
Tips for Training on GPUs

Reading Material

Highly Recommended Reading: Notes on Noise Contrastive Estimation and Negative Sampling (Dyer 2014)
Reference: Importance Sampling (Bengio and Senécal, 2003)
Reference: Noise Contrastive Estimation (Mnih and Teh, 2012)
Reference: Negative Sampling (Goldberg and Levy, 2014)
Reference: Mini-batching Sampling-based Softmax Approximations (Zoph et al., 2015)
Reference: Class-based Softmax (Goodman 2001)
Reference: Hierarchical Softmax (Morin and Bengio 2005)
Reference: Error Correcting Codes (Dietterich and Bakiri 1995)
Reference: Binary Code Prediction for Language (Oda et al. 2017)

Slides: Efficiency Slides
Sample Code: Efficiency Code Examples
Lecture Video: Efficiency Lecture Video

Section 2: Models of Sentences

9/12 Convolutional Networks for Text

Content:

Bag of Words, Bag of n-grams, and Convolution
Applications of Convolution: Context Windows and Sentence Modeling
Stacked and Dilated Convolutions
Structured Convolution
Convolutional Models of Sentence Pairs
Visualization for CNNs

Reading Material

Required Reading (for quiz): Goldberg Book Chapter 13
Reference: Time Delay Neural Networks (Waibel et al. 1989)
Reference: Convolutional Neural Networks (LeCun et al. 1998)
Reference: CNNs for Text (Collobert and Weston 2011)
Reference: CNN for Modeling Sentences (Kalchbrenner et al. 2014)
Reference: CNNs for Sentence Classification (Kim 2014)
Reference: Dilated CNNs for Language Modeling (Kalchbrenner et al. 2016)
Reference: Tree Convolution (Ma et al. 2015)
Reference: Graph Convolution for Text (Marcheggiani and Titov 2017)
Reference: Siamese Networks (Bromley et al. 1993)
Reference: Convolutional Matching Model (Hu et al. 2014)
Reference: Convolution + Sentence Pair Pooling (Yin and Schutze 2015)
Reference: Understanding ConvNets (Karpathy 2016)

Slides: CNN Slides
Sample Code: CNN Code Examples
Lecture Video: CNN Lecture Video

9/14 Recurrent Networks for Sentence or Language Modeling

Content:

Recurrent Networks
Vanishing Gradient and LSTMs
Strengths and Weaknesses of Recurrence in Sentence Modeling
Pre-training for RNNs

Reading Material

Required Reading (for quiz): Goldberg Book Chapter 14-15
Other Reading: Goldberg Book Chapter 16 (will be covered in class)
Reference: RNNs (Elman 1990)
Reference: LSTM (Hochreiter and Schmidhuber 1997)
Reference: Variants of LSTM (Greff et al. 2015)
Reference: GRU (Cho et al. 2014)
Reference: Pre-training RNNs (Dai and Le 2015)
Reference: Visualizing Recurrent Nets (Karpathy et al. 2015)
Reference: Learning Syntax from Translation (Shi et al. 2016)
Reference: Learning Sentiment from LMs (Radford et al. 2017)

Slides: RNN Slides
Sample Code: RNN Code Examples
Lecture Video: RNN Lecture Video

9/19 Using/Evaluating Sentence Representations

Content:

Sentence Similarity
Textual Entailment
Paraphrase Identification
Retrieval

Reading Material

Highly Recommended Reading: Skip-thought Vectors (Kiros et al. 2015), good reference for several tasks
Reference: Sentiment Treebank (Socher et al. 2013)
Reference: Paraphrase Detection (Dolan et al. 2005)
Reference: Paraphrase Detection w/ Matrix Factorization (Ji and Eisenstein 2013)
Reference: Semantic Relatedness (Marelli et al. 2014)
Reference: Manhattan LSTM (Mueller and Thyagarajan 2016)
Reference: Recognizing Textual Entailment (Dagan et al. 2006)
Reference: Stanford Natural Language Inference Corpus (Bowman et al. 2015)
Reference: Multi-perspective Matching for NLI (Wang et al. 2017)
Reference: Inference -> Generalization (Conneau et al. 2017)
Reference: Text to Text Retrieval (Huang et al. 2013)
Reference: Text to Image Retrieval (Socher et al. 2014)
Reference: Resources on Locality Sensitive Hashing (Stack Overflow)
Reference: Flickr 8k (Hodosh et al. 2013)

Slides: Sentence Representation Slides
Sample Code: Sentence Representation Code Examples
Lecture Video: Sentence Representation Video

Section 3: Sequence-to-sequence Models

9/21 Conditioned Generation

Content:

Encoder-Decoder Models
Conditional Generation and Search
Ensembling
Evaluation
Types of Data to Condition On

Reading Material

Required Reading (for quiz): Neural Machine Translation and Sequence-to-Sequence Models Chapter 7
Reference: Recurrent Neural Translation Models (Kalchbrenner and Blunsom 2013)
Reference: LSTM Encoder-Decoders (Sutskever et al. 2014)
Reference: BLEU (Papineni et al. 2002)
Reference: METEOR (Banerjee and Lavie 2005)
Reference: Knowledge Distillation (Kim et al. 2016)
Reference: Generation from Structured Data (Wen et al. 2015)
Reference: Generation from Input+Tags (Zhou and Neubig 2017)
Reference: Generation from Images (Karpathy and Li 2015)
Reference: Generation from Recipe (Kiddon et al. 2016)
Reference: Generation from TED Talks (Hoang et al. 2016)

Slides: Conditional LM Slides
Sample Code: Conditional LM Code Examples
Lecture Video: Conditional LM Lecture Video

9/26 Attention

Content:

Attention
What do We Attend To?
Improvements to Attention
Specialized Attention Varieties
A Case Study: "Attention is All You Need"

Reading Material

Required Reading (for quiz): Neural Machine Translation and Sequence-to-Sequence Models Chapter 8
Reference: Attentional NMT (Bahdanau et al. 2015)
Reference: Effective Approaches to Attention (Luong et al. 2015)
Reference: Copying Mechanism (Gu et al. 2016)
Reference: Attention-based Bias (Arthur et al. 2016)
Reference: Attending to Images (Xu et al. 2015)
Reference: Attending to Speech (Chan et al. 2015)
Reference: Hierarchical Attention (Yang et al. 2016)
Reference: Attending to Multiple Sources (Zoph and Knight 2015)
Reference: Different Multi-source Strategies (Libovicky and Helcl 2017)
Reference: Multi-modal Attention (Huang et al. 2016)
Reference: Self Attention (Cheng et al. 2016)
Reference: Attention is All You Need (Vaswani et al. 2017)
Reference: Structural Biases in Attention (Cohn et al. 2015)
Reference: Coverage Embedding Models (Mi et al. 2016)
Reference: Interpretability w/ Hard Attention (Lei et al. 2016)
Reference: Supervised Attention (Mi et al. 2016)
Reference: Attention vs. Alignment (Koehn and Knowles 2016)
Reference: Monotonic Attention (Yu et al. 2016)
Reference: Convolutional Attention (Allamanis et al. 2016)

Slides: Attention Slides
Sample Code: Attention Code Examples
Lecture Video: Attention Lecture Video

Section 4: Structured Prediction Models

9/28 Search-based Structured Prediction

Content:

The Structured Perceptron
Structured Max-margin Objectives
Simple Remedies to Exposure Bias

Required Reading (for quiz): Goldberg Book Chapter 19-19.3
Recommended Reading: Course in Machine Learning Chapter 17 (Daume)
Reference: Conditional Random Fields (Lafferty et al. 2001)
Reference: Structured Perceptron (Collins 2002)
Reference: Structured Hinge Loss (Taskar et al. 2005)
Reference: SEARN (Daume et al. 2006)
Reference: DAgger (Ross et al. 2011)
Reference: Dynamic Oracles (Goldberg and Nivre 2013)
Reference: Training Neural Parsers w/ Dynamic Oracles (Ballesteros et al. 2016)
Reference: Word Dropout (Gal and Ghahramani 2015)
Reference: RAML (Norouzi et al. 2016)

Slides: Structured Prediction Slides
Sample Code: Structured Prediction Code Examples
Lecture Video: Attention Lecture Video

10/3 Structured Prediction with Local Independence Assumptions

Content:

Why Local Independence Assumptions?
Conditional Random Fields

Required Reading (for quiz): Bidirectional LSTM-CRF Models for Sequence Tagging (Huang et al. 2015)
Reference: Conditional Random Fields (Lafferty et al. 2001)
Reference: An Introduction to CRFs (Sutton and McCallum 2011)
Reference: Minimum Risk Training for Neural MT (Shen et al. 2016)
Reference: Globally Normalized Networks (Andor et al. 2016)
Reference: Reward Augmented Maximum Likelihood (Norouzi et al. 2016)
Reference: Softmax Q-Distribution Estimation (Ma et al. 2016)
Reference: End-to-end Sequence Labeling with BiLSTM-CNN-CRF (Ma et al. 2016)

Slides: CRF Slides
Lecture Video: CRF Video

Section 4: Syntactic/Semantic Parsing Models

10/5 Transition-based Parsing Models

Content:

What is Transition-based Parsing?
Shift-reduce Parsing w/ Feed-forward Nets
Stack LSTM
Transition-based Models for Phrase Structure
A Simple Alternative: Linearized Trees

Required Reading (for quiz): Dependency Parsing Jurafsky and Martin Chapter 14.1-14.4
Reference: Shift-reduce Parsing (Yamada and Matsumoto 2003)
Reference: Shift-reduce Parsing (Nivre 2003)
Reference: Feature Engineering for Parsing (Zhang and Nivre 2011)
Reference: Feed-forward Dependency Parsing (Chen and Manning 2014)
Reference: Recursive RNNs (Socher et al. 2011)
Reference: Tree-structured LSTM (Tai et al. 2015)
Reference: Stack LSTM Dependency Parsing (Dyer et al. 2015)
Reference: Shift-reduce Phrase Structure Parsing (Watanabe et al. 2015)
Reference: Recurrent Neural Network Gramamrs (Dyer et al. 2016)
Reference: Linearized Trees (Vinyals et al. 2015)

Slides: Transition-based Parsing Slides
Sample Code: Transition-based Parsing Code Examples
Lecture Video: Transition-based Parsing Video

10/10 Parsing with Dynamic Programs

Content:

What is Graph-based Parsing?
Minimum Spanning Tree Parsing
Structured Training and Other Improvements
Dynamic Programming Methods for Phrase Structure Parsing

Required Reading (for quiz): Graph-based Dependency Parsing Jurafsky and Martin Chapter 14.5-14.6
Reference: Eisner Algorithm (Eisner 1996)
Reference: Large-margin Training of Parsers (McDonald et al. 2005)
Reference: Spanning-tree Algorithms (McDonald et al. 2005)
Reference: Higher-order Dependency Parsing (Zhang and McDonald 2012)
Reference: Graph-based Parsing w/ Neural Nets (Pei et al. 2015)
Reference: BiLSTM Features for Graph-based Parsing (Kiperwasser and Goldberg 2016)
Reference: Deep Bi-affine Attention (Dozat and Manning 2017)
Reference: Probabilistic Parsing w/ Matrix Tree Theorem (Koo et al. 2007)
Reference: Neural Probabilistic Parser (Ma and Hovy 2017)
Reference: Neural CRF Parsing (Durrett and Klein 2015)
Reference: Span-based Constituency Parsing (Stern et al. 2017)
Reference: Inside-outside Recurrent Networks (Le and Zuidema 2014)
Reference: Parsing as Language Modeling (Choe and Charniak 2016)
Reference: Disentangling Reranking Effects (Fried et al. 2017)

Slides: DP Parsing Slides
Sample Code: DP Parsing Code Examples
Lecture Video: DP Parsing Lecture Video

10/12 Neural Semantic Parsing

Content:

Combinatory Categorial Grammar and Lambda Calculus
Graph-based Models of Semantics
Shallow Semantics: Semantic Role Labeling

Recommended Reading (no quiz this class, but material may be useful anyway): Jurafsky and Martin Chapters 17 and 18
Reference: Geoquery (Zelle and Mooney 1996)
Reference: Free917 (Cai and Yates 2013)
Reference: Robocup (Wong and Mooney 2006)
Reference: If This Then That (Quirk et al. 2015)
Reference: Hearthstone Dataset/Latent Predictor Networks (Ling et al 2016)
Reference: Django Dataset (Oda et al. 2015)
Reference: Sequence-to-sequence Semantic Parsing (Jia and Liang 2016)
Reference: Sequence-to-tree Parsing (Dong and Lapata 2016)
Reference: Interfacing with FreeBase (Dong et al. 2015)
Reference: Learning from Weak Supervision (Guu et al. 2017)
Reference: Syntax for Code Generation (Yin and Neubig 2017)
Reference: Abstract Meaning Representation (Banarescu et al. 2013)
Reference: Minimal Recursion Semantics (Copestake et al. 2005)
Reference: Universal Conceptual Cognitive Annotation (Abend and Rappoport 2013)
Reference: Dependency->Semantics (Reddy et al. 2017)
Reference: CCG->Semantics (Zettlemoyer and Collins 2005)
Reference: Supertagging w/ LSTMs (Vaswani et al. 2016)
Reference: Neural Parsing for AMR (Damonte et al. 2017)
Reference: Neural Parsing for AMR (Peng et al. 2017)
Reference: Graph Parsing w/ Linearized Trees (Buys and Blunsom 2017)
Reference: Graph Parsing w/ "Remote" Transition (Hershcovich et al. 2017)
Reference: Semantic Role Labeling (Gildea and Jurafsky 2002)
Reference: Neural Semantic Role Labeling (He et al. 2017)

Slides: Semantic Parsing Slides
Sample Code: Semantic Parsing Code Examples
Lecture Video: Semantic Parsing Lecture Video

Section 5: Advanced Learning Techniques

10/17 Latent Random Variable Models

Content:

Generative vs. Discriminative, Deterministic vs. Random Variables
Variational Autoencoders
Handling Discrete Latent Variables
Examples of Variational Autoencoders in NLP

Required Reading (for quiz): Tutorial on Variational Auto-encoders (Doersch 2016)
Reference: Variational Auto-encoders (Kingma and Welling 2014)
Reference: Generating Sentences from a Continuous Space (Bowman et al. 2016)
Reference: Problems w/ Optimizing Latent Variables (Chen et al. 2017)
Reference: Convoluton Decoders for VAE (Yang et al. 2017)
Reference: Concrete Distribution (Maddison et al. 2017)
Reference: Gumbel-Softmax (Jang et al. 2017)
Reference: Variational Inference for Text Processing (Miao et al. 2016)
Reference: Controllable Text Generation w/ VAE (Hu et al. 2017)
Reference: Multi-space Variational Encoder-Decoders (Zhou and Neubig 2017)
Reference: Recurrent Latent Variable Models (Chung et al. 2015)
Reference: Language as a Latent Variable (Miao and Blunsom 2016)
Reference: Emergence of Language in Multi-agent Games (Havrylov and Titov 2017)
Reference: Natural Language Does Not Emerge Naturally (Kottur et al. 2017)

Slides: Latent Variable Slides
Sample Code: Latent Variable Code Examples
Lecture Video: Latent Variable Lecture Video

10/19 Reinforcement Learning

What is Reinforcement Learning?
Policy Gradient and REINFORCE
Stabilizing Reinforcement Learning
Value-based Reinforcement Learning

Required Reading (for quiz): Deep Reinforcement Learning Tutorial (Karpathy 2016)
Other Useful Reading: Reinforcement Learning Textbook (Sutton and Barto 2016)
Reference: REINFORCE (Williams 1992)
Reference: Co-training (Blum and Mitchell 1998)
Reference: Adding Baselines (Dayan 1990)
Reference: Sequence-level Training for RNNs (Ranzato et al. 2016)
Reference: Experience Replay (Lin 1993)
Reference: Neural Q Learning (Tesauro 1995)
Reference: Intrinsic Reward (Schmidhuber 1991)
Reference: Intrinsic Reward for Atari (Bellemare et al. 2016)
Reference: Reinforcement Learning for Dialog (Young et al. 2013)
Reference: End-to-end Neural Task-based Dialog (Williams and Zweig 2016)
Reference: Neural Chat Dialog (Li et al. 2016)
Reference: User Simulation for Learning in Dialog (Schatzmann et al. 2007)
Reference: RL for Mapping Instructions to actions (Branavan et al. 2009)
Reference: Deep RL for Mapping Instructions to Actions (Misra et al. 2017)
Reference: RL for Text-based Grames (Narasimhan et al. 2015)
Reference: Incremental Prediction in MT (Grissom et al. 2014)
Reference: Incremental Neural MT (Gu et al. 2017)
Reference: RL for Information Retrieval (Narasimhan et al. 2016)
Reference: RL for Query Reformulation (Nogueira and Cho 2017)
Reference: RL for Coarse-to-fine Question Answering (Choi et al. 2017)
Reference: RL for Learning Neural Network Structure (Zoph and Le 2016)

Slides: Reinforcement Learning Slides
Sample Code: Reinforcement Learning Code Examples
Lecture Video: Reinforcement Learning Lecture Video

10/24 Adversarial Networks

Content:

(Generative) Adversarial Networks
Where to use the Adversary?: Features vs. Outputs
GANs on Discrete Outputs
Adversaries on Discrete Inputs

Required Reading (for quiz): GAN Tutorial (Goodfellow 2017)
Reference: Generative Adversarial Nets (Goodfellow et al. 2014)
Reference: Example of Fuzzy Outputs (Lotter et al. 2015)
Reference: Improved Techniques for Training GANs (Salimans et al. 2016)
Reference: SeqGan (Yu et al. 2016)
Reference: MT w/ GAN (Yang et al. 2017)
Reference: MT w/ GAN (Wu et al. 2017)
Reference: MT w/ Gumbel-Greedy Decoding (Gu et al. 2017)
Reference: Dialog w/ GAN (Li et al. 2017)
Reference: Perturbing Embeddings (Miyato et al. 2016)
Reference: Adversarial Feature Learning for Domain Adaptation (Ganin et al. 2016)
Reference: Adversarial Feature Learning for Bilingual Classification (Chen et al. 2016)
Reference: Adversarial Feature Learning for Multilingual MT (Xie et al. 2017)
Reference: Adversarial Feature Learning for Multi-task Classification (Liu et al. 2017)
Reference: Adversarial Adaptation using Synthetic Data (Kim et al. 2017)
Reference: Adversarial Feature Learning for Implicit Relation Classification (Qin et al. 2017)
Reference: Professor Forcing (Lamb et al. 2016)
Reference: Unsupervised Style Transfer for Text (Shen et al. 2017)

Slides: Adversarial Slides
Lecture Video: Adversarial Lecture Video

10/26 Semi-supervised and Unsupervised Learning of Structure

Learning Features vs. Learning Structure
Semi-supervised Learning Methods
Unsupervised Learning Methods
Design Decisions for Unsupervised Models
Examples of Unsupervised Learning

Interesting Reading (not required, no quiz) Linguistic Structure Prediction Chapter 4
Reference: Unsupervised POS Induction w/ Word Embeddings (Lin et al. 2015)
Reference: Unsupervised Neural Hidden Markov Models (Tran et al. 2016)
Reference: Extracting Automata from RNNs (Giles et al. 1992)
Reference: CRF Autoencoders (Ammar et al. 2014)
Reference: Semi-supervised Prediction w/ Neural CRF Autoencoders (Zhang et al. 2017)
Reference: Gated Convolution (Cho et al. 2014)
Reference: Learning Grammar with RL (Yogatama et al. 2016)
Reference: Learning to Compose Task-specific Tree Structures (Choi et al. 2017)
Reference: Parsing w/ a Semantic Objective (Williams et al. 2017)
Reference: What do RNN Grammars Learn About Syntax? (Kuncoro et al. 2017)
Reference: Dependency Model with Valence (Klein and Manning 2004)
Reference: Unsupervised Neural Dependency Parsing (Jiang et al. 2016)
Reference: CRF Autoencoders for Unsuprevised Dependency Parsing (Cai et al. 2017)
Reference: Learning Language-level Features (Malaviya et al. 2017)
Reference: Embedded Segmental k-means Models (Kamper et al. 2017)
Reference: Speech Segmentation (Elsner and Shain 2017)
Reference: Word Discovery w/ Encoder-decoder Models (Boito et al. 2017)

Slides: Unsupervised/Semi-supervised Slides
Lecture Video: Unsupervised/Semi-supervised Lecture Video

Section 6: Models of Documents and Discourse

10/31 Coreference and Discourse Parsing

Models of Coreference
Discourse Parsing

Required Reading (for quiz): 15 Years in Co-reference (Ng 2010)
Reference: End-to-end Neural Coreference Resolution (Lee et al. 2017)
Reference: Deep Reinforcement Learning for Entity Ranking (Clark and Manning 2016)
Reference: Entity-level Representations (Clark and Manning 2016)
Reference: Global Features for Coreference (Wiseman et al. 2016)
Reference: Anaphoricity and Antecedent Features (Wiseman et al. 2015)
Reference: Coref, success and challenges (Ng 2016)
Reference: Discourse-driven LMs (Peng and Roth 2016)
Reference: Sentence-level LSTMs for Script Inference (Pichotta and Mooney 2016)
Reference: Easy Victories and Uphill Battles (Durrett and Klein 2013)
Reference: Solving Hard Coreference Problems (Peng et al. 2015)
Reference: Entity-centric Coref (Clark and Manning 2015)
Reference: Modular Entity-centric Model (Haghighi and Klein 2010)
Reference: Discourse Structure for Text Categorization (Ji and Smith 2017)
Reference: Adversarial Implicit Discourse Relation Classification (Qin et al. 2017)
Reference: Recursive Deep Models for Discourse (Li et al. 2014)
Reference: Attention-based Hierarchical Discourse (Li et al. 2016)
Reference: Representation Learning for Text-level Discourse (Ji and Eisenstein 2014)
Reference: Pay Attention to the Ending (Cai et al. 2017)
Reference: Discourse Language Models (Chaturvedi et al. 2017)

Slides: Document-level Model Slides
Lecture Video: Document-level Model Lecture Video

11/2 Models of Dialog

Chat-based Dialog
Task-based Dialog

Interesting Reading (no quiz): Dialog Systems and Chatbots Jurafsky and Martin Chapter 29
Reference: Data-driven Dialog Response Generation (Ritter et al. 2011)
Reference: Neural Dialog Response Generation (Sordoni et al. 2015)
Reference: Neural Dialog Response Generation (Shang et al. 2015)
Reference: Neural Dialog Response Generation (Vinyals and Le 2015)
Reference: Context is Helpful for MT (Matsuzaki et al. 2015)
Reference: Context is Not So Helpful for MT (Jean et al. 2017)
Reference: Hierarchical Model for Dialog Generation (Serban et al. 2016)
Reference: Discourse-level VAE (Zhao et al. 2017)
Reference: Diversity Promoting Objective (Li et al. 2016)
Reference: How Not to Evaluate your Dialog System (Liu et al. 2016)
Reference: DeltaBLEU (Galley et al. 2015)
Reference: Adversarial Evaluation (Li et al. 2017)
Reference: Automatic Turing Test (Lowe et al. 2017)
Reference: Personality Generation for Dialog (Mairesse et al. 2007)
Reference: Persona-based Neural Dialog (Li et al. 2016)
Reference: Dialog Response Retrieval (Lee et al. 2009)
Reference: Neural Dialog Response Retrieval (Nio et al. 2014)
Reference: Smart Reply (Kannan et al. 2016)
Reference: Language Generation for Dialog (Wen et al. 2015)
Reference: Neural Nets for Spoken Language Understanding (Mesnil et al. 2015)
Reference: Dialog State Tracking (Williams et al. 2013)
Reference: Neural Dialog State Tracking (Henderson et al. 2014)
Reference: End-to-end Dialog Control (Williams et al. 2017)

Slides: Dialog Slides
Lecture Video: Dialog Lecture Video

Section 7: Neural Networks and Knowledge

11/7 Learning from/for Knowledge Graphs

What are Knowledge Graphs/Ontologies?
Relation Extraction from Embeddings
Learning Embeddings from Relations

Required Reading (for quiz): Relation Extraction Jurafsky and Martin Chapter 21.2
Reference: Relation Extraction Survey (Nickel et al. 2016)
Reference: WordNet (Miller 1995)
Reference: Cyc (Lenant 1995)
Reference: DBPedia (Auer et al. 2007)
Reference: YAGO (Suchanek et al. 2007)
Reference: Babelnet (Navigli and Ponzetto 2010)
Reference: Freebase (Bollacker et al. 2008)
Reference: Wikidata (Vrandečić and Krötzsch 2014)
Reference: Relation Extraction by Translating Embeddings (Bordes et al. 2013)
Reference: Relation Extraction with Neural Tensor Networks (Socher et al. 2013)
Reference: Relation Extraction by Translating on Hyperplanes (Wang et al. 2014)
Reference: Relation Extraction by Representing Entities and Relations (Lin et al. 2015)
Reference: Relation Extraction w/ Decomposed Matrices (Xie et al. 2017)
Reference: Distant Supervision for Relation Extraction (Mintz et al. 2009)
Reference: Relation Classification w/ Recursive NNs (Socher et al. 2012)
Reference: Relation Classification w/ CNNs (Zeng et al. 2014)
Reference: Joint Entity and Relation Embedding (Toutanova et al. 2015)
Reference: Distant Supervision for Neural Models (Luo et al. 2017)
Reference: Relation Extraction w/ Tensor Decomposition (Sutskever et al. 2009)
Reference: Relation Extraction via. KG Paths (Lao and Cohen 2010)
Reference: Relation Extraction by Traversing Knowledge Graphs (Guu et al. 2015)
Reference: Relation Extraction via Differentiable Logic Rules (Yang et al. 2017)
Reference: Improving Embeddings w/ Semantic Knowledge (Yu et al. 2014)
Reference: Improving Embeddings w/ Semantic Knowledge (Yu et al. 2014)
Reference: Retrofitting Word Vectors to Semantic Lexicons (Faruqui et al. 2015)
Reference: Multi-sense Embedding with Semantic Lexicons (Jauhar et al. 2015)
Reference: Antonymy and Synonym Constraints for Word Embedding (Mrksic et al. 2016)

Slides: Knowledge Graph Slides
Lecture Video: Knowledge Graph Video

11/9 Machine Reading w/ Neural Nets

No quiz
Reference: MCTest (Richardson et al. 2013)
Reference: RACE (Lai et al. 2017)
Reference: SQuAD (Rajpurkar et al. 2016)
Reference: TriviaQA (Joshi et al. 2017)
Reference: Teaching Machines to Read and Comprehend (Hermann et al. 2015)
Reference: Attention Sum (Kadlec et al. 2016)
Reference: Attention over Attention (Cui et al. 2017)
Reference: Bidirectional Attention Flow (Seo et al. 2017)
Reference: Dynamic Coattention Networks (Xiong et al. 2017)
Reference: Gated Attention Readers (Dhingra et al. 2017)
Reference: Memory Networks (Weston et al. 2015)
Reference: End-to-end Memory Networks (Sukhbaatar et al. 2015)
Reference: Dynamic Memory Networks (Kumar et al. 2016)
Reference: Learning to Stop Reading (Shen et al. 2017)
Reference: Coarse-to-fine Question Answering (Choi et al. 2017)
Reference: bAbI Dataset (Weston et al. 2015)
Reference: NLP in Prolog (Pereira and Shieber 2002), Example Code
Reference: A Thorough Examination of the CNN/Daily Mail Task (Chen et al. 2016)
Reference: Adversarial Examples in SQuAD (Jia and Liang 2017)

Slides: Machine Reading Slides
Lecture Video: Machine Reading Video

11/14 Debugging Neural Nets (for NLP)

Identifying problems
Debugging training time problems
Debugging test time problems

Interesting Reading
Reference: Highway Networks (Srivastava et al. 2015)
Reference: Residual Connections (He et al. 2015)
Reference: Rethinking Generalization (Zhang et al. 2017)
Reference: Marginal Value of Adaptive Gradient Methods (Wilson et al. 2017)
Reference: Adam w/ Learning Rate Decay (Denkowski and Neubig 2017)
Reference: Dropout (Srivastava et al. 2014)
Reference: Recurrent Dropout (Gal and Ghahramani 2015)
Reference: Minibatch Creation Strategies (Morishita et al. 2017)
Reference: Decoding Problems (Koehn and Knowles 2017)

Slides: Debugging Slides

Section 8: Search, Multi-lingual and Multi-task Learning

11/16 Advanced Search Algorithms

Beam Search
A* Search
Search w/ Future Costs

No Quiz
Reference: Google’s Neural Machine Translation System (Wu et al. 2016)
Reference: Multinomial Length Normalization (Eriguchi et al. 2016)
Reference: Average Length Normalization (Cho et al. 2014)
Reference: Mutual Information and Diverse Decoding (Li et al., 2016)
Reference: Generating High-Quality and Informative Conversation Responses (Shao et al., 2017)
Reference: Effective Inference for Generative Neural Parsing (Stern et al., 2017)
Reference: Beam-Search Optimization (Wiseman et al., 2016)
Reference: Continuous Beam Search (Goyal et al. 2017)
Reference: A* Parsing (Klein et al., 2003)
Reference: LSTM CCG Parsing (Lewis et al. 2014)
Reference: Global Neural CCG Parsing (Lee et al. 2016)
Reference: Learning to Decode for Future Success (Li et al., 2017)
Reference: Generative Transition-based Dependency Parsing (Buys et al., 2015)
Reference: Recurrent Neural Network Grammars (Dyer et al. 2016)
Reference: Monte Carlo Tree Search (Kumagai et al. 2017)

Slides: Search Slides

11/21 Multi-task, Multi-lingual Learning Models

What is Multi-task Learning?
Methods for Multi-task Learning
Multi-task Objectives for NLP

Required Reading (for quiz): Multi-task Learning in Neural Networks and Multi-task Objectives for NLP (Ruder 2017)
Reference: Natural Language Processing from Scratch (Collobert et al. 2011)
Reference: Regularization Techniques (Barone et al. 2017)
Reference: Word Representations (Turian et al. 2010)
Reference: Semi-supervised Sequence Learning (Dai and Le 2015)
Reference: Gaze Prediction + Summarization (Klerke et al. 2016)
Reference: Selective Transfer (Zoph et al. 2016)
Reference: Soft Parameter Tying (Duong et al. 2015)
Reference: Translation-based Encoder Pretraining (McCann et al. 2017)
Reference: Bidirectional Language Model Pretraining (Peters et al. 2017)
Reference: Pre-training for MT (Luong et al. 2015)
Reference: Domain Adaptation via Feature Augmentation (Kim et al. 2016)
Reference: Feature Augmentation w/ Tags (Chu et al. 2017)
Reference: Unsupervised Adaptation (Long et al. 2015)
Reference: Multilingual MT (Johnson et al. 2017)
Reference: Muiltilingual MT (Ha et al. 2016)
Reference: Teacher-student Multilingual NMT (Chen et al. 2017)
Reference: Multiple Annotation Standards for Semantic Parsing (Peng et al. 2017)
Reference: Multiple Annotation Standards for Word Segmentation (Chen et al. 2017)
Reference: Modeling Annotator Variance (Guan et al. 2017)
Reference: Different Layers for Different Tasks (Hashimoto et al. 2017)
Reference: Polyglot Language Models (Tsvetkov et al. 2016)
Reference: Many Languages One Parser (Ammar et al. 2016)
Reference: Multilingual Relation Extraction (Lin et al. 2017)

Neural Networksfor NLP

Introduction

8/29 Class Introduction

8/31 A Simple (?) Exercise: Predicting the Next Word in a Sentence

Section 1: Models of Words

9/5 Distributional Semantics and Word Vectors

9/7 Why is word2vec So Fast?: Speed Tricks for Neural Nets

Section 2: Models of Sentences

9/12 Convolutional Networks for Text

9/14 Recurrent Networks for Sentence or Language Modeling

9/19 Using/Evaluating Sentence Representations

Section 3: Sequence-to-sequence Models

9/21 Conditioned Generation

9/26 Attention

Section 4: Structured Prediction Models

9/28 Search-based Structured Prediction

10/3 Structured Prediction with Local Independence Assumptions

Section 4: Syntactic/Semantic Parsing Models

10/5 Transition-based Parsing Models

10/10 Parsing with Dynamic Programs

10/12 Neural Semantic Parsing

Section 5: Advanced Learning Techniques

10/17 Latent Random Variable Models

10/19 Reinforcement Learning

10/24 Adversarial Networks

10/26 Semi-supervised and Unsupervised Learning of Structure

Section 6: Models of Documents and Discourse

10/31 Coreference and Discourse Parsing

11/2 Models of Dialog

Section 7: Neural Networks and Knowledge

11/7 Learning from/for Knowledge Graphs

11/9 Machine Reading w/ Neural Nets

11/14 Debugging Neural Nets (for NLP)

Section 8: Search, Multi-lingual and Multi-task Learning

11/16 Advanced Search Algorithms

11/21 Multi-task, Multi-lingual Learning Models

11/23 Thanksgiving -- NO CLASS

11/28 Multi-modal Learning

Project Presentations

11/30 Project Poster Presentations 1

12/5 Project Poster Presentations 2

Neural Networks
for NLP

9/7 Why is `word2vec` So Fast?: Speed Tricks for Neural Nets