Efficient Natural Language and Speech Processing
(Models, Training and Inference)
This workshop aims at introducing some fundamental problems in the field of natural language and speech processing which can be of interest to the general machine learning and deep learning community to improve the efficiency of the models, their training and inference. The workshop program offers an interactive platform for gathering experts and talents from academia and industry through different invited keynote talks, panel discussions, paper submissions and reviews, poster and oral presentations and a mentorship program.
This will provide an opportunity to discuss and learn from each other, exchange ideas, build connections, and brainstorm on potential solutions and future collaborations. The topics of this workshop can be of interest for people working on general machine learning, deep learning, optimization, theory and NLP & Speech applications.
Overview
Despite the great success of deep neural networks due to huge over-parameterization and using very large amount of data in different tasks of natural language processing (NLP) and speech processing, training or deploying these networks on devices or even cloud services with limited memory and computational power can be very expensive and challenging. For instance, pre-trained language models (PLMs) such as GPT-3 have led to a great breakthrough in NLP; but running GPT-3 with more than 170 billion parameters trained with more than 500 GB of data requires more than 10 Tesla V-100 GPUs. That being said, still improving the NLP and Speech models by increasing their number of parameters and incorporating more data is deemed a very common practice in the NLP and Speech domains. Therefore, it is of vital importance to invest on enhancing the efficiency of these models in terms of model architectures, training and inference from different perspectives highlighted in this workshop. In this regard, we would like to share some unique and fundamental challenges with the NeurIPS community to be considered in their future investigations.
Call for Papers
We encourage the NeurIPS community to submit their solutions, ideas, and ongoing work concerning data, model, training, and inference efficiency for NLP and speech processing. The scope of this workshop includes, but not limited to, the following topics.
Efficient Pre-Training and Fine-Tuning. Pre-training is a very expensive process. Even a small modification to the configuration of the models requires the user to redo pre-training:
- Fast pre-training techniques, avoiding pre-training from scratch
- Multi-domain pre-training/fine-tuning and fast domain adaptation for pre-trained/fine tuned models
- Multimodal pre-trained (e.g., text--speech) models
- Avoiding task-specific fine tuning of pre-trained models
- New efficient architectures for pre-trained models
Model Compression. Neural model compression techniques such as quantization, pruning, layer decomposition and knowledge distillation (KD) aim at reducing the number of parameters of the models, improving their memory requirements or running efficiency:
- Impact of different compression techniques on the inductive biases learned by the original models
- Combined compression techniques for more efficient NLP and speech models
- Efficient KD for NLP and speech, efficient intermediate layer distillation, and teacher-free distillation
- Improving KD for large classification problems (e.g., text generation and machine translation with a very large number of output classes)
- Solving the Capacity Gap problem and the Search Problem associated with finding the best checkpoint of the teacher
- Theory of KD (e.g., how does KD work?)
Efficient Training. How to improve the training speed of the NLP and speech models:
- Improving the optimizer for faster training
- Accelerated training of different tasks in NLP and speech
- Distributed training, federated learning and continual learning for NLP and speech tasks
Data Efficiency. Pre-trained models rely on a huge amount of unlabeled data which makes the training very sample inefficient:
- Sample efficient training, training with less data, few-shot and zero-shot learning
- Sample efficient data-augmentation, identifying which training samples should be augmented
- Low-resource NLP and speech, considering training tasks with limited available data
Edge Intelligence. Running the trained models on edge devices will require a conversion process to match the network with hardware specifications:
- TinyML for NLP and speech on embedded systems
- Efficient conversion versus hardware-aware training
- Training on device
Submission Instructions
You are invited to submit your papers in our CMT submission portal. All the submitted papers have to be anonymous for double-blind review. We expect each paper will be reviewed by at least three reviewers. The content of the paper (excluding the references and supplementary materials) should not be longer than 6 pages, strictly following the NeurIPS template style (which can be found here).
Authors can submit up to 100 MB of supplementary materials separately. Authors are highly encouraged to submit their codes for reproducibility purposes. Although original submissions are preferred, submitted papers can be among your already published or ArXiv papers, and your under submission works. Please make sure to indicate the complete list of conflict of interests for all the authors of your paper. To encourage higher quality submissions, our sponsors are offering the Best Paper Award to qualified outstanding original oral and poster presentations (upon nomination of the reviewers). Bear in mind that our workshop is not archival, but the accepted papers will be hosted on the workshop website.
Important Dates:
- Submission Deadline:
September 22, 2021 AOE
- Uploading Supplementary Materials:
September 26, 2021 AOE
- Acceptance Notification:
October 23, 2021 AOE
- Camera-Ready Submission:
November 1, 2021 AOE
- Workshop Date: December 13, 2021
The accepted papers can be found
here.
Confirmed Speakers
Prof.
Mirella Lapata
University of Edinburgh
Prof.
Luke Zettlemoyer
University of Washington (Facebook)
Prof.
Kevin Duh
Johns Hopkins University
Dr.
Boxing Chen
Alibaba
Prof.
Sameer Singh
University of California
Prof.
Danqi Chen
Princeton University
Dr.
Mohammad Norouzi
Google Brain
Prof.
Yejin Choi
University of Washington (Allen Institute for AI)
Dr.
Lu Hou
Huawei Noah's Ark Lab
Prof.
Xu Sun
Peking University
Prof.
Barbara Plank
IT University of Copenhagen
Prof.
Samira Ebrahimi Kahou
ETS & MILA
Schedule (EST time zone - New York/Montreal/Toronto)
Title: Opening Speech
Presenter: Pascal Poupart
Bio:TBD
Abstract:TBD
Title: Continual Learning in Large-Scale Pre-Training
Presenter: Xu Sun
Bio:Xu Sun is an Associate Professor in the Department of Computer Science, Peking University. He got Ph.D from The University of Tokyo (2010), advised by Prof. Jun'ichi Tsujii. From 2010 to 2012, he worked at The University of Tokyo, Cornell University, and The Hong Kong Polytechnic University as research fellows. He was a research intern at MSR-Redmond in 2009. His research focuses on natural language processing and machine learning, especially on natural language generation and deep learning for language. He received COLING Best Paper Award 2018.
Abstract:Large-scale pre-training has enabled break-throughs in natural language processing. However, the underlying large-scale models and data make the studies in the field hard to sustain. In this talk, I will introduce our recent work focusing on continual learning in large-scale pre-training to improve the efficiency of pre-trained language models (from ICML 2021, AAAI 2021, etc.). For data-efficient continual learning for PLMs, this talk includes our work on addressing long-tailed data distribution with definitional data and accurate behavioral modifications with low instance-wise side effects by limiting the changed parameters. For cost-effective searching of PLM architecture, I will introduce our training-free neural architecture search method based on the gram matrix of instance gradients that can find better fine-tuning architecture of PLMs. Continual Learning has vast opportunities in efficient PLMs learning and applications and new challenges are there to be resolved.
Title: Efficient Multi-lingual Neural Machine Translation
Presenter: Boxing Chen
Bio:Boxing Chen is a Senior Staff Algorithm Expert at Machine Intelligence Lab of Alibaba Group. He works on natural language processing, mainly focusing on machine translation. Prior to Alibaba, he was a Research Officer at the National Research Council Canada (NRC). He has co-authored more than 80 papers in the NLP conferences and journals and served as area chair for ACL and EMNLP. His teams ranked first place over 20 times in various MT competitions.
Abstract:To support Alibaba’s globalization, we developed a Multi-lingual Neural Machine Translation (MNMT) system to conduct translation between 214 languages with one model. The main challenges of MNMT are model capacity, zero-shot translation, inference speed and energy cost, etc. Therefore, we performed several studies to make the training, inference, and energy more efficient while remaining the competitive translation quality. Which include: 1. Language-aware interlingua-based new MNMT architecture; 2. Improving zero-shot translation via joint training with denoising autoencoding; 3. Speedup decoding with strategies of shallow decoder, decoder attention weights sharing, and shortlist prediction; 4. A new energy-efficient attention mechanism that replaces multiplication operations with binarized selective and addition operations.
Title: Compression and Acceleration of Pre-trained Language Models
Presenter: Lu Hou
Bio:Dr. Lu HOU is a researcher at the Speech and Semantics Lab in Huawei Noah's Ark Lab. She obtained Ph.D. from Hong Kong University of Science and Technology in 2019, under the supervision of Prof. James T. Kwok. Her current research interests include compression and acceleration of deep neural networks, natural language processing, and deep learning optimization.
Abstract:Recently, Transformer-based pre-trained models like BERT and GPT have achieved remarkable results on various natural language understanding tasks and even some computer vision and multi-modal tasks. However, these models have many parameters, hindering their deployment on edge devices or the cloud. In this talk, I will introduce some recent progress on how we alleviate the concerns in various deployment scenarios during the inference and training period. Specifically, compression and acceleration methods using knowledge distillation, dynamic networks, and network quantization will be discussed.
Title: Summarization in Quantized Transformer Spaces
Presenter: Mirella Lapata
Bio:Mirella Lapata is professor of natural language processing in the School of Informatics at the University of Edinburgh. Her research focuses on getting computers to understand, reason with, and generate natural language. She is the first recipient (2009) of the BCS and Information Retrieval Specialist Group (BCS/IRSG) Karen Sparck ones award, a Fellow of the Association for Computational Linguistics (ACL), and a Fellow of the Royal Society of Edinburgh (FRESE). She has also received best paper awards in leading NLP conferences and has served on the editorial boards of the Journal of Artificial Intelligence Research, the Transactions of the ACL, and Computational Linguistics. She was president of SIGDAT (the group that organises EMNLP) in 2018.
Abstract:Deep generative models with latent variables have become a major focus n of NLP research over the past several years. These models have been used both for generating text and as a way of learning latent representations of text for downstream tasks. While much previous work uses continuous latent variables, discrete variables are attractive because they are more interpretable and typically more space efficient. In this talk we consider learning discrete latent variable models with Quantized Variational Autoencoders, and show how these can be ported to the task of opinion summarization. We provide a clustering interpretation of the quantized space and a novel extraction algorithm to discover popular opinions among hundreds of reviews, a significant step towards opinion summarization of practical scope. We further demonstrate how this approach enables controllable summarization without further training, by utilizing properties of the quantized space to extract aspect-specific summaries.
Title: Data-Efficient Cross-Lingual Natural Language Processing
Presenter: Barbara Plank
Bio:Barbara Plank is Professor in Computer Science and Head of the MSc in Data Science at ITU (IT University of Copenhagen). She received her PhD in Computational Linguistics from the University of Groningen. She is interested in research at the cross-roads of Natural Language Processing, Machine Learning, Cognitive Science and Vision, with a particular interest for learning under limited supervision and language variation (language shift, domain shifts, genre shifts etc). Barbara has co-authored over 90 publications, including four best paper awards. She currently holds a DFF Sapere Aude researcher leader grant from the Independent Research Fund Denmark. She co-organized international workshops and conferences (including EurNLP, NoDaLiDa and workshops on domain adaptation, weak supervision and computational social science). Barbara is currently member of the advisory board of the European Association for Computational Linguistics (EACL), ACL publicity director and vice-president of the Northern European Association for Language Technology (NEALT).
Abstract:NLP today depends on huge amounts of unlabeled data, however, for many scenarios including low-resource languages and language varieties we do not have access to labeled resources and even unlabeled data might be scarce. In this talk, I will focus on data-efficient cross-lingual NLP. On the one side, I will outline methods on how to transfer models to low-resource languages. On the other side, I will argue for broader evaluation in cross-lingual learning to include dimensions of variation of language [1]. I'll showcase this on some of our recent work which includes NLP for Danish [4,2], cross-lingual task-oriented dialogues [2] and exploring genre as weak supervision signal for cross-lingual dependency parsing [3].
References:
[1] Barbara Plank. What to do about non-standard (or non-canonical) language in NLP. In KONVENS 2016. Bochum, Germany.
[2] Rob van der Goot, Ibrahim Sharaf, Aizhan Imankulova, Ahmet Üstün, Marija Stepanović, Alan Ramponi, Siti Oryza Khairunnisa, Mamoru Komachi and Barbara\ Plank. From Masked-Language Modeling to Translation: Non-English Auxiliary Tasks Improve Zero-shot Spoken Language Understanding. In NAACL 2021.
[3] Max Müller-Eberstein, Rob van der Goot and Barbara Plank. Genre as Weak Supervision for Cross-lingual Dependency Parsing. In EMNLP 2021.
[4] Barbara Plank, Kristian Nørgaard Jensen and Rob van der Goot. DaN+ - Danish Nested Named Entities and Lexical Normalization. In COLING 2020.
Title: From model compression to self-distillation: a review
Presenter: Samira Ebrahimi Kahou
Bio:Samira is an Associate Professor at École de technologie supérieure/Mila, Adjunct Professor at McGill and a Canada CIFAR AI Chair. Before joining ÉTS, she was a postdoctoral fellow working with Professor Doina Precup at McGill/Mila. Before her postdoc, she was a researcher at Microsoft Research Montréal. She received her Ph.D. from Polytechnique Montréal/Mila in 2016 under the supervision of Professor Chris Pal. She worked on model compression applied to deep convolutional neural networks with Rich Caruana and knowledge distillation during her Ph.D. at Mila.
Abstract:In this short talk, she presents some of the major milestones in model compression and knowledge distillation, starting with the seminal work of Buciluǎ et al. She also covers applications of knowledge distillation in cross-modal learning, few-shot learning, reinforcement learning and natural language processing.
Title: A versatile and efficient approach to summarize speech into utterance-level representations
Presenter: Joao B Monteiro
Authors:Joao B Monteiro (Institut National de la Recherche Scientifique)*; Jahangir Alam (Computer Research Institute of Montreal (CRIM), Montreal (Quebec) Canada); Tiago H Falk (INRS-EMT)
Abstract:Time delay neural networks (TDNN) have become ubiquitous for voice biometrics and language recognition tasks relying on utterance-level speaker- or language-dependent representations. In this paper, we discuss directions to improve upon the conventional TDNN architecture to render it more generally applicable. More specifically, we explore the utility of performing pooling operations across different levels of the convolutional stack and further propose an approach to efficiently combine such set of representations. We show that the resulting models are more versatile, in the sense that a fixed architecture can be re-used across different tasks, and learned representations are more discriminative. Evaluations are performed across two settings: (1) two sub-tasks for spoofing attack detection, and (2) three sub-tasks for spoken language identification. Results show the proposed design yielding improvements over the original TDNN architecture, as well as other previously proposed methods.
Title: Towards Zero and Few-shot Knowledge-seeking Turn Detection in Task-orientated Dialogue Systems
Presenter: Di Jin
Authors:Di Jin (Amazon Alexa AI)*; Shuyang Gao (Amazon); Seokhwan Kim (Amazon Alexa AI); Yang Liu (Amazon, Alexa AI); Dilek Z Hakkani-Tur (Amazon Alexa AI)
Abstract:Most prior work on task-oriented dialogue systems is restricted to supporting domain APIs. However, users may have requests that are out of the scope of these APIs. This work focuses on identifying such user requests. Existing methods for this task mainly rely on fine-tuning pre-trained models on large annotated data. We propose a novel method, REDE, based on adaptive representation learning and density estimation. REDE can be applied to zero/few-shots cases, and quickly learn a high-performing detector that is comparable to the full-supervision setting with only a few shots by updating less than 3K parameters. We demonstrate REDE's competitive performance on DSTC9 Track 1 dataset and our newly collected test set.
Title: Consistent Accelerated Inference via Confident Adaptive Transformers
Presenter: Tal Schuster
Authors:Tal Schuster (MIT CSAIL)*; Adam Fisch (MIT); Tommi Jaakkola (MIT); Regina Barzilay (MIT CSAIL)
Abstract:We develop a novel approach for confidently accelerating inference in the large and expensive multilayer Transformers that are now ubiquitous in natural language processing (NLP). Amortized or approximate computational methods increase efficiency, but can come with unpredictable performance costs. In this work, we present CATs---Confident Adaptive Transformers---in which we simultaneously increase computational efficiency, while \emph{guaranteeing} a specifiable degree of consistency with the original model with high confidence. Our method trains additional prediction heads on top of intermediate layers, and dynamically decides when to stop allocating computational effort to each input using a meta consistency classifier. To calibrate our early prediction stopping rule, we formulate a unique extension of conformal prediction. We demonstrate the effectiveness of this approach on four classification and regression tasks.
Title: Communication-Efficient Federated Learning for Neural Machine Translation
Presenter: Tanya Roosta
Authors:Tanya G Roosta (Amazon)*; Peyman Passban (Amazon); Ankit R Chadha (Amazon)
Abstract:Training neural machine translation (NMT) models in federated learning (FL) settings could be inefficient both computationally and communication-wise, due to the large size of translation engines as well as the multiple rounds of updates required to train clients and a central server. In this paper, we explore how to efficiently build NMT models in an FL setup by proposing a novel solution. In order to reduce the communication overhead, out of all neural layers we only exchange what we term ``Controller'' layers. Controllers are a small number of additional neural components connected to our pre-trained architectures. These new components are placed in between original layers. They act as liaisons to communicate with the central server and learn minimal information that is sufficient enough to update clients. We evaluated the performance of our models on five datasets from different domains to translate from German into English. We noted that the models equipped with Controllers preform on par with those trained in a central and non-FL setting. In addition, we observed a substantial reduction in the communication traffic of the FL pipeline, which is a direct consequence of using Controllers. Based on our experiments, Controller-based models are 6 times less expensive than their other peers. This reduction is significantly important when we consider the number of parameters in large models and it becomes even more critical when such parameters need to be exchanged for multiple rounds in FL settings.
Title: Dynamic-TinyBERT: Further Enhance the Inference Efficiency of TinyBERT by Dynamic Sequence Length
Presenter: Shira Guskin
Authors:Shira Guskin (Intel)*; Moshe Wasserblat (Intel); Ke Ding (Intel); Gyuwan Kim ()
Abstract:Limited computational budgets often prevent transformers from being used in production and from having their high accuracy utilized. TinyBERT addresses the computational efficiency by self-distilling BERT into a smaller transformer representation having fewer layers and smaller internal embedding. However, TinyBERT's performance drops when we reduce the number of layers by 50\%, and drops even more abruptly when we reduce the number of layers by 75\% for advanced NLP tasks such as span question answering. Additionally, a separate model must be trained for each inference scenario with its distinct computational budget. In this work we present Dynamic-TinyBERT, a TinyBERT model that utilizes sequence-length reduction and Hyperparameter Optimization for enhanced inference efficiency per any computational budget. Dynanic-TinyBERT is trained only once, performing on-par with BERT and achieving an accuracy-speedup trade-off superior to any other efficient approaches (up to 3.3x with <1\% loss-drop). Upon publication, the code to reproduce our work will be open-sourced.
Title: CTR-BERT: Cost-effective knowledge distillation for billion-parameter teacher models
Presenter: TBD
Authors:Aashiq Muhamed (Amazon)*; Iman Keivanloo (Amazon); Sujan Perera (Amazon); James A Mracek (Amazon); Yi Xu (Amazon); Qingjun Cui (Amazon); Santosh Rajagopalan (Amazon); Belinda Zeng (Amazon); Trishul A Chilimbi (Amazon)
Abstract:While pre-trained large language models (LLM) like BERT have achieved state-of-the-art in several NLP tasks, their performance on tasks with additional grounding e.g. with numeric and categorical features is less studied. In this paper, we study the application of pre-trained LLM for Click-through-rate (CTR) prediction for product advertisement in e-commerce. This is challenging because the model needs to a) learn from language as well as tabular data features, b) maintain low-latency (<5 ms) at inference time, and c) adapt to constantly changing advertisement distribution. We first show that scaling the pre-trained language model to 1.5 billion parameters significantly improves performance over conventional CTR baselines. We then present CTR-BERT, a novel lightweight cache-friendly factorized model for CTR prediction that consists of twin-structured BERT-like encoders for text with a mechanism for late fusion for text and tabular features. We train the CTR-BERT model using cross-architecture knowledge distillation (KD) and empirically study the interaction between KD and distribution shift in this setting, by a) experimenting with pre-training, distillation pre-finetuning and fine-tuning strategies b) factorizing features based on their distribution shift time scales, that allows the model to readily adapt and be re-trained. Finally, we show that CTR-BERT significantly outperforms a traditional CTR baseline with a 2.3\% relative ROC-AUC lift in offline experiments and a 2\% CTR lift in an online experiment.
Title: Cutting Down on Prompts and Parameters:Simple Few-Shot Learning with Language Models
Presenter: Robert Logan
Authors:Robert L Logan (UC Irvine)*; Ivana Balazevic (University of Edinburgh); Eric Wallace (U.C. Berkeley); Fabio Petroni (Facebook AI Research); Sameer Singh (University of California, Irvine); Sebastian Riedel ()
Abstract:Prompting language models (LMs) with training examples and task descriptions has been seen as critical to recent successes in few-shot learning. In this work, we show that finetuning LMs in the few-shot setting can considerably reduce the need for prompt engineering. In fact, one can use null prompts, prompts that contain neither task-specific templates nor training examples, and achieve competitive accuracy to manually-tuned prompts across a wide range of tasks. While finetuning LMs does introduce new parameters for each downstream task, we show that this memory overhead can be substantially reduced: finetuning only the bias terms can achieve comparable or better accuracy than standard finetuning while only updating 0.1% of the parameters. All in all, we recommend finetuning LMs for few-shot learning as it is more accurate, robust to different prompts, and can be made nearly as efficient as using frozen LMs.
Title: Weight, Block or Unit? Exploring Sparsity Tradeoffs for Speech Enhancement on Tiny Neural Accelerators
Presenter: Marko Stamenovic
Authors:Marko Stamenovic (Bose)*; Li-Chia Yang (Bose); Nils Westhausen (University of Oldenburg); Carl Jensen (Bose); Alex Pawlicki (Bose)
Abstract:We explore network sparsification strategies with the aim of compressing neural speech enhancement (SE) down to an optimal configuration for a new generation of low power microcontroller based neural accelerator (microNPU's). We examine three unique sparsity structures: weight pruning, block pruning and unit pruning; and discuss their benefits and drawbacks when applied to SE. We focus on the interplay between computational throughput, memory footprint and model quality. Our method supports all three sparsity structures above and jointly learns integer quantized weights along with sparsity, alleviating the need for tedious manual fine-tuning. Additionally, we demonstrate offline magnitude based pruning of integer quantized models as a performance baseline. Although efficient speech enhancement is an active area of research, our work is the first to apply block pruning to SE and the first to address SE model compression in the context of microNPU's. Using weight pruning, we show that we are able to compress an already compact model's memory footprint by a factor of 42X from 3.7MB to 87kB while only losing 0.1 dB SDR in performance. We also show a computational speedup of 6.7X with a corresponding SDR drop of only 0.59 dB SDR using block pruning.
Title: How to Win LMs and Influence Predictions: Using Short Phrases to Control NLP Models
Presenter: Sameer Singh
Bio:Dr. Sameer Singh is an Associate Professor of Computer Science at the University of California, Irvine (UCI) and an Allen AI Fellow at Allen Institute for AI. He is working primarily on robustness and interpretability of machine learning algorithms, along with models that reason with text and structure for natural language processing. Sameer was a postdoctoral researcher at the University of Washington and received his PhD from the University of Massachusetts, Amherst. He has received the NSF CAREER award, selected as a DARPA Riser, UCI Distinguished Early Career Faculty award, and the Hellman Faculty Fellowship. His group has received funding from Allen Institute for AI, Amazon, NSF, DARPA, Adobe Research, Hasso Plattner Institute, NEC, Base 11, and FICO. Sameer has published extensively at machine learning and natural language processing venues and received conference paper awards at KDD 2016, ACL 2018, EMNLP 2019, AKBC 2020, and ACL 2020. (https://sameersingh.org/)
Abstract:Current NLP pipelines rely significantly on finetuning large pre-trained language models. Relying on this paradigm makes such pipelines challenging to use in real-world settings since massive task-specific models are neither memory- nor inference-efficient, nor do we understand how they fare in adversarial settings. This talk will describe our attempts to address these seemingly unrelated concerns by investigating how specific short phrases in the input can control model behavior. These short phrases (which we call triggers) will help us identify model vulnerabilities and introduce new paradigms of training models.
In the first part of the talk, I will focus on the adversarial setting. I will show how easy it is for adversaries to craft triggers that cause a target model to misbehave when the trigger appears in the input. In the second part of the talk, I will show how these triggers can also be used to “prompt” language models to act as task-specific models, providing a negligible-memory, no-learning way to create classifiers. I will end with a comprehensive study of the interplay between prompting and finetuning, providing some guidelines for effectively performing few-shot learning with large language models.
Title: Benchmarks for Multi-objective Hyperparameter Optimization
Presenter: Kevin Duh
Bio:Kevin Duh is a senior research scientist at the Johns Hopkins University Human Language Technology Center of Excellence (JHU HLTCOE). He is also an assistant research professor in the Department of Computer Science and a member of the Center for Language and Speech Processing (CLSP). His research interests lie at the intersection of Natural Language Processing and Machine Learning, in particular on areas relating to machine translation, semantics, and deep learning. Previously, he was assistant professor at the Nara Institute of Science and Technology (2012-2015) and research associate at NTT CS Labs (2009-2012). He received his B.S. in 2003 from Rice University, and PhD in 2009 from the University of Washington, both in Electrical Engineering.
Abstract:The speed, size, and accuracy of deep neural networks often depend on hyperparameters such as network depth and architecture type. Hyperparameter optimization and neural architecture search are promising techniques that help developers build the best-possible network under budget constraints. I will discuss the importance of building benchmarks to evaluate these techniques in a multi-objective way. By incorporating multiple objectives such as training time, inference speed, and model size into hyperparameter optimization, we ensure a more holistic evaluation of the entire model development and deployment process.
Title: NLP with Synthetic Text
Presenter: Mohammad Norouzi
Bio:Mohammad Norouzi is a staff research scientist on the Google Brain team in Toronto. He is interested in self-supervised representation learning, generative models, and the use of generative models in advancing machine learning. He has contributed to several recent work on diffusion probabilistic models and was a co-developer of Google's neural machine translation system and SimCLR for learning visual representations.
Abstract:Synthetic data is successfully used to train powerful machine learning models for computer vision and robotics, thanks to the availability of high-fidelity graphics and physics-based simulation. But, can synthetic data be successfully used to improve natural language processing? In this talk, I will advocate for the use of large language models as a great source of synthetic text. I will review recent work on data augmentation for NLP and describe a general framework for NLP with synthetic text, called “Generate, Annotate, and Learn”. I will highlight a few key results on generating unlabeled text for improving semi-supervised learning and knowledge distillation, in addition to advancing GPT3-style few-shot learning.
Title: Toward Efficient Training of Large Language Models with Balanced Conditional Compute
Presenter: Luke Zettlemoyer
Bio:Luke Zettlemoyer is a Professor in the Paul G. Allen School of Computer Science & Engineering at the University of Washington and a Research Scientist at Meta. His research focuses on empirical methods for natural language semantics, and involves designing machine learning algorithms, introducing new tasks and datasets, and, most recently, studying how to best develop self-supervision signals for pre-training. Honors include multiple paper awards, a PECASE award, and an Allen Distinguished Investigator Award. Luke received his PhD from MIT and was a postdoc at the University of Edinburgh.
Abstract:The trend of building ever larger language models has dominated much research in NLP over the last few years. However, we have reached a point where dense compute is difficult to scale further, and there is a need for new, more efficient model architectures. In this talk, I will cover our recent efforts on learning sparse mixtures of experts (MoEs) models, which have new explicitly balanced control mechanisms for allocating conditional compute. This includes BASE Layers, where the routing of experts to tokens is algorithmically assigned to ensure balanced scaling across compute nodes, and DEMix Layers, where we introduce new modular approaches for deterministic expert routing based on metadata that specifies the domain of the input text. Overall, our sparse approaches have significantly reduced cross-node communication costs and could possibly provide the next big leap in performance, although finding a version that scales well in practice remains an open challenge.
Title: Why We Want Contrastive Learning in Language Models
Presenter: Danqi Chen
Bio:Danqi Chen is an assistant professor of computer science at Princeton University and co-leads the Princeton NLP Group. Her research focuses on representation learning, knowledge gathering & reasoning, and developing practical systems for question answering and information extraction. Before joining Princeton, Danqi worked as a visiting scientist at Facebook AI Research. She received her Ph.D. from Stanford University (2018) and B.E. from Tsinghua University (2012), both in Computer Science.
Abstract:Contrastive learning aims to learn representations such that similar samples stay close to each other while dissimilar ones are far apart. Recently, it has achieved great success in self-supervised learning of visual representations and even surpassed its supervised counterparts. In this talk, I will argue why contrastive learning may provide new solutions in language model pre-training and fine-tuning. I will first describe our recent work SimCSE on how contrastive learning can be used with pre-trained language models to produce universal sentence representations. And then, I will discuss why contrastive learning can potentially lead to better pre-trained representations. I hope this talk can shed light on some limitations of pre-trained language representations as well as why contrastive learning is a great idea to tackle these problems.
Title: Battling with Larger Models through Grounding and Searching
Presenter: Yejin Choi
Bio:Yejin Choi is Brett Helsel professor at the Paul G. Allen School of Computer Science & Engineering at the University of Washington and also a senior research manager at AI2 overseeing the project Mosaic. Her research interests include commonsense knowledge and reasoning, neural language (de-)generation, language grounding with vision and perception, and AI for social good. She is a co-recepient of the ACL Test of Time award in 2021, the CVPR Longuet-Higgins Prize (test of time award) in 2021, the AAAI Outstanding Paper Award (best paper award) in 2020, the Borg Early Career Award (BECA) in 2018, the inaugural Alexa Prize Challenge in 2017, IEEE AI's 10 to Watch in 2016, and the Marr Prize (best paper award) at ICCV 2013. She received her Ph.D. in Computer Science at Cornell University and BS in Computer Science and Engineering at Seoul National University in Korea.
Abstract:Scale appears to be the winning recipe in today's leaderboards. And yet, extreme-scale neural models are still brittle and make errors that are nonsensical or even counterintuitive. In this talk, I will discuss how smaller models developed in academia can still have an edge over larger industry-scale models, if powered with grounding and searching. First, I will present MERLOT (and RESERVE) that can learn neural script knowledge from complex multimodal data and achieve new SOTA over a dozen multimodal benchmarks. Next, I will discuss NeuralLogic (and NeuralLogic A*) search algorithms that can integrate logic constraints to language model decoding so that smaller unsupervised models can win over larger supervised models for various constrained generation tasks.
Title: Panel Discussion
Presenter: Pascal Poupart
Ali Ghodsi
Luke Zettlemoyer
Sameer Singh
Kevin Duh
Yejin Choi
Lu Hou
Bio:TBD
Abstract:TBD
Title: Best Papers and Closing Remarks
Presenter: Pascal Poupart & Ali Ghodsi
Bio:TBD
Abstract:TBD
Time |
Title |
Presenter |
08:00 AM - 08:10 AM | Opening Speech | | Pascal Poupart |
08:10 AM - 08:50 AM | Continual Learning in Large-Scale Pre-Training | | Xu Sun |
08:50 AM - 09:30 AM | Efficient Multi-lingual Neural Machine Translation | | Boxing Chen |
09:30 AM - 10:10 AM | Compression and Acceleration of Pre-trained Language Models | | Lu Hou |
10:10 AM - 10:20 AM | Break |
10:20 AM - 11:00 AM | Summarization in Quantized Transformer Spaces | | Mirella Lapata |
11:00 AM - 11:40 AM | Data-Efficient Cross-Lingual Natural Language Processing | | Barbara Plank |
11:40 AM - 12:20 PM | From model compression to self-distillation: a review | | Samira Ebrahimi Kahou |
12:20 PM - 01:00 PM | Break |
12:20 PM - 01:20 PM | Poster session | | |
01:20 PM - 01:25 PM | A versatile and efficient approach to summarize speech into utterance-level representations | | Joao B Monteiro |
01:25 PM - 01:30 PM | Towards Zero and Few-shot Knowledge-seeking Turn Detection in Task-orientated Dialogue Systems | | Di Jin |
01:30 PM - 01:35 PM | Consistent Accelerated Inference via Confident Adaptive Transformers | | Tal Schuster |
01:35 PM - 01:40 PM | Communication-Efficient Federated Learning for Neural Machine Translation | | Tanya Roosta |
01:40 PM - 01:45 PM | Dynamic-TinyBERT: Further Enhance the Inference Efficiency of TinyBERT by Dynamic Sequence Length | | Shira Guskin |
01:45 PM - 01:50 PM | CTR-BERT: Cost-effective knowledge distillation for billion-parameter teacher models | | TBD |
01:50 PM - 01:55 PM | Cutting Down on Prompts and Parameters:Simple Few-Shot Learning with Language Models | | Robert Logan |
01:55 PM - 02:00 PM | Weight, Block or Unit? Exploring Sparsity Tradeoffs for Speech Enhancement on Tiny Neural Accelerators | | Marko Stamenovic |
02:00 PM - 02:40 PM | How to Win LMs and Influence Predictions: Using Short Phrases to Control NLP Models | | Sameer Singh |
02:40 PM - 03:20 PM | Benchmarks for Multi-objective Hyperparameter Optimization | | Kevin Duh |
03:20 PM - 04:00 PM | NLP with Synthetic Text | | Mohammad Norouzi |
04:00 PM - 04:10 PM | Break |
04:10 PM - 04:50 PM | Toward Efficient Training of Large Language Models with Balanced Conditional Compute | | Luke Zettlemoyer |
04:50 PM - 05:30 PM | Why We Want Contrastive Learning in Language Models | | Danqi Chen |
05:30 PM - 06:10 PM | Battling with Larger Models through Grounding and Searching | | Yejin Choi |
06:10 PM - 06:15 PM | Break |
06:15 PM - 07:00 PM | Panel Discussion | | Pascal Poupart Ali Ghodsi Luke Zettlemoyer Sameer Singh Kevin Duh Yejin Choi Lu Hou |
07:00 PM - 07:10 PM | Best Papers and Closing Remarks | | Pascal Poupart & Ali Ghodsi |
07:10 PM - 08:00 PM | Poster session | | |
Organizers
Mehdi Rezagholizadeh
Huawei Noah's Ark Lab
Lili Mou
University of Alberta
Yue Dong
McGill University & MILA
Pascal Poupart
University of Waterloo
Ali Ghodsi
University of Waterloo
Qun Liu
Huawei Noah's Ark Lab
Volunteers
Khalil Bibi
Huawei Noah's Ark Lab
Andrson Avilla
Huawei Noah's Ark Lab
Technical Committee
- Pascal Poupart (UoWaterloo)
- Kevin Duh (Johns Hopkins University)
- Wulong Liu (Huawei Noah's Ark Lab)
- Bang Liu (UoMontreal)
- Di Jin (Amazon Alexa AI)
- Hamidreza Mahyar (McMaster University)
- Lili Mou (UoAlberta)
- Peyman Passban (Amazon)
- Prasanna Parthasarathi (McGill & MILA)
- Vahid Partovi Nia (Huawei Noah's Ark Lab)
- Yue Dong (McGill & MILA)
- Ivan Kobyzev (Huawei Noah's Ark Lab)
- Jad Kabbara (McGill & MILA)
- Aref Jafari (UoWaterloo)
- Ahmad Rashid (Huawei Noah's Ark Lab)
- Shailza Jolly (TU Kaiserslautern)
- Md. Akmal Haidar (Nuance Communications)
- Jingjing Xu (ByteDance)
- Vasileios Lioutas (UoBritish Colombia (UBC))
- Anderson R. Avila (Huawei Noah's Ark Lab)
- Malik H. Altakrori (McGill & MILA)
- Ali Vahdat (Thomson Reuters)
- Fattane Zarrinkalam (Thomson Reuters)
- Makesh S Narsimhan (McGill & MILA)
- Borna Jafarpour (Thomson Reuters)
- Shohreh Shaghaghian (Thomson Reuters)
- Ehsan Kamalloo (UoAlberta)
- Ali Saheb Pasand (UoWaterloo)
|
- Abbas Ghaddar (Huawei Noah's Ark Lab)
- Mehrdad Ganjeh (Ernst & Young (EY))
- Mingxuan Wang (ByteDance)
- Tanya Roosta (Amazon)
- Soheila Samiee (BASF)
- Yimeng Wu (Huawei Noah's Ark Lab)
- Marzieh Tahaei (Huawei Noah's Ark Lab)
- Habib Hajimolahoseini (Huawei Technologies)
- Mohammad Salameh (Huawei Technologies)
- Kira Aveline Selby (UoWaterloo)
- Mohammed Senoussaoui (Fluent.ai)
- M. Sarria-Paja (Universidad Santiago de Cali)
- Puneeth Saladi (Huawei Noah's Ark Lab)
- Flávio Ávila (Verisk Analytics)
- Tal Schuster (MIT)
- Irene Li (Yale)
- Shentong Mo (Carnegie Mellon University)
- Alpana Agarwal (Thapar University)
- Vinay Kumar (Thapar University)
- Shivani Malhotra (TIET Patiala)
- Iman Keivanloo (Amazon)
- Aashiq Muhamed (Amazon)
- Robert L. Logan IV (UCI University)
- Patrick Xia (Johns Hopkins University)
- Moshe Wasserblat (Intel)
- Guy Boudoukh (Intel)
- Ankit Chadha (Amazon)
- Khalil Bibi (Huawei Noah's Ark Lab)
- David Alfonso Hermelo (Huawei Noah's Ark Lab)
|
Sponsor