Zephyrnet Logo

7 Leading Language Models For NLP In 2020

Date:

The introduction of transfer learning and pretrained language models in natural language processing (NLP) pushed forward the limits of language understanding and generation. Transfer learning and applying transformers to different downstream NLP tasks have become the main trend of the latest research advances.

At the same time, there is a controversy in the NLP community regarding the research value of the huge pretrained language models occupying the leaderboards. While lots of AI experts agree with Anna Rogers’s statement that getting state-of-the-art results just by using more data and computing power is not research news, other NLP opinion leaders point out some positive moments in the current trend, like, for example, the possibility of seeing the fundamental limitations of the current paradigm.

Anyway, the latest improvements in NLP language models seem to be driven not only by the massive boosts in computing capacity but also by the discovery of ingenious ways to lighten models while maintaining high performance.

To help you stay up to date with the latest breakthroughs in language modeling, we’ve summarized research papers featuring the key language models introduced recently.

Subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new summaries.

If you’d like to skip around, here are the papers we featured:

  1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  2. Language Models Are Unsupervised Multitask Learners
  3. XLNet: Generalized Autoregressive Pretraining for Language Understanding
  4. RoBERTa: A Robustly Optimized BERT Pretraining Approach
  5. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
  6. StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding
  7. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Important Pretrained Language Models

1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

Original Abstract

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT representations can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.

BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE benchmark to 80.4% (7.6% absolute improvement), MultiNLI accuracy to 86.7 (5.6% absolute improvement) and the SQuAD v1.1 question answering Test F1 to 93.2 (1.5% absolute improvement), outperforming human performance by 2.0%.

Our Summary

A Google AI team presents a new cutting-edge model for Natural Language Processing (NLP) – BERT, or Bidirectional Encoder Representations from Transformers. Its design allows the model to consider the context from both the left and the right sides of each word. While being conceptually simple, BERT obtains new state-of-the-art results on eleven NLP tasks, including question answering, named entity recognition and other tasks related to general language understanding.

Top NLP Research Papers of 2018 Summarized By Mariya Yao TOPBOTS

Top NLP Research Papers of 2018 Summarized By Mariya Yao TOPBOTS

What’s the core idea of this paper?

  • Training a deep bidirectional model by randomly masking a percentage of input tokens – thus, avoiding cycles where words can indirectly “see themselves”.
  • Also pre-training a sentence relationship model by building a simple binary classification task to predict whether sentence B immediately follows sentence A, thus allowing BERT to better understand relationships between sentences.
  • Training a very big model (24 Transformer blocks, 1024-hidden, 340M parameters) with lots of data (3.3 billion word corpus).

What’s the key achievement?

  • Advancing the state-of-the-art for 11 NLP tasks, including:
    • getting a GLUE score of 80.4%, which is 7.6% of absolute improvement from the previous best result;
    • achieving 93.2% accuracy on SQuAD 1.1 and outperforming human performance by 2%.
  • Suggesting a pre-trained model, which doesn’t require any substantial architecture modifications to be applied to specific NLP tasks.

What does the AI community think?

What are future research areas?

  • Testing the method on a wider range of tasks.
  • Investigating the linguistic phenomena that may or may not be captured by BERT.

What are possible business applications?

  • BERT may assist businesses with a wide range of NLP problems, including:
    • chatbots for better customer experience;
    • analysis of customer reviews;
    • the search for relevant information, etc.

Where can you get implementation code?

2. Language Models Are Unsupervised Multitask Learners, by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever

Original Abstract

Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on task-specific datasets. We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach 55 F1 on the CoQA dataset – matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples. The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

Our Summary

In this paper, the OpenAI team demonstrates that pre-trained language models can be used to solve downstream tasks without any parameter or architecture modifications. They have trained a very big model, a 1.5B-parameter Transformer, on a large and diverse dataset that contains text scraped from 45 million webpages. The model generates coherent paragraphs of text and achieves promising, competitive or state-of-the-art results on a wide variety of tasks.

What’s the core idea of this paper?

  • Training the language model on the large and diverse dataset:
    • selecting webpages that have been curated/filtered by humans;
    • cleaning and de-duplicating the texts, and removing all Wikipedia documents to minimize overlapping of training and test sets;
    • using the resulting WebText dataset with slightly over 8 million documents for a total of 40 GB of text.
  • Using a byte-level version of Byte Pair Encoding (BPE) for input representation.
  • Building a very big Transformer-based model, GPT-2:
    • the largest model includes 1542M parameters and 48 layers;
    • the model mainly follows the OpenAI GPT model with few modifications (i.e., expanding vocabulary and context size, modifying initialization etc.).

What’s the key achievement?

  • Getting state-of-the-art results on 7 out of 8 tested language modeling datasets.
  • Showing quite promising results in commonsense reasoning, question answering, reading comprehension, and translation.
  • Generating coherent texts, for example, a news article about the discovery of talking unicorns.

What does the AI community think?

  • “The researchers built an interesting dataset, applying now-standard tools and yielding an impressive model.” – Zachary C. Lipton, an assistant professor at Carnegie Mellon University.

What are future research areas?

  • Investigating fine-tuning on benchmarks such as decaNLP and GLUE to see whether the huge dataset and capacity of GPT-2 can overcome the inefficiencies of BERT’s unidirectional representations.

What are possible business applications?

  • In terms of practical applications, the performance of the GPT-2 model without any fine-tuning is far from usable but it shows a very promising research direction.

Where can you get implementation code?

  • Initially, OpenAI decided to release only a smaller version of GPT-2 with 117M parameters. The decision not to release larger models was taken “due to concerns about large language models being used to generate deceptive, biased, or abusive language at scale”.
  • In November, OpenAI finally released its largest 1.5B-parameter model. The code is available here.
  • Hugging Face has introduced a PyTorch implementation of the initially released GPT-2 model.

3. XLNet: Generalized Autoregressive Pretraining for Language Understanding, by Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le

Original Abstract

With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, XLNet outperforms BERT on 20 tasks, often by a large margin, and achieves state-of-the-art results on 18 tasks including question answering, natural language inference, sentiment analysis, and document ranking.

Our Summary

The researchers from Carnegie Mellon University and Google have developed a new model, XLNet, for natural language processing (NLP) tasks such as reading comprehension, text classification, sentiment analysis, and others. XLNet is a generalized autoregressive pretraining method that leverages the best of both autoregressive language modeling (e.g., Transformer-XL) and autoencoding (e.g., BERT) while avoiding their limitations. The experiments demonstrate that the new model outperforms both BERT and Transformer-XL and achieves state-of-the-art performance on 18 NLP tasks.

TOP NLP 2019 - XLNet

TOP NLP 2019 - XLNet

What’s the core idea of this paper?

  • XLNet combines the bidirectional capability of BERT with the autoregressive technology of Transformer-XL:
    • Like BERT, XLNet uses a bidirectional context, which means it looks at the words before and after a given token to predict what it should be. To this end, XLNet maximizes the expected log-likelihood of a sequence with respect to all possible permutations of the factorization order.
    • As an autoregressive language model, XLNet doesn’t rely on data corruption, and thus avoids BERT’s limitations due to masking – i.e., pretrain-finetune discrepancy and the assumption that unmasked tokens are independent of each other.
  • To further improve architectural designs for pretraining, XLNet integrates the segment recurrence mechanism and relative encoding scheme of Transformer-XL.

What’s the key achievement?

  • XLnet outperforms BERT on 20 tasks, often by a large margin.
  • The new model achieves state-of-the-art performance on 18 NLP tasks including question answering, natural language inference, sentiment analysis, and document ranking.

 

What does the AI community think?

  • The paper was accepted for oral presentation at NeurIPS 2019, the leading conference in artificial intelligence.
  • “The king is dead. Long live the king. BERT’s reign might be coming to an end. XLNet, a new model by people from CMU and Google outperforms BERT on 20 tasks.” – Sebastian Ruder, a research scientist at Deepmind.
  • “XLNet will probably be an important tool for any NLP practitioner for a while…[it is] the latest cutting-edge technique in NLP.” – Keita Kurita, Carnegie Mellon University.

What are future research areas?

  • Extending XLNet to new areas, such as computer vision and reinforcement learning.

What are possible business applications?

  • XLNet may assist businesses with a wide range of NLP problems, including:
    • chatbots for first-line customer support or answering product inquiries;
    • sentiment analysis for gauging brand awareness and perception based on customer reviews and social media;
    • the search for relevant information in document bases or online, etc.

Where can you get implementation code?

4. RoBERTa: A Robustly Optimized BERT Pretraining Approach, by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov

Original Abstract

Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.

Our Summary

Natural language processing models have made significant advances thanks to the introduction of pretraining methods, but the computational expense of training has made replication and fine-tuning parameters difficult. In this study, Facebook AI and the University of Washington researchers analyzed the training of Google’s Bidirectional Encoder Representations from Transformers (BERT) model and identified several changes to the training procedure that enhance its performance. Specifically, the researchers used a new, larger dataset for training, trained the model over far more iterations, and removed the next sequence prediction training objective. The resulting optimized model, RoBERTa (Robustly Optimized BERT Approach), matched the scores of the recently introduced XLNet model on the GLUE benchmark.

What’s the core idea of this paper?

  • The Facebook AI research team found that BERT was significantly undertrained and suggested an improved recipe for its training, called RoBERTa:
    • More data: 160GB of text instead of the 16GB dataset originally used to train BERT.
    • Longer training: increasing the number of iterations from 100K to 300K and then further to 500K.
    • Larger batches: 8K instead of 256 in the original BERT base model.
    • Larger byte-level BPE vocabulary with 50K subword units instead of character-level BPE vocabulary of size 30K.
    • Removing the next sequence prediction objective from the training procedure.
    • Dynamically changing the masking pattern applied to the training data.

What’s the key achievement?

  • RoBERTa outperforms BERT in all individual tasks on the General Language Understanding Evaluation (GLUE) benchmark.
  • The new model matches the recently introduced XLNet model on the GLUE benchmark and sets a new state of the art in four out of nine individual tasks.

What are future research areas?

  • Incorporating more sophisticated multi-task finetuning procedures.

What are possible business applications?

  • Big pretrained language frameworks like RoBERTa can be leveraged in the business setting for a wide range of downstream tasks, including dialogue systems, question answering, document classification, etc.

Where can you get implementation code?

  • The models and code used in this study are available on GitHub.

5. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut

Original Abstract

Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations, longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large.

Our Summary

The Google Research team addresses the problem of the continuously growing size of the pretrained language models, which results in memory limitations, longer training time, and sometimes unexpectedly degraded performance. Specifically, they introduce A Lite BERT (ALBERT) architecture that incorporates two parameter-reduction techniques: factorized embedding parameterization and cross-layer parameter sharing. In addition, the suggested approach includes a self-supervised loss for sentence-order prediction to improve inter-sentence coherence. The experiments demonstrate that the best version of ALBERT sets new state-of-the-art results on GLUE, RACE, and SQuAD benchmarks while having fewer parameters than BERT-large.

What’s the core idea of this paper?

  • It is not reasonable to further improve language models by making them larger because of memory limitations of available hardware, longer training times, and unexpected degradation of model performance with the increased number of parameters.
  • To address this problem, the researchers introduce the ALBERT architecture that incorporates two parameter-reduction techniques:
    • factorized embedding parameterization, where the size of the hidden layers is separated from the size of vocabulary embeddings by decomposing the large vocabulary-embedding matrix into two small matrices;
    • cross-layer parameter sharing to prevent the number of parameters from growing with the depth of the network.
  • The performance of ALBERT is further improved by introducing the self-supervised loss for sentence-order prediction to address BERT’s limitations with regard to inter-sentence coherence.

What’s the key achievement?

  • With the introduced parameter-reduction techniques, the ALBERT configuration with 18× fewer parameters and 1.7× faster training compared to the original BERT-large model achieves only slightly worse performance.
  • The much larger ALBERT configuration, which still has fewer parameters than BERT-large, outperforms all of the current state-of-the-art language modes by getting:
    • 89.4% accuracy on the RACE benchmark;
    • 89.4 score on the GLUE benchmark; and
    • An F1 score of 92.2 on the SQuAD 2.0 benchmark.

 

What does the AI community think?

  • The paper has been submitted to ICLR 2020 and is available on the OpenReview forum, where you can see the reviews and comments of NLP experts. The reviewers are mainly very appreciative of the presented paper.

What are future research areas?

  • Speeding up training and inference through methods like sparse attention and block attention.
  • Further improving the model performance through hard example mining, more efficient model training, and other approaches.

What are possible business applications?

  • The ALBERT language model can be leveraged in the business setting to improve performance on a wide range of downstream tasks, including chatbot performance, sentiment analysis, document mining, and text classification.

Where can you get implementation code?

  • The original implementation of ALBERT is available on GitHub.
  • A TensorFlow implementation of ALBERT is also available here.
  • A PyTorch implementation of ALBERT can be found here and here.

6. StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding, by Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Jiangnan Xia, Liwei Peng, Luo Si

Original Abstract 

Recently, the pre-trained language model, BERT (and its robustly optimized version RoBERTa), has attracted a lot of attention in natural language understanding (NLU), and achieved state-of-the-art accuracy in various NLU tasks, such as sentiment classification, natural language inference, semantic textual similarity and question answering. Inspired by the linearization exploration work of Elman [8], we extend BERT to a new model, StructBERT, by incorporating language structures into pre-training. Specifically, we pre-train StructBERT with two auxiliary tasks to make the most of the sequential order of words and sentences, which leverage language structures at the word and sentence levels, respectively. As a result, the new model is adapted to different levels of language understanding required by downstream tasks. The StructBERT with structural pre-training gives surprisingly good empirical results on a variety of downstream tasks, including pushing the state-of-the-art on the GLUE benchmark to 89.0 (outperforming all published models), the F1 score on SQuAD v1.1 question answering to 93.0, the accuracy on SNLI to 91.7.

Our Summary 

The Alibaba research team suggests extending BERT to a new StructBERT language model by leveraging word-level and sentence-level ordering. To capture the linguistic structures during the pre-training procedure, they extend the BERT model with the word structural objective and the sentence structural objective. As a result, the StructBERT model is forced to reconstruct the right order of words and sentences. The experiments demonstrate that the introduced model significantly advances the state-of-the-art results on a variety of natural language understanding tasks, including sentiment analysis and question answering.

StructBERT

StructBERT

What’s the core idea of this paper?

  • The researchers introduce the StructBERT model, which builds upon BERT architecture with a multi-layer bidirectional transformer network:
    • The suggested model amplifies the ability of the BERT’s masked LM task by mixing up a certain number of tokens after the word masking and predicting the right order.
    • Furthermore, the model randomly shuffles the sentence order and predicts the next and the previous sentence as a new sentence prediction task.
    • Two auxiliary objectives are pretrained together with the original masked LM objective in a unified model.

What’s the key achievement?

  • StructBERT from Alibaba achieves state-of-the-art performance on different NLP tasks:
    • On the GLUE benchmark, it surpassed all published models on the average score and achieved the best results in 6 out of the 9 tasks.
    • On the SNLI dataset, StructBERT outperformed all existing approaches with a new state-of-the-art result of 91.7%.
    • On the SQuAD 1.1 question answering benchmark, the new model outperformed all published models except for XLNet with data augmentation.

What does the AI community think?

What are possible business applications?

  • Like other pretrained language models, StructBERT may assist businesses with a variety of NLP tasks, including question answering, sentiment analysis, document summarization, etc.

7. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu

Original Abstract 

Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.

Our Summary 

The Google research team suggests a unified approach to transfer learning in NLP with the goal to set a new state of the art in the field. To this end, they propose treating each NLP problem as a “text-to-text” problem. Such a framework allows using the same model, objective, training procedure, and decoding process for different tasks, including summarization, sentiment analysis, question answering, and machine translation. The researchers call their model a Text-to-Text Transfer Transformer (T5) and train it on the large corpus of web-scraped data to get state-of-the-art results on a number of NLP tasks.

T5 language model

T5 language model

What’s the core idea of this paper?

  • The paper has several important contributions:
    • Providing a comprehensive perspective on where the NLP field stands by exploring and comparing existing techniques.
    • Introducing a new approach to transfer learning in NLP by suggesting treating every NLP problem as a text-to-text task:
      • The model understands which tasks should be performed thanks to the task-specific prefix added to the original input sentence (e.g., “translate English to German:”, “summarize:”).
    • Presenting and releasing a new dataset consisting of hundreds of gigabytes of clean web-scraped English text, the Colossal Clean Crawled Corpus (C4).
    • Training a large (up to 11B parameters) model, called Text-to-Text Transfer Transformer (T5) on the C4 dataset.

What’s the key achievement?

  • The T5 model with 11 billion parameters achieved state-of-the-art performance on 17 out of 24 tasks considered, including:
    • a GLUE score of 89.7 with substantially improved performance on CoLA, RTE, and WNLI tasks;
    • an Exact Match score of 90.06 on the SQuAD dataset;
    • a SuperGLUE score of 88.9, which is a very significant improvement over the previous state-of-the-art result (84.6) and very close to human performance (89.8);
    • a ROUGE-2-F score of 21.55 on the CNN/Daily Mail abstractive summarization task.

What are future research areas?

  • Researching the methods to achieve stronger performance with cheaper models.
  • Exploring more efficient knowledge extraction techniques.
  • Further investigating the language-agnostic models.

What are possible business applications?

  • Even though the introduced model has billions of parameters and can be too heavy to be applied in the business setting, the presented ideas can be used to improve the performance on different NLP tasks, including summarization, question answering, and sentiment analysis.

Where can you get implementation code?

  • The pretrained models together with the dataset and code are released on GitHub.

If you like these research summaries, you might be also interested in the following articles:

Enjoy this article? Sign up for more AI research updates.

We’ll let you know when we release more summary articles like this one.

Source: https://www.topbots.com/leading-nlp-language-models-2020/?utm_source=rss&utm_medium=rss&utm_campaign=leading-nlp-language-models-2020

spot_img

Latest Intelligence

spot_img

Chat with us

Hi there! How can I help you?