Skip to content

Latest commit

 

History

History
84 lines (65 loc) · 7.03 KB

README.md

File metadata and controls

84 lines (65 loc) · 7.03 KB

XLNet

A Julia based implementation of XLNet: A Generalized Autoregressive Pretraining for LU. (Flux and JuliaText)

Stable Dev Build Status Build Status Build Status

What is XLNet?

XLNet is an generalized autoregressive pretraining for language understanding. The XLNet paper combines recent advances in NLP with innovative choices in how the language modelling problem is approached. When trained on a very large NLP corpus, the model achieves state-of-the-art performance for the standard NLP tasks that comprise the GLUE benchmark.

XLNet is an auto-regressive language model which outputs the joint probability of a sequence of tokens based on the Transformer architecture with recurrence. Its training objective calculates the probability of a word token conditioned on all permutations of word tokens in a sentence, as opposed to just those to the left or just those to the right of the target token.

What makes XLNet so special?

XLNet was proposed by researchers at Google Inc. in 2019. Since,

  • The autoregressive language model (e.g.GPT-2) is only trained to encode a unidirectional context and not effective at modeling deep bidirectional contexts, and
  • Autoencoding (e.g.BERT) suffers from the pre-train fine-tune discrepancy due to masking, XLNet borrows ideas from the two types of objectives while avoiding their limitations.

It is a new objective called Permutation Language Modeling. By using a permutation operation during training time, bidirectional context information can be captured and makes it a generalized order-aware autoregressive language model. No masking is required and thus the dependency between the BERT [MASK] tokens is maintained. Besides, XLNet introduces a two-stream self-attention to solve the problem that standard parameterization will reduce the model to bag-of-words.

Additionally, XLNet employs Transformer-XL as the backbone model, exhibiting excellent performance for language tasks involving long context.

Two versions of the XLNet model have been released, i.e.

  1. XLNet-Large, Cased : 24-layer, 1024-hidden, 16-heads
  2. XLNet-Base, Cased : 12-layer, 768-hidden, 12-heads and, they include similar settings of the corresponding BERT.

**XLNet (Paper Abstract): Empirically, XLNet outperforms BERT on 20 tasks and achieves state-of-the-art results on 18 tasks.


1. XLNet benefits from Auto-Regression and Auto-Encoding models

Add here 1


2. Permutation Language Modelling

Specifically, for a sequence X{...} of length T, there are T different orders to perform a valid autoregressive factorization! Intuitively, if model parameters are shared across all factorization orders, in expectation, the model will learn to gather information from all positions on both sides. Let, PT be the set of all possible permutations of a sequence [1,2,…, T] and use zt and z<t to denote the t-th element and the first t−1 elements of a permutation, p ∈ PT. Then the permutation language modeling objective can be expressed as follows: For instance, assume we have a input sequence {I love my dog}.

In the upper left plot of the above figure, when we have a factorization order: {3, 2, 4, 1}, the probability of sequence can be expressed as follows:

  • For the third token: {my}, it cannot use the information of all other tokens, so only one arrow from the starting token points to the third token in the plot.

In the upper right plot of the figure, when we have a factorization order: {2, 4, 3, 1}, the probability of sequence can be expressed as follows:

  • Here, for the third token: {my}, it can use the information of the second and fourth tokens because it places after these two tokens in the factorization order. Correspondingly, it cannot use the information of the first token. So in the plot, in addition to the arrow from the starting token, there are arrows from the second and fourth tokens pointing to the third token. The rest two plots in the figure have the same interpretation.

During training, for a fixed factorization order, XLNet is a unidirectional language model based on the transformer decoder, which performs normal model training. But different factorization order makes the model see different order of words when traversing sentences. In this way, although the model is unidirectional, it can also learn the bidirectional information of the sentence.

It is noteworthy that the sequence order is not actually shuffled but only attention masks are changed to reflect factorization order. With PLM, XLNet can model bidirectional context and the dependency within each token of the sequence.


3. Two-Stream Self-Attention with Target-Aware Representation

Add here 2


Work Checkpoints

- May not be updated -

Dated : 04-06-2021 (May not be updated)

  • Convert pre-train weights to bson
  • Create tokenizer : sentence-piece
  • Add as xlnet_tokenizer.jl
  • Transformer-XL encoder-decoder base with features essential to XLNet
  • ...

Status

In progress

References

  1. XLNet: Generalized Autoregressive Pretraining for Language Understanding - arxiv.org
  2. Understanding XLNet - Borealis AI
  3. Understanding Language using XLNet with autoregressive pre-training - medium.com
  4. Sentence-Piece Subword Tokenizer - Google
  5. Permutation Language Modelling - LMU Munich