BERT Bidirectional Encoder Representations from Transformers 双向,所以对左右两边的context都考虑到了,这样不会丢失单向的另一边的信息
结论: As a result, the pre-trained BERT model can be fintuned with just one additional output layer to create SOTA models for a wide range of tasks can: [QA, MRC, NER] can't: MT NLG tasks etc. [why? caz it doesn't build on language model]
Introduction
强调了语言模型pre训练的重要性,列出了几种任务:
sentence-level tasks NLI and paraphrasing,. 目的是通过整体分析预测sentence之间的关系
应用方法:使用MLM “Masked language model” 作为与训练的objevtive, 说是被Cloze task任务inspired的【BOW也是类似的方法】。 The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context. 这样的方法就可以把左右的context都放进去
第二个任务叫做NSP。Next sentence prediction。a “next sentence prediction” task that jointly pre- trains text-pair representations. 文本对的表示。这个怎么理解呢?Specifically, when choosing the sentences A and B for each pretraining example, 50% of the time B is the actual next sentence that follows A (labeled as IsNext), and 50% of the time it is a random sentence from the corpus (labeled as NotNext). 这个在后面有介绍原因,这里先提一下。【cls】 0 1 分别代表了是后面的句子,或者不是
[CLS] is a special symbol added in front of every input example, and [SEP] is a special separator token (e.g. separating questions/answers).
We primarily report results on two model sizes:BERTBASE (L=12, H=768, A=12, Total Parameters=110M) and BERTLARGE (L=24, H=1024,A=16, Total Parameters=340M).
To make BERT handle a variety of down-stream tasks, our input representation is able to unambiguously represent both a single sentence and a pair of sentences (e.g., h Question, Answer i) in one token sequence.
wordpicec embeddings,30,000token的词表。打头的永远都是[CLS]],这个token在最后的状态中用来表示类别。 \1. 文本向量:该向量的取值在模型训练过程中自动学习,用于刻画文本的全局语义信息,并与单字/词的语义信息相融合 \2. 位置向量:由于出现在文本不同位置的字/词所携带的语义信息存在差异(比如:“我爱你”和“你爱我”),因此,BERT模型对不同位置的字/词分别附加一个不同的向量以作区分 最后,BERT模型将字向量、文本向量和位置向量的加和作为模型输入。特别地,在目前的BERT模型中,文章作者还将英文词汇作进一步切割,划分为更细粒度的语义单位(WordPiece),例如:将playing分割为play和##ing;此外,对于中文,目前作者尚未对输入文本进行分词,而是直接将单字作为构成文本的基本单位。 'i', 'want', 'em', '##bed', '##ding', '##s', 'for', '.', '[SEP]'] Notice how the word “embeddings” is represented: ['em', '##bed', '##ding', '##s'] The original word has been split into smaller subwords and characters. The two hash signs preceding some of these subwords are just our tokenizer’s way to denote that this subword or character is part of a larger word and preceded by another subword. So, for example, the ‘##bed’ token is separate from the ‘bed’ token; the first is used whenever the subword ‘bed’ occurs within a larger word and the second is used explicitly for when the standalone token ‘thing you sleep on’ occurs.
Because BERT is a pretrained model that expects input data in a specific format, we will need:
special tokens to mark the beginning ([CLS]) and separation/end of sentences ([SEP])
tokens that conforms with the fixed vocabulary used in BERT
token IDs from BERT’s tokenizer
mask IDs to indicate which elements in the sequence are tokens and which are padding elements [工程化的一个工作]
segment IDs used to distinguish different sentences 【0,1】
positional embeddings used to show token position within the sequence 【表征位置】
Luckily, this interface takes care of some of these input specifications for us so we will only have to manually create a few of them (we’ll revisit the other inputs in another tutorial).
special tokens
BERT can take as input either one or two sentences, and expects special tokens to mark the beginning and end of each one:
2 Sentence Input:
[CLS] The man went to the store. [SEP] He bought a gallon of milk. [SEP]
The NSP task is closely related to representation- learning objectives used in Jernite et al. (2017) and Logeswaran and Lee (2018). However, in prior work, only sentence embeddings are transferred to down-stream tasks, where BERT transfers all pa- rameters to initialize end-task model parameters.
区别于之前的任务。
Pretraining-data
It is critical to use a document-level corpus rather than a shuffled sentence-level corpus such as the Billion Word Benchmark (Chelba et al., 2013) in order to extract long contiguous sequences.
For each task, we simply plug in the task- specific inputs and outputs into BERT and fine- tune all the parameters end-to-end. At the in- put, sentence A and sentence B from pre-training are analogous to (1) sentence pairs in paraphras- ing, (2) hypothesis-premise pairs in entailment, (3) question-passage pairs in question answering, and (4) a degenerate text-? pair in text classification or sequence tagging. At the output, the token rep- resentations are fed into an output layer for token- level tasks, such as sequence tagging or question answering, and the [CLS] representation is fed into an output layer for classification, such as en- tailment or sentiment analysis.
Therefore, one good recipe is to pre-train for, say, 90,000 steps with a sequence length of 128 and then for 10,000 additional steps with a sequence length of 512. The very long sequences are mostly needed to learn positional embeddings, which can be learned fairly quickly. Note that this does require generating the data twice with different values of max_seq_length.
We train with batch size of 256 sequences (256 sequences * 512 tokens = 128,000 tokens/batch) for 1,000,000 steps, which is approximately 40 epochs over the 3.3 billion word corpus. We use Adam with learning rate of 1e-4, β1 = 0.9, β2 = 0.999, L2 weight decay of 0.01, learning rate warm-up over the first 10,000 steps, and linear decay of the learning rate
The max_predictions_per_seq is the maximum number of masked LM predictions per sequence. You should set this to around max_seq_length * masked_lm_prob (the script doesn't do that automatically because the exact value needs to be passed to both scripts).
from: What is suitable max_predictions_per_seq size when max_seq_length set to 512? · Issue #516 · google-research/bert
For fine-tuning, most model hyperparameters are the same as in pre-training, with the exception of the batch size, learning rate, and number of train- ing epochs. The dropout probability was always kept at 0.1. The optimal hyperparameter values
我们看到了,超参数中,我们主要修改的是batchsize,learning rate 以及 training epochs,dropout率也要保持。
Batch size: 16, 32 Learning rate (Adam): 5e-5, 3e-5, 2e-5 Number of epochs: 2, 3, 4 We also observed that large data sets (e.g., 100k+ labeled training examples) were far less sensitive to hyperparameter choice than small data sets. Fine-tuning is typically very fast, so it is rea- sonable to simply run an exhaustive search over the above parameters and choose the model that performs best on the development set.
MNLI 句子对的二分类。任务是一个entailment classification task。任务是给定一个句子对,来判断第二个句子对第一个句子来说到底是一个entailment,contradiction还是neural。意思就是,一个意思或者加成了,还是矛盾还是中立(无关) QQP 句子对的二分类任务。是quora的问题对而分类任务,目的是判断在Quora上面问的这两个问题是不是等效的。这个是社区问答相关的工作。Community Question Answering,知乎也是这个类型的,这里推荐一下杨敏老师的工作,太强了 QNLI,句子对的二分类任务。问题NLI。是StanfordQA数据集的一个版本,这个版本被转化为了二分类任务。正类是Q和包含答案的sentence,负类是Q和不包含答案的并且来自同一个paragraph的sentence SST-2 单一句子的二分类问题。从电影评论中抽取的,并且是人类标注的极性。 CoLA 单一句子的二分类问题。目的是判断一个英语句子是否从语法上是可以接受的 STS-B 句子对的分类。是从新闻标题和别的地方找来的。标记了这两句话的相似度,用1-5。 MRPC是微软研究院做的段落Corpus,包含了句子对,这是自动从在线新闻中抽取出来的,人工标注了句子对是否从语义上一致。 RTE 句子对的二分类。和MNLI类似的任务。但是训练数据更少。 Note that we only report single-task fine-tuning results in this paper. A multitask fine-tuning approach could poten- tially push the performance even further. For example, we did observe substantial improvements on RTE from multi- task training with MNLI. 【google 论文原话,我觉得像是白说把】 WNLI 没有选择 NER
对应的在构造数据的时候,abcd分别如下构造:
a 输入[CLS]sentence A [sep] sentence B [sep] 输出[CLS]进入softmax b 输入[CLS]sentence A [sep] 输出[CLS]进入softmax c 输入[CLS]sentence A(Question) [sep] sentence B (paragraph)[sep] 输出[sentence B 的 token representation]进入softmax d 输入[CLS]sentence A [sep] 输出sentence A的token representation进入softmax
• GPT is trained on the BooksCorpus (800M words); BERT is trained on the BooksCor- pus (800M words) and Wikipedia (2,500M words). 数据集 • GPT uses a sentence separator ([SEP]) and classifier token ([CLS]) which are only in- troduced at fine-tuning time; BERT learns [SEP], [CLS] and sentence A/B embed- dings during pre-training. sep和cls的使用方法不同,由于NSP的存在所以有了A/B编码 • GPT was trained for 1M steps with a batch size of 32,000 words; BERT was trained for 1M steps with a batch size of 128,000 words. 都是100wstep,但是batchsize更大(这是炫富的意思么 • GPT used the same learning rate of 5e-5 for all fine-tuning experiments; BERT chooses a task-specific fine-tuning learning rate which performs the best on the development set. fine-tuning的时候learning rate在修改