BERT

Created	@February 28, 2024
Tags	NN

Bidirectional encoder representation transformers

Bert Base: L = 12,H = 768, A=12, 110M

Bert Large: L = 24,H = 1024, A=16, 340M

BERT分为哪两种任务，各自的作用是什么；
1. masked language model: replace word in the sentence to train the model, choose 15% words being masked, among them 80% mask 10% replace random 10% keep original this is to learn the inter-sentence correlations
1. next sentence prediction: given two input sentences, predict whether the sentence is next sentence. 50% positive 50% negative - this is to learn the intra-sentence correlations.

2.在计算MLM预训练任务的损失函数的时候，参与计算的Tokens有哪些？是全部的15%的词汇还是15%词汇中真正被Mask的那些tokens？

Only the masked tokens are taken into considerations.

3.在实现损失函数的时候，怎么确保没有被 Mask 的函数不参与到损失计算中

using a mask to indicate which is 1 and which is 0

loss_mask = torch.tensor(mask, dtype=torch.float32)

predictions = model(tokens) # tokens是输入的tokens

loss = loss_function(predictions, labels)

masked_loss = torch.sum(loss * loss_mask) / torch.sum(loss_mask)

4.BERT的三个Embedding为什么直接相加

因为BERT模型需要同时考虑输入的token、位置和段落信息。这三个嵌入分别对应了输入token在句子中的位置和所处的段落，以及输入token本身的信息。因此，在将它们相加之后，BERT模型可以同时获得这些信息。

5.BERT的优缺点分别是什么？

bidirectional → richer semantic representation, robustness to a lot of tasks,

classification, not generator, cannot do seq2seq task

6.你知道有哪些针对BERT的缺点做优化的模型？

SpanBERT: SpanBERT是一种针对

自然语言推理

（NLI）任务的BERT模型改进，它通过对输入序列中的部分单词进行特殊标记，来使模型能够更好地理解上下文中的语义关系。

DistilBERT: DistilBERT是一种轻量化的BERT模型，通过剪枝和蒸馏技术来减少模型的大小和计算量，从而提高模型的训练速度和推理速度。

7.BERT怎么用在生成模型中？

Seq2Seq模型：使用BERT模型作为编码器，将输入序列转换为上下文向量，然后将上下文向量输入到解码器中进行解码生成输出序列。该方法可以用于生成对话系统、机器翻译、文本摘要等任务。

GPT模型：使用BERT模型进行预训练，然后将预训练好的BERT模型作为初始参数，使用Transformer解码器进行微调，从而实现对文本生成任务的优化。GPT模型可以用于生成自然语言文本、文章摘要、问答系统等任务。