LLM Notes



目前有三种主流的Subword分词算法,分别是Byte Pair Encoding (BPE), WordPiece和Unigram Language Model

Back in the ancient times, before 2013, we usually encoded basic unigram tokens using simple 1’s and 0’s in a process called One-Hot encoding. word2vec improved things by expanding these 1’s and 0’s into full vectors (aka word embeddings). BERT improved things further by using transformers and self-attention heads to create full contextual sentence embeddings.


分布式词编码:word embedding

  • word2vec

CBOW模型是在已知当前词上下文context的前提下预测当前词w(t),类似阅读理解中的完形填空; 而Skip-Gram模型恰恰相反,是在已知当前词w(t)的前提下,预测上下文context。

对于两个模型,word2vec给出了两套框架,用于训练快而好的词向量: Hierarchical Softmax和Negative Sampling

  • BERT(Bidirectional Encoder Representations from Transformers)



pip install tensorflow==2.12 tensor2tensor --no-cache-dir

You must be using python <=3.7 to install Tensorflow 1.15


Democratizing Large Language Model Alignment

Aligning large language models (LLMs) with human preferences has proven to drastically improve usability and has driven rapid adoption as demonstrated by ChatGPT. Alignment techniques such as supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) greatly reduce the required skill and domain knowledge to effectively harness the capabilities of LLMs, increasing their accessibility and utility across various domains.

训练数据集for open-source model


Large Transformer Model Inference Optimization