This section provides a comprehensive overview of various pre-trained Large Language Models (LLMs) and their architectures, training objectives, and innovations in NLP. Here's a structured breakdown of the content:
Pre-Trained LLMs
This section highlights various state-of-the-art LLMs that have advanced the fields of Natural Language Understanding (NLU) and Natural Language Generation (NLG), and are frequently fine-tuned for specific downstream tasks.
General Purpose Models
-
T5:
- Architecture: Encoder-decoder with unified text-to-text training.
- Training Objective: Uses masked language modeling with span masking.
- Key Feature: Speeding up training with fewer masks per span.
- Fine-tuning: Adapter layers are added after pre-training.
-
GPT-3:
- Architecture: Based on GPT-2 with dense and sparse attention.
- Training Objective: Optimized for large batch sizes and lower learning rates.
- Key Feature: Model scale (175B parameters) improves performance.
-
mT5:
- Architecture: A multilingual version of T5 trained on the mC4 dataset.
- Training Objective: Designed for multilingual tasks, it uses a larger vocabulary.
- Key Feature: Fine-tuning using a mix of English and non-English data.
-
PanGu-α:
- Architecture: Autoregressive model with query layers for token prediction.
- Training Objective: Focus on efficient token prediction.
-
CPM-2:
- Architecture: Bilingual model with MoE architecture.
- Training Objective: Cost-efficient pre-training, using different fine-tuning strategies (complete and prompt fine-tuning).
- Key Feature: Dynamic offloading of parameters for inference efficiency.
-
ERNIE 3.0:
- Architecture: Uses TransformerXL and multitask learning.
- Training Objective: Focus on Chinese NLP tasks with modular architecture.
- Key Feature: Incorporates knowledge extraction and understanding tasks.
-
Jurassic-1:
- Architecture: Auto-regressive models with two variants: J1-Large (7B parameters) and J1-Jumbo (178B).
- Key Feature: Balanced self-attention architecture and improved tokenizer for faster predictions.
-
HyperCLOVA:
- Architecture: Similar to GPT-3, optimized for Korean language tasks.
-
Yuan 1.0:
- Architecture: Optimized for large-scale distributed training with Chinese language data.
- Key Feature: Efficient performance on multiple NLP tasks through large-scale training.
-
Gopher:
- Architecture: Model range from 44M to 280B parameters.
- Key Feature: Outperforms GPT-3 in 81% of tasks, highlighting the importance of model scale.
-
ERNIE 3.0 TITAN:
- Architecture: A larger version of ERNIE 3.0 with 26x more parameters.
- Key Feature: Includes credible and controllable generation tasks for factual consistency.
-
GPT-NeoX-20B:
- Architecture: Similar to GPT-3 with optimizations such as rotary positional embedding.
- Key Feature: Performance tuning based on model size scaling and parallel computation.
-
OPT:
- Architecture: Clone of GPT-3, open-sourced.
- Key Feature: Uses dynamic loss scaling and early checkpoint restarts for training stability.
-
BLOOM:
- Architecture: Causal decoder with innovations like ALiBi positional embedding.
- Key Feature: Enhanced training stability and downstream performance.
-
GLaM:
- Architecture: Sparse MoE structure with only two experts activated per input token.
- Key Feature: Efficient use of computation, achieving better results with reduced training energy.
-
MT-NLG:
- Architecture: 530B parameter model, surpasses GPT-3 in performance on several benchmarks.
-
Chinchilla:
- Architecture: Causal decoder model optimized for compute-efficient training.
- Key Feature: Identifies scaling laws for model size and training data for optimal performance.
-
AlexaTM:
- Architecture: Encoder-decoder, initially pre-trained with frozen encoder weights.
- Key Feature: Denoising and causal language modeling improve learning.
-
PaLM:
- Architecture: Decoder-only model with parallel attention.
- Key Feature: Optimized training with SwiGLU activation, RoPE embeddings, and multi-query attention.
-
PaLM-2:
- Architecture: A smaller version of PaLM with improved fine-tuning and reduced training costs.
-
U-PaLM:
- Key Feature: Uses UL2 objective to outperform baseline models on various NLP tasks.
-
UL2:
- Architecture: Encoder-decoder with a mixture of denoisers.
- Key Feature: Outperforms T5 in many benchmarks through unique training strategies.
-
GLM-130B:
- Architecture: Bidirectional model trained with mask infilling.
- Key Feature: Outperforms GPT-3 with bilingual capabilities.
-
LLaMA:
- Architecture: Efficient causal attention and reduced activations.
- Key Feature: Highly efficient with a focus on fine-tuning.
-
DeepSeek:
- Architecture: Focused on scaling laws for optimal model size and training data.
- Key Feature: Experiments with optimal batch size, learning rate, and compute budgets.
-
DeepSeek-v2:
- Architecture: Introduces MLA to reduce inference costs.
- Key Feature: Faster inference throughput through MLA optimization.
This section provides a detailed look at how various pre-trained models have evolved, their key innovations, and their contributions to improving NLP tasks. The focus is on architectures, training methodologies, and fine-tuning techniques, with an emphasis on efficiency, scalability, and language-specific optimizations.