Search This Blog

Pre-trained Large Language Models

 This section provides a comprehensive overview of various pre-trained Large Language Models (LLMs) and their architectures, training objectives, and innovations in NLP. Here's a structured breakdown of the content:

Pre-Trained LLMs

This section highlights various state-of-the-art LLMs that have advanced the fields of Natural Language Understanding (NLU) and Natural Language Generation (NLG), and are frequently fine-tuned for specific downstream tasks.

General Purpose Models

  1. T5:

    • Architecture: Encoder-decoder with unified text-to-text training.
    • Training Objective: Uses masked language modeling with span masking.
    • Key Feature: Speeding up training with fewer masks per span.
    • Fine-tuning: Adapter layers are added after pre-training.
  2. GPT-3:

    • Architecture: Based on GPT-2 with dense and sparse attention.
    • Training Objective: Optimized for large batch sizes and lower learning rates.
    • Key Feature: Model scale (175B parameters) improves performance.
  3. mT5:

    • Architecture: A multilingual version of T5 trained on the mC4 dataset.
    • Training Objective: Designed for multilingual tasks, it uses a larger vocabulary.
    • Key Feature: Fine-tuning using a mix of English and non-English data.
  4. PanGu-α:

    • Architecture: Autoregressive model with query layers for token prediction.
    • Training Objective: Focus on efficient token prediction.
  5. CPM-2:

    • Architecture: Bilingual model with MoE architecture.
    • Training Objective: Cost-efficient pre-training, using different fine-tuning strategies (complete and prompt fine-tuning).
    • Key Feature: Dynamic offloading of parameters for inference efficiency.
  6. ERNIE 3.0:

    • Architecture: Uses TransformerXL and multitask learning.
    • Training Objective: Focus on Chinese NLP tasks with modular architecture.
    • Key Feature: Incorporates knowledge extraction and understanding tasks.
  7. Jurassic-1:

    • Architecture: Auto-regressive models with two variants: J1-Large (7B parameters) and J1-Jumbo (178B).
    • Key Feature: Balanced self-attention architecture and improved tokenizer for faster predictions.
  8. HyperCLOVA:

    • Architecture: Similar to GPT-3, optimized for Korean language tasks.
  9. Yuan 1.0:

    • Architecture: Optimized for large-scale distributed training with Chinese language data.
    • Key Feature: Efficient performance on multiple NLP tasks through large-scale training.
  10. Gopher:

    • Architecture: Model range from 44M to 280B parameters.
    • Key Feature: Outperforms GPT-3 in 81% of tasks, highlighting the importance of model scale.
  11. ERNIE 3.0 TITAN:

    • Architecture: A larger version of ERNIE 3.0 with 26x more parameters.
    • Key Feature: Includes credible and controllable generation tasks for factual consistency.
  12. GPT-NeoX-20B:

    • Architecture: Similar to GPT-3 with optimizations such as rotary positional embedding.
    • Key Feature: Performance tuning based on model size scaling and parallel computation.
  13. OPT:

    • Architecture: Clone of GPT-3, open-sourced.
    • Key Feature: Uses dynamic loss scaling and early checkpoint restarts for training stability.
  14. BLOOM:

    • Architecture: Causal decoder with innovations like ALiBi positional embedding.
    • Key Feature: Enhanced training stability and downstream performance.
  15. GLaM:

    • Architecture: Sparse MoE structure with only two experts activated per input token.
    • Key Feature: Efficient use of computation, achieving better results with reduced training energy.
  16. MT-NLG:

    • Architecture: 530B parameter model, surpasses GPT-3 in performance on several benchmarks.
  17. Chinchilla:

    • Architecture: Causal decoder model optimized for compute-efficient training.
    • Key Feature: Identifies scaling laws for model size and training data for optimal performance.
  18. AlexaTM:

    • Architecture: Encoder-decoder, initially pre-trained with frozen encoder weights.
    • Key Feature: Denoising and causal language modeling improve learning.
  19. PaLM:

    • Architecture: Decoder-only model with parallel attention.
    • Key Feature: Optimized training with SwiGLU activation, RoPE embeddings, and multi-query attention.
  20. PaLM-2:

    • Architecture: A smaller version of PaLM with improved fine-tuning and reduced training costs.
  21. U-PaLM:

    • Key Feature: Uses UL2 objective to outperform baseline models on various NLP tasks.
  22. UL2:

    • Architecture: Encoder-decoder with a mixture of denoisers.
    • Key Feature: Outperforms T5 in many benchmarks through unique training strategies.
  23. GLM-130B:

    • Architecture: Bidirectional model trained with mask infilling.
    • Key Feature: Outperforms GPT-3 with bilingual capabilities.
  24. LLaMA:

    • Architecture: Efficient causal attention and reduced activations.
    • Key Feature: Highly efficient with a focus on fine-tuning.
  25. DeepSeek:

    • Architecture: Focused on scaling laws for optimal model size and training data.
    • Key Feature: Experiments with optimal batch size, learning rate, and compute budgets.
  26. DeepSeek-v2:

    • Architecture: Introduces MLA to reduce inference costs.
    • Key Feature: Faster inference throughput through MLA optimization.

This section provides a detailed look at how various pre-trained models have evolved, their key innovations, and their contributions to improving NLP tasks. The focus is on architectures, training methodologies, and fine-tuning techniques, with an emphasis on efficiency, scalability, and language-specific optimizations.

Featured Posts

What you need to have as .net developer

  1 System design ↳ https://lnkd.in/e7SUQ3ei 2 Public APIs ↳ https://lnkd.in/epWSyzqs 3 Tech interview handbook ↳ https://lnkd.in/e7EjsJN...

post first image

Popular Posts