Home 🏠

        • A simple neural network module for relational reasoning
        • Aligner: Efficient Alignment by Learning to Correct
        • Autonomy-of-Experts Models
        • DINT Transformer
        • Diverse Preference Optimization
        • From LLMs to Actions: Latent Codes as Bridges in Hierarchical Robot Control
        • Frontier Models are Capable of In-context Scheming
        • GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment
        • Grokking at the Edge of Numerical Stability
        • Identity Mappings in Deep Residual Networks
        • LLM Pretraining with Continuous Concepts
        • LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers
        • Measuring Progress on Scalable Oversight for Large Language Models
        • Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
        • Reasoning with Latent Thoughts: On the Power of Looped Transformers
        • ReMoE- Fully Differentiable Mixture-of-Experts with ReLU Routing
        • Scalable-Softmax Is Superior for Attention
        • Searching Latent Program Spaces
        • SFT Memorizes, RL Generalizes- A Comparative Study of Foundation Model Post-training
        • Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models
        • The Super Weight in Large Language Modeling
        • Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning
    Home

    ❯

    tags

    ❯

    Tag: Attention-Variants

    Tag: Attention-Variants

    3 items with this tag.

    • Feb 19, 2025

      Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

      • Attention-Variants
    • Feb 07, 2025

      Scalable-Softmax Is Superior for Attention

      • Attention-Variants
    • Feb 05, 2025

      DINT Transformer

      • Attention-Variants