aidevblogs
⌘K
BlogsVideosTweets
AllLLMsComputer VisionMLOpsAgentsData EngineeringResearchSafety
Uncovering Competency Gaps in Large Language Models and Their Benchmarks
arXiv CS.CL·arxiv.org·1 day ago·LLMs
TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior
arXiv CS.CL·arxiv.org·1 day ago·LLMs
Large Language Models Approach Expert Pedagogical Quality in Math Tutoring but Differ in Instructional and Linguistic Profiles
arXiv CS.CL·arxiv.org·1 day ago·LLMs
Investigating Model Editing for Unlearning in Large Language Models
arXiv CS.CL·arxiv.org·1 day ago·LLMs
Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics?
arXiv CS.CL·arxiv.org·1 day ago·LLMs
Semantic Deception: When Reasoning Models Can't Compute an Addition
arXiv CS.CL·arxiv.org·1 day ago·LLMs
EssayCBM: Rubric-Aligned Concept Bottleneck Models for Transparent Essay Grading
arXiv CS.CL·arxiv.org·1 day ago·LLMs
MediEval: A Unified Medical Benchmark for Patient-Contextual and Knowledge-Grounded Reasoning in LLMs
arXiv CS.CL·arxiv.org·1 day ago·LLMs
Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning
arXiv CS.CL·arxiv.org·1 day ago·LLMs
How important is Recall for Measuring Retrieval Quality?
arXiv CS.CL·arxiv.org·1 day ago·LLMs
Architectural Trade-offs in Small Language Models Under Compute Constraints
arXiv CS.CL·arxiv.org·1 day ago·LLMs
Where Did This Sentence Come From? Tracing Provenance in LLM Reasoning Distillation
arXiv CS.CL·arxiv.org·1 day ago·LLMs
Neural Probe-Based Hallucination Detection for Large Language Models
arXiv CS.CL·arxiv.org·1 day ago·LLMs
MultiMind at SemEval-2025 Task 7: Crosslingual Fact-Checked Claim Retrieval via Multi-Source Alignment
arXiv CS.CL·arxiv.org·1 day ago·LLMs
Reflection Pretraining Enables Token-Level Self-Correction in Biological Sequence Models
arXiv CS.CL·arxiv.org·1 day ago·LLMs
Automatic Replication of LLM Mistakes in Medical Conversations
arXiv CS.CL·arxiv.org·1 day ago·LLMs