Business & FundingTools & CodeGitLab Act 2GitLab is restructuring operations in response to the agentic AI era, cutting its geographic footprint by up to 30% and reducing headcount. The move signals how established developer platforms are recalibrating for an AI-native workflow landscape, where distributed teams and traditional DevOps tooling face pressure from autonomous agents. This reshaping matters because GitLab's scale and public transparency reveal how infrastructure companies are repositioning: fewer regional outposts, likely consolidation around core markets, and strategic bets on which capabilities matter when agents handle more CI/CD and deployment tasks.Simon Willison·6d ago77
Business & FundingPolicy & RegulationIlya Sutskever Stands by His Role in Sam Altman’s OpenAI Ouster: ‘I Didn’t Want It to Be Destroyed’Ilya Sutskever's courtroom defense of OpenAI during testimony about Sam Altman's 2023 removal signals a fracture in the narrative around that pivotal boardroom conflict. Despite departing the company months later, Sutskever's willingness to publicly oppose claims that OpenAI faced existential risk undercuts the internal governance dispute that nearly fractured the AI industry's most influential lab. His testimony reframes the ouster as a disagreement over organizational direction rather than a safety-driven intervention, reshaping how insiders understand the power dynamics and decision-making processes at frontier AI companies during moments of acute leadership tension.WIRED - AI·6d ago65
Products & AppsBusiness & FundingOpenAI just released its answer to Claude MythosOpenAI is positioning itself in the enterprise security market with Daybreak, a vulnerability-detection initiative built on its Codex Security agent. The system generates threat models from organizational codebases, identifies attack vectors, and automates vulnerability discovery before exploitation occurs. This move signals OpenAI's pivot toward infrastructure-layer AI products that compete less on raw capability and more on specialized, defensible workflows. For enterprises, the play matters: automated security scanning powered by LLM reasoning could reshape how development teams approach threat assessment, though effectiveness claims remain unvalidated in the wild.The Verge - AI·6d ago69
Business & FundingGM just laid off hundreds of IT workers to hire those with stronger AI skillsGeneral Motors is restructuring its IT workforce to prioritize AI competency, cutting legacy positions while hiring specialists in generative AI development, data engineering, cloud infrastructure, and prompt engineering. This reflects a broader corporate shift where traditional tech roles face displacement as enterprises race to embed AI capabilities across operations. The move signals that AI skills now command premium hiring power even within mature industrial companies, reshaping talent markets beyond pure-play tech firms.TechCrunch - AI·6d ago65
Products & AppsBusiness & FundingHere’s what Mira Murati’s AI company is up toThinking Machines, Mira Murati's post-OpenAI venture, is developing interaction models designed to enable natural human-AI collaboration through continuous multimodal input streams. This represents a strategic pivot toward conversational, real-time AI systems that operate across audio and video simultaneously, positioning the startup to compete in the emerging space of embodied and always-on AI assistants. The approach signals growing industry consensus that next-generation value lies not in static model capability but in seamless, continuous interaction paradigms.The Verge - AI·6d ago69
Business & FundingProducts & AppsOpenAI Launches AI Consulting Company, Following AnthropicOpenAI is establishing a dedicated consulting division to help enterprises navigate AI deployment challenges, mirroring Anthropic's earlier move into services. This signals a strategic pivot by frontier labs toward capturing implementation revenue alongside model licensing, recognizing that capability alone doesn't guarantee adoption. The consulting play addresses a real market gap: enterprises struggle with integration, fine-tuning, and organizational change management. For insiders, this reflects growing competition for enterprise wallet share and suggests AI vendors now view advisory services as table stakes in the B2B stack, not an afterthought.AI Business·May 1161
Hardware & InfraPolicy & RegulationData center used 30 million gallons of water without initially payingA major data center consumed 30 million gallons of water without initially compensating local authorities, exposing the hidden infrastructure costs of AI scaling. The incident underscores a critical tension in the AI industry: massive computational demands require enormous water resources for cooling, yet regulatory frameworks and payment mechanisms lag behind deployment velocity. This raises questions about whether AI companies can self-regulate resource consumption or whether governments must impose stricter environmental accountability before the next generation of models launches.Ars Technica - AI·May 1169
Opinion & AnalysisQuoting James ShoreJames Shore argues that AI coding agents must deliver proportional reductions in maintenance burden to justify productivity gains, not just speed boosts. The core thesis: if an LLM doubles code output, maintenance costs must halve, or teams face compounding long-term liabilities. This reframes the ROI calculus for enterprise AI adoption away from raw velocity metrics toward total-cost-of-ownership, challenging the prevailing narrative that faster code generation alone justifies agent deployment.Simon Willison·May 1177
Opinion & AnalysisYour AI Use Is Breaking My BrainJason Koebler's analysis reframes the AI saturation problem beyond the 'Dead Internet' trope, introducing 'Zombie Internet' to describe the cognitive friction of navigating spaces where human and machine-generated content are now indistinguishable. The piece argues that widespread AI deployment has created a filtering burden that exhausts users and is subtly reshaping how humans themselves write online. This touches on a critical but underexplored externality: as AI-generated text becomes ambient, the mental cost of verification and the erosion of authentic voice become infrastructure-level problems that affect platform viability and user trust.Simon Willison·May 1177
Tools & CodeOpinion & AnalysisUsing LLM in the shebang line of a scriptSimon Willison documents a clever pattern for executing plain English text files as LLM commands by leveraging shebang lines and LLM's fragment system. The technique treats natural language as executable code, collapsing the boundary between prose and computation. This reflects a broader shift in developer tooling where LLMs become first-class interpreters in Unix pipelines, enabling rapid prototyping and reducing friction between human intent and system execution. The pattern signals how LLM-native workflows are embedding themselves into foundational developer practices.Simon Willison·May 1172
Policy & RegulationBusiness & FundingThe EU wants to regulate AI but needs OpenAI and Anthropic to let regulators through the doorEurope's AI regulatory framework faces a critical enforcement gap: OpenAI has voluntarily granted the EU Commission access to GPT-5.5 Cyber for security audits, but Anthropic remains resistant after multiple regulatory meetings without granting inspection rights to its Mythos model. This divergence exposes a structural vulnerability in the EU's oversight strategy, which lacks legal teeth to compel frontier labs to submit systems for review. The asymmetry signals that regulatory credibility now hinges on corporate goodwill rather than binding authority, reshaping how Europe can actually enforce the AI Act's safety requirements.The Decoder·May 1180
ResearchModels & ReleasesELF: Embedded Language FlowsResearchers propose Embedded Language Flows (ELF), a diffusion model architecture that operates primarily in continuous embedding space rather than discrete token space, only discretizing at the final step. This challenges the dominant paradigm where language diffusion models work directly over tokens, mirroring the continuous-space success of image and video generation. The approach suggests that flow-based methods can match or exceed discrete diffusion performance on language tasks with minimal architectural overhead, potentially reshaping how generative language models are designed beyond autoregressive and masked-prediction approaches.arXiv cs.CL·May 1162
ResearchVariational Inference for Lévy Process-Driven SDEs via Neural TiltingResearchers have developed a neural exponential tilting framework that extends variational inference to Lévy-driven stochastic differential equations, bridging a long-standing gap in Bayesian modeling. Traditional approaches either sacrifice scalability through Monte Carlo rigor or rely on Gaussian assumptions that miss discontinuities and heavy tails. This work matters for practitioners in finance, climate modeling, and safety-critical systems where extreme events dominate risk. The technique reweights Lévy measures within a learned variational family, enabling tractable inference over jump processes at neural-network speed. Success here could reshape how uncertainty quantification handles non-Gaussian phenomena in high-stakes domains.arXiv cs.LG·May 1158
ResearchModels & ReleasesDECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side DevicesDECO addresses a critical constraint in deploying sparse mixture-of-experts models on resource-limited devices by matching dense transformer performance within identical parameter budgets. The architecture combines differentiable ReLU routing with learnable expert scaling and introduces NormSiLU activation to reduce the storage and memory-access overhead that typically makes MoE models impractical for edge deployment. This work matters because it directly tackles the gap between MoE's theoretical efficiency gains and real-world on-device constraints, potentially unlocking efficient inference for mobile and embedded systems without sacrificing model quality.arXiv cs.CL·May 1162
ResearchQuantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature RegimeResearchers have formalized how transformer token distributions evolve during inference using mean-field theory and multi-particle system analysis. The work proves that attention mechanisms cause token representations to rapidly concentrate onto a lower-dimensional manifold defined by key-query-value projections, remaining stable for practical inference windows. This theoretical foundation matters for practitioners because it explains why transformers compress information so effectively and provides mathematical tools to predict failure modes in long-context scenarios where metastability breaks down.arXiv cs.LG·May 1158
ResearchModels & ReleasesDynamic Skill Lifecycle Management for Agentic Reinforcement LearningResearchers propose SLIM, a framework that treats external skills for language model agents as dynamic variables rather than static toolsets. The insight challenges a core assumption in agentic AI: that skills either persist indefinitely or get absorbed into the model's weights. Instead, optimal skill composition varies by task and training stage, suggesting agents should actively manage which capabilities to activate. This reframes how we think about scaling agent capabilities beyond model parameters, with implications for efficient deployment and skill reuse across diverse problem domains.arXiv cs.CL·May 1158
ResearchTools & CodeOptimal and Scalable MAPF via Multi-Marginal Optimal Transport and Schrödinger BridgesResearchers have reformulated multi-agent path finding as a multi-marginal optimal transport problem, collapsing an exponentially complex search space into a tractable linear program. The breakthrough leverages Schrödinger bridges to scale the approach to real-world robot coordination tasks while guaranteeing collision-free, space-time non-overlapping solutions. This bridges classical operations research with modern probabilistic methods, offering AI systems a principled way to coordinate large swarms without exponential blowup, relevant to autonomous logistics, warehouse automation, and distributed robotics.arXiv cs.LG·May 1158
ResearchTools & CodeWildClawBench: A Benchmark for Real-World, Long-Horizon Agent EvaluationWildClawBench addresses a critical gap in agent evaluation by moving beyond synthetic sandboxes to test language and vision models in production-grade environments. The benchmark comprises 60 real-world tasks running inside Docker containers with actual CLI tools rather than mocked APIs, each requiring 20+ tool calls over roughly 8 minutes of execution. This shift from short-horizon, final-answer validation to long-horizon, runtime-faithful assessment matters because it exposes whether deployed agents can handle the messy complexity of actual work. For teams building or deploying agentic systems, the benchmark signals that synthetic metrics no longer suffice for credibility.arXiv cs.CL·May 1162
ResearchTools & CodeEquivariant Reinforcement Learning for Clifford Quantum Circuit SynthesisResearchers have developed an equivariant neural network architecture that learns to synthesize Clifford quantum circuits through reinforcement learning, with a key innovation: the learned policy generalizes across different qubit counts without retraining. This addresses a fundamental challenge in quantum circuit optimization by embedding symmetry constraints directly into the network design, enabling a single model to handle variable problem sizes. The approach combines curriculum learning from random walks with symplectic matrix representations, advancing the intersection of deep learning and quantum computing where generalization across hardware scales remains a critical bottleneck for practical deployment.arXiv cs.LG·May 1158
ResearchRevisiting Policy Gradients for Restricted Policy Classes: Escaping Myopic Local Optima with $k$-step Policy GradientsResearchers propose k-step policy gradients to address a fundamental limitation in reinforcement learning: standard policy gradient methods optimize greedily based only on immediate one-step returns, causing them to converge to suboptimal solutions when policy classes are restricted. The new approach couples randomness across multiple timesteps to escape these local optima, with theoretical guarantees that performance approaches the optimal deterministic policy exponentially as k increases. This work matters for practitioners deploying RL in constrained settings, from robotics to dialogue systems, where restricted policy classes are common but myopic optimization has historically limited performance ceilings.arXiv cs.LG·May 1158
ResearchTools & CodeDataMaster: Towards Autonomous Data Engineering for Machine LearningA new research direction tackles a structural bottleneck in ML systems: as model architectures and training procedures plateau toward commodity status, data quality and composition emerge as the primary lever for performance gains. This work proposes autonomous agents that handle the full data engineering pipeline, from external dataset discovery through cleaning and transformation, without touching the underlying learning algorithm. The approach matters because it decouples data optimization from model development, potentially letting practitioners squeeze more value from fixed compute budgets and standardized training recipes. For teams operating under resource constraints, this signals a shift in where competitive advantage concentrates.arXiv cs.LG·May 1162
ResearchTools & CodeBeyond Red-Teaming: Formal Guarantees of LLM Guardrail ClassifiersResearchers have moved beyond empirical red-teaming by formalizing how guardrail classifiers can certify safety guarantees. The key insight shifts verification from discrete input space to the classifier's learned representation layer, where harmful prompts cluster into certifiable convex regions. By leveraging the monotonicity of sigmoid heads, the team derives closed-form soundness proofs without approximation, addressing a critical gap in production LLM safety: testing shows promise, but deployed systems lack mathematical guarantees. This matters for anyone shipping guardrails at scale, as formal verification could become table stakes for enterprise and regulated deployments.arXiv cs.LG·May 1162
ResearchModels & ReleasesRubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable RewardsMeta researchers propose RubricEM, a reinforcement learning framework that treats evaluation rubrics as structural primitives for training research agents on open-ended tasks. Rather than relying on verifiable ground-truth rewards, the system decomposes policy execution into rubric-aligned stages, uses rubric feedback to guide reflection, and builds reusable memory from failed trajectories. This addresses a critical gap in post-training: how to scale RL beyond tasks with checkable answers to long-horizon reasoning work like report synthesis and evidence evaluation. The approach signals growing focus on making RL practical for frontier agent systems where traditional reward signals collapse.arXiv cs.CL·May 1162
ResearchTools & CodeV4FinBench: Benchmarking Tabular Foundation Models, LLMs, and Standard Methods on Corporate Bankruptcy PredictionV4FinBench addresses a critical gap in financial AI evaluation by releasing over one million company-year records from Central European economies, enabling rigorous testing of tabular foundation models and LLMs on bankruptcy prediction under realistic class imbalance. The dataset's scale and multi-horizon design matter because most public benchmarks remain orders of magnitude smaller, forcing researchers to rely on paywalled alternatives or synthetic data. This release lets the community stress-test whether foundation models trained on general text outperform specialized tabular methods on high-stakes financial forecasting, a question with direct implications for how financial institutions should allocate compute and model selection budgets.arXiv cs.LG·May 1158
Opinion & AnalysisPolicy & RegulationThree things in AI to watch, according to a Nobel-winning economistDaron Acemoglu, the 2024 Nobel laureate in economics, has emerged as a critical voice challenging Silicon Valley's AI narrative. His recent work questions whether current AI deployment models deliver genuine productivity gains or concentrate wealth without broad economic benefit. His perspective matters because it reframes how policymakers and investors should evaluate AI's societal ROI, moving beyond hype cycles toward measurable impact on labor markets and inequality. This positions economic scrutiny as a counterweight to techno-optimism in shaping AI regulation and corporate strategy.MIT Technology Review - AI·May 1177
ResearchGrounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive RankingVision-language models routinely generate plausible outputs driven by text priors alone, with images playing no role in the prediction. This 'visual ungroundedness' defeats existing confidence metrics because they cannot distinguish between image-informed and image-agnostic reasoning. BICR addresses this by training a lightweight probe on contrastive hidden states extracted from frozen LVLMs under two conditions: normal inference with images present, and inference with images blacked out. The method surfaces whether a model's confidence reflects genuine visual grounding or mere language pattern matching, a critical diagnostic for production deployments where hallucination risk is high.arXiv cs.CL·May 1162
ResearchTools & CodeUnmasking On-Policy Distillation: Where It Helps, Where It Hurts, and WhyResearchers have developed a training-free diagnostic framework that resolves a critical blind spot in on-policy distillation, a technique increasingly used to train reasoning models with dense token-level supervision. The work moves beyond aggregate metrics to pinpoint exactly when teacher guidance helps or hurts individual predictions, and whether optimal teacher selection should vary token-by-token. This addresses a practical bottleneck for teams scaling reasoning models: current evaluation requires expensive training runs that obscure failure modes. The framework's per-token, per-question resolution enables faster iteration on distillation strategies without costly experimentation, directly impacting how efficiently labs can optimize reasoning model training.arXiv cs.LG·May 1162
ResearchHardware & InfraLoKA: Low-precision Kernel Applications for Recommendation Models At ScaleRecommendation models at scale face a precision-efficiency tradeoff that differs fundamentally from language models. While FP8 arithmetic has unlocked speedups across GPU hardware, recommendation systems resist direct quantization due to numerical sensitivity in embedding operations and communication bottlenecks during distributed training. LoKA proposes a co-designed kernel and algorithmic framework to make low-precision arithmetic viable for this workload class, addressing a gap where infrastructure gains haven't translated to production adoption. Success here unlocks efficiency gains across e-commerce, ads, and ranking systems that process billions of daily inferences.arXiv cs.LG·May 1158
ResearchNeural Weight Norm = Kolmogorov ComplexityA new theoretical result connects neural network regularization to fundamental computer science, proving that weight decay implicitly optimizes for Kolmogorov complexity in fixed-precision regimes. The finding bridges deep learning practice with Solomonoff's universal prior, suggesting weight decay naturally biases networks toward simpler, more generalizable solutions. This explains a long-standing empirical mystery about why a decades-old regularization technique remains effective across modern architectures, and implies the choice of norm matters less than the sparsity it induces. The result matters for interpretability and inductive bias design, offering theoretical grounding for why neural networks generalize.arXiv cs.LG·May 1172
ResearchTools & CodeNeural at ArchEHR-QA 2026: One Method Fits All: Unified Prompt Optimization for Clinical QA over EHRsNeural's ArchEHR-QA submission demonstrates a modular approach to clinical question answering over electronic health records, using DSPy's MIPROv2 optimizer to automatically tune prompts and few-shot examples across four interdependent stages. The method chains question interpretation, evidence retrieval, answer generation, and grounding validation, with self-consistency voting across stochastic runs to reduce hallucination. This work signals growing maturity in applying LLM optimization frameworks to high-stakes medical QA, where faithful grounding and evidence traceability are non-negotiable, and suggests prompt engineering at scale can compete with task-specific fine-tuning in regulated domains.arXiv cs.CL·May 1152