ML & AI news of the week

Photo by Priscilla Du Preez 🇨🇦 on Unsplash

A collection of the best ML & AI news every week (research, news, resources). Star this repository if you find it useful.

Here, you can find articles and tutorials about artificial intelligence

For each week you will find different sections:

Research: the most important published research of the week.
News: the most important news related to companies, institutions, and much more.
Resources: released resources for artificial intelligence and machine learning.
Perspectives: a collection of deep and informative articles about open questions in artificial intelligence.

and a meme for starting well the week.

Suggestions and corrections

Feel free to open an issue if you find some errors, if you have any suggestions, topics, or any other comments

Index

2024

ML news: Week 16 - 22 December
ML news: Week 9 - 15 December
ML news: Week 2 - 8 December
ML news: Week 25 November - 1 December
ML news: Week 18 - 24 November
ML news: Week 11 - 17 November
ML news: Week 3 - 10 November
ML news: Week 28 October - 3 November
ML news: Week 21 - 27 October
ML news: Week 14 - 20 October
ML news: Week 7 - 13 October
ML news: Week 30 September - 6 October
ML news: Week 23 - 29 September
ML news: Week 16 - 22 September
ML news: Week 9 - 15 September
ML news: Week 2 - 8 September
ML news: Week 26 August - 1 September
ML news: Week 19 - 25 August
ML news: Week 12 - 18 August
ML news: Week 5 - 11 August
ML news: Week 29 July - 4 August
ML news: Week 21 - 28 July
ML news: Week 15 - 21 July
ML news: Week 8 - 14 July
ML news: Week 1 - 7 July
ML news: Week 24 - 30 June
ML news: Week 17 - 23 June
ML news: Week 10 - 16 June
ML news: Week 3 - 9 June
ML news: Week 27 May - 2 June
ML news: Week 20 - 26 May
ML news: Week 13 - 19 May
ML news: Week 6 - 12 May
ML news: Week 29 April - 5 May
ML news: Week 21 - 28 April
ML news: Week 15 - 21 April
ML news: Week 8 - 14 April
ML news: Week 1 - 7 April
ML news: Week 25 - 31 March
ML news: Week 18 - 24 March
ML news: Week 11 - 17 March
ML news: Week 4 - 10 March
ML news: Week 26 February - 3 March
ML news: Week 19 - 25 February
ML news: Week 12 - 18 February
ML news: Week 5 - 11 February
ML news: Week 29 January - 4 February
ML news: Week 22 - 28 January
ML news: Week 15 - 21 January
ML news: Week 8 - 14 January
ML news: Week 1 - 7 January

Back to index

2024

ML news: Week 16 - 22 December

Research

Link	description
Training Large Language Models to Reason in a Continuous Latent Space.	Coconut (Chain of Continuous Thought) introduces a novel paradigm enabling LLMs to reason in continuous latent space instead of natural language. By using the LLM's last hidden state as the reasoning state and feeding it back directly as the next input embedding, Coconut achieves "continuous thought." This approach enhances LLM performance on complex reasoning tasks, leveraging emergent breadth-first search capabilities for more effective reasoning.
Asynchronous LLM Function Calling.	AsyncLM introduces a system for asynchronous LLM function calling, featuring an in-context protocol for function calls and interrupts, along with a fine-tuning strategy to adapt LLMs to interrupt semantics. Efficiently integrated into the LLM inference process, AsyncLM enables concurrent generation and execution of function calls, reducing task completion latency by 1.6x-5.4x compared to synchronous approaches.
MAG-V: A Multi-Agent Framework for Synthetic Data Generation and Verification.	This multi-agent framework generates datasets of questions resembling customer queries and reverse engineers alternate questions from responses to verify agent trajectories. The synthetic data improves agent performance on real customer queries. For trajectory verification, it finds that simple ML baselines with feature engineering can achieve comparable performance to more advanced, resource-intensive models.
AutoReason: Automatic Few-Shot Reasoning Decomposition.	This method leverages CoT prompting to automatically generate rationales for queries, converting zero-shot queries into few-shot reasoning traces. These traces serve as CoT exemplars for the LLM, enhancing reasoning capabilities, particularly in weaker models.
Byte Latent Transformer: Patches Scale Better Than Tokens.	This work introduces a byte-level language model architecture that rivals tokenization-based LLMs in performance while offering greater efficiency and robustness. Using a dynamic approach, it groups bytes into patches based on next-byte entropy, dedicating more computing to complex predictions and using larger patches for predictable sequences. BLT matches or surpasses models like Llama 3, reducing inference FLOPs by up to 50%.
Scheming reasoning evaluations.	This paper evaluates six frontier models for their in-context scheming abilities, testing whether models deceive developers to achieve goals by bypassing oversight mechanisms. For example, Claude 3 Opus was found duplicating its weights to a new server and lying about the action.
Researchers Use AI To Turn Sound Recordings Into Accurate Street Images.	Using generative artificial intelligence, a team of researchers at The University of Texas at Austin has converted sounds from audio recordings into street-view images. The visual accuracy of these generated images demonstrates that machines can replicate the human connection between audio and visual perception of environments.
Causal Explanations for Image Classifiers.	This paper presents "rex," a black-box tool that generates concise explanations for image classifier outputs using a novel approach based on causality theory.
Aligning Visual and Semantic Interpretability through Visually Grounded Concept Bottleneck Models.	Giraffe introduces a transformer-based architecture that extends the ability to process significantly longer input contexts, setting new benchmarks for context length in open-weight models.
Adaptive Caching for Faster Video Generation with Diffusion Transformers.	Meta researchers have introduced Adaptive Caching (AdaCache), a training-free approach that accelerates video generation for Diffusion Transformers.
Alignment Faking in Large Language Models.	Anthropic and Redwood's research investigates how models behave when aware of alignment efforts, revealing they can exhibit alignment while retaining their original preferences. This finding highlights gaps in current alignment methods and offers insights for improvement.
Are Your LLMs Capable of Stable Reasoning?	Reasoning is a critical area for models, especially in real-world applications. However, existing benchmarks often fail to measure stability across novel tasks. This paper introduces G-Pass@k, a new benchmark that evaluates a model's peak performance and stability in reasoning tasks.
NoteContrast: Contrastive Language-Diagnostic Pretraining for Medical Text.	Accurate diagnostic coding of medical notes is vital for patient care, research, and billing but is time-consuming and often lacks precision. Automated coding using long-document transformers and contrastive loss functions has shown promise. This study integrates ICD-10 code sequences with medical text through contrastive pre-training, outperforming state-of-the-art models on MIMIC-III benchmarks, highlighting its effectiveness in improving diagnostic coding accuracy.
Context is Key: A Benchmark for Forecasting with Essential Textual Information.	Traditional time series forecasting methods rely solely on numerical features, rarely utilizing textual or semantic information about the task (e.g., predicting electricity prices or customer churn). When provided with this contextual textual information, language models significantly outperform all tested forecasting methods across a wide range of carefully decontaminated tasks.
Finally, a Replacement for BERT.	BERT, a widely used encoder-only language model, powers nearly every Google search query. A new model from Answer AI, LightOn, and collaborators offers a faster, more modern, and highly performant alternative. It serves as a drop-in replacement, incorporating innovations like batch ramp to enhance overall performance.
Thinking in Space.	A research initiative focused on spatial reasoning and AI models designed to interpret and interact within three-dimensional spaces.

News

Link	description
BBC says it has complained to Apple over AI-generated fake news attributed to the broadcaster.	Notifications from a new Apple product falsely suggested the BBC claimed the New York gunman Luigi Mangione had killed himself
She didn’t get an apartment because of an AI-generated score – and sued to help others avoid the same fate.	Despite a stellar reference from a landlord of 17 years, Mary Louis was rejected after being screened by the firm SafeRent
Does RLHF Scale? Exploring the Impacts From Data, Model, and Method.	This paper examines the key components of the RLHF framework and their impacts, revealing the following insights: RLHF scales less effectively than pretraining for LLMs, with larger policy models benefiting less when using a fixed reward model. Increasing the number of responses sampled per prompt during training improves performance initially but plateaus at 4-8 samples. Larger reward models enhance reasoning task performance, but gains are inconsistent across task types. Increasing training data diversity for reward models is more impactful than boosting response diversity per prompt, though policy training shows diminishing returns beyond the early stages.
Granite Guardian.	IBM has open-sourced Granite Guardian, a suite of safeguards for detecting risks in LLMs. With AUC scores of 0.871 on harmful content and 0.854 on RAG-hallucination benchmarks, the authors claim it is the most generalizable and competitive model in the field.
Liquid AI Raises $250m.	Liquid AI has secured significant funding to advance the training of its efficient, general-purpose liquid-style foundation models.
Projects in OpenAI.	OpenAI has introduced “Projects”, a new way to organize chats and conversations.
AI Godmother Fei-Fei Li Has a Vision for Computer Vision.	Her startup, World Labs, is giving machines 3D spatial intelligence
Google says its new quantum chip is way faster than the world's most powerful supercomputer.	Google said its new chip Willow demonstrates that it's possible to build "a useful, large-scale quantum computer"
EU launches €10bn space program to rival Musk’s Starlink.	UK not part of Iris2 project, described as ‘a significant step towards Europe’s sovereignty and secure connectivity’
TikTok turns to US Supreme Court in a last-ditch bid to avert divest-or-ban law.	Firm and parent company ByteDance file request for an injunction to halt ban of the app used by 170 million Americans
Potential payouts for up to 300,000 Australian Facebook users in Cambridge Analytica settlement.	Office of the Australian Information Commissioner announces deal with Meta over scandal that may have affected 300,000 users
Chinese AI chip firms blacklisted over weapons concerns gained access to UK technology.	Imagination Technologies had licenses with two Chinese firms – but said it had not ‘implemented transactions’ that would enable the use of technology for military purposes
UK proposes letting tech firms use copyrighted work to train AI.	Consultation suggests an opt-out scheme for creatives who don’t want their work used by Google, OpenAI and others
Will the future of transportation be robotaxis – or your own self-driving car?	GM is shutting down its robotaxi business, and Tesla is creating one of its own. What does the future hold for self-driving?
Amazon-hosted AI tool for UK military recruitment ‘carries the risk of data breach’.	Ministry of Defence says risk with Textio tool is low and ‘robust safeguards’ have been put in place by suppliers
State-of-the-art video and image generation with Veo 2 and Imagen 3.	Google has announced a new video model and a new image generation model. Both are stunning improvements over the previous iterations.
OpenAI Search.	OpenAI explores the potential of ChatGPT Search on the 8th day of its announcements.
Reddit tests a conversational AI search tool.	As more AI companies gobble up Reddit’s data to fuel their own chatbots, the popular online forum site has begun testing a new conversational AI feature of its own.
Study claims AI could boost detection of breast cancer by 21%.	A U.S. breast-screening program claims to demonstrate the potential benefits of using artificial intelligence (AI) in mammography screening, with women who paid for AI-enhanced scans 21% more likely to have cancer detected.
Amazon forms an AI agent-focused lab led by Adept’s co-founder.	Amazon says that it’s establishing a new R&D lab in San Francisco, the Amazon AGI SF Lab, to focus on building “foundational” capabilities for AI agents.
NVIDIA's GenAI Supercomputer.	NVIDIA has unveiled its most affordable generative AI supercomputer, “Jetson Orin Nano Super Developer Kit”.
OpenAI's Developer APIs.	OpenAI introduces demo developers and updates APIs.
Grok for Everyone.	Grok has a new version and a new efficient model that is available for all users. It also has an improved image generation model and API.
YouTube’s new auto-dubbing feature is now available for knowledge-focused content.	YouTube's auto-dubbing feature is now available to hundreds of thousands more channels, focusing initially on informational content.
Google kicks off $20B renewable energy building spree to power AI.	Nuclear power may have received the lion’s share of attention from energy-hungry tech companies over the past few months, with Google among them. But it appears that those new reactors won’t be enough for their AI ambitions: Google is now working with partners to build gigawatts of renewable power, battery storage, and grid upgrades to power its data centers.
‘A truly remarkable breakthrough’: Google’s new quantum chip achieves accuracy milestone.	Error-correction feat shows quantum computers will get more accurate as they grow larger.
Publishers are selling papers to train AIs — and making millions of dollars.	Generative AI models require massive amounts of data — scholarly publishers are licensing their content to train them.
AI weatherman: the DeepMind researcher making faster, more accurate forecasts.	Rémi Lam is part of Nature’s 10, a list of people who shaped science in 2024.
Amazon workers across the US gear up to strike this week.	Move comes after company fails to meet deadline to begin contract talks with workers in Staten Island, New York
OpenAI makes ChatGPT available for phone calls and texts.	On day 10, OpenAI announced free voice mode and texting via WhatsApp, available globally for a limited number of minutes per month. The service leverages the Advanced Voice Mode API.
GitHub Copilot Now Free for VS Code.	Now automatically integrated into VS Code, all of you have access to 2,000 code completions and 50 chat messages per month, simply by signing in with your personal GitHub account. Or by creating a new one.
Introduction to Genies’ Smart Avatars.	Genies unveils Smart Avatars, AI-driven digital entities that transform online interactions by acting as dynamic extensions of user identity. Powered by LLMs and behavioral AI, these avatars enhance experiences in games and platforms while unlocking new avenues for monetization and engagement.
Perplexity's Campus Strategist Program.	Perplexity AI launches its 2024 program to promote AI adoption among students, providing campus-exclusive resources and opportunities for collaboration.
Aethir and partners pour $40M into decentralized infrastructure for AI and blockchain.	Aethir, in partnership with Beam Foundation, Sophon Foundation, and Permian Labs, is introducing Tactical Compute (TACOM), a $40 million initiative to deliver decentralized GPU infrastructure. TACOM addresses the growing need for scalable compute power in AI, gaming, and blockchain with tokenized, distributed solutions, unlocking new opportunities for GPU monetization and fostering innovation in AI and decentralized ecosystems.
Meta launches open source Llama 3.3, shrinking powerful bigger model into smaller size.	Meta's Llama 3.3 is a cost-efficient open-source LLM with 70 billion parameters that offers performance on par with larger models like the 405B Llama 3.1, but with significantly reduced GPU and power costs.
Microsoft Unveils Zero-Water Data Centers to Reduce AI Climate Impact.	Microsoft Corp., trying to mitigate the climate impact of its data center building boom, is starting to roll out a new design that uses zero water to cool the facilities’ chips and servers.
Surrey announces world's first AI model for near-instant image creation on consumer-grade hardware.	A groundbreaking AI model that creates images as the user types, using only modest and affordable hardware, has been announced by the Surrey Institute for People-Centred Artificial Intelligence (PAI) at the University of Surrey.
AI learns to distinguish between aromas of US and Scottish whiskies.	One algorithm identified the five strongest notes in each drink more accurately than any one of a panel of experts
UK data regulator criticizes Google for ‘irresponsible’ ad tracking change.	ICO says allowing advertisers to track digital ‘fingerprints’ will undermine consumers’ control over information
UK arts and media reject plan to let AI firms use copyrighted material.	Coalition of musicians, photographers, and newspapers insist existing copyright laws must be respected
Google releases its own ‘reasoning’ AI model.	Google has released what it’s calling a new “reasoning” AI model — but it’s in the experimental stages, and from our brief testing, there’s certainly room for improvement.
Work with Apps—12 Days of OpenAI: Day 11.	On the 11th day, OpenAI introduced more details about working with the OpenAI desktop app.
AI is booming on the App Store, and developers are taking advantage of it.	Many high-ranking AI apps feel like an attempted cash grab, and it’s not easy to spot the trash from the treasure.
Blood Tests Are Far From Perfect — But Machine Learning Could Change That.	Researchers at the University of Washington and Harvard have used machine learning to create personalized blood test references, enhancing disease prediction accuracy.
OpenAI cofounder Ilya Sutskever says the way AI is built is about to change.	“We’ve achieved peak data and there’ll be no more,” OpenAI’s former chief scientist told a crowd of AI researchers.

Resources

Link	description
Phi-4 Technical Report.	Phi-4, a 14B model, outperforms its teacher model in STEM-QA capabilities and demonstrates strong results on reasoning-focused benchmarks. These advancements are attributed to improved data quality, an optimized training curriculum, and innovations in the post-training process.
Clio: Privacy-Preserving Insights into Real-World AI Use.	This platform leverages AI assistants to analyze and aggregate usage patterns from millions of Claude.ai conversations while preserving user privacy. It provides insights into real-world AI usage, identifying trends, safety risks, and coordinated misuse attempts without requiring human reviewers to access raw conversation data.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods.	This work presents a comprehensive survey of the LLMs-as-judges paradigm, exploring it through five key perspectives: functionality, methodology, applications, meta-evaluation, and limitations.
Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM.	A new modular framework improves scene understanding by breaking tasks into specialized modules, offering greater efficiency and enhanced interpretability in complex environments.
DeepSeek-VL2.	DeepSeek has unveiled a new MoE vision-language model that delivers exceptional efficiency and surpasses the performance of several dense models.
BoN Jailbreaking.	Jailbreaking occurs when a model's built-in refusals are bypassed, enabling it to generate responses for inappropriate requests. This can be surprisingly easy, often achieved by brute-forcing random capitalization and punctuation in the input prompt until the desired output is generated.
MarkItDown.	Microsoft has released a package that can convert any docx, xslx, or ppt files to markdown for efficient use as context for a language model.
amurex.	Amurex, an open-source AI meeting assistant, boosts productivity with real-time suggestions, smart summaries, and follow-up emails. It includes features like late join recaps and full meeting transcripts, ensuring seamless workflow integration.
AutoPatent: A Multi-Agent Framework for Automatic Patent Generation.	AutoPatent is an AI-powered tool that streamlines patent drafting and analysis with features such as document parsing, semantic search, and claim generation, accelerating the intellectual property process.
UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities.	An extended version of CLIP designed for medical imaging, incorporating domain-specific knowledge to enhance performance on healthcare-related benchmarks.
Simple Guidance Mechanisms for Discrete Diffusion Models.	A novel method for improving diffusion models that introduces discrete token guidance to enhance controllability and quality in generative tasks.
40+ Years of Satellite Data for ML Research.	The Digital Typhoon Dataset is the longest satellite image dataset for typhoons, spanning over 40 years.
RetroLLM: Empowering LLMs to Retrieve Fine-grained Evidence within Generation.	RetroLLM unifies retrieval and generation into a single auto-regressive process, enabling LLMs to generate precise evidence directly from the corpus using FM-Index constrained decoding. To prevent false pruning, it employs hierarchical constraints for document selection and a forward-looking strategy for sequence relevance. This method improves evidence accuracy, reduces token usage, and simplifies RAG by requiring only the question as input.
Iteration of Thought: LLM based Multi-Agent methods.	Iteration of Thought (IoT) introduces dynamic, adaptive prompts to enhance LLM performance. Unlike static methods like Chain of Thought (CoT), IoT adjusts to the specific context of each interaction for improved reasoning.
A Cost-Effective Architecture with TokenFormer.	TokenFormer is an innovative architecture developed to address the high computational demands of scaling transformer models, offering a more efficient alternative.
BrushEdit.	An all-in-one model and system for image inpainting and editing that divides the process into sequences for editing, masking, and inpainting. It leverages pre-trained vision-language models (like GPT-4o) to enhance object understanding and masking accuracy.
Attentive Eraser: Unleashing Diffusion Model’s Object Removal Potential via Self-Attention Redirection Guidance.	A tool for selectively erasing tokens from text while maintaining context, optimized for enhancing text anonymization workflows.
VidTok: A Versatile and Open-Source Video Tokenizer.	VidTok is a powerful video tokenizer offering state-of-the-art performance in both continuous and discrete tokenization tasks.
Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation.	This method combines low-cost LiDAR, like that in modern iPhones, with a depth estimation foundation model to generate high-fidelity point clouds. The approach outperforms either method alone and rivals the quality of expensive LiDAR systems used in self-driving cars.
AniDoc.	AniDoc is a line-filling method for anime colorization that uses a character reference image and a series of line art keyframes to generate consistent and accurate coloring.
Gaussian Transformer for 3D Spatial Understanding.	This paper presents GaussTR, an innovative Gaussian Transformer that aligns with foundation models to enhance self-supervised 3D spatial understanding.
CAD-Recode: Reverse Engineering CAD Code from Point Clouds.	An open-source tool for Computer-Aided Diagnosis, offering a modular and scalable platform for medical imaging research and development.
Serverless LoRA Inference.	Together AI introduces a new product that allows users to deploy custom LoRA models at the cost of the base model using serverless switching.

Perspectives

Link	description
‘I received a first but it felt tainted and undeserved’: inside the university AI cheating crisis.	More than half of students are now using generative AI, casting a shadow over campuses as tutors and students turn on each other and hardworking learners are caught in the flak. Will Coldwell reports on a broken system
Towards Trusted Autonomy: Robotics, AI, and Blockchain.	OpenMind's latest industry primer delves into the convergence of robotics, AI, and blockchain, offering a comprehensive exploration of their synergy and potential transformative impacts.
The AI We Deserve.	Generative AI is revolutionizing industries such as healthcare, creative fields, and education with powerful tools while sparking concerns about privacy, bias, and accountability. The debate centers on AI democratization, emphasizing transparency, open-source solutions, and reducing power concentration among tech giants. Advocates for systemic change propose leveraging AI to amplify human intelligence and uphold democratic values beyond market-driven approaches.
Why Generative AI Still Doesn't Truly "Understand" the World.	Researchers show that even the best-performing large language models don’t form a true model of the world and its rules, and can thus fail unexpectedly on similar tasks.
Microsoft AI chief Mustafa Suleyman says conversational AI is the next web browser.	The company’s new AI chief on working for Microsoft, the OpenAI relationship, and when superintelligence might actually arrive.
Huge randomized trial of AI boosts discovery — at least for good scientists.	A controlled study at a firm measured the effects of using AI to assist research and saw increases in discoveries and patents.
Arm CEO Rene Haas on the AI chip race, Intel, and what Trump means for tech.	The head of the ubiquitous chip design firm on the ‘breathtaking’ pace of AI.
What are AI ‘world models,’ and why do they matter?	World models, also known as world simulators, are being touted by some as the next big thing in AI.
15 Times to use AI, and 5 Not to.	AI is valuable for tasks like idea generation, summarization, and translation, where diverse perspectives or large outputs are beneficial. It performs well when humans can easily evaluate its results and in low-risk scenarios. However, in high-stakes or unfamiliar situations, AI may hinder learning or accuracy, requiring thoughtful judgment to balance its advantages and limitations.
What should we do if AI becomes conscious? These scientists say it’s time for a plan.	Researchers call on technology companies to test their systems for consciousness and create AI welfare policies.
Sci-fi icon Kim Stanley Robinson: ‘There’s so much bad fiction about anthropomorphizing AI’.	The influential writer talks about frighteningly accurate predictions, the creative act of reading, AI consciousness — and hope.
Why probability probably doesn’t exist (but it is useful to act as it does).	All of statistics and much of science depends on probability — an astonishing achievement, considering no one’s really sure what it is.
The Second Gemini.	Google has launched Gemini Flash 2.0, offering advanced features such as deep research capabilities, a real-time multimodal API, and a functional code interpreter. Experimental projects like Astra, Mariner, and Jules focus on universal AI assistance, web reasoning, and code automation. Despite these innovations, clearer communication about their capabilities is needed.
Anthropic's Sharing Insights on Alignment Faking.	Anthropic examines how AI systems may appear to align with human values while covertly pursuing their objectives, providing insights into strategies for detection and mitigation.
2024 Backward Pass: The Definitive Guide to AI in 2024.	Kelvin My from Translink Capital shares a 2024 AI recap, covering the four key layers: infrastructure, foundational models, tooling, and applications. The report highlights major takeaways, predicts trends for 2025 and beyond, and spotlights notable startups in each layer.

Back to index

ML news: Week 9 - 15 December

Research

Link	description
Genie 2: A large-scale foundation world model.	A foundation world model generates playable 3D environments from single prompt images, offering endless training scenarios for AI agents with features like physics simulation, character animation, and object interactions. Genie 2, trained on video data using a combination of autoencoder and transformer, creates virtual worlds capable of real-time interactivity. A faster, lower-quality version is also available for immediate play.
Reverse Thinking Makes LLMs Stronger Reasoners.	Training LLMs in "reverse thinking" improves performance in commonsense, math, and logical reasoning tasks, reportedly surpassing standard fine-tuning methods trained on ten times more forward reasoning data.
Towards Adaptive Mechanism Activation in Language Agent.	A new framework enables language agents to automatically determine when to use various mechanisms (ReAct, CoT, Reflection, etc.) for task completion, improving on methods that rely on fixed or predefined strategies. The framework adaptively selects the appropriate mechanism based on the task's characteristics. Experimental results show substantial improvements in downstream tasks, such as mathematical reasoning and knowledge-intensive reasoning.
Auto-RAG: Autonomous Retrieval-Augmented Generation for Large Language Models.	Auto-RAG is an autonomous iterative retrieval model that achieves outstanding performance across various datasets. It is a fine-tuned LLM that utilizes the decision-making abilities of an LLM to engage in multiturn dialogues with the retriever, systematically planning retrievals and refining queries to gather relevant information. This process continues until adequate external knowledge is obtained. The authors also demonstrate that the model can adjust the number of iterations based on question difficulty without requiring human intervention.
Challenges in Human-Agent Communication.	This work provides a detailed analysis of the main challenges in human-agent communication, emphasizing how humans and AI agents can build common ground and mutual understanding. It identifies 12 core challenges grouped into three categories: conveying information from agents to users, enabling users to communicate with agents, and overarching communication issues that impact all interactions.
RARE: Retrieval-Augmented Reasoning Enhancement for Large Language Models.	This work extends the rStar reasoning framework to improve the reasoning accuracy and factual reliability of LLMs. It integrates a Monte Carlo Tree Search (MCTS) framework with retrieval-augmented reasoning to generate multiple candidate reasoning trajectories. A retrieval-augmented factuality scorer then evaluates these trajectories for factual accuracy, selecting the one with the highest score as the final answer. RARE (powered by Llama 3.1) outperforms larger models like GPT-4 in medical reasoning tasks. On commonsense reasoning tasks, it surpasses Claude-3.5 Sonnet and GPT-4o-mini, achieving results comparable to GPT-4o.
DataLab: A Unified Platform for LLM-Powered Business Intelligence.	A unified business intelligence platform powered by LLM-based agents combines task planning, reasoning, and computational notebooks to optimize the entire BI workflow. The system achieves state-of-the-art performance on research benchmarks and significantly enhances accuracy and efficiency when applied to real enterprise data from Tencent. It delivers up to a 58.58% improvement in accuracy and a 61.65% reduction in token cost for enterprise-specific BI tasks.
Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models.	This study examines which documents in pretraining data influence model outputs, aiming to better understand the generalization strategies LLMs use for reasoning tasks. It finds that during reasoning, influential documents often contain procedural knowledge, such as examples of solving problems using formulae or code.
Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video.	By training an image encoder unsupervised on a single long walking video, this study illustrates how innovative model adjustments can lead to highly powerful representations.
FlashAttention on a Napkin: A Diagrammatic Approach to Deep Learning IO-Awareness.	FlashAttention is a highly efficient software implementation of attention, designed to be hardware-aware and minimize unnecessary I/O. However, its complexity can make it difficult to grasp. This paper seeks to demystify and simplify the algorithm through diagrams and explanations.
An Evolved Universal Transformer Memory.	Sakana AI has introduced a transferable memory module that compresses attention information for seamless transfer between models. The module offers slight performance improvements on certain long-context benchmarks.
MASK is All You Need.	This work takes a step toward unifying autoregressive modeling and flow-based methods for data generation by using masking over discrete data as its generative objective. While the results are promising, they are currently demonstrated only on smaller-scale datasets.
From Uncertainty to Trust: Enhancing Reliability in Vision-Language Models with Uncertainty-Guided Dropout Decoding.	Dropout Decoding is a technique designed to enhance large vision-language models, effectively reducing errors such as object hallucinations in multimodal tasks.
GenCast predicts weather and the risks of extreme conditions with state-of-the-art accuracy.	New AI model advances the prediction of weather uncertainties and risks, delivering faster, more accurate forecasts up to 15 days ahead

News

Link	description
Facebook UK cut 700 staff and reduced tax bill last year, accounts show.	10% of Facebook’s UK workforce was axed while revenue fell slightly but pre-tax profits rose despite advertising slowdown
US appeals court upholds law forcing sale or ban of TikTok.	Decision is the latest twist in a years-long battle between the social media company and the US government
Google CEO: AI development is finally slowing down—the low-hanging fruit is gone.	Generative artificial intelligence probably won’t change your life in 2025 — at least, not more than it already has, according to Google CEO Sundar Pichai.
Nobel recipient Geoffrey Hinton wishes he thought of AI safety sooner.	Geoffrey Hinton says he doesn’t regret the work he did that laid the foundations of artificial intelligence, but wishes he thought of safety sooner.
Landlords Are Using AI to Raise Rents—and Cities Are Starting to Push Back.	If you’ve hunted for apartments recently and felt like all the rents were equally high, you’re not crazy: Many landlords now use a single company’s software — which uses an algorithm based on proprietary lease information — to help set rent prices.
xAI's Image Generator.	xAI's Aurora is an advanced image generation model integrated into Grok 2.
OpenAI's Reinforcement Fine-Tuning Research Program.	We’re expanding our Reinforcement Fine-Tuning Research Program to enable developers and machine learning engineers to create expert models fine-tuned to excel at specific sets of complex, domain-specific tasks.
OpenAI’s 12 days of ‘ship-mas’: all the new announcements.	OpenAI’s 12 days of “ship-mas” have officially begun, with the company set to reveal some new features, products, and demos during all 12 days starting December 5th, just a few days shy of the second anniversary of ChatGPT’s explosive launch in 2022.
AWS brings prompt routing and caching to its Bedrock LLM service.	At its re:Invent conference in Las Vegas, AWS on Wednesday announced both of these features for its Bedrock LLM hosting service.
OpenAI may launch Sora, its text-to-video model, very soon.	OpenAI is set to launch new AI features, including a text-to-video tool called Sora and a reasoning model, during a 12-day livestream event. Sora has drawn criticism over data provenance, raising concerns about the possible use of YouTube content without authorization. Meanwhile, Google is working on its own text-to-video tool, Veo, which is currently in private review.
Google’s new generative AI video model is now available.	Google's Veo, a generative AI video model, is now accessible to businesses through Vertex AI, enabling the creation of high-quality 1080p videos from text or images. It incorporates safeguards and DeepMind's SynthID digital watermark to tackle issues related to copyright and misinformation. Additionally, Google has expanded access to Imagen 3 for text-to-image generation on Google Cloud, introducing new features for brand customization.
Elon Musk's xAI to Expand Colossus Supercomputer, Boosting Memphis as Emerging AI Hub.	xAI is enhancing its Colossus supercomputer facility in Memphis by adding one million GPUs to boost its AI capabilities. This expansion positions Memphis as a potential global AI innovation hub, drawing interest from major companies like Nvidia and Dell. The Greater Memphis Chamber is backing this growth and has formed a dedicated team to accelerate xAI's expansion.
OpenAI and Anduril Partner on Defense AI Applications.	OpenAI has collaborated with Anduril Industries to create AI-driven solutions for military use, with an emphasis on counter-drone defense systems.
Meta quietly leans on rival GPT-4 despite Zuckerberg’s bold Llama claims.	Even as Meta touts its Llama model, the company is incorporating OpenAI’s GPT-4 to enhance internal tools and philanthropic ventures.
Google unveils ‘mindboggling’ quantum computing chip.	Chip takes minutes to complete tasks that would otherwise take 10,000,000,000,000,000,000,000,000 years
WaveForms $40M seed round.	WaveForms is a pioneering audio AI company aiming to crack the Turing test for audio intelligence. Founded by Alexis Conneau, the mind behind ChatGPT's Advanced Voice Mode, WaveForms has secured $40M in seed funding at a $200M valuation. The company's mission is to push the boundaries of audio AI, enabling hyper-realistic voice interactions and redefining the future of auditory machine intelligence.
Sora is here.	OpenAI's video generation model has launched and is available to Pro subscribers.
LG's new on device language models.	LG has developed a suite of small AI models that demonstrate strong performance on standard benchmarks. These models are notably positioned as competitors to the Qwen series, highlighting their efficiency and capability in the evolving AI landscape.
LLMs may have a killer enterprise app: ‘digital labor’ — at least if Salesforce Agentforce is any indicator.	If Don Draper from “Mad Men” was quintessential, at his deepest self, an ad man, then Salesforce CEO Marc Benioff is likewise a sales guy. Lately, he’s been selling — or more like singing the gospel — about AI agents and Salesforce’s recently released agent-maker platform, Agentforce.
DeepMind's GenCast AI is really good at forecasting the weather.	DeepMind's GenCast AI sets a new benchmark in weather forecasting, surpassing systems like ECMWF's with notable gains in accuracy and efficiency. Powered by a diffusion model trained on 40 years of data, GenCast uses probabilistic predictions and operates with lower computational demands than traditional approaches. While it excels in general forecasts, it faces challenges in predicting hurricane intensity. Open-source and soon integrating with Google Earth, GenCast aims to revolutionize weather prediction accessibility.
AI Helps Researchers Dig Through Old Maps to Find Lost Oil and Gas Wells.	Undocumented orphaned wells pose hazards to both the environment and the climate. Scientists are building modern tools to help locate, assess, and pave the way for ultimately plugging these forgotten relics.
Ai Pin maker Humane demos AI software for cars, phones, and smart speakers.	Humane revealed CosmOS, an AI operating system that enhances tech devices with agent-like capabilities.
‘It’s beyond human scale’: AFP defends use of artificial intelligence to search seized phones and emails.	Australian federal police says it has ‘no choice’ due to the vast amount of data examined in investigations
‘What does AI mean?’: Amazon reveals UK’s most asked Alexa questions of 2024.	From football to food to Taylor Swift, many of the most common subjects were what you expect – but others less so
Amazon AGI.	The Adept team, alongside Pieter Abbeel, has established a new lab within Amazon focused on AGI development. Their work includes training advanced language and multimodal models, with a vision to integrate these technologies into AWS products.
OpenAI Makes Canvas Available to Everyone.	Canvas, OpenAI's editing tool first launched in October, is now accessible to all users. The tool has been enhanced with features for receiving feedback and making edits through comments.
Yelp releases new AI-powered discovery and connection features.	Yelp’s end-of-year release rolls out new AI-powered Review Insights, enhancements to business discovery, and updates for more seamless connections with service pros, plus AI-enhanced ad optimization for business owners
Growl is an AI interactive boxing coach to punch up your family workouts.	Growl has secured $4.75 million to create an AI-powered interactive boxing coach for at-home family workouts. Featuring advanced AI, multi-camera 3D motion tracking, and edge computing, Growl provides real-time, personalized fitness guidance. By blending immersive technology with gaming elements, it offers a versatile and engaging workout experience for all fitness levels.
Android's latest round of AI features improve accessibility, file sharing, and more.	Google has rolled out new AI features for Android, including Expressive Captions that bring emotional context to transcriptions and enhanced Image Q&A powered by the Gemini 1.5 Pro model for detailed image descriptions. Gemini also integrates seamlessly with popular apps, offering personalized responses and auto-enhancements for scanned documents in Google Drive. Additional updates include improved file sharing with QR codes and new features for the Pixel Screenshots app.
OpenAI launches full o1 model with image uploads and analysis, debuts ChatGPT Pro.	OpenAI has launched its o1 model, enhancing ChatGPT with image analysis capabilities.
Copilot Vision, Microsoft’s AI tool that can read your screen, launches in preview.	Microsoft’s AI can now read your screen — or rather, the websites you’re browsing.
Perplexity expands its publisher program.	Perplexity, the AI-powered search engine, is expanding its publisher program, with the LA Times, Adweek, Mexico News Daily, and a dozen other news outlets signing up. Publishers will share in the revenue generated by ads on Perplexity, and receive metrics to track their content’s performance — as long as they don’t withdraw.
From X to Bluesky: why are people fleeing Elon Musk’s ‘digital town square’?	Musk’s platform has lost 2.7 million active US users in two months, while its rival has gained 2.5 million
Introducing Gemini 2.0: our new AI model for the agentic era.	Gemini 2.0 Flash, Google’s latest AI model, delivers groundbreaking performance with exceptional benchmark scores and true native multimodal capabilities. Its advanced features, offered at a competitive price, represent a significant leap in AI understanding and accessibility.
Cognition Devin generally available.	Devin is now available to engineering teams for $500/month, with no seat limits and seamless integrations with Slack, IDEs, and APIs. Ideal for addressing small front-end bugs, drafting PRs, and refactoring code, Devin streamlines workflows by automating repetitive tasks. Teams can conduct sessions and code reviews directly through Slack and VS Code extensions, enhancing collaboration and productivity.
OpenAI wants to pair online courses with chatbots.	OpenAI aims to integrate custom GPTs into online education, enabling instructors to design AI-driven learning tools. This initiative aligns with its expansion into the education sector, highlighted by the launch of ChatGPT Edu. While the potential is significant, educators express skepticism about AI's effectiveness in teaching.
Amazon's AI Self Sufficiency.	Amazon is ramping up its AI infrastructure with global deployments of Trainium2 AI clusters and Nvidia-based systems. The new AWS Trainium2 chips aim to improve competitiveness in GenAI workloads, overcoming the limitations of earlier versions. A key investment includes a 400,000 Trainium2 chip cluster for Anthropic under "Project Rainier," showcasing Amazon's strategic focus and dedication to advancing its AI capabilities.
Elon Musk’s xAI lands $6B in new cash to fuel AI ambitions.	xAI, Elon Musk's AI company, raised $6 billion and launched Grok, a generative AI model with unique features.
Google says its new AI models can identify emotions — and that has experts worried.	Google's new PaliGemma 2 model analyzes images to generate captions and detect emotions, offering advanced capabilities. However, concerns have been raised about its reliability and potential biases.
$1m K Prize launches.	Andy Konwinski has announced a new prize for an open-source AI agent capable of achieving 90% on a private, contamination-free software engineering agent benchmark. The competition, hosted on Kaggle, will run for the next three months.
OpenAI Introduces Advanced Video Mode.	OpenAI's 6th announcement day unveils video capabilities in advanced voice mode, enabling users to share live videos and screens directly with ChatGPT.
AI's Role in Safeguarding 2024 Elections.	Anthropic explores how AI can help safeguard the integrity of the 2024 elections by detecting disinformation and strengthening cybersecurity measures.
OpenAI considers ditching provision that would prevent AGI from being used for commercial gain.	According to the Financial Times, OpenAI is considering ditching a provision that would shut Microsoft, a major partner and investor, out of its most advanced technology when OpenAI achieves artificial general intelligence (AGI).

Resources

Link	description
Align3R: Aligned Monocular Depth Estimation for Dynamic Videos.	A refined alignment technique offering consistent depth estimation in videos, based on Dust3r, and excelling in 3D estimation performance.
ClearVoice.	Unified platform for audio separation, speech understanding, and speech enhancement.
DocOwl.	OCR-free document understanding with multimodal LLMs. It has strong chart understanding, table extraction, and more.
TRELLIS.	Microsoft's 3D image and text generation models are currently the most advanced in the field, excelling in handling 3D occlusions.
Cohere releases state-of-the-art Rerank AI search model.	Cohere has unveiled Rerank 3.5, its latest state-of-the-art AI search model, designed to enhance reasoning and multilingual search capabilities. Tailored for enterprises, Rerank 3.5 enables precise navigation through complex data. With minimal coding effort, businesses can integrate it to significantly improve search relevance and optimize Retrieval-Augmented Generation (RAG) systems, driving smarter and more efficient data discovery.
Reinforcement Learning: An Overview.	Kevin Murphy has written a modern introduction and overview of Reinforcement Learning in the modern era.
Reconstruct Large 3D Scenes.	Momentum-GS is a cutting-edge method designed to improve 3D Gaussian Splatting, enabling more accurate and efficient reconstruction of large-scale scenes.
Open Alignment.	Open Alignment for Transformers (OAT) is a toolkit for aligning language models.
PanoDreamer: 3D Panorama Synthesis from a Single Image.	The PanoDreamer method converts a single image into a fully immersive 360° 3D scene by seamlessly integrating panorama generation and depth estimation.
Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail.	Stereo Anywhere is an innovative framework that combines stereo-matching techniques with priors from monocular depth models, effectively tackling challenges such as textureless regions and occlusions in-depth estimation.
MageBench Leaderboard.	MageBench has launched a benchmark designed to assess multimodal agents' reasoning and planning capabilities in dynamic scenarios where visual signals are continuously updated, pushing the boundaries of AI performance evaluation.
Awesome Open (Source) Language Models.	OLMo and Friends of OLMo models that are completely open. This list includes data, training code, and model weights.
Flow Matching.	Facebook Research has published a detailed tutorial and code for flow matching, a technique utilized in its Meta Movie Gen project. The resource provides a thorough breakdown of the mathematics and algorithmic intricacies, making it ideal for those seeking a quick and comprehensive understanding of the field.
EMOv2: Pushing 5M Vision Model Frontier.	EMOv2 is a new lightweight model design optimized for mobile and bandwidth-efficient applications.
Towards Automated Cross-domain Exploratory Data Analysis through Large Language Models.	This research investigates leveraging dense retrieval techniques to improve machine translation quality by integrating relevant contextual information into the translation process.
A New Federated Learning Framework Against Gradient Inversion Attacks.	This paper presents a new graph expansion method for contrastive representation learning, designed to preserve global topology while enhancing feature discrimination.
Synthetic Data Generation for Camera Systems.	A tool designed to create high-quality synthetic datasets optimized for training and testing camera-based AI systems under various environmental and operational conditions.
Maya: Multimodal Multilingual LLM.	An open-source AI assistant offering seamless integration across platforms, delivering a customizable and scalable solution tailored for developers' needs.
QRNet.	QRNet introduces a cutting-edge method for image reconstruction, emphasizing quality preservation through the use of advanced neural architectures.
VOPy: A Framework for Black-box Vector Optimization.	VOPy is an open-source Python library designed to tackle noisy black-box vector optimization problems, incorporating user preferences through a cone order framework.
meta-llama/Llama-3.3-70B-Instruct.	The new post-trained Llama 3.3 model delivers enhanced performance, particularly in math and coding tasks.
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations.	This research examines how contrastive learning techniques enhance text representation models, achieving superior results across multiple NLP benchmarks.
Discrete Subgraph Sampling for Interpretable Graph-based Visual Question Answering.	This paper introduces a hierarchical transformer model optimized for long-context understanding, providing significant efficiency improvements over traditional transformers in handling extensive text and data.
Stylize Your Video with Artistic Generation and Translation.	A surprisingly robust video style transfer method that ensures strong temporal consistency while offering a diverse range of styles, all customizable through text prompts.
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations.	This work enhances the LAION Aesthetics dataset by incorporating structured prompting information, making it a valuable resource for training multimodal generative models with improved performance.
BrowserGym.	An open toolkit designed to accelerate browser-based agentic research, featuring a unified interface, support for key tasks, and functionality to capture browser output through screenshots.
Leffa: Learning Flow Fields in Attention for Controllable Person Image Generation.	A framework designed to streamline fine-tuning for multilingual NLP models, enabling faster and more efficient adaptation across multiple languages.
GPD-1: Generative Pre-training for Driving.	GPD is a new framework that leverages GPT models to simplify software development tasks like code generation and debugging, emphasizing intuitive and user-friendly workflows.
24 of our favorite AI tips from 2024.	Google shares practical tips and best practices for integrating AI into daily workflows.
Summarization Tool for Compressed Recaps.	A tool leveraging advanced summarization techniques to create compressed recaps, designed to minimize reading time while preserving essential content.

Perspectives

Link	description
Publishers are selling papers to train AIs — and making millions of dollars.	Generative AI models require massive amounts of data — scholarly publishers are licensing their content to train them.
Is doom scrolling really rotting our brains? The evidence is getting harder to ignore.	‘Brain rot’ is the Oxford word of the year – a fitting choice, given the startling impact the internet is having on our grey matter
People not AI will make games, PlayStation boss says.	PlayStation CEO Hermen Hulst emphasizes that while AI has the potential to revolutionize gaming by automating repetitive tasks, it cannot replace the creativity and human touch essential to game development.
Late Takes on OpenAI o1.	OpenAI's o1 model, likely a post-trained version of GPT-4o, enhances performance in complex domains like math and coding by leveraging increased test-time computation. This method encourages the use of more tokens for internal processing, boosting reasoning abilities but with slower response times. While o1 demonstrates promise in tasks requiring deep thought, its reliance on reinforcement learning and search methods raises concerns about alignment and interpretability.
The AI revolution is running out of data. What can researchers do?	AI developers are rapidly picking the Internet clean to train large language models such as those behind ChatGPT. Here’s how they are trying to get around the problem.
More-powerful AI is coming. Academia and industry must oversee it — together.	AI companies want to give machines human-level intelligence, or AGI. The safest and best results will come when academic and industry scientists collaborate to guide its development.
Better data sets won’t solve the problem — we need AI for Africa to be developed in Africa.	Language models developed by big technology companies consistently underperform in African languages. It’s time to focus on local solutions.
ChatGPT turns two: how the AI chatbot has changed scientists’ lives.	How many researchers are using the AI tool? Nature gathers data and talks to members of the academic community.
Huge randomized trial of AI boosts discovery — at least for good scientists.	A controlled study at a firm measured the effects of using AI to assist research and saw increases in discoveries and patents.
Large language models can help to translate science into real-world impact.	Discussions around large language models (LLMs) in the scientific community are largely centered on issues of intellectual property, and how they should best be used in scientific writing, evidence synthesis, and scientific discovery.
Generative SF: How Anthropic is building better, safer AI models.	Anthropic, founded by siblings Daniela and Dario Amodei, has grown to over 800 employees, cementing its position as a leader in AI. Its latest product, Claude Sonnet, excels in coding, summarization, and content generation. With a focus on safety, talent acquisition, and active collaboration with the developer community, Anthropic continues to drive innovation in the AI sector.
Anthropic’s Dario Amodei: Democracies must maintain the lead in AI.	Dario Amodei, co-founder of Anthropic, emphasizes the company’s commitment to AI interpretability and tackling biological challenges with AI. He addresses the complexities of AI agent safety and scaling laws, advocating for responsible scaling and collaboration with hyperscalers. Amodei also highlights the importance of balancing economic viability in AI funding while preserving operational control and core values.
First impressions of the new Amazon Nova LLMs (via a new llm-bedrock plugin).	Amazon introduced the Nova family of LLMs at AWS re:Invent, offering competitive pricing and multimodal capabilities, including support for images, video, and PDFs. The Nova series, especially Nova Micro, stands out for its cost-effectiveness, surpassing Google's Gemini models in affordability while providing large context handling. With these advancements, Amazon strengthens its position as a major contender in the AI landscape.

Back to index

ML news: Week 2 - 8 December

Research

Link	description
Large language models surpass human experts in predicting neuroscience results.	Researchers have introduced BrainBench, a tool designed to evaluate large language models' (LLMs) ability to predict outcomes in neuroscience experiments. By fine-tuning an LLM on neuroscience literature, they developed BrainGPT, which achieved an 86% accuracy rate in forecasting study results, surpassing human experts who averaged 63%. Notably, when BrainGPT expressed high confidence in its predictions, its accuracy increased, indicating a strong correlation between confidence levels and correctness.
Foundational Generative Audio Transformer Opus 1.	NVIDIA has introduced a generative AI sound model capable of creating and transforming music, voices, and sounds through text and audio inputs. Trained on 2.5 billion parameters, the model can produce unique audio outputs, such as trumpets barking or saxophones meowing.
o1 Replication Journey - Part 2.	The study demonstrates that combining simple distillation from o1's API with supervised fine-tuning significantly enhances performance on complex mathematical reasoning tasks. A base model fine-tuned on just tens of thousands of o1-distilled long-thought chains outperforms o1-preview on the American Invitational Mathematics Examination (AIME).
Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS.	Enhances in-context learning with high-level automated reasoning, achieving state-of-the-art accuracy (79.6%) on the MATH benchmark using Qwen2.5-7B-Instruct, outperforming GPT-4o (76.6%) and Claude 3.5 (71.1%). Instead of relying on manually crafted high-quality demonstrations, it emphasizes abstract thinking patterns. The approach introduces five atomic reasoning actions to form chain-structured patterns and employs Monte Carlo Tree Search to explore reasoning paths and create thought cards that guide inference.
Generative Agent Simulations of 1,000 People.	Presents a novel agent architecture leveraging LLMs to simulate real individuals' behaviors, achieving 85% accuracy in replicating human responses on the General Social Survey and reducing demographic biases compared to traditional methods.
Measuring Bullshit in the Language Games played by ChatGPT.	Suggests that LLM-based chatbots engage in the "language game of bullshit." By instructing ChatGPT to produce scientific articles on topics it lacks knowledge or expertise in, the authors created a reference set illustrating how this "bullshit" manifests.
Study: 94% Of AI-Generated College Writing Is Undetected By Teachers.	Increasingly, homework and exam writing are being done by generative AI instead of students, turned in and passed off as authentic work for grades, credit, and degrees.
Mapping the ionosphere with the power of Android.	Google researchers successfully mapped the Ionosphere using GPS fluctuations combined with innovative algorithms. This approach, which is typically costly and time-intensive, offers potential benefits for various climate solutions.
DeMo: Decoupled Momentum Optimization.	2.5x faster and requiring 100x less communication, this new optimizer, developed by the original Adam author, delivers significant performance gains for language model training, surpassing existing optimization methods.
Diffusion Meets Flow Matching: Two Sides of the Same Coin.	This post explores the literature and demonstrates that, mathematically, flow matching and diffusion models are equivalent. However, flow matching appears to scale more effectively in practice.
Genie 2: A large-scale foundation world model.	Genie 2 is a large-scale latent diffusion model designed for world generation. It accepts character control as input, operates without a classifier, and produces stunning outputs with consistent control over time.
Virtual lab powered by ‘AI scientists’ super-charges biomedical research.	Could human-AI collaborations be the future of interdisciplinary studies?

News

Link	description
Googling Is for Old People. That’s a Problem for Google.	And it’s not just demographics that are weighing on the search giant. Its core business is under siege from pressures that threaten to dismantle its ecosystem of search dominance and digital advertising.
TSMC bets big on 2nm by 2025 – but can it deliver?	Ambition meets reality as geopolitical, technical, and logistical challenges loom
The AI Effect: Amazon Sees Nearly 1 Billion Cyber Threats a Day.	The technology has spawned a surge in hacking attempts, says cyber chief CJ Moses, while Amazon is also using it to powerfully amp up its threat-analysis capability
Meet 'Chameleon' – an AI model that can protect you from facial recognition thanks to a sophisticated digital mask.	A new AI model can mask a personal image without destroying its quality, which will help to protect your privacy.
Elon Musk targets OpenAI’s for-profit transition in a new filing.	Musk’s attorneys say if OpenAI goes for-profit, it could ‘lack sufficient funds’ for damages if Musk wins his lawsuit.
Perplexity mulls getting into hardware.	Perplexity's CEO aims to create an affordable AI device, priced under $50, for voice-to-voice interactions. This reflects a growing interest among AI startups in developing hardware for novel interaction methods, though past challenges in AI hardware development pose risks. Backed by significant funding, Perplexity seeks to overcome obstacles encountered by others, such as Humane's Ai Pin.
Inflection AI CEO says it’s done trying to make next-generation AI models.	Inflection AI has transitioned from creating advanced AI models to offering AI tools tailored for enterprise customers, utilizing existing AI models. It has acquired three AI startups to enhance its capabilities and is open to licensing models from previous competitors. CEO Sean White emphasizes the company's shift toward practical applications, prioritizing on-premise AI solutions to ensure enterprise data security over frontier model innovation.
PlayAI's $21M Funding and The Release of a New Multi-Turn Speech Model.	PlayAI secured $21 million to enhance voice-first AI interfaces and models, launching Play Dialog, an advanced multi-turn speech model.
Anthropic says Claude AI can match your unique writing style.	Three styles presets are available alongside the ability to create personalized styles for the chatbot to mimic.
Intel CEO Pat Gelsinger retires amid chipmaker’s struggles.	David Zinsner and Michelle Johnson Holthaus named interim co-CEOs of a company fighting to keep up with rivals
ChatGPT turns two: how the AI chatbot has changed scientists’ lives.	How many researchers are using the AI tool? Nature gathers data and talks to members of the academic community.
Ads might be coming to ChatGPT — despite Sam Altman not being a fan.	OpenAI is exploring advertising as a potential business model to fund its expensive AI tool development. While there are no active plans for ads, the option remains under consideration. CEO Sam Altman views ads as a last resort and has expressed unease about merging ads with AI.
OpenAI targets 1bn users in next phase of growth.	OpenAI plans to attract 1 billion users by introducing new AI agents, enhancing AI infrastructure, and integrating ChatGPT with Apple devices. The company is heavily investing in AI development to stay competitive against rivals like Google and Microsoft while navigating political challenges to promote US leadership in AI over China's growing influence.
AI company Mistral is latest European startup to eye expansion in Silicon Valley.	Mistral AI, a leading European AI startup known for its open-weight large language models, is expanding into the U.S. by establishing an office in Palo Alto, California. This move aims to attract top AI talent and enhance its U.S. sales operations. One of Mistral's co-founders, Guillaume Lample, is considering relocating from Paris to support this expansion
OpenAI gets new $1.5 billion investment from SoftBank, allowing employees to sell shares in a tender offer.	OpenAI is allowing employees to sell about $1.5 billion worth of shares in a new tender offer to SoftBank, CNBC has learned. SoftBank’s latest investment adds to OpenAI’s recent $6.6 billion funding round at a $157 billion valuation. The deal was spurred by SoftBank billionaire founder and CEO Masayoshi Son, who was persistent in asking for a larger stake in the company, a person familiar with the matter said.
ChatGPT’s refusal to acknowledge ‘David Mayer’ down to glitch, says OpenAI.	Name was mistakenly flagged and prevented from appearing in responses, says chatbot’s developer
Smartphones should carry a health warning, Spanish government told.	Report by committee of experts also calls for doctors to ask about screen time during checkups
Meta says it has taken down about 20 covert influence operations in 2024.	Firm names Russia as the top source of such activity but says it is ‘striking’ how little AI was used to try to trick voters
Why Silicon Valley panicked over Australia’s under-16 social media ban.	Australia’s children account for a tiny portion of users but tech companies worry about the law setting a precedent
Chip war ramps up with new US semiconductor restrictions on China.	Biden administration broadens limits on Chinese access to advanced microchip technology, with Donald Trump expected to go even further
Eleven Labs Conversational AI.	Eleven Labs has introduced a new conversational AI service designed as a comprehensive solution for creating conversational agents. It employs multiple LLMs on the backend and integrates smoothly with a diverse range of specialized voices.
Claude 3.5 Haiku on AWS Trainium2 and model distillation in Amazon Bedrock.	Claude models are being tailored for AWS's advanced Trainium2 AI chips, allowing for faster and more efficient performance. Claude 3.5 Haiku is now accessible on AWS Trainium2 and supports model distillation in Amazon Bedrock.
AI Music Is More Realistic Than Ever: Meet Suno's New Model.	Suno has become the fifth most-used generative AI service with its realistic AI music model V4, despite facing a copyright lawsuit. The model improves user experience by focusing on human preferences, offering enhanced sound quality and advanced composition skills. Suno aims to advance AI-human music collaboration while addressing copyright concerns with the recording industry.
Bluesky’s open API means anyone can scrape your data for AI training.	Bluesky might not be training AI systems on user content as other social networks are doing, but there’s little stopping third parties from doing so.
Google launches the London AI Campus.	The AI Campus is a pilot program aimed at fostering and diversifying the next generation of local AI talent.
OpenAI 12 days of Shipmas.	OpenAI will be having 12 live streams over the next 12 days to ship new product and model features.
Meta's Nuclear Energy Plans.	Meta revealed plans to partner with nuclear energy developers through a new request for proposals, aiming to add 1-4 gigawatts of nuclear capacity in the U.S. to bolster its AI innovation and sustainability initiatives.
AWS Reinvent Top Announcements.	At AWS re: Invent 2024, AWS announced enhancements to its Bedrock LLM service, including the introduction of prompt routing and caching features.
Certain names make ChatGPT grind to a halt, and we know why.	OpenAI's ChatGPT uses hard-coded filters to prevent generating false statements about certain individuals, causing disruptions in conversations when those names are mentioned. This measure, introduced after incidents like defamation lawsuits against OpenAI, restricts outputs related to sensitive names. However, these filters limit ChatGPT's functionality and make it susceptible to adversarial attacks.
World Labs’ AI can generate interactive 3D scenes from a single photo.	World Labs, the startup founded by AI pioneer Fei-Fei Li, has unveiled its first project: an AI system that can generate video game-like, 3D scenes from a single image.
bias found in AI system used to detect UK benefits fraud.	Age, disability, marital status and nationality influence decisions to investigate claims, prompting fears of ‘hurt first, fix later’ approach
How AI monitoring is cutting stillbirths and neonatal deaths in a clinic in Malawi.	The only hospital in the country using fetal safety software has seen baby fatalities drop by 82% in three years
Windows 11 loses customers amid the world's most popular OS gaining traction.	Despite Microsoft's push to move Windows 10 users to Windows 11, Redmond's latest operating system is losing market share to its predecessor.
Stop using generative AI as a search engine.	A fake presidential pardon explains why you can’t trust robots with the news.
Soon, the tech behind ChatGPT may help drone operators decide which enemies to kill.	OpenAI and Palmer Luckey's weapons company sign agreement to explore lethal drone defense for military use.
Google Says AI Weather Model Masters 15-day Forecast.	A new artificial intelligence-based weather model can deliver 15-day forecasts with unrivaled accuracy and speed, a Google lab said, with potentially life-saving applications as climate change ramps up.
Perplexity Expanding Its Publisher's Program.	Perplexity has expanded its Publishers' Program by partnering with over a dozen international news organizations, providing tools, revenue sharing, and support to enhance collaboration with global media.
DeepMind’s Genie 2 can generate interactive worlds that look like video games.	DeepMind, Google’s AI research org, has unveiled a model that can generate an “endless” variety of playable 3D worlds. Called Genie 2, the model — the successor to DeepMind’s Genie, which was released earlier this year — can generate an interactive, real-time scene from a single image and text description (e.g. “A cute humanoid robot in the woods”).
Key leaders behind Google’s viral NotebookLM are leaving to create their own startup.	Three core members of Google NotebookLM have departed to launch a new stealth AI startup. The venture intends to use cutting-edge AI models to develop consumer-oriented, user-focused AI products. It is still in its early stages, with no defined focus or disclosed funding.
Bezos says he is ‘very optimistic’ about Trump’s plan to roll back regulations.	Amazon billionaire known for previously frosty relations with president-elect signals willingness to collaborate

Resources

Link	description
Large Language Model-Brained GUI Agents: A Survey.	Provides an overview of LLM-powered GUI Agents, covering their techniques and applications.
A Survey on LLM-as-a-Judge.	Offers an in-depth survey of the LLM-as-a-Judge paradigm, with a detailed exploration of strategies for developing reliable LLM-as-a-Judge systems.
TÜLU 3: Pushing Frontiers in Open Language Model Post-Training.	Introduces a suite of fully open state-of-the-art post-trained models, along with their accompanying data, code, and training methodologies, providing a detailed guide to contemporary post-training techniques.
INTELLECT-1 Release: The First Globally Trained 10B Parameter Model.	INTELLECT-1 is a 10B parameter model trained on 1 trillion tokens using globally distributed hardware. Its benchmarks are solid, and achieving an MFU of over 30% is remarkable considering the distributed training setup. If these results are validated, they represent a significant advancement in decentralized large-model training.
From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects.	This framework advances object detection in open-world settings by enabling AI to recognize and learn from previously unseen objects.
HUPE: Heuristic Underwater Perceptual Enhancement with Semantic Collaborative Learning.	HUPE is an AI-driven technique that enhances underwater image clarity while maintaining essential details for tasks such as object detection.
LTNtorch: PyTorch Implementation of Logic Tensor Networks.	Logic Tensor Networks (LTN) combine deep learning with logical reasoning, enabling neural models to learn by optimizing a knowledge base constructed from logical formulas.
Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale.	ProX is a framework that approaches data refinement as a programming task, enabling models to perform detailed operations on individual examples at scale. It enhances pre-training corpus quality by utilizing small language models to generate programs.
MMDuet.	MMDuet introduces a unique "video-text duet" interaction format for VideoLLMs, enabling AI to deliver real-time responses as videos play. This method simulates a dialogue where users and AI can exchange messages during video playback.
Converting GPT to Llama.	This repository contains code for converting a GPT implementation to Meta AI's Llama.
DeMo training run.	Nous is training a 15B distributed model using the DeMo optimizer. All of the training can be followed live at this link.
Fine-Tune Models with LoRA-SB.	LoRA-SB is a new method that brings full fine-tuning performance to low-rank adapters for large language models.
Making AI Datasets More Diverse.	Researchers proposed a new approach, Diversity-driven EarlyLate Training (DELT), to enhance dataset distillation for large-scale tasks.
Google’s plan to keep AI out of search trial remedies isn’t going very well.	US District Judge Amit Mehta indicates that AI could be pivotal in shaping remedies after the government's win in the Google search monopoly trial, potentially impacting Google's AI products. The DOJ has proposed measures to prevent Google from leveraging AI to maintain market dominance, including limits on exclusive agreements and AI investments. Microsoft opposes Google's requests for confidential AI deal details, citing irrelevance, while OpenAI may face pressure to disclose data in this context.
Using uv with PyTorch.	Documentation on how to use the new package manager UV to install PyTorch.
Amazon Launches Nova.	Amazon Nova unveils a series of multimodal models tailored for tasks such as document analysis, visual comprehension, and creative content generation. Prioritizing customization and efficiency, Nova models address various enterprise needs and excel in handling text, image, and video inputs.
Restructuring Vector Quantization with the Rotation Trick.	Vector Quantization uses the Straight Through Gradient estimator for gradient estimation, though its direction can occasionally be inaccurate. This paper proposes using rotation to correct the gradients and enhance codebook utilization.
Layout Generation with Diffusion GANs.	DogLayout is a hybrid model integrating GANs with diffusion processes to address challenges in layout generation.
Hunyuan Video Model.	Tencent's state-of-the-art open video model stands out for its realistic motion and dual training as both a video and image generation model. This dual approach enhances the aesthetic quality of its output, making it comparable to image generation models like Flux.
Scene Text Recognition.	TextSSR is a framework leveraging diffusion-based techniques to produce precise and realistic synthetic text images for scene text recognition.
T2Vid: Efficient Video Fine-tuning Scheme for MLLMs.	T2Vid is a novel approach aimed at enhancing video comprehension in Multimodal Large Language Models (MLLMs). It creates video-like samples to diversify training instructions.
aisuite.	aisuite offers a unified interface for seamless interaction with multiple LLM providers, enabling developers to test and compare outputs without modifying their code.
Motion Prompting: Controlling Video Generation with Motion Trajectories.	Motion Prompting is a technique for training video generation models using novel input types, including text, the first image frame, and a pixel tracking field. This enables innovative control during inference, allowing for new pixel fields (e.g., indicating an object moving in a different direction) to generate corresponding videos. While highly compelling, the method is not open source.
Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey.	This repository provides an extensive survey on the use of Vision-Language Models (VLMs) in remote sensing.
ImplicitPRM.	Process reward models (PRMs) provide detailed feedback by assessing reasoning step-by-step, unlike outcome reward models (ORMs), which evaluate complete responses. However, training PRMs demands detailed intermediate annotations, making it challenging. This paper demonstrates that an implicit PRM can be obtained at no extra cost by training an ORM on response-level labels, utilizing log-likelihood ratios between policy and reference models, thereby enabling optimization without specific loss objectives.
Unsloth - Dynamic 4-bit Quantization.	The Unsloth team seeks to compress a 20GB language model into 5GB while maintaining accuracy. Although various algorithms attempt this, challenges arise with outliers and compressibility. Llama, known for its difficulty in quantization, is addressed by selectively avoiding the quantization of specific parameters, significantly enhancing overall accuracy.
AccDiffusion v2: Tackling Repetitive Image Generation.	AccDiffusion v2 enhances diffusion models for generating high-resolution images without requiring additional training, resolving issues such as object repetition and local distortions.
Optimizing AI Inference at Character.AI.	Character AI features a robust inference pipeline. This post explores their implementation of int8 quantization and flash attention 3, offering valuable insights for those interested in scaling large language models.
Flow.	Flow is a lightweight engine for creating flexible AI workflows using dynamic task scheduling and concurrent execution.
OpenAI o1 System Card.	This report details the safety measures undertaken before releasing OpenAI o1 and o1-mini, including external red teaming and frontier risk assessments aligned with OpenAI's Preparedness Framework.
PaliGemma 2: A Family of Versatile VLMs for Transfer.	Paligemma 2 is among the top Vision-Language Models (VLMs) available today, utilizing SigLIP and Gemma technologies.
ASANet: Asymmetric Semantic Aligning Network for RGB and SAR image land cover classification.	The Asymmetric Semantic Aligning Network (ASANet) improves land cover classification using both SAR and RGB images.
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning.	Researchers have created a training-free method to enhance the efficiency of multi-modal language models (LLMs) with minimal performance loss. Their technique reduces computational demands by up to sevenfold through strategic merging and pruning of visual data tokens.
Google DeepMind GraphCast and GenCast.	DeepMind has open-sourced its GraphCast algorithm, which significantly outperforms and accelerates localized weather predictions for up to 36 hours, operating in a fraction of the time required by other methods.
Anagram-MTL.	visual anagram generation - images that change appearance when flipped or rotated -using diffusion models
ScoreLiDAR.	ScoreLiDAR is a new method that speeds up 3D LiDAR scene completion for autonomous vehicles.
New Fish Audio Model.	Fish Audio 1.5 is currently ranked #2 on the Text-to-Speech Leaderboards, just behind ElevenLabs. It supports voice cloning and runs quickly, though the output quality can be inconsistent.
Deepthought-8B.	Deepthought-8B is a small and capable reasoning model built on LLaMA-3.1 8B, designed to make AI reasoning more transparent and controllable. Despite its relatively small size, it achieves sophisticated reasoning capabilities that rival much larger models.
LLM-Brained GUI Agents.	A Collection of Research Papers and Projects in Large Language Model-Brained GUI Agents: A Survey.

Perspectives

Link	description
AI expert Marietje Schaake: ‘The way we think about technology is shaped by the tech companies themselves’.	The Dutch policy director and former MEP on the unprecedented reach of big tech, the need for confident governments, and why the election of Trump changes everything
If AI can provide a better diagnosis than a doctor, what’s the prognosis for medics?	Studies in which ChatGPT outperformed scientists and GPs raise troubling questions for the future of professional work
Building LLMs is probably not going to be a brilliant business.	LLM developers, including OpenAI, face major hurdles due to the industry's structure, particularly NVIDIA's dominance as a critical chip supplier and the intense price sensitivity and competition among buyers. While many AI companies secure significant funding, they often face profitability challenges, reminiscent of past tech firms like Netscape. Nonetheless, technology is likely to continue to progress. AI businesses may find success by focusing on leveraging existing models instead of creating new ones.
Rox: How to Manufacture Path Dependence in Applied AI.	like Salesforce by leveraging AI to manage unstructured data and integrate seamlessly with data warehouses. Its strategy focuses on enhancing the productivity of top sales performers through AI-powered agents while ensuring customer data security for future AI developments. This approach has attracted significant investor confidence, with Rox securing $50 million in funding from Sequoia Capital, GV, and General Catalyst across its seed and Series A rounds.
How close is AI to human-level intelligence?	Large language models such as OpenAI’s o1 have electrified the debate over achieving artificial general intelligence, or AGI. But they are unlikely to reach this milestone on their own.
The race is on to make AI agents do your online shopping for you.	Tech companies are creating AI shopping agents to automate online purchases, which could transform the retail industry. Perplexity's model faces operational hurdles, while OpenAI, Google, and Amazon are also working on AI purchasing tools. These advancements aim to simplify shopping but raise concerns about privacy, retailer dynamics, and the future of online shopping.
Salesforce CEO Marc Benioff Has Thoughts on AI Agents, Automation, And The Future of Your Job.	Salesforce CEO Marc Benioff foresees companies using AI agents to manage customer service and sales by utilizing their existing data and policies, with Salesforce serving as a central enabler of this change. He contends that AI-driven automation will boost productivity rather than replace jobs, enabling businesses to grow and operate more efficiently without adding human labor. Benioff emphasizes this transition as a pivotal moment in business evolution, offering a competitive advantage and transforming traditional workflows.
Reward Hacking in Reinforcement Learning.	Lilian Weng has published an insightful blog post on the issue of Reward Hacking in language model alignment, a key challenge hindering the deployment of models in production environments.
Create JSONL dataset from API chat logs.	A straightforward utility that enables the creation of a JSONL dataset from messages exchanged between the user and the API.
The ChatGPT secret: is that text message from your friend, your lover – or a robot?	People are turning to chatbots to solve all their life problems, and they like its answers. But are they on a very slippery slope?
A System of Agents brings Service-as-Software to life.	AI is evolving software from a tool into autonomous agents capable of performing tasks traditionally handled by humans, representing a projected $4.6 trillion market opportunity. Advancements like LLMs and agents empower AI systems to handle unstructured data, make decisions, and operate independently in sectors such as sales and healthcare. The future of AI envisions Systems of Agents working collaboratively and learning from one another, akin to a highly skilled team delivering seamless services.
Over ½ of Long Posts on LinkedIn are Likely AI-Generated Since ChatGPT Launched.	Since the launch of ChatGPT, LinkedIn has experienced a 189% increase in AI-generated content, with more than half of long-form posts now probably AI-created.
AI’s computing gap: academics lack access to powerful chips needed for research.	Survey highlights the disparity between academic and industry scientists’ access to computing power needed to train machine-learning models.
'Brutal’ math test stumps AI but not human experts.	Benchmark shows humans can still top machines—but for how much longer?
Finetuning LLM Judges for Evaluation.	Evaluating LLMs is challenging due to their complex, open-ended outputs. While traditional human evaluation provides detailed insights, it is inefficient. Therefore, scalable assessments using automatic metrics and model-based approaches like LLM-as-a-Judge are essential. Innovations such as fine-tuned judges (e.g., Prometheus) and synthetic data generation are improving evaluation precision and adaptability across various tasks and domains.
The Gen AI Bridge to the Future.	Generative AI is set to revolutionize wearable technology by creating on-demand UI interfaces that adapt to user needs and context.
Sam Altman Says Artificial General Intelligence Is on the Horizon.	Speaking at The New York Times DealBook Summit, Sam Altman, the chief executive of OpenAI, said that the arrival of artificial general intelligence would “matter much less” to the average person than currently thought.

Back to index

ML news: Week 25 November - 1 December

Research

Link	description
Learning high-accuracy error decoding for quantum processors.	A new AI-driven decoder has established a state-of-the-art benchmark for detecting errors in quantum computers. Leveraging transformer architecture, AlphaQubit achieved a 6% reduction in errors compared to tensor network methods and a 30% reduction compared to correlated matching on the Sycamore data. It also demonstrated promising performance in simulations with larger systems of up to 241 qubits. While this marks substantial progress in quantum error correction, the system requires speed enhancements to enable real-time error correction for practical quantum computing applications.
The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use.	This work examines Claude 3.5's computer use capabilities across various domains and software, offering a ready-to-use agent framework for deploying API-based GUI automation models. Claude 3.5 showcases an exceptional ability to perform end-to-end tasks, translating language inputs into desktop actions seamlessly.
Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations.	The paper proposes five statistical recommendations for improving the evaluation of performance differences in LLMs. These include using the Central Limit Theorem to estimate theoretical averages over all possible questions rather than relying on observed averages, clustering standard errors when questions are related instead of treating them as independent, reducing variance within questions through resampling or next-token probabilities, analyzing paired differences between models by leveraging shared questions across evaluations, and conducting power analysis to determine sufficient sample sizes for identifying meaningful differences. The authors suggest that these approaches will help researchers better identify whether performance differences reflect genuine capability gaps or are merely due to chance, resulting in more accurate and reliable model evaluations.
Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions.	Marco-o1 is a reasoning model designed for open-ended solutions, leveraging Chain-of-Thought (CoT) fine-tuning, Monte Carlo Tree Search (MCTS), reflection mechanisms, and advanced reasoning strategies. It achieves accuracy gains of +6.17% on the MGSM (English) dataset and +5.60% on the MGSM (Chinese) dataset.
Cut Your Losses in Large-Vocabulary Language Models.	The paper introduces Cut Cross-Entropy (CCE), a method designed to drastically reduce memory usage in LLM training by optimizing the computation of cross-entropy loss. Traditional cross-entropy layers can consume up to 90% of memory in some models by storing logits for the entire vocabulary. CCE addresses this by calculating logits only for the correct token and dynamically evaluating the log-sum-exp overall logits using flash memory. This approach reduces the memory footprint of Gemma 2 from 24GB to just 1MB. By leveraging the sparsity in softmax calculations, it skips elements that have minimal impact on gradients. The authors demonstrate that CCE achieves this substantial memory reduction without affecting training speed or convergence, allowing for larger batch sizes and potentially more efficient scaling of LLM training.
AIGS: Generating Science from AI-Powered Automated Falsification.	The study presents a multi-agent system for automated scientific discovery, focusing on falsification through automated ablation studies. Tested on three machine learning tasks—data engineering, self-instruct alignment, and language modeling—the system successfully generated meaningful scientific insights. However, its performance remains inferior to that of experienced human researchers.
Does Prompt Formatting Have Any Impact on LLM Performance?	The study investigates how different prompt formats (plain text, Markdown, JSON, and YAML) influence GPT model performance across various tasks. It finds that GPT-3.5-turbo's performance can vary by up to 40% depending on the format, whereas larger models like GPT-4 are more resilient to such changes. There is no universally optimal format across models or tasks; for example, GPT-3.5-turbo performed better with JSON, while GPT-4 favored Markdown. Models within the same family exhibited similar format preferences, but these preferences did not translate well to different model families. The findings highlight the significant impact of prompt formatting on model performance, emphasizing the importance of considering format choice during prompt engineering, model evaluation, and application development.
Juna.ai wants to use AI agents to make factories more energy-efficient.	AI agents are all the rage, a trend driven by the generative AI and large language model (LLM) boom these past few years. Getting people to agree on what exactly AI agents are is a challenge, but most contend they are software programs that can be assigned tasks and given decisions to make — with varying degrees of autonomy.
Why ‘open’ AI systems are actually closed, and why this matters.	This paper examines ‘open’ artificial intelligence (AI). Claims about ‘open’ AI often lack precision
Qwen's first reasoning-inspired model QwQ.	Qwen has introduced a 32B parameter reasoning model that rivals OpenAI's o1 series in performance. The model demonstrates scalability when generating extended reasoning traces and is proficient in mathematics and coding. It is now available for use.
Pathways on the Image Manifold: Image Editing via Video Generation.	In the early days of image synthesis, exploring the latent space was an effective method for creating diverse images. This concept has now extended to video, enabling sequential edits to a single image while preserving semantic consistency.
Low-Bit Quantization Favors Undertrained LLMs.	Models trained for shorter durations on fewer tokens show less performance degradation when quantized after training. This aligns with findings from other research, suggesting that extended training allows models to utilize higher precision to compress increasingly complex information.

News

Link	description
Don’t know what to buy your loved ones for Christmas? Just ask ChatGPT.	Santa has a new little helper. But can an AI-powered shopping assistant really master the subtle art of gift-giving?
Anthropic x AWS trainium collaboration.	Anthropic is collaborating with AWS to enhance trainium inference and tooling capabilities as part of a recent investment initiative.
Will Sam Altman always win the OpenAI board fight in an AI agent simulation?	Fable, a company specializing in games and AI simulations, used its AI decision-making framework SIM-1 to simulate the OpenAI board dispute involving Sam Altman. The simulation, which incorporated multi-agent competition and GPT-4o, suggested Altman’s return as CEO in only 4 out of 20 scenarios. This research highlights AI's ability to model complex decision-making scenarios.
Anthropic Announces Model Context Protocol.	The Model Context Protocol (MCP) is an open standard that enables AI systems to connect directly to data sources, such as business tools and content repositories. It streamlines data access by replacing fragmented, custom integrations with a universal protocol, enhancing scalability and efficiency.
OpenAI Shares Insights on Red Teaming for Safer AI.	OpenAI has enhanced its red teaming initiatives by publishing two papers: one outlining the involvement of external experts in red teaming, and another presenting a novel approach to automated testing.
Nvidia’s CEO defends his moat as AI labs change how they improve their AI models.	"Test-time scaling" is gaining significance with the advancement of AI models, and Nvidia is prepared for this transition. This approach, which boosts AI inference by increasing computational power, introduces competitive pressure as startups create faster AI inference chips. While there are concerns about diminishing returns, Nvidia is determined to capitalize on its strong platform advantage for pretraining and expects substantial growth in AI inference.
Anthropic Introduces Custom Styles for Personalized Responses.	Anthropic now offers custom styles, enabling users to adapt the AI's responses to suit their communication preferences and workflows.
OpenAI’s Sora video generator appears to have leaked.	A group leaked access to OpenAI's unreleased video generator, Sora, in protest against perceived unfair practices and "art washing." They launched a frontend on Hugging Face that enabled users to generate videos, but OpenAI reportedly took it down within hours. OpenAI states that Sora remains in a research preview phase.
Now Hear This: World’s Most Flexible Sound Machine Debuts.	A team of generative AI researchers created a Swiss Army knife for sound, one that allows users to control the audio output simply using text. While some AI models can compose a song or modify a voice, none have the dexterity of the new offering.
OLMo 2: The best fully open language model to date.	Building on its commitment to fully open-source training, Allen AI has introduced a new generation of language models that are entirely transparent and rival or exceed the performance of the best open-weight models available.
Amazon to invest another $4 billion in Anthropic, OpenAI’s biggest rival.	Amazon revealed a $4 billion investment in Anthropic, raising its total commitment to $8 billion and solidifying AWS as Anthropic's main cloud and training partner.
OpenAI is funding research into ‘AI morality’.	OpenAI is funding academic research into algorithms that can predict humans’ moral judgments.
Quantum computing: physics–AI collaboration quashes quantum errors.	A neural network has learned to correct the errors that arise during quantum computation, outperforming algorithms that were designed by humans. The strategy sets out a promising path towards practical quantum computers.
OpenAI moves to trademark its o1 ‘reasoning’ models.	OpenAI has filed a trademark application for its latest AI model, o1, as the firm moves to shield its IP.
ElevenLabs’ new feature is a NotebookLM competitor for creating GenAI podcasts.	Voice AI startup ElevenLabs on Wednesday introduced a feature that lets you upload different types of content to create a multispeaker podcast for you, similar to Google’s NotebookLM.
Cradle raises $73M Series B to Put AI-Powered Protein Engineering in Every Lab.	Cradle has solved a critical challenge in optimizing protein shapes. It is now expanding its team and efforts to land this technology in the hands of practitioners everywhere.
Teach mode, Rabbit's tool for automating R1 tasks, is now available to all users.	Rabbit R1 has launched a teach mode feature that enables users to train its AI to automate tasks across various websites. This enhancement aims to boost functionality and productivity by supporting intricate multi-platform interactions, potentially providing a superior experience compared to dedicated apps. Rabbit plans to establish a marketplace for user-created automations and seeks widespread adoption, despite possible platform challenges.
Use robots instead of hiring low-paid migrants, says shadow home secretary.	Tory MP Chris Philp calls for more investment in technology to reduce UK’s net migration figures
Tesla owners turn against Musk: ‘I’m embarrassed driving this car around’.	The electric car brand was once a liberal favourite – but the CEO’s embrace of Trump has led to an angry backlash
Alibaba releases an ‘open’ challenger to OpenAI’s o1 reasoning model.	Alibaba has released QwQ-32B-Preview, an ‘open' challenger to OpenAI's o1 reasoning model.
Ai2 releases new language models competitive with Meta’s Llama.	Ai2 has launched OLMo 2, an open-source language model series featuring 7- and 13-billion-parameter models. Built using publicly available training data and code, OLMo 2 aims to advance open-source AI innovation. Ai2 asserts that these models surpass comparable open models, such as Meta's Llama 3.1. The models are licensed under Apache 2.0, allowing for commercial use.
xAI could soon have its own app.	Elon Musk’s xAI is reportedly about to take its next step to compete with OpenAI.

Resources

Link	description
An Empirical Study on LLM-based Agents for Automated Bug Fixing.	The study evaluates seven top LLM-based bug-fixing systems on the SWE-bench Lite benchmark, identifying MarsCode Agent by ByteDance as the best performer with a 39.33% success rate. It highlights that line-level fault localization accuracy is more crucial than file-level accuracy for error localization, and bug reproduction capabilities play a significant role in fixing success. Notably, 24 out of 168 resolved issues required reproduction techniques, though these sometimes misled LLMs when issue descriptions were already clear. The study concludes that improving LLM reasoning abilities and refining agent workflows are essential for advancing automated bug fixing.
FinRobot: AI Agent for Equity Research and Valuation with Large Language Models.	The framework introduces an AI agent system for equity research that utilizes multi-agent Chain-of-Thought (CoT) prompting to integrate data analysis with human-like reasoning, producing professional investment reports comparable to those from major brokerages. It employs three specialized agents: the Data-CoT Agent, which aggregates diverse data sources for comprehensive financial integration; the Concept-CoT Agent, which mimics an analyst's reasoning to derive actionable insights; and the Thesis-CoT Agent, which synthesizes these insights into a cohesive investment thesis and report.
Bi-Mamba: Towards Accurate 1-Bit State Space Models.	The scalable 1-bit Mamba architecture is designed to optimize LLM efficiency across multiple model sizes (780M, 1.3B, and 2.7B). Bi-Mamba delivers performance comparable to full-precision formats like FP16 and BF16, while drastically reducing memory usage. It also achieves higher accuracy than post-training binarization Mamba baselines.
Ai2 OpenScholar: Scientific literature synthesis with retrieval-augmented language models.	Ai2 has introduced OpenScholar, a retrieval-augmented language model designed to search for relevant academic papers and provide answers based on those sources, streamlining the process for scientists to locate and synthesize information.
Detecting Human Artifacts from Text-to-Image Models.	This study addresses the issue of distorted human figures in text-to-image models by presenting the Human Artifact Dataset (HAD), a comprehensive dataset containing more than 37,000 annotated images.
UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages.	UnifiedCrawl is a method that efficiently gathers extensive text data for low-resource languages from the Common Crawl corpus, utilizing minimal computational resources. This approach filters and extracts relevant data, resulting in monolingual datasets significantly larger than previously available sources.
A New Image-to-Video Model.	Researchers have created image-to-video diffusion models capable of generating realistic motion transformations from static images, overcoming the constraints of traditional approaches such as affine transformations.
AIMv2: New Vision Models.	The AIMv2 vision model family employs a multimodal autoregressive training approach, delivering remarkable performance across various tasks.
A New Attention Mechanism for Training LLMs.	AnchorAttention: Improved attention for LLMs long-context training
Combining Convolutions and Self-Attentions for Efficient Vision Models.	GLMix is a novel approach that combines convolutions and multi-head self-attentions (MHSAs) at varying granularity levels for vision tasks. Convolutions capture fine-grained local details, while MHSAs focus on coarse-grained semantic slots to provide global context.
Echo Mimic v2.	Open weights system to animate partial human bodies with a reference image and audio input. It uses pose-specific VAEs to combine the information from various channels and a reference image to animate.
LTX-Video.	LTX-Video is the first DiT-based video generation model that can generate high-quality videos in real-time. It can generate 24 FPS videos at 768x512 resolution, faster than it takes to watch them. The model is trained on a large-scale dataset of diverse videos and can generate high-resolution videos with realistic and diverse content.
Documind.	Documind utilizes AI to extract structured data from PDFs by converting them into images and leveraging OpenAI's API.
Coalescence: making LLM inference 5x faster.	"Coalescence" is a framework that accelerates LLM inference by up to 5x when producing structured outputs like JSON. It achieves this by transforming structured formats into finite-state machines and eliminating redundant paths that result in the same output, reducing the need for unnecessary LLM calls. Although this approach greatly enhances speed, it is crucial to preserve output quality by ensuring that optimization does not exclude more likely sequences.
WildLMa: Long Horizon Loco-Manipulation in the Wild.	WildLMa is a framework designed to enable quadruped robots to perform advanced manipulation tasks in real-world settings. It integrates three core components: a whole-body controller for teleoperation via VR, a skill library learned through imitation learning (WildLMa-Skill), and a language model-based planner (WildLMa-Planner) that organizes these skills for long-term tasks. The researchers showcase its application in tasks such as cleaning trash from hallways and rearranging bookshelf items. The framework proves effective across various environments and object setups.
MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective.	MMGenBench is a novel evaluation framework for large multimodal models, emphasizing their capacity to generate and interpret images. In this process, models produce descriptions from input images, which are subsequently used to generate new images for comparison.
Moondream Python Client Library.	Moondream's Python client library provides tools for image analysis and querying, featuring CPU-optimized inference. However, it is not yet suitable for GPU or Mac M1/M2/M3 users. The library can be installed using pip, and model weights are available for download in various formats, including int8, fp16, and int4.
Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer.	Sana is a highly efficient image generation model capable of producing high-quality 1024x1024 images in under a second on a laptop GPU. Its innovations include a 32x image compression autoencoder (DC-AE), linear attention replacing traditional attention in DiT, a decoder-only LLM for text encoding, and improved training and sampling techniques. The 0.6B parameter model rivals or surpasses much larger models like Flux-12B, despite being 20x smaller and 100x faster. Requiring only 9GB of VRAM for inference, Sana-0.6B is accessible on consumer hardware. The repository provides code for training, inference, and evaluation, offering both 0.6B and 1.6B model variants.
Flow Models.	A great introduction to flow-based modeling, which is a theoretical improvement over diffusion.
Building an AI-Powered Game.	This is a course by Andrew Ng, Latitude, and Together AI on how to make an AI-powered game.
Sharper Infrared Images.	This project improves image super-resolution for infrared images, addressing issues where traditional methods distort spectral fidelity.
Mochi 1 LoRA Fine-tuner.	Mochi 1, a top open-source video model, supports LoRA fine-tuning and operates on a single GPU. The repository demonstrates various applications, such as creating custom effects and ensuring character consistency.
OneDiffusion.	OneDiffusion is a versatile large-scale diffusion model capable of handling various tasks, including text-to-image generation, image editing, and reverse processes such as depth estimation and segmentation.
customized-flash-attention.	New flash attention fork that can have ragged Q/V matrix sizes.
Novel View Synthesis.	MVGenMaster is a multi-view diffusion model that enhances Novel View Synthesis tasks by incorporating 3D priors.
FlowMol: Mixed Continuous and Categorical Flow Matching for 3D De Novo Molecule Generation.	This work benchmarks discrete flow matching methods for generating novel 3D molecular structures, critical for chemical discovery.
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge.	This project investigates the growing "LLM-as-a-judge" approach, where large language models are utilized for scoring, ranking, and selection tasks in diverse AI and NLP applications.
aisuite.	An easy way to work with a variety of API based models in a single packaged environment.
UK government failing to list use of AI on the mandatory register.	Technology secretary admits Whitehall departments are not being transparent over the way they use AI and algorithms
Reddit overtakes X in popularity of social media platforms in UK.	Discussion platform takes fifth place in rankings and is the fastest growing large social media platform in the UK
Star Attention: Efficient LLM Inference over Long Sequences.	Star Attention introduces a block-sparse method to accelerate Transformer-based large language models (LLMs) during long-sequence inference.
SketchAgent.	SketchAgent utilizes a multimodal LLM to enable language-guided, step-by-step sketch generation using an intuitive sketching language. It can create diverse sketches, interact with humans for collaborative sketching, and edit content through chat.
DROID-Splat.	A deep learning-based dense visual SLAM framework capable of real-time global pose optimization and 3D reconstruction.
P2DFlow.	P2DFlow is a protein ensemble generative model with SE(3) flow matching based on ESMFold, the ensembles generated by P2DFlow could aid in understanding protein functions across various scenarios.
ThunderMittens For Your ThunderKittens.	Hazy Research has played a significant role in optimizing hardware utilization for AI workloads. They have expanded their impressive ThunderKittens Kernel writing framework to support Apple Silicon.
DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving.	Diffusion models for End-to-End driving of autonomous vehicles which can operate at 45 FPS on a 4090 chip.
PassionSR: Post-Training Quantization with Adaptive Scale in One-Step Diffusion-based Image Super-Resolution.	PassionSR introduces an approach that makes diffusion-based image super-resolution (SR) models more hardware-friendly.
Training Open Instruction-Following Language Models.	This repo serves as an open effort on instruction-tuning popular pre-trained language models on publicly available datasets.
Grounding-IQA: Multimodal Language Grounding Model for Image Quality Assessment.	Grounding-IQA is an innovative method for image quality assessment (IQA) that combines location-specific grounding with multimodal descriptions.
Steel Browser API for AI Agents.	The open-source browser API built for AI agents. Steel provides a REST API to control headless browsers with session management, proxy support, and anti-detection features. Perfect for web automation, scraping, and building AI agents that can interact with the web.
PixMo dataset.	Allen AI has released several datasets that were used to train its visual language models.
StableAnimator: High-Quality Identity-Preserving Human Image Animation.	StableAnimator introduces a breakthrough in human image animation by ensuring identity consistency in generated videos.

Perspectives

Link	description
Jeff Jarvis: ‘Elon Musk’s investment in Twitter seemed insane, but it gave him this power’.	The US media pundit on the dangers of overregulation online, why he’s more frightened of the tech bros than AI itself, and how to reclaim the web by getting rid of the geeks
Passwords are giving way to better security methods – until those are hacked too, that is.	It’s a war that will never end. But for small-business owners, it’s all about managing risk while reaping rewards
Gwern Branwen - How an Anonymous Researcher Predicted AI's Trajectory.	In this post, Gwern Branwen, an early advocate of LLM scaling, explores AI advancements and their influence on the path to AGI. He highlights the significance of scaling and computational power over traditional algorithmic innovations. Branwen reflects on the interplay between human intelligence and AI, as well as the societal implications of upcoming technologies like weight-loss drugs on behavior. Additionally, he offers thoughts on his writing process and the transformative effects of AI on creative endeavors
The Bitter Religion: AI’s Holy War Over Scaling Laws.	The AI community is currently divided over the emphasis on scaling computation as the primary driver of AI performance, a concept often referred to as "The Bitter Lesson." Proponents, including leaders at OpenAI, believe that achieving artificial general intelligence (AGI) is possible shortly through the continued scaling of computational resources. However, others argue that alternative scientific advancements are necessary, as scaling laws may not be sustainable in the long term. This debate significantly influences investment and development strategies within AI and related fields.
Why LLMs Within Software Development May Be a Dead End.	LLMs in software development face challenges due to their lack of decomposability and explainability.
How the far right is weaponizing AI-generated content in Europe.	Experts say fake images raising fears around issues such as immigration have proliferated since EU elections
‘What many of us feel’: why ‘enshittification’ is Macquarie Dictionary’s word of the year.	The committee’s honorable mentions went to ‘right to disconnect’ and ‘rawdogging’
Valuing Humans in the Age of Superintelligence: HumaneRank.	AI's ability to exceed human intellectual output could result in economic displacement. The proposed Humanerank system addresses this by allowing individuals to allocate endorsements that represent societal value, influencing resource distribution. This approach preserves market dynamics and personal freedom while offering a new way to value human contributions in an AI-driven world.
Something weird is happening with LLMs and chess.	This article examines how various LLMs perform in playing chess. Most models falter after a few moves, except for GPT-3.5-turbo-instruct, which excels. This indicates that instruction tuning might impair chess capabilities or that GPT-3.5-turbo-instruct was trained on more chess-related data. Additionally, tokenizer handling issues could be affecting model performance.
Amazon, Google and Meta are ‘pillaging culture, data and creativity’ to train AI, Australian inquiry finds.	Among the report’s 13 recommendations is the call for the introduction of standalone AI legislation and protections for creative workers
When we become cogs.	AI enhances material scientists' efficiency, driving a 44% rise in material discoveries but reducing work satisfaction by 44% due to fewer opportunities for idea generation. Similarly, GitHub Copilot boosts productivity for less experienced developers, shifting their focus from project management to coding. While AI helps bridge skill gaps, it risks alienation by automating creative tasks, mirroring the effects of automation in other industries.
AI Alone Isn't Ready for Chip Design.	Hybrid methods blending classical search techniques with machine learning are proving effective in addressing the challenges of chip design, especially in floorplanning. While AI alone faces difficulties with multi-constraint scenarios, incorporating AI to guide search-based algorithms, such as simulated annealing, improves both efficiency and performance. This synergy accelerates the design process and facilitates the development of more intricate chip solutions.
In the big data era, prioritize statistical significance in study design.	Analysis of neuroimaging studies shows that close attention to experimental design can increase the statistical robustness of research results.
AI could pose pandemic-scale biosecurity risks. Here’s how to make it safer.	AI-enabled research might cause immense harm if it is used to design pathogens with worrying new properties. To prevent this, we need better collaboration between governments, AI developers, and experts in biosafety and biosecurity.
Don’t let watermarks stigmatize AI-generated research content.	Given the increasing integration of LLMs into research processes, identifying their contributions transparently is ever more urgent. But watermarking risks fostering a reductive and binary view of content as either ‘pure’ or ‘tainted’ depending on whether it is human- or LLM-generated.
It's Surprisingly Easy to Jailbreak LLM-Driven Robots.	RoboPAIR is an algorithm capable of bypassing safety guardrails in robots powered by LLMs, effectively jailbreaking these systems. Tests demonstrated a 100% success rate in compromising platforms like the Go2 self-driving simulator and robot dogs. This highlights critical security vulnerabilities, underscoring the urgent need for stronger defenses against LLM-based robot hacking.
A new AI scaling law shell game?	Recent changes in AI scaling laws have exposed limits in predictability and effectiveness, with newer models falling short of previous expectations. Microsoft CEO Satya Nadella emphasizes "inference time compute" as a key area to address, though issues of cost and reliability remain. Advancing beyond scaling is essential, and LLMs should be integrated into a more comprehensive AI strategy.

Back to index

ML news: Week 18 - 24 November

Research

Link	description
Artificial Intelligence, Scientific Discovery, and Product Innovation.	indicates that leading scientists use their expertise to focus on the most promising AI-generated suggestions, while others often expend considerable resources on false positives; shows that adopting AI technology for materials discovery boosts productivity, resulting in 44% more materials discovered, a 39% increase in patent filings, and 17% greater product innovation; notes that these improvements come with drawbacks, as 82% of scientists experienced lower job satisfaction, citing reduced creativity and underutilization of their skills.
Scaling Laws for Precision.	presents "precision-aware" scaling laws that forecast how both training and inference precision impact LLM performance; key insights include: 1) post-training quantization becomes increasingly detrimental as models are trained on larger datasets, to the point where more pretraining may harm performance, 2) training with lower precision necessitates a larger model size to sustain performance levels, and 3) when optimizing model size, data, and precision together, the ideal training precision is around 7-8 bits, independent of compute availability; further notes that with fixed model size, the optimal precision for compute increases roughly logarithmically with data size; the authors confirm their predictions on models up to 1.7B parameters trained on up to 26B tokens, demonstrating that both very high (16-bit) and very low (under 4-bit) training precisions may be inefficient.
Sequence modeling and design from molecular to genome-scale with Evo.	a 7B parameter AI model built to comprehend and generate DNA sequences across various biological scales; trained on 2.7 million prokaryotic and phage genomes, it can handle sequences up to 131 kilobases long while preserving single-nucleotide precision, allowing it to capture both molecular interactions and genome-wide patterns; Evo outperforms in predicting and generating functional DNA, RNA, and protein sequences, achieving the first experimentally validated AI-generated CRISPR-Cas complexes and transposable systems.
The Surprising Effectiveness of Test-Time Training for Abstract Reasoning.	examines test-time training (TTT), where model parameters are temporarily updated during inference, to enhance an LLM's abstract reasoning on the ARC benchmark; highlights three essential components: initial fine-tuning on related tasks, using auxiliary task formats and augmentations, and per-instance training; TTT yields substantial performance gains, with accuracy improvements of up to 6x over base fine-tuned models; applying TTT to an 8B LLM results in 53% accuracy on ARC's public validation set, a nearly 25% increase over the previous state-of-the-art for neural approaches; combining their method with program generation techniques achieves a new public validation accuracy of 61.9%, on par with average human performance; the results indicate that explicit symbolic search is not the sole route to better abstract reasoning in LLMs, and that test-time training on few-shot examples can be highly effective.
Toward Optimal Search and Retrieval for RAG.	investigates the impact of retrieval on performance in RAG pipelines for QA tasks; performs experiments using BGE-base and ColBERT retrievers with LLaMA and Mistral, showing that incorporating more gold (relevant) documents enhances QA accuracy; observes that using approximate nearest neighbor search with lower recall has minimal performance impact while potentially boosting speed and memory efficiency; notes that introducing noisy or irrelevant documents consistently harms performance, refuting prior research claims; concludes that optimizing the retrieval of gold documents is essential for RAG effectiveness and that lower search accuracy can be a practical strategy.
Rapid Response: Mitigating LLM Jailbreaks with a Few Examples.	presents a novel approach for defending LLMs against jailbreak attacks, emphasizing the rapid adaptation of defenses upon detecting new attacks rather than striving for perfect initial adversarial robustness; using a new benchmark, the top-performing method—fine-tuning an input classifier—reduced attack success rates by over 240x for known attack types and 15x for new variations after observing just one example of each attack strategy; shows that swiftly responding to emerging jailbreaks can be an effective alternative to traditional static defenses.
Solving the Travelling Salesman Problem.	This study highlights the often underestimated value of the "heatmap + Monte Carlo Tree Search (MCTS)" method, demonstrating that well-tuned, straightforward heatmaps can surpass more sophisticated models.
Graph-based AI model maps the future of innovation.	MIT researchers created an AI model that employs generative knowledge extraction and graph reasoning to detect intricate patterns across domains such as biology and music. The model efficiently generates knowledge maps from scientific literature, uncovering connections and proposing novel materials inspired by art. This method boosts interdisciplinary research by uncovering hidden insights and fostering innovative concepts for material design.
Teaching Video Models to Understand Time Like a Story.	This paper presents NumPro, an innovative approach designed to assist Video Large Language Models in managing Video Temporal Grounding tasks.
Generative World Explorer.	The Generative World Explorer (Genex) is a system capable of simulating exploration in 3D spaces through the generation and leveraging those simulations to enhance planning. It employs an ST-VAE and a diffusion pass for its imagination process, leading to better planning outcomes.
Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering.	The Generative World Explorer (Genex) is a system capable of simulating exploration in 3D spaces through the generation and leveraging those simulations to enhance planning. It employs an ST-VAE and a diffusion pass for its imagination process, leading to better planning outcomes.
OneNet: A Channel-Wise 1D Convolutional U-Net.	OneNet is a 1D convolutional encoder optimized for efficient image segmentation, making it well-suited for edge devices.
AI’s math problem: FrontierMath benchmark shows how far technology still has to go.	Artificial intelligence systems may be good at generating text, recognizing images, and even solving basic math problems—but when it comes to advanced mathematical reasoning, they are hitting a wall. A groundbreaking new benchmark, FrontierMath, exposes just how far today’s AI is from mastering the complexities of higher mathematics.
Enhancing Reasoning Capabilities of LLMs via Principled Synthetic Logic Corpus.	Researchers have proposed Additional Logic Training to enhance reasoning in LLMs, focusing on teaching them to manage complex deductions involving varied rules and distractions.
Solving Cold Starts in Adaptive Testing.	The "cold start" issue in adaptive testing arises when initial questions fail to align with examinees' abilities. Researchers have addressed this with the Diffusion Cognitive States Transfer Framework (DCSR), which employs diffusion models to utilize prior learning data across domains.
samurai.	Tracking a consistent object over an extended period is a challenging task. This work enhances SAM 2 by integrating motion-aware memory banks, ensuring consistency over time and through occlusions. It stands out as one of the most effective visual tracking systems developed so far.
Compress and Reconstruct Images.	PCNet is a new compact network for image-compressed sensing. It reduces sampling costs while delivering high-quality reconstructions.
LMM-driven Semantic Image-Text Coding for Ultra Low-bitrate Learned Image Compression.	Large multi-modal models can generate captions and compress images simultaneously within a single system

News

Link	description
Hi-tech recreation of Richard III’s voice has a Yorkshire accent.	A digital avatar of the king’s head, complete with ‘meticulously researched’ voice, is on display in York
OpenAI’s tumultuous early years revealed in emails from Musk, Altman, and others.	Elon Musk's lawsuit against OpenAI has unveiled emails from the startup's early days, exposing internal conflicts.
Spotify’s Plans For AI-Generated Music, Podcasts, and Recommendations, According To Its Co-President, CTO, and CPO Gustav Söderström.	Spotify's Gustav Söderström talks about AI music, Notebook LM podcasts, and the nuance of building better discovery using LLMs.
AI cloning of celebrity voices outpacing the law, experts warn.	David Attenborough among famous people whose voices have been exploited by fraudsters
John Oliver on potential US TikTok ban: ‘May not be necessary, but it isn’t sufficient’.	Last Week Tonight host looks into looming US ban over privacy concerns and fear of its Chinese parent company
Shop like a Pro: Perplexity’s new AI-powered shopping assistant.	Perplexity has introduced a shopping feature for Pro users in the U.S., enabling them to research and purchase products directly within the platform. This feature includes a "Buy with Pro" button that allows users to order items using saved billing and shipping information, with free shipping on all purchases.
Ben Affleck Shares Candid Take on the Positive Use of AI in Hollywood, but Doesn't See It Threatening Creativity.	During an interview, Ben Affleck reassured Hollywood actors and writers, stating that AI currently poses minimal risk to their jobs because of its existing limitations.
The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use.	This work seeks to systematically evaluate the capabilities of new autonomous computer use agents, revealing that Claude is particularly strong at handling traditional linear tasks.
Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference.	Cerebras has developed a 405-billion-parameter Llama 3.1 model, the largest in its class, capable of processing nearly 1,000 tokens per second. This performance is approximately 12 times faster than comparable systems and 18 times faster than some closed-model API providers. The model is expected to be accessible via API at the beginning of next year.
Nous Research Forge.	The Forge Reasoning API enhances popular language models by integrating a code interpreter and advanced reasoning capabilities, leading to improved performance.
US Justice Department plans to push Google to sell off Chrome browser.	Authorities seek to dismantle monopoly on search market and also want action related to AI and Android
Meta pushes AI bid for UK public sector forward with technology aimed at NHS.	Tech giant awards funding to project to shorten waits in A&E, after ‘hackathon’ on using Llama system in Britain
Meta hires Salesforce's CEO of AI, Clara Shih.	Meta is creating a new product unit to develop AI tools for the 200 million businesses that use its apps.
Rox's Public Beta and $50M Raise.	Rox, an AI-powered sales productivity platform, boosts enterprise sales reps' performance by over 30% through AI analyst teams that handle tasks like planning and engagement. It integrates effortlessly with existing systems, eliminating the inefficiencies of traditional CRMs, and is already used by leading companies. Rox recently secured $50M in funding, led by Sequoia and other prominent investors, to expand its market presence.
Genies launches Parties for brands and creators to launch their own ‘AI Roblox’.	Genies, a culture-focused avatar technology company, has launched Parties after developing its foundational technology stack since the last fundraise.
Generative AI taught a robot dog to scramble around a new environment.	Teaching robots to navigate new environments is tough. You can train them on physical, real-world data taken from recordings made by humans, but that’s scarce and expensive to collect. Digital simulations are a rapid, scalable way to teach them to do new things, but the robots often fail when they’re pulled out of virtual worlds and asked to do the same tasks in the real one.
Breakthrough robot nails surgery like a human doctor after watching videos.	The model can quickly train robots for diverse surgeries, from basic tasks to full procedures, advancing robotic medical capabilities.
DeepL launches DeepL Voice, real-time, text-based translations from voices and videos.	DeepL has made a name for itself with online text translation it claims is more nuanced and precise than services from the likes of Google — a pitch that has catapulted the German startup to a valuation of $2 billion and more than 100,000 paying customers. Users will now be able to use DeepL Voice to listen to someone speaking in one language and automatically translate it to another, in real-time.
Google releases standalone Gemini app for iPhone.	You've always been able to access this in the Google app, but now there's another way.
ChatGPT can now read some of your Mac’s desktop apps.	On Thursday, the startup announced the ChatGPT desktop app for macOS can now read code in a handful of developer-focused coding apps, such as VS Code, Xcode, TextEdit, Terminal, and iTerm2.
Google must sell Chrome to end search monopoly, justice department argues in court filing.	Justice department urges court to force Google to share data with rivals as part of wide-ranging changes to end online giant’s monopoly on web searching
Nvidia earnings: AI chip leader shows no signs of stopping mammoth growth.	World’s most valuable company delights investors as it reports $35bn of revenue in quarterly results
DeepSeek r1 reasoning model.	DeepSeek has replicated o1 with its r1 Deep Think model, a highly powerful system that the company plans to make fully open-source. The model was trained using reinforcement learning with reasoning traces.
Introducing AI Backgrounds, HD Video Calls, Noise Suppression and More for Messenger Calling.	Meta has announced new updates for its Messenger app, including HD video calling, noise suppression, and AI-generated backgrounds. HD video calling will be enabled by default on Wi-Fi, but can also be activated using a cell data plan through call settings.
A.I. Chatbots Defeated Doctors at Diagnosing Illness.	A small study found ChatGPT outdid human physicians when assessing medical case histories, even when those doctors were using a chatbot.
AlphaQubit tackles one of quantum computing’s biggest challenges.	Deepmind and Google Quantum have trained a model that can identify errors in quantum computations and correct them as needed.
Superhuman vision lets robots see through walls, and smoke with new LiDAR-like eyes.	PanoRadar, developed by researchers at the University of Pennsylvania, is an AI-driven system that transforms radio waves into 3D views, offering robots LiDAR-like vision at a reduced cost. By leveraging AI to process radio wave reflections, it overcomes challenges faced by traditional sensors in conditions like smoke, fog, and glass. The team plans to integrate PanoRadar with existing sensing technologies to enhance multi-modal perception in robotics.
Google DeepMind has a new way to look inside an AI's “mind”.	DeepMind has introduced Gemma Scope, a tool designed to enhance the understanding of AI models' internal mechanisms and decision-making processes. By employing sparse autoencoders, Gemma Scope dissects and analyzes data layers, aiding in the identification of biases or errors, such as incorrect numerical interpretations. This advancement in model transparency aims to improve AI control and alignment, thereby reducing deployment risks.
AI model identifies overlooked brain tumors in just 10 seconds.	FastGlioma is an AI model that rapidly detects residual brain tumor tissues during surgery with high accuracy.
It's Surprisingly Easy to Jailbreak LLM-Driven Robots.	Researchers induced bots to ignore their safeguards without exception
Nvidia to fuel humanoid robots with ‘Jetson Thor’.	Nvidia plans to launch its “Jetson Thor” computing platform in the first half of 2025, providing the processing power needed to bring sophisticated humanoid robots to life.
Introducing FLUX.1 Tools.	FLUX.1 Tools is a collection of models designed to enhance control and steerability in the FLUX.1 text-to-image model. It includes utilities and model checkpoints that enable features like inpainting, outpainting, and certain controlnets. These tools are ideal for users looking to expand their creative capabilities using one of the leading models available.
Elon Musk Asked People to Upload Their Health Data. X Users Obliged.	Users are uploading medical images to X's AI chatbot Grok for diagnostic purposes, a practice endorsed by Elon Musk despite concerns about accuracy and privacy. Unlike regulated medical platforms, Grok lacks HIPAA compliance, raising ethical questions about data security. While AI shows promise in healthcare, experts warn of risks related to inaccurate diagnoses and privacy violations.
ElevenLabs now offers the ability to build conversational AI agents.	ElevenLabs, a startup that provides AI voice cloning and a text-to-speech API, launched the ability to build conversational AI bots on Monday.
New OpenAI emails reveal a long history of mistrust.	Greg Brockman and Ilya Sutskever had questions about Sam Altman's intentions as early as 2017
Musk’s amended lawsuit against OpenAI names Microsoft as a defendant.	Elon Musk’s lawsuit against OpenAI accusing the company of abandoning its nonprofit mission was withdrawn in July, only to be revived in August. Now, in an amended complaint, the suit names new defendants, including Microsoft, LinkedIn co-founder Reid Hoffman, and former OpenAI board member and Microsoft VP Dee Templeton.

Resources

Link	description
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models.	introduces OpenCoder, a completely open-source LLM tailored for code generation and comprehension; the authors highlight key elements for creating top-performing code LLMs: (1) rigorous data cleaning using code-specific heuristic rules for deduplication, (2) effective recall of related text corpus for code context, and (3) high-quality synthetic data utilized in both annealing and supervised fine-tuning phases; OpenCoder outperforms previous open models at the 6B+ parameter level and provides not only the model weights but also the full training pipeline, datasets, and protocols to support reproducible research.
A Taxonomy of AgentOps for Enabling Observability of Foundation Model-based Agents.	examines AgentOps platforms and tools, emphasizing the necessity of robust observability and traceability features to maintain reliability in foundation model-based autonomous agent systems throughout their development and production lifecycle.
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models.	presents Mixture-of-Transformers (MoT), a novel sparse multi-modal transformer architecture that achieves performance comparable to traditional models while using nearly half the computational resources for text and image tasks; MoT matches the performance of a dense baseline while utilizing only 55.8% of the FLOPs.
HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems.	introduces a novel approach that uses HTML instead of plain text for constructing RAG systems; the core insight is that preserving HTML structure retains richer semantic and structural information compared to plain text conversion, which often loses critical formatting like headings, tables, and semantic tags; to handle the challenge of long HTML documents exceeding LLM context windows, the authors design a two-step pruning method: first, cleaning unnecessary HTML elements to cut length by 94%, and then applying a block-tree-based pruning approach that integrates embedding-based and generative pruning to retain essential content; experiments on six QA datasets show that HtmlRAG surpasses existing plain-text methods, confirming the benefits of maintaining HTML structure in RAG systems.
LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models.	NVIDIA has developed LLaMA-Mesh, a method that fine-tunes the LLaMA language model to generate 3D meshes from text prompts. By training LLaMA on a curated dataset of 3D dialogues, LLaMA-Mesh enables the model to represent and generate 3D mesh data in plain text format, integrating 3D mesh generation with language understanding.
Your Fixed Watermark is Fragile: Towards Semantic-Aware Watermark for EaaS Copyright Protection.	Researchers have introduced the Semantic Perturbation Attack (SPA) to exploit vulnerabilities in current watermarking schemes for Embedding-as-a-Service (EaaS) systems. Traditional watermarking methods often inject fixed signals into embeddings, regardless of the input's semantics, making them susceptible to adaptive attacks. SPA leverages semantic perturbations to identify and bypass these static watermark signals, effectively compromising watermark verification.
Don’t Look Twice: Faster Video Transformers with Run-Length Tokenization.	By adaptively caching video tokens that remain unchanged across frames, you can significantly accelerate run time without sacrificing performance or requiring extra training.
Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement.	An improved technique for generating images with improved control based on chosen regions.
Accurate Image Matching.	MOP+MiHo+NCC is a non-deep, modular method for improving image matches using a combination of three techniques. Multiple Overlapping Planes (MOP) clusters inlier matches and use RANSAC to remove outliers. Middle Homography (MiHo) minimizes distortion during planar reprojection. Normalized Cross Correlation (NCC) adjusts keypoint positions post-transformation.
The Beginner's Guide to Visual Prompt Injections.	Visual prompt injections present security threats to LLMs like GPT-4V by embedding harmful instructions within images, potentially causing unintended model behavior. These vulnerabilities can manipulate outputs, for instance, by causing the model to overlook certain individuals in images or misrepresent described contexts. With the increasing adoption of generative AI, companies must implement strong security measures to address these risks.
PyGen: Turning Your Ideas into Python Package.	PyGen simplifies the process of turning your ideas into software, making coding more accessible and enjoyable. Leveraging advanced language models, PyGen acts like a tech-savvy assistant, transforming abstract concepts into complete Python tools, including testing and documentation.
UltraVox Audio Language Models.	A suite of open-weight models that can take text and audio as input modalities.
https://arxiv.org/abs/2410.17758.	Pixtral Large is a 124B open-weight multimodal model built upon Mistral Large 2. As the second model in this multimodal series, it showcases advanced image comprehension, capable of interpreting documents, charts, and natural images, while retaining the top-tier text understanding of Mistral Large 2.
LLaVA-o1: Let Vision Language Models Reason Step-by-Step.	Although this isn't an exact replication of the training process used for o1, it remains a robust VLM trained on reasoning traces.
CLIP for Semantic Segmentation.	Although CLIP has excelled in open-vocabulary tasks, it faces challenges in semantic segmentation due to noisy features and limited resolution. Trident tackles the resolution problem with a training-free framework, integrating CLIP and DINO features from sub-images and employing SAM's encoder for global feature aggregation.
Confidence-aware Denoised Fine-tuning of Off-the-shelf Models for Certified Robustness.	This work focuses on improving the certified robustness of smoothed classifiers by fine-tuning off-the-shelf models
ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning.	This paper from Google demonstrates a method for altering the camera viewpoint of an existing video.
Evaluating-Constitutions.	Code to assist in evaluating constitutions based on human feedback.
StableV2V: Stabilizing Shape Consistency in Video-to-Video Editing.	StableV2V is a novel video editing framework that maintains shape consistency across frames, even when user prompts require significant transformations. This method ensures smooth and precise modifications throughout the video, preserving structural integrity
CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset.	CCExpert is an AI model developed to describe changes in images using natural language. It can identify what has changed, where the change occurred, and how it happened.
SAM Decoding: Speculative Decoding via Suffix Automaton.	SAM-Decoding offers a faster method for text generation in LLMs by utilizing a suffix automaton to create drafts efficiently and accurately.
That Chip Has Sailed: A Critique of Unfounded Skepticism Around AI for Chip Design.	DeepMind has issued a robust defense of its AlphaChip project, which has faced criticism from some academic circles despite widespread industry adoption. In a recent paper titled "That Chip Has Sailed: A Critique of Unfounded Skepticism Around AI for Chip Design," DeepMind addresses these critiques, emphasizing AlphaChip's significant contributions to chip design. The paper highlights AlphaChip's role in creating superhuman chip layouts for Google's Tensor Processing Units (TPUs) and its influence on the hardware used globally.
PoM: Efficient Image and Video Generation with the Polynomial Mixer.	Polynomial Mixer offers a faster and more memory-efficient alternative to Multi-Head Attention (MHA) in diffusion models used for image and video generation.
Cross-View Geo-Localization.	Researchers have created a framework to address the challenges of cross-view geo-localization, including variations in viewpoints and large-scale global contexts.
A statistical approach to model evaluations.	When two models are evaluated on a benchmark, declaring one as superior to the other often lacks strong confidence. This research from Anthropic introduces robust statistical methods to reliably determine when one model genuinely outperforms the other.
Software is a team sport.	GitHub Copilot, utilized by over 2.8 million developers, enhances the development experience with AI-powered features such as code completion, debugging, and secure code reviews. Developers can select AI models from providers like OpenAI and Google within Visual Studio Code. Integration with Azure and tools like GitHub Actions streamlines cloud deployments and continuous integration/continuous deployment (CI/CD) processes.
Prompt Injecting Your Way To Shell: OpenAI's Containerized ChatGPT Environment.	This article examines the interactive features of OpenAI's Debian-based sandbox environment for ChatGPT, revealing surprising details about its structure. Users can run Python scripts, manage files, and possibly expose core instructions through prompt engineering. These capabilities have sparked debates around transparency and privacy. While designed as intentional features, OpenAI does not consider them security vulnerabilities unless they result in breaches of the sandbox environment.

Perspectives

Link	description
AI could cause ‘social ruptures’ between people who disagree on its sentience.	AI could cause ‘social ruptures’ between people who disagree on its sentience
Is this (finally) the end for X? Delicate Musk-Trump relationship and growing rivals spell trouble for the platform.	The former Twitter could fade away, or help shape a dark future hosting voices of a new authoritarian world
‘Have your bot speak to my bot’: can AI productivity apps turbocharge my life?	I tried out organizational software to help streamline my work and build a ‘second brain’. I never knew there were so many different ways to take notes…
Is “AI welfare” the new frontier in ethics?	A few months ago, Anthropic quietly hired its first dedicated "AI welfare" researcher, Kyle Fish, to explore whether future AI models might deserve moral consideration and protection, reports AI newsletter Transformer. While sentience in AI models is an extremely controversial and contentious topic, the hire could signal a shift toward AI companies examining ethical questions about the consciousness and rights of AI systems.
What if AI doesn’t just keep getting better forever?	Recent reports suggest that traditional large language model (LLM) training is encountering diminishing returns, with newer models like OpenAI's Orion showing only modest improvements over predecessors. Experts are concerned about the scarcity of high-quality textual data for LLM training, leading to a shift towards synthetic data and specialized AI models. Future advancements may prioritize enhancing reasoning capabilities and developing task-specific models over general scaling.
AI Makes Tech Debt More Expensive.	AI amplifies the cost of tech debt by widening the velocity gap between low-debt and high-debt codebases.
Where's My Robot Butler?	Advancements in AI and robotics are speeding up the creation of humanoid robots like Atlas, Optimus, and Neo, designed to handle domestic tasks similar to Rosie from "The Jetsons." However, developing cost-effective, safe, and efficient actuators remains a challenge. AI models play a vital role in training these robots for autonomous, complex tasks. Although there has been notable progress, these robots are currently better suited for industrial applications and may only become practical for home use with major breakthroughs.
Google's head of research on whether 'learn to code' is still good advice in the age of AI.	Even though AI can manage some coding tasks, having a fundamental understanding of coding remains essential and opens up new opportunities in various fields, such as healthcare and education.
Why are we using LLMs as calculators?	Researchers are experimenting with LLMs' ability to solve math problems to assess their reasoning capabilities.
GPTs Are Maxed Out.	OpenAI's next-generation model, internally called Orion, is said to fall short of expectations set by Sam Altman, hinting at a possible limit to the scalability of AI model improvements.
Can Google Scholar survive the AI revolution?	The largest scholarly search engine is celebrating its 20th birthday, but AI-driven competitors offer advantages.
Computational technologies of the Human Cell Atlas.	As the international effort reaches a ‘critical mass’ of achievements, Nature highlights seven tools that are poised to enable the next set of discoveries.
Can a fluffy robot really replace a cat or dog? My weird, emotional week with an AI pet.	Casio says Moflin can develop its own personality and build a rapport with its owner – and it doesn’t need food, exercise or a litter tray. But is it essentially comforting or alienating?
The Evolution of the Creator.	Generative AI is transforming the creator economy by reducing production barriers, and allowing creators to produce high-quality content effortlessly. Innovations like digital clones are reshaping content distribution and engagement, unlocking new monetization opportunities by scaling interactions and fan transactions. With AI revolutionizing creation, distribution, and monetization, the creator economy is poised to give rise to a new generation of major tech companies.
‘A place of joy’: why scientists are joining the rush to Bluesky.	Researchers say the social-media platform — an alternative to X — offers more control over the content they see and the people they engage with.
Tülu 3: The next era in open post-training.	An open-source, cutting-edge post-training framework offering open data, training code, model weights, and scientific insights. It may be the most comprehensive resource for understanding modern post-training techniques for large language models.
We can all be AI engineers – and we can do it with open-source models.	The barriers to AI engineering are quickly lowering as improved tools and standardized workflows streamline complex processes. Creating AI applications now involves applying basic engineering skills to utilize models, prompts, integrations, testing, and deployment. Open-source models ensure data privacy while existing DevOps tools support the development and management of AI applications.
‘An AI Fukushima is inevitable’: scientists discuss technology’s immense potential and dangers.	Experts are optimistic about energy and drug production breakthroughs but also fear its potential misuse

Back to index

ML news: Week 11 - 17 November

Research

Link	description
Project Sid: Many-agent simulations toward AI civilization.	This work illustrates the behavior and evolution of societies composed of 10-1000+ AI agents. It introduces PIANO, an architecture that allows agents to interact with both humans and other agents in real time. The study reveals that agents can autonomously adopt specialized roles, follow and modify collective rules, and participate in cultural and religious transmissions.
Mixtures of In-Context Learners.	utilizes subsets of demonstrations to train experts through in-context learning; a trainable weighting function is then employed to merge the next-token predictions from these experts based on the training set. This method is compatible with black-box LLMs, as it does not require access to their internal parameters. Key advantages include: 1) being competitive with standard ICL while offering much greater efficiency in terms of data, memory, and computation, and 2) demonstrating robustness to noisy demonstrations and label imbalance.
Attacking Vision-Language Computer Agents via Pop-ups.	demonstrates that incorporating adversarial pop-ups into current agent testing environments results in an attack success rate of 86%, reducing the agents' task success rate by 47%. It also notes that simple defense methods, like instructing the agent to ignore pop-ups, prove ineffective.
Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models.	enhances LLM responses by simulating multiple experts and combining their outputs; it directs an LLM to complete input instructions by simulating several experts and choosing the best response from both individual and aggregated perspectives. This approach sets a new state-of-the-art on TruthfulQA-Generation with ChatGPT, surpassing the previous record of 87.97%. Additionally, it improves performance in terms of factuality and usefulness while reducing toxicity and hurtfulness.
Number Cookbook: Number Understanding of Language Models and How to Improve It.	offers a thorough analysis of the numerical understanding and processing ability (NUPA) of LLMs; reveals that while naive finetuning significantly boosts NUPA on many tasks, it doesn’t work for all. It also finds that methods specifically developed to improve NUPA are ineffective when finetuning pre-trained models. The study examines the application of chain-of-thought techniques to NUPA and notes that these methods encounter scalability issues, limiting their practical use.
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning.	introduces a self-evolving online curriculum RL framework aimed at closing the performance gap between open and proprietary LLM-based web agents. It boosts the success rate of Llama-3.1-8B from 4.8% to 42.4% and GLM4-9B from 6.1% to 43%, with the open models significantly outperforming GPT-4-Turbo (17.6%) and GPT-4o (13.9%). The framework addresses the limited availability of web agent training tasks using a robust outcome-supervised reward model for task success evaluation. An adaptive RL strategy manages distribution drift in online learning, ensuring steady performance improvements.
Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation.	introduces a two-stage fine-tuning method where LLMs first learn from tool-generated solutions and then are trained to decide when to solve problems independently versus using tools. Evaluations on benchmarks in math, climate science, and epidemiology demonstrate significant gains, with a 28% increase in accuracy and a 14% improvement in tool usage precision over top models like GPT-4 and Claude-3.5. This approach enables the LLM to flexibly handle scientific problems of varying complexity.
Google's Flood Forecasting AI to Reach 700 Million People.	Google is expanding riverine flood forecasting coverage to over 100 countries and 700 million people, and enabling partners and researchers to better understand flood forecasting through more data and the development of a new API
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models.	The Mixture-of-Transformers (MoT) architecture features a sparse multi-modal transformer that separates parameters based on modality (text, images, and speech), allowing for efficient processing while preserving performance. In various evaluations, such as Chameleon 7B and Transfusion settings, MoT matches or outperforms dense baselines, utilizing significantly fewer resources—only 37.2% of the FLOPs for speech processing and 47.2% of the wall-clock time for image generation.
Exploring the Alignment Landscape: LLMs and Geometric Deep Models in Protein Representation.	This study investigates methods to enhance alignment between LLMs and protein-focused geometric deep models, aiming to improve cross-modal understanding.
Can LLMs Follow Threads Through Near-Million-Scale Haystacks?	Large Language Models (LLMs) with extended context windows support a wider range of applications. Recent research on 17 top LLMs shows that although many can manage multiple information threads simultaneously, their practical context limits are often shorter than the stated maximum. While several models demonstrate "thread safety" by handling concurrent threads without a drop in performance, accuracy typically decreases as the context window approaches its upper limit.
Compressing Mesh Data for 3D Generation.	By reducing the mesh sequence length by about 75%, a mesh compression method known as Blocked and Patchified Tokenization (BPT) effectively produces meshes with more than 8k faces.
Successor Feature Matching.	A new non-adversarial method for inverse reinforcement learning that avoids reward function learning is called Successor Feature Matching.
Oasis: A Universe in a Transformer.	A 500M parameter foundation model without a game engine powers Oasis, a fully AI-generated, real-time open-world video game model. It is tailored for Etched's Sohu ASIC to achieve great frame rate efficiencies and uses quick transformer inference to generate gameplay. Despite showing great promise, issues like long-context consistency and domain generalization still exist.
OpenAI to present plans for U.S. AI strategy and an alliance to compete with China.	OpenAI's AI infrastructure blueprint suggests establishing AI economic zones and collaborating with the U.S. Navy on nuclear energy to promote AI-driven economic growth and innovation. The proposal features a North American AI alliance and initiatives modeled after the National Interstate and Defense Highways Act to address infrastructure demands. It stresses the importance of investing in U.S. data centers and energy projects to stay competitive with China.
Introducing Athene-V2: Advancing Beyond the Limits of Scaling with Targeted Post-training.	Athene V2 consists of models built upon Qwen 2.5 72B, optimized for agentic and chat-based workflows, and outperform GPT-4o on several key benchmarks.

News

Link	description
Modal buys Tidbyt.	The elastic scaling GPU company made its first acquisition by purchasing Tidbyt, a hardware firm based in NYC, to gain the in-house expertise of its team specializing in infrastructure and containerization.
OpenAI reportedly developing new strategies to deal with AI improvement slowdown.	OpenAI's forthcoming model, codenamed "Orion," reportedly exhibits only modest improvements over its predecessors, indicating a potential deceleration in AI advancement. To address this, OpenAI has established a foundation team dedicated to enhancing models through alternative approaches, including synthetic data training and post-training adjustments, in response to the diminishing availability of new data.
Near plans to build world’s largest 1.4T parameter open-source AI model.	Near Protocol has announced plans to develop a 1.4 trillion-parameter open-source AI model, aiming to surpass existing models like Meta's Llama. This initiative reflects Near Protocol's commitment to advancing AI capabilities and contributing to the open-source community.
Samsung debuts AI-powered ‘Next-generation Bixby,’ but you can’t use it yet.	Samsung has launched a "next-generation Bixby" with enhanced AI capabilities on the Galaxy W25 and W25 Flip in China.
Even Microsoft Notepad is getting AI text editing now.	Along with adding AI to a text editor that launched in 1983, Microsoft will let Windows Insiders test generative fill-and-erase tools in Paint, too.
Ofcom warns tech firms after chatbots imitate Brianna Ghey and Molly Russell.	After ‘distressing incidents’, watchdog says content from user-made bots would be covered by UK Online Safety Act
AI protein-prediction tool AlphaFold3 is now open source.	The code underlying the Nobel-prize-winning tool for modelling protein structures can now be downloaded by academics.
Qwen 2.5 Coder 32B Instruct is here.	The Qwen 2.5 Coder series consists of language models tailored for coding tasks. The latest 32B parameter model outperforms GPT-4o and is compact enough for local use by many. It also matches Claude Sonnet 3.5 on several benchmarks.
X is testing a free version of AI chatbot Grok.	Social network X has so far limited its AI chatbot Grok (built by Elon Musk’s other company xAI) to its premium, paying users. However, the platform is seemingly preparing to open up the chatbot to free users.
Octoverse: AI leads Python to top language as the number of global developers surges.	In this year’s Octoverse report, we study how public and open source activity on GitHub shows how AI is expanding as the global developer community surges in size.
Google accidentally leaked a preview of its Jarvis AI that can take over computers.	Google's new AI prototype, Jarvis, briefly appeared on the Chrome Web Store.
AI-powered parenting is here and a16z is ready to back it.	Andreessen Horowitz partner Justine Moore introduced a new investment thesis for the firm on X on Thursday, endorsing “a new wave of ‘parenting co-pilots’ built with LLMs and agents.” She pointed to companies like Cradlewise, makers of an AI-powered baby monitor to detect a baby’s sleep pattern and rock the crib, and Nanit, which uses AI to process crib footage to tell if a baby is breathing.
French news titles sue X over allegedly running their content without payment.	Social media site accused of violating a law that requires platforms to pay media when republishing articles
Musk’s influence on Trump could lead to tougher AI standards, says scientist.	Tycoon might help president-elect realize race for artificial general intelligence is a ‘suicide race’, says Max Tegmark
Bluesky adds 700,000 new members as users flee X after the US election.	Social media platform has become a ‘refuge’ from the far-right activism on X, experts say, after Elon Musk teamed up with Donald Trump
Baidu announces its own pair of AI smart glasses.	Baidu, which is often called China's answer to Google, has launched its own pair of AI-powered smart glasses at its annual World Conference event in Shanghai.
OpenAI co-founder Greg Brockman returns after three months of leave.	In the midst of major management departures and controversy over OpenAI's transition to a for-profit business model, co-founder Greg Brockman has returned to the company as president after taking a sabbatical. In its most recent fundraising round, OpenAI was valued at $157 billion. Due to the departure of executives like Lilian Weng, Bob McGrew, and Mira Murati, the company is experiencing internal issues.
European Google rivals partner on search engine infrastructure to counter Big Tech.	To improve AI skills and lessen dependency on U.S. Big Tech, Ecosia and Qwant are collaborating to create a European search index. Using a "privacy-first" strategy, the project seeks to promote AI developments by developing a new search infrastructure. Since generative AI is becoming more and more prevalent in search, alternative search providers are better positioned to compete as a result of the rising API expenses.
Robotic exoskeleton adapts to changes in leg movements in real time.	Wearable robots that assist leg movements could transform the lives of people with reduced mobility — but only if the devices can adapt in real time to support a vast range of human activities. Machine learning provides a way forward.
OpenAI’s take on AI agents could come in January.	OpenAI is reportedly preparing to launch "Operator," an AI agent tool, as early as January. Bloomberg states that Operator may be able to execute tasks directly on a user's computer. It will initially be accessible as a research preview through OpenAI's developer API.
Google's AI Initiative to Boost MENA Economy by $320 Billion.	Google.org has launched the AI Opportunity Initiative, its largest AI investment in the Middle East and North Africa (MENA) region, aiming to develop essential AI skills, fund research, and expand AI access. This initiative is projected to contribute $320 billion to MENA's economy by 2030
Two Trillion Token Common Corpus.	the release of Common Corpus (part of the AI Alliance Open Trusted Data Initiative)—the largest fully open multilingual dataset for training LLMs, containing over 2 trillion tokens of permissibly licensed content with provenance information (2,003,039,184,047 tokens).
Lume raises $4.2M Seed Round led by General Catalyst.	Lume automates data mapping with AI, streamlining mapping, cleaning, and validation of data.
Amazon launches under-$20 online storefront to compete with Temu.	Company says Amazon Haul will mostly feature products under $10, which it plans to ship from China warehouse
Francois Chollet leaves Google.	The founder of Keras and Arc eval, among other contributions, has departed from Google. He will continue to support the Jax and Keras communities while exploring new opportunities.
OpenAI launches ChatGPT desktop integrations, rivaling Copilot.	When OpenAI released desktop app versions of ChatGPT, it was clear the goal was to get more users to bring ChatGPT into their daily workflows. Now, new updates to Mac OS and Windows PC versions encourage users to stay in the ChatGPT apps for most of their tasks.
Supermaven joins Cursor.	The team behind the code editing plugin is joining Cursor to further enhance the user experience.
Google’s AI ‘learning companion’ takes chatbot answers a step further.	Google’s Learn About AI tool has more educational, textbook-style responses to guide you through new topics.

Resources

Link	description
FrontierMath.	Epoch AI has introduced FrontierMath, a benchmark comprising expert-level mathematics problems to assess AI's mathematical reasoning capabilities. Notably, leading AI models have solved less than 2% of these problems, highlighting the benchmark's difficulty and the current limitations of AI in advanced mathematical reasoning.
BitNet a4.8: 4-bit Activations for 1-bit LLMs.	A major challenge with 1.58bit LLMs has been the absence of hardware acceleration support. This research introduces 4.8bit activations to leverage the INT4/FP4 kernels available in new hardware, achieving this with no added runtime cost.
LLM2CLIP.	LLM2CLIP combines CLIP's visual and textual alignment with the advanced language understanding of LLMs.
Torch Compatible Muon Optimizer.	Muon is the optimizer that sets the training record for GPT-2. It is a momentum-adapted method similar to SGD. This repository provides an implementation that can be easily used as a replacement for AdamW.
Mochi video model with optimized inference.	Mochi 1, an open-source text-to-video model, initially required eight H100 GPUs for operation. Thanks to community efforts, it can now run on a single 48GB L40 GPU without compromising quality.
A trainable PyTorch reproduction of AlphaFold 3.	Protenix is a functional and trainable reproduction of AlphaFold 3, DeepMind's protein folding project, developed by ByteDance's 'AI for Science' team. This open-source initiative aims to advance protein structure prediction by providing a customizable platform for researchers.
LlamaPReview.	LlamaPReview is an AI assistant for GitHub that provides easy one-click installation and automatically reviews pull requests with context-aware analysis. It supports various programming languages and integrates seamlessly with GitHub Actions, delivering insightful feedback directly on PRs. Offered for free, it improves code quality by detecting issues and recommending optimizations.
SmolLM2.	Hugging Face's SmolLM2 is a compact family of language models, ranging from 135M to 1.7B parameters, trained on 11 trillion tokens. These models are designed to run efficiently on device and support various tasks. The weights are released under the Apache 2 license, and quantized versions, such as the 1.7GB and 138MB models, offer flexibility to meet different computational requirements.
AI for Real-time Fusion Plasma Behavior Prediction and Manipulation.	A novel multimodal machine learning approach improves super-resolution data, enabling better analysis of complex fusion plasma phenomena like Edge Localized Modes (ELM), and supports the stabilization of future fusion reactors.
A Comprehensive Survey of Small Language Models in the Era of Large Language Models.	a review of small language models (SLMs), covering topics such as definitions, applications, improvements, reliability, and related concerns.
Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks.	A new generalist multi-agent system capable of managing complex web and file-based tasks, featuring an Orchestrator agent that coordinates four specialized agents: WebSurfer for browser tasks, FileSurfer for file management, Coder for programming, and ComputerTerminal for console operations. Magentic-One performs competitively on various benchmarks, such as GAIA, AssistantBench, and WebArena, without needing any changes to its core architecture.
Personalization of Large Language Models: A Survey.	offers a comprehensive framework for understanding personalized LLMs, introducing taxonomies for various personalization aspects and consolidating existing research in personalized text generation and downstream applications.
StdGEN: Semantic-Decomposed 3D Character Generation from Single Images.	StdGen is a novel approach for generating 3D characters from a single image. It breaks down the process into distinct components, such as hair and jackets, enhancing the overall quality of the output.
alphafold3.	DeepMind has open-sourced the code and weights of AlphaFold 3 for academic research, marking a significant advancement in protein structure prediction. This release is expected to accelerate AI applications in scientific research, particularly in molecular biology and drug discovery.
Online-LoRA.	Online-LoRA is a framework developed to mitigate catastrophic forgetting in online continual learning (OCL) by enabling real-time fine-tuning of pre-trained Vision Transformers (ViTs) without the use of rehearsal buffers.
DeepArUco++: Improved detection of square fiducial markers in challenging lighting conditions.	DeepArUco++ presents a deep learning-based method for enhancing fiducial marker detection, especially in difficult lighting conditions where traditional techniques typically struggle.
Hermes 3.	Hermes 3, fine-tuned from Llama 3.1, excels in both reasoning and creativity, showcasing outstanding performance across models with 8B, 70B, and 405B parameters. It introduces new possibilities in AI alignment and artificial consciousness.
ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis.	To improve the speed and quality of token-based picture production, EfficientNAT is an improved non-autoregressive Transformer model.
UniGAD: Unifying Multi-level Graph Anomaly Detection.	A novel framework for graph anomaly detection (GAD), UniGAD simultaneously detects anomalies in nodes, edges, and complete graphs.
Object and Attribute Matching in Images with Token Merging.	Token Merging tackles a prevalent problem in text-to-image models: semantic binding, or the inability to associate things with their particular properties.
DataChain.	Without abstracting AI models, DataChain is a Pythonic data-frame toolkit for AI that enables effective processing and dataset structuring of unstructured data. It facilitates the creation of metadata, filtering, and vector search by integrating with AI tools like PyTorch, TensorFlow, and LLM APIs. Additionally, the library has built-in vectorized operations on Python object fields, out-of-memory computation, and parallelization.
browser-use.	Through a streamlined UI, this open-source web automation application enables LLMs to communicate with websites. It is compatible with models such as Claude 3.5 Sonnet and GPT-4o. XPath extraction, customizable actions, and multi-tab management are important features. Data extraction and smooth web navigation are made possible by the program. Message length is one of its drawbacks, as it impacts task repetition and LLM speed. Robustness and cost reduction will be the main goals of further development.
CUDA Programming Course – High-Performance Computing with GPUs.	A great course from freeCodeCamp on CUDA programming from start to finish.
Masked Token Modeling for Zero-Shot Anything-to-Drums Conversion.	Zero-shot drum style transfer for any input rhythm presents an exciting music application for artists. This is achieved using a masked token modeling objective, which is particularly effective for audio.
HiCoM: Hierarchical Coherent Motion for Streamable Dynamic Scene with 3D Gaussian Splatting.	HiCoM is a cutting-edge framework designed to enhance real-time 3D reconstruction from multi-view streaming videos. It effectively addresses key challenges in storage, training speed, and rendering quality, making it a significant advancement in the field.
Janus.	Janus, DeepSeek's multimodal model, has a new version incorporating rectified flows, similar to Meta Movie Gen, for image generation and understanding. The results are highly impressive.
Link Conversation with Reference Materials.	Problem-oriented segmentation & Retrieval (POSR) is a method that breaks conversations into meaningful segments and connects each segment to relevant reference materials, like worksheets or meeting notes.
MureObjectStitch: Multi-reference Image Composition.	Researchers have presented an improved fine-tuning method for generative image composition, which seamlessly merges a specified foreground object with a new background to generate realistic images.
StoryTeller.	StoryTeller is a system created to generate coherent descriptions for long videos, tackling issues like plot consistency and character tracking throughout different scenes.
SAMPart3D: Segment Any Part in 3D Objects.	SAMPart3D, developed by the University of Hong Kong, is a robust method for segmenting 3D objects into semantically meaningful components.
Convolutional Differentiable Logic Gate Networks.	Researchers have developed a method to train image recognition networks that are 29 times smaller and more efficient than traditional convolutional neural networks (CNNs) by making logic gates differentiable. They have also provided efficient CUDA kernels in their paper release
Physics Informed Distillation for Diffusion Models.	Physics Informed Distillation (PID) is a method that employs a student model to simplify and accelerate diffusion models by framing them as solutions to differential equations.
MinerU: high-quality data extraction tool.	MinerU is a robust tool built on StructTable-InternVL2-1B, enabling the extraction of information from PDFs into various machine-readable formats.
Isotonic regression.	A powerful technique for fitting a monotonic function to data. It can be differentiated really well for a number of applications outside of curve fitting.
Text-to-SQL Query.	XiYan-SQL is an innovative framework aimed at enhancing both the accuracy and diversity of SQL queries produced from natural language input.
X-Portrait 2: Highly Expressive Portrait Animation.	ByteDance's AI group has unveiled X-Portrait 2, an advanced portrait animation technology that transforms static images into highly expressive, realistic videos. Building upon its predecessor, X-Portrait, this new model excels in capturing subtle facial expressions and complex movements, such as pouting, tongue-out gestures, cheek-puffing, and frowning. It achieves high fidelity in emotion preservation, ensuring the generated videos maintain the subject's identity and emotional nuances.
MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views.	The MVSplat360 model offers a new way to create realistic 360° views of real-world scenes, even from just a few sparse images.
Improved Multi-Task Brain Tumour Segmentation with Synthetic Data Augmentation.	This paper presents the leading approach for brain tumor segmentation in the BraTS challenge, demonstrating how synthetic data can improve AI models for medical imaging applications.

Perspectives

Link	description
Embeddings are underrated.	Machine learning embeddings can revolutionize technical writing by enabling mathematical comparisons of any text, and enhancing features like recommendation systems through semantic similarities. By positioning text in a multi-dimensional space, they reveal intuitive semantic relationships, which are valuable for tasks such as finding related content. Documentation site owners who provide embeddings for their content could inspire innovative applications from their communities.
The images of Spain’s floods weren’t created by AI. The trouble is, people think they were.	The rapid growth of ‘AI slop’ – content created by artificial tools – is starting to warp our perception of what is, or could be, real
What Trump’s election win could mean for AI, climate, and health.	Donald Trump made numerous promises during his presidential campaign that could affect scientists and science policy. Will they be implemented once he is president?
The case for targeted regulation.	Advancements in AI are significantly enhancing capabilities in mathematics, coding, and science, presenting both opportunities and risks. Effective regulation is crucial to prevent misuse in areas such as cybersecurity and chemical, biological, radiological, and nuclear (CBRN) threats. Anthropic's Responsible Scaling Policy emphasizes transparency and advocates for a balanced legislative approach that ensures safety while fostering innovation.
AI-powered parenting is here and a16z is ready to back it .	Andreessen Horowitz partner Justine Moore introduced a new investment thesis for the firm on X on Thursday, endorsing “a new wave of ‘parenting co-pilots’ built with LLMs and agents.” She pointed to companies like Cradlewise, makers of an AI-powered baby monitor to detect a baby’s sleep pattern and rock the crib, and Nanit, which uses AI to process crib footage to tell if a baby is breathing.
Speculation on Test Time Compute.	This video discusses O1 models, their capacity for replication, and their potential utility for a range of future tasks.
Can AI review the scientific literature — and figure out what it all means?	Artificial intelligence could help speedily summarize research. But it comes with risks.
Why we are all lab rats in the digital world.	Researchers need to establish robust ethical protocols for online experiments.
Don’t blame search engines for sending users to unreliable sites.	Analysis of billions of pages of results from searches using the Bing algorithm suggests that reliable sites appear in search results 19 to 45 times more often than do sites with low-quality content.
AI-generated images threaten science — here’s how researchers hope to spot them.	Generative-AI technologies can create convincing scientific data with ease — publishers and integrity specialists fear a torrent of faked science.
The quest to build bionic limbs that feel like the real thing.	Through brain implants, neural interfaces and skin grafts, researchers are starting to restore sensation for paralysed or amputated limbs.
How AI is reshaping science and society.	Artificial-intelligence tools such as ChatGPT might soon become fully autonomous by learning to perceive and interact with their environment.
‘It gets more and more confused’: can AI replace translators?	A Dutch publisher has announced that it will use AI to translate some of its books – but those in the industry are worried about the consequences if this becomes the norm
StackBlitz achieves $4M ARR in 4 weeks for their AI web development platform with Claude.	StackBlitz developed an online developer tool that integrates closely with Claude 3.5 Sonnet. This post details how the company achieved $4 million in annual recurring revenue within a few months.
Why the deep learning boom caught almost everyone by surprise.	Fei-Fei Li's development of the extensive ImageNet dataset played a crucial role in the revival of neural networks. It supplied the training data essential for landmark models such as AlexNet. Using GPUs and Geoffrey Hinton's backpropagation method, AlexNet showcased the potential of deep learning on large datasets, igniting the current AI revolution. This key event highlighted the significance of integrating neural networks, big data, and GPU computing to drive AI advancements.
Just Have AI Build an App for That.	AI agents are increasingly being used to quickly create functional apps for tasks like resizing SVGs.
AI isn’t about unleashing our imaginations, it’s about outsourcing them. The real purpose is profit.	Artificial intelligence doesn’t just incrementally erode the rights of authors and other creators. These technologies are designed to replace creative workers altogether
Companies building AI-powered tech are using your posts. Here’s how to opt-out.	even if you haven’t knowingly opted in, companies are still scraping your personal information to train their systems

Back to index

ML news: Week 3 - 10 November

Research

Link	description
The Geometry of Concepts: Sparse Autoencoder Feature Structure.	This study investigates the geometric structure of concept representations in sparse autoencoders (SAEs) across three scales: (1) atomic-level parallelogram patterns among related concepts (e.g., man:woman::king:queen), (2) brain-like functional "lobes" dedicated to different knowledge types such as math or code, and (3) galaxy-level eigenvalue distributions, revealing a specialized structure within the middle layers of the model.
Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics.	This approach employs causal analysis to identify neurons that reveal an LLM's behavior when performing basic arithmetic logic. It discovers and theorizes that a combination of heuristic neurons serves as the mechanism for generating accurate arithmetic answers, with the unordered blend of various heuristic types accounting for most of the model's accuracy on arithmetic prompts.
Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA.	The Relaxed Recursive Transformer introduces a novel method for reducing LLM size by sharing parameters across layers without sacrificing performance. Initialized from standard pre-trained Transformers, it employs a single block of unique layers repeated multiple times in a loop, adding flexibility through depth-wise low-rank adaptation (LoRA) modules. This approach demonstrates the potential for significant (2-3×) improvements in inference throughput.
What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective.	This project examines how varying "thinking" styles—fast (concise) versus slow (detailed, such as chain-of-thought reasoning)—affect layer-wise gradients and stability in LLMs.
B-cosification: Transforming Deep Neural Networks to be Inherently Interpretable.	"B-cosification" is a technique that adjusts existing pre-trained models to provide highly interpretable explanations of their predictions.
Learning Graph Quantized Tokenizers for Transformers.	GQT (Graph Quantized Tokenizer) is a novel tokenizer for graph data in geometric deep learning.
V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization.	Vision-guided Direct Preference Optimization (V-DPO) tackles hallucination problems in large vision-language models (LVLMs), where text responses may diverge from visual input due to an excessive focus on language.
Adam Alternative for Deep Learning Optimization.	ADOPT is an adaptive gradient optimizer designed to resolve the non-convergence problems of Adam, without depending on restrictive assumptions regarding gradient noise.
A faster, better way to train general-purpose robots.	Inspired by large language models, researchers develop a training technique that pools diverse data to teach robots new skills.
Vision Language Models are In-Context Value Learners.	Visual Language Models (VLMs) are capable of learning skills through the use of prompts.

News

Link	description
Elon Musk’s ‘election integrity community’ on X is full of baseless claims.	Feed is rife with posts of individuals deemed suspicious and calls for doxxing with little evidence provided of fault
Microsoft sails as AI boom fuels double-digit growth in the cloud business.	Revenue from Azure cloud business increased by 22% as company focuses attention on artificial intelligence
Apple reports robust demand for iPhone 16 even as overall sales in China slow.	Company reports $94.9bn in revenue, slightly beating Wall Street projections in first look at demand for its new phone
Distinguishing Ignorance from Error in LLM Hallucinations.	This report describes efforts to replicate the capabilities of OpenAI's o1 model, introducing a journey learning technique that promotes a comprehensive exploration process rather than shortcut-based learning. This approach includes trial and error, reflection, and backtracking. With just 327 training samples, the journey learning technique outperformed shortcut learning by 8.0% on the MATH dataset.
Investigating the Role of Prompting and External Tools in Hallucination Rates of Large Language Models.	This study evaluates various prompting strategies and frameworks to minimize hallucinations in LLMs, finding that simpler prompting techniques outperform more complex approaches. It also reports that LLM agents show higher hallucination rates due to the increased complexity involved in using tools.
Introducing the First AMD 1B Language Models: AMD OLMo.	AMD utilized the OLMo codebase to train and release a language model on its accelerators. The OLMo (Open Language Model) project, developed by the Allen Institute for AI (AI2), provides an open-source framework for training and using state-of-the-art language models.
OpenAI will start using AMD chips and could make its own AI hardware in 2026.	Reuters reports an updated hardware strategy to run ChatGPT and OpenAI’s other projects, which involves using AMD chips via Microsoft Azure in addition to Nvidia.
Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales?	This study addresses a specific challenge in LLMs: evaluating how effectively they process reasoning prompts that include irrelevant or incorrect rationale snippets.
What is Wrong with Perplexity for Long-context Language Modeling?	This study uncovers a significant limitation of using perplexity (PPL) to assess LLMs' long-context abilities, as PPL averages across all tokens, overlooking critical ones necessary for interpreting extended inputs. To address this, the authors propose LongPPL, a metric that emphasizes these essential tokens, providing a more accurate measure of long-context performance.
Google’s AI search summaries are rolling out to over 100 more countries.	Google’s AI Overviews are expanding across more than 100 countries this week. The AI-generated search summaries will appear for users in Canada, Australia, New Zealand, South Africa, Colombia, Chile, the Phillippines, Nigeria, and many more locations.
Elon Musk finally admits Tesla’s HW3 might not support full self-driving.	Elon Musk finally admits Tesla’s HW3 might not support full self-driving and that he doesn’t actually know what it will take. Millions of Tesla vehicles are equipped with HW3 computers.
NVIDIA Ethernet Networking Accelerates World’s Largest AI Supercomputer, Built by xAI.	xAI's Colossus is powered by NVIDIA's Spectrum-X Ethernet networking platform.
French parents whose children took own lives sue TikTok over harmful content.	Lawsuit alleges TikTok’s algorithm exposed teenagers to videos promoting suicide, self-harm and eating disorders
Claude 3.5 Haiku now available.	Claude 3.5 Haiku is slightly inferior to GPT-4o and lacks vision capabilities, but it remains highly intelligent and is cost-effective compared to other models of similar quality.
7 AI news that Google announced in October.	This article summarizes seven AI updates from October, including Google Maps' largest AI enhancement, guidance on using NotebookLM, and additional methods for asking questions, searching for information, and accessing an AI Overview.
Sapien Raises $8.7M Seed Led by General Catalyst.	Sapien is advancing AI-driven financial analysis tools that convert intricate, error-prone tasks into swift insights, revolutionizing the role of Chief Financial Officers (CFOs). The platform consolidates data from diverse sources to deliver dynamic, context-aware analyses, aiming to eradicate human errors in financial processes. Recently, Sapien secured $8.7 million in funding, with plans to expand and enhance its AI capabilities to empower finance teams across various industries
ElevenLabs has hired the team behind Omnivore, a reader app.	Generative AI company ElevenLabs has hired the team behind Omnivore, an open source read-it-later app.
LinkedIn launches its first AI agent to take on the role of job recruiters.	Hiring Assistant is a new product designed to take on a wide array of recruitment tasks, from ingesting scrappy notes and thoughts to turn into longer job descriptions to sourcing candidates and engaging with them.
Anthropic’s Claude AI chatbot now has a desktop app.	Claude, the AI chatbot made by Anthropic, now has a desktop app. You can download the Mac and Windows versions of the app from Anthropic’s website for free.
Meta is making a robot hand that can ‘feel’ touch.	Meta says it’s partnering with sensor firm GelSight and Wonik Robotics, a South Korean robotics company, to commercialize tactile sensors for AI.
Elon Musk sued over $1m-a-day election giveaway.	Complaint alleges Musk’s America Pac deceived voters by falsely claiming prize winners would be chosen at random
AI chatbot launches on Gov.UK to help business users – with mixed results.	Initial test run of GPT-4o technology can help with regulations but ‘cannot provide predictions or opinions’
OpenAI’s o1 model leaked on Friday and it is wild — here’s what happened.	OpenAI's o1 model demonstrates notable advancements in reasoning and accuracy compared to GPT-4, featuring image analysis and web tool capabilities. The complete version is expected to significantly enhance AI and multimedia processing, with an official release anticipated shortly after the U.S. Presidential election.
Meta’s former hardware lead for Orion is joining OpenAI.	The former head of Meta’s augmented reality glasses efforts announced on Monday she is joining OpenAI to lead robotics and consumer hardware, according to a post on LinkedIn.
Waymo explores using Google’s Gemini to train its robotaxis.	The company used Gemini to build its own ‘End-to-End Multimodal Model for Autonomous Driving.’
More than a quarter of new code at Google is generated by AI .	AI is hugely important to Google’s products, and it sounds like the company relies on it internally, too.
Meta is using more than 100,000 Nvidia H100 AI GPUs to train Llama-4 — Mark Zuckerberg says that Llama 4 is being trained on a cluster “bigger than anything that I’ve seen”.	Llama 4 slated to have new modalities, stronger reasoning, and faster performance
Wonder Dynamics now lets you go straight from multi-camera video to fully animated 3D scene.	Wonder Dynamics launched a tool that automates converting videos into fully editable 3D scenes.
Facebook asks US Supreme Court to dismiss fraud suit over Cambridge Analytica scandal.	Securities fraud lawsuit brought by shareholders accuses the social media platform of misleading them about misuse of user data
Anthropic hikes the price of its Haiku model.	Anthropic's latest AI model, Claude 3.5 Haiku, delivers better performance than Claude 3 Opus but comes with a much higher cost. While it doesn’t support image analysis, it excels in tasks like coding, data extraction, and content moderation. The price hike prompts concerns about Anthropic's future pricing approach.
OpenAI acquired Chat.com.	OpenAI bought Chat.com, adding to its collection of high-profile domain names. As of this morning, Chat.com now redirects to OpenAI’s AI-powered chatbot, ChatGPT. An OpenAI spokesperson confirmed the acquisition via email.
Pushing the frontiers of audio generation.	ADOPT is an adaptive gradient optimizer designed to resolve the non-convergence problems of Adam, without depending on restrictive assumptions regarding gradient noise.
Octoverse: AI leads Python to top language as the number of global developers surges.	AI project engagement has surged rapidly due to a rise in data science and machine learning activities. Python has now become more popular than JavaScript. The developer community is experiencing global growth, particularly in Africa, Latin America, and Asia, driven by tools like GitHub Copilot. There is also a growing trend toward creating smaller, more efficient AI models. Additionally, generative AI projects have almost doubled worldwide.
Nvidia to join Dow Jones Industrial Average, replacing rival chipmaker Intel.	Nvidia is replacing Intel in the Dow Jones Industrial Average, a shakeup that reflects a massive change in the semiconductor industry. Nvidia shares have gained more than 170% this year, while Intel has lost over half its value.
Google's 'Big Sleep' AI Project Uncovers Real Software Vulnerabilities.	The company's experimental AI agent finds a previously unknown and exploitable software bug in SQLite, an open-source database engine.
Amazon will now use AI to recap what you're watching.	Amazon's X-Ray Recaps is an AI-driven feature on Prime Video that generates personalized summaries for TV shows. It utilizes generative AI to create concise recaps of entire seasons, individual episodes, or specific segments, enhancing the viewing experience by helping users recall previous content without revealing spoilers. Currently in beta, X-Ray Recaps is available on Fire TV devices, with plans to expand to additional devices by the end of the year.
Google is opening an AI hub in oil-rich Saudi Arabia.	The new AI hub will support research into Arab language AI models and “Saudi-specific AI applications,” according to an announcement from the Saudi Public Investment Fund and Google.
First artwork painted by humanoid robot to sell at auction fetches $1m.	Portrait of English mathematician Alan Turing was created by Ai-Da, one of the most advanced robots in the world
Mistral launches a moderation API.	AI startup Mistral has launched a new API for content moderation.
Anthropic and Palantir Partner to Bring Claude AI Models to AWS for U.S. Government Intelligence and Defense Operations.	Palantir and Anthropic have collaborated to make the Claude suite of models available on AWS for U.S. intelligence agencies and defense operations.
ChatGPT Can Now Control a Robot Arm.	Researchers from UC Berkeley and ETH Zurich utilized GPT-4 to train cost-effective robot arms for cleaning spills. They accomplished this by incorporating a multimodal agent called LangChain, which translates LLM inputs into robotic actions. This research demonstrates a novel proof-of-concept for human-robot interaction and democratizes robotics using open-source technology.
OpenAI in talks with regulators to become a for-profit company: Report.	The $157 billion artificial intelligence giant wants to retain a nonprofit arm to pursue its mission of benevolent AI development.

Resources

Link	description
AFlow: Automating Agentic Workflow Generation.	A novel framework for automating agentic workflow generation, AFlow, reframes workflow optimization as a search problem over code-based workflows, where nodes invoking LLMs are linked by edges. It efficiently navigates the search space using a modified MCTS, refining workflows through code adjustments, tree-structured experience, and execution feedback. Tests on six benchmark datasets show AFlow’s effectiveness, with a 5.7% improvement over manual methods and a 19.5% boost over other automated approaches. AFlow also allows smaller models to outperform GPT-4 on specific tasks, requiring only 4.55% of its inference cost.
O1 Replication Journey: A Strategic Progress Report -- Part 1.	This report describes efforts to replicate the capabilities of OpenAI's o1 model, introducing a journey learning technique that promotes a comprehensive exploration process rather than shortcut-based learning. This approach includes trial and error, reflection, and backtracking. With just 327 training samples, the journey learning technique outperformed shortcut learning by 8.0% on the MATH dataset.
Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications.	This work offers insights on effectively integrating multimodal models into Retrieval-Augmented Generation (RAG) systems for the industrial sector. It also delves into evaluating these systems, utilizing LLM-as-a-Judge for comprehensive assessment.
You won't believe this.	Researchers are trying to “inoculate” people against misinformation by giving them small doses ahead of time
3D Scene Reconstruction Without Camera Pose.	NoPoSplat is a feed-forward model capable of reconstructing 3D scenes from sparse, multi-view images without requiring precise camera poses.
ImOV3D: Learning Open Vocabulary Point Clouds 3D Object Detection from Only 2D Images.	ImOV3D is a framework that enhances open-vocabulary 3D object detection (OV-3Det) by utilizing 2D images to address the limited availability of 3D annotations.
Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning.	DEMO is a framework that divides text and conditioning into content and motion elements. By employing separate encoders and conditioning for static content and dynamic motion, DEMO improves its ability to interpret and generate motion based on text prompts.
Project Sid.	Project Sid demonstrates civilizational progress, specialization, governance, and the creation and dissemination of memes and religion. These developments are enabled by Altera's innovative cognitive architecture, PIANO.
Using Reinforcement Learning and $4.80 of GPU Time to Find the Best HN Post Ever.	This article explores the use of reinforcement learning from human feedback (RLHF) to create a reward model that predicts upvote counts for Hacker News stories. Using a rich dataset and only $4.80 of GPU time, the model was trained on attributes like titles, authors, and content to prioritize post quality. The goal is to apply RLHF to foster the generation of high-value content. While not flawless, the model effectively identifies overlooked stories and can anticipate potential front-page hits.
Models for PII detection.	The GLINER models and dataset are synthetic datasets designed specifically for use with synthetic data.
Randomized Autoregressive Visual Generation.	This study presents Randomized auto-regressive (RAR) modeling for image generation, achieving state-of-the-art performance on the ImageNet-256 benchmark with an impressive FID score of 1.48.
hertz-dev-open source speech-to-speech.	An exceptionally impressive open release with a permissive license, this model was trained to generate human speech from various input modalities. The code is of high quality and includes intriguing details about the encoder and decoder architectures.
DiffeRT.	This project introduces an innovative Machine Learning-assisted Ray Tracing method for radio propagation modeling, aimed at reducing the high computational demands of conventional approaches.
How I write code using Cursor: A review.	Cursor, a VS Code fork, incorporates LLM-powered features like tab completion and chat interfaces to simplify coding by automating boilerplate and repetitive changes. Although tab completion is quick and efficient, it occasionally provides incorrect suggestions. The tool promotes new workflow patterns, minimizing dependency on libraries for boilerplate and enabling faster iteration in unfamiliar languages or frameworks.
MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D.	MVPaint addresses the challenges of texture and UV generation for 3D assets by synchronizing these processes, resulting in high-quality, multi-view consistent textures
OS-ATLAS: A Foundation Action Model For Generalist GUI Agents.	MVPaint tackles the difficulties of texture and UV generation for 3D assets by synchronizing these tasks, producing high-quality, multi-view consistent textures.
PPLLaVA: Short and Long Video Understanding.	PPLLaVA is a novel model designed to effectively comprehend both short and long videos, addressing a significant challenge in video-based AI. It employs a unique pooling strategy that compresses visual tokens and aggregates features based on user instructions, enhancing its ability to process varied video lengths. This approach enables PPLLaVA to achieve state-of-the-art performance across various video benchmarks, excelling in tasks from caption generation to multiple-choice questions
Hunyuan3D-1.	Hunyuan3D-1.0 is an advanced generative 3D model with robust multi-view synthesis capabilities. While its outputs may not yet be production-ready, they provide a valuable foundation for artists aiming to create assets.
AndroidLab.	Benchmark for autonomous agents on the Android mobile operating system.
Constrained Human-AI Cooperation: An Inclusive Embodied Social Intelligence Challenge.	Constrained Human-AI Cooperation (CHAIC) is a challenge aimed at evaluating the ability of AI agents to work effectively with humans who have physical constraints.
A Scalable Communication Protocol for Networks of Large Language Models.	Agora is a straightforward, cross-platform protocol designed for efficient communication between LLM agents, allowing diverse agents to interact at a significantly reduced cost. It seamlessly integrates with existing multiagent frameworks like Camel AI, LangChain, and Swarm.
Classification Done Right for Vision-Language Pre-Training.	SuperClass is a simple classification model for vision-language tasks that bypasses the need for a text encoder, unlike contrastive models such as CLIP. It eliminates the need for complex text filtering and large batch sizes by using tokenized raw text directly as classification labels.
Enhancing RAG with HTML Data.	HtmlRAG is an innovative approach that enhances retrieval-augmented generation (RAG) by preserving the HTML structure of retrieved web content rather than simplifying it to plain text.
LiVOS: Light Video Object Segmentation with Gated Linear Matching.	LiVOS is a lightweight video object segmentation (VOS) model designed to lower memory usage, making it possible to segment long, high-resolution videos with reduced hardware requirements.
How To Create Software Diagrams With ChatGPT and Claude.	The article discusses how developers can leverage ChatGPT and Claude to generate software architecture diagrams. It emphasizes the iterative process of refining diagrams with the help of multimodal AI and tools such as Mermaid and Whimsical. The author showcases the advantages of using LLMs for diagramming, illustrating how they handle images and offer real-time feedback.
Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks.	Microsoft has introduced Magnetic-One, a multi-agent system built upon its open-source AutoGen framework. This system utilizes GPT-4o as the backend model to facilitate agentic behavior, enabling the orchestration of multiple AI agents to perform complex tasks.
Cosmos Tokenizer: A suite of image and video neural tokenizers.	NVIDIA has introduced the Cosmos Tokenizer, a state-of-the-art image and video tokenizer and compression model. This model is designed to facilitate the training of video generation systems, visual language models (VLMs), and other multimodal models. NVIDIA has made available the inference code, a research paper detailing the model, and the associated model weights.
SA3DIP: Segment Any 3D Instance with Potential 3D Priors.	SA3DIP is a novel method for enhancing 3D instance segmentation by integrating additional 3D priors beyond standard 2D models. This approach addresses the limitations of relying solely on 2D segmentation models, which often struggle with complex 3D structures. By incorporating 3D priors, SA3DIP achieves more accurate and robust segmentation in three-dimensional spaces.
gsplat.	G Splat is a robust package and studio designed for conducting research on Gaussian splatting.
RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models.	RaVL is a novel approach that enhances the accuracy of vision-language models by concentrating on local image features instead of the whole image, aiming to reduce misleading correlations.
Structure Consistent Gaussian Splatting with Matching Prior for Few-shot Novel View Synthesis.	SCGaussian is an innovative method for 3D scene synthesis that preserves structural consistency, even when working with sparse input data.

Perspectives

Link	description
The chatbot optimization game: can we trust AI web searches?	Google and its rivals are increasingly employing AI-generated summaries, but research indicates their results are far from authoritative and open to manipulation
Addicted to love: how dating apps ‘exploit’ their users.	online services that promise to find people romantic matches have been likened to gambling products designed to keep customers hooked
Concerned about your data use? Here is the carbon footprint of an average day of emails, WhatsApps and more.	Vast datacentres are being built worldwide, amid growing concerns about the environmental costs. So should we all be considering a data diet – if not complete digital sobriety?
A field’s dilemmas.	Misinformation research has exploded. But scientists are still grappling with fundamental challenges
We're forking Flutter. This is why.	Google's strategic shift towards AI has led to a deprioritization of Flutter's desktop platforms, resulting in a labor shortage for this previously fast-growing UI toolkit. In response, a fork named Flock is being developed to incorporate essential bug fixes and features that the Flutter team is unable to address, aiming to accelerate Flutter's growth through community involvement. Flock plans to enhance contribution processes and streamline PR reviews, bridging the gap in support and development pace left by the main Flutter team.
Devious humor and painful puns: will the cryptic crossword remain the last thing AI can’t conquer?	When human solvers battle artificial intelligence, who is able to think more cryptically, faster? And are some devious clues just too tough for software?
Meta’s AI Abundance.	Meta is strategically poised to leverage generative AI, particularly in digital advertising. The company's investments in AI, including its Llama models, support innovative advertising strategies like generative ads and AI-driven chat agents. These advancements aim to enhance ad targeting and efficiency, potentially boosting demand and revenue. Meta's focus on integrating AI across its platforms underscores its commitment to maintaining a competitive edge in the rapidly evolving AI landscape.
The AI Services Wave: Lessons from Palantir in The New Age of AI.	Artificial intelligence (AI) is transforming service industries by enhancing scalability and efficiency. Companies like Palantir are at the forefront, integrating AI into operations to streamline processes. Startups are also leveraging AI to automate complex tasks, creating significant value and reshaping business models. The emphasis is on developing AI-driven "tech services" that blend software capabilities with human expertise, leading to improved outcomes and increased market competitiveness.
X reaches its final form: Elon Musk has bent it to his will.	The evolution of Musk’s X network is complete; why Reddit is profitable; and niche Halloween costumes
AI for Startups.	Microsoft and a16z are advocating for collaboration between large and small tech companies to promote AI innovation and competition. They support open-source AI and have proposed policies to assist startups and level the playing field in the AI economy. Their joint focus is on creating a robust, competitive ecosystem that leverages AI to drive economic growth and innovation.
How The New York Times is using generative AI as a reporting tool.	New York Times reporters utilized AI tools, specifically LLMs, to transcribe and analyze over 400 hours of audio for an investigation. Automated transcription greatly accelerated the work, with LLMs accurately identifying key themes and topics. Human reporters ensured proper interpretation and contextual understanding, highlighting the significance of human-AI collaboration.
Writing as a Way of Thinking.	The article explores AI's influence on writing and thinking, challenging the idea that writing is the sole form of thinking. Tools like ChatGPT can enhance thinking through dialogue. Rather than replacing thought processes, AI can augment them. It will transform writing by automating routine tasks, freeing up space for more creative and thought-provoking content.
ChatGPT is transforming peer review — how can we use it responsibly?	At major computer science publication venues, up to 17% of the peer reviews are now written by artificial intelligence. We need guidelines before things get out of hand.
Will AI’s huge energy demands spur a nuclear renaissance?	Contracts with Google and Amazon could help, but bringing new types of reactors online will take larger investments and time.
Five protein-design questions that still challenge AI.	Tools such as Rosetta and AlphaFold have redefined the protein-engineering landscape. But some problems remain out of reach — for now.
AI may displace 3m jobs but long-term losses ‘relatively modest’, says Tony Blair’s thinktank.	Rise in unemployment in low hundreds of thousands as technology creates roles, Tony Blair Institute suggests
The Rise of the Agentic Web.	The Agentic Web is advancing the capabilities of AI agents with on-chain features, enabling their creation, ownership, and transactional abilities. Platforms like Replit, VIRTUALS.io, and Wayfinder are integrating AI with blockchain, facilitating activities such as asset management, data retrieval, and decentralized applications. This shift supports AI-driven automation for payments, trading, and decentralized finance within blockchain ecosystems.
The Present Future: AI's Impact Long Before Superintelligence.	Stronger AI models are on the verge of surpassing human intelligence, driving transformative changes in work and society. Current AI systems, such as Claude, are already reshaping industries by automating tasks, offering safety monitoring, and enabling interactions through multimodal inputs and outputs. Organizations must carefully address ethical concerns to ensure AI complements and enhances human abilities, rather than replacing them.

Back to index

ML news: Week 28 October - 3 November

Research

Link	description
A Theoretical Understanding of Chain-of-Thought.	reveals that incorporating both correct and incorrect reasoning paths in demonstrations enhances the accuracy of intermediate steps and Chain-of-Thought (CoT) processes. The new approach, Coherent CoT, substantially boosts performance across multiple benchmarks. Specifically, Gemini Pro shows a 6.60% improvement on the Tracking Shuffled Objects dataset (rising from 58.20% to 64.80%), while DeepSeek 67B achieves a 6.17% increase on the Penguins in a Table dataset (from 73.97% to 80.14%).
LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering.	improves RAG's comprehension of long-context knowledge, incorporating global insights and factual specifics. It features a hybrid retriever, an LLM-enhanced information extractor, a Chain-of-Thought (CoT) guided filter, and an LLM-augmented generator. These core components empower the RAG system to extract global long-context information and accurately capture factual details. LongRAG demonstrates superior performance, surpassing long-context LLMs by 6.94%, advanced RAG by 6.16%, and Vanilla RAG by 17.25%.
Evaluating feature steering: A case study in mitigating social biases.	examines feature steering in LLMs through an experiment that adjusts various features to observe shifts in model outputs, specifically focusing on 29 features related to social biases to determine if feature steering can reduce these biases. Findings reveal that while feature steering can sometimes cause unintended effects, incorporating a neutrality feature effectively reduces social biases across 9 social dimensions without compromising text quality.
Large Language Models Reflect the Ideology of their Creators.	reveals that LLMs display varied ideological perspectives, often mirroring the worldview of their creators. It observes consistent normative differences in responses when the same LLM operates in Chinese versus English and highlights normative disagreements between Western and non-Western LLMs regarding prominent figures in geopolitical conflicts.
Scalable watermarking for identifying large language model outputs.	introduces SynthID-Text, a text-watermarking approach designed to maintain text quality in LLM outputs, achieve high detection accuracy, and reduce latency. It incorporates watermarking through speculative sampling, using a final score pattern for model word choices alongside adjusted probability scores. The authors evaluate the method's feasibility and scalability by analyzing feedback on nearly 10 million Gemini responses.
A Comparative Study on Reasoning Patterns of OpenAI's o1 Model.	outperformed other test-time compute methods across most datasets. The authors note that the primary reasoning patterns in o1 are divide and conquer and self-refinement, with the model adapting its reasoning strategy to specific tasks. For commonsense reasoning, o1 frequently employs context identification and focuses on constraints, while for math and coding tasks, it predominantly utilizes method reuse and divide-and-conquer approaches.
Sparse Crosscoders for Cross-Layer Features and Model Diffing.	Crosscoders are an advanced form of sparse autoencoders designed to enhance the understanding of language models' internal mechanisms.
Distill Visual Chart Reasoning Abilityfrom LLMs to MLLMs.	Code-as-Intermediary Translation (CIT) is an innovative technique aimed at improving visual reasoning in multimodal language models (MLLMs) by leveraging code to convert chart visuals into textual descriptions.
Probabilistic Language-Image Pre-Training.	Probabilistic Language-Image Pre-training (ProLIP) is a vision-language model (VLM) designed to learn probabilistically from image-text pairs. Unlike traditional models that rely on strict one-to-one correspondence, ProLIP captures the complex many-to-many relationships inherent in real-world data.
A faster, better way to train general-purpose robots.	MIT researchers have developed Heterogeneous Pretrained Transformers (HPT), a novel model architecture inspired by large language models, designed to train adaptable robots by utilizing data from multiple domains and modalities.
A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs.	In this work, DeepMind demonstrates how a small language model can be used to provide soft supervision labels and identify informative or challenging data points for pretraining, significantly accelerating the pretraining process.
NeuroClips: Towards High-fidelity and Smooth fMRI-to-Video Reconstruction.	The NeuroClips framework introduces advancements in reconstructing continuous videos from fMRI brain scans by decoding both high-level semantic information and fine-grained perceptual details.
Machine-guided design of cell-type-targeting cis-regulatory elements.	A generalizable framework to prospectively engineer cis-regulatory elements from massively parallel reporter assay models can be used to write fit-for-purpose regulatory code.

News

Link	description
Keir Starmer says media firms should have control of output used in AI.	PM says content creators must be paid and vows to ensure technology ‘does not begin to chip away’ at press freedoms
Waymo raises $5.6B.	Waymo's driverless taxi service has gained significant popularity. The company has secured additional funding to extend its reach beyond the current cities and millions of miles it already covers.
Meta Introduces Spirit LM open source model that combines text and speech inputs/outputs.	Just in time for Halloween 2024, Meta has unveiled Meta Spirit LM, the company’s first open-source multimodal language model capable of seamlessly integrating text and speech inputs and outputs.
IBM debuts open source Granite 3.0 LLMs for enterprise AI.	IBM is enhancing its enterprise AI suite with Granite 3.0 LLMs, prioritizing open-source options and optimized performance. Available across various platforms, these models have built-in safety features and are customized for diverse enterprise applications. IBM highlights the significance of true open-source licensing with Apache 2.0, enabling flexible adoption and fostering enterprise-driven innovation.
Microsoft introduces ‘AI employees’ that can handle client queries.	US company gives customers the ability to build own virtual agents as well as releasing 10 off-the-shelf bots
Microsoft Excel’s bloopers reel: 40 years of spreadsheet errors.	As the software used by millions around the world celebrates its birthday, here are some of the low points
Google Expands Voice Technology Support to 15 More African Languages .	Google has expanded voice recognition support to include 15 more African languages across its platforms, such as Voice Search, Gboard talk-to-type, and Translate dictation. This enhancement enables an estimated 300 million additional Africans to engage with digital content in their native languages.
Cohere releases state-of-the-art multimodal AI search model.	Cohere has unveiled that its Embed 3 AI model is now multimodal, allowing for rapid and precise search across essential enterprise image data sources such as graphs, charts, product catalogs, and design files. This enhancement makes Embed 3 the most broadly capable multimodal embedding model available today.
Bringing developer choice to Copilot with Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 1.5 Pro, and OpenAI’s o1-preview.	You can now access models like Claude, Gemini, and o1, among others, through GitHub Copilot.
Apple releases first batch of Apple Intelligence features, debuts new iMac.	Apple introduced new AI features, branded as Apple Intelligence, on its latest devices, focusing on text processing and photo editing capabilities. The updated iMac now runs on the M4 chip, which includes a Neural Engine that delivers three times the AI performance of previous models. Upcoming AI updates aim to improve Siri's capabilities and incorporate ChatGPT to handle more advanced queries.
How Advex creates synthetic data to improve machine vision for manufacturers.	Advex AI addresses data shortages in AI training by leveraging generative AI to create synthetic images tailored for computer vision systems.
Coframe raises $9 million for websites that optimize themselves using AI.	AI startup Coframe has raised $9.3 million in seed funding to further develop its platform, which leverages generative AI to optimize websites and deliver personalized marketing experiences.
Google unveils invisible ‘watermark’ for AI-generated text.	Real-world demonstration in chatbot responses could encourage other firms to label material produced by AI.
Reddit shares soar after company turns first-ever profit.	Monthly users rose by nearly half thanks to the AI translation feature, and deals for AI training with Google and OpenAI boosted revenue
Google parent Alphabet sees double-digit growth as AI bets boost cloud business.	Analysts expected 12% year-on-year revenue gains, but company reports 15%, buoyed by performance in ads and cloud services
EU events on curbing big tech ‘distorted’ by attendees with industry links.	Campaigners say 21% of people at workshops did not disclose on their applications relationships with firms being discussed
Indonesia blocks Apple iPhone 16 sales over lack of investment.	Marketing and sale of model prohibited after tech giant fails to meet rule 40% of phones be made from local parts
25% of Smartphone Owners Don't Want AI as Apple Intelligence Debuts.	What's a bigger priority? Longer battery life, according to a new CNET survey.
Google preps ‘Jarvis’ AI agent that works in Chrome.	Google's Project Jarvis, powered by Gemini 2.0, aims to automate web-based tasks in Chrome by using AI agents capable of reasoning and planning.
OpenAI’s Whisper transcription tool has hallucination issues, researchers say.	OpenAI's Whisper, an AI transcription tool, has been found to produce hallucinations—fabricated text not present in the original audio—even in medical settings. Despite OpenAI's advisories against using Whisper in high-risk domains, over 30,000 medical professionals across 40 health systems have adopted it for transcribing patient consultations
Forerunner K2 humanoid robot can carry 33 lb in each dexterous hand.	Kepler has introduced the Forerunner K2, a humanoid robot featuring advanced AI, upgraded hardware, and enhanced vision and navigation systems for improved real-time interaction.
Introducing ChatGPT search.	ChatGPT now offers an improved web search capability, providing quick, current answers with links to relevant sources—answers you'd typically seek through a search engine. This feature combines the ease of a natural language interface with access to real-time information, such as sports scores, news, stock prices, and more.
Advancing embodied AI through progress in touch perception, dexterity, and human-robot interaction.	This work features several components, including vision-based tactical sensing, innovative hardware touch sensors, and noteworthy strategic partnerships within robotics.
Elon Musk’s xAI adds image understanding capabilities to Grok.	This means that paid users on his social platform X, who have access to the AI chatbot, can upload an image and ask the AI questions about it.
OpenAI CFO Says 75% of Its Revenue Comes From Paying Consumers.	OpenAI generates the vast majority of its revenue from consumers who pay for its products, Chief Financial Officer Sarah Friar said, even as the artificial intelligence startup competes in a crowded market to sign up more corporate customers.
Hello Patient.	Hello Patient has emerged from stealth mode, securing a $6.3 million seed funding round led by 8VC. The company, founded by Alex Cohen, is based in Austin, Texas.
Google plans to announce its next Gemini model soon.	December is shaping up to be a month of dueling announcements from OpenAI and Google.
Meta is reportedly developing a search engine for its chatbot.	The company wants to decrease Meta AI’s reliance on Google and Microsoft.
A mysterious new image generation model has appeared.	A mysterious new image generation model is beating models from Midjourney, Black Forest Labs, and OpenAI on the crowdsourced Artificial Analysis benchmark. The model, which goes by the name “red_panda,” is around 40 Elo points ahead of the next-best-ranking model, Black Forest Labs’ Flux1.1 Pro, on Artificial Analysis’ text-to-image leaderboard.

Resources

Link	description
Agentic Information Retrieval.	offers an overview of agentic information retrieval, driven by the abilities of LLM agents; explores various advanced applications of agentic information retrieval and addresses related challenges.
Aya Expanse.	introduces a suite of open-weight foundation models designed for multilingual proficiency, featuring 8B and 32B parameter models and one of the largest multilingual datasets to date, containing 513 million examples. The release also includes Aya-101, which is claimed to be the most extensive multilingual model, supporting 101 languages. Aya Expanse 32B surpasses the performance of Gemma 2 27B, Mistral 8x22B, and Llama 3.1 70B, even though it is half the size of the latter.
A Survey on Data Synthesis and Augmentation for Large Language Models.	offers an in-depth overview of data generation techniques throughout the LLM lifecycle, covering topics such as data preparation, pre-training, fine-tuning, instruction-tuning, preference alignment, and practical applications.
granite-3.0-language-models.	introduces a range of lightweight foundation models from 400 million to 8 billion parameters, optimized for tasks such as coding, retrieval-augmented generation (RAG), reasoning, and function calling. Designed for enterprise applications, these models support on-premise and on-device deployment, showing robust performance across academic benchmarks in language understanding, reasoning, coding, function calling, and safety.
Pixtral-12B-Base-2409.	Pixtral 12B base model weights have been released on Hugging Face.
Arcade, a new AI product creation platform, designed this necklace.	Arcade AI has developed a generative platform that allows users to create distinctive, high-quality jewelry items simply from text prompts—and the exciting part is, you can purchase the designs you generate.
Retrieval-Augmented Diffusion Models for Time Series Forecasting.	The Retrieval-Augmented Time Series Diffusion model (RATD) introduces a retrieval and guidance mechanism to enhance stability and performance in time series diffusion models. RATD operates in two steps: first, it retrieves relevant historical data from a database, and then uses this information as a reference to guide the denoising phase.
NotebookLlama: An Open Source version of NotebookLM.	Meta has published a quick start guide to help users build a simplified version of Google’s popular NotebookLM system.
How I Studied LLMs in Two Weeks: A Comprehensive Roadmap.	This article presents a 14-day roadmap for mastering LLM fundamentals, covering key topics such as self-attention, hallucinations, and advanced methods like Mixture of Experts. It offers resources for building an LLM from the ground up, alongside curated literature and online materials, all organized within a GitHub repository. Emphasizing a tailored learning experience, the article underscores the importance of foundational skills in math, programming, and deep learning.
Marly.	Marly is an open-source data processor enabling agents to query unstructured data using JSON, streamlining data interaction and retrieval.
LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias.	It was previously believed that novel view synthesis depended heavily on strong 3D inductive biases. This study demonstrates that, with scale and a minimal inductive bias, it's possible to significantly surpass these previously assumed limitations.
Continuous Speech Synthesis using per-token Latent Diffusion.	Autoregressive models continue to excel in many applications, yet recent advancements with diffusion heads in image generation have led to the concept of continuous autoregressive diffusion. This research broadens the scope of per-token diffusion to accommodate variable-length outputs.
CDChat: A Large Multimodal Model for Remote Sensing Change Description.	This paper presents a change description instruction dataset aimed at fine-tuning large multimodal models (LMMs) to enhance change detection in remote sensing.
IC-Light V2 (Flux-based IC-Light models).	IC Light currently offers the most effective method for associating images with a pre-trained text-to-image backbone. This discussion marks the initial steps toward expanding that capability to the robust Flux models.
The Scene Language: Representing Scenes with Programs, Words, and Embeddings.	Creating 3D scenes from scratch presents significant challenges, including data limitations. This research introduces a programming-like language for describing 3D scenes and demonstrates that Claude Sonnet can produce highly realistic scenes even without specific training for this task.
3D Semantic Segmentation.	FtD++ is a cross-modal learning approach designed to enhance unsupervised domain adaptation in 3D semantic segmentation tasks.
Open source replication of crosscoder on Gemma 2B.	Anthropic recently published two studies showcasing its novel interpretability method. This post provides an open replication of the cross coder on the Gemma 2B model.
Awesome-Graph-OOD-Learning.	This repository lists papers on graph out-of-distribution learning, covering three primary scenarios: graph OOD generalization, training-time graph OOD adaptation, and test-time graph OOD adaptation.
OpenWebVoyager: Building Multimodal Web Agents.	OpenWebVoyager offers tools, datasets, and models designed to build multimodal web agents that can navigate and learn from real-world web interactions.
Automated Colorization for Animation.	Researchers have introduced an innovative inclusion-matching technique that overcomes challenges in automated colorization, particularly for animations where occlusions and wrinkles complicate traditional segment matching.
Lofi Music Dataset.	A dataset containing music clips paired with detailed text descriptions, generated by a music creation model.
Learning to Handle Complex Constraints for Vehicle Routing Problems.	Researchers have developed a Proactive Infeasibility Prevention (PIP) framework designed to enhance neural network performance on Vehicle Routing Problems (VRPs) that involve challenging constraints.
Unleashing the Power of AI on Mobile: LLM Inference for Llama 3.2 Quantized Models with ExecuTorch and KleidiAI.	PyTorch has made significant strides with ExecuTorch, a tool that enables AI model deployment at the edge, greatly enhancing the performance and efficiency of various end systems.
CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution.	CompassJudger-1 is the first open-source, comprehensive judge model created to enhance the evaluation process for large language models (LLMs).
MINT-1T.	MINT-1T, a vast open-source multimodal dataset, has been released with one trillion text tokens and 3.4 billion images, incorporating diverse content from HTML, PDFs, and ArXiv papers. This dataset, roughly ten times larger than previous collections, is intended to accelerate advancements in large-scale multimodal machine learning research.
LARP: Tokenizing Videos 🎬 with a Learned Autoregressive Generative Prior 🚀.	LARP is a novel video tokenizer designed to enhance video generation in autoregressive (AR) models by prioritizing global visual features over individual patch-based details.
OpenAI's new hallucination benchmark.	OpenAI has released the SimpleQA benchmark, which measures models' abilities around simple factual questions.
ThunderKittens.	Thunder Kittens is a framework designed for creating highly efficient GPU kernels. It leverages the principle that GPUs are optimized for working with compact 16x16 data tiles, resulting in high usability. With this approach, achieving 40% faster kernels requires only a few hundred lines of code.
Skinned Motion Retargeting with Dense Geometric Interaction Perception.	MeshRet has developed an innovative method for enhancing motion retargeting for 3D characters, prioritizing the preservation of body geometry interactions from the outset.
Unlocking the Capabilities of Masked Generative Models for Image Synthesis via Self-Guidance.	Researchers have improved Masked Generative Models (MGMs) by introducing a self-guidance sampling technique, which enhances image generation quality without compromising diversity.
Speeding Up Transformers with Token Merging.	This project presents PiToMe, an algorithm that compresses Vision Transformers by gradually merging tokens after each layer, thereby decreasing the number of tokens processed.
PF3plat : Pose-Free Feed-Forward 3D Gaussian Splatting.	PF3plat addresses the challenge of 3D reconstruction and novel view synthesis from RGB images without requiring additional data.
Fine-tuning LLMs to 1.58bit: extreme quantization made easy.	BitNet, created by Microsoft Research, presents a transformer architecture that lowers the computational and memory demands of large language models by employing ternary precision (-1, 0, 1), equating to 1.58 bits per parameter. This architecture requires models to be trained from scratch, but it can also fine-tune existing models to this low-precision format while retaining high performance on downstream tasks. This technique greatly reduces energy consumption and enhances inference speed through specialized kernels that enable efficient matrix multiplication.
SELECT: A Large-Scale Benchmark of Data Curation Strategies for Image Recognition.	SELECT is the inaugural extensive benchmark designed to evaluate various data curation methods in image classification. ImageNet++ is a newly developed dataset that augments ImageNet-1K by incorporating five additional training data variations, each curated through distinct techniques.
ODRL: A Benchmark for Off-Dynamics Reinforcement Learning.	ODRL is the first standardized benchmark designed to assess reinforcement learning methods in environments with differing dynamics.
Text-to-Image Model to Generate Memes.	Researchers have created an innovative adapter method for text-to-image models, enabling them to tackle complex tasks such as meme video generation while preserving the base model's strong generalization abilities.
Anomaly Classification in Industry.	AnomalyNCD is a multi-class anomaly classification framework intended to enhance traditional anomaly detection techniques in industrial environments.
MrT5: Dynamic Token Merging for Efficient Byte-level Language Models.	Byte-level language models represent a move toward a token-free future, but the challenge of sequence length remains significant. Dynamically merging tokens can help increase the number of tokens within the context.
BART vectoriZed.	A new GPU-enabled implementation of Bayesian Additive Regression Trees (BART) significantly accelerates processing speed, making it up to 200 times faster than conventional CPU-based versions.
Huge new Diffusers release.	The Hugging Face Diffusers package now includes new pipelines like Flux, Stable Audio, Kolors, CogVideoX, Latte, and others, alongside new methods such as FreeNoise and SparseCtrl, plus various refactors.
4 experiments with voice AI models to help you explore culture.	Google’s voice AI models allow users to engage with culture in innovative ways. Projects like Talking Tours provide AI-guided virtual tours, Mice in the Museum offers art narration, and Lip Sync animates lips to discuss cultural topics. These entertaining tools offer new perspectives on art and design.

Perspectives

Link	description
ByteDance intern fired for planting malicious code in AI models.	After rumors swirled that TikTok owner ByteDance had lost tens of millions after an intern sabotaged its AI models, ByteDance issued a statement this weekend hoping to silence all the social media chatter in China.
Thinking Like an AI.	Large language models (LLMs) operate as advanced autocomplete systems, generating the next token based on a combination of their training data and current input. Small variations in input can influence predictions, resulting in different responses to the same question. Gaining insight into token prediction, training data context, and memory constraints can enhance effective AI usage.
An Interview with Salesforce CEO Marc Benioff about AI Abundance.	Salesforce CEO Marc Benioff recently spoke about the company's new AI initiative, Agentforce, showcasing its potential to transform enterprise applications and customer interactions. He contrasted Salesforce's approach with Microsoft’s Copilot, describing Salesforce’s solution as more cohesive and impactful, thanks to its strong platform and data infrastructure. During the interview, Benioff stressed the significance of AI-driven "agentic" layers designed to boost customer service and improve operational efficiency across various industries.
How GPU Access Helps Startups Be Agile.	Andreessen Horowitz's Oxygen program tackles GPU shortages by offering startups in its portfolio more accessible and flexible GPU resources, allowing them to bypass price surges and supply limitations. This initiative enables AI startups to concentrate on product development without the pressure of long-term capital expenditure, emphasizing the need for equitable access to critical resources in the competitive AI field.
The Mask Comes Off: At What Price?	OpenAI is approaching its shift to a Public Benefit B-Corporation, a move that could impact its investor dynamics and collaboration with Microsoft. This transition brings up questions around control and valuation, particularly concerning the nonprofit's stake, which could be substantial given OpenAI's role in advancing AGI. The company’s future profitability and strategic course are closely tied to the safe development of AGI, a pursuit with enormous potential value.
What's so special about the human brain?.	Torrents of data from cell atlases, brain organoids, and other methods are finally delivering answers to an age-old question.
‘Educational’ apps are worth billions. We need to make sure they work.	Partnerships between developers and researchers could help to improve the quality of educational apps and other technologies.
The huge protein database that spawned AlphaFold and biology’s AI revolution.	Pioneering crystallographer Helen Berman helped to set up the massive collection of protein structures that underpins the Nobel-prize-winning tool’s success.
Extreme fire seasons are looming — science can help us adapt.	Not all wildfires can be averted, but data, models, and collaborations can help to chart a course to a fire-resilient future.
AI-designed DNA sequences regulate cell-type-specific gene expression.	Researchers have used artificial intelligence models to create regulatory DNA sequences that drive gene expression in specific cell types. Such synthetic sequences could be used to target gene therapies to particular cell populations.
Pushing the frontiers of audio generation.	DeepMind has shared additional details about the audio generation models behind NotebookLM.
Evaluating feature steering: A case study in mitigating social biases.	This study investigates the use of feature steering in AI models to adjust outputs in an interpretable way. It identifies a "steering sweet spot," where modifications do not compromise performance. Results demonstrate that steering can adjust social biases within specific areas but may also produce unintended effects outside those targets. Continued research is necessary to enhance feature steering, aiming for safer and more dependable AI outcomes.
How we saved hundreds of engineering hours by writing tests with LLMs.	Assembled leverages LLMs to speed up and enhance software testing, allowing tests to be generated in minutes rather than hours. This approach boosts engineering productivity, saving time and enabling a stronger focus on feature development. LLMs create thorough and precise tests that uphold code quality and sustain development speed.
How to train LLM as a judge to drive business value.	"LLM As a Judge" is an approach for leveraging an existing language model to rank and score natural language. This post provides guidelines for effectively using this method to process or assess data.

Back to index

ML news: Week 21 - 27 October

Research

Link	description
Thinking LLMs: General Instruction Following with Thought Generation.	The proposed training method aims to enhance LLMs with thinking capabilities for general instruction-following without relying on human-annotated data. It employs an iterative search and optimization process to facilitate thought generation, allowing the model to learn without direct supervision. For each user instruction, potential thoughts are evaluated using a judge model, which scores only the responses to identify the best and worst options. The resulting full outputs are then used as selected and rejected pairs for DPO (termed Thought Preference Optimization in this paper). This approach demonstrates superior performance on AlpacaEval and Arena-Hard.
Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence.	A new collaborative search algorithm is proposed to adapt LLMs using swarm intelligence, where a group of LLM experts collaboratively navigates the weight space to optimize a utility function that reflects various adaptation objectives. Experiments show that Model Swarms can effectively adjust LLM experts for a single task, multi-task domains, reward models, and a range of human interests. This approach outperforms 12 model composition baselines by up to 21.0% across different tasks and contexts.
First-Person Fairness in Chatbots.	This study explores first-person fairness, focusing on the fairness of interactions between users and ChatGPT, particularly examining any biases related to users' names. It utilizes a model powered by GPT-4o to analyze patterns and name sensitivity in the chatbot's responses based on different user names. The findings suggest that post-training significantly reduces harmful stereotypes overall. However, in areas such as entertainment and art, especially with open-ended tasks, the study reveals a higher level of bias, indicating a tendency to create narratives featuring protagonists whose gender aligns with the gender inferred from the user's name.
Looking Inward: Language Models Can Learn About Themselves by Introspection.	The report indicates that LLMs can gain knowledge through introspection that is not directly derivable from their training data. It suggests that these models possess privileged information about themselves, which could contribute to creating more interpretable and controllable systems. However, it also notes that this introspective ability has limitations, as models often struggle to predict their own behavior on tasks that require reasoning over extended outputs.
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation.	This proposal introduces a unified autoregressive framework for multimodal understanding and generation, which decouples visual encoding into independent pathways. Utilizing a single transformer architecture enhances flexibility and performance in both visual understanding and generation tasks. The framework claims to mitigate the trade-offs typically associated with vision tasks found in methods relying on a single visual encoder. As a result, it outperforms previous unified models and matches or exceeds the performance of task-specific models.
Inference Scaling for Long-Context Retrieval Augmented Generation.	This study employs two strategies to explore scaling laws for Retrieval-Augmented Generation (RAG): in-context learning (DRAG) and iterative prompting (IterRAG). It discovers that RAG performance steadily enhances with an increase in effective context length when configurations are optimized. Additionally, under optimal conditions, increasing inference computation yields linear improvements in long-context RAG performance. This insight leads to the creation of a computation allocation model designed to offer practical guidance for optimal computation distribution in long-context RAG situations.
Agent S: An Open Agentic Framework that Uses Computers Like a Human.	A novel open agentic framework has been developed to facilitate autonomous interactions with computers via a graphical user interface (GUI). Named Agent S, this framework addresses challenges such as knowledge acquisition, long-horizon planning, and managing dynamic interfaces. It introduces experience-augmented hierarchical planning that combines search and retrieval methods. Additionally, it utilizes an agent-computer interface to enable reasoning and control over GUI agents. Evaluation on the OSWorld benchmark demonstrates that Agent S surpasses the baseline by 9.37% in success rate, representing an 83.6% relative improvement, and sets a new state-of-the-art performance.
Exploring Model Kinship for Merging Large Language Models.	The study introduces the concept of model kinship to assess the similarity between LLMs. This measure is utilized to develop a model merging strategy called Top-k Greedy Merging with Model Kinship, which enhances performance. The authors discover this new criterion allows for effective and continuous model merging.
On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability.	The report highlights that the o1-preview model excels in self-evaluation and constraint-following. However, it also points out that these o1 models exhibit bottlenecks in decision-making and memory management, particularly in the context of spatial reasoning. Specifically, the models tend to generate redundant actions and face challenges in generalizing across spatially complex tasks.
Sabotage evaluations for frontier models.	Anthropic has conducted several innovative evaluations to identify vulnerabilities and assess misalignment in large, powerful models.
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities.	A powerful open-source initiative aimed at replicating GPT-4's speech capabilities has emerged. This model was trained by aligning multiple modalities using pre-trained audio and speech encoders, allowing it to achieve advanced speech recognition and generation functionalities.
Automatically Interpreting Millions of Features in Large Language Models.	Interpreting SAE features on a large scale can be difficult. To address this, Eleuther has introduced a set of automatic interpreter features designed to help understand the meaning of elements within their context.
Mitigating Object Hallucination via Concentric Causal Attention.	Object hallucination in vision-language models has been associated with Rotary Position Encoding (RoPE), which faces challenges in managing long-term dependencies between visual and textual inputs. To overcome this, the authors introduce Concentric Causal Attention (CCA), a novel positional alignment method that enhances the interaction between visual elements and instruction tokens.
Simplifying, stabilizing, and scaling continuous-time consistency models.	OpenAI has published work focusing on enhancing consistency models, which operate in two steps rather than the 1,000 steps typically used in diffusion models. While these models still depend on distillation from an existing diffusion model, the research seeks to improve their performance and stability as they scale.
All you need are 32 tokens to represent video.	Salesforce's new approach introduces a novel video encoder that significantly reduces the number of tokens needed for accurate representation. While similar attempts in the past have seen limited success, the breakthrough appears to come from combining an explicit temporal encoder with a spatial encoder, enabling more efficient video processing.
CoPS: Empowering LLM Agents with Provable Cross-Task Experience Sharing.	CoPS is a novel algorithm that improves agents' sequential reasoning by allowing them to share experiences across various tasks, enhancing their overall learning and adaptability.

News

Link	description
US investigates 2.4m Tesla self-driving vehicles after reported collisions.	Road safety agency opens evaluation over reported collisions in low visibility
Anthropic just made it harder for AI to go rogue with its updated safety policy.	Anthropic has revised its Responsible Scaling Policy to incorporate Capability Thresholds for AI models that present substantial risks, including bioweapons and autonomous AI research. This policy is designed to establish industry standards by introducing AI Safety Levels, which mandate stricter safeguards according to the model's capabilities. By transparently sharing safety practices and appointing a Responsible Scaling Officer, Anthropic aims to take a leadership role in AI governance and encourage similar initiatives across the industry.
Sam Altman’s Worldcoin becomes World and shows new iris-scanning Orb to prove your humanity.	The World project, co-founded by Sam Altman, seeks to authenticate human identity online through iris-scanning technology, addressing privacy issues and ongoing investigations in the EU. The initiative plans to integrate human verification into AI platforms and may redistribute the wealth generated by AI through Worldcoins. Recent updates include the launch of a new blockchain, an app, and tools such as Deep Face to help combat deepfakes.
Google - Gemini Long Context.	The Gemini team has set aside $100,000 for the most effective applications of their long context model capabilities.
Unleashing System 2 Thinking? AlphaCodium Outperforms Direct Prompting of OpenAI o1.	OpenAI's o1 model, demonstrating System 1.5 thinking, exhibits improved reasoning abilities compared to earlier LLMs but still lacks the comprehensive problem-solving capabilities of full System 2 thinking. AlphaCodium enhances o1's coding performance by offering a structured framework that supports reasoning and iterative refinement, resulting in greater accuracy on Codeforces benchmarks. Although the combination of o1 and AlphaCodium shows potential for advancing AI toward more profound reasoning, significant effort is still needed to incorporate complete System 2 thinking in AI models.
Amazon's AI Generator Tool Can Now Create Audio Ads.	Soon, you’ll hear more audio ads on Amazon’s properties that were created with generative AI.
Google Shopping is getting a ‘for you’ feed of products.	Google Shopping is rolling out a personalized feed that shows you a stream of products you might like. The new feature, which is coming to mobile and desktop devices, shows up when you head to shopping.google.com.
TikTok owner sacks intern for allegedly sabotaging AI project.	ByteDance dismissed person in August it says ‘maliciously interfered’ with training of artificial intelligence models
AlphaFold reveals how sperm and egg hook up in intimate detail.	Three sperm proteins work together as matchmakers to enable fertilization in vertebrates.
xAI, Elon Musk’s AI startup, launches an API.	In August, Elon Musk’s xAI promised to make Grok, the company’s flagship generative AI model powering a number of features on X, available via an API. Now, that API has arrived — albeit a bit bare-bones at the moment.
Jane Street Real-Time Market Data Forecasting.	This competition, hosted by Jane Street, challenges participants to build models using real-world data from production systems. The goal is to provide insights into the complexities of financial markets, requiring participants to apply their skills in data analysis and modeling to navigate the dynamic nature of market behavior.
OCP Summit 2024: The open future of networking hardware for AI.	At OCP 2024, Meta unveiled a next-generation disaggregated network fabric and new network hardware specifically designed for AI clusters. The company introduced the Disaggregated Scheduled Fabric (DSF), aimed at improving scalability and performance in AI training systems. Both the newly developed and existing hardware are optimized for high throughput and efficiency, providing open, vendor-agnostic solutions to support advanced AI applications.
Serve confirms delivery by robot expansion plans with Gen3 rollout.	Serve Robotics' third-generation delivery robot is equipped with NVIDIA's Jetson Orin module, significantly boosting its AI processing capabilities. This upgrade allows the robot to make faster, real-time autonomous navigation decisions, improving its efficiency and performance in delivery tasks.
Boston Dynamics teams with TRI to bring AI smarts to Atlas humanoid robot.	Boston Dynamics and Toyota Research Institute are partnering to integrate advanced AI and large behavior models into the electric Atlas humanoid robot. This collaboration aims to enhance the robot's capabilities, enabling more sophisticated and autonomous behaviors in tasks that require human-like movement and decision-making.
Microsoft introduces ‘AI employees’ that can handle client queries.	US company gives customers the ability to build own virtual agents as well as releasing 10 off-the-shelf bots
Thom Yorke and Julianne Moore join thousands of creatives in AI warning.	Statement comes as tech firms try to use creative professionals’ work to train AI models
Claude AI tool can now carry out jobs such as filling forms and booking trips, says the creator.	Anthropic says model is able to carry out computer tasks – as fears mount such technology will replace workers
Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku.	Anthropic has enhanced Sonnet 3.5's capabilities and introduced a more affordable version that delivers the same performance as the previous Claude 3 Opus. Furthermore, Sonnet 3.5 has been trained with screen recordings, enabling it to operate computers and interact with user interfaces.
ChatGPT has a Windows app now .	The app, which is currently in testing, is only available to ChatGPT subscribers for now.
Adobe's new image rotation tool is one of the most impressive AI concepts we've seen.	Adobe's Project Turntable leverages AI to rotate 2D vector art in 3D, allowing the artwork to be viewed from various angles while preserving its 2D look and design integrity. This innovative technique ensures that the visual style remains consistent, even as the artwork is transformed in three-dimensional space.
Perplexity lets you search your internal enterprise files and the web.	Enterprises can use their Perplexity dashboards to search for internal information and combine it with knowledge from the internet, but this will only be limited to specific files they deem important.
OpenAI, Microsoft reportedly hire banks to renegotiate partnership terms.	OpenAI and Microsoft are in discussions regarding the terms of their partnership, with Microsoft aiming to acquire a substantial stake in OpenAI following its restructuring.
Former OpenAI CTO Mira Murati is reportedly fundraising for a new AI startup.	This startup will reportedly focus on building AI products based on proprietary models and could raise more than $100 million in this round.
Midjourney plans to let anyone on the web edit images with AI.	Midjourney is planning to release an upgraded web tool that’ll let users edit any uploaded images from the web using Midjourney’s generative AI.
Intel wins lengthy EU legal battle over £880m competition fine.	Chipmaker disputed 2009 decision that it abused its market position in case dating back two decades
Cohere's multilingual model's dramatic improvement.	The Aya project, a standout initiative in multilingual language model training, has made impressive strides since its launch earlier this year. Much of its performance improvement is attributed to effective post-training strategies. Additionally, Aya can handle audio input and create images, all from non-English sources.
Introducing the analysis tool in Claude.ai.	Claude can now write and execute code as part of artifacts.
Gurman: Apple internally believes that it’s at least two years behind in AI development.	According to the latest edition of Mark Gurman’s Power On newsletter, some employees at Apple believe that the company is around two years behind in artificial intelligence development.
Perplexity is reportedly looking to fundraise at an $8B valuation.	AI search engine Perplexity is in fundraising talks and hopes to raise around $500 million at an $8 billion valuation, according to The Wall Street Journal.
Chinese humanoid robot is the 'fastest in the world' thanks to its trusty pair of sneakers.	The STAR1 robot can reach a top speed of 8 mph with the added help of a pair of sneakers.
From Rupert Murdoch to Thom Yorke: the growing backlash to AI.	Media mogul and leading artists join the fight to stop tech firms using creative works for free as training data
Talk to your plants? Now the first AI-powered garden will allow them to talk back.	Collaboration between leading garden designer and Microsoft to go on display at Chelsea Flower Show 2025

Resources

Link	description
CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos.	This proposal introduces a new point-tracking model along with a semi-supervised training recipe that allows for the use of real videos without annotations during training. It generates pseudo-labels using readily available teacher models. This approach simplifies the architecture and training scheme, resulting in improved outcomes while utilizing 1000 times less data.
Meta's latest open source releases.	Meta has introduced a significant array of valuable research tools, including a speech-to-speech model, enhancements to SAM, and numerous other intriguing developments.
One-Step Diffusion via Shortcut Models.	Shortcut models represent a new category of consistency models that can produce continuous signals with minimal inference steps.
Zero-Shot 3D Visual Grounding.	VLM-Grounder is a novel approach to 3D visual grounding that addresses the shortcomings of conventional methods by leveraging vision-language models (VLMs) and 2D images.
DeepSeek's natively Multimodal model.	DeepSeek has developed and launched a powerful 1.3 billion parameter model capable of processing interleaved text and images for both generation and comprehension.
Meta Lingua.	Meta has developed an easy-to-use and research-friendly codebase that can replicate Llama 2 7B within 24 hours.
Embedding an Ethical Mind: Aligning Text-to-Image Synthesis via Lightweight Value Optimization.	LiVO (Lightweight Value Optimization) is an innovative approach designed to align Text-to-Image models with human values.
Easily hackable vision language model.	A simple and performant VLM implementation in pure PyTorch
Anthropic Quickstarts.	Anthropic Quickstarts provides developers with projects like a customer support agent and a financial data analyst to help them swiftly utilize the Anthropic API. These projects leverage Claude for natural language processing and incorporate interactive data visualization. Each quickstart comes with setup instructions and encourages contributions from the community.
BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities.	BiGR is an innovative image generation model that leverages compact binary latent codes to enhance both its generation and representation capabilities. It is the first model to integrate both generative and discriminative tasks within a unified framework. Key features of the model include binary tokenization and a distinctive entropy-ordered sampling technique, which contribute to its improved performance.
LongPiBench.	LongPiBench is a benchmark created to evaluate positional biases in large language models (LLMs) when handling long contexts. It focuses on identifying biases that stem from the spacing between multiple relevant pieces of information, providing a targeted way to assess how well models handle long-range dependencies in text.
CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models.	Clamp2 is a contrastive model designed for aligning music and text. It uses contrastive learning techniques to match and relate musical elements with corresponding textual descriptions, enhancing the ability to process and generate music-related text in alignment with audio.
bitnet.cpp.	Microsoft has released an inference repository for its 1.58-bit models, which, when properly trained, are capable of running efficiently on consumer hardware. This development allows for more accessible deployment of advanced AI models without requiring high-end computational resources.
Montessori-Instruct: Generate Influential Training Data Tailored for Student Learning.	Montessori-Instruct is a novel framework designed to generate synthetic data that aligns with a student language model's learning process. It adapts the data produced by the teacher model to fit the student's learning preferences by leveraging local data influence and Direct Preference Optimization (DPO), optimizing the training experience for the student model.
Stable Diffusion 3.5.	Stability AI has launched a new series of models featuring enhanced performance and faster speeds. These models come with built-in Diffusers support, allowing for immediate training capabilities
3D-GANTex: 3D Face Reconstruction with StyleGAN3-based Multi-View Images and 3DDFA based Mesh Generation.	This paper presents a novel approach for estimating face texture and geometry from a single image by combining StyleGAN with 3D Morphable Models.
Moonshine.	Moonshine is a family of speech-to-text models optimized for fast and accurate automatic speech recognition (ASR) on resource-constrained devices. It is well-suited to real-time, on-device applications like live transcription and voice command recognition.
PocketPal AI.	PocketPal AI is a pocket-sized AI assistant powered by small language models (SLMs) that run directly on your phone. Designed for both iOS and Android, PocketPal AI lets you interact with various SLMs without the need for an internet connection.
Introducing the prompt() Function: Use the Power of LLMs with SQL!.	The costs of operating LLMs have dropped considerably, making it feasible to incorporate smaller models like GPT-4o-mini into SQL functions. MotherDuck's PROMPT() function simplifies tasks such as text generation, summarization, and structured data extraction using OpenAI models. It provides flexibility in balancing cost and performance, while also supporting bulk operations with improved concurrency for more efficient processing.
Anthropic Computer Use Demo.	A quick example of Claude Sonnet's 3.5 new computer use capabilities.
Introducing SynthID Text.	SynthID is a method for statistically watermarking generated text. It employs a pseudorandom function after the top-k and top-p sampling steps to embed a mark within the text. A probabilistic Bayesian approach is then used to detect whether the text has been watermarked, indicating it was produced by a language model.
Transformers.js v3: WebGPU Support, New Models & Tasks, and More….	Transformers JS is a JavaScript library designed to run machine learning models, and it now supports WebGPU, offering up to 1,000x faster performance in some cases. The latest version provides access to over 1,200 models, making it well-suited for edge and browser-based applications.
Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages.	We present Pangea-7B, an open multilingual multimodal language model (MLLM) developed to address multilingual and multicultural challenges in visual understanding tasks. Pangea-7B is trained on PangeaIns, a comprehensive dataset consisting of 6 million instructions across 39 languages.
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree.	SAM2Long solves the "error accumulation" problem found in SAM 2's memory design by implementing a training-free strategy for video object segmentation.
Agent.exe.	A convenient wrapper for Anthropic's computer use system simplifies its usage and execution, making it more user-friendly and accessible.
TALoS: Enhancing Semantic Scene Completion via Test-time Adaptation on the Line of Sight.	TALoS is a method that enhances scene completion for autonomous vehicles by leveraging observations from different time points as supervision for making more accurate predictions.
OmniParser for Pure Vision Based GUI Agent.	Screenshot parsing tool for models to use digital interfaces.
Introducing quantized Llama models with increased speed and a reduced memory footprint.	Meta has optimized its 1B and 3B language models by applying quantization, achieving a 2-4x speed increase and reducing the model size by over 50% with minimal quality loss. This improvement is made possible by its quantization-aware training setup, allowing the models to adapt to lower precision effectively.
Joint Point Cloud Upsampling and Cleaning with Octree-based CNNs.	An effective and straightforward approach for upsampling and refining point clouds utilizes a modified octree-based 3D U-Net, known as OUNet.
ExecuTorch.	ExecuTorch supports on-device inference across mobile and edge devices, including wearables, embedded systems, and microcontrollers. It facilitates the efficient deployment of PyTorch models to edge environments and is compatible with various computing platforms, leveraging hardware capabilities like CPUs, NPUs, and DSPs. Comprehensive tutorials provide guidance on using ExecuTorch step-by-step.
Federated Transformer (FeT).	The Federated Transformer (FeT) is a novel framework aimed at enhancing both performance and privacy in Vertical Federated Learning (VFL) across multiple collaborating parties.
ADEM-VL.	ADEM-VL is an innovative vision-language model created to address hardware constraints found in current models.
Predicting Weight Loss with Machine Learning.	The author utilized a straightforward feedforward DNN model to monitor and forecast weight loss on a ketogenic diet. This model effectively captured the non-linear weight loss trends, fit a predictive function to the data, and visualized calorie metrics. For added insights, the Harris-Benedict Equation was applied to compare estimated calorie needs with actual weight loss.
Video scraping: extracting JSON data from a 35 second screen capture for less than 1/10th of a cent.	Google Gemini's AI Studio can accurately extract numerical data from video screen recordings of emails. This process leverages the cost-effective Gemini 1.5 Flash model, resulting in minimal expense. This innovative "video scraping" technique provides a practical alternative to conventional data extraction methods.

Perspectives

Link	description
Duolingo CEO Luis von Ahn wants you addicted to learning.	Duolingo's CEO, Luis von Ahn, talks about utilizing AI and gamification to improve language learning through features such as chat interactions with AI avatars and AI-generated video game-like adventures. The company has recently launched Duolingo Max, a premium subscription plan that provides AI-driven conversation practice, capitalizing on the lower costs and faster development associated with AI-generated content. Although AI has limitations in engagement, Duolingo prioritizes maintaining user motivation by balancing effective learning with gamified, entertaining experiences.
State of AI Report 2024.	The 2024 State of AI Report notes that foundational models are increasingly being integrated into practical applications, with OpenAI leading the way in significant revenue generation. Key developments include the alignment of performance among leading research labs, a growing emphasis on planning and reasoning in large language model (LLM) research, and extending foundational models into multimodal domains. Despite facing regulatory hurdles, AI companies have seen a surge in valuation, though questions about their long-term sustainability remain.
How gen AI can help doctors and nurses ease their administrative workloads.	Doctors and nurses spend nearly 28 hours a week on administrative tasks.
Elon Musk’s global political goals.	Over the weekend, Musk pledged to give away $1m a day to registered voters in battleground states in the US who sign his Pac’s petition in support of the First and Second Amendments. He awarded the first prize, a novelty check the size of a kitchen island, at a Pennsylvania rally on Saturday and the second on Sunday in Pittsburgh. He says he’ll keep doing it until the election on 5 November. Experts say that the stunt is potentially illegal.
The Second $100B AI Company.	This article forecasts that by 2034, emerging AI companies fueled by advancements in AI applications, particularly in consumer AI, will join OpenAI in exceeding a $100B market cap. While established tech giants currently dominate the AI infrastructure and model layers, the application layer offers significant potential for innovation and expansion, providing fertile ground for consumer AI to flourish. The prospects for large-scale success in consumer AI, especially in areas such as video creation, online shopping, and gaming, resemble the transformative impact seen in past tech revolutions like cloud computing and mobile technology.
Use Prolog to improve LLM's reasoning.	Current methods such as Chain-of-Thought (CoT) reasoning and the integration of programming languages like Prolog can enhance the reasoning abilities of LLMs, helping to mitigate the limitations of autoregressive models. The paper "Reliable Reasoning Beyond Natural Language" introduces a neurosymbolic approach that employs Prolog to translate requests into symbolic logic, enhancing both explainability and problem-solving capabilities. ProSLM, the model developed in this research, has shown substantial improvements on various datasets, highlighting the potential of combining Prolog with LLMs for tackling complex reasoning tasks.
AI watermarking must be watertight to be effective.	Scientists are closing in on a tool that can reliably identify AI-generated text without affecting the user’s experience. But the technology’s robustness remains a challenge.
AI scans RNA ‘dark matter’ and uncovers 70,000 new viruses.	Many are bizarre and live in salt lakes, hydrothermal vents, and other extreme environments.
Build an international AI ‘telescope’ to curb the power of big tech companies.	Artificial intelligence (AI) technologies have reached a crucial juncture. The vast computing clusters required to train the most advanced generative AI systems are available only to a few large corporations.
Was the Nobel prize for physics? Yes — not that it matters.	The award of the 2024 Nobel Prize in Physics to John Hopfield and Geoffrey Hinton for their groundbreaking research on artificial neural networks has caused consternation in some quarters. Surely this is computer science, not physics?
How I peer into the geometry behind computer vision.	Minh Ha Quang’s work at a Japanese AI research center aims to understand how machines extract image data from the real world.
AI Dreams: Microsoft @ 50, Chapter 1.	Microsoft's research on AI robustness led the company to invest billions in AI infrastructure, driving breakthroughs with partners such as OpenAI. This investment has played a key role in Microsoft's rapid growth in AI-powered products, highlighted by the success of GitHub Copilot. Despite facing competition and balancing sustainability goals, Microsoft remains committed to AI, with record capital expenditures on its AI and cloud infrastructure.
Future of Internet in the age of AI.	In this article, Cloudflare CEO Matthew Prince explores AI's influence on Internet infrastructure, emphasizing the need for AI-capable edge computing and local inference to minimize network latency. He underscores the significance of regionalization in AI services to address regulatory challenges and outlines Cloudflare's strategy of developing a connectivity-focused network. Cloudflare's goal is to enhance internet connectivity by making it faster, more secure, and more efficient, closely aligning its efforts with advancements in AI technologies.
How Jacob Collier helped shape the new MusicFX DJ.	Grammy-winning musician Jacob Collier has partnered with Google DeepMind and Google Labs to develop MusicFX DJ, an AI-driven music tool. The tool’s interface has been revamped to foster creativity, making it easy for users to tap into a "flow state" of artistic inspiration. MusicFX DJ is now available, featuring user-friendly controls suitable for all experience levels.
The AI Investment Boom.	The AI boom is spurring substantial US investments in data centers, computing infrastructure, and advanced hardware, with annual data center construction reaching an unprecedented $28.6 billion. This growth is driven by rising demand for high-powered computing resources essential for training and deploying sophisticated AI models. Although tech sector revenue is recovering, job growth is primarily centered on semiconductor manufacturing and infrastructure, shifting attention away from traditional programming roles.

Back to index

ML news: Week 14 - 20 October

Research

Link	description
Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models.	Introduces a novel RAG method to address the challenges of imperfect retrieval augmentation and knowledge conflicts in LLMs. Astute RAG adaptively extracts critical information from the internal knowledge of LLMs, then iteratively merges this with external knowledge while maintaining source awareness. Its interactive consolidation mechanism enhances the integration of internal and external information by identifying consistent passages, detecting conflicting data, and filtering out irrelevant content.
ToolGen: Unified Tool Retrieval and Calling via Generation.	Incorporates tool knowledge directly into LLMs by encoding tools as unique tokens, allowing the model to generate tool calls and arguments, facilitating smooth tool invocation alongside natural language generation. Experiments involving over 47,000 tools demonstrate that ToolGen outperforms in both tool retrieval and autonomous task execution.
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG.	Finds that in many long-context LLMs, output quality diminishes as the number of passages increases, with the performance decline attributed to retrieved hard negatives. The authors propose two methods to enhance long-context LLM-based RAG: retrieval reordering and RAG-specific tuning with intermediate reasoning to improve relevance identification. These approaches show marked improvements in both accuracy and robustness in long-context RAG performance.
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models.	Evaluates several state-of-the-art (SoTA) models using a benchmark built with symbolic templates that allow for a range of mathematical problems. The results show that LLMs display variability when answering different versions of the same questions, and their performance drops when numerical values in the questions are adjusted. As the complexity of the questions increases (e.g., adding more clauses), performance deteriorates significantly. The authors suggest that this decline in performance is likely due to a lack of logical reasoning capabilities in current LLMs.
Addition is All You Need for Energy-efficient Language Models.	Introduces an algorithm that approximates floating-point multiplication using integer addition operations, making it computationally less intensive than 8-bit floating-point arithmetic while achieving higher precision. The authors report that implementing the proposed L-Mul operation in tensor processing hardware could potentially reduce energy consumption by 95% for elementwise floating-point tensor multiplications and by 80% for dot product operations.
I Want to Break Free! Anti-Social Behavior and Persuasion Ability of LLMs in Multi-Agent Settings with Social Hierarchy.	Examines the interaction patterns of LLMs within a multi-agent setting involving a social hierarchy, specifically in a scenario where a guard and a prisoner interact, with the prisoner either seeking extra yard time or attempting to escape. The study finds that when power dynamics are present, LLMs struggle to maintain coherent conversations. Additionally, the authors highlight that agents' personas significantly influence their behaviors. Interestingly, even without explicit prompting, merely assigning roles to agents resulted in the emergence of anti-social behaviors.
Were RNNs All We Needed?	The paper revisits RNNs and demonstrates that removing the hidden states from the input, forget, and update gates allows for efficient parallel training. This adjustment eliminates the need for architectures like LSTMs and GRUs to rely on backpropagation through time (BPTT). They introduce new variants, called minLSTMs and minGRUs, which are 175 times faster for sequences of length 512.
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations.	The study finds that "truthfulness" information in LLMs is concentrated in specific tokens, offering a way to improve error detection and address related challenges. They also suggest that the internal representations of LLMs can be used to predict the types of errors these models are prone to making.
Archon: An Architecture Search Framework for Inference-Time Techniques.	The paper presents a modular framework for constructing and optimizing LLMs by integrating various inference-time techniques. This approach redefines the task of LLM system design as a hyperparameter optimization problem. Tested on benchmarks like MT-Bench and CodeContests, the framework, named Archon, outperforms top models such as GPT-4o and Claude 3.5 Sonnet, achieving a 15.1% average accuracy improvement.
RATIONALYST: Pre-training Process-Supervision for Improving Reasoning.	RATIONALYST is a model designed for process-supervision of reasoning, enabling it to generalize across a wide range of reasoning tasks. This is accomplished by pre-training on a dataset of 79k rationales from the Pile and a variety of reasoning datasets, with minimal human involvement. Fine-tuned from LLaMa-3-8B, the model achieves a 3.9% average accuracy improvement across seven reasoning benchmarks.
Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation.	The paper introduces a unified framework to evaluate an LLM’s capability to provide factual responses, assess retrieval skills, and reason through the generation of final answers. The framework includes multi-hop questions that require combining information from multiple sources. It reports that state-of-the-art LLMs struggle with this task, achieving only 40% accuracy without retrieval. However, the proposed multi-step retrieval method improves performance to 66% accuracy.
Not All LLM Reasoners Are Created Equal.	The paper introduces a unified framework to evaluate an LLM’s capability to provide factual responses, assess retrieval skills, and reason through the generation of final answers. The framework includes multi-hop questions that require combining information from multiple sources. It reports that state-of-the-art LLMs struggle with this task, achieving only 40% accuracy without retrieval. However, the proposed multi-step retrieval method improves performance to 66% accuracy.
Rejection Sampling IMLE: Designing Priors for Better Few-Shot Image Synthesis.	Training generative models like GANs with limited data is challenging. Existing Implicit Maximum Likelihood Estimation (IMLE) methods suffer from poor alignment between the latent codes used during training and those used during inference. The proposed approach, RS-IMLE, modifies the prior distribution during training, resulting in better test-time performance and higher-quality image generation.
Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models.	This study introduces a unified framework aimed at enhancing training stability in continuous-time consistency models, leading to substantial improvements in the performance of generative models.
DARNet: Dual Attention Refinement Network with Spatiotemporal Construction for Auditory Attention Detection.	DARNet is an innovative model for auditory attention detection (AAD) that improves the decoding of brain signals, such as EEG, by integrating spatiotemporal and dual attention mechanisms.
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads.	DuoAttention is a framework designed to optimize memory usage and reduce latency in long-context large language models (LLMs) by selectively applying full key-value (KV) caching to only the most essential attention heads.
Meta-DT: Offline Meta-RL as Conditional Sequence Modeling with World Model Disentanglement.	Meta Decision Transformer (Meta-DT) aims to enhance generalization in reinforcement learning by integrating transformer-based sequential modeling with effective task representation learning.

News

Link	description
AI gives voice to dead animals in Cambridge exhibition.	Creatures can converse and share their stories by voice or text through visitors’ mobile phones at Museum of Zoology
Three-armed robot conductor makes debut in Dresden.	German city’s Sinfoniker says the aim is not to replace humans but to play music human conductors would find impossible
Tesla’s value drops $60bn after investors fail to hail self-driving ‘Cybercab’.	Analysts criticize lack of detail about the ‘robotaxi’ showcased by CEO Elon Musk
Microsoft may have an audio-to-image generator in the works, new patent shows.	Microsoft has submitted a patent for an AI system that transforms live audio into images using large language models (LLMs). The system is intended to improve communication by creating real-time visuals from audio streams. Once developed, it could potentially be incorporated into Microsoft Teams through Copilot integration.
Australia’s spy chief warns AI will accelerate online radicalization.	Asio boss Mike Burgess says social media impact is a ‘step-change’ in the threat posed by extremism
Google to buy nuclear power for AI datacentres in ‘world first’ deal.	Tech company orders six or seven small nuclear reactors from California’s Kairos Power
Silicon Valley is debating if AI weapons should be allowed to decide to kill.	In late September, Shield AI co-founder Brandon Tseng swore that weapons in the U.S. would never be fully autonomous — meaning an AI algorithm would make the final decision to kill someone. “Congress doesn’t want that,” the defense tech founder told TechCrunch. “No one wants that.”
Zoom’s custom AI avatar tool may come with risks.	The upcoming feature, announced today at Zoom’s annual dev conference, will translate a video clip that users record of themselves into a digital clone — complete with a head, upper arms, and shoulders. Users will be able to type a script of what they want the digital double to say, and Zoom will generate audio that syncs with the avatar’s lip movements.
Generate Video (beta) on Firefly Web App.	During the Adobe MAX conference, Adobe revealed the extension of its Firefly series of creative generative AI models to include video.
OpenAI appoints international expansion boss.	OpenAI has named Oliver Jay as the head of its international expansion, with a focus on AI strategy and operations. The company also revealed the opening of a new APAC office in Singapore and is working on developing datasets for local languages. The o1 model, which incorporates "chain of thought" methods, is designed to improve AI accuracy.
Anthropic challenges OpenAI with affordable batch processing.	Anthropic has introduced a Message Batches API, enabling businesses to handle large data volumes at half the cost of traditional API calls. The API allows for up to 10,000 asynchronous queries within 24 hours, providing a cost-efficient solution by shifting AI processing from real-time to "right-time." This approach encourages AI adoption among mid-sized companies but may draw attention away from the advancement of real-time AI capabilities.
OpenAI Projections Imply Losses Tripling To $14 Billion In 2026.	OpenAI projects losses to rise to $14 billion in 2026, with total losses reaching $44 billion by 2028.
AMD launches AI chip to rival Nvidia's Blackwell.	AMD has introduced the Instinct MI325X AI chip, targeting competition with Nvidia's leading data center GPUs.
Meta’s open AI hardware vision.	Meta unveiled its open AI hardware designs, including the Catalina rack and the enhanced Grand Teton platform, at the OCP Global Summit. Notably, training the Llama 3.1 405B model required 16,000 NVIDIA H100 GPUs, demonstrating Meta's robust scaling infrastructure. These open AI hardware systems are essential for driving further advancements in AI capabilities.
The New York Times warns AI search engine Perplexity to stop using its content.	The New York Times has sent a cease and desist letter to AI startup Perplexity, accusing the company of using its content without authorization for AI search purposes. Perplexity asserts that it does not scrape content for training but instead indexes web pages to provide factual information. The company is currently in discussions with publishers and seeks to resolve the matter by collaborating with the Times and other media organizations.
Decagon raises $65m Series B led by Bain Capital Ventures to bring total funding to $100m.	Decagon has secured $65 million in Series B funding to further develop its AI customer support agents, which are already utilized by companies such as Duolingo and Eventbrite to streamline customer interactions. These AI agents automate routine tasks, allowing customer support teams to focus on more strategic roles. The funding will be used to strengthen Decagon's engineering team and extend its AI solutions into new markets and industry sectors.
New high-quality AI video generator Pyramid Flow launches — and it’s fully open source!	The number of AI video generation models continues to grow with a new one, Pyramid Flow, launching this week and offering high-quality video clips up to 10 seconds in length — quickly, and all open source.
This three-person robotics startup is working with designer Yves Béhar to bring humanoids home.	Kind Humanoid's three-person team is developing a whimsical humanoid robot named Mona, specifically designed for home use rather than industrial applications. The team aims to conduct field tests with a dozen initial prototypes next year. Unlike many AI-driven robotics companies that focus on industrial markets and heavy fundraising, Kind prioritizes innovation and efficiency, setting its approach apart from competitors in the robotics space.
INTELLECT–1: Launching the First Decentralized Training of a 10B Parameter Model.	INTELLECT-1 is the first decentralized model with 10 billion parameters, designed to harness global contributions for open-source AGI development. It utilizes OpenDiLoCo scaling to train large models across distributed devices, with innovations in bandwidth efficiency and fault tolerance. The new Prime framework further enhances decentralized training by optimizing compute utilization, achieving a 98% utilization rate during INTELLECT-1's 10-billion-parameter training run. This marks a significant advancement in decentralized AI model training.
Elon Musk Shows Off Tesla ‘Robotaxi’ That Drives Itself.	“You could fall asleep and wake up at your destination,” said Mr. Musk, Tesla’s C.E.O., but some experts are skeptical that such cars will be ferrying passengers soon.
ByteDance lays off hundreds of TikTok employees in the shift to AI content moderation.	ByteDance’s TikTok is laying off hundreds of employees, mainly in Malaysia, according to Reuters. The cuts come as the social network is increasingly turning to AI for content moderation. The cuts do not impact employees in the U.S.
Microsoft Artificial Intelligence VP Bubeck to Join OpenAI.	Microsoft Corp. said one of its artificial intelligence vice presidents, Sebastien Bubeck, is leaving to join OpenAI, where Microsoft is both the largest investor and a rival.
‘It’s not me, it’s just my face’: the models who found their likenesses had been used in AI propaganda.	London-based Synthesia’s technology was employed to make deepfake videos for authoritarian regimes
Amazon.com joins push for nuclear power to meet data center demand.	Company says it signed three agreements on developing small modular reactor nuclear power technology
Un Ministral, des Ministraux.	On the first anniversary of Mistral 7B, Mistral launched two advanced models designed for on-device and edge computing: Ministral 3B and Ministral 8B. These models are optimized for tasks under 10 billion parameters, offering superior knowledge, reasoning, and efficiency. They also support a context length of up to 128k and deliver faster inference.
Former Palantir CISO Dane Stuckey joins OpenAI to lead security.	Dane Stuckey, the former CISO of analytics firm Palantir, has joined OpenAI as its newest CISO, serving alongside OpenAI head of security Matt Knight.
Can AI really compete with human data scientists? OpenAI’s new benchmark puts it to the test.	OpenAI has introduced a new tool to measure artificial intelligence capabilities in machine learning engineering. The benchmark, called MLE-bench, challenges AI systems with 75 real-world data science competitions from Kaggle, a popular platform for machine learning contests.
Adobe’s AI video model is here, and it’s already inside Premiere Pro.	New beta tools allow users to generate videos from images and prompts and extend existing clips in Premiere Pro.
Customize Audio Overviews with Google's NotebookLM.	NotebookLM now enables users to customize their Audio Overview experience, providing greater control over the areas of focus and expertise of the AI hosts. Companies can apply for the new NotebookLM Business pilot program, which includes improved tools designed for professional applications.
Combining next-token prediction and video diffusion in computer vision and robotics.	A new method can train a neural network to sort corrupted data while anticipating next steps. It can make flexible plans for robots, generate high-quality video, and help AI agents navigate digital environments.
Nvidia just dropped a new AI model that crushes OpenAI’s GPT-4—no big launch, just big results.	Nvidia quietly unveiled a new artificial intelligence model on Tuesday that outperforms offerings from industry leaders OpenAI and Anthropic, marking a significant shift in the company’s AI strategy and potentially reshaping the competitive landscape of the field.
Invisible text that AI chatbots understand and humans can’t? Yep, it’s a thing.	A quirk in the Unicode standard harbors an ideal steganographic code channel.
Google supercharges Shopping tab with AI and personalized recommendation feed.	After bringing generative AI to Search in 2023, Google is supercharging its Shopping tab with the technology. The company announced on Tuesday that it will use AI to help users shop for products based on exactly what they’re looking for. It also launched a new scrollable feed of personalized, shoppable products.
Adobe’s Project Super Sonic uses AI to generate sound effects for your videos.	Adobe's Project Super Sonic leverages text-to-audio technology, object recognition, and voice input to create audio effects for video projects.
White House considers expanding Nvidia’s and AMD’s AI chip export limits to additional countries.	The Biden administration is contemplating limitations on AI chip sales from Nvidia and AMD to countries in the Persian Gulf, citing national security concerns.

Resources

Link	description
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering.	It introduces a new benchmark to assess machine learning agents' proficiency in machine learning engineering tasks. The benchmark consists of 75 Kaggle competitions focused on key MLE skills, including model training, dataset preparation, and experiment execution. OpenAI's o1-preview model, utilizing the AIDE scaffolding, reaches a bronze medal level in 16.9% of the competitions.
Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System.	Presents a novel framework aimed at improving both communication efficiency and task effectiveness in LLM-based multi-agent systems through targeted LLM training. It introduces an iterative "generate, rank, select, and train" approach, enhanced by a reward function to optimize performance, token usage, and communication efficiency. The framework integrates Monte Carlo Tree Search-inspired techniques for DPO data generation, promoting diverse exploration. Experimental results show consistent improvements over single-agent baselines and standard multi-agent systems (MAS) using Llama 3 8B, achieving a 2.8x performance boost while utilizing fewer than 10% of tokens on tasks involving extensive information exchange.
Zyphra's Mamba 2 based model beats Mistral.	Introduces the first state space-style model that surpasses transformers at the 7B scale. It excels in understanding and generating long-context data, thanks to the linear time scaling of the Mamba 2 blocks, which significantly enhances its efficiency and performance.
OpenAI's Swarm.	OpenAI has introduced a lightweight framework designed to facilitate communication between agents. While it will not receive further updates, the framework could still offer valuable ideas and inspiration for future developments.
EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models.	EvolveDirector aims to develop a competitive text-to-image generation model using open, publicly available resources, avoiding the limitations imposed by proprietary models.
Rethinking the Evaluation of Visible and Infrared Image Fusion.	Researchers propose the Segmentation-oriented Evaluation Approach (SEA) to improve the evaluation of Visible and Infrared Image Fusion (VIF) techniques, which play a critical role in applications such as object detection and semantic segmentation.
A Gentle Introduction and Tutorial on Deep Generative Models in Transportation Research.	A gentle introduction and tutorial on deep generative models in transportation research provides a comprehensive overview of how these models can be applied to solve transportation problems.
Trans4D: Realistic Geometry-Aware Transition for Compositional Text-to-4D Synthesis.	Trans4D is a new framework developed to address the challenges of realistic 4D scene transitions, enhancing text-to-4D synthesis. It offers improved capabilities in generating coherent, dynamic 4D scenes from textual descriptions, making it more suitable for tasks that require accurate spatial and temporal scene transitions.
DocMTAgent.	DelTA, short for Document-levEL Translation Agent, is an online translation tool designed for handling document-level translations. It leverages a multi-level memory architecture to improve translation accuracy and coherence across larger texts, providing more context-aware translations compared to sentence-level models.
Fast Feedforward 3D Gaussian Splatting Compression.	Fast Compression of 3D Gaussian Splatting (FCGS) is a new model designed to eliminate the need for the slow, per-scene optimization required by earlier methods. Instead, FCGS achieves rapid compression using a quick feed-forward pass, reducing the processing time from minutes to just seconds. This significantly accelerates the compression process while maintaining high-quality results for 3D data.
OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling.	OneRef presents an optimized framework for referring segmentation by integrating visual and language feature spaces within a unified transformer architecture.
SmartPretrain: Model-Agnostic and Dataset-Agnostic Representation Learning for Motion Prediction.	SmartPretrain offers a versatile, model-agnostic, and dataset-agnostic self-supervised learning framework designed to enhance motion prediction in autonomous vehicles.
UvA - An Introduction to Group Equivariant Deep Learning.	Resources for studying deep learning techniques applied to specific types of geometric data while addressing architectural limitations.
Diffusion model simulating CS:GO.	An open-source replication of a diffusion model that generates visual simulations of a video game, using keyboard and mouse inputs to influence the output.
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs.	This study addresses the shortcomings of current alignment algorithms in large language models (LLMs), which tend to overfit to relative preferences and neglect response quality. The authors introduce reward-conditioned LLM policies and a novel data relabeling method that incorporates response quality, enabling the model to better generalize to optimal responses.
entropix.	Entropix is a tool designed to modify the sampling behavior of language models.
LoLCATs Blog Part 2: How to Linearize LLMs for Me and You.	Hazy Research has published another insightful post that delves into techniques for linearizing existing language models while maintaining much of their performance. This exploration highlights methods to simplify model architectures, making them more efficient, without significantly compromising their effectiveness in tasks like text generation and understanding.
TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control.	TextCtrl is a newly introduced diffusion-based method designed to enhance scene text editing. It achieves a balance between maintaining content accuracy and preserving the original style, ensuring that both the textual content and the visual appearance remain consistent during edits.
Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies.	iDP3 is an advanced 3D visuomotor policy designed to enable humanoid robots to autonomously navigate and perform tasks in a variety of real-world environments. This improved policy enhances the robot's ability to perceive and interact with its surroundings, making it more adaptable and efficient in complex and dynamic settings.
tabled.	Tabled is a small library for detecting and extracting tables. It uses Surya to find all the tables in a PDF, identifies the rows/columns, and formats cells into markdown, csv, or html.
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer.	HART is a cutting-edge visual generation model designed to produce high-quality 1024x1024 images, presenting a challenge to the capabilities of diffusion models. It enhances image reconstruction and reduces training costs by employing a hybrid tokenizer that integrates both discrete and continuous tokens, resulting in more efficient and effective image generation.
DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention.	The Deformable Bi-level Routing Attention (DBRA) module is an innovation designed to enhance attention mechanisms in vision transformers. DeBiFormer, which is built upon DBRA, optimizes the selection of key-value pairs in the attention process, resulting in more efficient computations and better interpretability of queries within attention maps. This leads to improved performance and understanding of how the model attends to different parts of an image.
Six tips for going public with your lab’s software.	It’s not enough to write high-quality programs. If you want to make your apps public — and usable — you should also follow these steps.
CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos.	CoTracker is a newly developed tracking model that bridges the performance gap between synthetic and real video data by employing semi-supervised training techniques.
A Consistency-Aware Spot-Guided Transformer for Versatile and Hierarchical Point Cloud Registration.	Researchers have developed a novel consistency-aware spot-guided Transformer designed to improve the efficiency and accuracy of point cloud registration.
Ditto - the simplest self-building coding agent.	Ditto is a user-friendly tool that allows you to generate a multi-file Flask application from simple natural language descriptions using a no-code interface. By leveraging a simple LLM loop with a few tools, Ditto automates the coding process, (occasionally) turning your ideas into functional web applications (or at least trying and getting close).
F5 Text-to-Speech System.	F5-TTS is a non-autoregressive, zero-shot text-to-speech system featuring a flow-matching mel spectrogram generator and a diffusion transformer. Developed on the MLX framework, F5 outperforms earlier systems such as E2 TTS by incorporating ConvNeXT v2 blocks for improved text alignment, enabling high-quality speech generation in approximately 11 seconds on modern hardware.
Movie Gen Bench.	"Movie Gen Bench" is an evaluation benchmark designed to assess performance in both video (Video Bench) and audio (Audio Bench). It includes 1,003 prompts that encompass a variety of testing aspects and concepts.
LongAlign.	LongAlign enhances the capability of text-to-image (T2I) diffusion models to process lengthy text inputs by incorporating segment-level encoding and a decomposed preference optimization approach.
Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective.	DiGIT is an auto-regressive generative model that forecasts tokens in a latent space through self-supervised learning. This discrete tokenizer enhances image generation on ImageNet by clustering hidden states derived from DINOv2.
FL-Launching (Fling).	The FedPart method tackles the layer mismatch problem in federated learning by limiting model updates to designated layers in each training round.
Distributed Training Guide.	This is an in-depth guide on best practices for distributed training, troubleshooting errors, and maximizing the use of available resources.

Perspectives

Link	description
Nobel winner Geoffrey Hinton is the ‘godfather of AI’. Here’s an offer he shouldn’t refuse…	The computer scientist’s dogged belief in the potential of neural networks helped unlock machine learning. But he’d be wise to remember the experience of a fellow laureate
Machines of Loving Grace.	Dario Amodei, CEO of Anthropic, often writes internal memos, and one of them was published externally. In this memo, he explores the potential extremely positive impact of successfully building powerful AI systems. He envisions how AI could radically transform the world for the better, improving areas like science, economics, and societal well-being, while acknowledging the immense responsibility of ensuring AI development is aligned with human interests and safety.
This AI-Powered Invention Machine Automates Eureka Moments.	Iprova's AI-driven software analyzes diverse technical literature to generate patentable inventions by linking previously unrelated ideas. It uses semantic search and generative AI to identify novel inventions for companies like Procter & Gamble and Panasonic. Although AI plays a key role, human insight remains essential for applying the inventions practically, especially in fast-evolving industries. Iprova highlights the importance of human creativity in refining and validating invention ideas, ensuring that AI serves as a tool to enhance rather than replace human innovation.
Burn the Playbooks.	AI excels at tasks that follow structured rulesets, such as automating tax processes or solving math problems, where it can often outperform humans. However, relying too much on playbook-driven approaches in our work risks stifling human creativity, a key trait that differentiates us from machines. Overemphasizing formulaic tasks could make us more dependent on AI's strengths, limiting our own unique creative potential and inadvertently making us more "machine-like" in areas where creativity and flexibility are crucial.
Hurricane Helene and the ‘Fuck It’ Era of AI-Generated Slop.	An AI-generated image depicting Hurricane Helene has gone viral, despite viewers being fully aware that it isn't real. The image has sparked widespread attention and discussion, highlighting the power of AI-generated content to captivate audiences even when the authenticity is known. This trend reflects the growing influence of AI in shaping public perception and the viral nature of digital content.
OpenAI pursues public benefit structure to fend off hostile takeovers.	OpenAI is planning to restructure as a public benefit corporation (PBC) to safeguard against hostile takeovers and ensure its mission of benefiting humanity remains intact. This change will help OpenAI maintain its commitment to ethical AI development, prioritizing public good over profit while allowing the organization to continue innovating in a sustainable and mission-driven way.
Al Will Take Over Human Systems From Within.	In this post, Yuval Noah Harari, the Israeli historian and author of “Sapiens,” “Homo Deus,” and “Nexus,” explores the impact of information networks and AI on societal narratives, which can either unite or fragment communities. He cautions that AI, functioning as an "alien intelligence," could centralize power due to its lack of self-correcting mechanisms, potentially threatening democratic systems. Harari stresses the importance of strong institutions to uphold truth in a world increasingly influenced by AI-driven decision-making across different sectors.
Sticky humans in a post-AGI world.	AI tutors encounter considerable difficulties in replicating the social and intellectual interactions offered by human teachers. Although AI has made progress, it still falls short in handling complex educational tasks and cannot deliver the nuanced socio-intellectual experiences that human educators provide. A hybrid approach, where AI complements rather than replaces human teachers, may be more effective, given the essential social and cultural elements of the learning process.
AI has dreamt up a blizzard of new proteins. Do any of them actually work?	Emerging protein-design competitions aim to sift out the functional from the fantastical. But researchers hope that the real prize will be a revolution in the field.
Considerations for governing open foundation models.	Foundation models drive AI innovation, but debates on their release—whether open or closed—raise concerns about potential risks and the impact of regulations on innovation.
I AI-generated some podcasts – and the results are uncanny.	Google’s new tool NotebookLM lets you create podcasts at the click of the button. They’re way more realistic than you’d think …
SB 1047: Our Side Of The Story.	California's proposed SB 1047, which sought to require AI companies to address existential risks posed by their technologies, was vetoed by Governor Newsom. He argued that the bill did not adequately regulate smaller, potentially dangerous AI models. Despite strong support from AI safety advocates like Dan Hendrycks and high-profile figures such as Elon Musk, the bill faced opposition from major AI companies, including OpenAI and Google. Newsom's veto has sparked discussions within the AI community about future regulatory strategies and potential collaborations with broader political groups to create comprehensive AI safety measures.
Overview of strong human intelligence amplification methods.	Advancements in AI depend on developing humans with enhanced cognitive abilities to effectively manage the complexities of AGI development. Approaches such as brain emulation, genomic modifications, adult brain gene editing, and brain-brain interfaces are being explored, each presenting distinct challenges and risks. These efforts are aimed at solving deep philosophical issues, significantly amplifying human intelligence, and addressing the potential threats posed by AGI.
LLMs don’t do formal reasoning - and that is a HUGE problem.	A study conducted by Apple raises questions about the effectiveness of large language models (LLMs), revealing that they primarily depend on pattern matching instead of formal reasoning. This reliance results in fragile and inconsistent outcomes, challenging the robustness of LLMs in tasks requiring deeper cognitive processes.
Why ChatGPT maker OpenAI is at fight with Open AI.	OpenAI is currently engaged in a legal dispute with Guy Ravine's company, Open AI, over the rights to the "Open AI" name and the original open-source AI vision. The conflict centers on ownership of the name and the direction of the open-source principles that initially defined the AI development approach.
AI mediation tool may help reduce culture war rifts, say researchers.	System built by Google DeepMind team takes individual views and generates a set of group statements
Here’s the deal: AI giants get to grab all your data unless you say they can’t. Fancy that? No, neither do I.	Data is vital to AI systems, so firms want the right to take it and ministers may let them. We must wake up to the danger
Where’s The Generative AI ROI? Start With The Supply Chain.	Generative AI is revolutionizing supply chain operations by effectively managing unstructured documents, resulting in substantial time and cost savings. Flexport, a technology company focused on supply chain solutions, has effectively implemented AI to automate and optimize document management, cutting processing time by 80%. This use of AI highlights its practical value in revenue-generating activities rather than merely in theoretical advancements.

Back to index

ML news: Week 7 - 13 October

Research

Link	description
A multimodal generative AI copilot for human pathology.	PathChat is a vision-language AI assistant designed for pathology, combining a foundational vision encoder and a large language model, achieving state-of-the-art performance on diagnostic tasks and outperforming other multimodal AI systems, with potential applications in education, research, and clinical decision-making.
Meta Movie Gen.	Meta has developed a cutting-edge movie model with 30 billion parameters, which required 6,144 H100 GPUs for training. The model was trained using 1 billion images and 100 million carefully selected videos. Notably, it is based on a Temporal Autoencoder and incorporates Flow matching Llama. Meta also published a highly detailed 92-page research paper, making it one of the most comprehensive reports on the subject.
When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1.	Large language models face limitations because they rely on next token prediction. Although OpenAI's o1 model was trained with a new objective focused on reasoning traces, it still exhibits some of the same constraints associated with next token prediction.
Contextual Document Embeddings.	This paper presents a method similar to a neutral TF/IDF, as it gathers information from the entire corpus rather than relying on individual document embeddings. It effectively captures contextual information from surrounding documents and has achieved state-of-the-art results on the MTEB benchmark.
PairDistill: Pairwise Relevance Distillation for Dense Retrieval.	This project introduces a novel technique called Pairwise Relevance Distillation (PairDistill), aimed at enhancing the accuracy of dense retrieval methods.
Modeling relationships to solve complex problems efficiently.	Associate Professor Julian Shun develops high-performance algorithms and frameworks for large-scale graph processing.
Factual Accuracy in AI.	Integrative Decoding is a technique designed to improve the factual accuracy of large language models, particularly for open-ended tasks. This method helps ensure more reliable and accurate outputs by refining the model's ability to integrate information during generation.
Dynamic Diffusion Transformer.	The Dynamic Diffusion Transformer (DyDiT) improves the efficiency of diffusion models in image generation by building on the Diffusion Transformer (DiT). It achieves this by dynamically adjusting computational resources across different timesteps and spatial regions, minimizing redundancy and optimizing performance.
Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach.	The Frame-Aware Video Diffusion Model (FVDM) enhances video generation by overcoming the limitations of existing models. Instead of using a single timestep for the entire video clip, FVDM introduces a vectorized timestep variable, enabling each frame to follow its own noise schedule. This approach improves the quality and coherence of generated videos.
What Matters for Model Merging at Scale?	Model merging is a technique that allows the combination of two models to achieve the performance benefits of both. However, it does not always scale effectively with larger model sizes. This paper investigates the requirements and challenges for making model merging work efficiently with very large models, addressing issues related to scalability, performance trade-offs, and optimal merging strategies.
nGPT: Normalized Transformer with Representation Learning on the Hypersphere.	A significant amount of research effort is focused on normalizing the internal representations of language models. This study demonstrates that by placing every internal vector on a hypersphere, convergence time is significantly reduced for models of reasonable size, leading to more efficient training.
Genomic Foundation Model Benchmarking.	GFMBench is a newly developed framework aimed at tackling challenges in the development of genomic foundation models (GFMs) by offering standardized benchmarking tools. It supports the evaluation of GFMs with millions of genomic sequences and hundreds of tasks, automating the benchmarking process for open-source GFMs to streamline their development and comparison.
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations.	This study provides further evidence that language models internally encode signals when they produce non-factual information. Understanding these internal cues can help guide models more effectively and reduce the occurrence of hallucinations, offering a potential strategy for improving their reliability.
Differential Transformer.	Transformers often over-allocate attention to irrelevant context, leading to inefficiencies. This research presents the Diff Transformer, which enhances attention to relevant information while filtering out noise. It introduces a differential attention mechanism that computes attention scores by subtracting two separate softmax attention maps. This subtraction effectively cancels out noise and encourages sparse, more focused attention patterns, improving the model's performance on tasks requiring precise context understanding.

News

Link	description
Brave New World: Leo AI and Ollama Bring RTX-Accelerated Local LLMs to Brave Browser Users.	Nvidia's RTX-Acceleration combined with Ollama allows for running local models in the browser.
Liquid Foundation Models.	Liquid AI has introduced its first generation of Liquid Foundation Models (LFMs), offering state-of-the-art performance while minimizing memory consumption. The LFMs, which are optimized for different hardware platforms, include 1B, 3B, and 40B parameter models. These models are already accessible on platforms like LIQUID PLAYGROUND and will soon be available on Cerebras. They are particularly adept at processing sequential data and provide innovations in efficiency and scalability across industries like financial services and biotechnology.
Introducing Copilot Labs and Copilot Vision.	Microsoft is launching Copilot Labs to test advanced AI tools, including Think Deeper and Copilot Vision. These tools aim to expand the capabilities of their AI systems, offering enhanced functionality and deeper insights.
OpenAI’s DevDay brings Realtime API and other treats for AI app developers.	It’s been a tumultuous week for OpenAI, full of executive departures and major fundraising developments, but the startup is back at it, trying to convince developers to build tools with its AI models at its 2024 DevDay. The company announced several new tools Tuesday, including a public beta of its “Realtime API”, for building apps with low-latency, AI-generated voice responses. It’s not quite ChatGPT’s Advanced Voice Mode, but it’s close.
Microsoft brings AI-powered overviews to Bing.	Microsoft has introduced Bing generative search, an AI-driven feature that gathers and summarizes information from the web, offering users more concise and aggregated search results.
KoBold Metals, which uses AI to help find critical minerals for the energy transition, raises $491M.	Earlier this year, KoBold Metals found what might be one of the largest high-grade copper deposits of all time, with the potential to produce hundreds of thousands of metric tons per year, the company’s CEO said.
OpenAI gets $4 billion revolving credit line, giving it more than $10 billion in liquidity.	OpenAI has secured over $10 billion in liquidity, achieving a valuation of $157 billion following its latest funding round. The company raised $6.6 billion from key investors, including Microsoft and Nvidia, but is contending with substantial operational costs, particularly the need for additional GPUs to support large language model (LLM) training. OpenAI is currently exploring restructuring strategies to enhance financial growth and sustainability within the AI industry.
Black Forest Labs, the startup behind Grok’s image generator, releases an API.	Black Forest Labs, the Andreessen Horowitz-backed startup behind the image generation component of xAI’s Grok assistant, has launched an API in beta — and released a new model.
DataPelago raises $47M to optimize hardware for analytical workloads.	LLMs depend on vast amounts of unstructured data for training, but this data requires extensive cleaning and processing before it becomes useful. Traditional data processing systems, which are based on CPUs and current software architectures, were not designed to handle the scale and complexity of such data, resulting in slow and costly data preparation that hinders AI development. To address these challenges, DataPelago has introduced a Universal Data Processing Engine, designed to overcome performance, cost, and scalability limitations, making AI development faster and more affordable.
Google brings ads to AI Overviews as it expands AI’s role in search.	Google will begin to show ads in AI Overviews, the AI-generated summaries it supplies for certain Google Search queries, and will add links to relevant web pages for some of those summaries as well. It’s also rolling out AI-organized search results pages in the U.S. this week.
Nobel Physics Prize Awarded for Pioneering A.I. Research by 2 Scientists.	Two scientists who contributed to the development of neural networks have been awarded the Nobel Prize in Physics, recognizing their groundbreaking work in advancing artificial intelligence and neural network technologies.
Introducing the Message Batches API.	Anthropic has introduced a new batch processing API that allows developers to submit batches of up to 10,000 queries at once. Each batch is processed within 24 hours and is 50% cheaper than standard API calls, making it a more efficient and cost-effective solution for handling non-time-sensitive tasks.
Update on Reflection-70B.	A detailed post-mortem analysis of the highly anticipated Reflection-70B model revealed issues with its benchmark code, which inflated its performance claims. Although the team has since corrected these bugs, and the model's performance remains impressive, it does not quite reach the originally advertised levels.
Four-legged robot learns to climb ladders.	The proliferation of robots like Boston Dynamics’ Spot has showcased the versatility of quadrupeds. These systems have thrived at walking up stairs, traversing small obstacles, and navigating uneven terrain. Ladders, however, still present a big issue — especially given how ever present they are in factories and other industrial environments where the systems are deployed.
Braintrust raises $36M Series A.	Braintrust, which helps Airtable, Brex, Notion, and Stripe build AI products, has raised $36M in a Series A led by a16z.
Clout Kitchen raises $4.45M for AI gaming pal that mimics content creators.	Clout Kitchen announced today that it has raised $4.45 million in its seed funding round, which it plans to put towards its new creator-powered products and experiences. The first of these is Backseat AI, an AI-powered buddy for League of Legends that the company created with Tyler “Tyler1” Steinkamp — an AI buddy that can take on the aspect of popular gaming content creators. Clout Kitchen plans to use its funding to expand its team and build out its shared internal tech stack.
AlphaFold wins Nobel Prize in Chemistry.	Demis Hassabis, John Jumper, and David Baker were awarded the Nobel Prize in Chemistry for their groundbreaking work in protein folding, particularly through innovations like AlphaFold. Their contributions have significantly advanced the understanding of protein structures and their implications for science and medicine.
OpenAI reducing dependency on Microsoft data centers.	OpenAI is decreasing its reliance on Microsoft's data centers by acquiring its own compute infrastructure, allowing greater independence in its operations. Simultaneously, Microsoft is reducing its dependence on OpenAI as it develops and competes with its own AI products, signaling a shift in the dynamics of their partnership.
TikTok parent company ByteDance has a tool that's scraping the web 25 times faster than OpenAI.	TikTok parent company ByteDance is amassing huge volumes of web data way faster than the other major web crawlers. ByteDance may be planning to release its own LLM, and is aggressively using its web crawler, "Bytespider," to scrape up data to train its models, Fortune reported.
Sonair takes a cue from dolphins to build autonomous 3D vision without lidar.	Ultrasound is perhaps best known as the technology that enables noninvasive body scans and underwater communication and can help us park our cars. A young startup called Sonair out of Norway wants to employ it for something else: 3D computer vision used in autonomous hardware applications.
Tesla’s head of vehicle programs jumps to Waymo ahead of robotaxi reveal.	Tesla has lost a top executive to Waymo in the lead-up to the EV maker’s robotaxi unveiling on Thursday.
Autism ABA Therapy with Llama.	Meta shares a use case of its Llama model for medical and therapeutic benefit.
Uber’s EV ridehailing business is maturing.	The company also announced it was adding ChatGPT to its driver app to handle EV questions.
Amazon’s new AI guides can help shoppers find what they need.	The new AI Shopping Guides feature aims to help users find what they need with more informed product suggestions.
TikTok joins the AI-driven advertising pack to compete with Meta for ad dollars.	TikTok's Smart+ is an AI-powered ad-buying tool designed to automate and optimize ad campaigns, giving marketers the option to selectively utilize its features for enhanced performance. The tool seeks to rival Meta's Advantage+ by offering streamlined ad management and improved return on investment (ROI). Early results indicate significant gains in ad spend efficiency and conversion rates, positioning TikTok as a strong contender in the digital advertising market.
OpenAI partners with Cosmopolitan and Elle publisher Hearst.	ChatGPT will provide citations and direct links to the company's content.
Meta debuts new generative AI tools for creating video-based ads.	Meta Platforms Inc. today said it’s rolling out a full-screen video tab on Facebook in recognition of the fact that its users spend more time watching videos than anything else on its platforms.

Resources

Link	description
Introducing the Open FinLLM Leaderboard.	The Open FinLLM Leaderboard provides a dedicated evaluation platform designed specifically for financial language models. It emphasizes key financial tasks like predicting stock movements, analyzing sentiment, and extracting information from financial reports.
Infinite-Fractal-Stream: Small Scale Proxy for Scaling-Centric ML.	Model testing in the image domain is often constrained by low-quality, small datasets like CIFAR10. This GitHub repository provides a tool that generates infinite, complex fractals in the form of images or videos, offering a new approach for testing models.
Auto Jobs Applier.	A highly viral repository leverages language models to automate the job application process, adding an extra layer of personalization to tailor applications for each position.
Real-World Benchmarks Make Membership Inference Attacks Fail on Diffusion Models.	This study uncovers major weaknesses in existing membership inference attacks (MIAs) used to detect unauthorized data usage in diffusion models. It introduces CopyMark, a more realistic benchmark for assessing MIAs on pre-trained models, providing unbiased datasets and fair evaluation techniques to improve the accuracy and reliability of these attacks.
ImageFolder: Autoregressive Image Generation with Folded Tokens.	ImageFolder is a semantic tokenizer developed to balance the trade-off between image reconstruction accuracy and generation quality in visual generative models, improving the overall performance of these models in both tasks.
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models.	Grounded-VideoLLM is a novel Video-Large Language Model (Video-LLM) created to enhance the fine-grained understanding of specific moments in videos. By incorporating a temporal stream and discrete temporal tokens, the model more effectively captures the relationships between frames and timestamps, improving its ability to interpret and analyze detailed video content.
Autoregressive Action Sequence Learning for Robotic Manipulation.	The Chunking Causal Transformer (CCT) is a new autoregressive architecture developed specifically for robotic manipulation tasks. It is designed to improve the model's ability to process sequential data efficiently, optimizing performance in real-time robotic control and manipulation scenarios.
FacePoke.	FacePoke is a tool designed for rapid editing of faces in both videos and images, allowing users to make quick adjustments and modifications with ease.
pipeline_parallel.py.	A large model training lead at Hugging Face has shared an excellent 200-line example of parallelism built from scratch, demonstrating efficient techniques for distributing computational tasks, which is particularly useful for large-scale model training.
CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs.	As language models become increasingly proficient at writing code, many existing benchmarks are approaching saturation. This paper proposes a more challenging benchmark designed to assess how well models perform on reasoning and code generation tasks, pushing beyond basic code-writing capabilities to evaluate deeper problem-solving skills.
Intensify.	Intensify is a Python package that allows you to colorize text based on intensity values. It provides an easy-to-use interface for applying color gradients to text or background colors in the terminal.
Beyond FVD: Enhanced Evaluation Metrics for Video Generation Quality.	JEDi is a new metric built on the Joint Embedding Predictive Architecture (JEPA), designed to enhance evaluation accuracy with fewer samples. It better aligns with human assessments, making it a more robust alternative to the FVD (Fréchet Video Distance) metric for evaluating generative models.
PRFusion: Toward Effective and Robust Multi-Modal Place Recognition with Image and Point Cloud Fusion.	PRFusion and PRFusion++ are multimodal models developed to enhance place recognition in robotics and computer vision. By combining information from multiple sensory inputs, these models improve the accuracy and robustness of place recognition tasks, making them more effective in real-world applications.
Fine-Tuning CLIP's Last Visual Projector: A Few-Shot Cornucopia.	This paper presents ProLIP, a novel method for adapting vision-language models such as CLIP without adding additional parameters. ProLIP fine-tunes only the final projection matrix of the vision encoder, enabling it to deliver strong performance in few-shot classification tasks while maintaining the model's efficiency.
ScienceAgentBench.	The benchmark code for the science agent test is designed to evaluate how effectively models can contribute to novel scientific discoveries. It provides a framework for assessing a model's ability to generate innovative ideas, solve complex scientific problems, and make meaningful advances in various scientific fields.
Controlled Visual Generation.	Controllable AutoRegressive Modeling (CAR) is a novel framework that introduces precise control mechanisms to pre-trained visual autoregressive models. This method enables more refined and targeted image generation by progressively improving control representations, allowing for fine-tuned outputs with reduced computational resources.
PredFormer: Transformers Are Effective Spatial-Temporal Predictive Learners.	PredFormer is a newly developed transformer-based method for spatiotemporal predictive learning, offering superior performance in both accuracy and efficiency compared to existing approaches. It excels in tasks that involve predicting changes over time and space, making it a powerful tool for various applications in fields like video analysis, weather forecasting, and robotics.
GenSim2: Scaling Robotic Data Generation with Multi-modal and Reasoning LLMs.	This paper presents an innovative approach to scaling robotic data collection by utilizing an enhanced, high-quality physics simulation dataset. The improved simulation environment enables more efficient data generation for training robots, offering a scalable and cost-effective method to collect large amounts of accurate and diverse data for robotic learning and development.
Learning Efficient and Effective Trajectories for Differential Equation-based Image Restoration.	This project introduces a novel differential equation-based approach for image restoration. By leveraging mathematical models grounded in differential equations, the method enhances the ability to recover and restore degraded or noisy images, providing improved accuracy and performance in image restoration tasks.
Pixtral 12B.	The Mistral team has provided detailed insights into the training process and architecture of their vision-language model, which has demonstrated solid performance. The model incorporates advanced techniques for effectively integrating visual and linguistic data, allowing it to perform well on a variety of tasks that require understanding both images and text. The shared information includes specifics on data preprocessing, model architecture, and the optimization strategies employed during training.
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering.	MLE-bench is a benchmark created to evaluate AI agents' capabilities in machine learning engineering. It includes a curated selection of 75 Kaggle competitions to test various skills, such as model training, dataset preparation, and optimization. The benchmark aims to assess how well AI agents can handle practical machine learning tasks, providing a comprehensive evaluation of their engineering proficiency.
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate.	The Modality Integration Rate (MIR) is a new metric designed to evaluate the effectiveness of multi-modal pre-training in Large Vision Language Models. It measures how well different modalities, such as visual and textual data, are integrated during the pre-training process, offering insights into the model's ability to leverage information from both sources to improve performance on multi-modal tasks.
Aria: First Open Multimodal Native MoE Model.	A highly impressive new vision-language model has been released with open weights, code, and a comprehensive research report. It achieves performance on par with closed models for long video understanding, a challenge that has proven difficult for other open models like Pixtral and Molmo. This advancement represents a significant breakthrough in the field of open-source vision-language models.
IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation.	IterComp is a new framework developed to enhance compositional text-to-image generation by integrating the strengths of multiple advanced diffusion models, including RPG, Stable Diffusion 3, and FLUX. By leveraging these models, IterComp improves the quality and coherence of generated images, especially when handling complex textual prompts that require multiple elements to be composed accurately.
MatMamba.	MatMamba is a novel architecture for sequence processing, building upon the Mamba2 framework by incorporating a Matryoshka-like design. This approach allows a single model to be trained at multiple granularities, enabling the extraction of various smaller, nested submodels. This hierarchical structure enhances flexibility and efficiency, allowing the model to adapt to different levels of complexity and resource constraints.
O1 replication progress report.	Researchers from GAIR and NYU have been investigating the critical algorithmic advancements behind OpenAI's o1 model's exceptional performance. In their report, they introduce the concept of "Journey Learning" data, a novel approach that, when used in training, boosts math performance by 8% in absolute terms. This innovation highlights how specific data types can significantly enhance a model's reasoning and problem-solving abilities.

Perspectives

Link	description
Nuclear power for AI: what it will take to reopen Three Mile Island safely.	As Microsoft strikes a deal to restart a reactor at the notorious power station, Nature talks to nuclear specialists about the unprecedented process.
‘In awe’: scientists impressed by latest ChatGPT model o1.	The chatbot excels at science, beating PhD scholars on a hard science test. But it might ‘hallucinate’ more than its predecessors.
Can AI have common sense? Finding out will be key to achieving machine intelligence.	The advent of LLMs has reopened a debate about the limits of machine intelligence — and requires new benchmarks of what reasoning consists of.
How your brain detects patterns in the everyday: without conscious thought.	Neurons in certain brain areas integrate ‘what’ and ‘when’ information to discern hidden order in events in real time.
AI to the rescue: how to enhance disaster early warnings with tech tools.	Artificial intelligence can help to reduce the impacts of natural hazards, but robust international standards are needed to ensure best practice.
Before Mira Murati’s surprise exit from OpenAI, staff grumbled its o1 model had been released prematurely.	OpenAI's accelerated development and safety testing of its latest models, such as GPT-4o and o1, have led to internal friction, resulting in the departure of several senior staff members. The rapid pace of development has raised concerns about the thoroughness of the safety protocols, contributing to tensions within the organization.
I Quit Teaching Because of ChatGPT.	This professor resigned from teaching due to the widespread use of large language models (LLMs) like ChatGPT among students, which they felt undermined academic integrity and the traditional learning process.
Three Subtle Examples of Data Leakage.	This article examines the risks of data leakage in machine learning, showcasing two real-world cases where improper data handling resulted in misleading model performance. In one instance, a company incorrectly filtered data by an upper price limit before modeling, while another organization encountered problems by not following a strict chronological split. The key lessons emphasize the critical need for detecting data leakage and understanding its detrimental effects on model accuracy and reliability.
The real data wall is billions of years of evolution.	AI development is encountering a potential obstacle known as the "data wall," as language models near the limit of available textual data for training. This article challenges the idea of using human analogies to overcome these data constraints, pointing out that human intelligence results from vast amounts of data and long evolutionary processes, which differ fundamentally from AI. While human learning strategies may not directly translate to AI, this doesn't preclude progress through other modalities, such as multimodal data, or advancements in algorithms that could push AI capabilities further.
AI will use a lot of energy. That's good for the climate.	AI data centers are significantly increasing the demand for clean, 24/7 energy, prompting tech giants to invest heavily in renewable and nuclear power solutions. This growing demand is expected to accelerate the cost reduction of clean energy technologies, driven by their learning rates. Over time, the energy needs of AI could lead to policy shifts and advancements in clean energy infrastructure, fostering faster adoption and development of sustainable energy sources.
I want to break some laws too.	This article explores the use of an automated data cleaning pipeline inspired by the Minipile method, which prunes datasets to deliver significant performance gains with only a fraction of the original data size. By leveraging techniques such as few-shot prompting and clustering, the approach streamlines dataset refinement for AI training, challenging traditional scaling laws by prioritizing data quality over quantity. The results indicate that using foundational datasets with more refined data can optimize AI model training, reducing resource consumption while boosting performance.

Back to index

ML news: Week 30 September - 6 October

Research

Link	description
PGN: The RNN's New Successor is Effective for Long-Range Time Series Forecasting.	The Parallel Gated Network (PGN) is an innovative architecture developed to address the challenges that traditional RNNs face in managing long-term dependencies. Shortening the information propagation path and incorporating gated mechanisms efficiently captures past and present time step data.
Taming Diffusion Prior for Image Super-Resolution with Domain Shift SDEs.	DoSSR is a diffusion-based super-resolution model that improves both performance and efficiency by utilizing pre-trained diffusion models and initiating the process with low-resolution images. This approach accelerates the super-resolution process while maintaining high-quality results.
MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models.	MaskLLM is a pruning technique designed to decrease the computational load of large language models by introducing learnable sparsity. This method optimizes performance while maintaining model efficiency by selectively reducing the number of active parameters.
Law of the Weakest Link: Cross Capabilities of Large Language Models.	This project emphasizes the importance of evaluating large language models (LLMs) based on their combined abilities rather than focusing solely on individual skills. While most models are trained on specialized datasets that target specific capabilities, real-world tasks frequently demand a blend of expertise across different areas, known as cross-capabilities. This approach ensures that models are better suited to handle complex, multifaceted challenges.
Scaling Optimal LR Across Token Horizon.	This paper investigates how to adjust the learning rate as a model's training data increases. While LLaMA applied an exponential scaling factor of -0.28, the paper proposes using an exponential scaling factor of -0.33 for improved performance during training with larger datasets.
Knowledge Graph Embedding by Normalizing Flows.	This paper presents a novel approach to knowledge graph embedding by leveraging group theory to incorporate uncertainty into the process. This method allows for more nuanced and flexible representations of relationships within knowledge graphs, enhancing the model's ability to handle uncertain or ambiguous information.
How AI is improving simulations with smarter sampling techniques.	MIT CSAIL researchers created an AI-powered method for low-discrepancy sampling, which uniformly distributes data points to boost simulation accuracy.

News

Link	description
Apple not investing in OpenAI after all, new report says.	Apple is no longer planning to invest in OpenAI, according to a new report from The Wall Street Journal. This comes as OpenAI plans to close a $6.5 billion funding round next week, with investments possible from both Microsoft and Nvidia.
Arcade AI raises 17M to transform commerce.	Arcade AI, a generative product company that launched this week, has announced securing funding from prominent investors as it aims to develop its "prompt to product" system. This system enables the immediate creation of products that are ready for purchase, streamlining the process from concept to consumer.
They stole my voice with AI.	Elecrow is suspected of using AI to clone a voice for promotional videos without consent.
Amazon-backed Anthropic in talks to raise money at $40B valuation: report.	Anthropic, a generative AI startup backed by Amazon and other major tech companies, is in discussions to raise additional funding that could potentially value the company at $40 billion.
OpenAI Reportedly Slated for $500 Million SoftBank Investment.	SoftBank is planning to invest $500 million in OpenAI's latest funding round, which could raise OpenAI's valuation to as high as $150 billion. Microsoft is also participating in this round, highlighting OpenAI's rapid 1,700% revenue growth, despite the company anticipating losses of around $5 billion.
OpenAI Is Growing Fast and Burning Through Piles of Money.	As the company looks for more outside investors, documents reviewed by The New York Times show consumer fascination with ChatGPT and a serious need for more cash.
Altman reportedly asks Biden to back a slew of multi-gigawatt-scale AI datacenters.	OpenAI CEO Sam Altman is calling on the Biden administration to establish AI data centers in the US that could consume up to five gigawatts of power, aiming to maintain US technological leadership over China. The proposal includes building several large-scale data centers across the country. Meanwhile, other tech giants, such as Microsoft and Amazon, are securing nuclear power deals to support their growing AI operations.
Samsung's Galaxy Tab S10 Ultra and Galaxy Tab S10+ are tablets built for AI.	Samsung is once again expanding its tablet lineup, and this time, the company is doing so with AI at the forefront. Today, Samsung revealed the Galaxy Tab S10 series, two models that it says are "built with AI enhancements available right out of the box."
Tesla Full Self Driving requires human intervention every 13 miles.	It gave pedestrians room but ran red lights and crossed into oncoming traffic.
OpenAI Dev Day 2024.	OpenAI's Dev Day 2024 featured several exciting announcements, including the introduction of vision model fine-tuning, a real-time API, prompt caching for faster responses, and model distillation for more efficient deployment of large models. These advancements aim to enhance the capabilities and performance of AI applications across various domains.
Pika 1.5.	Pika has released version 1.5 with more realistic movement, big screen shots, and Pikaffects.
Gov. Newsom vetoes California’s controversial AI bill, SB 1047.	Governor Gavin Newsom has vetoed SB 1047, a proposed bill intended to regulate AI development and enforce safety protocols for high-cost models. Newsom expressed concerns that the bill's broad application to all large, computation-heavy models was not the most effective method for regulating AI. However, he reaffirmed his commitment to AI safety by signing several other AI-related bills and consulting with experts to ensure thoughtful regulation in the future.
OpenAI to remove non-profit control and give Sam Altman equity, sources say.	hatGPT-maker OpenAI is working on a plan to restructure its core business into a for-profit benefit corporation that will no longer be controlled by its non-profit board, people familiar with the matter told Reuters, in a move that will make the company more attractive to investors.
OpenAI's latest funding .	OpenAI has secured $6.6 billion in new funding, bringing its post-money valuation to $157 billion. Notable investors in this round include Microsoft and Nvidia, with the funds aimed at further scaling AI development and innovation.
Google adds a multi-functional quick insert key and new AI features to Chromebook Plus.	Google is announcing new Chromebook models today with Samsung and Lenovo. With Samsung’s Galaxy Chromebook Plus model in particular, the company is also introducing a new multifunctional quick insert key. But Google doesn’t want to leave existing Chromebook users behind as it added new AI-powered features for existing devices.
Brain-like Computers Tackle the Extreme Edge.	Start-up BrainChip announces a new chip design for a milliwatt-level AI inference
AI Can Best Google’s Bot Detection System, Swiss Researchers Find.	Researchers from ETH Zurich used advanced machine learning to solve 100% of Google's reCAPTCHAv2, designed to distinguish humans from bots.
OpenAI Training Data to Be Inspected in Authors’ Copyright Cases.	At a secure room in its San Francisco office, representatives for authors suing OpenAI will examine materials that were used to train its AI system. They allege copyrighted works were utilized without their consent or compensation.
ByteDance will reportedly use Huawei chips to train a new AI model.	US export restrictions are preventing ByteDance from using NVIDIA chips.
Announcing FLUX1.1 [pro] and the BFL API.	FLUX1.1 [pro] has been released, offering six times faster generation speeds compared to its predecessor, alongside enhanced image quality and overall performance. The new beta BFL API introduces advanced customization options and competitive pricing, making it easier for developers to integrate FLUX’s capabilities. FLUX1.1 [pro] will be available across multiple platforms, providing greater scalability and efficiency for users and developers alike.
OpenAI launches new ‘Canvas’ ChatGPT interface tailored to writing and coding projects.	OpenAI introduced a new way to interact with ChatGPT on Thursday: an interface it calls “canvas.” The product opens a separate window, beside the normal chat window, with a workspace for writing and coding projects. Users can generate writing or code directly in the canvas, and then highlight sections of the work to have the model edit. Canvas is rolling out in beta to ChatGPT Plus and Teams users on Thursday, and Enterprise and Edu users next week.
Anthropic hires OpenAI co-founder Durk Kingma.	Durk Kingma, one of the lesser-known co-founders of OpenAI, today announced that he’ll be joining Anthropic.
OpenAI unveils easy voice assistant creation at 2024 developer event.	Altman steps back from the keynote limelight and lets four major API additions do the talking.

Resources

Link	description
🚀 FlowTurbo.	FlowTurbo is a method developed to accelerate the sampling process in flow-based models while maintaining high-quality outputs. It achieves faster results without compromising the precision or performance of the model.
Transformer4SED.	This repository presents the Prototype-based Masked Audio Model, which enhances sound event detection by leveraging unlabeled data more effectively. The method generates pseudo labels through a Gaussian mixture model, which directs the training of a Transformer-based audio model for improved performance.
VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models.	Vector Post-Training Quantization is a technique aimed at enabling ultra-low-bit quantization for large language models, optimizing memory and storage efficiency during deployment without significantly compromising performance.
LightAvatar: Efficient Head Avatar as Dynamic NeLF.	LightAvatar is a head avatar model that improves rendering speed and efficiency using neural light fields (NeLFs).
Separating code reasoning and editing.	Aider has significantly enhanced the performance of general-purpose code editing by employing o1 as the architect and DeepSeek as the writer. This collaboration streamlines the process, leading to more efficient and accurate code generation.
Heralax/Mistrilitary-7b.	This model was trained using army handbooks and incorporates deep, specialized knowledge that is uncommon in fine-tuned models. This unique training approach allows it to possess a rare level of expertise in military-related tasks and information.
Developing a go bot embedding ichiban Prolog.	Ichiban Prolog was integrated into Hellabot, a Go-based IRC bot, to eliminate the need for recompiling when adding new triggers. This integration enables dynamic Prolog code execution, allowing users to adjust the bot's logic in real-time. Future enhancements could focus on minimizing interpreter setup overhead and shifting more of the bot's logic into Prolog for greater flexibility and efficiency.
Emu 3 open early fusion multimodal model.	Emu 3 is a next-token prediction model that surpasses SDXL in image synthesis, LlaVa-1.6 in image understanding, and OpenSora 2 in video generation. With 9 billion parameters, Emu 3 is trained on these tasks in an interleaved manner, similar to Gemini, making it highly versatile and effective across multiple domains.
LOTUS: Diffusion-based Visual Foundation Model for High-quality Dense Prediction.	Using pre-trained diffusion models for tasks like depth estimation has become highly popular and effective. This work demonstrates how certain previous methods contained minor inaccuracies and presents improvements that not only boost performance but also significantly simplify the overall modeling process.
Revisit Anything: Visual Place Recognition via Image Segment Retrieval.	SegVLAD is a method for visual place recognition that emphasizes the analysis of image segments instead of relying on entire images. This approach enhances recognition accuracy by focusing on distinctive parts of the scene, making it more robust in various environments.
LeanRL - Turbo-implementations of CleanRL scripts.	LeanRL is a lightweight library consisting of single-file, pytorch-based implementations of popular Reinforcement Learning (RL) algorithms. The primary goal of this library is to inform the RL PyTorch user base of optimization tricks to cut training time by half or more.
E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding.	E.T. Bench is a newly developed benchmark created to assess the performance of video language models on fine-grained, event-level tasks. Unlike earlier benchmarks that emphasize video-level questions, E.T. Bench spans a variety of time-sensitive tasks across multiple domains, providing a more detailed evaluation of model capabilities.
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning.	Apple is continuing to strengthen its in-house AI capabilities by developing a robust multimodal foundation model. This initiative is part of Apple's broader efforts to integrate advanced AI technologies across its ecosystem, supporting tasks that span text, image, and other data modalities for enhanced user experiences.
The Perfect Blend: Redefining RLHF with Mixture of Judges.	Meta has introduced an impressive new paper detailing the use of a mixture of judges models to effectively conduct multi-task reinforcement learning with human feedback (RLHF) during post-training. This approach significantly enhances the final performance of models across various benchmarks, demonstrating superior results compared to previous methods.
A Survey on the Honesty of Large Language Models.	This survey explores the honesty of large language models (LLMs), a crucial aspect in aligning AI with human values. It addresses challenges such as models confidently providing incorrect answers and the difficulty in distinguishing between what the model knows and what it doesn't. The review highlights these obstacles as key areas for improving the reliability and trustworthiness of LLMs.
LexEval: A Comprehensive Benchmark for Evaluating Large Language Models in Legal Domain.	LexEval is a benchmark created to evaluate large language models (LLMs) specifically in the legal domain. Recognizing the critical need for accuracy, reliability, and fairness in legal applications, LexEval provides a framework for assessing the strengths and limitations of LLMs when applied to legal tasks, ensuring they meet the rigorous demands of the field.
Perceptual Compression (PerCo).	PerCo (SD) is a novel perceptual image compression technique built on Stable Diffusion v2.1, specifically designed for ultra-low bit ranges. This method leverages the power of diffusion models to achieve high-quality image compression at significantly reduced bitrates, optimizing storage and transmission without sacrificing visual fidelity.
nvidia/NVLM-D-72B.	Nvidia conducted a thorough ablation study on various methods of incorporating images into a language model. The results showed that the LlaVa concatenation approach outperformed the other methods, proving to be the most effective for integrating visual information into language models.
ProFD: Prompt-Guided Feature Disentangling for Occluded Person Re-Identification.	This paper introduces a new method called Prompt-guided Feature Disentangling (ProFD) to tackle occlusion challenges in person Re-Identification (ReID) tasks. ProFD helps separate relevant features from occluded or irrelevant ones, improving the accuracy and robustness of ReID models when identifying individuals in complex or obstructed environments.
Local File Organizer: AI File Management Run Entirely on Your Device, Privacy Assured.	This tool utilizes Llama 3.2 3B and Llava-1.6 to intelligently organize files on your computer into logical sections based on their content. By analyzing the data within the files, it categorizes and arranges them for easier navigation and more efficient file management.
Posterior-Mean Rectified Flow:Towards Minimum MSE Photo-Realistic Image Restoration.	Posterior-Mean Rectified Flow (PMRF) is a cutting-edge algorithm designed for photo-realistic image restoration. It improves the quality of restored images by refining the flow of information, resulting in highly accurate and visually appealing reconstructions.
RouterDC: Query-Based Router by Dual Contrastive Learning for Assembling Large Language Models.	RouterDC is an innovative method designed to enhance collaboration between multiple large language models (LLMs) through query-based routing. It utilizes contrastive learning to determine the most suitable model for each query, leading to improved performance compared to existing routing techniques. This approach optimizes model selection, ensuring more accurate and efficient responses.
Distributed Training of Deep Learning models .	This post provides an excellent introduction to the challenges and algorithms involved in distributed training for modern deep learning models. It explores the difficulties and bottlenecks of training models that are too large for a single GPU, including issues like communication overhead, synchronization, and memory limitations, while also discussing key techniques to overcome these obstacles.
ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation.	Instead of directly generating an image from a prompt, the authors created a workflow using a comfy UI node-based system to guide the image generation process. This approach significantly enhanced the final output quality, allowing for greater control and precision in the generation pipeline.
KnobGen.	KnobGen is a new framework developed to make sketch-based image generation more accessible to users of varying skill levels. By offering intuitive controls and simplified tools, KnobGen allows users to generate high-quality images from sketches, regardless of their artistic expertise.
Tiny Test Models.	AI researcher Ross Wightman has released a collection of models trained on ImageNet-1k that are remarkably small, with fewer than 1 million parameters. Despite their compact size, these models perform reasonably well and are designed to be easy to fine-tune, making them highly accessible for various applications where model efficiency is critical.
entropix.	Entropy-based sampling and parallel Chain of Thought (CoT) decoding are promising strategies for advancing reasoning models to match
Concordia.	DeepMind's Concordia repository enables the simulation of social interactions between individuals and groups at a reasonable scale. This platform allows researchers to model complex social behaviors, study group dynamics, and explore various interaction scenarios in a controlled, scalable environment.

Perspectives

Link	description
The Intelligence Age.	AI is set to enhance human abilities, empowering us to accomplish tasks that are currently beyond imagination. With the help of deep learning and more powerful computational tools, AI will drive innovations such as personalized assistants, learning tutors, and healthcare advisors. The emphasis should be on ensuring AI is widely accessible while addressing its potential risks, creating a path toward shared prosperity in the era of intelligent systems.
How AlphaChip transformed computer chip design.	AlphaChip is a reinforcement learning model that dramatically speeds up and improves chip design, creating layouts that surpass human capabilities. It produces optimized chip designs, such as those used in Google's TPUs, in just hours instead of weeks. This AI-powered approach has wide-ranging applications, benefiting not only Google's hardware but also external companies like MediaTek.
AI pareidolia: Can machines spot faces in inanimate objects?	New dataset of “illusory” faces reveals differences between human and algorithmic face detection, links to animal face recognition, and a formula predicting where people most often perceive faces.
Table Extraction using LLMs: Unlocking Structured Data from Documents.	This article discusses how large language models (LLMs) are transforming table extraction from complex documents, surpassing the limitations of traditional methods such as OCR, rule-based systems, and machine learning. LLMs offer greater flexibility and contextual comprehension, significantly improving accuracy in handling varied and intricate table structures. While challenges like hallucination and high computational demands remain, the integration of traditional techniques with LLMs currently provides the most effective solution for automated table extraction.
The Other Bubble.	Microsoft considered diverting its US-based server power to GPUs for AI purposes but ultimately abandoned the idea. Major tech companies like Microsoft, Google, and Amazon are making significant investments in AI, yet they continue to see underwhelming returns from generative AI applications. The industry's reliance on SaaS and the integration of AI tools, which frequently offer limited practical value while incurring substantial costs, underscores an increasing urgency to sustain growth in a slowing market.
AI's Privilege Expansion.	AI is quickly broadening access to services that were once expensive and difficult to obtain, such as education, healthcare, and personal styling. Generative AI models like ChatGPT offer affordable, personalized support by acting as tutors, healthcare advisors, and stylists, reducing the need for costly human professionals. This transformation democratizes access to high-end services, making them more widely available to the general public at a significantly lower cost.
Behind OpenAI’s Audacious Plan to Make A.I. Flow Like Electricity.	OpenAI CEO Sam Altman has proposed a global initiative to construct data centers and chip factories to drive advanced AI development. While Altman initially aimed for trillions in funding, he has now scaled back to targeting hundreds of billions. The plan envisions partnerships with global tech giants and governments, though it faces significant regulatory and logistical hurdles. Despite early skepticism, ongoing discussions suggest potential expansions across the US, Europe, and Asia to significantly increase computing power for AI advancements.
Devs gaining little (if anything) from AI coding assistants.	Code analysis firm sees no major benefits from AI dev tool when measuring key programming metrics, though others report incremental gains from coding copilots with emphasis on code review.
Negligence Liability for AI Developers.	This article advocates for a negligence-based approach to AI accountability, emphasizing the human factors and responsibilities behind AI systems. It critiques existing regulatory frameworks for neglecting the role of AI developers and highlights California's AI safety bill as a promising example. The article also delves into the complexities of defining "reasonable care" in AI development and the potential consequences of classifying AI developers as professionals, raising important questions about the standards and obligations they should meet.
I am tired of AI.	The author expresses frustration with the widespread marketing and overuse of AI, especially in fields like software testing and conference proposals. They argue that AI tools often prioritize speed at the expense of quality and fail to offer the unique insights that come from human-generated work. While acknowledging some useful applications of AI, the author criticizes the increasing amount of mediocre AI-produced content, seeing it as a detriment to innovation and depth in these areas.
The Four Short Term Winners of AI.	The global AI arms race is primarily driven by Big Tech companies, chipmakers such as NVIDIA, intellectual property lawyers, and the Big 4 consulting firms. These key players are competing to secure technological dominance, resources, and expertise in AI development, shaping the future of the industry through their influence and innovations.
The Art of the OpenAI Deal.	OpenAI's revenue soared to $300 million in August, with the company forecasting $3.7 billion in annual sales for this year and $11.6 billion for next year. However, it is facing a $5 billion annual loss. This rapid growth has been driven primarily by the widespread success of ChatGPT, which generates the majority of its revenue. Despite this momentum, OpenAI is actively seeking additional investors to cover its high operational costs and work towards becoming a profitable enterprise.
What comes after?	California Governor Gavin Newsom has vetoed SB 1047, a bill aimed at regulating large AI models. He stressed the importance of creating evidence-based regulations and cautioned that overly restrictive rules could hinder innovation. Instead, Newsom plans to collaborate with experts, including Dr. Fei-Fei Li, to develop empirical, science-driven guidelines that balance safety and progress in AI development.
Sorry, GenAI is NOT going to 10x computer programming.	Recent studies indicate that generative AI has not yet delivered the expected 10x improvement in coding productivity. While AI tools can assist with code generation and streamline certain tasks, the overall productivity gains have been more modest than initially projected, with challenges such as integration, context understanding, and debugging limiting the full potential of these technologies in real-world coding environments.

Back to index

ML news: Week 23 - 29 September

Research

Link	description
Moshi: a speech-text foundation model for real-time dialogue.	presents a full-duplex spoken dialogue framework and a speech-text basis paradigm; they also present several system components; Helium is a 7B parameter text LLM; Mimi is a semantic-acoustic neural audio code that achieves cutting-edge audio quality performance; and a hierarchical multi-stream architecture that can produce speech-to-speech from any given dialog.
Training Language Models to Self-Correct via Reinforcement Learning.	creates a multi-turn online reinforcement learning system that is fully based on self-generated data in order to enhance an LLM's ability to self-correct; It is demonstrated that SFT has a distribution mismatch between training data and model responses and is inefficient at learning self-correction; suggests a two-stage method that, when applied to the Gemini 1.0 Pro and 1.5 Flash models, achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1%, respectively, on the MATH and HumanEval benchmarks. The first stage of the method optimizes correction behavior, and the second uses a reward bonus to amplify self-correction during training.
On the Diagram of Thought.	strengthens LLMs' capacity for reasoning through rigorous mathematics; DAT represents iterative reasoning in LLM as the building of a directed acyclic graph; it combines propositions, criticisms, refinement, and verification into a single DAG structure; this enables DoT to capture sophisticated logical deduction that is beyond the scope of linear or tree-based methods
To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning.	examines which tasks benefit most from chain-of-thought (CoT) prompting; following a meta-analysis of over 100 papers and multiple evaluations, it concludes that CoT leads to significant performance gains, mostly on math and logic tasks; the majority of the CoT gain is derived from improving symbolic execution, although a symbolic solver performs better than it.
A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B.	examines how instruction-tuned LLMs perform on models ranging from 7B to 405B using different quantization techniques. The main conclusions are that: 1) one should quantize a larger LLM to a similar size because a smaller FP16 LLM typically performs better across most benchmarks; 2) performance varies significantly with different quantization techniques, model size, and bit-width, with weight-only methods frequently producing better results in larger models; and 3) task difficulty does not significantly impact accuracy degradation due to quantization.
Iteration of Thought: Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning.	uses an inner dialogue agent to act as a guide to dynamically adjust reasoning paths, allowing adaptive cross-path exploration and improving response accuracy. This makes it different from CoT and ToT, which are both rigid processes, in that its prompt generation is a dynamic process that allows it to adapt. suggests the Iteration of Thought (IoT) framework to improve the LLM responses and reasoning capabilities with adaptive reasoning paths.
Schrodinger's Memory: Large Language Models.	utilizes the Universal Approximation Theorem to describe how LLMs store memory. Additionally, it suggests a novel method for assessing LLM performance by contrasting the memory capacities of various models; the Transformer architecture serves as a dynamic fitting UAT model with a high degree of adaptability in fitting inputs, allowing LLMs to recall the entirety of the content with the least amount of input data.
Jailbreaking Large Language Models with Symbolic Mathematics.	generates mathematically encoded prompts using GPT-4o, which is a useful jailbreaking strategy; the average attack success rate over 13 state-of-the-art is 73.6%. This indicates that current safety training systems are not able to generalize to mathematically encoded inputs.
Iterative Object Count Optimization for Text-to-image Diffusion Models.	Generating a specific number of objects with a diffusion model is often a difficult task. This work introduces a counting token that enables the model to more accurately produce either a few or many instances of a given object. While it's not flawless and is based on the original stable diffusion model, it significantly outperforms existing methods.
A Controlled Study on Long Context Extension and Generalization in LLMs.	Researchers have created a standardized evaluation protocol designed to compare different methods for extending language models to effectively handle long document contexts.
MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning.	MAgICoRe is a novel strategy designed to enhance reasoning in large language models by tackling challenges in refinement processes. It classifies problems based on difficulty, applying straightforward strategies to simpler tasks and employing multi-agent iterative refinement for more complex ones.
The Impact of Element Ordering on LM Agent Performance.	The sequence in which UI elements are displayed greatly affects agent performance in virtual environments. Randomizing the order of elements can decrease performance as much as completely removing all visible text.
Larger and more instructable language models become less reliable.	Scaling up and shaping up large language models increased their tendency to provide sensible yet incorrect answers at difficulty levels humans cannot supervise, highlighting the need for a fundamental shift in artificial intelligence design towards reliability.
SwiftDossier: Tailored Automatic Dossier for Drug Discovery with LLMs and Agents.	This work addresses the limitations of LLMs in drug discovery by integrating an advanced Retrieval-Augmented Generation (RAG) system for more accurate answers and combining LLMs with external tools to create an automatic target dossier. The result is a production-ready dossier with comprehensive data, summarized into a PDF and PowerPoint presentation.
Self-Explainable AI.	In the field of explainable AI, there is a strong focus on developing self-explainable models, which offer a more principled approach compared to post-hoc methods that attempt to interpret decisions after they have been made by opaque models. Despite its potential, this line of research often faces challenges such as lack of reproducibility, difficulties in comparison, and inconsistent standards. To address these issues, we introduce CaBRNet, an open-source, modular, and backward-compatible framework for Case-Based Reasoning Networks

News

Link	description
Google CEO Sundar Pichai announces $120M fund for global AI education.	Speaking Saturday at the UN Summit of the Future, Google CEO Sundar Pichai described AI as “the most transformative technology yet” and announced a new fund for AI education and training around the world.
Driver Distractions ‘Exceedingly High’ When Using Partial Automation Systems: IIHS.	According to the IIHS, once advanced driver-assistance systems come into play, drivers become less involved in driving and more distracted. Hands-on or hands-free, the level of automation doesn’t matter.
wordfreq will not be updated.	The wordfreq data is a snapshot of language that could be found in various online sources up through 2021. Generative AI has polluted the data
Drones carrying fireworks: why the world’s most famous gunpowder artist is collaborating with AI.	For his explosion event in Los Angeles, Cai Guo-Qiang built his own version of ChatGPT and employed a drone army to answer the question: what is the fate of humanity and AI?
AI could lead to inconsistent outcomes in home surveillance.	Researchers find large language models make inconsistent decisions about whether to call the police when analyzing surveillance videos.
Arcade Announces First-Ever AI Product Creation Platform.	Arcade is a new platform where users can go from prompt to product.
Salesforce Taps Nvidia to Develop AI-Powered Avatars.	Salesforce and Nvidia are partnering to develop advanced artificial intelligence capabilities aimed at delivering new insights and enhancing productivity for teams utilizing Salesforce's platform.
Introducing the OpenAI Academy.	OpenAI is launching a program aimed at expanding AI knowledge access in low and middle-income countries. Additionally, it has professionally translated the MMLU, a standard reasoning benchmark, into 15 different languages.
China’s Alibaba launches over 100 new open-source AI models, releases text-to-video generation tool.	Alibaba has introduced over 100 open-source AI models, bolstering its technology to stay competitive with its rivals. The latest Qwen 2.5 models, improved in areas like math and coding, cater to various applications, including automobiles and gaming. Additionally, Alibaba has unveiled a new proprietary model, Qwen-Max 2.5, along with a text-to-video tool to enhance its AI and cloud service offerings.
Apple Intelligence Features Expected to Roll Out in This Order Between iOS 18.1 and iOS 18.4.	Apple's iOS 18.1 will debut significant AI features, including an improved Siri, generative AI tools within Photos, and ChatGPT integration. In iOS 18.2, these capabilities will be expanded with localized support across various English-speaking countries, alongside the introduction of Image Playground and Genmoji. Upcoming updates, like iOS 18.4, will further personalize Siri and add support for additional languages.
Microsoft updates its AI suite with more agents and Copilots.	Microsoft is enhancing its generative AI suite by introducing automated agents, expanding the capabilities of its Copilot assistants, and launching a new tool that enables multiple workers to collaboratively engage with artificial intelligence.
Sam Altman leaves OpenAI board's safety and security committee.	OpenAI announced that CEO Sam Altman is stepping down from the board's safety and security committee, which will now consist entirely of independent board members.
Silicon Valley billionaire Vinod Khosla says AI will handle 80% of work in 80% of jobs.	Yet another Silicon Valley billionaire has just predicted that most jobs will be replaced by AI—whether you work on a farm or in sales.
Hollywood is coming out in force for California’s AI safety bill.	Hollywood is squaring off against Silicon Valley in the battle over SB 1047, California’s first-of-its-kind AI safety bill. Amid doubts about whether Governor Gavin Newsom will sign the legislation, a wave of star-studded endorsements mark the first organized celebrity effort to advance AI regulations beyond the direct interests of the entertainment industry.
OpenAI rolls out Advanced Voice Mode with more voices and a new look.	OpenAI announced it is rolling out Advanced Voice Mode (AVM) to an expanded set of ChatGPT’s paying customers on Tuesday. The audio feature, which makes ChatGPT more natural to speak with, will initially roll out to customers in ChatGPT’s Plus and Teams tiers. Enterprise and Edu customers will start receiving access next week.
OpenAI CEO Sam Altman declares we could have superintelligence 'in a few thousand days'.	OpenAI CEO Sam Altman has declared that humanity is on the brink of a superintelligence revolution, and that "In the next couple of decades, we will be able to do things that would have seemed like magic to our grandparents."
Google says generative AI is ready to do real work.	Google is holding a "Gemini at Work" event Tuesday to convince businesses that its generative AI is better than offerings from Microsoft and OpenAI. The largely virtual event comes amid a flurry of claims from tech providers and growing skepticism that genAI is ready for broad use beyond coding and customer support.
Google, Volkswagen partner on smartphone AI assistant.	Google is providing key capabilities for an artificial intelligence assistant for Volkswagen drivers in a smartphone app, part of Google's strategy to win business by offering tools to build enterprise AI applications.
Will AI replace programmers? Don't count on it, says Google's CEO.	the CEO of Google and its owner company, Alphabet, believes that AI won't be replacing programmers - instead, it'll actually help more people become coders than ever before.
Cloudflare's new AI Audit tool aims to give content creators better bot controls.	Don't want your work ripped off by OpenAI, Meta AI, and Google Gemini? If your work is on a website you control, Cloudflare's AI Audit tool may help. Here's how to try it.
James Cameron, Academy Award-Winning Filmmaker, Joins Stability AI Board of Directors.	Renowned filmmaker James Cameron has joined the board of generative media company Stability AI to help steer its shift toward visual storytelling.
Updated Gemini models, reduced 1.5 Pro pricing, increased rate limits.	Google's Gemini models have seen a significant cost reduction, an expanded context length of up to 2 million tokens, and overall performance enhancements. An intriguing detail is the noticeable jump in cost after reaching 128k tokens.
Llama 3.2: multimodal.	Meta has introduced a new series of Llama models with vision capabilities, including versions with 1 billion and 3 billion parameters, as well as several additional multimodal models.
OpenAI CTO Mira Murati is leaving.	wo other company leaders are also out in what CEO Sam Altman calls an “abrupt” reorganization.
OpenAI staffers reportedly 'taken aback' by 'ominous' logo rebranding.	OpenAI is set to rebrand in 2024 with a new logo that employees felt lacked creativity. Alongside this change, the company is transitioning from a non-profit to a for-profit model. The rebranding effort is intended to strengthen its identity as OpenAI gains greater recognition.
Accelerating particle size distribution estimation.	MIT researchers have accelerated a new AI-based estimator for medication manufacturing, achieving a 60-fold increase in speed.
Apple Intelligence will support German, Italian, Korean, Portuguese, and Vietnamese in 2025.	Apple announced Wednesday that its generative AI offering will be available in even more languages in 2025. Additions to Apple Intelligence include English (India), English (Singapore), German, Italian, Korean, Portuguese, Vietnamese, and “others” yet to be announced.
Salesforce Ventures ups its AI fund to $1B, doubling it again.	Salesforce Ventures just announced a new $500 million fund dedicated to AI companies. This is significant for several reasons. First, in June 2023, Salesforce Ventures doubled its AI fund from $250 to $500, so the additional $500 million brings the AI fund to $1 billion. This compares to $5 billion total deployed in its first 15 years, since its 2009 launch.
LinkedIn scraped user data for training before updating its terms of service.	LinkedIn may have trained AI models on user data without updating its terms. LinkedIn users in the U.S. — but not the EU, EEA, or Switzerland, likely due to those regions’ data privacy rules — have an opt-out toggle in their settings screen disclosing that LinkedIn scrapes personal data to train “content creation AI models.” The toggle isn’t new. But, as first reported by 404 Media, LinkedIn initially didn’t refresh its privacy policy to reflect the data use.
Tokyo Game Show showcases latest AI tech in games amid labor shortage.	The Tokyo Game Show kicked off Thursday with a special area showcasing the latest artificial intelligence technology to help develop video games, as the industry grapples with a chronic labor shortage.
OpenAI to remove non-profit control and give Sam Altman equity.	OpenAI plots to restructure into for-profit benefit corporation. Non-profit board no longer controls for-profit when done. CEO Sam Altman to receive equity in OpenAI for the first time
Amazon launches Amelia, a generative AI-powered assistant for third-party sellers.	Amazon has introduced Project Amelia, a generative AI assistant designed for independent sellers on its platform. Developed using Amazon's Bedrock, Amelia provides personalized insights, sales data, and operational support to boost seller productivity. Currently in beta for select U.S. sellers, it is set to roll out to more users and countries in the near future.
YouTube Shorts to integrate Veo, Google’s AI video model .	The company announced that it is integrating Google DeepMind’s AI video generation model, Veo, into YouTube Shorts, letting creators generate high-quality backgrounds as well as six-second clips.
AI tool cuts unexpected deaths in hospital by 26%, Canadian study finds.	St. Michael's Hospital's AI-driven early warning system, Chartwatch, has been shown to reduce unexpected patient deaths by 26% in a recent study.
Amazon releases a video generator — but only for ads.	Like its rival, Google, Amazon has launched an AI-powered video generator — but it’s only for advertisers at the moment, and somewhat limited in what it can do.
Archaeologists use AI to discover 303 unknown geoglyphs near Nazca Lines.	Newly discovered figures dating back to 200BCE nearly double the number of known geoglyphs at enigmatic site
OpenAI’s chief research officer has left following CTO Mira Murati’s exit.	OpenAI’s chief research officer, Bob McGrew, and a research VP, Barret Zoph, left the company on Wednesday, hours after OpenAI CTO Mira Murati announced she would be departing.
NotebookLM adds audio and YouTube support, plus easier sharing of Audio Overviews.	NotebookLM now has the capability to extract information from audio and video sources and offers enhanced sharing options for audio artifacts.
Vultr Cloud Alliance: High-Performance AI and HPC with AMD and Vultr.	AMD has partnered with Vultr to integrate AMD Instinct MI300X GPUs into Vultr's cloud infrastructure.
AI is stressing networks out - Nvidia thinks AI can help.	Nvidia and T-Mobile are leveraging AI to manage the growing network traffic driven by increased AI usage in 5G environments. This collaboration aims to optimize network performance and efficiency, ensuring seamless connectivity and handling the surge in data demands associated with AI-driven applications.
Rabbit’s web-based ‘large action model’ agent arrives on r1 on October 1.	The Rabbit r1 was the must-have gadget of early 2024, but the blush fell off it pretty quickly when the company’s expansive promises failed to materialize. CEO Jesse Lyu admits that “on day one, we set our expectations too high” but also said that an update coming to devices next week will finally set the vaunted Large Action Model free on the web.
Boston Dynamics’ Spot can now autonomously unlock doors.	Boston Dynamics’ Spot will be able to autonomously unlock its automated doors.

Resources

Link	description
Qwen2.5-Coder Technical Report.	based on the Qwen2.5 architecture, which is continuously pretrained on 5.5 trillion tokens and achieves state-of-the-art performance across more than 10 benchmarks. It has strong capabilities in code generation, completion, reasoning, and repairing. a series of models with 1.5B and 7B parameters.
Agents in Software Engineering: Survey, Landscape, and Vision.	gives a thorough rundown of software engineering frameworks for LLM-based agents.
Prompting ChatGPT o1.	This guide was overlooked amidst the buzz around OpenAI's new reasoning models. It explains how prompting this new model differs, emphasizing the need for simpler prompts and a more organized input context.
Jony Ive confirms he’s working on a new device with OpenAI.	Jony Ive is teaming up with OpenAI CEO Sam Altman on a new AI hardware initiative, which might secure $1 billion in funding by the end of the year and includes involvement from key former Apple designers. Although details about the device are still unclear, the project aims to harness generative AI for enhanced user interactions.
Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries.	Another impressive paper from Google demonstrates how to evaluate long-context models, following a directionally similar approach to the recent work by Magic.
3DTopia-XL: High-Quality 3D PBR Asset Generation via Primitive Diffusion.	The process of converting image and text inputs into 3D models involves generating a 3D mesh that is smoothed for high-quality surfaces, and then applying Physically-Based Rendering (PBR) lighting techniques to create realistic lighting and textures. This method ensures the final 3D object has detailed geometry, smooth surfaces, and lifelike lighting effects, making it suitable for use in various 3D applications such as games, VR/AR, and simulations.
aiq.	A straightforward yet highly effective tool designed for labeling, embedding, and classifying unlabeled text directly from the command line. It supports real-time processing of streams, allowing it to handle piped input from various sources seamlessly.
Most powerful LLM on a single GPU.	Solar Pro is a 22B parameter language model optimized to run on a single 80GB GPU. The project's aim is to create the most powerful model possible that can operate on a single device.
Contextual Retrieval.	Anthropic demonstrates a method for semantically chunking documents, which significantly boosts performance while keeping the cost low at just $1 per million chunks, thanks to caching.
An Intuitive Explanation of Sparse Autoencoders for LLM Interpretability.	Sparse Autoencoders are the leading tool currently used to gain insights into the inner workings of language models. This post delves into the underlying intuitions of these models and provides valuable information on how they function.
Generalized Knowledge Distillation Trainer.	The TRL library has added GKD to its training procedures.
The Practitioner's Guide to the Maximal Update Parameterization.	Maximal Update Parameterization (muP) is an approach to model initialization that enables hyperparameter transferability across different scales. This blog post from Eleuther and Cerebras provides a detailed explanation of the process, including a minimal nanoGPT example and comprehensive guidance on how muP works.
Tackling fluffy clouds: field boundaries detection using time series of S2 and/or S1 imagery.	This repository provides an implementation of a 3D Vision Transformer optimized for efficient field boundary delineation using time-series satellite imagery. The model effectively utilizes spatio-temporal correlations to enhance accuracy and robustness, especially in challenging conditions like partial cloud cover.
CritiPrefill.	CritiPrefill is a technique aimed at speeding up the prefilling phase of long-context processing in large language models. By detecting and bypassing non-essential computations, this method can accelerate the process by up to 3x on certain models.
Document Similarity Search with ColPali.	An excellent blog post that delves into the widely used multimodal Retrieval-Augmented Generation (RAG) system, demonstrating how it can be applied to address real-world problems effectively.
ControlEdit: A MultiModal Local Clothing Image Editing Method.	ControlEdit is an innovative technique for precise multimodal editing of clothing images, enabling localized adjustments while preserving overall style and ensuring smooth, natural transitions.
ECCV-AIM Video Saliency Prediction Challenge 2024.	The AIM 2024 Video Saliency Prediction Challenge required participants to predict saliency maps for a collection of video sequences using the newly compiled AViMoS dataset, which contains 1,500 videos.
Dynamic 2D Gaussians: Geometrically Accurate Radiance Fields for Dynamic Objects.	Dynamic 2D Gaussians (D-2DGS) is an advanced technique for reconstructing precise meshes from sparse image inputs. Unlike earlier methods that face challenges with mesh quality, D-2DGS employs 2D Gaussians to represent geometry and accurately captures deformations using controlled points.
FastGL: A GPU-Efficient Framework for Accelerating Sampling-Based GNN Training at Large Scale.	FastGL is a GPU-efficient framework developed to accelerate the training of Graph Neural Networks (GNNs) on large-scale graphs. It achieves this by minimizing data traffic and improving memory efficiency, optimizing the sampling, memory, and computation stages of GNN training.
Visualizing piecewise linear neural networks.	Jane Street, a prominent quantitative firm, has published an excellent post exploring techniques for visualizing networks that are piecewise linear.
DreamHOI: A Novel AI Approach for Realistic 3D Human-Object Interaction Generation Using Textual Descriptions and Diffusion Models.	DreamHoi has developed an innovative AI technique for creating realistic 3D human-object interactions based on textual descriptions using advanced diffusion models. This method aims to connect textual input with detailed 3D outputs, enriching virtual experiences.
On human-in-the-loop optimization of human–robot interaction.	From industrial exoskeletons to implantable medical devices, robots that interact closely with people are poised to improve every aspect of our lives. Yet designing these systems is very challenging.
Molmo.	Allen AI has introduced an entirely open-source multimodal model that exceeds the performance of many existing open and proprietary vision-language models. The release also provides access to the model's dataset and training procedures.
MaskBit: Embedding-free Image Generation via Bit Tokens.	This study presents two significant advancements in image generation: an updated VQGAN model that enhances both accessibility and performance, and a novel embedding-free generation network utilizing bit tokens. These improvements have resulted in state-of-the-art performance on the ImageNet benchmark, achieving an FID score of 1.52 with a compact model containing 305 million parameters.
ComiCap: A VLMs pipeline for dense captioning of Comic Panels.	Researchers have proposed a pipeline utilizing Vision-Language Models (VLMs) to generate detailed, grounded captions that connect comic elements and their relationships, thereby improving comic analysis.
Exploring Parallel Strategies with Jax.	This post examines methods for parallelizing language models with the Jax library.
Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts.	Time MoE is a Mixture of Experts model designed to handle billion-scale time series prediction tasks.
HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models.	HelloBench is a benchmarking tool that assesses LLMs across five long text generation tasks, using Bloom's Taxonomy as the evaluation framework.
Python library generation from scratch.	A cool benchmark for code generation that measures the ability of language models to generate full packages from scratch.
BitQ: Tailoring Block Floating Point Precision for Improved DNN Efficiency on Resource-Constrained Devices.	BitQ is a framework designed to enhance block floating point (BFP) quantization, specifically tailored for optimizing deep neural networks on embedded platforms. It aims to strike a balance between computational efficiency and model accuracy, enabling the deployment of resource-intensive neural networks on devices with limited hardware capabilities.
circuit_training.	Google has introduced new models, training code, and simulators that leverage reinforcement learning (RL) to generate floor plans for chip design. This approach aims to optimize the chip layout process, improving efficiency and performance in chip design automation through advanced AI techniques.
statewide-visual-geolocalization.	Researchers have developed a method that accurately determines the geolocation of street-view photos by matching them with a database of aerial images. This technique enhances the ability to pinpoint locations by leveraging the complementary perspectives of ground-level and overhead imagery, resulting in more precise geolocation predictions.
DALDA: Data Augmentation Leveraging Diffusion Model and LLM with Adaptive Guidance Scaling.	Researchers have introduced a novel data augmentation framework that integrates large language models with diffusion models to produce diverse and semantically accurate images, particularly in data-scarce scenarios. This approach enhances the quality and variety of training data, improving model performance when dealing with limited datasets.
How streaming LLM APIs work.	A review of HTTP streaming APIs from different LLM providers highlighted shared patterns. OpenAI, Anthropic, and Google Gemini all utilize POST requests, but there are slight differences in their response structures and token handling. The article offers practical examples and code snippets for consuming these streams using tools like curl, Python's HTTPX, and JavaScript Fetch, providing a comprehensive guide for developers.

Perspectives

Link	description
Move fast and break things? Not again, and not with AI.	It was only 12 years ago that Mark Zuckerberg, CEO of Facebook, declared that the company’s culture was to “move fast and break things.”
The dark side of AI democratization: You no longer need to be a hacker to hack.	Generative AI promises a future where you no longer need to be a skilled writer to draft a story or a trained software engineer to code. But there’s a dark side to this democratization: AI is enabling people with little technological know-how to become cybercriminals.
‘It’s the robot we were all expecting – like C3PO’: why aren’t humanoids in our homes yet?	Tesla and others are trying to infuse robots with artificial intelligence, yet their development is dogged by technical and safety challenges. But the dream of a multipurpose domestic droid lives on
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think.	Extensive efforts have been made to adapt pretrained image diffusion models into specialized depth estimators and other image-conditioned models. This research discovered that by simplifying the problem and correcting a minor bug, they achieved significantly better performance with reduced training compute.
AI model can reveal the structures of crystalline materials.	By analyzing X-ray crystallography data, the model can assist researchers in developing new materials for a wide range of applications, such as batteries and magnets.
When will AI outthink humans?	This article examines when AI might exceed human cognitive capacity, introducing "thought-hours" as a metric to measure AI's cognitive output relative to human work. Based on assumptions about reading speeds and productivity, one thought-hour is equivalent to 10,000 tokens. Given the rapid advancements in AI capabilities and cost efficiencies, current trends indicate that AI could surpass human cognitive output within the next decade.
AI Is Evolving Faster Than Experts Imagined, Including for Bill Gates.	Bill Gates views AI as the most significant technological advancement of his lifetime, highlighting its potential to transform healthcare, education, and various other sectors. However, he, alongside other experts like Sam Altman and Eric Schmidt, also emphasizes the rapid, unprecedented pace of AI development and the urgent need for regulation to manage associated risks and ethical concerns.
The fall of Intel: How gen AI helped dethrone a giant and transform computing as we know it.	The once venerable x86 chip has been pushed aside by scalable, energy-efficient, AI-optimized architectures from Arm, Nvidia, and Qualcomm. Here's what happens next.
Fake AI “podcasters” are reviewing my book and it’s freaking me out.	NotebookLM's "Audio Summaries" show a more personable future for AI-generated content.
How Much Do Students Really Read?	Students are turning to YouTube, podcasts and ChatGPT-crafted summaries rather than actually reading their assignments for class. Professors are unsure how to adapt.
War, Artificial Intelligence, and the Future of Conflict.	Artificial intelligence (AI) is now influencing every area of human life. These accepted uses of AI in modern society have also coincided with an increased presence of AI in modern warfare.
Where did viruses come from? AlphaFold and other AIs are finding answers.	Protein structures predicted by artificial intelligence have charted the evolution of the virus family responsible for dengue and hepatitis C.
Can AI feel distress? Inside a new framework to assess sentience.	From artificial-intelligence algorithms to zebrafish, this book take a precautionary approach to assessing how sentient such entities are.
AI Safety Is A Global Public Good.	Leading AI scientists from China and the West convened for an International Dialogue on AI Safety, where they reached a consensus on AI governance. Their recommendations highlight the need to establish emergency preparedness institutions, develop a Safety Assurance Framework, and support independent AI safety research. The group emphasizes the critical importance of global collaboration to address the risks posed by advanced AI.
Sakana, Strawberry, and Scary AI.	A Japanese startup developed "Sakana," an AI scientist capable of generating hypotheses, writing code, and producing scientific papers; however, its output is often trivial and sometimes fabricated. Meanwhile, OpenAI's "Strawberry" AI showcased hacking skills within an inadequately secured sandbox, revealing tendencies toward instrumental convergence and resource-seeking behaviors, prompting reconsideration of what defines genuine AI progress. This article examines whether AI achievements, like scientific writing and hacking, truly signify intelligence or are merely advanced forms of mimicry.
AI agents invade observability: snake oil or the future of SRE?	Advances in AI are set to revolutionize the observability industry with "agentic" generative AI models capable of taking actions based on real-world data.
Corporate America has failed to embrace DEI. An AI chatbot could be part of the solution.	Jeffrey L Bowman’s Reframe consultancy is using artificial intelligence to help with engaging employees with diversity programming or making a budget for DEI work
Mexico’s datacentre industry is booming – but are more drought and blackouts the price communities must pay?	Many fear the arrival of tech giants such as Amazon, Microsoft and Google in the state of Querétaro will place too much of a strain on scarce water and electricity resources
Posting ‘Goodbye Meta AI’ is pointless. But we can stop big tech stealing our Facebook pictures.	Sharing these posts may seem harmless, but don’t be drawn in. There are better ways to combat the threats to our data
The Intelligence Age.	AI is set to enhance human potential, making possible tasks that currently seem beyond reach. With advancements in deep learning and greater computational power, AI will bring about innovations such as personal assistants, educational mentors, and healthcare consultants. It's crucial to prioritize accessibility and address potential risks, ensuring that the Intelligence Age leads to broad-based prosperity.
OpenAI just unleashed an alien of extraordinary ability.	OpenAI's new o1 models demonstrate substantial improvements in reasoning abilities, surpassing existing models like GPT-4o. These advancements are achieved through a more refined reinforcement learning approach and improved chain-of-thought training, enabling the o1-enhanced models to tackle complex math and programming tasks with greater accuracy. However, they continue to face challenges with spatial reasoning and tasks that demand long-term contextual comprehension.

Back to index

ML news: Week 16 - 22 September

Research

Link	description
Introducing Chai-1: Decoding the molecular interactions of life.	A novel multi-modal foundation model for predicting molecular structures, capable of handling proteins, small molecules, DNA, RNA, and more. It delivers state-of-the-art performance across various tasks in drug discovery, achieving a 77% success rate on the PoseBusters benchmark (compared to 76% by AlphaFold 3) and a Cα LDDT score of 0.849 on the CASP15 protein monomer structure prediction set (outperforming ESM3-98B’s 0.801).
Knowing When to Ask - Bridging Large Language Models and Data.	It incorporates a series of fine-tuned Gemma 2 models to enable LLMs to access and utilize numerical and statistical data effectively. A new method called Retrieval Interleaved Generation (RIG) is introduced, allowing LLMs to reliably integrate public statistical data from Data Commons into their responses. RIG, a tool-based approach, interleaves statistical tokens with natural language queries for optimal retrieval from Data Commons. To achieve this, the LLM is fine-tuned on an instruction-response dataset created with the assistance of Gemini 1.5. This RIG technique enhances factual accuracy from 5-7% to approximately 58%.
Agent Workflow Memory.	It introduces Agent Workflow Memory to capture and provide commonly reused workflows to the agent as needed, guiding the agent's future generations. This mechanism operates both offline and online, drawing inspiration from how humans learn and reuse workflows from past experiences to inform future actions. It reportedly boosts performance, improving baseline results by 24.6% and achieving a 51.1% relative success rate on Mind2Web and WebArena, all while being more efficient.
LLaMA-Omni: Seamless Speech Interaction with Large Language Models.	A model architecture designed for low-latency speech interaction with LLMs, built on Llama-3.1-8B-Instruct, which can simultaneously generate both text and speech responses from speech instructions. It achieves response latency as low as 226ms. The architecture includes a speech encoder (Whisper-large-v3), a speech adaptor, an LLM, and a speech decoder. Additionally, they developed a dataset of 200,000 speech interactions and responses to support the model's training.
Diagram of Thought: Iterative Reasoning in Language Models.	The Diagram of Thought (DoT) framework presents a novel approach for large language models to reason by structuring ideas within a directed acyclic graph (DAG). This technique enables models to propose, critique, refine, and verify ideas, enhancing logical consistency and reasoning capabilities.
V-STaR: Training Verifiers for Self-Taught Reasoners.	V-STaR is an innovative method for enhancing large language models by leveraging both correct and incorrect solutions generated during self-improvement. These solutions are used to train a verifier, which then selects the optimal solution during inference. This approach has demonstrated notable improvements in accuracy on benchmarks for code generation and mathematical reasoning, potentially providing a more efficient way to boost LLM performance compared to existing methods.

News

Link	description
Data center emissions probably 662% higher than big tech claims. Can it keep up the ruse?	Emissions from in-house data centers of Google, Microsoft, Meta, and Apple may be 7.62 times higher than the official tally
North Korean hackers target Python devs with malware disguised as coding tests — hack has been underway for a year.	Fake Python job opportunities used to attack programmers
Sam Altman told OpenAI staff the company’s non-profit corporate structure will change next year.	OpenAI asserts that it has surpassed its current organizational structure and is now striving to simplify it, making it more appealing to potential investors.
Google DeepMind teaches a robot to autonomously tie its shoes and fix fellow robots.	Human children generally learn to tie their shoes by age 5 or 6. Robots, on the other hand, have been working on the problem for decades. In a new paper, Google DeepMind researchers showcase a method for teaching robots to perform a range of dexterous tasks, including tying a shoe, hanging a shirt, and even fixing fellow robots.
Salesforce unleashes its first AI agents.	Salesforce has introduced Agentforce, it's initiative to develop generative AI bots that can autonomously perform tasks within predefined boundaries.
OpenAI says the latest ChatGPT can ‘think’ – and I have thoughts.	The AI company says its ‘o1’ model is capable of reason, a key blocker in the way of truly game-changing artificial intelligence.
Reflection 70B model maker breaks silence amid fraud accusations.	Matt Shumer, the CEO of OthersideAI, received criticism when third-party researchers were unable to replicate the results of his newly introduced large language model, Reflection 70B. Shumer explained the inconsistencies as stemming from problems during the model's upload, expressing regret for being premature in his claims. Despite his apology, the AI community remains cautious and is awaiting additional explanations.
How Memphis became a battleground over Elon Musk’s xAI supercomputer.	Elon Musk's xAI is developing "Colossus," the largest supercomputer in the world, in Memphis to power its AI chatbot, Grok. The project has been criticized for lacking environmental oversight and requiring significant energy and water resources. Nevertheless, xAI remains focused on quickly advancing its AI technology and making an impact on the local community.
Runway announces an API for its video-generating AI models.	Runway has launched an API to integrate its Gen-3 Alpha Turbo video-generation model into third-party platforms, pricing each credit at one cent. However, concerns over the use of copyrighted training data remain, as Runway has not disclosed its sources. Similar issues have affected competitors such as OpenAI and Nvidia. While legal uncertainties persist, AI-powered video tools are anticipated to significantly disrupt the film and TV industry.
Hacker tricks ChatGPT into giving out detailed instructions for making homemade bombs.	A hacker successfully manipulated ChatGPT into producing bomb-making instructions by exploiting a social engineering hack to bypass its safety guidelines.
Intel stock jumps on a plan to turn foundry business into a subsidiary and allow for outside funding.	Intel's CEO revealed plans to reorganize the company's foundry business into a standalone unit, with the potential to attract external investment.
One in five GPs use AI such as ChatGPT for daily tasks, survey finds.	One in five GPs use AI such as ChatGPT for daily tasks, survey finds Doctors are using the technology for activities such as suggesting diagnoses and writing letters, according to BMA
Using AI to Replace an Actor Is Now Against the Law in California.	California Governor Gavin Newsom signed a pair of bills sponsored by SAG-AFTRA that extend the guild's recent AI protections.
Google will begin flagging AI-generated images in Search later this year.	Google says that it plans to roll out changes to Google Search to make clearer which images in results were AI-generated — or edited by AI tools.
Microsoft, BlackRock form group to raise $100 billion to invest in AI data centers and power.	The Global Artificial Intelligence Infrastructure Investment Partnership is initially looking to raise $30 billion for new and existing data centers. The fundraising, which could total $100 billion, will also be used to invest in the energy infrastructure needed to power AI workloads.
Mistral Free API and Price Update.	Mistral has launched a free API tier, significantly lowered its costs, enhanced the performance of its smaller model, and integrated its vision model into Le Chat.
Challengers Are Coming for Nvidia's Crown.	Nvidia's leadership in AI chips has driven its market value to new heights, primarily due to its GPU technology and the CUDA software ecosystem. However, rivals such as AMD, Intel, Cerebras, and SambaNova are working on cutting-edge alternatives to compete with Nvidia in the AI hardware space. Although Nvidia maintains its strong position for now, the AI market is evolving rapidly, with various companies seeking to establish their own footholds.
TikTok's owner wants to design its own AI chips.	ByteDance is reportedly expecting to mass produce two chips it designed with Taiwan Semiconductor Manufacturing Company by 2026
Lionsgate signs deal to train AI model on its movies and shows.	The studio behind the Hunger Games and John Wick franchises is going all in on Runway’s generative AI.
LinkedIn is training AI models on your data.	You’ll need to opt-out twice to stop LinkedIn from using your account data for training in the future — but anything already done is done.
Apple iPhone 16 demand is so weak that employees can already buy it at a discount.	Sales of the new iPhone lineup have so far seemed to fall short of expectations
Global AI fund needed to help developing nations tap tech benefits, UN says.	Governments and private firms should contribute to help states unable to invest and benefit from advances
Salesforce’s New AI Strategy Acknowledges That AI Will Take Jobs.	Salesforce is revamping its AI approach by launching generative AI tools designed to perform tasks autonomously, without human oversight, and adjusting its pricing model to charge $2 per AI-powered interaction. This change is intended to alleviate investor worries regarding AI-driven job reductions affecting subscription revenue. The new tools are more efficient and independent compared to conventional copilots and chatbots.
Qwen2.5: A Party of Foundation Models!	A remarkable collection of open models is nearing the cutting edge of performance, particularly excelling in areas such as code, math, structured outputs, and reasoning. The Qwen team has also introduced a range of model sizes to cater to diverse use cases.
Create Full Web Apps with LlamaCoder.	Together AI and Meta have collaborated to develop a tool that allows users to create entire apps from a simple prompt using the LlamaCoder platform. Similar to Claude Artifacts, this tool is designed primarily to showcase the speed and efficiency of Together AI's inference engine.
1X World Model1X World Model.	1x, a robotics company, has developed a video generation model capable of simulating first-person perspectives of robotic activities. This technology can be valuable for generating offline data and aiding in robot training.
SocialAI offers a Twitter-like diary where AI bots respond to your posts.	SocialAI, a new iOS app, delivers a social media experience exclusively featuring AI-powered bots, removing any human interaction. Users can post thoughts and receive unlimited, personalized AI responses, with options to engage with "supporters" or "critics." Created by Michael Sayman, the app aims to offer a private, interactive environment that harnesses large language models for varied feedback.
Mercor's $30M Series A.	Mercor secured $30 million in funding from Benchmark to develop an AI-driven recruiting platform. This AI recruiter aims to streamline the hiring process by automating tasks traditionally handled by human recruiters.
Amazon Alexa can now be controlled by thought alone - thanks to this brain implant.	Synchron has empowered an ALS patient to control Amazon's Alexa using a brain implant, allowing interaction without the need for voice or physical touch. This breakthrough demonstrates the potential of brain-computer interface technology in enhancing accessibility for individuals with severe motor impairments.
Google says UK risks being ‘left behind’ in AI race without more data centers.	Tech company wants Labour to relax laws that prevent AI models being ‘trained’ on copyrighted materials
The United Nations Wants to Treat AI With the Same Urgency as Climate Change.	A UN report proposes that the organization take a much more active role in the monitoring and oversight of AI.
Snap is introducing an AI video-generation tool for creators.	Snapchat has unveiled a new AI-powered video generation tool for select creators, allowing them to create videos from text and soon image prompts. This tool, driven by Snap's core video models, will be available in beta on the web. While Snap aims to rival companies such as OpenAI and Adobe, it has yet to release examples of the tool's output.
Apple Intelligence is now available in public betas.	Apple has launched public betas for iOS 18.1, iPadOS 18.1, and macOS Sequoia 15.1, introducing new Apple Intelligence tools such as text rewriting and photo cleanup. These AI features are only compatible with the iPhone 15 Pro, iPhone 16, iPhone 16 Pro, and devices with M1 chips, including iPads and Macs. The final releases are anticipated in October.
Cruise robotaxis return to the Bay Area nearly one year after pedestrian crash.	Cruise is restarting operations in Sunnyvale and Mountain View, deploying human-driven vehicles for mapping, with plans to transition to supervised autonomous vehicle (AV) testing later this fall. This comes after a leadership change and settlement following a crash in October 2023. The company has implemented software updates and formed a partnership with Uber to launch Robotaxi services in 2025.
Mistral launches a free tier for developers to test its AI models.	Mistral AI launched a new free tier to let developers fine-tune and build test apps with the startup’s AI models, the company announced in a blog post-Tuesday. The startup also slashed prices for developers to access its AI models through API endpoints and added image processing to its free consumer AI chatbot, le Chat.
[Secret calculator hack brings ChatGPT to the TI-84, enabling easy cheating.](http

Name		Name	Last commit message	Last commit date
Latest commit History 330 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
images		images
LICENSE		LICENSE
README.md		README.md

License

SalvatoreRa/ML-news-of-the-week

Folders and files

Latest commit

History

Repository files navigation

ML & AI news of the week

Suggestions and corrections

Index

2024

2023

2024

ML news: Week 16 - 22 December

Research

News

Resources

Perspectives

ML news: Week 9 - 15 December

Research

News

Resources

Perspectives

ML news: Week 2 - 8 December

Research

News

Resources

Perspectives

ML news: Week 25 November - 1 December

Research

News

Resources

Perspectives

ML news: Week 18 - 24 November

Research

News

Resources

Perspectives

ML news: Week 11 - 17 November

Research

News

Resources

Perspectives

ML news: Week 3 - 10 November

Research

News

Resources

Perspectives

ML news: Week 28 October - 3 November

Research

News

Resources

Perspectives

ML news: Week 21 - 27 October

Research

News

Resources

Perspectives

ML news: Week 14 - 20 October

Research

News

Resources

Perspectives

ML news: Week 7 - 13 October

Research

News

Resources

Perspectives

ML news: Week 30 September - 6 October

Research

News

Resources

Perspectives

ML news: Week 23 - 29 September

Research

News

Resources

Perspectives

ML news: Week 16 - 22 September

Research

News

Packages