Explaining the behavior of trained neural networks stays an intriguing puzzle, especially as these models grow to be larger and more sophisticated. Like other scientific challenges throughout history, reverse engineering how artificial intelligence systems work requires a big amount of experimentation: generating hypotheses, manipulating behavior, and even dissecting large networks to review individual neurons. To date, most successful experiments have relied heavily on human supervision. Explaining every calculation in models of size GPT-4 and bigger will almost actually require more automation – even perhaps the usage of AI models themselves.

To make this timely endeavor possible, researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have developed a novel approach that uses AI models to conduct experiments on other systems and explain their behavior. Their method uses agents built from pre-trained language models to create intuitive explanations for computations inside trained networks.

At the center of this strategy is the “Automated Interpretability Agent” (AIA), which is meant to mimic the experimental processes of a scientist. Interpretability agents plan and conduct tests on other computing systems, ranging in size from single neurons to entire models, to offer explanations of those systems in various forms: language descriptions of what a system does and where it fails, and code that does that Behavior of the system reproduced. Unlike existing interpretability methods that passively classify or summarize examples, the AIA actively engages in hypothesis generation, experimental testing, and iterative learning, thereby refining its understanding of other systems in real time.

The AIA method is supplemented by the brand new “Function Interpretation and Description” (FIND) Benchmark, a testbed with functions just like computations in trained networks and accompanying descriptions of their behavior. A key challenge in assessing the standard of descriptions of real-world network components is that descriptions are only nearly as good as their explanatory power: researchers should not have access to valid unit labels or descriptions of learned computations. FIND addresses this long-standing problem in the sector by providing a reliable standard for evaluating interpretability methods: explanations of functions (e.g., created by an AIA) will be evaluated against function descriptions within the benchmark.

For example, FIND accommodates synthetic neurons designed to mimic the behavior of real neurons in language models, a few of that are selective for single concepts akin to “ground transport.” AIAs are given black-box access to synthetic neurons and design inputs (akin to “tree,” “luck,” and “automotive”) to check a neuron’s response. After determining that an artificial neuron produces higher responses for “automotive” than for other inputs, an AIA could design more detailed tests to differentiate the neuron’s selectivity for cars versus other modes of transportation akin to planes and boats. When the AIA creates an outline akin to “This neuron is selective for road transport and never for air or sea travel,” that description is evaluated against the bottom truth description of the synthetic neuron (“selective for ground transportation”) in FIND. The benchmark can then be used to match the capabilities of AIAs with other methods within the literature.

Sarah Schwettmann PhD ’21, co-lead writer of a Paper for the brand new work and research scientist at CSAIL, emphasizes the advantages of this approach. “The ability of AIAs to autonomously generate and test hypotheses can potentially reveal behaviors that may otherwise be difficult for scientists to detect. “It is remarkable that language models, when equipped with tools to review other systems, are able to any such experimental design,” says Schwettmann. “Clean, easy benchmarks with well-founded answers have been a vital driver of more general capabilities in language models, and we hope FIND can play an analogous role in interpretability research.”

Automation of interpretability

Major voice models proceed to keep up their status as sought-after celebrities within the tech world. Recent advances in LLMs have highlighted their ability to perform complex reasoning tasks in various areas. The team at CSAIL recognized that given these capabilities, language models could potentially serve because the backbone of generalized agents for automated interpretability. “Historically, interpretability has been a really complex field,” says Schwettmann. “There is not any one-size-fits-all approach; Most procedures are very specific to individual questions we may need a couple of system and to individual modalities akin to vision or language. Existing approaches to labeling individual neurons inside vision models have required training specific models on human data, with these models only performing this one task. Interpretability agents built from language models could provide a general interface for explaining other systems – synthesizing results across experiments, integrating across different modalities, and even discovering latest experimental techniques at a really fundamental level.”

As we enter a system during which the explanatory models themselves are black boxes, external assessments of interpretability methods grow to be increasingly necessary. The team’s latest benchmark addresses this need with a set of features with known structure that mimic behaviors observed within the wild. The features in FIND span a wide selection of areas, from mathematical reasoning to symbolic operations on strings to synthetic neurons created from word-level tasks. The interactive features dataset is created procedurally. By adding noise, composing functions, and simulating distortion, real-world complexity is introduced into easy functions. This allows for the comparison of interpretability methods in a setting that translates to real-world performance.

In addition to the functional data set, the researchers introduced an revolutionary evaluation protocol to judge the effectiveness of AIAs and existing automated interpretability methods. This protocol includes two approaches. For tasks that require replication of the function in code, the evaluation compares the AI-generated estimates directly with the unique ground truth functions. For tasks that involve describing functions in natural language, the evaluation becomes more complex. In these cases, accurately assessing the standard of those descriptions requires an automatic understanding of their semantic content. To address this challenge, the researchers developed a special “third-party” language model. This model is specifically trained to judge the accuracy and coherence of the natural language descriptions provided by the AI ​​systems and compare them with the behavior of the bottom truth function.

FIND enables evaluation and shows that we’re still a good distance from fully automating interpretability. Although AIAs outperform existing interpretability approaches, they still cannot accurately describe almost half of the features within the benchmark. Tamar Rott Shaham, co-lead writer of the study and postdoctoral researcher at CSAIL, notes: “Although this generation of AIAs effectively describe high-level functionality, they still often miss finer-grained details, particularly in functional subdomains with noise or irregular behavior. This is probably going attributable to inadequate sampling in these areas. One problem is that the effectiveness of the AIAs might be compromised by their initial exploratory data. To address this, we attempted to guide the exploration of AIAs by initializing their search with specific, relevant inputs, which significantly increased the accuracy of interpretation.” This approach combines latest AIA methods with previous techniques using pre-computed examples to initiate the Interpretation process.

The researchers are also developing a toolkit to enhance AIAs’ ability to perform more precise experiments on neural networks, in each black-box and white-box environments. This toolkit goals to offer AIAs with higher tools for choosing inputs and refining hypothesis testing capabilities for more nuanced and accurate neural network evaluation. The team also addresses practical challenges in AI interpretability, specializing in determining the best inquiries to ask when analyzing models in real-world scenarios. Their goal is to develop automated interpretability procedures that ultimately help people test systems – e.g. B. for autonomous driving or facial recognition – could help to diagnose potential failure modes, hidden biases or surprising behaviors before use.

Watch the observers

The team envisions someday developing near-autonomous AIAs that may audit other systems, with human scientists providing oversight and guidance. Advanced AIAs could develop latest forms of experiments and questions that will transcend the initial considerations of human scientists. The focus is on extending AI interpretability to more complex behaviors, akin to entire neural circuits or subnetworks, and predicting inputs that could lead on to undesirable behaviors. This development represents a big advance in AI research, with the aim of creating AI systems more comprehensible and reliable.

“An excellent benchmark is a strong tool for tackling difficult challenges,” says Martin Wattenberg, a pc science professor at Harvard University who was not involved within the study. “It’s wonderful to see this demanding benchmark for interpretability, one in all the important thing challenges in machine learning today. I’m particularly impressed with the automated interpretability agent that the authors created. It’s a sort of interpretability jiu-jitsu that turns AI back on itself to advance human understanding.”

Schwettmann, Rott Shaham and their colleagues presented their work at NeurIPS 2023 in December. Additional MIT co-authors, all members of CSAIL and the Department of Electrical Engineering and Computer Science (EECS), include graduate student Joanna Materzynska, graduate student Neil Chowdhury, Shuang Li PhD ’23, Assistant Professor Jacob Andreas, and Professor Antonio Torralba. David Bau, an assistant professor at Northeastern University, is one other co-author.

The work was supported partially by the MIT-IBM Watson AI Lab, Open Philanthropy, an Amazon Research Award, Hyundai NGV, the US Army Research Laboratory, the US National Science Foundation, the Zuckerman STEM Leadership Program, and a Viterbi Fellowship.

This article was originally published at news.mit.edu