This is a linkpost for https://www.convergenceanalysis.org/research/training-data-attribution-tda-examining-its-adoption-use-cases
Note: This report was conducted in June 2024 and is based on research originally commissioned by the Future of Life Foundation (FLF). The views and opinions expressed in this document are those of the authors and do not represent the positions of FLF.
This report investigates Training Data Attribution (TDA) and its potential importance to and tractability for reducing extreme risks from AI. TDA techniques aim to identify training data points that are especially influential on the behavior of specific model outputs. They are motivated by the question: how would the model's behavior change if one or more data points were removed from or added to the training dataset?
Report structure:
- First, we discuss the plausibility and amount of effort it would take to bring existing TDA research efforts from their current state, to an efficient and accurate tool for TDA inference that can be run on frontier-scale LLMs. Next, we discuss the numerous research benefits AI labs will expect to see from using such TDA tooling.
- Then, we discuss a key outstanding bottleneck that would limit such TDA tooling from being accessible publicly: AI labs’ willingness to disclose their training data. We suggest ways AI labs may work around these limitations, and discuss the willingness of governments to mandate such access.
- Assuming that AI labs willingly provide access to TDA inference, we then discuss what high-level societal benefits you might see. We list and discuss a series of policies and systems that may be enabled by TDA. Finally, we present an evaluation of TDA’s potential impact on mitigating large-scale risks from AI systems.
Key takeaways from our report:
- Modern TDA techniques can be categorized into three main groups: retraining-based, representation-based (or input-similarity-based), and gradient-based. Recent research has found that gradient-based methods (using influence functions) are the most likely path to practical TDA.
- The most efficient approach to conduct TDA using influence functions today has training costs on par with pre-training an LLM. It has significantly higher (but feasible) storage costs than an LLM model, and somewhat higher per-inference costs.
- Based on these estimates, TDA appears to be no longer infeasible to run on frontier LLMs with enterprise-levels of compute and storage. However, these techniques have not been tested on larger models, and the accuracy of these optimized TDA techniques on large models is unclear.
- Compressed-gradient TDA is already plausible to be used on fine-tuned models, which have orders of magnitude fewer training examples and parameters (on the order of millions or billions rather than hundreds of billions).
- Timing to achieve efficient and accurate TDA on frontier models is likely between 2-5 years, depending largely on specific incremental research results and amount of funding / researchers allocated to the space.
- Efficient TDA techniques will likely have a substantial positive impact on AI research and LLM development, including the following effects:
- Mitigating the prevalence of hallucinations and false claims
- Identifying training data that produces poor results (bias, misinformation, toxicity), improved data filtering / selection
- Shrinking overall model size / improving efficiency
- Improved interpretability & alignment
- Improved model customization and editing
- AI labs are likely already well-incentivized to invest in TDA research efforts because of the benefits to AI research.
- Public access to TDA tooling on frontier AI models is limited primarily by the unwillingness / inability of AI labs to publicly share their training data.
- AI labs currently have strong incentives to keep their training data private, as publishing such data would have negative outcomes such as:
- Reduced competitive advantages from data curation
- Increased exposure to legal liabilities from data collection
- Violating privacy or proprietary data requirements
- AI labs may be able to avoid these outcomes by selectively permitting TDA inference on certain training examples, or returning sources rather than the exact training data.
- Governments are highly unlikely to mandate public access to training data.
- AI labs currently have strong incentives to keep their training data private, as publishing such data would have negative outcomes such as:
- If AI labs willingly provided public access to TDA, you could expect the following benefits, among others:
- Preventing copyrighted data usage.
- Improved fact checking / content moderation
- Impacts on public trust and confidence in LLMs
- Accelerated research by external parties
- Increased accountability for AI labs
- AI labs appear largely disincentivized to provide access to TDA inference, as many of the public benefits are disadvantageous for them.
- Governments are highly unlikely to mandate public access to TDA.
- It seems plausible that certain AI labs may expose TDA as a feature, but that the majority would prefer to use it privately to improve their models.
- Several systems that could be enabled by efficient TDA include:
- Providing royalties to data providers / creators
- Automated response improvement / fact-checking
- Tooling for improving external audits of training data
- Content attribution tooling for LLMs, though it is unlikely to replace systems reliant on RAG
- We believe that the most promising benefit of TDA for AI risk mitigation is its potential to improve the technical safety of LLMs via interpretability.
- There are some societal / systematic benefits from TDA, and these benefits may be a small contributing factor to reducing some sources of risk. We don’t think these appear to move the needle significantly to reduce large-scale AI risks.
- TDA may meaningfully improve AI capabilities research, which might actually increase large-scale risk.
- TDA may eventually be highly impactful in technical AI safety and alignment efforts. We’d consider TDA’s potential impact on technical AI safety to be in a similar category to supporting mechanistic interpretability research.
Executive summary: Training Data Attribution (TDA) is a promising but underdeveloped tool for improving AI interpretability, safety, and efficiency, though its public adoption faces significant barriers due to AI labs' reluctance to share training data.
Key points:
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.