Search

Your search keyword '"Goldowsky-Dill, Nicholas"' showing total 8 results

Search Constraints

Start Over You searched for: Author "Goldowsky-Dill, Nicholas" Remove constraint Author: "Goldowsky-Dill, Nicholas"
8 results on '"Goldowsky-Dill, Nicholas"'

Search Results

1. Detecting Strategic Deception Using Linear Probes

2. Open Problems in Mechanistic Interpretability

3. Towards evaluations-based safety cases for AI scheming

4. Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

5. The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

6. Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

7. Localizing Model Behavior with Path Patching

Catalog

Books, media, physical & digital resources