What I Care About

I generally consider myself a utilitarian and a long-termist. I care a lot about AI safety. My P(doom) is a tentative 50%. I am interested in working on technical AI safety research. Right now, I am particularly interested in approaches that are highly theoretical, focus on outer alignment/reward mispecification, or involve some kind of red-teaming/monitoring. This is based on some mixture of my apptitutdes and the kinds of approaches that I am bullish on in terms of impact.

Credentialing

I am an undergraduate studying computer science at Georgia Tech. I am the Fellowship Lead for the AI Safety Initiative at Georgia Tech and previously co-led the Effective Altruism club at Georgia Tech. I completed the Machine Learning Safety Scholars program in 2022. See also current research projects.

Skills

I think I am good at thinking about abstract problems (like math/philosophy-ish things). I think I am good at reasoning about things like goals and probabilities. I think I am also decent at computer science (I would estimate around average for a GT CS grad, which is a little below average for a MATS technical scholar). I think I am conscientious, agentic, and work well with others.

Current Research Projects

Trajectory Normalized Scoring for Neutrality+

I am working with Elliott Thornley on an RL implementation of Neutrality+ (a part of his POST-Agency Proposal). Neutrality+ agents are theoretically shutdownable because their preferences are represented by an average utility across trajectory lengths rather than an expected utility across trajectory lengths (no weighting by probability). This means they are indifferent to shifting probability across trajectory lengths, which results in them not taking costly actions to avoid shutdown. Our RL implementation uses empirical results for probabilities of trajectory lengths in a batch as an estimator for objective probabilities in order to implement the Neutrality+ objective function.

Monitor Sensitive Training (MST)

I am working with Kasey Corra, Archie Chaudhury, and Yixiong Hao on MST, a new post-training technique intended to improve on the issues that arise from reward misspecification by modeling feedback as a function of observability. As an example of MST, imagine:

  1. Take preference data from a human evaluator
  2. Label the preference data based on time spent evaluating, evaluator expertise, and evaluator background
  3. Train a reward model on the labeled preference data
  4. Deploy the reward model with a label suggesting high thoroughness, high expertise, and low bias

The core intuition is to model feedback as a function of our limitations in observability so that we can simulate a standard of evaluation that is much stronger than actually exists. In its most ambitious form, MST trains the model such that it generalizes to completing tasks in exactly the way that we describe them. It takes advantage of the fact that it’s easier to make a description accurately reflect a task than to make a task accurately reflect intended behavior.

LLMs for Automated Red Teaming Using Steering Vectors

I am working with Dr. Kartik Goyal, Andrew Wei, and Sarvesh Tiku on using LLMs for automated red teaming using steering vectors. We find steering vectors for refusal and then train LLMs to create prompts that either over- or under-trigger these vectors in order to create over-refusal and jailbreak prompts.