Evan's thoughts on boundaries (Apr 2024)

AI Guardrails may benefit from formalizing the concept of boundaries

Apr 09, 2024

This blog post is an informal summary of progress on a technical topic. Feel free to let us know in the comments if you want to see more or less content like this.

Evan helped organize and run the “Conceptual Boundaries” workshop, which was initiated and primarily organized by Chris Lakin. This workshop was intended to extend Andrew Critch’s initial work on «Boundaries» here, though I’ll try not to assume you’ve read any of that.

Notes on these notes (i.e. meta-notes):

This is not an official summary or debrief from the event, and has been published without input from the other workshop participants. This is simply a summary of things I wanted to either remember or share with others.
- Assume insights are primarily from other participants, while controversial takes are my own.
I’ll try to answer questions in various venues, either based on my opinions or participant notes from the workshop, but the participant notes themselves are not directly sharable.
1. That said, if you’re curious what a particular participant thought of the workshop, they likely have an early draft of something they started writing on the last day of the workshop, and your question might be enough to prompt them to finish writing and publish the result.
I like outline formats, as they should let you skim at your desired level of detail.
1. You’re welcome to leave feedback on this format.

Context on the workshop

A conversation at a Foresight Institute event initially prompted the workshop. Chris, Allison Duettmann, and I agreed on the need for more work in this area, leading to Chris committing to lead a workshop with support from others.
1. I’ve personally been surprised at the amount of interest in the workshop generally, and the variety of paths that have led interested people to the topic.
  1. One reason might be that there’s a modern movement in the mental health space (e.g. described here) to recast many interpersonal issues in terms of boundaries, which has the interesting perspective of mapping customs onto the concept of property. This didn’t come up in the workshop; I just thought it was interesting.
The workshop itself included participants who had fairly different backgrounds, context, and goals for boundaries.
1. As a result, much of the time was spent trading context, figuring out what assumptions or goals were shared
At the end of the workshop, many participants (as well as the organizers, i.e. Chris and I) felt we’d done an interesting breadth-first exploration and wanted a second workshop.
1. The goal of the next workshop is to lay the foundation for boundaries as a new research subfield by developing clear and useful definitions, identifying interesting open problems, and setting goals that we think boundaries research agendas could achieve.
  1. This next workshop starts April 10th (i.e. two days from writing this)
  2. Most of this document is my attempt to provide a download of my thoughts leading into that workshop.

My hope for boundaries:

If you’ve come to this page via the Atlas Computing website, you probably know that we’re working to build safeguards for AI, and one way to achieve that might be to provide some baseline constraints.
1. in other words, can we define boundaries in a way that is both
  1. sufficiently grounded in quantifiable, objective (i.e. not subjective) information so that an AI could be trusted to understand what constitutes a boundary and a boundary violation
    AND
  2. is sufficiently useful as a framework that it can easily be made consistent with most people’s intuition for what a boundary is
    1. There’d necessarily be some parameters to set/tune, but the goal would be to have most of the heavy lifting done by the framework rather than, for instance needing to use something like a boundaries language to generate descriptions on a case-by-case basis.
  3. Unsurprisingly, this has a lot of overlap with What does davidad want from «boundaries»?, as davidad is an advisor of Atlas Computing.
2. This could look like some abstracted version of object identification that also encodes some notion of separability or independence of objects.
  1. Current object identification mostly requires existing data or explanations of what a thing is before it can start identifying instances of that thing, or identifies a thing because its pieces move together; boundaries should identify a thing because of some aspects of its “thing-ness”.
    1. I’ll give a very lossy summary of Critch’s VAPE formalization of boundaries here:
      1. You can define a set of Viscera, Active boundary (or Actions), Passive boundary (or Perception), and Environment states that interact with each other, modeled as a Bayesian network
      2. These states are limited in what they can act on (e.g. environment and viscera act on each other only indirectly, via the active and passive boundaries)
      3. This model assumes discrete time, but empowers you to potentially label different parts of the world (or an image, simulation, or video) as different parts V, A, P, or E.
      4. If this is interesting, you should at least read that whole post, if not the whole sequence.
3. If we had this way to identify objects, maybe we could identify a minimum viable set of boundaries, where, if they were not violated by an action, then we could be confident that the action did not result in a catastrophic unforeseen (and therefore unspecified) outcome.
  1. A simple example: if you can assure that an agent’s strategy for making a cup of tea doesn’t end respiration for any humans, perhaps you could claim that it’s more likely that the strategy [makes a cup of tea and doesn’t kill anyone] than the strategy [makes a cup of tea AND creates a hellscape that maintains respiration]. (My language is a little facetious/hyperbolic, but hopefully you get the idea.)
  2. If a system can identify boundaries objectively and understand what it means to violate them, we can validate if an action violates a boundary via something like formal methods.
    1. This could be important because you can use a Safeguarded AI architecture in conjunction with an objective definition of boundaries without worrying about if the AI is trying to subvert your goals*.
I really like the perspective that “boundaries might provide a way to identify the nouns in a normative language”.
1. If you want to make statements about what things should do (with or to other things), you probably want an objective way to start identifying things.
  1. As an example: operationalizing the statement “people shouldn't hurt others” requires definitions of people, others, and hurt that should minimally rely on interpretation so that observers can agree if a proposed or past action violates the statement.
2. Part of what I like about this framing is that I’ve found it fairly compelling to map ethical and political questions into the framework of “which boundary takes precedence in this case”, which is nontrivial because people on both sides of an argument seem willing to accept that both sets of boundaries DO exist.
  1. E.g. pro-choice vs pro-life could be mapped onto the questions “when does a fetus’s boundary exist independently from the boundary of the person in whose uterus the fetus exists?” and “when do governments have the right to violate the will/boundaries of constituents”
  2. E.g. immigration could become a question about “how do we distinguish benefits of being inside the intersecting boundaries of ‘physically in the country’ vs ‘citizen of the country’”

Some topics that were discussed:

Boundary protocols
1. In practice, you do want boundaries to be crossed or modified under the right conditions, because that’s stagnation. An organism with perfectly preserved boundaries will starvation; preserved national boundaries prevent trade; etc.
  1. Realistically, you want to be able to describe (and perhaps even infer) when it’s acceptable to the object for something to cross its boundary.
    1. One challenge is that cells seem to love letting viral DNA in, but that feels like a boundary violation.
    2. Meanwhile, only some people want surgeons to operate on their cancer, so language and the study of informed consent clearly also play a role at some level.
  2. Boundary protocols are embedded in physical reality (e.g. cell receptors on the boundary of a cell encode what is allowed in).
    1. How would one infer boundary protocols? And how would a protocol be updated or renegotiated?
  3. My Q: How much of a boundary protocol can you infer from observation?
    1. For example, by only observing people within a culture, is it possible to learn the social norms sufficiently to participate without causing disruptions? Could you learn them well enough to not change the culture if you now made up >90% of the participants? I’m not sure you could, which prevents this approach from enabling AI to act ethically. It still might not limit its ability to act safely, though, there are interventions (like destroying a food supply) that clearly disrupt a culture in a predicatable way.
Models of Boundaries
1. It seemed like Yann LeCun’s H-JEPA (section 4.6 here) is quite relevant, and we explored that.
2. We also discussed if Petri Nets could be used to model the state of a system, its boundary, and its boundary protocol.
3. Another potential model that’s come up since are Port-Hamiltonian systems
4. Generally, it felt like progress was needed (especially on answering questions like “how could model boundaries in a way that allows for continuous time?”).
  1. There were also a bunch of explorations around things like “do you need to be able to label things as ‘boundary’ or is labeling inside and outside of objects sufficient” or “how to deal with non-contiguous physical boundaries” that didn’t feel to me like they reached clear endpoints.
Types of boundaries
1. I created this list of Examples of Boundaries. It’s definitely got issues, but it was helpful to make sure a statement made about one type of boundaries held for other types one might want to consider
2. I also thought this formulation of boundaries was interesting:
  1. If we identify types of things that are interesting to preserve, it’d be nice to have a way of relating things to other things. Here’s 4 categories of things
    1. Objects (physical arrangements that perservere in time)
      1. E.g. atom or rock; it makes sense to say there’s a “boundary” around it because it’s intuitively recognizable as a thing.
    2. Cycles of objects (physical objects that indirectly beget themselves)
      1. E.g. metabolic cycles; carbon cycle; chicken + egg
    3. Patterns (arrangements of information encoded in physical objects where the objects are transient but the information persists)
      1. E.g. forests or civilization: the trees or people change but the pattern remains; Dawkensian memes, The Ship of Theseus, and living things (probably) fit into this category as well.
    4. Cycles of patterns
      1. E.g. centralization vs decentralization of power within society; the model of punctuated equilibrium in evolutionary biology
  2. “Things” on this scale are clearly composed of other “things”.
    1. While it might be possible to list all types of boundaries from the bottom up or create some sort of directed graph, I don’t think that’s necessary, since the most relevant piece is likely the ability to relate different boundaries to each other, which can be done more succintly in a case-by-case basis than falling back to a taxonomy of boundaries.
    2. Very hot take: a lot of my intution says that preserving cycles of patterns (the fourth category), with deference going to the patterns recurring on the longest timescale) is an interesting extrapolation of moral trends. (I don’t think this is particularly defensible, but it’s an interesting thought.)

Again, this is very incomplete, and I’m mostly trying to get something out the door in time for the next workshop. We’ll try to have a more comprehensive (and more timely) summary out of the next workshop!

Atlas Blog

Discussion about this post