ICAIL 2021: Part 2

June 22, 2021

Perhaps the most interesting part of the first day of ICAIL 2021 proper is a presentation by Dr. Alice Witt, entitled Converting Copyright Legislation into Machine-Executable Code: Interpretation, Coding Validation and Legal Alignment. The presentation is available on YouTube here.

Let me first say that this is the first time that I have seen a paper on Rules as Code at ICAIL, and it is nice to see it getting the attention that it deserves in that forum.

Second, I really appreciate the approach that the researchers have taken, here, which is to put technology into the hands of domain experts and see what we learn about how they use it differently.

They gave small sections of a copyright law to multiple participants, and asked them to encode those sections over two weeks. In the second week, they added a feature in which the participants came to an agreed list of terms that would be used to encode the rules for that week.

The point of the exercise was to determine whether or not adding this step, where the coders agreed on the terms to be used before they wrote code using those terms, would increase the similarity of the resulting encodings.

The result was that it did.

This is the first time that I have been introduced to Turnip, the programming language, or TurnipBox the online development environment for writing and running Turnip code. Turnip is an implementation of defeasible deontic logic, and seems to be a continuation of the work of CSIRO in that domain.

On first glance, I have some of the same concerns with Turnip that I had with CSIRO’s previous offering. Primary among them that it seems to be limited to propositional programs, which is very problematic for a lot of real-world applications. It does, however, seem to make the use of deontic modalities (obligation, prohibition, permission, etc.) entirely optional, which is an improvement over the other tools that I have seen from CSIRO in the past.

It also features a very sophisticated set of deontic modalities. Whether describing something as a “perdurant achievement obligation” is actually more convenient for legal knowledge engineers than expressing explicitly what happens when the obligation is breached, is an open question, to me. It’s also interesting to me whether describing deontics as modalities has advantages over modeling them within a non-modal logical system. I understand that deontic modalities are possible, but I have not yet seen why they are desirable from the perspective of the legal knowledge engineer.

The defeasibility part of defeasible deontic logic is absolutely critical. The focus on making defeasibility easy is absolutely critical for creating legal knowledge representation systems that are real-world useful.

Dr. Witt’s presentation also helpfully, from the perspective of interdisciplinary research into Computational Law, goes into what using the modern approach to statutory interpretation actually means operationally for people who are attempting to encode legislation. I’m not sure that the “steps” presented are necessarily sequential in real-world application, but it is helpful to start attempting to describe for legal professionals how the task of statutory interpretation is modified when the desired end product is a formal encoding of legislative meaning.

Ultimately, the results of the experiment showed an increase in the similarity of the atoms used by the three knowledge engineers in the second week as compared to the first, and an increase in similarity of the rules generated, from 16% to 38% after cleaning.

The researchers note that the results are limited by the small number of participants, by the non-standardized approach to similarity, and by the absence of a definitive understanding of the legislative intent, as that can only be provided by the courts.

It is not true that the courts are the only people who can create authoritative interpretations of legislation. That’s not how laws actually work. It is not as if until a court says something, we can have no idea what laws mean. We proceed to enforce laws all the time before they have been interpreted by courts. The role of the courts is not to give laws meaning, but to resolve disagreements as to their meaning. In the absence of disagreement, the agreed-to interpretation is the meaning of the law, and as “authoritative” as is worth worrying about.

What makes an interpretation authoritative is that it is agreed upon, and nothing more.

You can say that you are not able to be confident that your targeted interpretation of the law is 100% accurate because there may be disagreements as to its meaning. But that’s not saying much. And if you are encoding a section about which there are specific and known disagreements about interpretation, you’ve picked a strange target for determining whether or not there is consistency between the meaning of the law and the encoding of it, or even merely across encodings.

In any case, you could have provided the targeted interpretation to the coders in some way that did not tell them how to encode the rules, but told them how the rules were expected to behave, and then tested their encodings against that expectation. It adds complexity to the experiment, but it is a reasonable way to approach the issue of whether the encodings match the meaning of the law. Even if the interpretation is ultimately wrong, it doesn’t matter. The question is by what mechanism you make code match an intended interpretation, not by what mechanism you determine the correct interpretation.

The similarity results obtained with regard to atoms is unremarkable, after having had the participates agree on exactly that topic. There is an increase in similarity in the rules from around 16% to around 38% between week one and week two. However, it is difficult to see, from the presentation, whether the similarity approach used is capable of removing similarity with regard to the atoms as a factor in determining whether the rules were encoded similarly. So it’s not clear how much of that improvement is just a reflection of improved similarity in the atoms themselves.

In any case, it is clear that the three participants continued to encode the rules in significantly different ways. The researches provide three explanations for those differences.

The first is that statutory interpretation is hard. Granted. Although the difficulty of statutory interpretation doesn’t necessarily apply unless you can show that the interpretations that the participants had for the legislation were different, and not merely how those interpretations were encoded. Which is another reason to control for interpretation in the experiment somehow.

Second, they point out that there is a question of granularity. This is something that has been made very apparent in my own work over the last year, and I can’t agree more. You cannot account, in advance, for the difference between encoding “car in a park” as a single concept, or “thing that is in a park” and “thing that is a car” as two separate concepts. Which choice the legal knowledge engineer makes is likely to be dependent on factors like what question is likely to be asked, whether either of the separate terms are likely to be reused elsewhere, whether the higher granularity facts are likely to be available to the user, whether there is a clear statutory intention of the relationship between the more granular and less granular version, and others.

Third, they point out that you can do the same thing multiple ways in the Turnip language. This, too, is an unavoidable truth. It is also difficult to know whether some kinds of differences, such as differences in the order of clauses, have operational meaning in Turnip that may not have been apparent to the people doing the encodings, and so it is not clear whether things that may look almost identical would still have important differences in operational semantics. So it’s hard to tell if there is a difference between “they wrote different code” and “they wrote code that behaves differently”.

Ultimately, they conclude that Rules as Code processes should include methods for coding validation (making sure the code is good), methods for legal alignment (which they distinguish somehow from legal interpretation, which only courts can do), and should be interdisciplinary given the need to use complex legal methods of statutory interpretation in order to find an appropriate interpretation.

I’m not sure what the similarity results contribute to those conclusions. To the extent they are good conclusions, they would all be true regardless of the change in similarity from week to week.

The appeal to separation of powers, that courts are the only interpreters of law, in order to justify the importance of legal alignment, is incorrect. Legal alignment is important because we need the applications to give good answers. Not because courts decide what laws mean. Calling it something other than legal interpretation is a polite fiction that only serves to distract from the task that is actually required, which is to pick a target interpretation, and test whether or not you got there.

Checking to see if the encodings are the same without ensuring first that the interpretation is the same doesn’t really tell you much.

Yes, the process needs to be interdisciplinary. And yes, we need to be able to find and fix problems with encodings, and problems with interpretations. But we knew all that before we did this experiment.

The authors state a preference for the higher granularity approach. I think there is no one-size-fits-all answer to that question. What you need to model depends on what questions you want to be able to ask and answer using that model. There is a good reason to allow people to model at a lower level of granularity when higher granularity has additional costs and no additional benefit.

What this experiment does demonstrate, however, is that even if you create expectations that a certain legal ontology is going to be used for encoding rules, there will still be a large degree of variability in how those laws are encoded. Which means that the code that was written cannot be usefully compared to the laws it encodes in order to determine whether or not the encoding was done properly. What we need, instead, is a specification of the legislative intent that we are trying to encode, at whatever degree of granularity, and a way of testing whether or not we have encoded it.

What a tool for specifying a statutory interpretation in the context of a specific vocabulary would look like, and what it would need to be capable of, is a very interesting question. I have some ideas.