Strict Semantic Nullability

At a glance

Identifier: wg#1410
Stage: RFC0: Strawman
Champion: @leebyron
PR: -
Related:
- #1048 (Null-Only-On-Error / Semantically-Non-Null type (asterisk))
- #1065 (SemanticNonNull type (null only on error))
- #1073 (Be strict about error paths format)
- SemanticNullability (Semantic Nullability)

Timeline

WG discussion created on 2023-10-05 by leebyron

This is a follow up to #1394 based on a discussion in the Oct 2023 WG meeting.

Future of nullability in GraphQL is strict semantic nullability.

High level overview:

We introduce the concept of a "Semantically Nullable" type modifier ? which describes a type as strictly allowing return of semantic null values.

We introduce a schema directive @strictNullability to resolve how to interpret null values.

GraphQL nullability historical rationale

GraphQL field types default to being nullable with a modifier ! to indicate non-nullability. Why?

First, we want to preserve future evolution of schema. It’s often the case that when first designing schemas that nullability and changes to it over time aren’t deeply considered. It turns out that it’s safe to convert a nullable field to a non-nullable one, but not the other way around. Thus the default is nullable. Defaults matter, and GraphQL’s default prioritizes allowing for future change.

Callout on safe field type changes

A field type change is "safe" when the new type describes a "subset" of the previous type. Changing Dog to Pet is not safe because a historical client made assumptions about Dog values, and won't know what to do with a Bird. Changing from Pet to Dog is safe because historical clients are ready to handle any Pet, and won't be surprised by exclusively receiving Dog values. Similarly changing String! to String is not safe because a historical client is not ready to accept null and might NPE. Changing String to String! is safe because the historical client was ready to accept null and just will happen to no longer ever receive one.

Second, we want to assume that anything can fail anywhere, and minimize disruption. A GraphQL field may be resolved by connecting to a service, and if that fails, a null is returned in the result (and also the error is included alongside the data in the response as well). Using the non-null modifier demands that field never returns null such that if an error occurs during resolution that it “bubbles” to instead have the parent field return null. This is nice in that it provides a strict guarantee of non-nullability, but not nice in that it’s destructive and that sibling fields which may have resolved normally are disposed as a result. As a result we provide guidance to use non-null ! types sparingly.

A very specific example covering both of these two reasons is considering what happens as a system evolves. Perhaps at first you have a simple application monolith with a single DB. A table column is non nullable so you imagine the resulting GraphQL field isn’t nullable either. However in the future you build a dedicated service for a subset of that table, and now resolving that field could fail to reach the service and result in null. A future change to architecture created the possibility for error, and thus null.

Implicit in this understanding of nullability is that a field type does not make it possible to differentiate between interpreting a null value as “this field is actually the value null” or “this field encountered an error and we have no data to return”. Ideally we can differentiate this both in the Schema, to describe which of these two interpretations are possible, and in the response, to describe which of the two interpretations has occurred for that specific resolution.

Or put more candidly: a GraphQL field is not actually "nullable", it is "ambiguously nullable". Ambiguity hurts!

Callout on terminology for null in GraphQL

Semantic null: A null value returned which describes the actual value of the field.

Error null: A null value returned which describes an error state.

Ambiguous null: A null value returned which describes one of the above two states without a way to differentiate which is the case.

The specific way this hurts is that clients must be able to differentiate between these two cases. First (schema) to generate useful type definitions, where the ambiguity requires us to generate nullable types everywhere, which is awful ergonomics. Then (result) to know whether to interpret a null value as a semantic null or handle it as an error null. Today clients must look in the "errors" part of the result to see if an error exists at that field, but how to interpret the absence of an error isn’t clear if it isn’t known if semantic null type was allowed in the first place.

So where do we go from here? How do we resolve this ambiguity?

Annotate semantic nullability: ?

Today we can describe a field’s type normally field: String or use a non-nullable type modifier, field: String!.

I propose introducing a "semantically nullable" modifier: field: String? (referring to this now as "nullable" to be terse).

If a field type is nullable (String?), that means that null values are in fact semantically allowed. For a client to know the difference between semantic null vs an error, they can now confidently look to the errors result. If an error exists in the array for this field then the null was the result of an error, and if not then it is in fact a semantic null.

This leaves an unmodified type (String) remaining as “ambiguously null”.

Callout on exact type definitions

Type! → Type (no null values allowed)

Type? → Type | SemanticNull | ErrorNull (differentiation must be possible)

Type → Type | AmbiguousNull (differentiation isn't always possible)

Now we have a way to describe some fields as specifically allowing semantic null and we have a mechanism (errors result) to differentiate that from an error null.

Now that a nullable modifier exists, to make this truly useful, we would next want to interpret unmodified field: Type as “null only on error” (related RFC) and resolve the ambiguity. How can we do this this safely, in a backwards compatible way?

A strict nullability schema

The schema can next include a directive (exposed as a new boolean in introspection) called @strictNullability. This directive tells clients that they should interpret unmodified field types (field: String) as semantic null not being a valid value and that any null value in a the data result should be interpreted as a field error, regardless of whether the error portion of the result includes an entry for that field.

Callout on exact type definition when @strictNullability is set

Type! → Type (no null values allowed)

Type? → Type | SemanticNull | ErrorNull (differentiation must be possible)

Type → Type | ErrorNull (differentiation unnecessary)

With both changes in effect, a schema has removed ambiguous null as a potential result from the service overall. Clients know the types possible in the schema and can interpret and differentiate the result accordingly.

Edit: added after @benjie's feedback below

Additionally, the introduction of @strictNullability now requires that an error is included in the error list if an unmodified field: Type returns null. It will do this by changing the execution behavior through the same mechanism as NonNull types in Value Completion. Importantly, these errors would not bubble.

Execution behavior (value completion) does not change for nullable types (Type?) since null continues to be allowed.

This means that execution behavior could change in a subtle manner. The result of the "data" field will remain unchanged (what was a null, remains a null), however the "errors" list could appear in some responses it previously did not. This could potentially be breaking when sending responses to a client which discards responses that include any error (unfortunately common for older clients).

Here is the specific case of this scenario explained via an example:

A field returns a value which is not meant to be semantically nullable, however the resolver is known to fail often. This service knows it has a client which throws out responses that include field errors, so it does not raise a field error from the resolver even though that would have been the semantically correct thing to do. Because the field is known fails often and the service decides that failure is not a big deal and they would like the client to use the rest of the data, they simply return null to indicate failure instead. While this is semantically incorrect, it produced the outcome they were looking for.

When migrating an existing service to @strictNullability that also needs to preserve backwards compatibility for clients which discard full responses if there are any errors, fields that return null to indicate an error should be typed as Type? instead of Type - they should be declared nullable, since that is an accurate typing of the schema design choice that was made.

End Edit

How to adopt this incrementally?

For existing schemas adopting this feature, they will be in an incremental state where "semantically nullable" modifiers (?) are incrementally added to resolve some ambiguity, and in this state the schema does not yet apply @strictNullability.

Once this migration is complete and a service has added all true semantically nullable modifiers to field types, then the @strictNullability directive is added.

Alternative incremental migration strategy

First, convert all field types to Nullable and apply @strictNullability at the same time, then incrementally remove the Nullable types from fields which are known to never be nullable.

While uglier, this would be safer for avoiding breaking changes if a service is unsure what values are possibly returned and concerned about the impact of introducing new field errors.

In the duration between a client beginning to use nullable type modifiers but before applying @strictNullability, clients can decide how to use code generation and result interpretation. Either:

A. Ignore the nullable type modifiers and see no change.

B. Unsafely assume "strictNullability" is enabled and accept the risk of being wrong.

C. Assume strict nullability in a locally incremental way by annotating each fragment for strict nullability typegen and interpretation in coordination with rolling out the nullable types on the service/schema side.

Most will do A, and that's fine - it's the preferred path if the migration will be quick and they prefer to just look ahead. Some will do B, and that's fine for small or high-communication teams where you can trust the wrongness risk. Relay and other sophisticated clients will do C, where they allow large teams to adopt this over time.

Let’s look at the effects. Does this break things?

Say a historical schema with many clients has now adopted nullable types and the @strictNullable modifier, what happens to backwards and forwards compatibility?

First of all, new clients no longer see “ambiguous” nulls. The schema now describes if a null is or is not semantically a valid value from the schema’s field type, and we know how to differentiate semantic null from error null (either because Type where null definitionally indicates an error, or Type? where if an error result for the field exists it is an error null, otherwise it is semantic null).

Edit after @benjie's comment

Even without knowledge of the schema, a client can accurately use the "errors" list in the response to know which null values represent errors and which are values, since an error null is always accompanied by an error in the list.

The application of @strictNullability is potentially breaking in an edge case that can be mitigated by use of Nullable types. Execution results are always unchanged for the "data" response, any client which exclusively looks at this part of the response will see no change at all. After applying @strictNullability unmodified types must include an error in the list for a null value. Clients which consider "errors" in the response could see new errors if a service was invalidly returning null from a field not marked nullable.

Historical clients are unchanged because critically this has not changed the way the executor works in any way. No field which used to return a null value no longer does or vice versa. No new errors are being emitted in the errors result. Error handling behavior is unchanged. This has exclusively changed the schema to be more descriptive in how to interpret existing results.

An important subtle point is that a @strictNullability service may return a null value from an unmodified field type without a resulting error payload. Modern clients now know to interpret this as the field failing to resolve an error (error null) and not a semantic null. Historical clients will continue to interpret this as ambiguous null. Introducing a new error payload where there wasn't one previously would have been unsafe. Some clients throw out any result payload with any error. (Wat?! See the FAQ below)

End edit

What about forward compatibility?

In a @strictNullability service/schema, you might still begin by introducing a field with an unmodified type field: Type, and while it's still true that later changing this to field: Type! remains safe, once a schema is strict, later changing this to field: Type? is in fact not safe.

However, I am less concerned about this for two reasons:

The primary reason schema designers are tripped up by this forward compat issue is not missing semantic null, it's missing error null. They fail to anticipate future changes in their underlying architecture introducing new places for errors to occur, and this proposal includes error nulls as a possibility in the default unmodified type.

Given the proliferation of type-safe languages today (not the case in 2012) it's likely that strict nullability is a first class design consideration for anyone with this directive enabled. If it's not, well then this is an opt-in directive and this schema design "footgun" is at least one that schema owners are opting themselves into rather than being surprised by. The default without-directive state will remain Type → Type | AmbiguousNull, which remains fine for less sophisticated services and clients.

FAQ: Should we then continue to suggest use of NonNull (!)?

Yes, but far less often. It's still used sparingly but it implies something which the service guarantees will never produce a null, including an error null. That's still useful in some scenarios (obj identifiers).

But generally most will use this a lot less with a more familiar ? available to them.

~~FAQ: How is it okay for a @strictNullability field to return null without a matching error in the "errors" array?~~

EDIT: This section no longer applies, but leaving here for posterity

Currently, a field returning an ambiguous null could mean one of three things:

There is a matching error in the "errors" array response, therefore it is certainly the result of an error.

Otherwise there is not an matching error - what does that mean?

If the intent was that this field should in fact allow semantic null values, then that's probably what this meant, but we have no way to know for absolute certain since the Schema can't yet declare whether semantic null is a possible expected value (the goal of this proposal!)

Otherwise this could be the result of a failure to load the data that's just missing an matching error.

Wait, what? How is a missing matching error possibly spec compliant?

According to the section on Handling Field Errors if a field error occurs then an error must be added to the errors list. This could happen because the resolver simply failed (threw Exception, return Result<Error>, etc), it could also happen because it returned a value that failed to coerce (was the wrong type, null for a Non-Null modifier, etc). This all implies that if a field failed to return the wrong type of value or failed to return at all that it is a field error and thus must have an error entry.

So how could this a field returning an error null not have a matching error in the list? Well, the field resolver happened to simply return null, which is totally allowed by the executor and schema. It did this not because semantic null was the right value, but just because services are weird sometimes and this is how they decided to represent a failure condition. And this is allowed... and ambiguous 🤷

So what do we do about this? We have two options:

Option A: A @strictNullability service always produces an error for nulls

We amend Value Completion so that in strict mode such that if a resolver returns null, and it isn't explicitly a Nullable type, then we throw a field error.

Pros:

Asserts that the resolver returns a correctly typed value, and when it does not (because null we assume is semantic null and not valid for a strict unmodified field: Type)

Guarantees that every error null has an matching error in the errors list.

Cons:

It's potentially breaking.

This introduces a new error which didn't exist before. Since lots of historical clients decided to simply reject any result which had an "errors" and try again, it's entirely possible that the service had made this strange choice not because they didn't know better, but because they considered the failure non-fatal and safe to omit the value. If they had thrown an error instead the client would have treated it too seriously and thrown out the whole thing. This was unfortunately a common pattern for a long time.

This breaking change can be mitigated, but only with careful guidance! Since the directive isn't applied by default, adding this to the spec is definitely not breaking. BUT you can't simply add the directive and expect no breaking changes! You must first move every field resolver that returns null to be a Nullable type! If that is true, then adding the directive introduces no change and no thus no breakage.

Option B: Do nothing.

No changes to the executor at all. Existing behavior persists.

Pros:

It's not breaking!

It sure is easy to implement

Cons:

It allows this non-obvious behavior to continue, and specifically means that in the case of an error null you're not guaranteed to have more information describing why. This is particularly bad for clients which seek to interpret error null vs semantic null in their response parsers without requiring knowledge of the schema.

Had we been starting from scratch, I'd definitely do option A (and I'd also not make strict mode, I'd just have done this from the start - agreeing with @dschafer's comment below). The guarantee of having error info is strictly better, and we'd just have built better clients.

But alas, I think our Guiding Principles point us to option B.

Also, while the spec can choose to do nothing, GraphQL libraries and services can always choose to be stricter than the spec itself. We've left plenty of room in allowing resolvers to be a "internal function" for GraphQL libraries to decide what is best.

I would be totally comfortable with a non-normative note in the spec suggesting that GraphQL libraries may choose option A, but for historical reasons we don't enforce it and it's still spec compliant to not.

Also, I suspect the cost of not having an error in the list guarantee is quite low. In @strictNullability we don't need it to know that a field has in fact failed. If a client wanted to get this guarantee back they could always fill in the gaps and produce a generic error locally that says something akin to "this field unexpectedly returned null"

Strict Semantic Nullability

At a glance

Timeline

Future of nullability in GraphQL is strict semantic nullability.

GraphQL nullability historical rationale

Annotate semantic nullability: `?`

A strict nullability schema

How to adopt this incrementally?

Let’s look at the effects. Does this break things?

What about forward compatibility?

FAQ: Should we then continue to suggest use of NonNull (`!`)?

FAQ: How is it okay for a `@strictNullability` field to return `null` without a matching error in the `"errors"` array?

At a glance​

Timeline​

Future of nullability in GraphQL is strict semantic nullability.

GraphQL nullability historical rationale​

Annotate semantic nullability: ?​

A strict nullability schema​

How to adopt this incrementally?​

Let’s look at the effects. Does this break things?​

What about forward compatibility?​

FAQ: Should we then continue to suggest use of NonNull (!)?​

FAQ: How is it okay for a @strictNullability field to return null without a matching error in the "errors" array?​

At a glance

Timeline

GraphQL nullability historical rationale

Annotate semantic nullability: `?`

A strict nullability schema

How to adopt this incrementally?

Let’s look at the effects. Does this break things?

What about forward compatibility?

FAQ: Should we then continue to suggest use of NonNull (`!`)?

FAQ: How is it okay for a `@strictNullability` field to return `null` without a matching error in the `"errors"` array?