One Serialization Class to Rule Them All

In Haskell, the predominant way to serialize data is to use a type class. Libraries such as aeson, binary, cereal, cborg, protobuf, msgpack, capnproto, avro, thrift, xml, yaml, toml, bson, etc. etc. typically provide a type class that you can implement for your data type to encode and decode to and from a given format. In general, I find that code written for a particular domain is content to provide anywhere between one and three data representations for their data types.

For example, a web application might have a data type for a user that looks like this:

data User = User
  { userId :: Int
  , userName :: Text
  , userEmail :: Text
  , userCreatedAt :: UTCTime
  , userUpdatedAt :: UTCTime
  }

In this hypothetical application. We might also want to send it over the wire to a client, so we would write a ToJSON and FromJSON instance for it. We might also want to store it in a cache where we care mostly about serialization speed and don't care about backwards compatibility, so we would write a Store instance for it. We might later find that we want to provide a stable serialization representation for it when used in a worker queue, so we would write a Protobuf instance for it.

Depending on the application, it's also not uncommon to end up with multiple representations of the same data type, especially for common types provided by libraries. For example, loads of libraries support serialization of (), Bool, Int, Text, ByteString, UTCTime, due to their ubiquity and generally simple serialization formats.

For a given domain that you control, it's often simplest to just pick one serialization format and stick with it. But for libraries, it's often desirable to support multiple serialization formats so that users can choose the one that best fits their needs. Wouldn't it be nice if, as a library author, you could let consumers of your library hande you a list of serialization formats that they want to use, and you could just pick the best one for each data type?

In an ideal world, we could imagine writing something like this:

type RawPayload = RawPayload
  { payloadData :: ByteString
  -- ^ The serialized data
  , payloadMetadata :: Map Text ByteString
  -- ^ Metadata about the payload, such as the encoding type
  }

-- | We want to be able to serialize & deserialize the parameters of a function using an arbitrary serialization format.
class Codec fmt a where
  -- | Similar to a content-type header, this is a string that identifies the format of the payload.
  -- it will be set on the 'encoding' metadata field of the payload.
  encodingType :: fmt -> Proxy a -> ByteString
  messageType :: fmt -> a -> ByteString
  default messageType :: (Typeable a) => fmt -> a -> ByteString
  messageType _ _ = C.pack $ show $ typeRep (Proxy @a)
  encodePayload :: fmt -> a -> ByteString
  decode :: fmt -> RawPayload -> Either String a

encode :: forall fmt a. Codec fmt a => fmt -> a -> RawPayload
encode fmt x = RawPayload 
  (encodePayload fmt x) 
  (Map.fromList 
    [ ("encoding", (encodingType fmt (Proxy @a)))
    , ("messageType", messageType fmt x)
    ]
  )

Here, we've defined a Codec type class that can be implemented for any serialization format. In order to understand details about the actual payload in question, we also have the RawPayload carry metadata with it. This is similar to how HTTP headers include a Content-Type and Content-Encoding header to inform clients how they should interact with the contents of a request body.

Now, let's define a few instances of this type class:

data JSON = JSON

instance (Typeable a, Aeson.ToJSON a, Aeson.FromJSON a) => Codec JSON a where
  encodingType _ _ = "json/plain"
  encodePayload _ x = BL.toStrict $ Aeson.encode x
  decode _ = Aeson.eitherDecodeStrict' . inputPayloadData

data Binary = Binary

instance Codec Binary ByteString where
  encodingType _ _ = "binary/plain"
  encodePayload _ x = x
  decode _ = Right . inputPayloadData


data Protobuf = Protobuf

instance (Message a) => Codec Protobuf a where
  messageType _ x = encodeUtf8 $ messageName $ pure x
  encodingType _ _ = "binary/protobuf"
  encodePayload _ x = encodeMessage x
  decode _ = decodeMessage . inputPayloadData

These instances come from the aeson, binary, and proto-lens libraries, respectively. Now, let's imagine a composite codec that can choose the best codec for a given type. Let's say that it's desirable for our situation to use direct binary serialization where possible, followed by protobuf, followed by JSON. We can define a composite codec like this:

data Composite (codecs :: [Type]) where
  CompositeNil :: Composite '[]
  CompositeCons :: codec -> Composite codecs -> Composite (codec ': codecs)

defaultCodec :: Composite '[Binary, Protobuf, JSON]
defaultCodec = CompositeCons Binary $ CompositeCons Protobuf $ CompositeCons JSON CompositeNil

Here we have a fairly standard heterogeneous list. What would the Codec instance for this look like?

Let's start with the base caes. Obviously, an empty list of codecs can't encode anything, so we'll make that a type error.

instance TypeError (('ShowType a) ':<>: 'Text " is not supported by any of the provided codecs") => Codec (Composite '[]) a where
  messageType = error "unreachable"
  encodingType = error "unreachable"
  encodePayload = error "unreachable"
  decode _ _ = Left "No recognized codec for this type"

Now, how do we handle the case where we have a non-empty list of codecs? We want to try each codec in order until we find one that works.

This is where things get tricky. We can easily write an instance that ensures that a value of type a is supported by the first codec in the list, but it's not really what we want:

instance Codec codec a => Codec (Composite (codec ': codecs)) a where
  messageType (CompositeCons codec _) = messageType codec
  encodingType (CompositeCons codec _) = encodingType codec
  encodePayload (CompositeCons codec _) = encodePayload codec
  decode (CompositeCons codec _) = decode codec

This instance will always pick the first codec in the list, even if it doesn't support the type. We want to be able to try each codec in order until we find one that works.

The fundamental issue is that Haskell doesn't natively admit a way to express "try this constraint, and if it fails, try this other constraint". With a few language extensions and a lot of boilerplate, however, we can eventually come with something like this:

class c || d where
  resolve :: (c => r) -> (d => r) -> r
infixr 2 ||

This is a type class on constraints themselves with a single method, resolve. In theory, this type class would resolve the value on the left-hand side using the provided instance c, if the constraint c was satisfied, otherwise it would try to resolve the constraint d instead. On a very very manual basis, this sort of works! Supposing that we wanted to dispatch on whether or not a type provides a Show instance, we could write an instance like this:

instance (Show Int || d) where resolve = \r _ -> r

For things that don't admit an instance, we can provide a constraint d that makes it impossible for the constraint c to be satisfied, and call the resolve method with the right-hand argument:

instance d => (Show (a -> b) || d) where resolve = \_ r -> r

🤯 The fact that we can do this at all is absolutely wild to me. Unfortunately, we can't generalize this process to write something like this:

instance Show a => (Show a || d) where resolve = \r _ -> r

To use this technique, we would have to write an instance for every. single. type. that we wanted to support. Given that the original goal was to reduce boilerplate and manual type wrangling for the user, this is obviously a big step backward. As Matt Parsons puts it:

The problem comes down to where methods live, like, at the runtime level. an Object in JS or Java contains, in addition to its attributes, a lookup table for all the methods that can operate on it. And Haskell just doesn't have that. So we can't ask a type "Can you respond to this message?" because Haskell objects just don't respond to messages (aside from, like, record field labels, lol).

GHC's instance stuff is like a global hashmap of Type -> MethodDict, and you can do Typeable stuff to get the dict at compile time -

foo :: Typeable a => a -> Value
foo a = case cast a :: Maybe Int of
 Just i -> toJSON i -- GHC can statically insert the dict here
 Nothing -> mempty

Or you can grab the MethodDict using Dict :: c => Dict c

But there's not a nice way to smuggle a Dict (c a) inside of an a, allowing you to call methods on it

– Matt Parsons

OK, so blog post is over, right? We can't do it, so we're done. Thanks for coming to my TED talk.

Well, we have one means of redress left. There is a GHC plugin called if-instance, that implements a magic (||) type class for us that works exactly as we want it to by satisfying the constraint on the left-hand side if it can, and otherwise satisfying the constraint on the right-hand side at compile time without us having to write any instances. This is exactly what we want! All we have to do is add a pragma to the top modules that use the (||) type class.

{-# OPTIONS_GHC -fplugin=IfSat.Plugin #-}

Let's turn back to our Codec instance. Now that we have the (||) type class, we can write an instance that tries each codec in order until it finds one that works:

instance TypeError (('ShowType a) ':<>: 'Text " is not supported by any of the provided codecs") => Codec (Composite '[]) a where
  messageType = error "unreachable"
  encodingType = error "unreachable"
  encodePayload = error "unreachable"
  decode _ _ = Left "No recognized codec for this type"

instance (Codec fmt a || Codec (Composite codecs) a) => Codec (Composite (fmt ': codecs)) a where
  messageType fmt = dispatch @(Codec fmt a) @(Codec (Composite codecs) a)
    (case fmt of CompositeCons codec _ -> messageType codec) 
    (case fmt of CompositeCons _ codecs -> messageType codecs)
  encodingType fmt = dispatch @(Codec fmt a) @(Codec (Composite codecs) a)
    (case fmt of CompositeCons codec _ -> encodingType codec) 
    (case fmt of CompositeCons _ codecs -> encodingType codecs)
  encodePayload fmt = dispatch @(Codec fmt a) @(Codec (Composite codecs) a)
    (case fmt of CompositeCons codec _ -> encodePayload codec)
    (case fmt of CompositeCons _ codecs -> encodePayload codecs)
  decode fmt payload = dispatch @(Codec fmt a) @(Codec (Composite codecs) a)
    (case fmt of 
      CompositeCons codec codecs -> if Just (encodingType codec (Proxy @a)) == payload.inputPayloadMetadata Map.!? "encoding"
        then decode codec payload
        else ifSat @(Codec (Composite codecs) a)
          (decode codecs payload)
          (Left "No codec for this type supports this payload")
    )
    (case fmt of CompositeCons _ codecs -> decode codecs payload)

For each method, we use the dispatch function from if-instance to try the codec on the left-hand side, and if it fails, try the list of codecs on the right-hand side.

Let's try it out!

> encode defaultCodec (1 :: Int)
RawPayload {inputPayloadData = "foo", inputPayloadMetadata = fromList [("encoding","binary/plain"),("messageType","ByteString")]}

> newtype JsonInt = JsonInt Int deriving (ToJSON, FromJSON)
> encode defaultCodec (JsonInt 1)
RawPayload {inputPayloadData = "1", inputPayloadMetadata = fromList [("encoding","json/plain"),("messageType","JsonInt")]}

We provide the basic mechanism needed to perform content negotiation between a set of known serialization formats backed by their respective libraries.

For a more complete example, check out the Temporal.Payload module in the hs-temporal-sdk library. This module underpins RPC-style argument serialization and deserialization for the Haskell client for the Temporal workflow engine.

Deep thanks go to:

  • Noah Luck Easterly– for his writing on implementing constraint unions by hand in Haskell, which proved that this was even possible in the first place.
  • Sam Derbyshire– for eliminating the deep drudgery of writing constraint unions by hand with his if-instance library, and for his extremely prompt bugfix when I reported an issue with it.