Designing a JSON serializer

Posted on by Matthias Noback

Workshop utilities

For the workshops that I organize, I often need some "utilities" that will do the job, but are as simple as possible. Examples of such utilities are:

  • Something to dispatch events with.
  • Something to serialize and deserialize objects with.
  • Something to store and load objects with.
  • Something to use with event sourcing (an event store, an event-sourced repository).

I put several of these tools in a code base called php-workshop-tools. These utilities should be small, and very easy to understand and use. They will be different from the frameworks and libraries most workshop participants usually use, but they should offer more or less the same functionality. While designing these tools, I was constantly looking for the golden mean:

  • Make the tool generic, but don't support every imaginable use case.
  • Add support for proper dependency injection, but provide static/singleton/function-based helpers.

I have already described my thoughts about usability versus proper design of utility objects in another post: The case for singleton objects, fa├žades, and helper functions. In this post I'd like to look closer at some of the design considerations for the JSON serializer I eventually came up with.

The serializer utility class hasn't become part of the workshop tools code base. It lives in its own code base, as I thought it deserves its own project (with a few tweaks it might some day become really useful outside of my workshops). I called the serializer "naive serializer" as I believed it to be a bit dumb at first.

Use cases, requirements

In my applications I use serialization mainly to:

  1. Take some plain text structured data (e.g. JSON, XML) and transform it into some object that I can use in my application. Such an object has its own type and predefined properties, but it has no behavior. I use it to let data travel to a deeper layer in my application. In other words, such an object is a Data Transfer Object (DTO).
  2. Take some event objects and serialize them, in order to persist them in an event store or publish them to a queue.
  3. Take some view object (which is also a DTO) and serialize it as a response to some API query.

Considering only these use cases allowed me to keep the naive-serializer pretty simple. It doesn't need to have some of the features that other major serializers have, like:

  1. Using getters or setters to retrieve or update values.

    I don't perform (de)serialization directly from or to entities. That means, "domain invariants" won't have to be protected while deserializing, as this will be done properly by the domain objects themselves in a later stage (as if an anemic domain model could perform this kind of protection anyway). Data Transfer Objects don't have to "protect" themselves, so we can just copy the data right into it. I do recommend using a schema validator though, to make sure the plain text data you provide has the right structure.

  2. Resolving subclasses using discriminator maps, etc.

    This is a pretty invasive feature that comes with a lot of trouble in terms of code complexity. I think there are valid use cases for it, but I assume that they are not within the most common ones (at least not in my career).

  3. Support for custom "handlers", for \DateTime objects, etc.

    Other serializers have this feature because for one, PHP's built-in \DateTime[Immutable] objects can be serialized:

    {"date":"2017-07-11 08:14:31.379653","timezone_type":3,"timezone":"UTC"}
    

    But this string can not be deserialized by simply instantiating an empty instance of \DateTime[Immutable] and then repopulating the attributes.

    I believe this is a bit weird, but I also believe it can be easily circumvented (and should be anyway, in my opinion), by not using \DateTime[Immutable] as a primitive type in your domain model. I prefer to use a wrapper value object like:

    final class Timestamp
    {
        /**
         * @var string
         */
        private $timestamp;
    
        private function __construct(string $timestamp)
        {
            $this->timestamp = $timestamp;
        }
    
        public static function fromDateTimeImmutable(\DateTimeImmutable $timestamp): Timestamp
        {
            return new self($timestamp->format(\DateTime::ATOM));
        }
    
        public function asDateTimeImmutable(): \DateTimeImmutable
        {
            return \DateTimeImmutable::createFromFormat(\DateTime::ATOM, $this->timestamp);
        }
    
        public function __toString(): string
        {
            return $this->timestamp;
        }
    }
    

    This approach forces you to think of an internal value for the object that uniquely determines the value it represents, which I find beneficial to the design of the object itself.

    If you don't allow complicated values like \DateTime[Immutable], the (de)serialization algorithm becomes much simpler as you won't need to write or allow any custom handler code anymore.

  4. Custom configuration (e.g. annotations) to indicate the type of a property, like this:

    /**
     * @Serializer\Type("string")
     * @var string
     */
    private $foo;
    

    After dropping the support for custom handlers (see the previous point), it's now easy to limit the possible types for properties. In fact, we can limit the list of supported types to those already supported and recognized by PHP, or slightly broader, those used in @var and @return annotations recognized by PHPDocumentor. By the way, there is an accompanying library implementing type resolving for @var annotations, which turned out to be very useful for my own project: phpdocumentor/reflection-docblock.

    This allowed me to drop the need for extra configuration on top of existing @var annotations. In fact, I started relying on those, forcing users to add them, assuming many developers already do. Since we won't be able to serialize a value of type resource for example, this limits the list of supported property types to:

    • null
    • scalar (int, float, bool)
    • user-defined classes
    • arrays where every value is of the same type (maps or lists)
    • and any combination of the above

    If PHP ever comes with support for type declarations for properties, I assume that the need for relying on @var annotations will disappear.

  5. Custom configuration (e.g. annotations) to indicate which values should be included or excluded, like this:

    /**
     * @Serializer\Exclude("ALL")
     */
    class Foo
    {
        /**
         * @Serializer\Include
         */
        private $bar;
    }
    

    In practice this kind of configuration is often used when objects are a bit confused about the roles they play. It's the same for form validation groups by the way. Most often objects like these are either anemic domain entities, or they are fulfilling both command and query responsibilities. I recommend using different DTOs for different write and read-related use cases anyway, so I didn't want to make this part of the specification. If you find yourself wanting to skip some object-internal properties, like caches, etc. you can simply create another object, one that contains none of the tricky stuff, and serialize that.

  6. Custom configuration (e.g. annotations) to indicate how the name of a property should be converted to the name of a JSON object identifier, like this:

    /**
     * @Serializer\NamingStrategy("SnakeCase")
     */
    class Foo
    {
        /**
         * @Serializer\SerializedName("bazzzz")
         */
        private $bar;
    }
    

    In practice, such a feature is used for some kind of information hiding, where we don't want to expose the names of our properties. I think it's better to just rename the property anyway, or create a new object with the right property names after all. By the way, if you like "snake case", you can always name your properties in that style too. So the "naive serializer" simply uses the real property names and provides no transformation options.

Implementing the serializer

All of the reasoning that went into the project so far allowed me to define the following list of design guidelines:

  • Users shouldn't be forced to add custom configuration to their existing classes.
  • Users shouldn't need to write any supporting code.
  • The solution should take care of as few edge cases as possible.
  • The solution should be as small as possible, without becoming useless (<=100 LOC).
  • The solution should warn the user about its limitations using descriptive exceptions.

This list was pretty helpful as it really helped me to focus. Whenever I had to make some decision while writing the code, I could always use these guidelines to make the do the right thing.

I defined a class with all the cases I wanted to support:

final class SupportedCases
{
    /**
     * @var string
     */
    public $a;

    /**
     * @var int
     */
    public $b;

    /**
     * @var SupportedCases[]
     */
    public $c = [];

    /**
     * @var bool
     */
    public $d;

    /**
     * @var float
     */
    public $e;
}

Then I defined that the expected JSON output should be:

{
    "a": "a",
    "b":1,
    "c": [
        {
            "a": "a1",
            "b": 2,
            "c": [],
            "d": null,
            "e": null
        }
    ],
    "d": true,
    "e": 1.23
}

Deserializing the JSON data should result in an object equal to the one we just serialized and this of course is a perfect starting point for a unit test (maybe more like a component test). After I implemented the "happy path", I added some checks and assertions here and there to make the serializer fail in more explicit and developer-friendly ways.

Conclusion

I wanted to write about this project, not because I want you to use the serializer (I guess you shouldn't; something I didn't do is performance optimization for example). I simply wanted to describe how a bit of thinking can help drastically limit the scope of a project. Why are frameworks and libraries (for serialization, for forms, for persistence) so big and complicated? Because they want or need to support every imaginable use case. I find it very inspiring that the Doctrine team is currently dropping support for several features, in order to ease maintenance, and keep the library more focused. Removing features like detaching and merging entities will make the code much simpler. And removing support for Yaml configuration will prevent future bugs (apparently there has been a lot of trouble with it in the past).

I also wanted to describe some of the ways in which design issues on the user's side can lead to complicated feature requirements. Once you've fixed design issues like confused object roles, anemic domain models, lack of CQRS, etc. you won't need most of the special features which existing serializers are offering. You can keep the "happy path" of the code pretty clean, reducing code complexity. By effect, maintenance will be easier. In fact, you won't have too many users complaining about things that don't work. Particularly so if you throw clear exceptions when the library is used in an appropriate way.

Finally, I experienced again that posing some limitations can make you more creative. I have encountered this principle in several creative/artistic contexts before. In this case it was: aim for less than 100 lines of code (LOC). Of course, it code quality shouldn't be sacrificed for LOC. Also, you should always keep questioning the limitations themselves. But aiming for a small solution helped me cut a lot of waste from the code. Only after all the tests were green and the edge cases had been covered, I allowed myself some breathing space and expanded the code a bit to improve readability. The result is - I think - a pretty simple, clean, yet useful serializer.

PHP serializer