Martijn Vogten

The basic mistake all ORMs make and how to fix it

TL;DR Current ORM implementations only support one representation of an entity: the entity itself. As this representation contains all possible associations, the persistence layer needs to deal with a virtually infinite, potentially circular graph of objects. If we allow simplified (association-free) representations as well, we get the building blocks for defining bounded, cycle-free subgraphs. I call this approach ‘model-driven ORM’. With model-driven ORM we can implement business logic with the same expressive domain models as with conventional ORM, but without its problems and pitfalls.

Introduction

The goal of Object Relational Mapping (ORM) is to make it easier to work with relational databases from application code. ORM has its origins in the Java Enterprise Edition framework, but the concept has become popular in many programming languages, from strongly typed object-oriented Enterprise Software (Java, C#) to scripting languages like PHP or JavaScript/TypeScript.

The main idea is to model the data in the database as objects in the programming language’s memory. Application code can define data as domain objects or entities and populate and manipulate them using the methods of these objects. A persistence layer is responsible for storing objects in the database, and maintain consistency of the object graph. Popular implementations are Hibernate (implementing the JPA standard) for Java, Entity Framework for .net/C#, or typeORM for Typescript.

There has been a lot of debate about the actual usefulness of ORM as all current implementations invariably introduce a number of problems and pitfalls in practice:

the N + 1 query problem
session management issues (LazyInitializationException and the Open session in view anti-pattern, for Hibernate/Java)
data consistency issues with collections (bidirectional references in particular)
identity generation and object equality pitfalls

If you want to know more about these problems and their impact in real-world scenario’s I can highly recommend ‘High-performance Java Persistence’ by Vlad Mihalcea [Mihalcea] who has made a hands-on guide on how to deal with each of them. Although the book is Java/Hibernate specific, it provides some in-depth explanations that apply to any language or framework using conventional ORM.

The problems above have one thing in common: they are all related to either the complexity or the size of the object graph. If we can simplify the object graph, we can simplify the persistence layer and fix most or all of these problems and pitfalls.

Below I will:

show how these problems are all related to cyclic references in the entity definitions,
provide a practical example of data inconsistencies with conventional ORM and demonstrate their impact on system stability,
explain how we can remove the cycles from any domain model using alternative representations of the entities,
introduce model-driven ORM, and show how to use cycle-free entity representations to remove complexity from the persistence layer to build performant and robust applications, and finally,
demonstrate the application of model-driven ORM to a complex domain model using a working example of the Cargo System from Eric Evans’ classic on Domain Driven Design [Evans].

1. Complications due to cyclic references in entities

Suppose we have a relational database that contains two tables: employee and department. In conventional ORM we would map these tables to entity classes as follows:

class Employee {
	Long id;
	String name;
	String email;
	Department department;
}

class Department {
	Long id;
	String name;
	List<Employee> employees;
}

However, this representation has problems:

Because Department and Employee reference each other, we cannot load either of them from the database without doing some extra work: if we fetch a department we get a list of employees, each of which contains a department. These departments must all resolve to the same object in memory, because department.getEmployees().get(i).getDepartment() must return the very same department object. Ensuring this is called deduplication (identity resolution in Microsoft Entity Framework). This is a consequence of the object graph being cyclic: entities can reference each other, either directly or via other entities.

Many-to-many associations

Things get more complicated if we have a many-to-many link in the cycle. Suppose we have a database with users and usergroups, and any user can be a member of multiple groups. (The database will have a linktable to specify the groups a user belongs to). In conventional ORM we would map these tables as follows:

class User {
	Long id;
	String name;
	String email;
	List<UserGroup> groups;
}

class UserGroup {
	Long id;
	String name;
	List<User> users;
}

Now, if we want to fetch a user and its groups from the database, we will also need to fetch the users of those groups (because each group has a getUsers() method). And the groups those users belong to, etcetera. In practice, this would mean that – depending on how many groups users are in – we would need to load a large chunk of all users and groups or perhaps even all. (The latter is already the case if users are all a member of at least one group and there is one user who is a member of all groups.)

Lazy loading

A workaround for this problem is a technique called lazy loading. With this approach, each association can be configured to load either lazily or eagerly. In the example above we would configure the groups and users associations as ‘lazy’. When fetching a user from the database, the persistenct layer will then skip the usergroups collection, but add an empty collection stub instead. Only when application code actually accesses the usergroup collection with a call to getGroups(), a query is triggered to fetch the usergroups from the database.

In order to make lazy loading possible, all entities must be connected in some way to the same transaction (or a Unit of work [Fowler]). If not, data consistency is not guaranteed if multiple users access the same data concurrently. Also, because we need to deduplicate lazily loaded entities, all previously loaded entities must become part of a transaction-level cache or session (Hibernate calls this the ‘first-level cache’, which is the same concept as the DbContext in Microsoft’s Entity Framework).

The transaction-level cache

Keeping a transaction-level cache is not a problem in a read-only context. However, when updates happen to the entities, keeping the collections in the object graph consistent becomes a challenge. For example, in the Employee/Department example above, when changing the department property of an Employee entity (by calling the setDepartment() method), this will affect:

any query for Employees that filters on the department property (the department_id column in the employee table),
the employees collection of two departments: when changing the department from department A to department B, the employee has to be removed from the employees collection of department A, and added to the employees collection of department B.

Note that the employees collections can also count as a query for Employees, since they are always retrieved using a query. However, I still want to mention them separately, because it’s easy to think of a scenario where either of them (or both) will exist in the transaction-level cache, since we need to fetch a Department in order to call the setDepartment() function of the Employee in the first place.

To keep all collections (the query result and the one-to-many or many-to-many associations) consistent with the changes to the entities, the persistence layer has two options:

implement filtering of objects in memory – effectively mirroring the exact functionality of JOIN…ON or WHERE criteria of the relational database. The persistence layer can then check whether any entity still matches the conditions of each query or collection, and add or remove the entity where appropriate.
save all pending changes to the database (a so-called ‘flush’) and requery the collection from the database. Effectively, the persistence layer relies on the relational database for filtering entities.

The first option requires implementing the filtering functionality of an object database, and restricting the where conditions and join conditions to functions supported by both the in-memory filtering and the database filtering engines. This is not practical or possible to do for one relational database engine, let alone for the range of database engines that any popular ORM system must support.

Data consistency has become a performance trade-off

This means we are stuck with the second option: Flushing the changes and requerying collections from the database. This is what most persistence layers implement, however it comes at a performance cost. Effectively, this means that calling a find() method for a certain type of entity also triggers update statements for any changes to entities in the cache of that same type. Moreover, keeping the collections of that entity (like the employees collection of the Department) in sync requires invalidating all collections of that entity and requerying them when they are accessed.

This highlights why Hibernate does not guarantee a consistent object graph. Consequently, application code must correct the object graph as needed, despite lacking specifications on where and how these inconsistencies arise. Additionally, there is no indication when an inconsistency is created. These inconsistencies can however lead to high-impact and difficult-to-detect bugs.

If you want to see how such inconsistencies can play out in the employee-department example, follow along with the steps below. Otherwise, you can skip to the fix.

2. Data inconsistencies in a practical example

Imagine a developer working on an application that stores employee data. The developer is tasked with implementing the following scenario:

add a new employee to a department
send out an automatic email with the new list of employees of that department

Disclaimer: The experienced JPA programmer will (hopefully) know how to add application code to fix the inconsistency in the first place. Due to the extremely simple object graph we have here, the fix is also simple. However, what I want to show is how problematic the inconsistencies can be. Also, I shall argue that as the scenario’s become more complex it will be harder to keep the inconsistencies out.

The developer might set this up as follows:

void addEmployeeToDepartment(String firstName, String lastName, String departmentName) {
	runInTransaction(() -> {
		Department department = departmentRepository.findByName(departmentName);
		
		Employee newEmployee = new Employee(firstName, lastName);
		newEmployee.setDepartment(department);
		employeeRepository.save(newEmployee);
		
		sendEmailWithEmployeeList(departmentName);
	});
}

void sendEmailWithEmployeeList(String departmentName) {
	Department department = departmentRepository.findByName(departmentName);
	List<Employee> employeesList = department.getEmployees();
	String emailbody = "Employee list of department " + departmentName + ": \n";
	for (Employee emp : employeesList) {
		emailbody += emp.getLastName() + ...
	}
	sendEmail(emailBody);
}

A working implementation of this example is available on Github. Each of the steps below are actual commits in that repository, so if you want to follow along, you may want to check out the initial commit first (tagged ‘initial’).

Although the program may look correct at first sight, something is broken: When we test it, we get an email without the new employee in it.

Applying a fix that looks OK..

What did the developer miss here? Both sendEmailWithEmployeeList and addEmployeeToDepartment work correctly when invoked separately. Stepping through the program and following the database query logs simultaneously reveals that the save() of the new employee did not yet result in a database INSERT command. This leads the developer to conclude that the findByName call in the sendEmailWithEmployeeList function did not ‘see’ the new employee.

A plausible attempt to fix the method could be to force an INSERT using the flush() method of the EntityManager. So the developer adds entityManager.flush() after the save()..

void addEmployeeToDepartment(String firstName, String lastName, String departmentName) {
	runInTransaction(() -> {
		Department department = departmentRepository.findByName(departmentName);
		
		Employee newEmployee = new Employee(firstName, lastName);
		newEmployee.setDepartment(department);
		employeeRepository.save(newEmployee);
		
		entityManager.flush(); // Need to force save to get the employee to show up in the email :(
		
		sendEmailWithEmployeeList(departmentName);
	});
}

And voila, the query log now shows the INSERT statement as a result of the flush(), and the program works correctly: the email contains the new name.

In reality however, a potential data inconsistency is not resolved: the application code has now become unstable.

…but actually makes the system unstable

After the application has been in production for some time, a new requirement needs to be implemented: Before adding an employee, check if they are already in the department.

After adding the check

void addEmployeeToDepartment(String firstName, String lastName, String departmentName) {
	runInTransaction(() -> {
		Department department = departmentRepository.findByName(departmentName);

		Employee newEmployee = new Employee(firstName, lastName);

		if (department.getEmployees().stream().filter(existing -> haveSameName(existing, newEmployee)).count() > 0) {
			throw new RuntimeException("There is already an employee with that name");
		}
		
		newEmployee.setDepartment(department);
		employeeRepository.save(newEmployee);
		
		entityManager.flush(); // Need to force save to get the employee to show up in the email :(
		
		sendEmailWithEmployeeList(departmentName);
	});
}

and running the integration tests, the developer is surprised to find out that the list does no longer include the new employee. By adding a trivial change in one place the code breaks in another place.

Wrapping up: the ‘right’ fix

In order to understand what went wrong we need to realize that the second call of findByName (in sendEmailWithEmployeeList) does execute a SELECT query but the result is deduplicated, i.e. replaced by a cached Department object: the exact same object that was returned by the first call (in addEmployeeToDepartment). The newly added check for existing employees with the same name caused the getEmployees() method to query the list of employees before the new employee got inserted (in the flush() call). Without the check the getEmployees() method would query the employee records after the insert had actually happened.

The right fix (the fix that would have kept the system stable) is explained as a note on page 30 of version 3.1 of the JPA specification:

“Note that it is the application that bears responsibility for maintaining the consistency of runtime relationships—for example, for insuring that the “one” and the “many” sides of a bidirectional relationship are consistent with one another when the application updates the relationship at runtime.”

(Bidirectional means that employee has a getDepartment and department has a getEmployees method.)

Which means that, instead of just setting the “one” side

newEmployee.setDepartment(department);

we need to maintain consistency of the graph by also adding the new Employee on the “many” side, the Department:

newEmployee.setDepartment(department);
department.getEmployees().add(newEmployee);

If the developer had implemented this initially, the application code would have functioned correctly from the start, and the flush() call would not have been required.

One might argue that educating developers on maintaining graph consistency would suffice. However, this approach still leaves issues when refactoring the domain model. For instance, suppose the Department entity initially lacked a getEmployees() method, and we decide to refactor the code to add it, making the reference bidirectional. We would then need to review the entire codebase for any invocations of employee.setDepartment() to ensure corresponding updates to the Department entity. This requirement is not immediately apparent from the refactoring premise, and missing any invocation of employee.setDepartment() would introduce the inconsistencies described earlier.

Observations

A data inconsistency was introduced even in a very basic scenario.
The system is fragile: the application developer must ensure the graph’s consistency without direct indications of its current state or when it becomes inconsistent.
The developer is not guided towards a correct solution: We applied an incorrect fix (using flush() instead of correcting the inconsistency), and there was no indication of the mistake. As far as anyone was concerned, the application appeared to work correctly.

3. Removing the cycles

As demonstrated earlier, permitting cycles in the object graph imposes several demanding requirements on the persistence layer:

it must deduplicate all entities that are loaded over the course of a database transaction
it must support delayed (lazy) loading of parts of the object graph

Combined, these requirements can only be met by implementing a transaction-level cache, and even a mature ORM implementation like Hibernate cannot guarantee data consistency in this context.

So what happens if we remove the cycles? Can we do without deduplication and lazy loading if we limit the domain model to cycle-free representations only?

One way of removing the cycles is to think of it as making an object graph serializable. Let’s say we would need a JSON representation of an Employee or Department in the first example above:

when serializing Employee, we would probably omit the employees relation from Department.
when serializing Department, we would probably omit the department relation from Employee.

One way to achieve this is by moving the relations out of the base classes and into special subclasses:

class BaseEmployee {
	Long id;
	String name;
	String email;
}

class EmployeeDetail extends BaseEmployee {
	BaseDepartment department;
}

class BaseDepartment {
	Long id;
	String name;
}

class DepartmentWithEmployees extends BaseDepartment {
	List<BaseEmployee> employees;
}

Now we have:

two relation-free representations of the entities: BaseEmployee and BaseDepartment.
two serializable type definitions: EmployeeDetail and DepartmentWithEmployees.

In order to get the JSON representation of an Employee, we can use EmployeeDetail, and to get a Department as JSON, we can use DepartmentWithEmployees.

Note that although we have introduced multiple representations of the entities, we could do this without repeating ourselves: thanks to inheritance, each field is still defined only once (the code is still ‘DRY’).

The benefits of cycle-free representations

Besides being serializable, EmployeeDetail has other nice properties: it maps directly to a SELECT query: The table of the base class is the FROM clause, and each relation maps to a JOIN:

SELECT 
	employee.id,
	employee.name,
	employee.email,
	department.id,
	department.name 
FROM employee
	LEFT JOIN department ON department.id = employee.department_id

Note that we don’t need to manually write that SELECT query: it’s quite feasible to build a function that will translate any cycle-free type definition (plus some metadata annotations for table names, ID-fields, etc.) to a query.

To get the results as a list of objects, we need a second function to convert the table-shaped resultset data into a list of instances of that same type. [Evans] refers to object graphs like the one EmployeeDetail forms together with its referenced instances (BaseDepartment) as aggregates. Constructing an aggregate from a resultset is not trivial, but it can be done with one generic implementation, and using the same type information that we used earlier to construct the SELECT query.

If we put everything together and specify where-conditions (and order-by clauses) using method chaining, we get something like:

List<EmployeeDetail> employees = db.query(EmployeeDetail.class)
	.where("employee.name = ?", name)
	.execute();
String departmentName = employees.get(0).getDepartment().getName();

Now we can reap the benefits of ORM, i.e. we can work with relational data as objects. But since the object graph is bounded and cycle-free, we don’t need deduplication or lazy loading. And because we don’t need deduplication, we can have as many (or as few) representations of the entities as we like.

Other interesting representations

There are other useful representations of the entities. For example, imagine we need to show a combobox in the user interface for selecting a department. For that we need a list of objects with just the ID and name of every department. I shall call these Ref objects because they are a bit like a bibliographic reference: just the data to identify the book or article being referenced. Another way to think of this is as a rich ID: The ID of the referenced entity plus a name or displayname (and maybe a location (country) etc.) that will make it identifiable for users:

class DepartmentRef {
	Long id;
	String name;
}

The same for employees (think of this as a business card):

class EmployeeRef {
	Long id;
	String name;
	String email;
	DepartmentRef department;
}

Note how we can use DepartmentRef here: We do want to show the name of the department on a business card, but we don’t need any more details of the department.

Depending on the task at hand, we can mix and match the representations to create a subgraph (query) of the data we need. For example, when adding a new employee, we want to specify their department and their manager, and perhaps a job title. But we certainly don’t want to edit the properties of that department or the properties of the manager. We would typically use an EmployeeDetail object like:

class EmployeeDetail {
	Long id;
	String name;
	String email;
	String jobTitle;
	DepartmentRef department;
	EmployeeRef manager;
}

The same type can be used for editing the employee properties later, and it could also be a good fit for showing a list of employees: We might want to show the department name and the name of the manager (perhaps including a mailto link), but we don’t need any more details of the associated entities.

4. Introducing model-driven ORM

I call the approach outlined above ‘model-driven ORM’ to contrast with conventional ORM, which I like to call ‘table-driven ORM’.

With conventional ORM there is only one representation of an entity, and this representation also doubles as the table-definition (hence ‘table-driven ORM’). This means we can only define data accesses one time (at the schema level), and in table-sized units. It can be challenging to find a single definition that works for all tasks that the application has to implement.

On the other hand, the different representations introduced above enable easily specifying a ‘model for data access’ for the task at hand. Hence the term ‘model-driven ORM’. In practice, model-driven ORM has advantages similar to those of working with plain SQL (being able to specify which associations and which columns to load, stateless sessions, data consistency guarantees), but it is true Object Relational Mapping that lets us express queries and domain model application logic using objects only.

4-1. Think queries

The representations above allow the application code to specify which data to load beforehand, using a type definition like EmployeeDetail. This means we can make decisions that are tailored to the needs of the task at hand.

Imagine for a moment that we are not using ORM at all, but interact with the database using plain SQL queries. When developing a new part of application code that implements a certain feature, we would first establish which data is needed for the task, and then decide:

if we can reuse an existing query (or queries)
need to make a variation of an existing query, or
if we need to write a new query.

The same holds true for model-driven ORM, except:

we don’t need to write the query. It is generated from the type definition of the entity representation.
we have more options for reuse. We can easily combine (parts of) existing entity representations by subclassing them and adding in only the associations we need.

4-2. Stateless sessions

With model-driven ORM we can easily specify which data to load beforehand, so we don’t need lazy loading or deduplication at the session level (deduplication happens at the query level). This means the domain model objects are not connected to a session. There are no queries happening ‘under the hood’.

The session is effectively stateless, and we have to explicitly call repository functions to save changes to the database:

runIntransaction(() -> {
	EmployeeDetail employee = employeeRepository.findById(employeeId);
	employee.setEmail("john@example.org");
	employeeRepository.save(employee); // <- explicit save
});

4-3. Queries, updates and inserts are all explicit and immediate

Having a stateless session means that all interactions with the database are immediate and there are no cached objects: Calling a finder method will always trigger a SELECT query. Changes to the data happens with UPDATE or INSERT statements that are immediately executed. Since there is no hidden state or cache, we don’t have to worry that objects loaded from the database would not reflect existing changes in the object graph (a problem commonly referred to as data aliasing). Updates are explicit and immediate, so any object we load later will naturally contain those changes.

This means we are free to choose any model for updating the database. We can do it using plain UPDATE or INSERT commands, with simple utility functions, or using the entity representations. For example, if we want to change the price of a product, we could do

db.executeUpdate(conn, "UPDATE product SET price = ? WHERE product_id = ?", 10.0, 1);

// or, using a utility function
db.update(connection, "product", Map.of("price", 10.0), 1);

// or, the ORM way
Product product = db.findById(Product.class, 1L);
product.setPrice(10.0);
db.update(product);

We have multiple options for implementing the db.update(Product product) function. In a simple form, it would omit collection properties (one-to-many and many-to-many associations) and build an update statement containing all fields in the “product” table. In a more complex form it could dirty-check fields and apply cascading strategies to the collections. Dirty-checking and cascading is also relatively easy to implement compared to conventional ORM, because the object graph is bounded and cycle-free. We could for example save a (serialized) copy of the original data in memory and use that to identify the changes to the object graph.

4-4. Easy query building with optimal performance

Using types in the programming language (classes) to represent a SELECT query makes it easy to specify order-by and where-clauses or additional join conditions. Because each field in a class must be unique, we can use field names as aliases. So for example in this query:

class Department {
	Long id;
	String name;
	List<EmployeeRef> employees;
}

class Employee {
	Long id;
	String firstName;
	String lastName;
	DepartmentRef department;
}

If we want to specify that the Employees collection of a Department should be ordered alphabetically by name, we can do this using the field name employees.

db.query(Department.class).addOrderBy("{employees.lastName} ASC");
// Or find all employees in a department:
db.query(Employee.class).addWhere("{department.name} = ?", "Finance");

There are more benefits to using types as query definitions:

Types are known compile-time, which means query definitions are also available compile-time. This opens up opportunities for eliminating runtime overhead by generating code that prepares the query and code that maps resultsets to objects.
We can use the type-checks of the compiler to ensure that application code does not access any data that is not in the query.
Because we are not restricted to classes for defining representations, we could also use interfaces which (in Java and C#) will allow multiple inheritance.

5. Proof of concept: the Cargo System

Using the techniques outlined above, it’s always possible to remove cycles. But in order to do so, we need to change the domain model. How does this affect the application code? Will it be as expressive as with conventional ORM?

Let’s take a look at a somewhat more extensive example: The Cargo System from Eric Evans’ classic book Domain Driven Design. There is an official implementation available by Swedish software consulting company Citerus, that is built on top of Spring Boot and JPA (Hibernate).

Here are the entities of the Cargo System domain model:

classDiagram
	class Cargo {
		Long id
		String trackingId
		Location origin
		RouteSpecification routeSpecification
		List~Leg~ itinerary
		Delivery delivery
	}

	class Location {
		Long id
		String unlocode
		String name
	}

	class RouteSpecification {
		Location origin
		Location destination
		Instant arrivalDeadline
	}

	class Leg {
		Long id
		Voyage voyage
		Location loadLocation
		Instant loadTime
		Location unloadLocation
		Instant unloadTime
	}

	class Voyage {
		Long id
		String voyageNumber
		List~CarrierMovement~ carrierMovements
	}

	class CarrierMovement {
		Long id
		Location arrivalLocation
		Location departureLocation
		Instant arrivalTime
		Instant departureTime
	}

	class Delivery {
		Boolean misdirected
		Instant eta
		Instant calculatedAt
		Boolean isUnloadedAtDestination
		RoutingStatus routingStatus
		HandlingActivity nextExpectedActivity
		TransportStatus transportStatus
		Voyage currentVoyage
		Location lastKnownLocation
		HandlingEvent lastEvent
	}

	class HandlingActivity {
		HandlingEvent.Type type
		Location location
		Voyage voyage
	}

	class HandlingEvent {
		Long id
		Voyage voyage
		Location location
		Cargo cargo
		Instant completionTime
		Instant registrationTime
		HandlingEvent.Type type
	}
	Cargo --> Location
	Cargo --o RouteSpecification
	Cargo ..> "many" Leg
	Cargo --o Delivery
	RouteSpecification --> Location
	Leg --> Voyage
	Leg --> Location
	Voyage ..> "many" CarrierMovement
	CarrierMovement --> Location
	Delivery --o HandlingActivity
	Delivery --> Voyage
	Delivery --> Location
	Delivery --> HandlingEvent
	HandlingActivity --> Location
	HandlingActivity --> Voyage
	HandlingEvent --> Voyage
	HandlingEvent --> Location
	HandlingEvent ..> Cargo

And there are four repositories, one for each aggregate:

CargoRepository
HandlingEventRepository
VoyageRepository
LocationRepository

Refactoring the Cargo System to model-driven ORM

As Hibernate does not support multiple entity representations for the same database table we cannot use it for model-driven ORM. Instead I shall use PojoQuery, a library I initially wrote in 2013 that maps objects to queries like we saw above. We shall try to change as little as possible to the application configuration or the database schema. The result is available on Github.

Removing the cycles

Model-driven ORM requires cycle-free representations. There is one cycle in the diagram:

Cargo.delivery -> Delivery.lastEvent -> HandlingEvent.cargo -> Cargo

We can remove the cycle using a Ref-object for Cargo:

class CargoRef {
	Long id;
	String trackingId;

	public CargoRef(id, trackingId) {
		...
	}
	// (getters and identity implementation omitted)
}

To get a CargoRef instance for a Cargo object, we add a utility method to Cargo:

class Cargo {
	...
	public CargoRef getRef() {
		return new CargoRef(id, trackingId);
	}
	...
}

Cargo entities are never accessed through any of the other entities except for finding or displaying a tracking ID. This means we can update all entity fields referencing Cargo to use CargoRef. All we need to change in the application code is to pass cargo.getRef() into the HandlingEvent constructor instead of the cargo entity itself. Both of these changes are trivial.

Remove more unused associations

We shall remove one more unused association using the same approach. The Voyage entity internals (the carrier movements) are never accessed through any of the other entities. These associations are only used to get the voyage number. So here we can apply a similar Ref-object, a VoyageRef with two fields: id and voyageNumber.

class VoyageRef {
	Long id;
	String voyageNumber;

	public VoyageRef(id, voyageNumber) {
		...
	}
	// (getters and identity implementation omitted)
}

Using the PojoQuery library to implement the repositories

For implementing the repositories we cannot use Hibernate since it will not allow us to configure Cargo and CargoRef with the same database table (as shown above, this is a fundamental limitation of conventional ORM, related to deduplication). Luckily, the PojoQuery library supports embedding entities analogous to Hibernate’s @Embedded annotation, which is what we need here for Delivery, RouteSpecification and HandlingActivity. After adding PojoQuery annotations to the entities we can implement the CargoRepository like so (the adapter class CargoDatabase wraps Spring Boot’s transaction management, so all queries are executed within the current transaction):

public class CargoRepositoryImpl implements CargoRepository {

  @Autowired
  CargoDatabase db;

  @Override
  public Cargo find(TrackingId trackingId) {
    List<Cargo> list = db.query(Cargo.class)
        .addWhere("{cargo.trackingId} = ?", trackingId.toString())
        .execute();
    return list.size() > 0 ? list.get(0) : null;
  }

  @Override
  public List<Cargo> getAll() {
    return db.query(Cargo.class).execute();
  }

  @Override
  public void store(Cargo cargo) {
    if (cargo.id() == null) {
      db.insert(cargo);
      insertLegRecords(cargo);
    } else {
      db.update(cargo);
      db.update(SqlExpression.sql("DELETE FROM leg WHERE cargo_id = ?", cargo.id()));
      insertLegRecords(cargo);
    }
  }

  private void insertLegRecords(Cargo cargo) {
    for (Leg leg : cargo.itinerary().legs()) {
      Map<String, Object> legValues = PojoQuery.extractValues(Leg.class, leg);
      legValues.remove("id");
      legValues.put("cargo_id", cargo.id());
      db.insert("leg", legValues);
    }
  }

  @Override
  public TrackingId nextTrackingId() {
    return db.doReturningWork(conn -> {
      String select = db.isMySQL() ? 
          "SELECT UPPER(SUBSTR(CAST(UUID() AS CHAR(38)), 1, 8))" 
          :
          // HSQLDB
          "SELECT UPPER(SUBSTR(CAST(UUID() AS VARCHAR(38)), 0, 9)) AS id FROM (VALUES(0))";

      List<String> result = DB.queryColumns(conn, select).get(0);
      return new TrackingId(result.get(0));
    });
  }
}

As you can see, this looks a lot like a conventional ORM implementation of the same repository, except for the clunkiness of storing the Leg records. It should not be too hard to implement a generic function that stores one-to-many collections more elegantly. With or without that function and with or without dirty checking, the idea is the same: Application code asks for a Cargo aggregate using a find or getAll method, does its work and then calls the store() method of the repository when it is done.

The same is true for retrieving data: The repository is free to fetch the graph in whichever way it likes. Although the object graph is bounded and can always be fetched with just one SELECT query, there may be reasons to split the process up in steps, especially when the object graph gets large.

Integration tests and a Hibernate bug

Next, to make the integration tests pass, we need to occasionally call store() (replacing Hibernate’s flush()) and rewrite some Hibernate-specific code. After this, only one test was failing, which was the CargoRepository integration test.

Interestingly, this test was actually expecting an erroneous null value for voyageNumber:

	// The code above creates voyages and a cargo, then assigns an itinerary with three legs.
	List<Leg> legs = cargo.itinerary().legs();
	assertThat(legs).hasSize(3)
		.extracting("voyage.voyageNumber", "loadLocation", "unloadLocation")
		.containsExactly(
			Tuple.tuple(null, HONGKONG, NEWYORK), // <-- voyageNumber is null ???
			Tuple.tuple("0200T", NEWYORK, DALLAS),
			Tuple.tuple("0300A", DALLAS, HELSINKI));

Apparently, a bug in the Hibernate persistence layer was triggered and the author of the integration test was not able to work around it in another way. After changing the null value to the correct voyage number all tests pass.

The updated Cargo System domain model and JSON sample data of the aggregates

Here is the updated class diagram of the Cargo System domain model with the new CargoRef and VoyageRef classes:

classDiagram
	direction TD
	class CargoRef {
		Long id
		String trackingId
	}

	class Cargo {
		Long id
		String trackingId
		Location origin
		RouteSpecification routeSpecification
		List~Leg~ itinerary
		Delivery delivery
	}

	class HandlingEvent {
		Long id
		VoyageRef voyage
		Location location
		CargoRef cargo
		Instant completionTime
		Instant registrationTime
		HandlingEvent.Type type
	}

	class Leg {
		Long id
		VoyageRef voyage
		Location loadLocation
		Instant loadTime
		Location unloadLocation
		Instant unloadTime
	}

	class Location {
		Long id
		String unlocode
		String name
	}

	class RouteSpecification {
		Location origin
		Location destination
		Instant arrivalDeadline
	}

	class CarrierMovement {
		Long id
		Location arrivalLocation
		Location departureLocation
		Instant arrivalTime
		Instant departureTime
	}

	class Delivery {
		Boolean misdirected
		Instant eta
		Instant calculatedAt
		Boolean isUnloadedAtDestination
		RoutingStatus routingStatus
		HandlingActivity nextExpectedActivity
		TransportStatus transportStatus
		VoyageRef currentVoyage
		Location lastKnownLocation
		HandlingEvent lastEvent
	}

	class HandlingActivity {
		HandlingEvent.Type type
		Location location
		VoyageRef voyage
	}

	class Voyage {
		Long id
		String voyageNumber
		List~CarrierMovement~ carrierMovements
	}

	class VoyageRef {
		Long id
		String voyageNumber
	}
	CargoRef ..|> Cargo
	VoyageRef ..|> Voyage
	HandlingEvent --> CargoRef
	HandlingEvent --> VoyageRef
	Cargo --o Delivery
	Delivery --o HandlingActivity
	Delivery --> VoyageRef
	Delivery --> Location
	Delivery --> HandlingEvent
	Cargo --> Location
	Cargo --> RouteSpecification
	Cargo --> "many" Leg
	RouteSpecification --> Location
	Leg --> VoyageRef
	Leg --> Location
	Voyage --> "many" CarrierMovement
	CarrierMovement --> Location
	HandlingActivity --> Location
	HandlingActivity --> VoyageRef
	HandlingEvent --> Location
	style VoyageRef fill:#9f6
	style CargoRef fill:#9f6

With cycles removed, we can now show some sample data as JSON objects. This may be more illustrative than the diagram above. There are four repositories in the sample, corresponding to the four aggregates: Location, Voyage, Cargo and HandlingEvent:

Example JSON data for a Location

{
	"id": 1,
	"unlocode": "JNTKO",
	"name": "Tokyo"
} as Location

Example JSON data for a Voyage aggregate

{
	"id": 1,
	"voyageNumber": "0100S",
	"carrierMovements": [
		{
			"id": 1,
			"arrivalLocation": {
				"id": 5,
				"unlocode": "CNHGH",
				"name": "Hangzhou"
			} as Location,
			"departureLocation": {
				"id": 6,
				"unlocode": "CNHKG",
				"name": "Hongkong"
			} as Location,
			"arrivalTime": "2008-10-03T14:30:00Z",
			"departureTime": "2008-10-01T12:00:00Z"
		} as CarrierMovement,
		{
			"id": 2,
			"arrivalLocation": {
				"id": 1,
				"unlocode": "JNTKO",
				"name": "Tokyo"
			} as Location,
			"departureLocation": {
				"id": 5,
				"unlocode": "CNHGH",
				"name": "Hangzhou"
			} as Location,
			"arrivalTime": "2008-10-06T06:15:00Z",
			"departureTime": "2008-10-03T21:00:00Z"
		} as CarrierMovement,
		{
			"id": 3,
			"arrivalLocation": {
				"id": 13,
				"unlocode": "AUMEL",
				"name": "Melbourne"
			} as Location,
			"departureLocation": {
				"id": 1,
				"unlocode": "JNTKO",
				"name": "Tokyo"
			} as Location,
			"arrivalTime": "2008-10-12T11:30:00Z",
			"departureTime": "2008-10-06T11:00:00Z"
		} as CarrierMovement,
		{
			"id": 4,
			"arrivalLocation": {
				"id": 9,
				"unlocode": "USNYC",
				"name": "New York"
			} as Location,
			"departureLocation": {
				"id": 13,
				"unlocode": "AUMEL",
				"name": "Melbourne"
			} as Location,
			"arrivalTime": "2008-10-23T23:10:00Z",
			"departureTime": "2008-10-14T12:00:00Z"
		} as CarrierMovement
	] as CarrierMovement[]
} as Voyage

Example JSON data for a HandlingEvent aggregate

{
	"id": 2,
	"voyage": {
		"id": 1,
		"voyageNumber": "0100S"
	} as VoyageRef,
	"location": {
		"id": 6,
		"unlocode": "CNHKG",
		"name": "Hongkong"
	} as Location,
	"cargo": {
		"id": 1,
		"trackingId": "ABC123"
	} as CargoRef,
	"completionTime": "2009-03-02T00:00:00Z",
	"registrationTime": "2025-03-08T19:33:38.706497Z",
	"type": "LOAD"
} as HandlingEvent

Example JSON data for a Cargo aggregate

{
	"id": 1,
	"trackingId": "ABC123",
	"origin": {
		"id": 6,
		"unlocode": "CNHKG",
		"name": "Hongkong"
	} as Location,
	"routeSpecification": {
		"origin": {
			"id": 6,
			"unlocode": "CNHKG",
			"name": "Hongkong"
		} as Location,
		"destination": {
			"id": 10,
			"unlocode": "FIHEL",
			"name": "Helsinki"
		} as Location,
		"arrivalDeadline": "2009-03-15T01:00:00Z"
	} as RouteSpecification,
	"delivery": {
		"misdirected": false,
		"eta": "2009-03-12T01:00:00Z",
		"calculatedAt": "2025-03-05T20:27:38.387446Z",
		"isUnloadedAtDestination": false,
		"routingStatus": "ROUTED",
		"nextExpectedActivity": {
			"type": "LOAD",
			"location": {
				"id": 9,
				"unlocode": "USNYC",
				"name": "New York"
			} as Location,
			"voyage": {
				"id": 2,
				"voyageNumber": "0200T"
			} as VoyageRef
		},
		"transportStatus": "IN_PORT",
		"lastKnownLocation": {
			"id": 9,
			"unlocode": "USNYC",
			"name": "New York"
		} as Location,
		"lastEvent": {
			"id": 3,
			"voyage": {
				"id": 1,
				"voyageNumber": "0100S"
			} as VoyageRef,
			"location": {
				"id": 9,
				"unlocode": "USNYC",
				"name": "New York"
			} as Location,
			"cargo": {
				"id": 1,
				"trackingId": "ABC123"
			} as CargoRef,
			"completionTime": "2009-03-05T01:00:00Z",
			"registrationTime": "2025-03-05T20:27:37.684125Z",
			"type": "UNLOAD"
		} as HandlingEvent
	} as Delivery,
	"itinerary": [
		{
			"id": 6,
			"voyage": {
				"id": 3,
				"voyageNumber": "0300A"
			} as VoyageRef,
			"loadLocation": {
				"id": 3,
				"unlocode": "USDAL",
				"name": "Dallas"
			} as Location,
			"loadTime": "2009-03-09T01:00:00Z",
			"unloadLocation": {
				"id": 10,
				"unlocode": "FIHEL",
				"name": "Helsinki"
			} as Location,
			"unloadTime": "2009-03-12T01:00:00Z"
		} as Leg,
		{
			"id": 5,
			"voyage": {
				"id": 2,
				"voyageNumber": "0200T"
			} as VoyageRef,
			"loadLocation": {
				"id": 9,
				"unlocode": "USNYC",
				"name": "New York"
			} as Location,
			"loadTime": "2009-03-06T01:00:00Z",
			"unloadLocation": {
				"id": 3,
				"unlocode": "USDAL",
				"name": "Dallas"
			} as Location,
			"unloadTime": "2009-03-08T01:00:00Z"
		} as Leg,
		{
			"id": 4,
			"voyage": {
				"id": 1,
				"voyageNumber": "0100S"
			} as VoyageRef,
			"loadLocation": {
				"id": 6,
				"unlocode": "CNHKG",
				"name": "Hongkong"
			} as Location,
			"loadTime": "2009-03-02T01:00:00Z",
			"unloadLocation": {
				"id": 9,
				"unlocode": "USNYC",
				"name": "New York"
			} as Location,
			"unloadTime": "2009-03-05T01:00:00Z"
		} as Leg
	] as Leg[]
} as Cargo

Performance

The CargoRepository queries the Cargo aggregate in one SELECT query. The Cargo aggregate is somewhat complex, see the JSON above for an example: It embeds RouteSpecification and Delivery, both of which contain many-to-one associations to VoyageRef and Location. Cargo also contains an Itinerary: a collection of Leg records, each of which contains three many-to-one associations. Therefore, the query is quite big, with 15 joins and 50 fields.

SELECT
 `cargo`.id AS `cargo.id`,
 `cargo`.tracking_id AS `cargo.trackingId`,
 `origin`.id AS `origin.id`,
 `origin`.unlocode AS `origin.unlocode`,
 `origin`.name AS `origin.name`,
 `routeSpecification.origin`.id AS `routeSpecification.origin.id`,
 `routeSpecification.origin`.unlocode AS `routeSpecification.origin.unlocode`,
 `routeSpecification.origin`.name AS `routeSpecification.origin.name`,
 `routeSpecification.destination`.id AS `routeSpecification.destination.id`,
 `routeSpecification.destination`.unlocode AS `routeSpecification.destination.unlocode`,
 `routeSpecification.destination`.name AS `routeSpecification.destination.name`,
 `cargo`.spec_arrival_deadline AS `routeSpecification.arrivalDeadline`,
 `itinerary`.id AS `itinerary.id`,
 `itinerary.voyage`.id AS `itinerary.voyage.id`,
 `itinerary.voyage`.voyage_number AS `itinerary.voyage.voyageNumber`,
 `itinerary.loadLocation`.id AS `itinerary.loadLocation.id`,
 `itinerary.loadLocation`.unlocode AS `itinerary.loadLocation.unlocode`,
 `itinerary.loadLocation`.name AS `itinerary.loadLocation.name`,
 `itinerary`.load_time AS `itinerary.loadTime`,
 `itinerary.unloadLocation`.id AS `itinerary.unloadLocation.id`,
 `itinerary.unloadLocation`.unlocode AS `itinerary.unloadLocation.unlocode`,
 `itinerary.unloadLocation`.name AS `itinerary.unloadLocation.name`,
 `itinerary`.unload_time AS `itinerary.unloadTime`,
 `cargo`.misdirected AS `delivery.misdirected`,
 `cargo`.eta AS `delivery.eta`,
 `cargo`.calculated_at AS `delivery.calculatedAt`,
 `cargo`.unloaded_at_dest AS `delivery.isUnloadedAtDestination`,
 `cargo`.routing_status AS `delivery.routingStatus`,
 `cargo`.next_expected_handling_event_type AS `delivery.nextExpectedActivity.type`,
 `delivery.nextExpectedActivity.location`.id AS `delivery.nextExpectedActivity.location.id`,
 `delivery.nextExpectedActivity.location`.unlocode AS `delivery.nextExpectedActivity.location.unlocode`,
 `delivery.nextExpectedActivity.location`.name AS `delivery.nextExpectedActivity.location.name`,
 `delivery.nextExpectedActivity.voyage`.id AS `delivery.nextExpectedActivity.voyage.id`,
 `delivery.nextExpectedActivity.voyage`.voyage_number AS `delivery.nextExpectedActivity.voyage.voyageNumber`,
 `cargo`.transport_status AS `delivery.transportStatus`,
 `delivery.currentVoyage`.id AS `delivery.currentVoyage.id`,
 `delivery.currentVoyage`.voyage_number AS `delivery.currentVoyage.voyageNumber`,
 `delivery.lastKnownLocation`.id AS `delivery.lastKnownLocation.id`,
 `delivery.lastKnownLocation`.unlocode AS `delivery.lastKnownLocation.unlocode`,
 `delivery.lastKnownLocation`.name AS `delivery.lastKnownLocation.name`,
 `delivery.lastEvent`.id AS `delivery.lastEvent.id`,
 `delivery.lastEvent.voyage`.id AS `delivery.lastEvent.voyage.id`,
 `delivery.lastEvent.voyage`.voyage_number AS `delivery.lastEvent.voyage.voyageNumber`,
 `delivery.lastEvent.location`.id AS `delivery.lastEvent.location.id`,
 `delivery.lastEvent.location`.unlocode AS `delivery.lastEvent.location.unlocode`,
 `delivery.lastEvent.location`.name AS `delivery.lastEvent.location.name`,
 `delivery.lastEvent.cargo`.id AS `delivery.lastEvent.cargo.id`,
 `delivery.lastEvent.cargo`.tracking_id AS `delivery.lastEvent.cargo.trackingId`,
 `delivery.lastEvent`.completion_time AS `delivery.lastEvent.completionTime`,
 `delivery.lastEvent`.registration_time AS `delivery.lastEvent.registrationTime`,
 `delivery.lastEvent`.type AS `delivery.lastEvent.type`
FROM cargo AS `cargo`
 LEFT JOIN location AS `origin` ON `cargo`.origin_id = `origin`.id
 LEFT JOIN location AS `routeSpecification.origin` ON `cargo`.spec_origin_id = `routeSpecification.origin`.id
 LEFT JOIN location AS `routeSpecification.destination` ON `cargo`.spec_destination_id = `routeSpecification.destination`.id
 LEFT JOIN leg AS `itinerary` ON `cargo`.id = `itinerary`.cargo_id
 LEFT JOIN voyage AS `itinerary.voyage` ON `itinerary`.voyage_id = `itinerary.voyage`.id
 LEFT JOIN location AS `itinerary.loadLocation` ON `itinerary`.load_location_id = `itinerary.loadLocation`.id
 LEFT JOIN location AS `itinerary.unloadLocation` ON `itinerary`.unload_location_id = `itinerary.unloadLocation`.id
 LEFT JOIN location AS `delivery.nextExpectedActivity.location` ON `cargo`.next_expected_location_id = `delivery.nextExpectedActivity.location`.id
 LEFT JOIN voyage AS `delivery.nextExpectedActivity.voyage` ON `cargo`.next_expected_voyage_id = `delivery.nextExpectedActivity.voyage`.id
 LEFT JOIN voyage AS `delivery.currentVoyage` ON `cargo`.current_voyage_id = `delivery.currentVoyage`.id
 LEFT JOIN location AS `delivery.lastKnownLocation` ON `cargo`.last_known_location_id = `delivery.lastKnownLocation`.id
 LEFT JOIN handling_event AS `delivery.lastEvent` ON `cargo`.last_event_id = `delivery.lastEvent`.id
 LEFT JOIN voyage AS `delivery.lastEvent.voyage` ON `delivery.lastEvent`.voyage_id = `delivery.lastEvent.voyage`.id
 LEFT JOIN location AS `delivery.lastEvent.location` ON `delivery.lastEvent`.location_id = `delivery.lastEvent.location`.id
 LEFT JOIN cargo AS `delivery.lastEvent.cargo` ON `delivery.lastEvent`.cargo_id = `delivery.lastEvent.cargo`.id

Performance with conventional ORM: the N + 1 problem is actually a 6N + 1 problem

Let’s compare this with the conventional ORM approach: Which queries are executed when we run the original cargo system with Hibernate as the persistence layer? To get a fair comparison we need to trigger any lazy loading so we will open the admin page that lists all Cargo records: http://localhost:8080/dddsample/admin/list. We get the following query initially:

SELECT c1_0.id,
       c1_0.calculated_at,
       c1_0.current_voyage_id,
       c1_0.eta,
       c1_0.unloaded_at_dest,
       c1_0.last_event_id,
       c1_0.last_known_location_id,
       c1_0.misdirected,
       c1_0.next_expected_location_id,
       c1_0.next_expected_handling_event_type,
       c1_0.next_expected_voyage_id,
       c1_0.routing_status,
       c1_0.transport_status,
       c1_0.origin_id,
       c1_0.spec_arrival_deadline,
       c1_0.spec_destination_id,
       c1_0.spec_origin_id,
       c1_0.tracking_id
FROM   cargo c1_0 

And then, immediately after this query, 5 additional queries for each Cargo:

SELECT he1_0.id,
       he1_0.cargo_id,
       he1_0.completion_time,
       l1_0.id,
       l1_0.NAME,
       l1_0.unlocode,
       he1_0.registration_time,
       he1_0.type,
       he1_0.voyage_id
FROM   handling_event he1_0
       LEFT JOIN location l1_0
              ON l1_0.id = he1_0.location_id
WHERE  he1_0.id = ? 

SELECT v1_0.id,
       v1_0.voyage_number
FROM   voyage v1_0
WHERE  v1_0.id = ?

SELECT l1_0.id,
       l1_0.NAME,
       l1_0.unlocode
FROM   location l1_0
WHERE  l1_0.id = ?

SELECT l1_0.id,
       l1_0.NAME,
       l1_0.unlocode
FROM   location l1_0
WHERE  l1_0.id = ?

SELECT v1_0.id,
       v1_0.voyage_number
FROM   voyage v1_0
WHERE  v1_0.id = ? 

Apparently, the Hibernate sample application is not performance optimized, as we notice that already in the repository getAll() the N + 1 query problem appears. It requires up to 5 (repeating voyage and location entities are reused and queried only once) additional queries per cargo record, effectively a 5N + 1 query problem. Note that this already happens in the repository call as these 5 associations are fetched eagerly but the persistence layer does not apply join clauses for them.

Additionally, when mapping the list of Cargo entities to CargoRoutingDTO’s (in CargoRoutingDTOAssembler), a second N + 1 query problem appears as the leg query is triggered for each Cargo entity.

SELECT i1_0.cargo_id,
       i1_0.id,
       ll1_0.id,
       ll1_0.NAME,
       ll1_0.unlocode,
       i1_0.load_time,
       ul1_0.id,
       ul1_0.NAME,
       ul1_0.unlocode,
       i1_0.unload_time,
       v1_0.id,
       v1_0.voyage_number
FROM   leg i1_0
       LEFT JOIN location ll1_0
              ON ll1_0.id = i1_0.load_location_id
       LEFT JOIN location ul1_0
              ON ul1_0.id = i1_0.unload_location_id
       LEFT JOIN voyage v1_0
              ON v1_0.id = i1_0.voyage_id
WHERE  i1_0.cargo_id = ? 

So, we get:

1 query for the N Cargo records
N Handling Event queries with one join
up to 4N queries for Locations and Voyages referenced by the Cargo records
N queries with 3 joins for the itinerary of each Cargo.

If we assume paginated results with a page size of 20 (N = 20), we would have 1 + 20 + 80 + 20 = up to 121 queries. Fixing this will require adding Hibernate-specific configuration to the finder method, like specifying an HQL/JPQL query to force Hibernate to use joins instead of lazy loading.

Optimizing the getAll() method with model-driven ORM

With the model-driven ORM approach the getAll() method triggered just one query, but it may need some optimization. The itinerary property of Cargo is a ‘many’-association (a list of Leg records) and is therefore a multiplier for the size of the resultset returned by the relational database. If there are 10 Leg records for a cargo, that will result in 10 rows in the resultset with the Cargo record repeated 10 times. When showing a list of Cargo, we won’t need the full itinerary. Let’s introduce a CargoSummary representation for showing lists of Cargo, and make Cargo a subclass so we can move the itinerary property there:

class CargoSummary {
	Long id
	String trackingId
	Location origin
	RouteSpecification routeSpecification
}

class Cargo extends CargoSummary {
	List<Leg> itinerary;
}

Next, we change the getAll() method in the repository interface to return CargoSummary:

List<CargoSummary> getAll();

The query for CargoSummary contains many-to-one associations only, so it will have exactly one resultset record for each Cargo. It does still contain 11 joins, but 9 of these are just one or two columns from the Voyage and Location tables. This should be pretty fast on any database engine.

SELECT
 `cargo`.id AS `cargo.id`,
 `cargo`.tracking_id AS `cargo.trackingId`,
 `origin`.id AS `origin.id`,
 `origin`.unlocode AS `origin.unlocode`,
 `origin`.name AS `origin.name`,
 `routeSpecification.origin`.id AS `routeSpecification.origin.id`,
 `routeSpecification.origin`.unlocode AS `routeSpecification.origin.unlocode`,
 `routeSpecification.origin`.name AS `routeSpecification.origin.name`,
 `routeSpecification.destination`.id AS `routeSpecification.destination.id`,
 `routeSpecification.destination`.unlocode AS `routeSpecification.destination.unlocode`,
 `routeSpecification.destination`.name AS `routeSpecification.destination.name`,
 `cargo`.spec_arrival_deadline AS `routeSpecification.arrivalDeadline`,
 `cargo`.misdirected AS `delivery.misdirected`,
 `cargo`.eta AS `delivery.eta`,
 `cargo`.calculated_at AS `delivery.calculatedAt`,
 `cargo`.unloaded_at_dest AS `delivery.isUnloadedAtDestination`,
 `cargo`.routing_status AS `delivery.routingStatus`,
 `cargo`.next_expected_handling_event_type AS `delivery.nextExpectedActivity.type`,
 `delivery.nextExpectedActivity.location`.id AS `delivery.nextExpectedActivity.location.id`,
 `delivery.nextExpectedActivity.location`.unlocode AS `delivery.nextExpectedActivity.location.unlocode`,
 `delivery.nextExpectedActivity.location`.name AS `delivery.nextExpectedActivity.location.name`,
 `delivery.nextExpectedActivity.voyage`.id AS `delivery.nextExpectedActivity.voyage.id`,
 `delivery.nextExpectedActivity.voyage`.voyage_number AS `delivery.nextExpectedActivity.voyage.voyageNumber`,
 `cargo`.transport_status AS `delivery.transportStatus`,
 `delivery.currentVoyage`.id AS `delivery.currentVoyage.id`,
 `delivery.currentVoyage`.voyage_number AS `delivery.currentVoyage.voyageNumber`,
 `delivery.lastKnownLocation`.id AS `delivery.lastKnownLocation.id`,
 `delivery.lastKnownLocation`.unlocode AS `delivery.lastKnownLocation.unlocode`,
 `delivery.lastKnownLocation`.name AS `delivery.lastKnownLocation.name`,
 `delivery.lastEvent`.id AS `delivery.lastEvent.id`,
 `delivery.lastEvent.voyage`.id AS `delivery.lastEvent.voyage.id`,
 `delivery.lastEvent.voyage`.voyage_number AS `delivery.lastEvent.voyage.voyageNumber`,
 `delivery.lastEvent.location`.id AS `delivery.lastEvent.location.id`,
 `delivery.lastEvent.location`.unlocode AS `delivery.lastEvent.location.unlocode`,
 `delivery.lastEvent.location`.name AS `delivery.lastEvent.location.name`,
 `delivery.lastEvent.cargo`.id AS `delivery.lastEvent.cargo.id`,
 `delivery.lastEvent.cargo`.tracking_id AS `delivery.lastEvent.cargo.trackingId`,
 `delivery.lastEvent`.completion_time AS `delivery.lastEvent.completionTime`,
 `delivery.lastEvent`.registration_time AS `delivery.lastEvent.registrationTime`,
 `delivery.lastEvent`.type AS `delivery.lastEvent.type` 
FROM cargo AS `cargo`
 LEFT JOIN location AS `origin` ON `cargo`.origin_id = `origin`.id
 LEFT JOIN location AS `routeSpecification.origin` ON `cargo`.spec_origin_id = `routeSpecification.origin`.id
 LEFT JOIN location AS `routeSpecification.destination` ON `cargo`.spec_destination_id = `routeSpecification.destination`.id
 LEFT JOIN location AS `delivery.nextExpectedActivity.location` ON `cargo`.next_expected_location_id = `delivery.nextExpectedActivity.location`.id
 LEFT JOIN voyage AS `delivery.nextExpectedActivity.voyage` ON `cargo`.next_expected_voyage_id = `delivery.nextExpectedActivity.voyage`.id
 LEFT JOIN voyage AS `delivery.currentVoyage` ON `cargo`.current_voyage_id = `delivery.currentVoyage`.id
 LEFT JOIN location AS `delivery.lastKnownLocation` ON `cargo`.last_known_location_id = `delivery.lastKnownLocation`.id
 LEFT JOIN handling_event AS `delivery.lastEvent` ON `cargo`.last_event_id = `delivery.lastEvent`.id
 LEFT JOIN voyage AS `delivery.lastEvent.voyage` ON `delivery.lastEvent`.voyage_id = `delivery.lastEvent.voyage`.id
 LEFT JOIN location AS `delivery.lastEvent.location` ON `delivery.lastEvent`.location_id = `delivery.lastEvent.location`.id
 LEFT JOIN cargo AS `delivery.lastEvent.cargo` ON `delivery.lastEvent`.cargo_id = `delivery.lastEvent.cargo`.id

Optimizing performance

Using the fix above we may have removed a potential bottleneck, but the getAll() repository method is still wasting a lot of memory and bandwidth: It queries a large part of the Cargo aggregate but the actual list that is shown to the user (the list.html template) only contains 4 columns: TrackingId, Origin, Destination and Routed (routing status). The minimal DTO needed for that would look like this:

class CargoListDTO {
	String trackingId;
	String origin; // unlocode
	String destination; // unlocode
	RoutingStatus status;
}

Using PojoQuery, we can turn the same DTO into a query and fetch it directly:

@Table("cargo")
class CargoListDTO {
	@FieldName("tracking_id")
	String trackingId;
	@FieldName("spec_origin_id")
	Location origin; // unlocode
	@FieldName("spec_destination_id")
	Location destination; // unlocode
	@FieldName("routing_status")
	RoutingStatus status;
}

List<CargoListDTO> cargoList = db.query(CargoListDTO.class).execute();

(If you worry about Location objects leaking into the presentation layer, you can define a LocationDTO or query the unlocode directly by specifying the joins using annotations.)

Now we have optimized memory and bandwidth usage of the cargo list to the absolute minimum, and we did so within the context of ORM (i.e. using only objects and types) and without reverting to workarounds like writing HQL/JPQL queries, projections or specifying a fetch graph.

The Cargo System: Wrapping up

With the two new entity representations and the repository changes in place we have a working system with all integration tests (and unit tests) passing. We could do this without any changes to the database schema, and with only a few trivial changes to the domain model. Also, we have options to optimize performance within the context of ORM.

Conclusion

With one fix – banning cycles from entity definitions – we can avoid the common problems and pitfalls of Object Relational Mapping:

No N + 1 query problem because we don’t need lazy loading
No session management issues because deduplication happens at the query level
No data consistency issues with collections, since there is no cache of entities: Updates and inserts are immediate and the relational database guarantees consistent query results.
No issues with identity generation and object equality: Since inserts are immediate, autogenerated ID values are available right away. Deduplication is only necessary for query results and can always be based on the primary key values.

Conventional ORM and model-driven ORM side by side

Conventional ORM	Model-driven ORM
Entities define the database schema	Entity representations define data access
Data access is defined at the schema level, fetch whole entities at a time	Data access is defined for the task at hand, per column and association
Each type represents a table	Each type represents a query
Allows cycles in type definitions	Requires cycle-free type definitions
Lazy loading	Data is always loaded eagerly
Deduplication at transaction level (cache)	Deduplication at query level
Automatic cache flushes before queries, at end of transaction	Explicit and immediate inserts and updates
First-level cache	No caching or proxying of entities
Entities are proxied and connected to the database session	Entities are plain objects and can be treated as plain data
Application code needs to fix data consistency	Data consistency is guaranteed by the relational database
LazyInitializationException	Type-safe data access boundaries

I hope to have shown how the model-driven approach can make Object Relational Mapping more powerful and less problematic. The PojoQuery libary in its current form is sufficient for the Cargo System and it may have enough functionality to support more complex domain models and applications.

Next steps

I hope to get some feedback on the ideas above (and maybe find a better name?), and use that to refine the concept and perhaps build a reference implementation with a nice Getting started guide.

Thank you!

Thank you for reading this far! I would like to hear your thoughts. As I don’t have a discussion platform set up yet, please email me at martijn@vogten.nl, or you can file a Github issue in the PojoQuery project.

References

[Mihalcea] “High-Performance Java Persistence”, by Vlad Mihalcea, 2024, LeanPub

[Fowler] “Patterns of Enterprise Application Architecture”, by Martin Fowler, with Dave Rice, Matthew Foemmel, Edward Hieatt, Robert Mee, and Randy Stafford, 2002, Addison-Wesley

[Evans] “Domain-Driven Design: Tackling Complexity in the Heart of Software”, by Eric Evans, 2003, Addison-Wesley