Nested traversal in Spring Data derived queries: the underscore rule, the distinct-join trap, the same-collection trap, projections, EntityGraph, streams.

The single-entity toolkit in part one covered every keyword Spring Data offers when the query stays on the root entity. The interesting questions start when the predicate has to traverse into an associated entity. Order has many OrderItems. How do you find all orders whose items match some condition, and how do you do it without surprising yourself with the resulting SQL?

Derived queries can do nested traversal. They use a syntax you may have seen and not understood: the underscore. This post covers that syntax, the duplicate-parent problem it creates, projections, entity graphs, streams, and the moment when you should stop reaching for derived queries entirely.

The underscore rule

When Spring Data parses a method name like findByItemsProductSku, it has a choice. It can read itemsProductSku as a flat property on Order, or as the nested path items.productSku. The parser is greedy: it tries the longest property name first, then steps back one camelCase token at a time and tries again. If itemsProductSku happens to exist as a property on Order, the parser stops there and never traverses.

The underscore forces the split. findByItems_ProductSku is unambiguous: split at the underscore, traverse from items into productSku, never confuse it with anything else. Use the underscore on every nested traversal. It is explicit, it survives refactoring, and it is the syntax the Spring Data team recommends.

Traversing a collection

List<Order> findByItems_ProductSku(String sku);

The generated JPQL is roughly:

SELECT o FROM Order o
  JOIN o.items i
WHERE i.productSku = :sku

The join is an inner join. Orders without items will not appear in the result. If you have an order with zero OrderItems and you want it included anyway, you need a left join, which derived queries do not produce. Switch to @Query.

The duplicate parent problem, then and now

Join a one-to-many in SQL and the database returns one row per matching child. If an Order has three FULFILLED items, the raw SQL result for a fulfilled-items predicate has that order three times. This is the classic duplicate-parent trap.

If you are on Hibernate 5 or older, the JPA result reflects the SQL: the List<Order> contains ORD-003 three times. The historical fix is the Distinct keyword:

List<Order> findDistinctByItems_Status(OrderItemStatus status);

That emits SELECT DISTINCT o. Each parent appears once. Easy.

If you are on Hibernate 6 or newer (and you are, because Spring Boot 3 and 4 ship with it), the story changed. Hibernate 6 deduplicates entity-returning query results in memory by primary key. The SQL still returns the same multiplied rows, but by the time the List reaches your code each parent appears once. The Distinct keyword is no longer needed to make the Java result correct.

Distinct still matters in three cases. First, projections (interface, record, raw column) are not entities and are not deduplicated, so a DTO projection over a one-to-many join produces row-per-child unless you add Distinct. Second, the SQL itself is still wasteful without DISTINCT: the database transfers N rows that Hibernate then collapses to 1 entity. On a large result set this is real bandwidth and CPU. Third, count queries (countBy) operate at the SQL level and need countDistinct if you want a count of unique parents, not unique join rows. So Distinct is no longer a correctness fix for entity queries on modern Hibernate, but it is still the right choice for efficiency and for non-entity returns.

Combining parent and child predicates

List<Order> findByStatusAndItems_Status(
    OrderStatus orderStatus,
    OrderItemStatus itemStatus);

Generated JPQL:

SELECT o FROM Order o
  JOIN o.items i
WHERE o.status = :orderStatus
  AND i.status = :itemStatus

One join, two predicates: one on the parent, one on the joined child. This is the natural shape. Same duplicate-parent caveat applies: add Distinct if you want each parent once.

The trap: two predicates on the same collection

This is the part that catches everyone. Consider:

List<Order> findDistinctByItems_StatusAndItems_QuantityGreaterThan(
    OrderItemStatus status, int quantity);

What you read this as: orders where some item is FULFILLED, and some item has quantity greater than N. The items could be different.

What you get: orders where some single item is both FULFILLED and has quantity greater than N. Spring Data emits one join, not two. Both predicates apply to the same joined row.

SELECT DISTINCT o FROM Order o
  JOIN o.items i
WHERE i.status = :status
  AND i.quantity > :quantity

If your data model needs the looser semantic (any FULFILLED item AND any over-quantity item, possibly different items), you need two separate joins. Spring Data will not write that for you. Use a Specification, or write a @Query that does EXISTS twice with two correlated subqueries. The same rule applies to any pair of predicates that both traverse through the same collection.

Existence and counting through a join

boolean existsByItems_ProductSku(String sku);
long countDistinctByItems_Status(OrderItemStatus status);

existsBy compiles to a cheap SELECT EXISTS(SELECT 1 FROM ...) with no entity hydration. countDistinctBy emits COUNT(DISTINCT o.id) and gives you the number of distinct parents that have at least one matching child. Either is the right shape for these questions; findDistinctBy... .size() is the wrong shape (it loads every row).

Projections, in three flavors

You do not always want the full entity. You want a few columns and a quick render. Spring Data has three projection styles.

Interface-based projections. Declare an interface with getters. Spring returns proxies that implement it.

public interface OrderSummary {
    Long getId();
    String getOrderNumber();
    BigDecimal getTotalAmount();
    OrderStatus getStatus();
}

List<OrderSummary> findSummaryByStatus(OrderStatus status);

The generated SELECT projects only the columns the interface declares. No entity, no first-level cache pollution, no lazy associations. Cheap and read-only.

Class-based projections (records). Declare a record matching the columns. Spring uses the canonical constructor.

public record OrderDto(
    Long id,
    String orderNumber,
    String customerEmail,
    BigDecimal totalAmount,
    OrderStatus status,
    Instant createdAt) {}

List<OrderDto> findDtoByStatus(OrderStatus status);

Records are the cleanest fit because the constructor parameter names match the entity property names, and Spring Data maps them by name. The result is immutable and serializable. This is usually the shape you want for HTTP responses.

Dynamic projections. Pass Class<T> at call time and Spring Data picks the projection per call.

<T> List<T> findByCustomerEmail(String customerEmail, Class<T> type);

Now the same method backs multiple consumers: a list view that wants OrderSummary, an export job that wants the full entity, a CSV builder that wants OrderDto. One query method, three shapes. Useful for sharing query logic across read paths.

EntityGraph: eager-load a collection in the derived query

@EntityGraph(attributePaths = {"items"})
List<Order> findWithItemsByCustomerEmail(String customerEmail);

This attaches a fetch hint to the derived query. Hibernate emits a single JOIN FETCH for the items collection so the parents and their children come back in one round trip. Outside the persistence context, calling order.getItems() does not trigger another query, and never throws LazyInitializationException.

EntityGraph plus a one-to-many JOIN FETCH plus pagination is a trap. Hibernate has to either fetch all rows into memory and paginate in Java (with a warning in the log) or refuse to combine them. For paginated reads, fetch the parent page first, then load the children with a separate WHERE id IN (...) query, or use entity sub-select fetching.

Streaming results

@Transactional(readOnly = true)
public void processPaidOrders() {
    try (Stream<Order> stream = repo.streamByStatus(OrderStatus.PAID)) {
        stream.forEach(this::process);
    }
}

A Stream return type gives you a cursor-backed iteration. The database holds an open cursor; Hibernate reads rows in batches as the stream advances. Two strict requirements: the stream must be consumed inside an open transaction, and it must be closed (use try-with-resources). For a batch job that processes a hundred thousand rows, streaming uses constant memory regardless of result size.

When to leave derived queries behind

Derived queries cover most repository methods you will ever write. The cases where they do not work are clear:

Dynamic predicates. If the predicate changes based on which fields the caller filled in (a search form), use Specification. Spring Data composes specifications with and, or, and not as code, not as a method name.

Parenthesized boolean logic. Derived queries have no parens. findByAOrBAndC is always A OR (B AND C). If you need (A OR B) AND C, switch to @Query or Specification.

Two predicates on the same collection that should match different rows. Already covered. Use Specification with two joins, or a @Query with two EXISTS subqueries.

Complex joins or subqueries. Group by, having, window functions, lateral joins. None of these have a derived-query syntax. Write the JPQL or the native SQL.

Performance. If your read path is performance-critical and entity hydration is the bottleneck, a projection might fix it. If not, drop to JdbcTemplate or a native query. Repository methods are an abstraction. Sometimes the right answer is to step around them.

Companion code

The repo at github.com/umur/spring-data-derived-queries implements every method in this post and tests it against a Postgres container via Testcontainers. OrderRepositoryNestedIT covers nested traversal and the same-collection trap. OrderRepositoryProjectionIT covers interface, record, and dynamic projections, plus the EntityGraph case.

The two parts together describe everything Spring Data derived queries can do on a single aggregate. Dynamic SQL, parenthesized boolean logic, and two-join semantics on a single collection are all outside what derived queries can express. Inside those limits, the syntax stays concise. Queries are validated at startup. The methods read like English. That is a good trade for most repository code.

Spring Data Derived Queries: Crossing Boundaries