The Detail-Backfill Pattern or DataLoader Before It Was Cool
While optimization is often premature and, therefore, the root of all evil[1], using an efficient design for data access is choosing a good path rather than paving a bad one. A common task for APIs is to retrieve a “page” of master rows and their detail rows (e.g. Invoice and Items), as well as a total count of the master rows that match the given criteria. Here are three ways to get the data from a standard SQL database:
Option 1: for each row of the master table query result, execute a query per detail table to get the detail rows for that master row (also known, derisively, as N+1) #
Pro #
- initiative design
- allows easy pagination of master rows
Con #
- slow
- lots of little queries burden SQL server and network
- data might change during the process if the option does not include a potentially huge transaction (because the last query returns long after the first one)
Option 2: join all the detail rows to the master rows in one query #
Pro #
- only one query
- easy (if slow) pagination
Con #
- slow to the point of stopping access to the tables
- slow pagination
- cartesian number of rows from multiple details
- code must deal with duplicate master-row columns
Option 3: get the master rows with one query and all the details for those rows in one other query per detail table #
Pro #
- one detail query for each detail table
- allows easy pagination of master rows
Con #
- unintuitive
- requires code to merge the detail data under the master data
The effects of these three implementations can affect the usage, if not the interface, of the function
Pseudo-code Examples #
Option 1: Run a Query on the Detail for each Master
DataQuery = "SELECT *"
CountQuery = "SELECT COUNT(*) AS count"
DataQuery & CountQuery += " FROM table T1 WHERE T1.status = 'active'"
DataQuery += " LIMIT @startOffset, @pageSize
Result = empty Array
execute(CountQuery)
if (CountQuery.count <= startOffset) OR (pageSize == 0) OR
(startOffset > CountQuery.count)
return CountQuery.count, Result
execute(DataQuery)
for each DataQuery.row
store master column data to new CurrentMaster
append CurrentMaster to Result
DetailQuery = "SELECT * from detail T2 \
WHERE T2.master_id = @CurrentMaster.id
execute(DetailQuery)
for each DetailQuery.row
append detail column data to CurrentMaster.DetailArray
return CountQuery.count, Result
Option 2: SQL Join the Master and Detail
This code is for one detail table. Every additional table adds a join to the query
CountQuery = "SELECT COUNT(*) AS count FROM table T1"
DataQuery = "SELECT * FROM table T1 \
JOIN detail T2 ON (T1.id = T2.master_id)"
DataQuery & CountQuery += "WHERE T1.status = 'active'"
DataQuery += "ORDER BY T1.id, T2.id"
Result = empty Array
execute(CountQuery)
if CountQuery.count == 0
return CountQuery.count, Result
CurrentMaster = null
execute(DataQuery)
MasterRowIndex = 0
for each DataQuery.row
if (CurrentMaster == null) OR (CurrentMaster.id != DataQuery.id)
MasterRowIndex += 1
if MasterRowIndex > startOffset + pageSize
break
store master column data to new CurrentMaster
if MasterRowIndex > startOffset
append CurrentMaster to Result
if MasterRowIndex > startOffset
append detail column data to CurrentMaster.DetailArray
return CountQuery.count, Result
Because joining the Master and Detail tables can produce 1 - n rows for each master row, one cannot paginate using the query LIMIT clause. One must load enough rows to ensure that one can return a page-full of rows
Option 3: Manage the Join between Master and Detail in Code
CountQuery = "SELECT COUNT(*) AS count FROM table T1"
DataQuery = "SELECT * FROM table T1"
DataQuery & CountQuery += "WHERE T1.status = 'active'"
DataQuery += " LIMIT @startOffset, @pageSize",
Result = empty Array
execute(CountQuery)
if (CountQuery.count <= startOffset) OR (pageSize == 0) OR
(startOffset > CountQuery.count)
return CountQuery.count, Result
execute(DataQuery)
for each DataQuery.row
append master column data to Result
MasterIds = the id values for all master rows in Result
DetailQuery = "SELECT * from detail T2 \
WHERE T2.master_id IN @MasterIds \
ORDER BY T2.master_id"
CurrentMaster = null
execute DetailQuery
for each DetailQuery.row
if (CurrentMaster == null) OR
(CurrentMaster.id != DetailQuery.master_id)
CurrentMaster = find Result[] for DetailQuery.master_id
append detail column data to CurrentMaster.DetailArray
return CountQuery.count, Result
One can iterate Result to gather the MasterIds or save them in a list as one fetches the master rows
To find an item in Result[] for a master id in order to add a detail result, one can search Result linearly or index it depending on the possible size of Result, memory cost, etc.
[1]: “…; premature optimization is the root of all evil (or at least most of it) in programming.” - Donald Knuth