SecDevOps.comSecDevOps.com
Why ‘Store Together, Access Together’ Matters for Your Database

Why ‘Store Together, Access Together’ Matters for Your Database

The New Stack(3 weeks ago)Updated 3 weeks ago

When your application needs several pieces of data at once, the fastest approach is to read them from a single location in a single call. In a document database, developers can decide what is stored...

When your application needs several pieces of data at once, the fastest approach is to read them from a single location in a single call. In a document database, developers can decide what is stored together, both logically and physically. Fragmentation has never been beneficial for performance. In databases, the proximity of data — on disk, in memory or across the network — is crucial for scalability. Keeping related data together allows a single operation to fetch everything needed, reducing disk I/O, memory cache misses and network round-trips, thereby making performance more predictable. The principle “store together what is accessed together” is central to modeling in document databases. Yet its purpose is to allow developers to control the physical storage layout, even with flexible data structures. Understanding the core principle of data locality is essential today, especially as many databases emulate document databases or offer similar syntax on top of SQL. In contrast, SQL databases were designed for data independence — allowing users to interact with a logical model separate from the physical implementation managed by a database administrator. Today, the trend is not to separate development and operations, allowing faster development cycles without the complexity of coordinating multiple teams or shared schemas. Avoiding the separation into logical and physical models further simplifies the process. Understanding the core principle of data locality is essential today, especially as many databases emulate document databases or offer similar syntax on top of SQL. To qualify as a document database, it’s not enough to accept JSON documents with a developer-friendly syntax. The database must also preserve those documents intact in storage so that accessing them has predictable performance. Whether they expose a relational or document API, it is essential to know if your objective is data independence or data locality. Why Locality Still Matters in Modern Infrastructure Modern hardware still suffers from penalties for scattered access. Hard disk drives (HDDs) highlighted the importance of locality because seek and rotational latency are more impactful than transfer speed, especially for online transactional processing (OTLP) workloads. While solid state drives (SSDs) remove mechanical delays, random writes remain expensive, and cloud storage adds latency due to network access to storage. Even in-memory access isn’t immune: on multisocket servers, non-uniform memory access (NUMA) causes varying access times depending on where the data was loaded into memory by the first access, relative to the CPU core that processes it later. Scale-out architecture further increases complexity. Vertical scaling — keeping all reads and writes on a single instance with shared disks and memory — has capacity limits. Large instances are expensive, and scaling them down or up often requires downtime, which is risky for always-on applications. More nodes increase the likelihood of distributed queries, in which operations that once hit local memory must now traverse the network, introducing unpredictable latency. Data locality becomes critical with scale-out databases. For example, you might need your maximum instance size for Black Friday but would have to scale up progressively in the lead-up, incurring downtime as usage increases. Without horizontal scalability, you end up provisioning well above your average load “just in case,” as in on-premises infrastructures sized years in advance for occasional peaks — something that can be prohibitively costly in the cloud. Horizontal scaling allows adding or removing nodes without downtime. However, more nodes increase the likelihood of distributed queries, in which operations that once hit local memory must now traverse the network, introducing unpredictable latency. Data locality becomes critical with scale-out databases. To create scalable database applications, developers should understand storage organization and prioritize single-document operations for performance-critical transactions. CRUD functions (insert, find, update, delete) targeting a single document in MongoDB are always handled by a single node, even in a sharded deployment. If that document isn’t in memory, it can be read from disk in a single I/O operation. Modifications are applied to the in-memory copy and written back as a single document during asynchronous checkpoints, avoiding on-disk fragmentation. In MongoDB, the WiredTiger storage engine stores each document’s fields together in contiguous storage blocks, allowing developers to follow the principle “store together what is accessed together.” By avoiding cross-document joins, such as the $lookup operation in queries, this design helps prevent scatter-gather operations internally, which promotes consistent performance. This supports predictable performance regardless of document size, update frequency or cluster scale. The Relational Promise: Physical Data Independence For developers working with NoSQL databases, what I exposed above seems obvious: There is one single data model — the domain model — defined in the application, and the database stores exactly that model. The MongoDB data modeling workshop defines a database schema as the physical model that describes how the data is organized in the database. In relational databases, the logical model is typically independent of the physical storage model, regardless of the data type used, because they serve different purposes. SQL developers work with a relational model that is mapped to their object model via object relational mapping (ORM) tooling or hand-coded SQL joins. The models and schemas are normalized for generality, not necessarily optimized for specific application access patterns. The goal of the relational model was to serve online interactive use by non-programmers and casual users by providing an abstraction that hides physical concerns. This includes avoiding data anomalies through normalization and enabling declarative query access without procedural code. Physical optimizations, like indexes, are considered implementation details. You will not find CREATE INDEX in the SQL standard. In practice, a SQL query planner chooses access paths based on statistics. When writing JOIN clauses, the order of tables in the FROM clause should not matter. The SQL query planner reorders based on cost estimates. The database guarantees logical consistency, at least in theory, even with concurrent users and internal replication. The SQL approach is database-centric: rules, constraints and transactional guarantees are defined in the relational database, independent of specific use cases or table sizes. Today, most relational databases sit behind applications. End users rarely interact with them directly, except in analytical or data science contexts. Applications can enforce data integrity and handle code anomalies, and developers understand data structures and algorithms. Nonetheless, relational database experts still advise keeping constraints, stored procedures, transactions, and joins within the database. The physical storage remains abstracted — indexes, clustering, and partitions are administrator-level, not application-level, concepts, as if the application developers were like the non-programmer casual users described in the early papers about relational databases. Codd’s Rules Still Apply to SQL/JSON Documents Because data locality matters, some relational databases have mechanisms to enforce it internally. For example, Oracle has long supported “clustered tables” for co-locating related rows from multiple columns, and more recently offers a choice for JSON storage as either binary JSON (OSON, Oracle’s native binary JSON) or decomposed relational rows (JSON-relational duality views). However, those physical attributes are declared and deployed in the database using a specific data definition language (DDL) and are not exposed to the application developers. This reflects Codd’s “independence” rules: Rule 8: Physical data independence Rule 9: Logical data independence Rule 10: Integrity independence Rule 11: Distribution independence Rules 8 and 11 relate directly to data locality: The user is not supposed to care whether data is physically together or distributed. The database is opened to users who ignore the physical data model, access paths and algorithms. Developers do not know what is replicated, sharded or distributed across multiple data centers. Where the SQL Abstraction Begins to Weaken In practice, no relational database perfectly achieves these rules. Performance tuning often requires looking at execution plans and physical data layouts. Serializable isolation is rarely used due to scalability limitations of two-phase locking, leading developers to fall back to weaker isolation levels or to explicit locking (SELECT ... FOR UPDATE). Physical co-location mechanisms — hash clusters, attribute clustering — exist, but are difficult to size and maintain optimally without precise knowledge of access patterns. They often require regular data reorganization as updates can fragment it again. The normalized model is inherently application-agnostic, so optimizing for locality often means breaking data independence ( denormalizing, maintaining materialized views, accepting stale reads from replicas, disabling referential integrity). With sharding, constraints like foreign keys and unique indexes generally cannot be enforced across shards. Transactions must be carefully ordered to avoid long waits and deadlocks. Even with an abstraction layer, applications must be aware of the physical distribution for some operations. The NoSQL Pivot: Model for Access Patterns As data volumes and latency expectations grow, a different paradigm has emerged: give developers complete control rather than an abstraction with some exceptions. NoSQL databases adopt an application-first approach: The physical model matches...

Source: This article was originally published on The New Stack

Read full article on source →

Related Articles