Data Modeling in the Age of JSON and Schema-on-Read

Author: Kent Graziano

Snowflake Thought Leadership

With the rise of mobile devices and applications, as well as the Internet of Things, the number and variety of data sources has grown exponentially over the past decade. IDC recently estimated that 175 zettabytes of data will have been created by 2025. 

In order for businesses to be able to use all this data to drive growth, they must first be able to understand it. Effective data modeling is critical to making these massive new data streams and formats usable. Data modeling helps define the structure and semantics of the data, so business users and data scientists can be properly query, manipulate, and analyze it.  

Businesses that employ Schema-on-Write methodology know the importance of data modeling. In Schema-on-Write, which has been the standard method for loading data to be stored since the inception of relational databases in the 1970s, a database designer creates a structure, or “schema,” for the data before it is loaded, or “written,” into the system. 

With the more recent advent of the Schema-on-Read methodology, in which the goal is to load data into the system as quickly as possible and without upfront design and modeling, the role of data modeling has taken a backseat. Still, it remains no less important. 

What is Schema-on-Read

Schema-on-Read first rose to popularity among software developers, because it shortened the delivery time for working applications. Schema-on-Read allows applications to store data in semi-structured formats such as JavaScript Object Notation (JSON) and to rapidly iterate on it without breaking the database behind the applications. This was a boon to software developers, but it posed a challenge for the data scientists and data warehouse users who wanted to create reports from the data or analyze it. 

As flexible and fast as Schema-on-Read is, it also requires that data be transformed into an understandable relational model of some kind in order to allow business users to make sense of it.  As it turns out, semi-structured data can be transformed into a relational model by applying data modeling best practices.

With JSON documents, it is relatively easy to identify the elements needed to create a relational database within them, including key:value pairs, nested keys, and arrays. A data warehouse architect need only be familiar with the notation of these document types in order to discern their structure and develop a model that can be queried within a data warehouse. Using the same principles of data design, semi-structured data such as JSON can be transformed into any model: third normal form (3NF), a star schema, or Data Vault. 

By examining the content and forms within these semi-structured documents, a data modeler can transform them into actionable, useful data sets that businesses can leverage to generate value. Doing so is worth the effort: Getting a handle on how to treat semi-structured data will greatly increase the amount and types of data available to the business, such as mobile application data or consumer sentiment data, because social media data such as tweets and hashtags are typically semi-structured. Businesses can use this data to inform business decisions, optimize processes, and create a competitive advantage.

Regardless of the data model the business chooses and the tools it employs to implement the model, the goal remains the same: an actionable model that can be queried, understood, and used to derive business value from an ever-increasing stream of data. 

Learn More About Data Modeling for JSON

For a deeper technical look at turning semi-structured data into a viable relational data model, check out our eBook: “Data Modeling Best Practices: How to Convert Schema-on-Read into Schema-on-Write

More about Data Modeling: Supporting Multiple Data Modeling Approaches with Snowflake