FEB 23, 2023/6 min readIndustry Solutions

Ingestion of Healthcare Pricing Transparency Data Files Natively on Snowflake

Hospital price transparency helps Americans know the cost of a hospital item or service before receiving it. Starting January 1, 2021, each hospital operating in the United States is required to provide clear, accessible pricing information online about the items and services they provide in two ways: as a comprehensive machine-readable file with all items and services listed, and in a display of shoppable services in a consumer-friendly format.

Health plan price transparency helps consumers know the cost of a covered item or service before receiving care. As of July 1, 2022, group health plans and issuers of group or individual health insurance must post prices of covered items and services. This pricing information can be used by third parties, such as researchers and app developers, to help better inform consumers about the costs associated with their healthcare. Under the final Centers for Medicare & Medicaid Services (CMS) rules, a plan or issuer must disclose in-network negotiated rates and billed but out-of-network rates allowed through two machine-readable files posted on an internet website. More requirements went into effect on January 1, 2023, and as of January 1, 2024, a plan or issuer will be required to provide additional access to pricing information and enhance consumers' ability to shop for the healthcare that best meet their needs.

The overarching goal of these rules is to support a market-driven healthcare system by giving consumers the information they need to make informed decisions about their healthcare and related purchases. This enables customers to evaluate healthcare options and make cost-conscious decisions, reducing surprise out-of-pocket costs for health care services. It also creates a competitive dynamic that can narrow price differences for the same services in the same healthcare markets, foster innovation by providing industry the information necessary to support informed, price-conscious consumers in the healthcare market; and, over time, potentially lower overall healthcare costs.

Pricing transparency file formats and complexity

The CMS guidelines state that the files are to be hosted on an HTTPS website and should be in one of the approved formats (JSON/XML/YAML). Health plan providers are to update these files once a month.

However, the guidelines do not restrict file size. Given the amount of data that needs to be shared at each billing code/provider/plan level, the files created by health plans are often multi-GB in size. In fact, managed healthcare and insurance company Cigna warns these files potentially could be 1 TB in size.

Data published by health plans in one of the approved formats usually has a single consolidated JSON record with many nested objects, including the provider-negotiated in-network rates, plan type, and billing code level within that single JSON. Parsing and looping through those nested JSON structures to arrive at flattened information in tabular structure is cumbersome and complex. It is time-consuming for processors to loop through the nested JSON and unpack in-network negotiated rates and establish relationships with provider/billing code/plan types. After parsing, data often needs to be stored in an optimized schema/table format to enable ease of use for various analytic purposes.

Here is an example of single JSON with bundled in-network negotiated rates: in-network-rates-bundle-single-plan-sample.json

Processing these complex JSON files every month requires large computing infrastructure that can be scaled up as the files are made available, then scaled down once the processing is done.

Snowflake’s solution

Typically, when ingesting a large file into the data lake, the approach is:

Parse and break the file into smaller files outside of Snowflake
Ingest them into Snowflake (DataStore)
Aggregate the various segments once all the data is ingested and rebuild the original structure establishing relationships

This workaround is required as it is not possible to load very large files into memory and process them in the Snowpark environment. Workaround processes, as described above, are often complex, time- consuming, and require multiple technologies or infrastructures to make them work.

To address this complexity, Snowflake has recently introduced dynamic file access capability with Snowpark (currently in private preview). It allows the Python program running natively in the Snowflake Snowpark environment to read the file dynamically in chunks, process them, and ingest the data into Snowflake without having to upload the entire file into Snowflake’s internal stage and load it into memory for processing.

Snowflake’s solution to ingesting very large healthcare pricing transparency data files.

In the above solution approach, the pricing transparency JSON file is hosted in a cloud storage bucket and is referenced through an external stage on Snowflake. A custom Snowpark Python stored procedure is used to build a directed acyclic graph (DAG) of interconnected tasks that are executed in parallel. Each of these tasks (Snowpark Python code) reads the specific segments of the JSON file using the dynamic file access capability of Snowflake; it reads data from large files in a streaming fashion without loading the entire file into memory. These specific segments are then stored as smaller parquet files on cloud storage, referenced as external tables on Snowflake, and combined to result in structured data entities that hold pricing transparency data for analytics purposes. Each task records audit information indicating success and failures (if any) for traceability.

Conclusion

Ingesting, parsing, processing, and generating meaningful insights from healthcare pricing transparency data requires two key challenges to be addressed: ingestion of large JSON files and structuring the data in an easy-to-query model that helps generate negotiated price-related insights across various business dimensions. This solution addresses the challenges associated with ingestion of large and complex JSON files, simplifying the technology stack leveraging Snowflake’s native capabilities and providing elastic scalability as monthly files from each healthcare payer are processed with near-zero maintenance of the underlying compute and storage infrastructure. Once the data is ingested, Snowflake’s JSON handling capabilities and highly performant query engine makes accessing negotiated price data across various dimensions simple and efficient.

To learn more about Snowpark, watch our quick two-minute explainer video, or for more in-depth information, watch our partner demo. If you’re a Snowflake customer and would like to learn more, please contact your sales lead to schedule a live demo. Please note: Snowpark dynamic file access capability is in private preview until spring of 2023.

Learn more about the authors

Venkatesh Sekar

Principal Data Cloud Architect

Murali Gandhirajan

Global Healthcare and Life Sciences CTO

Ingestion of Healthcare Pricing Transparency Data Files Natively on Snowflake