MAY 27, 2026/6 min readOpen Source

Graph Queries Across Billions of Rows of Scattered Data with Postgres and Apache AGE™

Postgres engines now have access to more data than ever. With extensions like pg_lake, you can connect Postgres to gobs of files in object storage like csv, json, Apache Parquet™ and Apache Iceberg™.

But having access to data in object storage and being able to aggregate data in object storage are two different things. This blog walks through the Postgres extension, Apache AGE™, that makes working with huge files of data sets much friendlier through graph relationships.

Why graph matters for data lakes

Let's consider a healthcare network with providers, patients, facilities and referral chains. The analytical questions are straightforward:

What's the total billed amount per region?
Which patients have the highest spend?
What's the average claim by specialty?

SQL on Iceberg handles all of these beautifully. But if you need to know: "Which in-network providers are referring patients to out-of-network providers through chains of intermediaries, and what's the dollar impact?" Those questions have two parts: a graph traversal (find the referral chains) and an analytical aggregation (sum the costs). Neither a pure graph database nor a pure analytical engine can answer it alone. You need both, working together, in the same query.

This is where data lakes need graphs.

Why Apache AGE

Apache AGE is a PostgreSQL extension that adds openCypher graph query support directly inside Postgres. There are other graph databases out there — Neo4j, Amazon Neptune, TigerGraph — but AGE has a unique advantage for the modern data platform — it runs inside PostgreSQL.

We've recently seen customers increasingly using Apache AGE for a couple of reasons:

No data movement: Your Iceberg tables and your graph live in the same database. You don't extract, transform and load (ETL) data out of your lake into a separate graph database, keep it in sync and maintain two systems. You build your graph from your lake tables.
SQL + Cypher together: AGE graph queries return standard PostgreSQL result sets. That means you can wrap a Cypher query in a CTE and join it against your Iceberg tables in the same statement. Graph output becomes just another subquery.
One connection, one transaction: Lots of good operational simplicity here - like application connection, security, backups stay just one process.

Graph example: A healthcare network

Let's walk through a realistic scenario to show how cool this is. Imagine a healthcare platform where the data lives as Iceberg tables on Amazon S3 — claims, providers, patients, facilities, referrals, insurance plans, regions. This is a typical data set where files land in object storage from several upstream systems.

Load the Postgres extensions. Note that pg_lake runs with a little sidecar, the pgduck_server, to handle the Iceberg read/write.

CREATE EXTENSION pg_lake_spatial CASCADE;
CREATE EXTENSION age;

The Iceberg tables look like normal Postgres tables. You create them with USING iceberg and they're backed by Parquet on S3:

CREATE TABLE claims USING iceberg AS
SELECT claim_id, patient_id, provider_id,
      claim_date, service_type, billed_amount, paid_amount
FROM raw_claims;

CREATE TABLE providers USING iceberg AS
SELECT provider_id, provider_name, specialty,
      facility_id, network_status, avg_visit_cost
FROM provider_source;

Standard SQL works exactly as you'd expect — top spenders, cost by region, average claim by specialty. And you can easily use Postgres to join data from multiple Iceberg and Parquet tables.

Now let's build a graph from the same data. AGE lets you loop over your Iceberg tables and create nodes and edges for each table.

This creates nodes.

SELECT create_graph('healthcare');

-- Create Provider nodes from the Iceberg table
DO $$
DECLARE r RECORD;
BEGIN
   FOR r IN SELECT provider_id, provider_name, specialty,
                   network_status FROM providers
   LOOP
       EXECUTE format(
           'SELECT * FROM cypher(''healthcare'', $q$
               CREATE (:Provider {provider_id: %s, name: %s,
                       specialty: %s, network_status: %s})
           $q$) AS (v agtype)',
           quote_literal(r.provider_id), quote_literal(r.provider_name),
           quote_literal(r.specialty), quote_literal(r.network_status)
       );
   END LOOP;
END;
$$ LANGUAGE plpgsql;

You'll follow a similar pattern for the other tables like patients, facilities, regions and plans.

Then you'll add edges. In graph terms, edges (also called relationships) are the connections between nodes. If nodes are the "things" in your data — providers, patients, facilities — then edges are the "verbs" that describe how those things relate to each other: this provider referred to that provider, this patient visited that doctor. Edges can also carry their own properties (like a referral date or a reason), so they're not just links — they encode the context of the relationship.

For example, this code below adds REFERRED_TO edges from the referrals Iceberg table:

-- Create REFERRED_TO edges from the referrals Iceberg table
DO $$
DECLARE r RECORD;
BEGIN
   FOR r IN SELECT referring_provider_id, referred_to_provider_id,
                   referral_date, referral_reason FROM referrals
   LOOP
       EXECUTE format(
           'SELECT * FROM cypher(''healthcare'', $q$
               MATCH (p1:Provider {provider_id: %s}),
                     (p2:Provider {provider_id: %s})
               CREATE (p1)-[:REFERRED_TO {referral_date: %s,
                       reason: %s}]->(p2)
           $q$) AS (e agtype)',
           quote_literal(r.referring_provider_id),
           quote_literal(r.referred_to_provider_id),
           quote_literal(r.referral_date),
           quote_literal(r.referral_reason)
       );
   END LOOP;
END;
$$ LANGUAGE plpgsql;

Each edge represents a referral from one provider to another. It loops over the referrals Iceberg table and for each row, connects the referring provider node to the referred-to provider node with a directed relationship. The edge also carries referral_date and reason as properties, so you can later filter or analyze referral chains by when or why they happened.

When you're creating the procedures to create the edges and nodes, you're building a graph inside Postgres itself to know how and where to find this data. This map is actually stored on disk local to Postgres. There is some data duplication but only the data you're using for nodes and edges, the other columns can just stay in Iceberg.

Figure 1: Apache Age Cypher Graph relationships

Apache AGE versus a traditional CTE

Let's make this tangible. Suppose you need to answer:

"Find all referral chains where an in-network provider refers (directly or through intermediaries) to an out-of-network provider, then calculate the total billed amount for claims at those out-of-network providers."

In pure SQL, you'd need a recursive common table expression (CTE) with cycle detection to walk the referral chains:

WITH RECURSIVE referral_chain AS (
   -- Base case: direct referrals from in-network providers
   SELECT
       r.referring_provider_id AS origin_id,
       r.referred_to_provider_id AS current_id,
       p_from.network_status AS origin_status,
       p_to.network_status AS current_status,
       ARRAY[r.referring_provider_id, r.referred_to_provider_id] AS path,
       1 AS depth
   FROM referrals r
   JOIN providers p_from ON r.referring_provider_id = p_from.provider_id
   JOIN providers p_to ON r.referred_to_provider_id = p_to.provider_id
   WHERE p_from.network_status = 'In-Network'

   UNION ALL

   -- Recursive step: follow the chain, with cycle detection
   SELECT
       rc.origin_id,
       r.referred_to_provider_id AS current_id,
       rc.origin_status,
       p_next.network_status AS current_status,
       rc.path || r.referred_to_provider_id,
       rc.depth + 1
   FROM referral_chain rc
   JOIN referrals r ON rc.current_id = r.referring_provider_id
   JOIN providers p_next ON r.referred_to_provider_id = p_next.provider_id
   WHERE rc.depth < 4
     AND r.referred_to_provider_id != ALL(rc.path)  -- cycle detection
),
oon_endpoints AS (
   SELECT DISTINCT origin_id, current_id AS oon_provider_id
   FROM referral_chain
   WHERE current_status = 'Out-of-Network'
)
SELECT
   p_in.provider_name AS referring_doctor,
   p_oon.provider_name AS out_of_network_doctor,
   count(*) AS claims_at_oon,
   round(sum(c.billed_amount)::numeric, 2) AS total_billed_oon,
   round(sum(c.paid_amount)::numeric, 2) AS total_paid_oon,
   round(sum(c.billed_amount - c.paid_amount)::numeric, 2) AS cost_gap
FROM oon_endpoints oe
JOIN providers p_in ON oe.origin_id = p_in.provider_id
JOIN providers p_oon ON oe.oon_provider_id = p_oon.provider_id
JOIN claims c ON c.provider_id = oe.oon_provider_id
GROUP BY p_in.provider_name, p_oon.provider_name
ORDER BY total_billed_oon DESC
LIMIT 10;

This is 40+ lines of SQL. The UNION ALL materializes every intermediate row at every depth. If your referral graph has high fan-out, the recursive step can explode in memory before producing a single result. On a data set of even moderate size, this query could time out or exhaust work_mem.

With Apache AGE, the same question becomes:

WITH oon_referrals AS (
   SELECT * FROM cypher('healthcare', $$
       MATCH (p1:Provider {network_status: 'In-Network'})
             -[:REFERRED_TO*1..4]->
             (p2:Provider {network_status: 'Out-of-Network'})
       RETURN p1.provider_id AS in_net_id,
              p1.name AS in_net_doc,
              p2.provider_id AS oon_id,
              p2.name AS oon_doc
   $$) AS (in_net_id agtype, in_net_doc agtype,
           oon_id agtype, oon_doc agtype)
)
SELECT
   oon.in_net_doc::text AS referring_doctor,
   oon.oon_doc::text AS out_of_network_doctor,
   count(*) AS claims_at_oon,
   round(sum(c.billed_amount)::numeric, 2) AS total_billed_oon,
   round(sum(c.paid_amount)::numeric, 2) AS total_paid_oon,
   round(sum(c.billed_amount - c.paid_amount)::numeric, 2) AS cost_gap
FROM oon_referrals oon
JOIN claims c ON c.provider_id = trim(both '"' from oon.oon_id::text)
GROUP BY oon.in_net_doc, oon.oon_doc
ORDER BY total_billed_oon DESC
LIMIT 10;

The Cypher MATCH traverses variable-length paths with property filters with just four lines of code here. Because we created the nodes and edges ahead of time, we don't need a huge union or tons of memory to find this. AGE handles cycle detection internally. The graph engine's list storage makes the traversal efficient regardless of fan-out. And the result is dropped into a standard CTE so Postgres can join against the claims Iceberg table.

Closing thoughts

The infrastructure required is surprisingly minimal for a really complex graph system here. PostgreSQL, Apache AGE, pg_lake and S3-compatible object storage. That's the whole stack. It runs locally on a laptop with Docker and it scales to production on the same architecture.

The data lake Postgres pattern has solved some pretty major data storage problems. Open table formats solved the interoperability problem. But the query problem — asking relationship questions across analytical data — can be a bit of a gap.

Running pg_lake and Apache AGE in the same PostgreSQL instance fills that gap. Your data stays in Iceberg on S3. Your graph is built from that same data, in the same database. And the queries that matter most — the ones that combine graph traversals with analytical aggregations — are just SQL with a Cypher subquery.

One database. No data movement. Just Postgres, being extensible enough to be everything you need.

Learn more about the author

Elizabeth Garrett Christensen

Developer Advocate