Penguin Random House is the UK’s leading book publisher, bringing 2,000 new titles to market each year, and as a part of Bertelsmann Media Group are an influential force in global publishing, operating across six continents. Penguin Random House is dedicated to enabling talented people from all walks of life to tell their stories, making books for everyone, because a book can change anyone.

Snowflake caught up with Pete Williams, Director of Data and Online, at Penguin Random House (PRH), to hear how they have transformed data handling at the publishing giant to tackle a number of legacy issues and enable data maturity within the business.

How do you solve a problem like Big Data?

To fulfil their mission and maintain their leading position in publishing, Penguin Random House set out on a mission to develop their competitive advantage by transforming their data solutions and leverage their vast data assets. “Although PRH has been very successful using tried and tested operating methods, there’s a great opportunity to exploit our data assets to maintain and grow our position in an ever more competitive market” says Williams. 

Even though our legacy data infrastructure with Netezza on-premises collected and stored many terabytes of mainly transactional data, it was primarily only made available to sales and finance teams at PRH through BI tools such as Business Objects and Tableau. The fixed capacity and specific architecture meant there was a significant reliance on manual coding to orchestrate and deliver the extraction, transformation and load of data that resulted in an inability to scale and tackle the wider opportunities beyond basic BI. However, many other parts of the organisation needed access to data as well, and the constraints of legacy solution led to the creation of departmental data and analytic silos. This caused a lack of trust in the data that was offered due to a lack of any single source of truth for the business. 

Williams continues, “We had a BI team which brought data into the organisation, but it turns out other teams in other parts of the business were also bringing in some of that same data. They were storing it in different places, turning it into different sets of composite measures and providing a different view on performance.” It was clear to Williams that if PRH was going to be able to make rapid, confident decisions, consolidating silos to make sure that everybody drew their data from a single source of truth was vital. 

Such a necessity is common in business as data becomes an ever more significant factor in success. As Williams notes, “every business I have worked in seems to be a cottage industry of busy people manually finding and stitching together information – when their job is actually to use that data to make decisions. That manual effort always leads to mistakes or points of failure.”

Growing in data maturity

To implement that single source of truth in the business and eliminate points of failure it was essential for Williams and his team to embed proper data governance over that single source of truth from the outset and they put a Snowflake architecture at the heart of it.

Snowflake provided the storage layer for Penguin Random House’s new data model, providing three separate layers: a historical archive that contains raw data, the data vault base for integration and standardisation, and finally the presentation layer where the data is extractable for day-to-day workings of the business. Replacing manual coding and maintenance with automated data replication systems means data capture into Snowflake is now fast and scalable.

“We never imagined this architecture could get us here this quickly. We had our centralized data platform built within three months, and when I look back at some of the things we were doing before and how long it took us to do them, it’s just incredible how quick we have grown in data maturity.”

Williams was also keen to stress the value of ‘data literacy’ as central to unlocking the full opportunity which the platform provides, always asking ‘how can people better consume the data?’:

“It’s about helping people ask better questions. As they ask one question and we solve that, they’ll then ask another ten – and before we know it, we’re really, really motoring”.

The truth makes for better business

The new enterprise data platform at Penguin Random House, powered by Snowflake, seamlessly scales to meet business needs. While before only sales and finance teams had access to data from BI tools, PRH is now able to automate capturing and processing more data than ever before, freeing the data team to focus on analysis and operationalizing insights gained to benefit the business.

“We think we’ve already saved about 300 hours per year in the creation of a simple sales report based on Amazon data and we think more than 800 users are going to benefit from that. We’ve unlocked new capabilities and calculate preliminary savings of around 5000 hours per year – and we’re still early in the journey.”