Data warehousing: why the need for flexibility is an inflexible truth
As the global business landscape is increasingly digitalised, and new technologies like 5G drive the exponential expansion of the Internet of Things (IoT), the amount of data created on a daily basis is growing exponentially. Business intelligence and research firm Raconteur found this year that, on an average day, 500mn tweets, 65bn WhatsApp messages and 294bn emails are sent, while four petabytes of data are created on Facebook and 5bn searches are made online. By 2025, it’s estimated that 463 exabytes of data will be created each day globally – the equivalent of 212,765,957 DVDs per day.
In order to keep pace and stay afloat, modern businesses need to gather, store, analyse and draw insights from a mind-bending amount of raw data. Determining what information is valuable, how to extract it and where to keep it are the challenges that every business in the current landscape must overcome. This landscape, however, is changing so fast that today’s solutions are outdated within as little as six months. In order to keep up, enterprises are increasingly moving towards third party data management and storage solutions hosted in the cloud, for the flexibility and access to leading edge technology that they provide.
Gigabit magazine spoke with experts in the data warehousing space to gauge the state of the evolving data warehousing industry, and why flexibility is at the heart of leading modern solutions. But first…
What is a data warehouse?
The differences between a database and a data warehouse aren’t immediately obvious. Both contain data. Both databases and data warehouses are what’s called ‘relational’ data systems; they each store data in a structured format, using rows and columns. Where they differ is the purposes they serve. Also affecting the market are data lakes, which are newer, and solve different problems in a slightly different way.
A database stores current transactions and enables quick, easy access to specific transactions for ongoing business processes, known as Online Transaction Processing (OLTP).
Data warehouses, on the other hand, present a consolidated view of either a physical or logical data repository collected from various systems, according to Panoply. They are best at providing a correlation between data from existing systems (product inventory stored in one system and purchase orders for a specific customer, stored in another system for example), and are mostly used for online analytical processing (OLAP), which uses complex queries to analyse rather than process transactions.
Lastly, a data lake is a newer, highly scalable storage system that holds structured and unstructured data in its original form and format, rather than organising it into rows and columns like a database or warehouse. A data lake does not require planning or prior knowledge of the data analysis needed - it assumes that analysis will happen later, on-demand.
Jean-Michel Franco, Senior Director of Data Governance, Talend
“According to a TDWI and Talend survey, the top reasons companies migrate to a cloud data warehouse are: a flexible cost model, to take advantage of cloud features, faster performance and to migrate existing products to cloud. The on-premises data warehouse business is shrinking inexorably. Most new customer data warehouses under construction today are being built in the cloud (most commonly Snowflake, AWS Redshift, Azure SQL Data Warehouse, or Google BigQuery).
Putting your data repository in the cloud is simply better. It’s faster, more scalable, with zero install time, you can go live in minutes, and it’s always up-to-date. Nearly every single company looking for a new data warehouse or a new data lake will choose a cloud-based data repository.”
Rob Lamb, Chief Technology Officer, Dell Technology, UK
“There are fundamental differences between data lakes and data warehousing, and some challenges arise from confusion over terminology and usage. Data lakes and data warehouses are both used for storing data, but they are not the same. A data lake is a large pool of raw data set for future extraction and analysis – it needs to be searchable, but that may be the extent of tooling provided.
“The oil and gas industry was an early adopter of data lakes to land data for use cases such as minimising unplanned downtime and improving safety. A data warehouse is a repository for structured data supported by a combination of processes and tools to prepare data for a specific purpose. For example, warehousing is essential for the healthcare industry as it utilises it to strategise and predict outcomes, generate patients’ treatments and share data with medical aid services.”
Lamb has worked for Dell for almost a decade, watching the global explosion of data and working to support the expansion of cloud infrastructure from cutting edge, niche technology to the foundation of modern digital society.
“There has been a shift towards the use of cloud for data warehouse architecture in recent years as the services and capabilities have matured,” he continues. “There are three primary drivers for organisations looking at cloud for data warehousing:
The inability to handle the speed and volume of multi-source data, especially IoT data;
The inability to find a single technological solution to collect, store, and organise data from disparate sources;
The inability to handle Big Data projects with a single database;
“The challenge is managing these data sources and only integrating the valuable data into the data warehouse.”
Walter Heck, CTO, HeleCloud, Netherlands
“The more data businesses gather, the more information they have at their disposal. In a digital world, this is a great asset. But, with more data comes more responsibility. Businesses process and store thousands, millions, sometimes even billions of transactions each day, all of which need to be managed securely and effectively. The ability to store large quantities of data is being made increasingly possible by creating data warehouses,” says Heck, who took on his current role at the Amazon Web Services (AWS) Advanced Consulting Partner in August.
Heck has seen data warehouses grow dramatically in both size and complexity over the past year. He notes that the trend is spurring a large number of enterprises to closely investigate the possibilities of new generations of cloud and data management infrastructure, particularly those that are backed by machine learning and AI which allow companies to more accurately manipulate and understand their data.
The change, Heck believes, could not have come at a better time.
“Despite widespread talk of digital transformation, many companies across the globe still do not optimally use the data available to them. This is because data tends to sit undiscovered in silos across these businesses,” he explains. “That said, businesses are starting to wake up to this reality. As such, we are likely to see organisations start organising their approach to managing data. This is a good thing. With the introduction of 5G and the evolution of edge computing, data volumes are likely to explode to unprecedented levels in the next few years. This means that data warehousing needs to be flexible enough to scale based on volume as well as integrate the many different data types for analysis.”
A flexible future in the cloud
The mass migration of the modern enterprise to the cloud may even see CTOs and digital executives move their organisations beyond the concept of the data centre altogether. Rather than storing data in warehouses, solutions that provide even more immediate access as a flexible service are becoming the object of demand for industry leaders. Regardless, the days of on-premises legacy systems are ending, and companies need to look to the future if they expect to survive and thrive in a future where the accumulated digital universe is predicted to expand from 4.4 zettabytes at the start of this year to more than 44 zettabytes in 2020. Data is the future, and in the future only the flexible will survive.