The buzzword has been “digital transformation” and the phrase continues to announce the importance of leveraging new technology as the catalyst of improvement in the enterprise. New ways of doing things have been introduced and this is no less apparent in how data is now collected and used for business intelligence and analytics.
The advent of Big Data many years ago brought about huge excitement in these areas. The recognition that there is more data to be collected and used in the enterprise saw the emergence of technologies that facilitated the ingestion of all types of data, their storage in distributed file systems, the ability to scale out easily to accommodate more data, and the various means of getting at this data. But there was a problem.
While the ability to capture and store all types of data, including unstructured data, seemed to be the panacea, it became immediately apparent that:
- Most business data is structured
- Everybody knows SQL
- The relational model is popular
- Dimensional modeling works
While it is true that the Big Data “data lake” has the potential of opening up more insights due to the volume and variety of data, real-world use cases have shown that actionable data almost always came in the form of SQL-interfaced, relational data. And this is why the Data Warehouse never really went away.
But the modern data warehouse is a vastly different animal than the traditional data warehouse of years gone by. For a data warehousing platform to be called modern and a true agent of digital transformation, it must have the following attributes:
- Support any data locality (local disk, Hadoop, private and public cloud data.)
- In-database advanced analytics.
- Ability to handle native data types such as spatial, time-series and/or text.
- Ability to run new analytical workloads including machine learning, geospatial, graph and text analytics.
- Deployment agnostic including on-premises, private and public cloud.
- Query optimization for big data.
- Complex query formation.
- Massively parallel processing based on the model, not just sharding.
- Workload management.
- Load balancing.
- Scaling to thousands of simultaneous queries.
- Full ANSI SQL and beyond.
- MPP data warehouse able to run seamlessly on-premises, public or private clouds, with a much-expanded mission from previous designs.
- Primarily based on open source projects with strong communities behind them.
- Supporting both data science computation and preservation and publishing of data science models.
- In-database analytics and data science libraries. The alternative is running machine learning algorithms against Hadoop or cloud repositories, but needing to move results to another platform for further analysis and presentation (visualization, dimensional models for scenario planning, etc.)
- Able to support cost-based query optimizations on polymorphic data, while delaying analysis of the data structure until runtime. 1
As you can see, a Hadoop Big Data implementation and the modern Data Warehouse, combined, can become the all-encompassing data platform and single source of truth of an enterprise.
With that said, the best open source-based, modern data warehousing platform in the digital landscape today is Pivotal Greenplum.
In a succeeding blog post, we will discuss the many features that make Pivotal Greenplum the best data platform for data-driven digital transformation.
Notes:
1 Neil Raden, The Data Warehouse in the Age of Digital Transformation