Greenplum and the Power of Onehttps://exist.com/wp-content/uploads/web_800x507_theone-768x487-1.webp768487Exist Software LabsExist Software Labs//exist.com/wp-content/uploads/logos/exist/logo-default.png
“Did I disappoint you? Or leave a bad taste in your mouth? You act like you never had love And you want me to go without”
So goes a passage in the chart-topping song, “One”, off of U2’s very successful Achtung Baby album.
It could also pass for the sentiment of many organizations who jumped in on a Hadoop Big Data romance, only to be met with heart-breaking disappointment. But why was it such a heart-breaker?
1. Hadoop’s performance with small datasets leaves much to be desired. Its file-level processing is good for batch handling of large datasets but constrained for smaller, interactive querying.
2. Hadoop is weak on real-time analytics. As mentioned in point 1, Hadoop is more of a batch kind of implementation, and response times are often unacceptable. It’s true that Spark and Kafka on top can offer some remedy to this predicament but the complexity of configuration and maintenance can be quite harrowing.
3. Hadoop was limited in its deployment options. Most Hadoop implementations are on-premise and have not fully embraced the cloud renaissance.
4. Administration-wise, Hadoop’s infrastructure can be a pain in the bottom. Core tools for replication, adding nodes, directory, and partition-creation, performance tuning, workload management, data distribution, etc., are minimal and often require add-ons. Not to mention the headache that is disaster recovery.
All for One, One for All
It could also be argued that the need to piece together so many different disparate software components in order to come up with a data analytics platform was a key determinant in the decline of Hadoop-based implementations.
Imagine needing to learn a big chunk of these technologies just to get the ball rolling! What if what once looked like this:
Does Your Organization Have the Pivotal X-Factor?https://exist.com/wp-content/uploads/web_800x507_xfactor-1-768x487-1.webp768487Exist Software LabsExist Software Labs//exist.com/wp-content/uploads/logos/exist/logo-default.png
“No, you can’t always get what you want. You can’t always get what you want. You can’t always get what you want.”
Thus echoed the famous Rolling Stones tune of 1969. The principle may be true in many cases, but when it comes to the Data Integration needed for turning up actionable business insights, getting at all the Data Sources that you want can actually be an “always” kinda thing. Let me explain.
What is Data Integration?
To put it simply, you want access to all the data that are relevant to your organization—data that will enable you to do the analytics that drive business-building decisions. Having access to the data, or Data Integration, can mean two things:
You haul the data off from various data sources into a central data hub, or…
You directly interface with the data in-place (where they reside) from a data access platform
The first involves your classic ETL (Extract-Transform-Load) paradigm, and may also include ELT (Extract-Load-Transform), mostly used in dealing with unstructured data, and real-time, event-driven streaming. The second is what’s become a buzzword nowadays called, Data Federation or Data Virtualization.
What if I told you that you don’t have to buy separate Data Integration tools to accomplish these two ways of getting at data? What if I told you that all you need is the X-Factor…the Pivotal X-Factor!
What is Greenplum PXF?
A primary feature of the Greenplum Modern Data Analytics Platform, one that makes it a cut above the rest, is the Greenplum Platform Extension Framework (PXF). Greenplum PXF provides data connectors to all relational data sources supported by JDBC. It also has HDFS, Hive, and HBase connectors to all major commercial Hadoop distributions like Cloudera, Hortonworks Data Platform, and MapR, along with generic Apache Hadoop distributions. Is your data up on Cloud object stores? PXF provides connectors to Text, Avro, JSON, Parquet, AvroSequenceFile, and SequenceFile data on Azure, Google Cloud, Minio, and S3 object stores as well.
By defining External Tables that point to these data sources, you are able to retrieve and modify external data from the comforts of your own Greenplum home, as it were, by merely issuing SQL statements that behave as if the tables were actually local. And given Greenplum’s MPP architecture, access to these outside-the-fence data is lightning fast! How does inserting 71 million+ records from SQL Server into a 3-segment Greenplum cluster in 13 to 18 minutes sound?!
A Simple Example
Let’s say we have data that we want to do analytics on residing in two different kinds of databases, one SQL Server and the other Postgres. And suppose some important information also resides in CSV files provided by some department.
The traditional way of acquiring the data from these three different data sources is to use a separate ETL tool to define jobs that would physically transfer the data to destination tables. While this is OK, it leaves you committed to the data that you have already ingested and any mixing and matching of relationships can only be done after the fact and at the cost of physically transferring data.
What if you wanted the flexibility of first querying the data in these distinct data sources individually, and then together, establishing relationships between your SQL Server, Postgres, and CSV data without needing to load them physically first into your data repository?
With Greenplum PXF (the Pivotal X-Factor!), you can do just that!
So these are your three different data sources:
After configuring PXF JDBC settings under the hood, you define External Tables to the particular tables in SQL Server and Postgres that you want to access:
SQL Server External Table (PXF)
CREATE EXTERNAL TABLE pxf_sqlserver_TestTable( ID int, Name varchar(50), Source varchar(20))LOCATION ('pxf://DICT_Test.dbo.TestTable?PROFILE=Jdbc&SERVER=sqlserver')FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
For accessing the CSV file, we also define an External Table but with the use of another awesome Greenplum innovation called gpfdist (which will be the subject of another blog): CSVExternal Table (gpfdist)
CREATE EXTERNAL TABLE ext_csv_address( id int, address varchar(100), status varchar(10))LOCATION ('gpfdist://gpmdw:8080/*.csv')FORMAT 'CSV' ( DELIMITER ',' );
And with just one query, you can retrieve the data from all three data sources, relating them to each other as you please:
SELECT * FROMpxf_sqlserver_TestTable A → SQL ServerRIGHT JOIN pxf_postgres_dimairline B → Postgres ON A.id = B.airlineidLEFT JOIN ext_csv_address C → CSV file ON B.airlineid = C.idORDER BY A.id LIMIT 10;
This is the Data Federation or Data Virtualization use case. You can also use the External Tables as a simple way to implement ETL by inserting to your pre-defined Greenplum tables from them, like so:
INSERT INTO gp_sqlserver_TestTableSELECT * FROM pxf_sqlserver_TestTable;
Take note that this insertion to Greenplum is done in parallel across all segment hosts, thereby giving it lightning-fast load performance (71+M recs in 13 to 18 minutes!) If transformations on the data are required, then creating user-defined functions in Greenplum should do the trick.
So there you have it. With Greenplum PXF, you can always get at the data that you want, enabling your organization to transform data into actionable insights fast and quick. You do want your business to stick around as long as the Rolling Stones, don’t you? 😉