data integration philippines

Greenplum and the Power of One

Greenplum and the Power of One 768 487 Exist Software Labs

“Did I disappoint you?
Or leave a bad taste in your mouth?
You act like you never had love
And you want me to go without”

So goes a passage in the chart-topping song, “One”, off of U2’s very successful Achtung Baby album.

It could also pass for the sentiment of many organizations who jumped in on a Hadoop Big Data romance, only to be met with heart-breaking disappointment.

But why was it such a heart-breaker?

Some Reasons

1. Hadoop’s performance with small datasets leaves much to be desired. Its file-level processing is good for batch handling of large datasets but constrained for smaller, interactive querying.

2. Hadoop is weak on real-time analytics. As mentioned in point 1, Hadoop is more of a batch kind of implementation, and response times are often unacceptable. It’s true that Spark and Kafka on top can offer some remedy to this predicament but the complexity of configuration and maintenance can be quite harrowing.

3. Hadoop was limited in its deployment options. Most Hadoop implementations are on-premise and have not fully embraced the cloud renaissance.

4. Administration-wise, Hadoop’s infrastructure can be a pain in the bottom. Core tools for replication, adding nodes, directory, and partition-creation, performance tuning, workload management, data distribution, etc., are minimal and often require add-ons. Not to mention the headache that is disaster recovery.

All for One, One for All

It could also be argued that the need to piece together so many different disparate software components in order to come up with a data analytics platform was a key determinant in the decline of Hadoop-based implementations.

Imagine needing to learn a big chunk of these technologies just to get the ball rolling!

What if what once looked like this:

Can be turned into this:

With the Greenplum Data Platform, the need to learn and implement countless software is done away with since everything a modern data analytics platform needs is already packaged within the platform (See Why is Greenplum the Best Choice for a Cloud Data Warehouse? and Why the Data Warehouse Is Here to Stay).

With just ONE data analytics platform, your organization can begin to harness the benefits promised by Hadoop—and more—minus the heartache.

Greenplum is the ONE.

Does Your Organization Have the Pivotal X-Factor?

Does Your Organization Have the Pivotal X-Factor? 768 487 Exist Software Labs

“No, you can’t always get what you want. You can’t always get what you want. You can’t always get what you want.”

Thus echoed the famous Rolling Stones tune of 1969. The principle may be true in many cases, but when it comes to the Data Integration needed for turning up actionable business insights, getting at all the Data Sources that you want can actually be an “always” kinda thing. Let me explain.

What is Data Integration?

To put it simply, you want access to all the data that are relevant to your organization—data that will enable you to do the analytics that drive business-building decisions. Having access to the data, or Data Integration, can mean two things:

    • You haul the data off from various data sources into a central data hub, or…
    • You directly interface with the data in-place (where they reside) from a data access platform

The first involves your classic ETL (Extract-Transform-Load) paradigm, and may also include ELT (Extract-Load-Transform), mostly used in dealing with unstructured data, and real-time, event-driven streaming. The second is what’s become a buzzword nowadays called, Data Federation or Data Virtualization.

What if I told you that you don’t have to buy separate Data Integration tools to accomplish these two ways of getting at data? What if I told you that all you need is the X-Factor…the Pivotal X-Factor!

What is Greenplum PXF?

A primary feature of the Greenplum Modern Data Analytics Platform, one that makes it a cut above the rest, is the Greenplum Platform Extension Framework (PXF). Greenplum PXF provides data connectors to all relational data sources supported by JDBC. It also has HDFS, Hive, and HBase connectors to all major commercial Hadoop distributions like Cloudera, Hortonworks Data Platform, and MapR, along with generic Apache Hadoop distributions. Is your data up on Cloud object stores? PXF provides connectors to Text, Avro, JSON, Parquet, AvroSequenceFile, and SequenceFile data on Azure, Google Cloud, Minio, and S3 object stores as well.

By defining External Tables that point to these data sources, you are able to retrieve and modify external data from the comforts of your own Greenplum home, as it were, by merely issuing SQL statements that behave as if the tables were actually local. And given Greenplum’s MPP architecture, access to these outside-the-fence data is lightning fast! How does inserting 71 million+ records from SQL Server into a 3-segment Greenplum cluster in 13 to 18 minutes sound?!

A Simple Example

Let’s say we have data that we want to do analytics on residing in two different kinds of databases, one SQL Server and the other Postgres. And suppose some important information also resides in CSV files provided by some department.

The traditional way of acquiring the data from these three different data sources is to use a separate ETL tool to define jobs that would physically transfer the data to destination tables. While this is OK, it leaves you committed to the data that you have already ingested and any mixing and matching of relationships can only be done after the fact and at the cost of physically transferring data.

What if you wanted the flexibility of first querying the data in these distinct data sources individually, and then together, establishing relationships between your SQL Server, Postgres, and CSV data without needing to load them physically first into your data repository? 

 With Greenplum PXF (the Pivotal X-Factor!), you can do just that!

So these are your three different data sources:

After configuring PXF JDBC settings under the hood, you define External Tables to the particular tables in SQL Server and Postgres that you want to access:

SQL Server External Table (PXF)

CREATE EXTERNAL TABLE pxf_sqlserver_TestTable
(
      ID int,
      Name varchar(50),
      Source varchar(20)
)
LOCATION ('pxf://DICT_Test.dbo.TestTable?PROFILE=Jdbc&SERVER=sqlserver')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');

Postgres External Table (PXF)

CREATE EXTERNAL TABLE pxf_postgres_dimairline 
(
      airlineid smallint,
      airlinename varchar(95)
)
LOCATION ('pxf://public.dimairline?PROFILE=Jdbc&SERVER=postgres')
FORMAT ‘CUSTOM' (FORMATTER='pxfwritable_import');

For accessing the CSV file, we also define an External Table but with the use of another awesome Greenplum innovation called gpfdist (which will be the subject of another blog):

CSVExternal Table (gpfdist)

CREATE EXTERNAL TABLE ext_csv_address
(
      id int, 
      address varchar(100),
      status varchar(10))
LOCATION ('gpfdist://gpmdw:8080/*.csv')
FORMAT 'CSV' ( DELIMITER ',' );

And with just one query, you can retrieve the data from all three data sources, relating them to each other as you please:

SELECT * FROM
pxf_sqlserver_TestTable A → SQL Server
RIGHT JOIN pxf_postgres_dimairline B → Postgres
      ON A.id = B.airlineid
LEFT JOIN ext_csv_address C → CSV file
      ON B.airlineid = C.id
ORDER BY A.id LIMIT 10;


This is the Data Federation or Data Virtualization use case. You can also use the External Tables as a simple way to implement ETL by inserting to your pre-defined Greenplum tables from them, like so:

INSERT INTO gp_sqlserver_TestTable
SELECT * FROM pxf_sqlserver_TestTable;

Take note that this insertion to Greenplum is done in parallel across all segment hosts, thereby giving it lightning-fast load performance (71+M recs in 13 to 18 minutes!)

If transformations on the data are required, then creating user-defined functions in Greenplum should do the trick.

Parting Words

So there you have it. With Greenplum PXF, you can always get at the data that you want, enabling your organization to transform data into actionable insights fast and quick. You do want your business to stick around as long as the Rolling Stones, don’t you? 😉