greenplum philippines

Greenplum and the Power of One

Greenplum and the Power of One 768 487 Exist Software Labs

“Did I disappoint you?
Or leave a bad taste in your mouth?
You act like you never had love
And you want me to go without”

So goes a passage in the chart-topping song, “One”, off of U2’s very successful Achtung Baby album.

It could also pass for the sentiment of many organizations who jumped in on a Hadoop Big Data romance, only to be met with heart-breaking disappointment.

But why was it such a heart-breaker?

Some Reasons

1. Hadoop’s performance with small datasets leaves much to be desired. Its file-level processing is good for batch handling of large datasets but constrained for smaller, interactive querying.

2. Hadoop is weak on real-time analytics. As mentioned in point 1, Hadoop is more of a batch kind of implementation, and response times are often unacceptable. It’s true that Spark and Kafka on top can offer some remedy to this predicament but the complexity of configuration and maintenance can be quite harrowing.

3. Hadoop was limited in its deployment options. Most Hadoop implementations are on-premise and have not fully embraced the cloud renaissance.

4. Administration-wise, Hadoop’s infrastructure can be a pain in the bottom. Core tools for replication, adding nodes, directory, and partition-creation, performance tuning, workload management, data distribution, etc., are minimal and often require add-ons. Not to mention the headache that is disaster recovery.

All for One, One for All

It could also be argued that the need to piece together so many different disparate software components in order to come up with a data analytics platform was a key determinant in the decline of Hadoop-based implementations.

Imagine needing to learn a big chunk of these technologies just to get the ball rolling!

What if what once looked like this:

Can be turned into this:

With the Greenplum Data Platform, the need to learn and implement countless software is done away with since everything a modern data analytics platform needs is already packaged within the platform (See Why is Greenplum the Best Choice for a Cloud Data Warehouse? and Why the Data Warehouse Is Here to Stay).

With just ONE data analytics platform, your organization can begin to harness the benefits promised by Hadoop—and more—minus the heartache.

Greenplum is the ONE.

Does Your Organization Have the Pivotal X-Factor?

Does Your Organization Have the Pivotal X-Factor? 768 487 Exist Software Labs

“No, you can’t always get what you want. You can’t always get what you want. You can’t always get what you want.”

Thus echoed the famous Rolling Stones tune of 1969. The principle may be true in many cases, but when it comes to the Data Integration needed for turning up actionable business insights, getting at all the Data Sources that you want can actually be an “always” kinda thing. Let me explain.

What is Data Integration?

To put it simply, you want access to all the data that are relevant to your organization—data that will enable you to do the analytics that drive business-building decisions. Having access to the data, or Data Integration, can mean two things:

    • You haul the data off from various data sources into a central data hub, or…
    • You directly interface with the data in-place (where they reside) from a data access platform

The first involves your classic ETL (Extract-Transform-Load) paradigm, and may also include ELT (Extract-Load-Transform), mostly used in dealing with unstructured data, and real-time, event-driven streaming. The second is what’s become a buzzword nowadays called, Data Federation or Data Virtualization.

What if I told you that you don’t have to buy separate Data Integration tools to accomplish these two ways of getting at data? What if I told you that all you need is the X-Factor…the Pivotal X-Factor!

What is Greenplum PXF?

A primary feature of the Greenplum Modern Data Analytics Platform, one that makes it a cut above the rest, is the Greenplum Platform Extension Framework (PXF). Greenplum PXF provides data connectors to all relational data sources supported by JDBC. It also has HDFS, Hive, and HBase connectors to all major commercial Hadoop distributions like Cloudera, Hortonworks Data Platform, and MapR, along with generic Apache Hadoop distributions. Is your data up on Cloud object stores? PXF provides connectors to Text, Avro, JSON, Parquet, AvroSequenceFile, and SequenceFile data on Azure, Google Cloud, Minio, and S3 object stores as well.

By defining External Tables that point to these data sources, you are able to retrieve and modify external data from the comforts of your own Greenplum home, as it were, by merely issuing SQL statements that behave as if the tables were actually local. And given Greenplum’s MPP architecture, access to these outside-the-fence data is lightning fast! How does inserting 71 million+ records from SQL Server into a 3-segment Greenplum cluster in 13 to 18 minutes sound?!

A Simple Example

Let’s say we have data that we want to do analytics on residing in two different kinds of databases, one SQL Server and the other Postgres. And suppose some important information also resides in CSV files provided by some department.

The traditional way of acquiring the data from these three different data sources is to use a separate ETL tool to define jobs that would physically transfer the data to destination tables. While this is OK, it leaves you committed to the data that you have already ingested and any mixing and matching of relationships can only be done after the fact and at the cost of physically transferring data.

What if you wanted the flexibility of first querying the data in these distinct data sources individually, and then together, establishing relationships between your SQL Server, Postgres, and CSV data without needing to load them physically first into your data repository? 

 With Greenplum PXF (the Pivotal X-Factor!), you can do just that!

So these are your three different data sources:

After configuring PXF JDBC settings under the hood, you define External Tables to the particular tables in SQL Server and Postgres that you want to access:

SQL Server External Table (PXF)

CREATE EXTERNAL TABLE pxf_sqlserver_TestTable
(
      ID int,
      Name varchar(50),
      Source varchar(20)
)
LOCATION ('pxf://DICT_Test.dbo.TestTable?PROFILE=Jdbc&SERVER=sqlserver')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');

Postgres External Table (PXF)

CREATE EXTERNAL TABLE pxf_postgres_dimairline 
(
      airlineid smallint,
      airlinename varchar(95)
)
LOCATION ('pxf://public.dimairline?PROFILE=Jdbc&SERVER=postgres')
FORMAT ‘CUSTOM' (FORMATTER='pxfwritable_import');

For accessing the CSV file, we also define an External Table but with the use of another awesome Greenplum innovation called gpfdist (which will be the subject of another blog):

CSVExternal Table (gpfdist)

CREATE EXTERNAL TABLE ext_csv_address
(
      id int, 
      address varchar(100),
      status varchar(10))
LOCATION ('gpfdist://gpmdw:8080/*.csv')
FORMAT 'CSV' ( DELIMITER ',' );

And with just one query, you can retrieve the data from all three data sources, relating them to each other as you please:

SELECT * FROM
pxf_sqlserver_TestTable A → SQL Server
RIGHT JOIN pxf_postgres_dimairline B → Postgres
      ON A.id = B.airlineid
LEFT JOIN ext_csv_address C → CSV file
      ON B.airlineid = C.id
ORDER BY A.id LIMIT 10;


This is the Data Federation or Data Virtualization use case. You can also use the External Tables as a simple way to implement ETL by inserting to your pre-defined Greenplum tables from them, like so:

INSERT INTO gp_sqlserver_TestTable
SELECT * FROM pxf_sqlserver_TestTable;

Take note that this insertion to Greenplum is done in parallel across all segment hosts, thereby giving it lightning-fast load performance (71+M recs in 13 to 18 minutes!)

If transformations on the data are required, then creating user-defined functions in Greenplum should do the trick.

Parting Words

So there you have it. With Greenplum PXF, you can always get at the data that you want, enabling your organization to transform data into actionable insights fast and quick. You do want your business to stick around as long as the Rolling Stones, don’t you? 😉

Why is Greenplum the Best Choice for a Cloud Data Warehouse?

Why is Greenplum the Best Choice for a Cloud Data Warehouse? 768 487 Exist Software Labs


The Best MADP

Data is the drivetrain of digital transformation and the enterprise with the ability to tap into all possible data sources in order to gain actionable insights is at a key advantage.

In order to gain this advantage, a Modern Analytics Data Platform (MADP) is required. What are the attributes of an MADP that make it the technological foundation of digital transformation?

Greenplum ranks high in every one of these attributes, ensuring the enterprise of continuous access to valuable insights.

In fact, Gartner has ranked Greenplum as the No. 1 open source Data Warehouse platform for 2019, with only the very costly Teradata and Oracle above it:

This combination of being a premier MADP and no-comparison cost-effectiveness makes Greenplum the leading choice for most enterprises seeking data-driven digital transformation.


Moving the Data Warehouse to the Cloud

There are many benefits to moving your enterprise data warehouse to the cloud aside from the more common advantages of mitigating the cost of and simplifying management, administration, and tuning activities.

The following are some of the more salient benefits:

1. Vertical and horizontal scalability – With the influx of ever-increasing volumes and varieties of data come the need to be able to add processing and storage capability to your existing data warehouse infrastructure in a quick and agile manner. This also includes the ability to scale out and add more nodes as the number of users increases.

2. Drastically-reduced start-up and operating costs – The risk of investing millions of dollars in on-premise machines or appliances only to have them become outdated in a number of years is eliminated with the cloud’s pay-only-for-what-you-use-when-you-use-it model.

3. Agile feature enhancement – Advances in data analytics call for products that are quick to adapt to these new features. The cloud infrastructure allows for seamless integration of new functionalities behind the scenes.

4. Top-notch support – Access to 24/7 support by a team of experts means that your system never has to go down, allowing for stellar SLA fulfillments.

5. Security – Since the top cloud providers are required to meet strict security standards set by health, financial, and government entities, you can be assured that your data is kept safe, making it easier to attain certifications like ISO27001, SOC2, GDPR, HIPAA, and PCI. Authorization, authentication, logging, and auditing are basic to all these platforms.


Greenplum in the Cloud 

Pivotal Greenplum is available in the 3 major cloud service providers: AWS, Azure, and GCP. 

Greenplum on AWS

  • Same Pivotal Greenplum software as on-premises or cloud installation
  • Secure Deployment with Product Review from Amazon
  • GP Browser included (Web based SQL Query Tool)
  • Optional Installer makes installing additional components such as MADlib and Command Center easy!
  • Self Healing automates node recovery without administrative intervention
  • Snapshot Utility automates instant and non-blocking database backups
  • Optimized Deployment for Performance using Best Practices
  • Development to Production Deployments via AWS Cloud Formation
  • PgBouncer Connection Pooler included and preconfigured
  • Upgrade Utility notifies and automates cluster upgrades
  • Disaster Recovery via copied Snapshots simplifies and reduces cost for a DR solution

Greenplum on Azure

  • Same Pivotal Greenplum software as on-premises or cloud installation
  • Secure Deployment with Product Review from Microsoft
  • GP Browser included (Web based SQL Query Tool)
  • Optional Installer makes installing additional components such as MADlib and Command Center easy!
  • Self Healing automates node recovery without administrative intervention
  • Optimized Deployment for Performance using Best Practices
  • Development to Production Deployments via Azure Resource Manager Deployment
  • PgBouncer Connection Pooler included and preconfigured
  • Upgrade Utility notifies and automates cluster upgrades
  • Snapshot Utility automates instant and non-blocking database backups

Greenplum on GCP

  • Same Pivotal Greenplum software as on-premises or cloud installation
  • Secure Deployment with Product Review from Google
  • GP Browser included (Web based SQL Query Tool)
  • Optional Installer makes installing additional components such as MADlib and Command Center easy!
  • Self Healing automates node recovery without administrative intervention
  • Optimized Deployment for Performance using Best Practices
  • Development to Production Deployments via Google Deployment Manager
  • PgBouncer Connection Pooler included and preconfigured
  • Upgrade Utility notifies and automates cluster upgrades

For a more detailed presentation on Greenplum on AWS, watch this:

https://tanzu.vmware.com/content/webinars/apr-2-the-enterprise-data-science-warehouse-greenplum-on-aws

Why the Data Warehouse Is Here to Stay

Why the Data Warehouse Is Here to Stay 768 487 Exist Software Labs

The buzzword has been “digital transformation” and the phrase continues to announce the importance of leveraging new technology as the catalyst of improvement in the enterprise. New ways of doing things have been introduced and this is no less apparent in how data is now collected and used for business intelligence and analytics.

The advent of Big Data many years ago brought about huge excitement in these areas. The recognition that there is more data to be collected and used in the enterprise saw the emergence of technologies that facilitated the ingestion of all types of data, their storage in distributed file systems, the ability to scale out easily to accommodate more data, and the various means of getting at this data. But there was a problem.

While the ability to capture and store all types of data, including unstructured data, seemed to be the panacea, it became immediately apparent that:

  • Most business data is structured   
  • Everybody knows SQL
  • The relational model is popular
  • Dimensional modeling works

While it is true that the Big Data “data lake” has the potential of opening up more insights due to the volume and variety of data, real-world use cases have shown that actionable data almost always came in the form of SQL-interfaced, relational data. And this is why the Data Warehouse never really went away.

But the modern data warehouse is a vastly different animal than the traditional data warehouse of years gone by. For a data warehousing platform to be called modern and a true agent of digital transformation, it must have the following attributes:

  • Support any data locality (local disk, Hadoop, private and public cloud data.)
  • In-database advanced analytics.
  • Ability to handle native data types such as spatial, time-series and/or text.
  • Ability to run new analytical workloads including machine learning, geospatial, graph and text analytics.
  • Deployment agnostic including on-premises, private and public cloud.
  • Query optimization for big data.
  • Complex query formation.
  • Massively parallel processing based on the model, not just sharding.
  • Workload management.
  • Load balancing.
  • Scaling to thousands of simultaneous queries.
  • Full ANSI SQL and beyond.
  • MPP data warehouse able to run seamlessly on-premises, public or private clouds, with a much-expanded mission from previous designs.
  • Primarily based on open source projects with strong communities behind them.
  • Supporting both data science computation and preservation and publishing of data science models.
  • In-database analytics and data science libraries. The alternative is running machine learning algorithms against Hadoop or cloud repositories, but needing to move results to another platform for further analysis and presentation (visualization, dimensional models for scenario planning, etc.)
  • Able to support cost-based query optimizations on polymorphic data, while delaying analysis of the data structure until runtime. 1

As you can see, a Hadoop Big Data implementation and the modern Data Warehouse, combined, can become the all-encompassing data platform and single source of truth of an enterprise.

With that said, the best open source-based, modern data warehousing platform in the digital landscape today is Pivotal Greenplum.

In a succeeding blog post, we will discuss the many features that make Pivotal Greenplum the best data platform for data-driven digital transformation.

 

Notes:
1  Neil Raden, The Data Warehouse in the Age of Digital Transformation