Greenplum Summit Week 3: How to Get Started with a Modern Data Warehouse

News

Greenplum Summit Week 3: How to Get Started with a Modern Data Warehouse

The idea of a data warehouse isn’t new. Many enterprises have used them for years. What is new: the data landscape in 2020. The amount of data is exploding, and the use cases requiring real-time data analysis are growing just as fast.

Talks from Week 3 cover two tracks: how to make the move to a modern data warehouse in the first place; and how to keep your modern system optimized over time. Here are some of our favorite moments.

(You can watch all the sessions from Weeks 1, 2, and 3 on-demand in the VMware Learning Zone.)

Essential Greenplum advice from the leaders at Morgan Stanley

Learning from others is our favorite part of the Summit. So we just had to start this recap with the last session: an interview with Ailun Qin, VP at Morgan Stanley.

In this engaging interview, Ailun passes along invaluable best practices to folks just starting their journey with Greenplum. Here are just a few of her insights.

Benchmark everything to get a baseline measure of performance. Then run benchmark tests regularly. If you can’t measure it, you can’t manage it. That’s why Ailun recommends initial and regular benchmarking as part of your modern data warehouse effort. She goes on to cite gpcheckperf is a terrific tool to help.

Alerts will help you run Greenplum at scale. Just make sure each alert has an action item. You’ll soon be running Greenplum at petabyte scale. Alerts are a scalable, feasible way you can stay on top of system behavior that may warrant deeper investigation. Just make sure that you set up each alert message with an action item that’s likely to be useful to the engineer who receives it. Otherwise, it’s just noise and a big distraction!

Automation should be your top priority. And probably your second and third priority too. Ailun makes automation the top priority for her team day after day. When it comes to patch upgrades, and keeping the system in good working order, you need to be able to perform these tasks at scale, on 100 nodes (or more) at the same time.

You’re running Greenplum as a managed service for your users. Organize your people and processes accordingly. Greenplum is used for mission-critical tasks, so it’s vital that you give the system the attention it deserves. That means “follow the sun” support for a global company, with on-call engineers in North America, EMEA, and APJ. Ailun also recommends establishing a service level agreement with the business that outlines expectations for system uptime, incident response times, and escalation rules. She also says that you need a variety of specialized engineers behind-the-scenes to deliver for the business. IT operations, hardware engineers, and DBAs all have a role to play.

Always refreshing to hear a customer's (Morgan Stanley) Ailun Qin's perspective on Greenplum.

#data #GreenplumSummit @VMwareTanzu @VMware @Greenplum pic.twitter.com/g77Cq9eSAG
— Ji Lim (@jilim3) August 26, 2020

Ailun closes with three bits of advice for folks starting their Greenplum journey. Automate everything. Have a proactive mindset; strive to find and resolve issues before your users do. And finally, love your job!

Well said Ailun!

Your incumbent data warehouse was built for a different time. Things have changed.

Jacque Istok and Ivan Novick open Week 3 with a look at traditional data warehouses. These systems have served organizations well, when processes were mostly linear and ETL ruled the day.

Now? Ivan says “there is a complex, more dynamic data environment” in play. You need a modern data warehouse that can handle an explosion of use cases, and a sprawl of data sources. Open source projects like Hadoop and Spark offer some utility, but many organizations deem them incomplete solutions.

This is where Greenplum steps up to offer breakthrough utility with remarkable cost savings and efficiency.

Start your modern data warehouse evaluation by examining data ingest features.

Ivan Novick also observed in the “Welcome” talk that data processing is fairly straightforward…once you’ve cleaned up all your data. Of course, getting to that point is the hard part!

Data preparation – or data wrangling as Ivan calls it – is a real slog, especially with data streaming in from so many places. Greenplum shines in this scenario, with powerful data ingestion capabilities that hook into your existing data sources. A picture is worth a thousand

words:

Ready to migrate to Greenplum? Effective utilities make it easier than you think.

Most IT leaders stuck with a traditional data warehouse know they need to upgrade. (The shortcomings of their status quo are often plain as day.) Once you make the call to modernize to a system like Greenplum, how do you make this migration a success?

A success modernization starts by unlocking the relationships within your data. You need to understand how tables, reports, and queries are linked…and how they will map to your new modern data warehouse.

Sounds like a complex job, right? It is, but Robert Scott CTO of Eon Collective has done much of the work for you!

Robert joined Jacque to demonstrate a mix of methodology and tooling to help make migrations a success. In this talk, Robert says the metadata of your current systems are an underappreciated part of your migration success.

He went on to show a demo of the Eon Collective’s ADEPT Asset Manager tool that works with a variety of data sources (Oracle, Netezza, Informatica, etc.). The utility examines these systems in depth, analyzing the tables that need to be moved. The output is a “map” that shows the relationship between these sources, and actionable guidance on how to restructure and migrate data, reports, and ETL processes.

Perhaps best of all, Robert explains that the utility creates new insights that allow for a reduction in effort and cost for data modernization. This can dramatically simplify your project.

Get ready to upgrade to Greenplum 6. We have lots of practical tools to get you there.

Speaking of neat utilities! Nirali Sura and Randy Williard join the Summit to review all the tools at your disposal to help you make the move to Greenplum 6.

Gpbackup, gpcopy, and gptransfer will prove useful during your upgrade. Nirali also unveils a new project: gpupgrade.

Gpupgrade does what the name suggests: the tool helps you upgrade from GPDB5 to GPDB6. Use gpupgrade, and perform your upgrade faster, with smaller storage requirements.

How does it work? Nirali explains it in 5 steps:

Pre-upgrade. Install gpupgrade and the latest GPDB6 binaries.
Initialize. Here, gpupgrade initialize will initialize the target cluster and run checks to verify the health of the source cluster.
Execute. Then gpupgrade execute will upgrade the primary and master segments
Finalize. From there, gpupgrade finalize will upgrade standby and mirror segments. It will also update the data directories and port of the clusters.
Post-upgrade. Now, you can perform validation scripts to verify the performance of the upgraded cluster.

Nirali adds two other important details: steps 2, 3, and 4 will require downtime. And you can rollback after step 2 and again after step 3 if things don’t go as expected.

Gpupgrade is in beta now; we don’t recommend it for production usage just yet. That said, we are looking for beta testers – please contact us if you’re interested!

OK, that’s a look at the tools. What about the method to perform your upgrade? Randy outlines three popular choices:

Upgrade in-place. Run the new version of Greenplum on the same hosts that run your 4.x environment. You’ll need at least 40% available space on all hosts. Gpcopy will be a useful tool for you here.
Upgrade on new infrastructure. This is a popular option when you need to accommodate growth in your environment, and want newer, faster hardware to power your deployment. Once again, gpcopy is essential.
Upgrade on the same system. This involves a backup/wipe/restore workflow. Because of the “wipe” component, use this option as a last resort only! It may be your only option if no new environment is available AND system capacity is beyond 60%. (Just make sure your backup is valid before considering this path.)

Randy closes his part of the talk with these useful recommendations:

The Greenplum team is here to help you perform a successful upgrade!

DBAs, you have a tough job. Greenplum Command Center gives you superpowers.

Database administrators, we feel you. You’re under pressure to deliver high-notch performance for all your users. To do this effectively, you need tools that empower you to adjust how workloads run on your Greenplum clusters throughout the business day.

Greenplum Command Center was built for this purpose. Joy Chen and Ning Fu showcase new updates to the module for Greenplum 5.2 and up. The lead feature: resource groups.

Use resource groups to set CPU, memory, and concurrent transaction limits. They are a fantastic way to prioritize how finite compute resources are allocated to your workloads. You can even move workloads to higher (or lower) resource groups in real-time based on the overall demands on your system!

Joy and Ning go on to show the latest capabilities of Command Center, including a rules engine that allows you to automatically place workloads in a given resource group according to custom conditions.

Use Command Center to bring the power of Greenplum to the workloads that need it the most, when they need it!

Don’t Let Your Cluster Performance Suffer from Complex Queries.

When you’re running Greenplum at scale, you will invariably have analysts that run complex queries. How can you ensure that even these queries return results in a reasonable amount of time?

Sergey and Dmitry from Luxms VC have an answer. They walked us through a recent customer case study, an enterprise running Greenplum with 200 TB of data. The majority of queries performed just fine, but one query in particular was quite involved. These were some of the query parameters:

9 source tables
5 CTE queries
Window Function
UNION ALL
About 20 JOINS
GROUP BY 10 columns
CASE statements
Biggest tables: 5.5B + 1.6B records

Wow. This query would take 30 minutes to process. Clearly, there is a productivity impact. Would it be possible to get the query run time down to 1 minute? As it happens, yes!

The solution: get a useful subset of the data on a dedicated system. Sergey and Dmitry explain that a new open-source technology Dremio is tailor-made for this scenario.

The ultimate solution for the customer was a combination of Greenplum, Dremio, and the MPP BI tool developed and implemented by Luxms. This is shown below.

The results for this client were stellar, The time to run this query was reduced from 30 to 1 minute! The error rate is quite low as well. Consider a similar architecture if you have heavy queries with sluggish performance.

You’ll need crisp lifecycle operations to keep your modern data warehouse humming. Kubernetes can help.

Need to spin up Greenplum clusters quickly for some experimentation? As we noted in week 1, Kubernetes is a terrific option. Karen Huddleston says Kubernetes can simplify ongoing operations for Greenplum as well. In particular, the container orchestrator makes expansions and minor upgrades a breeze.

Let’s start with expansion. Just a single YAML change will automatically provision new segments, generate a gpexpand-config file, and run gpexpand to add the segments to the cluster.

When it comes to upgrades, Karen suggested simply following the usual Kubernetes workflow:

Load new Greenplum images into your container registry.
Run helm upgrade to upgrade the GP4K operator.
Add kubectl delete and kubectl apply manifest to recreate your cluster.

This workflow upgrades components and OS packages as well.

Join Us on September 23 for the Final Session of Greenplum Summit 2020

There’s just one session left – don’t miss out! Registration is free and easy.

Sept 23: AI, Neural Networks, an the Future of Analytics (register)