Introduction to Greenplum Architecture

This is the first article of the Greenplum Kernel series. There are a total of ten articles in this series, which will explain in depth the different modules of Greenplum. Today I’m going to explain the Greenplum architecture in more detail. Before we talk about Greenplum’s architecture, let’s take a look at the database management system.

Database management system

The invention of database management system is based on the demand of effective data management and query.  Before there was a database management system, file-based storage was used. In the following example, the two tables represent the information of the bar and the sales information of different bars. If these two kinds of information are stored in files, we can use two for loops to calculate the beer sales quantity of each store. 

Although the code of this operation is very short, the algorithm complexity is very high. In addition to the low efficiency of the algorithm, there are other problems with file storage:

  • Data consistency: for example, the bar corresponding to the sales record does not appear in the bar file, resulting in the existence of invalid data;
  • Modification of records: especially the modification of variable length attributes of string type becomes difficult to maintain;
  • Complexity, redundancy and inefficiency of the query: each query needs to write a traversal program, and the traversal program is highly responsible for running time and space;
  • Technical requirements are high
  • In addition to the above mentioned situation, there are many other issues.

So the database management system came into being. Many people will use database for short database management system. In fact, database and database management system are two concepts. A database refers to a collection of effectively organized, related, and convenient data for efficient storage and query. The database management system is a kind of software, It is used to manage the data in the database. It can not only provide the basic interface of adding, deleting and modifying, but also provide efficient query language. For data modeling, there will be different modeling methods, including based on document, object, kV, etc., while the database management system based on relational model as the underlying data model is called relational database management system, Greenplum is also a kind of relational database management system, referred to as relational database.

Relational database has several important technical points, such as how to define and store data, how to ensure the integrity of data, including data consistency, transaction and concurrency control, and how to support user-friendly query interface.

Greenplum Overall Architecture

Next, we look at how Greenplum solves the above problems. First, let’s take a look at the overall architecture of Greenplum. Greenplum is an open source distributed database based on Postgres. From the topological structure, it is a database cluster composed of stand-alone Postgres. But it is not limited to this. Greenplum provides a unified database interface, which makes users feel like they are using a stand-alone database, and Greenplum have made a lot of optimizations on cluster processing.

In terms of physical topology, Greenplum database is a typical master segment structure, A Greenplum cluster is usually composed of a Master node, a Standby node and multiple Segment nodes, and the nodes are interconnected through a high-speed network. Master is the entrance of the whole database. End users connect to master to perform query. Standby master will provide high availability support for master. The segment node is a work node, and the data exists on the segment. Mirror segment will provide high availability support for segment.

Each box in the figure below is a physical machine, In order to obtain the best performance of the physical machine; a node can flexibly deploy multiple segment processes. In the query process, when the Master node receives the query statement initiated by the user, it will perform query compilation, query optimization and other operations, generate a parallel query plan, and distribute it to the segment node for execution. After the segment is executed, the data will be sent back to the Master node and finally presented to the user.

After understanding the architecture of Greenplum, let’s take a look at how data is stored on Greenplum. On Greenplum, each physical machine will correspond to multiple disks, each disk will be attached with a physical disk, and each disk will have multiple data partitions. The organization of data on Greenplum will adopt the following strategies

  • First of all, the data will be evenly distributed on each segment according to the set distribution strategy. The distribution strategies supported by Greenplum include hash distribution, random distribution and new replication distribution in Greenplum 6. This operation is called data fragmentation
  • Then, for the data on each node, the data on the node can be divided into smaller subsets through data partition, so that the price is lower in the query era. This operation is called data partitioning

Let’s use the example of “bar” above to explain. On the left is the data fragmentation processing. The sell table is partitioned according to a column (“bar”) and divided into different nodes. Each node is then partitioned according to “date” and divided into small sub-tables. Through the query statement in the figure, only the blue part of the figure can be scanned to complete the query task, which greatly reduces the amount of data scanning.

Next, let’s take a look at the service process of Greenplum. In the following example, there is a master node on the left and two segment nodes on the right. From the perspective of topology, Greenplum is a database cluster composed of stand-alone Postgres. There are three Postgres processes in the example. When the client has a request, it will link to Greenplum’s master node through the libpq protocol of Postgres. After the postmaster process on the master listens to the connection request, it creates a sub process (QD) to process all the query requests of the client. QD and the original master will share data through shared memory.

Then there is the connection between QD and individual Segments. QD can be seen as the client of each Segment, the QD process sends connection requests to all Segment nodes, and establishes connection requests through the libpq protocol and each Segment, and the postmaster process on Segment, when it listens for the QD connection request, also creates a subprocess (QE) to process subsequent query requests. The full name of QD is query dispatcher, which is a distributor, while QE is query executor, which processes queries. All QD and QE processes will work together to complete the query request sent by the client.

After the preparatory work is completed, How does the query work? After the client sends the query request to the QD process on the master, the QD process will process the received query, including parsing the original query statement, generating the distributed query plan by the optimizer, and sending it to the QE process on each segment through the libpq protocol. QD and QES on each segment establish an interconnect connection.  It should be noted that, Libpq is mainly used to control command and result return, while interconnect is used for internal data communication.

Each QE process will execute the query subtask assigned to it and return the results to QD. QES also interact with each other through interconnect, and there is no libpq link. The libpq protocol is used to update and manage the state between QE and QD, including error handling. The final QD will summarize the collected query results and return them to the client through libpq. 

When performing complex queries, for example, select two tables first, and then join the two tables. At this time, the whole query is divided into two layers. The first layer reads the data, and then exchanges the data between them. The second layer joins the values of the same link key together, and finally returns the join result to the client. PostgreSQL is a process model that creates a single process for each query to process. In order to improve CPU utilization, Greenplum implements query parallelization between operators in segment. User queries can have multiple QES on a single segment, and QES interact with each other through interconnect.

Storage management

After talking about the architecture of Greenplum, let’s take a look at the main functional modules of Greenplum.  First, let’s take a look at Greenplum’s storage management. The storage pyramid is often mentioned in textbooks. In the pyramid distribution, the more you go up, the smaller the capacity, but faster and faster. The more you go down, the larger the capacity, but slower and slower. The storage above the memory is volatile storage, and it is fast but easy to lose data. This is volatile storage. Non-volatile storage is relatively slow, but it is not easy to lose data.

You may learn from the operating system courses of the university that when writing programs, we will access files through cache or buffer. The Greenplum process uses a shared buffer area as an intermediate memory buffer. The Beckend process will directly deal with the middle layer. The buffer in the shared memory will be exchanged with the disk file.

So what happens in the shared buffer? Greenplum organizes data in blocks. The dotted square in the figure is the shared memory area. The mapping table will find the block corresponding to the shared buffer block through the block number. If the content is found to be invalid, the data will be loaded into the shared buffer block through the lower file operation. If it is valid, it will be read directly.

Greenplum will store variable length data in each file block from back to front. In the page header, there is a fixed length pointer that points to the intra block offset of the real data at the end of the block. The nth data can be quickly searched through the pointer. The middle area is the free area. The data will grow from the back to the front, and the item ID will grow from the front to the back. The middle area will reserve continuous space to reduce fragmentation.

For Greenplum, reading and reading can be performed at the same time, and reading and writing can also be performed at the same time. In order to allow reading and writing at the same time, Greenplum provides a multi-version control mechanism MVCC to improve concurrency. When inserting data, two hidden fields are saved: xmin and xmax, which represent the numbers of the insert transaction and the delete transaction, respectively.

In the following example, Figure (c) deletes the B0 data, so the xmax value 18 is added to the data. If the transaction 17 is running, because the transaction 17 is before the transaction 18, it is invisible to this operation, so the B0 value is still visible to the transaction 17, and the transaction 17 will not think that the data has been deleted. For transaction 19, B0 has been deleted. The UPDATE will be split into two operations: DELETE and INSERT. For example, figure (e) in the example. And if the write and write to the same table simultaneously, and to the same tuple, it will be processed according to the isolation level. When the deleted tuple is invisible to all currently running transactions, VACUUM will reclaim disk space.

Index

Indexing can speed up the reading and writing of tuples. The index finds the physical location where the tuple is stored by value, and reads the tuple from that location. For example, in the following example, the index records that city is the second tuple of the’San Francisco’ tuple located in the 453st file block. Compared with scanning, indexing is not faster than sequential scanning every time. For example, it doesn’t make much sense to index gender, because almost half of the data will be scanned. So, when to use sequential scans and when to use indexes, this is what the query optimizer needs to do for us. In addition, how to maintain the index structure? What types of indexes are there? How to control concurrency? Is indexing necessarily better than scanning? We will share these contents with you in the index topic in “Kernel Series Live”, welcome to pay attention.

Query execution

Now, the addition, deletion, checking and modification of tuples can be supported, but this is far from enough. In order to facilitate the use of users and provide more powerful and useful functions, Greenplum provides an execution engine. When the query is executed, the SQL statement passes through the parser, and the string SQL statement is parsed into a structured tree, the most effective execution plan is made through the optimizer, and the executor executes the query result.

The SQL statement in the following example connects two tables. In Greenplum, the executor executes the query plan through iterators, that is, from top to bottom, one tuple at a time. Each executor node provides the next tuple up and gets the next tuple down. There may be multiple query plans for a statement, such as the sequential scan and index scan mentioned earlier. The query optimizer will be smart enough to choose the least expensive execution plan. In the example, the bottom is the scan operation on the two tables. After the scan is over, in order to perform Join, the Bars table needs to be redistributed according to the name of the tuple, so that the Bars tuple and Sells tuple with connection conditions can be gathered Together. Since Sells are already distributed according to bars, there is no need to redistribute Sells here. Finally, after finishing the projection operation, the results need to be aggregated to the QD node, which is done by Gather Motion at the top level.

If the index scan is adopted, one of the problems is that the tuples executed by the index are in the file, resulting in random access to the disk.

One solution is Greenplum Cluster operation. The Cluster operation reorders the tuples in the file 

according to the order of the index, so that when the index scan order is followed, the disk can be accessed in a sequential manner.

Another solution, especially if there are multiple conditions, can be scanned based on bitmaps. In the following example, there is an index in “date” and an index in “beer”. The bitmap information of each condition can be obtained according to the query conditions. The bitmap records which tuples meet the query conditions and satisfy the query conditions. The tuple of is represented by 1 in the bitmap. The bitmap corresponding to the two query conditions is used to obtain the final bitmap to be scanned through the bitmap AND operation. Based on bitmap scanning, files can be accessed in sequential access mode.

There are three main types of JOIN operations for tuples in Greenplum. The first one is Nested Loop Join, which is similar to the file storage mentioned earlier, that is, two loops are superimposed to match the scans inside and outside, and the result is returned. A possible variant here is that the inner loop may use indexes instead of sequential scans to make execution more efficient.

The second is Merge Join. Merge Join has two stages. The first stage is to sort the tuples to be connected according to the connection conditions. Then in the second stage, the merge operation is performed based on the sorted tuples.

The third way is hash join. In hash join, one table is generally used as a lookup table and another small table as a hash table. If the hash table is small enough, it can be stored in memory. During each search, the lookup table is scanned to see whether the current tuple has a match in the hash table. If there is a match, it will be returned directly, otherwise it will be skipped. But the problem is what if the hash table is too large? Which tuples need to be stored in external storage? How to deal with the hash table tuples that need to be matched in the outer memory? We will find out the answers to these questions in the live broadcast of 《Greenplum Kernel Revealed Execution Engine》 on May 22.

Transactions and logs

If the Greenplum process hangs in the process of modifying files, how to ensure data consistency? A classic problem often mentioned in database courses is: transfer 100 from account A to account B. If the system restarts and crashes after account A decreases by 100, will it happen that account A decreases by 100 but account B does not increase by 100? Greenplum guarantees the atomicity of operations through transactions.

Another problem is the isolation between transactions. One transaction handles the transfer, and the other transaction raises the interest rate on the bank account (here is 2% interest). If the transfer transaction and the interest rate increase transaction are carried out at the same time, if the operation is carried out in the wrong sequence in the figure, there will be problems, and in the end you will find that 2 dollar is missing! Using transaction isolation can solve this type of headache. These will be explained in detail in the “Business Topics” in this year’s kernel live series.

Every time data is written, the memory is changed first, and then the disk is written. In the following example, A is first modified to 23 in memory. After submission, if the system hangs, the modification of A will be lost. When restarting, you will find that A is still equal to 12, because the modification has not been written yet. Back to the disk, Of course, requiring every modification to write to the disk can prevent such problems from happening, but there will be efficiency problems. The log function provided by Greenplum can solve such problems. The log records the modification process of the database in detail. Log records are accessed sequentially and provide the concept of a logical timeline. The efficiency is much higher than that of random access to the disk.

As mentioned earlier, in Greenplum, data exists on multiple segments, so make sure that when writing data, all data on the segments must be written successfully or unsuccessfully. What is unacceptable is that in the case of partial success and partial unsuccess, an inconsistent state of data appears. A classic algorithm needs to be mentioned here: the two-stage submission algorithm. As the name suggests, the algorithm consists of two stages. In the first stage, prepare is done to let all nodes vote whether they can be submitted. If all nodes reply yes, they will be submitted in the second stage. Otherwise, as long as one node does not reply yes, all nodes will be rolled back

Let’s take a look at the error handling situation:

  1. The first case is that the error occurs in the first stage, and some nodes cannot submit, and subsequent operations only need to be rolled back in the second stage.
  1. If all nodes responded yes in the first stage, but there was a problem in the second stage, the DTM manager will continue to submit the failed node until it succeeds. All submitted information will be stored in the log by the DTM, and the status information of the transaction can be recovered through this information in order to find out what needs to be done next.

If you are interested in Greenplum, you can start by downloading the Greenplum source code and take the first step of Contributor. You can download Greenplum from Github. After downloading, compile the source code first, turn on the Debug information during the compilation process, and turn off optimization.

After the source code is downloaded locally, I personally use cscope to search the source code. You can choose your favorite source code analysis and reading tool according to your own situation. The more important directory structure in the source code is shown in the figure below, and you can check it in the corresponding live course chapter.