大数据论文:Google BigTable

Bigtable: A Distributed Storage System for Structured Data
BigTable:结构化数据的一种分布式存储系统


摘要

    Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.

        BigTable是一种分布式存储系统,可用于管理海量规模的结构化数据:通常是分布在数千台服务器上的PB级别的数据。Google的很多项目都会将数据存储在BigTable中,包括Web索引、谷歌地球和谷歌财经。这些应用对BigTable提出的需求差异非常大,无论是在数据量的规模上(从URL到网页到卫星成像)还是在响应时间上(从后台的批量处理到实时数据服务)。虽然这些需求的差异很大,但是BigTable成功地为所有的这些Google产品提供了一种灵活的、高性能的解决方案。在本文中,我们会论述由BigTable提供的简单数据模型,它使得客户端能够动态控制数据的分布和格式,我们还会论述BigTable的设计和实现方法。

1   前言

    Over the last two and a half years we have designed, implemented, and deployed a distributed storage system for managing structured data at Google called Bigtable. Bigtable is designed to reliably scale to petabytes of data and thousands of machines. Bigtable has achieved several goals: wide applicability , scalability , high performance, and high availability . Bigtable is used by more than sixty Google products and projects, including Google Analytics, Google Finance, Orkut, Personalized Search, Writely, and Google Earth. These products use Bigtable for a variety of demanding workloads, which range from throughput-oriented batch-processing jobs to latency-sensitive serving of data to end users. The Bigtable clusters used by these products span a wide range of configurations, from a handful to thousands of servers, and store up to several hundred terabytes of data.

        在过去的两年半时间里,我们已经设计、实现和部署了一种用于管理结构化数据的分布式存储系统,我们在Google称其为BigTable。BigTable被设计为能够可靠地扩展至PB级别的数据规模,并且能够部署在数千台机器上。BigTable已经实现了以下几个目标:广泛的适用性、可伸缩性、高性能和高可用性。有超过60个Google产品和项目正在使用BigTable,包括谷歌分析、谷歌财经、Orkut、个性化搜索、Writely和谷歌地球。这些产品对BigTable的工作负载需求迥然不同,有的是需要高吞吐量的批量处理任务,有的是对延迟敏感的终端用户数据服务。这些产品使用的BigTable集群的配置也有着很大的差异,有的只需要几台服务器,有的则需要上千台服务器,需要存储几百TB的数据。

    In many ways, Bigtable resembles a database: it shares many implementation strategies with databases. Parallel databases and main-memory databases have achieved scalability and high performance, but Bigtable provides a different interface than such systems. Bigtable does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format, and allows clients to reason about the locality properties of the data represented in the underlying storage. Data is indexed using row and column names that can be arbitrary strings. Bigtable also treats data as uninterpreted strings, although clients often serialize various forms of structured and semi-structured data into these strings. Clients can control the locality of their data through careful choices in their schemas. Finally, Bigtable schema parameters let clients dynamically control whether to serve data out of memory or from disk.

        BigTable在很多方面都和数据库非常相似:它使用了数据库的很多实现策略。并行数据库和内存数据库已经具备了可伸缩性和高性能,但是BigTable提供了一个不同于这两个系统的接口。BigTable不支持完整的关系数据模型,但是,它向客户端提供了一种简单数据模型,它使得客户端能够动态控制数据的布局和格式[1],并且客户端还能够推测出保存在底层存储设备的数据的位置属性[2]。数据会通过行名和列名进行索引,行名和列名可以是任意字符串。虽然客户端经常会将各种形式的结构化和半结构化数据序列化为字符串,但是BigTable还是会将这些数据当作未经解释的字符串。通过对数据模式的仔细选择,客户端可以控制数据的位置属性。最后,客户端可以通过BigTable的模式参数动态地控制将数据存储在内存中,还是硬盘上。

    Section 2 describes the data model in more detail, and Section 3 provides an overview of the client API. Section 4 briefly describes the underlying Google infrastructure on which Bigtable depends. Section 5 describes the fundamentals of the Bigtable implementation, and Section 6 describes some of the refinements that we made to improve Bigtable’s performance. Section 7 provides measurements of Bigtable’s performance. We describe several examples of how Bigtable is used at Google in Section 8, and discuss some lessons we learned in designing and supporting Bigtable in Section 9. Finally , Section 10 describes related work, and Section 11 presents our conclusions.

        第2节会更加详细地描述数据模型,第3节会概述客户端的API接口。第4节会简短地描述BigTable依赖的Google底层基础设施。第5节会描述BigTable实现的基础原则,第6节会描述我们为了改善BigTable的性能而采用的一些精细调优方法。第7节会提供BigTable性能的测量方法。在第8节中,我们会描述Google如何使用BigTable的几个示例。在第9节中,我们会论述在设计和支持BigTable时学到的一些经验和教训。最后,第10节会描述相关的工作,第11节会得出我们的结论。

2   数据模型

    A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key , column key , and a timestamp; each value in the map is an uninterpreted array of bytes.

        BigTable是一种稀疏的、分布式的和持久化的多维度分类图[3]。这个图通过行关键字、列关键字和时间戳进行索引,图中的每个值都是一个未经解释的字节数组。

    We settled on this data model after examining a variety of potential uses of a Bigtable-like system. As one concrete example that drove some of our design decisions, suppose we want to keep a copy of a large collection of web pages and related information that could be used by many different projects; let us call this particular table the Webtable. In Webtable, we would use URLs as row keys, various aspects of web pages as column names, and store the contents of the web pages in the contents: column under the timestamps when they were fetched, as illustrated in Figure 1.

        在仔细分析一个类似BigTable系统的各种潜在的用途之后,我们决定使用这种数据模型。有一个具体的示例驱使我们做出了一些设计决策,假设有很多不同的项目需要使用大量网页及其相关信息的一份拷贝数据,那么我们可以将这些拷贝数据存储在一张名为Webtable的表中。在Webtable表中,我们将会将URL作为行关键字,将网页的各种特征作为列名,并且会将网页的内容存储在contents:列中,并且将获取这些网页的时间戳作为标识[4],如图-1所示:

简单数据模型

Figure 1: A slice of an example table that stores Web pages. The row name is a reversed URL. The contents column family contains the page contents, and the anchor column family contains the text of any anchors that reference the page. CNN’s home page is referenced by both the Sports Illustrated and the MY -look home pages, so the row contains columns named anchor:cnnsi.com and anchor:my.look.ca. Each anchor cell has one version; the contents column has three versions, at timestamps t3, t5, and t6.

图-1:存储网页数据的一张示例表的片段。行名是一个反转的URL。contents列族包含网页内容,anchor列族包含引用这个网页的任意锚链接的文本。Sports Illustrated和MY-look的主页都会引用CNN的主页,因此,这行会包含名为anchor:cnnsi.comanchor:my.look.ca的列。每个锚链接的单元格都具有一个版本[5],而contents列则具有三个版本,分别由t3、t5和t6标识。

2.1   行

    The row keys in a table are arbitrary strings (currently up to 64KB in size, although 10-100 bytes is a typical size for most of our users). Every read or write of data under a single row key is atomic (regardless of the number of different columns being read or written in the row), a design decision that makes it easier for clients to reason about the system’ s behavior in the presence of concurrent updates to the same row.

        一张表中的行关键字可以是任意字符串(虽然大多数用户通常会使用10-100字节,但是目前最大支持64KB的字符串)。对单个行关键字标识的数据的读写操作都是原子性的(不管这行中有多少个不同的列正在被读写),我们采用了一种设计决策,当对相同的列进行并发更新时,客户端可以更加容易地推测出系统的行为。

    Bigtable maintains data in lexicographic order by row key. The row range for a table is dynamically partitioned. Each row range is called a tablet, which is the unit of distribution and load balancing. As a result, reads of short row ranges are efficient and typically require communication with only a small number of machines. Clients can exploit this property by selecting their row keys so that they get good locality for their data accesses. For example, in Webtable, pages in the same domain are grouped together into contiguous rows by reversing the hostname components of the URLs. For example, we store data for maps.google.com/index.html under the key com.google.maps/index.html. Storing pages from the same domain near each other makes some host and domain analyses more efficient.

        BigTable会根据行关键字的字典顺序维护数据。表的行区间是可以动态分区的。每个行区间都被称为一个数据片(tablet),它是数据分布和负载均衡的基本单位。因此,读取较短行区间的效率会更高,通常只要和少数几台机器进行通信即可。通过选择合适的行关键字,客户端可以充分利用这项特性,这样它们就能在访问数据时获得更好的数据位置信息。例如,在Webtable表中,通过反转URL的主机名部分的方式,可以将属于相同域名的网页聚集在一起,组织成连续的行。例如,我们会将maps.google.com/index.html的数据存储在com.google.maps/index.html关键字标识的行中。将属于相同域名的网页数据存储在连续的区域内会使得某些主机和域名分析方法更加高效。

2.2   列族

    Column keys are grouped into sets called column families, which form the basic unit of access control. All data stored in a column family is usually of the same type (we compress data in the same column family together). A column family must be created before data can be stored under any column key in that family; after a family has been created, any column key within the family can be used. It is our intent that the number of distinct column families in a table be small (in the hundreds at most), and that families rarely change during operation. In contrast, a table may have an unbounded number of columns.

        将列关键字进行分组而得到的集合称为列族(column family),它是访问控制的基本单位。存储在相同列族中的所有数据通常都是相同类型的(我们会将相同列族中的数据压缩在一起)。列族在使用之前必须先创建,然后才能在这个列族的任何列关键字标识的列中存储数据;在列族创建之后,就可以在这个列族的任何列关键字标识的列中存储数据。根据我们的设计意图,一张表中的列族数量不能太多(最多几百个),并且列族在操作期间很少会改变。相比之下,一张表的列数可能是无限多的。

    A column key is named using the following syntax: family:qualifier. Column family names must be printable, but qualifiers may be arbitrary strings. An example column family for the Webtable is language, which stores the language in which a web page was written. We use only one column key in the language family , and it stores each web page’s language ID. Another useful column family for this table is anchor; each column key in this family represents a single anchor, as shown in Figure 1. The qualifier is the name of the referring site; the cell contents is the link text.

        为列关键字命名的语法规则如下所示:列族:限定符(family:qualifier)。列族的名称必须是可印刷的,但是限定符可以是任意字符串。Webtable表有一个名为language的示例列族,它可以存储用于编写这个网页的语言。我们只会用到language列族中的一个列关键字,它会存储每个网页的语言ID。这个表的另一个有用的列族是anchor,这个列族中的每个列关键字都表示一个锚链接,如图-1所示。限定符就是引用站点的名称,单元格的内容就是链接文本。

    Access control and both disk and memory accounting are performed at the column-family level. In our Webtable example, these controls allow us to manage several different types of applications: some that add new base data, some that read the base data and create derived column families, and some that are only allowed to view existing data (and possibly not even to view all of the existing families for privacy reasons).

        访问控制,以及磁盘和内存的使用统计都是在列族的层面进行的。在我们的Webtable示例中,这些控制方式使得我们能够管理几种不同类型的应用:有些应用可以添加新的基础数据,有些应用可以读取基础数据和创建衍生列族,有些应用只能查看已有的数据(甚至有可能因为隐私的原因,不能查看所有的已有列族)。

2.3   时间戳

    Each cell in a Bigtable can contain multiple versions of the same data; these versions are indexed by timestamp. Bigtable timestamps are 64-bit integers. They can be assigned by Bigtable, in which case they represent “real time” in microseconds, or be explicitly assigned by client applications. Applications that need to avoid collisions must generate unique timestamps themselves. Different versions of a cell are stored in decreasing timestamp order, so that the most recent versions can be read first.

        BigTable中的每个单元格都可以包含同一份数据的多个版本,这些版本是通过时间戳进行索引的。BigTable的时间戳是64位的整数。BigTable可以给时间戳赋值,在这种情况下,可以用于表示精确到毫秒的“实时”时间,客户端应用也可以显式地给时间戳赋值。需要避免版本冲突的应用必须自己生成具有唯一性的时间戳。一个单元格的不同版本的数据是按照时间戳的降序存储的,这样就只能读取到最近几个版本的数据。

    To make the management of versioned data less onerous, we support two per-column-family settings that tell Bigtable to garbage-collect cell versions automatically. The client can specify either that only the last n versions of a cell be kept, or that only new-enough versions be kept (e.g., only keep values that were written in the last seven days).

        为了降低管理版本化数据的繁琐性,我们为每个列族添加了两个设置参数,它们会通知BigTable自动对单元格版本进行垃圾收集。客户端可以指定只保留最后n个版本的单元格数据,或者只保留足够新的版本的数据(例如,只保留最近7天写入的数据)。

    In our Webtable example, we set the timestamps of the crawled pages stored in the contents: column to the times at which these page versions were actually crawled. The garbage-collection mechanism described above lets us keep only the most recent three versions of every page.

        在我们的Webtable示例中,我们将存储在contents:列中的已爬取网页的时间戳设置为这些网页版本实际被爬取的时间。上文描述的垃圾收集机制使得我们能够只保存每个网页的最近三个版本的数据。

3   API

    The Bigtable API provides functions for creating and deleting tables and column families. It also provides functions for changing cluster, table, and column family metadata, such as access control rights.

        BigTable的API提供了用于创建和删除表和列族的功能。它还提供了用于修改集群、表和列族元数据的功能,例如访问控制权限。

    Client applications can write or delete values in Bigtable, look up values from individual rows, or iterate over a subset of the data in a table. Figure 2 shows C++ code that uses a RowMutation abstraction to perform a series of updates. (Irrelevant details were elided to keep the example short.) The call to Apply performs an atomic mutation to the Webtable: it adds one anchor to www.cnn.com and deletes a different anchor.

        客户端应用可以在BigTable中写入或删除数据,还可以从单行查询数据,或者遍历表中的某个数据子集。图-2展示了一段使用RowMutation抽象类来实现一系列更新操作的C++代码。(为了保持示例的简洁,我们省略了无关的细节。)调用Apply函数便会对Webtable表执行一次原子性的修改操作:它会为www.cnn.com添加一个锚链接,然后删除另一个锚链接。

写入BigTable


Figure 2: Writing to Bigtable.
图-2:写入BigTable

    Figure 3 shows C++ code that uses a Scanner abstraction to iterate over all anchors in a particular row. Clients can iterate over multiple column families, and there are several mechanisms for limiting the rows, columns, and timestamps produced by a scan. For example, we could restrict the scan above to only produce anchors whose columns match the regular expression anchor:*.cnn.com, or to only produce anchors whose timestamps fall within ten days of the current time.

        图-3展示的C++代码使用Scanner的抽象类来遍历一个特定行中的所有锚链接。客户端可以遍历多个列族,并且还可以利用多种机制对一次扫描操作产生的行、列和时间戳进行过滤。例如,我们可以限制上述的扫描操作,使其只产生列名匹配正则表达式anchor:*.cnn.com的锚链接,或者只产生时间戳落在距当前时间十日之内的锚链接。

读取BigTable


Figure 3: Reading from Bigtable.
图-3:读取BigTable

    Bigtable supports several other features that allow the user to manipulate data in more complex ways. First, Bigtable supports single-row transactions, which can be used to perform atomic read-modify-write sequences on data stored under a single row key . Bigtable does not currently support general transactions across row keys, although it provides an interface for batching writes across row keys at the clients. Second, Bigtable allows cells to be used as integer counters. Finally, Bigtable supports the execution of client-supplied scripts in the address spaces of the servers. The scripts are written in a language developed at Google for processing data called Sawzall. At the moment, our Sawzall-based API does not allow client scripts to write back into Bigtable, but it does allow various forms of data transformation, filtering based on arbitrary expressions, and summarization via a variety of operators.

        BigTable还支持一些其他的特性,它们使得用户能够以更加复杂的方式来处理数据。首先,BigTable支持单行事务,它可以对存储在单个行关键字下的数据执行原子性的读取-修改-写入的操作序列。虽然BigTable为客户端提供了一个用于跨行关键字的批量写入接口,但是目前它还不支持跨行关键字的通用事务处理。其次,BigTable可以将单元格用作整数计数器。最后,BigTable可以在服务器的地址空间中执行客户端提供的脚本。可以使用由Google开发的一种名为Sawzall的编程语言来编写用于处理数据的脚本。现在,我们的基于Sawzall的API并不允许客户端脚本将数据写回至BigTable,但是它支持各种形式的数据转换、基于任意表达式的数据过滤和通过各种运算符实现的数据汇总。

    Bigtable can be used with MapReduce, a framework for running large-scale parallel computations developed at Google. We have written a set of wrappers that allow a Bigtable to be used both as an input source and as an output target for MapReduce jobs.

        BigTable可以和MapReduce一起使用,MapReduce是Google开发的一种运行大规模并行计算的框架。我们已经开发了一些包装器工具,既可以将BigTable作为MapReduce任务的数据输入源,也可以将BigTable作为MapReduce任务的数据输出目标。

4   构建块

    Bigtable is built on several other pieces of Google infrastructure. Bigtable uses the distributed Google File System (GFS) to store log and data files. A Bigtable cluster typically operates in a shared pool of machines that run a wide variety of other distributed applications, and Bigtable processes often share the same machines with processes from other applications. Bigtable depends on a cluster management system for scheduling jobs, managing resources on shared machines, dealing with machine failures, and monitoring machine status.

        BigTable是建立在其他几种Google的基础设施之上的。BigTable使用分布式的Google文件系统(GFS)来存储日志和数据文件。BigTable集群通常在一个共享的机器池中运行,这个机器池还会运行各种各样其他的分布式应用,BigTable的进程经常要和其他应用的进程共享使用同一台机器。BigTable依赖于一个集群管理系统,这个系统可用于任务调度、共享机器的资源管理、机器故障的处理和机器状态的监控。

    The Google SSTable file format is used internally to store Bigtable data. An SSTable provides a persistent, ordered immutable map from keys to values, where both keys and values are arbitrary byte strings. Operations are provided to look up the value associated with a specified key, and to iterate over all key/value pairs in a specified key range. Internally, each SSTable contains a sequence of blocks (typically each block is 64KB in size, but this is configurable). A block index (stored at the end of the SSTable) is used to locate blocks; the index is loaded into memory when the SSTable is opened. A lookup can be performed with a single disk seek: we first find the appropriate block by performing a binary search in the in-memory index, and then reading the appropriate block from disk. Optionally, an SSTable can be completely mapped into memory, which allows us to perform lookups and scans without touching disk.

        Google的SSTable文件格式可用于存储BigTable内部的数据。SSTable提供了一种持久化的、不可变的有序图,而图又是一种实现关键字和值映射的数据结构,关键字和值都可以是任意字节串。可以对SSTable进行如下的操作:查询和一个指定关键字相关联的值,以及在一个指定的关键字范围之内遍历所有的关键字/值的配对。从内部看,每个SSTable都包含一系列的数据块(通常,每个数据块的大小为64KB,但这是可以配置的)。SSTable使用数据块索引(存储在SSTable的末端)来定位数据块,当打开SSTable时,便会将索引载入内存。每次查询都可以通过一次磁盘搜索完成:我们首先会对内存中的索引执行一次二分搜索,找到合适的数据块,然后从磁盘读取这个合适的数据块。也可以选择将SSTable完整地映射至内存中,这样我们在查询和扫描数据时就不用访问磁盘了。

    Bigtable relies on a highly-available and persistent distributed lock service called Chubby. A Chubby service consists of five active replicas, one of which is elected to be the master and actively serve requests. The service is live when a majority of the replicas are running and can communicate with each other. Chubby uses the Paxos algorithm to keep its replicas consistent in the face of failure. Chubby provides a namespace that consists of directories and small files. Each directory or file can be used as a lock, and reads and writes to a file are atomic. The Chubby client library provides consistent caching of Chubby files. Each Chubby client maintains a session with a Chubby service. A client’s session expires if it is unable to renew its session lease within the lease expiration time. When a client’s session expires, it loses any locks and open handles. Chubby clients can also register callbacks on Chubby files and directories for notification of changes or session expiration.

        BigTable还依赖于一种高可用的、持久化的分布式锁服务,称为Chubby。Chubby服务由五个活动的副本组成,其中的一个副本被选为主机(Master)(其他的四个副本是备机(Slave)),负责处理请求。只有当大多数的副本正常运行,并且能够互相通信时,Chubby服务才是可用的。当遇到故障时,Chubby会使用Paxos算法来保证副本的一致性。Chubby提供了一个命名空间,这个命名空间由一系列的目录和小文件组成。每个目录或文件都可以当作一个锁,对一个文件的读写操作都是原子性的。Chubby客户端的程序库为Chubby文件提供了一致性缓存。每个Chubby客户端都维护了一个Chubby服务的会话。如果在会话租约的过期时间之内,客户端没有续签它的会话租约,那么这个客户端的会话就会过期。当某个客户端的会话过期了,它就会丢失所有的锁和打开的句柄。Chubby客户端还可以注册回调函数,当Chubby的文件和目录改变,或者会话过期时,这个回调函数便会通知客户端。

    Bigtable uses Chubby for a variety of tasks: to ensure that there is at most one active master at any time; to store the bootstrap location of Bigtable data (see Section 5.1); to discover tablet servers and finalize tablet server deaths (see Section 5.2); to store Bigtable schema information (the column family information for each table); and to store access control lists. If Chubby becomes unavailable for an extended period of time, Bigtable becomes unavailable. We recently measured this effect in 14 Bigtable clusters spanning 11 Chubby instances. The average percentage of Bigtable server hours during which some data stored in Bigtable was not available due to Chubby unavailability (caused by either Chubby outages or network issues) was 0.0047%. The percentage for the single cluster that was most affected by Chubby unavailability was 0.0326%.

        BigTable使用Chubby完成以下的几个任务:确保任何时候最多只有一个活动主机;存储BigTable数据的自引导指令的位置(请参考第5.1节);查找数据片服务器,并且在数据片服务器失效时进行善后(请参考第5.2节);存储BigTable的模式信息(每张表的列族信息);以及存储访问控制列表(ACL)。如果Chubby有较长的一段时间不能访问,那么BigTable就会变得不可用。最近,我们在横跨11个Chubby实例的14个BigTable集群中测量了这种影响。由于Chubby不可用而导致存储在BigTable中的部分数据不可用的平均比率是0.0047%(原因可能是Chubby停止运行或者网络问题)。单个集群受Chubby不可用的影响的最大百分比是0.0326%。

5   实现方法

    The Bigtable implementation has three major components: a library that is linked into every client, one master server, and many tablet servers. Tablet servers can be dynamically added (or removed) from a cluster to accomodate changes in workloads.

        BigTable的实现方法包含三个主要组件:一个链接至每个客户端中的程序库、一个主机服务器,以及很多个数据片服务器(Tablet Server)。可以动态地添加(或移除)集群中的数据片服务器,这样便能适应工作负载的变化。

    The master is responsible for assigning tablets to tablet servers, detecting the addition and expiration of tablet servers, balancing tablet-server load, and garbage collection of files in GFS. In addition, it handles schema changes such as table and column family creations.

        主机(Master)负责数据片服务器的数据片分配、数据片服务器的新增和失效检测、数据片服务器的负载均衡,以及GFS文件的垃圾收集。另外,它还会处理对模式的相关修改操作,例如创建表和列族。

    Each tablet server manages a set of tablets (typically we have somewhere between ten to a thousand tablets per tablet server). The tablet server handles read and write requests to the tablets that it has loaded, and also splits tablets that have grown too large.

        每个数据片服务器都会管理一系列的数据片(通常,我们的每台数据片服务器都有大约数十个至上千个数据片)。数据片服务器会处理针对已加载数据片的读写请求,并且还会将已经增长过大的数据片拆分为多个较小的数据片。

    As with many single-master distributed storage systems, client data does not move through the master: clients communicate directly with tablet servers for reads and writes. Because Bigtable clients do not rely on the master for tablet location information, most clients never communicate with the master. As a result, the master is lightly loaded in practice.

        如同很多单主机的分布式存储系统一样,客户端的数据都不会经过主机:当进行读写操作时,客户端会直接和数据片服务器展开通信。因为BigTable的客户端不需要依赖于主机就能取得数据片的位置信息,大多数的客户端从来不会与主机进行通信。在实际应用中,主机的负载是很轻的。

    A Bigtable cluster stores a number of tables. Each table consists of a set of tablets, and each tablet contains all data associated with a row range. Initially, each table consists of just one tablet. As a table grows, it is automatically split into multiple tablets, each approximately 100-200 MB in size by default.

        一个BigTable集群会存储很多张表。每张表都由一系列的数据片组成,每个数据片都包含与某个行区间相关联的所有数据。最初,每张表都只有一个数据片。随着表数据的增长,初始的数据片会自动拆分为多个数据片,每个数据片的大小默认为100-200MB。

5.1   数据片的位置

    We use a three-level hierarchy analogous to that of a B+-tree to store tablet location information (Figure 4).

        我们使用一个三级层次结构的、类似于B+树的数据结构来存储数据片的位置信息(如图-4所示)。

数据片位置的层次结构


Figure 4: Tablet location hierarchy.
图-4:数据片位置的层次结构

    The first level is a file stored in Chubby that contains the location of the root tablet. The root tablet contains the location of all tablets in a special METADATA table. Each METADATA tablet contains the location of a set of user tablets. The root tablet is just the first tablet in the METADATA table, but is treated specially – it is never split – to ensure that the tablet location hierarchy has no more than three levels.

        第一层是一个存储在Chubby中的文件,它包含了根数据片(Root Tablet)的位置信息。根数据片包含了一个特殊的METADATA(元数据)表中的所有数据片的位置信息。每个METADATA数据片都包含了一组用户数据片的位置信息。根数据片正好就是METADATA表中的第一个数据片,但是对它的处理比较特殊 —— 它从来不会拆分 —— 这样才能确保数据片位置的层次结构不会超过三级。

    The METADATA table stores the location of a tablet under a row key that is an encoding of the tablet’s table identifier and its end row. Each METADATA row stores approximately 1KB of data in memory. With a modest limit of 128 MB METADATA tablets, our three-level location scheme is sufficient to address 234 tablets (or 261 bytes in 128 MB tablets).

        METADATA表会将一个数据片的位置信息存储在一个行关键字之下,而这个行关键字是由这个数据片的所在表的标识符和这个数据片的最后一行进行编码得到的。每个METADATA行都会在内存中存储大约1KB的数据。METADATA数据片的最合适的大小限制为128MB,我们的三级位置模式完全可以寻址234个数据片(或者在128MB的数据片中寻址261个字节)。

    The client library caches tablet locations. If the client does not know the location of a tablet, or if it discovers that cached location information is incorrect, then it recursively moves up the tablet location hierarchy. If the client’s cache is empty, the location algorithm requires three network round-trips, including one read from Chubby. If the client’s cache is stale, the location algorithm could take up to six round-trips, because stale cache entries are only discovered upon misses (assuming that METADATA tablets do not move very frequently). Although tablet locations are stored in memory, so no GFS accesses are required, we further reduce this cost in the common case by having the client library prefetch tablet locations: it reads the metadata for more than one tablet whenever it reads the METADATA table.

        客户端程序库会缓存数据片的位置信息。如果客户端不知道某个数据片的位置信息,或者它发现缓存的位置信息是不正确的,那么它就会向上递归访问数据片位置的层次结构。如果客户端的缓存是空的,那么寻址算法需要三个来回的网络通信才能寻址,包括一次对Chubby的读取操作。如果客户端的缓存数据失效了,那么寻址算法就需要6个来回的网络通信才能寻址,因为只有在缓存命中失败时才能发现失效的缓存条目[6]。虽然数据片的位置信息是存储在内存中的,这样就不需要访问GFS文件系统,但是通过客户端程序库预取数据片的位置信息,我们还可以在常见情况下进一步减少开销:无论客户端何时读取METADATA表,它每次都会读取多个数据片的元数据。

    We also store secondary information in the METADATA table, including a log of all events pertaining to each tablet (such as when a server begins serving it). This information is helpful for debugging and performance analysis.

        我们还会在METADATA表中存储次级信息,包括每个数据片所有的事件日志(例如,某台服务器何时为数据片提供服务)。这种信息对于调试和性能分析非常有帮助。

5.2   数据片的分配

    Each tablet is assigned to one tablet server at a time. The master keeps track of the set of live tablet servers, and the current assignment of tablets to tablet servers, including which tablets are unassigned. When a tablet is unassigned, and a tablet server with sufficient room for the tablet is available, the master assigns the tablet by sending a tablet load request to the tablet server.

        在任何时候,每个数据片只能分配至一个数据片服务器。主机(Master)会记录当前有哪些活跃的数据片服务器、当前有哪些数据片分配至哪些数据片服务器,以及哪些数据片尚未被分配。当有一个数据片尚未被分配,并且刚好有一个数据片服务器具备足够的空间能够存储这个数据片,那么主机便会向这个数据片服务器发送一个加载请求,这样便能将这个数据片分配至这个服务器了。

    Bigtable uses Chubby to keep track of tablet servers. When a tablet server starts, it creates, and acquires an exclusive lock on, a uniquely-named file in a specific Chubby directory. The master monitors this directory (the servers directory) to discover tablet servers. A tablet server stops serving its tablets if it loses its exclusive lock: e.g., due to a network partition that caused the server to lose its Chubby session. (Chubby provides an efficient mechanism that allows a tablet server to check whether it still holds its lock without incurring network traffic.) A tablet server will attempt to reacquire an exclusive lock on its file as long as the file still exists. If the file no longer exists, then the tablet server will never be able to serve again, so it kills itself. Whenever a tablet server terminates (e.g., because the cluster management system is removing the tablet server’ s machine from the cluster), it attempts to release its lock so that the master will reassign its tablets more quickly.

        BigTable使用Chubby跟踪记录数据片服务器的状态。当一个数据片服务器启动时,它会在一个特定的Chubby目录中创建一个具有唯一名称的文件,然后便能获取一个该文件的排它锁。主机会监控这个目录(servers目录),这样便能发现新加入的数据片服务器。如果一个数据片服务器丢失了自己的排它锁,那么它便会停止数据片服务:例如,由于网络分区而导致这台服务器丢失了自己的Chubby会话。(Chubby提供了一种有效的机制,使得数据片服务器不用耗费网络流量就可以检查自己是否仍然持有排它锁。)只要上述文件仍然存在,数据片服务器将会尝试重新获取这个文件的排它锁。如果这个文件不再存在了,那么这个数据片服务器将不能再次提供服务,因此它便会停止运行。当一个数据片服务器终止运行时(例如,因为集群管理系统正在将这个数据片服务器的机器从集群中移除),它会尝试释放自己的排它锁,这样主机将可以更加快速地重新分配它的数据片。

    The master is responsible for detecting when a tablet server is no longer serving its tablets, and for reassigning those tablets as soon as possible. To detect when a tablet server is no longer serving its tablets, the master periodically asks each tablet server for the status of its lock. If a tablet server reports that it has lost its lock, or if the master was unable to reach a server during its last several attempts, the master attempts to acquire an exclusive lock on the server’s file. If the master is able to acquire the lock, then Chubby is live and the tablet server is either dead or having trouble reaching Chubby, so the master ensures that the tablet server can never serve again by deleting its server file. Once a server’s file has been deleted, the master can move all the tablets that were previously assigned to that server into the set of unassigned tablets. To ensure that a Bigtable cluster is not vulnerable to networking issues between the master and Chubby, the master kills itself if its Chubby session expires. However, as described above, master failures do not change the assignment of tablets to tablet servers.

        主机(Master)负责检测数据片服务器何时不再提供数据片服务,并且会尽快将这些数据片进行重新分配。为了检测数据片服务器何时不再提供数据片服务,主机会定期地向每个数据片服务器轮询它的排它锁状态。如果某个数据片服务器报告它已经丢失了自己的排它锁,或者如果主机最近几次尝试和某个数据片服务器进行通信都不成功,那么主机便会尝试获取这个服务器文件的排它锁。如果主机能够获取这个排它锁,那么就说明Chubby是正常运行的,而数据片服务器要么是宕机了,要么是和Chubby之间有通信故障,因此,主机就会删除这个数据片服务器的文件,确保这个数据片服务器再也不能提供服务了。一旦删除了数据片服务器的文件,主机就会将先前分配给这个数据片服务器的所有数据片变为未分配的状态。为了确保主机和Chubby之间的网络问题不会影响BigTable集群的正常运行,如果主机的Chubby会话已过期,那么它便会停止自身的运行。然而,正如上文所述,主机故障不会改变数据片到数据片服务器的分配状态。

    When a master is started by the cluster management system, it needs to discover the current tablet assignments before it can change them. The master executes the following steps at startup. (1) The master grabs a unique master lock in Chubby, which prevents concurrent master instantiations. (2) The master scans the servers directory in Chubby to find the live servers. (3) The master communicates with every live tablet server to discover what tablets are already assigned to each server. (4) The master scans the METADATA table to learn the set of tablets. Whenever this scan encounters a tablet that is not already assigned, the master adds the tablet to the set of unassigned tablets, which makes the tablet eligible for tablet assignment.

        当集群管理系统启动一个主机时,在修改数据片分配状态之前,它需要先了解当前的数据片分配状态。主机会在启动阶段执行以下步骤:(1)主机会从Chubby获取一个唯一的主机锁,用于防止并发的主机实例化。(2)主机会扫描Chubby的servers目录,找到活跃的服务器。(3)主机会和每个活跃的数据片服务器进行通信,获取每台服务器上的数据片分配状态信息。(4)主机会扫描METADATA表,获取数据片的集合信息。当这种扫描操作遇到一个尚未分配的数据片时,主机便会将这个数据片添加至未分配数据片集合,这样便使得这个数据片符合数据片分配的条件。

    One complication is that the scan of the METADATA table cannot happen until the METADATA tablets have been assigned. Therefore, before starting this scan (step 4), the master adds the root tablet to the set of unassigned tablets if an assignment for the root tablet was not discovered during step 3. This addition ensures that the root tablet will be assigned. Because the root tablet contains the names of all METADATA tablets, the master knows about all of them after it has scanned the root tablet.

        可能会遇到一种复杂的情况:如果METADATA数据片尚未被分配,那么主机是不能扫描METADATA表的。因此,在开始扫描(步骤-4)之前,如果在步骤-3的执行过程中发现没有分配根数据片(Root Tablet),那么主机就会将根数据片添加至未分配的数据片集合中。这个额外操作能够确保根数据片一定会被分配。因为,根数据片包含了所有METADATA数据片的名称,待主机扫描根数据片之后,主机就会获得所有METADATA数据片的名称了。

    The set of existing tablets only changes when a table is created or deleted, two existing tablets are merged to form one larger tablet, or an existing tablet is split into two smaller tablets. The master is able to keep track of these changes because it initiates all but the last. Tablet splits are treated specially since they are initiated by a tablet server. The tablet server commits the split by recording information for the new tablet in the METADATA table. When the split has committed, it notifies the master. In case the split notification is lost (either because the tablet server or the master died), the master detects the new tablet when it asks a tablet server to load the tablet that has now split. The tablet server will notify the master of the split, because the tablet entry it finds in the METADATA table will specify only a portion of the tablet that the master asked it to load.

        已有数据片的集合只有当创建或删除表时才会修改,例如,将两个已有的数据片合并为一个较大的数据片,或者将一个已有的数据片拆分为两个较小的数据片。因为除了最后一次修改操作之外,其他的修改操作都是由主机初始化的,所以主机能够跟踪记录这些修改操作。数据片的分拆操作应当特殊对待,因为它们都是由数据片服务器初始化的。数据片服务器会在METADATA表中记录新数据片的信息,这样服务器便能提交分拆操作了。当分拆操作被提交之后,数据片服务器便会通知主机。如果已提交分拆操作的通知信息丢失了(有可能数据片服务器或主机宕机了),那么当主机要求一个数据片服务器加载这个刚刚分拆的数据片时,这个主机便能检测到新的数据片。数据片服务器将会向主机重新发送通知信息,因为它在METADATA表中找到的数据片条目只会指定加载相应数据片的一部分,而主机则会要求数据片服务器完整加载这个数据片。

5.3   数据片的服务

    The persistent state of a tablet is stored in GFS, as illustrated in Figure 5. Updates are committed to a commit log that stores redo records. Of these updates, the recently committed ones are stored in memory in a sorted buffer called a memtable; the older updates are stored in a sequence of SSTables. To recover a tablet, a tablet server reads its metadata from the METADATA table. This metadata contains the list of SSTables that comprise a tablet and a set of a redo points, which are pointers into any commit logs that may contain data for the tablet. The server reads the indices of the SSTables into memory and reconstructs the memtable by applying all of the updates that have committed since the redo points.

        数据片的持久化状态存储在GFS中,如图-5所示。更新操作会被提交至一个提交日志文件中,这个文件用于存储重做记录(Redo Record)。在这些更新操作中,最近提交的更新操作存储在内存中的一个排序缓冲区中,这个缓冲区被称为memtable;较老的更新操作存储在一系列的SSTable中。若要恢复一个数据片,则数据片服务器就需要从METADATA表中读取它的元数据。数据片的元数据包含了组成一个数据片的SSTable列表,以及一系列的重做点(Redo Point),这些重做点指向可能含有这个数据片的数据的提交日志。数据片服务器会将SSTable的索引读取至内存中,然后通过执行这些重做点之后提交的所有更新操作就可以重建memtable了。

数据片的表示方法


Figure 5: Tablet Representation
图-5:数据片的表示方法

    When a write operation arrives at a tablet server, the server checks that it is well-formed, and that the sender is authorized to perform the mutation. Authorization is performed by reading the list of permitted writers from a Chubby file (which is almost always a hit in the Chubby client cache). A valid mutation is written to the commit log. Group commit is used to improve the throughput of lots of small mutations. After the write has been committed, its contents are inserted into the memtable.

        当一个数据片服务器收到一条写入操作时,它就会检查这个写入操作的格式是否正确,以及发送方是否具有执行写入操作的权限。数据片服务器会从一个Chubby文件中读取许可的写入者列表(这个文件几乎一定会存放在Chubby的客户端缓存中),这样就能对写入操作的发送发进行身份验证了。一次有效的写入操作会记录在提交日志文件中。可以采用分组提交的方式来提高包含大量小规模写入操作的应用程序的吞吐量。当一个写入操作提交之后,它的内容便会插入至memtable中。

    When a read operation arrives at a tablet server, it is similarly checked for well-formedness and proper authorization. A valid read operation is executed on a merged view of the sequence of SSTables and the memtable. Since the SSTables and the memtable are lexicographically sorted data structures, the merged view can be formed efficiently.

        当一个数据片服务器收到一条读取操作时,和处理写入操作类似,它会检查这个读取操作的格式是否正确,以及发送方是否具有执行读取操作的权限。一次有效的读取操作会读取一个由SSTable序列和memtable组成的合并视图。因为SSTable和memtable都是按字典排序的数据结构,所以可以高效地生成合并视图。

    Incoming read and write operations can continue while tablets are split and merged.

        当数据片进行分拆和合并时,数据片服务器仍然可以继续处理收到的读取操作和写入操作。

5.4   空间压缩

    As write operations execute, the size of the memtable increases. When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS. This minor compaction process has two goals: it shrinks the memory usage of the tablet server, and it reduces the amount of data that has to be read from the commit log during recovery if this server dies. Incoming read and write operations can continue while compactions occur.

        随着写入操作的执行,memtable的尺寸也会不断增大。当memtable的尺寸达到一个阈值时,这个memtable就会被冻结,然后会创建一个新的memtable,然后会将这个被冻结的memtable转换成一个SSTable,最后将其写入至GFS文件系统。这种次级压缩(Minor Compaction)处理有两个目的:它能够减少数据片服务器的内存使用率,如果这个服务器宕机了,它还能减少恢复期间需要从提交日志读取的数据总量。当空间压缩正在进行时,数据片服务器仍然可以继续处理收到的读取操作和写入操作。

    Every minor compaction creates a new SSTable. If this behavior continued unchecked, read operations might need to merge updates from an arbitrary number of SSTables. Instead, we bound the number of such files by periodically executing a merging compaction in the background. A merging compaction reads the contents of a few SSTables and the memtable, and writes out a new SSTable. The input SSTables and memtable can be discarded as soon as the compaction has finished.

        每次的次级压缩处理都会创建一个新的SSTable。如果这种行为未经检查持续执行下去,那么读取操作可能就需要合并来自任意数量的SSTable的更新操作。但是,我们会定时在后台执行一次合并压缩操作,借此限制这类文件的数量。合并压缩操作会读取一些SSTable和memtable的内容,然后将这些数据合并写入至一个新的SSTable中。一旦合并压缩操作结束,就可以将输入的SSTable和memtable删除了。

    A merging compaction that rewrites all SSTables into exactly one SSTable is called a major compaction. SSTables produced by non-major compactions can contain special deletion entries that suppress deleted data in older SSTables that are still live. A major compaction, on the other hand, produces an SSTable that contains no deletion information or deleted data. Bigtable cycles through all of its tablets and regularly applies major compactions to them. These major compactions allow Bigtable to reclaim resources used by deleted data, and also allow it to ensure that deleted data disappears from the system in a timely fashion, which is important for services that store sensitive data.

        如果合并压缩会将所有的SSTable都合并重写至一个新的SSTable中,那么这次的合并压缩就被称为一次主干压缩(Major Compaction)。由非主干压缩产生的SSTable可能含有特殊的删除条目,这些删除条目会消除仍然活跃的、较老的SSTable中的被删除数据。另一方面,主干压缩也会产生一个SSTable,但是它不会包含删除信息或被删除数据。BigTable会循环扫描它所有的数据片,并且会定期地对它们进行主干压缩。这些主干压缩操作使得BigTable能够回收被删除数据使用的资源,并且还能够确保被删除的数据能够及时地从系统中消失[7],这对于存储敏感数据的服务来说是非常重要的。

6   优化

    The implementation described in the previous section required a number of refinements to achieve the high performance, availability, and reliability required by our users. This section describes portions of the implementation in more detail in order to highlight these refinements.

        在前一节中描述的实现方法还需要进行一系列的优化才能达到我们的用户对高性能、高可用性和高可靠性的要求。本节会更加详细地描述BigTable的实现方法,着重强调以下的优化方法。

6.1   本地分组

    Clients can group multiple column families together into a locality group. A separate SSTable is generated for each locality group in each tablet. Segregating column families that are not typically accessed together into separate locality groups enables more efficient reads. For example, page metadata in Webtable (such as language and checksums) can be in one locality group, and the contents of the page can be in a different group: an application that wants to read the metadata does not need to read through all of the page contents.

        客户端可以将多个列族集中在一起,形成一个本地分组(locality group)。对每个数据片(Tablet)中的每个本地分组都会生成一个单独的SSTable。将通常不会一起访问的列族分割为若干个独立的本地分组,这样可以提高读取操作的效率。例如,Webtable中的页面元数据(例如语言和校验和)可以存储在一个本地分组中,网页内容可以存储在另一个不同的本地分组中:一个只想要读取元数据的应用程序不需要读取所有的网页内容。

    In addition, some useful tuning parameters can be specified on a per-locality group basis. For example, a locality group can be declared to be in-memory. SSTables for in-memory locality groups are loaded lazily into the memory of the tablet server. Once loaded, column families that belong to such locality groups can be read without accessing the disk. This feature is useful for small pieces of data that are accessed frequently: we use it internally for the location column family in the METADATA table.

        另外,可以以每个局部分组为单位,指定一些有用的调优参数。例如,可以将一个本地分组声明为在内存中存储。存储在内存中的本地分组的SSTable会延迟加载至数据片服务器(Tablet Server)的内存之中。一旦加载完成,那么读取属于这个本地分组的列族时就不用访问磁盘了。这种特性对于需要频繁访问的小段数据非常有用:在BigTable内部,我们会利用这个特性提高METADATA表中的列族位置的查找速度。

6.2   压缩

    Clients can control whether or not the SSTables for a locality group are compressed, and if so, which compression format is used. The user-specified compression format is applied to each SSTable block (whose size is controllable via a locality group specific tuning parameter). Although we lose some space by compressing each block separately, we benefit in that small portions of an SSTable can be read without decompressing the entire file. Many clients use a two-pass custom compression scheme. The first pass uses Bentley and McIlroy’s scheme, which compresses long common strings across a large window. The second pass uses a fast compression algorithm that looks for repetitions in a small 16 KB window of the data. Both compression passes are very fast–they encode at 100-200 MB/s, and decode at 400-1000 MB/s on modern machines.

        客户端能够控制是否要压缩某个本地分组的SSTable,如果需要压缩,那么还能够指定需要使用的压缩格式。用户指定的压缩格式适用于每个SSTable数据块(通过一个本地分组的专用调优参数可以控制SSTable数据块的大小)。虽然每个数据块单独压缩会浪费一些存储空间[8],但是如果我们只需要读取某个SSTable的一小部分数据,那么就不必解压缩整个文件了。很多客户端都使用一种双行程的自定义压缩方案。首级行程会使用Bentley和McIlroy的方案,这个方案会跨越一个较大的窗口,压缩长度较长的常见字符串。次级行程会使用一种快速的压缩算法,这种算法会在一个16 KB的小窗口中查找重复的数据。这两个压缩行程都是非常快速的,在现在的机器上,它们的编码速率能达到100-200 MB/s,解码速率能达到400-1000 MB/s。

    Even though we emphasized speed instead of space reduction when choosing our compression algorithms, this two-pass compression scheme does surprisingly well. For example, in Webtable, we use this compression scheme to store Web page contents. In one experiment, we stored a large number of documents in a compressed locality group. For the purposes of the experiment, we limited ourselves to one version of each document instead of storing all versions available to us. The scheme achieved a 10-to-1 reduction in space. This is much better than typical Gzip reductions of 3-to-1 or 4-to-1 on HTML pages because of the way Webtable rows are laid out: all pages from a single host are stored close to each other. This allows the Bentley-McIlroy algorithm to identify large amounts of shared boilerplate in pages from the same host. Many applications, not just Webtable, choose their row names so that similar data ends up clustered, and therefore achieve very good compression ratios. Compression ratios get even better when we store multiple versions of the same value in Bigtable.

        当我们选择压缩算法时,虽然我们认为速度要比减少存储空间更加重要,但是这种双行程的压缩方案有着令人吃惊的出色表现。例如,在Webtable中,我们会使用这种压缩方案来存储网页内容。在一次实验中,我们在一个已压缩的本地分组中存储了海量的文档。为了达到实验的目的,我们只会存储每个文档的一个版本,而不会存储所有可用的版本。这种压缩方案能达到10:1的空间压缩比率。这个压缩率要比常用的Gzip压缩方案高得多(Gzip只能达到3:1或4:1的压缩率),Webtable会以如下方式存储HTML网页:所有来自于相同主机的网页都会存储在相邻的地方。这就使得Bentley-McIlroy算法能够在来自于相同主机的网页中找到大量的重复数据。不仅仅是Webtable,很多应用程序都会选择合适的行名,这样就能将相似的数据聚集起来,因此就能达到非常高的压缩率。当我们在BigTable中存储相同数据的多个版本时,这个压缩率甚至会更高。

6.3   通过缓存提高读取性能

    To improve read performance, tablet servers use two levels of caching. The Scan Cache is a higher-level cache that caches the key-value pairs returned by the SSTable interface to the tablet server code. The Block Cache is a lower-level cache that caches SSTables blocks that were read from GFS. The Scan Cache is most useful for applications that tend to read the same data repeatedly. The Block Cache is useful for applications that tend to read data that is close to the data they recently read (e.g., sequential reads, or random reads of different columns in the same locality group within a hot row).

        为了改善读取性能,数据片服务器(Tablet Server)可以使用两层缓存。扫描缓存(Scan Cache)是一种较高层的缓存,用于缓存SSTable接口向数据片服务器代码返回的键值(Key-Value)对。数据块缓存是一种较底层的缓存,用于缓存从GFS(Google文件系统)读取的SSTable数据块。对于经常需要重复读取相同数据的应用程序来说,扫描缓存非常有用。对于经常需要读取最近读取过的数据(例如,顺序读取,或者随机读取一个热点行的相同本地分组中的不同的列)附近的数据的应用程序来说,数据块缓存非常有用。

6.4   布隆过滤器[9]

    As described in Section 5.3, a read operation has to read from all SSTables that make up the state of a tablet. If these SSTables are not in memory, we may end up doing many disk accesses. We reduce the number of accesses by allowing clients to specify that Bloom filters should be created for SSTables in a particular locality group. A Bloom filter allows us to ask whether an SSTable might contain any data for a specified row/column pair. For certain applications, a small amount of tablet server memory used for storing Bloom filters drastically reduces the number of disk seeks required for read operations. Our use of Bloom filters also implies that most lookups for non-existent rows or columns do not need to touch disk.

        正如第5.3节所述,一次读取操作必须读取构成一个数据片状态的所有SSTable的数据。如果这些SSTable不在内存中,我们可能会执行很多次磁盘访问。在一个特定的本地分组中,客户端可以为指定的SSTable创建布隆过滤器,这样我们便能减少磁盘访问的次数。布隆过滤器使得我们能够查询某个SSTable是否包含一个指定的行/列对的任意数据。对于某些特定的应用程序来说,只需要使用数据片服务器的少量内存来存储布隆过滤器,就可以大大地减少读取操作所需的磁盘寻道次数。我们对于布隆过滤器的使用方法也意味着大多数对不存在的行或列的查找操作不需要接触到磁盘。

6.5   提交日志的实现方法

    If we kept the commit log for each tablet in a separate log file, a very large number of files would be written concurrently in GFS. Depending on the underlying file system implementation on each GFS server, these writes could cause a large number of disk seeks to write to the different physical log files. In addition, having separate log files per tablet also reduces the effectiveness of the group commit optimization, since groups would tend to be smaller. To fix these issues, we append mutations to a single commit log per tablet server, co-mingling mutations for different tablets in the same physical log file.

        如果我们将每个数据片的提交日志保存在一个独立的日志文件中,那么就有可能会在GFS中并发写入海量的文件。根据每台GFS服务器上的底层文件系统的实现方法,这些写入操作可能会导致大量的磁盘寻道,这样才能写入至不同的日志文件中。另外,每个数据片都有独立的日志文件也会降低分组提交(Group Commit)的优化效率,因为分组有可能较小。为了解决这些问题,我们设置每个数据片服务器都只有一个提交日志文件,然后将修改操作的日志以追加的方式写入同一个日志文件中,因此一个实际的日志文件会混合存储多个数据片的修改操作的日志。

    Using one log provides significant performance benefits during normal operation, but it complicates recovery. When a tablet server dies, the tablets that it served will be moved to a large number of other tablet servers: each server typically loads a small number of the original server’s tablets. To recover the state for a tablet, the new tablet server needs to reapply the mutations for that tablet from the commit log written by the original tablet server. However, the mutations for these tablets were co-mingled in the same physical log file. One approach would be for each new tablet server to read this full commit log file and apply just the entries needed for the tablets it needs to recover. However, under such a scheme, if 100 machines were each assigned a single tablet from a failed tablet server, then the log file would be read 100 times (once by each server).

        使用单个日志能够显著提高正常操作的性能,但是会使恢复操作复杂化。当一个数据片服务器(Tablet Server)宕机时,由它加载的数据片将会被移动至很多其他的数据片服务器上:每个服务器通常只会加载原来的服务器的很少量的数据片。为了恢复数据片的状态,新的数据片服务器需要从由原来的数据片服务器写入的提交日志(Commit Log)中提取这个数据片的修改操作信息,然后重新执行一遍。然而,这些数据片的修改操作信息实际上是混合存储在相同的日志文件中的。每个新的数据片服务器都应当采用一种策略,也就是读取完整的提交日志文件,然后只执行它需要恢复的数据片的修改操作。然而,当使用这种方案时,如果有100台机器,每台机器只分配失效数据片服务器的一个数据片,那么这个日志文件将会被读取100次(每个服务器读取一次)。

    We avoid duplicating log reads by first sorting the commit log entries in order of the keys <table, row name, log sequence number>. In the sorted output, all mutations for a particular tablet are contiguous and can therefore be read efficiently with one disk seek followed by a sequential read. To parallelize the sorting, we partition the log file into 64 MB segments, and sort each segment in parallel on different tablet servers. This sorting process is coordinated by the master and is initiated when a tablet server indicates that it needs to recover mutations from some commit log file.

        为了避免重复读取日志,我们首先会按照关键字<表、行名、日志序列号>的顺序,对提交日志的条目进行排序。在排序输出中,某个特定数据片的所有修改操作信息都会连续存放在一起。因此,只需要一次磁盘寻道,然后再执行一次连续的读取操作,就可以高效地读取日志文件了。为了并行化排序操作,我们会将日志文件划分为多个64 MB大小的分段,然后在不同的数据片服务器上以并行的方式排序每个日志分段。这个排序过程是由主机(Master)进行协调的,当某个数据片服务器表明自己需要从某个提交日志文件恢复数据片的修改操作时,就会开始执行这个过程。

    Writing commit logs to GFS sometimes causes performance hiccups for a variety of reasons (e.g., a GFS server machine involved in the write crashes, or the network paths traversed to reach the particular set of three GFS servers is suffering network congestion, or is heavily loaded). To protect mutations from GFS latency spikes, each tablet server actually has two log writing threads, each writing to its own log file; only one of these two threads is actively in use at a time. If writes to the active log file are performing poorly , the log file writing is switched to the other thread, and mutations that are in the commit log queue are written by the newly active log writing thread. Log entries contain sequence numbers to allow the recovery process to elide duplicated entries resulting from this log switching process.

        将提交日志写入至GFS有时可能会导致性能抖动,这是由多方面原因造成的(例如,当写入日志时,GFS服务器宕机了;或者,在遍历能够到达特定的三个GFS服务器组的网络路径时,遭遇了网络拥塞;或者,GFS服务器的负载过重)。为了确保在GFS的延迟飙升时,修改操作仍然能够顺利地执行,每个数据片服务器实际上都会有两个日志写入线程,每个线程都会写入至各自的日志文件;并且在任何时刻,这两个写入线程只有一个是工作的。如果一个线程写入日志文件的性能很差,那么就会切换至另一个日志写入线程,由这个新切换的线程将提交日志队列中的修改操作信息写入至日志文件中。包含序列号的日志条目使得恢复过程能够忽略由这种日志写入线程的切换过程引起的重复条目。

6.6   数据片恢复提速

    If the master moves a tablet from one tablet server to another, the source tablet server first does a minor compaction on that tablet. This compaction reduces recovery time by reducing the amount of uncompacted state in the tablet server’s commit log. After finishing this compaction, the tablet server stops serving the tablet. Before it actually unloads the tablet, the tablet server does another (usually very fast) minor compaction to eliminate any remaining uncompacted state in the tablet server’s log that arrived while the first minor compaction was being performed. After this second minor compaction is complete, the tablet can be loaded on another tablet server without requiring any recovery of log entries.

        如果主机(Master)要将某个数据片从一个数据片服务器移动至另一个,那么源数据片服务器首先会对这个数据片执行一次次级压缩(Minor Compaction)。这种压缩操作能够减少数据片服务器的提交日志中的未压缩状态的数据总量,从而能够减少恢复操作需要的时间。在这次压缩操作结束之后,这个数据片服务器便会停止提供这个数据片的服务。在它实际卸载这个数据片之前,这个数据片服务器还会执行另一次(通常会很快)次级压缩,这样便能降低在执行首次次级压缩时,又在这个数据片服务器的提交日志中产生的所有未压缩状态的数据总量。在第二次次级压缩完成之后,另一个数据片服务器便可以加载这个数据片了,而且不需要对日志条目执行任何恢复操作。

6.7   利用不可变性

    Besides the SSTable caches, various other parts of the Bigtable system have been simplified by the fact that all of the SSTables that we generate are immutable. For example, we do not need any synchronization of accesses to the file system when reading from SSTables. As a result, concurrency control over rows can be implemented very efficiently. The only mutable data structure that is accessed by both reads and writes is the memtable. To reduce contention during reads of the memtable, we make each memtable row copy-on-write and allow reads and writes to proceed in parallel.

        除了SSTable的缓存之外,我们生成的SSTable实际上都是不可变的。因此,我们可以利用这种特性,对BigTable系统的各种其他部分进行简化。例如,当读取SSTable的数据时,我们不必对文件系统的访问进行任何同步操作。因此,可以非常高效地实现对行的并发控制。memtable是唯一一个能够被读取和写入操作同时访问的可变数据结构。为了减少读取memtable时的竞争,我们对memtable的每一行都采用“即写即拷(Copy-On-Write)”的机制,这样便使得读取和写入操作能够并行执行。

    Since SSTables are immutable, the problem of permanently removing deleted data is transformed to garbage collecting obsolete SSTables. Each tablet’s SSTables are registered in the METADATA table. The master removes obsolete SSTables as a mark-and-sweep garbage collection over the set of SSTables, where the METADATA table contains the set of roots.

        因为SSTable都是不可变的,所以永久移除被删除数据的问题就转化为对已废弃的SSTable进行垃圾收集的问题。每个数据片的SSTable都会在METADATA表中注册。主机在移除已废弃的SSTable时,会采用“标记-删除(Mark-And-Sweep)”的方式对这些SSTable进行垃圾回收,而METADATA表则包含根数据片(Root Tablet)的集合。

    Finally, the immutability of SSTables enables us to split tablets quickly. Instead of generating a new set of SSTables for each child tablet, we let the child tablets share the SSTables of the parent tablet.

        最后,SSTable的不可变性使得我们能够快速地拆分数据片。我们不必为每个子数据片(Child Tablet)生成一个新的SSTable集合,只需要让这些子数据片共享父数据片(Parent Tablet)的SSTable即可。

7   性能评估

    We set up a Bigtable cluster with N tablet servers to measure the performance and scalability of Bigtable as N is varied. The tablet servers were configured to use 1 GB of memory and to write to a GFS cell consisting of 1786 machines with two 400 GB IDE hard drives each. N client machines generated the Bigtable load used for these tests. (We used the same number of clients as tablet servers to ensure that clients were never a bottleneck.) Each machine had two dual-core Opteron 2 GHz chips, enough physical memory to hold the working set of all running processes, and a single gigabit Ethernet link. The machines were arranged in a two-level tree-shaped switched network with approximately 100-200 Gbps of aggregate bandwidth available at the root. All of the machines were in the same hosting facility and therefore the round-trip time between any pair of machines was less than a millisecond.

        我们建立了一个BigTable集群,它包含N个数据片服务器(Tablet Server),这样便能评估集群的性能和可扩展性,而这两个指标会随着N变化而变化。每个数据片服务器都配置了1GB内存,它们会将数据写入至一个由1786台机器组成的GFS存储单元中,每台机器都具有两块400GB的IDE硬盘。我们会使用N台客户端机器为这些测试产生BigTable的工作负载。(我们使用的客户端的数量和数据片服务器的数量相同,这样能够确保客户端不会成为性能瓶颈。)每台机器都具有两块主频为2 GHz的Opteron双核处理器,以及足以保存所有正在运行进程的工作集数据的物理内存,而且还有一块千兆以太网卡。我们使用这些机器组成了一个双层树状的交换网络,根结点具有大约100-200 Gbps的聚合带宽。所有的机器都在相同的托管设施中,因此,任意两台机器之间的往返通信时间应当都小于1毫秒。

    The tablet servers and master, test clients, and GFS servers all ran on the same set of machines. Every machine ran a GFS server. Some of the machines also ran either a tablet server, or a client process, or processes from other jobs that were using the pool at the same time as these experiments.

        数据片服务器、主机(Master)、测试客户端和GFS服务器都在相同的一套机器上运行。每台机器都会运行一个GFS服务器。除此之外,某些机器还会运行其他的进程,有的会运行数据片服务器,有的会运行客户端进程,有的会运行其他任务的进程(在本次实验期间,这些任务也会同时使用相同的机器池)。

    R is the distinct number of Bigtable row keys involved in the test. R was chosen so that each benchmark read or wrote approximately 1 GB of data per tablet server.

        R是本次测试涉及的BigTable行关键字的不同数量。我们会小心地选择R的值,确保每次基准测试都会对每个数据片服务器读取或写入大约1 GB的数据。

    The sequential write benchmark used row keys with names 0 to R – 1. This space of row keys was partitioned into 10N equal-sized ranges. These ranges were assigned to the N clients by a central scheduler that assigned the next available range to a client as soon as the client finished processing the previous range assigned to it. This dynamic assignment helped mitigate the effects of performance variations caused by other processes running on the client machines. We wrote a single string under each row key. Each string was generated randomly and was therefore uncompressible. In addition, strings under different row key were distinct, so no cross-row compression was possible. The random write benchmark was similar except that the row key was hashed modulo R immediately before writing so that the write load was spread roughly uniformly across the entire row space for the entire duration of the benchmark.

        顺序写入(Sequential Write)基准测试使用的行关键字的命名规则为0至R-1。我们将行关键字的存储空间划分为10N个大小相同的区间。有一个中央调度器会将这些区间分配给N个客户端,分配规则为:一旦某个客户端将上次分配给它的区间处理完成之后,中央调度器就会将下一个可用区间分配给这个客户端。这种动态分配的方式有助于减轻由于在客户端机器上运行的其他进程造成的性能变化的影响。我们会在每个行关键字下写入一个单独的字符串。每个字符串都是随机生成的,因此也就是不可压缩的。另外,不同行关键字下的字符串都是不同的,因此跨行压缩是不可能的。随机写入(Random Write)基准测试也是相似的,但是在写入之前会采用按R取模的方式对行关键字进行哈希计算,借此就可以在基准测试的整个过程中,将写入负载大致地均匀分布在整个行空间中。

    The sequential read benchmark generated row keys in exactly the same way as the sequential write benchmark, but instead of writing under the row key, it read the string stored under the row key (which was written by an earlier in vocation of the sequential write benchmark). Similarly, the random read benchmark shadowed the operation of the random write benchmark.

        顺序读取(Sequential Read)基准测试生成的行关键字几乎和顺序写入基准测试完全相同,但是它是读取存储在行关键字下的字符串(这个字符串是稍早执行顺序写入基准测试时写入的),而不是在行关键字下写入字符串。同样地,随机读取(Random Read)基准测试的操作和顺序写入基准测试是相似的。

    The scan benchmark is similar to the sequential read benchmark, but uses support provided by the Bigtable API for scanning over all values in a row range. Using a scan reduces the number of RPCs executed by the benchmark since a single RPC fetches a large sequence of values from a tablet server.

        扫描(Scan)基准测试类似于顺序读取基准测试,但是它使用能够在一个行区间扫描所有值的API,这个API是由BigTable提供的。扫描操作能够减少基准测试执行RPC(Remote Procedure Call,远程过程调用)调用的次数,因为每次执行RPC调用都能从一个数据片服务器获取大量的值。

    The random reads (mem) benchmark is similar to the random read benchmark, but the locality group that contains the benchmark data is marked as in-memory, and therefore the reads are satisfied from the tablet server’s memory instead of requiring a GFS read. For just this benchmark, we reduced the amount of data per tablet server from 1 GB to 100 MB so that it would fit comfortably in the memory available to the tablet server.

        随机读取(内存)(Random Read (mem))基准测试类似于随机读取基准测试,但是包含基准测试数据的本地分组会被标记为在内存中存储,因此,这种基准测试会直接从数据片服务器的内存中读取数据,而不需要从GFS中读取数据。针对这种基准测试,我们会将每个数据片服务器的数据总量从1 GB降低至100 MB,这样便可以将基准测试数据全部加载至数据片服务器的内存中了。

    Figure 6 shows two views on the performance of our benchmarks when reading and writing 1000-byte values to Bigtable. The table shows the number of operations per second per tablet server; the graph shows the aggregate number of operations per second.

        图-6展示了两张此次基准测试的性能视图,此次基准测试会对BigTable读取和写入1000字节大小的值。图-6中的表格显示了每个数据片服务器(Tablet Server)每秒的操作次数,图-6中的图表显示了每秒的总操作次数。

数据片服务器的数量


Figure 6: Number of 1000-byte values read/written per second. The table shows the rate per tablet server; the graph shows the
aggregate rate.

图-6:每秒1000字节值的读取/写入次数。表格显示了每个数据片服务器的执行速率,图表显示了总执行速率。

7.1   单个数据片服务器的性能

    Let us first consider performance with just one tablet server. Random reads are slower than all other operations by an order of magnitude or more. Each random read involves the transfer of a 64 KB SSTable block over the network from GFS to a tablet server, out of which only a single 1000-byte value is used. The tablet server executes approximately 1200 reads per second, which translates into approximately 75 MB/s of data read from GFS. This bandwidth is enough to saturate the tablet server CPUs because of overheads in our networking stack, SSTable parsing, and Bigtable code, and is also almost enough to saturate the network links used in our system. Most Bigtable applications with this type of an access pattern reduce the block size to a smaller value, typically 8KB.

        首先,我们应当考虑只有一个数据片服务器时的性能。随机读取要比所有其他的操作要慢一个数量级,甚至更多。每次随机读取都会将一个64 KB的SSTable数据块通过网络从GFS传输至一个数据片服务器,但是我们只使用其中1000字节的数据。数据片服务器每秒大约会执行1200次读取操作,也就是每秒从GFS读取大约75 MB的数据。这种带宽足以占满数据片服务器的CPU资源,因为网络协议栈、SSTable解析和BigTable代码都会造成大量的开销,同样几乎也会占满我们系统中使用的网络链路资源。大多数采用这种访问模式的BigTable应用程序都会将数据块的大小减小至一个较小的值,通常为8KB。

    Random reads from memory are much faster since each 1000-byte read is satisfied from the tablet server’s local memory without fetching a large 64 KB block from GFS.

        从内存随机读取要快得多,因为每次操作都是从数据片服务器的本地内存中直接读取1000字节的数据,而不用从GFS获取一个较大的64 KB的数据块。

    Random and sequential writes perform better than random reads since each tablet server appends all incoming writes to a single commit log and uses group commit to stream these writes efficiently to GFS. There is no significant difference between the performance of random writes and sequential writes; in both cases, all writes to the tablet server are recorded in the same commit log.

        随机写入和顺序写入的性能要比随机读取好一些,因为每个数据片服务器都会将所有的入站写入操作追加至一个提交日志(Commit Log)的尾部,并且会采用分组提交的方式,将数据以流的方式高效地写入至GFS中。随机写入和顺序写入的性能没有显著的差异。在这两种情况下,这个数据片服务器的所有写入操作都会记录在相同的提交日志中。

    Sequential reads perform better than random reads since every 64 KB SSTable block that is fetched from GFS is stored into our block cache, where it is used to serve the next 64 read requests.

        顺序读取的性能要比随机读取好一些,因为从GFS获取的每个64 KB的SSTable数据块都会存储在我们的数据块缓存(Block Cache)中,后续的64次读取请求就可以使用缓存中的这些数据了。

    Scans are even faster since the tablet server can return a large number of values in response to a single client RPC, and therefore RPC overhead is amortized over a large number of values.

        扫描操作甚至会更快,因为当数据片服务器对一个客户端的RPC调用作出响应时,它会返回大量的数据,所以RPC的开销就会分摊至大量的数据上。

7.2   性能扩展

    Aggregate throughput increases dramatically, by over a factor of a hundred, as we increase the number of tablet servers in the system from 1 to 500. For example, the performance of random reads from memory increases by almost a factor of 300 as the number of tablet server increases by a factor of 500. This behavior occurs because the bottleneck on performance for this benchmark is the individual tablet server CPU.

        随着我们将系统中的数据片服务器(Tablet Server)逐渐从1个增加至500个,系统的聚合吞吐量会有着显著的提高,增长倍率甚至超过了100。例如,当数据片服务器的数量增加至500个时,从内存中随机读取的性能提高了几乎300倍。因为这种基准测试的性能瓶颈是单个数据片服务器的CPU,所以才会有这样的性能提升。

    However, performance does not increase linearly. For most benchmarks, there is a significant drop in per-server throughput when going from 1 to 50 tablet servers. This drop is caused by imbalance in load in multiple server configurations, often due to other processes contending for CPU and network. Our load balancing algorithm attempts to deal with this imbalance, but cannot do a perfect job for two main reasons: rebalancing is throttled to reduce the number of tablet movements (a tablet is unavailable for a short time, typically less than one second, when it is moved), and the load generated by our benchmarks shifts around as the benchmark progresses.

        然而,性能并不是线性增长的。对于大多数的基准测试来说,当数据片服务器的数量从1增加至50时,每台服务器的吞吐量都会有一种显著下跌的现象。这种下跌现象是由于多台服务器配置的负载不均衡造成的,通常是由于其他进程竞争CPU和网络资源而导致的。我们的负载均衡算法会尝试解决这种负载不均衡的问题,但是却不能完美地解决,这是由于以下两个主要原因造成的:这种算法会限制再平衡的次数,从而减少移动数据片的次数(当移动某个数据片时,这个数据片会在短时间内不可用,通常小于1秒);随着基准测试的进行,由基准测试产生的负载也会不断变化。

    The random read benchmark shows the worst scaling (an increase in aggregate throughput by only a factor of 100 for a 500-fold increase in number of servers). This behavior occurs because (as explained above) we transfer one large 64KB block over the network for every 1000-byte read. This transfer saturates various shared 1 Gigabit links in our network and as a result, the per-server throughput drops significantly as we increase the number of machines.

        基准测试的结果显示,随机读取的性能扩展最差(服务器的数量增加了500倍,而聚合吞吐量却仅仅提高了100倍)。因为(正如上文中的解释)当我们每次读取1000字节的数据时,都会通过网络传输一个64KB的大数据块,所以才会发生这种现象。这种网络传输会占满我们网络中的各种共享的千兆网络链路,随着我们逐渐增加机器的数量,这样便会导致每个服务器的吞吐量显著地下跌。

8   实际应用

    As of August 2006, there are 388 non-test Bigtable clusters running in various Google machine clusters, with a combined total of about 24,500 tablet servers. Table 1 shows a rough distribution of tablet servers per cluster. Many of these clusters are used for development purposes and therefore are idle for significant periods. One group of 14 busy clusters with 8069 total tablet servers saw an aggregate volume of more than 1.2 million requests per second, with incoming RPC traffic of about 741 MB/s and outgoing RPC traffic of about 16 GB/s.

        截止至2006年8月,Google内部有388个生产用的BigTable集群在各种服务器集群中运行,总共大约有24500个数据片服务器。表-1展示了每个集群的数据片服务器的大致分布的情况。在这些集群中,有很多用于开发目的,因此会有一段时期比较空闲。通过观察一个由14个正在使用的集群组成的集群分组(总共包含8069个数据片服务器)可以发现,每秒可以处理多于120万个请求,入站的RPC网络流量大约为741 MB/s,出站的RPC网络流量大约为16 GB/s。

数据片服务器在集群中的分布情况


Table 1: Distribution of number of tablet servers in Bigtable clusters.
表-1:BigTable集群中的数据片服务器数量的分布情况。

    Table 2 provides some data about a few of the tables currently in use. Some tables store data that is served to users, whereas others store data for batch processing; the tables range widely in total size, average cell size, percentage of data served from memory, and complexity of the table schema. In the rest of this section, we briefly describe how three product teams use Bigtable.

        表-2展示了几个当前正在使用的表的一些相关数据。有些表存储的是用户需要使用的数据,而其他表存储的数据则用于批处理;这些表在总大小、平均单元格大小、从内存中读取数据的比例和表模式的复杂度等方面都有所不同。在本节的其他部分,我们会简短地描述以下三个产品团队是如何使用BigTable的。

生产环境中使用的BigTable特性


Table 2: Characteristics of a few tables in production use. Table size (measured before compression) and # Cells indicate approximate sizes. Compression ratio is not given for tables that have compression disabled.
表-2:在生产环境中使用的一些表的特性。表的大小(Table Size)(压缩前测量)和单元格数量(# Cells)都能够大致表明BigTable的大小。压缩比率(Compression Ratio)对于禁用压缩的表来说并不适用。

8.1   谷歌分析

    Google Analytics (analytics.google.com) is a service that helps webmasters analyze traffic patterns at their web sites. It provides aggregate statistics, such as the number of unique visitors per day and the page views per URL per day , as well as site-tracking reports, such as the percentage of users that made a purchase, given that they earlier viewed a specific page.

        谷歌分析(Google Analytics)(analytics.google.com)是能够帮助网站的站长分析网站流量模式的一种服务。它提供了聚合的统计数据,例如,每日独立访客(Unique Visitor)的数量和每个URL的每日页面浏览(Page View)的次数;除此之外,还有网站跟踪报告,例如,根据用户稍早浏览的某些页面,统计出最终在网站上购物的用户所占的比率。

    To enable the service, webmasters embed a small JavaScript program in their web pages. This program is invoked whenever a page is visited. It records various information about the request in Google Analytics, such as a user identifier and information about the page being fetched. Google Analytics summarizes this data and makes it available to webmasters.

        为了使用这项服务,网站站长必须在他们的网页中嵌入一小段JavaScript程序。当这些网页被访问时,这个程序就会被调用。它会将网络请求的各种有关信息记录在谷歌分析中,例如用户标识符和获取网页的有关信息。谷歌分析会汇总这些数据,然后提供给网站站长。

    We briefly describe two of the tables used by Google Analytics. The raw click table (~200 TB) maintains a row for each end-user session. The row name is a tuple containing the website’s name and the time at which the session was created. This schema ensures that sessions that visit the same web site are contiguous, and that they are sorted chronologically. This table compresses to 14% of its original size.

        我们会简短地描述谷歌分析使用的两张表。原始点击(Raw Click)表(大约200 TB)的每一行都存储了一个终端用户的会话。行名是一个多元组,包含网站的名称和这个会话的创建时间。这种模式能够确保访问相同网站的会话存储在相邻的地方,它们也是按照时间的先后顺序排列的。这张表能够压缩至原始尺寸的14%。

    The summary table (~20 TB) contains various predefined summaries for each website. This table is generated from the raw click table by periodically scheduled MapReduce jobs. Each MapReduce job extracts recent session data from the raw click table. The overall system’s throughput is limited by the throughput of GFS. This table compresses to 29% of its original size.

        汇总(Summary)表(大约20 TB)包含每个网站的各种预定义的汇总信息。一个定时调度的MapReduce任务能够根据原始点击表的数据生成汇总表的数据。MapReduce任务每次运行时都会从原始点击表中提取出最近的会话数据。系统的整体吞吐量受限于GFS的吞吐量。这张表能够压缩至原始尺寸的29%。

8.2   谷歌地球

    Google operates a collection of services that provide users with access to high-resolution satellite imagery of the world’s surface, both through the web-based Google Maps interface (maps.google.com) and through the Google Earth (earth.google.com) custom client software. These products allow users to navigate across the world’s surface: they can pan, view, and annotate satellite imagery at many different levels of resolution. This system uses one table to preprocess data, and a different set of tables for serving client data.

        谷歌经营了一系列的服务,用户能够通过这些服务访问地球表面的高分辨率卫星成像系统,包括基于Web的谷歌地图(Google Maps)接口(maps.google.com)和谷歌地图(Google Earth)(earth.google.com)的自定义客户端软件。这些产品使得用户能够浏览世界各地的地球表面:他们可以在很多不同的分辨率等级下平移、浏览和旋转卫星成像。这个系统使用一张表存储预处理数据,以及一组不同的表存储客户端数据。

    The preprocessing pipeline uses one table to store raw imagery. During preprocessing, the imagery is cleaned and consolidated into final serving data. This table contains approximately 70 terabytes of data and therefore is served from disk. The images are efficiently compressed already, so Bigtable compression is disabled.

        预处理流水线会使用一张表来存储原始的成像数据。在预处理过程中,会清理和合并成像数据,然后形成最终的服务数据。这张表包含大约70 TB的数据,因此存储在磁盘上。图像已经经过了有效的压缩,因此禁用了BigTable的压缩功能。

    Each row in the imagery table corresponds to a single geographic segment. Rows are named to ensure that adjacent geographic segments are stored near each other. The table contains a column family to keep track of the sources of data for each segment. This column family has a large number of columns: essentially one for each raw data image. Since each segment is only built from a few images, this column family is very sparse.

        成像数据表中的每一行都对应于一个单独的地理区域。行名能够确保相邻地理区域的数据保存在邻近的存储空间中。这张表包含一个列族(Column Family),用于记录每个地理区域的数据源。这个列族包含很多列:基本上每个原始图像数据对应一列。因为每个地理区域都是由很少的几张图片构成的,所以这个列族是非常稀疏的。

    The preprocessing pipeline relies heavily on MapReduce over Bigtable to transform data. The overall system processes over 1 MB/sec of data per tablet server during some of these MapReduce jobs.

        预处理流水线高度依赖于MapReduce任务,这个任务负责转换BigTable的数据。在运行某些MapReduce任务期间,整个系统中的每个数据片服务器(Tablet Server)的处理速度大约为1 MB/s。

    The serving system uses one table to index data stored in GFS. This table is relatively small (~500 GB), but it must serve tens of thousands of queries per second per datacenter with low latency. As a result, this table is hosted across hundreds of tablet servers and contains in-memory column families.

        这个服务系统会使用一张表来索引存储在GFS中的数据。这张表相对较小(~500 GB),但是它必须在保证较低延迟的前提下,为每个数据中心提供每秒处理数万次查询请求的服务。因此,会有数百个数据片服务器承载这张表的数据,并且包含存储在内存中的列族。

8.3   个性化搜索

    Personalized Search (www.google.com/psearch) is an opt-in service that records user queries and clicks across a variety of Google properties such as web search, images, and news. Users can browse their search histories to revisit their old queries and clicks, and they can ask for personalized search results based on their historical Google usage patterns.

        个性化搜索(Personalized Search)(www.google.com/psearch)是一种选择性使用的服务,它可以记录用户的查询和点击,涉及各种各样的谷歌服务(例如,网络搜索、图像和新闻)。用户可以浏览他们的搜索历史,再次访问他们先前的查询和点击,他们也可以基于先前使用Google的习惯模式,定制个性化搜索结果。

    Personalized Search stores each user’s data in Bigtable. Each user has a unique userid and is assigned a row named by that userid. All user actions are stored in a table. A separate column family is reserved for each type of action (for example, there is a column family that stores all web queries). Each data element uses as its Bigtable timestamp the time at which the corresponding user action occurred. Personalized Search generates user profiles using a MapReduce over Bigtable. These user profiles are used to personalize live search results.

        个性化搜索会将每个用户的数据存储在BigTable中。每个用户都会有一个唯一的userid,并且会为每个用户分配一行,行名就是userid。所有的用户动作都会存储在一张表中。BigTable会为每种类型的动作保留一个独立的列族(例如,会使用一个列族存储所有的Web查询)。每个数据元(Data Element)都会使用自身的BigTable时间戳,记录相应的用户动作发生的时间。自定义搜索会通过一个处理BigTable数据的MapReduce任务生成用户配置文件。这些用户配置文件可用于个性化实时的搜索结果。

    The Personalized Search data is replicated across several Bigtable clusters to increase availability and to reduce latency due to distance from clients. The Personalized Search team originally built a client-side replication mechanism on top of Bigtable that ensured eventual consistency of all replicas. The current system now uses a replication subsystem that is built into the servers.

        个性化搜索的数据在多个BigTable集群中是有重复的,这样不仅可以提高可用性,而且还能降低由于客户端距离较远而造成的延迟。个性化搜索团队最初在BigTable之上构建了一种客户端侧的复制机制,用于确保所有复制的最终一致性。当前系统现在使用一种内建于服务器中的复制子系统。

    The design of the Personalized Search storage system allows other groups to add new per-user information in their own columns, and the system is now used by many other Google properties that need to store per-user configuration options and settings. Sharing a table amongst many groups resulted in an unusually large number of column families. To help support sharing, we added a simple quota mechanism to Bigtable to limit the storage consumption by any particular client in shared tables; this mechanism provides some isolation between the various product groups using this system for per-user information storage.

        个性化搜索的存储系统设计使得其他团队也可以在他们各自的列中添加每个新用户的信息,而且现在很多其他的Google服务也会通过这个系统存储每个用户的配置选项和设置。在很多团队之间共享一张表通常会产生大量的列族。为了更好地支持数据共享,我们为BigTable添加了一种简单的配额机制[10],用于限制任意的特定用户在共享表中造成的存储消耗。这种机制在使用这个系统的各种产品团队之间提供了某种隔离效果,而这些产品团队也需要存储每个用户的信息。

9   经验教训

    In the process of designing, implementing, maintaining, and supporting Bigtable, we gained useful experience and learned several interesting lessons.

        在设计、实现、维护和支持BigTable的过程中,我们获得了很多有用的经验,并且学到了一些有趣的教训。

    One lesson we learned is that large distributed systems are vulnerable to many types of failures, not just the standard network partitions and fail-stop failures assumed in many distributed protocols. For example, we have seen problems due to all of the following causes: memory and network corruption, large clock skew, hung machines, extended and asymmetric network partitions, bugs in other systems that we are using (Chubby for example), overflow of GFS quotas, and planned and unplanned hardware maintenance. As we have gained more experience with these problems, we have addressed them by changing various protocols. For example, we added checksumming to our RPC mechanism. We also handled some problems by removing assumptions made by one part of the system about another part. For example, we stopped assuming a given Chubby operation could return only one of a fixed set of errors.

        我们学到的其中一个经验教训便是大型分布式系统很容易受到许多类型的故障的影响,这些故障不仅仅是标准的网络分区和很多分布式协议预想的“遇错即停(Fail-Stop)”[11]故障。例如,我们遇到过由于以下原因而导致的问题:内存和网络损坏、大量的时钟偏移、机器挂起、持续不对称的网络分区、我们正在使用的其他系统(例如,Chubby)的缺陷、GFS配额溢出,以及计划内和计划外的硬件维护。在解决这些问题的过程中,我们得到了很多经验,我们通过修改各种协议来解决这些问题。例如,我们为RPC(远程过程调用)机制添加了校验和(CheckSum)功能。我们在设计系统的一部分功能时,不会对其他部分的功能作出任何假设,这样我们便能解决某些问题。例如,我们不再假设一个给定的Chubby操作只会返回一组固定错误码中的一个值。

    Another lesson we learned is that it is important to delay adding new features until it is clear how the new features will be used. For example, we initially planned to support general-purpose transactions in our API. Because we did not have an immediate use for them, however, we did not implement them. Now that we have many real applications running on Bigtable, we have been able to examine their actual needs, and have discovered that most applications require only single-row transactions. Where people have requested distributed transactions, the most important use is for maintaining secondary indices, and we plan to add a specialized mechanism to satisfy this need. The new mechanism will be less general than distributed transactions, but will be more efficient (especially for updates that span hundreds of rows or more) and will also interact better with our scheme for optimistic cross-data-center replication.

        我们学到的另一个经验教训便是,除非清楚地知道如何使用新的功能特性,否则不要轻易地添加这些新的功能特性,这点非常重要。例如,我们最初计划在我们的API中支持通用事务处理。然而,因为我们并不是立即要使用这些功能,所以我们也就没有实现它们。现在,我们在BigTable上运行了很多实际的应用程序,我们已经能够检查它们的实际需要。然后,我们发现大多数的应用程序只需要使用单行事务处理。有些应用程序需要使用分布式事务处理,而分布式事务处理的最重要的用途便是维护二级索引,因此我们计划增加一种专用的机制来满足这种需要。这种新机制的通用性要比分布式事务处理差一些,但是将能获得更高的效率(特别是在更新操作涉及数百行甚至更多数据的时候),并且还将能够更好地使用我们的“跨数据中心(Cross-Data-Center)”复制方案的优化策略。

    A practical lesson that we learned from supporting Bigtable is the importance of proper system-level monitoring (i.e., monitoring both Bigtable itself, as well as the client processes using Bigtable). For example, we extended our RPC system so that for a sample of the RPCs, it keeps a detailed trace of the important actions done on behalf of that RPC. This feature has allowed us to detect and fix many problems such as lock contention on tablet data structures, slow writes to GFS while committing Bigtable mutations, and stuck accesses to the METADATA table when METADATA tablets are unavailable. Another example of useful monitoring is that every Bigtable cluster is registered in Chubby. This allows us to track down all clusters, discover how big they are, see which versions of our software they are running, how much traffic they are receiving, and whether or not there are any problems such as unexpectedly large latencies.

        在维护BigTable的过程中,我们还获得一个具有实践意义的经验:合适的系统级监控是非常重要的(例如,不但要监控BigTable自身,而且还要监控使用BigTable的客户端进程)。例如,我们扩展了我们的RPC系统,因此对于一个RPC调用的示例,它详细记录了由于RPC调用而执行的重要动作。这项功能使得我们能够检测和修复很多问题,例如:对数据片(Tablet)数据结构的锁争用的问题、当提交BigTable的修改操作时写入GFS速度过慢的问题、当METADATA数据片不可用时访问METADATA表阻塞的问题。系统监控的另一个用途示例就是每个BigTable集群都是在Chubby中注册的。这就使得我们能够跟踪所有集群的状态,包括集群的数据规模、集群正在运行的软件版本、集群的通信流量有多少,以及集群是否发生问题(例如,预料之外的大量延迟)。

    The most important lesson we learned is the value of simple designs. Given both the size of our system (about 100,000 lines of non-test code), as well as the fact that code evolves over time in unexpected ways, we have found that code and design clarity are of immense help in code maintenance and debugging. One example of this is our tablet-server membership protocol. Our first protocol was simple: the master periodically issued leases to tablet servers, and tablet servers killed themselves if their lease expired. Unfortunately, this protocol reduced availability significantly in the presence of network problems, and was also sensitive to master recovery time. We redesigned the protocol several times until we had a protocol that performed well. However, the resulting protocol was too complex and depended on the behavior of Chubby features that were seldom exercised by other applications. We discovered that we were spending an inordinate amount of time debugging obscure corner cases, not only in Bigtable code, but also in Chubby code. Eventually, we scrapped this protocol and moved to a newer simpler protocol that depends solely on widely-used Chubby features.

        对于我们来说,最宝贵的经验便是简单设计的价值。考虑到我们系统的规模(大约100000行生产代码),以及随着时间的推移,新的代码会以各种难以预料的方式加入系统,我们发现代码和设计的清晰度非常有助于维护和调试代码。其中的一个示例便是我们的数据片服务器(Tablet Server)的成员协议。我们的首个协议非常简单:主机(Master)会定期向数据片服务器发放租约,如果数据片服务器的租约过期了,那么它们会中止自身的运行。不幸的是,当发生网络故障时,这个协议会明显地降低可用性,并且还很容易受到主机恢复时间的影响。我们多次重新设计这个协议,直到它能够获得较好的性能。然而,最终的协议太过于复杂了,并且依赖于Chubby的某些功能特性的行为,而这些功能特性却又很少被其他应用程序使用。我们发现我们花费了大量的时间在调试一些稀奇古怪的问题,不仅仅是BigTable代码的问题,而且还有Chubby代码的问题。最后,我们只好废弃了这个协议,重新开发了一个更新、更简单的协议,这个协议只依赖于Chubby的一些被广泛使用的功能特性。

10   相关工作

    The Boxwood project has components that overlap in some ways with Chubby, GFS, and Bigtable, since it provides for distributed agreement, locking, distributed chunk storage, and distributed B-tree storage. In each case where there is overlap, it appears that the Boxwood’s component is targeted at a somewhat lower level than the corresponding Google service. The Boxwood project’s goal is to provide infrastructure for building higher-level services such as file systems or databases, while the goal of Bigtable is to directly support client applications that wish to store data.

        Boxwood项目的一些组件在某些方面和Chubby、GFS、BigTable有重叠之处,因为它提供了分布式协议、锁、分布式块存储和分布式B-树存储。尽管Boxwood项目与Google服务有重叠之处,但是Boxwood项目的组件似乎用于提供更加底层的服务。Boxwood项目的目的是提供用于构建更高级别服务(例如,文件系统或数据库)的基础设施,而BigTable的目的则是直接向客户端应用程序提供数据存储服务。

    Many recent projects have tackled the problem of providing distributed storage or higher-level services over wide area networks, often at “Internet scale”. This includes work on distributed hash tables that began with projects such as CAN, Chord, Tapestry, and Pastry. These systems address concerns that do not arise for Bigtable, such as highly variable bandwidth, untrusted participants, or frequent reconfiguration; decentralized control and Byzantine fault tolerance are not Bigtable goals.

        最近很多项目都已经解决了在广域网(Wide Area Network)中提供分布式存储和更高层服务的问题,这些项目通常是“互联网规模(Internet Scale)”的。这其中包括了分布式哈希表的相关工作,这些工作是由一些诸如CAN、Chord、Tapestry和Pastry之类的项目率先发起的。这些系统的主要关注点和BigTable有所不同,例如:高度可变的带宽、不可信的参与者,或频繁的重新配置;另外,去中心化控制和拜占庭容错(Byzantine Fault Tolerance)[12]也不是BigTable的设计目的。

    In terms of the distributed data storage model that one might provide to application developers, we believe the key-value pair model provided by distributed B-trees or distributed hash tables is too limiting. Key-value pairs are a useful building block, but they should not be the only building block one provides to developers. The model we chose is richer than simple key-value pairs, and supports sparse semi-structured data. Nonetheless, it is still simple enough that it lends itself to a very efficient flat-file representation, and it is transparent enough (via locality groups) to allow our users to tune important behaviors of the system.

        就应用程序开发者可能使用的分布式数据存储模型而言,我们认为分布式B-树或分布式哈希表提供的键值对(Key-Value Pair)模型有很大的局限性。键值对是一种很有用的构建单元,但是它们不应当是提供给开发者的唯一的构建单元。我们选用的模型要比简单的键值对丰富得多,并且还支持稀疏的半结构化数据。虽然如此,它仍然足够简单,能够非常高效地表示平面文件(Flat File),并且它还足够透明(通过本地分组(Locality Group)),使得我们的用户能够对系统的重要功能进行调优。

    Several database vendors have developed parallel databases that can store large volumes of data. Oracle’s Real Application Cluster database uses shared disks to store data (Bigtable uses GFS) and a distributed lock manager (Bigtable uses Chubby). IBM’s DB2 Parallel Edition is based on a shared-nothing architecture similar to Bigtable. Each DB2 server is responsible for a subset of the rows in a table which it stores in a local relational database. Both products provide a complete relational model with transactions.

        有几个数据库厂商也开发出了一些能够存储海量数据的并行数据库。Oracle的“真正应用集群(RAC:Real Application Cluster)”数据库使用共享磁盘来存储数据(BigTable使用GFS),以及一个分布式锁管理器(BigTable使用Chubby)。IBM的DB2数据库的并行版本基于一种类似于BigTable的无共享(Shared-Nothing)架构。每台DB2服务器都负责处理一张表中的行的一个子集,它们会将这些子集的数据存储在一个本地的关系型数据库中。这两种产品都提供了一种完整的关系模型,并且支持事务处理。

    Bigtable locality groups realize similar compression and disk read performance benefits observed for other systems that organize data on disk using column-based rather than row-based storage, including C-Store and commercial products such as Sybase IQ, SenSage, KDB+, and the ColumnBM storage layer in MonetDB/X100. Another system that does vertical and horizontal data partioning into flat files and achieves good data compression ratios is AT&T’s Daytona database. Locality groups do not support CPU cache-level optimizations, such as those described by Ailamaki.

        有些其他的系统也使用基于列的存储方式来组织磁盘上的数据,而不是使用基于行的存储方式,这些系统包括C-Store,以及一些商用产品,例如:Sybase IQ、SenSage、KDB+和MonetDB/X100中的ColumnBM存储层。BigTable的本地分组(Locality Group)也可以实现和上述系统相似的压缩性能和磁盘读取性能。还有一个系统能够将数据垂直和水平切分为平面文件(Flat File),并且能够获得优秀的数据压缩率,这个系统便是AT&T的Daytona数据库。本地分组不支持CPU缓存等级的优化,而诸如Ailamaki这样的系统能够支持。

    The manner in which Bigtable uses memtables and SSTables to store updates to tablets is analogous to the way that the Log-Structured Merge Tree stores updates to index data. In both systems, sorted data is buffered in memory before being written to disk, and reads must merge data from memory and disk.

        BigTable使用memtable和SSTable存储数据片(Tablet)的更新操作,这种方式类似于日志结构合并树(LSM-Tree:Log Structured Merge Tree)存储索引数据的更新操作的方式。在将已排序的数据写入磁盘之前,这两个系统都会将这些数据缓存在内存之中,因此读取操作必须合并内存和磁盘中的数据。

    C-Store and Bigtable share many characteristics: both systems use a shared-nothing architecture and have two different data structures, one for recent writes, and one for storing long-lived data, with a mechanism for moving data from one form to the other. The systems differ signicantly in their API: C-Store behaves like a relational database, whereas Bigtable provides a lower level read and write interface and is designed to support many thousands of such operations per second per server. C-Store is also a “read-optimized relational DBMS”, whereas Bigtable provides good performance on both read-intensive and write-intensive applications.

        C-Store和BigTable具有很多相同的特性:这两个系统都使用一种无共享的架构,并且都具有两种不同的数据结构,一种适用于近期写入的数据,另一种适用于存储长期使用的数据,它们还实现了一种将这两种数据结构相互转换的机制。这两个系统在它们的API方面有着明显的差异:C-Store的行为和关系型数据库较为相似,而BigTable提供了一个更加底层的读写接口,每台使用这个接口的服务器能够达到每秒数千次的读写操作速度。C-Store还是一种“读取优化的关系型数据库管理系统(DBMS)”,而BigTable为读取密集和写入密集的应用程序都提供了优秀的性能。

    Bigtable’s load balancer has to solve some of the same kinds of load and memory balancing problems faced by shared-nothing databases. Our problem is somewhat simpler: (1) we do not consider the possibility of multiple copies of the same data, possibly in alternate forms due to views or indices; (2) we let the user tell us what data belongs in memory and what data should stay on disk, rather than trying to determine this dynamically; (3) we have no complex queries to execute or optimize.

        BigTable的负载均衡器必须解决某些相同类型的负载问题,以及无共享数据库所面临的内存均衡问题。我们的问题很简单:(1)我们不认为相同的数据可能会有多份拷贝,相同的数据可能由于视图或索引的原因以不同的形式表现出来;(2)我们让用户来告诉我们,哪些数据应当存储在内存中,哪些数据应当存储在磁盘中,而不是让系统尝试动态地决定;(3)我们不会执行或优化复杂的查询。

11   结论

    We have described Bigtable, a distributed system for storing structured data at Google. Bigtable clusters have been in production use since April 2005, and we spent roughly seven person-years on design and implementation before that date. As of August 2006, more than sixty projects are using Bigtable. Our users like the performance and high availability provided by the Bigtable implementation, and that they can scale the capacity of their clusters by simply adding more machines to the system as their resource demands change over time.

        我们已经论述了BigTable,它是谷歌用于存储结构化数据的一种分布式系统。从2005年4月开始,BigTable就已经投入生产使用了,在此之前,我们花费了大约7个人年来设计和实现BigTable。截止至2006年8月,有超过60个项目正在使用BigTable。我们的用户对于BigTable提供的性能和高可用性非常满意,随着时间的推移,这些用户对于系统资源的需求会越来越大,他们此时只需要简单地向系统中添加更多的机器,便能够扩展系统集群的容量。

    Given the unusual interface to Bigtable, an interesting question is how difficult it has been for our users to adapt to using it. New users are sometimes uncertain of how to best use the Bigtable interface, particularly if they are accustomed to using relational databases that support general-purpose transactions. Nevertheless, the fact that many Google products successfully use Bigtable demonstrates that our design works well in practice.

        由于BigTable提供的接口并不常见,一个有趣的问题就是:我们的用户适应BigTable的接口会有多困难?新用户有时会不确定如何才能最有效地使用BigTable的接口,特别是当他们已经习惯于使用支持通用事务处理的关系型数据库时。虽然如此,但事实表明,有很多的谷歌产品都成功地使用了BigTable,这就说明我们的设计在实践中行之有效。

    We are in the process of implementing several additional Bigtable features, such as support for secondary indices and infrastructure for building cross-data-center replicated Bigtables with multiple master replicas. We have also begun deploying Bigtable as a service to product groups, so that individual groups do not need to maintain their own clusters. As our service clusters scale, we will need to deal with more resource-sharing issues within Bigtable itself.

        目前,我们正在为BigTable实现一些额外的新功能,例如:支持二级索引,以及用于构建跨数据中心(Cross-Data-Center)复制BigTable的基础设施,这些BigTable具有多个主机(Master)复制。我们已经开始将BigTable部署为一种提供给产品团队使用的服务,这样其他团队就不需要维护他们自己的集群了。随着服务集群的不断扩展,我们需要在BigTable自身内部处理更多的资源共享问题。

    Finally, we have found that there are significant advantages to building our own storage solution at Google. We have gotten a substantial amount of flexibility from designing our own data model for Bigtable. In addition, our control over Bigtable’s implementation, and the other Google infrastructure upon which Bigtable depends, means that we can remove bottlenecks and inefficiencies as they arise.

        最后,我们发现,构建谷歌自己的存储解决方案带来了很多优势。通过为BigTable设计我们自己的数据模型,我们也获得了极大的灵活性。另外,我们控制着BigTable的整个实现过程,以及BigTable依赖的其他的谷歌基础设施,这就意味着,当出现性能瓶颈或效率低下的问题时,我们能够及时地解决这些问题。

致谢

    We thank the anonymous reviewers, David Nagle, and our Shepherd Brad Calder, for their feedback on this paper. The Bigtable system has benefited greatly from the feedback of our many users within Google. In addition, we thank the following people for their contributions to Bigtable: Dan Aguayo, Sameer Ajmani, Zhifeng Chen, Bill Coughran, Mike Epstein, Healfdene Goguen, Robert Griesemer, Jeremy Hylton, Josh Hyman, Alex Khesin, Joanna Kulik, Alberto Lerner, Sherry Listgarten, Mike Maloney, Eduardo Pinheiro, Kathy Polizzi, Frank Yellin, and Arthur Zwiegincew.

        我们感谢匿名的评论者、David Nagle,以及我们团队的Shepherd Brad Calder,他们为本文提供了很多的反馈建议。BigTable系统还从谷歌内部的很多用户的反馈建议中获益匪浅。另外,我们还要感谢为BigTable作出贡献的下列人员:Dan Aguayo、Sameer Ajmani、Zhifeng Chen、Bill Coughran、Mike Epstein、Healfdene Goguen、Robert Griesemer、Jeremy Hylton、Josh Hyman、Alex Khesin、Joanna Kulik、Alberto Lerner、Sherry Listgarten、Mike Maloney、Eduardo Pinheiro、Kathy Polizzi、Frank Yellin和Arthur Zwiegincew。


注释

[1]  对于BigTable而言,数据是没有格式的。从数据库的角度来看,BigTable的数据没有固定的模式(Schema),由用户自行定义模式。

[2]  位置属性可以理解为一种树状结构,具有相同前缀的数据存储在邻近的位置。在读取数据时,可以将这些数据一次性读取出来。

[3]  图(Map)由关键字(Key,又称为键)和值(Value)组成,图中的每个条目都是一个键值对(Key-Value Pair)。

[4]  也就是按照获取时间的不同,存储了多个版本的网页数据。

[5]  BigTable使用时间戳(Timestamp)标识列的版本,此处使用t8和t9分别标识两个锚链接的版本。

[6]  假设METADATA的数据片没有被频繁移动,那么其中的三次通信发现缓存过期,另外三次更新缓存数据。

[7]  实际上是回收资源。数据被删除之后,它占有的资源并不能马上被释放,需要进行垃圾收集(Garbage Collection)之后才能重复使用。

[8]  相比起对整个SSTable进行压缩,分块压缩的压缩率较低。

[9]  布隆过滤器(BF:Bloom Filter)是一种哈希算法,由一个很长的二进制向量和系列的随机映射函数构成,可以参考:
      http://baike.baidu.com/view/4526149.htm

[10]  可以参考OpenStack的资源配额机制。

[11]  遇错即停(Fail-Stop),指的是一旦系统出错就停止运行,不输出任何数据或出错信息;遇错速报(Fail-Fast),指的是系统出错之后并不立即停止运行,而是尽可能快地报告或返回错误信息,然后再停止运行。

[12]  又称为拜占庭将军问题(Byzantine Failures),这是由莱斯利·兰伯特提出的点对点通信中的基本问题。所谓拜占庭失效是指一方向另一方发送消息,另一方没有收到,发送方也无法确认消息确实丢失的情形。在容错的分布式计算中,拜占庭失效可以是分布式系统中算法执行过程中的任意一个错误。这些错误被统称为“崩溃失效”和“发送与遗漏失效”。当拜占庭失效发生时,系统可能会做出任何不可预料的反应。