采访《七周七数据库》作者 Eric Redmond

Sever Database in Seven Weeks

Part 1: Introduction and the CAP theorem####

Kevin: OK, so maybe (let's talk about) something about your background. Why are you all of a sudden interested in databases?

Kevin: OK, so you actually did some research on database in university.

Eric: Whoa, whoa (let's) back up. My name is not on any papers. I definitely wouldn't go as far as saying I did research. I was an undergrad that helped with a lot of this research though.

Kevin: OK, just to be accurate on the record.

Dingding: This is Dingding. So (about) your book Seven Database in Seven Weeks, I'm just curious that why it's seven, not six and not eight?

Kevin: That actually is a very nice book and very mind-expanding. I think your book will be following a very similar pattern: just seven representative databases in different genres.

So when you graduated, did you start as a database guy?

Eric: Sure. There's a lot of theory to know about distributed systems in general and those that manage data in particular. A lack of knowledge of these details have been bringing people to make wild claims like 100% up-time guarantees or what I think is even crazier that is that CAP doesn't matter. You're starting to hear this trope quite often now. Really, the cornerstone of all distributed databases is the CAP theorem. The CAP theorem: C stands for consistency, A stands for availability and P is partition-tolerance. I'll start with partition-tolerance first. This means a system that can tolerate a network partition, meaning lost network packets, which is always a possibility. Effectively, partition-tolerance means your system is distributed. If it's possible to have two computers on a network where one tries to communicate to another one and that signal can be lost. If it can be lost, that's a network partition. In the simplest case where you have two computers and one cable running between them, if you just take a pair of scissors and you cut that cable, you now have two computers that are effectively their own little sub-networks because they can't talk to each other any more. You can heal that partition eventually (take a bunch of electrical tape and fix that cable) but that possibility is always there. So as far as the P (partition-tolerance), this is something that's not optional if it's a distributed system. So the layman definition of the CAP theorem is: between consistency, availability and partition-tolerance, you can only have two of them. But really, what it actually is, is that if you have a distributed system, you have to choose between consistency or availability. Now, not to belabour the P part but P really is the keystone of all distributed system problems. If it weren't for the possibility of partitions, creating a consistent and available distributed database would actually be trivial. For concept of such painful importance, you would think it would deserve some fancy Greek letter like ∏, but now it's just P. Now I sort of glossed over consistency and availability, I'll try and explain using just a quick little story about why you can only have one of the two. So imagine you have three men that are sitting at a bar and the bartender gives each of the men a whiskey, then if you walk up to any of the man and ask what was your last drink, they would of course say whiskey. Now let's say one of the three men gets up to go to the bathroom. In distributed system's speak, he's been partitioned from the group. He can't communicate with them because he's in the hallway walking to the bathroom. Now while he's gone, walking away, the bartender gives the two men that are still sitting at the bar a beer but now we have a problem because depending on who I ask, either a man at the bar or the one on the way to the bathroom, I'll get a different answer to the question of what was your most recent drink. So what this means is the three men would give inconsistent answers because as a system what you want is a system where no matter who you ask, you always get the same answer. That's what consistency means. But the man on his way to the bathroom could say I'm a minority here, I can't reach consensus with my friends so I'll just refuse to answer. So in other words, he's now unavailable. He's unavailable to answer your question. Not that he's down or he's crashed or anything. It's a conscious decision on his part to not answer that question because that way he's not inconsistent: he's just not answering and that's the decision that must be made. You can either be consistent but unavailable, or you can be available but inconsistent in the face of a partition, but you can't be both at the same. So that's sort of my parable of the CAP theorem and on why you have to pick consistency or availability but you can't have both.

Kevin: So using your analogy, if that man chooses to be available, then his answer may not be correct.

Eric: Right, it could be inconsistent because the last drink he had was whiskey. But (for) the other guys, the last drink they had was beer so they would give you inconsistent answers if you ask the wrong one. The problem is you never know which one you're asking. Now remember I've said that the man can decide that he's a minority and can't reach consensus so he'll just refuse to answer. This is actually how some databases like Mongo work: they will vote on a master. The master is the one that gets to communicate. But a master can only be elected to be the one to answer questions if it's in a majority. So if a network partition occurs, the minority has just lost the ability to be elected. In the CAP theorem sense, hypothetically Mongo could remain consistent. I'm not going to speak to whether it will remain consistent because this is a technical issue and working for a competitor I probably shouldn't exactly go into any sort of rant about whether it's consistent but it could be.

Kevin: OK. But it can't be both consistent and available.

Eric: Right. That said, that doesn't mean that a system can't be neither. It could be neither consistent nor available and there are databases that are like that.

Kevin: What's the point for that?

Kevin: OK. When you talk about availability and consistency, it seems like most databases will optimize for one of them. Is it like a continuum between the two?

Part 2: Relational Databases####

Kevin: OK. I think this is a good time for us to dive into different genres of databases and see where they fall into this theorem and also what are the trade-offs that they made. Or in other words, when we pick those databases, why do you want to choose one of the databases? We'll start with relational database cause that's been around since forever. Where do you think relational databases fall into this CAP theorem?

Kevin: But latency is not part of the CAP trade-off.

Kevin: You talked about master/slave and replication type of scenario. What about sharding relational database? I know it's kind of ... You can also do that, but it comes with trade-off.

Kevin: When we do get to the specific parts, we'll ask you to give us some examples of use cases on how to choose. So we're still on relational database?

Eric: Oh yes.

Kevin: So out of the open-source relational databases, I know you have experience with Postgres so that's sort of front-runner nowadays.

Eric: That's my front-runner. I'll be honest, in the past three or four years, I haven't really been involved much in the relational database world. I pretty much spent my lifetime (10,000 hours) in front of my computer on relational databases and it's not something I'm interested in going back to. I still think they're amazing and I still think relational databases will solve the majority of people's problems. I think that far too many people rush to alternatives for no good reason. But I definitely still throw out Postgres because I just prefer the design and now again this is my entire biased opinion. It's the one that I'm the most comfortable with. It's the one that most of my knowledge of the internals peaked around 2004 and at the time its internals were far cleaner and much more flexible than something like MySQL at the time: you could actually write your own indexes rather painlessly. They have just implemented it so you could write your own B-tree style indexes with even less trouble than just flexible scripting almost. I just enjoyed the flexibility and the community.

Part 3: NoSQL databases####

Dingding: So, you can build a distributed relational database with distributed nodes so you can have hundred of Postgres nodes and just build an upper layer to make it distributed. What's your opinion of that against a NoSQL solution?

Eric: It's interesting. If you think of that architecture and you work with the architecture of something like Mongo, you'll see a lot of similarities where you have a mater node that replicates and then you'll have your Mongo server that accesses a router and your configuration that track where different keys go, they'll feed Mongo to decide this shard goes to this replica set. Very similar to the way you would do manually in a relational database. But again your problem comes down to the CAP theorem. It comes down to the fact that you still have to make a consistency-availability trade-off. You can use something like Prosgres but you don't really get all the benefits anymore. You can't really easily or realistically join, for example, across a network. Really, the power of relational databases come down to the fact that you can normalize all of your data structures and if it's normalized, you can query it in pretty much any way you want and more specifically joining. You can join values and create new table types and relations that you get as response. You have effectively lost the ability to do both of those once you start distributing your relational database unless you're just doing straight-up replication. I know people do this a lot but sometimes I get the feeling they do it just because they just don't know any better. If you're doing that, I would recommend just trying something else. Just try a system that was designed to be distributed from the ground up and not shoehorn a system that was not designed to be distributed into becoming a distrusted system. The important thing to note about all of the databases that we've mentioned so far, with enough code, you can make them all do anything if you want to hack enough around it. The question is how much effort you really want to put into designing your own kind of database ecosystem, which effectively what you're doing. You also have to think about operational costs. Yes, you can create these sort of custom clusters of relational database to do the things you want but now you effective have to worry about is this a master node or is this a secondary or is this a configuration or is this a router. You've sort of offloaded your development costs and put it squarely on your operational costs. This is something developers don't often think about when they are coding: coding is the cheapest and fastest phase of your process. The more expensive and much long part of it is operations. Assuming you're successful as a developer, someone is going to be running and maintaining that code for years. If you design it in such a way that is complex to scale and maintain, you've effectively just given some future person a whole lot of work. If you're not distributing, just use relational database. They're easy and they have a ton of research behind them for years. SQL is an amazing language. It's the most successful language ever. Think of any language that's been around that long. You're talking about C and SQL in some form, obviously not in the SQL specification. That's much older than 40 years but the whole relational concept. If you're distributing, go after a NoSQL solution.

Part 4: Document Databases and Column Databaes####

Kevin: Talking about NoSQL, a lot of people equate them to document databases. That seems to be most popular one nowadays. So why don't we start with document databases? What are the trade-offs and maybe in the CAP theorem context?

Kevin: So it's easier to ship your queries or the algorithm themselves to the data because you have a distributed system than to getting pieces of data from each place and try to join them together or get a result.

Eric: Yes, it's just the matter of reducing the amount of data going over the wire. When you're dealing with big data, you very often have to invert your way of thinking and this is one of those cases where you invert the order of operations you are normally used to doing. One of those inversions I mentioned before where very often you don't necessarily think about how you're querying the data. You just put it in there and you worry about querying it later. That obviously has its own cost. You're going to be paying these costs anyway. In a relational database, you just pay them in a very different way: you pay them upfront, in the design phase. While you're sitting down designing something, how many times have you done an application that you're using a relational database and you start with a whiteboard and you start drawing tables and saying okay we're going to want to do this and this is what our scheme is going to look like and you start coding and say oh crap, I need a different table or I need join tables to sit in-between these because the join needs to actually have another value hanging off of it, so I need to create a whole new table. These design decisions are very trivial when you're dealing with a lot of these NoSQL solutions but the querying is more difficult.

Kevin: Yeah, I guess the compliment with relation databases was querying is really powerful, really super easy. Whereas when you would start to add data, then you start to face the CAP theorems. I still want to talk a little bit more about Mongo because that seems to be the popular one. We talked about availability and some of the trade-offs but from a data modelling point of view, if you just compare relational database with a Mongo data, I feel like Mongo is a very opinionated way. It's like if you know exactly you want to traverse the data, then Mongo is perfect. In your example, a person and cat. Maybe cat has toys. If you know you always go from person to cat to get toys, you never would just aggregate some toys that way, then it's perfect. You can always go that way. But you're losing the flexibility. You may not anticipate you do need to do another scan.

Eric: Yes. The flexibility you lose is the query flexibility of a relational database. That said, no database has the query flexibility of a relational database. The relational database designed SQL is a query language. It's a declarative language. It's structured. But that's exactly its strength and largely that's its biggest strength. And relational databases are the only ones where you have a somewhat structured schema. Column-oriented databases like HBase, you define your schema upfront. Cassadra is another: it's topologically Dynamo-based like Riak is but its data structure is column-oriented like HBase is where you define what your column families are. They do have a little more flexibility than a relational database. The problem for relational databases is you put one value per row for example whereas in a column-oriented data store you can have as many individual discrete values as you want without adding rows: you're just adding that value because data is stored in columns rather than rows. A simple example would be a wiki where the key might be the title of the wiki page and you might have multiple revisions. In a relational database, unless you denormalized it, if you just said okay my page table will have one column for the title and one column for the contents of the page. What you'll find is the title never changes, so you'll just be replicating that a lot and the contents of the page change quite often. Whereas in a column data store, you would have just one column family page and the title would never change. That would be one column and another column would be the contents of the page and it would change. When you do a query, it's a row but it's almost like a pseudo row. You're just saying okay give me the most recent title and the most recent page contents. You actually can get a lot out of this: you can give them a time to live, which is very nice. There's a reason that Facebook's messaging system runs on HBase. The ability for messages to have a time to live is something I presume they are able to leverage and also it scales up crazy.

Kevin: Right. I guess if you do this on a SQL database, it'll be non-trivial or you'll have to write a lot of code to run it like that.

Eric: Yes, or you'll have to have some sort of custom extension. I wouldn't be surprise if something like Postgres had a time to live extension or a timestamp. Generally what you'll do is you'll just timestamp a value and then as part of query you'll just say give me this range and then maybe manually delete everything that's outside of that range. But that would be one trivial way of doing it but it's nice to have these things built-in, that's for sure. Again, like I said before, most of these databases, with enough code, you can make them do them to anything but how much code do you want to write?

Dingding: When we talked about document DB, you mentioned some column DB like HBase and Cassadra. Document DB to me is similar to column DB but with few limitation and more improvement. So what's your opinion with these two and the difference between them?

Eric: Actually a document DB is much closer to a key-value store in this way because Mongo or Couch, whether its queries or views, by default is not actually indexing the values. The only thing that's indexed is the id, which is just a key look-up. But if you query against a column that's not indexed, it's just a full table scan. The same is true with Mango and Couch. The same is not true with something like HBase where these are sparsely ordered. You're actually always scanning effectively. I guess you can do key-value look-ups as well but generally you scan ranges of values. To index, you'll effectively index manually. Again, depending on how much code you want to write, you can index with a key-value store as well. If you have Redis for example, which is a key-value store and you have a deeply nested value and you want to be able to query by certain values, you can just create another key type and say okay I'm going to be a quick-up and point to the, for example, say you have a person:social_security_number and that'll contain all of the person data that we talked about earlier with our Mongo example: searched by last name. You can just create a key like last:last_name and then it can point to the correct person:social_security_number and then it's just a look-up. You can do this with a key-value store. In relational databases, obviously to do it effectively, you need some sort of ability to scan. So you could effective write your B-tree if you can't scan. It depends on how much code you want to write but I would say generally speaking, I actually find much more similararity between a key-value store and a document data store than I would in a column-oriented data store and a document data store.

Kevin: I would say that as well. I feel like one way you look at document databases, they're also key-value stores but the value can be nested.

Eric: Riak, for example, all values are opaque, meaning it doesn't care what the value is so you can actually put JSON as a value inside of Riak. It has a secondary indexing feature so if you want to index against some of those values, you just make it index and you query against that index. That would make it closer to Mongo in that respect as far as querybility. Or if you want to use MapReduce, you can do that too, which case it'll be something more close to Couch although Couch is considerably more efficient on the way that it builds its views cause it pre-builds them and just keeps them updated using partial MapReduce as you make updates. You're absolutely right. There's much more similarity than differences in that respect.

Part 5: Key-value Stores####

Kevin: We talked about document databases and column databases already and you mentioned Mongo, Couch, HBase and Cassandra. Let's move to key-value stores. We already touched on Redis already. Maybe you can talk about Redis and Riak in comparison? Those two key-value stores. What are the characteristics of each and so on?

Kevin: We talked Redis in the caching context. It's mostly read. I've seen some people that use Redis as an intermediary to their database so instead of all the operation touching database they can just put in memory and it's fast.

Kevin: So does the replications happen automatically?

Eric: Well, Multi-Datacenter Replication is actually the one thing that we charge for. The single datacenter replication is built-in. When you write a value to Riak, it will by default replicate to three nodes. This is tunable, you can set its value to whatever you want. We usually recommend 3: it's usually sufficient for most cases. That means at least 2/3 of that value's nodes could go down and you still have data available. Then there's Multi-Datacenter Replication which means that you have multiple datacenters that themselves replicate through each other to keep themselves in sync by various means. It's very tunable. This is for a lot of reasons, either data locality: so you can have a cluster in the U.S and a cluster in China and you can have data local to your Chinese customers cause they can access it faster v.s other data that's local to your U.S customers or you can have two datacenters and one is just used for backup or whatever other reasons people have for choosing to replicate at multiple datacenters.

Kevin: By the way, I'm just curious, are you aware of the companies using Riak in China?

Eric: I'm not. Actually that's definitely something I'm very interested in. If anyone's interested in helping spread the word about Riak in China, I'd love to hear from them or if anybody that knows of any companies in China that are using Riak, I would love to hear from them. We are a company that was found in the U.S. All of our first customers are in the U.S, then we went to Europe and six months ago we opened an office in Japan so we're slowly spreading internationally. Any way to speed up that trend would be amazing cause we have not really spread to South America as far as I know as well. We may be everywhere. I just haven't heard about them.

Kevin: Is Riak itself an open-source database?

Kevin: I was just thinking because when it comes to China, a lot of things are big data and it's typical that you walk into a bank and you have 10,000,000 customers or 100,000,000 customers very easily. Coming back to Riak, you talked about Riak automatically replicates the data records, does it hurt its consistency or how does it make that choice?

Eric: As we've mentioned previously about the CAP theorem being kind of a spectrum: consistency and availability are in some ways tunable. It doesn't have to be fixed. This comes back to what I've mentioned about Peter Bailiss and PBS (Probabilistically Bounded Staleness, how eventual is eventual consistency and how consistent is eventual consistency), you can tune your eventual consistency. You can tune your availability and your consistency and in some ways your latency as well is effectively this way. The way Riak does it is it has three values called N, R and W. N is the number of nodes you will replicate a value to eventually. By default, it's 3. W is the number of nodes that will return a successful affirmative before you return back to the client and say yes this write woks so what happens is your write to one of the Riak notes will coordinate the replication. So if you set W to 2, what it'll do is even though eventually all three nodes will have the value replicated that you want, it will only wait until two of them returned a success before it says the write is successful. R is the last one, which is the same thing but for reads. You'll attempt to read from all three nodes but if you set R to 2, only two of them need to return a success for value to get a result. You're not going to wait for all three. Now if you want to make a more consistent system, the consistency can be either write consistent or read consistent. So for example, say that I have my end value and I want to be certain that my write has been replicated to all three so I can set W to 3 and say don't even return until every value has been replicated. At this point, you can be pretty certain that successive reads are going to contain that value but you've of course slowed down. You paid in latency and availability because if one of those three replicas is down, your write will fail because what you told it is I need all three to work. One of them is down, you only two up, it can fail. That's what's called the quorum. The quorum means if your R(Read) value + your W(Write) value > number of nodes, then you're probably going to be consistent, assuming everything goes according to plan because there's going to be an overlap. It's an easy thought experiment. Say you have nodes A, B and C, and you write successfully to nodes A and C and you read successfully from nodes B and C, then even though only C has the most recent value, you've at least got the most recent value and you know this. There are details here. I'm not going to all of the problems why that's not exactly truly consistent, not in the linearizable sense but it's consistent enough for being a highly available system, which is the whole point. You can flip it over: if you want to write quickly but you don't care to wait, you can set W to 1 and say just write to one node, I don't care which one, and then return. Your latency would go down because you're not waiting for all three replicas to return: whenever the fastest won the return race, then you're just back and you say okay I won. Then when you do a read, if you want the most recent value, you can then set R to the number of nodes and you're effectively reading from all of them cause you're saying at least one of these is going to have the most recent write. But you've slowed down your read at that time and again you've paid in latency and if one of those nodes is down, you've paid in availability. I'm oversimplifying this because there actually is a little more detail to it than that: there's durable writes; there's primary writes, primary reads. Because Riak by default does something, and this is something Dynamo does as well called sloppy quorum where if a node you would normally write to or read from isn't available, it will then go to the secondary node, which is the next from the list. It's not a strict quorum in the fact that it would fail. It would actually try to do that right any way. And it would just elect another node to act as a temporary storage. This is like, if you've gone on vacation and while you're on vacation, you have your neighbour collect your all your mails and then once you've come back from vacation, your neighbour hands you all your mail. In Riak, this is called the hinted handoff. So once the node has rejoined because it had crashed or because there was a network partion, all of that data gets given back to the primary node, the one that should had it all along. It's tricks like this that allowed Riak to be highly available system. This is why you get these kinds of crazy up-times that you don't necessarily get with all other databases that choose consistency over this kind of availability.

Kevin: I guess the use cases for Riak is if the business requirement is such that availability is paramount. No way this can be down, money is at stake or ...

Kevin: So DynamoDB is hosted right? It's Database as a Service where you can just spin up but Riak is a little bit different. It's not hosted.

Kevin: One scenario of getting this kind of infrastructure is if your infrastructure is solely based on EC2 then that's great; if you have infrastructure away not on Amazon's cloud then you'll have to pay extra latency to retrieve the data.

Kevin: It's coming up. It's the next one. But I want to ask the last question for Riak because I think you're more familiar with that. Other than e-commerce sites such as Amazon, what are other use cases or scenarios that can benefit from high availability?

Eric: There are all sorts: we have video games companies that are using our stuff to be the back-end for user data and session data, etc or even switching devices where for example on a lot of games you can play on multiple devices: you can save state in one and pick up another and continue playing. These kinds of things are at a very large scale. (They) seem trivial but they're very hard to get right and you can't really have a lot of latency in those cases. So it goes beyond just simple shopping carts. There's also Riak CS which is great for asset storage. People store videos and images. As long as your values are small, you can store anything in Riak. Like I said, values are opaque so a value can be an image, just basics before encoded in whatever. It could just be small thumbnails if you'd like although I would recommend (Riak) CS but you could even use regular Riak

采访《七周七数据库》作者 Eric Redmond - 数据库的

Part 1: Introduction and the CAP theorem####

Part 2: Relational Databases####

Part 3: NoSQL databases####

Part 4: Document Databases and Column Databaes####

Part 5: Key-value Stores####