At Google, our mission is organizing all of the world's
information. We use literally thousands of different data
formats to represent networked messages between servers, index
records in repositories, geospatial datasets, and more. Most of
these formats are structured, not flat. This raises an
important question: How do we encode it all?
XML? No, that wouldn't work. As nice as XML is, it isn't
going to be efficient enough for this scale. When all of your
machines and network links are running at capacity, XML is an
extremely expensive proposition. Not to mention, writing code
to work with the DOM tree can sometimes become unwieldy.
Do we just write the raw bytes of our in-memory data
structures to the wire? No, that's not going to work either.
When we roll out a new version of a server, it almost always
has to start out talking to older servers. New servers need to
be able to read the data produced by old servers, and vice
versa, even if individual fields have been added or removed.
When data on disk is involved, this is even more important.
Also, some of our code is written in
Java or Python, so we need a portable solution.
Do we write hand-coded parsing and serialization routines
for each data structure? Well, we used to. Needless to say,
that didn't last long. When you have tens of thousands of
different structures in your code base that need their own
serialization formats, you simply cannot write them all by
hand.
Instead, we developed Protocol Buffers. Protocol Buffers
allow you to define simple data structures in a special
definition language, then compile them to produce classes to
represent those structures in the language of your choice.
These classes come complete with heavily-optimized code to
parse and serialize your message in an extremely compact
format. Best of all, the classes are easy to use: each field
has simple "get" and "set" methods, and once you're ready,
serializing the whole thing to or parsing it from a byte array
or an I/O stream just takes a single method call.
OK, I know what you're thinking: "Yet another IDL?" Yes, you
could call it that. But, IDLs in general have earned a
reputation for being hopelessly complicated. On the other hand,
one of Protocol Buffers' major design goals is simplicity. By
sticking to a simple lists-and-records model that solves the
majority of problems and resisting the desire to chase
diminishing returns, we believe we have created something that
is powerful without being bloated. And, yes, it is very fast at
least an order of magnitude faster than XML.
And now, we're making Protocol Buffers available to the Open
Source community. We have seen how effective a solution they
can be to certain tasks, and wanted more people to be able to
take advantage of and build on this work. Take a look at the
documentation, download the code and let us know what you
think.
More
here.