Apache Druid vs Apache Pinot
In our previous blogs, we have learned what Apache Druid and Apache Pinot are. And today, we are going to talk about the similarities and differences between both along with the use cases to help you understand which scenario favors Druid more and vice versa.
To summarize, Apache Druid is an open-source, real-time database that empowers modern analytics applications with OLAP queries on event data. On the other hand, Apache Pinot is a realtime distributed OLAP datastore built to provide results on OLAP queries with low latency.
But there’s so much more to both Apache Druid and Apache Pinot. So let’s get started!
Similarities between Apache Druid & Apache Pinot
Apache Druid and Apache Pinot are both fundamentally similar because they store data and process queries on the same nodes, deviating from the decoupled architecture of BigQuery. Druid and Pinot both have their data storing format with indexes, are tightly coupled with their query processing engines, and unsupportiveness of large data volumes between nodes, so the queries run faster in both than Big Data processing systems like Presto, Hive, Spark, Kudu and Parquet.
To make columnar compression (resources) more efficient and indexes (queries) more aggressive, Druid and Pinot do not support point updates and deletes. Druid and Pinot support Lambda-style streaming and batch ingestion of data from Kafka. These two systems are large scale as Druid Metamarkets run Druid Cluster of approximately ten thousand CPU cores and one cluster of Pinot has thousands of machines.
This brings us to our next section –
Comparison of Apache Druid and Apache Pinot
Although Druid and Pinot have similar architectures, there are still some big features that are available in one and absent in other. But those features can be added to another system with some additional efforts. Still, there is one feature difference between these two that is too big to be removed — implementation of segment management in the “Coordinator” in Druid and “Controller” in Pinot nodes.
“But why is this difference so big to be eliminated?” — you may ask.
So let’s help you understand.
Segment Management in Druid
The “Coordinator” in Druid and “Controller” in Pinot, aren’t responsible for metadata persistence in the cluster data segments, and the current mapping of query and segment processing nodes where the segments are loaded. This information perseveres in ZooKeeper.
Unlike Pinot, Druid additionally perseveres this information in the SQL database that is provided to set the cluster of Druid. Some of the benefits it offers are:
- Less stored data in ZooKeeper: Data in ZooKeeper consists of minimal information about mapping segment id to the query processing nodes list where the segment is loaded. The remaining data like a list of data metrics, dimensions, and size of the segment is stored in the SQL database.
- Old Data Retrieval: After the eviction of old data segments from the cluster, the metadata about them is removed from ZooKeeper and the rest of the data segments are offloaded from the query processing nodes. But these old data segments are not removed from deep storage and SQL databases, allowing easy data retrieval in case data is required for reporting.
Segment Management in Pinot
As opposed to Druid, which implements all segment management logic itself and communicates with Zookeeper through Apache Curator, Pinot represents a big share of cluster and segment management logic to the Helix framework. Using Helix helps Pinot developers to focus on other system parts. Being tested in different conditions, Helix might have fewer bugs than the logic implementation that Druid. But Helix can also constrain Pinot with the bounds of its framework.
Now that we have covered the segment management part of Druid and Pinot, it’s time to check the difference in features between them that can be replicated by Java developers (as Druid and Pinot are written in Java), if required.
Pluggable Druid and Opinionated Pinot
As Druid was developed and used by many organizations it has gained the support of multiple exchangeable options for every dedicated service or part. Druid uses Kafka, RabbitMQ, Flink, Samza, Storm, Spark, and so on as a source for real-time ingestion of data; HDFS, Google Cloud Storage, Amazon S3, etc. as deep storage; and Kafka, Graphite, or Druid itself as Druid cluster metrics sink.
As Pinot was built by LinkedIn exclusively to meet its requirements, it doesn’t offer much of a choice. In Pinot, you can only use Kafka for real-time data ingestion; and Amazon S3 or HDFS are deep storage.
Pinot’s Predicate Pushdown
If at the time of data ingestion, there is a partition of data in Kafka by dimension keys, Pinot easily produces segments to carry the partitioning information. When the query on this dimension takes place, a broker node filters upfront of the segment, so that fewer segments and query processing nodes need to be hit.
However, in Druid, data sources are partitioned by time and each data ingestion method should have a write lock on a particular time range when the data is loading. So no two methods can operate at the same time range of the same data store at the same time. Hence, there is no predicate pushdown on brokers currently in Druid.
Pinot’s Better Data Format and Query Execution
In Pinot, the inverted index is optional, but it’s mandatory in Druid, even though it’s sometimes unnecessary and consumes a lot of space. In Pinot, min and max values in numeric columns per segment are recorded and the support for data sorting is incredible. However, in Druid, data sorting can only be performed manually. Better data sorting in Pinot equals better compression which leads to less space consumption and improves query performance.
Druid’s Smarter Segment Assignment Algorithm
While Pinot’s algorithm is to assign segments to query processing nodes with the least loaded segments at the given time. The algorithm Druid uses is much more worldly. Druid takes the table and time of every segment into account, creates a complex formula to calculate the final score according to which the query processing nodes are ranked and chooses the best node to assign a new segment. This brought approx. 30%-40% improvement in the query processing speed in production.
That was all for the comparison of Apache Druid and Apache Pinot. Both have some pros and cons among which some can be overcome with the help of developers. Yet, before opting for any one of the two it’s important to consider which one can help your use case better than the other. And if you want to implement either of them into your business, then you should hire developers with relevant experience in the field.