Exploring Druid
Topics covered
- Druid Overview
- Functionalities available
- Single node installation
- Loading data to Druid
What is Druid
Apache Druid is an open source distributed data store. Druid’s core design combines ideas from data warehouses, timeseries databases, and search systems to create a unified system for real-time analytics for a broad range of use cases. Druid merges key characteristics of each of the 3 systems into its ingestion layer, storage format, querying layer, and core architecture.
Key features of Druid include:
Column-oriented storage
Druid stores and compresses each column individually, and only needs to read the ones needed for a particular query, which supports fast scans, rankings, and groupBys.
Native search indexes
Druid creates inverted indexes for string values for fast search and filter.
Streaming and batch ingest
Out-of-the-box connectors for Apache Kafka, HDFS, AWS S3, stream processors, and more.
Flexible schemas
Druid gracefully handles evolving schemas and nested data.
Time-optimized partitioning
Druid intelligently partitions data based on time and time-based queries are significantly faster than traditional databases.
SQL support
In addition to its native JSON based language, Druid speaks SQL over either HTTP or JDBC.
Horizontal scalability
Druid has been used in production to ingest millions of events/sec, retain years of data, and provide sub-second queries.
Easy operation
Scale up or down by just adding or removing servers, and Druid automatically rebalances. Fault-tolerant architecture routes around server failures.
Druid Installation and Configuration
Wget http://apache.mirrors.hoobly.com/druid/0.18.1/apache-druid-0.18.1-bin.tar.gz
tar -zxvf apache-druid-0.18.1-bin.tar.gz
Druid includes a set of reference configurations and launch scripts for single-machine deployments:
nano-quickstart
micro-quickstart
small
medium
large
xlarge
Single server reference configurations
Nano-Quickstart: 1 CPU, 4GB RAM
- Launch command:
bin/start-nano-quickstart
- Configuration directory:
conf/druid/single-server/nano-quickstart
Micro-Quickstart: 4 CPU, 16GB RAM
- Launch command:
bin/start-micro-quickstart
- Configuration directory:
conf/druid/single-server/micro-quickstart
Small: 8 CPU, 64GB RAM (~i3.2xlarge)
- Launch command:
bin/start-small
- Configuration directory:
conf/druid/single-server/small
Medium: 16 CPU, 128GB RAM (~i3.4xlarge)
- Launch command:
bin/start-medium
- Configuration directory:
conf/druid/single-server/medium
Large: 32 CPU, 256GB RAM (~i3.8xlarge)
- Launch command:
bin/start-large
- Configuration directory:
conf/druid/single-server/large
X-Large: 64 CPU, 512GB RAM (~i3.16xlarge)
- Launch command:
bin/start-xlarge
- Configuration directory:
conf/druid/single-server/xlarge
Starting a single node large druid cluster :
./bin/start-single-server-large
Common issues:
Cannot start up because port 2181 is already in use.
Cannot start up because port 8888 is already in use.
Update 2181 to 2182 and 8888 to 8889 in below files
bin/verify-default-ports
conf/zk/zoo.cfg
conf/druid/single-server/large/router/runtime.properties
In order to load data to druid below is the command to use
bin/post-index-task --file quickstart/tutorial/crawl-index-hadoop.json --url http://0.0.0.0:8081
quickstart/tutorial/crawl-index-hadoop.json is an index json which mentions the schema of the input file , location of the file, Column used for segmentation, Connection details etc.
Sample indexer json for an open source data set available for web crawls available in the location.
For the below use case i converted the data to json format and loaded to hdfs location /data/druid/
{
"type" : "index_hadoop",
"spec" : {
"dataSchema" : {
"dataSource" : "crawl",
"parser" : {
"type" : "hadoopyString",
"parseSpec" : {
"format" : "json",
"dimensionsSpec" : {
"dimensions" : [
"content_charset",
"content_digest",
"content_languages",
"content_mime_detected",
"content_mime_type",
"fetch_status",
"url",
"url_host_2nd_last_part",
"url_host_3rd_last_part",
"url_host_4th_last_part",
"url_host_5th_last_part",
"url_host_name",
"url_host_private_domain",
"url_host_private_suffix",
"url_host_registered_domain",
"url_host_registry_suffix",
"url_host_tld",
"url_path",
"url_port",
"url_protocol",
"url_query",
"url_surtkey",
"warc_filename",
"warc_record_length",
"warc_record_offset",
"warc_segment",
{ "name": "added", "type": "long" },
{ "name": "deleted", "type": "long" },
{ "name": "delta", "type": "long" }
]
},
"timestampSpec" : {
"format" : "auto",
"column" : "fetch_time"
}
}
}, "metricsSpec" : [],
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "day",
"queryGranularity" : "none",
"intervals" : ["2012-09-12/2020-09-13"],
"rollup" : false
}
},
"ioConfig" : {
"type" : "hadoop",
"inputSpec" : {
"type" : "static",
"paths" : "/data/druid/"
}
},
"tuningConfig" : {
"type" : "hadoop",
"partitionsSpec" : {
"type" : "hashed",
"targetPartitionSize" : 5000000
},
"forceExtendableShardSpecs" : true,
"jobProperties" : {
"fs.default.name" : "hdfs://ip-172-31-23-142.us-west-2.compute.internal:8020",
"fs.defaultFS" : "hdfs://ip-172-31-23-142.us-west-2.compute.internal:8020",
"dfs.datanode.address" : "172.31.23.142",
"dfs.client.use.datanode.hostname" : "true",
"dfs.datanode.use.datanode.hostname" : "true",
"yarn.resourcemanager.hostname" : "172.31.23.142",
"yarn.nodemanager.vmem-check-enabled" : "false",
"mapreduce.map.java.opts" : "-Duser.timezone=UTC -Dfile.encoding=UTF-8",
"mapreduce.job.user.classpath.first" : "true",
"mapreduce.reduce.java.opts" : "-Duser.timezone=UTC -Dfile.encoding=UTF-8",
"mapreduce.map.memory.mb" : 1024,
"mapreduce.reduce.memory.mb" : 1024
}
}
},
"hadoopDependencyCoordinates": ["org.apache.hadoop:hadoop-client:2.8.5"]
}
Load Time
17.9 GiB of Json data with 1,186,221 records are loaded to a druid Large cluster in 10 mins of execution time.
Read Performance
Reading Parquet files
This Apache Druid module extends Druid Hadoop based indexing to ingest data directly from offline Apache Parquet files.
Note: If using the parquet-avro
parser for Apache Hadoop based indexing, druid-parquet-extensions
depends on the druid-avro-extensions
module, so be sure to include both.
The druid-parquet-extensions
provides the Parquet input format, the Parquet Hadoop parser, and the Parquet Avro Hadoop Parser with druid-avro-extensions
. The Parquet input format is available for native batch ingestion and the other 2 parsers are for Hadoop batch ingestion. Please see corresponding docs for details.
Reference: https://druid.apache.org/technology
https://druid.apache.org/docs/latest/development/extensions-core/parquet.html