Exploring Druid

4 min readOct 13, 2020

Topics covered

Druid Overview
Functionalities available
Single node installation
Loading data to Druid

What is Druid

Apache Druid is an open source distributed data store. Druid’s core design combines ideas from data warehouses, timeseries databases, and search systems to create a unified system for real-time analytics for a broad range of use cases. Druid merges key characteristics of each of the 3 systems into its ingestion layer, storage format, querying layer, and core architecture.

Key features of Druid include:

Column-oriented storage

Druid stores and compresses each column individually, and only needs to read the ones needed for a particular query, which supports fast scans, rankings, and groupBys.

Native search indexes

Druid creates inverted indexes for string values for fast search and filter.

Streaming and batch ingest

Out-of-the-box connectors for Apache Kafka, HDFS, AWS S3, stream processors, and more.

Flexible schemas

Druid gracefully handles evolving schemas and nested data.

Time-optimized partitioning

Druid intelligently partitions data based on time and time-based queries are significantly faster than traditional databases.

SQL support

In addition to its native JSON based language, Druid speaks SQL over either HTTP or JDBC.

Horizontal scalability

Druid has been used in production to ingest millions of events/sec, retain years of data, and provide sub-second queries.

Easy operation

Scale up or down by just adding or removing servers, and Druid automatically rebalances. Fault-tolerant architecture routes around server failures.

Druid Installation and Configuration

Wget http://apache.mirrors.hoobly.com/druid/0.18.1/apache-druid-0.18.1-bin.tar.gz
 tar -zxvf apache-druid-0.18.1-bin.tar.gz

Druid includes a set of reference configurations and launch scripts for single-machine deployments:

nano-quickstart
micro-quickstart
small
medium
large
xlarge

Single server reference configurations

Nano-Quickstart: 1 CPU, 4GB RAM

Launch command: bin/start-nano-quickstart
Configuration directory: conf/druid/single-server/nano-quickstart

Micro-Quickstart: 4 CPU, 16GB RAM

Launch command: bin/start-micro-quickstart
Configuration directory: conf/druid/single-server/micro-quickstart

Small: 8 CPU, 64GB RAM (~i3.2xlarge)

Launch command: bin/start-small
Configuration directory: conf/druid/single-server/small

Medium: 16 CPU, 128GB RAM (~i3.4xlarge)

Launch command: bin/start-medium
Configuration directory: conf/druid/single-server/medium

Large: 32 CPU, 256GB RAM (~i3.8xlarge)

Launch command: bin/start-large
Configuration directory: conf/druid/single-server/large

X-Large: 64 CPU, 512GB RAM (~i3.16xlarge)

Launch command: bin/start-xlarge
Configuration directory: conf/druid/single-server/xlarge

Starting a single node large druid cluster :

./bin/start-single-server-large

Common issues:

Cannot start up because port 2181 is already in use.
Cannot start up because port 8888 is already in use.

Update 2181 to 2182 and 8888 to 8889 in below files
bin/verify-default-ports
conf/zk/zoo.cfg
conf/druid/single-server/large/router/runtime.properties

In order to load data to druid below is the command to use

bin/post-index-task --file quickstart/tutorial/crawl-index-hadoop.json  --url http://0.0.0.0:8081

quickstart/tutorial/crawl-index-hadoop.json is an index json which mentions the schema of the input file , location of the file, Column used for segmentation, Connection details etc.

Sample indexer json for an open source data set available for web crawls available in the location.

For the below use case i converted the data to json format and loaded to hdfs location /data/druid/

{
  "type" : "index_hadoop",
  "spec" : {
    "dataSchema" : {
      "dataSource" : "crawl",
      "parser" : {
        "type" : "hadoopyString",
        "parseSpec" : {
          "format" : "json",
          "dimensionsSpec" : {
            "dimensions" : [
                "content_charset",
                "content_digest",
                "content_languages",
                "content_mime_detected",
                "content_mime_type",
                "fetch_status",
                "url",
                "url_host_2nd_last_part",
                "url_host_3rd_last_part",
                "url_host_4th_last_part",
                "url_host_5th_last_part",
                "url_host_name",
                "url_host_private_domain",
                "url_host_private_suffix",
                "url_host_registered_domain",
                "url_host_registry_suffix",
                "url_host_tld",
                "url_path",
                "url_port",
                "url_protocol",
                "url_query",
                "url_surtkey",
                "warc_filename",
                "warc_record_length",
                "warc_record_offset",
                "warc_segment",
              { "name": "added", "type": "long" },
              { "name": "deleted", "type": "long" },
              { "name": "delta", "type": "long" }
            ]
          },
          "timestampSpec" : {
            "format" : "auto",
            "column" : "fetch_time"
          }
        }
      },      "metricsSpec" : [],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "day",
        "queryGranularity" : "none",
        "intervals" : ["2012-09-12/2020-09-13"],
        "rollup" : false
      }
    },
    "ioConfig" : {
      "type" : "hadoop",
      "inputSpec" : {
        "type" : "static",
        "paths" : "/data/druid/"
      }
    },
    "tuningConfig" : {
      "type" : "hadoop",
      "partitionsSpec" : {
        "type" : "hashed",
        "targetPartitionSize" : 5000000
      },
      "forceExtendableShardSpecs" : true,
      "jobProperties" : {
        "fs.default.name" : "hdfs://ip-172-31-23-142.us-west-2.compute.internal:8020",
        "fs.defaultFS" : "hdfs://ip-172-31-23-142.us-west-2.compute.internal:8020",
        "dfs.datanode.address" : "172.31.23.142",
        "dfs.client.use.datanode.hostname" : "true",
        "dfs.datanode.use.datanode.hostname" : "true",
        "yarn.resourcemanager.hostname" : "172.31.23.142",
        "yarn.nodemanager.vmem-check-enabled" : "false",
        "mapreduce.map.java.opts" : "-Duser.timezone=UTC -Dfile.encoding=UTF-8",
        "mapreduce.job.user.classpath.first" : "true",
        "mapreduce.reduce.java.opts" : "-Duser.timezone=UTC -Dfile.encoding=UTF-8",
        "mapreduce.map.memory.mb" : 1024,
        "mapreduce.reduce.memory.mb" : 1024
      }
    }
  },
  "hadoopDependencyCoordinates": ["org.apache.hadoop:hadoop-client:2.8.5"]
}

Load Time

17.9 GiB of Json data with 1,186,221 records are loaded to a druid Large cluster in 10 mins of execution time.

Read Performance

Reading Parquet files

This Apache Druid module extends Druid Hadoop based indexing to ingest data directly from offline Apache Parquet files.

Note: If using the parquet-avro parser for Apache Hadoop based indexing, druid-parquet-extensions depends on the druid-avro-extensions module, so be sure to include both.

The druid-parquet-extensions provides the Parquet input format, the Parquet Hadoop parser, and the Parquet Avro Hadoop Parser with druid-avro-extensions. The Parquet input format is available for native batch ingestion and the other 2 parsers are for Hadoop batch ingestion. Please see corresponding docs for details.

Reference: https://druid.apache.org/technology

https://druid.apache.org/docs/latest/development/extensions-core/parquet.html