Getting Started

A quick way to get familiar with Pinot is to run the Pinot examples. The examples can be run either by compiling the code or by running the prepackaged Docker images.

To demonstrate Pinot, let’s start a simple one node cluster, along with the required Zookeeper. This demo setup also creates a table, generates some Pinot segments, then uploads them to the cluster in order to make them queryable.

All of the setup is automated, so the only thing required at the beginning is to start the demonstration cluster.

Compiling the code

One can also run the Pinot demonstration by checking out the code on GitHub, compiling it, and running it. Compiling Pinot requires JDK 8 or later and Apache Maven 3.

  1. Check out the code from GitHub (https://github.com/apache/incubator-pinot)
  2. With Maven installed, run mvn install package -DskipTests -Pbin-dist in the directory in which you checked out Pinot.
  3. Make the generated scripts executable:
cd pinot-distribution/target/apache-pinot-incubating-<version>-SNAPSHOT-bin/apache-pinot-incubating-<version>-SNAPSHOT-bin; chmod +x bin/*.sh

Trying out Offline quickstart demo

To run the demo with compiled code:
bin/quick-start-offline.sh

Once the Pinot cluster is running, you can query it by going to http://localhost:9000/query/

You can also use the REST API to query Pinot, as well as the Java client. As this is outside of the scope of this introduction, the reference documentation to use the Pinot client APIs is in the Executing queries via REST API on the Broker section.

Pinot uses PQL, a SQL-like query language, to query data. Here are some sample queries:

/*Total number of documents in the table*/
SELECT count(*) FROM baseballStats LIMIT 0

/*Top 5 run scorers of all time*/
SELECT sum('runs') FROM baseballStats GROUP BY playerName TOP 5 LIMIT 0

/*Top 5 run scorers of the year 2000*/
SELECT sum('runs') FROM baseballStats WHERE yearID=2000 GROUP BY playerName TOP 5 LIMIT 0

/*Top 10 run scorers after 2000*/
SELECT sum('runs') FROM baseballStats WHERE yearID>=2000 GROUP BY playerName

/*Select playerName,runs,homeRuns for 10 records from the table and order them by yearID*/
SELECT playerName,runs,homeRuns FROM baseballStats ORDER BY yearID LIMIT 10

The full reference for the PQL query language is present in the PQL section of the Pinot documentation.

Trying out Realtime quickstart demo

Pinot can ingest data from streaming sources such as Kafka.

To run the demo with compiled code:
bin/quick-start-realtime.sh

Once started, the demo will start Kafka, create a Kafka topic, and create a realtime Pinot table. Once created, Pinot will start ingesting events from the Kafka topic into the table. The demo also starts a consumer that consumes events from the Meetup API and pushes them into the Kafka topic that was created, causing new events modified on Meetup to show up in Pinot.

To show new events appearing, one can run SELECT * FROM meetupRsvp ORDER BY mtime DESC LIMIT 50 repeatedly, which shows the last events that were ingested by Pinot.

Experimenting with Pinot

Now we have a quick start Pinot cluster running locally. Below are step-by-step instructions on how to add a simple table to the Pinot system, how to upload a segment, and how to query the segment.

Suppose we have a transcript in CSV format containing students’ basic info and their scores for each subject.

studentID firstName lastName gender subject score
200 Lucy Smith Female Maths 3.8
200 Lucy Smith Female English 3.5
201 Bob King Male Maths 3.2
202 Nick Young Male Physics 3.6

When we create a CSV file, we will also need a separate CSV config JSON file.

First, however, we will create a working directory called getting-started (in this example, it is on Desktop), and create two additional directories within getting-started called data and config.

Note that we can create a variable for the working directory called WORKING_DIR.

$ mkdir getting-started
$ WORKING_DIR=/Users/host1/Desktop/getting-started
$ cd $WORKING_DIR
$ mkdir getting-started/data
$ mkdir getting started/config

We will create the transcript CSV file in data, and the CSV config file in config.

$ touch getting-started/data/test.csv
$ touch getting-started/config/csv-record-reader-config.json

The test.csv file should look like this, with no header line at the top:

200,Lucy,Smith,Female,Maths,3.8
200,Lucy,Smith,Female,English,3.5
201,Bob,King,Male,Maths,3.2
202,Nick,Young,Male,Physics,3.6

Instead of using a header line, we will use the CSV config JSON file csv-record-reader-config.json to specify the header:

{
  "header":"studentID,firstName,lastName,gender,subject,score",
  "fileFormat":"CSV"
}

In order to set up a table, we need to specify the schema of this transcript in transcript-schema.json, which we will store in config:

$ touch getting-started/config/transcript-schema.json

transcript-schema.json should look like this:

{
  "schemaName": "transcript",
  "dimensionFieldSpecs": [
    {
      "name": "studentID",
      "dataType": "STRING"
    },
    {
      "name": "firstName",
      "dataType": "STRING"
    },
    {
      "name": "lastName",
      "dataType": "STRING"
    },
    {
      "name": "gender",
      "dataType": "STRING"
    },
    {
      "name": "subject",
      "dataType": "STRING"
    }
  ],
  "metricFieldSpecs": [
    {
      "name": "score",
      "dataType": "FLOAT"
    }
  ]
}

To upload the schema, we can navigate to the directory in pinot-distribution that contains pinot-admin.sh, and use the command below:

$ VERSION=0.2.0
$ cd ./pinot-distribution/target/apache-pinot-incubating-$VERSION-SNAPSHOT-bin/apache-pinot-incubating-$VERSION-SNAPSHOT-bin/bin
$ ./pinot-admin.sh AddSchema -schemaFile $WORKING_DIR/config/transcript-schema.json -exec
Executing command: AddSchema -controllerHost [controller_host] -controllerPort 9000 -schemaFilePath /Users/host1/Desktop/getting-started/config/transcript-schema.json -exec
Sending request: http://[controller_host]:9000/schemas to controller: [controller_host], version: 0.2.0-SNAPSHOT-68092ab9eb83af173d725ec685c22ba4eb5bacf9

Then, we need to specify the table config in another JSON file (also stored in config), which links the schema to the table:

$ touch getting-started/config/transcript-table-config.json

transcript-table-config.json should look like this:

{
  "tableName": "transcript",
  "segmentsConfig" : {
    "replication" : "1",
    "schemaName" : "transcript",
    "segmentAssignmentStrategy" : "BalanceNumSegmentAssignmentStrategy"
  },
  "tenants" : {
    "broker":"DefaultTenant",
    "server":"DefaultTenant"
  },
  "tableIndexConfig" : {
    "invertedIndexColumns" : [],
    "loadMode"  : "HEAP",
    "lazyLoad"  : "false"
  },
  "tableType":"OFFLINE",
  "metadata": {}
}

And upload the table config to Pinot cluster:

$ ./pinot-admin.sh AddTable -filePath $WORKING_DIR/config/transcript-table-config.json -exec
Executing command: AddTable -filePath /Users/host1/Desktop/getting-started/config/transcript-table-config.json -controllerHost [controller_host] -controllerPort 9000 -exec
{"status":"Table transcript_OFFLINE successfully added"}

At this point, the directory tree for our getting-started should look like this:

|-- getting-started
    |-- data
           |-- test.csv
    |-- config
           |-- csv-record-reader-config.json
           |-- transcript-schema.json
           |-- transcript-table-config.json

In order to upload our data to the Pinot cluster, we need to convert our CSV file into a Pinot Segment, which will be put in a new directory $WORKING_DIR/test2:

$ ./pinot-admin.sh CreateSegment -dataDir $WORKING_DIR/data -format CSV -outDir $WORKING_DIR/test2 -tableName transcript -segmentName transcript_0 -overwrite -schemaFile $WORKING_DIR/config/transcript-schema.json -readerConfigFile $WORKING_DIR/config/csv-record-reader-config.json
Executing command: CreateSegment  -generatorConfigFile null -dataDir /Users/host1/Desktop/getting-started/data -format CSV -outDir /Users/host1/Desktop/getting-started/test2 -overwrite true -tableName transcript -segmentName transcript_0 -timeColumnName null -schemaFile /Users/host1/Desktop/getting-started/config/transcript-schema.json -readerConfigFile /Users/host1/Desktop/getting-started/config/csv-record-reader-config.json -enableStarTreeIndex false -starTreeIndexSpecFile null -hllSize 9 -hllColumns null -hllSuffix _hll -numThreads 1
Accepted files: [file:/Users/host1/Desktop/getting-started/data/test.csv]
Finished building StatsCollector!
Collected stats for 4 documents
Created dictionary for STRING column: studentID with cardinality: 1, max length in bytes: 4, range: null to null
Created dictionary for STRING column: firstName with cardinality: 3, max length in bytes: 4, range: Bob to Nick
Created dictionary for STRING column: lastName with cardinality: 3, max length in bytes: 5, range: King to Young
Created dictionary for FLOAT column: score with cardinality: 4, range: 3.2 to 3.8
Created dictionary for STRING column: gender with cardinality: 2, max length in bytes: 6, range: Female to Male
Created dictionary for STRING column: subject with cardinality: 3, max length in bytes: 7, range: English to Physics
Start building IndexCreator!
Finished records indexing in IndexCreator!
Finished segment seal!
Converting segment: /Users/host1/Desktop/getting-started/test2/transcript_0_0 to v3 format
v3 segment location for segment: transcript_0_0 is /Users/host1/Desktop/getting-started/test2/transcript_0_0/v3
Deleting files in v1 segment directory: /Users/host1/Desktop/getting-started/test2/transcript_0_0
Driver, record read time : 1
Driver, stats collector time : 0
Driver, indexing time : 0

Once we have the Pinot Segment, we can upload it to our cluster:

$ ./pinot-admin.sh UploadSegment -segmentDir $WORKING_DIR/test2/
Executing command: UploadSegment -controllerHost [controller_host] -controllerPort 9000 -segmentDir /Users/host1/Desktop/test2/
Compressing segment transcript_0_0
Uploading segment transcript_0_0.tar.gz
Sending request: http://[controller_host]:9000/v2/segments to controller: [controller_host], version: 0.2.0-SNAPSHOT-68092ab9eb83af173d725ec685c22ba4eb5bacf9

You did it! Now we can query the data in Pinot.

To get all the number of rows in the table:

$ ./pinot-admin.sh PostQuery -brokerPort 8000 -query "select count(*) from transcript"
Executing command: PostQuery -brokerHost [controller_host] -brokerPort 8000 -query select count(*) from transcript
Result: {"aggregationResults":[{"function":"count_star","value":"4"}],"exceptions":[],"numServersQueried":1,"numServersResponded":1,"numSegmentsQueried":1,"numSegmentsProcessed":1,"numSegmentsMatched":1,"numDocsScanned":4,"numEntriesScannedInFilter":0,"numEntriesScannedPostFilter":0,"numGroupsLimitReached":false,"totalDocs":4,"timeUsedMs":7,"segmentStatistics":[],"traceInfo":{}}

To get the average score of subject Maths:

$ ./pinot-admin.sh PostQuery -brokerPort 8000 -query "select avg(score) from transcript where subject = \"Maths\""
Executing command: PostQuery -brokerHost [controller_host] -brokerPort 8000 -query select avg(score) from transcript where subject = "Maths"
Result: {"aggregationResults":[{"function":"avg_score","value":"3.50000"}],"exceptions":[],"numServersQueried":1,"numServersResponded":1,"numSegmentsQueried":1,"numSegmentsProcessed":1,"numSegmentsMatched":1,"numDocsScanned":2,"numEntriesScannedInFilter":4,"numEntriesScannedPostFilter":2,"numGroupsLimitReached":false,"totalDocs":4,"timeUsedMs":33,"segmentStatistics":[],"traceInfo":{}}

To get the average score for Lucy Smith:

$ ./pinot-admin.sh PostQuery -brokerPort 8000 -query "select avg(score) from transcript where firstName = \"Lucy\" and lastName = \"Smith\""
Executing command: PostQuery -brokerHost [controller_host] -brokerPort 8000 -query select avg(score) from transcript where firstName = "Lucy" and lastName = "Smith"
Result: {"aggregationResults":[{"function":"avg_score","value":"3.65000"}],"exceptions":[],"numServersQueried":1,"numServersResponded":1,"numSegmentsQueried":1,"numSegmentsProcessed":1,"numSegmentsMatched":1,"numDocsScanned":2,"numEntriesScannedInFilter":6,"numEntriesScannedPostFilter":2,"numGroupsLimitReached":false,"totalDocs":4,"timeUsedMs":67,"segmentStatistics":[],"traceInfo":{}}