Elasticsearch in 5 Minutes

What is Elasticsearch?

Elasticsearch is a search engine used for quickly running queries on large amounts of data.

While Elasticsearch can function as a data store, it's often used in conjunction with Hadoop, MongoDb, etc. to provide near real-time full-text search.

Elasticsearch was first released in 2010. It's been widely adopted as the most popular enterprise search engine and competes with Apache Solr for analytical querying in the big data space.

Companies like Netflix and LinkedIn use Elasticsearch to provide real time search results on massive amounts of data.

Why use Elasticsearch?

Elasticsearch is good for performing near real time search operations on big data. Elasticsearch inherently supports distributed cluster computing and supplements data stores like HDFS and MongoDb.

Elasticsearch is most popular for full-text search. If your application is an ecommerce website with millions of products to sell then using Elasticsearch can help your customers search for products much faster than alternative methods.

How does Elasticsearch work?

Elasticsearch works by creating indexes on stored data. These indexes store data in a certain order to more efficiently retrieve query results.

Elasticsearch exposes a RESTful interface for both querying and storing information. Sending POST / GET requests with JSON request bodies is essentially the interface for interacting with Elasticsearch.

Since Elasticsearch is Java based, it can easily run on different platforms.

Getting started with Elasticsearch

To use Elasticsearch, you simply download and install the latest version from the official website. After downloading the tar/zip file, you can access the project's /bin folder to run Elasticsearch.

Elasticsearch uses a RESTful interface for creating indexes and populating data. For example, to create an index you simply call:

PUT http://localhost:9200/users

Creating an index

This will create a users index in Elasticsearch. Please note that the default location for Elasticsearch is localhost:9200 but this is easily configurable.

Populating data

To populate the index with data, you could call:

POST http://localhost:9200/users/_bulk

Request Body

{
 "index":{
    "_index":"users", "_type":"user", "_id":"1"
 }
}
{
 "name":"Sam", "age":"36",
}
{
 "index":{
    "_index":"users", "_type":"user", "_id":"2"
 }
}
{
"name":"Sara", "age":"32",
}

Notice how we submit a POST request to /users/_bulk. This _bulk endpoint is part of the Elasticsearch REST API.

Elasticsearch automatically creates data mappings based on the fields in the JSON request body.

Searching data

Using the Search API, you can easily query data stored in Elasticsearch:

GET http://localhost:9200/_search?q = name:Sam

This GET request searches every index where the name field equals Sam

That's it! While Elasticsearch comes with a rich API for searching, aggregating, etc., this covers the basics of how Elasticsearch works and how it can be used with your application.

You can find a lot more examples on querying data and configuring Elastic search here.

Elasticsearch: a deeper dive...

What is Lucene?

Elasticsearch is built on the Apache Lucene project. Lucene is a Java based library for running fast searches. Lucene is free, open-source and actively maintained by the Apache foundation.

The library leverages searchable indexes instead of searching text directly to provide superior performance in data retrieval. This has made Lucene fundamental to the evolution of internet based search engines.

Lucene works by creating inverted indexes to organize data more efficiently. Lucene takes a page-centric data structure and inverts it to keyword-centric structure. This post has some really good information on how Lucene works and why it plays such a fundamental role to modern day search engines.

How does Elasticsearch compare to other NoSQL data stores like MongoDb?

Elasticsearch and MongoDb are both non-relational data stores. This means they store data in JSON-like objects called documents. These documents don't have relationships with other indexes or collections of data. While this can make transactional processing more involved, a non-relational data store excels in read operations because it doesn't have to concern itself with the numerous table joins characteristic of RDBMS.

While Elasticsearch could be used to store data similar to Mongo, it's more commonly used to supplement data stored in MongoDb. Many applications rely on Mongo for the storage of data and Elasticsearch for indexing and faster querying.

How does Elasticsearch compare to Apache Solr?

Apache Solr is another search platform based on Lucene. While Elasticsearch and Solr share many similarities, there are subtle nuances surrounding community support and intuitiveness. Asaf Yigai does a really good job explaining the key differences between these two in his article Solr vs. Elasticsearch: Who’s The Leading Open Source Search Engine?.

Should I use Elasticsearch in my project?

If your application needs to support real-time search queries then Elasticsearch is appropriate. Elasticsearch is particularly beneficial if you work in a distributed environment as it inherently supports sharding. This makes it a popular supplement to HDFS and the Hadoop ecosystem as well.

Conclusion

Elasticsearch excels at full-text search on large data sets. Use Elasticsearch if you rely on the near real-time querying of big data. Remember that Elasticsearch is easy to install and exposes a rich RESTful interface for storing, manipulating, and querying data.

Your thoughts?