Getting started with ElasticSearch

You must have surely heard the tag lines “Data is gold” or “Data is oil”! If not, then you heard it now. The notion is that with the right type and volume of data, you can pull out very valuable insights to help support your business/IT goals. This data might be coming from your own applications, log files, social media data, blogs, online news media, etc. Data is everywhere. And when you have that data, you want to search through it for intelligent information. That is where search engines come to the rescue. I will cover one such search engine – ElasticSearch.

Search engines come to the aid in various scenarios. Couple of common scenarios are…

  • You got too much incoming data and you really only care for a subset. And to make matters more interesting lets assume that the data happens to be either unstructured or semi-structured in nature.
  • You have application data that you would like to access quickly. Similar to indexes on database tables you want an index the data for faster look-up. For example you have various forms and would like to provide the user with a keyword based search which can provide him links to all forms that have the matching keyword.

Now you could make the argument to put the data into an RDBMS and use SQL. If your data is well structured, reasonably small in volume and you know the exact query fields then you could do that. But otherwise you might need to consider a search engine based solution.

In the open source world, Apache Lucene library is the defacto search engine. Lucene is a very mature product and has been around for a while now. Various projects have attempted to build a higher level abstraction on top of Lucene. Two such products are Apache Solr and ElasticSearch. Both sit on top of Lucene and add additional features like fail over, sharding, replication, etc. This article will focus on ElasticSearch.

In previous articles I have used the data from http://fec.gov/disclosurep/PDownload.do. I would suggest downloading one of the state specific data set since that will be a lot smaller. Of course feel free to use the ALL.zip which is the national data set, though loading it will take a bit longer.

Installing ElasticSearch is way too easy. Simply download the appropriate archive from http://www.elasticsearch.org/download/. Unzip the content to a folder. Go to the root folder and type in ‘bin/elasticsearch’ to start the server. The switch ‘f’ is to ensure that it will run in the foreground and you can see log messages from the server. Take out the switch and it will run in the background. To make using ElasticSearch easier, install the following plugin http://mobz.github.com/elasticsearch-head/. ElasticSearch comes with many intelligent default configurations already set, making it easier to get started.

To verify if the server is running, open a browser and type in http://localhost:9200/.

If you have curl utility then running the following command should get you the same response  as above.

The data format coming back is JSON. Get familiar with JSON (if you are not already). The way you interact with ElasticSearch is via JSON. If you installed the plugin then go to URL http://localhost:9200/_plugin/head/. This web interface lets you do some basic administration as well as common tasks such as browsing your indexes. Especially useful is the ‘Any Request’ tab (4th from the left). This lets you execute queries and view results. Here is a screenshot with a sample query and response.

ES-1

The ElasticSearch query is to the left and the result is to the right. Both are in JSON format. A sample fec.gov data set from the 2012 presidential elections is included in the root of the github maven project. Lets first load the data into ElasticSearch. Once you have cloned the github project load it up in an IDE. I use SpringSource Eclipse STS. Go to the class LoadDataDriver.java and update the path to point to your file location. Now run the class. Did not get around to making this runnable via command line.

I use my own custom CSV parser to parse the contents of the CSV. The jar file is called fft-0.8.jar and is in the root of the source code folder. Please run the following maven command to add the jar to your local repository. This is a mandatory step before you proceed. Someday I will add the jar to a real repository…until then you will have to install this yourself.

And now here is LoadDataDriver.java

Once you have the data loaded, go to the ElasticSearch HEAD plugin URL and run the following query. The query basically gets a list of contributions for Romney sorted by the contribution amount and filtered for amounts >= 2000. Ignore the facets portion. It is for a different time. You can remove that from the query if you prefer. For the sake of this article just know that the statistical facet provides some aggregate information. The size parameter ensures that ElasticSearch will return only the first 100 rows and the offset indicates from where in the result set to start returning documents from. You could use this for paging through a large set, but beware that the contents of the response will change if your index is being updated often. If you need reliable scrolling over a set of documents (with no changes to the set as you scroll) either use ElasticSearch’s scrolling feature OR extract the unique ids with one query and stash that some place and paginate over that. In the latter case each paginate request would get the list of ids from the cached set and then query ElasticSearch with just those ids to get the rest of the data.

If you are using Lunix , then go get the CURL command line utility and running the following should give you the JSON response as in the browser example above.

Be careful…this might return a lot of data. Pipe it to a file and view in your favorite json editor.

If you are using Java, then you can use ElasticSearch Java API. The library jar can be found in the install folder under the lib sub-folder. But we use Maven here so refer to the pom.xml for relevant dependency config (if you care). The Java API is very elegant (IMHO) and quite easy to use. The code is in DataLoaderImpl.java. An extract is noted below. Just run this class directly to see the output. Modify as you please.

 

 

There is so much more about ElasticSearch to talk about. Maybe another article , another day. The source code can be found at https://github.com/thomasma/elasticsearch-cmpgn-contribs.

One thought on “Getting started with ElasticSearch

Comments are closed.