Indexing your Amazon S3 bucket with Elasticsearch

I pursue my Elasticsearch journey with a new plugin release today …

So, your company uses Amazon S3 as a storage backend for internal documentation ? Or you’re running a Web application where users can upload and share files and content backed by S3 ? Now, you want/have/need to have the whole suff indexed and searchable using a “mind blowing searh engine” (say Elasticsearch ;-)) ? Well the solution might be the Amazon S3 River plugin for ES released today.

Main features

So what does this plugin do ? Here are the features for this first release :

  • Connect to your S3 bucket using AWS Credentials,
  • Scan only changes from last scan for better efficiency,
  • Filter documents based on folder path (no restriction on the depth level, you can use such path as Work/Archives/2012/Project1/docs/),
  • Filter documents to include using wilcard expresssions, such as *.doc or *.pdf,
  • Filter documents to exclude using alwo wilcards expressions, such as *.avi or *.zip (of course, exclusions are computed first),
  • Indexes document content and document metadata (cause based onto the Attachment plugin),
  • Support ms office, open office, google documents and many formats (full list here),
  • Support scan frequency configuration,
  • Support bulk indexing for optimization

The project

Project is naturally hosted on GitHub here : https://github.com/lbroudoux/es-amazon-s3-river. Plugin is installable as a standard Elasticsearch plugin by using the bin/plugin -install command. Everything you need for installation and configuration should be present onto the project front page.

Restriction

As a disclaimer : when developping this plugin, I discover an Elasticsearch limitation in the fact that all loaded plugins are not isolated from each other and share the same resources (this because plugin libraries are added to main ClassLoader as you can see here). As a consequence, using this new plugin in the conjonction of the Google Drive River plugin previoulsy released is not possible (both Amazon and Google libraries are using conflicting versions of Apache http-client). I’ll tackle this subject if enough time in the forthcoming days.

As usual, do not hesitate to give me your feedback through comments on this post, issues on GitHub projet or tweets (@lbroudoux) !

Advertisements

8 thoughts on “Indexing your Amazon S3 bucket with Elasticsearch

  1. I’ve been experimenting the amazon s3 river plugin to connect my S3 bucket to my ES instance on EC2 . I’m using 0.0.1 version of the plugin. I see that content of the file is completely garbled – the data is actually in json format. What am I doing wrong ?

  2. Hi,
    I am trying to use Amazon s3 river plugin to populate data from s3 on elastic search. I am using Elastic search 1.6.0 version, Looks like I am missing something here as data are not getting indexed immediately, Can you please how is bul processor timed ? as well as it starts deleting indexes after some time.

    Is there any email, I can have a quick chat with you?
    Thanks,
    dharmendra

    1. Hi @dharmendra,

      S3 river plugin actually supports version 1.4.x of Elasticsearch. Latest releases 1.40 and 1.4.1 have been tested with ES 1.4.2.

      Not sure if there’s a chance I’ll update plugin to latest ES versions because of the deprecation of river mechanism (see https://www.elastic.co/blog/deprecating-rivers).

      I’ll advice you to test with older versions of ES if you can afford it (i.e. no need for latest bleeding edge features) or directly look for another solution to something like logstash if wanting to prepare to ES 2.x.

      Regards,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s