I pursue my Elasticsearch journey with a new plugin release today …
So, your company uses Amazon S3 as a storage backend for internal documentation ? Or you’re running a Web application where users can upload and share files and content backed by S3 ? Now, you want/have/need to have the whole suff indexed and searchable using a “mind blowing searh engine” (say Elasticsearch ;-)) ? Well the solution might be the Amazon S3 River plugin for ES released today.
So what does this plugin do ? Here are the features for this first release :
- Connect to your S3 bucket using AWS Credentials,
- Scan only changes from last scan for better efficiency,
- Filter documents based on folder path (no restriction on the depth level, you can use such path as
- Filter documents to include using wilcard expresssions, such as *.doc or *.pdf,
- Filter documents to exclude using alwo wilcards expressions, such as *.avi or *.zip (of course, exclusions are computed first),
- Indexes document content and document metadata (cause based onto the Attachment plugin),
- Support ms office, open office, google documents and many formats (full list here),
- Support scan frequency configuration,
- Support bulk indexing for optimization
Project is naturally hosted on GitHub here : https://github.com/lbroudoux/es-amazon-s3-river. Plugin is installable as a standard Elasticsearch plugin by using the
bin/plugin -install command. Everything you need for installation and configuration should be present onto the project front page.
As a disclaimer : when developping this plugin, I discover an Elasticsearch limitation in the fact that all loaded plugins are not isolated from each other and share the same resources (this because plugin libraries are added to main ClassLoader as you can see here). As a consequence, using this new plugin in the conjonction of the Google Drive River plugin previoulsy released is not possible (both Amazon and Google libraries are using conflicting versions of Apache http-client). I’ll tackle this subject if enough time in the forthcoming days.
As usual, do not hesitate to give me your feedback through comments on this post, issues on GitHub projet or tweets (@lbroudoux) !