A river plugin for Elasticsearch that index Google Drive

Hi there,

I’ve blogged some weeks ago about a test run I’ve done with Elasticsearch and Kibana3 (now just Kibana, the ‘3’ has been dropped since ;-)). And the fact is that is was so much fun and so pleasant to go with them that I’d like to go further and start digging into Elasticsearch.

Few days scratching my head and looking around the plugin ecosystem of ES and I’ll get the idea of writing a Google Drive river to actually learn from the trenches. So I am happy to announce the 1st release of this Elasticsearch plugin that allows you to index with ES the content of a Google Drive !

Main features

So what does this plugin do ? Here are the features for this first release :

  • Connect to Google Drive in ‘offline’ mode (no need to be connected to your Google account, just to authorize the plugin to do so) using OAuth 2,
  • Scan only changes from last scan for better efficiency,
  • Filter documents based on folder path (only 1 level for the moment),
  • Filter documents to include using wilcard expresssions, such as *.doc or *.pdf,
  • Filter documents to exclude using alwo wilcards expressions, such as *.avi or *.zip (of course, exclusions are computed first),
  • Indexes document content and document metadata (cause based onto the Attachment plugin),
  • Support ms office, open office, google documents and many formats (full list here),
  • Support scan frequency configuration,
  • Support bulk indexing for optimization

The project

Project is naturally hosted on GitHub here : https://github.com/lbroudoux/es-google-drive-river. Plugin is installable as a standard Elasticsearch plugin by using the bin/plugin -install command. Everything you need for installation and configuration should be present onto the project front page.

Some features are still missing and some may be improved but the basic stuffs should work well and fast. Want to give it a try ? Or help with some ideas, tests or contributions ? Do not hesitate to give me your feedback, I’ll keep on digging and investigating in Elasticsearch the forthcoming weeks, months … who knows !?

Advertisements

4 thoughts on “A river plugin for Elasticsearch that index Google Drive

  1. Google Drive-apps zijn eigenlijk alleen maar een website die-bestanden laadt in de browser vanaf Google Drive. Ik zat te denken dat, aangezien het belangrijkste project bestand is slechts een XML-document, het zou kunnen worden in een pagina’s DOM boom geladen en weergegeven in de pagina in een leesbaar formaat (Het is hoe de belangrijkste GUI begint dingen, niet?).
    Zoals voor het bewerken van de RTF-Files, Google Drive heeft ook de directe mogelijkheid om RTF-bestanden importeren en exporteren rechtstreeks, zonder te hoeven een apart conversie programma.
    Natuurlijk zou de uitvoer functies veel zorg nodig hebben, maar voor eenvoudige maken en bewerken van de basisstructuur moet netjes uitvoeren.

  2. Hello, there. Awsome job with the plug-in. Very nice. I get a question for you… Have you seen an issue when you create an index for S3 and after a few days, it is still empty althought there are hundrets of records on your S3 path? We´re still developing, so everytime I need to recreate the index, I have this issue. (Only on S3)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s