README.md

# data.grandlyon.com indexer

This collection of Python scripts allows one to index into [Elasticsearch](https://www.elastic.co/) both the metadata and the data available from the [data.grandlyon.com](http://data.grandlyon.com) platform. Metadata are obtained from the [GeoNetwork](https://www.geonetwork-opensource.org/) "q" API; when available, data are obtained from the Web Feature Service, in the GeoJSON format. The various modules exchange information through [RabbitMQ](https://www.rabbitmq.com/). Messages are serialized using [MessagePack](https://msgpack.org/). For the time being (some of) these scripts are not really generic, as they carry some opinions that are closely related to the [data.grandlyon.com](http://data.grandlyon.com) platform, in particular to the way metadata are originally entered.

The most "tedious" part of the workflow regards the heuristic detection of data types, which eventually ensures that all the data values are cast to the "smaller" data type which can represent all the values occurring within the various datasets: `int` is "smaller" than `float`, which is "smaller" than `string`. Datetimes are detected, as well. [MongoDB](https://www.mongodb.com/) is used to store all the documents that needs to be analyzed in order to compile the catalog containing all the field-type pairs.

Some "editorial metadata" are added to raw (meta)data before actually inserting documents into Elasticsearch (cf. the "doc-indexer" module).

A simplified view of the entire workflow is provided by the attached [draw.io](https://www.draw.io) diagram.

## How-to

0. Generate a `config.yaml` file, using the `config.template.yaml` file as a template; customize the `docker-compose.yml` file, if needed.
1. Run `docker-compose build`.
2. Run `docker-compose up -d rabbitmq`.
3. Run `docker-compose up -d mongo`.
4. [optional] `docker-compose up -d mongo-express`.
5. Run `docker-compose up metadata-getter`.
6. Run `docker-compose up metadata-processor`.
7. Run `docker-compose up doc-enricher`.
8. Run `docker-compose up docs-to-mongodb`.
9. Run `docker-compose up field-type-detector`.
10. Run `docker-compose up doc-processor`.
11. Run `docker-compose up doc-indexer`.
12. Run `docker-compose up reindexer`.

N.B.: Steps 6-12 can also be performed at the same time, in separate terminals.


## TODO

* implementing the authenticated access to (meta)data sources; extracting a small sample of restricted access datasets out of the "full" documents, to be used as a teaser for the not-yet-authorized user
* incremental updates
* the field type detection takes a lot of time: can it be optimized?
* logging, reporting
* testing, testing, testing, ...
* etc.