Skip to content
Snippets Groups Projects
user avatar
Alessandro Cerioni authored
Updated es_template in order to avoid indexing some fields as full-text.
fcc5fe82
History

data.grandlyon.com indexer

This collection of Python scripts allows one to index into Elasticsearch both the metadata and the data available from the data.grandlyon.com platform. Metadata are obtained from the GeoNetwork "q" API; when available, data are obtained from the Web Feature Service, in the GeoJSON format. The various modules exchange information through RabbitMQ. Messages are serialized using MessagePack. For the time being (some of) these scripts are not really generic, as they carry some opinions that are closely related to the data.grandlyon.com platform, in particular to the way metadata are originally entered.

The most "tedious" part of the workflow regards the heuristic detection of data types, which eventually ensures that all the data values are cast to the "smaller" data type which can represent all the values occurring within the various datasets: int is "smaller" than float, which is "smaller" than string. Datetimes are detected, as well. MongoDB is used to store all the documents that needs to be analyzed in order to compile the catalog containing all the field-type pairs.

Some "editorial metadata" are added to raw (meta)data before actually inserting documents into Elasticsearch (cf. the "doc-indexer" module).

A simplified view of the entire workflow is provided by the attached draw.io diagram.

How-to

  1. Generate a config.yaml file, using the config.template.yaml file as a template; customize the docker-compose.yml file, if needed.
  2. Run docker-compose build.
  3. Run docker-compose up -d rabbitmq.
  4. Run docker-compose up -d mongo.
  5. [optional] docker-compose up -d mongo-express.
  6. Run docker-compose up metadata-getter.
  7. Run docker-compose up metadata-processor.
  8. Run docker-compose up doc-enricher.
  9. Run docker-compose up docs-to-mongodb.
  10. Run docker-compose up field-type-detector.
  11. Run docker-compose up doc-processor.
  12. Run docker-compose up doc-indexer.
  13. Run docker-compose up reindexer.

N.B.: Steps 6-12 can also be performed at the same time, in separate terminals.

TODO

  • implementing the authenticated access to (meta)data sources; extracting a small sample of restricted access datasets out of the "full" documents, to be used as a teaser for the not-yet-authorized user
  • incremental updates
  • the field type detection takes a lot of time: can it be optimized?
  • logging, reporting
  • testing, testing, testing, ...
  • etc.