Skip to content
Snippets Groups Projects
README.md 3.64 KiB
Newer Older
  • Learn to ignore specific revisions
  • # data.grandlyon.com indexer
    
    
    This collection of Python scripts allows one to index into [Elasticsearch](https://www.elastic.co/) both the metadata and the data available from the [data.grandlyon.com](http://data.grandlyon.com) platform. Metadata are obtained from the [GeoNetwork](https://www.geonetwork-opensource.org/) "q" API; when available, data are obtained from a PostGIS database, in the GeoJSON format. For the time being, only geographical data stored in PostGIS are indexed. The various modules exchange information through [RabbitMQ](https://www.rabbitmq.com/). Messages are serialized using [MessagePack](https://msgpack.org/). Some of the scripts are not at all generic, as they carry some opinions that are closely related to the [data.grandlyon.com](http://data.grandlyon.com) platform, in particular to the way metadata are originally entered.
    
    The most "tedious" part of the workflow regards the heuristic detection of data types, which eventually ensures that all the data values are cast to the "smaller" data type which can represent all the values occurring within the various datasets: `int` is "smaller" than `float`, which is "smaller" than `string`. Datetimes are detected, as well.
    
    
    Some "editorial metadata" are added to raw (meta)data before actually inserting documents into Elasticsearch (cf. the "doc-indexer" module).
    
    
    Here is a simplified overview of the entire workflow.
    
    ![Indexer workflow diagram](./doc/data-grandlyon-com-indexer-workflow-drawio.png)
    
    Alessandro CERIONI's avatar
    Alessandro CERIONI committed
    0. Generate a `config.yaml` file, using the `config.template.yaml` file as a template; customize the `docker-compose.yml` and `docker-compose-tools.yml` files, if needed.
    1. Run `docker-compose build`
    
    Alessandro CERIONI's avatar
    Alessandro CERIONI committed
    2. Run `docker-compose --compatibility up -d`
    
    3. [optional] Run `docker-compose -f docker-compose-tools.yml up delete-queues`
    4. [optional] Run `docker-compose -f docker-compose-tools.yml up delete-indices`
    
    Alessandro CERIONI's avatar
    Alessandro CERIONI committed
    5. Wait !
    
    Alessandro CERIONI's avatar
    Alessandro CERIONI committed
    6. `$ curl -X GET http://<the_hostname_where_the_API_is_running>:<the_API_listening_port/uuid/<the_uuid_of_a_given_dataset|all>[?force=true]`
    
    ## Aliases migration
    
    This project also include a script that allow one to migrate aliases from one instance of elasticsearch to another.
    
    Exemple d'usage :
    
    ```python
    
    python tools/alias_copier.py --src-es https://<source-host>:443 --dst-es https://<destination-host>:443 --src-idx <src-index> --dst-idx <dst-index> --skip <ex: preprod>
    
    ```
    
    Prefixes or suffixes to the alias with `--prepend` and `--append`.
    
    It is possible to skip the copy of aliases including a particular string. The argument takes a list of strings: `--skip bar foo`.
    
    La liste complète des arguments est visible en executant la commande suivante: 
    
    ```python
    python tools/alias_copier.py --help
    ```
    
    
    ## Tests
    
    Install pytest
    
        pip install pytest
    
    Run the tests
    
        python -m pytest
    
    
    Alessandro Cerioni's avatar
    Alessandro Cerioni committed
    * producing indexation reports out of log messages (cf. the branches `Denis_clean_full_datalogger_31Oct` and `Denis_full_datalogs_Stack_October_31`)
    
    * indexing non geographical data
    * rendering the code less opinionated / more generic
    
    Alessandro Cerioni's avatar
    Alessandro Cerioni committed
    * removing dead code (ex.: `deferred_count`, present in various workers)
    * periodically cleaning up the working directory
    * adding a `health` endpoint to the API, which should at least check that
      * no (meta)data in Elasticsearch is older than N hours/days/... (depending on the expected behaviour);
      * Elasticsearch did not enter the read-only state
        *N.B. should Elasticsearch enter such a state, writes can be re-enabled by issuing the following command from a shell: `$ curl -XPUT -H "Content-Type: application/json" http(s)://<es_host>:<es_port>/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'`
    * upgrading to Elasticsearch 7.x