diff --git a/.gitignore b/.gitignore index 712be594beabcb336729eebdcad76c25059743f4..30c48abaadbd09174facc8f1de574f4bb7fe23af 100644 --- a/.gitignore +++ b/.gitignore @@ -1,2 +1,3 @@ site venv +.DS_Store \ No newline at end of file diff --git a/docs/components/indexers/metadata-and-data.md b/docs/components/indexers/metadata-and-data.md index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..3af88f40ece1a13d308d8140da471f8d1a70f7ba 100644 --- a/docs/components/indexers/metadata-and-data.md +++ b/docs/components/indexers/metadata-and-data.md @@ -0,0 +1,20 @@ +# Metadata and data Indexer + +## Known design flaws + +### metadata `fields` might not always be up to date + +The indexer uses a catalgue of fields assicated with their type. This catalog is generated at the first indexation. When a field of a dataset has only null values, the indexer can't guess the correct type for that field and so it associate a Null type. + +When a value is added to that field, the indexer will try to convert the field to that type before pushing the document in elasticsearch (doc processor). As it can't convert to a Null type, it launchs a new analysis in order to update the fields/types catalog. Once this is done the indexation can resume as normal. + +However as the catalog update is launched by the doc processor it means that the metadata for that dataset has already been reindexed (see diagram of the process). As a consequence the 'fields' field of the metadata containing the list of fields and their type for that dataset won't be updated before the next indexation. + +### `fields.list` not up to date after a new field is added in a dataset + +The indexer generates a file which list for each dataset their fields associated with their type. This file is generated once at the first indexation and at every regeneration of the catalgue which happens when a field is not found in that catalog. The flaw here is that if a field is added to a dataset, but that the field's name already exists in another dataset, the catalogue won't be regenerated. + +The known options to face this issue at the moment are: +* use a name that doesn't exist in another dataset +* [WARNING] delete the catalog files to force the generation of new ones. This might lead to indexation error in elasticsearch. If a field type change, let's say from int to string, Elasticsearch won't accept the incomming data. In that case refer to the last solution. +* create a new index \ No newline at end of file