From f22d9d661f8d2645a7519adb601ee0d6590adbf2 Mon Sep 17 00:00:00 2001 From: Fabien Forestier <fforestier@MacBookAir.local> Date: Fri, 26 Jun 2020 16:03:13 +0200 Subject: [PATCH] Add flaws section to the metadata-and-data indexer --- .gitignore | 1 + docs/components/indexers/metadata-and-data.md | 20 +++++++++++++++++++ 2 files changed, 21 insertions(+) diff --git a/.gitignore b/.gitignore index 712be59..30c48ab 100644 --- a/.gitignore +++ b/.gitignore @@ -1,2 +1,3 @@ site venv +.DS_Store \ No newline at end of file diff --git a/docs/components/indexers/metadata-and-data.md b/docs/components/indexers/metadata-and-data.md index e69de29..3af88f4 100644 --- a/docs/components/indexers/metadata-and-data.md +++ b/docs/components/indexers/metadata-and-data.md @@ -0,0 +1,20 @@ +# Metadata and data Indexer + +## Known design flaws + +### metadata `fields` might not always be up to date + +The indexer uses a catalgue of fields assicated with their type. This catalog is generated at the first indexation. When a field of a dataset has only null values, the indexer can't guess the correct type for that field and so it associate a Null type. + +When a value is added to that field, the indexer will try to convert the field to that type before pushing the document in elasticsearch (doc processor). As it can't convert to a Null type, it launchs a new analysis in order to update the fields/types catalog. Once this is done the indexation can resume as normal. + +However as the catalog update is launched by the doc processor it means that the metadata for that dataset has already been reindexed (see diagram of the process). As a consequence the 'fields' field of the metadata containing the list of fields and their type for that dataset won't be updated before the next indexation. + +### `fields.list` not up to date after a new field is added in a dataset + +The indexer generates a file which list for each dataset their fields associated with their type. This file is generated once at the first indexation and at every regeneration of the catalgue which happens when a field is not found in that catalog. The flaw here is that if a field is added to a dataset, but that the field's name already exists in another dataset, the catalogue won't be regenerated. + +The known options to face this issue at the moment are: +* use a name that doesn't exist in another dataset +* [WARNING] delete the catalog files to force the generation of new ones. This might lead to indexation error in elasticsearch. If a field type change, let's say from int to string, Elasticsearch won't accept the incomming data. In that case refer to the last solution. +* create a new index \ No newline at end of file -- GitLab