Problem Statement
When upgrading a FOLIO system by calling the Okapi tenant install API with a list of modules to upgrade, an operator may choose to specify the loadReference=true
tenant parameter. This may be required, for example, for the tenant to take advantage of a new controlled vocabulary specified by a FOLIO SIG, or to get an update to an existing controlled vocabulary. As currently implemented in most modules, this will cause the module to attempt to load all reference data (not just new data). New records will be created if needed, and existing records (matched by UUID) will be overlaid.
This can lead to the following issues:
- If the tenant has altered or deleted any of the reference data loaded by the module when it was first enabled (which is possible in some cases both using the reference UI and using the module APIs), any changes will be overwritten with the system defaults, and deleted records will be re-created.
- If the reference data have data constraints (for example, the requirement that a particular property be unique), and the tenant has created a new reference record which causes a conflict with incoming reference data, the upgrade will fail, and the system will be left in an inconsistent state.
For more background, see: https://discuss.folio.org/t/reference-data-and-upgrades/2858; see also TC/Defining data types in FOLIO for automatic upgrades
Examples
- The reference data provided by mod-inventory-storage include labels in English. In order to provide labels in the local language, users must update the records (either in the UI or using the API). These labels are overwritten on upgrade.
- The reference data provided by mod-users includes patron groups. It is possible to manage those data in the users setting UI (and in fact it is perfectly reasonable to want to customize patron groups). Patron groups have a unique constraint that on the "group" property, so if the tenant has created a patron group that conflicts with incoming reference data, the upgrade will fail.
- The reference data provided by mod-inventory-storage includes a location hierarchy for Københavns Universitet. Locations can be managed in the tenant settings UI. It is very likely that a tenant will wish to remove the provided location hierarchy and create their own. Locations have unique constraints on the name and code. If the tenant creates a new location that conflicts with incoming data, the upgrade will fail.
Underlying issues:
We see this as a three-fold problem:
- The upgrade fails if there's a problem with
loadReference=true
, and leaves the system in an inconsistent state. Should not fail, should report errors and continue.- Note: Worse for non-RMB modules, ignore
loadReference
parameter and do their own thing
- Note: Worse for non-RMB modules, ignore
- No way to get an output of what would be modified, the upgrade steps are all launched through Okapi fire-and-forget, so it fails or completes but no interactive component or ability to simulate.
- There are no clear boundaries between the different types of data packaged with FOLIO, reference data vs. sample data. E.g. service points for DIKU tenant are loaded as reference data. A side effect is that tenants must edit the reference data according to local needs.
- This is a pitfall for upgrades, since modifications by users (which they are encouraged and enabled to through GUI) can be either breaking or being overwritten, so this is a potential data-loss scenario.
We do have UIs for editing Reference Data. Tenants are able, encouraged, and need to modify some of the reference data according to local circumstances. Upgrades should be able to preserve these changes while bringing in system-supplied updates.
Overall goal:
Upgrades have to be a fire-and-forget mechanism, without worrying about deleted custom data, outdated system data and so on.
The end result should not leave the system in an inconsistent or unusable state. Upgrades should continue unless failures encountered are truly fatal to the process and it cannot continue without a particular step succeeding.
Minor errors may happen and should be logged clearly for remediation by the system administrator.
Currently FOLIO delivers Reference Data and Sample Data, and these are not clearly distinguished. We would like to see:
- A more rigorous definition of reference data.
- The system should treat the base reference records as immutable, and introduce local updates as overlays on top of the base (like a customized view of the record for the tenant).
What follows is one possible model.
Proposal:
First, make a clear distinction between sample data and what we call 'Reference data' at the moment.
Second, split what we now call 'Reference data' into 3 layers or categories with different degrees of protection:
Name | Usage | Overwritten on upgrade | Immutable (towards user) | Example of data stored |
---|---|---|---|---|
System data | Core information the module can not work without | YES | YES | PORT=9135; DEPENDENCY=mod_login, mod_permission |
Predefined Optional Working Set (POWS) | Predefined users/groups, schemas, $STUFF to work with | On admin request (LoadReferenceData=true) | YES | schema_book{
} |
Custom data | Can override any given reference & system data, as well as introducing new values Whatever the user changes is stored here. This layer will hold sample data, since it's essentially real data for a fictional tenant. | NO (or only if the admin wishes to start with clean slate, like apt-get purge on Debian) To load example data (like diku), introduce another switch (LoadExampleData=true) | No | schema_book{
} PORT=10587 User: JohnDoe (Admin) |
--- Result --- | this is not a layer, just what the module works with in the end | no layer | no layer | PORT=10587; DEPENDENCY=mod_login, mod_permission schema_book{
} User: JohnDoe (Admin) |
System data is meant to consist only of information which causes the module to fail critically if not present. It may be altered by POWS or the custom overlay, but can never be not available, like a network port to use.
Predefined Optional Working Set (POWS) is a set of data that helps in using the module. Omitting it is possible, although basic functions of the module may not be available or working in an expected way (e. g. default user groups & permissions).
Custom data is a layer which keeps all users changes. It is also host to any example data which is loaded (like diku) and may not be altered during upgrades (notable exception: The admin explicitely wishes to overwrite EVERYTHING changed by his users, which will empty the entire layer). Even once introduced example data is not updated during upgrades to prevent any changes to custom settings.
This is introducing a transparent layered structure, in which the user will see data from all layers within the settings screen, and may alter everything there, although the changes will only be saved to the custom data layer. Modules should look for information in the following order, stopping at first hit:
- Custom data layer
- POWS layer
- System data layer
Modules apply custom settings/data first, falling back to the POWS in case of missing information and further falling back to system data if necessary.
This way updating critical information and even POWS is possible without breaking the updating process, since both layers are untouched by users. Data which is stored in the custom layer may break some newly introduced features, making meaningful logging and error reporting towards the user necessary.
11 Comments
Ingolf Kuss
I imagine:
system data / reference data
(= a minimal set of data needed to operate the system. Can not be changed, but may be overlaid)
(immutable, not overwritten on upgrade; but will be changed (by the system) if an upgrade is made)
examples:
patron groups: faculty, staff, graduate, undergraduate
pre-loaded resource types: text, sound, other
1 example loan policy (load period: 60 days, grace period: 7 days, ...)
Instance status types : catalogued, batch loaded, not catalogued
a default location (e.g. "City Campus")
default language, locale, time zone
...
examples for system data that we do not even see in the UI:
... ?
custom data / overlaid reference data
(local modifications of reference data: deletions, overlays, additions)
(not overwritten on upgrade; changeable by the user/tenant)
(not necessary to run the system - but needed by the tenant!)
Examples;
Institution doesn't distinguish between grad and undergrad students, so it de-activtes ("deletes") "undergraduate" and renames (overlays) "graduate" to "student".
Tenant adds additional resource types: cd, map, electronic .....
Additional loan periods; standard loan period changed to 30 days
Renaming the standard location, adding more locations
Change system language to "Italian", change locale, time zone
...
sample data
large sets of data loaded only once, when the module is installed for the first time.
Illustrativ examples to populate the data base.
May be needed for tests; maybe even performance tests.
Needed for demonstration purposes (showing the system to a new, potential client)
Will be essentially deleted when the system becomes productive.
Will not be upgraded. Can be changed by the tenant / user.
Examples:
. anonymized, sample user data ("Xenia Sample") with fictious names and addresse
. large sets of inventory data (but not the real data which the tenant holds)
. a larger set of material types
. a larger set of contributor types
. a larger set of identifier types
...
Jason Root
I agree that there should be some sample data category. This is very useful for a development and testing point of view.
I also want to point out that a big part of this is that the upgrade process should not fail, if the situation arises that any of the included reference or sample data changes or has been removed. This will involve some extra tooling or features from Okapi I believe, as these requests to enable new module versions for a tenant and load/use/upgrade whatever data, happens at the tenant API there.
Lastly, if there is some desire to get away from the "fire and forget" method of upgrades, we need to make that clear. I am a fan of better visibility during the upgrade process. Some sort of validation of what is going to be run, or something declarative from Okapi saying "this is what is going to take place: foobar".
Jason Root
Data type examples, for some context, that I propose:
System data to include: A default Job profile needed for functionality from data-import, permissions for modules, service accounts for parts of the system to function/interact with other parts of the system (pub-sub)
Reference data/overlayed data to include: Lookup tables
Sample data to include: A default tenant and its locations, example patron groups, example loan policies, example circulation policies, example permission sets for staff/users of Folio
Custom data to include: What tenant and locations/service points the customer/operator define, custom material types, organizational specific patron groups
Johannes Drexl
As sample data is just something that is introduced by a user of another library, it's already there - within the custom data layer:
"Custom data is a layer which keeps all users changes. It is also host to any example data which is loaded (like diku)..."
Anne L. Highsmith
Johannes Drexl
Stephen Pampell
I solicited feedback from Christie Thomas on this document/discussion. This is what she sent me:
Other comments -
The working definition of reference data includes dataset management for mm that overlap with what is being discussed in the entity management working group. Some of these external entities that are already being used in FOLIO for reference / lookups are resource type / instance type and format. Entities that we will require being able to reference / look up in the future include personal names, organizational names, subjects, works, etc. Because management of these entities will require much of the same management as the external reference data already in FOLIO, there may be a common long term solution. The entity management working group is in the middle of initial visioning and documenting phase right now.
Anne L. Highsmith
A response to Christie's points (via Stephen)
Point #2 "Some reference data should not be updated by FOLIO on upgrade at all as these should be intentionally managed by the local institution. In Inventory, examples of this are statistical codes and instance status." I completely agree that existing values for certain reference data should not be updated by FOLIO on upgrade and statistical codes and instance status are good examples of that. I would mention as a clarification, however, that existing records would need to be modified if the underlying schema were modified. If an additional data element was added to a location record, for instance, all location records would need to be modified on upgrade to add that data element. Perhaps this is obvious, but I felt that it was worth saying
Nassib Nassar
The underlying problem here, in my view, is that FOLIO UUIDs are not globally unique, because for example it supports reference data being altered by a single tenant. In fact, the UUIDs are only unique per tenant. I think it would be a worthwhile goal for FOLIO to support UUIDs being globally unique, if at all possible.
Anne L. Highsmith
Aren't UUIDs, or at least version 4 uuids, globally unique by definition? FOLIO was my 1st introduction to UUIDs, so I read up on them in Wikipedia, and here is what the article says about potential collisions:
This number is equivalent to generating 1 billion UUIDs per second for about 85 years. A file containing this many UUIDs, at 16 bytes per UUID, would be about 45 exabytes; this is many times larger than the largest databases currently in existence, which are on the order of hundreds of petabytes. … Thus, the probability to find a duplicate within 103 trillion version-4 UUIDs is one in a billion.
I remember being advised early on to use uuid v. 4 when generating uuids to put in records for this reason and have always done so.
Nassib Nassar
If the same UUID is used for records that can vary their content independently, then in effect the UUID is not globally unique.