Page tree
Skip to end of metadata
Go to start of metadata

Problem Statement

When upgrading a FOLIO system by calling the Okapi tenant install API with a list of modules to upgrade, an operator may choose to specify the loadReference=true tenant parameter. This may be required, for example, for the tenant to take advantage of a new controlled vocabulary specified by a FOLIO SIG, or to get an update to an existing controlled vocabulary. As currently implemented in most modules, this will cause the module to attempt to load all reference data (not just new data). New records will be created if needed, and existing records (matched by UUID) will be overlaid.

This can lead to the following issues:

  • If the tenant has altered or deleted any of the reference data loaded by the module when it was first enabled (which is possible in some cases both using the reference UI and using the module APIs), any changes will be overwritten with the system defaults, and deleted records will be re-created.
  • If the reference data have data constraints (for example, the requirement that a particular property be unique), and the tenant has created a new reference record which causes a conflict with incoming reference data, the upgrade will fail, and the system will be left in an inconsistent state.

For more background, see: https://discuss.folio.org/t/reference-data-and-upgrades/2858; see also TC/Defining data types in FOLIO for automatic upgrades

Examples

  • The reference data provided by mod-inventory-storage include labels in English. In order to provide labels in the local language, users must update the records (either in the UI or using the API). These labels are overwritten on upgrade.
  • The reference data provided by mod-users includes patron groups. It is possible to manage those data in the users setting UI (and in fact it is perfectly reasonable to want to customize patron groups). Patron groups have a unique constraint that on the "group" property, so if the tenant has created a patron group that conflicts with incoming reference data, the upgrade will fail.
  • The reference data provided by mod-inventory-storage includes a location hierarchy for Københavns Universitet. Locations can be managed in the tenant settings UI. It is very likely that a tenant will wish to remove the provided location hierarchy and create their own. Locations have unique constraints on the name and code. If the tenant creates a new location that conflicts with incoming data, the upgrade will fail.

Underlying issues:

We see this as a three-fold problem:

  1. The upgrade fails if there's a problem with loadReference=true , and leaves the system in an inconsistent state. Should not fail, should report errors and continue.
    1. Note: Worse for non-RMB modules, ignore loadReference parameter and do their own thing
  2. No way to get an output of what would be modified, the upgrade steps are all launched through Okapi fire-and-forget, so it fails or completes but no interactive component or ability to simulate. 
  3. There are no clear boundaries between the different types of data packaged with FOLIO, reference data vs. sample data. E.g. service points for DIKU tenant are loaded as reference data. A side effect is that tenants must edit the reference data according to local needs.
    1. This is a pitfall for upgrades, since modifications by users (which they are encouraged and enabled to through GUI) can be either breaking or being overwritten, so this is a potential data-loss scenario.

We do have UIs for editing Reference Data. Tenants are able, encouraged, and need to modify some of the reference data according to local circumstances. Upgrades should be able to preserve these changes while bringing in system-supplied updates.

Overall goal:

Upgrades have to be a fire-and-forget mechanism, without worrying about deleted custom data, outdated system data and so on.

The end result should not leave the system in an inconsistent or unusable state. Upgrades should continue unless failures encountered are truly fatal to the process and it cannot continue without a particular step succeeding.

Minor errors may happen and should be logged clearly for remediation by the system administrator.

Currently FOLIO delivers Reference Data and Sample Data, and these are not clearly distinguished. We would like to see:

  1. A more rigorous definition of reference data.
  2. The system should treat the base reference records as immutable, and introduce local updates as overlays on top of the base (like a customized view of the record for the tenant).

What follows is one possible model.

Proposal:

First, make a clear distinction between sample data and what we call 'Reference data' at the moment.

Second, split what we now call 'Reference data' into 3 layers or categories with different degrees of protection:

NameUsageOverwritten on upgradeImmutable (towards user)Example of data stored
System dataCore information the module can not work withoutYESYESPORT=9135; DEPENDENCY=mod_login, mod_permission

Predefined Optional Working Set

(POWS)

Predefined users/groups, schemas, $STUFF to work withOn admin request (LoadReferenceData=true)

YES

schema_book{
  • title(string,128)
  • subtitle(string,256)
  • authors(string,128)
  • publisher(string,128)
  • release_date(integer,4)
  • blurb(string,32768)

}

Custom data

Can override any given reference & system data, as well as introducing new values

Whatever the user changes is stored here.


This layer will hold sample data, since it's essentially real data for a fictional tenant.

NO

(or only if the admin wishes to start with clean slate, like apt-get purge on Debian)


To load example data (like diku), introduce another switch (LoadExampleData=true)

No

schema_book{
  • uuid(integer,64)
  • subtitle(NA)
  • authors(string,256)
  • release_date(string,64)

}

PORT=10587

User: JohnDoe (Admin)

--- Result ---this is not a layer, just what the module works with in the endno layerno layer

PORT=10587; DEPENDENCY=mod_login, mod_permission

schema_book{

  • uuid(integer,64)
  • title(string,128)
  • authors(string,256)
  • publisher(string,128)
  • release_date(string,64)
  • blurb(string,32768)

}

User: JohnDoe (Admin)

System data is meant to consist only of information which causes the module to fail critically if not present. It may be altered by POWS or the custom overlay, but can never be not available, like a network port to use.

Predefined Optional Working Set (POWS) is a set of data that helps in using the module. Omitting it is possible, although basic functions of the module may not be available or working in an expected way (e. g. default user groups & permissions). 

Custom data is a layer which keeps all users changes. It is also host to any example data which is loaded (like diku) and may not be altered during upgrades (notable exception: The admin explicitely wishes to overwrite EVERYTHING changed by his users, which will empty the entire layer). Even once introduced example data is not updated during upgrades to prevent any changes to custom settings. 

This is introducing a transparent layered structure, in which the user will see data from all layers within the settings screen, and may alter everything there, although the changes will only be saved to the custom data layer. Modules should look for information in the following order, stopping at first hit:

  1. Custom data layer
  2. POWS layer
  3. System data layer

Modules apply custom settings/data first, falling back to the POWS in case of missing information and further falling back to system data if necessary. 

This way updating critical information and even POWS is possible without breaking the updating process, since both layers are untouched by users. Data which is stored in the custom layer may break some newly introduced features, making meaningful logging and error reporting towards the user necessary.



  • No labels

11 Comments

  1. I imagine:

    system data / reference data

    (= a minimal set of data needed to operate the system. Can not be changed, but may be overlaid)

    (immutable, not overwritten on upgrade; but will be changed (by the system) if an upgrade is made)

    examples:

    patron groups: faculty, staff, graduate, undergraduate

    pre-loaded resource types: text, sound, other

    1 example loan policy (load period: 60 days, grace period: 7 days, ...)

    Instance status types : catalogued, batch loaded, not catalogued

    a default location (e.g. "City Campus")

    default language, locale, time zone

    ...

    examples for system data that we do not even see in the UI:

    ... ?


    custom data / overlaid reference data

    (local modifications of reference data: deletions, overlays, additions)

    (not overwritten on upgrade; changeable by the user/tenant)

    (not necessary to run the system - but needed by the tenant!)

    Examples;

    Institution doesn't distinguish between grad and undergrad students, so it de-activtes ("deletes") "undergraduate" and renames (overlays) "graduate" to "student".

    Tenant adds additional resource types: cd, map, electronic .....

    Additional loan periods; standard loan period changed to 30 days

    Renaming the standard location, adding more locations

    Change system language to "Italian", change locale, time zone

    ...


    sample data

    large sets of data loaded only once, when the module is installed for the first time.

    Illustrativ examples to populate the data base.

    May be needed for tests; maybe even performance tests.

    Needed for demonstration purposes (showing the system to a new, potential client)

    Will be essentially deleted when the system becomes productive.

    Will not be upgraded. Can be changed by the tenant / user.

    Examples:

    . anonymized, sample user data ("Xenia Sample") with fictious names and addresse

    . large sets of inventory data (but not the real data which the tenant holds)

    . a larger set of material types

    . a larger set of contributor types

    . a larger set of identifier types

    ...

  2. I agree that there should be some sample data category. This is very useful for a development and testing point of view.


    I also want to point out that a big part of this is that the upgrade process should not fail, if the situation arises that any of the included reference or sample data changes or has been removed. This will involve some extra tooling or features from Okapi I believe, as these requests to enable new module versions for a tenant and load/use/upgrade whatever data, happens at the tenant API there.


    Lastly, if there is some desire to get away from the "fire and forget" method of upgrades, we need to make that clear. I am a fan of better visibility during the upgrade process. Some sort of validation of what is going to be run, or something declarative from Okapi saying "this is what is going to take place: foobar".

  3. Data type examples, for some context, that I propose:


    System data to include: A default Job profile needed for functionality from data-import, permissions for modules, service accounts for parts of the system to function/interact with other parts of the system (pub-sub)


    Reference data/overlayed data to include: Lookup tables


    Sample data to include: A default tenant and its locations, example patron groups, example loan policies, example circulation policies, example permission sets for staff/users of Folio


    Custom data to include: What tenant and locations/service points the customer/operator define, custom material types, organizational specific patron groups

  4. As sample data is just something that is introduced by a user of another library, it's already there - within the custom data layer:

    "Custom data is a layer which keeps all users changes. It is also host to any example data which is loaded (like diku)..."

    1. I think the proposal has a problem with terminology. The use of 'reference data' in the document makes it difficult for me to evaluate, because I (and I'm guessing most other librarians, SIG members, etc.) have had a clear mental picture to this point of what constitutes reference data – anything that is created and updated through the Settings app.  And we need to be able to update virtually anything created through the Settings app. So I strongly suggest you find another word for what the proposal now calls reference data, and shift the term 'reference data' to what the document currently calls 'custom data'.
    2.  The statement "The system should treat the base reference records as immutable" is an example of why I think the terminology is confusing. Based on experience, librarians will expect to be able to change reference records as needed; calling them 'immutable' introduces problems.
    3. In the table, the current category of 'reference data' contains some confusing examples. What kind of predefined users/groups are there that would be considered immutable? On the other hand, under what conditions would a tenant want to change a schema, since schemas are so integral to the code?
      1. I changed the terminology to make it more specific, I hope. 
      2. Should be solved by #1
      3. I tried to think ahead, following experience of around 18 years of system administration. Say the current reference data for mod_xy splits people into four groups, because the person who wrote the module has this setup at his workplace. Name them 'bachelors', 'masters', 'graduates' and 'staff'. Now library A wants to implement their setup, which only consists of 'staff' and 'clients', so deleting (as in 'not being displayed any longer') of the three original groups is feasible. Now, a couple of years later they have some books which will due to changed circumstances only be lend to graduates, so they create the group 'graduates', because nobody has in mind this group already existed in the first place, and although undoing the deletion is also possible, it brings its own set of permissions which have to be changed either way. So creating the new group 'graduates' is by any means more easy for the admin. Nobody wants this new group gone/reshaped with the next upgrade of mod_xy, nor does anybody want a breaking upgrade process. Thus the only option to retain the custom changes and upgrading the POWS layer is to keep them separate, which is the proposal. And schemata don't have to be integral to the code, look at most SQL-databases. If schemata are at the moment integral to the code, that is fine, but they don't have to be, thus I tried to think of one that might be at least understandable.
  5. I solicited feedback from Christie Thomas on this document/discussion. This is what she sent me:

    1. UUIDs should not be used as match to update reference data by default - if there is an external identifier for an entry from the source data set, that should be used because the UUID may not be consistent across FOLIO instances. 
    2. Some reference data should not be updated by FOLIO on upgrade at all as these should be intentionally managed by the local institution. In Inventory, examples of this are statistical codes and instance status. 
    3. Data model should allow for the identification of other properties associated with the entry to be cached locally and updated with the upgrade. 
    4. Data model should allow for local properties in an entry that are not wiped out on update of an entry. 
    5. Data model should allow for local entries that are not maintained by external data set and are not wiped out on upgrade. 


    Other comments - 

    The working definition of reference data includes dataset management for mm that overlap with what is being discussed in the entity management working group. Some of these external entities that are already being used in FOLIO for reference / lookups are resource type / instance type and format. Entities that we will require being able to reference / look up in the future include personal names, organizational names, subjects, works, etc. Because management of these entities will require much of the same management as the external reference data already in FOLIO, there may be a common long term solution. The entity management working group is in the middle of initial visioning and documenting phase right now.

    1. A response to Christie's points (via Stephen)

      Point #2 "Some reference data should not be updated by FOLIO on upgrade at all as these should be intentionally managed by the local institution. In Inventory, examples of this are statistical codes and instance status." I completely agree that existing values for certain reference data should not be updated by FOLIO on upgrade and statistical codes and instance status are good examples of that. I would mention as a clarification, however, that existing records would need to be modified if the underlying schema were modified. If an additional data element was added to a location record, for instance, all location records would need to be modified on upgrade to add that data element. Perhaps this is obvious, but I felt that it was worth saying



  6. The underlying problem here, in my view, is that FOLIO UUIDs are not globally unique, because for example it supports reference data being altered by a single tenant.  In fact, the UUIDs are only unique per tenant.  I think it would be a worthwhile goal for FOLIO to support UUIDs being globally unique, if at all possible.

    1. Aren't UUIDs, or at least version 4 uuids, globally unique by definition? FOLIO was my 1st introduction to UUIDs, so I read up on them in Wikipedia, and here is what the article says about potential collisions:

      This number is equivalent to generating 1 billion UUIDs per second for about 85 years. A file containing this many UUIDs, at 16 bytes per UUID, would be about 45 exabytes; this is many times larger than the largest databases currently in existence, which are on the order of hundreds of petabytes.Thus, the probability to find a duplicate within 103 trillion version-4 UUIDs is one in a billion.

      I remember being advised early on to use uuid v. 4 when generating uuids to put in records for this reason and have always done so.

  7. If the same UUID is used for records that can vary their content independently, then in effect the UUID is not globally unique.