OEP-4 MMS5 quadstore cleanup
State | Discussion |
Github Project Board | Â |
Slack Channel | |
Created | Aug 22, 2023 |
Authors | @Doris Lam @Blake Regalia |
Motivation
As mms5 layer 1 uses named graphs in the underlying quadstore to manage metadata and model data for orgs/repos/cm, some graphs eventually become obsolete or orphaned as references to them get removed in the course of using the mms5 api. Leaving these graphs in the DB (some data graphs can have millions of triples) can have unwanted effects on query performance depending on the quadstore used (for example Neptune changes how it executes queries depending on some statistics of the triples in the db). It is also just desirable to not leave unused graphs around.
The current ‘model load’ operation produces an orphaned graph each time it’s invoked as it sets the newly loaded graph as the head of the branch it’s loaded to, but the previous staging graph doesn’t get deleted. As more functionality gets added to the api (ex. delete endpoints) more graphs can get orphaned in a similar manner.
Proposal
Add a process where unused named graphs for a given mms5 cluster namespace can be queried for ‘orphaned’ status and deleted in the background. This process needs to understand how mms5 organizes these graphs and how they’re referenced in cluster/org/repo metadata. If a graph is no longer reachable via the metadata, then it’s safe to delete.
This process would be separate from the api operations as attempting to determine if some graph should be deleted during an api operation can get complicated and the graph deletion itself can take a long time (minutes). This removes the burden from the current implementation and future api implementations from worrying about graph cleanup and instead focus on managing the metadata.
The process should only need to know the cluster namespace and quadstore location, and maybe provide a way to trigger a run (can be set to run periodically or triggered by layer1 service or combinations). It shouldn’t need a user facing api since this is meant to be invisible to the user (much like jvm garbage collection).
Performance Impact
During graph deletion it can have an effect on the quadstore performance, cleaning up unused graphs can also affect query performance, there are variables based on the quadstore used and how often/when it’s run. Overall running cleanup during downtime and reducing unused triples should have a positive performance impact.
Implementation and maintainability
The query for orphaned graphs would have a dependency on the mms5 ontology - changes to the ontology would require updates to the query. There may also need to be changes to any current api operation that creates a new graph in order to prevent a newly created graph from being flagged as orphaned before any metadata can be established (ex. maybe add special metadata to indicate a graph should be not deleted with some expiry timestamp).
The process itself can be implemented as a separate service (in its own github repo) or as a background thread in the existing layer1 service. More discussion needed to compare pros and cons. (interdependencies, means of triggering a run, what if layer1 has multiple instances deployed, separation of concerns and monitoring)
Deployment options
Same as current layer1 deployment (primarily docker/kube)