Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving where tableversion data is stored #11

Open
billgeo opened this issue Sep 1, 2016 · 25 comments
Open

Moving where tableversion data is stored #11

billgeo opened this issue Sep 1, 2016 · 25 comments

Comments

@billgeo
Copy link
Contributor

billgeo commented Sep 1, 2016

Not sure if this is best practice or a good idea, but would it be possible to move tableversion data into the schema of the table that has been versioned?

e.g. table version data, metadata about the table and functions specific to that table to the schema that the table originates from.

The main benefit would be to allow backup/restores of individual schemas that would preserve the versions and metadata about the versions. At the moment you have to backup/restore the entire database I believe.

@palmerj
Copy link
Contributor

palmerj commented Sep 1, 2016

All possible and makes sense but that's a large change. What's the driver?

@billgeo
Copy link
Contributor Author

billgeo commented Sep 1, 2016

Potential future issues. Joe is currently backing up AIMS and ROADS schemas independently. If we ever expanded that to include admin boundaries or turned on table versioning for any AIMS or ROADS tables we'd run into this problem,

@billgeo
Copy link
Contributor Author

billgeo commented Sep 1, 2016

Also, @dwsilk just said roads is currently versioned. So would be a difficult to resolve now if we had to do a restore for roads. Maybe there's other ways to get around this problem? A ver_repair() function that can repair the restored table to some specified date or version number?

@palmerj palmerj self-assigned this Jan 5, 2017
@palmerj
Copy link
Contributor

palmerj commented Jan 6, 2017

Proposal

Place versioned table specific objects into the same schema as the source versioned table. This will be done by creating a table specific metadata table to record the revision metadata (IDs, descriptions, dates), moving the location of the revision data table and data access functions. This additional metadata table is required to remove the requirement for the global table_version.revision tables. The new layout would like something like:

{schema}.{table}
{schema}.{table}____revision
{schema}.{table}____revision_metadata

These 3 tables can then be backed up and restored independently.

I also propose that the table_version.versioned_tables table gets removed and a view is created in it's place. This view definition would then look up the database table catalogue and return tables with the {schema}.{table}____revision/metadata name format.

NOTE: This is a major change that will result in a new major version release i.e 2.0

Changes

To support isolated backups of versioned table means we no longer support global version ID management. One of the main side affects of this change is we can no longer support functions calls such as:

table_version.ver_get_modified_tables(p_revision integer)
table_version.ver_get_revision(...)
table_version.ver_get_revisions(...)
table_version.ver_create_revision()
table_version.ver_get_last_revision()
table_version.ver_complete_revision()
table_version.ver_delete_revision()

The following new table specific functions will be created:

table_version.ver_create_revision(p_schema name, p_table name, p_comment text, p_revision_time timestamp DEFAULT now(), p_schema_change boolean DEFAULT false) RETURNS BOOL

table_version.ver_complete_revision(p_schema name, p_table name, p_comment text, p_revision_time timestamp DEFAULT now(), p_schema_change boolean DEFAULT false) RETURNS BOOL

table_version.ver_get_table_revisions(p_schema name, p_table name) RETURNS TABLE(..)

However the table_version.ver_get_modified_tables function can't be replaced in any sane way so it will be removed. I think this change won't have a large impact as I believe LDS or internal LINZ systesm are not using this.

Also to keep table specific objects in the same schema the table revision and diff data access function will be renamed and moved:

{schema}.{table}_get_table_diff(p_revision1 integer, p_revision2 integer) RETURNS TABLE(..)
{schema}.{table}_get_table_revision(p_revision integer) RETURNS TABLE(..)

Example

Create table foo.bar:

CREATE SCHEMA foo;

CREATE TABLE foo.bar
(
  id INTEGER NOT NULL PRIMARY KEY (id),
  baz TEXT 
);

SELECT table_version.ver_enable_versioning('foo', 'bar');

This will then create the following auxiliary tables:

CREATE TABLE foo.bar___revision
(
  _revision_created INTEGER NOT NULL, -- references  foo.bar___revision_metadata.id but not enforced as foreign key due to performance
  _revision_expired INTEGER, -- references  foo.bar___revision_metadata.id but not enforced as foreign key due to performance
  id INTEGER NOT NULL,
  baz TEXT,
  CONSTRAINT "pkey_foo_bar_revision" PRIMARY KEY (_revision_created, id)
);

CREATE TABLE foo.bar___revision_metadata
(
  id serial NOT NULL PRIMARY KEY (id),
  revision_time TIMESTAMP NOT NULL DEFAULT now(),
  start_time TIMESTAMP NOT NULL DEFAULT clock_timestamp(),
  user_name TEXT NOT NULL DEFAULT "current_user"(),
  comment TEXT,
);

and the following specific table specific access functions:

foo.bar_get_table_diff(p_revision1 integer, p_revision2 integer) RETURNS TABLE(..)
foo.bar_get_table_revision(p_revision integer) RETURNS TABLE(..)

and the following trigger and function:

CREATE TRIGGER foo_bar_revision_trg
  AFTER INSERT OR UPDATE OR DELETE
  ON foo.bar
  FOR EACH ROW
  EXECUTE PROCEDURE foo.bar_revision_changes();

Creating and completing a revision will now be done with a specific table call:

SELECT table_version.ver_create_revision('foo', 'bar', 'My revision xxx');
...
do some edits
...
SELECT table_version.ver_complete_revision('foo', 'bar');

@dwsilk
Copy link
Member

dwsilk commented Jan 10, 2017

This looks great for simplifying back-up / restore on individual tables and avoiding the creation of a single schema with mixed _revision data.

At the moment, it is convenient that a revision can relate to multiple tables because

  • it's very easy to determine what changed inside one 'revision' to a related set of data
  • it's very easy to determine the coincident state of two related tables at any point in time

With this change, a single 'revision' to a related set of data will require a revision_id for each table with no relationship other than same start_time if the changes occurred in one transaction. Finding what the state of table_x was when table_y was at revision_n will be a bit trickier.

Have I got that right?

@palmerj
Copy link
Contributor

palmerj commented Jan 10, 2017

Thanks @dwsilk interesting. I think we have conflict in feature requests and I can't see a nice simple implementation that can support independent table backups and multiple table single revision management.

Do you have an application requirement for this multi-table functionality right now? If so you could solve that problem at the per application level by recording the last revision id per table in some sort of set. If revisions are created and managed in a transactions you shouldn't lose any functionality.

@dwsilk
Copy link
Member

dwsilk commented Jan 10, 2017

Yes, that would work. I do have this requirement now and would need to make changes simultaneous to release of table_version 2.0 to production.

@palmerj
Copy link
Contributor

palmerj commented Jan 11, 2017

Ok thank @dwsilk . @billgeo how do you want to handle this in the context of AIMS? I was thinking of implementing change #15, #18 and #19 as part of the 2.0 release to make the BDE processor faster and more robust.

@billgeo
Copy link
Contributor Author

billgeo commented Jan 11, 2017

Would need to test it on the test server, the same as for BDE processor and Roads. Other than that I don't think it should have an effect on AIMS if it upgrades properly and is tested?

@palmerj
Copy link
Contributor

palmerj commented Jan 11, 2017

Ok I will roll out the new version in a branch, get @imincik to package it up for the testing PPA and leave it up to you to test and deployment to production.

@palmerj
Copy link
Contributor

palmerj commented Jan 11, 2017

Actually I think this might have an affect on the AIMS backend code. I think there are stored procedures that use the table_version.ver_xxx API. We should review this, but it should only be a simple change.

@dwsilk
Copy link
Member

dwsilk commented Feb 12, 2017

So each versioned table will have these additional objects within the same schema:

{schema}.{table}____revision
{schema}.{table}____revision_metadata
{schema}.{table}_get_table_diff(p_revision1 integer, p_revision2 integer) RETURNS TABLE(..)
{schema}.{table}_get_table_revision(p_revision integer) RETURNS TABLE(..)

I wonder if it would be cleaner if these objects were still stored in a separate schema
{schema}____revision, instead of storing them within the source schema. Or if a revision_ prefix would make it easier to differentiate than the suffixes? Do you have any thoughts on this @imincik?

It would also be nice if these two functions could be dynamic and remain on the table_version schema (taking a table_name as an argument):

{schema}.{table}_get_table_diff(p_revision1 integer, p_revision2 integer) RETURNS TABLE(..)
{schema}.{table}_get_table_revision(p_revision integer) RETURNS TABLE(..)

But seems that is not really possible e.g.

Aside from those fairly minor points, it would be good to see these changes go ahead.

@palmerj
Copy link
Contributor

palmerj commented Feb 13, 2017

Thanks for the comment @dwsilk 👍

I wonder if it would be cleaner if these objects were still stored in a separate schema
{schema}___revision, instead of storing them within the source schema. Or if a revision prefix would make it easier to differentiate than the suffixes? Do you have any thoughts on this @imincik?

This somewhat seems to defeat the purpose that all the table and revision data is in an easy to backup place with a tool like pg_dump. I believe it's quite common to do backup by schema for example. In stating this I did considered another other design option of per schema revisioning to support application level versioning and tracking changes in one logical group across a revision (solves your initial stated issue/requirement). In this design, like the current state, the version IDs would be shared across all tables within the schema. In addition there would be a group difference function would return the changes across all tables within that revision. This design of course means that consistent backups can only occur at the schema level. One of the issues of this design is the group revision difference function would likely cause performance and semantic calling issues. So I went for simple per table version model that allowed the lowest level of granularity for revisioning and consistent backup. Maybe within the removal group revision difference function from this design I'm open either way - but I do want the extension to remain general purpose. @imincik do you have a view? From my understand of our current extension usage a per schema revsioning design is ok.

It would also be nice if these two functions could be dynamic and remain on the table_version schema (taking a table_name as an argument):

{schema}.{table}_get_table_diff(p_revision1 integer, p_revision2 integer) RETURNS TABLE(..)
{schema}.{table}_get_table_revision(p_revision integer) RETURNS TABLE(..)

Do you have a specific reason for still wanting the functions in the table_version schema?

But seems that is not really possible.

Unfortunately it's not, without having a function return a SETOF RECORD - but in this case the caller needs to declare the returned columns and datatypes. It's for this reason that the original design created a static function per table.

@palmerj palmerj closed this as completed Feb 13, 2017
@dwsilk
Copy link
Member

dwsilk commented Feb 13, 2017

@palmerj did you mean to close this?

I agree with the per table model instead of per schema / application. While the other option you considered suits the roads use of table_version (and I actually wrote an anonymous function to do the group differencing you're describing, which was useful for testing), it seems unnecessary for our other uses of table_version and keeping it general purpose is important.

This somewhat seems to defeat the purpose that all the table and revision data is in an easy to backup place with a tool like pg_dump.

The proposal makes it easy to backup both table and revision data, not so easy to separate them.

So I'm saying that if you turned on revisioning for tables in roads, you would have revision data created in roads____revision. You could then use pg_dump to dump the current state of your database tables with -n roads, or if you wanted revision data too then -n roads -n roads____revision.

With the proposed design, the same thing would require an --exclude-table-data pattern to remove revisions from a pg_dump on roads, or a number of -t switches. And then -n roads to do both.

There may also be use cases for backing up current data and revision data at different frequencies, particularly with large schema where revision data grows rapidly. A more modular design would be helpful for that too?

It doesn't worry me too much - it's just a big design change so want to make sure that it is well considered.

Do you have a specific reason for still wanting the functions in the table_version schema?

Just for the same reason that it's nice that we can have one table_version.ver_get_table_revisions() function, rather than one for each table. Less clutter / duplication. I understand the limitation though!

@palmerj
Copy link
Contributor

palmerj commented Feb 13, 2017

@palmerj did you mean to close this?

Nope sorry!

@palmerj palmerj reopened this Feb 13, 2017
@palmerj
Copy link
Contributor

palmerj commented Feb 14, 2017

it seems unnecessary for our other uses of table_version and keeping it general purpose is important.

Ok great.

The proposal makes it easy to backup both table and revision data, not so easy to separate them.

I'm not sure of the requirement to separate the main table from the revision data table? If the functional requirement to enable revisioning on a table has been made then you dump both. The main table will also have trigger references to the ____revision table which make dumping the main table funny as well.

With the proposed design, the same thing would require an --exclude-table-data pattern to remove revisions from a pg_dump on roads, or a number of -t switches. And then -n roads to do both.

I really only see dumping per table (three per revisioned table in this case), per schemas or per databases as options in operational databases.

There may also be use cases for backing up current data and revision data at different frequencies

I don't really understand the requirement for this.

particularly with large schema where revision data grows rapidly.

IMHO if the table is growing rapidly then more disk is required. I don't think we should get into the game of designing a complex solution for this. Partitioning (with some data going to cheaper disk) or some sort of bespoke old data purpose could be considered. But really the way the revisioning data model is designed it's pretty hard to archive old versions.

Just for the same reason that it's nice that we can have one table_version.ver_get_table_revisions() function, rather than one for each table. Less clutter / duplication. I understand the limitation though!

Yip would be nice but not possible.

@dwsilk
Copy link
Member

dwsilk commented Feb 14, 2017

I'm not sure of the requirement to separate the main table from the revision data table?

There are a couple of things that come to mind.

  1. Subsetting a schema for testing
    If I want to develop / test a report / application that references versioned tables, I may not care about the revision tables. Like https://github.com/linz/bde_test_dataset?

  2. Documenting a schema from source
    Documenting every single ____revision and ____revision_metadata table and every table_diff function seems unnecessary. They all serve the same purpose (for different tables). So I'd handle table and revision data differently.

I still think this is pretty minor, and would be happy to see the proposal go ahead if you feel that it is the better approach.

I don't really understand the requirement for [backups table and revision data at different frequencies].

OK cool - sounds sensible. I haven't had much exposure to that part of database management.

@imincik
Copy link
Contributor

imincik commented Feb 14, 2017

@palmerj , @dwsilk ,

I like idea to have metadata tables and functions in schema to which they belongs to and maintain revision ID per schema. From design view it is OK and it is good feature for LINZ and also other users.

@palmerj
Copy link
Contributor

palmerj commented Feb 15, 2017

I like idea to have metadata tables and functions in schema to which they belongs to and maintain revision ID per schema. From design view it is OK and it is good feature for LINZ and also other users.

Hi @imincik do you have a reason for liking this idea? Is that to remove duplication of functions and metadata tables required under the current proposal? Do you like the idea of a group changeset function? I'm ok to implement this, but just wanted to understand your reasons a little more, so any new design proposal can accomodate the issues.

@dwsilk
Copy link
Member

dwsilk commented Feb 19, 2017

@imincik could you please clarify, are you supporting:

  1. Original Proposal: metadata tables and functions are created in the same schema as the tables that are being versioned and revision IDs are maintained independently for each table
  2. Jeremy's Other Consideration: metadata tables and functions are created in the same schema as the tables that are being versioned and revision IDs are maintained across all tables in each schema
  3. Daniel's Proposal: metadata tables and functions are created in a separate schema (related to the original schema by name) and revisions IDs are maintained for each table

..or something else? It's not clear to me.

@imincik
Copy link
Contributor

imincik commented Feb 20, 2017

@imincik could you please clarify, are you supporting:

@dwsilk , I am voting for "Jeremy's Other Consideration"

Hi @imincik do you have a reason for liking this idea?

Because it can allow us possibility to logically maintain, backup and replicate data based on schema. I can imagine few use cases for this.

If you want to maintain your data in once, then you have only one schema. If you are creating more schemas, you want logically maintain your data in multiple groups.

@palmerj
Copy link
Contributor

palmerj commented Feb 20, 2017

Ok. I will re-adjust the proposal as a wiki page and then get both of your reviews again. Cheers all!

@stale
Copy link

stale bot commented Oct 6, 2018

This issue has been automatically marked as stale as there has not been any activity for sometime. The issue will be closed in 14 days if no further activity.

@stale stale bot added the Stale label Oct 6, 2018
@dwsilk dwsilk removed the Stale label Oct 7, 2018
@palmerj
Copy link
Contributor

palmerj commented Oct 8, 2018

We need to review what we want to do with this change

@billgeo
Copy link
Contributor Author

billgeo commented Jun 8, 2022

Not sure we have much of a driver to do this at this stage. Labelling as won't fix for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants