Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import data without a primary key #212

Closed
craigds opened this issue Aug 17, 2020 · 1 comment
Closed

Import data without a primary key #212

craigds opened this issue Aug 17, 2020 · 1 comment
Milestone

Comments

@craigds
Copy link
Member

craigds commented Aug 17, 2020

At present it's not possible to import data that doesn't have a primary key. This needs to work somehow so sno can track changes to third-party data effectively.

We already have a --primary-key option in the import command, for the situation where the data actually does have a unique identifier field and it is just not marked as such. But we need to do better so we can track changes to other data.

Naively, a primary key could be a hash of the feature data, plus a sequential integer (to de-duplicate where there are duplicate features.) However a GPKG working copy requires an integer primary key, so GPKG working copies will need to also store a map from the hash-based PK to a working-copy integer PK.

(This ticket to be expanded with a more exact plan)

@olsen232
Copy link
Collaborator

In the simplest case, every feature encountered is just assigned a primary key from the sequence 1, 2, 3...
However, this mapping from each feature to its primary key is stored as metadata the imported dataset, so that if
the same (or similar) data is reimported, then the same primary keys can be assigned to each feature. The
reimport will reuse primary keys for any features that are unchanged, and any new primary keys that must be
assigned are appended in the same manner to this mapping.

Note that reimporting depends only on the data to be imported and the stored primary key metadata - local edits to
imported features have no effect on how the data is reimported.

For the sake of efficiency, the entire feature is not stored in the mapping, but only a hash of its contents.
Since multiple features with the same contents may be imported, the mapping to be stored has the structure:

{feature hash -> [list of primary keys]}

In fact the inverse mapping is stored, since it has a simpler structure, primary keys are unique:

{primary key -> feature hash}

This is stored in $DATASET_PATH/meta/generated-pks.json, along with the column-schema of the new primary key:

{
  "primaryKeySchema": {
    "id": "ad068414-3a04-45ab-851d-bfa5104c60d6",
    "name": "generated-pk",
    "dataType": "integer",
    "primaryKeyIndex": 0,
    "size": 64
  },
  "generatedPrimaryKeys": {
    "1": "181e23cf3a3c5e74254707687c4be2b5b02dbf63",
    "2": "021ac25fcf4dafc72053f84d2b87ec5662adcb83",
    "3": "8e775122edbdd367c8d383fffeabf6580de485fd",
    ...
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants