Import data without a primary key #212

craigds · 2020-08-17T22:47:29Z

At present it's not possible to import data that doesn't have a primary key. This needs to work somehow so sno can track changes to third-party data effectively.

We already have a --primary-key option in the import command, for the situation where the data actually does have a unique identifier field and it is just not marked as such. But we need to do better so we can track changes to other data.

Naively, a primary key could be a hash of the feature data, plus a sequential integer (to de-duplicate where there are duplicate features.) However a GPKG working copy requires an integer primary key, so GPKG working copies will need to also store a map from the hash-based PK to a working-copy integer PK.

(This ticket to be expanded with a more exact plan)

The text was updated successfully, but these errors were encountered:

olsen232 · 2020-12-14T21:41:53Z

In the simplest case, every feature encountered is just assigned a primary key from the sequence 1, 2, 3...
However, this mapping from each feature to its primary key is stored as metadata the imported dataset, so that if
the same (or similar) data is reimported, then the same primary keys can be assigned to each feature. The
reimport will reuse primary keys for any features that are unchanged, and any new primary keys that must be
assigned are appended in the same manner to this mapping.

Note that reimporting depends only on the data to be imported and the stored primary key metadata - local edits to
imported features have no effect on how the data is reimported.

For the sake of efficiency, the entire feature is not stored in the mapping, but only a hash of its contents.
Since multiple features with the same contents may be imported, the mapping to be stored has the structure:

{feature hash -> [list of primary keys]}

In fact the inverse mapping is stored, since it has a simpler structure, primary keys are unique:

{primary key -> feature hash}

This is stored in $DATASET_PATH/meta/generated-pks.json, along with the column-schema of the new primary key:

{
  "primaryKeySchema": {
    "id": "ad068414-3a04-45ab-851d-bfa5104c60d6",
    "name": "generated-pk",
    "dataType": "integer",
    "primaryKeyIndex": 0,
    "size": 64
  },
  "generatedPrimaryKeys": {
    "1": "181e23cf3a3c5e74254707687c4be2b5b02dbf63",
    "2": "021ac25fcf4dafc72053f84d2b87ec5662adcb83",
    "3": "8e775122edbdd367c8d383fffeabf6580de485fd",
    ...
  }
}

craigds added this to the 0.6 milestone Aug 17, 2020

craigds mentioned this issue Aug 17, 2020

import: Add a way to replace a full existing dataset #112

Closed

olsen232 modified the milestones: 0.6, 0.7 Dec 8, 2020

olsen232 mentioned this issue Dec 8, 2020

Generate PKs while importing, for sources that lack them. #326

Merged

3 tasks

olsen232 closed this as completed Dec 14, 2020

olsen232 mentioned this issue Dec 16, 2020

Optionally detect similar features during PK-less reimport. #336

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import data without a primary key #212

Import data without a primary key #212

craigds commented Aug 17, 2020

olsen232 commented Dec 14, 2020

Import data without a primary key #212

Import data without a primary key #212

Comments

craigds commented Aug 17, 2020

olsen232 commented Dec 14, 2020