feat(api): add /import endpoint to support bulk importing json data #390

Silthus · 2024-02-07T11:58:18Z

Hey Openbeta Team :)

As discussed in OpenBeta/open-tacos#1031 it would be great to have a simple page where users can bulk import data using a simple, validated json schema.

This PR adds the required functionality to the backend, enabling the integration of a bulk import in the frontend. The frontend is also actively being worked on by @l4u532.

What this PR does:

it changes from apollo-server to apollo-server-express allowing the integration of additional http endpoints without changing the graphql api. You can read more about the swapping of the packages here: https://www.apollographql.com/docs/apollo-server/v3/integrations/middleware/#swapping-out-apollo-server
it adds a new endpoint a /import for authorized users, allowing the import of a defined json schema. The schema will be defined in the PR for the frontend.
the data is inserted using the standard functions used by the mutations of the graphql api ensuring data validity.
I also cleaned up some of the tests and removed the timeouts waiting for the changes to emit and swapped the database for an in memory database everywhere, getting rid of the required docker instance for testing
everything is end2end tested and was developed using a test first approach

Let me know what you think. I will happily make any adjustments to the PR.

Silthus · 2024-02-08T07:29:55Z

Thanks @vnugent for the feedback on Discord about discussing the pros and cons of using the existing GraphQL API vs. creating a new REST endpoint first.
For documentation purposes, I will continue the discussion here.

First off: you are absolutely right to discuss the pros and cons before going forward with such a change. You built such an amazing product and we are very happy to hear your thoughts and feedback about the change we are about to implement.
Frankly, we were so deep into coding, that we forgot to share our train of thought. So here is the pro and con list we went through before deciding to go with the extra /import endpoint. We are eager to hear your thoughts about it and more than willing to change things if you disagree or have concerns.

REST vs GraphQL for Bulk-Import of Data

[Draft] Example import.json

{
  "areas": [
    {
      "areaName": "Utah",
      "countryCode": "us",
      "children": [
        {
          "areaName": "Southeast Utah",
          "children": [
            {
              "areaName": "Indian Creek",
              "children": [
                {
                  "areaName": "Supercrack Buttress",
                  "climbs": [
                    {
                      "name": "The Key Flake",
                      "grade": "5.10",
                      "fa": "unknown",
                      "type": {
                        "trad": true
                      },
                      "safety": "UNSPECIFIED",
                      "metadata": {
                        "lnglat": {
                          "type": "Point",
                          "coordinates": [
                            -109.54552,
                            38.03635
                          ]
                        },
                        "left_right_index": 1
                      },
                      "content": {
                        "description": "Cool off-width that requires off-width and face skills.",
                        "protection": "Anchors hidden up top. Need 80m to make it all the way down.",
                        "location": "Opposite keyhole flake. Obvious right leaning offwidth that starts atop 20 ft boulder."
                      }
                    },
                    {
                      "name": "Incredible Hand Crack",
                      "grade": "5.10",
                      "fa": "Rich Perch, John Bragg, Doug Snively, and Anne Tarver,  1978",
                      "type": {
                        "trad": true
                      },
                      "safety": "UNSPECIFIED",
                      "metadata": {
                        "lnglat": {
                          "type": "Point",
                          "coordinates": [
                            -109.54552,
                            38.03635
                          ]
                        },
                        "left_right_index": 2
                      },
                      "content": {
                        "description": "Route starts at the top of the trail from the parking lot to Supercrack Buttress.",
                        "protection": "Cams from 2-2.5\". Heavy on 2.5\" (#2 Camalot)",
                        "location": ""
                      },
                      "pitches": [
                        {
                          "pitchNumber": 1,
                          "grade": "5.10",
                          "disciplines": {
                            "trad": true
                          },
                          "length": 100,
                          "boltsCount": 0,
                          "description": "A classic hand crack that widens slightly towards the top. Requires a range of cam sizes. Sustained and excellent quality."
                        },
                        {
                          "pitchNumber": 2,
                          "grade": "5.9",
                          "disciplines": {
                            "trad": true
                          },
                          "length": 30,
                          "boltsCount": 0,
                          "description": "Easier climbing with good protection. Features a mix of crack sizes. Shorter than the first pitch but equally enjoyable."
                        }
                      ]
                    }
                  ],
                  "gradeContext": "US",
                  "metadata": {
                    "isBoulder": false,
                    "isDestination": false,
                    "leaf": true,
                    "lnglat": {
                      "type": "Point",
                      "coordinates": [
                        -109.54552,
                        38.03635
                      ]
                    },
                    "bbox": [
                      -109.54609091005857,
                      38.03590033981814,
                      -109.54494908994141,
                      38.03679966018186
                    ]
                  },
                  "content": {
                    "description": ""
                  }
                }
              ],
              "metadata": {
                "lnglat": {
                  "type": "Point",
                  "coordinates": [
                    -109.5724044642857,
                    38.069429035714286
                  ]
                }
              },
              "content": {
                "description": "Indian Creek is a crack climbing mecca in the southeastern region of Utah, USA. Located within the [Bears Ears National Monument](https://en.wikipedia.org/wiki/Bears_Ears_National_Monument)."
              }
            }
          ]
        }
      ]
    }
  ]
}

Bulk Importing on the Server Side Using a New REST Endpoint

This is done using a separate endpoint, swapping out the apollo-server package against the apollo-server-express package (see here for more details).

Pro

everything is done in a single transaction, rolling back all changes if one operation fails, and only committed if everything has run successfully 1x
does not break if large amounts of data are uploaded
no queuing mechanism needed for handling the import of several area trees
proven mass upload mechanism battle tested with mountain project import
new fields in the database are automatically exposed without the need to adjust the GQL
potentially faster (however this was not tested and is just speculation)

Cons

little work is needed in the core to swap out apollo-server against apollo-server-express
authentication needs to be handled separately, although sharing most of the auth logic
adds complexity by introducing an additional endpoint in the api

Importing via GQL

This would be done by processing the json document on the client constructing a GraphQL query and sending it to the server.

Pro

requires little to no server side changes
uses the same permissions and auth as the current frontend and api
no need to swap out the server engine

Cons

every mutation runs as a single transaction
- failing a mutation deep down in the tree leaves the import only partially completed
- partial imports cannot be repeated as mutations that already executed will fail
- could leave imports in an uncompleted state
additional non type safe logic is required to construct the correct GraphQL query, which is potentially unstable
unsure how very large GraphQL queries are handled by the server, using a REST endpoint on the other hand has been battle tested with large data
changing the GrapQL api requires adjustment of the JSON parser

Conclusion

We believe that doing the little bit of extra work and implementing a custom endpoint for importing bulk data is the right way to go.

The main reason for that is the transaction safety ensuring that a single .json file can be imported, using multiple retries to fix possible errors, without breaking the import and leaving the database in a partial state.

The second biggest reason we prefer the server side import is the additional complexity it brings to changing the GraphQL endpoints. By separating it from the regular API, a new contributor or existing collaborator does not need to touch the frontend of the bulk importer when changing the GraphQL API. Also we don't really know how it would behave with large amounts of data.

Plans for the Future

Here are some additional ideas and thoughts we had regarding the import feature, which also influenced our decision:

Export -> Edit -> Import Workflow

The idea is to add an export function that allows exporting selected areas or regions with individual properties into a single json file which then can be edited and uploaded using the import page.
This would allow for a very easy way to update existing data. The export would be similar to the existing openbeta-export, except that it produces a single json file in the schema required by the import.

Data Cleanup

One issue with Openbeta is the quality of data. Adding such an export and import workflow could allow individual, non developers, to start cleaning up data in a larger scope.

adds OpenBeta/open-tacos#1031

vnugent · 2024-02-10T17:28:56Z

@Silthus I'd be more comfortable if we still use GQL as the interface (not introducing REST in this PR). You can create a new GQL bulk mutation and reuse the all DB operations in import-json.ts. There's no need to call existing update APIs.

Silthus · 2024-02-10T17:33:30Z

OK, I can understand why you are more comfortable with sticking to the GraphQL only API.

But I am not quite sure what you mean. Can you share a small example snippet or link to some docs? I just need to direction to head towards and will change the bulk import accordingly.

vnugent · 2024-02-10T18:41:44Z

You can piggyback Area GQL. I'm using updateAreaSortingOrder mutation as an example.

Add a new mutation definition to AreaEdit.gql, ex bulkUpload or whatever.

You'll need to explicitly define the input param. This is the schema/structure of the bulk json object. This helps with param validation and provides schema for GQL explorers.
Optionally, you can also create a Custom scalar to encapsulate the bulk upload payload. Example of existing custom scalar

Add a GQL mutation mapping to JS model (analogous to REST 'controller')

https://github.com/OpenBeta/openbeta-graphql/blob/develop/src/graphql/area/AreaMutations.ts#L72
You can make DB calls at this layer but we delegate DB operations to 'models', subclasses of Apollo DataSource. I'd create a new subclass of MutableAreaDatasource, maybe call it BulkDataSource, and move the logic you have in import-json.ts there.

Silthus · 2024-02-11T06:53:12Z

@vnugent thanks that helped a lot!

I adjusted the code accordingly, the only thing left to do are some failing tests where the input doesn't match up.

Let me know what you think of the changes.

I intentionally left the exchange of the apollo-server-express package in there to decouple the tests from a real running server listening on a port on localhost. This significantly speeds up testing and reduces complexity without any disadvantages.

vnugent

I have a small request to check both nullness and defined. Other than that we're good to merge.

src/model/BulkImportDataSource.ts

Silthus · 2024-02-15T04:35:05Z

I have a small request to check both nullness and defined. Other than that we're good to merge.

Thanks I learned something again today :)

Silthus added 9 commits February 8, 2024 08:32

feat(db): add json bulk import method

5b0649e

feat(import): add required default fields

613f549

docs: update db schema description of ancestors

a32db97

fix: do not infer gradecontext; require it

7913b82

feat(db): add bulk import of json data

4e9f8f4

adds OpenBeta/open-tacos#1031

feat(server): migrate to apollo-server-express

904566f

feat(server): add /import/json endpoint

c2c2f49

test: improve change testing and remove timeouts

38587d5

fix(auth): correctly return user ctx

8a76076

Silthus force-pushed the feature/bulk-import branch from d1a297c to 8a76076 Compare February 8, 2024 07:33

feat: add editor role requirement to import endpoint

d09167a

Silthus force-pushed the feature/bulk-import branch from f7a472c to 7ca922c Compare February 11, 2024 06:16

Silthus marked this pull request as draft February 11, 2024 06:26

feat(import): rework bulk import to use gql api

6addf8c

Silthus force-pushed the feature/bulk-import branch from 7ca922c to 6addf8c Compare February 11, 2024 07:35

fix(import): corecctly handle id conversion

f49ec3e

Silthus marked this pull request as ready for review February 12, 2024 09:49

l4u532 added 5 commits February 12, 2024 11:35

update descriptions in AreaEdit.gql

f47c017

update descriptions in AreaEdit.gql

ae6bfe1

update descriptions in AreaEdit.gql

f71d92a

update descriptions in AreaEdit.gql

4ef8609

update descriptions in AreaEdit.gql

a10a048

vnugent requested changes Feb 14, 2024

View reviewed changes

src/model/BulkImportDataSource.ts Outdated Show resolved Hide resolved

Silthus requested a review from vnugent February 15, 2024 04:34

fix: use juggling null checks instead of strict

aeb6e49

Silthus force-pushed the feature/bulk-import branch from 520265f to aeb6e49 Compare February 15, 2024 04:44

fix(gql): remove unused but required fields in climbs

44015e5

vnugent merged commit d912a16 into OpenBeta:develop Feb 15, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(api): add /import endpoint to support bulk importing json data #390

feat(api): add /import endpoint to support bulk importing json data #390

Silthus commented Feb 7, 2024

Silthus commented Feb 8, 2024

vnugent commented Feb 10, 2024

Silthus commented Feb 10, 2024

vnugent commented Feb 10, 2024

Silthus commented Feb 11, 2024

vnugent left a comment

Silthus commented Feb 15, 2024

feat(api): add /import endpoint to support bulk importing json data #390

feat(api): add /import endpoint to support bulk importing json data #390

Conversation

Silthus commented Feb 7, 2024

Silthus commented Feb 8, 2024

REST vs GraphQL for Bulk-Import of Data

Bulk Importing on the Server Side Using a New REST Endpoint

Pro

Cons

Importing via GQL

Pro

Cons

Conclusion

Plans for the Future

Export -> Edit -> Import Workflow

Data Cleanup

vnugent commented Feb 10, 2024

Silthus commented Feb 10, 2024

vnugent commented Feb 10, 2024

Silthus commented Feb 11, 2024

vnugent left a comment

Choose a reason for hiding this comment

Silthus commented Feb 15, 2024