Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(api): add /import endpoint to support bulk importing json data #390

Merged
merged 19 commits into from
Feb 15, 2024

Conversation

Silthus
Copy link
Contributor

@Silthus Silthus commented Feb 7, 2024

Hey Openbeta Team :)

As discussed in OpenBeta/open-tacos#1031 it would be great to have a simple page where users can bulk import data using a simple, validated json schema.

This PR adds the required functionality to the backend, enabling the integration of a bulk import in the frontend. The frontend is also actively being worked on by @l4u532.

What this PR does:

  • it changes from apollo-server to apollo-server-express allowing the integration of additional http endpoints without changing the graphql api. You can read more about the swapping of the packages here: https://www.apollographql.com/docs/apollo-server/v3/integrations/middleware/#swapping-out-apollo-server
  • it adds a new endpoint a /import for authorized users, allowing the import of a defined json schema. The schema will be defined in the PR for the frontend.
  • the data is inserted using the standard functions used by the mutations of the graphql api ensuring data validity.
  • I also cleaned up some of the tests and removed the timeouts waiting for the changes to emit and swapped the database for an in memory database everywhere, getting rid of the required docker instance for testing
  • everything is end2end tested and was developed using a test first approach

Let me know what you think. I will happily make any adjustments to the PR.

@Silthus
Copy link
Contributor Author

Silthus commented Feb 8, 2024

Thanks @vnugent for the feedback on Discord about discussing the pros and cons of using the existing GraphQL API vs. creating a new REST endpoint first.
For documentation purposes, I will continue the discussion here.


First off: you are absolutely right to discuss the pros and cons before going forward with such a change. You built such an amazing product and we are very happy to hear your thoughts and feedback about the change we are about to implement.
Frankly, we were so deep into coding, that we forgot to share our train of thought. So here is the pro and con list we went through before deciding to go with the extra /import endpoint. We are eager to hear your thoughts about it and more than willing to change things if you disagree or have concerns.

REST vs GraphQL for Bulk-Import of Data

[Draft] Example import.json
{
  "areas": [
    {
      "areaName": "Utah",
      "countryCode": "us",
      "children": [
        {
          "areaName": "Southeast Utah",
          "children": [
            {
              "areaName": "Indian Creek",
              "children": [
                {
                  "areaName": "Supercrack Buttress",
                  "climbs": [
                    {
                      "name": "The Key Flake",
                      "grade": "5.10",
                      "fa": "unknown",
                      "type": {
                        "trad": true
                      },
                      "safety": "UNSPECIFIED",
                      "metadata": {
                        "lnglat": {
                          "type": "Point",
                          "coordinates": [
                            -109.54552,
                            38.03635
                          ]
                        },
                        "left_right_index": 1
                      },
                      "content": {
                        "description": "Cool off-width that requires off-width and face skills.",
                        "protection": "Anchors hidden up top. Need 80m to make it all the way down.",
                        "location": "Opposite keyhole flake. Obvious right leaning offwidth that starts atop 20 ft boulder."
                      }
                    },
                    {
                      "name": "Incredible Hand Crack",
                      "grade": "5.10",
                      "fa": "Rich Perch, John Bragg, Doug Snively, and Anne Tarver,  1978",
                      "type": {
                        "trad": true
                      },
                      "safety": "UNSPECIFIED",
                      "metadata": {
                        "lnglat": {
                          "type": "Point",
                          "coordinates": [
                            -109.54552,
                            38.03635
                          ]
                        },
                        "left_right_index": 2
                      },
                      "content": {
                        "description": "Route starts at the top of the trail from the parking lot to Supercrack Buttress.",
                        "protection": "Cams from 2-2.5\". Heavy on 2.5\" (#2 Camalot)",
                        "location": ""
                      },
                      "pitches": [
                        {
                          "pitchNumber": 1,
                          "grade": "5.10",
                          "disciplines": {
                            "trad": true
                          },
                          "length": 100,
                          "boltsCount": 0,
                          "description": "A classic hand crack that widens slightly towards the top. Requires a range of cam sizes. Sustained and excellent quality."
                        },
                        {
                          "pitchNumber": 2,
                          "grade": "5.9",
                          "disciplines": {
                            "trad": true
                          },
                          "length": 30,
                          "boltsCount": 0,
                          "description": "Easier climbing with good protection. Features a mix of crack sizes. Shorter than the first pitch but equally enjoyable."
                        }
                      ]
                    }
                  ],
                  "gradeContext": "US",
                  "metadata": {
                    "isBoulder": false,
                    "isDestination": false,
                    "leaf": true,
                    "lnglat": {
                      "type": "Point",
                      "coordinates": [
                        -109.54552,
                        38.03635
                      ]
                    },
                    "bbox": [
                      -109.54609091005857,
                      38.03590033981814,
                      -109.54494908994141,
                      38.03679966018186
                    ]
                  },
                  "content": {
                    "description": ""
                  }
                }
              ],
              "metadata": {
                "lnglat": {
                  "type": "Point",
                  "coordinates": [
                    -109.5724044642857,
                    38.069429035714286
                  ]
                }
              },
              "content": {
                "description": "Indian Creek is a crack climbing mecca in the southeastern region of Utah, USA. Located within the [Bears Ears National Monument](https://en.wikipedia.org/wiki/Bears_Ears_National_Monument)."
              }
            }
          ]
        }
      ]
    }
  ]
}

Bulk Importing on the Server Side Using a New REST Endpoint

This is done using a separate endpoint, swapping out the apollo-server package against the apollo-server-express package (see here for more details).

Pro

  • everything is done in a single transaction, rolling back all changes if one operation fails, and only committed if everything has run successfully 1x
  • does not break if large amounts of data are uploaded
  • no queuing mechanism needed for handling the import of several area trees
  • proven mass upload mechanism battle tested with mountain project import
  • new fields in the database are automatically exposed without the need to adjust the GQL
  • potentially faster (however this was not tested and is just speculation)

Cons

  • little work is needed in the core to swap out apollo-server against apollo-server-express
  • authentication needs to be handled separately, although sharing most of the auth logic
  • adds complexity by introducing an additional endpoint in the api

Importing via GQL

This would be done by processing the json document on the client constructing a GraphQL query and sending it to the server.

Pro

  • requires little to no server side changes
  • uses the same permissions and auth as the current frontend and api
  • no need to swap out the server engine

Cons

  • every mutation runs as a single transaction
    • failing a mutation deep down in the tree leaves the import only partially completed
    • partial imports cannot be repeated as mutations that already executed will fail
    • could leave imports in an uncompleted state
  • additional non type safe logic is required to construct the correct GraphQL query, which is potentially unstable
  • unsure how very large GraphQL queries are handled by the server, using a REST endpoint on the other hand has been battle tested with large data
  • changing the GrapQL api requires adjustment of the JSON parser

Conclusion

We believe that doing the little bit of extra work and implementing a custom endpoint for importing bulk data is the right way to go.

The main reason for that is the transaction safety ensuring that a single .json file can be imported, using multiple retries to fix possible errors, without breaking the import and leaving the database in a partial state.

The second biggest reason we prefer the server side import is the additional complexity it brings to changing the GraphQL endpoints. By separating it from the regular API, a new contributor or existing collaborator does not need to touch the frontend of the bulk importer when changing the GraphQL API. Also we don't really know how it would behave with large amounts of data.

Plans for the Future

Here are some additional ideas and thoughts we had regarding the import feature, which also influenced our decision:

Export -> Edit -> Import Workflow

The idea is to add an export function that allows exporting selected areas or regions with individual properties into a single json file which then can be edited and uploaded using the import page.
This would allow for a very easy way to update existing data. The export would be similar to the existing openbeta-export, except that it produces a single json file in the schema required by the import.

Data Cleanup

One issue with Openbeta is the quality of data. Adding such an export and import workflow could allow individual, non developers, to start cleaning up data in a larger scope.

@vnugent
Copy link
Contributor

vnugent commented Feb 10, 2024

@Silthus I'd be more comfortable if we still use GQL as the interface (not introducing REST in this PR). You can create a new GQL bulk mutation and reuse the all DB operations in import-json.ts. There's no need to call existing update APIs.

@Silthus
Copy link
Contributor Author

Silthus commented Feb 10, 2024

OK, I can understand why you are more comfortable with sticking to the GraphQL only API.

But I am not quite sure what you mean. Can you share a small example snippet or link to some docs? I just need to direction to head towards and will change the bulk import accordingly.

@vnugent
Copy link
Contributor

vnugent commented Feb 10, 2024

You can piggyback Area GQL. I'm using updateAreaSortingOrder mutation as an example.

  1. Add a new mutation definition to AreaEdit.gql, ex bulkUpload or whatever.
  • You'll need to explicitly define the input param. This is the schema/structure of the bulk json object. This helps with param validation and provides schema for GQL explorers.
  • Optionally, you can also create a Custom scalar to encapsulate the bulk upload payload. Example of existing custom scalar
  1. Add a GQL mutation mapping to JS model (analogous to REST 'controller')

@Silthus Silthus marked this pull request as draft February 11, 2024 06:26
@Silthus
Copy link
Contributor Author

Silthus commented Feb 11, 2024

@vnugent thanks that helped a lot!

I adjusted the code accordingly, the only thing left to do are some failing tests where the input doesn't match up.

Let me know what you think of the changes.

I intentionally left the exchange of the apollo-server-express package in there to decouple the tests from a real running server listening on a port on localhost. This significantly speeds up testing and reduces complexity without any disadvantages.

@Silthus Silthus marked this pull request as ready for review February 12, 2024 09:49
Copy link
Contributor

@vnugent vnugent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a small request to check both nullness and defined. Other than that we're good to merge.

src/model/BulkImportDataSource.ts Outdated Show resolved Hide resolved
@Silthus Silthus requested a review from vnugent February 15, 2024 04:34
@Silthus
Copy link
Contributor Author

Silthus commented Feb 15, 2024

I have a small request to check both nullness and defined. Other than that we're good to merge.

Thanks I learned something again today :)

@vnugent vnugent merged commit d912a16 into OpenBeta:develop Feb 15, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants