Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migration from pulp to createrepo-agent #972

Open
cottsay opened this issue Aug 18, 2022 · 20 comments
Open

Migration from pulp to createrepo-agent #972

cottsay opened this issue Aug 18, 2022 · 20 comments

Comments

@cottsay
Copy link
Member

cottsay commented Aug 18, 2022

Summary

This ticket tracks migrating the RPM repository management tool used in ros_buildfarm from pulp to a new purpose-built tool called createrepo-agent.

Background

RPM repository metadata consists of a collection of XML files which reside in a subdirectory of the repository root. The root document, repomd.xml, can be signed using a GPG key. Unlike debian metadata which uses a "clearsign" signature, the repomd.xml.asc is a "detached" signature. Any modification to the contents of the repository typically results in changes to each of the ~5 XML files and the signature.

Pulp is a general-purpose content management solution with robust plugins specifically targeted at RPMs. It leverages postgresql, redis, Django, and stores payload data in a CAS. It is written in Python, and uses several daemon processes to implement different roles to service different types of requests.

Motivation for this change

Pulp is a very powerful content management tool, but it is extremely heavyweight and complex. Implementing the required queries to perform package invalidation (as is required by ros_buildfarm) means that we must perform import operations serially, and performance at our scale has become unsustainable. Central to our performance problems are that metadata generation in Pulp is far too slow.

Additionally, the way RPM repository metadata is hosted inherently provides for races when updating metadata that clients may be simultaneously downloading due to the fact that several separate files must be updated together. Pulp has no mitigation for this problem, and it is causing jobs to occasionally fail to download repository metadata.

Another problem with our current solution is that the serialization of repository operations is tightly coupled to Jenkins, making it difficult to experiment with other orchestration and execution solutions.

After analyzing the performance problems we're currently experiencing with Pulp, it was decided that a new tool should be created which can solve several of the problems holding us back today.

Overview of createrepo-agent

High-level features:

  • Background process which keeps metadata in memory so that it doesn't need to be re-read for each change - only written.
  • Integrated change queue which not only ensures that simultaneous operations do not overwrite each other, but also batches all pending changes in the same metadata write operation.
  • No system provisioning beyond installation of the tool - existing repositories can be used or new ones created as necessary.
  • Process for keeping old metadata files (other than the top-level repomd.xml) and retiring after it is unlikely to be requested.

Roll out process

See #972 (comment)

@cottsay
Copy link
Member Author

cottsay commented Aug 25, 2022

Discussion point: "upstream" repositories

In the debian repositories, we're listing upstream repositories in a configuration files on the repo host which the import_upstream jobs pull from. This is how we get packages from the "bootstrap" repositories.

In Pulp, this is implemented by creating specifically named repository entities which can be synchronized from the upstream URL and then synchronized between pulp distributions just like a ROS distro sync. Though this is implemented, we aren't actually using it and the "bootstrap" repository for RHEL is empty.

The sync feature implemented in createrepo-agent will allow us to pull repository content from any remote URL just the same as a local one, so it can be used to implement this scenario. The tricky thing here is how we deal with storing the list of upstream repositories needed to invoke the operation.

The easiest way to align with how this is done on the debian side is simply to write out a script to a known location on the repo host and invoke that script in the import_upstream job. The script would contain a list of createrepo-agent sync invocations for each upstream repository.

At the moment, I'm considering simply dropping import_upstream support for RPM repositories since we just don't use the feature. For the most part, it is better in the long-term to put packages directly into Fedora/EPEL which aren't actual ROS packages.


CONCLUSION: Implement upstream repositories using createrepo-agent, and keep importing from the existing ros_bootstrap repository. We are not currently, and have no immediate plans to use this feature, but we might someday.

@tfoote
Copy link
Member

tfoote commented Aug 25, 2022

For the most part, it is better in the long-term to put packages directly into Fedora/EPEL which aren't actual ROS packages.

I generally agree with this. My one worry about this is how quickly can get get out releases? Is it bottlenecked on you uploading them?

@cottsay
Copy link
Member Author

cottsay commented Aug 25, 2022

...how quickly can get get out releases?

In general, there are usually ~36 hrs of overhead and ~7 days of "baking" in the testing repositories. The latter can be expedited automatically if a few community members test the update manually and report their findings.

Is it bottlenecked on you uploading them?

Yes, though the robotics SIG has the ability to push updates to these packages as well. We (Open Robotics) don't currently have any automation, documentation, or tooling for updating RPM packages to the bootstrap repository as it is, so it would probably still need to go through me. Rather than spending resources on standing that up, I'd rather focus on getting more team members into the Fedora robotics SIG who can push the packages and test them for faster update turnarounds.

Great questions, BTW.

@nuclearsandwich
Copy link
Contributor

The high level write-up is superb and will be great for folks who aren't as familiar with the internals of the buildfarm or aren't at a scale where they're feeling our pain.

One question, this looks like there's a hard cut between createrepo-agent and pulp. A. Is that the case or is it just not clear from the plan that both systems will be run in parallel during an initial phase? B. If there's a hard-cut to createrepo-agent am I forgetting a conversation where you convinced me to go ahead with that? 😅


I think it is worth trying to build out a detailed checklist of steps either in this issue or one on the private config repository. Two examples (both on private config repos) are the original pulp deployment and the migration of the ROS build farm to Ubuntu Xenial. Writing out the exhaustive lists helps identify step ordering (and possible conflicts between is expected to happen first) as well as providing a safety during the migration when adrenaline impedes upon critical thinking.


Discussion point: "upstream" repositories

I do not think that we need to block this deployment on having an import-upstream feature in the createrepo-agent. But I do not think that we can leave import_upstream support unimplemented or forego having a bootstrap repo for RPM repositories entirely.

I do agree with Scott that our steady-state should be pushing changes to infrastructure upstream, something that the release cadence of Debian and Ubuntu doesn't enable which as a result, creates potential conflicts between the upstream and ROS Infra-provided versions of the infrastructure packages which can be avoided by publishing those directly to Fedora project archives (EPEL is a Fedora project).

My one worry about this is how quickly can get get out releases? Is it bottlenecked on you uploading them?

I haven't been following the Fedora releases very closely but my understanding is that the major messy issues with infra packages are caught and settled by people using the packages from the ROS repos on Debian/Ubuntu with enough swiftness that our Fedora infra people (cottsay alone, at present) can update the pending releases without having to release the intermediate duds so a good portion of those never make it to Fedora in the first place thanks to the intrepid community running out of the ROS repos on our more widely adopted platforms.

My main concern is that the bootstrap repo has other uses beyond distributing the latest ROS infrastructure packages, some legacy which we're trying to move away from and some still valid.

As far as legacy goes, we've used the bootstrap repository to pull in packages that are not available upstream to provide for ROS, however, with Jammy in particular, we've limited this to just very closely associated projects like Gazebo and Colcon where we work directly with upstream and providing packages via the ROS bootstrap repository is primarily a matter of ensuring that consistent versions are available and in use.

There are also still be packages, like those provided by commercial DDS vendors, which we would need to publish but aren't suitable for Fedora Project repositories.

Lastly, because of the way rosdep and bloom interact, we may also need the bootstrap repo to create equivs / empty packages for soft dependencies that are only available on, for example amd64 but not arm64, which rosdep doesn't model even if the package is available in EPEL on another platform.

We (Open Robotics) don't currently have any automation, documentation, or tooling for updating RPM packages to the bootstrap repository as it is, so it would probably still need to go through me. Rather than spending resources on standing that up, I'd rather focus on getting more team members into the Fedora robotics SIG who can push the packages and test them for faster update turnarounds.

I agree that getting more of us active in the Fedora Robotics SIG is the primary focus but I also don't see how documentation and automation hurt that aim if we assume that we'll always have some term rotation among ROS infra members. (I misread the above as documentation about how to contribute updates to the Robotics SIG not the current RPM bootstrap repo).

I think we can revisit implementation details for an RPM ros_bootstrap repo in future discussions I agree that time now doesn't need to be spent there.

The latter can be expedited automatically if a few community members test the update manually and report their findings.

Going along with expanding our involvement in the Robotics SIG, I do think it's important to be careful we don't create a bloc here who unintentionally who acts in concern to try and override the usual mechanisms for review. But if we get genuine, distributed, community input on issues that's super valuable.

@cottsay
Copy link
Member Author

cottsay commented Aug 29, 2022

Thanks for your detailed thoughts, @nuclearsandwich.

...looks like there's a hard cut between createrepo-agent and pulp...Is that the case...?

That's correct. The primary driver for this is that our implementations using Pulp and createrepo-agent use entirely different credentials - and even types of credentials - from each other. Supporting both in parallel would mean wiring new credentials into each ros_buildfarm_config build file. If this is what you'd like to see, I can update the PR, but we'll have to introduce some temporary configuration parameters for providing the legacy/Pulp credentials.

...am I forgetting a conversation where you convinced me to go ahead with that?

You're probably not missing anything, I probably just overlooked or forgot about it. Discussing it here (in writing) is good.

I think it is worth trying to build out a detailed checklist...

Sure, I can do that. Expect a follow-up comment on this issue.

I do not think that we can leave import_upstream support unimplemented or forego having a bootstrap repo for RPM repositories entirely.

The existing 'sync' scenario in createrepo-agent can handle what we need here. My hesitation mostly stemmed from shoehoringing it into the same place in our workflows as the import jobs for reprepro. Namely, the fact that "upstream" repositories are specified as part of deployment and not part of configuration is...less than ideal. It is absolutely technically possible for me to make this work the same for the RPM repositories, but I wanted to weight it against what we'd like to see.

It requires very little effort to stand this up with the goal of feature parity with our use of reprepro here. I'll make that happen, now now that my hesitations have been heard.

@nuclearsandwich
Copy link
Contributor

That's correct. The primary driver for this is that our implementations using Pulp and createrepo-agent use entirely different credentials - and even types of credentials - from each other. Supporting both in parallel would mean wiring new credentials into each ros_buildfarm_config build file. If this is what you'd like to see, I can update the PR, but we'll have to introduce some temporary configuration parameters for providing the legacy/Pulp credentials.

I'm fine with transitional configuration options in principle but I also respect the work you've put into this and the confidence that you have that a hard-cut is not a huge risk. I think we should discuss synchronously the pros and cons of a parallel deployment versus a hard-cut with a working revert path together with @clalancette and then adopt the approach we prefer coming out of that.

@nuclearsandwich
Copy link
Contributor

We actually just finished a discussion where we decided to pursue the parallel deployment approach. It's more work than a hard cut but the value in recovering from unforeseen issues makes it worthwhile.

@cottsay
Copy link
Member Author

cottsay commented Aug 30, 2022

I've updated all of the necessary branches to support the import_upstream job as well as running Pulp in parallel with createrepo-agent.

@cottsay
Copy link
Member Author

cottsay commented Aug 30, 2022

Pre-Deployment Checklist

Deployment Checklist

  • Comment on discourse thread stating that downtime is starting
  • Place build.ros2.org in shutdown mode
  • Create EBS snapshot of repo.ros2.org host in EC2
    - ros2-buildfarm-repo-host-backup-2022-09-20
    - ros2-buildfarm-repo-storage-backup-2022-09-20
  • Merge production buildfarm configuration changes: Update credentials used for uploaded RPM packages ros2/ros_buildfarm_config#248
  • Merge production cookbook changes: Switch from Pulp to createrepo-agent for RPMs cookbook-ros-buildfarm#116
  • Re-deploy repo.ros2.org
    • $ sudo -Es
    • $ cd /root/ros-buildfarm-deployment && ./configure
    • $ createrepo-agent --sync=http://127.0.0.1/rhel_pulp/building/8/ --arch=SRPMS --arch=x86_64 /var/repos/rhel/building/8/
      ❌ This command failed. It seems that the curl handler in createrepo_c is tossing connection errors. I'm not sure what's to blame here, but these commands should have a sufficiently similar result:
        $ rsync --recursive --times --delete --itemize-changes rsync://127.0.0.1:1234/ros-building-rhel-8-SRPMS/ /var/repos/rhel/building/8/SRPMS/
        $ rsync --recursive --times --delete --exclude=debug --itemize-changes rsync://127.0.0.1:1234/ros-building-rhel-8-x86_64/ /var/repos/rhel/building/8/x86_64/
        $ rsync --recursive --times --delete --itemize-changes rsync://127.0.0.1:1234/ros-building-rhel-8-x86_64-debug/ /var/repos/rhel/building/8/x86_64/debug/
      
  • Merge changes to buildfarm code: Switch from Pulp to createrepo-agent for RPMs #976
  • Deploy buildfarm changes:
    • python3 -m ros_buildfarm.scripts.release.generate_release_maintenance_jobs https://raw.githubusercontent.com/ros2/ros_buildfarm_config/ros2/index.yaml rolling rhel
    • python3 -m ros_buildfarm.scripts.release.generate_release_maintenance_jobs https://raw.githubusercontent.com/ros2/ros_buildfarm_config/ros2/index.yaml humble rhel
      ❗ Note: No substantial changes
    • python3 -m ros_buildfarm.scripts.release.generate_release_maintenance_jobs https://raw.githubusercontent.com/ros2/ros_buildfarm_config/ros2/index.yaml galactic rhel
      ❗ Note: No substantial changes
  • Trigger release job reconfiguration jobs on Jenkins:
  • Bring build.ros2.org out of shutdown mode
  • Comment on discourse thread stating that downtime has finished

When there are no in-progress jobs, we can then use commands like this to look for differences between the repositories:

dnf repodiff --refresh --repofrompath=old,http://repo.ros2.org/rhel_pulp/building/8/x86_64/ --repofrompath=new,http://repo.ros2.org/rhel/building/8/x86_64/ --repo-old=old --repo-new=new --compare-arch --archlist=x86_64

@nuclearsandwich
Copy link
Contributor

Am I right in thinking that Foxy is intentionally omitted from the above checklists because it has no RPM builds?

@nuclearsandwich
Copy link
Contributor

When there are no in-progress jobs, we can then use commands like this to look for differences between the repositories:
...

We can also use repodiff to check the testing and main repositories as long as there's no in-progress sync right?

@cottsay
Copy link
Member Author

cottsay commented Aug 30, 2022

Am I right in thinking that Foxy is intentionally omitted from the above checklists because it has no RPM builds?

Right, there is no reconfigure job at all because there are no build files for RHEL or Fedora.

We can also use repodiff to check the testing and main repositories as long as there's no in-progress sync right?

Absolutely.

@nuclearsandwich
Copy link
Contributor

Man this is gonna start to feel like code review. Can we DevOps so hard we version control our checklists?

  • For the sake of completeness may I suggest adding the maintenance downtime discourse announcement and status page updates to the checklist
  • Likewise I think it's worth making EBS snapshot backups of at least the repo host if not also the Jenkins host before beginning the redeploy as a cover-your-butt.

Beyond that I think this checklist is bang on and can stand as the plan. We'll pick a date later on once some of the other parts of this process are through review. Thanks for taking the time to build it up!

@cottsay
Copy link
Member Author

cottsay commented Aug 31, 2022

maintenance downtime discourse announcement

Sure, added.

EBS snapshot backups of at least the repo host

Also added - good call. I really don't think a Jenkins host snapshot will help us - we don't even need to deploy to that host. All of the job configurations are managed by ros_buildfarm anyway, so we have a robust rollback path that I feel safer using than restoring a whole host backup.

Please find some time in the next week to review the three linked PRs in the pre-deployment checklist.

@nuclearsandwich
Copy link
Contributor

Please find some time in the next week to review the three linked PRs in the pre-deployment checklist.

Definitely! I focused on the plan first but I'll be sure to do those next.

@nuclearsandwich
Copy link
Contributor

Please find some time in the next week to review the three linked PRs in the pre-deployment checklist.

Definitely! I focused on the plan first but I'll be sure to do those next.

First pass reviews completed on each linked PR.

@cottsay
Copy link
Member Author

cottsay commented Sep 7, 2022

First pass reviews completed on each linked PR.

Many thanks - I believe I've addressed this round of feedback. I think that after this next round, we should set a date.

@nuclearsandwich
Copy link
Contributor

Many thanks - I believe I've addressed this round of feedback. I think that after this next round, we should set a date.

Agreed and all ✔️! I put it on Monday's agenda to pick a date.

@cottsay
Copy link
Member Author

cottsay commented Sep 19, 2022

Cross-linking the discourse post in this thread.

tl;dr - this is scheduled to happen tomorrow (2022-9-19).

@cottsay
Copy link
Member Author

cottsay commented Sep 21, 2022

The initial deployment has concluded, but I'd like to keep this ticket open until we've removed Pulp completely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants