Copr

Copr is build system for 3rd party packages.

Frontend
Backend
Package signer
  • copr-keygen.cloud.fedoraproject.org

    Dist-git
  • copr-dist-git.fedorainfracloud.org

Devel instances (NO NEED TO CARE ABOUT THEM, JUST THOSE ABOVE)

Contact Information

Owner

msuchy (mirek)

Contact

#fedora-admin, #fedora-buildsys

Location

Fedora Cloud

Purpose

Build system

This document

This document provides a condensed information allowing you to keep Copr alive and working. For more sofisticated business processes, please see https://docs.pagure.org/copr.copr/maintenance_documentation.html

TROUBLESHOOTING

Almost every problem with Copr is due problem with spawning builder VMs, or with processing action queue on backend.

VM spawning/termination problems

Try to restart copr-backend service:

$ ssh root@copr-be.cloud.fedoraproject.org
$ systemctl restart copr-backend

If this doesn’t solve the problem, try to follow logs for some clues:

$ tail -f /var/log/copr-backend/{vmm,spawner,terminator}.log

As the last resort option, you can terminate all builders and let copr-backend to throw all information about them. This action will obviously interrupt all running builds and reschedule them:

$ ssh root@copr-be.cloud.fedoraproject.org
$ systemctl stop copr-backend
$ cleanup_vm_nova.py
$ redis-cli
> FLUSHALL
$ systemctl start copr-backend

Sometimes OpenStack can not handle spawning too much VMs at the same time. So it is safer to edit on copr-be.cloud.fedoraproject.org:

vi /etc/copr/copr-be.conf

and change:

group0_max_workers=12

to "6". Start copr-backend service and some time later increase it to original value. Copr automaticaly detect change in script and increase number of workers.

The set of aarch64 VMs isn’t maintained by OpenStack, but by Copr’s backend itself. Steps to diagnose:

$ ssh root@copr-be.cloud.fedoraproject.org
[root@copr-be ~][PROD]# systemctl status resalloc
● resalloc.service - Resource allocator server
...

[root@copr-be ~][PROD]# less /var/log/resallocserver/main.log

[root@copr-be ~][PROD]# su - resalloc

[resalloc@copr-be ~][PROD]$ resalloc-maint resource-list
13569 - aarch64_01_prod_00013569_20190613_151319 pool=aarch64_01_prod tags=aarch64 status=UP
13597 - aarch64_01_prod_00013597_20190614_083418 pool=aarch64_01_prod tags=aarch64 status=UP
13594 - aarch64_02_prod_00013594_20190614_082303 pool=aarch64_02_prod tags=aarch64 status=STARTING
...

[resalloc@copr-be ~][PROD]$ resalloc-maint ticket-list
879 - state=OPEN tags=aarch64 resource=aarch64_01_prod_00013569_20190613_151319
918 - state=OPEN tags=aarch64 resource=aarch64_01_prod_00013608_20190614_135536
904 - state=OPEN tags=aarch64 resource=aarch64_02_prod_00013594_20190614_082303
919 - state=OPEN tags=aarch64
...

Be careful when there’s some resource in STARTING state. If that’s so, check /usr/bin/tail -F -n +0 /var/log/resallocserver/hooks/013594_alloc. Copr takes tickets from resalloc server; and if the resources fail to spawn, the ticket numbers are not assigned with appropriately tagged resource for a long time.

If that happens (it shouldn’t) and there’s some inconsistency between resalloc’s database and the actual status on aarch64 hypervisors (ssh copr@virthost-aarch64-os0{1,2}.fedorainfracloud.org) - use virsh there to introspect theirs statuses - use resalloc-maint resource-delete, resalloc ticket-close or psql commands to fix-up the resalloc’s DB.

Backend Troubleshoting

Information about status of Copr backend services:

systemctl status copr-backend*.service

Utilization of workers:

ps axf

Worker process change $0 to list which task they are working on and on which builder.

To list which VM builders are tracked by copr-vmm service:

/usr/bin/copr_get_vm_info.py

Appstream builder troubleshoting

Appstream builder is painfully slow when running on a repository with a huge amount of packages. See https://github.com/hughsie/appstream-glib/issues/301 . You might need to disable it for some projects:

$ ssh root@copr-be.cloud.fedoraproject.org
$ cd /var/lib/copr/public_html/results/<owner>/<project>/
$ touch .disable-appstream
# You should probably also delete existing appstream data because
# they might be obsolete
$ rm -rf ./appdata

Backend action queue issues

First check the number of not-yet-processed actions. If that number isn’t equal to zero, and is not decrementing relatively fast (say single action takes longer than 30s) — there might be some problem. Logs for the action dispatcher can be found in:

/var/log/copr-backend/action_dispatcher.log

Check if there’s no stucked process under Action dispatch parent process in pstree -a copr output.

Deploy information

Using playbooks and rbac:

$ sudo rbac-playbook groups/copr-backend.yml
$ sudo rbac-playbook groups/copr-frontend-cloud.yml
$ sudo rbac-playbook groups/copr-keygen.yml
$ sudo rbac-playbook groups/copr-dist-git.yml

https://pagure.io/copr/copr/blob/master/f/copr-setup.txt The copr-setup.txt manual is severely outdated, but there is no up-to-date alternative. We should extract useful information from it and put it here in the SOP or into https://docs.pagure.org/copr.copr/maintenance_documentation.html and then throw the copr-setup.txt away.

On backend should run copr-backend service (which spawns several processes). Backend spawns VM from Fedora Cloud. You could not login to those machines directly. You have to:

$ ssh root@copr-be.cloud.fedoraproject.org
$ su - copr
$ copr_get_vm_info.py
# find IP address of the VM that you want
$ ssh root@172.16.3.3

Instances can be easily terminated in https://fedorainfracloud.org/dashboard

Order of start up

When reprovision you should start first: copr-keygen and copr-dist-git machines (in any order). Then you can start copr-be. Well you can start it sooner, but make sure that copr-* services are stopped.

Copr-fe machine is completly independent and can be start any time. If backend is stopped it will just queue jobs.

Logs

Backend

  • /var/log/copr-backend/action_dispatcher.log

  • /var/log/copr-backend/actions.log

  • /var/log/copr-backend/backend.log

  • /var/log/copr-backend/build_dispatcher.log

  • /var/log/copr-backend/logger.log

  • /var/log/copr-backend/spawner.log

  • /var/log/copr-backend/terminator.log

  • /var/log/copr-backend/vmm.log

  • /var/log/copr-backend/worker.log

And several logs for non-essential features such as copr_prune_results.log, hitcounter.log, cleanup_vms.log, that you shouldn’t be worried with.

Frontend

  • /var/log/copr-frontend/frontend.log

  • /var/log/httpd/access_log

  • /var/log/httpd/error_log

Keygen

  • /var/log/copr-keygen/main.log

Dist-git

  • /var/log/copr-dist-git/main.log

  • /var/log/httpd/access_log

  • /var/log/httpd/error_log

Services

Backend

  • copr-backend

    • copr-backend-action

    • copr-backend-build

    • copr-backend-log

    • copr-backend-vmm

  • redis

  • lighttpd

All the copr-backend-*.service are configured to be a part of the copr-backend.service so e.g. in case of restarting all of them, just restart the copr-backend.service.

Frontend

  • httpd

  • postgresql

Keygen

  • signd

Dist-git

  • httpd

  • copr-dist-git

PPC64LE Builders

Builders for PPC64 are located at rh-power2.fit.vutbr.cz and anyone with access to buildsys ssh key can get there using keys as:: msuchy@rh-power2.fit.vutbr.cz

There are commands: $ ls bin/ destroy-all.sh reinit-vm26.sh reinit-vm28.sh virsh-destroy-vm26.sh virsh-destroy-vm28.sh virsh-start-vm26.sh virsh-start-vm28.sh get-one-vm.sh reinit-vm27.sh reinit-vm29.sh virsh-destroy-vm27.sh virsh-destroy-vm29.sh virsh-start-vm27.sh virsh-start-vm29.sh

bin/destroy-all.sh destroy all VM and reinit them reinit-vmXX.sh copy VM image from template virsh-destroy-vmXX.sh destroys VM virsh-start-vmXX.sh starts VM get-one-vm.sh start one VM and return its IP - this is used in Copr playbooks.

In case of big queue of PPC64 tasks simply call bin/destroy-all.sh and it will destroy stuck VM and copr backend will spawn new VM.

Ports opened for public

Frontend:

Port Protocol Service Reason

22

TCP

ssh

Remote control

80

TCP

http

Serving Copr frontend website

443

TCP

https

^^

Backend:

Port Protocol Service Reason

22

TCP

ssh

Remote control

80

TCP

http

Serving build results and repos

443

TCP

https

^^

Distgit:

Port Protocol Service Reason

22

TCP

ssh

Remote control

80

TCP

http

Serving cgit interface

443

TCP

https

^^

Keygen:

Port Protocol Service Reason

22

TCP

ssh

Remote control

Resources justification

Copr currently uses the following resources.

Frontend

  • RAM: 2G (out of 4G) and some swap

  • CPU: 2 cores (3400mhz) with load 0.92, 0.68, 0.65

Most of the memory is eaten by PostgreSQL, followed by Apache. The CPU usage is also mainly used for those two services but in the reversed order.

I don’t think we can settle down with any instance that provides less than (2G RAM, obviously), but ideally, we need 3G+. 2-core CPU is good enough.

  • Disk space: 17G for system and 8G for pgsqldb directory

If needed, we are able to clean-up the database directory of old dumps and backups and get down to around 4G disk space.

Backend

  • RAM: 5G (out of 16G)

  • CPU: 8 cores (3400MHz) with load 4.09, 4.55, 4.24

Backend takes care of spinning-up builders and running ansible playbooks on them, running createrepo_c (on big repositories) and so on. Copr utilizes two queues, one for builds, which are delegated to OpenStack builders, and action queue. Actions, however, are processed directly by the backend, so it can spike our load up. We would ideally like to have the same computing power that we have now. Maybe we can go lower than 16G RAM, possibly down to 12G RAM.

  • Disk space: 30G for the system, 5.6T (out of 6.8T) for build results

Currently, we have 1.3T of backup data, that is going to be deleted soon, but nevertheless, we cannot go any lower on storage. Disk space is a long-term issue for us and we need to do a lot of compromises and settling down just to survive our daily increase (which is around 10G of new data). Many features are blocked by not having enough storage. We cannot go any lower and also we cannot go much longer with the current storage.

Distgit

  • RAM: ~270M (out of 4G), but climbs to ~1G when busy

  • CPU: 2 cores (3400MHz) with load 1.35, 1.00, 0.53

Personally, I wouldn’t downgrade the machine too much. Possibly we can live with 3G ram, but I wouldn’t go any lower.

  • Disk space: 7G for system, 1.3T dist-git data

We currently employ a lot of aggressive cleaning strategies on our distgit data, so we can’t go any lower than what we have.

Keygen

  • RAM: ~150M (out of 2G)

  • CPU: 1 core (3400MHz) with load 0.10, 0.31, 0.25

We are basically running just signd and httpd here, both with minimal resource requirements. The memory usage is topped by systemd-journald.

  • Disk space: 7G for system and ~500M (out of ~700M) for GPG keys

We are slowly pushing the GPG keys storage to its limit, so in the case of migrating copr-keygen somewhere, we would like to scale-up it to at least 1G.