Releases: fgci-org/fgci-ansible
v1.2.0 - Jelly Jamaican
General
- #115 - cleanup and simplification begun of package installation
Role Updates
- fgci-install : sync host_vars to install node too if you have them
- #114 ansible-pull success/failures are now sent to cassini - it's possible to make a dashboard showing the status of ansible-pull
- it's possible to disable the ansible-pull cronjob (and the above annotations)
- cuda role is run before slurm
- slurm: some TRES improvements from @jabl
- yum : another variable that can be used to install package introduced: global_packages
- fail2ban: --check should work now
- users: quite large update from @khappone that makes it figure out which user to use. In the future we could remove some tasks in the playbook and just put the users role first.
- yum-cron: it's now possible to set settings on yum-cron - for example exclude java
v1.1.0 - Industrious Indigo
General updates
In fgci-ansible git repo the stable branch is now the master branch. If you want bleeding edge use the devel branch. In either case, use the same branch for your ansible-push as for ansible-pull.
- minor change of order of some roles in playbooks, this is due to dependencies
- smart fact gathering in ansible - should speed up ansible runs
- allow setting resolv.conf search domain and other options
- arc-frontend updated to contain the NorduGrid 2015 chain
- quite many changes done to the slurm role. Mostly related to GPU and GRES.
New Roles
- idmapd
- open-vm-tools (for VMWare guests)
v1.0.0 - Humble Habu
General Updates
Roles versions are now freezed in requirements.yml
This means that if a role is updated it is not used instantly by the clusters. We/one would have to manually update requirements.yml to point to a more recent commit or tag. Reason is to make it more stable and prepare for production.
The order in which some roles were executed.
ansible-pull on compute nodes now pull the files from the install node - being a bit nicer to github
NAT rules are now in the ferm firewall role as well as in the ip_forwarder role.
New tools:
- upgradeAndReboot.yml
- Might not work for every cluster but
- gitmirror_conf_generate.py
- Used by git-mirror to have a mirror of every role on the install node
New Variables
- slurm_nodelist and slurm_partitionlist
- ib_ip_addr a variable in the hosts file to define an ib ip address
New Roles
Most of these are not enabled by default, but can be with setting some variables. See each role for more details.
- https://github.com/cmprescott/ansible-role-autofs
- https://github.com/mhakala/ansible-role-adauth
- https://github.com/jabl/ansible-role-pam
- https://github.com/mhakala/ansible-role-lustre_client
- https://github.com/CSC-IT-Center-for-Science/ansible-nsswitch
- https://github.com/jabl/ansible-role-gitmirror
Role Updates
- Fix serial console for compute nodes
- Improve partitioning on login/compute nodes
- only generate ed25519 ssh host keys
v0.9.4 - Gracious Green Snake
Production branch merged with master.
General Updates
fgci-ansible
- new variable to configure IB interfaces: network_ib_interfaces
- moved common network variables to group_vars/all/all.yml
- changing variable name: ip_address to int_ip_addr on tasks and templates. This also is reflected on inventory file "hosts".
- internal network IPs are now in hosts file
- Removing unused playbooks (after_provisioning.yml and all_nodes.yml)
- site.yml has now reinstall and rebook playbooks commented allowing harmless run
- nfs_exports: duplicated shares to cover all internal subnets (ether and Ib)
- fgci-meta-* metapackages are installed
- sync compute.yml and local.yml playbooks
- install and run smartd on compute nodes too
- Configure ib from dual to single port (ansible-role-rdma)
- admin node installs now a metapackage from FGCI7 repo
- new role: ansible-role-pdsh-machines for issue
- new variable to configure IB interfaces: network_ib_interfaces
- tools/reboot_computes.yml - Reboot without errors
- added ansible-role-cvmfs in nfs.yml - This will allow running performance tools available in CVMFS from NFS node
- if installed, NetworkManager gets removed
- FGCI7 repo added to login and nfs nodes
New Roles
ansible-role-serial-console
mhakala/ansible-role-pam
ansible-role-serial-console
cmprescott/ansible-role-autofs
jtyr/ansible-nsswitch
mhakala/ansible-role-lustre_client
Role Updates
ansible-role-slurm
- Only create the slurm user/group on the NIS server
- cgroups: /sys/fs/cgroup is now mounted by systemd and not slurm
ansible-role-rdma
- add options to disable managing rdma and opensm services
- set the "lspci" shell commands to changed_when: False, so they never show up as "changed" in ansible.
- Since memlock unlimited is set by ansible-system-limits, the copy rdma.conf is disabled by default.
ansible-role-nis
- make nscd management optional and state/enabled variables
- same variable for NetworkManager
- make the NISDOMAIN task a bit more readable
- skip stop and disable NetworkManager when NetworkManager isn't installed
ansible-role-nfs
- make nfs service optional
ansible-role-pxe_bootstrap
- Allow serial port to be chosen, default to ttyS0 on boot.py PXE script
ansible-role-sshd-host-keys
- changing variable name: ip_address to int_ip_addr on tasks and templates. This also is reflected on inventory file "hosts".
ansible-role-pxe_config
- changing variable name: ip_address to int_ip_addr on tasks and templates. This also is reflected on inventory file "hosts".
ansible-role-yum
- the address to the proxy changed to URL
ansible-role-fgci-bash
- log module usage to syslog
ansible-role-squid
- squid service gets now started and enabled in the task
ansible-role-users
- new feature: add / remove pubkeys to the root user
ansible-role-fgci-install
- permissions updated to 0755 for ansible-pull-script.sh
- Added some missing default variables
- Adding random sleep interval into ansible-pull-script.sh
- Replacing hard-coded delays by random time configurable by ansible_pull_sleep var
- new variable: fgci_node_type if is:
- service: setup dhcpd, copies in ansible var files, ansible-pull-script to the service/install node
- compute: pushes the ansible-pull-script.sh to the compute node
- add variable ansible_pull_branch (defaults to production)
v0.9.3 - Forest Flame Snake
Minor update.
Production branch merged with master.
General Updates
- Removed prep_local.yml - it was not safe.
- ansible-pull-script is now in https://github.com/CSC-IT-Center-for-Science/ansible-role-fgci-install
- Some IP variables moved out of group_vars into the hosts file
- compute.yml and local.yml synced (fgci-repo role missing)
New Roles
- adauth from https://github.com/mhakala/ansible-role-adauth - only added to requirements.yml for now
- pdsh-machines - creates /etc/machines and install pdsh and dependencies
Role Updates
- network_interfaces - possibility to configure IB interfaces (specifically settings for partitions and connected_mode)
- smartd now runs on compute nodes too
- rdma - configure interfaces as Ethernet
- slurm:
- configuration of cgroups
- CI syntax testing
- slurm repo version is a variable
- nhc: drain/reboot node if nvidia-smi exits with return code 255 or 17
- fgci-bash: log module usage to syslog of local node
- rsyslog: ship logs from every node with "lmod" tag to central_log_host
- arc-frontend: new variable "init_griduser_accts" which defaults to False. This needs to be set to True to create the grid user accounts.
- nis: new variable: nis_manage_nsswitch . If this is set to False then we do not manage nsswitch file. For the adauth role.
- users: it's now possible to add ssh pubkey to root user too (not doing that for FGCI)
- flowdock: fixed tags.
v.0.9.2 - Elegant Egg-eater
Minor update.
production branch has been synced up with test and master.
General Updates
- ansible-pull updates
- the address to the install node is now templated in both in local.yml and in ansible-pull-script.sh
- travis/testing works again
- ansible performance tuning - thanks to @jabl for the contribution
- more forks and "pipelining = True"
- prep_local.yml helper script is removed
Updates to roles
- ansible-role-slurm
- address to accountinghost is now a variable - defaults to service node
- ansible-role-rsyslog on admin-node does not write remote logs into /var/log/messages anymore
- ansible-role-fgci-bash
- a way to install only some specific *sh login scripts
- ansible-role-users
- key option for authorized keys were added
- ansible-role-rdma
- some option to configure some ports to Ethernet
- ansible-role-flowdock
- also send site name in the message
- ansible-role-pxe_config
- make CSCfi/ansible-role-pxe_config#4 thanks @jabl for the issue.
- we are no longer making .local addresses. nodename.int and nodename-ib.int.siteDomain
- ansible-role-yum
- always install libselinux-python - needed for machines with selinux enabled
- ansible-role-squid
- users handlers to restart squid on config change. CSCfi/ansible-role-squid#1
- ansible-role-postfix
- install postfix first
New Roles
- fgci-repo
- installs the fgci repo
- yum-cron-2
- installs a working yum-cron configuration
- named yum-cron-2 to not conflict with the old one
v0.9.1 - Delirious Dwarf Boa
This is a minor release, no new roles.
- collectd: remove some tags
- installing some more programs, rsync, bash-completion, wget, nfs-utils, pdsh
- some guidelines for contributions
- ansible-pull is now set by default to pull the "production" branch. This is controlled with the ansible_pull_branch - this can also be a tag.
- fixes to ssh host key generation and ordering
- dns role runs after dnsmasq role in install node
v0.9.0 - Credulous Cobra
158 commits to fgci-ansible repository since this v0.3.0 29 days ago
Updates
- Slurm 15.08.4
- All group_vars are now moved into examples/group_vars. These needs to be copied to fgci-ansible/group_vars/
- Travis-ci is now running on most roles
- We started to use https://waffle.io/CSC-IT-Center-for-Science/fgci-ansible/ for managing all the github repositories and issues
- NIS user management
- We've reinstalled the test cluster many times during this period to iron out the problems
- Nodes are now configuring themselves with the help of ansible-pull.
- To make ansible-pull work there are a few new caveats
- If there is a change to a group_vars, ansible hosts file, ssh keys or the ansible-pull-script.sh these needs to be updated o the install node.
- Site.yml is updated, it's not enough to just run that to reinstall the cluster. For complete instructions for using this and all the roles - see https://confluence.csc.fi/display/FGCI/FGCI+Cluster+deployment+and+installation+guide
- Passwords are now disabled on admin accounts.
- SSH access is restricted.
- Login node is now the NAT gateway
- PDSH is now installed
- Hosts file is now updated to contain many more entries (also .local and -ib style of the compute nodes)
- Lmod is now installed
- Rsyslog improvements - it's now possible with a variable decide which programs to log to the admin node.
- Firewall rules have been cleaned up.
- Host_vars have been removed
- virt-manager is now usable on the admin node
- Many variables have been consolidated.
- Resolver.conf should now point to the right addresses (it's not the same everywhere )
- Various ordering issues spotted during reinstall exercises - tasks trying to write to NFS before NFS is installed for example. These should mostly be OK now.
- sysctl/ulimits now also set on compute nodes.
- rdma/IB configured on nfs node.
New Roles
- ansible-role-nhc
- node health checker for slurm
- ansible-role-dell
- installs dell management tools as well as Dell System Updater - for firmware updates
- this includes a racadm.sh script which can help somewhat in managing the iDRACs
- ansible-role-postfix
- configures a relayhost and sets IPv4 only
- ansible-role-sshd-host-keys
- SSH keys are now managed. There are roles for install node also in nfs.yml about the ssh keys
Tools
- tools/pullReqs.sh - galaxy installs all the ansible repos
- tools/diff_group_vars.sh - to help with checking for new updates in the examples/
- the old FGI reinstall script is now also available on the install node
v.0.3.0 - Benign Boa
Updates
209 commits to fgci-ansible repository since v0.2.0 (does not include updates in roles in other repositories)
More yum repos have been added
A few more settings have been into examples/ directory
Hosts file have been updated - now the compute nodes need some variables set like MAC address and such.
Slurm is 15.08.03
Travis-ci for fgci-ansible repo now spawns a CentOS7 docker container and runs various ansible tasks.
site.yml is now updated to have all except the compute nodes.
There's also an after_provision.yml - this is for roles on the admin node that needs to run after the install node is up
New roles
chronological order from commits:
- ansible-role-system-limits
- adds some settings into /etc/sysctl.d and /etc/security/limits.d
- ansible-role-arc-client
- ansible-role-yum
- installs IB things and on compute nodes templates/overwrites /etc/yum.conf and adds proxy
- ansible-role-fail2ban
- installed only on login-node
- ansible-role-flowdock
- sends notification to the FGCI channel in flowdock
- ansible-role-collectd
- sends metrics - also ipmi metrics if it's a physical machine
- ansible-role-nis
- install node runs a NIS server. Grid,compute and login nodes are clients
- ansible-role-dhcp_server
- ansible-role-dnsmasq
- ansible-role-pxe_config
- ansible-role-pxe_bootstrap
- dhcp_server dnsmasq and pxe_* roles reinstalls a node over PXE/DHCP
- ansible-role-nfs
- setup exports
- ansible-role-nfs_mount
- setup mounts
- ansible-role-arc-frontend
- installs and configures an ARC CE
- ansible-role-cuda
- installs CUDA yum repo and installs cuda
Tools
- reboot_computes.yml
- reinstall_computes.yml
v.0.2.0 - Altruistic Anaconda
This is an example of how we could make releases / release notes.
General updates
- "ansible-playbook -u myfgciusername -i hosts site.yml" now mostly works,
- username aren't distributed to login-node yet so some manual hacks still needed
- Added travis-ci for some FGCI roles :) https://travis-ci.org/CSC-IT-Center-for-Science/
- This passes but it doesn't actually test anything at all (not even syntax checking)
- Refactoring of group_vars/
- fgci-examples/ directory has some example files - this is where new cluster admins should copy files from to start creating the config files that ansible uses.
New roles
- admin-role-aliases
- Configures basic /etc/aliases and runs newaliases
- ansible-role-cvmfs
- Need to fix pointer to proxy
- ansible-role-fgci-login
- Only installs EPEL for now
- ansible-role-rdma
- Basically only installs it and some udev rules for /dev/infiniband/
- ansible-role-ferm-firewall
- Not 100% sure that all these hosts should have NAT, nor that the NAT rules are good.
- SSH blocks all out so far at least
- ansible-role-ntp
- ansible-role-rsyslog-client
- Ships logs to a remote rsyslog server
- ansible-role-slurm
- Defaults should not be trusted - we should set them to something better. At least think about:
- FirstJobId
- All the spool/state dirs - /tmp is probably really bad
- Population of nodes / partitions is not working
- ansible-role-sshd
- A remote role - restricts ssh access to root/admins where appropriate
- ansible-role-users
- Adds some CSC admin-users (can be disabled)
- ansible-role-yum-cron
- Enables auto-download cron
Updates to existing roles:
- provision_vm
- Provisions VMs on the install node.