Installing Cloud Pak for Business Automation 21 for a customer.
In my last blog, we’d successfully got a demo system up and running.
We next started to install Cloud Pak for Business Automation for a customer. Initially, we were going to put in version 19, and then version 20 came along, which included extra IBM common services. But at that point, there weren’t that many extra services.
OpenShift or OKD?
The customer was still working out their strategy on Red Hat OpenShift. They had two OpenShift environments but quickly realised that we couldn’t actually put the install on it because they didn’t have the licenses or the hardware available to cope with the workload.
A decision was made that we were going to try with the upstream version of Open Shift, which is called OKD (the open-source, community version without support from Red Hat). OKD is not licensed and therefore the customer could extend the environment and create the additional worker nodes needed for the deployment.
The customer wanted to run their containers through security checking using a product from Aqua Security. The product searches for vulnerabilities such as security, fixes and ensures that there are no viruses or anything in the containers that have been deployed. So, to do that, they wanted to deploy using their own Container Registry.
To achieve this, we needed to take all the containers that IBM hosts on the Cloud in their registry and copy those into the Artifactory, the customer’s on-site registry for containers.
We were only deploying a small amount of the containers – concentrating on Content, not the process, ODM (Operational Decision Manager), or Advanced Document Processing. But, all of the containers have to be mirrored. And…the scripts that IBM provides for doing that mirroring will only work against OpenShift!
We very quickly failed to put them into the local registry using the scripts. However, one member of the technical team persevered and manually downloaded and populated Artifactory as we wouldn’t be able to deploy to Test or Production without the Aquasec checks.
We went around in circles for quite a while. Eventually, the customer said, because it’s going to be just a development environment, more like a sandbox environment (you wouldn’t put a production load on OKD), we could just use direct links into IBM Cloud.
You need to add a token key from IBM to get access to the containers, to prove you have access to be able to run the software. You need that key when you’re copying them to your local registry or if you’re deploying them directly into OpenShift or OKD. But obviously, because they will be deploying directly into their OKD environment, it wouldn’t be able to go through their local registry, which would then do the Aqua sec penetration testing of the containers. So, they got dispensation to be able to deploy directly into OKD and we started the process.
The customer bought their licenses. IBM normally issues licenses using Passport Advantage. It was the old way of downloading the traditional setups, the exe installs. However, for the containers to deploy into OpenShift and OKD, IBM has created a registry that can be accessed with a token – a secure password with lots of characters.
So, we started, basing what we did on the sample deployments that we have done in the Insight 2 Value office on version 20. We started on version 20 and got what’s called a CR YAML file. The YAML file contains all of the configurations for pointing to databases and pointing to their active directory. This had all been done remotely by a guy that was tasked with working with me and doing the deployments.
It took a few iterations but we managed to get the deployment done. We got the Database administrators to create the Oracle database and tablespaces that were needed for the deployment. There were four of them –the Global Configuration Database (GCD), the Object store, the Navigator configuration, and one for the User Management Services (UMS).
We’d got access and those tablespaces created, which would then be used by the deployment. The deployment actually creates the tables inside of those tablespaces all configured – you just provide the blank databases. We had some users created for administration plus users of the system and they were put inside of the configuration.
Changing the default settings to use less resource
To restrict the amount of CPU and memory that was going to be used, we changed the default settings.
Instead of the default of two pods being deployed for each of the different areas – Content Platform Engine, the navigator, the CSS and the GraphQL – we changed the configuration so that there would only be one pod for each of those, which makes it a lot lighter touch.
We discovered we could change the configuration for the Cloud Pak, for the old P8 part of the system. However, we found nothing in there that allows us to change what’s used by IBM common services. I’ve actually raised a case with IBM tech support and tech sales to find out if there was anything that could be done. The response was – this is what it’s going to be using. So whatever system is deployed IBM common services, there’s not a lot you can do about what processing power you need for that.
There were a couple of other interesting things. Firstly, it needs a lot of volumes. NFS is used as the volume mounting procedure and is the Linux sharing of Windows. If you go to a Windows machine and you connect to a server, you’ll typically see an x drive or h drive which is on a server. NFS is the Linux equivalent of those windows shares, which are then used by the containers to keep the persistence. If you then want to upgrade to the next version of the software, the persistence is still on those volumes – the containers are self-contained. They have no configuration inside them that is all in the external drives/shares. When you create the pods, it then maps to certain volumes inside of the container so it knows where all of the configurations are held.
Initially, the customer decided that they were going to manually create all of those NFS shares. I think there were 14 of them. Then the customer went away on holiday for a week and another guy was assigned to the task. We were still having an issue and we couldn’t work out what the problem was. Then I got access to the OKD environment. Rather than the customer sharing his screen and showing me what he could see, I actually had the chance to prod around the files. I discovered that there was a typo in the YAML.
One of the database users had pasted in the database username, rather than the password which meant it was failing and stopping the deployment going forward.
The operator tries to be very clever. As I said earlier, it creates the database tables that it needs, all of the configuration files based on what you put in. So, you give it the database, server name, usernames, passwords, the connection string to the server. Then, it will do all of the configurations as if you were that person, you did the setup exe and configured it up.
But it does produce lines of log file, when it’s doing the deployment.
What’s occurring in the log file?
It’s a little bit of an art to try and find out what’s going wrong.
A tip is to go into the log file and search for the word fatal. That usually gives you an indication of what’s going wrong.
I use two methods:
1. Get the log files on a local machine. Open it in a text editor (I use Visual Studio Code) and do a search for “start ICP4A” which will show you when the operator starts its cycle.
2. Below that search for fatal. That shows you where the first errors are occurring in the log file. The operator continually loops so you’ll see the same error again and again.
So that’s where I found the problem with the username and password.
A bug or maybe a feature?
I discovered an interesting bug – well maybe it is a feature in the deployment? If a deployment fails, you can’t use that project name again.
So the way that OpenShift and OKD works is you create a project (actually a namespace in Kubernetes speak but Red Hat call it a project). You deploy into that project but if it fails and gets into a loop that it can’t self-correct, you can’t actually delete that project or clear that project down and reuse it. I think it’s because of the clever things that it’s doing in the background – adding cluster roles and all sorts of other things that are under the bonnet. Unless you know exactly what was being created, it’s hard to clear them down to reuse the same project.
The guy that I was working with was on holiday. His stand-in didn’t have the time to create all of the NFS shares so we used an NFS managed client. We used a storage class that maps to a single NFS share. Within that, it creates multiple volumes to then use as persistent volumes inside of the deployment. So instead of manually creating 14 persistent volume claims, you create the one NFS share, and you tell the operator that you’re going to use this storage class to deploy.
So, we did all of that…and the system actually deployed. It took four or five elapsed weeks to get it working with us learning a lot along the way. Much of that time was waiting for the database users, only to discover their passwords had expired because of the long conversation around whether to use OpenShift or OKD. The long-term savings in terms of managing and upgrading the system will more than offset the effort to get this far.
So there we have it. A working system on a customer site using OKD.
Next time, I will talk about demo and enterprise deployments.
If you enjoyed this blog, why not sign up below to receive an email when the next blog is published.