Mission decommission: parting with a platform for good

ADAM: Good morning everyone, thank you for joining us for what promises to be a really insightful and engaging webinar, as Reuben takes us on a journey for how to part with a platform for good, and some of the lessons we’ve learned along the way.

Just some really quick introductions. I’m Adam and I’m a Client Partner at Made Tech, working under our Health division. Reuben is a Lead Developer at Made Tech but will introduce himself fully in a moment.

Some really quick housekeeping about today’s webinar. The session is going to be recorded so we can share it afterwards. There will also be an opportunity to ask questions. If you can use the Q & A box rather than the chat box, that would be brilliant. At the end of the session I will pose these questions for Reuben to answer. At the end of the session we will also be sending out feedback forms. If you could respond to these that would be greatly appreciated.

Without further ado, I will hand you over to Reuben.

REUBEN: Cool, thanks Adam. Hello. As Adam said, my name is Reuben. I am a Lead Engineer at Made Tech. Probably since you are here, you will know what Made Tech do but in case not, we are a consultancy that help the public sector to use technology to make a positive impact on society.

I have been here for the best part of a year and really enjoyed seeing that impact in action. Some of the work that Made Tech has done in the past, just in the year that I’ve been here, has been really awesome and valuable.

Personally, I like learning new things, I like understanding how stuff works, coming up with solutions to problems. Outside of work you will catch my trying to exercise some form of control over my children or watching sports on TV, and occasionally out for a run.

At work, what do I do, or what do I favour? I care about employing good practices, empowering and unblocking teams, being kind of a servant leader in that kind of role, trying to do things right with a test first approach. I’ve worked quite a bit on legacy transformations, but more recently I’ve been looking at using diagrams as code, and documentation as code, to facilitate architecture transparency, as an architect, right now.

That’s not really what today is about. Today is about decommissioning a platform. Some learnings on a project I recently contributed on. The agenda for today is planning for decommission – what do you need to think about, what considerations are there. We’ve got some specifics around a case study, a real-life example of what we went through, some problems we experienced.

Data was a big part of this as you will hear. Also, lessons learned and how we have applied them into the next decommissioning, as well as generally building software and what we can learn from this process.

Before we talk about the planning, I’d like to take us through a little timeline of how this whole project lifecycle evolved. So, Coronavirus happened, and the government invested in this company that Made Tech worked with and helped and provided a service for. A results matching service that needed to be a little bit shoehorned in to fit its needs. Then it was really taking off, it became a pandemic, so the service needed to be resilient. That wasn’t really what it was built for, so what was the answer?

Well, it was quite simply a bit of vertical scaling and moving it to the cloud – quite a tactical solution. But then Coronavirus was getting worse and more prevalent. We had some big scaling ups to do, lots of tests to process, and it wasn’t really fit for purpose. We needed to scale horizontally and have a slightly different approach to the results handling. What was the answer to that? That’s where Made Tech came on board, to help build a serverless platform.

I refer to these as platforms A, B and C. These are terminologies that we will use throughout the presentation. Platform B is the main one I will talk about today. A really key aspect of the bit where Made Tech were involved in building this, helping to build this third, serverless platform was that it didn’t target feature parity. It was more about what it needed to provide. Platform B had far more function in it, far more behaviours, but we just needed to do what we needed to do for results matching coronavirus tests.

It went on through, it got to about March 2021 and the serverless platform had been done for LFTs, the Lateral Flow Tests. The work carried on throughout that year, so PCRs still going through Platform B, we brought those onto Platform C. Then we started to look at how to migrate antibodies. So the three test types that we were using for Covid were on Platform B. Two of them had come across, and then we started to look at antibodies. Once that was done, the idea was that Platform B could be decommissioned.

But then in March last year, there was an announcement that the antibody tests were going to stop. Therefore, we didn’t need Platform B anymore after a certain date in July, or maybe it was the end of June. I can’t quite remember the precise date.

So, we also at that time had a team rotation on the project. The Made Tech team changed around, we had some new people come in. We had five new staff including myself come in at that time, and two of the other members had rolled off at the end of that period. So, it was quite a lot of people, new faces, without a lot of domain knowledge. It slipped by us a little bit, the revelation of antibody tests stopping meant that we could stop the work on Platform C, but we did notice it eventually. So, we could abandon that plan and it meant that we had a date in the future where we could stop processing antibodies on Platform B. Platform B had kind of reached end of life.

That is really what this presentation is about. Made Tech had helped to build this serverless platform. We didn’t have a lot of domain knowledge on Platform B. New team came in with not much domain knowledge, a lot of data unknowns, a lot of risks associated with it. What did we do?

Well, we carried on and then by September last year Platform B was fully switched off, with all the data requirements met. As a little bonus, in some respects, to put the decommissioning learnings to bed, we also had to decommission, or help to manage the decommission for Platform C. The NHS made a strategic move to use a new provider on their results matching services, which we will talk about a bit as well because it’s quite important when you are thinking about how we decommission.

That’s the timeline of events, hopefully it sets the context quite nicely.

So, planning the decommission, some of the things to think about. We decided to have a planning phase, a discovery kick-off and iteration because we didn’t have that great a knowledge about the platform. We wanted to understand it more. We needed some meetings to understand what the requirements were from the stakeholders and so on. There are two kinds of uses to think about with the considerations. The people that aren’t in the team, what do they need from the decommissioning? What do the stakeholders need, and what are their priorities? What does done look like for them?

There’s internal – people from the company that are currently providing the service. Same sorts of questions. What do they think done looks like? Do any of their teams depend on your packages? Have you stood something up that they are using? Are you maybe the last team using a service that company is paying for, like a VPN gateway or a CI build server, or a license for jet reins or something, who knows?

Then the end user is probably the most important in some ways. How are they going to be affected by decommissioning? Do they need re-routing, do they need emailing, notifying? Is the user need still there or is the service replacing it? So, to take that back to our example, in Platform B, the user need wasn’t there anymore because there were no antibody tests. For Platform C, the new service was replacing it, so was that transparent to the public?

Also, some more data considerations around if you are giving it to the new provider, people whose data is being sent over need to be notified, and that sort of thing.

Then the second set of users which is the team, that’s us as technologists, what did we need to know, and how are we going to get to something that we can call done? For us, and this will be different for every decommissioning really, but for us, the three main things were understanding the path to production because we didn’t have that knowledge of the technology stack, what services are involved, what repositories get deployed, what credentials do you need, environment settings, secret management tools, all that sort of thing. Dependencies, the less visible things, what dependencies are involved between the application and the infrastructure? What models depend on something else? Is everything automated? Is everything provisioned by code or are there some manual steps involved?

Lastly, the data restrictions. What do we do about all the data stores? What does the contract say, what does GDPR say? Do we need to keep the data, or should we keep the data? How do we meet the organisational policies? What are the organisational policies?

So, we used the discovery phase for answering a lot of those questions. Then we got the plan on the right-hand side of this slide. Our decisions on the various phases of decommissioning were firstly to save money. That was, as often is, a primary consideration for stakeholders. Also, don’t delete any data until decisions have been made. So, we have all this data, we need to decommission and turn stuff off, but we weren’t allowed, I suppose, to delete any data until we had been given the green light by the NHS and by the business.

These decisions are our decisions for this discovery but it could be different in the future. It’s about getting those priorities and understanding the considerations.

So, saving the money. Unbuilding it slowly because we didn’t want to accidentally delete any data stores, and then finally delete everything towards the end, once any data had been transferred or copied or decided to be kept or put into cold storage or whatever it might be.

This is a couple of examples of what we got out of the discovery phase. These were some of the dependencies of the modules. Lots of arrows and boxes, there’s no need to dive into the depths of it, but it is how things got deployed and what they depended on, what the blast radius of removing them was. There are dotted lines on there, which come out of the environment into more like a subscription level layer for the cloud provider.

Some of these modules called into shared infrastructure components as well. Three or four of these are all VN based components, so we couldn’t just delete the module because that would have affected everything at the same time.

So, this is the sort of artefact we were looking at producing for us as engineers. Then we’ve got the data. We spent a long time trying to work out what all of the data dependencies were. We had a slightly more stakeholder-friendly version of this for distributing with the business and with the NHS. There are different types of all sorts of different data in there. There are relational databases, document databases, persisted data there. There are user action logs, vulnerability assessments. There is NSG, network security group flow auditing, there’s outbound report storage, application logging and VN monitoring. So, a whole raft of everything you would think of really, there. Which we had to identify and explain, and put down and understand at what point it could be done.

Knowing all of the dependencies, knowing what modules, what components created each of these data points, we could start to look at how to unpick the platform and take things away slowly. We also had time in our favour. There wasn’t a great urgency to do this. We knew that decisions were going to take a long time. We had a decent level of resource on the team, although we didn’t have a lot of domain experience. We treated the project as a learning experience. We had a couple of new engineers come in to understand how a decommissioning would work, how infrastructurous code could work, how we could remove elements through Terraform for example, or whatever.

These last two slides, the data and the dependency diagram tie out of what we wanted to achieve in the discovery phase.

So, onto what we did in this example, through the decommission itself, and how we helped to manage it.

The first thing we spoke about were dependencies. We did find a couple. We weren’t sure that we had uncovered every dependency, and you never quite are, I think. We found a couple, and we didn’t want to end up like Jarmanda in this diagram, starting a fire somewhere, trying to put it out and then starting a fire somewhere else, and end up going around in circles chasing our tails.

So, once we understood the dependency we had identified, we added some regression tests to give us confidence, in an automated fashion. And also to potentially highlight any dependencies we might have missed.

For some of the team, that was a good experience on how to implement this sort of thing. We had one which was an Azure Dev Ops pipeline, which replicated some of the functionality we needed to retain, then it checked the resources were still there for Platform C. It gave us that knowledge that we hadn’t bitten off too much, or taken off too much at a time, and everything was continuing to work as we expected.

One of those was a storage account and one was a more infrastructure level thing.

Onto the next bit then. We put in these regression tests before we removed anything because we didn’t want to jeopardise ourselves. The next bit was to save money, right. We found out through some cost analysis that 85% of the cost of the platform was just on VNs. So, we managed to save that money super quickly, just by scaling the instances that we knew could go down to zero, to zero. One or two of them we kept at one instance because it meant that we didn’t get access to certain other areas.

Very quickly, we had identified these were provisioned by infrastructurous code, we could scale them down to zero without deleting anything. Phase one was complete quite early and we got that confidence and trust in the business.

Now I want to share an experience of what didn’t work so well. It would take us back to the dependency diagram. We did what we thought was a good amount of legwork on understanding the dependencies. We see here – hopefully the colours are ok here – we see the pink ones coming through as a higher level thing and then the green ones being a lower level dependency.

We would expect that removing websites isn’t going to impact messaging because networking and monitoring environment modules were the only things that they had in common, they were shared dependencies. So, we thought that would be fine, but it wasn’t.
The reason we wanted to keep one of them was that there was a web app that worked a bit like a support tool. We wanted to keep it so that we could continue to drop in to do manual checks on data, so that we could respond to queries on the legacy data if we needed to, and do FOI requests – we hadn’t inadvertently corrupted or changed the firewall on a data store and that sort of thing.

So, that was deployed into an app service that was provisioned by the website’s component. We thought removing messaging wouldn’t affect this, but as it turned out there was some really convoluted behaviours in the pipeline. More specifically, one of the Octopus deployment steps ran some shell scripts that used Azure CLI to get public IP of the load balancer of the messaging, to do some authorisation. So, when there were no BMs there, there was nowhere for that request to go. Then you couldn’t log in to the support tool. We did identify that problem early, so while it was a problem, it wasn’t as big a problem as it may have been. But it is an example that you can’t ever find out quite everything and also, by doing your due diligence and putting in some regression, and having a good test plan to build your confidence, you can identify it sooner. So, we didn’t ever reach this phase where we had lost some data and then we wanted to log in and then we couldn’t log in because of x, y and z.

Then on to phases two and three. This was just an ongoing balancing act really. The three bits in green here were the three things that we wanted to achieve as a team. We had this guiderail of not deleting anything unduly, reducing cost which we spoke about, and maintaining the data integrity. To make the most of the opportunity on managing the decommission here, without being super time-boxed on this project. We wanted to maximise the learning so that the team were upskilled. We had a new team, we also had a few junior developers involved so that we are better prepared for the future, and they could challenge themselves.

We wanted to maximise the speed, where we could, as fast as was sensible but without overshooting data decisions. That increased learning would make us quicker in the future, as hopefully we would find out with Platform C. And we wanted to reduce complexity. We didn’t want to make it harder than it needed to be.

We didn’t want to gold plate at the expense of learning or speed if we didn’t have to. So, there were opportunities and costs with that balancing act. Again, with time being on our side, we had a great chance for pairing and knowledge transfer, understanding the processes that go into decommissioning.

We had siloed knowledge on Terraform, for example, which for those that don’t know is an infrastructurous code software tool. It’s got a domain-specific language associated with it. We also uncovered some technologies that no one in the team had met before. No one had used Octopus, for example, for continuous deployment and release management. None of us had used Caraf for running the applications.

When it was an individual or two who had a bit of knowledge, we spent a bit of time sharing that across, spending the time making sure we tried to do it right, rather than just deleting a resource manually, for example in the cloud. Because it stuck more closely with our principles, and it gave the team, especially our new engineers, a greater understanding of the software delivery lifecycle, and what a good code base looks like. Basically, reflecting reality. There are not many things more frustrating than code not telling the truth about the software.

Then the costs around this, we sometimes misplaced a bit of effort. We focused a little bit too hard on some things, so we had to make sure we came out of the rabbit hole, stuck our head above the parapet to make sure we were spending that effort correctly.

Sometimes, that hampered speed. But by doing it often we kept on track there. The best knowledge that we had at that time was that we were doing the right thing. Then this approach to unbuilding it slowly coupled with not having data decisions – which we haven’t really spoken about yet – meant it did become a bit fragmented. We could only unbuild certain aspects of each module, and we had to keep the data parts. It actually did increase complexity from time to time.

Picking up a ticket might introduce two or three new dependencies. So, we had to be a bit pragmatic about what to do there.

So, the data considerations, to try and follow on from that last point a little bit more. We’ve got this system that we have spotted around the data. All of it starts with discovering new data really, and all of it ended up with a more complex backlog.

I’ll talk about the relationships here a tiny bit. We didn’t have a lot of domain knowledge on this platform. Each data store we uncovered or we had to provide an analysis, and recommend on what we thought the importance of it was or how long-lived it should be. What the retention periods were, where it fitted in to the contracts and so on. Ultimately, the decisions had to be made by the NHS and the internal stakeholders. That leads to an intimidating first step. Many people probably know, when it comes to data, especially, people are a bit hesitant to take the first step. We attempted to be proactive with a recommendation to get the conversation moving, but the burden of that responsibility is quite high, right? Kicking off that conversation, and getting people talking about things that no-one ever seems to be quite a hundred percent sure on is hard.

There is a complicated decision process as well. There are a lot of people involved, a lot of potential legal ramifications around data in this decommissioning in particular. We had to be compliant with GDPR, with FOI requests. There is also the public enquiry going on with relation to Covid. So enough has to be available for that, when asked. And there are long waits. This was a two-way relationship with the decision process, but there is a lot of to-ing and fro-ing. And every time you discover more data you’ve got to make that first step. You’ve got to feed it into the decision process. All of those things create the long waits. The long waits led us to reduced options because we had to be selective in what we could decommission. The choices became a bit limited.

We had maybe done eighty percent of the first module and seventy percent of the next, and then sixty of the next. As the data dependencies grew, we could only actually take out a certain amount each time. Each time you reduce the scope of a story, we had a new story to go in, which was a new dependency, and it helped to bloat the backlog.

So, there was a lot of splitting into two, or splitting into three. We left a very coherent and understandable trail of destruction behind us but it didn’t need to be there, if decisions could have been made sooner.

We reached a point where there was nothing that could be turned off without deleting data, and we didn’t know what data could be deleted. So we chose in this instance to be proactive about what we could do, looking at all the types of data stores we had and deciding on the options that were available. One thing we thought about was that we could export everything and then we could delete it. If we exported all the data somewhere, then we could turn everything off and the data would be somewhere else.

We didn’t quite reach this point, but this was where our thinking was going.

We set up a series of spikes to identify methods of data transfer for each of the data stores. So, storage accounts, databases, SQL or Mongo databases, log analytics, work spaces. There were a few considerations there; what’s the type of data that is going and how accessible does it need to be? Does it need to be readily accessible? Do you need a key to get there? Is it encrypted data? If it is encrypted data, how do you provide the decryption key? It’s no good transferring that data if nobody can access it, or if people who shouldn’t be able to access it can get there. Once we knew what the storage was and where it should go or the options there, we started to think about how it could happen.

Some of these methods are limited by the size of the datasets, some of them are limited by security. Some of them aren’t necessarily compatible with each other. An example of this was we knew there was a party who had a way of importing .bac files for their secure databases, but in Azure SQL they create .bac pac files, so we worked through a process with them to understand how they could improve that process to a more cloud-based file format.

Once we had outlined all of these, we knew what we could do with every piece of data, and we did get the data decisions in, so we could delete some parts without having to transfer it, and then other bits we did transfer.

I will take us on to some of our lessons learned, here. I did mention a few times about this Platform C. We tried to take some of these learnings from that couple of months to work to our advantage for Platform C. The biggest one being to try and get some data decisions as early as possible. A lot of them were re-applicable, so it was a very similar process, with very similar data stored around patients and results and so on. So, the pattern was similar, which was helpful for getting decisions soon. We did start the Platform C decommissioning with the data decisions beforehand, which meant that we could tackle it in a more modular way.

We also had that knowledge from the spikes about how to transfer, so we knew for the ones that we needed to keep, how we were going to do that. And the team had been through the process of decommissioning, so that was a lot more easy for them the second time around.

I actually rolled off the project before the Platform C decommissioning, so I was really encouraged to see how ably they took that process on of managing that decommission. So, we had a smoother ride of it, the bloated backlog didn’t happen because we didn’t discover new datastores. We knew everything upfront, it was really streamlined, lean process.

These are a few boxes on things to think about, I suppose – things that we learned here that we will be asking next time. Requesting for data decisions in advance, we’ve spoken about that a lot already, but it helps with planning. Exploring dependencies as early as possible. I touched on that but there are technical dependencies obviously, but there is non-technical or non-code dependencies as well, around authorisation or access to tools, for example. We didn’t have Octopus logins to start with. There are areas of the cloud platform that we had to export data to, that we didn’t have subscription level permission to access. There is exporting data upfront. I think that’s an option if your data stores are really tightly coupled with your application, and you need to turn it off. You could think about exporting it all and then managing the application turning off separately.

Undertaking the discovery, obviously we talked about that today. I think proper planning can really help with a successful delivery. It is the bit toward the start really; identifying user needs, the most important aspects of the decommission provides really good guiderails for the work. Not explicitly on here, but it helps to build the trust with the business as well. You know what you’re talking about, you provide a plan, you set it all out.

Sharing knowledge as early and as often as possible, especially with unknowns. I think this is especially important when the learning curve is steep. We were uncovering a lot around the domain, quite a bit around the technologies involved. Also, if one person knows it, you want to share that as soon as possible because if they are ill, or have holiday booked or whatever, then it becomes a problem.

I think also this is especially important if the user needs are being replaced by a new service. This was the case in the Platform C decommission, about sharing that knowledge and those requirements, and what happens to everything early.

In this example for Platform C, there was a dual running period. While we were doing Platform B, we were working with the new provider for Platform C about what requirements there were, and how they should be thinking about managing the data and what we needed to send.

Working in the open. Again, it’s similar to sharing knowledge, but being transparent on the goals and on the progress. Keep the business involved. We had a brilliant BA working with us who had real trust in the team, really confident in our ability. He could go to the business comfortable and confident that it was being managed well. Highlight blockers, keep documentation up to date. One thing that did happen a bit in the Platform B decommission was copies of copies of copies of Excel documents. We probably all felt that pain, I expect.

Do what it takes to give you confidence. In this case we built the regression tests. It really empowered the team, especially the more junior members, to have that confidence, to let them look forwards rather than look backwards all the time. If we get a notification that a build failed or that the automated pipeline had failed, with the regression testing, we knew that we had to look at rectifying.

Giving thought to decommissioning when building. We’ve got another slide on this just coming up, as our big thing for building servicing in the future. The separation of concerns makes people’s lives easier when the time comes, to switch off a service. When you are building a deployment pipeline, don’t skip building a tear-down pipeline as well.

This takes us to the final slide. The key takeaway that I have taken and tried to share with people at Made Tech, in particular. You might be sitting, standing, or watching here today, working on building and improving a live service, and think turning it off is someone else’s problem in x years’ time. It probably will be, but if you think for a moment that it might be you that might need to do that in five years’ time, you’re still there and you’ve been doing something else for five years, would you have all those answers? Would you remember everything? Do you know how to reset that particular thing? Are there pockets of code in there that you can’t wait to forget about because it’s a bit of a spider web or there are massive dependencies inside it? So, an easier life for future you, is what that is about.

I think also that decommissioning isn’t necessarily the same as retiring, or as large scale as we tend to think, traditionally. I’ve come to think of it as it can be quite small fry as well. It can be replacing a service or a module or using a new integration. We still need to plan about how we can switch those in and out with ease.

If you’ve got a bunch of integrations in either direction, people integrating with you or you integrating with others, how hard would it be for you, or worse for someone else, really, to try and turn some aspect of that off?

So, less coupling. If we have decommissioning in mind, and separation of concerns in mind, we can make the codebase more modular. One of the engineers on the team was really passionate about separating the data from the application pipelines, for example. It will lead to the codebase being a bit more modular.

There is a cost sensitivity thing here. We always want to be aware of costs. If we make something easy to turn off it’s going to save the business money, and you’re going to get a pat on the back, or maybe a pay rise, we’ll see. But it saves time. If it’s easier to do it will be quicker to do. If it’s quicker to do it saves money. As well, obviously, as all of the costs associated with running the software itself.

It also gives a bit more flexibility. Technology is always changing, always becoming better. It’s kind of a living organism. So, if you want to change your repository layer, you want to use a new data store, you want to change your document generator or your library, adding things in and out having decommissioning in mind, changing elements of your integrations, it makes your life a bit easier. You are a bit more supple.

I think that is my last slide. I think we are ready for questions and comments. Do you want to come back in, Adam?

ADAM: That was excellent, thanks for that. I have certainly learned an awful lot in the last half hour or so. Yes, there’s so much more to consider in decommissioning that I first thought. We do have some questions, so if you are ok to be in the hotseat for a couple more minutes, Reuben?

One of the questions is, “What would you say is the biggest challenge in this project, thinking about the balancing act slide, and all of the things that had to be considered?”

REUBEN: I think the biggest challenge really was nailing down the data decisions. In some respects for what we wanted to achieve, it was helpful that they took time because we wanted to treat it as a learning experience. But it was quite tricky to pick the right work to do, without knowing what those decisions were. Also, for me personally, I mentioned it a few times, but we had a fairly green team. Making sure they had the right level of autonomy and the right level of support. I can’t ask them to alter an Octopus deploy pipeline if they haven’t got access to Octopus. So that kind of aspect was also quite tricky for me. But overall, the data was the hardest part.

ADAM: That makes sense. There are a couple of other questions as you were speaking. Planning is a crucial part to the success of this project and I know there were lots of learnings along the way, what would you say was the biggest challenge whilst planning the project, as well?

REUBEN: I think planning-wise, it was just finding that time to have those discussions with the internal stakeholders, and understanding what good looked like for them. I identified those two groups, the users that aren’t us and the users that are us. With the discovery in our example, the public was quite an easy one because the NHS was doing all that correspondence, it wasn’t a service anymore.

I didn’t have all the relationships with the stakeholders internally to know what they wanted, so we had to build those quite quickly. I had a very definite view of what I wanted because I didn’t know how we deployed the code. I was really interested in the path to production, and identifying the dependencies. Hopefully, that answers it.

ADAM: Absolutely. Final question – I really liked your slide around building to decommission. Would you say that taking this approach helps promote overall best practice development?

REUBEN: Definitely, yes. I can’t see it going against any principles that I know of, of good software development. I have been in the situation myself where it is not like that. I think that’s one of the things that opened my eyes a bit during it. How do we make decommissioning easier for people in the future? We don’t know how long something is going to last. I don’t think it goes against any good practices, I think it’s always going to be good, and it lends itself nicely to modularisation and cohesive, decoupled code.

ADAM: Brilliant. That’s excellent. Again, thanks very much Reuben. We are going to be wrapping up the webinar now. If there are any further follow up questions or comments that you have, please feel free to reach out to us, either via our marketing email, LinkedIn or we’ve got lots of information on the website.

Just a reminder that we will be sending out a feedback form as well. If you could please respond to this with any feedback you have from today’s session, that would be greatly appreciated. Thank you very much.

Back to the episode