The first step in this project was figuring out how we were going to reimage the devices in the stores from Windows to Linux. We wanted a unified way that would work for every store and wouldn't allow the people installing to modify it. We settled on using PxE to launch a network bootable version of CloneZilla.
Once this decision was made, we created the system images that we wanted to place on the systems and tested that the PxE boot process would work and CloneZilla would re-image the system properly. Once that was successful, I decided to try and automate the PxE process even more. Part of the PxE protocol allows you to filter what OS a system can boot to from PxE. Because we had many different models of hardware, we had to make sure the image built for that model was applied and not one that wasn't. I created a PowerShell script that would scan the network and save the MAC addresses of all the devices on the network. Using the MAC addresses, I added a function to the script to modify the PxE configuration file to only have CloneZilla launch with certain settings for specific models. These configurations would also automatically start the imaging process once CloneZilla was loaded onto the system.
During this process of testing, we found out that our store systems needed to be on a certain BIOS version that had settings for PxE. Luckily, our systems were newer EUFI and had integration with Windows WMIC that allowed us to change BIOS settings using just PowerShell. We could also download the correct BIOS version and use PowerShell to install it silently.
Now that the PxE was working, we needed to work on deploying our resources out to the store systems. Fortunately, every store has one system that was not going to be migrated to Linux. Figuring out each stores hardware inventory was another issue. Our deployment software Tanium, is great for getting information about individual systems, but not for how we needed to group them to accomplish what we needed. I came up with a solution of using my script that identified the model for PxE and used it to identify which images should be downloaded to the PxE system. Using PowerShell I was able to create a script to identify the images needed to be downloaded, and created logic for downloading them and how to deal with connection issues. This allowed the script to keep running and restart the download from where it left off. Doing this enabled us from having to restart the download completely should something happen to the network connectivity.
Tanium also had issues because it couldn't track the download progress from this script natively. I came up with a solution that the script would calculate the percentage of the download and save that value in a json file. I then went back into Tanium and configured it to read that json file. From that we could track the progress of downloads. During this time, we were gathering data on how long the downloads were taking to complete. The minimum store configuration was around 75GB. So there was a lot of data to transfer. Even in our corporate network it could take over an hour and a half. Some stores have very poor internet connections and this was a huge concern. After doing some research, I tested out creating CDN in our Azure cloud. All our files are stored in Azure and integrating a CDN drastically cut down on the download time. From our corporate network it went down to about 20 minutes in total. The one challenge with this, is because the CDN is caching the files to make them easier to download, if we needed to update the files, we would need to flush the CDN cache. Flushing the cache was not always immediate and we would need to plan out when we would make changes.
After making the process of downloading and reimaging work 100% of the time, we started documenting the install process for the people who would be doing the actual install. In the end, I was able to create an instructions document that almost anyone with a little hands on PC knowledge could follow. The process had been made so that really, the person doing the migration only had to check if certain milestones were happening in the document. There were screenshots for menu and literally every step. That way the user could quickly see how things were working. As with every project, there were issues that did crop up. If it was something with the process, we corrected it and updated the instructions. If it was a weird issue that was hard to replicate, we noted down everything we could about it, and added a section for potential issues and solutions. We never ran into an issue that we couldn't solve or get around easily because we had done so much work before hand to make it as simple as possible.
My first step in this project was to try and upgrade test stores that we had access to. I first made sure to take system images of the Windows 7 systems so that I could reset to a known good configuration. Then I went through the process of manually upgrading the Operating System by downloading the current Windows 10 ISO and running through the GUI to perform the upgrade. Once I knew that the upgrade worked, I looked into starting the upgrade using the command line. After researching the proper switches for the upgrade, I tested that.
After getting the intital upgrade working and somewhat automated, I needed to address the post upgrade tasks that took place. Initialy after the upgrade, Windows went through the OOBE(Out of Box Experience). This required alot of manual work to get through and would not be acceptable to decrease calls to the help desk. Further looking at the operating system post upgrade, certain OS settings had been changed that affected the systems to the point where our in house PoS software would not function. Doing more research in the upgrade process, I found two places that I could insert powershell scripts that could automate the process of changing the OS settings to what we required.
Testing this automation was successful and I was able to fully upgrade a system with little to no manual intervention. The next step was figuring out how to deliver the ISO to the systems remotely. Fortunately, we had a tool called Tanium. This tool has many features, but the ones that were most helpful are its ability to gather system information from all the PCs and the ability to push files and run scripts remotely. To prepare I copied all the files from the Windows 10 ISO to zip file that also contained all the scripts I needed. Once that was prepared, we could upload the zip file to our Tanium server.
Once all the files and scripts were set, we tested the whole upgrade process through Tanium in our test stores. Our test stores are at the corporate office for Little Caesars and had gigabit internet speeds. This allowed us to test multiple times very quickly to see how repeatable and reliable this process was. After successful testing in the test stores, we used corporate stores that were located relativeley close to the office for our first real world test. If something went wrong, we could quickly respond to the store and make sure they were up and running before the store openned. Our first set of stores were all successful and didn't require us to come to the store. There were some things like system settings that we didn't not catch during the development of this process that did need to be changed, but that was easy enough to do remotely and we could just upgrade the post upgrade scripts to take care of it.
Once the testing phase was complete, are next task was to figure out a schedule to tackle the 15,000+ systems that needed to be upgraded. One thing we had to take into account was the bandwidth at the stores for downloading the 5 GB zip file for the upgrade. Some stores were running on internet connections as low as 5 mbps. We also didn't want to take up any bandwidth during store hours. So we scheduled the downloads to take place only after store hours. Once the files were staged to most of the systems, we needed to tackle which systems to upgrade per deployment. Since there were multiple systems in the stores that needed to be upgraded, We decided to do only one device per store a day. We also decided to only do about 500 stores max per day. This gave us and the stores a buffer. If the one system had an issue during the upgrade, it would not affect the store operation. If there happened to be a widespread issue for whatever reason that day, the total damage would be minimal and we would have the capacity from the help desk to respond.
After doing this rollout for the upgrade, we completed the 15,000+ systems in about 3 months. Originally, management wanted to image hard drives with Windows 10 and have contractors go to each store and just replace the hard drives. This would have taken much longer than the development and deployment time that it took with my remote process and would have cost much more money. With my process, Little Caesars saved over 3 million dollars. Additionally, we did not have to rely on 3rd party contractors who would have required training and would have disrupted the store during operating hours.