I recently ran into an issue where the Forefront Identity Manager (FIM) Service was stopped after restarting the server. However, the FIM Synchronization Service was started. This caused various errors with the User Profile Service Application, as you would expect. I was unsure why this was happening; after all, the service is set to automatic startup. Fortunately, this was an easy solution to resolve: just simply start up the service on the server hosting the User Profile Synchronization Service. (Note: it is best practice to stop the service via Central Admin and then start it again.) However, simply starting it up and moving forward was not good enough for me. I have a curious mind, and I always like to know why something is occurring. So, of course, I dove further into this and tried to reproduce the issue.
[su_row][su_column size="2/3"]Let me provide you with some background. I had been installing Windows Updates for my production farm. Upon all updates being installed, I was required to reboot the servers. Since this was a production farm, I wanted to reboot only one server at a time in order to prevent downtime. So I started with the APP2 server (the one hosting the FIM services), followed by WFE2, then APP1, WFE1, and finally the SQL server. When I was going over my validations, everything appeared fine. The next day, I performed my routine checks and found the User Profile Incremental Sync timer job had been failing. That is when I noticed that the FIM Service was not started, thus causing the timer job failure. I started it back up such that farm was back up to its fully operational state, and then I began to investigate.[/su_column]
[su_service title="Support When You Need It" icon="icon: wrench"]100% Onshore support for SharePoint, Nintex, and Office 365. Learn more.[/su_service]
Prior to rebooting the servers, I knew the FIM service was started. So either a Windows update changed some settings with the service, or the service failed to start for some reason. I checked the service settings and it was set for Automatic, with the farm account set as the "Log On As" user. This was still configured the same as it had always been. So, it wasn't due to changes with the service settings. I had to dig deeper. So, I did what every IT investigator does: checked the Event Viewer logs. In there, I found an entry that essentially said it had failed connecting to the database. I thought this was interesting. It could not connect to the database, yet the SQL server and service were running. I also did not notice any issues anywhere else. Thinking back to the server reboots, the server hosting the FIM service was the first to be rebooted, with the SQL service being last. When I looked further back in the Event logs, I noticed that the service had started at one point after the reboot. But now it's stopped?
So now I began trying to reproduce the issue. On my test farm, I came up with three possible scenarios for rebooting:
Scenario 1: The server hosting the FIM services is rebooted, while the SQL server is up.
Scenario 2: The SQL server is rebooted, while the server with the FIM services is up.
Scenario 3: Both servers are rebooted around the same time.
Upon testing these scenarios, I found the following results:
- In Scenario 1, the FIM services start immediately and stay running with no issues.
- In Scenario 2, the FIM service stops as soon as the SQL server goes offline.
- In Scenario 3, the results went either way, depending on how the servers come back online. If the FIM services tries to start prior to the SQL service starting, then the services are stopped. If the SQL service starts prior to the FIM services trying to start, then the FIM services will start without issues.
After finding these results, I tried to think of the best way to prevent this issue from occurring again. There are really only two possible solutions. The first is to change the FIM services to a delayed automatic start. Unfortunately, this would not help with Scenario 2. The other is to change my reboot process to start with the SQL server (or at least restart the SQL server prior to the FIM server). This would cause the FIM service to stop when the SQL server goes down, but would ensure that the service starts back up after the FIM service server is rebooted.
Ever since my encounter with the FIM service not starting after a full farm reboot, I have taken extra care in rebooting my servers in the best scenario. I still, from time to time, forget to reboot the SQL server prior to the FIM server, and end up noticing the various errors that arise due to the FIM service not starting. I have started to check that the correct services for SharePoint are running after a reboot in order to catch the issue sooner and prevent any errors that may pop up. Hopefully you learn from my errors, understand a little bit more of why this is occurring, and take away some preventative actions you can begin to implement.