Building a Highly Available Splunk Deployment Server Cluster (POC)

🌸 1. Overview

In many Splunk environments, the Deployment Server (DS) is an essential part of the setup, even though it’s often overlooked. While it doesn’t directly handle data like indexers or search heads, it plays a key role in keeping everything organized and working smoothly:

You might have heard the saying:

“If the DS is down, nothing breaks immediately.”

While this is somewhat true and can seem efficient, having the DS down for an extended period can cause some real headaches. For instance, if we need to change client configurations directly on the clients or switch to another DS without knowing if it’s set up in a cluster or master-child configuration, the original DS may not recognize the new applications, checksums, or any changes made in the serverclass.conf file. This can lead to it reverting all the configurations we’ve worked hard to put in place.

To help tackle these common challenges, we want to:

So, a big question arises:

Can we create a redundant Deployment Server setup that is simple, transparent, and predictable?

This proof of concept (POC) dives into that question and gathers insights from the official Splunk documentation on setting up a Deployment Server Cluster, with a few tweaks of our own. 🪄

Friendly reminder: If you’re not familiar with Keepalived, it might be helpful to do a little research first or check out another similar option. We’ll be using a VIP in the deploymentclient.conf for seamless operation. And if you want to know more about Keepalived, check out my friend’s blog series on Understanding Keepalived for High Availability.

If you’re interested in setting up a solid redundancy structure for your Splunk environment, you might want to check out my previous blog series on Splunk Cluster Manager Redundancy. This blog is designed to help you create a basic Distributed Clustered Deployment (M2) with cluster manager redundancy. Plus, I’ve included some handy configuration files for you to reference along the way.


2. Deployment Server Cluster

Since the release of version 9.2, Splunk Enterprise has introduced some handy shared log file directories:

To get your deployment server (DS) cluster up and running, you’ll want to make sure the client_events and deployment-apps directories are shared with each other. It’s pretty simple to use the same mount points for these directories on all your deployment servers, as explained in the documentation.

Once everything is shared and set up correctly, your deployment servers will be able to work together to manage the clients smoothly. Plus, passive deployment servers will have the ability to view the status of active clients using the same files and directories.


🗂️ 3. Environment Overview

Welcome to my proof of concept (POC)! The aim here isn’t to create a full-blown enterprise-grade load balancer solution. Instead, we’re keeping it simple with:

In a nutshell: We want our agents to always communicate through the same address, and we should be able to lose one Deployment Server without any hiccups.

To keep things realistic yet easy to manage, I’ve put together the following POC architecture for you.


3.1. Prerequisites:

Before we dive in, please note that this setup is designed for macOS (ARM). If you’re using a different operating system, you might need to tweak a few steps.

For example, we’ll be using UTM for emulation, but other VM software can be handy if you’re on Windows.

Here’s what you’ll need:

screenshot
Figure 1 Splunk POC architecture.

You can find all the configuration files and apps I mention in this post in the GitHub Repository. Here’s a simplified look at the repository structure:

splunk_ds_ha/
├── README.md
├── s1_configurations/
│   ├── etc_system_local/
|   │   ├── inputs.conf
|   │   └── server.conf
|   │
│   └── etc_apps/
|       └── poc_all_deploymentclient/
|
├── ds1_configurations/
│   ├── etc_system_local/
|   │   ├── server.conf
|   │   ├── serverclass.conf
|   │   └── user-seed.conf
|   │
│   └── etc_apps/
|       ├── poc_all_search_base/
|       ├── poc_all_search_outputs/
|       └── poc_full_license_server/
|    
├── ds2_configurations/
│   ├── etc_system_local/
|   │   ├── server.conf
|   │   ├── serverclass.conf
|   │   └── user-seed.conf
|   │
│   └── etc_apps/
|       ├── poc_all_search_base/
|       ├── poc_all_search_outputs/
|       └── poc_full_license_server/
|    
└── deployment-apps/
    ├── poc_all_deploymentclient/
    ├── poc_all_indexes/
    └── poc_all_search_base/


All the configuration files referenced here are stored in that repository. I’ve kept things concise so that the blog stays engaging.

Now, let’s roll up our sleeves and start building something awesome! 🌸

3.2 Setup: step by step

Here’s a high-level flow of what we’ll be doing in this POC. Each step is designed to be straightforward and easy to replicate. Let’s get started!

  1. Let’s get started by installing Splunk Enterprise on your local host!
    • First off, go ahead and install Splunk Enterprise.
    • Once that’s done, you’ll want to copy all the relevant apps and configurations from our GitHub Repository. Just be sure to change the placeholder targetUri URL in the deploymentclient.conf file to the correct VIP on your UTMs.
    • For the configurations, take the files from the s1_configurations/etc_system_local directory and place them in $SPLUNK_HOME/etc/system/local/. Then, move the directories from s1_configurations/etc_apps to $SPLUNK_HOME/etc/apps/.

             splunk_ds_ha/
             ...
             ├── s1_configurations/
             │   ├── etc_system_local/
             |   │   ├── inputs.conf
             |   │   └── server.conf
             |   │
             │   └── etc_apps/
             |       └── poc_all_deploymentclient/
             ...
      
      
    • Start Splunk and verify that it is fully operational. This instance will later act as a deployment client.
  2. Let’s create a shared directory (shared_apps) on the host!

    • First things first, we need to set up a shared directory that both of our deployment servers can access.
    • Quick Reminder: Make sure to adjust the permissions so the splunk user can read and write to it.
    • Inside this shared directory, let’s create two folders:

         shared_apps/
         ├── deployment-apps
         └── client_events
      
  3. Now, let’s set up two ubuntu in UTM

    • Go ahead and create two Ubuntu 24.04.3 emulator, naming them DS1 and DS2.
    • In UTM, head over to Settings > Sharing and configure it to point both VMs to the same shared directory (shared_apps) on the host.
    • Once we start the VMs, we’ll mount the shared directory on both DS1 and DS2 by running these commands:

       mkdir -p /mnt/shared_apps
       echo "share /mnt/shared_apps 9p trans=virtio,version=9p2000.L,rw,_netdev,nofail,auto" | sudo tee -a /etc/fstab
       sudo systemctl daemon-reload
       sudo mount -a
       df -h
       sudo chown -R splunk:splunk /mnt/shared_apps
      
    • Finally, let’s check the /mnt/shared_apps directory together to ensure it’s visible and that you have the necessary read and write permissions.
  4. Set up splunk users and install Splunk (Don’t start yet!).

    First off, let’s get Splunk installed! You can choose between the tgz or package installation methods. Depending on your choice, you may or may not need to create a splunk user.

    On both DS1 and DS2:

    • If you’re using the tgz method, please create the splunk user.
             useradd -d /opt/splunk/ splunk
             sudo usermod -s /bin/bash splunk
    
    • Install Splunk Enterprise 10.0.2 using either the tgz or package method.

    Important: Please do not start Splunk just yet! If you do, it might create some files and directories that we’ll need to clean up later for the deployment server (DS) cluster configuration.

    Troubleshooting Tip: If Splunk starts too early, it could generate unwanted files in the following locations. In this step, we can clean up them:

    • $SPLUNK_HOME/etc/deployment-apps/_splunk_ds_info/
    • $SPLUNK_HOME/var/log/client_events/
    • It may also create the file $SPLUNK_HOME/var/run/serverclass.xml.
  5. Bind shared directories into Splunk paths.

    As we move forward, note that the /opt/splunk/var/log/client_events/ directory won’t exist yet, and that’s totally okay!

    On both deployment servers:

    • Bind deployment apps:
        mount --bind /mnt/shared_apps/deployment-apps /opt/splunk/etc/deployment-apps
      
    • Bind client events:
        mkdir -p /opt/splunk/var/log/client_events
        mount --bind /mnt/shared_apps/client_events /opt/splunk/var/log/client_events
      

    Now, copy the deployment-apps/ directory from the repository directly into the shared_apps/deployment-apps/ directory.

    Important: Make sure to update any placeholder URLs in the files with the correct ones.

  6. Apply DS-specific configurations.

    Head over to the GitHub Repository and find the configurations for your deployment servers:

         splunk_ds_ha/
         ...
         ├── ds1_configurations/
         │   ├── etc_system_local/
         |   │   ├── server.conf
         |   │   ├── serverclass.conf
         |   │   └── user-seed.conf
         |   │
         │   └── etc_apps/
         |       ├── poc_all_search_base/
         |       ├── poc_all_search_outputs/
         |       └── poc_full_license_server/
         |    
         ├── ds2_configurations/
         │   ├── etc_system_local/
         |   │   ├── server.conf
         |   │   ├── serverclass.conf
         |   │   └── user-seed.conf
         |   │
         │   └── etc_apps/
         |       ├── poc_all_search_base/
         |       ├── poc_all_search_outputs/
         |       └── poc_full_license_server/
         ...
    

    The most important configuration file here is serverclass.conf, which you’ll find in both etc_apps/ directories. Remember, these files need to be identical, and contains the key option for this POC:

       [global]
       syncMode = sharedDir
    

    Important: Don’t forget to edit any placeholder URLs in those files!

  7. Start Splunk on both Deployment Servers

    Now it’s time to bring Splunk to life!

    • Start Splunk on both DS1 and DS2.
    • Verify that both instances start cleanly without issues.
  8. Install and configure keepalived

    Let’s ensure high availability with Keepalived!

    • Install keepalived on both DS1 and DS2.
             sudo apt update
             sudo apt install -y keepalived
    
    • Next, create a script to check the status of Splunk on both DS.
             sudo tee /etc/keepalived/check_splunk.sh >/dev/null <<'EOF'
             #!/usr/bin/env bash
             set -e
    
             SPLUNK_HOME="/opt/splunk"
    
             # check if the splunkd is up
             $SPLUNK_HOME/bin/splunk status |grep "splunkd is running"
             EOF
    
             sudo chmod +x /etc/keepalived/check_splunk.sh
    
    • on DS1, set up Keepalived:
             sudo tee /etc/keepalived/keepalived.conf >/dev/null <<EOF
             global_defs {
             router_id DS1
             }
    
             vrrp_script chk_splunk {
             script "/etc/keepalived/check_splunk.sh"
             interval 5
             timeout 5
             fall 2
             rise 2
             weight -80
             }
    
             vrrp_instance VI_1 {
             state MASTER
             interface IFACE_PLACEHOLDER
             virtual_router_id 51
             priority 150
             advert_int 1
             nopreempt
    
             authentication {
             auth_type PASS
             auth_pass SplunkDS123!
             }
    
             unicast_src_ip DS1_IP_PLACEHOLDER
             unicast_peer {
             DS2_IP_PLACEHOLDER
             }
    
             virtual_ipaddress {
             VIP_PLACEHOLDER/24 dev IFACE_PLACEHOLDER
             }
    
             track_script {
             chk_splunk
             }
             }
             EOF
    
    • on DS2 do the same but with slightly different configurations:
             sudo tee /etc/keepalived/keepalived.conf >/dev/null <<EOF
             global_defs {
             router_id DS2
             }
    
             vrrp_script chk_splunk {
             script "/etc/keepalived/check_splunk.sh"
             interval 5
             timeout 2
             fall 3
             rise 3
             weight -20
             }
    
             vrrp_instance VI_1 {
             state BACKUP
             interface IFACE_PLACEHOLDER
             virtual_router_id 51
             priority 100
             advert_int 1
             nopreempt
    
             authentication {
             auth_type PASS
             auth_pass SplunkDS123!
             }
    
             unicast_src_ip DS2_IP_PLACEHOLDER
             unicast_peer {
             DS1_IP_PLACEHOLDER
             }
    
             virtual_ipaddress {
             VIP_PLACEHOLDER/24 dev IFACE_PLACEHOLDER
             }
    
             track_script {
             chk_splunk
             }
             }
             EOF
    
    • Configure a Virtual IP (VIP) and find out what are your UTM ips/interface.
    • Replace the placeholders (<INTERFACE-PLACEHOLDER>, <DS1-PLACEHOLDER>, <DS2-PLACEHOLDER>, and <DS-VIP-PLACEHOLDER>) in the configuration files with your actual settings, such as the interface and IP addresses. Then run the following commands on both DS:
             sudo sed -i \
             -e "s/IFACE_PLACEHOLDER/<INTERFACE-PLACEHOLDER>/g" \
             -e "s/DS1_IP_PLACEHOLDER/<DS1-PLACEHOLDER>/g" \
             -e "s/DS2_IP_PLACEHOLDER/<DS2-PLACEHOLDER>/g" \
             -e "s/VIP_PLACEHOLDER/<DS-VIP-PLACEHOLDER>/g" \
             /etc/keepalived/keepalived.conf
    
             sudo systemctl enable --now keepalived
             sudo systemctl status keepalived --no-pager
    
    • Configure health checks based on splunk status

    • After starting keepalived, verify which node owns the VIP with the command below:

             ip -4 addr show dev <INTERFACE-PLACEHOLDER>
    
    • Test VIP reachability

    Finally, you should be able to see your clients in the Splunk UI under Settings > Agent Management. Please be patient if it takes some time for the apps and clients to appear! 🐣

  9. Create a serverclass from the DS1 UI (Settings > Agent Management > Server classes)

    Remember to follow the Splunk name conventions as a best practice. 🐣

    • Create a serverclass (for example: sc_s1_apps).
    • Assign apps with wildcard usage as poc_*. Also, configure them with restart behavior.
    • Add the local Splunk instance as a deployment client.
  10. Verify application deployment and shared state

    At this stage:

    • Apps should be deployed to the Splunk instance. Keep in mind that you might need to be patient due to the restart option.

    Next, you should also see the same client in the DS2 UI! 🥳

    Optional Checks: You can verify client events and checksum information under the shared_apps/client_events/ directory to confirm there are no unexpected changes.

    Note: Changes made through the UI are automatically shared. However, CLI-based changes will require a reload on the DS where those changes were made.

  11. Test failover

    • Stop Splunk on DS1.
    • Verify that the VIP moves to DS2 using the command we discussed previously:
      ip -4 addr show dev <INTERFACE-PLACEHOLDER>
    

    From DS2 logs and the UI of S1, check that:

    • The client remains connected.
    • No app redeployment occurs if checksums are unchanged

    To verify, you can check the client_event files. For instance, my appevents_* file shows only the events below for an app called poc_all_search_base after the DS1 and DS2 modifications, indicating it didn’t reload again:

      {"datetime":"01-10-2026 11:11:55.158 +0300","log_level":"INFO","component":"DSappevent","data":{"key":"638ad0c520c118da61f4c96683287adc_sc_s1_apps_poc_all_search_base","clientId":"638ad0c520c118da61f4c96683287adc","serverClassName":"sc_s1_apps","appName":"poc_all_search_base","action":"Download","failedReason":"''","checksum":"12890411723215581178","result":"Ok","timestamp":1768032715}}
    
      {"datetime":"01-10-2026 11:11:55.158 +0300","log_level":"INFO","component":"DSappevent","data":{"key":"638ad0c520c118da61f4c96683287adc_sc_s1_apps_poc_all_search_base","clientId":"638ad0c520c118da61f4c96683287adc","serverClassName":"sc_s1_apps","appName":"poc_all_search_base","action":"Install","failedReason":"''","checksum":"12890411723215581178","result":"Ok","timestamp":1768032715}}
    

    Alternatively, you can use the command below to see if ds2 reloads the apps again:

      index=_internal  host="poc-ds2"
      | rex field=_raw "issueReload\":(?<issue_reload>[^\,]+)"
      | search issue_reload = *
    

    If there’s an issue during this process, I promise you’ll know something isn’t working! 🔮

  12. Recover DS1
    • Start Splunk on DS1
    • Verify VIP behavior
            ip -4 addr show dev <INTERFACE-PLACEHOLDER>
    
    • Confirm both DS instances are healthy and consistent

4. Troubleshooting

Here are some friendly reload tips in case you encounter issues. If you face the error Failed to reload. This deployment server cannot acquire the lock file '/opt/splunk/splunk/etc/deployment-apps/splunk tmp/splunk ds lock.lock', here’s what you can do:

  1. Stop Splunk.
  2. Add the option below:
       # $SPLUNK_HOME/etc/splunk-launch.conf
       OPTIMISTIC_ABOUT_FILE_LOCKING = 1
    
  3. Clean the lock file.
    • Option 1: Run $SPLUNK_HOME/bin/splunk clean locks
    • Option 2: Manually delete the local_shared_apps/deployment-apps/_splunk_tmp/_splunk_ds_lock file.
  4. Start Splunk again.

A Quick Note: It’s recommended to switch from a CIFS filesystem to an NFS filesystem.

This is a known side effect of implementing this feature, but there’s a workaround with the setting mentioned above. Just be cautious, as using this setting might lead to some data loss, so read up on it carefully before making any changes.

Hope this helps, and feel free to reach out if you need more assistance! 😊


5. Summary & Final thoughts

What did we achieve with this POC?

✅ No single point of failure for the Deployment Server
✅ Agents consistently connect to the same address (VIP)
✅ Smooth failover without agent reconfiguration
✅ Shared state keeps our configurations in sync
✅ Simple and easy-to-observe behavior

So, when one Deployment Server goes down:

Exactly what we aimed for!

Just a quick note:

But that being said, it is:

If you’re aware of the trade-offs, this pattern can be incredibly helpful!


What could we explore next?

💬 I’d love to hear your thoughts, ideas, or questions! How do you tackle similar challenges?

Feel free to connect with me on LinkedIn or drop a comment on my blog.

Thanks for reading! And remember, if you’re diving into Splunk internals, you’re definitely not alone! 😊

Until next time, Happy Splunking! 👋🔍


References: