DC/OS stands for Data Center Operating System. Essentially, it allows you to manage multiple machines or nodes as if they were a single pool of resources. Unlike traditional operating systems, DC/OS spans multiple machines within a network, aggregating their resources to maximize utilization by distributed applications. Think of DC/OS as an Operating System like CentOS, RHEL or Ubuntu. But DC/OS goes beyond those OSes by using a unified API to manage multiple systems such as containers and distributed services, in the cloud or on-premises. It automates resource management, schedules process placement, facilitates inter-process communication, and simplifies the installation and management of distributed services. DC/OS has a web interface and a command-line interface (CLI) facilitating remote management and monitoring of the cluster and its services. There are two types of DC/OS, an enterprise version which includes advanced features for security, compliance, multitenancy, networking and storage, and is backed by 24×7, SLA-governed support. And there is an open source, distributed operating system based on the Apache Mesos distributed systems kernel. DC/OS manages multiple machines in the cloud or on-premises from a single interface. Open source DC/OS also adds service discovery, a Universe package for different frameworks, CLI and GUI support for management and volume support for persistent storage.
DC/OS is built on top of Apache Mesos, the open source distributed orchestrator for container as well as non-container workloads. It is a cluster manager that simplifies the complexity of running applications on a shared pool of servers and is responsible for sharing resources across the application framework by using a scheduler and executor. You can think of Mesos as being like a Linux kernel.
Cluster Manager – Master and agent node/task management and execution
Container platform
Operating system
Two-Level Scheduling
DC/OS manages this problem by separating resource management from task scheduling. Mesos manages CPU, memory, disk, and GPU resources. Task placement is delegated to higher level schedulers that are more aware of their task’s specific requirements and constraints. This model, known as two-level scheduling, enables multiple workloads to be collocated efficiently.
Marathon Framework
Marathon act as a core component, giving you a production-grade, well-tested scheduler that is capable of orchestrating both containerized and non-containerized workloads. With Marathon, you have the ability to reach an extreme scale, scheduling tens of thousands of tasks across thousands of nodes. You can use highly configurable declarative application definitions to enforce advanced placement constraints with node, cluster, and grouping affinities.
For the DC/OS cluster we are going to use an AWS EC2 instance to create and manage nodes. The diagram below shows what this architecture looks like.
We will use the Terraform module to create and maintain the infrastructure components. Firstly, we set up Terraform in the system where we are going to deploy the DC/OS cluster. On-prem and other cloud installation steps can be found here (https://docs.d2iq.com/mesosphere/dcos/2.1/installing/).
In this setup we are going to use AWS cloud platform and DCOS universal installer 0.3 for DCOS Version: 2.1
Terraform config require to generate ssh key before deployment. We need to make sure we have the public key generated at “ssh_public_key_file”.
As we are using open-source version of DC/OS, we do not need a licence key. Now, we can proceed to deployment.
Deploy DC/OS Cluster
Create Demo directory.
Create the Terraform file with a name that suits your deployment. In this example, we are going to create a file called dcos.tf. Make sure the file ends with the extension .tf.
Via this Terraform template we are going to launch cluster with 1 master node, 2 private node, 1 public node and 1 default bootstrap node.
This setup can be accessible via “whatismyip” IP address.
After, the installation cluster can be managed through UI, CLI and API. The cluster will generate:
EC2 Instances:
1 Bootstrap node – bootstrap nodes according to the role (Ansible).
N Master nodes – to manage the cluster API and resources.
N Public agent node – to sit behind the external load-balancer
N Private agent node – to sit behind the internal load-balancer
ELB:
1 Admin DC/OS UI load-balancer (ALB)
1 External public facing load-balancer – you can map your custom DNS here. (NLB)
1 Internal load-balancer for internal use, within VPC. (NLB)
Security Group to manage traffic. Specifically, Admin LB access control through white-listed IP addresses.
DC/OS UI Login:
Open the Admin LB endpoint which we got from the output of universal installer (Terraform).
First, login with an Open ID provider. (Google/Microsoft account)
Some default useful URLs:
DC/OS by default comes with Marathon installed to manage stateless long running applications.
Marathon supports Docker and Universal Container Runtime.
DC/OS also provides non-container application execution as Jobs by using the Metronome framework. You can directly execute binary or packaged executables by supplying simple commands.
Installation instructions can be found by clicking on “cluster-name” in the upper-right corner of the DC/OS UI.
Sample steps for Linux:
CLI commands:
Sample application file: app.json
Check in the UI “Services” Ttb and explore the service management operations like start, stop, scale, terminate, configuration history, logs, etc.
Note that Service and Group support “Quota”. With this we can define the CPU and memory resource allocation quota.
Service discovery enables you to address applications independently of where they are running in the cluster, which is particularly useful in cases where applications may fail and be restarted on a different host.
DC/OS provides two options for service discovery:
Mesos-DNS:
Resolve internal service host and port information through SRV record lookup.
Example: dig srv _appname._tcp.marathon.mesos
Named Virtual IPs:
Allows you to assign name/port pairs to your apps, which means you can give your apps meaningful names with a predictable port, by adding the following to the package definition (json).
From the DC/OS UI navigate to the “Catalog” tab.
DC/OS provides a large variety of pre-defined and ready to use tools through “Catalog”, including packages like Jenkins, Prometheus, Grafana, mongodb, Redis and openvpn.
As our application is deployed on DC/OS-Marathon framework, we will make our frontend service web accessible through a public load-balancer.
We will use Marathon-LB, a Python-based utility backed with HA-Proxy.
Search for “marathon-lb” and select it from the community package. Install it with the default configurations.
By default, Marathon-LB attaches to the external load-balancer and deploys itself in the public agent nodes.
Define the HA-Proxy labels (Key:Value) to the application through the json manifest.
After deploying the application with above configuration, map your DNS records to the external load-balancer CNAME.
Please refer to the official doc for more information:
https://docs.d2iq.com/mesosphere/dcos/services/marathon-lb/1.14/mlb-reference/
External user accounts: External user accounts exist for users who want to use their Google, GitHub, or Microsoft credentials. DC/OS never receives or stores the passwords of external users.
Local user accounts: Local user accounts exist for users who want to create a user account within DC/OS. Usernames and password hashes are stored in IAM database.
Service accounts: A machine interacting with DC/OS should always go through a service account login for obtaining an authentication token. Do not use a username/password-based login in that case.
Let’s add a local user with a simple username and password.
To install the application auto-scaler, navigate to the “Catalog”, then search for and select “marathon-autoscaler”.
In the service configuration of “Autoscaler” define “Marathon App” with a specific ID like “/example/appid” and define “Userid” and “Password” which we created in the previous step.
Deploy the autoscaler for that specific Marathon application.
For more details refer to: https://github.com/mesosphere/marathon-autoscale
Types of metrics:
Custom metrics to the DC/OS metrics service from the application can be exported via injecting the standard environment variables STATSD_UDP_HOST and STATSD_UDP_PORT. The application can use this and export metrics.
Alternatively, we can serve metrics in Prometheus format by exposing an according endpoint. We need to add at least one port with the label.
DCOS_METRICS_FORMAT=prometheus.
Example: “labels”: { “DCOS_METRICS_FORMAT”: “prometheus” }
CLI to test metrics:
Note: Metrics in DC/OS 1.12 and newer versions use Telegraf to collect and process data. Telegraf provides a plugin-driven architecture.
From the UI “Catalog” tab search and install “dcos-monitoring”, which includes Prometheus agent/server and Grafana.
Access URL for Monitoring:
For more details refer to: https://docs.d2iq.com/mesosphere/dcos/2.1/metrics/
If you have a question about DCOS deployment on AWS Cloud, be sure to let us know by getting in touch using the form below.