Scale ML on Your Local Clusters with Ray
TL;DR: Run machine learning workloads on local on-premise clusters with Ray.
Ray is a simple and robust framework for building and running distributed applications. Ray provides many libraries for accelerating machine learning, such as RLlib, Tune, Serve, and SGD, and handles all the involved complexities, such as scheduling tasks across multiple machines, recovering from failures, and efficient data transfer, while keeping things simple.
Ray Cluster Launcher (Autoscaler)
Ray ships with a cluster launcher, which automatically performs autoscaling based on resource usage. The cluster launcher can be configured to launch cloud clusters, for example, on AWS, Azure, GCP, and Kubernetes. While these options are widely used, multiple users might need to launch instances in their local on-premise clusters. The cluster launcher needs to be updated for such instances. Fortunately, the Ray cluster launcher has a simple API that makes adding other cluster node providers simple. For example, recently, the support of Staroid instances was added to the cluster launcher.
In this blog post, we overview the on-premise cluster launcher, which allows launching and managing private on-premise clusters. The on-premise cluster launcher supports two modes of operation: 1) manually managed, i.e., the user explicitly specifies the head and worker IPs, and 2) automatically managed, i.e., the user only specifies a coordinator address to a coordinating server that automatically coordinates its head and worker IPs, while making sure to provide isolation of resources between different users.
Getting Started with Launching On-Premise Clusters
To get started with launching on-premise clusters follow the following instructions:
(1) Install the most recent nightly Ray version (pip install -U [link to wheel])
).
(2) Copy example-full.yaml and fill out the ssh user, private key, and other user customized fields if necessary. Besides, the provider section is required and should be filled based on the use case:
To manually manage a cluster, you need to provide the head IP and the list of worker IPs under the provider section in the YAML file. For example:
provider:
type: local
head_ip: 123.456.78.90:1234
worker_ips: [123.456.78.91, 123.456.78.92]
Here is an example of a filled YAML file.
To automatically manage a cluster:
a) Copy coordinator_server.py to your local directory.
b) Open a tmux and run:python coordinator_server.py --ips <list of node ips> --port <PORT>
Provide a list of comma-separated node IPs (e.g.,123.123.12.12,456.456.45.45
) that corresponds to the available local nodes and a port of your choice (e.g., 1234
). This step will print the coordinator address needed in the next step. An example output:Running on prem coordinator server on address <Host:PORT>
.
c) Use the printed coordinator address to fill out the coordinator address field in the YAML file, under the provider section (remove the head and worker IPs fields). For example:
provider:
type: local
coordinator_address: 123.456.789.01:1234
Here is an example of a filled YAML file.
(3) Run ray up -y example-full.yaml
to launch your on-premise Ray cluster. Here is an example output:
Once the cluster starts, you can attach to a remote SSH shell on the head node by running ray attach example-full.yaml
. To use this Ray cluster, either run ray.init(address=”auto”)
in your Python script or set up the RAY_ADDRESS
shell environment variable to auto
.
Conclusion
In this blog post, I showed you how you can scale your machine learning workloads to run on local on-premise clusters with Ray.
If you have questions, feedback, or suggestions, please get in touch by filing an issue on the Ray repository or the Ray Slack.
If you’re interested in working on Ray and helping to shape the future of distributed computing, join us at Anyscale! We’re hiring.