Distributed Computing

Launches worker processes via the specified cluster manager.

For example, Beowulf clusters are supported via a custom cluster manager implemented in the package ClusterManagers.jl.

The number of seconds a newly launched worker waits for connection establishment from the master can be specified via variable JULIA_WORKER_TIMEOUT in the worker process’s environment. Relevant only when using TCP/IP as transport.

To launch workers without blocking the REPL, or the containing function if launching workers programmatically, execute addprocs in its own task.

Examples

# On busy clusters, call `addprocs` asynchronously
t = @async addprocs(...)

# Utilize workers as and when they come online
if nprocs() > 1   # Ensure at least one new worker is available
   ....   # perform distributed execution
end

# Retrieve newly launched worker IDs, or any error messages
if istaskdone(t)   # Check if `addprocs` has completed to ensure `fetch` doesn't block
    if nworkers() == N
        new_pids = fetch(t)
    else
        fetch(t)
    end
  end

addprocs(machines; tunnel=false, sshflags=``, max_parallel=10, kwargs...) -> List of process identifiers

Add processes on remote machines via SSH. Requires julia to be installed in the same location on each node, or to be available via a shared file system.

machines is a vector of machine specifications. Workers are started for each specification.

A machine specification is either a string machine_spec or a tuple - (machine_spec, count).

machine_spec is a string of the form [user@]host[:port] [bind_addr[:port]]. user defaults to current user, port to the standard ssh port. If [bind_addr[:port]] is specified, other workers will connect to this worker at the specified bind_addr and port.

count is the number of workers to be launched on the specified host. If specified as :auto it will launch as many workers as the number of CPU threads on the specific host.

Keyword arguments:

tunnel: if true then SSH tunneling will be used to connect to the worker from the master process. Default is false.
sshflags: specifies additional ssh options, e.g. sshflags=`-i /home/foo/bar.pem`
max_parallel: specifies the maximum number of workers connected to in parallel at a host. Defaults to 10.
dir: specifies the working directory on the workers. Defaults to the host’s current directory (as found by pwd())
enable_threaded_blas: if true then BLAS will run on multiple threads in added processes. Default is false.
exename: name of the julia executable. Defaults to "$(Sys.BINDIR)/julia" or "$(Sys.BINDIR)/julia-debug" as the case may be.
exeflags: additional flags passed to the worker processes.
topology: Specifies how the workers connect to each other. Sending a message between unconnected workers results in an error.
- topology=:all_to_all: All processes are connected to each other. The default.
- topology=:master_worker: Only the driver process, i.e. pid 1 connects to the workers. The workers do not connect to each other.
- topology=:custom: The launch method of the cluster manager specifies the connection topology via fields ident and connect_idents in WorkerConfig. A worker with a cluster manager identity ident will connect to all workers specified in connect_idents.
lazy: Applicable only with topology=:all_to_all. If true, worker-worker connections are setup lazily, i.e. they are setup at the first instance of a remote call between workers. Default is true.

Environment variables :

If the master process fails to establish a connection with a newly launched worker within 60.0 seconds, the worker treats it as a fatal situation and terminates. This timeout can be controlled via environment variable JULIA_WORKER_TIMEOUT. The value of JULIA_WORKER_TIMEOUT on the master process specifies the number of seconds a newly launched worker waits for connection establishment.