Binghamton University has two major HPC clusters. However, they both run SLURM in user space. If an organization needs to quickly, temporarily, run a heavy computation on the fly, significant setup is required.
What it does
Searches the local network for nodes that one can log in to; built authorization on top of SSH. Commands are distributed via sockets + protobuf, state is kept sane as the network is meant to have the same NFS.
How I built it
TCP Socket server plus rudimentary python
Challenges I ran into
Socket pipe kept breaking. Just try catched it.
Accomplishments that I'm proud of
Nodes are matched via regex, allowing significant clusters
What I learned
Synchronization across multiple processes/nodes, how to get a process to run on a remote server without keeping an active session
What's next for User Cluster
Implement arguments that alias to utilities such as ulimit, for memory and CPU usage. Proper algorithm for choosing nodes rather than just cycling each command. Detaching the manager from the shell, for p