Workflow configuration import and validation for AliECS
Mentored By: Teo Mrnjavac
- Links
General Overview
The ALICE Experiment Control System (AliECS) is a distributed system based on state of the art cluster management and microservices which have recently emerged in the distributed computing ecosystem. AliECS is being built from scratch in Go and takes advantage of Apache Mesos for its cluster resource management capabilities, with the goal of controlling tens of thousands of processes over hundreds of nodes.
In preparation for LHC Run 3 starting Spring 2021, the ALICE Experiment at CERN LHC is undergoing a major upgrade, which includes a new computing system called O² (Online-Offline). A workflow template is a file (YAML) which describes a set of data-driven processes to run and control throughout the O²/FLP cluster at LHC Point 2.
For developers of processing software, the high-level interface is called DPL (O² Data Processing Layer). This component is able to generate dumps of data-driven workflows (i.e. files describing a set of processes to run and how they talk to each other), which currently cannot directly be imported into AliECS.
These template workflows were handcrafted to the specifications of the DPL dump, which was time-consuming and did not follow a formal schema. The goal of this project was to develop a converter tool that would receive a DPL dump and output the required number of task templates along with a single workflow template.
The following image shows a typical data flow pipeline from generating DPL dumps to using coconut
and starting environments:
Project Goals
The following were the core goals defined in the project proposal:
Goal | Result |
---|---|
Develop a tool to convert a DPL dump into workflow and task templates | ✅ |
Define formal schemata that these templates adhere to | ✅ |
Develop a validation tool to verify if templates adhere to the said schemata | ✅ |
Beyond the above, workflow.Graft()
was also implemented. This function allows us to convert a fresh DPL dump on the fly and append its contents to an existing workflow template. Had GSoC not ended so soon I would’ve liked to work on:
- Adding commit hooks to validate all templates uploaded to AliceO2Group/ControlWorkflows
- Preserve custom ordering of marshaled YAML elements
Highlights
- Before we began work on the project we needed a name for the utility that would house the above features. Given that
AliceO2Group/Control
(the repository for AliECS) already had a couple of tools calledcoconut
andpeanut
. We decided thatwalnut
was an appropriate title.walnut
stands for the Workflow Administration and LiNting UTility. - Twice during GSoC, I had to present my work to the fellow members of the ALICE community. These were encapsulated in:
Challenges and Lessons
- Using hard-coded paths to marshal complicated types like
aggregatorRole
anditeratorRole
resulted in overly complicatedMarshalYAML()
functions. Rather, callingMarshalYAML()
on each of its constituent fields to simply reuse the custom marshalers we had already defined proved to be a cleaner and more elegant approach. Typecasting these into amap[string]interface
and iterating over the key-value pair, adding to the result as we go, gave us exactly what we wanted in a much smaller package. - During marshaling of YAML files,
omitempty
was proving to be unreliable in the case of slices. Further research revealed that the slices were not empty but rather heldnil
values. You can read more about this here. - Before we could begin work on
workflow.Graft()
, we needed a way to load existing workflow templates intowalnut
. We couldn’t use theUnmarshalYAML
methods defined already since this means we lose all ordering and comments present in the workflow template. Using theyaml.Node
implementation solved this problem allowing us to insert new elements and preserve the ordering and comments as well.
Conclusion
- Over the 90 days of Google Summer of Code, I submitted:
- 19 pull requests with a total of 138 commits
- 3,600+ additions to
Control
- I spent ~300 hours working on this project which is around 25 hours/week
Overall, GSoC has been a phenomenal learning experience for me. The knowledge I gained is not limited to just programming, I have learned how to work in a team and how to present the work we do in a structured and digestible fashion for other engineers that depend on it. I’m immensely grateful to CERN, Google and most of all my mentor, Teo Mrnjavac.