[archives] [homepage]

GSoC 2020 @ CERN - Part II
Sunday, November 01, 2020; ago; Download .md

Workflow configuration import and validation for AliECS

General Overview

The ALICE Experiment Control System (AliECS) is a distributed system based on state of the art cluster management and microservices which have recently emerged in the distributed computing ecosystem. AliECS is being built from scratch in Go and takes advantage of Apache Mesos for its cluster resource management capabilities, with the goal of controlling tens of thousands of processes over hundreds of nodes.

In preparation for LHC Run 3 starting Spring 2021, the ALICE Experiment at CERN LHC is undergoing a major upgrade, which includes a new computing system called O² (Online-Offline). A workflow template is a file (YAML) which describes a set of data-driven processes to run and control throughout the O²/FLP cluster at LHC Point 2.

For developers of processing software, the high-level interface is called DPL (O² Data Processing Layer). This component is able to generate dumps of data-driven workflows (i.e. files describing a set of processes to run and how they talk to each other), which currently cannot directly be imported into AliECS.

These template workflows were handcrafted to the specifications of the DPL dump, which was time-consuming and did not follow a formal schema. The goal of this project was to develop a converter tool that would receive a DPL dump and output the required number of task templates along with a single workflow template.

Intended conversion inputs and outputs
Intended conversion inputs and outputs

The following image shows a typical data flow pipeline from generating DPL dumps to using coconut and starting environments:

Data-driven workflow pipeline
Data-driven workflow pipeline

Project Goals

The following were the core goals defined in the project proposal:

GoalResult
Develop a tool to convert a DPL dump into workflow and task templates
Define formal schemata that these templates adhere to
Develop a validation tool to verify if templates adhere to the said schemata

Beyond the above, workflow.Graft() was also implemented. This function allows us to convert a fresh DPL dump on the fly and append its contents to an existing workflow template. Had GSoC not ended so soon I would’ve liked to work on:

Highlights

  • Before we began work on the project we needed a name for the utility that would house the above features. Given that AliceO2Group/Control (the repository for AliECS) already had a couple of tools called coconut and peanut. We decided that walnut was an appropriate title. walnut stands for the Workflow Administration and LiNting UTility.
  • Twice during GSoC, I had to present my work to the fellow members of the ALICE community. These were encapsulated in:

Challenges and Lessons

  • Using hard-coded paths to marshal complicated types like aggregatorRole and iteratorRole resulted in overly complicated MarshalYAML() functions. Rather, calling MarshalYAML() on each of its constituent fields to simply reuse the custom marshalers we had already defined proved to be a cleaner and more elegant approach. Typecasting these into a map[string]interface and iterating over the key-value pair, adding to the result as we go, gave us exactly what we wanted in a much smaller package.
  • During marshaling of YAML files, omitempty was proving to be unreliable in the case of slices. Further research revealed that the slices were not empty but rather held nil values. You can read more about this here.
  • Before we could begin work on workflow.Graft(), we needed a way to load existing workflow templates into walnut. We couldn’t use the UnmarshalYAML methods defined already since this means we lose all ordering and comments present in the workflow template. Using the yaml.Node implementation solved this problem allowing us to insert new elements and preserve the ordering and comments as well.

Conclusion

  • Over the 90 days of Google Summer of Code, I submitted:
    • 19 pull requests with a total of 138 commits
    • 3,600+ additions to Control
  • I spent ~300 hours working on this project which is around 25 hours/week

Overall, GSoC has been a phenomenal learning experience for me. The knowledge I gained is not limited to just programming, I have learned how to work in a team and how to present the work we do in a structured and digestible fashion for other engineers that depend on it. I’m immensely grateful to CERN, Google and most of all my mentor, Teo Mrnjavac.

My experience with GSoC 2020 @ CERN: Part 0, Part I