2025 - MaRDI

Information on cookies

On the basis of your freely given consent, which can be revoked at any time, your visit to our website is recorded using Matomo, an open source web analytics software program. The information produced will be used solely for statistical purposes and to improve the website and server. No personal data will be stored or shared with third parties. For more information, please refer to the Privacy Policy on our website. By clicking “Accept”, you consent to our use of cookies for analytical purposes. If you do not agree to this, please click “Reject”. In principle, you can visit our website without cookies being enabled. This does not apply in the case of essential cookies.

2025-03 12th issue - Pack your Data

Welcome

Welcome to the Equinox Math and Data Quarterly Newsletter! With decreasing heat in the southern hemisphere and snowdrops sprouting in the north, MaRDI services have been gaining momentum and are now appearing on the Portal. One of these is of particular interest to us in this newsletter issue: MaPS—the MaRDI Packaging System. This cute new service enables you to package your software environment such that referees of your paper and code may replicate your exact same setup on their system (no matter the hemisphere), using just a couple of simple command-line prompts: highly efficient and very useful for peer review. Want to know more? Check out our main article and the Data Date video interview below.

Download the illustration

by Ariel Kahtan, licensed under CC BY-SA 4.0.

Before introducing MaPS in more detail (and hopefully convincing you to use it in the future), we would like to ask you about your current favorite packaging system. The link will take you to our one-question-multiple-choice survey:

What is your favorite packaging system?

In December, we asked you about your favorite algorithm in mathematics. Of the 15 offered algorithms to choose from, only 6 got votes. Gradient Descent was by far the most popular algorithm (30% of all votes). In addition to the ones offered in the list, these two algorithms were mentioned: Buchberger's algorithm and Q-Learning.

Short Stories on MaRDI

Two short stories on the MaRDI Packaging System

1. Reproducing and reviewing results

Dr. Alba Numbs is a researcher in numerics and high-performance computing. She just received a request to peer-review an article submitted for publication in a journal where she collaborates. She received from the editor the manuscript of the authors, a pdf file. The article interests Dr. Numbs since it aligns with her research domain. In the paper, the authors introduce a new algorithm and they describe it broadly, with the key ideas that make the core of their innovation. They also claim to have run several tests comparing their new algorithm with already-known methods, obtaining specific performance advantages for their system. All looks good, but Dr. Numbs would like to test the algorithms to verify the claims of the authors.

More often than she would like to admit, when she refereed articles on new algorithms, she could only do some “static analysis” of the code, essentially meaning that she reviewed the logic of the algorithm as described in the papers, but she could not run the algorithms in practice. In some cases, the authors do not provide any implementation to begin with, code is not part of the submission for publication. The authors certainly have their own implementations, but they don’t make them public, only the overall idea of the algorithm. Dr. Numbs believes this practice should be avoided in academic papers. Fortunately, in this case the authors do have a public repository with the source code.

However, it is often challenging for a third person to install and run all the dependencies and libraries to make a program work that has not been polished enough to be considered production-ready. Research programs are usually just a proof-of-concept demonstration, without much attention to system robustness or stability, and much less user-friendly installations.

In the paper received by Dr. Numbs, the authors included a URL for the GitHub repository, which was indeed online. So, Dr. Numbs browsed the code online a bit, and she cloned the repository on her computer to test. But then, some troubles started. The program contained some scripts in Julia. The authors state they used release version 0.9, but Dr. Numbs has installed version 1.11 on her computer. Dr. Numbs’ version is newer, and she doesn’t want to downgrade her installation for testing, but if an error occurs, she has no way of ruling out that it is related to version incompatibility. The program also has other dependencies: a few command-line programs for utility functions, a bunch of scripts in Python and Go, and C++ compiled source. A build script calls all these dependencies that, Dr. Numbs deduces, must be installed in the author’s computers.

Dr. Numbs was very reluctant to install all these compilers, interpreters, and libraries on her computer just to test someone else’s program. She would not likely use many of these tools again, so that would be polluting her system, not to mention all the time she would need to look for specific versions of the required software and troubleshoot her system for errors. She still remembered last year's “computer apocalypse” after she was asked to review eight software-related articles at the same time for the proceedings of a conference on numerical analysis. She ended up manually installing so many programs and libraries conflicting and overwriting each other that she could not test all the submissions, and she lost several days of work trying to fix everything. Eventually, she had to format the hard drive and install her operating system from scratch to clean up the mess she had on her laptop. This time she was much more cautious, and the more technical issues she found, the more she was tempted to give up and just evaluate the printed paper with a pencil.

Then, she noticed the authors mentioned that they used the new MaPS system from MaRDI, which promises to solve exactly this problem. MaPS allows to create and share runtimes, a kind of container package similar to a Virtual Machine or a Docker container, but tailored specifically for academic researchers.

Dr. Numbs read the documentation and decided to give it a try. She only had to install the MaPS command line program on her Linux computer. Then, she listed all the available runtimes on the MaRDI repositories and, sure, the runtime corresponding to her assigned paper was listed. She then deployed the runtime, which means that MaPS automatically downloaded all the necessary files, and in Dr. Numbs's home folder appeared a directory tree mimicking the filesystem of the operating system that the authors packaged, including all the programs, compilers, and libraries that the authors used in their work.

Dr. Numbs then could run the runtime, meaning that she got a shell on the virtual system* as if she were there (like logging into a system via SSH). Once in that virtual environment, she could list the files to, sure enough, find the same files she could get from the GitHub repository. She could then follow the tests and benchmarks from the paper by trying them out live, almost like following a tutorial. MaPS allows editing any file within the runtime, so Dr. Numbs saved her own notes and results from tests on her system. Programs from within the runtime cannot change anything in the host operating system, making it safe, but the host OS can put and take files in the guest filesystem.

Dr. Numbs was delighted with the system. She could not only reproduce all the results claimed in the paper but also try out her own examples, trying difficult cases, and testing out the limitations of the new algorithm. That gave her incredible capability to evaluate and, even more, to give feedback on the implementation of the algorithm. For instance, she found a bug in the code that appeared in some edge cases. She decided to communicate it back to the authors via editorial correspondence instead of opening a GitHub issue to preserve her anonymity as referee. She thought that the bug was solvable, so she encouraged the authors to make a fix before publication.

After a few days of working on the review and a couple of reports and answers exchanged anonymously with the authors, she was done with this project. She decided to keep the runtime on her computer for archiving purposes since it did not occupy a lot of space, and she had some notes and test data of her own in the runtime. She could, however, delete it completely and download it again from scratch if she needs it in the future. It is indeed a useful system for archiving since it ensures the possibility of execution in the future, even in different host systems. Dr. Numbs was most happy that her work computer was not polluted in any way with any installed program or library. The only exception was the MaPS command line tool which, she suspected, she would use again.

2. Packing your data

Bernard Vir is a PhD student in computational biology, studying protein folding problems. He works with his advisor within a quite prolific research group. He is in his second year, and he has been getting used to the workflow in his laboratory. He receives some experimental data, and he feeds some machine learning models to train a system to predict protein folds and some biological consequences. He needed to get familiar with the software tools his advisor and the group uses. There was a core model, programmed in R, using programs and tools dating back 15 years that a now-retired professor started. Bernard’s advisor uses that too, so they need to keep doing things compatible with that legacy model for practical purposes. Another colleague in his department made a more modern version of the system in R, but it is not fully compatible. Nevertheless, a substantial part of the group researchers use the new version. On top of that, Bernard prefers using TensorFlow and Python for handling most of the machine learning tasks. Then there are graphic interfaces that they access via the web, made in Node and Javascript. There are bindings and interoperability layers that make all work together, but the system is tricky to install and get working.

One of the tasks assigned to Bernard is to organize and clean all the software systems in his lab as part of a broader Research Data Management Plan. He decides to use MaPS. The group publishes about 6-7 papers per year (amongst all the researchers in their group), and the majority of these papers contain some software simulation or computational results. Of course, there is quite a bit of discussion in the department, with some people wanting to keep the legacy toolchain, and some wanting to use the more modern one. Since he wouldn’t force people to change their tools, Bernard decided to create several base runtimes in MaPS for them.

When creating a new runtime from scratch (initializing a runtime), MaPS proposes a quite minimal Linux Debian image by default. You can initialize a runtime and then start it in sandbox mode, meaning that any changes you make are intended to be exported. Bernard created two runtimes, one called Proti1 with the legacy core libraries and another Proti2 with the modern libraries. Each one of these runtimes has its own tools, interpreters, and libraries. Then each researcher that authors a paper inside the group can take any of the two runtimes and expand it with the code developed in their research, using R, Python, or any system of their preference.

Researchers do their daily coding and testing in their own (host) operating systems, but anything that is meant to be associated with a publication must be eventually packed in a MaPS runtime, and tested in the guest OS. Every time a researcher in the group is about to publish a paper, they must freeze a runtime by doing a commit, so its state is saved. This runtime will stay associated with the article. To share runtimes more easily, you can upload the runtime to a remote repository. Bernard contacted the MaRDI team to start their own MaPS repository in their university**, for the runtimes of their research group, since they expect to grow by 6-7 runtimes per year. The permanent link to the repository is included in the published article, so both parts stay connected. An open-access version of the article is included in the runtime for convenience.

This new system allows all the members of Bernard’s research group to easily exchange their code together with a runnable environment, regardless of their preference for the legacy or modern kernel model. Readers of the articles will also benefit from accessing the same runtimes as the authors have. Any runtime they produce must be accompanied by a metadata description of what tools are installed in the runtime, and how it was set up, so the toolchain can be replicated outside that runtime if necessary. Finally, the repository will also serve as an archiving and backup library, which will work in addition to the git repository they already use.

Bernard explains the system to his department, by organizing some training sessions. While this is a new tool to learn, they all agree that they need something better than their current situation. They also have other tech training sessions for newcomers, in which new (or not new) researchers learn how to use Git, LaTeX, data management protocols, and other tools and good practices to handle their research data. Having clear answers and good tools to address these issues makes all the researchers’ lives a bit easier.

You can find all the information about MaPS and try it out on the project's official page. For all your technical questions, you can refer to the main developer of MaPS, Aaruni Kaushik.

* It is possible to run automatically a program or script immediately after launching a runtime, look for the manifest file in the documentation. This can be useful in some contexts when the packaged program is interactive. For non-interactive scripts, the user may find it more useful the shell interface to explore the scripts and run them manually.

** Currently, only one MaPS repository exists, hosted at http://repo.oscar-system.org/. Multiple, decentralized repositories are possible if needed, but the default one does not require extra configuration. Contact the MaPS team to ask more about it.

Data Dates

^{(To improve the viewing experience, you may click on the settings symbol and choose the 1080p quality option.)}
The video is available under the CC BY 4.0 license. You are free to share and adapt it, when mentioning the author (MaRDI).

In Conversation with Aaruni Kaushik

Aaruni Kaushik, the principal developer of the MaRDI packaging system MaPS, introduces the service and gives a demo. In the interview with Christiane Görgen, they also discuss its usefulness, future features, and how it will be rolled out to the community.

A best practice example of FAIRifying mathematics

In its first year, this newsletter sported a whole series of issues discussing how to make your own mathematics findable, accessible, interoperable, and reusable, including how to find, access, latch onto, and reuse existing mathematical results. MaRDI has now supported a beautiful mathematical library, the small phylogenetic trees, in the process of becoming FAIR. This database was set up in the early 2000s as part of the vanguard movement to make mathematical results available online in a unified format. These days however, the website display has sadly become outdated, code documentation is insufficient for modern standards, and it is for the outsider hard to judge whether the content is up to date with developments in the field of algebraic phylogenetics.

To tackle these issues, a team has come together of original maintainers, a user and expert mathematician, a data steward, and an information specialist. In workshops and discussions with all interest groups, MaRDI, software developers, and maintainers of similar databases, they shaped the following plan: to set up a new modern website, outsource the mathematical results to a software package, and provide a report on lessons learned. This three-fold strategy has proven hugely successful.

The new version of the small phylogenetic trees library is now available at algebraicphylogenetics.org. Its source code is available under an MIT license and long-term hosted on GitLab. All data on the website is set up to be as machine-readable as possible, with displayed results imported from the software package, pictures of small phylogenetic trees provided in Tikz, and data serialized for download. The software package AlgebraicPhylogenetics.jl uses the interoperable programming language OSCAR, based on Julia, is well documented and provides all data on small phylogenetic trees in an immediately useable format. The project is community-based and facilitates collaboration as compared to the original solution.

Which decisions were made and why, a research-data management plan for the whole project, including which obstacles regarding funding, hosting, and career trajectories were encountered and overcome, is part of an entertaining best-practice report.

MaRDI services up and running

With the start of the new year, the MaRDI Portal's landing page also got a new look. Go check out portal.mardi4nfdi.de, and you will find a comprehensive overview of the 15 MaRDI services that are currently up and running. All of these services have their own documentation page with a description, maintainer info, version type, and web links. Some, like the HelpDesk, have been in existence since MaRDI's launch, others, like the MaRDI station for outreach, or consultancy for best practices, are new.

Tools include MaPS (the subject of our main article), the MediaWiki Math Search Extension for semantic formulae, the RDMO plugin MaRDMO as well as MaRDIFlow for documenting and realizing workflows in mathematical modeling, MaRDI Open Interfaces for numerical solvers, the interoperable R-based application mlr3 for open-source machine learning, and the .mrdi file format for serialization.

Databases include MathAlgoDB and MathModDB for mathematical algorithms and models, respectively, with the underlying (MaRDI) knowledge graph and its query service.

Thus, the MaRDI Portal now provides an entry point for users to contact the service they are most interested in directly.

MaRDI meets information specialists round two

In mid-March, eighteen data stewards, librarians, and mathematicians met in Leipzig for the second installment of MaRDI's workshop series tailored to information specialists. This time around, the noon-to-noon event saw featured talks by re3data representative Robert Ulrich from KIT library and ex-MaRDI employee Christian Himpe of the University and State Library of Münster, as well as demonstrations by Marco Reidelbach and Aurela Shehu showcasing MaRDI services: namely, MaRDMO and MathModDB. These and the newest developments of the MaRDI Portal sparked a lot of interest in the audience, leading to discussions about metadata formats, the technology stack behind mathematical databases, and the automated generation of wiki pages for mathematics. Björn Schembera and Christiane Görgen complemented these discussions with a very first presentation of MaRDI's new train-the-trainer program, adding maths-specific content and insight into the mathematical research process to a regular rdm curriculum, and collected valuable feedback. In particular, the first two workshop days gave then rise to a large number of possible barcamp topics, with training, features of the MaRDI Portal, and the role of pure mathematics and reproducibility being the most popular.

NFDI4friends

NFDI network meeting "AI as an enabler for science"

The third edition of the NFDI network meeting, bringing together all consortia active in the Berlin-Brandenburg region, will take place on May 21st at the Weizenbaum Institute in Berlin. The event will feature a combination of invited presentations and demonstrations from consortia and the broader community, providing a platform for discussions and networking.

More information:

in English

New lecture series on social media data

The working group Social Media Data in Research Practice, an initiative of NFDI4Culture in collaboration with BERD@NFDI, KonsortSWD, and Text+, continues its Show & Tell lecture and discussion series. In 2025, the series will focus on the handling of social media data in research on right-wing extremism and democracy studies. The first Zoom session was on February 28; upcoming dates are April 25 and June 20.

More information:

in English

Services Roadshow by Base4NFDI

These two-day online sessions focus on the basic services currently being developed within the NFDI community. Day 1 (May 22) is aimed at infrastructure providers, while Day 2 (May 27) is designed for researchers and users. Participants can expect presentations from various NFDI consortia, case demonstrations, and Q&A sessions.

More information:

in English

Data Week Leipzig 2025

Data Week Leipzig 2025 will take place from 10-13 June 2025 and is set to bring together experts from science, business, and society to explore the diverse perspectives of data and its applications. The event will feature workshops, training sessions, and hackathons. Many sessions will take an interdisciplinary approach, focusing on key topics such as data, digitalization, and artificial intelligence.

More information:

in German
in English