# Math & Data Quarterly

## News and insights into the realm of mathematical research data

**1st issue**

Welcome to the very first issue of the MaRDI (Mathematical Research Data Initiative) Newsletter. Research data in mathematics comes in many different flavors: papers, formulae, theorems, code, scripts, notebooks, software, models, simulated and experimental datasets, libraries of math objects with properties of interest... In short, the list is as long as mathematical research data is diverse.

Unfortunately, there is no straightforward or standard way to make these digital objects available for future generations of researchers. Availability, however, is not the only concern. In an ideal world, mathematical research data would be

**FAIR:** **F**indable, **A**ccessible, **I**nteroperable, and **R**eusable.

MaRDI is a part of the German National Research Data Infrastructure (NFDI) and it is dedicated to building infrastructures to make mathematical research data FAIR. Work on solutions for some of the major problems we face today started last year; from understanding the state-of-the-art technology of a field all the way along the research pipeline to establishing standards for peer review. As part of this process it is especially important for us to engage you, the mathematics community, early on so have a look at the list of our upcoming workshops! This issue of the Newsletter is dedicated to the F in FAIR: to findability and what this means for mathematics.

licensed under CC BY-NC-SA 4.0.

We explore two aspects of what Findable means. First, we will focus on how to find data created by other researchers and then we discuss how to make sure your own data is findable for the math community. In each newsletter, we will also publish an episode of our interview series on math and data: "Data Dates", introduce you to the people behind the MaRDI project, and offer some reading recommendations on the topic.

### Have you ever…

- tried searching for a formula?
- seen a reference to a homepage which is long gone?
- put code on your personal webpage because you didn't know how and where else to publish it?
- browsed through the publications of your coauthor's coauthors looking for that one result which you almost remembered but not quite?
- not been able to find something you needed to keep going into the research direction you fancied?

**
Then you are not alone!**

To find out, where people search for math data, we ask you to answer our very short multiple-choice survey:

**Where do you look for mathematical research data?**

You will see the results right after submitting your answer.

### How to find data?

**Abstract**

On an almost infinite Internet, where do you go to find research data? Which are the "hubs" concentrating resources? What is MaRDI proposing to help on the Findability front?

**Data and FAIR principles**

Modern science, including mathematics, relies increasingly on data. With the word "data", we mean a broad significance, including literature (books and articles), but also databases of experimental data, simulation-generated data, taxonomies (exhaustive listings of the examples of a given category of objects), workflows and frameworks (for instance software stacks with all the programs used in a research project), etc. Even a single formula can be considered a piece of data.

To set up good practices in the scientific community, a group of researchers published in 2016 the so called FAIR principles. Those principles are Findablilty, Accessibility, Interoperability, and Reusability.

In this short article we will introduce the Findability principle, with a focus on mathematical sciences, and connected with the efforts that are being developed in the MaRDI project.

For a more comprehensive introduction to the FAIR principles, you can visit the Go-Fair portal.

**Findability**

Findability is the first of the FAIR principles; and it is also the most basic one, because if you can't find some data, you can't re-use it in any way, it is as if it did not existed.

When we try to find data, we may face two situations: either we know that something exists and we are looking for it specifically; or we don't know exactly the result we want, and we look for anything related to a search term. In the first case, rather than finding that data, our problem is locating it somewhere in the physical or virtual space. In the second case, our problem is to examine all the data available (in a certain catalog) for a certain characteristic that we are interested in.

Both problems are solved by using a few tools: Firstly, each piece of data needs to have a unique reference or identification, so that we can build lookup tables for the location of each dataset. Secondly, together with the ID, we need other metadata that describes the data with some useful information (type, subject, authors, etc). Thirdly, we need to build comprehensive catalogs that gather all the metadata of the datasets, and to build search engines, which are algorithms to retrieve things from the catalogs.

Thus, the Findability principle can be expanded to some concrete recommendations:

- (Meta)data is assigned a globally unique and persistent identifier.
- Data is described with rich metadata.
- Metadata clearly and explicitly includes the identifier of the data it describes.
- (Meta)data is registered or indexed in a searchable resource.

**Where people go**

For mathematical articles, books, and other classically published works, a reference includes title, author, year, etc. While this is easily usable and readable by a human, it is not always consistent in format, and it does not provide a mean to locate and access that information.

The two de-facto standard catalogs that collect mathematical literature, and also assign a unique identifier, are the Mathematical Reviews (archived in MathSciNet by the American Mathematical Society) and the ZentralBlatt Mathematik (archived in zbMath by the FIZ Karlsruhe - Leibniz Institute). Those unique identifiers (MR number and Zb number) are helpful to reference a piece of mathematical literature, and those catalogs are a main tool to find works in a specific math domain, but those catalogs are much less comprehensive when it comes to other data (databases, media, online resources, etc). It also has the drawback that the authors can't control the existence or the metadata of an entry, and MathSciNet is a subscription-based service (MR Lookup offers limited services to non-subscribers. As of 2021, ZbMath became zbMATH-open and requires no subscription).

Another notable mention is arXiv, which is a de-facto standard platform for pre-publications. Here the actual paper is offered publicly (which concerns the Accessibility principle), and any work in arXiv also gets a unique ID and can be found via the catalog search. However, the focus is on literature and there is very limited support for datasets related to a paper.

One of the most standardized identifiers for online resources is the Digital Object Identifier (DOI), which references any digital object. Unlike a URL, the DOI is linked to a particular file and not to the server or website where it is hosted. The DOI website resolves the DOI number to the most up-to-date URL to access the data, so the DOI also serves as a locator in addition to being a unique identifier. Usually, publishers assign a DOI to new publications, but authors can also obtain a DOI in other registration agencies. Some open repositories offer free DOI registration (for example Zenodo, see below).

When it comes to non-literature data, the panorama is much coarser. zbMath has a sister project,

swMath, which is a catalog of mathematical software packages (computer algebra, numerics, etc) and a cross-reference record of their citations articles in zbMath. zbMath also features a full-text search of formulas, which is improving within the MaRDI framework.

For pure databases (experimental data, simulations data, etc), there is not yet a universally accepted repository in mathematics, in contrast with other fields where one can find standardized catalogs of proteins or genetic encodings, for example.

Zenodo is a general purpose repository for open data, which hosts quite a few mathematical datasets. See our article "Publishing on open repositories" where we talk more about Zenodo.

TK: Focus very strong on publications not on results and math research data in general. E.g. you want to find a theorem or formula rather than a publication, E.g. you want to find specific formula e.g.. related to bessel functions, or a specfic representation of a specifc group. Threre exist some special, curated collections, which are also findable. Examples include SuiteSparse Matrix collection or the On-line enceclopedia of interger sequences (OEIS)

**What is MaRDI proposing concerning Findability?**

Unfortunately, most data-based mathematical research is still published either without the datasets, or the datasets are only hosted in university servers accessible only through personal websites of the researchers involved.

The MaRDI project aims to, on the one hand, provide the necessary infrastructure to properly publish data (according to the FAIR principles), and on the other hand, to spread awareness within the math research community of the problems and proposed solutions that publishing data entails.

Here we will name just a few of the initiatives related to the Findability principle. You can read the funded proposal of MaRDI here.

The Scientific Computing Task Area (TA2) is preparing a benchmark framework to compare existing and new algorithms and methods to solve specific problems. For instance, there are several dozens of methods to solve a linear system Ax=b, with different performance and different technology stack, depending on the size of the matrix A, if is is sparse or dense, if we look for exact or approximate solutions, etc. So far there is no centralised catalog where a "user" (for instance a computational biologist) can go to choose the best method for his, or her, particular application. This catalog and benchmark will make finding symbolic and numerical algorithms much easier, and it aspires to be a major reference when looking for such algorithms.

The tool for this is building a knowledge graph of numerical algorithms. A knowledge graph is an abstract representation of a set of concepts, objects, events, or anything related to a domain of study, as nodes, and formal relations between them (edges) that can be read by humans or computers unambiguously. The biggest collective effort to guild a knowledge graph is WikiData. In this case, nodes in this graph will be the algorithms themselves as concepts, but also papers related to them, software packages implementing them, benchmarks, and connections to other databases. It will then be possible to navigate the knowledge graph to find semantical information, such as which algorithms extend a given one, where can we find implementations, how do they perform in comparison, etc.

Another effort aimed to Findability in MaRDI is the so called Mathematical Entity Linking (MathEL), or a way to extract and compare conceptual information from mathematical formulas. The concept of a particular equation (for instance the Klein-Gordon equation, the General Relativity equation, etc) can be expressed in many different forms, variables can be named differently, notations for derivatives, tensors, etc. may differ, goupings and substitutions can occur, etc. The MathEL sub-project aims to retrieve the conceptual information of formulas, to propose annotation standards for introducing semantic information into formulas (for instance referencing a WikiData node or other knowledge graph node), to mine large corpuses of data (for instance the Zb catalog or the arXiv repository), and to create user interfaces to retrieve concept and source information, such as question-answering engines.

We can share a sneak peek into the formula search engine that is under development.This phony wiki page contains a couple of math formulas. This search portal should be able to find those formulas when queried in the search box. You can try " V=4/3 \pi r^3 " or " V=\frac{4}{3} \pi r^3 " and it will find the test formula. Also, with " V = 4/3 \pi ?s^3 " you can find variable substitutions. However, other common re-writings are not yet recognized, such as " V = \frac{4\pi}{3} r^3 ". It will eventually be possible to find formulas which have LaTeX expressions with operators with the same meaning but other notations.

In subsequent articles we will expose other tasks being carried within MaRDI that are more aligned with the other FAIR principles (for instance open interfaces, or descriptions of workflows).

**Abstract**

We had some data from a project. How can we prepare it according to FAIR principles? Follow us in this attempt to being FAIR on the first try.

**Publishing data in open repositories**

This is a short story on how we set up data in a public repository (almost painlessly).

We are IMAGINARY, a math communication association, part of the MaRDI consortium, and we do math exhibitions as our main activity. In a recent project about climate change, we gathered some data about Earth grids. The latitude and longitude grid is the most common way of creating a reference system on the Earth, but there are other ways to divide the Earth in small regions that can be addressed and used, particularly useful in numerical models.

We contacted a couple of climate researchers that would prepare for us the sets of geographic nodes and edges that make those grids. Then, one of our collaborators took that data and converted it into a 3D-printable model, by adding thickness to the edges and checking the structural integrity of the ensemble, so it could be a physical object. Finally, a 3D printing company made the objects and we used them in our exhibition.

Being a company committed to Free and Open Source licenses, we tested how to make that FAIR data.

First, this dataset is not worth a research article in a journal, since it is not a scientific advance. At the same time, it is specific enough to be (probably) the first time this data is gathered and processed this way, and it could be useful to others to re-use.

**Git (GitHub, GitLab)**

Since we are dealing with software files, the most convenient platform to publish and develop is to use GitHub. Git is the most popular version control software, and any organization of code should start there. GitHub (and GitLab) are probably the most popular platforms to host Git projects. However, as a publishing tool, this should be considered almost as a kind of personal website (actually you can host and serve a git repository in your server), and it is a live and working tool, so the published data can change at any time. Github does not offer a guarantee of stability, a standardized identifier, or a good way to search and find your data. Also, it keeps a record of all the previous states, so all the dirty work is on the public.

Our GitHub page was our collaboration tool within the team. It was not intended as a publication method, it just happened that we left it to be publicly available.

**Zenodo**

Zenodo is a public open access repository of general purpose. It is used by researchers in any scientific area to make a copy of their works findable and accessible to the public. Those works can be articles or books in pre-print or already published by traditional publishing houses, but also databases, data files, images, or any digital asset that their research relies upon.

Zenodo can offer a Digital Object Identifier (DOI) in case that the work does not already have one. In this case, the DOI contains "zenodo" string in it (for instance 10.5281/zenodo.6538815)

Zenodo is hosted by the CERN infrastructure and funded by the European Commission amongst others.

Creating our entry on Zenodo was not difficult.

Firstly, we created an account. A valid email is enough, but you can link it to your ORCID identifier to uniquely determine the author(s).

Secondly, we made a new upload draft. You can choose the type of document (publication, poster, dataset, image, video, software, physical object, etc) and fill in the form with title, authors, publication date (can be in the past), description, and several other fields.

For the authors, we added the ORCID of those who had it. We also used "IMAGINARY" as an author even if it is not a physical person but a company.

We requested a new DOI since we did not have any. The DOI can be "reserved" during the draft process, so you know it in advance (and can use it in the documents you prepare).

For the actual content, we used a zip file with the master branch of the GitHub repository. You can also link your Zenodo account to your GitHub account, so that whenever you make a "release" in Github it gets a snapshot automatically published in Zenodo.

Finally, we submitted the draft. Once published, you can't add, delete or modify the files associated with a DOI (that is the point of the DOI), but you can make new versions with a new DOI. Thus, you must double- and triple-check before submitting. In case you make an erroneous submission, the only fix is to write an email to Zenodo administrators explaining yourself and asking for a fix.

**In Conversation with Cedric Villani**

In the first episode of the interview series Data Date, Cedric Villani joins Christiane Görgen for a brief exchange of thoughts about Math & Data.

**OpenML hackathon at Dagstuhl castle**

Sebastian Fischer and Oleksandr Zadorozhnyi, of the MaRDI task area Statistics and Machine Learning, participated in an OpenML hackathon held in late March at the headquarters of the Leibniz Center for Informatics at Dagstuhl, Germany.

OpenML is an open-source platform for sharing datasets, algorithms, experiments, and results. The hackathon was initiated by Bernd Bischl, one of the key players behind OpenML and a Co-Spokesperson in MaRDI. Researchers from other parts of Germany, France, the Netherlands, Poland, and Slovenia were present to discuss topics such as data quality on OpenML, an extension of its established services to new data formats, and new computational tasks.

The review article "Datasheets for datasets" provided fruitful exchanges on future improvement of data and metadata quality. In particular, support for non-tabular data formats such as images was discussed and will now be embedded by transitioning from the attribute-related file format to parquet. The so-far available eight types of tasks, including regression, classification, and clustering, will be extended to new tasks which are typical for graphical modeling. As this is one of the main use cases and an important topic for both Sebastian and Oleksandr, discussions on the problem of graphical-model structure estimation from a given dataset, embedding into the current set of tasks available on OpenML, addition of different evaluation measures or criteria for model selection and storage of graph-specified datasets within the OpenML framework were had with Jan van Rijn. The evaluation measures and criteria for model selection allow for the comparison of estimated graphs to some given ground truth, a procedure that is not normally part of the ML workflow.

Sebastian also presented their collaborative work with Michael Lang on the mlr3oml R package. This package connects the OpenML platform to the open-source machine learning mlr3 package in R, another crucial aspect of the MaRDI task area.

The hackathon was rounded out with social activities like a walk through the forest. The good weather aside, special thanks needs to be given to Joaquin Vanschoren, the OpenML founder, whose supply of water to the whole group during the hike was the other reason why everyone made it back to the castle in good spirits!!!

All in all the week in Wadern was a pleasant and fruitful one for all the participants.

We will also be introducing you to the people who shape MaRDI with their expertise and vision for mathematical research data. They will appear in a series of "Making MaRDI" interviews available via our Twitter account. Stay tuned!

**Call for seed funds 2023
**

These funds support scientists from all fields of research within engineering, relating to the development and implementation of innovative ideas in data management. The grant is equivalent to the funding of a full-time doctoral position for one year. If necessary, the funding can be split between project partners.

More information in English or in German.

To learn about the Nationale Forschungsdateninfrastruktur, the community of which MaRDI is just one small part, read the 2021 article by Nathalie Hartl, Elena Wössner, and York Sure-Vetter in Informatik Spektrum. See doi.org/10.1007/s00287-021-01392-6

Christiane Görgen and Claudia Fevola explain in a short review article the role repositories can play in the MaRDI infrastructure. They use MathRepo as an example, a small math research-data repository hosted at the Max Planck Institute for Mathematics in the Sciences in Leipzig. See arxiv.org/abs/2202.04022

The interim report of the European Commission Expert Group on FAIR data discusses how to turn FAIR into reality. See doi.org/10.2777/1524

Thomas Koprucki and Karsten Tabelow have been two of the driving forces in the early stages of MaRDI. Together with Ilka Kleinod they discussed mathematical models as an important type of mathematical research data in a 2016 article for the Proceedings in Applied Mathematics and Mechanics: https://doi.org/10.1002/pamm.201610458

**2nd issue**

Welcome to the very first issue of the MaRDI (Mathematical Research Data Initiative) Newsletter. Research data in mathematics comes in many different flavors: papers, formulae, theorems, code, scripts, notebooks, software, models, simulated and experimental datasets, libraries of math objects with properties of interest... In short, the list is as long as mathematical research data is diverse.

Unfortunately, there is no straightforward or standard way to make these digital objects available for future generations of researchers. Availability, however, is not the only concern. In an ideal world, mathematical research data would be

**FAIR:** **F**indable, **A**ccessible, **I**nteroperable, and **R**eusable.

MaRDI is a part of the German National Research Data Infrastructure (NFDI) and it is dedicated to building infrastructures to make mathematical research data FAIR. Work on solutions for some of the major problems we face today started last year; from understanding the state-of-the-art technology of a field all the way along the research pipeline to establishing standards for peer review. As part of this process it is especially important for us to engage you, the mathematics community, early on so have a look at the list of our upcoming workshops! This issue of the Newsletter is dedicated to the F in FAIR: to findability and what this means for mathematics.

licensed under CC BY-NC-SA 4.0.

We explore two aspects of what Findable means. First, we will focus on how to find data created by other researchers and then we discuss how to make sure your own data is findable for the math community. In each newsletter, we will also publish an episode of our interview series on math and data: "Data Dates", introduce you to the people behind the MaRDI project, and offer some reading recommendations on the topic.

### Have you ever…

- tried searching for a formula?
- seen a reference to a homepage which is long gone?
- put code on your personal webpage because you didn't know how and where else to publish it?
- browsed through the publications of your coauthor's coauthors looking for that one result which you almost remembered but not quite?
- not been able to find something you needed to keep going into the research direction you fancied?

**
Then you are not alone!**

To find out, where people search for math data, we ask you to answer our very short multiple-choice survey:

**Where do you look for mathematical research data?**

You will see the results right after submitting your answer.

### How to find data?

**Abstract**

On an almost infinite Internet, where do you go to find research data? Which are the "hubs" concentrating resources? What is MaRDI proposing to help on the Findability front?

**Data and FAIR principles**

Modern science, including mathematics, relies increasingly on data. With the word "data", we mean a broad significance, including literature (books and articles), but also databases of experimental data, simulation-generated data, taxonomies (exhaustive listings of the examples of a given category of objects), workflows and frameworks (for instance software stacks with all the programs used in a research project), etc. Even a single formula can be considered a piece of data.

To set up good practices in the scientific community, a group of researchers published in 2016 the so called FAIR principles. Those principles are Findablilty, Accessibility, Interoperability, and Reusability.

In this short article we will introduce the Findability principle, with a focus on mathematical sciences, and connected with the efforts that are being developed in the MaRDI project.

For a more comprehensive introduction to the FAIR principles, you can visit the Go-Fair portal.

**Findability**

Findability is the first of the FAIR principles; and it is also the most basic one, because if you can't find some data, you can't re-use it in any way, it is as if it did not existed.

When we try to find data, we may face two situations: either we know that something exists and we are looking for it specifically; or we don't know exactly the result we want, and we look for anything related to a search term. In the first case, rather than finding that data, our problem is locating it somewhere in the physical or virtual space. In the second case, our problem is to examine all the data available (in a certain catalog) for a certain characteristic that we are interested in.

Both problems are solved by using a few tools: Firstly, each piece of data needs to have a unique reference or identification, so that we can build lookup tables for the location of each dataset. Secondly, together with the ID, we need other metadata that describes the data with some useful information (type, subject, authors, etc). Thirdly, we need to build comprehensive catalogs that gather all the metadata of the datasets, and to build search engines, which are algorithms to retrieve things from the catalogs.

Thus, the Findability principle can be expanded to some concrete recommendations:

- (Meta)data is assigned a globally unique and persistent identifier.
- Data is described with rich metadata.
- Metadata clearly and explicitly includes the identifier of the data it describes.
- (Meta)data is registered or indexed in a searchable resource.

**Where people go**

For mathematical articles, books, and other classically published works, a reference includes title, author, year, etc. While this is easily usable and readable by a human, it is not always consistent in format, and it does not provide a mean to locate and access that information.

The two de-facto standard catalogs that collect mathematical literature, and also assign a unique identifier, are the Mathematical Reviews (archived in MathSciNet by the American Mathematical Society) and the ZentralBlatt Mathematik (archived in zbMath by the FIZ Karlsruhe - Leibniz Institute). Those unique identifiers (MR number and Zb number) are helpful to reference a piece of mathematical literature, and those catalogs are a main tool to find works in a specific math domain, but those catalogs are much less comprehensive when it comes to other data (databases, media, online resources, etc). It also has the drawback that the authors can't control the existence or the metadata of an entry, and MathSciNet is a subscription-based service (MR Lookup offers limited services to non-subscribers. As of 2021, ZbMath became zbMATH-open and requires no subscription).

Another notable mention is arXiv, which is a de-facto standard platform for pre-publications. Here the actual paper is offered publicly (which concerns the Accessibility principle), and any work in arXiv also gets a unique ID and can be found via the catalog search. However, the focus is on literature and there is very limited support for datasets related to a paper.

One of the most standardized identifiers for online resources is the Digital Object Identifier (DOI), which references any digital object. Unlike a URL, the DOI is linked to a particular file and not to the server or website where it is hosted. The DOI website resolves the DOI number to the most up-to-date URL to access the data, so the DOI also serves as a locator in addition to being a unique identifier. Usually, publishers assign a DOI to new publications, but authors can also obtain a DOI in other registration agencies. Some open repositories offer free DOI registration (for example Zenodo, see below).

When it comes to non-literature data, the panorama is much coarser. zbMath has a sister project,

swMath, which is a catalog of mathematical software packages (computer algebra, numerics, etc) and a cross-reference record of their citations articles in zbMath. zbMath also features a full-text search of formulas, which is improving within the MaRDI framework.

For pure databases (experimental data, simulations data, etc), there is not yet a universally accepted repository in mathematics, in contrast with other fields where one can find standardized catalogs of proteins or genetic encodings, for example.

Zenodo is a general purpose repository for open data, which hosts quite a few mathematical datasets. See our article "Publishing on open repositories" where we talk more about Zenodo.

TK: Focus very strong on publications not on results and math research data in general. E.g. you want to find a theorem or formula rather than a publication, E.g. you want to find specific formula e.g.. related to bessel functions, or a specfic representation of a specifc group. Threre exist some special, curated collections, which are also findable. Examples include SuiteSparse Matrix collection or the On-line enceclopedia of interger sequences (OEIS)

**What is MaRDI proposing concerning Findability?**

Unfortunately, most data-based mathematical research is still published either without the datasets, or the datasets are only hosted in university servers accessible only through personal websites of the researchers involved.

The MaRDI project aims to, on the one hand, provide the necessary infrastructure to properly publish data (according to the FAIR principles), and on the other hand, to spread awareness within the math research community of the problems and proposed solutions that publishing data entails.

Here we will name just a few of the initiatives related to the Findability principle. You can read the funded proposal of MaRDI here.

The Scientific Computing Task Area (TA2) is preparing a benchmark framework to compare existing and new algorithms and methods to solve specific problems. For instance, there are several dozens of methods to solve a linear system Ax=b, with different performance and different technology stack, depending on the size of the matrix A, if is is sparse or dense, if we look for exact or approximate solutions, etc. So far there is no centralised catalog where a "user" (for instance a computational biologist) can go to choose the best method for his, or her, particular application. This catalog and benchmark will make finding symbolic and numerical algorithms much easier, and it aspires to be a major reference when looking for such algorithms.

The tool for this is building a knowledge graph of numerical algorithms. A knowledge graph is an abstract representation of a set of concepts, objects, events, or anything related to a domain of study, as nodes, and formal relations between them (edges) that can be read by humans or computers unambiguously. The biggest collective effort to guild a knowledge graph is WikiData. In this case, nodes in this graph will be the algorithms themselves as concepts, but also papers related to them, software packages implementing them, benchmarks, and connections to other databases. It will then be possible to navigate the knowledge graph to find semantical information, such as which algorithms extend a given one, where can we find implementations, how do they perform in comparison, etc.

Another effort aimed to Findability in MaRDI is the so called Mathematical Entity Linking (MathEL), or a way to extract and compare conceptual information from mathematical formulas. The concept of a particular equation (for instance the Klein-Gordon equation, the General Relativity equation, etc) can be expressed in many different forms, variables can be named differently, notations for derivatives, tensors, etc. may differ, goupings and substitutions can occur, etc. The MathEL sub-project aims to retrieve the conceptual information of formulas, to propose annotation standards for introducing semantic information into formulas (for instance referencing a WikiData node or other knowledge graph node), to mine large corpuses of data (for instance the Zb catalog or the arXiv repository), and to create user interfaces to retrieve concept and source information, such as question-answering engines.

We can share a sneak peek into the formula search engine that is under development.This phony wiki page contains a couple of math formulas. This search portal should be able to find those formulas when queried in the search box. You can try " V=4/3 \pi r^3 " or " V=\frac{4}{3} \pi r^3 " and it will find the test formula. Also, with " V = 4/3 \pi ?s^3 " you can find variable substitutions. However, other common re-writings are not yet recognized, such as " V = \frac{4\pi}{3} r^3 ". It will eventually be possible to find formulas which have LaTeX expressions with operators with the same meaning but other notations.

In subsequent articles we will expose other tasks being carried within MaRDI that are more aligned with the other FAIR principles (for instance open interfaces, or descriptions of workflows).

**Abstract**

We had some data from a project. How can we prepare it according to FAIR principles? Follow us in this attempt to being FAIR on the first try.

**Publishing data in open repositories**

This is a short story on how we set up data in a public repository (almost painlessly).

We are IMAGINARY, a math communication association, part of the MaRDI consortium, and we do math exhibitions as our main activity. In a recent project about climate change, we gathered some data about Earth grids. The latitude and longitude grid is the most common way of creating a reference system on the Earth, but there are other ways to divide the Earth in small regions that can be addressed and used, particularly useful in numerical models.

We contacted a couple of climate researchers that would prepare for us the sets of geographic nodes and edges that make those grids. Then, one of our collaborators took that data and converted it into a 3D-printable model, by adding thickness to the edges and checking the structural integrity of the ensemble, so it could be a physical object. Finally, a 3D printing company made the objects and we used them in our exhibition.

Being a company committed to Free and Open Source licenses, we tested how to make that FAIR data.

First, this dataset is not worth a research article in a journal, since it is not a scientific advance. At the same time, it is specific enough to be (probably) the first time this data is gathered and processed this way, and it could be useful to others to re-use.

**Git (GitHub, GitLab)**

Since we are dealing with software files, the most convenient platform to publish and develop is to use GitHub. Git is the most popular version control software, and any organization of code should start there. GitHub (and GitLab) are probably the most popular platforms to host Git projects. However, as a publishing tool, this should be considered almost as a kind of personal website (actually you can host and serve a git repository in your server), and it is a live and working tool, so the published data can change at any time. Github does not offer a guarantee of stability, a standardized identifier, or a good way to search and find your data. Also, it keeps a record of all the previous states, so all the dirty work is on the public.

Our GitHub page was our collaboration tool within the team. It was not intended as a publication method, it just happened that we left it to be publicly available.

**Zenodo**

Zenodo is a public open access repository of general purpose. It is used by researchers in any scientific area to make a copy of their works findable and accessible to the public. Those works can be articles or books in pre-print or already published by traditional publishing houses, but also databases, data files, images, or any digital asset that their research relies upon.

Zenodo can offer a Digital Object Identifier (DOI) in case that the work does not already have one. In this case, the DOI contains "zenodo" string in it (for instance 10.5281/zenodo.6538815)

Zenodo is hosted by the CERN infrastructure and funded by the European Commission amongst others.

Creating our entry on Zenodo was not difficult.

Firstly, we created an account. A valid email is enough, but you can link it to your ORCID identifier to uniquely determine the author(s).

Secondly, we made a new upload draft. You can choose the type of document (publication, poster, dataset, image, video, software, physical object, etc) and fill in the form with title, authors, publication date (can be in the past), description, and several other fields.

For the authors, we added the ORCID of those who had it. We also used "IMAGINARY" as an author even if it is not a physical person but a company.

We requested a new DOI since we did not have any. The DOI can be "reserved" during the draft process, so you know it in advance (and can use it in the documents you prepare).

For the actual content, we used a zip file with the master branch of the GitHub repository. You can also link your Zenodo account to your GitHub account, so that whenever you make a "release" in Github it gets a snapshot automatically published in Zenodo.

Finally, we submitted the draft. Once published, you can't add, delete or modify the files associated with a DOI (that is the point of the DOI), but you can make new versions with a new DOI. Thus, you must double- and triple-check before submitting. In case you make an erroneous submission, the only fix is to write an email to Zenodo administrators explaining yourself and asking for a fix.

**In Conversation with Cedric Villani**

In the first episode of the interview series Data Date, Cedric Villani joins Christiane Görgen for a brief exchange of thoughts about Math & Data.

**OpenML hackathon at Dagstuhl castle**

Sebastian Fischer and Oleksandr Zadorozhnyi, of the MaRDI task area Statistics and Machine Learning, participated in an OpenML hackathon held in late March at the headquarters of the Leibniz Center for Informatics at Dagstuhl, Germany.

OpenML is an open-source platform for sharing datasets, algorithms, experiments, and results. The hackathon was initiated by Bernd Bischl, one of the key players behind OpenML and a Co-Spokesperson in MaRDI. Researchers from other parts of Germany, France, the Netherlands, Poland, and Slovenia were present to discuss topics such as data quality on OpenML, an extension of its established services to new data formats, and new computational tasks.

The review article "Datasheets for datasets" provided fruitful exchanges on future improvement of data and metadata quality. In particular, support for non-tabular data formats such as images was discussed and will now be embedded by transitioning from the attribute-related file format to parquet. The so-far available eight types of tasks, including regression, classification, and clustering, will be extended to new tasks which are typical for graphical modeling. As this is one of the main use cases and an important topic for both Sebastian and Oleksandr, discussions on the problem of graphical-model structure estimation from a given dataset, embedding into the current set of tasks available on OpenML, addition of different evaluation measures or criteria for model selection and storage of graph-specified datasets within the OpenML framework were had with Jan van Rijn. The evaluation measures and criteria for model selection allow for the comparison of estimated graphs to some given ground truth, a procedure that is not normally part of the ML workflow.

Sebastian also presented their collaborative work with Michael Lang on the mlr3oml R package. This package connects the OpenML platform to the open-source machine learning mlr3 package in R, another crucial aspect of the MaRDI task area.

The hackathon was rounded out with social activities like a walk through the forest. The good weather aside, special thanks needs to be given to Joaquin Vanschoren, the OpenML founder, whose supply of water to the whole group during the hike was the other reason why everyone made it back to the castle in good spirits!!!

All in all the week in Wadern was a pleasant and fruitful one for all the participants.

We will also be introducing you to the people who shape MaRDI with their expertise and vision for mathematical research data. They will appear in a series of "Making MaRDI" interviews available via our Twitter account. Stay tuned!

**Call for seed funds 2023
**

These funds support scientists from all fields of research within engineering, relating to the development and implementation of innovative ideas in data management. The grant is equivalent to the funding of a full-time doctoral position for one year. If necessary, the funding can be split between project partners.

More information in English or in German.

To learn about the Nationale Forschungsdateninfrastruktur, the community of which MaRDI is just one small part, read the 2021 article by Nathalie Hartl, Elena Wössner, and York Sure-Vetter in Informatik Spektrum. See doi.org/10.1007/s00287-021-01392-6

Christiane Görgen and Claudia Fevola explain in a short review article the role repositories can play in the MaRDI infrastructure. They use MathRepo as an example, a small math research-data repository hosted at the Max Planck Institute for Mathematics in the Sciences in Leipzig. See arxiv.org/abs/2202.04022

The interim report of the European Commission Expert Group on FAIR data discusses how to turn FAIR into reality. See doi.org/10.2777/1524

Thomas Koprucki and Karsten Tabelow have been two of the driving forces in the early stages of MaRDI. Together with Ilka Kleinod they discussed mathematical models as an important type of mathematical research data in a 2016 article for the Proceedings in Applied Mathematics and Mechanics: https://doi.org/10.1002/pamm.201610458

via our partner IMAGINARY. You can unsubscribe easily at any time.