Math & Data Quarterly
News and insights into the realm of mathematical research data
Welcome to the fifth issue of the MaRDI Newsletter on mathematical research data. In the first four issues, we focused on the FAIR principles. Now we move to a topic, which makes use of FAIR data and also implements the FAIR principles in data infrastructures. So without further ado, let me introduce you to the ultimate use case of FAIR data: Knowledge Graphs.
by Ariel Cotton, licensed under CC BY-SA 4.0.
Knowledge graphs are very natural and represent information similar to how we humans do. They come in handy when you want to avoid redundancy in storing data (as it may happen quite often with tabular methods), and also for complex dataset queries.
This newsletter issue offers some insight into the structure of knowledge, examples of knowledge graphs, including some specific to MaRDI, an interview with a knowledge graph expert, as well as news and announcements related to research data.
In the last issue, we asked how long it would take you to find and understand your own research data. These are the results:
Now we ask you for specific challenges when searching for mathematical data. You may choose from the multiple-choice options or enter something else you faced.
Click to enter your challenges!
You will be taken to the results page automatically, after submitting your answer. Additionally, the current results can be accessed here.
The knowledge ladder
We are not sure exactly how humans store knowledge in their brains, but we certainly pack concepts into units, and then relate those conceptual units together. For example, if asked to list animals, nobody remembers an alphabetical list (unless you explicitly train yourself to remember such a list). Instead, you start the list with something familiar, like a dog, then you recall that dog is a pet animal, and then you list other pet animals like cat or canary. Then you recall that canary is a bird, and then you list other birds, like eagle, falcon, owl… when you run out of birds, you recall that birds fly in the air, which is one environment medium. Another environment medium is water, and this prompts you to start listing fishes and sea animals. This suggests that we can represent human knowledge in the form of a mathematical graph: concepts are nodes, and relationships are edges. This structure is also ingrained in language, which is the way humans communicate and store knowledge. All languages in the world, across all cultures, have nouns, verbs, or adjectives, and establish relationships through sentences. Almost every language organizes sentences in a subject-verb-object pattern (or any permutations: SVO, SOV, VSO, etc). The subject and the object are typically nouns or pronouns, the verb is often a relationship. A sentence like “my mother is a teacher” encodes the following knowledge: the person “my mother” is a node 1, “teacher” is a node 2, and “has as a job” is a relational edge from node 1 to node 2. Also, there is a node 3, the person “me”, and a relationship “is the mother of” from node 1 to node 3, (which implies a reciprocal relationship “is a child of” from node 3 to node 1).
On this construction, we can expect to have an abstract representation of human knowledge that we can store, retrieve, and search with a computer. But not all data automatically gives knowledge, and raw knowledge is not all you may need to solve a problem. This distinction, sometimes referred as “knowledge ladder”, is illustrated in the image above, although terminology has not been universally agreed upon. In this ladder, data are raw input values that we have collected with our senses, or with a sensor device. Information is data tagged with meaning; I am a person, that person is called Mary, teaching is a job, this thing I see is a dog, this list of numbers are daily temperatures in Honolulu. Knowledge is achieved when we find relationships between bits of information; Mary is my mother, Mary’s job is teacher, these animals live together and compete for food; pressure, temperature, and volume in a gas are related by the gas law PV=nRT. Insight is discerning. It is singling out the information that is useful for your purpose from the rest, it is finding seemingly unrelated concepts that behave alike. Finally, wisdom is understanding the connections between concepts. It is the ability to explain step by step how concept A relates to concept B. From this point of view, “research” means to know and understand all portions of the human knowledge that falls into or close to your domain, and then enlarging the graph with more nodes and edges, for which you need both insight and wisdom.
The advent of knowledge graphs
Knowledge graphs (KG) as a theoretical construction have been discussed in information theory, linguistics and philosophy for at least five decades, but it is only in this century that computers allowed us to implement algorithms and data retrieval at a practical and massive scale. Google introduced its own knowledge graph in 2012, you may be familiar with it. When you look up in Google some person, some place, etc, there is a small box to the right that displays some key information such as birthdate and achievements for a person, opening times for a shop, etc. This information is not a snippet from a website, it is information collected from many sources and packed into a node of a graph. Then those nodes are linked together by some affinity relationship. For instance, if you look up “Agatha Christie”, you will see an “infobox” with her birthdate, deathdate, short description extracted from Wikipedia, a photograph… And also a list of “People also search for” that will bring you to her family relatives such as Archibald Christie, or to other British authors of the same genre, such as Arthur Conan Doyle.
But probably the biggest effort to bring all human knowledge into structured data is Wikidata. Wikidata is a sister project of Wikipedia. Wikipedia aims to gather all human knowledge in the form of encyclopedic articles, that is, into non-structured human-readable data. Wikidata, by contrast, is a knowledge graph. It is a directed labeled graph, made of triples of the form subject (node) - predicate (edge) - object (node). The nodes and edges are labeled, actually, they contain a whole list of attributes.
The Wikidata graph is not designed to be used directly by humans. It is designed to retrieve information automatically, to be a “base of truth” that can be relied on. For instance, it can check automatically that all languages of Wikipedia state basic facts correctly (birthplace, list of authored books…), and can be used by external services (such as Google and other search engines or voice assistants) to offer correct and verifiable answers to queries.
In practice, nodes are pages, for instance, this one for Agatha Christie. Inside the page, it lists some “statements”, which are the labeled edges to other nodes. For example, it is an instance of a human, that her native language is English, or that her field of work is crime novel, detective literature, and others. If we compare that page with the Agatha Christie entry in the English Wikipedia, clearly the latter contains more information, and the Wikidata page is less convenient for a human to read. Potentially, all the ideas described with English sentences in Wikipedia could be represented by relationships in the Wikidata graph, but this task is tedious and difficult for a human, and AI systems are not yet sufficiently developed to make this conversion automatically.
In the backend, Wikidata is stored in relational SQL databases (the same Mediawiki software as used in Wikipedia), but the graph model is that of triples subject-predicate-object as defined in the web standard RDF (Resource Description Framework), This graph structure can be explored and queried with the language SPARQL (Simple Protocol And RDF Query Language). Note that usually, we use the verb “query” as opposed to “search” when we want to retrieve information from a graph, database, or other structured sources of information.
Thus, one can access the Wikidata information in several ways. First, one can use the web interface to access single nodes. The web interface has a search function that allows one to look up pages (nodes) that contain a certain search string. However, it is much more insightful to get information that takes advantage of the graph structure, that is, querying for nodes that are connected to some topic by a particular predicate (statement), or that have a particular property. For Wikidata, we have two main tools: direct SPARQL queries, and the Scholia plug-in tool.
The web and API at query.wikidata.org allows to send queries in SPARQL language. This is the most powerful search, you can browse the examples in that site. The output can be a list, a map, a graph, etc. There is a query builder help function, but essentially it requires some familiarity with SPARQL language. On the other hand, Scholia is a plug-in tool that helps querying and visualizing the Wikidata graph. For instance, searching for “covid-19” via Scholia, it will offer a graph of related topics, a list of authors and recent publications on the topic, organizations, etc., in different visual forms.
Knowledge graphs, artificial intelligence, and mathematics
Knowledge graphs are a hot research area in connection with Artificial Intelligence. On the one hand, there is the challenge of creating a KG from a natural language text (for instance, in English). While detecting grammar and syntax rules (subject, verb, object) is relatively doable, creating a knowledge graph requires encoding the semantics, that is, the meaning of the sentence. In the example of a few paragraphs above, “my mother is a teacher”, to extract the semantics we need the context of who is “me” (who is saying the sentence), we need to check if we already know the person “my mother” (her name, some kind of identifier), etc. The node for that person can be on a small KG with family or contextual information, while “teacher” can be part of a more general KG of common concepts.
In the case of mathematics, extracting a KG from natural language is a tremendous challenge, unfeasible with today’s techniques. Take a theorem statement: it contains definitions, hypotheses, and conclusions, and each one has a different context of validity (the conclusion is only valid under the hypotheses, but that is what you need to prove). Then imagine that you start your proof by reduction to absurd, so you have several sentences that are valid under the assumption that the hypotheses of the theorem hold, but not the conclusion. At some point, you want to find a contradiction with your previous knowledge, thus proving the theorem. The current knowledge graph paradigm is simply not suitable to follow this type of argumental line. The most similar thing to structured data for theorems and proofs are formal languages in logic, and there are practical implementations such as LEAN Theorem Prover. LEAN is a programming language that can encode symbolic manipulation rules for expressions. A proof by algebraic manipulation of a mathematical expression can therefore be described as a list of manipulations from an original expression (move a term to the other side of the equal sign, raise the second index in this tensor using a metric…). Writing proofs in LEAN can be tedious but it has the benefit of being automatically verifiable by a machine. There is no need for a human referee. Of course, we are still far from an AI checking the validity of a proof without human intervention, or even figuring out proofs to conjectures on its own. On the other hand, a dependency graph of theorems, derived in a logic chain from some axioms, is something that a knowledge graph like the MaRDI KG would be suitable to encode.
In any case, structured knowledge (in the form of KG or other forms, such as databases) is a fundamental piece to providing AI systems with a source of truth. Recent advances in the field of generative AI include the famous conversation bots ChatGPT and other Large Language Models (LLM), which are impressive in the sense that they can generate grammatically correct text, with meaningful sentences while keeping attention to maintaining a conversation. However, these systems are famous for not being able to distinguish truth from falsehood (to be precise, the AI is trained with text data that is assumed to be mostly true, but it cannot make any logical deductions). If we ask an AI for the biography of a nonexistent person, it may simply invent it trying to fulfill the task. If we contradict the AI with a pure fact, it will probably just accept our input despite its previous answer. Currently, conversational AI systems are not capable of rebating false claims by providing evidence. However, in the likely future, a conversational AI with access to a Knowledge Base (KG, database, or other), will be able to process queries and generate answers in natural language, but also to check for verified facts, and to present relevant information extracted from the knowledge base. An example in this direction is the Wolfram Alpha plug-in for ChatGPT. With some enhanced algorithms to traverse and explore a knowledge graph, we will maybe witness AI systems stepping up from Knowledge to Insight, or further up the ladder.
One of the mottos of MaRDI is “Your Math is Data”. Indeed, from an information theory perspective, all mathematical results (theorems, proofs, formulas, examples, classifications) are data, and some mathematicians also use experimental or computational data (statistical datasets, algorithms, computer code…). MaRDI intends to create the tools, the infrastructure, and the cultural shift to manage and use all research data efficiently. In order to climb up the “knowledge ladder” from Data to Information and Knowledge, the Data needs to be structured, and knowledge graphs are one excellent tool for that goal.
AlgoData
Several initiatives within MaRDI are based on knowledge graphs. A first example is AlgoData (requires MaRDI / ORCID credentials), a knowledge graph of numerical algorithms. In this KG, the main entities (nodes) are algorithms that solve particular problems (such as linear systems of equations or integrate differential equations). Other entities in the graph are supporting information for the algorithms, such as articles, software (code), or benchmarks. For example, we want to encode that algorithm 1 solves problem X, it is described in article Y, it is implemented on software Z, and it scores p points in benchmark W. A use case would be querying for algorithms that solve a particular type of problem, comparing the candidates using certain benchmarks, and retrieving the code to be used (ideally, being interoperable with your system setup).
AlgoData has a well-defined ontology. An ontology (from the Greek, loosely, “study or discourse of the things that exist”) is the set of concepts relevant to your domain. For instance, in an e-commerce site, “article”, “client”, “shopping cart”, or “payment method” are concepts that need to be defined, and included in the implementation of the e-commerce platform. For knowledge graphs, the list would include all types of nodes, and all labels for the edges and other properties. In general-purpose knowledge graphs, such as Wikidata, the ontology is huge, and for practical purposes the user (human or machine) relies on search/suggestion algorithms to identify the property that fits the most to their intention. In contrast, for specific-purpose knowledge graphs, such as AlgoData, a reduced and well-defined ontology is possible, as it simplifies the overall structure and search mechanisms.
The ontology of AlgoData (as of June 2023, under development) is the following:
Classes:
Algorithm, Benchmark, Identifiable, Problem, Publication, Realization, Software.
Object Properties:
analyzes, applies, documents, has component, has subclass, implements, instantiates, invents, is analyzed in, is applied in, is component of, is documented in, is implemented by, is instance of, is invented in, is related to, is solved by, is studied in, is subclass of, is surveyed in, is tested by, is used in, solves, specializedBy, specializes, studies, surveys, tests, uses.
Data Properties:
has category, has identifier.
We can display this ontology as a graph,
Currently, AlgoData implements two search functions: “Simple search” by matching words on the content, and “Graph search” where we query for nodes in the graph satisfying certain conditions in their connections. The main AlgoData page gives a sneak preview of the system (these links are password protected, but MaRDI team members and any researcher with a valid ORCID identifier can access)
A project closely related to AlgoData is the Model Order Reduction Benchmark (MORB) and its Ontology (MORBO). This sub-project focuses on the creation of benchmarks for algorithms solving Model Reduction (a standard technique in mathematical modelization, to reduce the simulation time for large-scale systems) and has its own knowledge graph and ontology, tailored to this problem. More information on the MOR Wiki and the MaRDI TA2 page.
The MaRDI portal and knowledge graph
The main output from the MaRDI project will also be based on a knowledge graph. The MaRDI Portal will be the entry point to all services and resources provided by MaRDI. The portal will be backed by the MaRDI knowledge graph, a big knowledge graph scoped to all mathematical research data. You can already have a sneak peek to see the work in progress.
The architecture of the MaRDI knowledge graph follows that of Wikidata, and it is compatible with it. In fact, many entries of Wikidata have been imported into the MaRDI KG and vice-versa. The MaRDI knowledge graph will also integrate many other resources from open knowledge, thus leveraging from many projects. A non-exhaustive list would include:
- The MaRDI AlgoData knowledge graph described above.
- Other MaRDI knowledge graphs, such as the MORWiki or the graph of Workflows with other disciplines.
- The zbMATH Open repository of reviews of mathematical publications.
- The swMATH Open database of mathematical software
- The NIST Digital Library of Mathematical Functions (DLMF).
- The CRAN repository of R packages.
- Mathematical publications in arXiv.
- Mathematical publications in Zenodo.
- The OpenML platform of Machine Learning projects.
- Mathematical entries from Wikidata.
- Entries added manually from users.
The MaRDI Portal does not intend to replace any of those projects, but to link all those openly available resources together in a big knowledge graph of greater scope. As of June 2023, the MaRDI KG has about 10 million triples (subject-predicate-object as in the RDF format). As with Wikidata, the ontology is too big to be listed, and it is described within the graph itself (e.g. the property P2 is the identifier for functions from the DLMF database).
Let us see some examples of entries in the MaRDI KG. A typical entry node in the MaRDI KG (in this example, the program ggplot2) is very similar to a Wikidata entry. This page is a human-friendly interface, but we can also get the same information in machine-readable formats such as RDF or JSON.
For the end user, probably it is more useful to query the graph for connections. As with Wikidata, we can query the MaRDI knowledge graph directly in SPARQL. It is a work-in-progress to enable the Scholia plug-in to work with the MaRDI KG. Currently, the beta MaRDI-Scholia queries against Wikidata.
Some queries that are available in the MaRDI KG but not on Wikidata are for instance queries to formulas in DLMF: here formulas that use the gamma function, or formulas that contain sine and tangent functions (the corpus of the database is still small, but it illustrates the possibilities). Wikidata can nevertheless query for symbols in formulas too.
The MaRDI KG is still in an early stage of development, and not ready for public use (all the examples cited are illustrative only). Once the KG begins to grow, mostly from open knowledge sources, the MaRDI team will improve it with some “knowledge building” techniques.
One such technique is the automated retrieval of structured information. For instance, the bibliographic references in an article are structured information, since they follow one of a few formats, and there are standards (bibTeX, Zb/MR number, …).
Another technique is link inference. This addresses the problem of low connectivity in graphs made by importing sub-graphs from multiple third-party sources, which may result in very few links between the sub-graphs. For instance, an article citing some references and a GitHub repository citing the same references are likely talking about the same topic. These inferences can then be reviewed by a human if necessary.
Another enhancement would be to improve search in natural language so that more complex queries can be made in plain English without the need to use SPARQL language.
The latest developments of the MaRDI Portal and its knowledge graph will be presented at a mini-symposium at the forthcoming DMV annual meeting in Ilmenau in September 2023.
- Knowledge ladder: Steps on which information can be classified, from the rawest to the more structured and useful. Depending on the authors, these steps can be enumerated as Data, Information, Knowledge, Insight, Wisdom.
- Data: raw values collected from measurements.
- Information: Data tagged with its meaning.
- Knowledge: Pieces of information connected together with causal or other relationships.
- Knowledge base: A set of resources (databases, dictionaries…) that represent Knowledge (as in the previous definition).
- Knowledge graph: A knowledge base organized in the form of a mathematical graph.
- Insight: Ability to identify relevant information from a knowledge base.
- Wisdom: Ability to find (or create) connections between information points, using existing or new knowledge relationships.
- Ontology: Set of all the terms and relationships relevant to describe your domain of study. In a knowledge graph, the types of nodes and edges that exist, with all their possible labels.
- RDF (Resource Description Framework): A web standard to describe graphs as triples (subject - predicate - object).
- SPARQL (Simple Protocol And RDF Query Language): A language to send queries (information retrieval/manipulation requests) to graphs in RDF format.
- Wikipedia: a multi-language online encyclopedia based on articles (non-structured human-readable text).
- Wikidata: an all-purpose knowledge graph intended to host data relevant to multiple Wikipedias. As a byproduct, it has become a tool to develop the semantic web, and it acts as a glue between many diverse knowledge graphs.
- Semantic web: a proposed extension of the web in which the content of a website (its meaning, not just the text strings) is machine-readable, to improve search engines and data discovery.
- Mediawiki: the free and open-source software that runs Wikipedia, Wikidata, and also the MaRDI portal and knowledge graph.
- Scholia: A plug-in software for Mediawiki, to enhance visualization of data queries to a knowledge graph
- AlgoData: a knowledge graph for numerical algorithms, part of the MaRDI project.
In Conversation with Daniel Mietchen
In this episode of Data Dates, Daniel and Tabea talk about knowledge graphs. Touching on the general concept, how it would help you find the proverbial needle and specific challenges that include mathematical structures. In addition, we also hear about the MaRDI knowledge graph and what this brings to mathematicians.
Leibniz MMS Days
The 6th Leibniz MMS Days, organized by the Leibniz Network "Mathematical Modeling and Simulation (MMS)", took place this year from April 17 to 19 in Potsdam at the Leibniz Institute for Agricultural Engineering and Bioeconomics. A small MaRDI faction, consisting of Thomas Koprucki, Burkhard Schmidt, Anieza Maltsi, and Marco Reidelbach made their way to Postdam to participate.
This year's MMS Days placed a special emphasis on "Digital Twins and Data-Driven Simulation," "Computational and Geophysical Fluid Dynamics," and "Computational Material Science," which were covered in individual workshops. There was also a separate session on research data and its reproducibility in which Thomas introduced the MaRDI consortium with its goals and vision, and promoted two important MaRDI services of the future, AlgoData and ModelDB; two knowledge graphs for documenting algorithms and mathematical models. Marco concluded the session by providing insight into the MaRDMO plugin, which links established software in research data management with the different MaRDI services, thus enabling FAIR documentations of interdisciplinary workflows. The presentation of the ModelDB was met with great interest among the participants and was the subject of lively discussions afterwards and in the following days. Some aspects from these discussions have already been considered in the further design of the ModelDB.
In addition to the various presentations, staff members of the institute gave a brief insight into the different fields of activity of the institute, such as the optimal design of packaging and the use of drones in the field, during a guided tour. The highlight of the tour was a visit to the 18-meter wind tunnel, which is used to study flows in and around agricultural facilities. So MaRDI actually got to know its first cowshed, albeit in miniature.
MaRDI RDM Barcamp
MaRDI, supported by the Bielefeld Center for Data Science (BiCDaS) and the Competence Center for Research Data at Bielefeld University, will host a Barcamp on research-data management in mathematics on July 4th, 2023, at the Center for Interdisciplinary Research (ZiF) in Bielefeld.
More information:
- in English
Working group on Knowledge Graphs
The NFDI working group aims to promote the use of knowledge graphs in all NFDI consortia, to facilitate cross-domain data interlinking and federation following the FAIR principles, and to contribute to the joint development of tools and technologies that enable the transformation of structured and unstructured data into semantically reusable knowledge across different domains. You can sign up to the mailing list of the working group here.
Knowledge graphs in other NFDI consortia can be found for instance at the NFDI4Culture KG (for cultural heritage items) or at the BERD@NFDI KG (for business, economic, and related data items).
More information:
- in English
NFDI-MatWerk Conference
The 1st NFDI-MatWerk Conference to develop a common vision of digital transformation in materials science and engineering will take place from 27 - 29 June 2023 as a hybrid conference. You can still book your ticket for either on-site or online participation (online tickets are even free of charge).
More information:
- in English
Open Science Barcamp
The Barcamp is organized by the Leibniz Strategy Forum Open Science and Wikimedia Deutschland. It is scheduled for 21 September 2023 in Berlin and is open to everybody interested in discussing, learning more about, and sharing experiences on practices in Open Science.
More information:
- in English
- The department of computer science at Stanford University offers this graduate-level research seminar, which includes lectures on knowledge graph topics (e.g., data models, creation, inference, access) and invited lectures from prominent researchers and industry practitioners.
It is available as a 73-page pdf document, divided into chapters:
https://web.stanford.edu/~vinayc/kg/notes/KG_Notes_v1.pdf
and additionally as video playlist:
https://www.youtube.com/playlist?list=PLDhh0lALedc7LC_5wpi5gDnPRnu1GSyRG - Video lecture on knowledge graphs by Prof. Dr. Harald Sack. It covers the topics of basic graph theory, centrality measures, and the importance of a node.
https://www.youtube.com/watch?v=TFT6siFBJkQ The Working Group (WG) Research Ethics of the German Data Forum (RatSWD) has set up the internet portal “Best Practice for Research Ethics”. It bundles information on the topic of research ethics and makes them accessible.
https://www.konsortswd.de/en/ratswd/best-practices-research-ethics/
Welcome to the fourth MaRDI Newsletter! This time we will investigate the fourth and final FAIR principle: Reusability. We consider the R in FAIR to capture the ultimate aim of sustainable and efficient handling of research data, that is to make your digital maths objects reusable for others and to reuse their results in order to advance science. In the words of the scientific computing community, we want mathematics to stand on the shoulders of giants rather than to be building on quicksand.
licensed under CC BY-NC-SA 4.0.
To achieve this, we need to make sure every tiny piece in a chain of results is where it should be, seamlessly links to its predecessors and subsequent results, is true and is allowed to be embedded in the puzzle we try to solve. This last comment is crucial, so we dedicate our main article in this issue of the newsletter to the topic of documentation, verifiability, licenses, and community standards for mathematical research data. We also feature some nice pure-maths examples we made for the love data week, report on the first MaRDI workshop for researchers in theoretical fields who are new to FAIR research data management, and entertain you with surveys and news from the world of research data.
To get into the mood of the topic, here is a question for you:
If you need to (re)use research data you created some time ago, how much time would you need to find and understand it? Would you have the data at your fingertips, or would you have to search for it for several days?
You will be taken to the results page automatically after submitting your answer, where you can find out how long other researchers would take. Additionally, the current results can be accessed here.
On the shoulders of giants
The famous quote from Newton: “If I have seen further, it is by standing on the shoulders of giants" usually refers to how science is built on top of previous knowledge, with researchers basing their results on the works of scientists who came before them. One could reframe it by saying that scientific knowledge is reusable. This is a fundamental principle in the scientific community: once a result is published, anyone can read it, learn how it was achieved, and then use it as a basis for further research. Reusing knowledge is also ingrained in the practice of scientific research as the basis of verifiability. In natural sciences, the scientific method demands that experimental data back your claims. In mathematical research, the logic construction demands mathematical proof of your claims. This means that for a good scientific practice, your results must be verifiable by other researchers, and this verification requires a reuse of not only the mental processes but also the data and tools used in the research.
Research data must be as reusable as the results and publications they support. From the perspective of modern, intensively data-driven science, this demand poses some challenges. Some barriers to reusability are technical, because of incompatibilities of standards or systems, and this problem is largely covered in the Interoperability principle of FAIR. But other problems such as poor documentation or legal barriers can be even bigger obstacles than technical inconveniences.
Reuse of research data is the ultimate goal of FAIR principles. The first three principles (Findable, Accessible, Interoperable) are necessary conditions for effective reuse of data. What we list here as “Reusability” requirements are all the remaining conditions, often more subjective or harder to evaluate, that appeal to the final goal of having a piece of research data embedded in a new chain of results.
To be precise, the Reusability principle requires data and metadata to be richly characterised with descriptors and attributes. Anyone potentially interested in reusing the data should easily find out if that data is useful for their purposes, how it can be used, how it was obtained, and any other practical concerns for reusing it. In particular, data and metadata should be:
- associated with detailed provenance
- released with a clear and accessible data usage license.
- broadly aligned with agreed community standards of its discipline.
Documentation
It is essential for researchers to acknowledge that the research data they generate is a first-class output of their scientific research and not only a private sandbox that helps them produce some public results. Hence, research data needs to be curated with reusability in mind, documenting all details (even some that might seem irrelevant or trivial to its authors) related to its source, scope, or use. In data management, we use the term “provenance” to describe the story and rationale behind that data. Why does it exist, what problem was it addressing, how it was gathered, transformed, stored, used… all this information might be relevant for a third party that first encounters the data and has to judge if it is relevant for themselves or not.
In experimental data, it is important to document exactly what was the purpose of the experiment, which protocol was followed to gather the data, who did the fieldwork (in case that contact information is needed), which variables were recorded, how the data is organized, which software was used, which version of the dataset it is, etc. As an antithesis of the ideal situation, imagine that you, as a researcher, find out about an article that uses some statistical data that you think you could reuse or that you want to look at as a referee. The data is easily available, and it is in a format that you can read. The data, however, is confusing. The fields on the tables have cryptic names such as “rgt5” and “avgB” that are not defined anywhere, leaving you to guess their meaning. Units of the measures are missing. Some registries are marked as “invalid” without any explanation of the reason and without making clear whether those registries were used or not on calculations. Derived data is calculated from a formula, but the implementation in the spreadsheet is slightly but significantly different than the formula in the article. If you re-run the code, the results are thus a bit different from those stated in the article. At some point, you try to contact the authors, but the contact data is outdated, or it is unclear who of several authors can help with the data (you can picture such a scene in this animated short video). Note that in this scenario we describe, the research data might have been perfectly Findable, Accessible and technically good and Interoperable, but without attention to those Reusable requirements, the whole purpose of FAIR data is defeated.
In computer-code data, documentation and good community development practices are non-trivial issues the industry has been addressing for a long time. Communities of programmers concerned by these problems have developed tools and protocols that solve, mitigate, or help manage these issues. Ideally, scientists working on scientific computing should learn and follow those good practices for code management. For instance, package managers for standard libraries, version control systems, continuous integration schemes, automated testing, etc., are standard techniques in the computer industry. While not using any of these techniques and just releasing source code in zip files might not break F-A-I principles, it will make reuse and community development much more difficult.
Documenting algorithms is especially important. Algorithms frequently use tricks, constants that get hard-coded, code patterns that come from standard recipes, parts that handle exceptional cases… Most often, even a very well-commented code is not enough to understand the algorithm, and a scientific paper is published to explain how the algorithm works. The risk is having a mismatch between the article that explains the algorithm, and the released production-ready code that implements it. If the code implements something similar but not exactly what is described in the article, there is a gap where mistakes can enter. Having a close integration between the paper and the code is crucial to prevent the newcomer from having to rework how the described algorithm translates into code.
Verifiability
As we introduced above, independent verification is a pillar of scientific research, and verification cannot happen without reusability of all necessary research data. MaRDI puts a special effort into enabling verification of data-driven mathematical results, by building FAIR tools and exchange platforms for the fields of computer algebra, numerical analysis, and statistics and machine learning.
An interesting example arises in computer algebra research. In that field, output results are often as valuable by themselves as the program that produced them. For instance, classifications and lists are valuable by themselves (see for example the LMFDB or MathDB sites for some classification projects). Once that list is found, it can be stored and reused for other purposes without any need to revisit the algorithm that produced such a list. Hence, the focus is normally on reusability of the output, but forgetting the reusability of the sources. This neglects to describe the provenance of the data, how it was created, which techniques were used to find it. This entails serious risks. Firstly, it is essential to verify that the list is correct (since a lot of work will be carried out assuming it is). Secondly, it is often the case that later research needs a slight variation of the list offered in the first place, so researchers need to modify parameters or characteristics of the algorithm to create a modified list.
In the case of numerical analysis, the output algorithms are usually focused on user reusability, often in the form of computing packages or libraries. However, several different algorithms may compete for accuracy, speed, hardware requirements, etc., so the “verification” process gets replaced by a series of benchmarks that can rate an algorithm in different categories and verify its performance. We have described, in the previous newsletter, how MaRDI would like to make numerical algorithms easier to reuse and benchmark them in different environments.
As for statistical data, our Interoperability issue of the newsletter describes how MaRDI curates datasets with “ground truths,” known facts that we know for sure independently from the data, that allow for the validation of new statistical tools to be applied to the data. In this case, re-using these new statistical tools on new studies increases the corpus of cases where the tool has been successfully used, making each reuse a part of the validation process.
Licenses
We also discussed licenses in our Accessibility issue. Let’s recall that FAIR principles do not prescribe free / open licenses, although those licenses are the best way to allow unrestricted reusability. However, FAIR principles do require a clear statement of the license that applies, be it restrictive or permissive.
Even within free/open licenses, the choice is wide and tricky. In software, open source licenses (e.g. MIT, Apache licenses) refer to the fact that the source code must be provided to the user. Those are amongst the most permissive because with the code one can study, run, or modify it. In contrast, free software licenses (e.g. GPL) carry some restrictions and an ethical/ideological load. For instance, many free licenses include copyleft, which means that any derived work must keep the same license, effectively preventing a company to bundle this software in a proprietary package that is not free software.
In creative works (texts, images…), the Creative Commons licenses are the standard legal tool to explicitly allow redistribution of works. There are several variants, ranging from almost no restrictions (CC0 / Public domain), to including clauses for attribution (CC-BY, attribution), sharing with the same license (CC-SA, share alike), or restricting commercial use (CC-NC, non-commercial) or derivative works (CC-ND, non-derivative), and any compatible combination. For databases, the Open Database License (ODbL) is a widely used open license, along with CC.
The following diagram shows how you can determine which CC license would be appropriate for you to use:
Attention must be paid that CC-ND is not an open license, and CC-NC is subject to interpretation of the term “non-commercial,” which can pose problems. While CC licenses have been defended in court in many jurisdictions, there are always legal details that can pose issues. For instance, the CC0 license intends to waive all rights over a work, but in some jurisdictions, there are rights (such as authorship recognition) that cannot be refrained. Other details concern the license versions. The latest CC version is 4.0, and it intends to be valid internationally without need to “port” or adapt to each jurisdiction, but each CC version has its own legal text and thus provides slightly different legal protection. Please note that this survey article does not provide legal advice, you can find all the legal text and human-readable text on the CC website.
In general, the best policy for open science is to use the least restrictive license that suits your needs and, with very few exceptions, not to add or remove clauses to modify a license. Reusing and combining content implies that newly generated content needs a license compatible with those of the parts that were used. This can become complicated or impossible the more restrictions they have (for instance, with interpretations of commercial interest or copyleft demands). Also, licenses and user agreements can conflict with other policies, such as data privacy; see an example in the Data Date interview in this newsletter.
Community
Perhaps the most synthetic form of the Reusability principle would be “do as the community does or needs” since it is a goal-focused principle: if the community is re-using and exchanging data successfully, keep those policies; if the community struggles with a certain point, act so that reuse can happen.
MaRDI takes a practical approach to this, studying the interaction between and within the mathematics community and other research communities and the industry. We described this “collaboration with other disciplines” in the last newsletter, and we highlighted the concept of “workflow” as the object of study, that is, the theoretical frameworks, the experimental procedures, the software tools, the mathematical techniques, etc. used by a particular research community. By studying the workflows in concrete focus communities, we expect to significantly increase and improve their reuse of mathematical tools, while also setting methods that will apply to other research communities as well.
MaRDI’s most visible output will be the MaRDI Portal, which will give access to a myriad of ‘FAIR’ resources via federated repositories, organized cohesively in Knowledge Graphs. MaRDI services will not only facilitate reusability of research data to mathematicians and researchers in other fields alike but also be a vivid example of best-practices research life. This portal will be a gigantic endeavor to organize FAIR research data, a giant on whose shoulders tomorrow’s scientists can stand. We strive for MaRDI to establish a new data culture in the mathematical research community and in all disciplines it relates to.
In Conversation with Elisabeth Bergherr
In this episode of Data Dates, Elisabeth and Christiane talk about reusability and the use of licenses in interdisciplinary statistical research, students' thesis, and teaching.
Love Data Week
Love Data Week is an international week of actions to raise awareness for research data and research data management. As part of this initiative, MaRDI created an interactive website that allows you to play around with various mathematical objects and learn interesting facts about their file formats.
Research data in discrete math
Mid March, the MaRDI outreach task area hosted the first research-data workshop for rather theoretical mathematicians in discrete math, geometry, combinatorics, computational algebra, and algebraic geometry. These communities are not covered by MaRDI's topic-specific task areas but form an important part of the German mathematical landscape, in particular with the initiative for a DFG priority program whose applicants co-organized the event. A big crowd of over sixty participants spent two days in Leipzig discussing automated recognition of Ramanujan identities with Peter Paule, machine-learned Hodge numbers with Yang He, and Gröbner bases for locating photographs of dragons with Kathlén Kohn. Michael Joswig led a panel in focusing on the future of computers in discrete mathematics research and the importance of human intuition. Antony Della Vecchia presented file formats for mathematical databases, and Tobias Boege encouraged the audience to reproduce published results in a hands-on session with participants finding pitfalls even in the most simple exercise. In the final hour, young researchers took the stage to present their areas of expertise, the research data they handle, and their take-away messages from this workshop: to follow your interests, keep communicating with your peers and scientists from other disciplines, and make sure your research outputs are FAIR for yourself and others. This program made for a very lively atmosphere in the lecture hall and was complemented by involving discussions on mathematicians as pattern-recognition machines, how mathematics might be a bit late to the party in terms of software, whether humans will be obsolete soon, and the hierarchy of difficulty in mathematical problems.
Conference on Research Data Infrastructure
The Conference will take place September 12th – 14th, 2023, in Karlsruhe (Germany). There will be disciplinary tracks and cross-disciplinary tracks.
Abstract submissions deadline: April 21, 2023
More information:
- in English
IceCube - Neutrinos in Deep Ice
This code competition aims to identify which direction neutrinos detected by the IceCube neutrino observatory came from. PUNCH4NFDI is focused on particle, astro-, astroparticle, hadron, and nuclear physics, and is supporting this ML challenge.
Deadline: April 23, 2023
More Information:
- in English
Open Science Radio
Get an overview of all NFDI consortia funded to date, and gain an insight into the development of the NFDI, its organizational structure, and goals in the 2-hour Open Science Radio episode interviewing Prof. Dr. York Sure-Vetter, the current director of the NFDI.
Listen:
- in English
The DMV, in cooperation with the KIT library, maintains a free self-study course on good scientific practice in mathematics, including notes on the FAIR principles. (Register here to subscribe to the free course.)
Edmund Weitz of the University of Hamburg recorded an entertaining chat about mathematics with ChatGPT (in German).
Remember our interview about accessibility with Johan Commelin in the second MaRDI Newsletter? The Xena Project is "an attempt to show young mathematicians that essentially all of the questions which show up in their undergraduate courses in pure mathematics can be turned into levels of a computer game called Lean". It has published a blog post highlighting very advanced maths, which can now be understood using the interactive theorem prover Lean Johan told us about.
On March 14, the International Day of Mathematics was celebrated worldwide. You can relive the celebration through the live blog, which also includes two video sessions with short talks for a general audience—one with guest mathematicians and one with the 2022 Fields Medal laureates. This year, the community was asked to create Comics. Explore the featured gallery and a map with all of the mathematical comic submissions worldwide.
Welcome to the third issue of the MaRDI Newsletter on mathematical research data, and happy holidays! We give you a brief snapshot of the world of interoperability. This is the third and may be one of the most challenging of the FAIR principles; very topic dependent, and much more technical than, say, findability. Its key question is: how do you seamlessly hand a digital object from one researcher to another?
licensed under CC BY-NC-SA 4.0.
We discuss the meaning and implications of interoperability in a number of mathematical disciplines, interview an expert on scientific software, report on workshops that have happened in the mathematical research-data universe, and much more.
We encounter different systems almost everywhere in our lives, both professionally and in everyday situations. Not all of them seem to be interoperable. For example, a navigation app will not be able to interpret equations, and it might not be trivial to ask Mathematica to compile your Julia computations. Think of any two systems—what would a marriage of the two look like? (We understand marriage here to be establishing the base for communication and exchange.)
If you could choose two systems you would like to get married, which ones would you choose?
Did you choose a perfect match in the survey above? You can add more anytime...
Interoperability: Let's play together
In our previous newsletters, we have covered, the Findability and Accessibility principles in FAIR research data. Those are the basic principles that give researchers awareness and access to the existence of research data. In contrast, the remaining two principles, Interoperability and Reusability, are related to what can be done with that data or rather to the quality of it. They have more profound implications for the interactions of the research community as a whole.
Research is almost never conducted in isolation. Researchers build on top of other researchers’ findings, combine different sources with their own insights, and use plenty of tools and methods developed by others. Here we will focus on some technical (and less technical) requirements to make this research community possible: Interoperability.
Interoperability is the capacity to combine pieces from different sources to work together. Standards in science and industry, such as measuring units or the shape of plug connectors, are designed for interoperability. In research, a simple example is language. Most scientific research is nowadays written and published in English. While there may be valid reasons to use other languages (in specific disciplines, in outreach, to foster exchanges in a particular cultural group…), the reality is that using a single lingua franca for scientific research enables comprehension and use of any scientific publication to all researchers. This creates a necessity for researchers to learn and use the English language as part of their research (and life) skills. When it comes to computers, plenty of standards respond to the need for interoperable data, such as file formats or computer languages (pdf, LaTeX, …), some having more success than others.
For research data, interoperability is crucial to enable a research community to collaborate and interact. Interoperability means using a standard set of vocabulary and data models that give a good and agreed representation of the type of research data in question. This effectively sets a standard for data communication. Then each researcher can adapt their tools and methods to process data within those standards.
To be precise, FAIR principles provide a framework for interoperable research data:
- Data and metadata must use a knowledge representation (ontologies, data models) that is shared, broadly applicable, and accessible.
- Such knowledge representation must be itself FAIR.
- When data and metadata reference other data and metadata, their relationship must be qualified (e.g. data X uses algorithm Y in such a way, data Z is derived from dataset W by applying such filtering)
In information science, an ontology is the set of all relevant concepts and relationships for a particular domain. This can be an enumeration or represented by a knowledge graph where nodes are concepts (think of nouns), and edges are qualifiers (think of verbs). This theoretical reflection of the nature of your research data is fundamental to developing useful standards that enable practical interoperability.
The MaRDI project actually devotes a significant part of its efforts to improving the interoperability (and reusability) of research data. Here we provide a brief summary of these interoperability efforts.
Computer Algebra
Computer Algebra concerns calculations on abstract mathematical objects, such as groups, rings, polynomials, manifolds, polytopes, etc. Computations are generally exact (no numerical approximations). Typical use cases of computer algebra are enumeration problems, for instance, finding a list of all graphs with certain properties. For such abstract objects, just data representation is already non-trivial, therefore researchers often build on top of specific frameworks called Computer Algebra Systems (CAS) that implement these data types and methods. Such CASes can be of broad scopes, like Mathematica, Maple, Magma, SageMath, OSCAR, etc, or they can have a focus on a specific domain, like GAP (group theory), Singular (algebraic geometry), Polymake (polytopes and other combinatorial objects), etc. A desirable goal would be to have a common data format to allow interoperability between different software systems without the loss of CAS information, enabling the parsing of files and call of functions from one system to another. This is obviously not an easy task. On the one hand, some of those CASes (e.g. Mathematica, Maple…) are proprietary, their focus is not purely on math research but they also provide tools used in other fields such as engineering or education. Interoperative approaches that use anything other than their provided APIs will therefore likely fail. On the other hand, the specific purpose CASes such as GAP, Singular, or Polymake (incidentally, all three have originated and are maintained by German universities and researchers close to MaRDI) are open-source, and can be used stand-alone but are also integrated into broader CASes such as SageMath (Python based) or OSCAR (Julia based). Turning these specific systems into broad-purpose CASes while also retaining state-of-the-art algorithms from the latest research is already a great success story.
The goals for MaRDI in Computer Algebra are to document and establish workflows, data formats, and guidelines on how to set up databases. By ‘workflows’ we understand this to be the process of generating/retrieving data, setting up an experiment, and obtaining conclusions, which will imply documenting the exact versions of the software (and possibly hardware) used as well as the tech stack (from the operating system to the languages and interpreters and libraries used). This will have benefits such as enabling verification of the results and making further reuse easier. It also provides clear guidelines on which software can be used together, replaced, or mixed, and therefore evaluating its interoperability.
Documenting and establishing data formats means going a step further in the interoperability, not only describing which software or data format the current work adheres to but actually making a system-agnostic description of the data. For instance, if we are using a particular ring of polynomials in several variables with coefficients in a particular field, the data description should make clear how we store and operate the elements of such a ring. Typically, this will follow a data format from a particular CAS, but having an independent description will enable other CASes to implement a compatibility layer to reuse the data. This will become even more relevant when implementing new abstract structures. Eventually, the goal is for all CASes that wish to support a particular data format, to be possible to implement a compatibility layer based on the data description. This is called data serialization, as the goal is to translate internal data structures into a text description, which can be exchanged to another system to be de-serialized, that is, turned into the data structure of the new system with the same semantic information but possibly a different implementation. The MaRDI team is implementing this data serialization in OSCAR, but the goal is to have a system-agnostic specification.
Finally, documenting computer algebra databases will, among other benefits in findability, enable a comprehensive picture of the different systems and the compatibility layers needed to have interoperability amongst them.
Scientific computing
Numerical algorithms are central to scientific computing. Their approximations to exact mathematical quantities come with inherent inexactness and error propagation, due to finite precision in the used data structures. This contrasts with the abstract and exact objects used in Computer Algebra. Typical examples are linear solvers (Ax=b) for different types of matrices (big, small, huge, sparse, dense, stochastic…), or numerical integration methods for ODEs or PDEs. Numerical algorithms are closely associated with applied mathematics, and performance or scalability are relevant factors for choosing one method over another. We already described in the Findability article that MaRDI is building a knowledge graph for those numerical algorithms, together with benchmarks, supporting articles for theoretical background, and other features. But the goal goes beyond creating such a graph just to find algorithms, it is also an ambitious goal to develop an infrastructure to make all these algorithms interoperable.
Researchers implement their algorithms in programming languages such as MATLAB (which is proprietary), or C/C++, Julia, Python, etc, possibly with extension libraries. To implement interoperability between different numerical methods, MaRDI proposes a three-component architecture (driver - connector - implementor). For a particular algorithm, the implementor is the piece of software that contains the actual existing algorithm in whatever language or framework that the author used. The driver is a high-level calling function that contains the semantics of the data, but not the implementation of the algorithm. The same data model can then be used by drivers of different numerical algorithms, even if their implementation uses completely different technologies, thus enabling an interoperable ecosystem. The prototypes of those drivers are being proposed and defined by the MaRDI team. The missing critical piece is the connector, which communicates between the driver and the implementor, which needs to be developed for each algorithm, likely in collaboration with the original author. The MaRDI team is implementing some examples, but the goal is that in the future, any researcher who is developing numerical algorithms can use their preferred technology stack and then easily implement a connector to standard driver functions.
The benchmark comparison between algorithms (planned for the knowledge graph) actually requires this interoperability architecture so that the same test can be executed by different algorithms in equal conditions without a need to adapt the data to fit a particular tech framework.
Statistics and Machine Learning
Typical research data usage in statistics or machine learning include big experimental datasets, frequently coming from other domains. Good examples of this are genetic data or financial data. These datasets contain valuable information that researchers try to extract using statistics or AI techniques. In statistics, for instance, a typical goal is to create a model, meaning to describe a joint probability distribution of all the variables depending on the individual probability distributions of each variable. This means understanding the dependencies between the variables.
A problem often found by statisticians who develop new theoretical methods to extract information from experimental data is that there is only a very limited collection of suitable datasets where they can test new methods. It is difficult to obtain curated data from interdisciplinary teams before the statistical tools are proven useful and robust, which leaves researchers with limited choices to run tests. The most valuable information in curated data includes “ground truths”, that is, relationships between variables that are known externally to the experimental data, via expert knowledge from another field. For instance, in a macroeconomic study, some variables can be related or independent, or their relationship may depend on the presence of a third variable indicator, or even more complex interactions. We may know some of these interactions by knowing government policies or strategies which are not reflected directly in the data. For the statistician, such a "ground truth" is very useful to validate the algorithm used to fit the model. A goal for MaRDI is to collect a broader, curated list of datasets that can be used by statisticians to test and validate modeling techniques. Those datasets need to be cleaned and ready to be used by standard statistical packages (that is, to be interoperable), and to have useful annotated “ground truths” attached to the data for use on interdisciplinary teams. Besides this data collection, MaRDI aims to be a leading example of quality curated data so that experimentalists can adhere to those quality standards.
Another goal concerns machine learning (ML) algorithms. The community around ML is much broader than mathematicians (software developers, data scientists, ML engineers…), and therefore the frameworks used are very diverse. TensorFlow and Torch are two popular tools in the industry, but there are many others. The language R is suitable for statistics and data science, and also for machine learning. An initiative to bring cohesion and interoperability in this software ecosystem is mlr3 (machine learning for the R language), which MaRDI is using and extending. The mlr3 project brings different R packages together (often based on or operating on other frameworks), providing unified naming conventions, and a full suite of tools (learners, benchmarks, analyzers, importers/exporters, …), making R and mlr3 a competitive integrated framework for ML.
We can see a couple of examples of how MaRDI is bridging interoperability gaps in this field. A first example: in machine learning (as in the statistics case we saw earlier), there is a great need for more quality datasets (training, evaluation…). OpenML is a web service that allows sharing of datasets and ML tasks within the ML community. MaRDI is helping to build mlr3oml, an interoperability interface between mlr3 and OpenML. MaRDI also builds and stores “curated quality datasets” in OpenML that can be used for testing and benchmarking, and also as a model of good practices.
A second example: Many learning algorithms in ML are treated as black boxes, they come from different ML techniques and have different implementations. However, a significant part of these algorithms come from some neural network techniques that have some common characteristics: architecture, loss function, optimizer… The package mlr3torch, being developed with MaRDI, aims to “open” some of those black boxes giving greater control of those parameters.
Cooperation with other disciplines
MaRDI strives to bring together mathematical methods and the people who use them. Today this collaboration requires much more than having a common spoken language and publishing in international journals, nowadays data languages are crucial. MaRDI aims to understand and document how researchers in disciplines other than mathematics use (or would like to use) mathematical research data. Hence, the “interoperability” between mathematics and other fields is key. For the past year, MaRDI has collected a series of case studies from other NFDI (the German National Research Data Infrastructure program) consortia, other research groups, and also in the industry, to document through a series of templates how they work and use research data. The key concept is the “workflow”, meaning the documentation of the whole process of setting a theoretical framework, hypothesis to scrutinize, experiment model, data acquisition, technical equipment, metadata association, data processing, software used, data analysis techniques, extraction of results, publications… everything that is directly related to data management, but also its research context. Several examples of workflows can be found on the MaRDI portal TA4 page. Currently, the collected information is textual, highlighting the data acquisition process (and its metadata), and the mathematical model used. In the future, both the (meta)data and the model will be formalized by means of ontologies and model pathway diagrams (graphs) to enable further uses of the research data, such as reproducibility, replacing methods and techniques by newer or more performant ones, or enabling reusability by other researchers.
By looking at the case studies, one can observe that most researchers implement “island solutions” adapted to their specific needs, even if those solutions may be very professional and optimized. There is a great potential to increase interoperability and exchange. MaRDI aims to leverage a change in mathematical data management and analysis to support researchers, in the belief that such a shift will be broadly welcome within the research community.
MaRDI portal
The MaRDI portal will be the single entry point to all the MaRDI services and resources collected by the different task areas. The portal team is currently building a knowledge graph of mathematical research data by retrieving information from other sources (for instance, WikiData, swMATH for documenting mathematical software, package repositories to improve the information granularity of some mathematical software, zbMATH Open to retrieve publications, etc.). This requires a lot of interoperability efforts using the respective APIs since the volume of data is not manageable by hand. Some automation and AI techniques are being considered to foster this process. In due time, all the different MaRDI teams will start producing their output goals, and the portal team will manage the integration within the portal. For instance, the knowledge graph of numerical algorithms will be integrated into the knowledge graph of the MaRDI portal. The statistical datasets collections will also be described as entities in the MaRDI knowledge graph, and so on. In a sense, the portal needs to create interoperability layers between the internal task areas of MaRDI.
All in all, the interoperability principle is an enabling condition for building and strengthening a community. That is the driving goal of all the efforts from MaRDI that we described here. This enabling condition turns into an actual collaboration when the data is reused across different projects and researchers, which will be the topic of our fourth article in this series, about Reusability.
In Conversation with Ulrike Meier Yang
In the third episode of the interview series Data Date, Ulrike and Christiane talk about mathematical research data in the xSDK project, the importance of guidelines, three levels of interoperability, and automated testing.
MaRDI annual workshop 2022
Mid November, the whole MaRDI team met at WIAS in Berlin for their second annual workshop. The kickoff in Leipzig one year before had provided an enthusiastic start for the consortium and for building infrastructure for mathematical research data in Germany. The slogan at the time was to spend the coming twelve months doing two things: listening (zuhören) and simply getting started (einfach anfangen)!
Now the team looked back, recapped, and planned for the second year and further into the future. Over the course of three days, approximately forty people met in person including some participating online to first present each task area's updates and vision, discuss current issues in interactive small-group BarCamps, and finally decide on the upcoming route. The event was kicked off with a keynote talk by Martin Grötschel, who stressed the importance to follow a bottom-up process and potential projects' pitfalls drawn from his learned experience. This was followed by NFDI's Cord Wiljes describing potential benefits of cross-consortial collaborations. There was plenty of lively discussion centered around possible career paths of women in maths and data and how MaRDI could live up to the central expectations of the Portal, link knowledge graphs, best deal with the very diverse mathematical research data in management plans, and build a community. BarCamps developed ideas and new work packages, like the setting up of an editorial team for the Portal. All throughout many participants compiled self-designed sheets of bingo to collect #MaRDI_buzzwords. The long and pleasant days were accompanied by a visit to the computer-games museum and a conference dinner. At the end of the workshop, the MaRDI team drew the conclusion to best spend the coming year building on the previous "listening and getting started" and now focusing on two different tasks: networking with the community (vernetzen) and cross collaboration (zusammenarbeiten) within the consortium. This will link MaRDI's expertise across different institutions and will ensure that resulting services reach and engage with potential users early on, making them truly useful for the working mathematician.
MaRDI Movies
The first in this series of short, entertaining, and informative videos is called 'Mardy, the happy math rabbit'. Follow Mardy through the pitfalls of reproducing software results: An introduction to software review in mathematics by Jeroen Hanselmann.
MOM workshop on MaRDI, OSCAR, and MATHREPO
In November, MaRDI's task area for Computer Algebra invited their community to ZIB and TU Berlin for the "MOM workshop on MaRDI, OSCAR and MATHREPO". Over the course of two days, some twenty people met in person to discuss how to deal with databases, polytopes, triangulations, graded rings, polynomials, gröbner bases, finite point configurations and the like. Particularly important were questions on how to save an object, where to store it long-term, how to seamlessly interact with databases, and how to reproduce a computation.
The MaRDI organisers presented serialisation and workflow efforts and led an exercise in reproducibility where the participants were asked to rerun published research outputs. Some could be redone quite well, others were not so easy to reproduce. A number of examples came from the mathematical research-data repository MathRepo, co-maintained by MaRDI's Tabea Bacher. The awarding of the FAIRest MathRepo page of 2022 was part of the workshop. A jury of interested workshop participants took a closer look at the contributions previously nominated by the audience and judged them according to the FAIR principles. The highly deserved winner was Tobias Boege from Aalto University for his entry on Selfadhesivity in Gaussian conditional independence structures In addition to very good documentation, by compressing files and using the MPDL Repository keeper as longterm storage solution, he found a way to make huge amounts of his research data FAIRly available, which was an unusually difficult problem.
Alheydis Geiger from the Max Planck Institute for Mathematics in the Sciences, Leipzig, presented a user story of OSCAR. In her paper she and her collaborators combined different computer algebra systems, such as OSCAR, Macaulay 2, Magma, Julia, Polymake, Singular and more, to investigate self-dual matroids from canonical curves. The Graded Ring Database was introduced in a talk by Alexander M. Kasprzyk from the the University of Nottingham. Focusing on the mathematical meaning of the research data in the data base as well as technical and accessibility matters.
In a final session, researchers split up into two smaller groups to discuss. The first group collected both computer algebra and general software systems used by the participants and discussed which system was best suited for what research questions. In the other group technical peer reviewing was discussed: how it can be done and why it would be necessary (for more on technical peer reviewing watch the MaRDI Movie Mardy, the happy math rabbit).
MaRDI Workshop on scientific computing—A platform to discuss the “HOW”
From October 26 to 28, 2022, the first MaRDI Workshop on Scientific Computing took place at WWU, Münster. About 40 people from the scientific computing community and from MaRDI came together to learn and talk about research data in three densely packed days of exchange.
The introductory talk by Thomas Koprucki on MaRDI was followed by blocks of talks on topics such as: workflows and reproducibility, ontologies and knowledge graphs or benchmarks. Ten invited speakers presented their projects: for example, Ulrike Meier Yang (see video interview above) introduced the extreme-scale scientific software development kit xSDK, Benjamin Uekermann presented preCICE, a general-purpose simulation coupling interface, Andrea Walther talked about 40 years of developing ADOL-C, which is a package for automatic differentiation of algorithms and FitBenchmarking and an open source tool for comparing data analysis software was presented by Tyrone Rees.
As one of the main goals of the organizers was to bring together researchers from the scientific computing community and related disciplines to learn from different projects and related expertise, speakers were encouraged to present work in progress, open problems or report on personal experiences; not only to talk about the "WHAT" but also to share the "HOW". It can be said that this concept worked out. This was noticeable in both the coffee breaks, which were characterized by lively conversations and in the afternoon of October, 27th that was devoted entirely to discussions. There were several discussion groups focused on a variety of topics, such as workflows and reproducibility, knowledge graphs, research software, benchmarks, training and awareness, ... The training and awareness group discussed how to deal with software that is not associated with a paper—there are some journals that might publish on such topics, but it is difficult to get the recognition deserved- and which career level is best approached for research data management topics. After the discussion in groups, the results were presented to everyone. One of the ideas, that was discussed a lot when the groups reconvened, was the possibility of providing better job security for software engineers by making them permanent employees of universities and having the projects they work on pay the university for their services.
Mario Ohlberger, co-spokesperson at MaRDI and co-organizer of the workshop, said there was great feedback for the event. The workshop created a new platform for exchange and generated many new impulses for MaRDI. Many participants had never been to such a workshop before, they were happy to find others that are passionate about the same topics and are willing to exchange ideas.
Digital Humanities meet Mathematics (DiHMa.Lab)
The first session of DiHMa.Lab took place in September with a workshop organized jointly by the Ada Lovelace Center for Digital Humanities and MaRDI’s interdisciplinary task area, TA4. Over a course of two days, about thirty people from archeology, philology, literary sciences, history, cultural studies, research-data management and of course mathematics came together in this hybrid event to identify and discuss various interconnections, exchange experiences and come up with ideas on how to improve the cooperation and understanding of each other's research. The main focus of the workshop was to engage with both NFDI consortia—NFDI4Memory, NFDI4Objects, Text+, NFDI4Culture, KonsortSWD, MaRDI—and institutes involved in social sciences and humanities research and to familiarize everyone with the methods, problems, questions, and research data of the represented fields.
To that end researchers presented examples of (mathematical) research data and their handling in various projects from digital humanities. For instance, Nataša Djurdjevac Conrad (ZIB) talked about a project where the spreading of wool-bearing sheep in ancient times was analyzed by using agent-based models. Christoph von Tycowicz (ZIB) presented instances of geometric morphometrics used to determine installation sites of ancient sundials or changing facial expressions during the aging process. Tom Hanika (Uni Kasel) and Robert Jäschke (IBI - HU Berlin) spoke about formal concept analysis and order theory and how it can be applied and yield interesting results when analyzing literary works or art.
What these projects have in common is that they avoid black box situations, where a method is applied without really knowing how it works and therefore making it a matter of chance to interpret the results in a fitting manner. In order to obtain reliable results it is necessary for mathematicians to understand the complex questions and data arising in digital humanities and researchers from digital humanities to be careful in applying mathematical methods and understand them first as to be able to choose “the right“ method and to correctly interpret the results. Achieving that enables successful collaborations and contributes to entirely new mathematical questions. This then opens up rich sources for novel questions in digital humanities.
All in all, it was a very successful workshop, resulting in the idea of DiHMa.Lab establishing a „marketplace for methods“ where digital humanities questions could be posted and liked by mathematicians – preferably proposing also a method. Moreover, the participants were very open, accommodating, and interested in the topics and concerns from the different fields, eager to learn new methods, to see what is possible if „we“ join forces, and what new questions arise.
New consortia and an initiative for basic services
On November 4, the Joint Science Conference (GWK) decided to fund seven additional consortia as well as an initiative for the realization of cross-consortia basic services Base4NFDI within the framework of the National Research Data Infrastructure (NFDI). As in the two previous years, the decision by the GWK follows the recommendations of the NFDI expert panel appointed by the German Research Foundation (DFG).
More information:
- in German
International Love Data Week 2023
Love Data Week is an international celebration of data, hosted by the Inter-university Consortium for Political and Social Research (ICPSR), that takes place every year during the week of Valentine's day (in 2023: February 13 - 17). Universities, nonprofit organizations, government agencies, corporations, and individuals around the world are encouraged to host and participate in data-related events and activities held either online or in-person locally. The theme this year is Data: Agent of Change.
More information:
- in English
In October, The Netherlands hosted the "1st international conference on FAIR digital objects" with over 150 professionals signing the Leiden Declaration on FAIR Digital Objects. This is deemed to be "an opportunity for all of us working in research, technology, policy and beyond to support an unprecedented effort to further develop FAIR digital objects, open standards and protocols, and increased reliability and trustworthiness of data".
A group of MaRDI team members together with external experts have written a new article highlighting the status quo, the needs and challenges of research-data management plans for mathematics: a preprint is already available here.
The ICPSR published a guide to data preparation and archiving in 2020. Even though addressed to social scientists, the presented guidelines can be applied to any field.
The "Making MaRDI" Twitter series we announced in the previous Newsletter has been launched and integrated into the website. There are currently four profiles presenting the work that Karsten Tabelow, Tabea Bacher, Christian Himpe, and Ilka Agricola carry out in the consortium.
Welcome to the second issue of the MaRDI Newsletter. In each newsletter, we talk about various research-data themes that might be of interest to the mathematical community, in particular finding data that is relevant to advance your research, ensuring other people can access your files, solving the difficult problem of managing files between coauthors, and preserving your results such that your peers can build their research on those.
The FAIR principles for sustainable research-data management are important to us, so we present them individually in a series of articles. This issue of the Newsletter is dedicated to the A in FAIR: accessibility and what this means for mathematics.
licensed under CC BY-NC-SA 4.0.
In each newsletter, we also publish an episode of our interview series "Data Dates", tell you about an event that happened in the MaRDI universe, and offer some reading recommendations on FAIR topics.
In our last newsletter issue, we asked you to enter 3 methods you commonly use to search for/find mathematical research data. Here are the results to that survey:
Share your accessibility nightmare (or a success story)!
We will feature a selection of your stories in an upcoming newsletter (anonymously).
FAIR access to research data
Access to research information is the most fundamental principle for spreading science across the scientific community and society. Publishing and making research results available is a cornerstone of research. This, however, is not exempt from issues. On the one hand, some research is either private, restricted within the industry, or protected by intellectual property. On the other hand, other barriers exist while accessing data in the form of technical incompatibilities, paywalls, bad metadata, or just incomplete data.
The Accessibility principle of FAIR data is the idea that all the relevant data connected to a research result should be properly available. This concerns which data is available, to whom it is accessible, how it is technically stored and retrieved, and how it is classified and managed. This principle is rooted in the scientific fundament of reproducibility and verifiability: other researchers should be able to repeat and independently verify the published results. While this is especially important in the experimental sciences, it also applies to the domain of mathematics.
The FAIR principles state that research data is Accessible when it respects the following recommendations:
- The data is accessible over the internet, possibly after authentication and authorization. The means of access (protocols) must be open, free, and universal, and those protocols must include authentication and authorization whenever necessary.
- The metadata must be available together with the data, and it must persist even after the data is no longer available.
It is important to note the "possibly after authentication and authorization" sentence. It is a common misunderstanding that FAIR accessibility implies free of cost or under open licenses. That is not the case. Free-of-cost publication and open licenses fall into the domain of the Open Access principles. While FAIR and Open Access have points in common, we will see examples where non-open access databases can be FAIR; or open access articles and research data which are not FAIR because metadata or appropriate protocols are missing.
Standards and protocols are a fundamental element in FAIR accessible data. Many tasks, especially those that are repeated in the same way, are performed much more efficiently by machines than by humans. That is why computers are very important when dealing with research data, too. In terms of accessibility, any storage location would ideally provide interfaces where machines can automatically access research data, also referred to as Application Programming Interfaces or APIs.
The research data behind the articles
Let us see three stories of fictional mathematicians that use some research data as a fundamental part of their research. They handle different types of data (databases, classifications, source code, articles...), which can also have different origins (produced by themselves or from a third party). They face different challenges to keep their research data FAIR.
Alice is a mathematician working in computational algebra. She makes intensive use of software, but in her published articles, she often uses sentences such as "using software XX, we can see that...". Her scripts in the form of source code, software packages, toolchains, and her computed results are research data that, if omitted from the published results, are not FAIR data, making her results difficult to be validated or replicated. She is aware of that problem, and she wants to solve that, so she decided to set up a server in her math department with some files with her source code, and she mentions that those files exist on her personal website; maybe she even puts the URL to the code in her articles. However, she has changed from university several times, thus changing her servers and websites, and many files and projects related to older articles are now lost. In order to be fully FAIR compliant, she needs to ensure that the data is bound to a metadata reference and to the research article, that it is accessible through standard internet protocols, and plan for a long-term archive that does not disappear when she changes her job position. Ideally, she would assign a DOI to the source code and host it in some long-time archiving (e.g., Zenodo, GitHub, MathRepo, or others ). Furthermore, she needs to make the code Interoperable and Reusable, which we will discuss in forthcoming issues. The MaRDI project aims to help mathematicians in this situation improve their FAIR data management.
Alice also participates in a collaborative project to classify all instances of her favorite algebraic objects. She and other colleagues have set up an online catalog listing all the known examples, the invariants they use to classify, and bibliographical information. At the moment, this catalog contains a few hundred items; Alice and her team will need to provide download options, filters, and means to retrieve information from the database beyond the graphical web interface. They will need to provide the results in formats that can be further processed with standard tools. That is, they will need an API and standardized formats to allow other researchers to use that database effectively in their own research projects.
Bob and Charlie are mathematicians modelling biological processes. Bob models tumor growth in human cancer, and Charlie neurological activity in animals. They handle three types of data: experimental specimen data in the form of databases that they receive from a partner or third party, model data in the form of source code that they develop, and result data in the form of articles they publish.
For Bob, primary data comes mainly from patients in hospitals. For obvious privacy reasons, Bob cannot directly access that primary data. Instead, he relies on organizations that offer anonymized databases publicly available for research (for example, the National Cancer Institute). Parts of these databases are totally anonymous and can be given open access. Other records contain detailed genetic information that, by their nature, could be used to identify the patient. Those databases have authenticated access, and researchers only can access them after being identified and committing to respect standard good practices in handling medical data. Thus, even if the access is restricted to identified and authenticated people, the data can be FAIR.
For Charlie, keeping his research data FAIR is tricky. He partners with some laboratories that have the appropriate resources to collect data from animals. Since obtaining this experimental data is expensive, the laboratory keeps some rights of use, and Charlie has to sign a "Data Use Agreement" contract. This allows him to use the data only for the declared purpose, and he is unable to redistribute it. In this case, the data would not be FAIR. However, the laboratory agrees to release the data for public use after two or three articles have been published from that source, as they consider that the data has already yielded enough results. From that moment, the data could be considered FAIR. Some websites collect already released databases (e.g., International Brain Lab) or collect data directly from laboratories for researchers' use (e.g., Human Connectome Project).
Bob and Charlie transform the databases they obtain, develop and apply models. They then write and publish articles. It is increasingly common that journals in the modelling field require the source code to be available. Bob and Charlie, like most researchers, use GitHub, but they have other options as we mentioned with Alice. Additionally, interdisciplinary fields with large communities often have collaborative and open-science platforms where many researchers collaborate in large distributed teams (e.g., COMOB, Allen Institute). In those projects, FAIR principles are a basic need. Concerning accessibility, all the data must be perfectly identified by its metadata. Accessibility has to be transparent to the researchers so the source code of their models can retrieve and process the data in a single step. All the platforms mentioned above have high standards of FAIR-ness and offer APIs based on open standards.
Accessibility and Open Access
It is important to distinguish between the "Accessibility" FAIR principle and the "Open Access" practice.
The open-access philosophy states that research data and especially research results (articles) should be available online, free of charge, or from other barriers. This is usually achieved using legal open licenses such as Creative Commons or similar ones.
The open access movement rose in the context of articles and scientific literature by the end of the 90s and the beginning of the 2000s, in the dawn of the internet era. The new technologies (publishing online, print-on-demand, easier distribution...) made the cost of publication lowered dramatically, but at the same time, some editorial houses kept increasing their fees to access scientific journals and started practices such as "bundling" to force libraries to buy subscriptions in bulk. In our academic system, researchers are pressured to publish in prestigious, high-impact journals since their academic valuation highly depends on publication metrics. Most often, journals do not offer remuneration per authoring of scientific articles. Furthermore, researchers often peer-review articles for free, with the incentive of gaining status in their research field. Under those circumstances, the role and the business model of the traditional editorial houses started to be questioned. For several years, discontent grew in the scientific community. Some researchers proposed a boycott (e.g., Tim Gowers against Elsevier), while others defended revolutionary tactics (e.g., Aaron Swartz's Guerrilla Open Access Manifesto) that brought shadow sites to the forefront. These sites offered free and unrestricted access to vast amounts of scientific literature (e.g., Sci-Hub, LibGen) but unauthorized by their copyright holders and thus unlawful in many jurisdictions. In parallel, pre-publication sites such as arXiv that make access to scientific articles free and open have gained much popularity. It is nowadays common to find in arXiv pre-release versions (after peer review and with the final layout) almost identical to the journal-published articles. Other authors directly avoid journals and publish in arXiv (with the consequences it entails, such as loose or lacking review and lack of certifiable merits).
More recently, the open access movement has brought new journals and editorial practices that guarantee access to research articles at no cost. For instance, the Public Library of Science (PLOS) is a non-profit publishing house that advocates for Open Access, releasing all its published articles with Creative Commons licenses. In turn, PLOS brought the practice of pay-to-publish, a scheme that moves the publication fees to the authors or their institutions. While this model is defended by many researchers and publishers, regrettably some deceptive journals exploit this model by charging authors with publication fees without making any quality check or review of the submitted articles. The increasing tendency, however, is to have low-cost journals published only online that can have their small publication costs covered by universities and institutions.
The FAIR principles as described above do not, in essence, interfere with the open access practice, and they do not prescribe open licenses. FAIR is focused on all research data in general, not only articles, and it keeps its recommendations limited to technical aspects such as protocols and APIs and the presence of metadata.
However, the choice of a license for the data does impact the degree of FAIR-ness. While the Findability principle is quite independent of the chosen license, the Accessibility principle is heavily affected by it. Open licenses allow for the redistribution of the data, making access to infrastructure more resilient, durable, and decentralized. It removes barriers and makes use of the right to data more effective. The choice of license has a bigger effect on the principle of reusability in terms of its "legal" and other technical and architectural requirements.
FAIR data and open access are intertwined practices, and researchers need to consider both perspectives, especially in light of developing trends and policies. Recently, the U.S. government issued a memorandum (Ensuring Free, Immediate, and Equitable Access to Federally Funded Research) to all federal agencies establishing immediate access at no cost to all U.S.-funded research. This means that all research paid for with public money must be released in an open format, free of charge. This memorandum includes research data, such as research databases and other primary sources of information. Similar policies can be expected soon in the E.U. countries. Although not yet a binding policy, the European Commission already supports FAIR principles.
MaRDI's proposal concerning Accessibility
The efforts of MaRDI are, on the one hand, geared towards fulfilling the technical needs to have this network of federated repositories: creating APIs and setting standard formats and protocols to access information through the MaRDI portal. On the other hand, MaRDI aims to spread the FAIR culture amongst researchers by providing training on the practices and tools that will improve their data management.
One of the main MaRDI outputs is our portal, which will help researchers to find and access mathematical research data. The portal itself does not create a new gigantic repository to collect all mathematical research data. Instead, it facilitates the creation of a network of federated domain-specific repositories, making the already existing projects more connected, interoperable, and accessible from a single entry point.
In order to enable standardized retrieval of mathematical research data and their metadata, i.e. to make mathematical research data accessible to machines, the MaRDI consortium has decided to set up an API during the five-year funding period (see p.37, 53 of the proposal). This API will be integrated into the MaRDI Portal, the envisioned one-stop contact point for mathematical research data for the scientific community, by FIZ Karlsruhe and Zuse Institute Berlin.
Take as an example, the API of zbMath Open that has similarities to our portal. zbMath Open is a reviewing service for articles in pure and applied mathematics, where you can find 4.4 million bibliographic entries with reviews or abstracts of scholarly literature in mathematics. It has developed an open API offering the bibliographic metadata of each contribution. You can use this in different ways: to provide references for Wikipedia or Mathoverflow, for so-called data-driven decision making, or even for plagiarism detection (see for instance, this article).
In Conversation with Johan Commelin
In the second episode of the interview series Data Date, Johan and Christiane talk about mathematical research data in the Lean project, the importance of Github, accessibility in this context, and connected knowledge graphs.
Pizza and Data at StuKon22
Who would have thought that Pizza and Data go so well together? Very well as we found out at the DMV Student Conference in early August that was held at the MPI MiS in Leipzig.
Three days of StuKon saw presentations of Bachelor or Master theses from 13 of the participating students and talks and workshops on possible career paths for mathematicians held by representatives of banks, academia, insurance, consulting, and Cybersecurity firms.
The first evening was planned by MaRDI. StuKon participants were invited to enjoy their slices of delicious pizza while talking about their experiences with research data. Tabea Bacher gave a short presentation on MaRDI in a cozy relaxed atmosphere. She introduced the FAIR principles and the participants were challenged with the very broad concept of mathematical research data encompassing proofs, formulae, code, simulation data, collections of mathematical objects, graphs, visualizations, papers and any other digital object arising in research. Some of the common difficulties in (mathematical) research data were illustrated by an example from her own work.
Participants were then encouraged to talk to one another about their experiences and what they would want or need from a MaRDI service. Ideas, problems and questions were illustrated by designing postcards briefly presented after this very educational dinner. From this, three recurring concerns were identified.
The need for a formula finder ranked high on the list of concerns raised by the students, this was also mentioned in the last MaRDI Newsletter. The second problem that was brought up was research being published in a language not mastered by the researcher that wants to build on it. It has to be translated first. One could argue that the translation could be done with available tools or not to bother with the translation at all. Translated articles are not made available in a public domain and often remain on personal computers so that the next interested party has to repeat this process for themselves. Wouldn’t it be nice to have a service that collected translations of articles and excerpts and made them accessible? If only to determine if the paper really holds the information you need. And last but not least, the students felt that theses that expand and explain a research paper or proof in detail should be linked to that paper or proof, respectively. These are often Bachelor or Master theses that are rarely published on the university servers, let alone somewhere else. They felt if these were linked to a dense proof or paper, it would help understand the research better - or at least more easily - and give context to the problem.
While there were other issues raised, these were the main points discussed by the StuKon participants. As the organisers we feel that it is important to include the next generation in mathematics in the discussion on FAIRness of research data. It seems that everybody left with MaRDI stuck in their heads. Hopefully they will remember it as a place to consult and possibly contribute in future research careers.
image credit: Bernd Wannenmacher
The Future of Digital Infrastructures for Mathematical Research
At the DMV Annual Meeting (2022-09-12 – 09-16), we hosted a MaRDI-Mini-Symposium: "The Future of Digital Infrastructures for Mathematical Research". As mathematics becomes increasingly digital and algorithms, proof assistants, and digital databases become more and more involved in mathematical research, questions arise on handling this mathematical research data that accumulates alongside a publication; storage, accessibility, reusability, and quality assurance. Speakers shared their experience with existing solutions and their visions and plans on how a well-developed integrated infrastructure can further facilitate mathematical research.
The slides of all talks can be accessed via the MaRDI-website.
NFDI4Culture Music Award
This award, presented in two different categories, is given by the musicological community in NFDI4Culture and it intends to recognize music-related or musicological projects and undertakings. Applications may be submitted by 30. September 2022. The funds (up to 3000 EUR) associated with the award are earmarked for expenses that contribute to the goals of NFDI4Culture and must be used by the end of the year 2023.
More information:
FAIR4Chem Award: The FAIRest dataset in chemistry!
This award is given for published chemistry research datasets that best meet the FAIR principles and thus make a significant contribution to increasing transparency in research and the reuse of scientific knowledge. NFDI4Chem will award the FAIRest dataset with prize money of 500 €, supported by the Fonds der Chemischen Industrie (FCI). Submission deadline is November 15, 2022.
More information:
On the first Monday of every month at 4 pm, the NFDI hosts a live InfraTalk on youtube. Here, participants of the individual consortia talk about important topics to a general audience -- for instance, Harald Sack on Knowledge Graphs (March 7, 2022).
https://www.youtube.com/playlist?list=PL08nwOdK76QlnmEB659qokiWN3AC-kqFS- Danish librarians have set up "How to FAIR: a Danish website to guide researchers on making research data more FAIR" https://doi.org/10.5281/zenodo.3712065. On accessibility, they say "Conducting research is often a team effort. Even before collecting the data, it is important to consider who will get access to the data, under which conditions, and what permissions they will have." and provide lots of use cases from all across the sciences https://www.howtofair.dk/how-to-fair/access-to-data/
FDM Thüringen's Research Data Scarytales promises to "take you on an eerie journey and show you in short stories what scary consequences mistakes in data management can have". The multiple player game comprises of stories based on real events and is designed to avoid potential pitfalls and traps in your Research Data Management plan.
Welcome to the very first issue of the MaRDI (Mathematical Research Data Initiative) Newsletter. Research data in mathematics comes in many different flavors: papers, formulae, theorems, code, scripts, notebooks, software, models, simulated and experimental datasets, libraries of math objects with properties of interest... In short, the list is as long as mathematical research data is diverse.
Unfortunately, there is no straightforward or standard way to make these digital objects available for future generations of researchers. Availability, however, is not the only concern. In an ideal world, mathematical research data would be
FAIR: Findable, Accessible, Interoperable, and Reusable.
MaRDI is a part of the German National Research Data Infrastructure (NFDI) and it is dedicated to building infrastructures to make mathematical research data FAIR. Work on solutions for some of the major problems we face today started last year; from understanding the state-of-the-art technology of a field all the way along the research pipeline to establishing standards for peer review. As part of this process it is especially important for us to engage you, the mathematics community, early on so have a look at the list of our upcoming workshops!
This issue of the Newsletter is dedicated to the F in FAIR: to findability and what this means for mathematics.
licensed under CC BY-NC-SA 4.0.
We explore two aspects of what Findable means. First, we will focus on how to find data created by other researchers and then we discuss how to make sure your own data is findable for the math community.
In each newsletter, we will also publish an episode of our interview series on math and data: "Data Dates", introduce you to the people behind the MaRDI project, and offer some reading recommendations on the topic.
Have you ever…
- tried searching for a formula?
- seen a reference to a homepage that is long gone?
- put code on your personal webpage because you didn't know how and where else to publish it?
- browsed through the publications of your coauthor's coauthors looking for that one result that you almost remembered but not quite?
- not been able to find something you needed to keep going into the research direction you fancied?
Then you are not alone!
To find out, where people search for math data, we ask you to answer our very short multiple-choice survey:
Where do you look for mathematical research data?
You will see the results here or right after submitting your answer.
How to find research data?
In the near-infinite resource aka World Wide Web, where do you find your research data? Where are the concentrating resource “hubs”? How is MaRDI proposing to help on the Findability challenges?
Data and FAIR principles
Modern science, including mathematics, relies increasingly on research data. Research data is the factual material required to verify research findings and in mathematics, this can also be the knowledge written up in an article.
Types of research data would include literature, such as books and articles, databases of experimental data, simulation-generated data, taxonomies (exhaustive listings of the examples of a given category of objects), workflows, and frameworks (for instance software stacks with all the programs used in a research project), etc. Even a single formula could be considered research data. To set up good practices in the scientific community, Wilkinson et al published the FAIR Guiding Principles for scientific data management and stewardship. These principles are Findability, Accessibility, Interoperability, and Reusability.
In this article, we will introduce the Findability principle, with a focus on mathematical sciences, in connection with the infrastructure that is being developed by MaRDI.
For more information about what research data is and how to manage it (especially for researchers in German-speaking countries), you can visit Forschungdaten.info (in German). For a comprehensive introduction to the FAIR principles, you can visit the Go-Fair portal.
Findability
Findability is the first of the FAIR principles; it is also the most basic one because if you can't find some data, you can't re-use it in any way, it is as if it does not exist.
When we try to find (research) data, we may face two situations: either we know that something exists and we are looking for it specifically, or we don't know exactly what we want and we look for anything related to a search term. In the first case, rather than finding that data, our problem is locating it somewhere in the physical or virtual space. In the second, our problem is to examine all the data available (in a certain catalog) for a certain characteristic that we are interested in.
Both problems can be solved by using a few tools. Firstly, each piece of data needs to have a unique reference or identification, so that we can build lookup tables for the location of each dataset. Secondly, together with the ID, we need other metadata that describes the data with some useful information (type, subject, authors, etc). Thirdly, we need to build comprehensive catalogs that gather all the metadata of the datasets and build search engines, which are algorithms to retrieve things from the catalogs.
Thus, the Findability principle can be concretized to the following recommendations:
- (Meta)data is assigned a globally unique and persistent identifier.
- Data is described with rich metadata.
- Metadata clearly and explicitly includes the identifier of the data it describes.
- (Meta)data is registered or indexed in a searchable resource.
The classical approach for searching and finding data has been dominated by the publication paradigm: You look for a specific publication, or for any publication related to a certain topic, that will contain the information you are interested in. However, in reality, you often want to find a theorem, a formula, or any concrete information rather than a publication. For instance a specific expression of a Bessel function, a particular representation of a given group, or the proof that certain differential equations have unique solutions. This approach requires re-thinking how we structure and manage research data. We discuss next the available places to find research data and then the MaRDI proposal for such a comprehensive approach.
Where to look for research data
For mathematical articles, books, and other classically published works, a reference includes title, author, year, etc. While this is easily usable and readable by a human, it is not always consistent in format and it does not provide a means to locate and access that information. The two de-facto standard catalogs that collect mathematical literature and also assign a unique identifier are:
- The ZentralBlatt Mathematik (unique identifier: Zb number), archived in zbMath by the FIZ Karlsruhe - Leibniz Institute and
- The Mathematical Reviews (unique identifier: MR number), archived in MathSciNet by the American Mathematical Society.
While these unique identifiers are helpful in referencing a piece of mathematical literature and these platforms are useful in finding works in a specific math domain, their catalogs are much less comprehensive when it comes to other research data (databases, media, online resources, etc). It also has the drawback that the authors cannot control the existence or the metadata of an entry, and MathSciNet is a subscription-based service*.
Another notable mention is arXiv, which is a de-facto standard platform for pre-publications. Here the actual paper is offered publicly thus making it Accessible. Furthermore, any work in arXiv also gets a unique ID and can be found via the catalog search. The focus here is also on literature, although there is limited support for datasets related to a paper. When it comes to non-literature research data, the panorama is much coarser. swMath, a sister project to zbMath, is a catalog of mathematical software packages (computer algebra, numerics, etc) and a cross-referencing record of their citations articles in zbMath. zbMath also features a full-text search of formulas, which is being improved within the MaRDI framework.
There are also general-purpose identifiers and catalogs for data. One of the most standardized identifiers for online resources is the Digital Object Identifier (DOI), which references any digital object. Unlike a URL, the DOI is linked to a particular file and not to the server or website where it is hosted. The DOI website resolves the DOI number to the most up-to-date URL to access the data, so the DOI also serves as a locator in addition to being a unique identifier. Usually, publishers assign a DOI to new publications but authors can also obtain a DOI in other registration agencies. Some open repositories offer free DOI registration. For instance, Zenodo is a general-purpose repository for open data, which hosts quite a few mathematical research datasets. See our article "Publishing on open repositories" where we talk more about Zenodo.
Currently, for pure research databases (experimental data, simulations data, etc), there is no universally accepted repository in mathematics. There are a few curated collections of mathematical objects, such as the Online Encyclopedia of Integer Sequences (OEIS), the SuiteSparse Matrix Collection, and the NIST Digital Library of Mathematical Functions. The reality is that many researchers rely on open repositories for access to data. Unfortunately, in contrast to biological repositories where researchers can find standardized catalogs of proteins or genetic encodings, mathematical catalogs are neither for general-purpose use nor very interoperable.
MaRDI's proposal concerning Findability
Unfortunately, most data-based mathematical research is still published either without the datasets, or the datasets are hosted on university servers accessible only through personal websites of the researchers involved.
MaRDI aims to, on the one hand, provide the necessary ground infrastructure to properly publish research data in federated repositories (using standards and practices according to the FAIR principles), and on the other, it plans to spread awareness within the math research community on the problems and proposed solutions that publishing research data entails.
Here we will name a few of the initiatives related to the Findability principle.
The Scientific Computing Task Area (TA2) is preparing a benchmark framework to compare existing and new algorithms and methods to solve specific problems. For instance, there are several dozens of methods to solve a linear system Ax=b, with different performance and different technology stacks, depending on the size of the matrix A, if it is sparse or dense, if we look for exact or approximate solutions, etc. So far there is no centralized catalog where a "user" (for instance a computational biologist) can go to choose the best method for their particular application. This catalog and benchmark will make finding symbolic and numerical algorithms much easier and it aspires to be a major reference when looking for such algorithms.
The tool for this is building a knowledge graph of numerical algorithms. A knowledge graph is an abstract representation of a set of concepts, objects, events, or anything related to a domain of study, as nodes, and formal relations between them (edges) that can be read by humans or computers unambiguously. The biggest collective effort to build a knowledge graph is Wikidata. In this mathematical knowledge graph, nodes will be the algorithms themselves as concepts, but also papers related to them, software packages implementing them, benchmarks, and connections to other databases. It will then be possible to navigate the knowledge graph to find semantical information, such as which algorithms extend a given one, where can we find implementations, how do they perform in comparison, etc.
Another effort aimed at Findability in MaRDI is the Mathematical Entity Linking (MathEL), or a way to extract and compare conceptual information from mathematical formulas. The concept of a particular equation (for instance the Klein-Gordon equation, the General Relativity equation, etc) can be expressed in many different forms, variables can be named differently, notations for derivatives or tensors may differ, and groupings and substitutions can occur. The MathEL sub-project aims to retrieve the conceptual information of formulas, propose annotation standards for introducing semantic information into formulas (for instance referencing a WikiData node or other knowledge graph node), to mine large corpora of research data (for instance the Zb catalog or the arXiv repository) and to create user interfaces to retrieve concept and source information, such as question-answering engines.
To illustrate this, here is a sneak peek into the MaRDI portal, under development, which will integrate the MathWebSearch search engine as a MediaWiki component. The formula search can find Wikipages based on formula expressions denoted in LaTeX on the pages on the MaRDI portal. This test wiki page contains a couple of math formulas. This search portal should be able to find those formulas when queried in the search box. With the TeX and BaseX configuration, you can try an input like " V=4/3 \pi r^3 " or " V=\frac{4}{3} \pi r^3 " and it will find the Wiki page with the test formula. Also, with " V = 4/3 \pi ?s^3 " you can find variable substitutions. Other common re-writings are not yet recognized, such as " V = \frac{4\pi}{3} r^3 " but the core search engine is also under active development. The same engine is used in zbMath formulae search. Plans for MaRDI include to make entities in a Wikibase knowledge graph findable through formula search.
In subsequent articles, we will expose other tasks being carried out within MaRDI** that exemplify the other FAIR principles (for instance open interfaces, or descriptions of workflows).
* MR Lookup offers limited services to non-subscribers. As of 2021, ZbMath became zbMATH-open and requires no subscription.
**The funded MaRDI proposal can be accessed here.
Taking some data from a project, we try to prepare it according to the FAIR principles. Follow us in our attempt to make it FAIR on the first try.
Publishing research data in open repositories
We are IMAGINARY, a math communication association, part of the MaRDI consortium and we develop and organize math exhibitions as our main activity. Using data that we collected about Earth grids for one of our recent projects on climate change, we will take you through how we almost painlessly set up data in a public repository.
Our latest exhibition is the "10-minute museum on the climate crisis mathematics", where we describe mathematical modeling and places where maths is used in climate science. We all know that the latitude and longitude grid is the most common way of creating a reference system on the Earth. Did you know there are other ways to divide the Earth into small regions that can be particularly useful in numerical models?
Quite excited by this, we contacted a couple of climate researchers who were able to prepare for us the sets of geographic nodes and edges that make those grids. Then another one of our collaborators took that data and converted it into a 3D-printable model by adding thickness to the edges and checking the structural integrity of the ensemble so that it could be a physical object. Finally, a 3D printing company made the objects that we used in our exhibition.
As this dataset was not used in a way that contributed to existing knowledge, it was not suitable for a publication in a journal. However, it occurred to us that the data that was gathered and processed was niched and specific enough to be the basis for others to re-use and build on.
Being a company committed to Free and Open Source licenses, we wanted to not only make the data available but FAIR as well.
Git (GitHub, GitLab)
Since we were dealing with software files, the most convenient platform for publishing and developing is GitHub. Git is an efficient version control software and any organization of code should start here. GitHub and GitLab are probably the most popular platforms to host projects. However, as a publishing tool, it could be considered almost as a kind of personal website (actually, you can host and serve a git repository in your server) and it is a live and working tool. This means that the published data can change at any time. Github does not offer, by default, a guarantee of stability (although there are archive options), a standardized identifier, or a good way to search and find your data. Also, it keeps a record of all previous versions so all the dirty work is on the public.
Our GitHub page was our collaboration tool within the team. It was not intended as a publication method; it just happened that we left it to be publicly available. Having data available somewhere does not automatically make it FAIR. We wanted to have an identifier associated with it and we knew that some repositories offered that.
Zenodo
Zenodo is one such open-access general-purpose repository. It is hosted by the CERN infrastructure and funded in part by the European Commission. Researchers in any scientific area use it to make a copy of their work findable and accessible to the public. These works can be articles or books in pre-print or, in some cases, already published by traditional publishing houses but also databases, data files, images or any digital asset that their research relies upon.
Zenodo offers a Digital Object Identifier (DOI) if the work does not already have one. In this case, the DOI contains a "zenodo" string in it. For instance, 10.5281/zenodo.6538815.
This was a perfect fit for our data and as a bonus, creating our entry on Zenodo was not difficult!
Firstly, we created an account. A valid email is all you need. You can also link it to your ORCiD to determine the author(s) uniquely.
Secondly, we made a new upload draft. You can choose the type of document (publication, poster, dataset, image, video, software, physical object, etc.) and fill in the form with the title, authors, publication date (can be in the past), description, and several other fields.
For the authors, we added the ORCID of those who had it. We also used "IMAGINARY" as an author, even though it was not a physical person but a company.
We requested a new DOI since we did not have any. The DOI can be "reserved" during the draft process, so you know it in advance and can use it in the documents you prepare.
For the actual content, we used a zip file with the master branch of the GitHub repository. You can also link your Zenodo account to your GitHub account so that whenever you make a "release" in GitHub, a snapshot is automatically published in Zenodo.
Finally, we submitted the draft. Take note: once published, you can't add, delete or modify the files associated with a DOI, which is the main point of the DOI. You would have to make new versions with a new DOI. Thus, we recommend that you double- and triple-check before clicking submit. In case you make an erroneous submission, you can write an email to the Zenodo administrators for help.
Wikipedia / Wikidata
We now have an identifier that would make our data easy to find if you have it, or if you happen to search in Zenodo's search box. But now, we wanted to increase our Findability. We needed to include our data in places where people often look for information and Wikipedia / Wikidata are the perfect places for that.
Wikipedia is the universally known collaborative encyclopedia. With more than 6 million articles in English, it would be easy to find an article relating to your data. However, before advertising your data on Wikipedia by editing general-interest articles, you must be familiar with the core principles of Wikipedia content: Neutral point of view, Verifiable, and No original research. That is to say, only link to research and data published elsewhere and do not hijack articles for self-promotion.
In our case, we found an article on Discrete global grid. Since our work provides an example of such grids, it could be of general interest. Additionally, as there are no other examples of 3D-printable grids that we are aware of, we decided to add a link in the "External references" section.
We then had a look at Wikidata. Wikidata is the data backbone for Wikipedia. In contrast with Wikipedia, which is made of articles, Wikidata is made of entries; every entry can be an object, an abstract concept, a person, a feeling, a math research article..., essentially anything. Every entry lists some properties of the item in a structured form. It is human-readable but also planned to be machine-readable, meaning one day some AI or search engine can obtain knowledge from such an enormous database, which aspires to have all human knowledge structured. As such, it is a suitable place to catalog research data. Many researchers index there their articles (listing title, authors, DOI...), databases, models, etc. But many don't, so it is not yet a comprehensive research (or general) catalog. It is also less intuitive as a search tool than Wikipedia (there is no full text to read), and it can be challenging to retrieve useful information by hand.
In our case, searching for "Earth grid" produced nothing, while "Earth system grid" brought us to the US Energy department portal, and we learned that "Grid in Earth sciences" is the title of a concrete published article. We finally found the Wikidata entry on "Discrete Global Grid" (linked in the Wikipedia article) which is about the concept, but not much information therein. We could have created a Wikidata entry and have our data listed as an instance (example) of a Discrete Global Grid, but we found that our 3D data would have more context in the Wikipedia article. Therefore, we decided not to put our reference in Wikidata.
After asking some colleagues, we found that a more typical use case would be the following: A published research article uses a dataset. Then a Wikipedia page references the published article as a source. By creating a reference in Wikipedia, an entry in Wikidata is created. Then a (different) entry in Wikidata representing the dataset is linked to the entry representing the published article. This way, there is a path from Wikipedia to the research data referenced in Wikidata. Hopefully, eventually, the dataset is used in other publications (referenced in other Wikipedia pages) and Wikidata can keep track of all the works derived from that dataset.
Assessing the FAIRness
At this point, we were wondering, how can we tell if our data is really FAIR? How well did we do? Fortunately, there is also a tool to assess that!!
The Automated FAIR Data Assessment Tool from FAIRsFAIR data initiative accepts any working reference, a DOI for instance, and tries to determine its FAIRness from its metadata. It generates a summarised report with individual scores and a final global mark. Luckily for us, Zenodo handles that metadata quite well and makes it available via the HTML code on the Zenodo page itself.
So how did we do? On a scale from 0 to 3, our grand score is: "moderate" or 2.
To improve that score, we could have edited the metadata and added more details; however, that is still a feature under development in Zenodo (e.g., supporting the citation file format), and it may be a bit cumbersome to edit that metadata on other platforms.
Conclusion
Overall we were satisfied with this experiment of making our data FAIR. The GitHub workflow is a bit difficult to learn but it is nowadays part of software development. An added benefit is that it can integrate into FAIR workflows. Zenodo was a success: easy to use, takes care of most of the metadata, and provides free DOIs. Wikipedia is not difficult, but you need to restrain your interest in getting visibility from undermining the general interest of an encyclopedia. About Wikidata, we concluded that it is not for our use case (although it might be for other research data). Finally, the FAIR data assessment tool is great not only to evaluate but also to educate on good practices and improving your FAIRness. Probably there are still many tools and hints that we can discover, but so far it was not so hard a trip to make.
We hope that reading about our experience encourages you to re-evaluate and want to improve the FAIRness of your data.
In Conversation with Cedric Villani
In the first episode of the interview series Data Date, Cedric Villani joins Christiane Görgen for a brief exchange of thoughts about Math & Data.
OpenML hackathon at Dagstuhl castle
Sebastian Fischer and Oleksandr Zadorozhnyi, of the MaRDI task area Statistics and Machine Learning, participated in an OpenML hackathon held in late March at the headquarters of the Leibniz Center for Informatics at Dagstuhl, Germany.
OpenML is an open-source platform for sharing datasets, algorithms, experiments, and results. The hackathon was initiated by Bernd Bischl, one of the key players behind OpenML and a Co-Spokesperson in MaRDI. Researchers from other parts of Germany, France, the Netherlands, Poland, and Slovenia were present to discuss topics such as data quality on OpenML, an extension of its established services to new data formats, and new computational tasks.
The review article "Datasheets for datasets" provided fruitful exchanges on future improvement of data and metadata quality. In particular, support for non-tabular data formats such as images was discussed and will now be embedded by transitioning from the attribute-related file format to parquet. The so-far available eight types of tasks, including regression, classification, and clustering, will be extended to new tasks which are typical for graphical modeling. As this is one of the main use cases and an important topic for both Sebastian and Oleksandr, discussions on the problem of graphical-model structure estimation from a given dataset, embedding into the current set of tasks available on OpenML, addition of different evaluation measures or criteria for model selection and storage of graph-specified datasets within the OpenML framework were had with Jan van Rijn. The evaluation measures and criteria for model selection allow for the comparison of estimated graphs to some given ground truth, a procedure that is not normally part of the ML workflow.
Sebastian also presented their collaborative work with Michael Lang on the mlr3oml R package. This package connects the OpenML platform to the open-source machine learning mlr3 package in R, another crucial aspect of the MaRDI task area.
The hackathon was rounded out with social activities like a walk through the forest. The good weather aside, special thanks needs to be given to Joaquin Vanschoren, the OpenML founder, whose supply of water to the whole group during the hike was the other reason why everyone made it back to the castle in good spirits!!!
All in all the week in Wadern was a pleasant and fruitful one for all the participants.
We will also be introducing you to the people who shape MaRDI with their expertise and vision for mathematical research data. They will appear in a series of "Making MaRDI" interviews available via our Twitter account. Stay tuned!
Call for seed funds 2023
These funds support scientists from all fields of research within engineering, relating to the development and implementation of innovative ideas in data management. The grant is equivalent to the funding of a full-time doctoral position for one year. If necessary, the funding can be split between project partners.
More information:
- To learn about the Nationale Forschungsdateninfrastruktur, the community of which MaRDI is just one small part, read the 2021 article by Nathalie Hartl, Elena Wössner, and York Sure-Vetter in Informatik Spektrum. See doi.org/10.1007/s00287-021-01392-6
- Christiane Görgen and Claudia Fevola explain in a short review article the role repositories can play in the MaRDI infrastructure. They use MathRepo as an example, a small math research-data repository hosted at the Max Planck Institute for Mathematics in the Sciences in Leipzig. See arxiv.org/abs/2202.04022
- The interim report of the European Commission Expert Group on FAIR data discusses how to turn FAIR into reality. See doi.org/10.2777/1524
- Thomas Koprucki and Karsten Tabelow have been two of the driving forces in the early stages of MaRDI. Together with Ilka Kleinod they discussed mathematical models as an important type of mathematical research data in a 2016 article for the Proceedings in Applied Mathematics and Mechanics: doi.org/10.1002/pamm.201610458
Our Newsletter "Math & Data Quarterly" is prepared by our partner IMAGINARY. You can unsubscribe easily at any time.