Data, Metadata and Paradata

I have had some interesting discussions about what constitutes metadata these days. What is really the difference between metadata and data? While participating in ESOF today, I came across the term paradata, which was new to me. What is that, and how does it differ from data and metadata?

Data

Data refers to raw facts, observations, measurements, or representations of information in various formats, such as numbers, text, images, audio, or video. It is the foundation of knowledge and is essential for making informed decisions, conducting research, and gaining insights across various domains.

In this context, I will limit myself to thinking about data as something stored in a digital format, such as numbers or text. This includes images, audio, etc., that, technically speaking, are also represented as numbers or text or in binary form.

Metadata

There are many ways to define metadata, but here I will follow a suggestion of breaking it down into four categories:

Descriptive metadata: information about the data’s content, meaning, and context. Examples: titles, summaries, keywords, abstracts, and subject classifications.
Structural metadata: describes the organization, arrangement, and relationships between different components of a dataset or information resource. Examples: information on how the data is structured, including the hierarchy, sequence, and interdependencies of its elements.
Administrative metadata: information related to the administrative aspects of data management. Examples: data ownership, access rights, security, versioning, provenance, and data management policies.
Technical metadata: describes the technical characteristics of data, including its file format, size, encoding, resolution, compression, and software dependencies. Examples: technical requirements and capabilities for accessing, processing, and preserving the data.

These can typically be added next to the data in a structured form: title + description.

Paradata

What is left to describe now that we have defined data and metadata? Well, all the contextual stuff that is necessary to understand how data has been collected and processed. Wikipedia defines paradata as data about the process by which data were collected. It is a relatively new term, first introduced in a paper on measuring survey quality by Mick P. Couper.

In his keynote lecture at ESOF, Isto Huvila at Uppsala University talked about results from his ERC project CApturing Paradata for documenTing data creation and Use for the REsearch of the future (CAPTURE) where they had worked with archaeologists and cultural heritage scholars on developing better documentation practices.

I found his approach compelling. Focusing on paradata describes how FAIR can be implemented in practice, not only as a mandatory requirement by funders. After all, if research data is going to be useful, it is the researchers who will have to create the relevant documentation (paradata), and other researchers will have to understand and use that documentation for something meaningful.

A significant challenge is that different researchers have different needs in different situations. It is impossible and meaningless to capture absolutely everything about a dataset. The point is to capture “just enough” information for it to be helpful. The difficulty is that the data creator and a potential future data user may need different information to make sense of the data. That resonates with my experience as a data creator and a user. I often forget to document things I take for granted, but that other users may need to make sense of my data. On the other hand, I often end up not using other people’s data because there are things I don’t understand about the dataset.

One of Huvila’s group’s findings is that a lot of paradata typically exist in various sources. The problem is that it is scattered and not structured. On the other hand, today’s focus on developing data management plans (DMP) and structured metadata may risk throwing away meaningful paradata because it does not fit into existing categories.

In the article Improving the Usefulness of Research Data with Better Paradata Huvila et al. suggests the following points for making useful paradata (and making the paradata useful):

Formal paradata is good for computers while the best paradata does not necessarily look like ’data’ at all for its human users
Documentation and documented processes need to align with each other not least because they shape each other
Useful documentation needs to consider the scholarship in its complete complexity
Be comprehensive as the usefulness and eventual uses of paradata are likely to be impossible to anticipate

This all sounds good, but it remains to figure out how this can be done in practice.

In sum

To sum up this little blog post, I will paraphrase Huvila when arguing that there may be enough metadata (data about data) but generally too little paradata (data on the processes of its creation, curation, and use).

Data#

Metadata#

Paradata#

In sum#

Data

Metadata

Paradata

In sum