Reusable: (Meta)data are associated with detailed provenance

The tale

The elf Rherek hurried home to the castle as if running on a gust of magic wind.

– “I got it, I got it, I got it”, he shouted almost out of breath.

– “Got what…?” said Jimko, the data wizard. “Take a deep breath, relax and tell me.”

– “I … got … the right recipe for turning water into gold”, he whispered. “We need a giant with three heads. And this chest shows how to conjure up such a giant.”

– “Marvellous” said Jimko. “Let’s look at the chest immediately.”

They investigated the chest. The content looked like playing cards. However, they were all mixed up.

– “Now …” said Jimko. “This looks perfect. It is in line with what we believe to be the right steps. However, it is all mixed up, so we have no clue in which order to take the steps. And it gets even worse …”

– “How is that?” asked Rherek. “Can’t we just try it in different sequences? We must succeed eventually.”- “Or blow up the castle. If you do anything wrong with the giant, it will turn more evil than a Buffingor witch on a bad day!”

– “But, but, but … can’t we ask those who created the cards?” Rherek asked warily.

– “We can… But the chest contains no trace of who created the cards. It will be a stroke of luck. But try and go back and look for traces of evidence of who created the chest”, said Jimko despairingly.

Rherek ran off. But neither he nor Jimko were at all convinced that further investigation would make it possible to understand the cards.

The truth

Much of the value in data is the ability for a machine or a human to judge the origin of the data. Thereby often evaluating whether the data is reusable in a new context. This includes the ability to know how the data was created, by whom it was created, and with which type of equipment? Also, has the data been processed, or is it raw data? If it was processed, how was the workflow? And so on, and so on. This is quite similar to a section on method in a paper or article, and you can refer to this type of documentation from your data set. However, keep in mind that this might not be readable for a machine.

Remember to include provenance of who you are, and how you would like the data set to be cited/credited if used elsewhere.

The easiest way to get started, is to try and think of yourself as a re-user of your own data. But before you do so, you must clear your head of all knowledge related to the data set. What details would you need to evaluate and trust a given data set? If this is hard to imagine, try finding other people’s data sets, and see if you think they have enough provenance.