Metrics, feedback systems and evaluation. Text of a keynote given at a NISO virtual conference, February 2019

How did we get here? Is there a clear path forward?

The skills to understanding how well we’re performing, to reflect on the implications and to act on the insights are essential to the human state. Our ability to abstract these thoughts, and express them – whether in language or numbers – has given us the ability to communicate all of these processes through time and space; to compare, to evaluate and judge performance. Research evaluation is one small part of the human condition, and one in which that expression of performance is increasingly communicated through the abstraction of numbers. But in the middle of all of these data, we mustn’t lose sight of the essential humanity of our endeavor, an essence that – perhaps – can’t be abstracted. While there is a path that is emerging, there are challenges to be faced in the future, balances to be found between abstracted data and human need.

Humanity, from its earliest days, has always wanted to understand and control.
The earliest artefacts that our ancestors have left us tell us about their preoccupations. At a basic level, they hunted, they ate, they drank, they cooked, they lived. But the biggest, the most substantial artefacts are those that they left to mark their lives – to celebrate their dead ancestors’ achievements – and to understand their relationships with the natural world.

Wherever you stand on this planet, the sun dominates the seasons and the day. It tells you when to hunt, when to move, when to harvest. And up in the sky, the stars appear to rotate around us, providing information about the seasons, our location on the earth. It is not surprising that all the evidence from antiquity tells of humanity’s obsession with astral bodies. Either passively, measuring the sun’s passage through the year; or attempting to actively appease the gods to influence their behaviour and to guarantee their reappearance. All natural phenomena that affect humanity were subject to analysis through divinity. From the Bronze Age there was tradition of casting money and other offerings into water – often at crossing places and junctions, and places where flooding might a occur. It’s testimony to the resilience of tradition that despite most people not having heard of Belisima, or Achelous, or Vedenemo – three of the many gods who would have been appeased by such offerings.

Whether good or bad, or merely decorous, habits persist. And these stretch from throwing a few cents into a wishing well, to using the H-index to compare researchers in different fields and generations. It takes real energy to change these habits, and change which is supported by theory and investigation.

In the absence of direct control; or accepted theory, humanity uses hypothesis, metaphor and abstraction to develop understanding and control.

Our ancient ancestors would not have had the working hypotheses of gas, or rock, or gravity, or space – which we routinely invoke to explain the rise and fall of tidal waters. But as generations lived by the sea, hunted by it (and in it), depended on it, they would have known about these rhythms and the effect that the moon had on their environment. The ancient Greeks knew about this relationship, but talked about it in terms of the moon attracting the water (in an animistic manner, in other words, capriciously and knowingly). Galileo tried to offer a mechanical explanation (he didn’t like forces he didn’t understand), whereas Kepler was happy to make that conceptual jump into the unknown. It wasn’t until Newton that we had a proper, functioning theory that allowed for proper scientific work on the lunar effect on the tides. Plate tectonics is another excellent example. In my dad’s school geography textbook (from the 50s), the explanation for the evolving crust is entirely wrong. And it couldn’t have been right: the mechanisms had not been discovered – or at least they hadn’t been accepted as a consensus. As an interesting sidenote, it is suggested that the political turmoil of the mid-20th century held back this consensus by several decades.

All of this is not to dismiss humanities’ attempts to explain cause and effect, or to measure it, or to influence it. It is to observe that in the absence of known theory, we do our best, with the tools that are available to us. We work towards a consensus – hopefully using scientific methods – and (hopefully) we achieve a broadly accepted theoretical framework.

As an illustration about where we are, and how we got there, I want to consider two areas, both of which I feel have some relationship with how research metrics are being developed and used. The first comes from ideas of human management, or human resource management. The second comes from engineering: the science of feedback.

The modern roots of management theory
Life used to be so simple. In order to motivate people, you’d show them an example of what to do, and then punish them if they didn’t raise their game. Take the example of General Drusus, whose life is immortalized in the Drusus Stone, in Germany. Aged only 30 at his death, this enormous moment supposedly celebrated his achievements at bringing peace to that part of the Roman Empire.

In modern terms, we might present chemists with the example of (say) the Nobel Laureate Greg Winter, whose discoveries enabled modern cancer treatments using monoclonal antibodies, and founded an industry worth 100s of millions of dollar. And, having shown them the example, threatened to defund their labs if they didn’t behave appropriately.

This may sound far-fetched, but there are analogies to be found in the present time – and in scholarly research. While I won’t mention countries – I have valued friends and colleagues – one country I am familiar with examined the importance of international collaborations in – what they felt – were comparable countries. Seeing some trends, they obliged researchers to increase the numbers of international collaborations (no matter how or why) under threat of defunding. Although the collaborations increased, it doesn’t look as if this … experiment … was particularly successful.

However, in terms of we managing our fellow humans in commerce and industry, how we support them to develop their performance, we have – generally – come a long way since the Roman times. Albeit mostly in the last 100 years.

It took until the beginning of the twentieth century for industry to start examining personnel management seriously. It didn’t emerge from any moral or ethical drive; rather it was pragmatically bourne by the economic and population crises that followed the world wars. It was driven by the need to rebuild countries; and to accommodate emerging labour organizations, democracy, and social ambition.

The first formal attempts at understanding the reflexive, thoughtful human in the workplace – as compared to the “unit of production” approach of Taylorism – were explored in the 1950s. These were based on scientific hypotheses inspired by new ideas of inheritance and inherent, unchangeable qualities. Inspired by the behavioural sciences and psychological theories of the time. The 1960s and 70s saw the introduction of more individualistic, goal-oriented approaches. For the first time – the subject, the employee, the human – became able to reflect on their own performance.

And over the last two decades we have seen the growth of 360 appraisal. Evaluation and feedback of the individual embedded in a complex network of interdependencies.

Personally, I remember the transition well. The start of my career was marked by line managers telling me how well I’d done. For a few years after that, I was asked how well I felt I was doing. And now, for the last few years I’m asked – how well does the company accommodate you, what can they do better. (I’m sure that’s not just at Digital Science – although it’s a great place to work!)

A Brief History of Performance Management

In last 100 years, then, we have come a long way. We’ve come from a combination of “be like this”, “do as i say”, via “you are what you are, and that’s a fixed quantity”, to a much more sophisticated, reflexive concept. “How do you fit in a system”, “how do we accommodate you inside this complex network”. In short, we have abandoned a STEM-like faux “scientific” approach in favour of a more discourse-focussed, human-centered, experiential process.

The development of feedback science
Then the opposite trend can be observed in the fields of engineering, computer science and mathematics. The notion of a system that receives feedback of its own performance – and responds accordingly and dynamically – are concepts at the heart of any performance management system.

The field was founded in Ancient Greek technology; it was developed by Arabic cultures and finally flourished in the industrial revolution of the 1600s. Time was always the driver – for those first 1500 years, humanity was obsessed with accurate time-keeping. In the 1600s, feedback mechanisms became essential to governing the speed of steam machines and mills, it allowed them to go from experimental novelty to essential factory equipment.

Engineers began to use the mathematics of feedback science to predict and develop mechanisms as part of the system, rather than deploying them in an ad hoc manner, to control unruly machines. We see the genesis of hypothesis driven research in this field, rather than trial-and-error experimentation. In the 1950s, Soviet scientists made huge theoretical breakthroughs to support their space programme, and maths and computer science have combined to give us all miniaturized devices that have more positional accuracy than was conceived of, only a few years ago.

We can see, then, two very different approaches to feedback, correction, evaluation. An approach to managing humans, that becomes more humane over the decades (and as a more dogmatic scientific approach fails to produce rewards); and an approach best suited to systems (even systems that involve humans), that takes a rigorous, theory-based approach to control.

So how do these apply to the “business” or “industry” of research?

The growth of research as an investment business.
I think that we have to be willing to view one of the contexts of research evaluation as part of the feedback loop of “research as a business”.

In business, people expect a return on their investment. This might be expressed as hard cash, or notions of increased wealth, or as narratives generated by the business. Over the centuries, we have been flexible.

As a relatively early funded researcher, Charles Babbage appears to have devoted more of his time to asking for money and explaining where it had gone, than actually working on his machines. John Harrison – who invented the first clock sufficiently accurate to compute longitude on the sea – was supported financially by the British Government, who stood to gain massively by the increased navigational efficiency of their fleet. As a side note, it’s worth observing that the Government refused to accept that he had performed adequately to merit winning the equivalent of over $3M they had established as a prize, and that Harrison had to resort to any number of tactics to maintain a financial life-line. Researcher and funder fall out over results. The sun never sets on that one.

Today, research is a well-funded industry. Digital Science’s Dimensions application has indexed 1.4 Trillion dollars of research funding; and a vast set of outputs coming from that funding. 100 Million publications. Nearly 40 Million patents. Half a million clinical trials, a similar number of policy documents. You could be crude, and take one number – divide it by another – and come to some conclusions about the productivity, but that “analysis” would be unlikely to be helpful in any context. People have probably done worse in the pursuit of understanding research.

The emergence of metrics-focussed / centered research evaluation.
According to researchers Kate Williams and Jonathan Grant, one of the most decisive steps towards a metrics-centred view of research evaluation almost happened in Australia, in 2005. This decision was explicitly based on a political commitment to strengthen links between industry and universities. The proposed Research Quality Framework focussed on the broader impact of research, as well as its quality. This plan was eventually abandoned, largely due to political change. Nevertheless, the plan was hugely influential on the UK’s proposal to replace its Research Assessment Exercise with a system based on quantitative metrics. One particular obstacle that came up (according to Williams and Grant) was the explicit “steering of researchers and universities”. However, the UK finally adopted its new framework – the Research Excellence Framework, or REF – although the impact portion was a much reduced percentage – initially set at 20%, rising to 25% in 2017.

The growth of metrics; the response to metrics
Any action requires an equal and opposite reaction.
The movement towards greater reliance on metrics to provide the feedback and evaluation components in the research cycle have inspired appropriate responses. Whether DORA, the Leiden Manifesto, or the Responsible Metrics movement, we see clear positions forming on what maybe seen as appropriate and inappropriate; responsible or irresponsible. Clearly there are a couple of candidates that often get identified as big issues. Use of the Journal Impact Factor as a way of understanding a researcher’s performance is, absolutely, numerically illiterate. The H-index is clearly biassed towards certain fields, later-stage researchers, fields with higher rates of self-citation, people who don’t take career breaks, and – therefore – men.

For me, there is an interesting dichotomy. If we take a hypothetical example of a metric that is created between a number of citations by a number of papers, for example. The simplest way to do this is to divide the former by the latter. And that’s probably the most commonly done thing. It’s certainly well understood by the vast majority of people. And yet, it’s highly misguided. That simple math works well if you have an approximate balance between high and low cited documents. A case that simply never happens in citation data – where always have a large number of low performing documents, and a small number of high performing documents. Using such a simple, but misleading piece of maths results in us concluding that the vast majority of documents are “below average”. Which is supremely unhelpful.

The Leiden Manifesto elegantly observes that:
“Simplicity is a virtue in an indicator because it enhances transparency. But simplistic metrics can distort the record (see principle 7). Evaluators must strive for balance — simple indicators true to the complexity of the research process.”

My experience is that while nearly everyone is happy with “divide one number by another”, as soon as we introduce some better mathematical practice – for example, calculate the exponential value of the arithmetic mean of the natural logs of the citations to reduce the effect of the small number of highly cited articles – peoples eyes glaze over. Even if this does result in an average value that is much “fairer” and “more responsible” than the arithmetic mean.

Finding this balance – between accessibility and fairness – is all the more critical when it comes to considering the changing population of people who are using metrics. Every week, on various email lists, we see people posting messages akin to “Hi, I’m reasonably new to metrics, but my head of library services just made me responsible for preparing a report … and how do I start?”

This was brought into sharp relief at a recent meeting in London, when the organizers were considering the number of institutions in the UK – 154 degree awarding bodies – versus the number of “known experts” in the field. There’s a real disparity – and you should bear in mind that the UK is probably the leader in metrics-focussed institutional analysis. Initiatives such as the LIS-Bibliometrics events, under the watchful eye of Dr Lizzie Gadd – and the Metrics Toolkit – are essential components in supporting the education and engagement of the new body of research assessment professionals. We can’t assume that the users of our metrics are now expert in the detail and background to all calculations and data.

Research is a human endeavour.
However, I want to move away from a debate on the detail: I know there are many experts who are going to follow me, who are very well positioned to discuss more nuanced areas of metrics. Let’s focus on a bigger question: what are we trying to achieve with research metrics and evaluation? Are there two different things going on here?

We are engaged on a human endeavor. For example, researching Altzheimer’s. What strategies are useful; are we trying to cure, prevent, slow down, ameliorate? For widespread populations, within families or for an individual? What funding works, what drugs. What areas should be de-invested – perhaps just for the present time. Are there any effective governmental policies that can help shift the curve? For me, a key part of the work in the field of metrics and evaluation, is trying to understand the extremely complex relationships and interdepencies within topics – the human task that we have set ourselves. The fields in which we work.

And separate to that, we have other questions – how well is a lab or a funder or a researcher or a journal performing. Questions about human performance and productivity, albeit of humans operating in a complex system. And, because they are mostly human artefacts, they are capable of reflection and change. They have ambitions and desires and motivations. The type of approach needed to support humans, and their desire to fulfil their ambitions are quite different to those needed to understand the shape, direction and dynamics of a field of knowledge.

If I am to make a prediction…
At the beginning of the presentation, I described two other areas where feedback and evaluation have been a crucial factor in the develop of human performance and system efficiency. The increasingly human-centric analysis practised by corporations in the pursuit of increased excellence; and the great theoretical, mathematical and computational breakthroughs that have revolutionalized all forms of technology.

At the current time, it feels that we are standing at a fork. On the one hand, we could use big data, network theory, advanced visualizations, AI and so on to really dig into research topics, to throw up new ideas and insights into the performance of this particular area of human society. A similar revolution to that underway in online retail, or automotive guidance.

And on the other hand, we have the increasing impact of research metrics on individual humans, and the need to be acceptable to the broadest possible slice of our community.

These two things are not the same as each other. They require different data, they offer different conclusions.

Even now, driving my fairly ordinary car, if I was presented with the data being processed by the car’s computer as it keeps the wheels spinning at an optimum rate, I would be unable to think about anything else. Dividing one number by another, or just “counting some stuff” is perfectly fine on one level. But it’s entirely inadequate to understand the nuances or trends within research.

When I think about the analytics that are possible with a modern approach to research metrics, I often think of the work of Professor Chaomei Chen at Drexel. Chaomei has been working for several years on deep analysis of the full text of research articles. His goal is to map the progress of a topic as it goes from being uncertain (“it is suggested that virus A is implicated in condition B”) to certainty (“B is caused by A”). The technological approach is heavily based on a number of theoretical approaches, which Chaomei can present using highly informative visualizations.

While these visualizations can support the qualitative statements about the role that individuals, laboratories or journals, that is not their purpose. They are designed to inform the trends, status and progression of topic-based work.

When it comes to looking at individual humans within research, I think there is another revolution that will come about.

For years, we have been accustomed to thinking that metrics are a thing that happen to researchers; or (if you work in a research office) a thing that you do to yourself. The world is changing, and the new generation of researchers will be much more aware of their own standing, their own profiles, their own strengths and their own ambitions. This is, after all, the selfie generation – and if the current massive trend towards sharing, collaboration and open access was inspired by the Napster generation – a high school graduate when Napster was launched is now in her late 30s – we are going to see a far more self-aware and self-reflective population of researchers in 20 years than we’ve been accustomed to.

The recent push towards “profiles” and the use of “baskets” (or “buckets”) of metrics is absolutely compatible with this generation, and is a start. We should be prepared for more of the same: and that includes investing in some of the concepts that we see in Human Resources (or “Talent Management” as we now see it called). For example, 360 reviews. Why shouldn’t a researcher be asking hard questions of a funders support? Or of a journal’s likelihood of promoting the research in the media? Or of the prospects of a promotion in a lab?

For my conclusion, I am extremely optimistic about the state of metrics. It seems that the conversations and movements are in the right direction – but both sides would benefit from more conversations about the purpose – and limitations – of the data driven approach.

How did we get here? Is there a clear path forward?

Leave a Reply Cancel reply