Bad science

I became aware of a couple of interesting opinion pieces in the academic literature this week, both via Twitter.

The first one’s titled Benchmarking Open Access Science Against Good Science (“Commentary”) by Lindenmayer and Likens, published at the Bulletin of the Ecological Society of America, ref. via Sean Ulm. The second is apparently (no author given) an editorial at The Economist, titled “Trouble at the lab” (open access). I consider both to be well worth reading, and more or less right on the money. I’ll summarize the first one briefly here for those without access.

The authors’ principal point is that scientists who use publicly available data sets in their studies need to be very careful with their analyses to avoid coming to wrong conclusions. The basic reason for this: there are often details and subtleties to such data that need to be thoroughly understood, but which are often not. They state:

Our extensive experience from a combined 80 years of collecting empirical data is that large data sets are often nuanced and complex, and appropriate analysis of them requires intimate knowledge of their context and substance to avoid making serious mistakes in interpretation. We therefore suggest that it is essential that those intending to use large, composite open-access data sets must work in close collaboration with those responsible for gathering those data sets.

Then they really unload on a certain class of scientists:

There is also the emerging issue of a generation of what we term here as “parasitic” scientists who will never be motivated to go and gather data because it takes real effort and time and it is simply easier to use data gathered by others. The pressure to publish and extraordinary levels of competition at universities and other institutions (Lindenmayer and Likens 2011) will continue to positively select for such parasitic scientists. This approach to science again has the potential to lead to context-free, junk science. More importantly, it may create massive disincentives for others to spend the considerable time and effort required to collect new data.

It’s not every day you see such harsh things said in academic journals, and they could have avoided use of “parasitic”, but their point is well founded.


6 thoughts on “Bad science

  1. Thanks for linking to these essays, Jim.

    I’ll admit that reading them left a bitter taste in my mouth because I am one of the parasitic scientists they are referring to. Even though I agree with them, I won’t stop working with pre-existing data sets for the foreseeable future. I suppose this makes me a hypocrite.

    Lindenmayer and Likens were obviously resorting to hyperbole to get their message across. Unfortunately, they left out two huge points to consider: (1) many of the questions being answered using these big open-access data are beyond the capabilities and resources of single researchers (PhD students and post docs) and even most research teams. (2) Researchers working on these open data are adding value by sifting through and interpreting the mass of raw data, complementing these data with custom-written simulations that try and match pattern to process and even adding additional information to pre-existing data.

    While one could argue that my first point isn’t really a worthy justification, it is a reality of science today. Why would I (or anyone else) forfeit the chance to work on important, general questions during my research just because I didn’t gather the data myself? How will the data I can reasonably gather during my PhD compare to those waiting on a server somewhere? I doubt I could match the data collected over multiple years by huge research teams.

    It would be a waste of resources for me to re-gather inferior data…

    Personally, I think the solution isn’t to dismiss recycled data. Instead, we should just place, say, a 5-8 year embargo on all public data. During this period, the owner of the data has the exclusive right to analyse and publish this data. This would ensure that all the “low hanging fruit” has already been plucked by the time the data is made publicly available and it would create a real disincentive to the true parasitic researchers who are just trying to squeeze out some easy papers. After 5-8 years, it will take real creativity and expertise to extract novel patterns from a well-known dataset and, by that time, all the limitations of the data should be out in the open.

    The rest of their opinions are spot on – it is important to familiarise yourself with the data, whether you collected it yourself or downloaded it from a server (How was it collected, what was actually being measured… etc). Also, it is non-negotiable that you have a decent understanding of system being examined (it is reckless to work on an unfamiliar system solely because you have the data already).

    All the negative issues of open data can be fixed by good scientific practice and ethics…there is no need to toss the baby out with the bathwater.

    • My longish answer got accidentally deleted so I’ll just say great comment Falko. You’re seeing both sides of the issue better than L&L IMO, even if they do have an important point. Another point is that some large data sets are not collected by individual researchers or teams, but by governmental agencies. Those are fair game for anybody to analyze. And I think they wouldn’t be half as upset if those using others’ data were careful about understanding all the ins and outs of those data, which will often require close collaboration with the collectors. This would reduce the analytical errors being made.

      My second publication was a letter contesting the claims of a paper in Geophysical Research Letters who used exactly the two data sets I used in my dissertation. The authors clearly thought they could get away with what was a shoddy analysis leading to a wrong principal conclusion, because few GRL readers would even know of the existence of these data, let alone the critical details thereof. There was just no way I was going to let that just pass. I ought to tell the full story on that at some point, because it was not pretty.

  2. Strong language from Lindenmayer and Likens. I strongly agree that people should be very careful when using data sets collected by other researchers. Collaboration, or at least checking with the original authors, is a good idea to avoid misuse and misunderstanding.

    On the other hand, L&L criticize doing science backwards. However, when human behavior is being studied, it appears that backwards science is one way of doing a truly blind study. The researchers gathering the data can’t unintentionally insert bias on a given matter, if they don’t know that somebody else will come later and use their dataset to study that matter.

    The Economist article isn’t really an editorial. Those are published as ‘leaders’, and there is also a leader to go with this topic. It is also their tradition to not publish the authors’ names.

    • Welcome carrot eater, I remember you from comments at Rabett Run.
      Interesting and important point in paragraph 2 and thanks for that clarification in paragraph 3.

  3. I was really disappointed by this paper. Is good science better progressed by vitriolic*, data-free** arguments from authority*** that are poorly aimed**** and divisive*****?

    * “parasitic scientists” on “fishing trips”
    ** “in our discipline of ecology, there is an increasing number of examples … where substantially flawed papers are being published [CITATION MISSING], in part because authors had limited or no understanding of the data sets”
    *** “Our extensive experience from a combined 80 years of collecting empirical data”.
    **** “open access” is much, much more than shared data
    ***** “the emerging issue of a generation of what we term here as “parasitic” scientists”

    • I’d say “vitriolic” is definitely an over-statement, and a number of people are way over-sensitive about this and related issues. I’m one of those “parasitic ecologists” since much of my work involves the use of data collected, maintained and databased by others, but I don’t get upset by what they said, because they do have an underlying point. And that point is that you have to be very careful with what you conclude when you do analyses with data that you yourself were not involved in collecting, because you may not fully understand said data’s limitations and legitimate applications. With that I agree about 1000%.

      Having said that, I nevertheless take your points as basically valid–they need to lay out their arguments much better than they did, and knock it off with the potentially inflammatory language.

Have at it

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s