Scientific prose

Having recently moved country, with all the attendant difficulties imposed by the language barrier, I have begun to think a little about communication, how we inform one another of our thoughts and make ourselves understood. As a scientist I am particularly interested in the style of writing in scientific prose, since this at the same time needs both to convey very complex, novel ideas, and to be understood by the international scientific community, many of whose members do not have Engish as a first language. I also feel that, ideally, a good scientific paper should be at least somewhat understandable to the interested layperson, particularly in areas of science such as medicine where personal, corporate and state decisions will be based on published research. I do confess, however, that when writing my papers I write for scientists and not the general public!

Personally I can find reading scientific papers, even within my own field, more challeging than many other styles of English. Why is this? I can think of at least four explanations:

  1. Unfamiliarity of language. If I am informed that “the slithy toves did gyre and gimbol in the wabe1“, I don’t have a clue what any of those words mean. (Although, curiously, I can identify their parts of speech…)
  2. Unfamiliarity of concepts. I know what the words “closed”, “metric” and “space” mean in everyday parlance, but it takes a bit of effort to recollect from my undergraduate days what exactly a “closed metric space” is.
  3. Complexity of language. Scientific papers typically contain many long complex and/or compound sentences such as this: “We show that when mass loss is slow, systems of two planets that are marginally stable can become unstable to close encounters, while for three planets the timescale for close approaches decreases significantly with increasing mass ratio.” (Debes & Sigurdsson, 2002) Here a short main clause “we show” introduces a huge subordinate clause 35 words long, which itself is composed of three nested levels of clauses.
  4. Complexity of concepts. The above quotation is describing the effects of a star losing mass (as happens before it becomes a white dwarf) on any orbiting planets. Whether these planets are stable depends on their masses relative to the star’s, which increase as the star loses mass. Systems of two and three planets behave somewhat differently, but in both cases the question is whether, or on what timescale, the planets will approach each other closely. This complex set of phenomena, and the causal relationships between them, are described by the authors in the quoted sentence.

Clearly, to a large degree 1 and 2 are interdependent. Wovon man nicht sprechen kann, darĂ¼ber muss man schweigen. We must have the necessary vocabulary to begin to discuss anything, for if we do not have words to describe something, how can we discuss it? On the other hand, scientific language tends to take familiar words and give them a very precise meaning. For example, one cannot properly understand a closed metric space simply by referring to the everyday meanings of the three component words. Increasing technical vocabulary and concepts go hand in hand when studying any field, sport as much as science, and together present one barrier to understanding a text. This is unavoidable, since it is not possible in any paper to explain all the technical terms used: that is a matter for the textbooks! It is, however, a reasonable assumption that anyone (scientist or layperson) interested in reading a paper will have some knowledge of the requisite background, and it is common to introduce less well-known terms. In my field, for example, it is customary to inform readers that a “debris disc” is an extra-Solar analogue of our Solar System’s Asteroid and Kuiper Belts: any professional astronomer, or layperson with an interest in astronomy, will understand that.

What I’ve been wondering is whether 3 is so necessarily dependent on 4; and, if so, whether that is a barrier to, or maybe rather an aid to, understanding. I.e., does the semantic complexity of scientific concepts—complex chains of causation, conditionals and counterfactuals, caveats etc.—necessarily entail a concomitant syntactic complexity when they are expressed in natural language? And if so, is this a hindrance to understanding them, or would lots of shorter, simpler sentences be harder to understand than one equivalent long one? To give an example, could the sentence from Debes & Sigurdsson quoted above: “We show that when mass loss is slow, systems of two planets that are marginally stable can become unstable to close encounters, while for three planets the timescale for close approaches decreases significantly with increasing mass ratio.” be rewritten something like this: “Some systems of two planets are marginally stable. We show that slow mass loss can render them unstable to close encounters. Systems of three planets have a timescale for close approaches to occur. We show that slow mass loss decreases this timescale significantly.” This is more verbose (43 words against 37), as there is some repetition (e.g., of “we show”), and breaking up the sentences means more pronouns had to be introduced. Question to you, readers: Do you find the original sentence, or my rewritten sentences, easier to understand?

While pondering that, it’s worth comparing scientific prose to other factual English prose written to inform at a relatively high level. To do this, I compared two texts: First, the abstract and introduction from the Debes & Sigurdsson paper from which the above sentence was taken; this had the triple advantage of being (1) close to hand, (2) related to my research, and (3) despite my complaints about the difficulty of scientific writing, actually a very well-written paper. Second, the opening paragraphs of “The History of Madrid” chapter from my Dorling Kindersley travel guide “Madrid”; this has the advantage of needing to introduce new vocabulary itself, in the form of Spanish words, persons etc., in the same manner as the introduction to a scientific paper.

The simplest measure of linguistic complexity might be to simply count the number of words in the sentences. This I did for each text, removing sentences containing semi- or full-colons (as they could just as easily be parsed as compound sentences or strings o simple sentences). For the scientific paper, I ignored any references in parentheses such as “…similar to the Solar system (Duncan & Lissauer 1998).”, but counted references essential to the meaning of a sentence such as “Duncan & Lissauer (1997) … found that…”, in this case, as 4 words.

There were 32 sentences in the Debes & Sigurdsson text, and 31 in the Dorling Kindersley. The former had an average length of 30.2 +/- 8.9 words, the latter 21.9 +/- 7.4. While the sentences in the scientific text are longer, with such a range of lengths the difference is probably not statistically significant, and I will need more samples of texts to say definitively whether the scientific paper’s sentences are longer. I should also draw from more than one source, since there might be great variations of individual style between authors.

On the other hand, just because a sentence is long, does that make it complicated? The sentence “The big red car, the small green car, and the huge black bus stopped at the traffic lights”, containing 18 words, I would say is less complicated than “The car which was big and red, and the bus indicating to turn left, stopped when they wanted”, despite having the same number of words. The former is a simple sentence with only one main clause, the latter a complex sentence with one main and two subordinate clauses, as well as a participial phrase. To attempt to measure this sort of complexity, I counted the number of finite verbs (those that can stand alone in a clause or sentence) in each sentence. This gives the average number of clauses per sentence, although it misses out constructions such as participial phrases, infinitive phrases, gerunds and the like. The results were: Debes & Sigurdsson, 2.25 +/- 1.14 finite verbs per sentence; Dorling Kindersley, 1.94 +/- 1.06 finite verbs per sentence. Again, we see a hint that the scientific text is indeed more complicated.

Although this needs to be confirmed with a larger study, this suggests that scientific prose may be more syntactically complex than other informative factual writing. Whether this is a hindrance to understanding, and whether it is practical to simplify our language, are of course further questions to explore…

Oh, and I just read through the draft of this, and Damn! do I write some long sentences!

Edit (25/09/11 17:46 UT): A t-test tells me that the difference in mean sentence length between the two texts is significantly different (p=0.00016, about three and a half sigma) but not the difference in number of verbs (p=0.26).

1OK, I know this isn’t science, but it’s a good example. Try this from last week’s Nature: “In contrast with previous assumptions, we report here that the nascent antizyme polypeptide is the relevant polyamine sensor that operates in cis to negatively regulate upstream RFS on the polysomes…”