Clarity-related text stats for scientific manuscripts (LaTeX).
© 2015 Paweł Korus
Scientific writing toolkit (SWTK) is a Python-based scientific manuscript analysis tool that facilitates clear writing. It computes numerous clarity-related statistics and detects common problems with English writing like overuse of weak verbs, or excessive use of adverbs, and passive voice, etc.
SWTK generates a self-contained interactive HTML report with appropriate highlights in the text. An example report can be downloaded here (save the file to your computer, and open it from there). This report uses the default template; you can customize the CSS template to your liking.
I designed the toolkit with LaTeX manuscripts in mind; you can expect equations and citations alike to get out of your way while proofreading. Since I did not intended to implement a full fledged TeX parser; there are some basic guidelines that you should follow for best results. You can easily integrate SWTK in your document compilation workflow. If you wish (or if the included LaTeX parser does not work for you) you can use either Markdown or plaintext documents.
Text analysis relies on the Natural Language Processing Toolkit. A full list of currently available analysis modules is featured below. If you do not see what you need - let me know or just add it yourself - the toolkit can be easily extended with plug-ins. For more information on the principles of good writing, you can refer to the classic book The Elements of Style, or one of the available MOOCs out there (see examples below).
This project was inspired by two online apps: Hemingway and Expresso. Both are online editors, intended for general-purpose writing, and as such did not suit my LaTeX writing workflow well.
Learn More:
- W. Strunk, The Elements of Style (book), http://www.gutenberg.org/ebooks/37134
- W. Zinsser, On Writing Well: The Classic Guide to Writing Nonfiction (book)
- Writing in the Sciences (MOOC), https://www.coursera.org/course/sciwrite
- Academic English: Writing (MOOC), https://www.coursera.org/specializations/academic-english
Alternatives:
Libraries
The toolkit is published under the MIT License. This means it is provided for free, but without warranties of any kind.
Get a copy: download the latest version as a ZIP archive or clone the repository:
> git clone https://github.com/pkorus/swtk.git
Install the natural language processing toolkit (you'll need Python 2.7 first):
> sudo pip install -U nltk
Install the required nltk packages:
- Punkt Tokenizer Models
- Maximal Entropy Treebank POS tagger
- Averaged Perceptron Tagger (depending on the version of NLTK)
> python -m nltk.downloader punkt
> python -m nltk.downloader maxent_treebank_pos_tagger
> python -m nltk.downloader averaged_perceptron_tagger
SWTK uses the default POS tagger from NLTK. This guarantees that you are using the recommended tagging engine, but might change accross different versions of NLTK. Note that the taggers might differ not only in the classification accuracy, but also in processing time, e.g., the Maximal Entropy Treebank POS tagger is significantly faster than the Averaged Perceptron Tagger. If the default tagger does not work for you, consider switching to a different one.
The toolkit is intended for command line usage. If you prefer something more automatic, see the notes below on integration with a typical document compilation workflow. For a full list of command line options, run the swtk-analyzer
with argument --help
.
> python swtk-analyzer.py [options] filename.(tex|md|txt)
Most important arguments:
Command Line Option | Description |
---|---|
-o filename.html |
output filename; use - for stdout |
-e /--external |
externalize css / javascript, by default, they are embedded in the document |
-f /--floats |
enable floats - also extract and check captions of figures and tables |
-m /--math |
enables experimental math support |
The appearance of the report and the user interface is defined by a CSS stylesheet (./data/default.css
). You can customize it directly, or supply a different one via command line options (-s
). You can also hack your way into the report behavior (./data/default.js
or -j
).
Integration with Kile
Integration with latexmk
The toolkit is designed for working with technical documents written in LaTeX; commonly used special tokens (references, citations, equations, etc.) will be replaced with stubs, e.g., \cite{label}
will become [1]
. Equations can be optionally rendered in the text using MathJax - refer to command line options for more details.
The toolkit implements a rudimentary LaTeX parser. Do not expect it to handle complicated documents with custom commands. If the included parser does not work for you, consider using Markdown or plaintext (see below).
The following things should generally work:
- document title (
\title{}
) - authors (
\author{}
or\name{}
) - abstract (
\begin{abstract}...\end{abstract}]
) - sections (
\section{}
,\subsection{}
, etc.) - paragraphs
- captions of floats (
\begin{figure}
and\begin{table}
) - enabled separately via command line options - not-nested enumerations (
\begin{itemize}
and\begin{enumerate}
) - getting rid of citations, references, and text formatting commands (
emph, textbf, text
)
Possibly important ignored items include:
- other environments (e.g.,
center
,algorithm
) - document inclusion commands (e.g.,
input
orinclude
) - clustered commands; the parser treats consecutive non-empty lines as a single block and parses only the first command.
E.g., from the following snippet:
\section{Introduction}
\label{sec:introduction}
only the first command (section
) will be parsed. Make sure to separate important commands with an empty line, e.g.,
\appendices
\section{}
Math support is experimental. If enabled, the HTML will reference to MathJax via CDN, so Internet connectivity will be required to view the math in the reports.
Inline equations should be enclosed with $
(single dollar sign). Separate equations should be enclosed with \begin{equation}
or begin{equation*}
. Do not expect more complex equations to render properly (e.g., when using the \begin{cases}
environment) - unless a future update of MathJax fixes this.
The toolkit implements a rudimentary Markdown parser:
- lines starting with
#{1,5}
will be converted to section titles; - a block of lines starting with
-
or a number ([0-9]+\.
) will be interpreted as an enumeration (nested enums are not supported); - format strings will be removed (
_
,**
and*
); - inline code will be replaced with
[code]
; - links will be replaced with link name;
Other content - e.g., images, tables, code blocks - is currently not supported, and should not be present in the analyzed document.
Plaintext is the simplest option available. One-line paragraphs without a period will be treated as section titles (only top level). Block of lines starting with -
will be treated as an enumeration. Everything else will be treated as raw text.
Other formats are not supported. If you wish, you may implement the necessary parser yourself, or use a document conversion tool, e.g., Pandoc.
Analysis Tool | Status | Description |
---|---|---|
Basic document statistics | stable | basic stats like character / word count, etc. |
Part of speech statistics | stable, needs improvement | the number of verbs, nouns, modals, adverbs, etc. |
Extra long and short sentences | stable | self-explanatory |
Weak verbs | stable | weak overused verbs, e.g., is, has, does |
Filter words & phrases | stable | vague words and phrases used typically in spoken language, if possible, offers a suggested correction |
Passive voice | stable | self-explanatory |
Frequent bigrams | stable | frequently used pairs of words |
Frequent trigrams | stable | frequently used triples of words |
Frequent acronyms | stable | frequently used acronyms, checks if they are defined in the manuscript |
Rare words | prototype | highlights rare words based on a provided dictionary (by default 5,000 words from Brown corpus) |
Buried verbs | prototype | finds verbs that are too far from the subject |
Verbs used as nouns | - | |
Sentence difficulty estimation | - | |
Unreferenced floats | - | identifies figures & tables not referenced in the text |
The toolkit provides a simple plug-in architecture. All you need to do is to derive a new class from Plugin
and provide suitable processing methods:
process_text(self, paper)
process_token(self, token)
process_sentence(self, sentence)
These methods give you access to the paper, token, and sentence objects, respectively. Obviously, you can access the sentences and tokens from the process_text
method - the remaining two are just for convenience. Once you're done with processing individual tokens/sentences, you need to finish your work in the finalize(self, paper)
method - called in the end by the plugin manager. It is responsible for generating the final report of your analysis (should be appended to the paper.reports
list).
An example report for the POS tagger plugin is shown below:
The final report is represented by the Report
class which contains the following attributes:
Attribute | Description |
---|---|
label |
human readable name for your report. |
details |
the main content of your report (HTML text): use a list to have its items automatically wrapped in an <ul> tag (unordered list); use a string to directly specify the HTML of your report. |
help |
a short note about what your plugin does (optional); string with HTML content. |
summary |
a short information that summarizes the main results of your plugin; use a string; make it as compact as possible. |
css_classes |
list of all css classes that your plugin has used; this list is used for two purposes: to generate a "toggle all" button (for your plugin's label), and to generate css classes in the HTML code; the list can contain either a dedicated CSS object (basically a name and color specification) or a string (in this case, the toolkit will generate the highlight colors automatically). The first option gives you the flexibility to change highlight colors dynamically (see the bigrams plugin for an example). |
In order to highlight selected words or sentences, just append the css class name to the reports
list (attribute of both Sentence
and Token
) classes. The system uses a convention that classes beginning with an underscore (_
) are disabled. This will be automatically handled by the toolkit - you just need to make sure that your initial assignment is correct, e.g., append class name _weakVerb
if you want the highlight to be disabled by default, otherwise append weakVerb
. When reporting the classes to the toolkit (the css_classes
attribute of your report) use class names without the beginning underscore (i.e., report weakVerb
in the previous example).
By default, the toolkit will provide a toggle button for turning on/off the highlights made by individual plugins (once the css_classes
attribute is set in the report). Toggling individual highlights is also supported, but requires explicit instructions. When generating the string list with report details, use Plugin.toggle_button_generator
to wrap the list items in the necessary HTML code. The method expects a tuple with the item text, and a corresponding CSS class that should be toggled.
A brief example:
# Example output of text analysis
reportItems = [('word_1',10), ('word_2',8)]
css_mapping = {'word_1': 'cssClass1', 'word_2': 'cssClass2'}
# Report generation (core of the finalize method)
detailedReport = [('{} : {}'.format(k,v), css_mapping[k]) for (k,v) in reportItems]
report = Report('Rare words', Plugin.toggleButtonGenerator(detailedReport))
report.css_classes = css_mapping.values()
paper.reports.append(report)
It is possible to use the results provided by other plugins in your own analysis. Some plugins (e.g., the text statistics plugin, or the POS tagger) will provide paper-wide stats that you can use (paper.stats
). You can also see the results by examining the reports for individual sentences, or tokens (words) - e.g., the passive voice analysis module will attach a passiveVerb
css class to the reports
field of the token, and passiveVoice
class for the sentence. Other modules might change other attributes, like the pos_tag
. Take a look at individual modules to see what they provide.
In order to make sure that the data you need is available, configure your plug-in to run later than its dependencies. The run_priority
attribute determines the execution order (lower numbers are started earlier). The run priority applies within three run scopes:
- text processors are executed first,
- sentence processors are executed second,
- token processors are executed last.