Machine Learning evaluation study for SYRAS (Systematic Review Assistant)

My original goal for SYRAS, the Systematic Review Assistant, was to try to automate the laborious process of reviewing thousands of scientific article references while selecting their relevance to a particular topic of study. During the initial development of the software, once I had a basic application up and running, I conducted a detailed and extensive evaluation of various statistical techniques and machine-learning algorithms to see which one would best provide a solution for the specific problem.

I’d like to present what I learned and discovered during that process which will take a few articles, now I’ve had time to review the hundreds of pages of notes I took during the voyage. The topics I’d like to cover are:

  1. Overall orientation: identifying the problem and potential solutions.
  2. Tips: things you should know before starting an ML project.
  3. Natural Language Processing: NLP-specific Machine Learning has some extra complexities requiring pre-processing steps before more general algorithms can be used.
  4. Algorithms: identification of algorithm suitability, from off-the-shelf to custom designs and ensembles.
  5. Hyper-parameters: the difficulties of dealing with these frustrating fudge-factors, in theory and in practice.
  6. Statistics: it’s vital to understand how to prepare and score experiments, avoiding common mistakes which can invalidate your results. Using randomisation, cross-validation, normalisation and accuracy measures.
  7. Systematic Review specifics: the workflow of the application introduces some quirks into the learning problem, including incremental learning and horribly unbalanced corpora.
  8. The performance results: evaluation of algorithm performance and tuning.

Some of the topics I will cover may seem obvious to those experienced in ML projects, but I made so many newbie mistakes during this study which cost me a lot of time and effort. While the hands-on learning was valuable and concrete, I wouldn’t want to waste the same time in the future, so this is also a guide-to-self and others on how to approach a project in future.

Step 1: Identify the problem and potential solutions.

To kick any project, I recommend you ensure you truly know what the problem is. This might sound obvious, but I made a couple of major u-turns during the ML evaluation because I had delayed fully designing the final product workflow until after I had proved that the technology could even provide a solution. More specifically I wanted to prove a natural language classifier could predict the classifications made by a human researcher – if it wasn’t possible, then my bold claim was invalid and the project was impossible.

My first mistake was to jump in too quickly on this premise and not first explore and understand the end-user workflow and their requirements and challenges (user-centred design). It would turn out that there were other opportunities and challenges lurking which would change the entire approach to the application solution.

For example it emerged that there were two different possible uses of the application: a) a classifier which could automatically complete the job of the researcher by labelling articles for them (the holy grail), or b) an search assistant which could relieve the tedium of the screening process by helping identify the best articles more quickly. These are very different goals, different user-experiences and would employ different algorithms.

Another real-world complexity which arose catastrophically late in the project (the very end), was relating to the scientific-validity of the application – essentially a business requirement rather than a feature. For usage a) above to be allowed in a scientific review, where the machine is effectively performing the review, the software/algorithm/product would have to be extensively validated, peer-reviewed, locked-down in function and also accepted by the overseeing institutions such as Cochrane and understood by journal editorial review statisticians who would eventually be judging the validity of the studies using the product. While this cannot be a show-stopper if I want to make such a system, it was a bit of a dead-stop at the beta phase when I tried to get scientists to use it!

During the evaluation I’d explored usage b) which is more of a morale-booster to the human than a replacement, but the researcher still has to complete the review themselves, reading every single abstract of perhaps 5,000 articles. And while it starts off better, the tail end of the review becomes ever more boring and barren – as the search engine has (hopefully) pushed all the relevant articles to the front. It would be like soldiering on to “Page 500” of the Google results even though you hadn’t seen anything interesting for since page 324. Compare this to the previously random but even distribution without the tool, and perhaps the researcher would prefer to not have the assistant!

So does the product have legs? I still believe so, and so I invested some time white-boarding various workflows with different researchers to better understand their needs. While writing this, I feel I probably could do more of that stakeholder knowledge transfer – in fact you possibly can’t do too much!

Defining the problem

The main reason I am so keen on this product idea is that Systematic Reviews are a seemingly perfect natural scenario for supervised learning in ML. We have a human researcher willing to read and uniformly classify 5,000 nicely structured documents in a database while we watch – what more data could an AI ask for!

The scenario therefore has a specific quirk, I call it incremental supervised learning, i.e. the ML has to learn on the fly while the user is operating the system and will at some point during this process become knowledgable enough to help out. It’s possible the system could repeatedly self-test itself (even do CV folds and parameter tuning!), until it was confident that it understood the topic before butting in and avoiding a “clippy moment”.

Defining the classifier (usage a.), in simple terms of input – output:


  • Article title – short but variable length
  • Article abstract – longer variable length
  • Metadata including: date, authors, keywords, journal, citation graph
  • Binary Label (training)
  • [Possibly] other hints, reason of choice, tags, “strength” of choice


  • Binary label: mutually-exclusive choice: include/exclude
  • (Possibly) Breakdown of probabilities, topics, relations etc.

Ignoring the optional possibilities for the moment, the problem is of natural language binary classification. Knowing this already limits the solution research to a subset of the plethora of techniques and frameworks.

For the search assistant (usage b.), the input is the same as above – if we assume the same workflow while the assistant silently observes and learns.


  • Threshold of confidence, to suppress/enable the results
  • Ranked list of article references (document IDs), sorted by relevance to the articles so far positively labelled
  • [Possibly], articles similar to individual articles to allow “exploration search”
  • [Possibly], metadata on the search results, why they were presented, topics discovered, confidence etc.

This feels less well-defined. While it certainly has some search aspects, it’s not exactly a classic search engine because a) the input isn’t a search query, b) the learning is incremental not pre-calculated c) the document-similarity aspect will be complicated by averaging, clustering and graph-traversal. It’s possible this problem is more closely represented by a sales recommendations engine or social graph, but with the complexities of NLP over the top. I would therefore exercise caution here and try to simplify the proposal.

Bottom Up vs. Top Down Design? Meet in the middle.

Before designing any solution it’s important to get familiar with the available tools and materials. In the young and evolving ML domain this is even more vital as the various frameworks, like Scikit Learn or Tensor Flow, have quite specific strengths and weaknesses. Not only do they provide different tools, like cross-validation or parameter tuning, but some of them only support certain algorithms or have better versions of those algorithms. Some frameworks are entirely online API-oriented paid cloud services, whereas others are installable software which you host and maintain yourself. This entirely practical choice could dictate the functionality you have at your disposal.

SYRAS was intended to be open-source and “free” for use by any academic institution or individual who could arrange hosting, or self-host, so this effectively ruled-out any lock-in to paid APIs like Tensor Flow. While it’s possible to configure the software, the user-experience of having to register for highly technical services like ML APIs would, in my opinion, render the application too hard to get up and running for the average researcher.

This is an area of great concern for an academic FOSS product – deciding on the “stack” will greatly influence the uptake of the application. Can it be hosted easily? Can it be containerised with Docker or desktopised with Electron? Will academic IT departments be able to easily manage and maintain a platform? Will it scale? Would I be able to host a version cost-effectively and charge a nominal fee for usage?

Underneath each framework is one or more programming languages, like Python, R or Javascript. Deciding on the framework therefore aligns you with a language which identifies you with a certain community of developers and related tools. The Python language has become very popular with ML and NLP enthusiasts, so there are more tools available in that language. There are more answers and examples on Stack Overflow for Python and more new code repositories popping up.

Early on in the development of SYRAS, I had performed a refresher review of web application technologies – unrelated to the ML aspects. I chose a MEAN-ish stack of ExpressJS and MongoDB but later came to find communicating between JavaScript and Python challenging. First of all they “can’t” simply talk to each other, so you need some form of IPC, IO, shared files, or an adapter like Pyro. Then you find their preferred data structures are very different with JS tools implicitly favouring JSON which is alien to Python libraries such as SciKit Learn, leaving you sourcing or writing adapters and converters. Even with these communication pipelines theoretically in place, it becomes difficult to deploy and scale such intricately intertwined subsystems – especially considering the horsepower needed for some ML tasks.

I learned that an API oriented-design is therefore important to establish early on. It’s more likely you will need to dabble in microservices, or at least a service oriented architecture (SOA) when developing an ML application to help “lubricate” your high-level design and give you access to a wider variety of tools now and in the future. This may feel like a burden and slow the design process down but it should pay off for a serious project. I was keen to get started so I bodged a few things together, but as the system grew and crystallised around these hacks, they became harder to refactor to a more professional solution.

The lesson here is to experiment with a variety of tools, platforms and frameworks in a lightweight manner (Agile “spikes”) before finalising the overall application architecture. You have top-down user requirements and bottom-up capabilities – they have to meet in the middle.

In retrospect I half did this right. I evaluated the following tools and only feel like I over-invested in half of them.

  • Natural JS – this initially seemed promising, but lacked depth. I spent a lot of time embedding its NLP processing capabilities into my initial testing and evaluation framework, but eventually replaced all that with the SciKit Learn equivalents.
  • Tensor Flow – I didn’t spent too much time on TF as it wasn’t compatible with the business model as discussed above, and had less NLP tools than SKL.
  • Gensim – I realised too late that this is a good educational tool but probably not intended to be part of a real-world application. I spent too much time evaluating the performance of the algorithms in detail, which I duplicated in SKL.
  • SciKit Learn (SKL) – this has an immense set of tools for both general ML and NLP. While some might say it’s still an educational tool, I found it to be professional enough to be part of a deployable product.
  • Word2Vec – I didn’t look too deeply into this. While it seems relevant to some of the NLP goals of the project it lacks the variety of ML tools SKL provides.
  • NTLK –  similar to Word2Vec, powerful NLP tools but I found the installation and dependencies too complex for a professional product. (Perhaps Docker would mask this)
  • Various others: I looked at many other smaller packages which, if compatible, could end up in the final product, but it’s difficult to compete with SKL which seems to have almost everything you need, and an extendable pipeline.

API Oriented Design (SOA)

I ended up with a set of APIs (which are still under development) which isolate the strengths and weaknesses of the various frameworks. I kept the ML part separate from the “web app” which became more of a traditional GUI.

  • Web application – serving the user-facing app: Express JS
    • Assistant Plugins – connecting to the backing APIs on various events.
  • Web application REST API – serving data for the app, e.g. user accounts, auth, projects, screening, etc. : Express JS, OpenAPI, MongoDB
  • Document Classification REST API – the prediction service: Python, Flask, SciKit Learn
  • Document Search REST API – the incremental similarity service: Python, Flask, SciKit Learn


To summarise:

  • Know your user
  • Know your tools
  • Invest lightly broadly (gestalt!)
  • Plan deployment early

The next article will cover some more detailed tips and tricks “I wish I’d known before starting”, like fixing bugs before spending weeks grid-searching parameters.

This entry was posted in articles, Projects and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.