Machine Learning tips I learned making SYRAS Systematic Review Assistant

This is part two of an eight part brain-dump of things I learned developing the initial version of the Machine Learning (ML) aspect of SYRAS the Systematic Review Assistant.

A collection of Machine Learning (ML) tips, gotchas and reminders to future self.

In part one, I emphasised the importance of scoping the problem domain fully, understanding your users and use-cases, identifying potential frameworks early in the design process to understand platform and language-specific dependencies, plus how a modular API-oriented design can help you avoid problems when stitching together various parts of a software product from specific ML libraries.

In this article I will cover a variety of other issues I faced, lessons I learned and tricks I evolved to help develop and evaluate NLP ML algorithms. In no particular order…

  1. Ensure corpus and query processing are identical
  2. Revoking research results due to bugs and changes
  3. Fixing bugs before grid testing
  4. Establish a naming convention of trial variant results
  5. The framework tail wags the solution dog, unfortunately.
  6. Preparing and caching cross-validation data-sets
  7. Watch StdDev and “break open” CV averages with high variance, to find the cause
  8. Actually use the pipeline architecture of SKL and all the tools

Ensure corpus and query processing are identical

Natural Language Processing (NLP) requires a number of pre-processing steps before most Machine Learning (ML) techniques can be used. Lemmatisation is an option to unify word inflections such as plurals, tense, case, number or gender down to a single stem which can then be statistically counted as the same word for semantic purposes. Tokenisation is a surprisingly complicated job considering punctuation, compound words, case-insensitivity and other language specifics. Statistical scaling is required by some ML algorithms such as normalisation, Gaussian distributions, removing mean offsets in vectors – mathematical housework which sometimes have parameters.

When you use an NLP technique to process a corpus or library of text during training, you must use the exactly the same pre-processing pipeline and parameters when prepare the query during testing or in production. Frameworks like SLK make this easier, but if you roll your own, it’s easy to forget.

NLP preprocessing usually is designed to convert human text into something that can be represented as a multi-dimensional vector space, which can then be used like a spatial database to find statistical – and hopefully semantic – anomalies with clever maths. So that mapping between ASCII words and complex N-dimensional vectors is critical – it’s everything. If it’s not performed the same every time, then your query phrases or documents will be “projected” into the wrong part of the search space.

A concrete example of this in SKL is to keep hold of the vectoriser which you prepared with the corpus and us it for the query transforms.

The preprocessing:

self.vectorizer = TfidfVectorizer(... YOUR PARAMETERS ... )
self.X_tfidf = self.vectorizer.fit_transform(self.corpus.data)

The search:

doc_x = self.vectorizer.transform(self.vectorizer.transform([query]))
sims = cosine_similarity(doc_x, self.X_tfidf)        
rank = list(reversed(numpy.argsort(sims[0])))

This caught me out at one point, which meant I had to invalidate a huge number of test results. Worse, I wasn’t sure how far back this problem had existed, which meant I possibly couldn’t trust any of my previous test calibrations – nightmare!

Revoking research results due to bugs and changes

The previous tip outlined one example of a design flaw leading to bogus results. This can also happen due to other reasons: general bugs, deployment mistakes, misconfigurations, incorrect assumptions. As any coder knows, there are endless ways to get it wrong.

While developing a typical application you fix can a bug, set the task tracker to “done” and move on – it’s fixed.

However when performing long running benchmarks of an ML algorithm, across combinations of parameters, different corpora or algorithmic variants a bug could invalidate all previously recorded results. Finding a bug therefore poses a difficult challenge in managing your experimental results and integrity of any statistics which were compiled from those data. In the worst case scenario you would have to discard all previous results, re-benchmark your system and begin the exploration of parameters and variants from scratch. This could be an epic problem if the bug is discovered late in a trial, so you may be tempted to see if the bug’s effect was limited to affecting a subset of your results and only replicate that part.

One problem I found with this in practice was simply the amount of historical records I had kept over the constant evolution of the code. It was difficult to point to a chart in a 200 page journal and say “This was from version 20.1 of algorithm 12.b with parameters x, y and z” – therefore this was immune to the bug, phew!

Over time I did develop detailed “tagging” of the results to ensure (see below), but it wasn’t bullet proof.

Automation of the testing process is the only sure-fire way to achieve repeatable results and also makes it easier. Of course, it requires more investment up-front to fully automate any testing process but if you are serious and want to be professional then it’s going to pay off. Automation must include code version management, build, configuration and deployment (whatever that means specifically to your system) to ensure the application tested is the same piece of software when it runs again in the future. Automation must also include the train and test cycle: data presentation, results gathering, parametric sweeps, cross-validation controls and metrics to ensure the experiment performed on the software is also the same when run again in the future.

This can be quite a challenge but it means you will be freed of the fear of the pain of trashing hours or days of manually gathered results and thus the temptation of optimistically ignoring the consequences of bugs.

Tools such as Excessiv have been designed for this purpose. In retrospect I left it a bit late, thinking “Yeah, I’ll integrate with that later” but I needed to manually develop all the features of this product just to get to the end of my initial evaluation. Guess what – when you build a custom test framework you also make bugs which can invalidate your results. So while building everything from the ground up was a great experience and now I feel I deeply understand the process, I would certainly advise future-me to buy something like Excessiv to automate my tests from day 1.

Fixing bugs before grid testing

You will spend a lot of time or effort exploring the various dimensions of your search space and proposed solution. There are tools to help you do this, as mentioned, but it’s still a lot of work and you will generate a lot of resulting data. I sometimes went too far and wide before realising there was a basic bug in my algorithm (or supporting code) which meant my bug had been multiplied in its destructive effect.

I have since found a few margin notes in my journal saying things like “hmm, this doesn’t smell right – investigate fix this before continuing…”.

A margin note is NOT enough! Stop right there and address that spidey-sense. Do not continue until the statistical smell has been identified. Do not run 4 x 10 x 100 x 250 tests – that’s a million results to have to clean up later.

At the risk of self-shaming, this was a retrospectively humorous entry from my dev journal. I record history this way, so I am not doomed to repeat it.

2017-11-11 Re-working LSA bugs

After writing the above, I tried one last pass on kNN – 1NN. This resulted in some suspiciously high true-positive results (like 100%). I realised I had never got around to removing the test cases from the corpus (after initial trials getting the Python-JS API working). A quick analysis showed the system was indeed picking the test case from the corpus most of the time. While looking into it I also discovered more bugs which has lead me to return to basics re-work the Python before exploring the parameters again.

I’ve wasted a huge amount of time running these trials on buggy code! The lesson learned is it’s vital to get the underlying system perfect before running hundreds of trials.

1) Forgetting to remove the test cases from the corpus.
2) “maybe” label conversion hack was wrong in kNN – needs addressing in the corpus.
3) Unnecessarily re-calculating the similarity matrix, led to the code running 100x slower.
4) 1NN repeat results are strange, later repeats find the same doc from the query doc, but earlier queries don’t. It seems the model is affected by the queries, or the case shuffling is wrong.5) Tokenisation of query is different from tokenisation of initial corpus.
6) Lemmatisation was not implemented

Establish a naming convention of trial variant results

I described above one reason why you’ll want to be able to correlate your results to your code. While the exact way you do this will probably depend greatly on your methods, framework, platform and so on, I recommend you put as much thought into this as early as possible.

The kind of variations you will want to record might be:

  • Code revision/commit hash
  • Your algorithm name/identifier
  • Preprocessing options (lemma, stopwords, tokeniser)
  • Number of training/eval/training cases
  • Corpus name, size, subset, prep
  • Number of Cross Validation folds
  • Balancing
  • Evaluation metrics (i.e. what you are studying or optimising)

The list could go on, but in short it’s anything you are varying or will vary in future.

The latter is what tripped me up. I would establish a naming convention, but then later discover I wanted to vary a new option I didn’t previously know about, like scaling or balancing of positives and negatives training cases. This meant all previous tests had assumed “no scaling” or possibly there was a default scaling provided. From this point on, I would add a scaling tag to the results but to be able to compare previous results to the future ones, I’d need to retrofit the assumed parameter to all previous results.

You may never achieve this perfectly, but some hindsight here will help you prepare.

e.g. I shortened this to Alg5.1-swE-4000trc_1cv_u for quick visual scanning, but it would be preceded by a more formal table of the meaning.

The framework tail wags the solution dog, unfortunately.

As a professional software engineer, I prefer to design abstract algorithms and systems and then go looking for lower-level solutions which can help provide the components to support and implement the design. Many of the systems I’ve designed and built have outlived their implementation specifics – ie. they’ve been migrated from framework to framework or even sometimes ported across languages. I’ve always believed the value or IPR is in the algorithm not the implementation.

Unfortunately in practical ML today (ie. in its youth), the platforms and frameworks are inconsistent in their coverage of the various utilities you need. There is not so much competition between them because they all do different things well.  They are also very “big bricks” and will necessarily implement very large and high-level portions of your system design.

So your system design will be enormously influenced by the capabilities of the framework you adopt. You would have trouble seamlessly porting a finished application to another system – especially if you want to keep the same performance results.

I think at this time, this is just how it is. So choose your framework carefully and trial it before finalising your design – or even committing to fulfilling your stakeholder requirements.

Preparing and caching cross-validation data-sets

Once you’ve fully automated your testing you may find yourself generating huge amounts of data, both input and output. One of the explorations I did was on the effect of balancing training and test cases – some algorithms are highly sensitive to the ratios of positive and negative cases and require a 50/50% balance. In real life, I know the systematic review dataset may be highly unbalanced – e.g. a 5 – 10% positive rate. This led me to take an existing pre-labelled corpus of 5,000 documents and split it into various schemes of 10%/90% thru 50%/50% through various increments.

Combining this with Cross-Validation (CV) folds of tens or higher, I ended up with hundreds of thousands or millions of files quite quickly. Even on an SSD these take time to generate.

So I implemented a test/train case caching system with a naming convention which detected if there was an existing combination, if not it would generate one. This allowed me to run any test variant and the prerequisites were automatically prepared, and also meant the re-runs of tests rand very much quicker.

While you do have be cautious with caching randomised data, if you are sure you are not defeating the randomisation effect, caching these datasets speed up testing by many orders of magnitude, facilitating quicker, deeper testing.

Watch StdDev and “break open” CV averages with high variance, to find the cause

When you utilise cross-validation – you run the same experiment many times on random subsets of cases and average the results to get a more reliable figure, avoiding and quirks. I can’t recommend this highly enough – I wasted time early on agonising over tiny optimisation details only to find they were ghost artefacts of specific test or training cases.

However averages lose data. The average of the sequence 2,2,2,2 = 2 but it’s also the average of 3,1,4,0. These individual result are clearly very different, so it’s important to keep an eye on the variance or standard deviation of your cross-validation averaged results.

In the example above 2,2,2,2 represents an algorithm which is very stable in its output, over various test cases – presumably a good thing. However the second one has wildly varying results which are never actually “right”. So if you ignored the variance of these averages, and simply saw the CV output you might not realise the unreliable nature of algorithm 2.

I made sure the STD DEV was always plotted on my output charts visually. I had written my own reports so possibly the mistake of omitted “error bars” or the like was entirely my own. A more professional reporting tool may have included this by default.

If I saw a high variance, I would “break open” the CV – run it so you can see the individual results and play around to see if you can see the pattern of why it’s so lumpy. Sometimes I found my algorithm had two behaviours – sometimes good then sometimes bad. Sometimes it might get stuck on something: a particular query or local minima.

An unstable algorithm won’t be a good user experience. Even if it’s working on average, your users are individual cases and will expect consistency. So while CV is a good tool, it can mask some real-world performance requirements.

Actually use the pipeline architecture of SKL and all the tools

Finally: stand on the shoulders of giants. The amazing people behind SciKit Learn (and other tools) have put a huge amount of effort and thinking into their frameworks.

Perhaps I’m weird in wanting to build it all myself and most people will simply use their methodology anyway. After a lot of time (wasted?) constructing my own processing pipeline I came to understand how elegant the SKL approach is, utilising some real Python magic in handling the multi-dimensional datasets so fluidly.

RTFM for more information here: https://scikit-learn.org/stable/modules/compose.html#pipeline

If I was to start again, I would design my components with their pipeline interface, so they could be plugged in to this. This is an example of the tail-wagging avoidance from earlier, and my reluctance to get into bed with said dog.

At the end of the day, it depends what you want to be doing: coding deeply and endlessly or training an ML algorithm fast and effectively even if you don’t fully understand all the details. I feel that ML has got to the level where it’s pointless to even try to understand the entire architecture, so you have to let go, trust the framework and let the results speak for themselves.


In the next article (3 of 8), I’ll dig into the Natural Language Processing (NLP)-specific aspects of the Machine Learning algorithms I developed and evaluated.

This entry was posted in Projects and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.