Natural Language Processing (NLP) Aspects of Machine Learning

This is part three of an eight part brain-dump of things I learned developing the initial version of the Machine Learning (ML) aspect of SYRAS the Systematic Review Assistant.

One of the things which seems obvious in retrospect, but hampered my initial research into ML tools, techniques and frameworks was the distinction between general ML and natural language-specific processing.

There are many mathematical techniques available which can be used to represent and solve problems but some problem spaces are harder to squeeze into a statistical model. Core algorithms like Bayes, LSI, SVM or regressions and broader techniques like clustering, classification and search are mostly applied directly to numbers. Turning words and sentences from various languages and different styles of writing or speech into numbers adds a great deal of complexity to the system.

There are some frameworks out there which are NLP specific from end-to-end. These aim to provide a one-stop-shop for language oriented applications. The alternative is to combine NLP pre-processing and pipe the statistical representations produced into more generic ML algorithms. Using a NLP framework should be easier, but may have limited depth of features or tweak-ability. The latter DIY approach will give you more options but would be more difficult for a beginner.

The situation is the same for most problem spaces, as it would be rare for raw data to be perfectly ready for ingestion by an ML pipeline. For example computer vision such as facial recognition, image similarity search or self-driving cars would require a great deal of number crunching to process raw images or video frames into a representation suitable for the ML tools. Audio processing, financial prediction, weather modelling or chemical modelling would all have their own quirks, common data formats and pre-processing tricks.

When you are researching for techniques or tools, the trick here is to understand which domain the tool chain was originally intended and developed for and whether that matches your problem domain, or whether their are components which are generic enough and well designed to be used independently. It’s quite possible there might be a great classification algorithm buried in a vision tool which might work for text – given the right transformations. But usually for NLP it’s much easier to find language oriented tools.

Some of the concepts you will need to be familiar with, to transform text into statistics for general ML are outlined below.

  • Vectors, word embeddings
  • NGrams
  • Bigrams, compound words (e.g. self-concept)
  • TF-IDF Preprocessing
  • Lemmatisation (general case-reduction)
  • Stop Words
  • Metadata, recovering the original text(after working on tokens/data) p60 SO

Vectors, word embeddings, dimensions

Before getting into language quirks, it’s important you understand the concept of what has more recently become known as “word embedding”, but has been known for decades variously as vectorisation, multi-dimensional semantic cluster analysis or N-dimensional maps and the like. I studied it at university in the 1990’s and was fascinated with the idea of plotting all the words in the world in a vast N-dimensional matrix according to some impossibly contrived semantic properties and hoping some sense would arise from their locations.

I always preferred the word co-ordinate to the now preferred vector, it just makes more sense to me to imagine a word “at” position [3, 67, 23, 300] which is “near” to [4, 64, 21, 290] like in a giant galactic map. I think the difference is a vector is more of a general direction and distance, whereas a co-ordinate is always from the origin point of 0,0.

Anyway, the process of vectorising your language (corpus, document or search query) is required because most later ML stages operate on multi-dimensional representations. Usually this involves analysing word frequencies in relation to one another and using some tried-and-tested maths techniques (e.g. Baysian) to extract some meanings out of the original text. These frequencies or relationships are the numbers the ML algorithms can “understand” – but they are no where near operating on the actual text or the language constructs you would hold in your mind, like subject/object or even noun-phrases.

It feels to me as if most NLP/ML tools and techniques have settled on this small subset of text processing and seem to have almost forgotten about the fact they aren’t even trying to parse the text e.g. via grammars and form lexical sentence structures, let alone try to attach semantics. I can only assume it was all too hard and intricate and the brute-force statistical giants won on average – and that was good enough to get an edge in business.

I went along happily with many of the currently “standard” NLP techniques for a while, until after fully understanding the maths underneath it I realised how limited the approach is, compared to the AI research I studied in the 1990’s. For example the vectorisation we were exploring was where the dimensions were semantic concepts like “activity”, “size” or “reality”. Combining this with grammatical sentence parsing into tree structures could allow you explore the relationships between words or phrases in a sentence or different sentences in terms of semantic similarities and differences. It gets very complex and taxing very quickly but the aims are much higher than just counting occurrences of words in a flat corpus. An example is the Bag of Words (BOW) concept common in some modern approaches which completely destroys all grammatical structure into a soup of words an then tries to analyse them. It seems crazy to purposely lose that amount of information, but it seems it’s easier to do the statistics with huge CPUs than to try to construct an intricate model of “real” linguistic-semantic modelling.

Perhaps there are alternate vectorisation techniques out there, but the only one I came across over and over again was quite simply counting frequencies of words in documents, or entire corpora – to give highly averaged topical guesswork. Perhaps now we have the computing horsepower, it’s time to explore other “embedding” techniques.

TF-IDF Preprocessing

To put a name to a face, the de-facto vectorisation technique I was referring to above it TF-IDF. This is the main tool for turning lots of words into lots of numbers. I won’t explain it from scratch here as other people have already done that better than I could.

As I’ve mentioned, it’s a amazingly simple but effective technique for extracting topical statistics from any natural language (it could actually be applied to almost any dataset). The TF simply counts cards, and the IDF places the bets – together they determine the keywords probably relevant to each document in a corpus.

Because of the workflow of a Systematic Review I wanted to be able to do what I called incremental processing – i.e. to be able to add a new unlabelled document into the corpus and add it into an existing set of statistics (vectorise it). I found this didn’t work very well in practice – so I had to rework the application to pre-process the entire corpus in advance which is much more common. I had trouble finding references on whether the maths would be able to handle this approach, it was so uncommon.

I was also tripped up by certain libraries (like the JS Natural library) including stop word processing and tokenisation quirks in the TD-IDF stage. I hadn’t expected it to be doing anything other than TD-IDF. This is an example of an NLP all-rounder trying to be helpful, but when you are doing a very thorough study you want control over all the aspects.

TF-IDF also introduces hyper-parameters in the form of cut-off thresholds which can be difficult to anticipate as they are not always “normalised” figures, so the optimal thresholds could vary between datasets.

Scaling and normalisation is also another consequence of TF-IDF which will affect your results. I found the scaling options in SKL’s TF-IDF affected the performance of the Scalar Vector Machine (SVM) algorithm in interesting ways. I also wasted time experimenting with normalisation stages, which were really having no further effect! This is one of the risks of the DIY approach – there are so many combinations that the documentation of each stage cannot possibly explain whether your mathematical pipeline is valid. For example LSA requires a normalised distribution, which TF-IDF provides, but as LSA can be used on other datasets, the LSA docs probably won’t discuss TF-IDF (unless in a more closed framework).

NGrams

The concept of n-grams is common and appears taken for granted in most NLP libraries. They are just words, or pairs, triplets, quadruplets etc. depending on the value of N. I see them as a colossal fudge – following on from my disappointment above of the loss of complex parsing. Instead of parsing with grammars, we now just take nearby pairs of words and analyse their occurrence frequencies, in a “bag of ngrams” presumably. You could in theory combine a 1-gram, 2-gram and 3-gram representation in an ensemble if you have the horsepower.

While I concede again that the brute-force statistical approach has “won”, I can’t help visualise how wrong a bigram is, when it happens to span two relatively distant grammatical branches.

There are also “skip grams” which I think represent a middle ground between grammatical parsing and statistical analysis. They allow a representation of the relationships between all combinations of nearby words which theoretically could glean the grammar from the text to some degree. Perhaps in modern informal text and speech this will be as accurate as you could ever get.

Bigrams for real compound words (e.g. self-concept)

Related to the above concept of n-grams (which are a purely statistical trick), is the fact that in reality most languages have some type of compound word structure. This can pose problems in NLP.

I fortunately hit this quite early as my initial corpus was about “self-concept” in children, so I was able to explore and test it.

If your tokeniser is set to strip out punctuation, then you need to be aware that “self-concept” will turn into two words “self concept”, which will then be processed independently. The two words on their own are not necessarily of interest in our study it’s very specifically about self-concept. Lemmatisation (below) would blur this even more, as selfish might turn into self and conceptual into concept and we really weren’t interested in those words.

Will the above statistical n-gram analysis will recover this loss and discover a correlation between these two words? Does it even understand the difference between “self concept” and “concept self” in n-gram terms? The latter would have come from a very different source.

Of course, these days the hyphen is a slowly dying feature of English (on of my favourite complaints). So perhaps again we need to turn to the magic of statistics.

Lemmatisation and stemming (grouping inflections)

The need for grouping word inflections is another necessary evil which should be used with care, and evaluated for how it is helping your application – depending on the language you are processing. Most languages have some level of word variation such as plurals, genders, tense but some have other forms leading to large numbers of inflections.

This is important to vectorisation and further topic analysis as will greatly affect (reduce) the number of dimensions or topics produced. When you are doing what I call brute-force statistics, before feeding into ML vector-based algorithms, you will certainly want to group inflected words into a base before calculating the frequencies or relationships, else the figures will be greatly diluted and the representation fragmented.

If you are going to use the “topics” later and be re-presenting them to the end user (discussed below), then you should favour lemmatisation over stemming as it will produce real “keywords” the user will understand, rather than made up stems like “frequen”.

I can only imagine how complicated and full of fudge-factors, lookups and heuristics lemmatisation algorithms must be. I really wouldn’t care to write one myself, but sourcing a good one is vital.

Stop Words

Another relatively contentious issue (to me) is the concept of discarding “boring words we don’t care about”. Annoyingly the concept works very well, but I always feel the maths should take care of this implicitly and we shouldn’t have another set of fudge-factors, language-specific exceptions and effectively a hyper-parameter which needs tuning.

My unanswered question is, why doesn’t TF-IDF intrinsically get rid of “a” and “the” seeing as they are in all texts they should be weighted to the bottom, and optionally thresholded out. But in practise adding the clumsy “stop word processing” option usually improved your results, so you tick the box and move on.

Metadata, for recovering the original text

After converting your text into data via a pipeline of tokenising, lemmatising, summarising, vectorising, componentising and multi-dimensional embedding your ML algorithm eventually comes up with some insights and perhaps classifies something or finds some related topic or document. But how do you reverse this entire process to get back to the original language with which you are meant to present the output to the user?

I found in many of the frameworks this step was difficult to achieve. I think it shows how academic some of the tools have become and how far they are from being real world “product components”. After all the whole job of an ML application is to deliver meaningful results back to a human (at some point), it seems almost humorous that this is an afterthought. I can imagine several product managers I know chuckling at “how developers are” in this respect – when you present a 50-dimensional vector back to them as their search results, even though underneath the system worked perfectly and had in fact perfectly identified their favourite kind of cat video from billions of training cases.

As far as I could tell, it’s mostly up to you to preserve a certain thread of metadata throughout the transformation process to be able to “tag” the data back to the source documents, words, topics or whatever level you will be correlating with.

While looking for answers, I found more questions, so I wrote a specific case of how I did this in Gensim on Stack Overflow: https://stackoverflow.com/a/46659931/209288.

Later I did a similar thing in SciKit Learn – I build an API around the classifier/search which took document UUIDs and tracked them through the analysis, so that “neighbour” documents could be retrieved by UUID. Then later on in the GUI / Application layer, the actual text, title, abstract of the document and whatever other screening labels or comments had been added, could be retrieved and presented to the user.

I recommend implementing your own UUID, and keeping maps to any of the internal tool’s own IDs, should they provide them. Don’t expose or rely on implementation specific IDs else you may not be able to swap out components later, and it could present security risks.


Those are some of the NLP aspects I had to pay attention to, which I eventually separated in my mind from understanding the specifics of the generalised ML algorithms I also evaluated – which are coming up next in the next part of this series.

Posted in Projects | Tagged , , , , , | Leave a comment

Machine Learning tips I learned making SYRAS Systematic Review Assistant

This is part two of an eight part brain-dump of things I learned developing the initial version of the Machine Learning (ML) aspect of SYRAS the Systematic Review Assistant.

A collection of Machine Learning (ML) tips, gotchas and reminders to future self.

In part one, I emphasised the importance of scoping the problem domain fully, understanding your users and use-cases, identifying potential frameworks early in the design process to understand platform and language-specific dependencies, plus how a modular API-oriented design can help you avoid problems when stitching together various parts of a software product from specific ML libraries.

In this article I will cover a variety of other issues I faced, lessons I learned and tricks I evolved to help develop and evaluate NLP ML algorithms. In no particular order…

  1. Ensure corpus and query processing are identical
  2. Revoking research results due to bugs and changes
  3. Fixing bugs before grid testing
  4. Establish a naming convention of trial variant results
  5. The framework tail wags the solution dog, unfortunately.
  6. Preparing and caching cross-validation data-sets
  7. Watch StdDev and “break open” CV averages with high variance, to find the cause
  8. Actually use the pipeline architecture of SKL and all the tools

Ensure corpus and query processing are identical

Natural Language Processing (NLP) requires a number of pre-processing steps before most Machine Learning (ML) techniques can be used. Lemmatisation is an option to unify word inflections such as plurals, tense, case, number or gender down to a single stem which can then be statistically counted as the same word for semantic purposes. Tokenisation is a surprisingly complicated job considering punctuation, compound words, case-insensitivity and other language specifics. Statistical scaling is required by some ML algorithms such as normalisation, Gaussian distributions, removing mean offsets in vectors – mathematical housework which sometimes have parameters.

When you use an NLP technique to process a corpus or library of text during training, you must use the exactly the same pre-processing pipeline and parameters when prepare the query during testing or in production. Frameworks like SLK make this easier, but if you roll your own, it’s easy to forget.

NLP preprocessing usually is designed to convert human text into something that can be represented as a multi-dimensional vector space, which can then be used like a spatial database to find statistical – and hopefully semantic – anomalies with clever maths. So that mapping between ASCII words and complex N-dimensional vectors is critical – it’s everything. If it’s not performed the same every time, then your query phrases or documents will be “projected” into the wrong part of the search space.

A concrete example of this in SKL is to keep hold of the vectoriser which you prepared with the corpus and us it for the query transforms.

The preprocessing:

self.vectorizer = TfidfVectorizer(... YOUR PARAMETERS ... )
self.X_tfidf = self.vectorizer.fit_transform(self.corpus.data)

The search:

doc_x = self.vectorizer.transform(self.vectorizer.transform([query]))
sims = cosine_similarity(doc_x, self.X_tfidf)        
rank = list(reversed(numpy.argsort(sims[0])))

This caught me out at one point, which meant I had to invalidate a huge number of test results. Worse, I wasn’t sure how far back this problem had existed, which meant I possibly couldn’t trust any of my previous test calibrations – nightmare!

Revoking research results due to bugs and changes

The previous tip outlined one example of a design flaw leading to bogus results. This can also happen due to other reasons: general bugs, deployment mistakes, misconfigurations, incorrect assumptions. As any coder knows, there are endless ways to get it wrong.

While developing a typical application you fix can a bug, set the task tracker to “done” and move on – it’s fixed.

However when performing long running benchmarks of an ML algorithm, across combinations of parameters, different corpora or algorithmic variants a bug could invalidate all previously recorded results. Finding a bug therefore poses a difficult challenge in managing your experimental results and integrity of any statistics which were compiled from those data. In the worst case scenario you would have to discard all previous results, re-benchmark your system and begin the exploration of parameters and variants from scratch. This could be an epic problem if the bug is discovered late in a trial, so you may be tempted to see if the bug’s effect was limited to affecting a subset of your results and only replicate that part.

One problem I found with this in practice was simply the amount of historical records I had kept over the constant evolution of the code. It was difficult to point to a chart in a 200 page journal and say “This was from version 20.1 of algorithm 12.b with parameters x, y and z” – therefore this was immune to the bug, phew!

Over time I did develop detailed “tagging” of the results to ensure (see below), but it wasn’t bullet proof.

Automation of the testing process is the only sure-fire way to achieve repeatable results and also makes it easier. Of course, it requires more investment up-front to fully automate any testing process but if you are serious and want to be professional then it’s going to pay off. Automation must include code version management, build, configuration and deployment (whatever that means specifically to your system) to ensure the application tested is the same piece of software when it runs again in the future. Automation must also include the train and test cycle: data presentation, results gathering, parametric sweeps, cross-validation controls and metrics to ensure the experiment performed on the software is also the same when run again in the future.

This can be quite a challenge but it means you will be freed of the fear of the pain of trashing hours or days of manually gathered results and thus the temptation of optimistically ignoring the consequences of bugs.

Tools such as Excessiv have been designed for this purpose. In retrospect I left it a bit late, thinking “Yeah, I’ll integrate with that later” but I needed to manually develop all the features of this product just to get to the end of my initial evaluation. Guess what – when you build a custom test framework you also make bugs which can invalidate your results. So while building everything from the ground up was a great experience and now I feel I deeply understand the process, I would certainly advise future-me to buy something like Excessiv to automate my tests from day 1.

Fixing bugs before grid testing

You will spend a lot of time or effort exploring the various dimensions of your search space and proposed solution. There are tools to help you do this, as mentioned, but it’s still a lot of work and you will generate a lot of resulting data. I sometimes went too far and wide before realising there was a basic bug in my algorithm (or supporting code) which meant my bug had been multiplied in its destructive effect.

I have since found a few margin notes in my journal saying things like “hmm, this doesn’t smell right – investigate fix this before continuing…”.

A margin note is NOT enough! Stop right there and address that spidey-sense. Do not continue until the statistical smell has been identified. Do not run 4 x 10 x 100 x 250 tests – that’s a million results to have to clean up later.

At the risk of self-shaming, this was a retrospectively humorous entry from my dev journal. I record history this way, so I am not doomed to repeat it.

2017-11-11 Re-working LSA bugs

After writing the above, I tried one last pass on kNN – 1NN. This resulted in some suspiciously high true-positive results (like 100%). I realised I had never got around to removing the test cases from the corpus (after initial trials getting the Python-JS API working). A quick analysis showed the system was indeed picking the test case from the corpus most of the time. While looking into it I also discovered more bugs which has lead me to return to basics re-work the Python before exploring the parameters again.

I’ve wasted a huge amount of time running these trials on buggy code! The lesson learned is it’s vital to get the underlying system perfect before running hundreds of trials.

1) Forgetting to remove the test cases from the corpus.
2) “maybe” label conversion hack was wrong in kNN – needs addressing in the corpus.
3) Unnecessarily re-calculating the similarity matrix, led to the code running 100x slower.
4) 1NN repeat results are strange, later repeats find the same doc from the query doc, but earlier queries don’t. It seems the model is affected by the queries, or the case shuffling is wrong.5) Tokenisation of query is different from tokenisation of initial corpus.
6) Lemmatisation was not implemented

Establish a naming convention of trial variant results

I described above one reason why you’ll want to be able to correlate your results to your code. While the exact way you do this will probably depend greatly on your methods, framework, platform and so on, I recommend you put as much thought into this as early as possible.

The kind of variations you will want to record might be:

  • Code revision/commit hash
  • Your algorithm name/identifier
  • Preprocessing options (lemma, stopwords, tokeniser)
  • Number of training/eval/training cases
  • Corpus name, size, subset, prep
  • Number of Cross Validation folds
  • Balancing
  • Evaluation metrics (i.e. what you are studying or optimising)

The list could go on, but in short it’s anything you are varying or will vary in future.

The latter is what tripped me up. I would establish a naming convention, but then later discover I wanted to vary a new option I didn’t previously know about, like scaling or balancing of positives and negatives training cases. This meant all previous tests had assumed “no scaling” or possibly there was a default scaling provided. From this point on, I would add a scaling tag to the results but to be able to compare previous results to the future ones, I’d need to retrofit the assumed parameter to all previous results.

You may never achieve this perfectly, but some hindsight here will help you prepare.

e.g. I shortened this to Alg5.1-swE-4000trc_1cv_u for quick visual scanning, but it would be preceded by a more formal table of the meaning.

The framework tail wags the solution dog, unfortunately.

As a professional software engineer, I prefer to design abstract algorithms and systems and then go looking for lower-level solutions which can help provide the components to support and implement the design. Many of the systems I’ve designed and built have outlived their implementation specifics – ie. they’ve been migrated from framework to framework or even sometimes ported across languages. I’ve always believed the value or IPR is in the algorithm not the implementation.

Unfortunately in practical ML today (ie. in its youth), the platforms and frameworks are inconsistent in their coverage of the various utilities you need. There is not so much competition between them because they all do different things well.  They are also very “big bricks” and will necessarily implement very large and high-level portions of your system design.

So your system design will be enormously influenced by the capabilities of the framework you adopt. You would have trouble seamlessly porting a finished application to another system – especially if you want to keep the same performance results.

I think at this time, this is just how it is. So choose your framework carefully and trial it before finalising your design – or even committing to fulfilling your stakeholder requirements.

Preparing and caching cross-validation data-sets

Once you’ve fully automated your testing you may find yourself generating huge amounts of data, both input and output. One of the explorations I did was on the effect of balancing training and test cases – some algorithms are highly sensitive to the ratios of positive and negative cases and require a 50/50% balance. In real life, I know the systematic review dataset may be highly unbalanced – e.g. a 5 – 10% positive rate. This led me to take an existing pre-labelled corpus of 5,000 documents and split it into various schemes of 10%/90% thru 50%/50% through various increments.

Combining this with Cross-Validation (CV) folds of tens or higher, I ended up with hundreds of thousands or millions of files quite quickly. Even on an SSD these take time to generate.

So I implemented a test/train case caching system with a naming convention which detected if there was an existing combination, if not it would generate one. This allowed me to run any test variant and the prerequisites were automatically prepared, and also meant the re-runs of tests rand very much quicker.

While you do have be cautious with caching randomised data, if you are sure you are not defeating the randomisation effect, caching these datasets speed up testing by many orders of magnitude, facilitating quicker, deeper testing.

Watch StdDev and “break open” CV averages with high variance, to find the cause

When you utilise cross-validation – you run the same experiment many times on random subsets of cases and average the results to get a more reliable figure, avoiding and quirks. I can’t recommend this highly enough – I wasted time early on agonising over tiny optimisation details only to find they were ghost artefacts of specific test or training cases.

However averages lose data. The average of the sequence 2,2,2,2 = 2 but it’s also the average of 3,1,4,0. These individual result are clearly very different, so it’s important to keep an eye on the variance or standard deviation of your cross-validation averaged results.

In the example above 2,2,2,2 represents an algorithm which is very stable in its output, over various test cases – presumably a good thing. However the second one has wildly varying results which are never actually “right”. So if you ignored the variance of these averages, and simply saw the CV output you might not realise the unreliable nature of algorithm 2.

I made sure the STD DEV was always plotted on my output charts visually. I had written my own reports so possibly the mistake of omitted “error bars” or the like was entirely my own. A more professional reporting tool may have included this by default.

If I saw a high variance, I would “break open” the CV – run it so you can see the individual results and play around to see if you can see the pattern of why it’s so lumpy. Sometimes I found my algorithm had two behaviours – sometimes good then sometimes bad. Sometimes it might get stuck on something: a particular query or local minima.

An unstable algorithm won’t be a good user experience. Even if it’s working on average, your users are individual cases and will expect consistency. So while CV is a good tool, it can mask some real-world performance requirements.

Actually use the pipeline architecture of SKL and all the tools

Finally: stand on the shoulders of giants. The amazing people behind SciKit Learn (and other tools) have put a huge amount of effort and thinking into their frameworks.

Perhaps I’m weird in wanting to build it all myself and most people will simply use their methodology anyway. After a lot of time (wasted?) constructing my own processing pipeline I came to understand how elegant the SKL approach is, utilising some real Python magic in handling the multi-dimensional datasets so fluidly.

RTFM for more information here: https://scikit-learn.org/stable/modules/compose.html#pipeline

If I was to start again, I would design my components with their pipeline interface, so they could be plugged in to this. This is an example of the tail-wagging avoidance from earlier, and my reluctance to get into bed with said dog.

At the end of the day, it depends what you want to be doing: coding deeply and endlessly or training an ML algorithm fast and effectively even if you don’t fully understand all the details. I feel that ML has got to the level where it’s pointless to even try to understand the entire architecture, so you have to let go, trust the framework and let the results speak for themselves.


In the next article (3 of 8), I’ll dig into the Natural Language Processing (NLP)-specific aspects of the Machine Learning algorithms I developed and evaluated.

Posted in Projects | Tagged , , , , , | Leave a comment

Machine Learning evaluation study for SYRAS (Systematic Review Assistant)

My original goal for SYRAS, the Systematic Review Assistant, was to try to automate the laborious process of reviewing thousands of scientific article references while selecting their relevance to a particular topic of study. During the initial development of the software, once I had a basic application up and running, I conducted a detailed and extensive evaluation of various statistical techniques and machine-learning algorithms to see which one would best provide a solution for the specific problem.

I’d like to present what I learned and discovered during that process which will take a few articles, now I’ve had time to review the hundreds of pages of notes I took during the voyage. The topics I’d like to cover are:

  1. Overall orientation: identifying the problem and potential solutions.
  2. Tips: things you should know before starting an ML project.
  3. Natural Language Processing: NLP-specific Machine Learning has some extra complexities requiring pre-processing steps before more general algorithms can be used.
  4. Algorithms: identification of algorithm suitability, from off-the-shelf to custom designs and ensembles.
  5. Hyper-parameters: the difficulties of dealing with these frustrating fudge-factors, in theory and in practice.
  6. Statistics: it’s vital to understand how to prepare and score experiments, avoiding common mistakes which can invalidate your results. Using randomisation, cross-validation, normalisation and accuracy measures.
  7. Systematic Review specifics: the workflow of the application introduces some quirks into the learning problem, including incremental learning and horribly unbalanced corpora.
  8. The performance results: evaluation of algorithm performance and tuning.

Some of the topics I will cover may seem obvious to those experienced in ML projects, but I made so many newbie mistakes during this study which cost me a lot of time and effort. While the hands-on learning was valuable and concrete, I wouldn’t want to waste the same time in the future, so this is also a guide-to-self and others on how to approach a project in future.

Step 1: Identify the problem and potential solutions.

To kick any project, I recommend you ensure you truly know what the problem is. This might sound obvious, but I made a couple of major u-turns during the ML evaluation because I had delayed fully designing the final product workflow until after I had proved that the technology could even provide a solution. More specifically I wanted to prove a natural language classifier could predict the classifications made by a human researcher – if it wasn’t possible, then my bold claim was invalid and the project was impossible.

My first mistake was to jump in too quickly on this premise and not first explore and understand the end-user workflow and their requirements and challenges (user-centred design). It would turn out that there were other opportunities and challenges lurking which would change the entire approach to the application solution.

For example it emerged that there were two different possible uses of the application: a) a classifier which could automatically complete the job of the researcher by labelling articles for them (the holy grail), or b) an search assistant which could relieve the tedium of the screening process by helping identify the best articles more quickly. These are very different goals, different user-experiences and would employ different algorithms.

Another real-world complexity which arose catastrophically late in the project (the very end), was relating to the scientific-validity of the application – essentially a business requirement rather than a feature. For usage a) above to be allowed in a scientific review, where the machine is effectively performing the review, the software/algorithm/product would have to be extensively validated, peer-reviewed, locked-down in function and also accepted by the overseeing institutions such as Cochrane and understood by journal editorial review statisticians who would eventually be judging the validity of the studies using the product. While this cannot be a show-stopper if I want to make such a system, it was a bit of a dead-stop at the beta phase when I tried to get scientists to use it!

During the evaluation I’d explored usage b) which is more of a morale-booster to the human than a replacement, but the researcher still has to complete the review themselves, reading every single abstract of perhaps 5,000 articles. And while it starts off better, the tail end of the review becomes ever more boring and barren – as the search engine has (hopefully) pushed all the relevant articles to the front. It would be like soldiering on to “Page 500” of the Google results even though you hadn’t seen anything interesting for since page 324. Compare this to the previously random but even distribution without the tool, and perhaps the researcher would prefer to not have the assistant!

So does the product have legs? I still believe so, and so I invested some time white-boarding various workflows with different researchers to better understand their needs. While writing this, I feel I probably could do more of that stakeholder knowledge transfer – in fact you possibly can’t do too much!

Defining the problem

The main reason I am so keen on this product idea is that Systematic Reviews are a seemingly perfect natural scenario for supervised learning in ML. We have a human researcher willing to read and uniformly classify 5,000 nicely structured documents in a database while we watch – what more data could an AI ask for!

The scenario therefore has a specific quirk, I call it incremental supervised learning, i.e. the ML has to learn on the fly while the user is operating the system and will at some point during this process become knowledgable enough to help out. It’s possible the system could repeatedly self-test itself (even do CV folds and parameter tuning!), until it was confident that it understood the topic before butting in and avoiding a “clippy moment”.

Defining the classifier (usage a.), in simple terms of input – output:

Inputs:

  • Article title – short but variable length
  • Article abstract – longer variable length
  • Metadata including: date, authors, keywords, journal, citation graph
  • Binary Label (training)
  • [Possibly] other hints, reason of choice, tags, “strength” of choice

Outputs:

  • Binary label: mutually-exclusive choice: include/exclude
  • (Possibly) Breakdown of probabilities, topics, relations etc.

Ignoring the optional possibilities for the moment, the problem is of natural language binary classification. Knowing this already limits the solution research to a subset of the plethora of techniques and frameworks.

For the search assistant (usage b.), the input is the same as above – if we assume the same workflow while the assistant silently observes and learns.

Outputs:

  • Threshold of confidence, to suppress/enable the results
  • Ranked list of article references (document IDs), sorted by relevance to the articles so far positively labelled
  • [Possibly], articles similar to individual articles to allow “exploration search”
  • [Possibly], metadata on the search results, why they were presented, topics discovered, confidence etc.

This feels less well-defined. While it certainly has some search aspects, it’s not exactly a classic search engine because a) the input isn’t a search query, b) the learning is incremental not pre-calculated c) the document-similarity aspect will be complicated by averaging, clustering and graph-traversal. It’s possible this problem is more closely represented by a sales recommendations engine or social graph, but with the complexities of NLP over the top. I would therefore exercise caution here and try to simplify the proposal.

Bottom Up vs. Top Down Design? Meet in the middle.

Before designing any solution it’s important to get familiar with the available tools and materials. In the young and evolving ML domain this is even more vital as the various frameworks, like Scikit Learn or Tensor Flow, have quite specific strengths and weaknesses. Not only do they provide different tools, like cross-validation or parameter tuning, but some of them only support certain algorithms or have better versions of those algorithms. Some frameworks are entirely online API-oriented paid cloud services, whereas others are installable software which you host and maintain yourself. This entirely practical choice could dictate the functionality you have at your disposal.

SYRAS was intended to be open-source and “free” for use by any academic institution or individual who could arrange hosting, or self-host, so this effectively ruled-out any lock-in to paid APIs like Tensor Flow. While it’s possible to configure the software, the user-experience of having to register for highly technical services like ML APIs would, in my opinion, render the application too hard to get up and running for the average researcher.

This is an area of great concern for an academic FOSS product – deciding on the “stack” will greatly influence the uptake of the application. Can it be hosted easily? Can it be containerised with Docker or desktopised with Electron? Will academic IT departments be able to easily manage and maintain a platform? Will it scale? Would I be able to host a version cost-effectively and charge a nominal fee for usage?

Underneath each framework is one or more programming languages, like Python, R or Javascript. Deciding on the framework therefore aligns you with a language which identifies you with a certain community of developers and related tools. The Python language has become very popular with ML and NLP enthusiasts, so there are more tools available in that language. There are more answers and examples on Stack Overflow for Python and more new code repositories popping up.

Early on in the development of SYRAS, I had performed a refresher review of web application technologies – unrelated to the ML aspects. I chose a MEAN-ish stack of ExpressJS and MongoDB but later came to find communicating between JavaScript and Python challenging. First of all they “can’t” simply talk to each other, so you need some form of IPC, IO, shared files, or an adapter like Pyro. Then you find their preferred data structures are very different with JS tools implicitly favouring JSON which is alien to Python libraries such as SciKit Learn, leaving you sourcing or writing adapters and converters. Even with these communication pipelines theoretically in place, it becomes difficult to deploy and scale such intricately intertwined subsystems – especially considering the horsepower needed for some ML tasks.

I learned that an API oriented-design is therefore important to establish early on. It’s more likely you will need to dabble in microservices, or at least a service oriented architecture (SOA) when developing an ML application to help “lubricate” your high-level design and give you access to a wider variety of tools now and in the future. This may feel like a burden and slow the design process down but it should pay off for a serious project. I was keen to get started so I bodged a few things together, but as the system grew and crystallised around these hacks, they became harder to refactor to a more professional solution.

The lesson here is to experiment with a variety of tools, platforms and frameworks in a lightweight manner (Agile “spikes”) before finalising the overall application architecture. You have top-down user requirements and bottom-up capabilities – they have to meet in the middle.

In retrospect I half did this right. I evaluated the following tools and only feel like I over-invested in half of them.

  • Natural JS – this initially seemed promising, but lacked depth. I spent a lot of time embedding its NLP processing capabilities into my initial testing and evaluation framework, but eventually replaced all that with the SciKit Learn equivalents.
  • Tensor Flow – I didn’t spent too much time on TF as it wasn’t compatible with the business model as discussed above, and had less NLP tools than SKL.
  • Gensim – I realised too late that this is a good educational tool but probably not intended to be part of a real-world application. I spent too much time evaluating the performance of the algorithms in detail, which I duplicated in SKL.
  • SciKit Learn (SKL) – this has an immense set of tools for both general ML and NLP. While some might say it’s still an educational tool, I found it to be professional enough to be part of a deployable product.
  • Word2Vec – I didn’t look too deeply into this. While it seems relevant to some of the NLP goals of the project it lacks the variety of ML tools SKL provides.
  • NTLK –  similar to Word2Vec, powerful NLP tools but I found the installation and dependencies too complex for a professional product. (Perhaps Docker would mask this)
  • Various others: I looked at many other smaller packages which, if compatible, could end up in the final product, but it’s difficult to compete with SKL which seems to have almost everything you need, and an extendable pipeline.

API Oriented Design (SOA)

I ended up with a set of APIs (which are still under development) which isolate the strengths and weaknesses of the various frameworks. I kept the ML part separate from the “web app” which became more of a traditional GUI.

  • Web application – serving the user-facing app: Express JS
    • Assistant Plugins – connecting to the backing APIs on various events.
  • Web application REST API – serving data for the app, e.g. user accounts, auth, projects, screening, etc. : Express JS, OpenAPI, MongoDB
  • Document Classification REST API – the prediction service: Python, Flask, SciKit Learn
  • Document Search REST API – the incremental similarity service: Python, Flask, SciKit Learn

 

To summarise:

  • Know your user
  • Know your tools
  • Invest lightly broadly (gestalt!)
  • Plan deployment early

The next article will cover some more detailed tips and tricks “I wish I’d known before starting”, like fixing bugs before spending weeks grid-searching parameters.

Posted in articles, Projects | Tagged , , , , , | Leave a comment

Local development certificates done properly

Yesterday I finally lost my patience with the developer’s eternal problem of having to skip the untrusted self-signed certificate interstitial warning screens, and so I decided to solve it properly. The problem has got worse recently due to a combination of changes in the world and my projects:

  1. Containerisation – in Docker containers “localhost” is no longer useful as a service hostname as every service has their own localhost! So I have migrated towards using local host names, .local or .dev patterns etc.
  2. Chrome has long made special exceptions to “localhost” to allows invalid certificates which ease development pains. This is now lost due to 1).
  3. Chrome doesn’t resolve .localhost domain extensions which are spoofed in a container’s /etc/host file. I recently hit this problem at work, while migrating Selenium tests to Docker. So again we had to migrate our test domains to something.local instead of .localhost.
  4. Chrome since about v60-ish now requires a subject alternative name (SAN) in a certificate and won’t respect a fallback to a local canonical name.
  5. Whenever my local Node.js application crashes with an exception, Chrome forgets the fact I’ve whitelisted the certificate. It took me a while to realise this was happening, but I now see it’s 100% correlated. This has led to a gradual fear of crashes (probably a good thing).

Being RSI conscious and generally lazy, having to click three more things each dev-cycle is a major pain. Also the fact that Browsersync stops working every time too makes it even more work to resume – involving double-clicking the failed GET in the network tab, and accepting the same warning message in a temporary browser window.

After reading the Chrome team’s official vision, and with my recent experience in setting up a personal CA for a business to allow them to use client-certificates for authenticating remote staff, I decided the best, most permanent solution is to use a personal CA added to the trust root of the local host, which signs the local cert – instead of self-signing.

While the process is actually easy to perform, I found as usual the historical complexities of the OpenSSL command line options and configuration files, plus the lack of a single article on this specific approach, meant I had to fiddle around for a while to get it right.

Things that made it tricky:

  • You cannot add a SAN on the openssl command line, so you must use a config file.
  • The config files overlap between “req” and “x509” are not immediately obvious.
  • People helpfully offer solutions using “openssl req -x509” which can do everything in one pass but it can only self-sign certificates.
  • The config file section names are themselves configurable, so there’s differences across the examples and tutorials

I started from this simplified gist and my knowledge of the fabulously professional and detailed OpenSSL DIY-CA guide by Jamie Nguyen (aimed at production quality so the setup is a bit over-complex for local dev).

The configuration file mysite.local.conf, which contains sections for both the “req -config” option and the “x509 -extfile” option is as follows.

The important part for generating a CSR with SANs with “req” command and then making the certificate signed with the CA with “x509” command,  is for the “req” command the config line “req_extensions = v3_ca” tells it to find the extensions section with the SAN, but for “x509” it’s the “-extensions v3_ca” option which points to the same section in the “-extfile” file. This took me a while to get right – although it seems obvious now!

[req]
default_bits = 2048
prompt = no
default_md = sha256
distinguished_name = req_distinguished_name
req_extensions = v3_ca

[req_distinguished_name]
C = AU
ST = SYDNEY
L = NSW
O = MyOrganisation
OU = MyOrganisation Unit
CN = *.local

[v3_ca]
basicConstraints = CA:FALSE
keyUsage = digitalSignature, keyEncipherment
subjectAltName = @alternate_names

# I added localhost wildcards for general compatibility with older projects.
[alternate_names]
DNS.1 = localhost
DNS.2 = *.localhost
DNS.2 = *.local
DNS.3 = mysite.local

The process – to create a CA and create a certificate for the local site with the above SANs:

1. Make root CA Key

openssl genrsa -des3 -out rootCA.key 4096
> secret

2. Create and self-sign the CA Root Certificate

openssl req -x509 -new -nodes -key rootCA.key -sha256 -days 1024 -out rootCA.crt -config mysite.local.conf

Once that’s done, this CA can be used to issue many site certificates.

3. Make local site key

openssl genrsa -out mysite.local.key 2048

4. Make certificate request (CSR)

openssl req -new -key mysite.local.key -out mysite.local.csr -config mysite.local.conf

5. Generate and sign the site certificate.

openssl x509 -req -in mysite.local.csr -CA rootCA.crt -CAkey rootCA.key -CAcreateserial \
  -out mysite.local.crt -days 500 -sha256 -extfile mysite.local.conf -extensions v3_ca
> secret

6. Install the certificates.

The critical step which makes all the difference to this approach is getting your host to trust the personal CA. This is slightly different for Windows, Mac and Unix, or iOS, but the principle is the same – you add it to the list of “trusted roots”.

In Mac, you can open a .crt file which will add it into your Keychain Access, under Login Keychain > Certificates Category.

Then you must open the “Trust” section and set it to “Always Trust”.

Once you have done this any site certificate signed by this fake CA will be trusted by your entire machine – in theory all applications, all browsers and the O/S itself. This is important to developers using various tools – IDEs, Postman, CURL or other API clients. It’s not just a specific solution for Chrome (like startup flags).

After copying the site certificate and key into place in my ExpressJS app, and restarting finally Chrome shows the valid “Secure” icon! No more interstitials…

If you inspect the certificate in Chrome, you should see both the site cert and the CA cert are both valid and trusted.

The site (the SANs are under Details):

The CA:

 

I also had to add it to Browsersync configuration:

browserSync.init({
 https: {
 key: "../ssl/mysite.local.key",
 cert: "../ssl/mysite.local.crt"
 },
});

Which also started working again on Gulp restart.

Now this is working and stable, it really has made a difference to my workflow – just one less level of friction and annoyance in my daily grind. I’ll have to find something else to get annoyed about now.

Posted in articles | Tagged , | Leave a comment

The Toast Test

A long time ago my friend’s mum’s toaster broke, so she sent it back to the manufacturer. They serviced it and returned it back to her with a note saying it was all working, have a great day! She plugged in, tried it out, but it was still broken and didn’t toast the bread.

She rang the company and got through to a support technician who said the toaster was working perfectly when they tested it. Confused, she pressed for more details about how they had tested the product. He went into great detail about the impressive battery of tests they had performed using all manner of high-tech analytical equipment. They’d measured so many voltage reference points, run diagnostic routines on the logic chips, and checked the inductive load on the transformer and so on. All tests passed 100% so they gave it a quick polish and popped it back in the post.

“But did you make any toast?”, she asked him.

Well of course not, he explained. This is a hi-tech, manufacturing clean-room grade area – they can’t have food or crumbs lying around, that would just be crazy. And who’s going to go out and buy all the bread?

The Toast Test has become something of a legend in the various companies I have worked at over the years. I usually introduce the story and it becomes a very simple term to describe what software engineers (myself included) sometimes forget to do when they’ve spent years adopting the best practises of module testing, unit-testing, TDD and so forth. At the end of a long, hard project it’s so easy to forget to return to the original brief, after you’ve got 100% code-coverage and all the tests are green. Even terms like integration testing or system testing although technically appropriate don’t quite have the simple, singular goal of the Toast Test.

Just: try using the product to do the one main thing it’s supposed to do – like a user would.

 

Posted in Miscellaneous | Tagged , | Leave a comment

Introducing Systematic Review Assistant

I’ve been busy. Since July last year all my spare time has been taken up with a new initiative which I am now proud to launch into “early beta”: a systematic review assistant to help perform the laborious task of classifying thousands of scientific articles.

Sometime ago I watched my professor partner reviewing 5,000 documents manually and boldly claimed I could quickly make an AI to do it for her. I wasn’t exactly wrong, but it was a little bit harder than I thought. The document classification side is partly “just a search engine” but some interesting problems and opportunities emerged from the specific workflow required by the structured review process – such as blinding, teamwork, comparisons and corpus preparation artefacts. These led me to explore a myriad of ML and statistical techniques in various languages and packages. I even did an AI nano-degree MOOC as a refresher.

I have 400 pages of notes, so I will publish some articles on what I learned and how I built the prototype, ranging from classifier algorithm evaluation trials, what I learned from SciKit Learn, designing APIs, designing a scalable application architecture (SOA) under ExpressJS, managing large corpora in MongoDB, integrating Node apps with Python, so much async, so many Promises, abusing Mocha for science, Dockerizing micro-services and just how truly amazing Naive Bayes was for its time. I said I’ve been busy.

I’m now at “Step 3” of the original plan, and am looking for help to take it forward:

a) Alpha testers: scientists who need to do a systematic review, and who are willing to collaborate on helping polish the system to become more of an off-the-shelf product. It’s not finished but I feel it’s already a useful tool. I need feedback from real users to continue iterating the product. Leave a comment to get in touch.

b) Developers: who want to get involved in taking it forward. Check out the repo here: https://github.com/scipilot/sysrev-assist and drop me a line.

Here is the project description from the primary repository:

What is this project?

The aim of this project is to provide a set of tools to help undertake scientific systematic reviews more easily and quickly so they are more likely to be performed at all.

Some scientist such as Ben Goldacre believe systematic reviews are one of the most important steps forward for progressing science itself. “Systematic reviews are one of the great ideas of modern thought. They should be celebrated.” [Bad Science p99]. Academic organisations such as The Cochrane Collaboration already provide rules and guidance, services and tools, but due to gaps in this support during the lengthy review process they are still perceived to be so difficult or laborious that they are not performed as often as they should.

There are commercial offerings such as Covidence and Zotero which offer a well-established range of functionality, some specialised to particular fields. While these are certainly powerful tools, commercial products are sometimes challenging to acquire during research projects.

Our goals:

  1. to provide free and open software to the science community
  2. to develop intelligent assistants to automate the laborious aspects of collation and screening
  3. to establish a community of open-source developers to broaden the creativity and support base

What are Systematic Reviews?

“Systematic reviews are a type of literature review that collects and critically analyzes multiple research studies or papers, using methods that are selected before one or more research questions are formulated, and then finding and analyzing studies that relate to and answer those questions in a structured methodology.[1] They are designed to provide a complete, exhaustive summary of current literature relevant to a research question. Systematic reviews of randomized controlled trials are key in the practice of evidence-based medicine,[2] and a review of existing studies is often quicker and cheaper than embarking on a new study.” https://en.wikipedia.org/wiki/Systematic_review

Systematic reviews can provide the data needed for a meta-analysis, or they can be used as a preparatory stage of any research project to assess the current state of a specific scientific topic.

A typical scenario might be summarised as follows:

  1. Journal database article search (sourcing articles by keyword, reference/citations)
  2. Systematic Review Process (filtering thousands down to dozens)
  3. Data extraction (of experimental methods, results/statistics)
  4. Meta-analysis and/or further research.

We are focussing on step 2, which itself has a fairly complex set of stages. As mentioned above Cochrane provide good tools and services for steps 1 and 4 but only guidance on how to do 2 and 3, which is up to each researcher to perform.

Outline approach

  1. Step 1 is to implement a basic web application which can perform a basic review, including article data import, screening process, collaboration and result data export. The application must have a “plugin” architecture to enable future additions.
  2. Step 2 is to research and develop potential solutions to the perceived roadblocks. For example could screening 5,000 documents be assisted by a natural-language AI? Or could the initial citation/reference searches be improved?
  3. Step 3 is to widen out the project to collaboration by international, academic developers who will have their own ideas and challenges.

History

The project was originally initiated around 2016 by Dr Nic Badcock and his team (Dept. Cognitive Science, Macquarie University, Sydney, Australia). While performing many systematic reviews, they developed their own software in R and Matlab. The R libraries help to automate the complex task of ingesting and processing articles exported from various online journal database archives. The Matlab GUI allows researchers to screen thousands of articles, keeping track of ratings and comments, while keeping the researcher focused and productive. This first version has been used successfully in several collaborative projects.

In early 2017 Pip Jones (programmer, scipilot.org) teamed up with Nic with the idea of adding an AI assistant to the screening process. After considering the scalability of installing and supporting the Matlab-based GUI and limitations of Matlab itself for this purpose, it was decided to first build a web-based GUI which was compatible with the existing R libraries and data-file formats. The MatLab GUI’s information-architecture was duplicated into an Express JS prototype application with a wireframe front-end.

In 2017 a small research fund was granted by Macquarie University to help kickstart the web project.

An initial machine learning (ML) assistant has been added to the web-based GUI after extensive evaluation of various classification techniques. This is implemented as a separate REST API backed by a Python application which utilises the SciKit-Learn libraries. The API provides classification and search services using fairly standard vectorisation (word-embedding) models, term frequency (TF-IDF), principle component analysis for dimensionality reduction (PCA/LSI/LSA) and nearest-neighbour (KNN) correlation. The REST API could be used independently from the web application.

While the initial ML algorithm in place (in the beta version) does have a little special-sauce in the ranking algorithm, it predominately using very basic, standard techniques. This is because after a long evaluation study, I found there not much improvement in classification accuracy (F-Score) in this specific problem domain. I felt it was best left working as simple and performant as possible and without some of the quirks exhibited by the more complex algorithms.

There is huge and exciting scope for improvement in the assistant concept, but it’s not necessarily in the classification and search accuracy of the algorithms.  Buy me a beer if you want to know more.

Posted in Ideas, Projects | Tagged , , , , | Leave a comment

Dusted off – Troposphere word cloud renderer

Troposphere is a word-cloud rendering plugin built in Javascript on Canvas using Fabric.

https://github.com/scipilot/troposphere

It has a few artistic options such as the “jumbliness”, “tightness” and “cuddling” of the words, how they scale with word-rankings, colour and brightness. It also has an in-page API to connect with UI controls.

I developed this visualisation in my spare-time a few years ago, while I was building a product codenamed “Tastemine” which was originally conceived by Emily Knox, the social media manager at Deepend Sydney. This proprietary application was designed to collect “stories” – posts, comments and likes on our client’s Facebook pages with analytical options ranging from keyword performance tracking through to sentiment analysis. After scraping and histogramming keywords from this (and other potential sources), we added the option to render the result in Troposphere as a playful way to present the data visually.

Screen Shot 2016-07-01 at 5.21.46 pmWhile ownership of a page enables you to access much more detailed data, it’s surprising how much brand identity you can collect from public stories on company pages. We used this to great effect, walking into pitches with colourful (if slightly corny) renditions of the social zeigest happing right now on their social media. Some embarrassing truths and some pleasant surprises were amongst the big friendly letters.

I think the scaling of the word frequencies is one of the most compelling aspects of word clouds. Often I’d hear clients cooing over seeing their target brand-message keywords standing out in large bold letters. Whether it truly proves success or not is questionable, but the big words had spoken. We certainly used it to visualise success trends over time, as messages we were marketing became larger in the social chatter over time.

Screen Shot 2016-07-14 at 10.52.16 pm

It was, at the time of writing, the only pure-HTML5 implementation, with visual inspiration taken from existing Java and other backend cloud generators such as Wordle which didn’t have APIs and so couldn’t be used programmatically. It was quite challenging to perfect, and still isn’t perfect at not crashing some letters together, and still takes some time to render large clouds. I made huge leaps in optimisation and accuracy during development, so it may be possible to further perfect it with improved collision detection and masking algorithms.

It was built in the very early days of HTML5 Canvas when we were all lamenting the loss of the enormous Flash toolchain. It felt like 1999 all over again programming in “raw” primitives with no hope of finding libraries for sophisticated and well-optimised kerning, tweening, animating, transforming and forget physics. These were both sad and exciting times – we were in a chasm between the death of Flash and the establishment of a proper groundswell of Javascript. At the time Fabric was one of the contenders for a humanising layer of the raw Canvas API, handling polyfills and normalisation plus a mini-framework which actually had all sorts of strange scoping quirks.

One of my dear friends, and at the time rising-star developers, Lucy Minshall was suffering more than most from the sudden demise of Flash – being a Flash developer. I chose this project as training material for her as it was a good transition example from the bad old ways of evil proprietary Adobe APIs to the brave new future following the Saints of WHATWG. It also contained some really classic programming problems, difficult maths and required a visual aesthetic – perfect for a talented designer-turned Flash developer like Lucy. Who cares what language you’re writing in anyway – its the algorithms that matter!

The most interesting and difficult part of the project was “cuddling” the words, as Lucy and I came to call it with endless mirth. This was the idea of fitting the word shapes together like tetris so they didn’t overlap. Initially I implemented a box-model where the square around each glyph couldn’t intersect with another bounding-box. That was easy! Surely it wouldn’t be so hard to swap that layout strategy for one that respected the glyph outlines?

While I can’t remember all the possibilities I tried (there were lots, utter failures, false-starts, weak algorithms and CPU toasters) a few of the techniques which stuck are still interesting.

The main layout technique (inspiration source sadly lost), was placing the words in a spiral from the centre outwards. This really helped with both the placement algorithm to get a “cloud” shape and also the visual appeal and pseudo-randomness – considering people don’t like really random things.

Another technique I borrowed from my years as a Windows C/CPP programmer in the “multimedia” days, was bit-blitting and double/triple buffering. Now this was a pleasant Canvas surprise as bitmap operations were pretty new to Flash at the time, and felt generally impossible on the web. The operations used to test whether words were overlapping involved some visually distressing artefacts with masks and colour inversions and so on, so I needed to do that stuff off-screen. Also for performance purposes I only needed to scan the intersecting bounding-boxes of the words, so copying small areas to a secondary canvas for collision detection was much more efficient. Fortunately Canvas allows you to do these kind of raster operations (browser willing) even though it’s mainly a vector-oriented API.

Producing natural looking variations in computer programs often suffers from the previously mentioned problem that true randomness and human perception of randomness are two very different things. People are so good at recognising patterns in noisy environments, that you have to purposely smooth out random clusters to avoid people having religious crises.

During this project I produced a couple of interesting random variants which I simply couldn’t find in the public domain at the time. The randomisation I developed is based around the normal distribution (bell curve) and cut-off around three standard-deviations to prevent wild outliers, instead of at the usual min-max. The problem with typical random numbers over many iterations is you get a “flat line” of equal probabilities between the min and max, like a square plateau. This isn’t normal! Say if your minimum is 5 and max is 10, over time you’ll get many 5.01 but never a single 4.99. In reality in life, everything is a normal distribution! Really you want to ask an RNG for a centre-point, and a STDEV to give the spread. I was pretty surprised (after coming up with the idea) that I couldn’t find anything, in any language to implement it. I’d been working on government-certified RNGs recently, and had even interfaced with radioactive-decay-driven RNGs in my youth,  so believed I was relatively well versed in the topic. So I reached for my old maths text books and did it myself – with some tests – and visualisations of course!

Having bell-curve weighted random numbers really helped give a soothing natural feel to the “jumbliness” of the words and to the spread of the generated colour palettes. It’s an effect that’s difficult to describe – it has to be seen (A/B tested) to be appreciated. I wonder if they are secretly used in other areas of human-computer relations.

Performance was one of the biggest, or longest challenges. In fact it never ended. I was never totally happy with how hot my laptop got on really big clouds, with all the rendering options turned on. Built into the library are some performance statistics and – you guessed it – meta-visualisation tools in the form of histograms of processing time over word size.

I also experimented with sub-pixel vs. whole-pixel rendering but didn’t find the optimisation gains some people swore by, when rounding to true pixels.

After a lot of hair pulling, there were some really fun moments when a sudden head-slap moment lead to a reworking of the collision detection algorithm (the main CPU hog) which gave us huge a leap in performance. I’m sure there’s still many optimisations to make, and I’d be happy to accept any input, hence why I’ve open sourced it after all this time.

While tag clouds may be the Mullets of the Internet, programming them almost certainly contributes to baldness.

Screen Shot 2016-07-14 at 10.55.29 pm

Posted in Dusted Off | Tagged , , | Leave a comment

Trialling the ELK stack from Elastic

As part of my ongoing research into big-data visualisation and infrastructure and application management tools it came to give ELK a test run to check if it’s suited to my needs. I’ve already looked at a few others (which I will detail in another article), and so far haven’t found somethings suitable for a SMB to collate and process both “live” information from applications, e.g. from existing databases or APIs, combined with “passive” information from log files.

Some of the applications I work with are modifiable so we can take advantage of APIs to push event-driven data out to analytics or monitoring platforms, but some legacy components are just too hard to upgrade thus log-scraping could be the only viable option. I already use tools at various stack levels such as New Relic (APM), Datadog/AWS (infrastructure), Google Analytics (web/user) and my special interest: custom business-event monitoring. Ideally these various sources could be combined to produce some extremely powerful emergent information, but the tools at these different levels are often specific to the needs of that level such as hardware metrics vs. user events and thus difficult to integrate.

My Experiences Trialling Elastic 

It seemed from an initial look that the Elastic stack was flexible and agnostic enough to be able to provide any of these aspects. But would it be a jack-of-all and master-of-none?

To give it a go, I looked at the three main components separately at first. Simply put:

  1. LogStash – data acquisition pipeline
  2. Elastic Search – search engine & API
  3. Kibana – visualisations dashboard

The provide hosted services but I didn’t feel like committing just yet and didn’t want to be rushed in a limited-time trial, so I downloaded and installed the servers locally. I mentally prepared myself for hours of installation and dependency hell after my experiences with Graphite and Datalab.

But – these were my only notes during the fantastically quick set-up:

  • Too easy to setup, built in Java,  it just ran on my macbook!
  • Tutorial: input from my local apache logs files -> elasticsearch, processed really quick!
  • Logstash Grok filter would be key to parsing our log files…

I just unzipped it and ran it and it worked. I know that’s how the world should work, but this is the first time I’ve experienced that for years.

Interest piqued, I decided to run through the tutorials and then move on to setting up a real-world log import scenario for a potential client. I noted down the things I discovered, hopefully they will help other people on a similar first journey. At least it will help me when I return to this later and have predictably forgotten it all.

LogStash – data acquisition pipeline

I ran through the Apache logs tutorial, after completing the basic demos.

The default index of logstash-output-elasticsearch is “logstash-%{+YYYY.MM.dd}”, this is not mentioned in the tutorials. Thus all Apache logs are indexed under this, hence the default search like http://localhost:9200/logstash-2016.07.11/_search?q=response=200

I don’t think this will be useful in reality – having an index for every day, but I guess we’ll get to that later. Furthermore the timestamp imported is today’s date, i.e. the time of the import, not the time parsed from the logs. [I will address this later, below]

Interesting initial API calls to explore:

http://localhost:9200/_cat/indices?v – all (top level) indices, v = verbose, with headers.

http://localhost:9200/_cat/health?v   – like Apache status

http://localhost:9200/_cat/ – all top level information

Grok – import parser

Grok is one of the most important filter plugins– enabling you to parse any log file format, standard or custom. So I quickly tried to write an info trace log grok filter for some legacy logs I often analyse manually and thus know very well. This would make it easier to evaluate the quality and depth of the tools – “how much more will these tools let me see?”

My first noobish attempt was an “inline” pattern in the Grok configuration. A toe in the water.

filter {
   grok {

        # example 01/03/2016 23:59:43.15 INFO: SENT: 11:59:43 PM DEVICE:123:<COMMAND>
        match => { "message" => "%{DATESTAMP:date} %{WORD:logType}: %{WORD:direction}: %{TIME} %{WORD:ampm} DEVICE:%{INT:deviceId}:%{GREEDYDATA:command}"}
   }
}

I found it hard to debug at first: it wasn’t importing but I saw no errors. This was because it was working! It took me a little while to get the trial-and-error configuration debug cycle right. Tips:

  • This grok-debug tool was good.
  • This one also.
  • Core Grok patterns reference is vital
  • A regex cheat sheet also helps, as Grok is built on it.
  • Start Logstash with -v to get verbose log output, or even more with –debug
  • Restart logstash when you make config changes (duh)
  • The config is not JSON. Obvious but this kept catching me out because most other aspects you’ll need to learn simultaneously are in JSON. (what’s with the funny PHP-looking key => values, are they a thing?)

OK. I got my custom log imported fairly easily – but how to I access it?

Elastic Search – search engine & API

During my first 5 minutes going through the ES tutorials and I noted:

  • Search uses a verbose JSON DSL via POST (not so handy for quick browser hackery)
  • However I found you can do quick GET queries via mini-language
  • Search properties: http://localhost:9200/logstash-2016.07.12/_search?q=deviceId:1234&response=200&pretty
  • To return specific properties e.g.: &_source=logType,direction,command
  • Scores – this is search-engine stuff (relevance, probabilities, distances) as opposed to SQL “perfect” responses.
  • Aggregates (like GROUP) for stats, didn’t try them, POST only and I’m lazy today, but they look good. Would probably explore them more in Kibana

Great – the REST API looks really good, easily explorable with every feature I had hoped for and more. The “scores” aspect made me realise that this isn’t just a data API, this is a proper search engine too, with interesting features such as fuzziness and Levenshtein distances. I hadn’t really thought of using that – from a traditional data accuracy perspective this seemed all a bit too gooey, but perhaps there will be a niche I could use it for.

Kibana – for visualisations

  • Download “installed” tar.gz – again worked perfectly, instantly.
  • Ran on http://0.0.0.0:5601, it set itself up
  • Created default index pattern on logstash-* and instantly saw all the data from above import.

So again great, this was “too easy” to get up and running, literally within 5 minutes I was exploring the data from the Apache tutorial in wonderful technicolor.

So after a broad sweep I was feeling good about this stack so I felt it was time to go a bit deeper.

Mappings

It seems the next (it should have been first!) major task is to map the fields to types, else they all end up as fieldname.raw as strings.

But… you cannot change mappings on existing data – so you must set them up first! However you can create a new index, and re-index the data… somehow, but I found it easier to start again for the moment.

I couldn’t figure out (or there isn’t) a mini-language for a GET to create mappings, so I used the example CURL commands which weren’t as annoying as I’d thought they’d be – except I do use my browser URL history as my memory. It’s just a bit harder to re-edit and re-use SSH shell histories, than in-browser URLs.

curl -XPUT http://localhost:9200/ra-info-tracelog -d '
{
 "mappings": {
   "log": {
     "properties": {
       "date": { "type" : "date" },
       "deviceId": { "type" : "integer" }
       }
     }
   }
 }
}
';

 

Getting the date format right…

The legacy server, from which these logs came, doesn’t use a strict datetime format (sigh), and  Logstash was erroring.

# example 01/03/2016 23:59:43.15

Initially I tried to write a “Custom Pattern” but then I found the Grok date property format should be able to handle it, even with 2 digits of microseconds (default is 3 digits). To figure this out, I had to manually traverse the tree of patterns in the core library from DATESTAMP down through its children. This was actually a good exercise to learn how the match patterns work – very much like an NLP grammar definition (my AI degree is all coming back to me).

So why is Grok erroring when the pattern is correct?

It took me a while to realise it’s because the Grok DATESTAMP pattern is just a regexp to parse the message data into pieces but is more permissive than the default date field mapping specification in the subsequent Elasticsearch output stage. So it tokenises the date syntactically, but it’s the field mapping which then tries to interpret it semantically which fails.

OK, so I felt I should write a custom property mapping to accommodate the legacy format.

    "date": {
        "type" : "date" ,
        "format": "dd/MM/yyyy HH:mm:ss.SS"
    },

Mild annoyance alert: To do these changes I had to keep re-creating the indexes and changing the output spec, restarting Logstash and changing my debug query URLs. So it’s worth learning how to re-index data, or (when doing it for real) get this right first in an IA scoping stage.

Tip: I debugged this by pasting very specific single logs one line at a time into the test.log file which Logstash is monitoring. Don’t just point it a huge log file!

So many formats!

The date mappings are in yet another language/specification/format/standard called Joda. At this point I started to feel a little overwhelmed with all the different formats you need to learn to get a job done. I don’t mind learning a format, but I’m already juggling three or four new syntaxes and switching between them when I realise I need to move a filtering task to a different layer is an off-putting mix of laborious and confusing.

For example I just learned how to make optional matches in Grok patterns, but I can’t apply it here and do “HH:mm:ss.SS(S)?” to cope with log-oddities, which is a frustrating dead-end for this approach. So I have to look again at all the other layers to see how I can resolve this with a more flexible tool.

OK, once the date mapping works… it all imports successfully.

Creating Kibana visualisation

To use this time field you create a new index pattern in Kibana>Settings>Indices and select the “date” field above as the “Time-field name”, otherwise it will use the import time  as the timestamp – which won’t be right when we’re importing older logs. (It will be almost right if logs are being written and processed immediately but this won’t be accurate enough for me).

I loaded in a few hundred thousand logs, and viewed them immediately in Kibana… which looks good! There are immediately all the UI filters from my dreams, it looks like it will do everything I want.

But there’s an easier way!

The Date filter allows you to immediately parse a date from a field into the default @timestamp field. “The date filter is especially important for sorting events and for backfilling old data.” which is exactly what I’m doing.

So I made a new filter config:

filter {
    grok {
        match => { "message" => "%{DATESTAMP:date} ... “}
    }
    date {
        match => ["date", "dd/MM/yyyy HH:mm:ss.SS"]
    }
}

And it turned up like this (note: without using the field mapping above, so the redundant “date” property here is just a string). Also note the UTC conversion, which wqs a little confusing at first especially as I unwittingly chose a to test a log across the rare 29th of Feb!

    "@timestamp" : "2016-02-29T13:00:23.600Z",
    "date" : "01/03/2016 00:00:23.60",

The desired result was achieved: this showed up in Kibana instantly, without having to specify a custom time-field name.

I got it wrong initially, but at least that helped me to understand what these log-processing facilities are saving you from having to do later (possibly many times).

Making more complex patterns

The legacy log format I’m trialling has typical info/warning/error logs, but each type also has a mix of a few formats for different events. To break down these various log types, you need to implement a grammar tree of expressions in custom Grok Patterns.

The first entries should be the component “elements” such as verbose booleans or enumerations

    ONOFF (?:On|Off)
    LOG_TYPE (?:INFO|WARNING|ERROR|DEBUG).

If a log file has various entry types, like sent/received, connection requests and other actions then the next entries should be matching the various log-line variants composed of those custom elements and any standard patterns from the core libraries.
e.g.

# e.g. 01/02/2016 01:02:34.56 INFO: Connection request: device:1234 from 123.45.56.78:32770 ()
INFO_CONNECTION_REQUEST %{DATESTAMP:date} %{LOG_TYPE:logType}: Connection request: device:%{INT:deviceId} from %{HOSTPORT:deviceIP} \(%{HOSTNAME}\)

Then finally you have one log-line super-type which matches any log-line variants
e.g.:

INFO_LINE %{INFO_COMMAND}|%{INFO_CONNECTION_REQUEST}|%{INFO_CONNECTION_CLOSED}|%{INFO_STATUS}

Again the tools mentioned above were crucial in diagnosing the records which arrive in Elasticsearch tagged with _grokparsefailure while you are developing these patterns.

For annoyingly “flexible” legacy log formats I found these useful:

  • Optional element: ( exp )?          e.g. (%{HOSTNAME})?
  • Escape brackets:    \(                     e.g. \(%{INT:userId}\)
  • Non-capturing group: (?: exp ) e.g. (?:INFO|WARNING|ERROR|DEBUG)

Differentiating log variants in the resulting data

Next I wanted to be able to differentiate which log-line variant each log had actually matched. This turned out to be harder than I had thought. There doesn’t seem to be a mechanism within the regular-expression matching capabilities of the Grok patterns such as to be able to “set a constant” when a specific pattern matches.

The accepted method is to use logic in the pipeline configuration file plus the abilities to add_tags or add_fields in the Grok configuration. This approach is sadly a bit wet (not DRY) as you have to repeat the common configuration options for each variant. I tried to find other solutions, but currently I haven’t resolved the repetitions.
e.g.

grok {
    match => { "message" => "%{INFO_CONNECTION_CLOSED}" }
    patterns_dir => ["mypatterns"]
    add_tag => [ "connection" ]
}
grok {
    match => { "message" => "%{RA_INFO_LINE}" }
    patterns_dir => ["mypatterns"]
}

However this can also result in a false _grokfailure tag, because the two configurations are run sequentially, regardless of a match. So if the first one matches, the second will fail.

One solution is to use logic to check the results of the match as you progress.

    grok {
        match => { "message" => "%{INFO_CONNECTION_CLOSED}" }
        patterns_dir => ["mypatterns"]
        add_tag => [ "connection" ]
    }
    if ("connection" not in [tags]) {
        grok {
            match => { "message" => "%{INFO_LINE}" }
            patterns_dir => ["mypatterns"]
        }
    }

This works well, and for these log-line variants, I’m now getting a “connection” tag, which can enable API queries/Kibana to know to expect a totally different set of properties for items in the same index. I see this tag as a kind of “classname” – but I don’t know yet if I’m going down the right road with that OO thought train!

    "@timestamp" : "2016-02-29T13:00:25.520Z",
    "logType" : [ "INFO", "INFO" ],
    "deviceId" : [ "123", "123" ],
    "connection_age" : [ "980", "980" ],
    "tags" : [ "connection" ]

Another method is to “pre-parse” the message and only perform certain groks for specific patterns. But again it still feels like this is duplicating work from the patterns.

    if [message] =~ /took\s\d+/ { grok { ... } }

Even with the conditional in place above, the first filter technically fails before the second one succeeds. This means the first failure will still add a “_grokparsefailure” tag to an eventually successful import!

The final workaround is to manually remove the failure tags in all but the last filter:

    grok {
        match => { "message" => "%{INFO_CONNECTION_CLOSED}" }
        add_tag => [ "connection" ]
        # don't necessarily fail yet...
        tag_on_failure => [ ]
    }

So while I am still very impressed with the ELK stack, I am starting to see coping with real-world complexities isn’t straightforward and is leading to some relatively hacky and unscalable techniques due to the limited configuration language. It’s these details that will sway people from one platform to another, but it’s difficult to find those sticking points until you’ve really fought with it – as Seraph so wisely put it.

Loading up some “big” data

Now I was ready to import a chunk of old logs and give it a good test run. I have a lot – a lot – of potential archive data going back years. It seemed to import fairly quickly even on my old 6yr-old MacBookPro Logstash chewed 200,000 logs into Eleasticsearch in a few minutes. (I know this is tiny data, but it’s all I had in my clipboard at the time.) I’m looking forward to testing millions of logs on a more production tuned server and benchmarking it with proper indexing and schema set up.

Heading back to Kibana, I was able to explore the data more thoroughly now it’s a bit more organised. The main process goes through:

  1. data discovery
  2. to making visualisations
  3. and then arranging them on dashboards.

This process is intuitive and exactly what you want to do. You can explore the data by building queries with the help of the GUI, or hand-craft some of it with more knowledge of the Elasticsearch API, then you can save these queries for re-use later in the visualisation tools.

Even in the default charts, I instantly saw some interesting patterns including blocks of missing data which looked like a server outage, unusual spikes of activity, and the typical camel-humps of the weekend traffic patterns. These patterns are difficult to spot in the raw logs, unless you have Cipher eyes.

I had a quick look at the custom visualisations, particularly the bar-charts, and found you can quite easily create sub-groups from various fields and I started to realise how powerful the post-processing capabilities of Kibana could be in slicing up the resulting data further.

Summary thoughts

In summary I feel the ELK stack can certainly do what I set out to achieve – getting business value out of gigabytes of old logs and current logs without having to modify legacy servers. I feel it could handle both infrastructure level monitoring and the custom business-events both stored in logs and fired from our APIs and via MQs.

The component architecture and exposed REST API is also flexible enough to be able to easily feed into other existing data-processing pipelines instead of Kibana, including my latest pet-project Logline which visualises mashups of event-driven logs from various sources using the Vis.org Timeline.

Next steps

I feel now I’m ready to present this back to the folks at the organisations I consult for and confidently offer it as a viable solution. It offers tools for building a business intelligence analysis platform and with the addition of the monitoring tools such as Watcher potentially bring that post-rational intelligence into real-time.

Beyond that – the next step could even be predicting the future, but that’s another story.

Posted in articles | Tagged , , , , , | Leave a comment

PHPUnit-Selenium2 Cheat Sheet

My PHPUnit-Selenium2 Cheat Sheet

Here are a few snippets of how I’ve achieved various tasks, some tricks and patterns in phpunit/phpunit-selenium v2.0 – targeting Selenium2. I’ll try to keep this updated with more techniques over time.

Screenshots

I wrote this small hook to make screenshots automatic, like they used to be. Of course you may want to put a timestamp in the file, but I usually only want the last problem.

/**
 * PhpUnitSelenium v1 used to have automatic screenshot as a feature, in v2 you have to do it "manually".
 */
public function onNotSuccessfulTest(Exception $e){
 file_put_contents(__DIR__.'/../../out/screenshots/screenshot1.png', $this->currentScreenshot());

 parent::onNotSuccessfulTest($e);
}

Waiting for stuff

An eternal issue in automated testing is latency and timeouts. Particularly problematic in anything other than a standard onclick-pageload cycle, such as pop-up calendars or a JS app. Again I felt the move from Selenium1 to 2 made this much more clumsy, so I wrote this simple wrapper for the common wait pattern boilerplate.

/**
 * Utility method to wait for an element to appear.
 *
 * @param string $selector
 * @param int    $timeout milliseconds wait cap, after which you'll get an error
 */
protected function waitFor($selector, $timeout=self::WAIT_TIMEOUT_MS){
 $this->waitUntil(function(PHPUnit_Extensions_Selenium2TestCase $testCase) use($selector){
  try {
   $testCase->byCssSelector($selector);
  } catch (PHPUnit_Extensions_Selenium2TestCase_WebDriverException $e) {
   return null;
  }
  return true;
 }, $timeout);
}

Checking for the right page, reliably.

If something goes wrong in a complex test with lots of interactions, it’s important to fail fast  – for example if the wrong page loads, nothing else will work very well. So I always check the page being tested is the right page. To do this reliably, not using content or design-specific elements, I add a <body> tag “id” attribute to every page (you could use body class if you’re already using that styling technique but I tend to separate my QA tagging from CSS dependencies). Then I added this assertion to my base test case.

/**
 * We use <body id="XXX"> to identify pages reliably.
 * @param $id
 */
protected function assertBodyIDEquals($id){
 $this->assertEquals($id, $this->byCssSelector('body')->attribute('id'));
}

Getting Value

The ->value() method was removed in Selenium v2.42.0. The replacement method is to use $element->attribute(‘value’) [source]

// old way
//$sCurrentStimulus = $this->byName('word_index')->value();
// new way
$sCurrentStimulus = $this->byName('word_index')->attribute('value');
// I actually use this now:
$sCurrentStimulus = $this->byCssSelector('input[name=word_index]')->attribute('value');

However ->value() was also a mutator (setter), which ->attribute() is not. So if you want to update a value, people say you have to resort to injecting JavaScript into the page, which I found somewhat distasteful. However luckily this is not the case for the “value” attribute specifically, according to the source code, it’s only the GET which was removed from ->value().

JSON Wire Protocol only supports POST to /value now. To get the value of an element GET /attribute/:naem should be used

So I can carry on doing this, presumably until the next update breaks everything.

$this->byName('u_first_name')->value(GeneralFixtures::VAlID_SUBJECT_USERNAME);


General Page Tests

I have one test suite that just whips through a list of all known pages on a site and scans them for errors, a visual regression smoke test for really stupid errors. It’s also easy to drop a call to this method in at the beginning of any test. When I spot other visual errors occurring, I can add them to the list.

/**
 * Looks for in-page errors.
 */
protected function checkErrors() {
 $txt = $this->byTag('body')->text();
 $src = $this->source();

 // Removed: This false-positives on the news page.
 //$this->assertNotContains('error', $this->byTag('body')->text());

 // Standard CI errors
 $this->assertNotContains('A Database Error Occurred', $txt);
 $this->assertNotContains('404 Page Not Found', $txt);
 $this->assertNotContains('An Error Was Encountered', $txt);
 // PHP errors
 $this->assertNotContains('Fatal error: :', $txt);
 $this->assertNotContains('Parse error:', $txt);

 // the source might have hidden errors, but then it also might contain the word error? false positive?
 // This false positives in the user form (must have validation error text!
 //$this->assertNotContains('error', $this->source());
 $this->assertNotContains('xdebug-error', $src); // XDebug wrapper class

}

 

Posted in articles | Tagged , , , | Leave a comment

MOTIf v2.0 – responsive redesign

After 8 years the MOTIf website was starting to show it’s age, visually at least.

While I have performed regular technical updates to keep it browser compatible and futureproofed, we made a fixed-layout decision (rather than fluid) in 2007 and so has it never worked well on these newfangled smart phones and phablet whotnots. Sadly though, the main driver for the recent redesign was actually a need to distance ourselves from some unscrupulous people tacitly claiming the site was their own! We decided it was time to rebrand the site, and introduce the key people in the team on a new “About Us” page – the site has previously had somewhat of an air of mystery behind it, for… reasons (as the kids say nowadays).

So I thought it was time for a complete front-end rebuild, and dusted off everything I learned while working at Deepend building what were cutting-edge responsive sites (three or four years ago now). We spent huge efforts pioneering in this field, and even built our own front-end framework/reset/bootstrap.

Seeing as I’m working voluntarily on the site now, my time is a scarce resource so I decided to stand on the giant’s shoulders of Twitter Bootstrap – replacing the good work that Blueprint served the site since 2007. Blueprint was great as a reset and grid system, but came before responsive design had been invented and would have required an m-site (remember those?). TBS 4 is about to come out but it’s not even in RC yet, so I chose TBS3 which I’m relatively familiar with. (The only thing I don’t like about TBS is it comes with “style” which you have to get rid of, rather than it being a purely vanilla reset and grid framework.)

One of the great tools we used at Deepend was BrowserSync which upgrades you into the robot octopus required for responsive testing on multiple devices. It automatically reloads the pages after you’ve edited the source, but also sync’s the navigation and even scrolling across all devices – it’s quite amazing to see it working.

Screen Shot 2015-11-28 at 10.46.53 am

While pondering a new front-end build, I realised I’ve now changed allegiences from Grunt to Gulp. I was a great fan of Grunt, so the transition was hesitant but there is a certain beauty and simplicity to the concept of Gulp in which I’m more keen to invest time (than learning more ad-hoc config formats). I’ve been using it recently with a node.js/redis application (SciWriter – coming soon!) and it just feels more like an integral part of the system, being in Javascript and allowing interopability with the server codebase if required. Also the logo is far less frightening.

I was pleased to see there is now an official version of Bootstrap w’ SASS, (rather than the previous third-party version), as I’m more a fan of SASS than LESS. To be honest I can’t remember the details of why now, but after a couple of years of trying both in dozens of projects at Deepend we all plumped for SASS as the marginally superior platform.

To get SASS building in Gulp, I ditched my previous ally Compass for gulp-ruby-sass. I found it relatively tricky to wire up the SASS build as the twbs/bootstrap-sass documentation has myriad options including combinations of Rails, Bower, Compass, Node, Sprockets, Mincer, Rake… aagh what! But after thinking it through and a short walk around the block I found gulp-ruby-sass was the right choice for me – as I am using Bower and Gulp.

Once the set of dependencies and technologies were chosen, the actual install ended up quite straightforward:

  • update/install Ruby, Node, NPM etc.
  • install Bootstrap with Bower
  • install Gulp with Node NPM
  • install Browsersync and Ruby-SAS into Gulp

I set up a src folder in the site with some news .gitignore’s for bower_components, sass_cache and node_modules, and then created a JS and CSS build in the gulpfile.

As I am migrating an existing site, I decided to use the SCSS format (rather than SASS). The great thing about SCSS being a superset of CSS is that I could just drop the original 2007 motif.css (designed over Blueprint) into the src/scss directory and start migrating to the new site. I much prefer a format closer to CSS and I am not much of a fan of oversimplified syntax transpilers such as Coffeescript. It just feels like yet another language to learn, and takes your knowledge further from the the true W3C stack – all for a few braces?

Now I was ready to splice the BS3 “starter” template header into the site’s header view template, fiddle around a little with the JS/CSS imports and see what the site looked like for fun… I was actually pretty amazed to see the site looked relatively intact and was already responsive! I believe this is testament to the semantic markup approach of both BS and my previous work on the site – the old and new CSS didn’t conflict directly, but intermingled relatively harmlessly.

Now the job was to go through the original CSS and HTML finding any specific classes (and div structures of course) for Blueprint or my custom elements like rounded corners from years before border-radius. (I did chuckle when the major browser finally implemented border-radius and box-shadow – just in time for flat design.) This was the “easy but long” task, after the quick wins of importing such power from all these great frameworks.

I am truly appreciative of tools such as Bootstrap, Gulp, Bower and SASS. Over the 30+ years I’ve been developing I have implemented similar frameworks or solutions for myself or my teams, before they existed publicly. I know how hard they are to get right. It’s a real pleasure to use well designed tools built by people who really know what they’re used for. Plus it’s a relief not to have to build it myself again as languages shift in and out of fashion! (Ah the memories, that old Perl CMS… countless templating systems… the time we cleverly named “Deepstrap” then immediatley regretted Googling the name for trademarks.)

Getting BrowserSync to work perfectly took a couple of attempts. I saw the “inbuilt server” wasn’t useful to me as I have a CMS and backend and it only serves flat HTML. So I tried the proxy, but it replaced all my nice SEO URLs and local domain with simple IP addresses, which defeated the routing. So I eventually built the snippet injection into my application itself – i.e. my web application is now “Browsersync Aware”.

To do this, I first added a controller parameter to enable browserSync in a session, but then also configured it to be always-on in the DEV deployment (avoiding having to enable it in many devices, but still allowing occasional debugging in production). My body template is now rendered thus:

<body <?= isset($body_id) ? 'id="'.$body_id.'"' : '' ?>>
<?= isset($browserSync) ? '<script async src="//'.$_SERVER['SERVER_NAME'].':3000/browser-sync/browser-sync-client.2.9.11.js"></script>'  : '' ?>

The gulpfile is still evolving, but this is how it currently works. Everything is built on-change via watch and deployed directly to the site directories.

// MOTIf Front-end src build - Gulp file

// Define base folders
var src = 'src';
var dest = '..';

var gulp = require('gulp');
var concat = require('gulp-concat');
var rename = require('gulp-rename');
var uglify = require('gulp-uglify');
var sass = require('gulp-ruby-sass');
var debug = require('gulp-debug');
var browserSync = require('browser-sync').create();


// JS build
gulp.task('scripts', function() {
 return gulp.src(src+'/js/*.js')
  //  .pipe(debug({title: 'debugjs:'}))
  .pipe(concat('main.js'))
  .pipe(rename({suffix: '.min'}))
  .pipe(uglify())
  .pipe(gulp.dest(dest+'/js'))
  .pipe(browserSync.stream())
  ;
});

// CSS build
gulp.task('sass', function() {
 //return sass(src+'/scss/**/*.scss', {verbose: false})/* NB: glob fixed a frustrating "0 items" problem! */
 return sass(src+'/scss/styles.scss', {verbose: false})// prevent multi-compile of includes in this folder - it's a pure tree.
  .on('error', function (err) {
   console.error('Error!', err.message);
  })
  //.pipe(debug({title: 'debugsass:'}))
  .pipe(rename({suffix: '.min'}))
  .pipe(gulp.dest(dest+'/css'))
  .pipe(browserSync.stream())
  ;
});

// hawtcher bee watcher
gulp.task('watch', function() {
 browserSync.init({
  notify: false // the "connected to browsersync" message gets in the way of the nav!
 });
 gulp.watch(src+'/js/*.js', ['scripts']);
 gulp.watch(src+'/scss/*.scss', ['sass']);
 gulp.watch('../system/application/views/**/*.php').on('change', browserSync.reload);
 gulp.watch(src+'/images/**/*', ['images']);
});

// Go!
gulp.task('default', ['scripts', 'sass', 'watch', 'browser-sync']);

Another useful responsive developer tool is the Chrome device-simulator which performs viewport and user-agent spoofing. However be warned that it doesn’t accomodate the extra cruft the actual device browsers incur such as address bar, tabs, status bar etc. so the actual viewports will be significantly smaller. Real device testing is still the only way to be sure, but paid services such as BrowserStack can also help automate this.

Screen Shot 2015-11-28 at 10.37.49 am

There’s still a way to go with the redesign. I’ve only redesigned the public-facing pages not the inner areas where the tests are done, but I’m pretty pleased to bring the site (almost) up to date, and therefore to allow the experimenters to administer these tests on more convenient devices.

With over 5,000 registered professionals and 12,000 children tested so far, the site has gradually become a valuable resource to many teachers and clinicians. I want to ensure it’s kept usable and useful into the future, for the next generation of kids who’ll need help with learning to read and write.

Screen Shot 2015-11-28 at 10.33.26 am

 

Posted in Projects | Tagged , | Leave a comment