This is part three of an eight part brain-dump of things I learned developing the initial version of the Machine Learning (ML) aspect of SYRAS the Systematic Review Assistant.
One of the things which seems obvious in retrospect, but hampered my initial research into ML tools, techniques and frameworks was the distinction between general ML and natural language-specific processing.
There are many mathematical techniques available which can be used to represent and solve problems but some problem spaces are harder to squeeze into a statistical model. Core algorithms like Bayes, LSI, SVM or regressions and broader techniques like clustering, classification and search are mostly applied directly to numbers. Turning words and sentences from various languages and different styles of writing or speech into numbers adds a great deal of complexity to the system.
There are some frameworks out there which are NLP specific from end-to-end. These aim to provide a one-stop-shop for language oriented applications. The alternative is to combine NLP pre-processing and pipe the statistical representations produced into more generic ML algorithms. Using a NLP framework should be easier, but may have limited depth of features or tweak-ability. The latter DIY approach will give you more options but would be more difficult for a beginner.
The situation is the same for most problem spaces, as it would be rare for raw data to be perfectly ready for ingestion by an ML pipeline. For example computer vision such as facial recognition, image similarity search or self-driving cars would require a great deal of number crunching to process raw images or video frames into a representation suitable for the ML tools. Audio processing, financial prediction, weather modelling or chemical modelling would all have their own quirks, common data formats and pre-processing tricks.
When you are researching for techniques or tools, the trick here is to understand which domain the tool chain was originally intended and developed for and whether that matches your problem domain, or whether their are components which are generic enough and well designed to be used independently. It’s quite possible there might be a great classification algorithm buried in a vision tool which might work for text – given the right transformations. But usually for NLP it’s much easier to find language oriented tools.
Some of the concepts you will need to be familiar with, to transform text into statistics for general ML are outlined below.
- Vectors, word embeddings
- Bigrams, compound words (e.g. self-concept)
- TF-IDF Preprocessing
- Lemmatisation (general case-reduction)
- Stop Words
- Metadata, recovering the original text(after working on tokens/data) p60 SO
Vectors, word embeddings, dimensions
Before getting into language quirks, it’s important you understand the concept of what has more recently become known as “word embedding”, but has been known for decades variously as vectorisation, multi-dimensional semantic cluster analysis or N-dimensional maps and the like. I studied it at university in the 1990’s and was fascinated with the idea of plotting all the words in the world in a vast N-dimensional matrix according to some impossibly contrived semantic properties and hoping some sense would arise from their locations.
I always preferred the word co-ordinate to the now preferred vector, it just makes more sense to me to imagine a word “at” position [3, 67, 23, 300] which is “near” to [4, 64, 21, 290] like in a giant galactic map. I think the difference is a vector is more of a general direction and distance, whereas a co-ordinate is always from the origin point of 0,0.
Anyway, the process of vectorising your language (corpus, document or search query) is required because most later ML stages operate on multi-dimensional representations. Usually this involves analysing word frequencies in relation to one another and using some tried-and-tested maths techniques (e.g. Baysian) to extract some meanings out of the original text. These frequencies or relationships are the numbers the ML algorithms can “understand” – but they are no where near operating on the actual text or the language constructs you would hold in your mind, like subject/object or even noun-phrases.
It feels to me as if most NLP/ML tools and techniques have settled on this small subset of text processing and seem to have almost forgotten about the fact they aren’t even trying to parse the text e.g. via grammars and form lexical sentence structures, let alone try to attach semantics. I can only assume it was all too hard and intricate and the brute-force statistical giants won on average – and that was good enough to get an edge in business.
I went along happily with many of the currently “standard” NLP techniques for a while, until after fully understanding the maths underneath it I realised how limited the approach is, compared to the AI research I studied in the 1990’s. For example the vectorisation we were exploring was where the dimensions were semantic concepts like “activity”, “size” or “reality”. Combining this with grammatical sentence parsing into tree structures could allow you explore the relationships between words or phrases in a sentence or different sentences in terms of semantic similarities and differences. It gets very complex and taxing very quickly but the aims are much higher than just counting occurrences of words in a flat corpus. An example is the Bag of Words (BOW) concept common in some modern approaches which completely destroys all grammatical structure into a soup of words an then tries to analyse them. It seems crazy to purposely lose that amount of information, but it seems it’s easier to do the statistics with huge CPUs than to try to construct an intricate model of “real” linguistic-semantic modelling.
Perhaps there are alternate vectorisation techniques out there, but the only one I came across over and over again was quite simply counting frequencies of words in documents, or entire corpora – to give highly averaged topical guesswork. Perhaps now we have the computing horsepower, it’s time to explore other “embedding” techniques.
To put a name to a face, the de-facto vectorisation technique I was referring to above it TF-IDF. This is the main tool for turning lots of words into lots of numbers. I won’t explain it from scratch here as other people have already done that better than I could.
As I’ve mentioned, it’s a amazingly simple but effective technique for extracting topical statistics from any natural language (it could actually be applied to almost any dataset). The TF simply counts cards, and the IDF places the bets – together they determine the keywords probably relevant to each document in a corpus.
Because of the workflow of a Systematic Review I wanted to be able to do what I called incremental processing – i.e. to be able to add a new unlabelled document into the corpus and add it into an existing set of statistics (vectorise it). I found this didn’t work very well in practice – so I had to rework the application to pre-process the entire corpus in advance which is much more common. I had trouble finding references on whether the maths would be able to handle this approach, it was so uncommon.
I was also tripped up by certain libraries (like the JS Natural library) including stop word processing and tokenisation quirks in the TD-IDF stage. I hadn’t expected it to be doing anything other than TD-IDF. This is an example of an NLP all-rounder trying to be helpful, but when you are doing a very thorough study you want control over all the aspects.
TF-IDF also introduces hyper-parameters in the form of cut-off thresholds which can be difficult to anticipate as they are not always “normalised” figures, so the optimal thresholds could vary between datasets.
Scaling and normalisation is also another consequence of TF-IDF which will affect your results. I found the scaling options in SKL’s TF-IDF affected the performance of the Scalar Vector Machine (SVM) algorithm in interesting ways. I also wasted time experimenting with normalisation stages, which were really having no further effect! This is one of the risks of the DIY approach – there are so many combinations that the documentation of each stage cannot possibly explain whether your mathematical pipeline is valid. For example LSA requires a normalised distribution, which TF-IDF provides, but as LSA can be used on other datasets, the LSA docs probably won’t discuss TF-IDF (unless in a more closed framework).
The concept of n-grams is common and appears taken for granted in most NLP libraries. They are just words, or pairs, triplets, quadruplets etc. depending on the value of N. I see them as a colossal fudge – following on from my disappointment above of the loss of complex parsing. Instead of parsing with grammars, we now just take nearby pairs of words and analyse their occurrence frequencies, in a “bag of ngrams” presumably. You could in theory combine a 1-gram, 2-gram and 3-gram representation in an ensemble if you have the horsepower.
While I concede again that the brute-force statistical approach has “won”, I can’t help visualise how wrong a bigram is, when it happens to span two relatively distant grammatical branches.
There are also “skip grams” which I think represent a middle ground between grammatical parsing and statistical analysis. They allow a representation of the relationships between all combinations of nearby words which theoretically could glean the grammar from the text to some degree. Perhaps in modern informal text and speech this will be as accurate as you could ever get.
Bigrams for real compound words (e.g. self-concept)
Related to the above concept of n-grams (which are a purely statistical trick), is the fact that in reality most languages have some type of compound word structure. This can pose problems in NLP.
I fortunately hit this quite early as my initial corpus was about “self-concept” in children, so I was able to explore and test it.
If your tokeniser is set to strip out punctuation, then you need to be aware that “self-concept” will turn into two words “self concept”, which will then be processed independently. The two words on their own are not necessarily of interest in our study it’s very specifically about self-concept. Lemmatisation (below) would blur this even more, as selfish might turn into self and conceptual into concept and we really weren’t interested in those words.
Will the above statistical n-gram analysis will recover this loss and discover a correlation between these two words? Does it even understand the difference between “self concept” and “concept self” in n-gram terms? The latter would have come from a very different source.
Of course, these days the hyphen is a slowly dying feature of English (on of my favourite complaints). So perhaps again we need to turn to the magic of statistics.
Lemmatisation and stemming (grouping inflections)
The need for grouping word inflections is another necessary evil which should be used with care, and evaluated for how it is helping your application – depending on the language you are processing. Most languages have some level of word variation such as plurals, genders, tense but some have other forms leading to large numbers of inflections.
This is important to vectorisation and further topic analysis as will greatly affect (reduce) the number of dimensions or topics produced. When you are doing what I call brute-force statistics, before feeding into ML vector-based algorithms, you will certainly want to group inflected words into a base before calculating the frequencies or relationships, else the figures will be greatly diluted and the representation fragmented.
If you are going to use the “topics” later and be re-presenting them to the end user (discussed below), then you should favour lemmatisation over stemming as it will produce real “keywords” the user will understand, rather than made up stems like “frequen”.
I can only imagine how complicated and full of fudge-factors, lookups and heuristics lemmatisation algorithms must be. I really wouldn’t care to write one myself, but sourcing a good one is vital.
Another relatively contentious issue (to me) is the concept of discarding “boring words we don’t care about”. Annoyingly the concept works very well, but I always feel the maths should take care of this implicitly and we shouldn’t have another set of fudge-factors, language-specific exceptions and effectively a hyper-parameter which needs tuning.
My unanswered question is, why doesn’t TF-IDF intrinsically get rid of “a” and “the” seeing as they are in all texts they should be weighted to the bottom, and optionally thresholded out. But in practise adding the clumsy “stop word processing” option usually improved your results, so you tick the box and move on.
Metadata, for recovering the original text
After converting your text into data via a pipeline of tokenising, lemmatising, summarising, vectorising, componentising and multi-dimensional embedding your ML algorithm eventually comes up with some insights and perhaps classifies something or finds some related topic or document. But how do you reverse this entire process to get back to the original language with which you are meant to present the output to the user?
I found in many of the frameworks this step was difficult to achieve. I think it shows how academic some of the tools have become and how far they are from being real world “product components”. After all the whole job of an ML application is to deliver meaningful results back to a human (at some point), it seems almost humorous that this is an afterthought. I can imagine several product managers I know chuckling at “how developers are” in this respect – when you present a 50-dimensional vector back to them as their search results, even though underneath the system worked perfectly and had in fact perfectly identified their favourite kind of cat video from billions of training cases.
As far as I could tell, it’s mostly up to you to preserve a certain thread of metadata throughout the transformation process to be able to “tag” the data back to the source documents, words, topics or whatever level you will be correlating with.
While looking for answers, I found more questions, so I wrote a specific case of how I did this in Gensim on Stack Overflow: https://stackoverflow.com/a/46659931/209288.
Later I did a similar thing in SciKit Learn – I build an API around the classifier/search which took document UUIDs and tracked them through the analysis, so that “neighbour” documents could be retrieved by UUID. Then later on in the GUI / Application layer, the actual text, title, abstract of the document and whatever other screening labels or comments had been added, could be retrieved and presented to the user.
I recommend implementing your own UUID, and keeping maps to any of the internal tool’s own IDs, should they provide them. Don’t expose or rely on implementation specific IDs else you may not be able to swap out components later, and it could present security risks.
Those are some of the NLP aspects I had to pay attention to, which I eventually separated in my mind from understanding the specifics of the generalised ML algorithms I also evaluated – which are coming up next in the next part of this series.