Trusted by more than 1,800 merchants across the globe (six continents and counting!) Findify is a veritable linguistic expert. From Hebrew to Japanese, Hungarian to Norwegian, experts at Findify have molded their Site Search and Personalization tools to work across more than 20 languages.
Not all languages were created equal, however. Many languages are intensely complex, requiring excellent decompounding standards. Swedish, in particular, is a good example of this.
While Findify is today known as a global provider, the company was founded in Sweden. As a result, a large segment of the Findify client base is still currently located in this area of Scandinavia.
Like most smaller languages, Swedish has far fewer resources online than more widely spoken languages such as English, Spanish, French and German, and so there are relatively few open-source solutions that properly analyze and handle the Swedish language.
In true Findify fashion, the company decided to build their own scalable solution. Rather than a specific fix, only applicable for Swedish, they came up with a methodology with broad applicability, and perfected Swedish in the making.
English search
When you search for a “hand cream” in English, typically your search query is piped through a set of normalization filters:
- Tokenize: for English, it’s just split on whitespace and punctuation
- Downcase: so the lowercase and UPPERCASE and MiXedCase queries will be the same
- Remove stopwords: so “plåster för barn” (plasters for children) will be “plåster barn”
- Stem: remove word endings, so “hands” will become “hand”
The process is quite simple for English language, and there are a lot of out-of-the-box technical solutions for this problem: ElasticSearch and Apache Lucene are popular and widely-used examples.
Why is English different from other germanic languages?
Germanic languages have the idea of “compound word”, when multiple dependent words are merged together in a long super-word. This behavior can also be observed even in English (example: basketball), but it’s not as frequent as for Swedish. So “hand cream” becomes “handcreme”, which makes the English-style search flow a bit more complicated:
- How can you tokenize the word stream, if there are no spaces in between?
- What if the whole sentence is a single word? What about spårvagnsaktiebolagsskensmutsskjutarefackföreningspersonalbeklädnadsmagasinsförrådsförvaltarens?
The general idea behind the compound word analysis in search is to leverage the spell checker dictionaries to split the words into parts and then index them as usual. There are even some ElasticSearch supports for German, but not for Swedish. But still, even for German the splitting was quite aggressive and could split a word like “skyscraper” into “sky+crap”, which could result in quite unexpected search results.
Academic approach
There is quite a lot of academic research happening around the Swedish language, so we checked out a couple of the most cited articles about decompounding in Germanic languages. Most of them can be summarized into a single algorithm:
- For a compound word “ansiktsmask” (face mask), generate all possible split combinations: an+siktsmask, ansikts+mask, ansiktsm+ask, ansiktsma+sk.
- Score them by the probability that this split is correct.
- Take the most high-scoring one as correct, if it’s score is above the threshold.
- Recursively try to do steps 1-3 with left and right parts of the split.
Which seems quite trivial from the first sight.
Scoring the split candidates
The most simple approach will be to score these candidates by the word frequency, so for su+nshine, sun+shine, suns+hine, sunsh+ine it’s quite clear that the sun+shine is the most probable one.
But this approach quickly showed that there are quite a lot of false-positive splits like for the word “thermometer” which should not be split at all, but thermo and meter are very frequent words!
So instead of writing too many quirks for the algorithm, we decided to trust the machine and build the ML algorithm, which will attempt to build the best scoring rule out of the training data.
Machine learning Swedish compounds
To ask the machine to learn something, at the first step you need a lot of examples, training data. For Swedish compounds, we found two datasets suitable for it:
- Folkets Dictionary, which contained the split information for some of known words
- Talbanken dataset, which also had this data.
The main problem with these datasets is that they had only positive samples for the training (so like “this word is compound”). We can use all other words as negative samples (like “all other words are non-compound”), but it’s not always true: other words can also be compound, but not yet marked as compound. But as we had no other data on this step, we still trained the model as is, using the following input data:
- Word frequency of left, right parts and the whole word (for both normalized and non-normalized forms)
- Length of word parts
- Part-of-speech for all of the parts
- Left and right part suffix frequency
At total it was about 28 input feature values for each split candidate (and to a vector of 1,282 numbers at the end). From the ML point of view, the whole dataset looked like this:
correct? | Left freq | Right freq | Left length | Right length | etc… | |
su-nshine | no | 0.00092% | 0% | 2 | 6 | |
sun-shine | yes | 0.0095% | 0.00064% | 3 | 5 | |
suns-hine | no | 0.00014% | 0% | 4 | 4 |
So the model can try to build the (hopefully) perfect splitting rule based on the data from the grey part of the table above. After the training for each split the model emits a number – the probability of likelihood that that split is correct.
The problem is that the resulting model is not always completely sure that split is possible. See the histogram of score distribution below:
As you can see, when the score is close to 1.0 (or to 0.0), the model is almost sure that this split is correct (or not possible). But what with all these values in the middle?
Leveraging human power
The more good data you have for the training, the better the model. So to further improve it we generated 3,000 split candidates in which model was not really sure and asked a couple of native Swedish speakers to help it in labelling the dataset.
We also conducted a dataset expansion in a way that if sun+shine is a compound, then all the forms like sun+shines, sun+shining are also compounds.
After we carried out this set of improvements, the model was much more reliable at detecting the proper splits:
But still a single question was not clear: Which threshold value can be defined as a proper level of confidence that the split is correct?
Choosing the threshold
As formally we’re doing a binary classification here, which emits some confidence score between 0 and 1, we still need to choose the best threshold so we can assume that any one above it is classified as compound.
And after we chose the threshold, all the classification results can be split into four categories:
Classified as compound | Classified as non-compound | |
compound | True positive | False negative |
non-compound | False positive | True negative |
And the numbers can look like this:
Classified as compound | Classified as non-compound | |
compound | 73% | 8% |
non-compound | 6% | 17% |
So the number of correctly classified compounds out of all compounds is called precision. And recall is the number of correctly classified compounds out of all words classified as compound. By moving the threshold up and down you can balance between precision and recall.
There is also a way to compute the best precision and recall combination called the F1 Score. For our case it was around 0.97, which resulted in a threshold of 0.88.
Word forms
But being able to split (or not split) the word into its parts is still a first step for proper Swedish language support:
- Swedish compounds may have the -s- connector in between parts.
- The left part may lose or change the ending if it’s part of a compound word.
So to properly support Swedish, we also need to have an extensive dictionary of word forms, so for the word “shining” we can know that the lemma (normal) form is “shine”. For Swedish, we used the same Folkets/Talbanken datasets, and also SALDO morphological dataset for all the word forms, which further improved our ability to properly search Swedish.
Results
For one of our swedish merchants, Apohem, we were able to deliver search results for complicated queries:
In the screenshot above, for example, a search for “huvudvärkstabletter” (head pain pills) is correctly decompounded to the parts. And even if there is no exact match for the word “huvudvärktabletter” in these products, we still were able to match products having separate words “huvud”, “värk” and “tabletter”.
This piece was written by Findify’s Machine Learning expert Roman Grebennikov. Findify is a leading provider of AI-powered site search and personalization software. For more information on Findify’s powerful ecommerce tool, which includes solutions such as Personalized Search, Smart Collections, and Recommendations, book a demo here.