Use presage for improving text prediction in Maliit

asked 2014-10-19 03:10:45 +0200

Cristian Magherusan-Stanciu
306 ●6 ●13 ●12

I think the text prediction could be greatly improved by using Presage instead of the current XT9 engine.

Here's a demo of what it can do.

Comments

Presage should be tested with sailfishOS as soon as possible. If the prediction engine is good then the virtual keybosrd could be made smaller.

vattuvarg ( 2014-10-19 10:04:30 +0200 )edit

I did not know about this project. I gather it is open source, so this means using it would mean smaller amount of royalties to pay. Even if the quality of the prediction were not better, in the long run it would pay off supporting it.

mikelima ( 2014-10-21 10:37:33 +0200 )edit

answered 2017-12-10 16:57:48 +0200

This post is a wiki. Anyone with karma >75 is welcome to improve it.

updated 2017-12-10 22:23:04 +0200

ljo
606 ●7 ●20 ●20

I have put together an input predictor plugin for SFOS based on Presage: https://github.com/martonmiklos/sailfishos-presage-predictor

ARM build is available on openrepos: https://openrepos.net/content/martonmiklos/presage-based-text-predictor

At the moment only Hungarian and Swedish ngram databases and keyboard layouts are on openrepos, but generating and packaging ngram databases and layouts is not a big deal, just we need a proper source for it.

I am still polishing it and if anyone has expertise in this area feel free to contribute even with tips. [@ljo working on this.]

For e.g. I have no clue what kind of source is the best for generating the ngram database. (I have used novels, and some GNOME translation texts for the Hungarian but the they usually lacks the first person singular/plural versions of the verbs, and so on.)

Also to make it perfect some work still needs to be done on Presage too. I think I will need to add a feature/custom predictor which is first letter case sensitive and only matches a single word. I am thinking about feeding the lastnames and town names database to it.

It feels a bit slower than the Xt9 ATM, but it might be the case that this is due the size of the ngram database.

So if you have any expertise in this field please let me know: I would like to put together an "ngram database generation best practices" document to be able to support more and more languages.

edit flag offensive delete publish link

Comments

Great, great work. Thank you very much. Just a thought from someone ignorant in this field. Would it be possible in a way to use the language dictionaries from aspell, unspell, whatever, to generate the ngrams? Should we modify presage to use their database directly?

Damien Caliste ( 2017-12-10 17:17:45 +0200 )edit

@Damien Caliste, yes, it could be used as one of three sources to presage predictions but not for ngram generation since that requires words in running order which a wordlist does not provide. The primary source is the generated ngrams file from actual texts, secondly your actal usage which provides additional ngram frequency information which is persisted, thirdly word lists derived from aspell, hunspell etc and recently used words. For the wordlist part, if in required format, this could be given in the presage configuration file, default is /usr/share/dict/words. But for presage to be useful you would require an ngram database generated from mixed sources to be complemented by your actual language usage. Right now I'm experimenting with optimal sizes and configuration settings.

ljo ( 2017-12-10 21:22:07 +0200 )edit

Ah, ok thanks for the detailed explanation. Got it.

Damien Caliste ( 2017-12-10 22:07:34 +0200 )edit

First of all - thank you very much for it! This is an exciting project that makes SFOS surely better!

If we start with English, it looks like there is a COCA that can help us: https://www.corpusdata.org/ . I presume we can feed this to the generator. For other languages, we would probably need similar dataset. A good start would be to contact the language institute/lab that works on language aspects. They should know whether such corpus exists for your language. For Estonian, I think I know who to contact. But maybe before we start contacting everyone, it would be great to get few databases sorted, so we would have experience with it and know what to ask for. As English does have such database freely accessible, should we start with that?

There seem to be also freely available n-grams databases for few languages at https://www.ngrams.info/ . Not sure whether its the same format though.

rinigus ( 2017-12-11 08:53:43 +0200 )edit

@rinigus, thankyou from all of us. Miklós, the main developer, expressed in github that with this early release it might be a strategic/predagogic point to keep back on the number of languages. (But correct me if my impression is wrong). I am currently working on best practices adding language resources to hopefully be useful when the app/library is ready for more people/languages joining in. For now Hungarian and Swedish is well enough for the initial testing, but I could of course release an English ngram database, which undoubtly would attract a lot more people.

ljo ( 2017-12-11 11:01:51 +0200 )edit

Use presage for improving text prediction in Maliit

Comments

1 Answer

Comments

Question tools

Stats

Related questions

Use presage for improving text prediction in Maliit

Comments

1 Answer

Comments

Question tools

Public thread

Stats

Related questions