The Árni Magnússon Institute's language processing website can be used to process Icelandic text.
The tools we make available here are:
One of the basic tasks in text processing is to divide the text into units, usually sentences and tokens. Errors made at this initial stage of data preparation persist throughout the entire processing pipeline. The tool that solves this problem is called a tokenizer..
The tokenizer used here is called Tokenizer and is developed by Miðeind. It is released as a Python package and available on the Python Package Index (PyPI). More information about the program can be found here.
A Part-of-Speech (PoS) tagger reads in text and tags each token with a text string that indicates the word's part of speech and other grammatical features, such as case, grammatical gender and tense.
The PoS tagger used here is simply called ABLTagger 3.0. It is maintained by The Language and Voice Technology Lab at the University of Reykjavík (CADIA-LVL) (CADIA-LVL). The groundwork for this tool was the Bi-LSTM tagger ABLTagger 1.0. which was originally developed by Steinþór Steingrímsson, Örvar Kárason and Hrafn Loftsson in the spring of 2019.
On this platform, PoS-tagged words are sent to a lemmatizer, which reads in PoS-tagged text and lemmatizes it, i.e. records the base form (lemma) of each word (e.g. hestur for hests).
Word lemmas are retrieved using Nefnir, developed by Jón Friðrik Daðason.
Skiptir er skipalanínutól sem færir inn skiptingar á orðum í texta.