Natural Language Processing Platform The Árni Magnússon Institute for Icelandic Studies

About this platform

The Árni Magnússon Institute's language processing website can be used to process Icelandic text.
The tools we make available here are:

  • Tokenizer - Tokenizer from Miðeind ehf
  • PoS Tagger - POS from The Language and Voice Technology Lab at the University of Reykjavík
  • Lemmatizer - Nefnir by Jón Friðrik Daðason
  • Hyphenation Tool - Skiptir from The Árni Magnússon Institute
A more detailed description of the tools and their functionality can be found below.

Tokenizer

Citation

One of the basic tasks in text processing is to divide the text into units, usually sentences and tokens. Errors made at this initial stage of data preparation persist throughout the entire processing pipeline. The tool that solves this problem is called a tokenizer..

The tokenizer used here is called Tokenizer and is developed by Miðeind. It is released as a Python package and available on the Python Package Index (PyPI). More information about the program can be found here.

PoS Tagger

Citation

A Part-of-Speech (PoS) tagger reads in text and tags each token with a text string that indicates the word's part of speech and other grammatical features, such as case, grammatical gender and tense.

The PoS tagger used here is simply called ABLTagger 3.0. It is maintained by The Language and Voice Technology Lab at the University of Reykjavík (CADIA-LVL) (CADIA-LVL). The groundwork for this tool was the Bi-LSTM tagger ABLTagger 1.0. which was originally developed by Steinþór Steingrímsson, Örvar Kárason and Hrafn Loftsson in the spring of 2019.

Lemmatizer

Citation

On this platform, PoS-tagged words are sent to a lemmatizer, which reads in PoS-tagged text and lemmatizes it, i.e. records the base form (lemma) of each word (e.g. hestur for hests).

Word lemmas are retrieved using Nefnir, developed by Jón Friðrik Daðason.

Hyphenation Tool

Citation

Skiptir er skipalanínutól sem færir inn skiptingar á orðum í texta.

NLP tools