Morfessor file types

Morfessor FlatCat 1.0 style text model

The recommended format for long-term storage of Morfessor FlatCat 1.0 models is as a compressed plain-text analysis of the corpus (model.segmentation.gz) and a separate hyper-parameter file (parameters.txt). In semi-supervised training the annotated corpus is not included in these files, so the annotation file must be stored alongside the analyzed corpus. The plain-text format ensures that the model is human-readable for inspection, and that later versions of Morfessor FlatCat are able to load the model.

Model file

Specification:

<int><space><CONSTRUCTION>/<CATEGORY>[<space>+<space><CONSTRUCTION>/<CATEGORY>]*

Example:

10 kahvi/STM + kakku/STM
5 kahvi/STM + kilo/STM + n/SUF
24 kahvi/STM + ko/ZZZ + ne/ZZZ + emme/SUF

Parameter file

Specification:

<KEY><colon><tab><VALUE>

Example:

corpusweight:   1.0
min-perplexity-length:  4
perplexity-threshold:   10.0
perplexity-slope:       1.0
length-threshold:       3.0
type-perplexity:        False
length-slope:   2.0

Binary model

Warning

Pickled models are sensitive to bitrot. Sometimes incompatibilities exist between Python versions that prevent loading a model stored by a different version. Also, next versions of Morfessor are not guaranteed to be able to load models of older versions.

For short-term development use a binary model might be more convenient. The binary model is generated by pickling the FlatcatModel object. This ensures that all training-data, annotation-data, weights and other hyper-parameters are exactly the same as when the model was saved.

Morfessor 1.0 style text model

Morfessor FlatCat is initialized using Morfessor 1.0 style text models. These files consists of one segmentation per line, preceded by a count, where the constructions are separated by ‘ + ‘. In short, these are identical to a FlatCat text model without the category tags.

Specification:

<int><space><CONSTRUCTION>[<space>+<space><CONSTRUCTION>]*

Example:

10 kahvi + kakku
5 kahvi + kilo + n
24 kahvi + kone + emme

You can load files with slightly different format by using the format arguments --compound-separator and --construction-separator (The arguments --analysis-separator and --category-separator are useful for annotations and category-tagged analyses respectively). If you are loading a file using the wrong construction separator, you may see the error message:

#################### WARNING ####################
The input does not seem to be segmented.
Are you using the correct construction separator?

Text corpus file

A text corpus file is a free format text-file. All lines are split into compounds using the compound-separator (default <space>). The compounds then are split into atoms using the atom-separator. Compounds can occur multiple times and will be counted as such.

Example:

kahvikakku kahvikilon kahvikilon
kahvikoneemme kahvikakku

Word list file

A word list corpus file contains one compound per line, possibly preceded by a count. If multiple entries of the same word occur there counts are summed. If no count is given, a count of one is assumed (per entry).

Specification:

[<int><space>]<COMPOUND>

Example 1:

10 kahvikakku
5 kahvikilon
24 kahvikoneemme

Example 2:

kahvikakku
kahvikilon
kahvikoneemme

Annotation file

An annotation file contains one compound and one or more annotations per compound on each line. The separators between the annotations (default ‘, ‘) and between the constructions (default ‘ ‘) are configurable.

Annotations can also be category tagged, by appending a slash ‘/’ and the category to each morph. If category tags are used, all morphs within the file must be tagged.

Specification:

<compound> <analysis1construction1>[ <analysis1constructionN>][, <analysis2construction1> [<analysis2constructionN>]*]*

<compound> <analysis1construction1>/<analysis1category1>[ <analysis1constructionN></analysis1categoryN>][, <analysis2construction1>/<analysis2category2> [<analysis2constructionN>/<analysis2categoryN>]*]*

Example:

kahvikakku kahvi kakku, kahvi kak ku
kahvikilon kahvi kilon
kahvikoneemme kahvi konee mme, kah vi ko nee mme

kahvikakku kahvi/STM kakku/STM, kahvi/STM kak/SUF ku/SUF