Python library interface to Morfessor FlatCat

Morfessor FlatCat 1.0 contains a library interface in order to be integrated in other python applications. The public members are documented below and should remain relatively the same between Morfessor FlatCat versions. Private members are documented in the code and can change anytime in releases.

The classes are documented below.

IO class

class flatcat.io.FlatcatIO(encoding=None, construction_separator=u' + ', comment_start=u'#', compound_separator=u'\s+', analysis_separator=u', ', category_separator=u'/', strict=True)

Definition for all input and output files. Also handles all encoding issues.

The only state this class has is the separators used in the data. Therefore, the same class instance can be used for initializing multiple files.

Extends Morfessor Baseline data file formats to include category tags.

read_annotations_file(file_name, construction_sep=u' ', analysis_sep=None)

Read an annotations file.

Each line has the format: <compound> <constr1> <constr2>... <constrN>, <constr1>...<constrN>, ...

Returns a defaultdict mapping a compound to a list of analyses.

read_any_model(file_name)

Read a complete model in either binary or tarball format. This method can NOT be used to initialize from a Morfessor 1.0 style segmentation

read_combined_file(file_name, annotation_prefix=u'<', construction_sep=u' ', analysis_sep=u', ')

Reads a file that combines unannotated word tokens and annotated data. The formats are the same as for files containing only one of the mentioned types of data, except that lines with annotations are additionally prefixed with a special symbol.

read_corpus_file(file_name)

Read one corpus file.

For each compound, yield (1, compound, compound_atoms). After each line, yield (0, “n”, ()).

read_corpus_files(file_names)

Read one or more corpus files.

Yield for each compound found (1, compound, compound_atoms).

read_corpus_list_file(file_name)

Read a corpus list file.

Each line has the format: <count> <compound>

Yield tuples (count, compound, compound_atoms) for each compound.

read_corpus_list_files(file_names)

Read one or more corpus list files.

Yield for each compound found (count, compound, compound_atoms).

read_parameter_file(file_name)

Read learned or estimated parameters from a file

read_segmentation_file(file_name)

Read segmentation file. see docstring for write_segmentation_file for file format.

read_tarball_model_file(file_name, model=None)

Read model from a tarball.

write_formatted_file(file_name, line_format, data, data_func, newline_func=None, output_newlines=False, output_tags=False, construction_sep=None, analysis_sep=None, category_sep=None, filter_tags=None, filter_len=3)

Writes a file in the specified format.

Formatting is flexible: even formats that cannot be read by FlatCat can be specified.

write_lexicon_file(file_name, lexicon)

Write to a Lexicon file all constructions and their emission counts.

write_segmentation_file(file_name, segmentations, construction_sep=None, output_tags=True, comment_string=u'')

Write segmentation file.

File format (single line, wrapped only for pep8): <count> <construction1><cat_sep><category1><cons_sep>...

<constructionN><cat_sep><categoryN>
class flatcat.io.TarGzMember(arcname, tarmodel)

File-like object that writes itself into the tarfile on closing

class flatcat.io.TarGzModel(filename, mode)

A wrapper to hide the ugliness of the tarfile API.

Both TarGzModel itself and the method newmember are context managers: Writing a model requires a nested with statement.

members()

Generates the (name, contents) pairs for each file in the archive.

The contents are in the form of file-like objects. The files are generated in the order they are in the archive: the recipient must be able to handle them in an arbitrary order.

newmember(arcname)

Receive a new member to the .tar.gz archive.

Parameters:- the name of the file within the archive. (arcname) –
Returns:a file-like object into which the contents can be written. This is a context manager: use a “with” statement.

Model classes

Morfessor 2.0 FlatCat variant.

class flatcat.flatcat.FlatcatModel(morph_usage=None, forcesplit=None, nosplit=None, corpusweight=1.0, use_skips=False, ml_emissions_epoch=-1)

Morfessor FlatCat model class.

Parameters:
  • morph_usage – A MorphUsageProperties object describing how the usage of a morph affects the category.
  • forcesplit – Force segmentations around the characters in the given list. The same value should be used in Morfessor Baseline or other initialization, to guarantee results.
  • nosplit – Prevent splitting between character pairs matching this regular expression. The same value should be used in Morfessor Baseline or other initialization, to guarantee results.
  • corpusweight – Multiplicative weight for the (unannotated) corpus cost.
  • use_skips – Randomly skip frequently occurring constructions to speed up online training. Has no effect on batch training.
  • ml_emissions_epoch – The number of epochs of resegmentation using Maximum Likelihood estimation for emission probabilities, instead of using the morph property based probability. These are performed after the normal training. Default -1 means do not switch over to ML estimation.
add_annotations(annotations, annotatedcorpusweight=None)

Adds data to the annotated corpus.

add_corpus_data(segmentations, freqthreshold=1, count_modifier=None)

Adds the given segmentations (with counts) to the corpus data. The new data can be either untagged or tagged.

If the added data is untagged, you must call viterbi_tag_corpus to tag the new data.

You should also call reestimate_probabilities and consider calling initialize_hmm.

Parameters:
  • segmentations – Segmentations of format: (count, (morph1, morph2, ...)) where the morphs can be either strings or CategorizedMorphs.
  • freqthreshold – discard words that occur less than given times in the corpus (default 1).
  • count_modifier – function for adjusting the counts of each word.
cost_breakdown(segmentation, penalty=0.0, index=0)

Returns breakdown of costs for the given tagged segmentation.

cost_comparison(segmentations, retag=True)

Diagnostic function. (Re)tag the given segmentations, calculate their cost and return the sorted breakdowns of the costs. Can be used to analyse reasons for a segmentation choice.

generate_focus_samples(num_sets, num_samples)

Generates subsets of the corpus by weighted sampling.

get_cost()

Return current model encoding cost.

get_lexicon()

Returns morphs in lexicon, with emission counts

get_params()

Returns a dict of hyperparameters.

initialize_baseline(min_difference_proportion=0.005)

Initialize emission and transition probabilities without changing the segmentation, using Viterbi EM, from a previously added (see add_corpus_data) segmentation produced by a morfessor baseline model.

initialize_hmm(min_difference_proportion=0.005)

Initialize emission and transition probabilities without changing the segmentation.

num_compounds

Compound (word) types

num_constructions

Construction (morph) types

rank_analyses(choices)

Choose the best analysis of a set of choices.

Observe that the call and return signatures are different from baseline: this method is more versatile.

Parameters:choices – a sequence of AnalysisAlternative(analysis, penalty) namedtuples. The analysis must be a sequence of CategorizedMorphs, (segmented and tagged). The penalty is a float that is added to the cost for this choice. Use 0 to disable.
Returns:A sorted (by cost, ascending) list of SortedAnalysis(cost, analysis, index, breakdown) namedtuples.
cost :  the contribution of this analysis to the corpus cost.
analysis :  as in input.
breakdown :  A CostBreakdown object, for diagnostics
reestimate_probabilities()

Re-estimates model parameters from a segmented, tagged corpus.

theta(t) = arg min { L( theta, Y(t), D ) }

set_focus_sample(set_index)

Select one pregenerated focus sample set as active.

set_params(params)

Sets hyperparameters to loaded values.

toggle_callbacks(callbacks=None)

Callbacks are not saved in the pickled model, because pickle is unable to restore instance methods. If you need callbacks in a loaded model, you have to readd them after loading.

train_batch(min_iteration_cost_gain=0.0025, min_epoch_cost_gain=0.005, max_epochs=5, max_iterations_first=1, max_iterations=1, max_resegment_iterations=1, max_shift_distance=2, min_shift_remainder=2)

Perform batch training.

Parameters:
  • min_iteration_cost_gain – Do not repeat iteration if the gain in cost was less than this proportion. No effect if max_iterations is 1. Set to None to disable.
  • min_epoch_cost_gain – Stop before max_epochs, if the gain in cost of the previous epoch was less than this proportion. Set to None to disable.
  • max_epochs – Maximum number of training epochs.
  • max_iterations_first – Number of iterations of each operator, in the first epoch.
  • max_iterations – Number of iterations of each operator, in later epochs.
  • max_resegment_iterations – Number of resegment iterations in any epoch.
  • max_shift_distance – Limit on the distance (in characters) that the shift operation can move a boundary.
  • min_shift_remainder – Limit on the shortest morph allowed to be produced by the shift operation.
train_online(data, count_modifier=None, epoch_interval=10000, max_epochs=None, result_callback=None)

Adapt the model in online fashion.

violated_annotations()

Yields all segmentations which have an associated annotation, but woud currently not be naturally segmented in a way that is included in the annotation alternatives,

viterbi_analyze_list(corpus)

Convenience wrapper around viterbi_analyze for a list of word strings or segmentations with attached counts. Segmented input can be with or without tags. This function can be used to analyze previously unseen data.

viterbi_tag_corpus()

(Re)tags the corpus segmentations using viterbi_tag

words_with_morph(morph)

Diagnostic function. Returns all segmentations using the given morph. Format: (index_to_segmentations, count, analysis)

class flatcat.flatcat.FlatcatLexiconEncoding(morph_usage)

Extends LexiconEncoding to include the coding costs of the encoding cost of morph usage (context) features.

Parameters:morph_usage – A MorphUsageProperties object, or something that quacks like it.
clear()

Resets the cost variables. Use before fully reprocessing a segmented corpus.

class flatcat.flatcat.FlatcatEncoding(morph_usage, lexicon_encoding, weight=1.0)

Class for calculating the encoding costs of the grammar and the corpus. Also stores the HMM parameters.

tokens: the number of emissions observed. boundaries: the number of word tokens observed.

clear_emission_cache()

Clears the cache for emission probability values. Use if an incremental change invalidates cached values.

clear_emission_counts()

Resets emission counts and costs. Use before fully reprocessing a tagged segmented corpus.

clear_transition_cache()

Clears the cache for emission probability values. Use if an incremental change invalidates cached values.

clear_transition_counts()

Resets transition counts, costs and cache. Use before fully reprocessing a tagged segmented corpus.

get_cost()

Override for the Encoding get_cost function.

This is P( D_W | theta, Y )

log_emissionprob(category, morph, extrazero=False)

-Log of posterior emission probability P(morph|category)

log_transitionprob(prev_cat, next_cat)

-Log of transition probability P(next_cat|prev_cat)

logtransitionsum()

Returns the term of the cost function associated with the transition probabilities. This term is recalculated on each call to get_cost, as the transition matrix is small and each segmentation change is likely to modify a large part of the transition matrix, making cumulative updates unnecessary.

transit_emit_cost(prev_cat, next_cat, morph)

Cost of transitioning from prev_cat to next_cat and emitting the morph.

update_emission_count(category, morph, diff_count)

Updates the number of observed emissions of a single morph from a single category, and the logtokensum (which is category independent). Updates logcondprobsum.

Parameters:
  • category – name of category from which emission occurs.
  • morph – string representation of the morph.
  • diff_count – the change in the number of occurences.
update_transition_count(prev_cat, next_cat, diff_count)

Updates the number of observed transitions between categories. OBSERVE! Clearing the cache is left to the caller.

Parameters:
  • prev_cat – The name (not index) of the category transitioned from.
  • next_cat – The name (not index) of the category transitioned to.
  • diff_count – The change in the number of transitions.
class flatcat.flatcat.FlatcatAnnotatedCorpusEncoding(corpus_coding, weight=None)

Class for calculating the cost of encoding the annotated corpus

get_cost()

Returns the cost of encoding the annotated corpus

modify_contribution(morph, direction)

Removes or readds the complete contribution of a morph to the cost function. The contribution must be removed using the same probability value as was used when adding it, making ordering of operations important.

reset_contributions()

Recalculates the contributions of all morphs.

set_counts(counts)

Sets the counts of emissions and transitions occurring in the annotated corpus to precalculated values.

transition_cost()

Returns the term of the cost function associated with the transition probabilities. This term is recalculated on each call to get_cost, as the transition matrix is small and each segmentation change is likely to modify a large part of the transition matrix, making cumulative updates unnecessary.

update_counts(counts)

Updates the counts of emissions and transitions occurring in the annotated corpus, building on earlier counts.

update_weight()

Update the weight of the Encoding by taking the ratio of the corpus boundaries and annotated boundaries. Does not scale by corpus weight,, unlike Morfessor Baseline.

Code Examples for using library interface

Initialize a semi-supervised model from a given segmentation and annotations

import flatcat

io = flatcat.FlatcatIO()

morph_usage = flatcat.categorizationscheme.MorphUsageProperties()

model = flatcat.FlatcatModel(morph_usage, corpusweight=1.0)

model.add_corpus_data(io.read_segmentation_file('segmentation.txt'))

model.add_annotations(io.read_annotations_file('annotations.txt'),
                      annotatedcorpusweight=1.0)

model.initialize_hmm()

The model is now ready to be trained.

Segmenting new data using an existing model

First printing only the segmentations, followed by the analysis with morph categories.

import flatcat

io = flatcat.FlatcatIO()

model = io.read_binary_model_file('model.pickled')

words = ['words', 'segmenting', 'morfessor', 'categories', 'semisupervised']

for word in words:
    print(model.viterbi_segment(word))

for word in words:
    print(model.viterbi_analyze(word))