neurox.data

Subpackages:

Submodules:

neurox.data.loader

Loading functions for activations, input tokens/sentences and labels

This module contains functions to load activations as well as source files with tokens and labels. Functions that support tokenized data are also provided.

neurox.data.loader.load_activations(activations_path, num_neurons_per_layer=None, is_brnn=False)[source]

Load extracted activations.

Parameters
  • activations_path (str) – Path to the activations file. Can be of type t7, pt, acts, json or hdf5

  • num_neurons_per_layer (int, optional) – Number of neurons per layer - used to compute total number of layers. This is only necessary in the case of t7/p5/acts activations.

  • is_brnn (bool, optional) – If the model used to extract activations was bidirectional (default: False)

Returns

  • activations (list of numpy.ndarray) – List of sentence representations, where each sentence representation is a numpy matrix of shape [num tokens in sentence x concatenated representation size]

  • num_layers (int) – Number of layers. This is usually representation_size/num_neurons_per_layer. Divide again by 2 if model was bidirectional

neurox.data.loader.filter_activations_by_layers(train_activations, test_activations, filter_layers, rnn_size, num_layers, is_brnn)[source]

Filter activations so that they only contain specific layers.

Useful for performing layer-wise analysis.

Warning

This function is deprecated and will be removed in future versions.

Parameters
  • train_activations (list of numpy.ndarray) – List of sentence representations from the train set, where each sentence representation is a numpy matrix of shape [NUM_TOKENS x NUM_NEURONS]. The method assumes that neurons from all layers are present, with the number of neurons in every layer given by rnn_size

  • test_activations (list of numpy.ndarray) – Similar to train_activations but with sentences from a test set.

  • filter_layers (str) – A comma-separated string of the form “f1,f2,f10”. “f” indicates a “forward” layer while “b” indicates a backword layer in a Bidirectional RNN. If the activations are from different kind of model, set is_brnn to False and provide only “f” entries. The number next to “f” is the layer number, 1-indexed. So “f1” corresponds to the embedding layer and so on.

  • rnn_size (int) – Number of neurons in every layer.

  • num_layers (int) – Total number of layers in the original model.

  • is_brnn (bool) – Boolean indicating if the neuron activations are from a bidirectional model.

Returns

  • filtered_train_activations (list of numpy.ndarray) – Filtered train activations

  • filtered_test_activations (list of numpy.ndarray) – Filtered test activations

Notes

For bidirectional models, the method assumes that the internal structure is as follows: forward layer 1 neurons, backward layer 1 neurons, forward layer 2 neurons …

neurox.data.loader.load_aux_data(source_path, labels_path, source_aux_path, activations, max_sent_l, ignore_start_token=False)[source]

Load word-annotated text-label pairs data represented as sentences, where activation extraction was performed on tokenized text. This function loads the source text, source tokenized text, target labels, and activations and tries to make them perfectly parallel, i.e. number of tokens in line N of source would match the number of tokens in line N of target, and number of tokens in source_aux will match the number of activations at index N. The method will delete non-matching activation/source/source_aix/target pairs, up to a maximum of 100 before failing. The method will also ignore sentences longer than the provided maximum. The activations will be modified in place.

Warning

This function is deprecated and will be removed in future versions.

Parameters
  • source_path (str) – Path to the source text file, one sentence per line

  • labels_path (str) – Path to the annotated labels file, one sentence per line corresponding to the sentences in the source_path file.

  • source_aux_path (str) – Path to the source text file with tokenization, one sentence per line

  • activations (list of numpy.ndarray) – Activations returned from loader.load_activations

  • max_sent_l (int) – Maximum length of sentences. Sentences containing more tokens will be ignored.

  • ignore_start_token (bool, optional) – Ignore the first token. Useful if there is some line position markers in the source text.

Returns

tokens – Dictionary containing three lists, source, source_aux and target. source contains all of the sentences from``source_path`` that were not ignored. source_aux contains all tokenized sentences from source_aux_path. target contains the parallel set of annotated labels.

Return type

dict

neurox.data.loader.load_data(source_path, labels_path, activations, max_sent_l, ignore_start_token=False, sentence_classification=False)[source]

Load word-annotated text-label pairs data represented as sentences. This function loads the source text, target labels, and activations and tries to make them perfectly parallel, i.e. number of tokens in line N of source would match the number of tokens in line N of target, and also match the number of activations at index N. The method will delete non-matching activation/source/target pairs, up to a maximum of 100 before failing. The method will also ignore sentences longer than the provided maximum. The activations will be modified in place.

Parameters
  • source_path (str) – Path to the source text file, one sentence per line

  • labels_path (str) – Path to the annotated labels file, one sentence per line corresponding to the sentences in the source_path file.

  • activations (list of numpy.ndarray) – Activations returned from loader.load_activations

  • max_sent_l (int) – Maximum length of sentences. Sentences containing more tokens will be ignored.

  • ignore_start_token (bool, optional) – Ignore the first token. Useful if there is some line position markers in the source text.

  • sentence_classification (bool, optional) – Flag to indicate if this is a sentence classification task, where every sentence actually has only a single activation (e.g. [CLS] token’s activations in the case of BERT)

Returns

tokens – Dictionary containing two lists, source and target. source contains all of the sentences from source_path that were not ignored. target contains the parallel set of annotated labels.

Return type

dict

neurox.data.loader.load_sentence_data(source_path, labels_path, activations)[source]

Loads sentence-annotated text-label pairs. This function loads the source text, target labels, and activations and tries to make them perfectly parallel, i.e. number of tokens in line N of source would match the number of activations at index N. The method will delete non-matching activation/source pairs. The activations will be modified in place.

Parameters
  • source_path (str) – Path to the source text file, one sentence per line

  • labels_path (str) – Path to the annotated labels file, one sentence per line corresponding to the sentences in the source_path file.

  • activations (list of numpy.ndarray) – Activations returned from loader.load_activations

Returns

tokens – Dictionary containing two lists, source and target. source contains all of the sentences from source_path that were not ignored. target contains the parallel set of annotated labels.

Return type

dict

neurox.data.representations

Utility functions to manage representations.

This module contains functions that will help in managing extracted representations, specifically on sub-word based data.

neurox.data.representations.bpe_get_avg_activations(tokens, activations)[source]

Aggregates activations by averaging assuming BPE-based tokenization.

Given loaded tokens data and activations, this function aggeregates activations based on tokenized text. BPE based tokenization is assumed, with every non-terminal subword ending with “@@”. The activations are aggregated by averaging over subwords.

Warning

This function is deprecated and will be removed in future versions.

Parameters
  • tokens (dict) – Dictionary containing three lists, source, source_aux and target. Usually the output of data.loader.load_aux_data.

  • activations (list of numpy.ndarray) – Activations returned from loader.load_activations.

Returns

activations – Subword aggregated activations corresponding to one per actual token found in the untokenized text.

Return type

list of numpy.ndarray

neurox.data.representations.bpe_get_last_activations(tokens, activations, is_brnn=True)[source]

Aggregates activations by picking the last subword assuming BPE-based tokenization.

Given loaded tokens data and activations, this function aggeregates activations based on tokenized text. BPE based tokenization is assumed, with every non-terminal subword ending with “@@”. The activations are aggregated by picking the last subword for any given word.

Warning

This function is deprecated and will be removed in future versions.

Parameters
  • tokens (dict) – Dictionary containing three lists, source, source_aux and target. Usually the output of data.loader.load_aux_data.

  • activations (list of numpy.ndarray) – Activations returned from loader.load_activations.

  • is_brnn (bool, optional) – Whether the model from which activations were extracted was bidirectional. Only applies for RNN models.

Returns

activations – Subword aggregated activations corresponding to one per actual token found in the untokenized text.

Return type

list of numpy.ndarray

neurox.data.representations.char_get_avg_activations(tokens, activations)[source]

Aggregates activations by averaging assuming Character-based tokenization.

Given loaded tokens data and activations, this function aggeregates activations based on character-tokenized text. The activations are aggregated by averaging over characters.

Warning

This function is deprecated and will be removed in future versions.

Parameters
  • tokens (dict) – Dictionary containing three lists, source, source_aux and target. Usually the output of data.loader.load_aux_data.

  • activations (list of numpy.ndarray) – Activations returned from loader.load_activations.

Returns

activations – Character aggregated activations corresponding to one per actual token found in the untokenized text.

Return type

list of numpy.ndarray

neurox.data.representations.char_get_last_activations(tokens, activations, is_brnn=True)[source]

Aggregates activations by picking the last subword assuming Character-based tokenization.

Given loaded tokens data and activations, this function aggeregates activations based on character-tokenized text. The activations are aggregated by picking the last character for any given word.

Warning

This function is deprecated and will be removed in future versions.

Parameters
  • tokens (dict) – Dictionary containing three lists, source, source_aux and target. Usually the output of data.loader.load_aux_data.

  • activations (list of numpy.ndarray) – Activations returned from loader.load_activations.

  • is_brnn (bool, optional) – Whether the model from which activations were extracted was bidirectional. Only applies for RNN models.

Returns

activations – Character aggregated activations corresponding to one per actual token found in the untokenized text.

Return type

list of numpy.ndarray

neurox.data.representations.sent_get_last_activations(tokens, activations)[source]

Gets the summary vector for the input sentences.

Given loaded tokens data and activations, this function picks the final token’s activations for every sentence, essentially giving summary vectors for every sentence in the dataset. This is mostly applicable for RNNs.

Note

Bidirectionality is currently not handled in the case of BiRNNs.

Parameters
  • tokens (dict) – Dictionary containing three lists, source, source_aux and target. Usually the output of data.loader.load_aux_data.

  • activations (list of numpy.ndarray) – Activations returned from loader.load_activations.

Returns

activations – Summary activations corresponding to one per actual sentence in the original text.

Return type

list of numpy.ndarray

Module contents: