neurox.data

Subpackages:

Submodules:

neurox.data.annotate

Given a list of sentences, their activations and a pattern, create a binary labeled dataset based on the pattern where pattern can be a regular expression, a list of words and a function. For example, one can create a binary dataset of years vs. not-years (2004 vs. this) by specifying the regular expression that matches the pattern of year. The program will extract positive class examples based on the provided filter and will consider rest of the examples as negative class examples. The output of the program is a word file, a label file and an activation file.

neurox.data.annotate.annotate_data(source_path, activations_path, binary_filter, output_prefix, output_type='hdf5', decompose_layers=False, filter_layers=None)[source]

Given a set of sentences, per word activations, a binary_filter and output_prefix, creates binary data and save it to the disk. A binary filter can be a set of words, a regex object or a function

Parameters
  • source_path (text file with one sentence per line) –

  • activations (list) – A list of sentence-wise activations

  • binary_filter (a set of words or a regex object or a function) –

  • output_prefix (prefix of the output files that will be saved as the output of this script) –

Returns

Return type

Saves a word file, a binary label file and their activations

Example

annotate_data(source_path, activations_path, re.compile(r’^ww$’)) select words of two characters only as a positive class annotate_data(source_path, activations_path, {‘is’, ‘can’}) select occrrences of ‘is’ and ‘can’ as a positive class

neurox.data.loader

Loading functions for activations, input tokens/sentences and labels

This module contains functions to load activations as well as source files with tokens and labels. Functions that support tokenized data are also provided.

neurox.data.loader.load_activations(activations_path, num_neurons_per_layer=None, is_brnn=False)[source]

Load extracted activations.

Parameters
  • activations_path (str) – Path to the activations file. Can be of type t7, pt, acts, json or hdf5

  • num_neurons_per_layer (int, optional) – Number of neurons per layer - used to compute total number of layers. This is only necessary in the case of t7/p5/acts activations.

  • is_brnn (bool, optional) – If the model used to extract activations was bidirectional (default: False)

Returns

  • activations (list of numpy.ndarray) – List of sentence representations, where each sentence representation is a numpy matrix of shape [num tokens in sentence x concatenated representation size]

  • num_layers (int) – Number of layers. This is usually representation_size/num_neurons_per_layer. Divide again by 2 if model was bidirectional

neurox.data.loader.filter_activations_by_layers(train_activations, test_activations, filter_layers, rnn_size, num_layers, is_brnn)[source]

Filter activations so that they only contain specific layers.

Useful for performing layer-wise analysis.

Warning

This function is deprecated and will be removed in future versions.

Parameters
  • train_activations (list of numpy.ndarray) – List of sentence representations from the train set, where each sentence representation is a numpy matrix of shape [NUM_TOKENS x NUM_NEURONS]. The method assumes that neurons from all layers are present, with the number of neurons in every layer given by rnn_size

  • test_activations (list of numpy.ndarray) – Similar to train_activations but with sentences from a test set.

  • filter_layers (str) – A comma-separated string of the form “f1,f2,f10”. “f” indicates a “forward” layer while “b” indicates a backword layer in a Bidirectional RNN. If the activations are from different kind of model, set is_brnn to False and provide only “f” entries. The number next to “f” is the layer number, 1-indexed. So “f1” corresponds to the embedding layer and so on.

  • rnn_size (int) – Number of neurons in every layer.

  • num_layers (int) – Total number of layers in the original model.

  • is_brnn (bool) – Boolean indicating if the neuron activations are from a bidirectional model.

Returns

  • filtered_train_activations (list of numpy.ndarray) – Filtered train activations

  • filtered_test_activations (list of numpy.ndarray) – Filtered test activations

Notes

For bidirectional models, the method assumes that the internal structure is as follows: forward layer 1 neurons, backward layer 1 neurons, forward layer 2 neurons …

neurox.data.loader.load_aux_data(source_path, labels_path, source_aux_path, activations, max_sent_l, ignore_start_token=False)[source]

Load word-annotated text-label pairs data represented as sentences, where activation extraction was performed on tokenized text. This function loads the source text, source tokenized text, target labels, and activations and tries to make them perfectly parallel, i.e. number of tokens in line N of source would match the number of tokens in line N of target, and number of tokens in source_aux will match the number of activations at index N. The method will delete non-matching activation/source/source_aix/target pairs, up to a maximum of 100 before failing. The method will also ignore sentences longer than the provided maximum. The activations will be modified in place.

Warning

This function is deprecated and will be removed in future versions.

Parameters
  • source_path (str) – Path to the source text file, one sentence per line

  • labels_path (str) – Path to the annotated labels file, one sentence per line corresponding to the sentences in the source_path file.

  • source_aux_path (str) – Path to the source text file with tokenization, one sentence per line

  • activations (list of numpy.ndarray) – Activations returned from loader.load_activations

  • max_sent_l (int) – Maximum length of sentences. Sentences containing more tokens will be ignored.

  • ignore_start_token (bool, optional) – Ignore the first token. Useful if there is some line position markers in the source text.

Returns

tokens – Dictionary containing three lists, source, source_aux and target. source contains all of the sentences from``source_path`` that were not ignored. source_aux contains all tokenized sentences from source_aux_path. target contains the parallel set of annotated labels.

Return type

dict

neurox.data.loader.load_data(source_path, labels_path, activations, max_sent_l, ignore_start_token=False, sentence_classification=False)[source]

Load word-annotated text-label pairs data represented as sentences. This function loads the source text, target labels, and activations and tries to make them perfectly parallel, i.e. number of tokens in line N of source would match the number of tokens in line N of target, and also match the number of activations at index N. The method will delete non-matching activation/source/target pairs, up to a maximum of 100 before failing. The method will also ignore sentences longer than the provided maximum. The activations will be modified in place.

Parameters
  • source_path (str) – Path to the source text file, one sentence per line

  • labels_path (str) – Path to the annotated labels file, one sentence per line corresponding to the sentences in the source_path file.

  • activations (list of numpy.ndarray) – Activations returned from loader.load_activations

  • max_sent_l (int) – Maximum length of sentences. Sentences containing more tokens will be ignored.

  • ignore_start_token (bool, optional) – Ignore the first token. Useful if there is some line position markers in the source text.

  • sentence_classification (bool, optional) – Flag to indicate if this is a sentence classification task, where every sentence actually has only a single activation (e.g. [CLS] token’s activations in the case of BERT)

Returns

tokens – Dictionary containing two lists, source and target. source contains all of the sentences from source_path that were not ignored. target contains the parallel set of annotated labels.

Return type

dict

neurox.data.loader.load_sentence_data(source_path, labels_path, activations)[source]

Loads sentence-annotated text-label pairs. This function loads the source text, target labels, and activations and tries to make them perfectly parallel, i.e. number of tokens in line N of source would match the number of activations at index N. The method will delete non-matching activation/source pairs. The activations will be modified in place.

Parameters
  • source_path (str) – Path to the source text file, one sentence per line

  • labels_path (str) – Path to the annotated labels file, one sentence per line corresponding to the sentences in the source_path file.

  • activations (list of numpy.ndarray) – Activations returned from loader.load_activations

Returns

tokens – Dictionary containing two lists, source and target. source contains all of the sentences from source_path that were not ignored. target contains the parallel set of annotated labels.

Return type

dict

neurox.data.representations

Utility functions to manage representations.

This module contains functions that will help in managing extracted representations, specifically on sub-word based data.

neurox.data.representations.bpe_get_avg_activations(tokens, activations)[source]

Aggregates activations by averaging assuming BPE-based tokenization.

Given loaded tokens data and activations, this function aggeregates activations based on tokenized text. BPE based tokenization is assumed, with every non-terminal subword ending with “@@”. The activations are aggregated by averaging over subwords.

Warning

This function is deprecated and will be removed in future versions.

Parameters
  • tokens (dict) – Dictionary containing three lists, source, source_aux and target. Usually the output of data.loader.load_aux_data.

  • activations (list of numpy.ndarray) – Activations returned from loader.load_activations.

Returns

activations – Subword aggregated activations corresponding to one per actual token found in the untokenized text.

Return type

list of numpy.ndarray

neurox.data.representations.bpe_get_last_activations(tokens, activations, is_brnn=True)[source]

Aggregates activations by picking the last subword assuming BPE-based tokenization.

Given loaded tokens data and activations, this function aggeregates activations based on tokenized text. BPE based tokenization is assumed, with every non-terminal subword ending with “@@”. The activations are aggregated by picking the last subword for any given word.

Warning

This function is deprecated and will be removed in future versions.

Parameters
  • tokens (dict) – Dictionary containing three lists, source, source_aux and target. Usually the output of data.loader.load_aux_data.

  • activations (list of numpy.ndarray) – Activations returned from loader.load_activations.

  • is_brnn (bool, optional) – Whether the model from which activations were extracted was bidirectional. Only applies for RNN models.

Returns

activations – Subword aggregated activations corresponding to one per actual token found in the untokenized text.

Return type

list of numpy.ndarray

neurox.data.representations.char_get_avg_activations(tokens, activations)[source]

Aggregates activations by averaging assuming Character-based tokenization.

Given loaded tokens data and activations, this function aggeregates activations based on character-tokenized text. The activations are aggregated by averaging over characters.

Warning

This function is deprecated and will be removed in future versions.

Parameters
  • tokens (dict) – Dictionary containing three lists, source, source_aux and target. Usually the output of data.loader.load_aux_data.

  • activations (list of numpy.ndarray) – Activations returned from loader.load_activations.

Returns

activations – Character aggregated activations corresponding to one per actual token found in the untokenized text.

Return type

list of numpy.ndarray

neurox.data.representations.char_get_last_activations(tokens, activations, is_brnn=True)[source]

Aggregates activations by picking the last subword assuming Character-based tokenization.

Given loaded tokens data and activations, this function aggeregates activations based on character-tokenized text. The activations are aggregated by picking the last character for any given word.

Warning

This function is deprecated and will be removed in future versions.

Parameters
  • tokens (dict) – Dictionary containing three lists, source, source_aux and target. Usually the output of data.loader.load_aux_data.

  • activations (list of numpy.ndarray) – Activations returned from loader.load_activations.

  • is_brnn (bool, optional) – Whether the model from which activations were extracted was bidirectional. Only applies for RNN models.

Returns

activations – Character aggregated activations corresponding to one per actual token found in the untokenized text.

Return type

list of numpy.ndarray

neurox.data.representations.sent_get_last_activations(tokens, activations)[source]

Gets the summary vector for the input sentences.

Given loaded tokens data and activations, this function picks the final token’s activations for every sentence, essentially giving summary vectors for every sentence in the dataset. This is mostly applicable for RNNs.

Note

Bidirectionality is currently not handled in the case of BiRNNs.

Parameters
  • tokens (dict) – Dictionary containing three lists, source, source_aux and target. Usually the output of data.loader.load_aux_data.

  • activations (list of numpy.ndarray) – Activations returned from loader.load_activations.

Returns

activations – Summary activations corresponding to one per actual sentence in the original text.

Return type

list of numpy.ndarray

neurox.data.utils

neurox.data.utils.save_files(words, labels, activations, output_prefix, output_type='hdf5', decompose_layers=False, filter_layers=None)[source]

Save word and label files in the text format and activations in the specified format (default hdf5 format)

Parameters
  • words (list) – A list of words

  • labels (list) – A list of labels for every word

  • activations (list) – A list of word-wise activations

  • output_prefix (string) – Specify prefix of the output files

Returns

Return type

Save word, label and activation files

neurox.data.writer

Representations Writers

Module with various writers for saving representations/activations. Currently, two file types are supported:

  1. hdf5: This is a binary format, and results in smaller overall files. The structure of the file is as follows:

    • sentence_to_idx dataset: Contains a single json string at index 0 that maps sentences to indices

    • Indices 0 through N-1 datasets: Each index corresponds to one sentence. The value of the dataset is a tensor with dimensions num_layers x sentence_length x embedding_size, where embedding_size may include multiple layers

  2. json: This is a human-readable format. There is some loss of precision, since each activation value is saved using 8 decimal places. Concretely, this results in a jsonl file, where each line is a json string corresponding to a single sentence. The structure of each line is as follows:

    • linex_idx: Sentence index

    • features: List of tokens (with their activations)

      • token: The current token

      • layers: List of layers

        • index: Layer index (does not correspond to original model’s layers)

        • values: List of activation values for all neurons in the layer

The writers also support saving activations from specific layers only, using the filter_layers argument. Since activation files can be large, an additional option for decomposing the representations into layer-wise files is also provided.

class neurox.data.writer.ActivationsWriter(filename, filetype=None, decompose_layers=False, filter_layers=None)[source]

Bases: object

Class that encapsulates all available writers.

This is the only class that should be used by the rest of the library.

filename

Filename for storing the activations. May not be used exactly if decompose_layers is True.

Type

str

filetype

An additional hint for the filetype. This argument is optional The file type will be detected automatically from the filename if none is supplied.

Type

str

decompose_layers

Set to true if each layer’s activations should be saved in a separate file.

Type

bool

filter_layers

Comma separated list of layer indices to save.

Type

str

__init__(filename, filetype=None, decompose_layers=False, filter_layers=None)[source]

Initialize self. See help(type(self)) for accurate signature.

open()[source]

Method to open the underlying files. Will be called automatically by the class instance when necessary.

write_activations(sentence_idx, extracted_words, activations)[source]

Method to write a single sentence’s activations to file

close()[source]

Method to close the udnerlying files.

static get_writer(filename, filetype=None, decompose_layers=False, filter_layers=None)[source]

Method to get the correct writer based on filename and filetype

static add_writer_options(parser)[source]

Method to return argparse arguments specific to activation writers

class neurox.data.writer.ActivationsWriterManager(filename, filetype=None, decompose_layers=False, filter_layers=None)[source]

Bases: neurox.data.writer.ActivationsWriter

Manager class that handles decomposition and filtering.

Decomposition requires multiple writers (one per file) and filtering requires processing the activations to remove unneeded layer activations. This class sits on top of the actual activations writer to manage these operations.

__init__(filename, filetype=None, decompose_layers=False, filter_layers=None)[source]

Initialize self. See help(type(self)) for accurate signature.

open(num_layers)[source]

Method to open the underlying files. Will be called automatically by the class instance when necessary.

write_activations(sentence_idx, extracted_words, activations)[source]

Method to write a single sentence’s activations to file

close()[source]

Method to close the udnerlying files.

class neurox.data.writer.HDF5ActivationsWriter(filename)[source]

Bases: neurox.data.writer.ActivationsWriter

__init__(filename)[source]

Initialize self. See help(type(self)) for accurate signature.

open()[source]

Method to open the underlying files. Will be called automatically by the class instance when necessary.

write_activations(sentence_idx, extracted_words, activations)[source]

Method to write a single sentence’s activations to file

close()[source]

Method to close the udnerlying files.

class neurox.data.writer.JSONActivationsWriter(filename)[source]

Bases: neurox.data.writer.ActivationsWriter

__init__(filename)[source]

Initialize self. See help(type(self)) for accurate signature.

open()[source]

Method to open the underlying files. Will be called automatically by the class instance when necessary.

write_activations(sentence_idx, extracted_words, activations)[source]

Method to write a single sentence’s activations to file

close()[source]

Method to close the udnerlying files.

Module contents: