neurox.data¶
Subpackages:
Submodules:
neurox.data.annotate¶
Given a list of sentences, their activations and a pattern, create a binary labeled dataset based on the pattern where pattern can be a regular expression, a list of words and a function. For example, one can create a binary dataset of years vs. not-years (2004 vs. this) by specifying the regular expression that matches the pattern of year. The program will extract positive class examples based on the provided filter and will consider rest of the examples as negative class examples. The output of the program is a word file, a label file and an activation file.
-
neurox.data.annotate.
annotate_data
(source_path, activations_path, binary_filter, output_prefix, output_type='hdf5', decompose_layers=False, filter_layers=None)[source]¶ Given a set of sentences, per word activations, a binary_filter and output_prefix, creates binary data and save it to the disk. A binary filter can be a set of words, a regex object or a function
- Parameters
source_path (text file with one sentence per line) –
activations (list) – A list of sentence-wise activations
binary_filter (a set of words or a regex object or a function) –
output_prefix (prefix of the output files that will be saved as the output of this script) –
- Returns
- Return type
Saves a word file, a binary label file and their activations
Example
annotate_data(source_path, activations_path, re.compile(r’^ww$’)) select words of two characters only as a positive class annotate_data(source_path, activations_path, {‘is’, ‘can’}) select occrrences of ‘is’ and ‘can’ as a positive class
neurox.data.loader¶
Loading functions for activations, input tokens/sentences and labels
This module contains functions to load activations as well as source files with tokens and labels. Functions that support tokenized data are also provided.
-
neurox.data.loader.
load_activations
(activations_path, num_neurons_per_layer=None, is_brnn=False)[source]¶ Load extracted activations.
- Parameters
activations_path (str) – Path to the activations file. Can be of type t7, pt, acts, json or hdf5
num_neurons_per_layer (int, optional) – Number of neurons per layer - used to compute total number of layers. This is only necessary in the case of t7/p5/acts activations.
is_brnn (bool, optional) – If the model used to extract activations was bidirectional (default: False)
- Returns
activations (list of numpy.ndarray) – List of sentence representations, where each sentence representation is a numpy matrix of shape
[num tokens in sentence x concatenated representation size]
num_layers (int) – Number of layers. This is usually representation_size/num_neurons_per_layer. Divide again by 2 if model was bidirectional
-
neurox.data.loader.
filter_activations_by_layers
(train_activations, test_activations, filter_layers, rnn_size, num_layers, is_brnn)[source]¶ Filter activations so that they only contain specific layers.
Useful for performing layer-wise analysis.
Warning
This function is deprecated and will be removed in future versions.
- Parameters
train_activations (list of numpy.ndarray) – List of sentence representations from the train set, where each sentence representation is a numpy matrix of shape
[NUM_TOKENS x NUM_NEURONS]
. The method assumes that neurons from all layers are present, with the number of neurons in every layer given byrnn_size
test_activations (list of numpy.ndarray) – Similar to
train_activations
but with sentences from a test set.filter_layers (str) – A comma-separated string of the form “f1,f2,f10”. “f” indicates a “forward” layer while “b” indicates a backword layer in a Bidirectional RNN. If the activations are from different kind of model, set
is_brnn
toFalse
and provide only “f” entries. The number next to “f” is the layer number, 1-indexed. So “f1” corresponds to the embedding layer and so on.rnn_size (int) – Number of neurons in every layer.
num_layers (int) – Total number of layers in the original model.
is_brnn (bool) – Boolean indicating if the neuron activations are from a bidirectional model.
- Returns
filtered_train_activations (list of numpy.ndarray) – Filtered train activations
filtered_test_activations (list of numpy.ndarray) – Filtered test activations
Notes
For bidirectional models, the method assumes that the internal structure is as follows: forward layer 1 neurons, backward layer 1 neurons, forward layer 2 neurons …
-
neurox.data.loader.
load_aux_data
(source_path, labels_path, source_aux_path, activations, max_sent_l, ignore_start_token=False)[source]¶ Load word-annotated text-label pairs data represented as sentences, where activation extraction was performed on tokenized text. This function loads the source text, source tokenized text, target labels, and activations and tries to make them perfectly parallel, i.e. number of tokens in line N of source would match the number of tokens in line N of target, and number of tokens in source_aux will match the number of activations at index N. The method will delete non-matching activation/source/source_aix/target pairs, up to a maximum of 100 before failing. The method will also ignore sentences longer than the provided maximum. The activations will be modified in place.
Warning
This function is deprecated and will be removed in future versions.
- Parameters
source_path (str) – Path to the source text file, one sentence per line
labels_path (str) – Path to the annotated labels file, one sentence per line corresponding to the sentences in the
source_path
file.source_aux_path (str) – Path to the source text file with tokenization, one sentence per line
activations (list of numpy.ndarray) – Activations returned from
loader.load_activations
max_sent_l (int) – Maximum length of sentences. Sentences containing more tokens will be ignored.
ignore_start_token (bool, optional) – Ignore the first token. Useful if there is some line position markers in the source text.
- Returns
tokens – Dictionary containing three lists,
source
,source_aux
andtarget
.source
contains all of the sentences from``source_path`` that were not ignored.source_aux
contains all tokenized sentences fromsource_aux_path
.target
contains the parallel set of annotated labels.- Return type
dict
-
neurox.data.loader.
load_data
(source_path, labels_path, activations, max_sent_l, ignore_start_token=False, sentence_classification=False)[source]¶ Load word-annotated text-label pairs data represented as sentences. This function loads the source text, target labels, and activations and tries to make them perfectly parallel, i.e. number of tokens in line N of source would match the number of tokens in line N of target, and also match the number of activations at index N. The method will delete non-matching activation/source/target pairs, up to a maximum of 100 before failing. The method will also ignore sentences longer than the provided maximum. The activations will be modified in place.
- Parameters
source_path (str) – Path to the source text file, one sentence per line
labels_path (str) – Path to the annotated labels file, one sentence per line corresponding to the sentences in the
source_path
file.activations (list of numpy.ndarray) – Activations returned from
loader.load_activations
max_sent_l (int) – Maximum length of sentences. Sentences containing more tokens will be ignored.
ignore_start_token (bool, optional) – Ignore the first token. Useful if there is some line position markers in the source text.
sentence_classification (bool, optional) – Flag to indicate if this is a sentence classification task, where every sentence actually has only a single activation (e.g. [CLS] token’s activations in the case of BERT)
- Returns
tokens – Dictionary containing two lists,
source
andtarget
.source
contains all of the sentences fromsource_path
that were not ignored.target
contains the parallel set of annotated labels.- Return type
dict
-
neurox.data.loader.
load_sentence_data
(source_path, labels_path, activations)[source]¶ Loads sentence-annotated text-label pairs. This function loads the source text, target labels, and activations and tries to make them perfectly parallel, i.e. number of tokens in line N of source would match the number of activations at index N. The method will delete non-matching activation/source pairs. The activations will be modified in place.
- Parameters
source_path (str) – Path to the source text file, one sentence per line
labels_path (str) – Path to the annotated labels file, one sentence per line corresponding to the sentences in the
source_path
file.activations (list of numpy.ndarray) – Activations returned from
loader.load_activations
- Returns
tokens – Dictionary containing two lists,
source
andtarget
.source
contains all of the sentences fromsource_path
that were not ignored.target
contains the parallel set of annotated labels.- Return type
dict
neurox.data.representations¶
Utility functions to manage representations.
This module contains functions that will help in managing extracted representations, specifically on sub-word based data.
-
neurox.data.representations.
bpe_get_avg_activations
(tokens, activations)[source]¶ Aggregates activations by averaging assuming BPE-based tokenization.
Given loaded tokens data and activations, this function aggeregates activations based on tokenized text. BPE based tokenization is assumed, with every non-terminal subword ending with “@@”. The activations are aggregated by averaging over subwords.
Warning
This function is deprecated and will be removed in future versions.
- Parameters
tokens (dict) – Dictionary containing three lists,
source
,source_aux
andtarget
. Usually the output ofdata.loader.load_aux_data
.activations (list of numpy.ndarray) – Activations returned from
loader.load_activations
.
- Returns
activations – Subword aggregated activations corresponding to one per actual token found in the untokenized text.
- Return type
list of numpy.ndarray
-
neurox.data.representations.
bpe_get_last_activations
(tokens, activations, is_brnn=True)[source]¶ Aggregates activations by picking the last subword assuming BPE-based tokenization.
Given loaded tokens data and activations, this function aggeregates activations based on tokenized text. BPE based tokenization is assumed, with every non-terminal subword ending with “@@”. The activations are aggregated by picking the last subword for any given word.
Warning
This function is deprecated and will be removed in future versions.
- Parameters
tokens (dict) – Dictionary containing three lists,
source
,source_aux
andtarget
. Usually the output ofdata.loader.load_aux_data
.activations (list of numpy.ndarray) – Activations returned from
loader.load_activations
.is_brnn (bool, optional) – Whether the model from which activations were extracted was bidirectional. Only applies for RNN models.
- Returns
activations – Subword aggregated activations corresponding to one per actual token found in the untokenized text.
- Return type
list of numpy.ndarray
-
neurox.data.representations.
char_get_avg_activations
(tokens, activations)[source]¶ Aggregates activations by averaging assuming Character-based tokenization.
Given loaded tokens data and activations, this function aggeregates activations based on character-tokenized text. The activations are aggregated by averaging over characters.
Warning
This function is deprecated and will be removed in future versions.
- Parameters
tokens (dict) – Dictionary containing three lists,
source
,source_aux
andtarget
. Usually the output ofdata.loader.load_aux_data
.activations (list of numpy.ndarray) – Activations returned from
loader.load_activations
.
- Returns
activations – Character aggregated activations corresponding to one per actual token found in the untokenized text.
- Return type
list of numpy.ndarray
-
neurox.data.representations.
char_get_last_activations
(tokens, activations, is_brnn=True)[source]¶ Aggregates activations by picking the last subword assuming Character-based tokenization.
Given loaded tokens data and activations, this function aggeregates activations based on character-tokenized text. The activations are aggregated by picking the last character for any given word.
Warning
This function is deprecated and will be removed in future versions.
- Parameters
tokens (dict) – Dictionary containing three lists,
source
,source_aux
andtarget
. Usually the output ofdata.loader.load_aux_data
.activations (list of numpy.ndarray) – Activations returned from
loader.load_activations
.is_brnn (bool, optional) – Whether the model from which activations were extracted was bidirectional. Only applies for RNN models.
- Returns
activations – Character aggregated activations corresponding to one per actual token found in the untokenized text.
- Return type
list of numpy.ndarray
-
neurox.data.representations.
sent_get_last_activations
(tokens, activations)[source]¶ Gets the summary vector for the input sentences.
Given loaded tokens data and activations, this function picks the final token’s activations for every sentence, essentially giving summary vectors for every sentence in the dataset. This is mostly applicable for RNNs.
Note
Bidirectionality is currently not handled in the case of BiRNNs.
- Parameters
tokens (dict) – Dictionary containing three lists,
source
,source_aux
andtarget
. Usually the output ofdata.loader.load_aux_data
.activations (list of numpy.ndarray) – Activations returned from
loader.load_activations
.
- Returns
activations – Summary activations corresponding to one per actual sentence in the original text.
- Return type
list of numpy.ndarray
neurox.data.utils¶
-
neurox.data.utils.
save_files
(words, labels, activations, output_prefix, output_type='hdf5', decompose_layers=False, filter_layers=None)[source]¶ Save word and label files in the text format and activations in the specified format (default hdf5 format)
- Parameters
words (list) – A list of words
labels (list) – A list of labels for every word
activations (list) – A list of word-wise activations
output_prefix (string) – Specify prefix of the output files
- Returns
- Return type
Save word, label and activation files
neurox.data.writer¶
Representations Writers
Module with various writers for saving representations/activations. Currently, two file types are supported:
hdf5
: This is a binary format, and results in smaller overall files. The structure of the file is as follows:sentence_to_idx
dataset: Contains a single json string at index 0 that maps sentences to indicesIndices
0
throughN-1
datasets: Each index corresponds to one sentence. The value of the dataset is a tensor with dimensionsnum_layers x sentence_length x embedding_size
, whereembedding_size
may include multiple layers
json
: This is a human-readable format. There is some loss of precision, since each activation value is saved using 8 decimal places. Concretely, this results in a jsonl file, where each line is a json string corresponding to a single sentence. The structure of each line is as follows:linex_idx
: Sentence indexfeatures
: List of tokens (with their activations)token
: The current tokenlayers
: List of layersindex
: Layer index (does not correspond to original model’s layers)values
: List of activation values for all neurons in the layer
The writers also support saving activations from specific layers only, using the
filter_layers
argument. Since activation files can be large, an additional
option for decomposing the representations into layer-wise files is also
provided.
-
class
neurox.data.writer.
ActivationsWriter
(filename, filetype=None, decompose_layers=False, filter_layers=None)[source]¶ Bases:
object
Class that encapsulates all available writers.
This is the only class that should be used by the rest of the library.
-
filename
¶ Filename for storing the activations. May not be used exactly if
decompose_layers
is True.- Type
str
-
filetype
¶ An additional hint for the filetype. This argument is optional The file type will be detected automatically from the filename if none is supplied.
- Type
str
-
decompose_layers
¶ Set to true if each layer’s activations should be saved in a separate file.
- Type
bool
-
filter_layers
¶ Comma separated list of layer indices to save.
- Type
str
-
__init__
(filename, filetype=None, decompose_layers=False, filter_layers=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
open
()[source]¶ Method to open the underlying files. Will be called automatically by the class instance when necessary.
-
write_activations
(sentence_idx, extracted_words, activations)[source]¶ Method to write a single sentence’s activations to file
-
-
class
neurox.data.writer.
ActivationsWriterManager
(filename, filetype=None, decompose_layers=False, filter_layers=None)[source]¶ Bases:
neurox.data.writer.ActivationsWriter
Manager class that handles decomposition and filtering.
Decomposition requires multiple writers (one per file) and filtering requires processing the activations to remove unneeded layer activations. This class sits on top of the actual activations writer to manage these operations.
-
__init__
(filename, filetype=None, decompose_layers=False, filter_layers=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
open
(num_layers)[source]¶ Method to open the underlying files. Will be called automatically by the class instance when necessary.
-
-
class
neurox.data.writer.
HDF5ActivationsWriter
(filename)[source]¶ Bases:
neurox.data.writer.ActivationsWriter
-
open
()[source]¶ Method to open the underlying files. Will be called automatically by the class instance when necessary.
-
-
class
neurox.data.writer.
JSONActivationsWriter
(filename)[source]¶ Bases:
neurox.data.writer.ActivationsWriter
-
open
()[source]¶ Method to open the underlying files. Will be called automatically by the class instance when necessary.
-
Module contents: