neurox.data.extraction¶
Submodules:
neurox.data.extraction.transformers_extractor¶
Representations Extractor for transformers
toolkit models.
Module that given a file with input sentences and a transformers
model, extracts representations from all layers of the model. The script
supports aggregation over sub-words created due to the tokenization of
the provided model.
- Can also be invoked as a script as follows:
python -m neurox.data.extraction.transformers_extractor
-
neurox.data.extraction.transformers_extractor.
get_model_and_tokenizer
(model_desc, device='cpu', random_weights=False)[source]¶ Automatically get the appropriate
transformers
model and tokenizer based on the model description- Parameters
model_desc (str) – Model description; can either be a model name like
bert-base-uncased
, a comma separated list indicating <model>,<tokenizer> (since 1.0.8), or a path to a trained modeldevice (str, optional) – Device to load the model on, cpu or gpu. Default is cpu.
random_weights (bool, optional) – Whether the weights of the model should be randomized. Useful for analyses where one needs an untrained model.
- Returns
model (transformers model) – An instance of one of the transformers.modeling classes
tokenizer (transformers tokenizer) – An instance of one of the transformers.tokenization classes
-
neurox.data.extraction.transformers_extractor.
aggregate_repr
(state, start, end, aggregation)[source]¶ Function that aggregates activations/embeddings over a span of subword tokens. This function will usually be called once per word. For example, if we had the sentence:
This is an example
which is tokenized by BPE into:
this is an ex @@am @@ple
The function should be called 4 times:
aggregate_repr(state, 0, 0, aggregation) aggregate_repr(state, 1, 1, aggregation) aggregate_repr(state, 2, 2, aggregation) aggregate_repr(state, 3, 5, aggregation)
Returns a zero vector if end is less than start, i.e. the request is to aggregate over an empty slice.
- Parameters
state (numpy.ndarray) – Matrix of size [ NUM_LAYERS x NUM_SUBWORD_TOKENS_IN_SENT x LAYER_DIM]
start (int) – Index of the first subword of the word being processed
end (int) – Index of the last subword of the word being processed
aggregation ({'first', 'last', 'average'}) – Aggregation method for combining subword activations
- Returns
word_vector – Matrix of size [NUM_LAYERS x LAYER_DIM]
- Return type
numpy.ndarray
-
neurox.data.extraction.transformers_extractor.
extract_sentence_representations
(sentence, model, tokenizer, device='cpu', include_embeddings=True, aggregation='last', tokenization_counts={})[source]¶ Get representations for one sentence
-
neurox.data.extraction.transformers_extractor.
extract_representations
(model_desc, input_corpus, output_file, device='cpu', aggregation='last', output_type='json', random_weights=False, ignore_embeddings=False, decompose_layers=False, filter_layers=None)[source]¶
Module contents: