neurox.data.extraction

Submodules:

neurox.data.extraction.transformers_extractor

Representations Extractor for transformers toolkit models.

Module that given a file with input sentences and a transformers model, extracts representations from all layers of the model. The script supports aggregation over sub-words created due to the tokenization of the provided model.

Can also be invoked as a script as follows:

python -m neurox.data.extraction.transformers_extractor

neurox.data.extraction.transformers_extractor.get_model_and_tokenizer(model_desc, device='cpu', random_weights=False)[source]

Automatically get the appropriate transformers model and tokenizer based on the model description

Parameters
  • model_desc (str) – Model description; can either be a model name like bert-base-uncased, a comma separated list indicating <model>,<tokenizer> (since 1.0.8), or a path to a trained model

  • device (str, optional) – Device to load the model on, cpu or gpu. Default is cpu.

  • random_weights (bool, optional) – Whether the weights of the model should be randomized. Useful for analyses where one needs an untrained model.

Returns

  • model (transformers model) – An instance of one of the transformers.modeling classes

  • tokenizer (transformers tokenizer) – An instance of one of the transformers.tokenization classes

neurox.data.extraction.transformers_extractor.aggregate_repr(state, start, end, aggregation)[source]

Function that aggregates activations/embeddings over a span of subword tokens. This function will usually be called once per word. For example, if we had the sentence:

This is an example

which is tokenized by BPE into:

this is an ex @@am @@ple

The function should be called 4 times:

aggregate_repr(state, 0, 0, aggregation)
aggregate_repr(state, 1, 1, aggregation)
aggregate_repr(state, 2, 2, aggregation)
aggregate_repr(state, 3, 5, aggregation)

Returns a zero vector if end is less than start, i.e. the request is to aggregate over an empty slice.

Parameters
  • state (numpy.ndarray) – Matrix of size [ NUM_LAYERS x NUM_SUBWORD_TOKENS_IN_SENT x LAYER_DIM]

  • start (int) – Index of the first subword of the word being processed

  • end (int) – Index of the last subword of the word being processed

  • aggregation ({'first', 'last', 'average'}) – Aggregation method for combining subword activations

Returns

word_vector – Matrix of size [NUM_LAYERS x LAYER_DIM]

Return type

numpy.ndarray

neurox.data.extraction.transformers_extractor.extract_sentence_representations(sentence, model, tokenizer, device='cpu', include_embeddings=True, aggregation='last', tokenization_counts={})[source]

Get representations for one sentence

neurox.data.extraction.transformers_extractor.extract_representations(model_desc, input_corpus, output_file, device='cpu', aggregation='last', output_type='json', random_weights=False, ignore_embeddings=False, decompose_layers=False, filter_layers=None)[source]
neurox.data.extraction.transformers_extractor.main()[source]

Module contents: