What is this Dataset?

Transformers Concept Net is a collection of latent concepts derived from five Transformer Models: BERT-cased, RoBERTa, XLNet, XLM-RoBERTa, and ALBERT. This dataset was created by clustering contextualized word representations from these models and then annotating the resulting clusters with the help of ChatGPT. The goal of the dataset is to improve the understanding and analysis of deep transformer models. We utilized a subset of the WMT News 2018 dataset consisting of 250K sentences as our base for concept discovery, and extract 600 concepts per layer for each of the five models.

Political Figures: Albert Layer 9

Geographic Locations in California: Bert Layer 12

Please checkout the EMNLP23' paper to view more concepts and to learn more about how the dataset was created and annotated
  • 39K Latent Concepts
  • 5 Transformer Models
  • ChatGPT Annotations
  • Growing Dataset

Download Links

The dataset is available in splits for each model:
bert-base-cased , albert-base-v1 , roberta-base , xlm-roberta-base and xlnet-base-cased
Each *zip file contains the concepts for every layer in the corresponding model, the annotations for the concepts and the sentences from which contextual representations were derived before concept discovery.

The TransformersConceptNet repository provides code for interactively browsing the the dataset, the annotations as well as the context for each concept.

Citation

If you find this dataset useful in your own work, please cite the following paper:

@inproceedings{mousi2023llms,
	title = "Can LLMs Facilitate Interpretation of Pre-trained Language 
	Models?",
	author = "Mousi, Basel  and
	Durrani, Nadir  and 
	Dalvi, Fahim", 
	booktitle = "Proceedings of the 2023 Conference on Empirical Methods 
	in Natural Language Processing",
	month = dec, 
	year = "2023", 
	publisher = "Association for Computational Linguistics", 
	url = "https://browse.arxiv.org/pdf/2305.13386.pdf"
}