What is this Dataset?

BERT Concept Net is a dataset of latent concepts learned within the representations of BERT. The goal of this dataset is to complement existing Human defined concepts like linguistic and semantic properties (Part of Speech Tags, Syntactic tags, WordNet etc.). The concepts are discovered in an unsupervised fashion, and are annotated using a semi-supervised method.

Birds from SEM:animal:land_animal

Proper Nouns from SEM:origin:europe:germany

Please checkout out the ICLR'22 paper for more details on how the dataset was curated and annotated. The labels themselves can also be explored below.
  • 635K sentences
  • 219 unique concepts
  • Heirarchical labels
  • 1.7M annotated tokens
  • Growing dataset

Download Links

The dataset is available here as a gzip'ed tar: bert-concept-net_v1.tgz (44M) .
After downloading the package, the tar utility can be used to uncompress the data in the current directory:
tar xvzf bert-concept-net_v1.tgz

The accompanying README.md provides detailed instructions on the format of the dataset and the associated scripts to filter the dataset.

The raw annotations are also available as an archive: annotations-release.zip
The code used to process the data and obtain the original clusters before annotation will be made available soon.

Data Exploration

Click on any of the segments to zoom-in and see finer concepts.
Click in the center to go back and zoom-out to parent concepts.

Samples

Some samples from the dataset

FormerNBAplayerHicksonfacingarmedrobberychargeinAugust.
N/ASEM:named_entity, SEM:entertainment:sport:basketball, POS:proper-nounN/AN/AN/AN/AN/AN/AN/ASEM:time:months_of_yearN/A
ThemoodwasmoremutedatEmiratesStadiumatthestartofArseneWenger'slonggoodbyetoArsenal.
N/AN/AN/AN/AN/AN/ASEM:entertainment:sport, SEM:origin:europe:ukN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/ASEM:entertainment:sport, SEM:origin:europe:ukN/A
Hehadpubliclytriedtodistancehimselffromit...
N/AN/ALEX:suffix:ly, POS:adverbN/AN/AN/AN/AN/AN/ALEX:dots
TheUSten-yearTreasurybondreturnhasjumped
N/AN/ASEM:time:timeframe, POS:adjective, LEX:hyphenatedN/AN/AN/AN/AN/A
Alamoftencritiquesthegovernment
POS:proper-noun, SEM:named_entity:person, SEM:demography:muslim_name, LEX:case:title_caseN/AN/AN/AN/A

Citation

If you find this dataset useful in your own work, please cite the following paper:

@inproceedings{dalvi2020discovering,
  title = {Discovering Latent Concepts Learned in {BERT}},
  author = {Dalvi, Fahim  and
		Khan, Abdul  and
		Alam, Firoj  and
		Durrani, Nadir  and
		Xu, Jia  and
		Sajjad, Hassan},
  booktitle = { International Conference on Learning Representations },
  year = {2022},
  url = {https://openreview.net/forum?id=POTMtpYI1xH}
}