dirty_cat.datasets.get_ken_embeddings

dirty_cat.datasets.get_ken_embeddings(types=None, *, exclude=None, embedding_table_id='all_entities', embedding_type_id=None, pca_components=None, suffix='')[source]

Download Wikipedia embeddings by type.

More details on the embeddings can be found on https://soda-inria.github.io/ken_embeddings/.

Parameters:
typesstr, optional

Substring pattern that filters the types of entities. Will keep all entity types containing the substring. Write in lowercase. If None, all types will be passed.

excludestr, optional

Type of embeddings to exclude from the types search.

embedding_table_idstr, default=’all_entities’

Table of embedded entities from which to extract the embeddings. Get the supported tables with get_ken_table_aliases(). It is also possible to pass a custom figshare ID.

embedding_type_idstr, optional

Figshare ID of the file containing the type of embeddings. Get the supported tables with get_ken_types(). Ignored unless a custom embedding_table_id is provided.

pca_componentsint, optional

Size of the dimensional space on which the embeddings will be projected by a principal component analysis. If None, the default dimension (200) of the embeddings will be kept.

suffixstr, optional, default=’’

Suffix to add to the column names of the embeddings.

Returns:
DataFrame

The embeddings of entities and the specified type from Wikipedia.

See also

get_ken_table_aliases()

Get the supported aliases of embedded entities tables.

get_ken_types()

Helper function to search for entity types.

dirty_cat.fuzzy_join()

Join two tables (dataframes) based on approximate column matching.

dirty_cat.FeatureAugmenter

Transformer to enrich a given table via one or more fuzzy joins to external resources.

Notes

The files are read and returned in parquet format, this function needs pyarrow installed to run correctly.

The types parameter is there to filter the types by the input string pattern. In case the input is “music”, all types with this string will be included (e.g. “wikicat_musician_from_france”, “wikicat_music_label” etc.). Going directly for the exact type name (e.g. “wikicat_rock_music_bands”) is possible but may not be complete (as some relevant bands may be in other similar types). For searching the types, the get_ken_types() function can be used.

References

For more details, see Cvetkov-Iliev, A., Allauzen, A. & Varoquaux, G.: Relational data embeddings for feature enrichment with background information.

Examples using dirty_cat.datasets.get_ken_embeddings