sosia.Original

class sosia.Original(scientist, treatment_year, first_year_margin=2, pub_margin=0.2, cits_margin=0.2, coauth_margin=0.2, affiliations=None, period=None, first_year_search='ID', eids=None, refresh=False, sql_fname=None)[source]

Representation of a scientist for whom to find a control scientist.

Parameters:
  • scientist (str, int or list of str or int) – Scopus Author ID, or list of Scopus Author IDs, of the scientist to find a control scientist for.
  • treatment_year (str or numeric) – Year of the event. Control scientist will be matched on trends and characteristics of the original scientist up to this year.
  • first_year_margin (numeric (optional, default=2)) – Number of years by which the search for authors publishing around the year of the original scientist’s year of first publication should be extend in both directions.
  • pub_margin (numeric (optional, default=0.2)) – The left and right margin for the number of publications to match possible matches and the scientist on. If the value is a float, it is interpreted as percentage of the scientists number of publications and the resulting value is rounded up. If the value is an integer it is interpreted as fixed number of publications.
  • cits_margin (numeric (optional, default=0.2)) – The left and right margin for the number of citations to match possible matches and the scientist on. If the value is a float, it is interpreted as percentage of the scientists number of publications and the resulting value is rounded up. If the value is an integer it is interpreted as fixed number of citations.
  • coauth_margin (numeric (optional, default=0.2)) – The left and right margin for the number of coauthors to match possible matches and the scientist on. If the value is a float, it is interpreted as percentage of the scientists number of coauthors and the resulting value is rounded up. If the value is an integer it is interpreted as fixed number of coauthors.
  • affiliations (list (optional, default=None)) – A list of Scopus affiliation IDs. If provided, sosia conditions the match procedure on affiliation with these IDs in the treatment year.
  • period (int (optional, default=None)) – An additional period prior to the publication year on which to match scientists. Note: If the value is larger than the publication range, period sets back to None.
  • first_year_search (str (optional, default="ID")) – How to determine characteristics of possible control scientists in the first year of publication. Mode “ID” uses Scopus Author IDs only. Mode “name” will select relevant profiles based on their surname and first name but only when “period” is not None. Select this mode to counter potential incompleteness of author profiles.
  • eids (list (optional, default=None)) – A list of scopus EIDs of the publications of the scientist you want to find a control for. If it is provided, the scientist properties and the control group are set based on this list of publications, instead of the list of publications obtained from the Scopus Author ID.
  • refresh (boolean (optional, default=False)) – Whether to refresh cached results (if they exist) or not. If int is passed, results will be refreshed if they are older than that value in number of days.
  • sql_fname (str (optional, default=None)) – The path of the SQLite database to connect to. If None, will use the path specified in config.ini.
define_search_group(stacked=False, verbose=False, refresh=False)[source]

Define search_group.

Parameters:
  • stacked (bool (optional, default=False)) – Whether to combine searches in few queries or not. Cached files with most likely not be reusable. Set to True if you query in distinct fields or you want to minimize API key usage.
  • verbose (bool (optional, default=False)) – Whether to report on the progress of the process.
  • refresh (bool (optional, default=False)) – Whether to refresh cached results (if they exist) or not.
define_search_sources(verbose=False)[source]

Define .search_sources.

Within the list of search sources sosia will search for matching scientists. A search source is of the same main field as the original scientist, the same types (journal, conference proceeding, etc.), and must not be related to fields alien to the original scientist.

Parameters:verbose (bool (optional, default=False)) – Whether to report on the progress of the process.
find_matches(stacked=False, verbose=False, refresh=False)[source]

Find matches within search_group based on four criteria: 1. Started publishing in about the same year 2. Has about the same number of publications in the treatment year 3. Has about the same number of coauthors in the treatment year 4. Has about the same number of citations in the treatment year 5. Works in the same field as the scientist’s main field

Parameters:
  • stacked (bool (optional, default=False)) – Whether to combine searches in few queries or not. Cached files will most likely not be reusable. Set to True if you query in distinct fields or you want to minimize API key usage.
  • verbose (bool (optional, default=False)) – Whether to report on the progress of the process.
  • refresh (bool (optional, default=False)) – Whether to refresh cached results (if they exist) or not. If int is passed and stacked=False, results will be refreshed if they are older than that value in number of days.

Notes

Matches are available through property .matches.

get_publication_languages(refresh=False)

Parse languages of published documents.

inform_matches(fields=None, verbose=False, refresh=False, stop_words=None, **tfidf_kwds)[source]

Add information to matches to aid in selection process.

Parameters:
  • fields (iterable (optional, default=None)) – Which information to provide. Allowed values are “first_year”, “num_coauthors”, “num_publications”, “num_citations”, “country”, “language”, “reference_sim”, “abstract_sim”. If None, will use all available fields.
  • verbose (bool (optional, default=False)) – Whether to report on the progress of the process.
  • refresh (bool (optional, default=False)) – Whether to refresh cached results (if they exist) or not. If int is passed and stacked=False, results will be refreshed if they are older than that value in number of days.
  • stop_words (list (optional, default=None)) – A list of words that should be filtered in the analysis of abstracts. If None uses the list of English stopwords by nltk, augmented with numbers and interpunctuation.
  • tfidf_kwds (keywords) – Parameters to pass to TfidfVectorizer from the sklearn package for abstract vectorization. Not used when information=False or or when “abstract_sim” is not in information. See https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html for possible values.

Notes

Matches including corresponding information are available through property .matches.

Raises:fields – If fields contains invalid keywords.
active_year

The scientist’s most recent year with publication(s) before provided year (which may be the same).

affiliation_id

The affiliation ID (as string) of the scientist’s most frequent affiliation in or before the active year.

citations

The citations of the scientist until the provided year.

citations_period

The citations of the scientist during the given period.

coauthors

Set of coauthors of the scientist on all publications until the provided year.

coauthors_period

Set of coauthors of the scientist on all publications during the given period.

country

Country belonging to the affiliation defined in affiliation_id.

fields

The fields of the scientist until the provided year, estimated from the sources (journals, books, etc.) she published in.

first_name

The scientist’s first name.

first_year

The scientist’s year of first publication.

language

The language(s) of the scientist published in.

main_field

The scientist’s main field of research, as tuple in the form (ASJC code, general category).

The main field is the field with the most publications, provided it is not Multidisciplinary (ASJC code 1000). In case of an equal number of publications, preference is given to non-general fields (those whose ASJC ends on a digit other than 0).

matches

List of Scopus IDs or list of namedtuples representing matches of the original scientist in the treatment year.

Notes

Property is initiated via .find_matches().

name

The scientist’s complete name.

organization

The name belonging to the affiliation defined in affiliation_id.

publications

List of the scientists’ publications.

publications_period

The publications of the scientist published during the given period.

search_group

The set of authors that might be matches to the scientist. The set contains the intersection of all authors publishing in the treatment year as well as authors publishing around the year of first publication. Some authors with too many publications in the treatment year and authors having published too early are removed.

Notes

Property is initiated via .define_search_group().

search_sources

The set of sources (journals, books) comparable to the sources the scientist published in until the treatment year. A sources is comparable if is belongs to the scientist’s main field but not to fields alien to the scientist, and if the types of the sources are the same as the types of the sources in the scientist’s main field where she published in.

Notes

Property is initiated via .define_search_sources().

sources

The Scopus IDs of sources (journals, books) in which the scientist published in.

subjects

The subject areas of the scientist’s publications.

surname

The scientist’s surname.