Reference¶
References to individual classes and their functions, methods, and properties.
sosia.Original¶
- class sosia.Original(scientist: str | int | list[str | int], match_year: str | int, eids: list[str | int] | None = None, refresh: bool | int = False, db_path: str | Path | None = None, log_path: str | Path | None = None, verbose: bool | None = False)[source]¶
Representation of a scientist for whom to find a control scientist.
Representation of a scientist for whom to find matches (= Original).
- Parameters:
scientist (str, int or list of str or int) – Scopus Author ID, or list of Scopus Author IDs, of the scientist to find a control scientist for.
match_year (str or numeric) – Year in which the comparison takes place. Control scientist will be matched on trends and characteristics of the original scientist up to this year.
eids (list (optional, default=None)) – A list of Scopus EIDs of the publications to consinder. If it is provided, all properties will be derived from them, and the control group is based on them. If None, will use all research-type publications published until the match year.
refresh (boolean (optional, default=False)) – Whether to refresh cached results (if they exist) or not. If int is passed, results will be refreshed if they are older than that value in number of days.
db_path (str or pathlib.Path (optional, default=None)) – The path of the SQLite database to connect to. If None, will default to ~/.cache/sosia/main.sqlite. Will be created if the database doesn’t exist.
log_path (str or pathlib.Path (optional, default=None)) –
- The path of the log file using logging. If None, will default
to ~/.cache/sosia/sosia.log.
verbose (bool (optional, default=False)) – Whether to report on the initialization process.
- define_search_sources(verbose: bool = False, mode: Literal['narrow', 'wide'] = 'narrow') Self [source]¶
Define search sources related to the Original.
Search sources are the set of sources where sosia will search for possible candidates. Search source are of the same types (journal, conference proceeding, etc.) the Original published in, and is related to the main field (ASCJ-4) of the Original.
- Parameters:
verbose (bool (optional, default=False)) – Whether to report on the progress of the process.
mode (str (optional, default="narrow")) – Accepted values: “narrow”, “wide”. A “narrow” definition of search sources excludes search sources that are also associated to fields (ASJC-4) not among the fields of the Original. A “wide” defintion includes all those sources.
Notes
Search sources are available through property .search_sources.
- filter_candidates(first_year_margin: int | None = None, pub_margin: float | int | None = None, coauth_margin: float | int | None = None, cits_margin: float | int | None = None, same_discipline: bool | None = False, verbose: bool = False, refresh: bool | int = False) None [source]¶
Find matches within candidates based on up to five criteria: 1. Work mainly in the same discipline (as of date of retrieval) 2. Started publishing in about the same year 3. Have about the same number of publications in the match year 4. Have about the same number of coauthors in the match year 5. Have about the same number of citations in the match year
- Parameters:
first_year_margin (numeric (optional, default=None)) – The left and right margin for year of first publication to match candidates and the scientist on. If the value is not given, sosia will not filter on the first year of publication.
pub_margin (numeric (optional, default=None)) – The left and right margin for the number of publications to match candidates and the scientist on. If the value is a float, it is interpreted as percentage of the scientist’s number of publications and the resulting value is rounded up. If the value is an integer, it is interpreted as fixed number of publications. If the value is not given, sosia will not filter on the number of publications.
coauth_margin (numeric (optional, default=None)) – The left and right margin for the number of coauthors to match candidates and the scientist on. If the value is a float, it is interpreted as percentage of the scientists number of coauthors and the resulting value is rounded up. If the value is an integer, it is interpreted as fixed number of coauthors. If the value is not given, sosia will not filter on the number of coauthors.
cits_margin (numeric (optional, default=None)) – The left and right margin for the number of citations to match candidates and the scientist on. If the value is a float, it is interpreted as percentage of the scientists number of publications and the resulting value is rounded up. If the value is an integer, it is interpreted as fixed number of citations. If the value is not given, sosia will not filter on the number of citations.
same_discipline (boolean (optional, default=False)) – Whether to restrict candidates to the same main discipline (ASJC2) as the original scientist or not.
verbose (bool (optional, default=False)) – Whether to report on the progress of the process.
refresh (bool or int (optional, default=False)) – Whether to refresh cached results (if they exist) or not. If int is passed, results will be refreshed if they are older than that value in number of days.
Notes
Matches are available through property .matches.
- get_publication_languages(refresh: bool = False) Self ¶
Parse languages of published documents.
- identify_candidates_from_sources(first_year_margin: int, frequency: int | None = None, stacked: bool = False, verbose: bool = False, refresh: bool | int = False) Self [source]¶
Define a search group of authors based on their publication activity in the search sources between the first year and the match year.
- Parameters:
first_year_margin (int) – The left margin for year of first publication to identify match candidates.
frequency (int (optional, default=None)) – The maximum gap in number of years between publications of suitable candidates, i.e. the average frequency with which they publish in these sources. If not given, will take the average frequency (rounded up) of the Orignal: max[1, (match_year - first_year) / number of publications.] Must not be smaller than the first_year_margin. To find candidates, the method considers chunks of consecutive volumes, and requires candidates to publish at least once in each (!) of these chunks. That is, it generates sets of authors publishing in the search sources during specific years, and considers only the intersection of these. If the last chunk is smaller than half the target chunk size, it will be merged with the previous chunk.
stacked (bool (optional, default=False)) – Whether to combine searches in few queries or not. Cached files will most likely not be reusable. Set to True if you query in distinct fields or to minimize API key usage.
verbose (bool (optional, default=False)) – Whether to report on the progress of the process.
refresh (bool (optional, default=False)) – refresh : bool or int (optional, default=False) Whether to refresh cached results (if they exist) or not. If int is passed, results will be refreshed if they are older than that value in number of days.
Notes
If candidates have been identified, they are accessible through property .candidates.
- inform_matches(fields: Iterable | None = None, verbose: bool = False, refresh: bool | int = False) None [source]¶
Add information to matches to aid in selection process.
- Parameters:
fields (iterable (optional, default=None)) – Which information to provide. Allowed values are “first_name”, “surname”, “first_year”, “last_year”, “num_coauthors”, “num_publications”, “num_citations”, “subjects”, “affiliation_country”, “affiliation_id”, “affiliation_name”, “affiliation_type”, “language”, “num_cited_refs”. If None, will use all available fields.
verbose (bool (optional, default=False)) – Whether to report on the progress of the process.
refresh (bool or int (optional, default=False)) – Whether to refresh cached results (if they exist) or not. If int is passed, results will be refreshed if they are older than that value in number of days.
Notes
Matches including corresponding information are available through property .matches.
- Raises:
ValueError – If fields contains invalid keywords.
- property affiliation_country: str | None¶
The country of the scientist’s affiliation.
- property affiliation_id: str | None¶
The affiliation ID (as string) of the scientist’s most frequent affiliation in the active year.
- property affiliation_name: str | None¶
The name of the scientist’s affiliation.
- property affiliation_type: str | None¶
The type of the scientist’s affiliation.
- property candidates: list[int] | None¶
The set of authors that might be matches to the scientist. The set contains the intersection of all authors publishing in the treatment year as well as authors publishing around the year of first publication. Some authors with too many publications in the treatment year and authors having published too early are removed.
Notes
Property is initiated via .identify_candidates_from_sources().
- property citations: int | None¶
The citation count of the scientist until the provided year.
- property coauthors: list¶
Sorted list of coauthors of the scientist on all publications until the comparison year.
- property fields: set | list | tuple¶
The fields of the scientist until the provided year, estimated from the sources (journals, books, etc.) she published in.
- property first_name: str | None¶
The scientist’s first name.
- property first_year: int¶
The scientist’s year of first publication.
- property language: str | None¶
The language(s) the scientist published in.
- property last_year: int¶
The scientist’s most recent year with publication(s) before the match year (which may be the same).
- property main_field: tuple¶
The scientist’s main field of research, as tuple in the form (ASJC code, general category).
The main field is the field with the most publications, provided it is not Multidisciplinary (ASJC code 1000). In case of an equal number of publications, preference is given to non-general fields (those whose ASJC ends on a digit other than 0).
- property matches: list | None¶
List of Scopus IDs or list of namedtuples representing matches of the original scientist in the treatment year.
Notes
Property is initiated via .find_matches().
- property name: str | None¶
The scientist’s complete name.
- property publications: set | list | tuple¶
List of the scientists’ publications.
- property search_sources: set | list | tuple | None¶
The set of sources comparable to those of the Original.
A source (journal, book, etc.) is comparable if it belongs to the Original’s main field, and if the types of the sources are those the Original publishes in.
Notes
Property is initiated via .define_search_sources().
- property sources: list | tuple¶
The Scopus IDs of sources (journals, books, etc.) in which the scientist published in.
- property subjects: set | list | tuple¶
The subject areas of the scientist’s publications.
- property surname: str | None¶
The scientist’s surname.
sosia.get_field_source_information¶
- sosia.get_field_source_information(verbose: bool = False) None [source]¶
Download two files from sosia-dev/sosia-data repository: 1. List of Scopus source IDs with additional information 2. Mapping of sources to ASJC codes
- Parameters:
verbose (bool (optional, default=False)) – Whether to report on the progress of the process.
sosia.make_database¶
- sosia.make_database(fname: Path | None = None, verbose: bool = False, drop: bool = False) None [source]¶
Make SQLite database with predefined tables and keys.
- Parameters:
fname (pathlib.Path (optional, default=None)) – The path of the SQLite database to connect to. If None, will default to ~/.cache/sosia/main.sqlite.
verbose (boolean (optional, default=False)) – Whether to report on the progess of the process.
drop (boolean (optional, default=False)) – If True, deletes and recreates all tables in cache (irreversible).