Change Log¶

0.6.1¶

2023-11-05

2023-04-23

Drop usage of configuration file and recommend project-specific databases.
In class Scientist(), rename properties: `.country -> .affiliation_country, .affiliation -> .affiliation_name.
In class Scientist(), create property `.affiliation_type.
In .create_fields_sources_list(), make use of parameter “verbose”.
Make retrieval of affiliation related information robust to missing information (404 error).
Pass on “refresh” parameter from inform_matches().
Use pyproject.toml for packaging, drop pbr (PEP 621).
Use XDG compliant file storage for support files in ~/.cache/sosia/.
Improve various methods and functions for stability and speed.

2022-01-20

In .inform_matches(), remove abstract similarity and reference list similarity computation and corresponding keywords “abstract_sim” and “reference_sim”; do not require nltk and scikit-learn anymore.
In .inform_matches(), add “num_cited_refs” as number of jointly cited references up until provided year.
Increase robustness to Scopus server problems.
Allow to refresh downloaded results when using very large stacked source-based searches.
Fix bug with integer conversion when using pandas > 1.1.5.
Require pybliometrics >= 3.2.0.

2020-12-08

2020-12-04

End support for Python 3.5.
Reorganize config.ini.
In Original(), add parameters “sql_name” and “first_year_search”, and rename parameters: “year”: “treatment_year”, “year_margin”: “first_year_margin”, “search_affiliations”: “affiliations”.
In Original().find_matches(), remove parameter “ignore_first_id”.
In Original(), change default values for parameters: “year_margin”: 0.2, “pub_margin”: 0.2, “cits_margin”: 0.2, “coauth_margin”: 0.2.
Rename function create_cache() to make_database().
Rename tables in MySQL database: author_size becomes author_pubs, author_cits_size becomes author_ncits.
In Original().find_matches() remove parameters “information”, “stop_words”, “tfidf_kwds”; always create a plain list.
Add property .matches to Original().
Create new method Original().inform_matches() to add additional information to matches.
Use externally provided list of sources and their fields.
Remove unused property Original().city.
Raise warning if there are too few publications to determine a field.
Allow integer values for “refresh” in all instances, require pybliometrics >= 2.7.
Require numpy.
Fix bug originating from missing reference EIDs.
Fix bug originating from missing source IDs.
Improve documentaton, add tutorial.
Add citation dunder.

2020-03-17

Update docs w.r.t. the usage of pybliometrics.
Add support for Python 3.8 and Python 3.9.
Add missing required package lxml and require sklearn>=0.22.1.
Correct verbose output of .find_matches() w.r.t. completeness of reference lists and abstracts of matches.
Check for existence of the search group in .find_machtes().
In .find_matches(), fix bug when attempting to compute the cosine similarity when reference lists or abstracts are completely missing.
In .get_publication_language(), fix bug resulting from bad downloads of abstracts.
In Original(), fix bug resulting from unclean source ID information in Scopus search results.

2019-11-26

Introduce internal SQLite database to store results from stacked queries.
Comply with pybliometrics 2.2 or higher to make use of integrity_fields.
In Original(), add parameter “num_citations” to filter on the number of citations as well.
In Original(), add paramater “period” to allow for matching on information derived from used-provided period only.
In Original(), add parameter “search_affiliations” to enable subsetting on matches from a list of specific affiliations.
Add “num_citations” to information of matches.
Introduce internal config file.
Attempt to download most recent sources list from scopus.com during .create_fields_sources_list().
In find_country(), add “refresh” parameter and fix bugs related to wrong views and not continuing the search.
In get_main_field() return most common 2-digit ASJC code and most commont 4-digit ASJC field.
Attempt to extract URL for the Scopus source list via webscraping.
In find_matches(), provide only desired information.
In find_matches(), fix bug with missing references or abstracts.
Use decorators for methods.

2019-02-21

Introduce new class Scientist() to be used as parent class of Original() and others.
Add property language to Scientist() (and thus Original() and matches).
Allow setting of properties of Original() and Scientist().
Make the provision of additional information of matches optional.
Add source names to .sources and .search_sources.
Re-Download abstract if language information aren’t present.
In Original(), allow list of Author IDs.
In Original(), add optional parameter to condition retrieved information on list of EIDs.
Exclude “multidisciplinary” from list of fields for main field determination.
Give preference to non-general field during main field determination.
Use most recent affiliation to identify country.
Filter focal researcher from .search_group.
Fix bug in computation of relevant year range.
Fix bug resulting from missing source IDs.
Fix bug resulting in redundant counts during .define_search_group().
Use pair-wise tfidf-vectorization (instead of group-wise vectorization).
When the focal’s publications are empty, do not compute similiarity measures.
In clean_abstracts(), remove copyright statement in next-to-last sentence as well.
Use error messages from scopus to for case-specific error handling.
Simplify functions and classes and refactor internally.
Make stacked_query robust to group as list of int.
Enable internal function query_journal() to perform stacked queries.

2018-11-25

Activate method chaining.
Outsource functions compute_cosine, print_progress, margin_range, raise_non_empty and query to utils.py and add tests for them.
Fix bugs in .define_search_group(), margin_range() and _get_refs().
Implement clean_abstracts() to remove copyright statements from abstracts.
Use sklearn’s default settings for tfidf-vectorization of abstracts by default.
Simplify code.

2018-11-23