Match Authors in Scopus automatically with sosia

sosia (Italian for doppelgänger) finds researchers that are similar to another one. Use the matching researcher as a control in Diff-in-Diff anlyses. sosia is developed and described by econometricians for scientists of science.

sosia does not pre-compute annual characteristics to find controls. Instead, sosia searches the entire Scopus database via pybliometrics. Configure both–and let sosia find a match for you.


Install sosia from PyPI using the console or command line interpreter:

$ pip install sosia

In Python, set up sosia (and eventually pybliometrics) and search for similar scientists using their Scoups Author Profile IDs.

>>> import sosia
>>> sosia.create_fields_sources_list()  # Necessary only once
>>> sosia.make_database()  # Necessary only once
>>> stefano = sosia.Original(55208373700, 2019)  # Scopus ID and year
>>> stefano.define_search_sources()  # Sources similiar to scientist
>>> stefano.define_search_group()  # Authors publishing in similar sources
>>> stefano.find_matches()  # Find matches satisfying all criteria
>>> print(stefano.matches)
>>> ['55022752500', '55810688700', '55824607400']
>>> stefano.inform_matches()  # Optional step to provide additional information
>>> print(stefano.matches[0])
Match(ID='55022752500', name='Van der Borgh, Michel', first_name='Michel',
surname='Van der Borgh', first_year=2012, num_coauthors=6, num_publications=5,
num_citations=33, num_coauthors_period=6, num_publications_period=5,
num_citations_period=33, subjects=['BUSI', 'COMP', 'SOCI'], country='Netherlands',
affiliation_id='60032882', affiliation='Eindhoven University of Technology,
Department of Industrial Engineering & Innovation Sciences', language='eng',
reference_sim=0.0, abstract_sim=0.1217)

Full reference:

Original(scientist, treatment_year[, …]) Representation of a scientist for whom to find a control scientist.


If sosia helped you getting data for research, please cite our corresponding paper:

Citing the paper helps the development of sosia, because it justifies funneling resources into the development. It also signals that you created your control group in a transparent and replicable way.

Indices and tables