Downloads

zip Download all data

txt README.txt

csv media-data-average-opinion-us.csv

csv survey-data-favorability-study-1.csv

csv survey-data-favorability-study-3.csv

csv survey-data-trust-analytical-sample.csv

csv survey-data-trust-descriptive-sample-12to20.csv

csv survey-data-trust-descriptive-sample-18to20.csv

pdf survey-questionnaires.pdf

This dataset encompasses three distinct sets of data analyzed in the study, namely the survey data on favorability to the US, the survey data on trust in Americans, and the social media data. It is also available at yuxie.com and Princeton DataSpace. Licenses/restrictions placed on the data, or limitations of reuse: CC BY-NC-SA 4.0 .

Survey Data on Favorability to the US

The first part of the dataset comprises the analysis in Study 1 and Study 3.


The analysis in Study 1 uses data from three surveys: the Social Attitude Questionnaire of Urban and Rural Residents (SAQURR) in 2019 and 2020 (N=3,408), the COVID-19 Multi-Wave Study (CMWS) between 2020 and 2022 (N=38,613), and the Survey on Living Conditions (SLC) in 2023 (N=2,596). The Chinese and English versions of the survey questionnaires are provided in survey-questionnaires.pdf.


Study 1 uses individual-level data appended from the three surveys, which is provided in survey-data-favorability-study-1.csv. The data includes Chinese favorability scores toward the US, survey sources, the year and month of the interview, demographic information of respondents, and the survey weights.


Analysis in Study 3 involves a subsample from Northwest China in SAQURR (N=1880), which is provided in survey-data-favorability-study-3.csv. The data includes Chinese favorability scores towards the US and seven other countries or regions. To assess the comparability of the control group (respondents interviewed in December 2019) and treatment group (April 2020) in the quasi-experimental design, we provide background information on sex, education, and age.


Survey Data on Trust in Americans

The second part of the datasets provides information used in Study 4, involving the CFPS data, Baidu Index data, and the COVID-19 cases and deaths data.


The China Family Panel Studies (CFPS), conducted by Peking University, is a nationally representative, longitudinal, comprehensive, and biennial social survey started in 2010. The outcome of interest in Study 4 is trust in Americans measured in the 2020 CFPS, incorporating the baseline trust from the 2018 CFPS. We confined the sample to respondents who indicated their level of trust in Americans in both the 2018 and 2020 waves (N=17,497). survey-data-trust-descriptive-sample-18to20.csv reports the trust level in 2018 and 2020 and the changes in between. As a supplementary analysis, we also used all respondents aged 16 or above in each wave of the CFPS since 2012 to document the changes in Chinese trust in Americans from 2012 to 2020 (survey-data-trust-descriptive-sample-12to20.csv).


In the regression analysis, we provide the subsample of those who have the “potential” to decrease trust (baseline trust scored above 0) and have complete information on location and interview date (N=11,430). They are interviewed at some point over the 23 weeks spanning from July 2020 to December 2020.


We measure the Chinese public attention on the pandemic in the US using the Baidu Index. Baidu is a widely used search engine in China. The Baidu Index provides query-based data that reflects the daily intensity of keywords entered into Baidu, the largest search engine in China. We applied a logarithmic transformation to the Baidu Index scores for the keywords, such as “美国疫情” (“pandemic in the US”), “疫情” (“pandemic”) and “中美贸易战” (“Sino-US trade war”), to quantify public attention to these issues.


Our analysis in Study 4 also involves the COVID-19 cases and deaths data obtained from the Oxford COVID-19 Government Response Tracker. We used two measures with logarithmic transformation: the daily number of confirmed cases and the daily number of deaths occurring one day before the 2020 CFPS interview date. Due to the time difference between China and the US, these statistics are possibly the most up-to-date information available to the survey respondents who closely follow US news.


survey-data-trust-analytical-sample.csv collects variables used for the regression analysis, including the trust in Americans in 2018 and 2020, demographic variables, and location details (province) from the CFPS, along with the merged data of Baidu Index and the COVID-19 cases and deaths data. Key variable meanings are explained below.


Variable name

Meaning

trust_americans

Trust in Americans in 2020

trust_parents

Trust in parents in 2020

trust_neighbors

Trust in neighbors in 2020

trust_doctors

Trust in doctors in 2020

trust_officials

Trust in officials in 2020

trust_americans_18

Trust in Americans in 2018

trust_parents_18

Trust in parents in 2018

trust_neighbors_18

Trust in neighbors in 2018

trust_doctors_18

Trust in doctors in 2018

trust_officials_18

Trust in officials in 2018

increase

Trust in Americans increased from 2018 to 2020 (binary)

logUS_pandemic

logged Baidu Search Index score of "pandemic in US"

logpandemic

logged Baidu Search Index score of "pandemic"

logtrade_war

logged Baidu Search Index score of "Sino-American trade war"

logUS_case_new

logged number of new COVID-19 cases in the US one day ago

logUS_death_new

logged number of new COVID-19 related deaths in the US one day ago

age

Age

age2

Age squared

married

Married

male

Male

hs_above

Completed senior high school or a higher level of education

uhukou

Urban hukou

internet

Internet user

student

In full-time education, including undergraduate and postgraduate education

employed

In full- or part-time paid employment or was self-employed

weekend

Interviewed at weekend

logUS_pandemic_lag1

logged Baidu Search Index score of "pandemic in US" one day ago

logUS_pandemic_lag2

logged Baidu Search Index score of "pandemic in US" two days ago

logUS_pandemic_lag3

logged Baidu Search Index score of "pandemic in US" three days ago

logUS_pandemic_lead1

logged Baidu Search Index score of "pandemic in US" one day later

logUS_pandemic_lead2

logged Baidu Search Index score of "pandemic in US" two days later

logUS_pandemic_lead3

logged Baidu Search Index score of "pandemic in US" three days later

week

Week indicator

provcd18

Province indicator

date_N15

Indicating at least 15 respondents are interviewed on a given day

Social Media Data

The third dataset is provided to depict trends in attitudes toward the US in Study 2. The data is collected from 53,949,720 posts containing US-related keywords (美国, 灯塔国, 美利坚, 米国, 美帝) from January 1, 2016, to November 28, 2023, on the Chinese social media platform Weibo, which is similar to Twitter. The substantial size provides us with a high level of confidence that this dataset encompasses prevalent viewpoints on Chinese social media. Each post was labeled with an attitude score toward the US on a scale of -2 (most unfavorable), -1 (somewhat unfavorable), 0 (neutral), 1 (somewhat favorable), and 2 (most favorable). Subsequently, we employed fine-tuning on a large language model, BERT, using these annotations for two tasks. The first task involved binary classification to determine whether a Weibo post conveyed attitudes toward the US. The second task was a regression model to predict the attitude score.


The daily attitude averaging across all users is provided in media-data-average-opinion-us.csv, smoothed using a 540-day sliding window to filter out minor fluctuations.


For additional information on the processing of the data, please refer to the Supplementary Materials.


Data Publisher

COVID-19 Multi Wave Study (CMWS) and Survey on Living Conditions (SLC) are conducted by the Population Development Studies Center, Renmin University of China. Social Attitude of Urban and Rural Residents Survey (SAURRS) is conducted by the Institute of Psychology of the Chinese Academy of Sciences. China Family Panel Studies (CFPS) is conducted by the Institute of Social Science Survey, Peking University. The Weibo data is owned by Sina.

Citation

Please cite this paper if you use this dataset for research purpose.

Xie, Y., Yang, F., Huang, J., He, Y,. Zhou, Y., Qian, Y., Cai, W., Zhou, J. Declining Chinese Attitudes toward the United States amid COVID-19 (2024). DOI: TBD

zip Download all data

csv Chinese-descent-scientists-destination.csv: List of 25,202 Chinese-descent scientists with their respective discipline labels and destination country or region. Scientists migrating to China mainland, Hong Kong and Taiwan are recorded separately.

csv Chinese-descent-scientists-destination-count.csv: Number of Chinese-descent scientists who migrated to China, categorized by year, discipline, and stage (junior/experienced). Due to the small sample size, scientists labeled in the "Statistics" discipline were excluded from the count.

This dataset encompasses two distinct sets of data analyzed in the study, namely Asian American Scholar Forum survey data and Microsoft Academic Graph bibleometrics data:

This data is available at yuxie.com and Princeton DataSpace.

Survey data:

The first part of the dataset comprises survey data collected from the Asian American Scholar Forum survey. With respect to privacy concerns of the survey respondents, the raw survey data have been designated as confidential and are deemed inappropriate for public disclosure. Researchers interested in obtaining access to the data are encouraged to directly contact the authors for an authorized copy. Nonetheless, the summarized statistics derived from the survey data can be found in the Supplementary Materials, sufficing the replication of the results presented in this paper.

Bibleometrics data:

The second part of the dataset involves bibliometric data obtained from the Microsoft Academic Graph, which indexed 208,440,142 scientists from 27,077 institutions authoring 205,203,354 scientific publications dated until December 2021. The database was sourced from the publicly available snapshot retrieved from OpenAlex in early 2022, after Microsoft Academic Graph announced retirement in Dec 2021.

We identified Chinese-descent scientists by their surnames. We first collected 832 common Chinese surnames from Wikipedia, including those in Chinese characters and romanized names, in Hanyu Pinyin (the system of Chinese romanization mostly used by mainland Chinese scientists) and Wade-Giles (the system mostly used by Cantonese-speaking and Taiwanese scientists). This methodology results in the non-counting of Chinese-descent scientists who have changed their surnames (usually females after marriage), leading to an undercount. We searched for those surnames in the authors' full names recorded in Microsoft Academic Graph to identify Chinese-descent scientists. To retain a high degree of reliability in individual identification, we removed scientists with a gap of more than 5 years between consecutive publications, which we believed were false results in which Microsoft Academic Graph's name disambiguation algorithm incorrectly merged multiple individuals. We ended up with 25,202 Chinese-descent scientists who had their first publications in US affiliations and dropped their US affiliations and subsequently published at least one paper affiliated with China.

We leveraged Google Maps API to parse all 27,077 institution names in Microsoft Academic Graph, and retrieved their country labels. Therefore, we could label every Chinese-descent scientist's working country in any publishing year. Specifically, we focused on Chinese-descent scientists leaving the US, i.e., those who were trained in the US (first paper affiliated in the US) and who subsequently moved from the US to China (i.e., stopped using US affiliations and started to use Chinese affiliations). For each such scientist, we counted the year range of all his/her papers affiliated in the US and affiliated in China, and annotated his/her leaving year as the year of his/her first subsequent paper after his/her most recent usage of a US affiliation. This was more accurate than simply using his/her last year with a US affiliation, which might produce false positives that counted current US-based Chinese-descent scientists.

We further identified two groups of interest among US-based Chinese-descent scientists: "junior" scientists—those who had published their first papers in the US, started publishing with Chinese affiliations within 5 years thereafter, and finally left the US within 7 years thereafter; and "experienced" scientists—those who had published over 25 papers in their whole career and outperformed 97% of scientists.

For additional information on the processing of the survey data and bibliometric data, please refer to the Supplementary Materials.

Data Publisher

The survey data is administered by the Asian American Scholar Forum. The bibleometrics data is published by Microsoft under Open Data Commons Attribution License (ODC-By).

Citation

Please cite this paper if you use this dataset for research purpose.

Yu Xie, Xihong Lin, Ju Li, Qian He, Junming Huang, Caught in the Crossfire: Fears of Chinese-American Scientists, Proceedings of the National Academy of Sciences, 120 (27) e2216248120 (2023). DOI: 10.1073/pnas.2216248120

Please also cite Microsoft Academic Graph if you use their data.

Arnab Sinha et al., An Overview of Microsoft Academic Service (MAS) and Applications, in Proceedings of the 24th International Conference on World Wide Web (WWW'15 Companion), ACM, New York, NY, 243-246 (2015). DOI: 10.1145/2740908.2742839

DataSpace Download all data at Princeton DataSpace.

This dataset include estimated sentiments on The New York Times on China in eight topics from 1970 to 2019, and a time series of public attitude aggregated from surveys on China.

(1) Estimated sentiments on The New York Times on China in eight topics from 1970 to 2019

We estimate sentiments of The New York Times articles on China with a three-stage procedure. First, two human coders annotate 873 randomly selected articles with a total of 18,598 paragraphs as expressing either positive, negative, or neutral sentiment in each of eight topics (ideology, government & administration, democracy, economic development, marketization, welfare and well-being, globalization, and culture). We treat irrelevant articles as neutral sentiment. Secondly, we fine-tune a natural language processing model BERT (Bidirectional Encoder Representations from Transformers) with the human-coded labels. The model uses a deep neural network with 12 layers. It accepts paragraphs (i.e., word sequences of no more than 128 words) as input and outputs a probability for each category. We end up with two binary classifiers for each topic for a grand total of 16 classifiers: an assignment classifier that determines whether a paragraph expresses sentiment in a given topic domain and a sentiment classifier that then distinguishes positive and negative sentiment in a paragraph classified as belonging to a given topic domain. Thirdly, we run the 16 trained classifiers on each paragraph in our corpus and assign category probabilities to every paragraph. We then use the probabilities of all the paragraphs in an article to determine the article's overall sentiment category (i.e., positive, negative, or neutral) in every topic.


 

    Estimated sentiment on paragraphs    

    Estimated sentiment on news articles    

    Estimated daily sentiment    

Ideology

csvtopic-0-paragraph-pred.tsv.gz csvtopic-0-article-pred.tsv.gz csvtopic-0-trend.tsv

Government & administration

csvtopic-1-paragraph-pred.tsv.gz csvtopic-1-article-pred.tsv.gz csvtopic-1-trend.tsv

Democracy

csvtopic-2-paragraph-pred.tsv.gz csvtopic-2-article-pred.tsv.gz csvtopic-2-trend.tsv

Economic development

csvtopic-3-paragraph-pred.tsv.gz csvtopic-3-article-pred.tsv.gz csvtopic-3-trend.tsv

Marketization

csvtopic-4-paragraph-pred.tsv.gz csvtopic-4-article-pred.tsv.gz csvtopic-4-trend.tsv

Welfare & well-being

csvtopic-5-paragraph-pred.tsv.gz csvtopic-5-article-pred.tsv.gz csvtopic-5-trend.tsv

Globalization

csvtopic-6-paragraph-pred.tsv.gz csvtopic-6-article-pred.tsv.gz csvtopic-6-trend.tsv

Culture

csvtopic-7-paragraph-pred.tsv.gz csvtopic-7-article-pred.tsv.gz csvtopic-7-trend.tsv


  • Paragraph sentiment file columns:

  • url: url of an article (string)

  • date: date of an article (YYYY-MM-DD)

  • article_id: unique ID we assign to an article (int). This is for inner use only, and it has no association with The New York Times

  • paragraph_id: zero-based index of a paragraph in an article (int)

  • assignment_prediction_score: probability that this paragraph express a positive or negative sentiment toward China on a certain topic (float). A value close to 1 means that this paragraph is very likely to express a positive or negative sentiment. A value close to 0 means that this paragraph is very unlikely to express a positive or negative sentiment, i.e., it is neutral or irrelavant.

  • sentiment_prediction_score: probability that this paragraph express a positive sentiment toward China on a certain topic (float). A value close to 1 means that this paragraph is very likely to express a positive sentiment. A value close to 0 means that this paragraph is very likely to express a negative sentiment. This value is useless when assignment_prediction_score is close to zero.


  • Article sentiment file columns:

  • url: url of an article (string)

  • ss1_prediction: estimated sentiment of an article on a certain topic of China (int). 0 if this article is estimated to express a neutral sentiment on a certain topic of China, or it is irrelavant to a certain topic of China. 1 if this article is estimated to express a positive sentiment. -1 if this article is estimated to express a negative sentiment.


  • Daily sentiment file columns:

  • date: date (YYYY-MM-DD)

  • num_articles: number of The New York Times articles on this date (int)

  • num_positive_articles: number of The New York Times articles that are estimated to express positive sentiments on a certain topic of China.

  • num_negative_articles: number of The New York Times articles that are estimated to express negative sentiments on a certain topic of China.


The pretrained model files of BERT can be downloaded from Google's Github repository. Our settings to finetune the model are here.

(2) Public attitude aggregated from surveys on China

This time series is aggregated from 101 cross-sectional surveys from 1974 to 2019 that asked relevant questions about attitudes toward China, ranging from -100% to 100% with the year of 1974 as baseline = 0. Years with attitudes above zero show a more favorable attitude than that in 1974. Years with attitudes below zero show a less favorable attitude than that in 1974, with a lowest level of -24% in 1976. The time series is estimated with 95% confidence interval, as in aggregated-survey.tsv. Detailed method is described in Donghui Wang, Yu Xie, and Junming Huang, Trend Analysis with Pooled Data from Different Survey Series: The Latent Attitude Method, Sociological Methodology (2023).

  • aggregated-survey.tsv columns:

  • year: year (int)

  • Estimates: aggregated attitude value (float)

  • ul: upper bound of 95% confidence interval (float)

  • ll: lower bound of 95% confidence interval (float)

Citation

Junming Huang, Gavin G. Cook and Yu Xie. Large-scale quantitative evidence of media impact on public opinion toward China. Humanities and Social Sciences Communications, 8, 181 (2021). DOI: 10.1057/s41599-021-00846-2.

Junming Huang, Gavin G. Cook and Yu Xie. Between reality and perception: the mediating effects of mass media on public opinion toward China. Chinese Sociological Reviews, 53 (5), 431-450, (2021). DOI: 10.1080/21620555.2021.1980720.

zip Download New York Times bestsellers statistics data. The top 20 rankings for hardcover fiction and nonfiction by week and isbn-10.

Citation

Burcu Yucesoy, Xindi Wang, Junming Huang and Albert-László Barabási, Success in books: a big data approach to bestsellers. EPJ Data Science 7, 7 (2018). DOI: 10.1140/epjds/s13688-018-0135-y

zip Download the source code of the proposed IMRank algorithm.

Citation

Suqi Cheng, Hua-Wei Shen, Junming Huang, Wei Chen and Xue-Qi Cheng, IMRank: Influence Maximization via Finding Self-Consistent Ranking, Proceedings of the 37th international ACM SIGIR conference on Research and development in Information Retrieval (SIGIR'14), Gold coast, Australia (2014). DOI: 10.1145/2600428.2609592

zip Download the source code of the proposed StaticGreedy algorithm.

Citation

Suqi Cheng, Hua-Wei Shen, Junming Huang and Xue-Qi Cheng, StaticGreedy: solving the scalability-accuracy dilemma in influence maximization, Proceedings of the 22nd ACM Conference on Information and Knowledge Management (CIKM'13), San Francisco, USA (2013). DOI: 10.1145/2505515.2505541

github Download the data of Chinese cuisine and related code, originally downloaded from Meishijie and cleaned by Yuxiao Zhu on Mar 18, 2019.

Citation

Yu-Xiao Zhu, Junming Huang, Zi-Ke Zhang, Qian-Ming Zhang, Tao Zhou, Yong-Yeol Ahn, Geography and similarity of regional cuisines in China, PLOS One, 8(11): e79161 (2013). DOI: 10.1371/journal.pone.0079161

zip Download all data

csv goodreads.300k.collections.csv (1.4GB): 31,130,083 Book collection records. Columns: user id, item (book) id, collecting timestamp, rating, reserved, category.

csv goodreads.300k.edges.csv (136MB): 9,922,981 Following links between users. Columns: followed user id, follower user id. User IDs are continuous integers.

csv goodreads.300k.items.csv (37MB): Informaiton of 1,928,141 items (books). Columns: item id, rating, popularity, category.

The dataset was crawled from www.goodreads.com and anonymized by Junming Huang. You are free to use the dataset in academic research but not allowed to re-distribute without written permissions.

Citation

Junming Huang, Xue-Qi Cheng, Hua-Wei Shen, Tao Zhou and Xiaolong Jin, Exploring Social Influence via Posterior Effect of Word-of-Mouth Recommendations, Proceedings of the Fifth ACM International Conference on Web Search and Data Mining (WSDM'12), Seattle, USA (2012). DOI: 10.1145/2124295.2124365

Junming Huang, Xue-Qi Cheng, Jiafeng Guo, Hua-Wei Shen and Kun Yang, Social Recommendation with Interpersonal Influence, Proceedings of the 19th European Conference on Artificial Intelligence (ECAI'10), Amsterdam, The Netherlands (2010). DOI: 10.3233/978-1-60750-606-5-601