Junming Huang

This dataset encompasses three distinct sets of data analyzed in the study, namely the survey data on favorability to the US, the survey data on trust in Americans, and the social media data. It is also available at yuxie.com and Princeton DataSpace. Licenses/restrictions placed on the data, or limitations of reuse: CC BY-NC-SA 4.0 .

Survey Data on Favorability to the US

The first part of the dataset comprises the analysis in Study 1 and Study 3.

The analysis in Study 1 uses data from three surveys: the Social Attitude Questionnaire of Urban and Rural Residents (SAQURR) in 2019 and 2020 (N=3,408), the COVID-19 Multi-Wave Study (CMWS) between 2020 and 2022 (N=38,613), and the Survey on Living Conditions (SLC) in 2023 (N=2,596). The Chinese and English versions of the survey questionnaires are provided in survey-questionnaires.pdf.

Study 1 uses individual-level data appended from the three surveys, which is provided in survey-data-favorability-study-1.csv. The data includes Chinese favorability scores toward the US, survey sources, the year and month of the interview, demographic information of respondents, and the survey weights.

Analysis in Study 3 involves a subsample from Northwest China in SAQURR (N=1880), which is provided in survey-data-favorability-study-3.csv. The data includes Chinese favorability scores towards the US and seven other countries or regions. To assess the comparability of the control group (respondents interviewed in December 2019) and treatment group (April 2020) in the quasi-experimental design, we provide background information on sex, education, and age.

Survey Data on Trust in Americans

The second part of the datasets provides information used in Study 4, involving the CFPS data, Baidu Index data, and the COVID-19 cases and deaths data.

The China Family Panel Studies (CFPS), conducted by Peking University, is a nationally representative, longitudinal, comprehensive, and biennial social survey started in 2010. The outcome of interest in Study 4 is trust in Americans measured in the 2020 CFPS, incorporating the baseline trust from the 2018 CFPS. We confined the sample to respondents who indicated their level of trust in Americans in both the 2018 and 2020 waves (N=17,497). survey-data-trust-descriptive-sample-18to20.csv reports the trust level in 2018 and 2020 and the changes in between. As a supplementary analysis, we also used all respondents aged 16 or above in each wave of the CFPS since 2012 to document the changes in Chinese trust in Americans from 2012 to 2020 (survey-data-trust-descriptive-sample-12to20.csv).

In the regression analysis, we provide the subsample of those who have the “potential” to decrease trust (baseline trust scored above 0) and have complete information on location and interview date (N=11,430). They are interviewed at some point over the 23 weeks spanning from July 2020 to December 2020.

We measure the Chinese public attention on the pandemic in the US using the Baidu Index. Baidu is a widely used search engine in China. The Baidu Index provides query-based data that reflects the daily intensity of keywords entered into Baidu, the largest search engine in China. We applied a logarithmic transformation to the Baidu Index scores for the keywords, such as “美国疫情” (“pandemic in the US”), “疫情” (“pandemic”) and “中美贸易战” (“Sino-US trade war”), to quantify public attention to these issues.

Our analysis in Study 4 also involves the COVID-19 cases and deaths data obtained from the Oxford COVID-19 Government Response Tracker. We used two measures with logarithmic transformation: the daily number of confirmed cases and the daily number of deaths occurring one day before the 2020 CFPS interview date. Due to the time difference between China and the US, these statistics are possibly the most up-to-date information available to the survey respondents who closely follow US news.

survey-data-trust-analytical-sample.csv collects variables used for the regression analysis, including the trust in Americans in 2018 and 2020, demographic variables, and location details (province) from the CFPS, along with the merged data of Baidu Index and the COVID-19 cases and deaths data. Key variable meanings are explained below.

Variable name	Meaning
trust_americans	Trust in Americans in 2020
trust_parents	Trust in parents in 2020
trust_neighbors	Trust in neighbors in 2020
trust_doctors	Trust in doctors in 2020
trust_officials	Trust in officials in 2020
trust_americans_18	Trust in Americans in 2018
trust_parents_18	Trust in parents in 2018
trust_neighbors_18	Trust in neighbors in 2018
trust_doctors_18	Trust in doctors in 2018
trust_officials_18	Trust in officials in 2018
increase	Trust in Americans increased from 2018 to 2020 (binary)
logUS_pandemic	logged Baidu Search Index score of "pandemic in US"
logpandemic	logged Baidu Search Index score of "pandemic"
logtrade_war	logged Baidu Search Index score of "Sino-American trade war"
logUS_case_new	logged number of new COVID-19 cases in the US one day ago
logUS_death_new	logged number of new COVID-19 related deaths in the US one day ago
age	Age
age2	Age squared
married	Married
male	Male
hs_above	Completed senior high school or a higher level of education
uhukou	Urban hukou
internet	Internet user
student	In full-time education, including undergraduate and postgraduate education
employed	In full- or part-time paid employment or was self-employed
weekend	Interviewed at weekend
logUS_pandemic_lag1	logged Baidu Search Index score of "pandemic in US" one day ago
logUS_pandemic_lag2	logged Baidu Search Index score of "pandemic in US" two days ago
logUS_pandemic_lag3	logged Baidu Search Index score of "pandemic in US" three days ago
logUS_pandemic_lead1	logged Baidu Search Index score of "pandemic in US" one day later
logUS_pandemic_lead2	logged Baidu Search Index score of "pandemic in US" two days later
logUS_pandemic_lead3	logged Baidu Search Index score of "pandemic in US" three days later
week	Week indicator
provcd18	Province indicator
date_N15	Indicating at least 15 respondents are interviewed on a given day

Social Media Data

The third dataset is provided to depict trends in attitudes toward the US in Study 2. The data is collected from 53,949,720 posts containing US-related keywords (美国, 灯塔国, 美利坚, 米国, 美帝) from January 1, 2016, to November 28, 2023, on the Chinese social media platform Weibo, which is similar to Twitter. The substantial size provides us with a high level of confidence that this dataset encompasses prevalent viewpoints on Chinese social media. Each post was labeled with an attitude score toward the US on a scale of -2 (most unfavorable), -1 (somewhat unfavorable), 0 (neutral), 1 (somewhat favorable), and 2 (most favorable). Subsequently, we employed fine-tuning on a large language model, BERT, using these annotations for two tasks. The first task involved binary classification to determine whether a Weibo post conveyed attitudes toward the US. The second task was a regression model to predict the attitude score.

The daily attitude averaging across all users is provided in media-data-average-opinion-us.csv, smoothed using a 540-day sliding window to filter out minor fluctuations.

For additional information on the processing of the data, please refer to the Supplementary Materials.

Data Publisher

COVID-19 Multi Wave Study (CMWS) and Survey on Living Conditions (SLC) are conducted by the Population Development Studies Center, Renmin University of China. Social Attitude of Urban and Rural Residents Survey (SAURRS) is conducted by the Institute of Psychology of the Chinese Academy of Sciences. China Family Panel Studies (CFPS) is conducted by the Institute of Social Science Survey, Peking University. The Weibo data is owned by Sina.

Citation

Please cite this paper if you use this dataset for research purpose.

Xie, Y., Yang, F., Huang, J., He, Y,. Zhou, Y., Qian, Y., Cai, W., Zhou, J. Declining Chinese Attitudes toward the United States amid COVID-19 (2024). DOI: TBD

Download all data

Chinese-descent-scientists-destination.csv: List of 25,202 Chinese-descent scientists with their respective discipline labels and destination country or region. Scientists migrating to China mainland, Hong Kong and Taiwan are recorded separately.

Chinese-descent-scientists-destination-count.csv: Number of Chinese-descent scientists who migrated to China, categorized by year, discipline, and stage (junior/experienced). Due to the small sample size, scientists labeled in the "Statistics" discipline were excluded from the count.

This dataset encompasses two distinct sets of data analyzed in the study, namely Asian American Scholar Forum survey data and Microsoft Academic Graph bibleometrics data:

This data is available at yuxie.com and Princeton DataSpace.

Survey data:

The first part of the dataset comprises survey data collected from the Asian American Scholar Forum survey. With respect to privacy concerns of the survey respondents, the raw survey data have been designated as confidential and are deemed inappropriate for public disclosure. Researchers interested in obtaining access to the data are encouraged to directly contact the authors for an authorized copy. Nonetheless, the summarized statistics derived from the survey data can be found in the Supplementary Materials, sufficing the replication of the results presented in this paper.

Bibleometrics data:

The second part of the dataset involves bibliometric data obtained from the Microsoft Academic Graph, which indexed 208,440,142 scientists from 27,077 institutions authoring 205,203,354 scientific publications dated until December 2021. The database was sourced from the publicly available snapshot retrieved from OpenAlex in early 2022, after Microsoft Academic Graph announced retirement in Dec 2021.

We identified Chinese-descent scientists by their surnames. We first collected 832 common Chinese surnames from Wikipedia, including those in Chinese characters and romanized names, in Hanyu Pinyin (the system of Chinese romanization mostly used by mainland Chinese scientists) and Wade-Giles (the system mostly used by Cantonese-speaking and Taiwanese scientists). This methodology results in the non-counting of Chinese-descent scientists who have changed their surnames (usually females after marriage), leading to an undercount. We searched for those surnames in the authors' full names recorded in Microsoft Academic Graph to identify Chinese-descent scientists. To retain a high degree of reliability in individual identification, we removed scientists with a gap of more than 5 years between consecutive publications, which we believed were false results in which Microsoft Academic Graph's name disambiguation algorithm incorrectly merged multiple individuals. We ended up with 25,202 Chinese-descent scientists who had their first publications in US affiliations and dropped their US affiliations and subsequently published at least one paper affiliated with China.

We leveraged Google Maps API to parse all 27,077 institution names in Microsoft Academic Graph, and retrieved their country labels. Therefore, we could label every Chinese-descent scientist's working country in any publishing year. Specifically, we focused on Chinese-descent scientists leaving the US, i.e., those who were trained in the US (first paper affiliated in the US) and who subsequently moved from the US to China (i.e., stopped using US affiliations and started to use Chinese affiliations). For each such scientist, we counted the year range of all his/her papers affiliated in the US and affiliated in China, and annotated his/her leaving year as the year of his/her first subsequent paper after his/her most recent usage of a US affiliation. This was more accurate than simply using his/her last year with a US affiliation, which might produce false positives that counted current US-based Chinese-descent scientists.

We further identified two groups of interest among US-based Chinese-descent scientists: "junior" scientists—those who had published their first papers in the US, started publishing with Chinese affiliations within 5 years thereafter, and finally left the US within 7 years thereafter; and "experienced" scientists—those who had published over 25 papers in their whole career and outperformed 97% of scientists.

For additional information on the processing of the survey data and bibliometric data, please refer to the Supplementary Materials.

Data Publisher

The survey data is administered by the Asian American Scholar Forum. The bibleometrics data is published by Microsoft under Open Data Commons Attribution License (ODC-By).

Citation

Please cite this paper if you use this dataset for research purpose.

Yu Xie, Xihong Lin, Ju Li, Qian He, Junming Huang, Caught in the Crossfire: Fears of Chinese-American Scientists, Proceedings of the National Academy of Sciences, 120 (27) e2216248120 (2023). DOI: 10.1073/pnas.2216248120

Please also cite Microsoft Academic Graph if you use their data.

Arnab Sinha et al., An Overview of Microsoft Academic Service (MAS) and Applications, in Proceedings of the 24th International Conference on World Wide Web (WWW'15 Companion), ACM, New York, NY, 243-246 (2015). DOI: 10.1145/2740908.2742839

Download all data at Princeton DataSpace.

This dataset include estimated sentiments on The New York Times on China in eight topics from 1970 to 2019, and a time series of public attitude aggregated from surveys on China.

(1) Estimated sentiments on The New York Times on China in eight topics from 1970 to 2019

We estimate sentiments of The New York Times articles on China with a three-stage procedure. First, two human coders annotate 873 randomly selected articles with a total of 18,598 paragraphs as expressing either positive, negative, or neutral sentiment in each of eight topics (ideology, government & administration, democracy, economic development, marketization, welfare and well-being, globalization, and culture). We treat irrelevant articles as neutral sentiment. Secondly, we fine-tune a natural language processing model BERT (Bidirectional Encoder Representations from Transformers) with the human-coded labels. The model uses a deep neural network with 12 layers. It accepts paragraphs (i.e., word sequences of no more than 128 words) as input and outputs a probability for each category. We end up with two binary classifiers for each topic for a grand total of 16 classifiers: an assignment classifier that determines whether a paragraph expresses sentiment in a given topic domain and a sentiment classifier that then distinguishes positive and negative sentiment in a paragraph classified as belonging to a given topic domain. Thirdly, we run the 16 trained classifiers on each paragraph in our corpus and assign category probabilities to every paragraph. We then use the probabilities of all the paragraphs in an article to determine the article's overall sentiment category (i.e., positive, negative, or neutral) in every topic.

	Estimated sentiment on paragraphs	Estimated sentiment on news articles	Estimated daily sentiment
Ideology	topic-0-paragraph-pred.tsv.gz	topic-0-article-pred.tsv.gz	topic-0-trend.tsv
Government & administration	topic-1-paragraph-pred.tsv.gz	topic-1-article-pred.tsv.gz	topic-1-trend.tsv
Democracy	topic-2-paragraph-pred.tsv.gz	topic-2-article-pred.tsv.gz	topic-2-trend.tsv
Economic development	topic-3-paragraph-pred.tsv.gz	topic-3-article-pred.tsv.gz	topic-3-trend.tsv
Marketization	topic-4-paragraph-pred.tsv.gz	topic-4-article-pred.tsv.gz	topic-4-trend.tsv
Welfare & well-being	topic-5-paragraph-pred.tsv.gz	topic-5-article-pred.tsv.gz	topic-5-trend.tsv
Globalization	topic-6-paragraph-pred.tsv.gz	topic-6-article-pred.tsv.gz	topic-6-trend.tsv
Culture	topic-7-paragraph-pred.tsv.gz	topic-7-article-pred.tsv.gz	topic-7-trend.tsv

Paragraph sentiment file columns:
url: url of an article (string)
date: date of an article (YYYY-MM-DD)
article_id: unique ID we assign to an article (int). This is for inner use only, and it has no association with The New York Times
paragraph_id: zero-based index of a paragraph in an article (int)
assignment_prediction_score: probability that this paragraph express a positive or negative sentiment toward China on a certain topic (float). A value close to 1 means that this paragraph is very likely to express a positive or negative sentiment. A value close to 0 means that this paragraph is very unlikely to express a positive or negative sentiment, i.e., it is neutral or irrelavant.
sentiment_prediction_score: probability that this paragraph express a positive sentiment toward China on a certain topic (float). A value close to 1 means that this paragraph is very likely to express a positive sentiment. A value close to 0 means that this paragraph is very likely to express a negative sentiment. This value is useless when assignment_prediction_score is close to zero.

Article sentiment file columns:
url: url of an article (string)
ss1_prediction: estimated sentiment of an article on a certain topic of China (int). 0 if this article is estimated to express a neutral sentiment on a certain topic of China, or it is irrelavant to a certain topic of China. 1 if this article is estimated to express a positive sentiment. -1 if this article is estimated to express a negative sentiment.

Daily sentiment file columns:
date: date (YYYY-MM-DD)
num_articles: number of The New York Times articles on this date (int)
num_positive_articles: number of The New York Times articles that are estimated to express positive sentiments on a certain topic of China.
num_negative_articles: number of The New York Times articles that are estimated to express negative sentiments on a certain topic of China.

The pretrained model files of BERT can be downloaded from Google's Github repository. Our settings to finetune the model are here.

(2) Public attitude aggregated from surveys on China

This time series is aggregated from 101 cross-sectional surveys from 1974 to 2019 that asked relevant questions about attitudes toward China, ranging from -100% to 100% with the year of 1974 as baseline = 0. Years with attitudes above zero show a more favorable attitude than that in 1974. Years with attitudes below zero show a less favorable attitude than that in 1974, with a lowest level of -24% in 1976. The time series is estimated with 95% confidence interval, as in aggregated-survey.tsv. Detailed method is described in Donghui Wang, Yu Xie, and Junming Huang, Trend Analysis with Pooled Data from Different Survey Series: The Latent Attitude Method, Sociological Methodology (2023).

aggregated-survey.tsv columns:
year: year (int)
Estimates: aggregated attitude value (float)
ul: upper bound of 95% confidence interval (float)
ll: lower bound of 95% confidence interval (float)

Citation

Junming Huang, Gavin G. Cook and Yu Xie. Large-scale quantitative evidence of media impact on public opinion toward China. Humanities and Social Sciences Communications, 8, 181 (2021). DOI: 10.1057/s41599-021-00846-2.

Junming Huang, Gavin G. Cook and Yu Xie. Between reality and perception: the mediating effects of mass media on public opinion toward China. Chinese Sociological Reviews, 53 (5), 431-450, (2021). DOI: 10.1080/21620555.2021.1980720.

Download New York Times bestsellers statistics data. The top 20 rankings for hardcover fiction and nonfiction by week and isbn-10.

Citation

Burcu Yucesoy, Xindi Wang, Junming Huang and Albert-László Barabási, Success in books: a big data approach to bestsellers. EPJ Data Science 7, 7 (2018). DOI: 10.1140/epjds/s13688-018-0135-y

Download the source code of the proposed IMRank algorithm.

Citation

Suqi Cheng, Hua-Wei Shen, Junming Huang, Wei Chen and Xue-Qi Cheng, IMRank: Influence Maximization via Finding Self-Consistent Ranking, Proceedings of the 37th international ACM SIGIR conference on Research and development in Information Retrieval (SIGIR'14), Gold coast, Australia (2014). DOI: 10.1145/2600428.2609592

Download the source code of the proposed StaticGreedy algorithm.

Citation

Suqi Cheng, Hua-Wei Shen, Junming Huang and Xue-Qi Cheng, StaticGreedy: solving the scalability-accuracy dilemma in influence maximization, Proceedings of the 22nd ACM Conference on Information and Knowledge Management (CIKM'13), San Francisco, USA (2013). DOI: 10.1145/2505515.2505541

Download the data of Chinese cuisine and related code, originally downloaded from Meishijie and cleaned by Yuxiao Zhu on Mar 18, 2019.

Citation

Yu-Xiao Zhu, Junming Huang, Zi-Ke Zhang, Qian-Ming Zhang, Tao Zhou, Yong-Yeol Ahn, Geography and similarity of regional cuisines in China, PLOS One, 8(11): e79161 (2013). DOI: 10.1371/journal.pone.0079161

Download all data

goodreads.300k.collections.csv (1.4GB): 31,130,083 Book collection records. Columns: user id, item (book) id, collecting timestamp, rating, reserved, category.

goodreads.300k.edges.csv (136MB): 9,922,981 Following links between users. Columns: followed user id, follower user id. User IDs are continuous integers.

goodreads.300k.items.csv (37MB): Informaiton of 1,928,141 items (books). Columns: item id, rating, popularity, category.

The dataset was crawled from www.goodreads.com and anonymized by Junming Huang. You are free to use the dataset in academic research but not allowed to re-distribute without written permissions.

Citation

Junming Huang, Xue-Qi Cheng, Hua-Wei Shen, Tao Zhou and Xiaolong Jin, Exploring Social Influence via Posterior Effect of Word-of-Mouth Recommendations, Proceedings of the Fifth ACM International Conference on Web Search and Data Mining (WSDM'12), Seattle, USA (2012). DOI: 10.1145/2124295.2124365

Junming Huang, Xue-Qi Cheng, Jiafeng Guo, Hua-Wei Shen and Kun Yang, Social Recommendation with Interpersonal Influence, Proceedings of the 19th European Conference on Artificial Intelligence (ECAI'10), Amsterdam, The Netherlands (2010). DOI: 10.3233/978-1-60750-606-5-601

Downloads