Finding the “Right Stuff”

A case Study re the use of Epi-Search Technology and JStor’s TextAnalyzer

The goal of this study is to illustrate how Epi-Search can be used to find highly relevant related research materials given an article, book chapter, or whitepaper as the initial item around which more readings are desired to be found.

Because this study was conducted in pursuit of a Sage Ocean Grant, it seemed only natural to make used of the related Sage Whitepaper: The Ecosystem of Technologies for Social Science Research as the initial target.

Three retrieval approaches are compared: the TextAnalyzer service available at JStor, the current Epi-Search approach found at http://FindRelatedBooks.com and the enhanced version of Epi-Search used to prepare related materials back ends of the Warren McCulloch papers found in Issue 1, Volume 21, of the journal E:CO (Emergence and Complexity in Organizations). For all three approaches the retrieved materials will be restricted to those found and accessible at JStor (for consistency all results were also restricted to 2000 to-date).

The JStor approach is quite simple: upload the paper and let the software do the work.

The current Epi-Search is almost as simple: copy and paste the paper and let the software do the work. Click the results button labelled “Google Books.” In the resulting Google search box (pre-filled with search terms) add the following at the very beginning: “site:jstor.org “ and switch the bottom button to ‘ALL’ from “BOOKS’. This addition limits Google’s results to materials available at JStor.

The updated Epi-Search approach is more sophisticated. Before the query is offered to the software the researcher needs to prepare a “lexical profile” consisting of three elements as shown below:

Lexical Profile of Sage Whitepaper

The Ecosystem of Technologies for Social Science Research

Element #1

The Author’s Abstract or Introduction

The growth in digitally borne data, combined with increasingly accessible means of developing software, has resulted in a proliferation of software to support the research lifecycle. There is now a range of software and tools custom-built for very specific tasks, and the tools supporting common research methods have improved and expanded. Moreover, progress in machine learning models—especially around natural language processing, speech recognition, and the application of graph and network theory—has led to an explosion in new tools and has enabled social science researchers to borrow tools and technologies from other disciplines. The availability and accessibility of new technologies for research is promising. But how can researchers and educators keep up with the changing landscape of tools and software? This challenge became apparent in a survey we conducted in 2016 with close to ten thousand researchers in the social sciences, who told us that the pace of change was an obstacle to teaching new methods to students (Figure 1; Metzler, Kim, Allum, & Denman, 2016). Moreover, the rapid evolution of tools for big data research in particular was seen as a barrier to researchers looking to move into the new and growing field of computational social science (Figure 2). Through subsequent interviews with researchers and students, we gained an understanding of the challenges facing social scientists who want to prepare themselves for a more data- intensive future in research. In response to this and the 2016 survey results, SAGE Publishing launched the SAGE Ocean initiative,1 with the mission to support social science by equipping social scientists with the skills, tools, and resources they need to work with big data and new technology. Over a period of 10 months, SAGE Ocean reviewed 418 tools and software packages used by social science researchers, which we sourced from research papers, tools directories, company databases like Crunchbase, Wikipedia, researcher and lab blogs, and other websites. We were interested to find out more about: How researchers discover tools How researchers decide which tools to adopt for their research How tool developers fund and maintain their tools How developers are recognised for their efforts What role software development plays within the academic ecosystem We explored the various features of these tools and technologies, as well as the key people and organisations that supported their development. We conducted detailed analyses of tools for text annotation, recruiting and surveying research participants, and collecting and analysing social media data. From this work, SAGE Ocean has built a Research Tools Directory2 to help researchers navigate the landscape of tools and software, and launched a Concept Grant scheme3 to support the builders of tools and software for social science research. We will continue this research and share our findings as we expand our list of tools for research. We believe this insight and knowledge is vital for a future in which more research is carried out with the help of technology, and in which researchers may increasingly become tool builders themselves.

Element #2

Word Cloud

academic, access, analysis, annotation, available, blog, challenges, code, collected, com, community, companies, computational, data, developed, digital, figure, free, funded, grants, https, media, number, ocean, open, org, organisations, packages, papers, project, research, retrieved, sage, sagepub, science, social, software, source, support, survey, sustainability, teams, technologies, text, tools, university, used, work, www, years

Element #3

Concept Extractions

Research, Researchers, science, Series, students, Creators, Technologies, Tools, Hu, Methods

Tools, Duca, Creators, Research, cleaning, communities, researchers, ecosystem, annotation, software

Tools, Researchers, Software, Tool, Data, Technologies, Science, Annotation, Packages, Developers


The main idea behind the lexical profile is to create a weighted set of "meaning vectors" with just the requisite variety to attract relevant related results when using NLP, LSA, and LDA technologies.

Element #1, the author’s abstract or Introduction is taken straight from the initial text with figures, boxes, and excess punctuation removed. Element #2, the word cloud is prepared by uploading the full text of the initial paper into word cloud software extracting the fifty top terms (and, if necessary, cleaning the list for nonsense items). Element #3, the concept extractions are prepared by inserting the full text into the software profile at findrelatedbooks.com and capturing the keywords suggested for both a Google Scholar query (found by looking at the results marked “from the web”) and a Google Books query. This is supplemented by a list of the related keywords produced by a software keyword extractor (in this case cortical.io was used).

When all three elements are combined, the resulting lexical profile is then submitted to the query box at FindRelatedbooks.com Once again, Click the results button labelled “from the web.” Then click the link to Google. In the resulting Google search box (pre-filled with search terms) add the following at the very beginning: “site:jstor.org “. This addition limits Google’s results to materials available at JStor.

Results

All three queries produce relevant material.
The “best” and “most relevant” is provided by the enhanced Epi-Search process.

JStor TextAnalyzer

The initial result produces an html page which looks like this:

Listing the results for easier readability:

Current Epi-Search

Enhanced Epi-Search

Discussion

Of course a researcher could simply input the title of the whitepaper into Google and then again limit the results to JStor by adding “site:jstor.org” in front of the search terms. This results in:

These results seem rather biased towards “ecology” – most likely due to the use of “Eco-System” in the whitepaper title. One of the serious disadvantages of general searches such as Google is that the entailments of words are drawn from general language use and not from the specific interests of the querying researcher. By contrast, the lexical profile approach used in the modified FindRelatedBooks.com adds just enough context so that the entailments better reflect the paper from which the query words are drawn.

Using JStor as the target corpus does not mean the researcher is stuck using the internal JStor search engine. Google allows searches of JStor material. The JStor results can be improved by taking the concepts which TextAnalyzer has extracted and using them in a Google search (restricted with the term “site:jstor.org”).

This suggests that there is merit in the Textanalyzer concept extractions but less merit in the search engine used internal to JStor itself.

We note that both versions of Epi-Search also give results from the ISCE Library (4000+ books centered on systems, complexity, philosophy, and organizations). These too seem highly relevant (though some are pre-2000):

By contrast, results from Google Books are more tangential and are best used by a researcher trying to write about the topics in the whitepaper and locating them with a context composed of current book level work. This suggests (and we have other evidence to support) that aiming Epi-Search at a restricted corpus (i.e. JStor or the ISCE Library) works well for getting focused results and aiming it at large corpora results in more tangentially relevant retrievals. Either method is strongly suggestive of two additional uses for the software: 1) a researcher can input their work and discover other related material which perhaps should be addressed but of which they were previously unaware and 2) that same input can reveal related but tangential areas of exploration for query expansion and “added readings.”

FindRelatedBooks.com in its various forms is a highly powerful tool which far too few researchers are aware of or use.

For more information about epi-search and FindRelatedBooks.com please contact Michael Lissack (michael.lissack@gmail.com or 617-710-9565) President, American Society for Cybernetics