Dr Daniela Duca

Why learning from digital texts is still a challenge?

2021-04-02T00:00:00+00:00

Ebooks have been around since late 70s with Project Gutenberg, and became a commercially available format as early as the internet enabled monetary exchanges in the 90s. In 1998, US libraries were already distributing free ebooks through their websites and technologies from companies like OverDrive have been supporting that effort since the late 80s. Fast forward to the 2000s, Amazon and Apple brought digital to the masses with simpler ereaders and smarter phones. Yet, this move has been really slow. I mean it isn’t a Kodak moment: print books are still popular, and digital versions are their less-flexible equivalents.

We’ve temporarily enjoyed the weightlessness option: we bring all the books we want and can with us, everywhere, and we don’t carry all those backpacks with heavy textbooks that break our backs. But this isn’t enough to make us read always on our devices. Print books aren’t vintage, they are still very much a preferred option 40 years after the first books went electronic. When digital photography came about, it made it easy for anyone to become an artist, flattening the requirement to enter the field. Digital books haven’t yet delivered an impact of this scale.

Arguably, lower pricing and instant access are a bonus. And that helped in 2020. Tax incentives (in the UK for example) and the limitations on physical due to the pandemic boosted the digital books market, while print sales declines more than in previous years with shops closed. 2020 was especially catalytic for etextbooks. The working and studying from home engendered an almost prescribed growth in digital textbook sales, which is unlikely to be sustained, because learning by reading digitally continues to be a frustrating experience.

All you can read

We expect the price of digital books to be lower than print, given the absence of a physical product, which is achievable and has little impact on the underlying business model. We purchase for our Kindle or Kobo, via Hive, or even a publisher direct to consumer (D2C) subscription like that of O’Reilly. We share it with family and friends, and it’s quite like what we’d do with a print book. The author can still get royalties (approximately) associated with one sale.

With textbooks, affordability has been a core value proposition of all the new technologies aggregating them, and they have succeeded. Renting textbooks went mainstream in 2008; according to the National Association of College Stores, the student spending on textbooks and course materials decreased from ~$700 on average in 2008 US and Canada to ~$400 in 2020.

Digital library alternatives, that offer access to at least half a million books and textbooks for under $20 a month like Perlego (2016), and sometimes let you publish and sell your own like Scribd (2007), are forcing the publishing industry to rethink author royalties. While they simplify and integrate the reader’s experience, charging the price of a book or less per month, they merely replicate physical reading and have done little to transform it.

A similar push to disrupt the author royalty model comes from B2B digital library subscriptions, like BibliU (2015) and Kortext (2013). Given the buyer is the institution, and the bundle price is driven by their size, the price per book gets into the exotic derivatives market. A library purchase can get a digital copy of the textbook that, unlike your Kindle, can be used by multiple students at the same time.

An intro textbook in Calculus, for example, would be used across all STEM majors by at least half of all the first year students, so just over one thousand at a mid-size university, who can all borrow the digital copy via BibliU or Kortext. The cover price is an important incentive for the author. Writing a textbook is a far more complex and time-consuming project than a non-fiction book or any other piece of writing that academics are accustomed to doing, argues the American psychologist Robert J. Sternberg. To make an all-inclusive textbook subscription model successful, the author’s compensation must go through some form of creative destruction (assuming print is off the table), unless the price of an etextbook purchased by an institution is allowed to vary depending on an estimated number of users that will borrow it (which isn’t working very well right now).

Digital apps from OverDrive are adding to this conundrum within the book (as opposed to textbooks) library renting experience sending the idea of late fees into oblivion. The company reports that their users borrowed “430 million ebooks, audiobooks and digital magazines in the past 12 months [2020], a 33% increase over 2019”, while more than 20 thousand new libraries and schools signed up.

With $200M raised among them, these technologies made strides with affordability. They even tried to improve and personalize your reading experience, with functionality that ranges from changing the font and background, to highlighting the text, adding notes, text-to-speech, bookmarking, dictionaries, chapter summaries and other neat elements. These affordances replicate those of physical reading, sometimes even delighting the user with instant word searches. It works quite well when reading for pleasure; when learning, however, our brain collects information spatially. We tend to remember really well where fragments of information are located on the page and what that page looks like. We lose that ability when we read on Kortext or Perlego, and even more so with all the apps that train you to read for speed, like Spritz and Spreeder.

Second generation enhancements

In 2015, after 4 years of research on how college students learn with and from textbooks, a group of academics at Harvard developed perusall. Perusall isn’t addressing affordability, but rather improving the effectiveness of student reading; the application focused on the job-to-be-done: learning. And it enabled learning from book chapters, articles and other materials, through social interactivity and adding the lecturer back into the process. In other words, students can share notes and highlights, start discussions on the margin, answer each other’s questions about the material, and the instructor has all this data to help them decide when to engage.

Glose has been taking social reading to the mainstream in France and Europe since 2014; and was just acquired by Medium - the content publishing platform, an indication that social reading is already an expected feature of any reading interface. The apps fixing the affordability challenges (BibliU, Kortext, scribd, perlego, clasoos) also offer social reading functionalities. The approach to social reading that fires me up even more, in fact, is that of reading.supply. While it doesn’t contain/support books and textbooks, it does a much better job at helping you develop your knowledge while socially reading with the very neat option of manually building graphs of knowledge.

Independent and group notetaking while reading online is neither new nor unusual. With all the content on the internet, bookmarking and annotation evolved quite organically. diigo (2005), hypothes.is (2011), and memex (2015) are just a few of the tools that enable this functionality and are popular with universities. Memex builds on the idea that we learn from reading if we can build a knowledge graph, and tries to support the user in organizing the information from bookmarks and online annotation in that form. This isn’t something that I’ve seen in any of the digital textbook applications.

Can you read it all and learn from it?

While notetaking has been linked with improved learning outcomes both in print and digital, there are very few studies that test its benefits for college students, and even less that look at digital course text. These suggest that guided notes, along with self-questioning and immediate summaries have the strongest and most consisten effect on student learning. Furthermore, we know that learning happens as a result of spaced repetition, breaking down the content, returning to previous chapters to review and flipping through the pages of later chapters to inquire.

These affordances are slowly being picked up by reading technologies. Readwise tried to crack spaced repetition for the kindle readers. Mindstone launched in February 2021 promises features that are based entirely on the science of learning. It organizes your reading and notes, compounds them into a knowledge graph, and lets you set up reminders to support spaced repetition.

Retention and comprehension can also be achieved with associated visual cues (like pictures and videos) in multiple media formats especially when reading in a second language. Several textbook aggregator apps are adding materials to their platforms, but that is not yet as integral a part of the main reading experience.

Erudite doesn’t yet support textbooks, but it’s transforming the reading experience for college students. Powered by several algorithms, the app breaks down the text into sections that are somewhat independent. It incorporates the ideas of highlights, note taking, spaced repetition, self-questioning and recall; all the while you are not rushing to finish reading the content, you read to internalize as much of the knowledge as you can get.

Most textbook products are not like Erudite or Mindstone, they focus on the distribution of the digital artefact in an affordable and all-inclusive way, rather than the job that needs to be done by the student. They are not selling the ‘learning from textbook’ experience, they exquisitely replicate the analogue at a discount, with increasingly better affordances. Yet, to deliver a product or a tool that will increase the quantity and quality of learning from textbooks will require a redefinition of the problem and a step away from the delivery of text to the delivery of a guide to learning.

Developing a comprehensive directory of tools and technologies for social science research methods

2021-02-11T00:00:00+00:00

This post originally appeared on FORRT’s Educators Corner

Often the search and exploration of tools and technologies in social science research is not part of the class curriculum in the same way as the systematic review of literature is. This, sadly, leaves the becoming researcher in a place of disadvantage, in my opinion. In their early research career, students will mostly rely on their supervisor or peers to advise on the tools they use, which is still a very limited sample. However, with strides in technological development, researchers could choose from a growing number of multivariate tools for social science methods rising from within the discipline itself, as well as borrowed from other disciplines or coming from the commercial sector.

Starting from this premise, we decided to build a tools directory for social scientists, a simple solution for a place where any researcher or student can come and find the right tool for what they need. In this piece, I explain how the tools directory was developed and how it can be used by educators, researchers and students.

Developing the tools directory

The initial list was based on software tools and tech platforms that we knew were popular among social science researchers because we’ve commissioned books about them, or they have been prominent within the community. We continued to ask academics, look through papers and other lists like the DiRT Directory from the Digital Humanities, the Digital Methods Initiative and SourceForge. Soon enough, the directory was growing out of control. What we thought would be a simple scroll down page, organised in a few basic categories, was not serving its purpose any longer.

With around three hundred different software packages and tools that we knew were used by some or many social science researchers in their work, a new challenge was becoming apparent. It was a paradox-of-choice situation. On one hand, it was increasingly clear why academics often rely solely on recommendations from their peers when choosing a tool. And on the other hand, we knew we needed to explore how one would choose the right tool from a list, and ultimately how to teach others to find the tool that fits their own purposes rather than simply recommending a tool they’ve used.

As the list grew, we enlisted the help of a few master students, and started collecting more data: who built these tools, were they free or paid, what cluster of similar tools would they belong to, when were they built, based on the information available could we tell whether they were up to date, scaling, or failed, could we find papers that cited these tools, were the creators recommending a citation etc.

When we hit 400 software packages/tools, we knew we had to promote this list and share it in a way that researchers would actually stumble upon it and have the opportunity to reference it in a lecture or paper. So we wrote a whitepaper summarizing the big trends on the development of tools and tech for social science research. We learned that both commercial and non-commercial tools are popular within the social sciences, but the ones that last longer and are more successful focus beyond the discipline and almost always have a person or teams of people dedicated to raising funds or expanding the community of users and contributors.

At 400 software packages/tools, we were still not sure the list was big enough. We then focused on specific methods and researched all the tools available to carry out that method or task within the research process. We looked at the evolution of technologies for that method in particular, as well as how it fits within the development of the method itself. We call these ‘deep dives’. We’ve done deep dives on tools for annotation, or tools for transcription, surveying tools, tools for studying social media data, and we kept finding more software applications within each of these areas. We concluded these deep dives to be quite useful, as they enabled sharing slightly more comprehensive sub-lists of tools that could be used in different modules. We have now 543 tools on the list, and the number keeps growing.

How to use the tools directory

The full directory is currently available on our GitHub repository as a csv file. We decided to host it on GitHub, in order to be able to update the directory when we come across new tools or after deep dives; ensure it’s always available for others to reuse in its most up-to-date form, and enable instructors, students and researchers to add tools that might be missing.

Educators teaching research methods or preparatory courses for students’ theses could present the full tools directory to students, so they are more flexible in finding the right tools for their needs and future projects.Students can browse through the list and filter for tools to find a tool that is most appropriate for a research project they are initiating. For example, a student transcribing interviews might look at the transcription tools to find alternatives. Similarly, educators that are teaching a more specialized course, such as introduction to text mining, data visualization, or social data mining, or running online experiments could filter out a sub-list of tools focusing on the explicit method. They could then share this sub-list as part of the course reference materials or assignments.

Fig. 1. The spread of 543 tools and technologies across methods and techniques.

Fig. 2. Filtering to find transcription tools. A student or instructor could filter by column F (the Competitive cluster which contains the method/technique/task/area that we used to categorize the tool) to get a sub-list of tools that could be broadly used for a particular process. If the cluster is too broad, the student can look through the technique (column E), that breaks it down further. For example for social media tools, the technique would include analysis, collection, visualisation etc. If looking for more recent tools, one can filter by the year the tool was launched (column M); or if the student is interested in something that is free, they can check the charges (column N).

While the csv file that contains the tools directory might be easy to update and share, we acknowledge that it might not be that easy to use within a classroom. We are experimenting with a variety of ways that would enable a better display and navigation of the directory, without losing from the ease of updating it.

In 2019 we did our first deep dive into the tools for social data science to support our SAGE Campus course on collecting social media data. We created a sublist to share for this course to help learners find the software that might be most appropriate for their own project, especially given the variety of social media platforms available. To render the sub-list in a more friendly way, we used the free version of airtable, which is a no-code app for relational databases with a colorful and modern interface. Students would navigate to this page (Fig 3) to see the sub-list on a single table. They can then find the right tool for their social media project by selecting the platform they want to collect their data from (twitter, instagram, facebook etc), whether they are happy to pay or looking for something that’s free, and the type of task they want to perform: whether they need the tool for collecting the data, analysis, or visualization. Once they have a filtered list, they can also look through the academic papers we’ve linked where each tool has been used, to explore further the potential of the tools.

Fig. 3. Screenshot of the sub-list containing social media tools via the free version of airtable. Similar to working with a csv file (as in Fig. 2), this interface lets the student filter the list down to narrow the choices for a tool they could use to either collect or analyse their data. This interface is web-based, and has a more inviting user experience than working with a csv file. A student can easily see the categories of tools, filter by multiple terms or concepts linked within each of the columns.

We envision this sub-list of social media tools to be a starting point, as it helps the learner filter down based on a limited number of criteria, such as: the task that can be achieved (collection, analysis), the social media platform that’s integrated, and the fees.

We’ve reused the same sub-list of social media tools with a different interface (pory.io, currently in beta) to render this list of tools more akin to a catalogue of records, that the student can search and filter. This rendering was used in a bootcamp on starting off with social media research. Similar to the airtable rendering, a student could filter based on the task they want to achieve and then click into the tool to get more information and explore which one would work better.

Fig. 4: Screenshot of the sub-list of social media tools rendered into a catalogue via the pory.io app. The user experience on this interface is friendlier than working with a table as in Fig. 2 & 3. A student can filter the list by the type of tool, which is immediately visible; for example they might be looking for tools to support their data collection. They can then use the search box to enter key terms and narrow down the list further, a process that is more familiar. The student can also browse the list of tools by opening the individual cards to find more information (see next figure).

Fig. 5: Once the student filters a list of tools, they can click one each card to get further information about each tool. Currently this includes a brief description, the platform supported, whether it’s free or not, and several academic papers that have used this tool.

Airtable and pory.io have different affordances for rendering the sub-lists of tools, and our experience so far is that both have been useful. We are hoping to learn more from these experiments, to understand the student’s journey as well as the data that would inform their exploration process.

The social media tools sub-list was part of a deep dive that we carried out in 2019. Since then, we dived into surveying tools and text mining. We have not created separate sub-lists for these, and encourage instructors to try other ways of representing these tools within their courses. If you are teaching text mining in the social sciences, for example, you can point your students to this overview of the text mining tools available (Fig. 6 & Fig. 7) and share a sub-list of the tools directory filtered for text mining with your students.

Fig. 6: Screenshot of the Text Mining section, an overview of tools available.

Fig. 7: Text mining tools and technologies based on the process they support.

Going forward

Going forward, we are quite interested in finding out what are the criteria people often use to filter down to their top tools, so we can build this list forward and continuously add the data that helps academics and students find the tools that fit their project best.

We understand that lists follow some form of a hype cycle, where there is a lot of work done at the start and some engagement from the community, and then the whole project slowly dies and it is forgotten. It becomes pretty unusable, because with the pace of research and technology, a lot of the tools are out of date and many new ones have popped up. A person must be dedicated to updating the list and for now we have that covered. Since the publication of the whitepaper in November 2019, we’ve added at least 100 more tools, mostly focusing on text and data mining. While it’s relatively easy to come across new tools, the hardest bit is updating the ones that are already on the list, and that’s where we are open for suggestions from the community. The list with updates to the whitepaper are available in this GitHub repository.

Finally, the locus of software tools and technologies within the research ecosystem remains a big challenge. Software tools are yet to gain the credit of research output. And that is why, among other reasons, software tools are rarely cited or referenced in papers. This is not only bad for reproducibility of research, but it also makes it difficult to help other researchers weigh in and compare different tools used for similar studies. We aim to promote and include the suggested citation of the tools in our list, and strongly encourage anyone to use https://citeas.org when unsure how to give credit to these.

We remain active and are continuously thinking of better ways to present and re-architecture the information about software tools and technologies we’ve gathered, to make it easier to navigate and explore. We hope these materials will help you and your students become more aware of the diversity of tools and technologies and will open new and potentially easier avenues to decide on the best software tool to use for your research.

The challenges of running social science experiments from home - and 14 tools that can help

2020-12-10T00:00:00+00:00

This post originally appeared on SAGE Ocean

Never have the social and behavioral sciences been as critical as they have been during the COVID-19 pandemic and associated lockdown. More data was on demand and it was demanded instantly, along with the most robust analysis and policy recommendations, which meant that the classic research methods needed some creativity to transition to a socially distanced world. Data collection methods have been adapting not just to answer pressing questions about the impact of COVID on individuals and society, but also accelerating the ways in which they could be carried out digitally.

Digital methods are not new to the social sciences: surveys have gone digital since the 90s, with the recent Mechanical Turk, Prolific, Call for Participants and even social media marketing enhancing the digital recruiting of participants. The increase in the use of social media to cover 3.6B of the world’s population has allowed academics both to use the data from these platforms for research, as well as augment the sample snowballing methods. In the more recent years, trends around misinformation and fake news online have triggered teams of academics to develop online games such as GetBadNews and Hoaxy that teach the average internet user about misinformation, and also function as a way to test theories of learning and retention.

A growing number of social science researchers are shifting to digital methods, but it’s not an easy task, and this has been even more evident in lockdown. Some research methods are challenging (or even impossible) to run digitally, for example, experiments that can only be done in person or research that focuses on groups or regions that are hardly online. Even the fully digital VR experiments that we wrote about in another blog are difficult when the hardware is not available or can only be accessed in a lab. So what can you do if you want to continue doing social science experiments and scale them up while working from home?

We’ve selected 14 software tools that you can start using immediately to run your social or behavioral experiments online. Some come with integrated recruitment, others leave that bit to you; some are for asynchronous experiments, others let you run your experiment with multiple users at the same time; some work for small groups, others for thousands or more. As a bonus, and for those of you who want to go fully digital and forget about recruiting, we’ve included two bonus tools you can use to design and run simulated or computational experiments.

We spoke to some of these tools’ creators about what they think are the biggest challenges of running experiments remotely. Read their answers here or jump straight to the list of tools.

Claudia von Bastian of Tatool

The main challenges of conducting experiments remotely are portability (running experiments across platforms, browsers, and devices, usability (creating unambiguous instructions to ensure that participants are actually doing what you want them to do, and quality (having checks in place to determine whether the data quality of participant-generated data sets).

Chris Wickens of oTree

I think a big challenge of running online multiplayer experiments is dropouts. In a lab you can ensure that people play at the same time and that they all complete the experiment. Online this is much harder. With a multiplayer experiment, there is the risk that some participants will be stuck waiting for someone who dropped out (or lost their internet connection, etc).

Ting Qian of Finding Five

We are hearing from researchers that the biggest challenge in shifting to online experiments has been the lack of time and technical resources for (re)creating their experiments online. Further, since researchers can no longer invite participants to the lab and have them complete the task on a standardized lab computer, another concern is accounting for variability across participants’ devices. For instance, without adequate corrections, mouse tracking results can look vastly different when participants use a mouse as opposed to a laptop trackpad.

Jo Evershed of Gorilla

One concern for people new to online research is maintaining data quality when conducting online experiments. Thankfully, once you unpack all the different elements that can impact data quality, it’s not that much different to maintaining data quality in the lab. Dr Jenni Rodd gave an excellent lecture on this at our BeOnline2020 conference, the video can be viewed here.

Jason Radford of Volunteer Science

There are two big challenges to running online experiments: Choosing the right software and recruiting subjects. Different fields have different experimental traditions, so if you’re working across multiple areas then it can be hard to find the right software. When it comes to recruiting subjects, the issue is usually a lack of funding, or having to figure out where to find volunteers. It’s for these reasons that we designed Volunteer Science to be discipline agnostic, and to give you the choice of building your own participant pool, syncing with Mechanical Turk, or tapping into our existing pool of tens of thousands of volunteers.

PEBL is a free software specifically designed for use in psychology. It lets you design your own experiments or use any of the ready-made ones. It’s also a great tool to use in your teaching, since you can build and exchange experiments freely.

Tatool is an open-source easy to use tool for experts, as well as newbies. You can either download the software, or use the web version. You’d have to recruit your own participants, but they can access the tool from anywhere and any browser. It also has the option of running your experiments offline, but you’d probably need to be in a lab for that?

Experiment Factory is another open source tool that offers a collection of experiments and the ability to integrate the recruitment with Mechanical Turk. It’s still in beta, but you can sign up or reach out to the team behind it at the Poldrack Lab, Stanford University.

Gorilla is one of the more well-known commercial solutions for designing and running online experiments in behavioral sciences, integrates participant recruitment. Gorilla is great for research and for teaching!

E-Prime is a relatively comprehensive software for behavioral research. The company has some pretty sweet hardware tools and sensors as well, but you’d probably need to meet the person face to face and set them up.

OTree is an open source platform that is commonly used to design experiments in economics among other disciplines, with the ability to run multi-player strategy experiments. OTree also integrates with Mechanical Turk.

Lioness is a free web-based platform for designing and running your experiments that is super popular in economics as well. The team behind it works really hard to enable researchers to run experiments with simultaneous participants that are incentivized to hold their attention and not drop out!

PsyToolKit is a free-to-use toolkit for demonstrating, programming, and running cognitive-psychological experiments and surveys, including personality tests. It’s great for doing research and for teaching.

FindingFive is one of the newer tools for running your experiments online. It has some pretty cool features like blocking anyone who is trying to do the experiment again, and makes it easy to set up any exclusion criteria or prerequisites when you recruit your participants via Mechanical Turk.

Lab.js is also among the newer players in this space. It’s free and open source, you don’t need any coding skills, and has great resources for starting off doing an experiment or teaching about it.

nodeGame has a variety of features. It’s web-based, works on mobile too. It’s free and can scale to thousands of users participating in the experiment at the same time, but also lets you replace humans with simulated bots. I’d say pretty sweet!

Empirica is an open source tool, quite similar to nodeGame as both are developed in JavaScript, and lets you scale your experiments to thousands of users interacting simultaneously. Their mission is to help researchers easily iterate on sophisticated experimental designs.

Volunteer Science is an online experiments tool that enables researchers to run live and longitudinal experiments, with simultaneous or asynchronous participants. It’s excellent to use in the lab, from home or in the classroom. Read Jason’s recommendations on how to recruit your participants in a pandemic.

Breadboard is a software platform developed by a group of researchers at Yale that supports researchers in designing and running experiments with participants on networks. The platform also supports the recruitment.

Bonus: Tools for computational experiments

Wings is a semantic workflow system that assists scientists with the design of computational experiments.

CodaLab is an ecosystem for conducting computational research in a more efficient, reproducible, and collaborative manner. There are two aspects of CodaLab: worksheets and competitions. Worksheets lets you design and run your machine learning experiments in the cloud, plus, you can follow by publishing the experiment in a markdown file or executable paper. The competitions section is for hosting and participating in competitions in various areas requiring some computational tools and experiments.

Unbundling the remote quantitative methods academic: Coolest tools to support your teaching

2020-12-07T00:00:00+00:00

This post originally appeared on SAGE Ocean

This year’s lockdown challenged the absolute core of higher education and accelerated or rather imposed the adoption of digital tooling to fully replace the interactivity of the physical classroom. And while other industries might have suffered losses, the edtech space flourished, with funding for edtech almost doubling in the first half of 2020 vs 2019. Even before the pandemic, lecturers were starting to feel overwhelmed by the amount of choice to support their teaching. More funding just meant more hype, more tools, and more tools working on similar or slightly improved solutions, making it even harder and more time-consuming to find and adapt them in a rush.

Below, I take a look at several tools and startups that are already supporting many of you in teaching quantitative research methods; and some cool new tools you could use to enhance your classroom.

Adding interactivity

A challenge when teaching remotely is to ask the class to raise their hand, or shout out words and phrases. Perhaps you use a whiteboard to explain different concepts or want to involve the class in an idea generation process. All these are possible with the tools in this category when you are comfortably sitting in your living room. Although, you still might not be able to gauge the mood or see some sparkles in the eyes of your students, as you can when teaching in an auditorium.

Mentimeter offers a simple way to run polls, ask questions with multiple-choice, or word entries; you can show the results live.

IdeaBoardz and Lino are both web apps for ideation and for collaboratively writing sticky notes, so for the more hands-on activities.

Canvas from Chrome, Jamboard from Google, and Aww are web-based whiteboards; all three quite basic and collaborative, with some differing functionality, like slides and sticky notes on Jamboard.

Tools to create better video and other types of content

You are probably using Zoom, Google Meet, MS Teams, or another conferencing tool to deliver or record your lecture. They all have different features and bar some security issues, Zoom has been definitely the favorite. If you were looking for something to #FixTheInternet, check out Meething from the Mozilla Builders Incubator.

With PlayPosit and EdPuzzle, you can add quizzes and notes inside your video recording for asynchronous learning.

The coolest app to come out this year from the co-founders of Coursera is a complete rebuild of the live and remote video lecture called engageli. The app comes with engagement stats and other embedded functionality that helps you monitor your entire session when body language and drowsing eyes aren’t there to gauge.

If you are thinking about accessibility and different modes of learning, there is something to help here as well. Avid.fm and Alpe are already working with several academics to develop audio courseware solutions. And if you are feeling inspired, try GoSynth to create four-minute audio snippet explainers into your course materials.

For those that had some VR experience before the lockdowns, and have a relatively small class with students that own headsets, do explore WondaVR, you can easily use it to create some very exciting and alternative content.

Enhancing the learning experience for your students

Now I know what you might say, updating or changing the LMS you are using is a gargantuan type of a task and not within your remit. But if you do have the time AND the idea of enhancing your students learning experience through classroom discussions (now online) sounds like your cup of tea, then have a look at these amazing tools.

Parlay and PackBack are exclusively solving the challenge of classroom discussions gone digital. Aula and eduflow are adding that on top of their slick LMS functionality. PeerScholar goes one step further and includes peer-to-peer reviews for iterations on classroom assignments and before these go to be graded by you, the instructor. Kritik adds team-based learning capabilities.

Tools for running labs

While the other categories of tooling I discussed so far are useful across many disciplines, teaching labs or how to use statistical tools and programming are unique to the social sciences and are the most challenging aspect to carry out online. When physically in the same room, you walk through the rows of students and can quickly pick up if any of them are stuck but aren’t saying anything. At that point, you or your teaching assistants jump in and help without disturbing the rest of the class.

I wish I could share some tools that can help with that challenge, but alas, I found nothing yet.

There are, however, some ways you can make the lab experience a bit easier both for you and your students. Leaving the practical tips for another blog, right now you could try Observable — it’s a web-based collaborative notebook to build dashboards and run ad-hoc and visual data explorations. If that’s intimidating and you are mostly using R and excel spreadsheets with modest datasets, check out jamovi, a user-friendlier way to do stats, open-source.

For the student that’s always in search of the newest tools and the edge cases, or one more trending language to learn, guide them towards Pluto.jl, a notebook for Julia.

Finally, when teaching how to set up and run online experiments, Volunteer Science and lab.js are among some of the tools that render that ordeal into quite a smooth experience. If you want to delve further in into this space we’ve written more about tools for online experiments earlier in the year.

More resources

PanOpen is a courseware solution for open education resources, and you can also find relevant materials on OER Commons. If you are not up for searching or adopting a new tool and are just looking for materials from other quantitative or computational social science courses, I’ve put together a list (that you can download as a csv) of links to videos, slides, websites, github repos, and blogs.

On FORRT you can find resources and pedagogies to help you integrate the concepts of open science and reproducibility in your courses.

rOpenSci is a great place to find the most relevant and carefully vetted R packages for research. And if you are feeling super adventurous, have a look through the 543 software tools and packages that we’re tracking for any kind of analysis and data collection you can think of in the social sciences.

In case you are still feeling like your remote and online teaching could use a boost, there are two resources that I would absolutely recommend: the online courses from MetaDocencia, and the Distance Learning Playbook for College and University Instruction. Tweet me about the tools that have helped support your teaching this challenging year.

Originally published on SAGE Ocean.

Turning COVID-19 into a data visualization exercise for your students

2020-05-23T00:00:00+00:00

This post originally appeared on SAGE Ocean

We will emerge from this pandemic with a better understanding of the world and an improved ability to teach others about it. For now, we need to be continuously analyzing the data and thinking about the lessons we can learn and apply. Here’s how you can join in!

At SAGE, we have been working with academics around improving and sharing teaching resources, especially for quantitative and computational methods in social sciences. Besides the mass remote and emergency teaching experiment happening right now, one of the positive things we can already identify and reuse to improve learning in methods courses is the glut of data visualizations. The absolute advantage here is that all these visualizations are produced (almost always) with the same raw input, telling a variety of different stories. What better way to explain the different uses and impact of visualizations and the use of different tools to students than examples based on the same data?

For this blog, we thought we would make a start collating the variety of plots and multi-panels grouped based on the tools and skills required to create them. We’ve also included further resources for the type of visuals we discuss or introductory materials around the tools used to create them. We hope these will be useful for teachers and students who want to learn more or use different visualization examples in their methods courses.

1. Mapping the raw numbers to follow live data

Johns Hopkins Coronavirus Resource Dashboard screenshot taken on 4/1/2020

While it’s an impressive effort to pull together live data from various sources, and the dashboard makes it almost effortless to follow the spread of the virus based on the reported numbers of infected people across the world, it is only that. It is not easy to draw many conclusions from these types of dashboards, and the red bubbles across the world could be visually misleading, especially when areas are more densely populated and so larger absolute numbers might convey wider spread, when in fact it’s inaccurate. This is pretty much like harvesting the wheat and selling it by the ton. You’ve got the wheat grains out of the field and into the barn, which you know is useful, but there isn’t much you can do with it if you don’t have a mill and some knowledge around making the flour and potentially yeast for something more easily consumable, like bread.

Pros: Interactive, can be live, multi-panel, high-level view of the raw figures.

Cons: More useful when scaled to location or in this case the population; requires standard reporting across all geo locations otherwise hard to visualize missing data.

Live map: here.

Data available in this GitHub repository.

Cite as: Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-19 in real-time. Lancet Infect Dis; published online Feb 19. https://doi.org/10.1016/S1473-3099(20)30120-1.

Resources:

Going one step further and making your dashboard a bit more useful from information is beautiful, and with many more details from Max Roser and team as we are learning the best ways to convey live data to an increasingly more worried world.
Noting that these numbers are based on positive tests, and different countries ramped up or de-escalated testing differently, this FiveThirtyEight article estimates various scenarios.
Similar dashboards can be created with Tableau Public.
Other dashboards and panels that you could easily create without advanced coding skills with Datawrapper (including the famous cumulative cases by country from first known patients).
Learn ArcGIS and notes on mapping covid-19 with ArcGIS.

2. Using R and Shiny to interact with the visualizations

Credit Joachim Gassen - https://joachim-gassen.github.io/tidycovid19/

Credit Tinu Schneider, 2020. Code is on Github. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

The beauty of using R and being really proficient with it is that you can quickly put together an interactive web interface for others to play with. This can be done with the open-source R package — Shiny. For example, the most popular shared graphs in the news have been the ones around flattening the curve and aligning the trajectories of the virus spreading by country. But as you will know, the visualization can be sensationalized when different defaults are set. With this Shiny app from Joachim Gassen, you can move the dials and choose the variables to be displayed. Similarly, with this other Shiny app from Tinu Schneider, you can adjust the defaults and see how the curve could flatten.

Going one step further, you can also use Shiny to create interactive simulations, like this one from Alison Hill, a research fellow at Harvard, looking at the spread and the healthcare capacity. She also included a useful tutorial.

Credit Alison Hill. Simulation shows modelling COVID-19 spread vs healthcare capacity

Pros: Open-source, can be replicated, interactive, can adjust the defaults

Cons: Requires some coding experience in R

Data and associated code available here for the trajectories by country, here for flattening the curve, and here for the hospital capacity simulations.

Resources:

RStudio blog on using R, tidyverse packages and RECON libraries to compile and visualize covid19 data.
Maja Zaloznik’s Intro to R. Access the slides or watch the webinar.
The Carpentries’ R for Social Scientists training, which includes data viz with ggplot.
Cool intro workshop to using ggplot in R.
A train-the-trainer session on how to teach Shiny from RStudio.
More examples of visualizations in R from RStudio and Rittman Mead, and a primer on data visualization also from RStudio.
Improve your data visualizations with more recipes and loads of examples from this R Graphics Cookbook.
Full course materials using R: Modeling Criminological Data from the University of Manchester or this workshop on Getting Started with R from Réka Solymosi, Henry Partridge and Sam Langton.

3. Using python with matplotlib to visualize tweets

Yes, these graphs require much more work and a team of about nine researchers to collect the data, conduct analyses and visualize it properly. The Computational Story Lab at University of Vermont collected tweets in more than 20 languages related to COVID-19 and used a variety of tools to get to these visualizations: unix, matplotlib, mongodb, gitlab, and ‘an exceedingly small batch of artisanal matlab by the artist Peter Sheridan Dodds’, @peterdodds (according to Chris Danford). The easy to digest summary here, and full paper with code on gitlab.

Pros: Can do advanced and multi-panel visualizations.

Cons: You need the skills to use these tools.

Data available in gitlab.

Cite as: Alshaabi, T., Minot, J.R., Arnold, M.V., Adams, J.L., Dewhurst, D.R., Reagan, A.J., Muhamad, R., Danforth, C.M., & Dodds, P.S. (2020). How the world’s collective attention is being paid to a pandemic: COVID-19 related 1-gram time series for 24 languages on Twitter. https://arxiv.org/abs/2003.12614.

Resources:

How to use python to gather COVID-19 data.
A basic mapping of #covid19 cases in the UK with python.
SAGE Campus: Introduction to Python for social scientists.
Phillip Brooker’s Programming with Python for Social Science.
All types of charts with Python explained.
100 tips and tricks for working with pandas from Data School.
Simple ways to improve your matplotlib from Kimberly Fessel.
Visualizing with matplotlib excerpt and code from O’Reilly Python Data Science Handbook by Jake VanderPlas.

4. Visualizing predictions and simulations (advanced)

This requires a whole other post, but I wanted to mention a few examples we came across that sparked our attention and that we thought could be a good way to entice anyone to learn advanced simulation methods:

Going deep with the Bayesian scientist extracting patterns from real and synthetic models.
On flattening the curve, this plot is interactive, using RK4 for analysis and built with Svelte.

More resources:

Before you build another graph, especially for an ongoing event, where communication of risk and uncertainty is critical to saving lives, we definitely recommend considering some data visualization basics, for example:

these tips from Amanda Makulec.
Andy Kirk’s recommendations for better data visualizations, or his brilliant chartmaker for picking the right tool.
Visualizations can illuminate, but they can also be misleading. A discussion on visualizing uncertainty across all COVID-19 charts.

A final point on data visualizations from our friends at ADDTWO: although seems trivial, the main thing students struggle with is the last step - making their graph look ‘polished’! Why is this so hard? While accurate representations are critical and even when many are able to pick the right chart types, data visualizations are stories, and the design and the use of colors and sizes on graphs is similarly important. Sometimes, the tools you use are either limited or not as user friendly on the design-and-polish steps. The team at ADDTWO recommends exporting (whenever possible) your visuals as .svg and further sharpening the design with any illustrator apps (Adobe Illustrator, figma and others).

Which other data visualization examples you will be using for your next workshop or module? What are your top tips?

From preprocessing to text analysis: 80 tools for mining unstructured data

2020-01-20T00:00:00+00:00

Text mining techniques have become critical for social scientists working with large scale social data, be it Twitter collections to track polarization, party documents to understand opinions and ideology, or news corpora to study the spread of misinformation.

Overview

Researchers and developers within the public and private sectors have been making strides in this space, and especially so in the past year. The improvements to the representation of text with models like BERT from Google and the OpenAI’s GPT are the talk of the town in computational linguistics since they beat a benchmark for natural language understanding faster than predicted.

Meanwhile, political scientists like Justin Grimmer and colleagues are combining experimental methods, with computational text analysis to infer the features or pieces of text that are most likely to affect our voting behaviors. Others, are using platforms like IRaMuTeQ and Hyperbase that require no coding skills to run large scale text analysis projects.

In the infographic below, we identify more than 80 different apps, software packages, and libraries for R, Python and MATLAB that are used by social science researchers at different stages in their text analysis project. We focused almost entirely on statistical, quantitative and computational analysis of text, although some of these tools could be used to explore texts for qualitative purposes.

An infographic of text mining tools in the social sciences.

Word cloud for most common techniques.

Key takeaways

Most tools are free, but high-performance tools require coding skills.

More than 70% (92 out of 130) of the tools we’ve identified for text cleaning, preprocessing, enriching, and all kinds of analysis are free to use, and a handful provide free trial periods. The free and/or open-source libraries and packages such as scikit-learn, spacy, gensim, quanteda, NLTK are high performance, i.e. the outputs are as good if not better than some of the paid-for options and the open-source no-code options. In other words, the more you want to get out of your corpus, the more comfortable you need to be with R or Python in order to find and use these packages and especially if you want to apply transformers and language representation models to your dataset.

Graph showing tools and packages for text mining by charge 1962-2019

Graph showing text mining tools and packages launched per year 1962-2019, including those with women in leadership teams, where data is available.

Most tools that do not require coding skills are relatively old.

A suite of free and some paid applications are available for researchers that don’t code, such as Voyant, Lexi&co, IRaMuTeQ, Hyperbase, Mallet, Orange Text and Data Mining. Beside Voyant, which was launched seven years ago, the other software was developed in the 1990s to early 2000s, when coding was not as widespread a skill as it is today. However, some of the statistical analysis these tools offer is remarkable.

There is an increasing number of apps with great user interfaces, and some of them free, which enable you to enrich your corpus.

An important step, primarily for those that build their own packages and analysis tools, is to enrich the corpus. The most common tasks are part-of-speech tagging. We noted that researchers increasingly require to annotate samples of their corpus in order to train a topic modeling or classification algorithm. This, combined with the booming chatbot market and the needs of large businesses to sort through their documents, are driving the development of paid web apps and open-source packages for text labeling. More than 10 tools were launched just in the past three years: Explosion AI, the developer behind spaCy, launched prodi.gy; Amazon released its GroundTruth SageMaker to integrate with Mechanical Turk and other human-in-the-loop services like iMerit. The most active one is probably doccano, it’s free to use and in just one year it grew to 24 contributors. We’ve invested in TagWorks, which integrates with Mechanical Turk and offers a more hierarchical annotation schema.

The most time-consuming bit is cleaning and preprocessing.

Whilst you can preprocess and reduce your text with a few of these tools (for example Orange, IRaMuTeQ, Hyperbase, scikit-learn, MathWorks Text Analytics Toolbox, NLTK, quanteda), you still need to format and clean your corpus before you load it in. We’ve heard from many researchers that their biggest pain point and frustration is cleaning and doing some of the pre-processing. Main reasons being:

it takes much longer than expected and at least three times the amount they spent on the fun part (analysis!);
they prefer not to teach cleaning and preprocessing and leave that for workshops and working groups; and
they almost never go back to this part of the process after they ran their analysis, although they acknowledge that testing their analysis on whatever decisions they’ve taken for preprocessing may add an extra layer of confidence in their outputs.

There are just about a handful of tools to help with converting file formats: PDFminer is a Python parser and analyzer for PDF documents and can convert them into HTML, but the most common is AntFileConverter from Laurence Anthony, which converts PDFs and DOCs into plain text. TextClean is a neat collection of tools for cleaning and normalizing text documents in R, and it’s open-source. If you are working with existing text datasets from the web, like the 20 Newsgroup or the Pen Tree Bank, you still need to do some work before you fit them to your analysis algorithms, and there is a package in Python that can simplify this step.

Once you master some of these tools, they will save you time.

One thing is certain, there are plenty of software applications, libraries and packages that can help support your large scale text analysis project. You can try the easier-to-use ones like Orange and move to applying argument analysis algorithms and language models to your growing corpus. We’ve got a course to get your started.

Annotated text corpora

When working with text mining tools or learning how to use them, the biggest problem is finding a ready-to-use corpus. In many instances, you’d need a readily labeled one to test, especially if you don’t have the time to do the annotations yourself or the money to crowdsource the task before you work on your actual corpus. Here are 10 sources of (publicly available and free) labeled text corpora to get you started:

Reuters newswire in 1987 indexed by category, aka Reuters-21578, contains 21,578 news articles, though only about 12 thousand are manually indexed across 135 categories; best for training classification algorithms.
The 20 Newsgroups dataset contains close to 20 thousand documents categorized across 20 groups; best for training on classification and clustering.
MPQA Opinion Corpus contains under one thousand news articles and other documents that are annotated manually for opinions, beliefs, emotions, speculations
This corpus contains about 16 thousand annotated wikipedia tables to study fact verification.
Stanford labeled Rotten Tomatoes dataset for sentiment analysis, includes paper and code.
Stanford 25 thousand labeled and 25 thousand test datasets with IMDB movie reviews for sentiment analysis.
The training data for Sentiment140 is a collection of just under 200 thousand labeled tweets for sentiment analysis.
An aggregated corpus of more than 10 different sources, including tweets, news articles. Blogs, dialogues,, mapped to a unified tagging schema for emotion classification resulting in more than 20 thousand statements for 6 different emotions.
SMS Spam Collection contains just over 5 thousand English mobile text messages labelled according to whether they are spam or not.
Dataturks A set of 405 mostly Spanish reviews for academic papers submitted to an international computing conference, with the reviewers’ scores, and another set of scores labeled by readers of the reviews.

You can also check the trending projects on Dataturks which lists classified and labeled text datasets in multiple languages. Similarly, tagtog have a running list of public projects across domains. The National Centre for Text Mining in the UK releases corpora for text mining for social sciences but also STEM research, some of which are annotated for sentiment and entities. Many NLP developers also keep track of useful datasets for machine learning, many on GitHub and loads on Kaggle. I recently came across this very neat list that includes multiple formats for multiple tasks and information about the license. However, if you are looking for a real challenge, then explore TREC datasets from the National Institutes of Standard and Technology in the US.

What I learned from mining researcher questions

2020-01-17T00:00:00+00:00

I decided to get into text mining. Considering that I have basic programming skills, and I want to do some text mining for academic purposes, plus I’ve already looked at more than 80 tools researchers use…

How difficult can it be? I only need to find a question, some text, and pick a good enough tool.

My natural inclination is to do some research on research, and text mining could help me scale this up. I picked the researchgate discussion forum for my analysis, and my question: what types of questions researchers ask most commonly.

My idea for a method…

collect the questions posted to researchgate over a few months, then use Orange Text and Data Mining (no coding skills needed) to cluster the corpus and see which clusters have the most questions.

Collecting and cleaning the data:

I set up a daily scraper with import.io, that would crawl my personal research gate account for “Questions we think you can answer”. Now, of course, these are tailored for my ‘skills’, so I would expect questions in the entrepreneurship/intrapreneurship/innovation/business/technology/strategy/research methods space. I am also hoping the list will be biased towards social sciences and tools in the social sciences, as these were the type of questions I’d already answered on the platform.

ResearchGate

After 100 days of imports, I aggregated the data to obtain 5324 instances (i.e. rows, each row contains a question), which translated into 654 when removing duplicates.

The analysis:

I chose to mine my corpus with Orange in order to make progress quickly. Given my basic coding skills, a no-code app seemed like the perfect option. I uploaded my corpus and applied some preprocessing steps. I knew I had to do this, because academics told me that’s the most boring bit of the process. Then followed about 3 hours of playing around:

messy, yes!

I finally got somewhere. Best feeling ever. I ignored a few of the fields I scraped, thinking they would be useful, such as the number of reads and the action. That left me with just the title of the discussion or question, a short description (which is not always the full descriptive text, but a good first paragraph at least), and the tags inserted by the user. From the 654 documents (or individual and not duplicated questions scraped over 100 days from my research gate feed), Orange processed 6145 words, and since I chose uni and bigrams, that meant - 4781 tokens. I applied bag-of-words to convert my corpus into numbers (aka vectors of word counts). Other ways to word-embed should also be possible (I noticed an option for adding your own python script, so should be easy to do word2vec, though unsure about any of the transformers).

Ideally, my next step would be to do some clustering to get a sense of the most common topics.

But first, to give you an idea of the corpus, here’s a word cloud. As I expected, the majority of the questions I get contain words like innovation, business, management, entrepreneurship… Those were my self-attributed skills.

Word clouds with Orange Text and Data Mining.

At this point, I realize (again) that my document classification may not have the useful results I was hoping for, i.e. I will not be able to correctly infer the most common questions. For that I will need a different data set, one that is not tailored for me, but a rather more ‘random’ sample. For now, I have to make do and will go ahead with the analysis, to learn more about Orange and my current corpus.

I am finally ready to cluster the 654 questions. In order to do that, I use a step that measures the (cosine) distances between the documents or rows in my case. Since I applied the most simple word-to-numbers transformation (bag-of-words) in the previous step, the distances that Orange computes are a reflection of the presence or absence and frequency of all words in each row. Once I have these distances, I can apply hierarchical clustering, and I picked a depth of 10 (I don’t have a robust reason for it, open to discuss alternatives).

Having reflected a bit more about the simplicity of my analysis so far, I would expect questions on completely different topics to be closer together and clustered within the same group if they look similar. For example, questions that start in the same way and have multiple (not very important) words that repeat, like ‘what is the best way to do ….’ or ‘what is the difference between …’ or ‘how do I..’ will probably be closer, even if they ask about another topic, i.e. the keywords are different.

And yay! this is exactly what happens. a first glance scrolling through the clusters I see this:

Hierarchical clustering with Orange Text and Data Mining.

This is still useful though, because I could potentially cluster question by the type of help the researcher needs, rather than the content or discipline. How would I do that… welcome comments!

Final workflow

Text mining with Orange.

I was able to identify 68 different clusters with anywhere from 3 to about 50 questions in each, some make more sense than others. I would need to follow up with a more qualitative approach to really understand what’s going on. This short workflow is just the start of my analysis and, as many social science researchers would say, could help sort through the corpus, but not very likely to provide publishable results. Thoughts and questions

> Take-away 1: Scraping was so unbelievably easy!

Scraping has definitely gotten easier, and you don’t need to know how to set up complex crawlers. As long as you know which websites you want to scrape, you can easily set these up with import.io or other services like this. It’s easy because you can visually change the things that you want to collect, and the scraper sets it all up for you in csv or other structured formats. For me, this meant that there was almost no cleaning I needed to do because I selected exactly the bits of the page I wanted scraped, no more no less.

+helps to be in the UK: copyright exception for text mining means I can scrape and analyse content for non-commercial purposes.

> Take-away 2: A no-code platform doesn’t mean you don’t need any coding skills. In fact, you cannot do much if you don’t know at least how to construct a the workflow from a developer’s perspective.

Orange was quite useful. And whilst I did not spend time learning/improving my python scripting skills, I still needed to spend time to understand how a script would have been built. The workflow in Orange followed, what I believe is, a proper coder’s workflow. This is both good and bad. It’s good, because now I only need to learn the actual code/commands, as I already understand the sequence. It’s bad because it creates a little slump in the process, takes time to figure out and is not intuitive for the non-coder.

> Take-away 3: It’s so beautiful.

For my next mini-text-mining project, I am going to explore the questions and discussions where academics are asking for help in quantitative and computational social science.

What else should I do with this type of corpus? Please add in the comments, and if you used Orange or another tool, I am very curious about your workflows!

How to find and use academic research if you don’t have access

2020-01-02T00:00:00+00:00

Perhaps because I work for an academic publisher, but whenever I meet a founder, or a product manager, they always ask me what’s the latest academic research for their particular problem space and where or how they can find it. I get quite excited because, for me, this is the ultimate evidence of the benefits of open science, or the global push to make all publicly funded academic research freely available to all. But having all this research available does not mean it is easily discoverable and accessible. People, like my founder and product manager friends, have to figure out how to find this research, understand it and make proper use of the results for their own work. And I’ve got good news, there are loads of tools already to help you get closer to some useful insights!

In this post, I will share my tactics and the apps I use.

1. To get started

Without a doubt, the first stop is Google and Google Scholar. These are good for getting a quick look into a topic and check the few and somewhat relevant articles on page 1. However, the results never convince me, and I take these with large pinches of salt. There are too many duplicates and too few options to dig into the huge list much further. You can get an idea of the volume of research (although almost never useful), and filter by date. Other than that, you can browse the pages and manually collect some of the papers that spark your interest (based on title and citation). I’ve used the metadata on Google Scholar for papers I’ve previously identified as important, but only as my optimistic measure of impact. In my experience the citation figures on Google Scholar are the most generous.

Some alternatives:

1findr

I use a free service from 1science for really high level numbers and volumes out of a collection of 120 million articles. A few months ago, I was working on a presentation about social science research, and wanted to come up with some estimates on the number of ongoing projects and active academics. I thought I’d start with the volume of papers published per year and see how that figure changed over time. After a few trials, I found this free search tool to work best. I could filter down by disciplines and immediately get a growth over the last 50 years. To further see how many of these mention specific tools I was investigating, i just filtered by keyword. Really brilliant!

Microsoft Academic

Microsoft Academic has a collection of more than 200 million publications (books, papers and patents) and allows a more complex search in contrast to Google Scholar. Beside the fact that it’s always really slow on my wifi, I do like the way the results are organized. It’s perfect for starting on a new subject. Almost always you get the best results on influential authors, institutions and conferences. My absolute wish list for this results page would be to see a few more numbers. For example, I want to know the estimated volumes: how many papers on this or that parent/child topic, or from this or that academic, or this or that institution. They have graphics but without any numbers and you have to (extra) click into each. Also, I don’t need to see the topics twice, on the left and on the right, but I do want to get an idea of disciplines these papers are coming from. Image for post Image for post

Semantic Scholar

Semantic scholar is by far my current favorite and where I spend most of my research time. I have used it prolifically, even though it has yet to grow its coverage in the social sciences and humanities (anecdotal and personal view). Launched by Allen Institute for AI, semantic scholar is constantly machine reading, extracting useful info and mapping over 170 million academic papers into (what I think is) a knowledge graph. Their algorithms are pretty sweet imho: the results pages never dissapoint. I use semantic scholar in 3 ways:

I collect the volumes of academic work on my topic of interest, and more specifically, I look at individual papers citation breakdown — papers citing methods or results. Best thing, you can filter for results with full paper.
I explore the authors, because it helps me understand who has influenced whom.
And I dig into the slides section!

IRIS.ai

IRIS is another service based on some clever topic modelling algorithms, although it is most useful when you already have a paper to start with. The tool machine reads more than 130 million open access papers, and once you feed it an article, it would break it down by topic or theme and you pull in relevan papers for each of these in a map or easy to explore visual. It’s coverage is better for STEM, but I can always get some interesting insights for my own research.

AcademicLabs

Academic Labs is great for finding labs and research groups that are working on specific areas. I’ve got more on matchmaking tools and how to look for academics to collaborate with, especially if you are a non-academic, in this blog.

What none of these platforms have, and I think it’s a big miss both for academics and non-academics, are the books and content that is not peer reviewed. I know science journalists and my entrepreneur or product manager friends would always want to find that opinion piece, article, blog or non-fiction book that cites this research. To do this, I would either look at the personal pages of the academics and see if they have a ‘media’ section, or look for the Altmetric score (some publishers have this) and click into the different social media icons. It is a work-around. I have big hopes for QUEST.

2. To find a very specific method, application or result

This summer, for one of our projects at SAGE Ocean, we explored causal inference, and specifically a relatively new method called convergent cross mapping, which builds upon a few well-known causal theories. My goal was to get an idea of its uptake in the social sciences. I wanted to look for the ‘turning point’ paper, or the paper where the method is fully described and supposedly, the one cited in any research that applies it. After I’d used all the different platforms I mentioned earlier to drill down into the subject of causal inference, and convergent cross-mapping, I found a few papers that seemed to fully apply this technique. Next I read the abstract to understand which one is the ‘turning point’ paper. Their abstracts would talk about using the method, and I presumed that a peak at the ‘background’ section would give me some information about the previously done work and specifically who developed convergent cross mapping, or where it was best applied. So I needed the full paper.

3. When I get to an article that I really want to read but that’s behind a paywall, I normally go for these alternatives:

Use the unpaywall plugin — this searches through a database of open access papers and pre-prints to find a free and full-text version
Use the Open Access Button — you can paste the link, title, or DOI of the paper and, again, this searches through a database of available full-text papers to find a freely available one; this is an ‘ethical’ alternative to sci-hub
If I cannot find anything, I will search for the authors, reach out to them and ask for a version of the paper that they could share, and appreciate especially if they have some slides or a blog that explains with less jargon their key findings and methodology. Academics are amazing!

It didn’t take me more than one hour to find that ‘turning point’ paper — Sugihara, George; et al. (26 October 2012). “Detecting Causality in Complex Ecosystems” (PDF). Science. 338 (6106): 496–500. doi:10.1126/science.1227079. PMID 22997134.

But my work was not done. This paper is published and categorized as environmental sciences. I still wanted to understand the uptake of convergent cross mapping in social sciences. This is when I turned to my favorite semantic scholar. I found the paper, and filtered the citations to just those that cite the method. Unfortunately, there was no easy way to figure out which papers were in the social sciences, so I had to literally read through the titles and journals and make a judgement on the discipline. Out of more than 300, I found 5. It was a long afternoon.

If you do have access though, what took me more than 1 hour can be done in 3 minutes with Web of Science, just click on ‘analyze results’ in the right-hand corner, and you get a visual with numbers by discipline.

More alternatives and how to keep up to date with latest research

To keep up to date with a specific subject or area of research, you can set up alerts and follow the academics that are considered influential on the subject. Both of these are possibly on all the platforms: Google Scholar, Microsoft Academic, Semantic Scholar… And now there are even more alternatives:

Sparrho is like pinterest but for academic papers.
Meta, although it only focuses on biomedical science for now; and actually, in my opinion a better way to explore your biomed questions (and the literature) is sci.ai (beta) or causaly (paid).
Morressier is great for latest research presented at conferences.
OOIR is a must for the social and political science enthusiasts, it’s all the latest and trending papers. For pre-prints (papers not yet peer-reviewed and published, with full text available) in social sciences, use SocArXiv.
MyScienceWork is similar to many of the search tools, and does allow for filtering full-text; they also have add-on and paid solutions.
Papers with code is an excellent web tool for all those looking into the data science and/or computational science space; along with arXiv of course. And if you are already a prolific user of arXiv, then get the Librarian extension to explore and find all the cited full papers.
If you are a researcher (even if not in academia), you can always use and explore Loop.
I recently had a demo from SCI-BRAIN (paid), it’s fabulous if you’re interested in the influential academics, although they have yet to add sources beyond arXiv.

I am sure there are loads more, and if you know or use any other exploration and discovery tools, do share!

Using and developing software in social science and humanities research

2019-10-18T00:00:00+00:00

This blog was originally published on SAGE Ocean

Looking at software for research in social sciences is one of the key areas within SAGE Ocean. In a 2016 white paper on who is doing computational social science, we asked social science researchers about using and sharing software and code for working with big data. Over the last year, we’ve been exploring the different software and technologies used by social science researchers for a variety of purposes, like surveying, text annotation, online experiments, transcription, text mining, tools for social media research (see the work on collecting data, linkedin, landscape of tools, weibo ) and finding industry partners. We’ve picked out some of the trends from 400+ tools in this space and are drafting a white paper that we’ll share in the next few weeks.

Whilst analyzing these tools and investigating researchers’ challenges, both with developing and using the software, we wanted to do a virtual raise-your-hand with our community to understand the extent to which software and code is being used and developed in the social sciences right now. So we ran the research software survey developed by Simon Hettrick at the Software Sustainability Institute. Simon already surveyed researchers across faculties at Southampton *twice* (preview the results), and is working with a growing list of universities that are interested in understanding how much support their own researchers need around software development. Simon’s aim is to open up all the data in this GitHub repository, so others can re-use and aggregate it.

What did we find?

More than three-quarters of the respondents think that software is important or critical to their research. While 85% use software, only 10% have developed their own software. I was surprised by this figure, considering that close to half of the tools we found were developed within universities or by individual researchers (more details in the upcoming white paper). Although Simon’s data covers researchers from a single university, his numbers show that across disciplines, there are only 33% that develop their own software. This could indicate that social sciences and humanities are not that far behind!

What is more reassuring is that over 20% of the respondents said they hired someone specifically to develop software. Ideally, they would prefer to recruit a developer from an institutional pool of technical experts, which means that universities should look into the needs of researchers across all disciplines when it comes to software. Interestingly, close to a quarter of the respondents included costs for software development in their funding requests, another positive sign that central pools of research software engineers may be in demand.

Method and respondents

We published a link to fill in the survey in our Big Data Newsletter in July this year. Our newsletter goes out to about 300 thousand subscribers worldwide, and within just a couple of days, we got 149 responses, a relatively small but promising sample. We assumed that most of our subscribers are from social science and humanities related faculties (85%) and are either teaching or doing research within a university (91%). Most respondents are receiving grants from a national or international research council (50%), some receive funding from a charitable fund (15%) or their own university (14%).

Unfortunately, where the questions were not compulsory, only 14 or 15 people answered, so I left these out the current analysis. I manually cleaned and aggregated the faculty and funding type, as many respondents elected to fill in ‘other’ and there were too many types as a result. I also aggregated the role based on the closest equivalent role to what the respondents filled in.

And yes, we did have a lucky £50 gift card winner, who can boast all about it to their colleagues and more widely, if they chose to 😁.

Raw and cleaned data is available here.

You can read a detailed report of our findings in our new white paper. Download it here.

A light introduction to research data management

2019-10-01T00:00:00+00:00

Demystifying the terms

Research data

Although a contested term, research data represents almost everything and everything that is raw and could be analysed for research. Can be (not exhaustive) text, numbers, images, tables, databases, corpora, etc. You will most likely hear arts and some humanities claiming they don’t produce data in their research, but the argument is that even a recording of their process, from inspiration to output constitutes data. The outputs from one research project could become the data of another researcher. A selected number of pages from a corpus, or an aggregated list of sources can also be data. In the social sciences, research data could be: all the raw interview files and transcripts, the survey responses and the actual survey questions, the social media aggregated data, the corpora to be analyzed etc.

Research data management

RDM- normally refers to all the processes from the creation of research data to its stewardship to ensure the data retains value. Many universities, if not all, will appoint one person to support [all research staff] with the management of research data. This will be either a small part of someone’s role, or as many as a team of 5 or more supporting RDM. There is… some debate… as to which part of the university should be responsible for the management of research data: in some, the library has taken an active role, in others the research support department, and in others even IT infrastructure. What this sometimes means, is that even though all these different teams of people are championing for RDM, none are undertaking decision-making roles. The library may want to run RDM infrastructures, but they can only run trainings, because they may depend on the IT skillset to implement certain services. And IT may not feel these are at the top of the priority list, hence much delay with implementation. Moreover, there are no real incentives for researchers to take care of their data post-publication or post-grant-end-date. And a big discourse in this space is around culture change being the major reason for RDM’s slow uptake. Tools for RDM include DMP Online (a checklist for researchers that want to comply with institutional and funders data policies). DMP Online was originally developed by the Digital Curation Centre in the UK, and is now a joint partnership with California Digital Library. There are a number of other checklists and tools that support offices can use to ensure compliance with RDM policies, such as the DMA dashboard. To find out more about research data management, the Jisc RDM toolkit is a good start as it is the most recent and links back to a variety of other links and resources.

Preservation

Refers to the processes of safeguarding digital data in order to make it accessible and readable even when the file formats or software used to open these files are obsolete. So for example, if you did some interviews back in the 1990s, and saved all the responses on the old 1995 Microsoft word file format, you will probably be unable to open these files now. A good preservation system would ‘read’ the file, identify the format, extract all metadata (file specs), check for a variety of errors and save the original 1995 word file, along with a copy which would continuously be updated to the newest file format, so that you can open and read the contents of that file. There are a variety of file formats, and some research projects use very specific software and produce unique formats that are harder to preserve. In these cases, preservation software will automatically convert the copy of the original file into a ‘common’ format (most of the time an open format). For instance, image files will be converted to JPEG2000 (open format) or TIFF (proprietary). Even though proprietary formats can be more robust, they carry more risk for the long-term (company goes bust, nobody will run any updates, hence become obsolete). You may be wondering, who decides how and when preservation software converts the files etc. There are a host of preservation experts that work together to produce standards and rules for different file formats. The National Archives in the UK run PRONOM - which is a file format registry where anyone can submit new file formats and requirements for their preservation. The Library of Congress in the US also runs a couple of initiatives to identify file formats and guide around their long-term sustainability. Most important thing to know: OAIS - the reference model/standard that contains all the recommendations for how to implement a preservation process. Examples of preservation software infrastructure: preservica, archivematica. To find out more about digital preservation, the Digital Preservation Coalition’s handbook would be a good start.

Archiving

Refers to the process of placing data that is no longer needed into long-term storage, most often offline and in a different location. Archived data is not easily accessible and would require manual retrieval. Unlike preservation, archived data remains unchanged over the years, in the original file formats. The most common software that universities use for archiving is Arkivum. Where RDM has been well-implemented, institutions will combine an archival and preservation system to ensure long-term sustainability and accessibility of the content. And potentially a repository interface, where the researcher will be uploading the data. Curation - the difference between data management and curation is quite subtle. Data management refers to the stewardship of data throughout its lifecycle, whereas curation deals more with the selection of data that will have a value in the future, making it accessible to other communities of researchers and ensuring its long-term preservation. When people refer to data curation, they refer to higher-level data challenges like: there is too much data and we don’t know which one to preserve; if you ask a researcher they will always say theirs is the most important data; how do we decide how many years should it be curated for; is it 10 years after last access or 10 years from creation…

Repository

A repository (within the research data management context) is a software that enables individuals and institutions to ’save’ data. You will have heard of figshare and zenodo, these are probably the most popular and loved repositories by researchers. However, there are more than 2k research data repositories worldwide (registry). Most of them are based or built upon these software: dryad (funded by NSF), eprints (funded by EPSRC), DSpace, dataverse, Fedora, samvera etc. Very rarely, if ever, repositories perform any preservation or archiving. A repository simply allows the researcher to save a copy of the data, s/he can choose to make it private or publicly available and most of the repository software can produce a DOI (digital object identifier). Universities may choose to have a repository for publications and one for data, or a combined data and publications repository, but most often just a publications repository. Since RDM is not yet a top 10 priority, most university policies will encourage researchers to publish their data in their disciplinary or funder repository, and only when they cannot find an appropriate disciplinary or funder repository among the 2k+ available worldwide, should they request institution’s help. Institutional repository are quite handy when it comes to the REF or any other type of reporting, because research support staff can easily pull the stats from their repositories around volume and impact of research. However, institutional repositories are still quite patchy and require a lot of manual involvement. Worth noting that many (about 30 to my knowledge from 2017) universities in the UK and South Africa have been subscribing to the institutional offer of figshare (researchers love figshare). The Wellcome Trust data repository also runs on figshare.

IR

IR refers to institutional repository and can be both for publications and/or data.

Open research data

Refers to the global initiative to make data accessible and reusable. You will hear many refer to the FAIR principles in relation to open data. FAIR stands for findable, accessible, interoperable and reusable. The principles were developed in 2014 by FORCE11, a network of researchers, librarians, archivists, etc that has grown organically. There are a dozen of metrics underpinning these principles that can be used to evaluate the level of FAIRness of data. Another important development in this space was the Horizon2020 Open Data Pilot, which allowed grant applicants to volunteer to make their data open. The next EC funding program will require all grant applicants to open their data. In the UK, most funders under UKRI require data to be open. If it wasn’t for BREXIT, there may have been more progress in this space, as the ex-former Universities minister set up a task force to review the open research data infrastructures in the UK and recommend policies and the way forward. In Europe, there has also been a few reviews of data infrastructures, for example this is the latest or an older quite well-known report on research infrastructures in the digital humanities

Metadata

Refers to the set of fields that describe a dataset, including but not limited to: title, year, authors, description, discipline etc. There are a few standards (minimum required fields that are enough to describe a dataset and ensure it can be understood and reused by someone else without contacting the author): Dublin Core, datacite schema are probably the most used. Figshare for example built its selection of fields based on the datacite schema. Some metadata standards are disciplinary, for example the most used in the social science is the data documentation initiative. Normally, repositories will allow the users to enter the metadata manually. There are some initiatives and enhancements of certain repository software to enable automated pick up of some fields. In a preservation system, however, a lot of the metadata, especially the one related to the provenance of the file, its format and technical specs will be automatically detected by the software. Identifiers - numerical and textual sequences for data (DOI), funders (FundRef), institutions and researchers (ORCID).

Challenges and barriers to curating data

In no particular order:

language and terminology is not well understood policies across funders, institutions and publishers around data are weak and quite vague (open data doesn’t mean it has to be always accessible, could also mean send me an email and i will forward it)
there is no way to check whether data practices are well implemented to enable more and better data sharing, researchers need to go through some serious ‘cultural changes’; researchers claim time to clean up the data and no attribution as major challenges; also many believe they will lose future publications
data is quite patchy and metadata is not good enough to enable reusability of data infrastructure is poor, although a host of services exist to support different parts of data management and curation, most are not interoperable, meaning that people waste time uploading data multiple times into different services
a variety of training programmes available, but skills are lacking among researchers and little interest or time to pursue this
person-identifiable data poses big risks for open data
tracking costs of managing research data: researchers don’t include them in grants to remain competitive (game theory); research support don’t always have a separate budget; many services are acquired across departments and there is not central budget to cost against for RDM activities; making it all hard to quantify the spending for the management of research data (and hence the risk of losing it)

WHY RDM is important

Critically, to minimize data loss; it can be catastrophically bad: check out library burns
one of the aspects that can impact research efficiency, integrity and dealing with reproducibility crisis
it’s great for building reputation and reuse of your data, wider dissemination and impact
potentially new relationship and collaborations if data is shared

Research data services

Note: these are used within universities mostly.

A good research data service will have a few components (imho):

Training
Researcher depositing data
Research support preserving and archiving data
And all would be interoperable and linked to publications and projects seamlessly

A number of universities are trying to provide parts of this service, many established research data support roles or even teams. The first step all universities take is to define their research data policy. Commonly, this policy recommends that the responsibility for the curation of data lies with the researcher. But we’ve seen some changes recently, whereby institutions and national structures are trying to support this piece.

In the UK, there is an expectation for RD services to be part of the national infrastructure: Jisc massive research data shared service project, 18 pilot institutions in the UK.

In the Netherlands, DANS offers the following core services as part of the national infrastructure, but not for free:

DataverseNL for short-term data management
EASY for long-term archiving
NARCIS, the Netherlands’ national portal for research information.

In the US, most RD services are provided by individual institutions.

A couple of years ago, I looked at a variety of software and tried to posit which ones would make for a good and as comprehensive as possible research data service, here are a few combinations I had come up with (though this is not an exhaustive list):

DANS - DataVerse and EASY recently partnered with Dryad Strengths: tried and tested, repository+archiving Weaknesses: NL/EU focused, not clear how much preservation they do, may be expensive

Ex-Libris Esploro: pilots: University of Iowa, Lancaster University, University of Miami, University of Oklahoma, University of Sheffield Strengths: 2 universities in the UK including Lancaster, UCL has Ex-Libris Primo, end-to-end research service Weaknesses: as of 2018, it was work in progress

Figshare with partner* Strengths: analytics, user experience, user support, storage bundles, large files, researchers like Figshare Weaknesses: will have link or expect faculty to upload twice to fighare and preservation, very few examples for repo+preservation

Mendeley Data (Elsevier) with partner* Strengths: integrated with Pure, may be developing a preservation system?? Weaknesses: may not be able to work with Ex-Libris (competitors) or Symplectic Elements

*Best partner: preservica Strengths: can be on premises and use UCL local storage, admin analytics Weaknesses: expensive, training costs extra, not very user friendly

Events, reports, and other indicators of growing interest in this area:

2003: NIH expects the data from $500k funded projects to be accessible and open
2005: NSF expects researchers to share their data and other raw materials, including software
2011: Research funders in the UK developed common principles on data policy
2012: Royal Society published its ‘Science as an open enterprise’ report championing policies around open data, one of the most cited reports underpinning the move towards open science in the UK
2013: G8 Open Data Charter and Technical Annex promotes sharing data
2013: Research Data Alliance, a global network of research data experts (now close to 5k), was launched as a bottom up approach to develop research data infrastructure
2014: PLOS open data policy, a lot of pushback
2014: first draft of FAIR data principles at FORCE 11 workshop, these were set in stone and launched later in 2016
2016: a number of UK’s leading research organisations developed and signed the Concordat on Open Research Data built on 10 principles
2016: IRUSdata UK, metrics for data usage, COUNTER compliant
2017: European Open Science Cloud (still a bit of a mess) will potentially integrate data services along other research infrastructures and build interoperability
2018: UK Open Research Data Taskforce recommendations for building infrastructure: Realising the potential launched to public Feb 2019
research data discovery initiatives in Australia and the UK, these are aggregations of institutional repositories to enable search across them.

Organizations that are quite visible in the RDM and open data space

National: Jisc, DANS, ANDS, NIH, NSF, The National Archives, NEH

Libraries: British Library, California Digital Library, Library of Congress

Others: CNI, DataCite, Open Knowledge Foundation, Digital Preservation Coalition, Digital Curation Centre, UC3, Force11

Leading Universities: Edinburgh University, Universities of California