Presentation on Alveo for Digital Humanities Australasia 2014

Creative Commons License
This work by Peter Sefton, Steve Cassidy, Dominique Estival, Jared Berghold & Denis Burnham is licensed under a Creative Commons Attribution 4.0 International License.

This presentation about the HCS Vlab was delivered by Peter Sefton at Digital Humanities Australasia 2014 in Perth

The Human Communication Science Virtual Laboratorybuilding on HCSNet (an ARC research network)

 


The Human Communication Science Virtual Laboratory (

HCS vLab

) is a UWS-led project, funded by The National eResearch Collaboration Tools and Resources project (NeCTAR), an Australian Government Super Science project, regrouping almost 50 active researchers from 16 institutions., 10 such Virtual Labs across Australia. The project builds upon 2 previous UWS-administered projects: the 2000-member

Human Communication Science Network

(HCSNet, ARC, RN0460284) and the 30-investigator, 12-institution

Big Australian Speech Corpus

(ARC LIEF, LE100100211) and the ANDS funded Australian National Corpus project led by Griffith University.

The main purpose of the

HCS vLab

is to provide an environment that will foster inter-disciplinary research in Human Communication Science (HCS). While HCS is a broad field which encompasses speech science, speech technology, computer science, language technology, behavioural science, linguistics, music science, phonetics, phonology, sonics and acoustics, research is often conducted in isolation within each discipline. Too often the data sets used in research are difficult to share between researchers and even more between disciplines; tools are rarely shared across disciplines. HCS research in Australia, and the development of successful real-life applications, demands a new model of research, beyond that of the isolated desk/lab/university bound research environment. The

HCS vLab

environment aims to eliminate the waste involved in repeated unshared analyses, provide the impetus for new collaborations, encourage new tool-data combinations, and improve scientific replicability by moving data and tools as well as the analyses conducted with these into an easily accessible, shared environment.

Architecturally, the

HCS vLab

comprises a repository for heterogeneous data under a standardised metadata framework based on RDF, providing discovery services that allow researchers to create data sets that can be fed to a wide variety of research tools via a rich Application Programming Interface (API). Another major component is a workflow engine which allows data to be fed through a series of processing steps which can be stored and re-used. The

HCS vLa

b will also orchestrate the creation of virtual environments including virtual servers pre-loaded with both a set of tools and data, as well as virtual High Performance Computing clusters.

This presentation will first cover the architecture of the

HCS vLab

and give examples of its use across different kinds of data (text, audio, video) with a variety of tools. These include both ‘point and click’ pre-configured tools and a range of full programming environments in which data can be automatically marshalled for further processing. Examples include the Python-based Natural Language Toolkit (NLTK) for text processing and EMU-R on the R-stats platform for speech processing and analysis. We will present in more detail the variety of corpora that have been made accessible and discuss the tools that are available for analysing these data sets, emphasising the novel use of some of these. The presentation will then report on experiences with new kinds of interdisciplinary research and demonstrate some research scenarios.

We will also discuss the potential for this approach and architecture to be adopted more generally in the digital humanities world, showing how new data and tools can be imported into the Virtual Lab environment, and how the tools can be used on data anywhere.


Biography

Dominique Estival has a PhD in Linguistics and extensive experience in academic research and commercial project management for Language Processing in the USA, Europe and Australia, including as NLP Team Leader for R&D at Syrinx Speech Systems, a Sydney speech recognition company developing automated telephone dialogue systems, Senior Research Scientist, for natural language technologies, human-computer interfaces and multi-lingual processing with the DSTO (Defence Science & Technology Organisation), and Senior Manager for language processing research for US-government-funded and commercial projects at Appen P/L, a company providing speech and language databases for language applications. At UWS, Estival is the Project Manager of the Big ASC Project, establishing the audio-visual AusTalk corpus of Australian English, and of the HCS vLab. She is a founding member of the Australasian Language Technology Association (ALTA) and in 2008 established the Australian Computation and Linguistic Olympiad (OzCLO).

Steve Cassidy is a Computer Scientist whose research covers the creation, management and exploitation of language resources. He is the main author of the Emu speech database system which is widely used in the creation and analysis of spoken language data for acoustic phonetics research. He has been involved in the standardisation of tools and formats for the exchange of language resources starting with his work on Emu and more recently as an invited expert on the ISO TC 37 working groups on annotation interchange formats and query languages for linguistic data. Cassidy is the Product Owner for the HCS vLab, acting as a conduit between the development team and prospective users around Australia as well as ensuring interoperability with related international efforts.

Peter Sefton is the Manager for eResearch at UWS. Before that he ran the Software Research and dDvelopment Laboratory at the Australian Digital Futures Institute at USQ. Following a PhD in computational linguistics, he has gained extensive experience in the higher education sector in leading the development of IT and business systems to support both learning and research. At USQ, Sefton was involved in the development of institutional repository infrastructure in Australia via the federally funded RUBRIC project and was a senior advisor to the CAIRSS repository support service from 2009 to 2011. He oversaw the creation of one of the core pieces of research data management infrastructure to be funded by the Australian National Data Service, consulting widely with libraries, IT, research offices and eResearch departments at a variety of institutions. The resulting Open Source research data catalogue application ReDBOX is now widely deployed at Australian universities. At UWS, Peter leads a team working with key stakeholders to implement university-wide eResearch infrastructure, including an institutional data repository, and collaborates widely with research communities on specific research challenges. His research interests include repositories, digital libraries, and the use of The Web in scholarly communication.

Denis Burnham is inaugural Director of MARCS Institute, UWS (1999-present) and President of the Australasian Speech Science and Technology Association (ASSTA, 2002-present). He conducts research in speech perception (auditory-visual, cross-language, infants, children, adults), special speech registers (to infants, pets, computers, foreigners), language development and literacy, human-machine interaction, and corpus management; has been continuously funded by the Australian Research Council since 1986; and has run various large projects, most recently this HCS vLab, the Big Australian Speech Corpus (Big ASC), the Human Communication Science research network, the Thinking Head Project, and the Seeds of Literacy Dyslexia project.

Jared Berghold has a research and development background in computer visualisation, interactivity and enterprise architecture. Jared has practiced software engineering for over eight years and prior to Intersect worked at iCinema, a research centre at UNSW, working on interdisciplinary projects with a focus on interactive and immersive narrative systems. worked at Avolution and CiSRA, the Australian research and development lab for Canon.

Jared has a BE/BA (Hons) in Computer Systems and International Studies from the University of Technology, Sydney.

Funding

HCS vLAB acknowledges funding from the NeCTAR project http://www.nectar.org.au NeCTAR is an Australian Government project conducted as part of the Super Science initiative and financed by the Education Investment Fund.

 


The

HCS vLab

is funded by NeCTAR, a body set up by the Australian Government as part of the Super Science initiative and financed by the Education Investment Fund.

Outline

  • Project background

Show and tell – what is this vLab? Demo of Phase 1
  • How to get involved?

Phase 2
  • Feedback – Questions?

 

 


The aim of this presentation is to introduce the virtual lab project, show what it can do now and talk about its potential for the future, and let Digital Humanities practitioners in Australia know how they can get involved.

Contributing Partners

 


University of Western Sydney, Macquarie University, the Australian National University, University of Canberra, Flinders University, University of Melbourne, University of Sydney, University of Tasmania, University of New South Wales, University of Western Australia, RMIT, University of New England, LaTrobe University, NICTA (National ICT Australia, ASSTA (Australasian Speech Science and Technology Association),

Development partner: Intersect

Intersect Development Team:

Ilya AnisimoffJared BergholdDavid ClarkeGeorgina EdwardsKaren El-AzziGabriel GasserMatthew HillmanChris KenwardNasreen ShariqueKali WaterfordElyse WiseMarc Ziani de FerrantiShuqian HonSean LinTheeban SoundararajanVincent TranStanley HonPierre EstephanSimon Yin

 

Watch the video!

 


A good way to get a an understanding of what the virtual lab can do is to have a look at this video by Steve Cassidy from Macquarie University. http://hcsvlab.org.au/2013/10/hcs-vlab-version-2-0-screencast/

Data Discovery

 


The laboratory has a data-discovery portal which houses a growing number of data sets. We originally called these ‘corpora’ but this turned out to a problem, as (a) many people don’t know what a corpus is or how to tell their corpus from their corpora and (b) it is specific to some disciplines, so these are now know as Collections.

All the current collections in the lab require researchers to agree to a license, usually via web-click, some require an offline contract. This is not Open Data in the normal sense but given the terms under which most of the collections were collected this is the best we can do to make data available as broadly as possible.

Discover data via metadata facets

 


The discovery interface is very similar to that found on many DH services, it has faceted browsing and full-text search for resources. But this is not the main or sole function of the lab, the purpose is to allow people to DO things with data.

Architecture: Discovery leads to data

 


The lab has a data-storage component which can look after large amounts of heterogeneous data, with any kind of digital resource stored against an item.

Including various media

 


This is an example of speech data from the Mitchell and Delbridge corpus (data set) collected in the 1960s.

Post discovery: compile your own stable “Item Lists”

 


Researchers can create their own sets of working-data from items in the repository, the discovery interface can be used to define and save an item list.

Item list are stable, reusable and in future will be citable, they are NOT like saved searches. This contrasts with a saved-search in something like the National Library’s Trove service where the same search or API call might yield a different set of items on different days.

These are know as item lists and these are the key to re-doable research workflows as they allow the same stable data set to be run through multiple processes. Item lists are available via the web interface and via the API.

Concordance: a tool run on an Item List

 


Steve’s demo shows some example of simple, broadly useul text analysis tools that are built in to the HCS vLab data site.

API Access: Get a key

 


All the data and discovery services available in the HCS vLab are available via the API, the Application Programming Interface. This is one of the key features of the lab, coupled with the stable, reusable Item Lists we looked at above, this is the key to openeing up data in new ways.

Copy-paste access to data

 


Programmers are able to use Item Lists in custom code. The discovery interface makes this as easy as copy and paste (assuming that you’ve installed the software libaries you need in your programming interface).

The API respects the access control of data collections in the lab, via a per-user private API key.

The aim of the ultimate aim lab is to make sure that you can do everything via the API.

Workflow: Galaxy

 


Like other NeCTAR virtual labs, HCS vLab uses the Galaxy workflow engine to allow researchers to construct research workflows that can be run over different data sets, and share them with others. In phase two of the project we will work on reusablity and reproducibility with research teams.

Chain processes on Item Lists – eg [Tokenizer] -> [Frequency List]

 


This simple demo show how an Item List is used in Galaxy, where multiple operations can be performed on data with each step feeding its result into the next one. This allows for more flexibility than the simple ‘canned’ tools built into the HCS vLab data web site.

A similar approach will be used for audio and video data allowing researchers to create item lists, set some parameters and run analytical processes, resulting in new data sets or graphical plots.

 


This particular workflow processes audio data using the

PsySound3 software toolkit (which was developed by Densil Cabrera, Emery Schubert and others to analyse sound recordings using physical and psychoacoustical algorithms). Given any audio recording as input, this workflow will perform an FFT (Fast Fourier Transform), Hilbert transform and Sound Level Meter analysis on the audio file and plot a graph for each one.

 


The second screenshot shows the plot resulting from one of the analyses performed – the FFT. The input audio file used in this example came from the Mitchell-Delbridge corpus (a database containing the recordings of Australian English as spoken by 7736 students at 330 schools across Australia, mostly collected in 1960).

 


The lab also allows annotation of all kinds of materials including text and time-based media. At the moment this is accessible via the API.

Reuse

 


This project uses a number of open source components to scaffold the lab including

Hydra which is Ruby on Rails framework wrapping:

Apache Solr and blacklight

The Duraspace Fedora Commons repository

Apache Solr, via the blacklight project

WOrkflow courtest of Galaxy “Data intensive biology

for everyone.” 🙂


 


Call out to the partner network! If you are affiliated with one of the institutions above then we need your help to test the laboratory. We have money in the budget for testing by Higher Degree Research Students and potentially other testers. It’s worth getting to know the lab because of the huge potential it offers for new kinds of research, bringing together data that has not been joined-up before with tools that have been hard to access and use. Research is changing, with funders requesting data management plans be in place, and signaling that Open Access to data is likely to be mandated alongside Open Access to publications, the lab aims to assist researchers embracing this change.

Photo credit: http://nla.gov.au/nla.pic-an7697018-3

 


This is a picture of a dog*.

But what is it doing in this presentation?

(No it’s not a

virtual

lab, it’s just a lab).

At the presentation I (Peter) asked the audience what they could tell about this dog from the picture. There were two interesting answers. Firstly it is male, which might be of interest in a biological virtual lab. Secondly someone said it is ‘happy’, which might be of interest in a Human Communications Science lab. A set of images like this may be an appropriate addition to the lab, studying how people react to non-human faces. Denis Burnham, who leads this project is a psychologist, and has been exploring ways that Alveo could be used to store re-usable sets of stimuli used in experiments, which are typically collected for a particular study and not made available for re-use.

The idea of looking at dog pictures is something I made up, but in phase two of the virtual lab, starting mid 2014, one of the tasks is to set up a board who to approve the addition of new data sets, they will be able to answer the question; dog pictures or no?

* (it’s Daniel de Byl’s dog Merlin, and he took the photo which is used here with permission).

Phase two

 


Phase two of this project is about realizing the potential of the lab. We know what we will be doing to

We’ll be:

* Promoting its use to researchers and research communities via a variety of outreach activities

* Supporting the lab via a combination of the UWS Service Desk and the AERO eResearch body

* Continuing development of new features

* And most importantly, we’ll be working on a sustainable model for the future. (Can it live-on through grants? Subscriptions? Partnerships with other similar projects?)

Finally: User feedback

“I really liked using the system and the instructions were very easy to use and the system easy to navigate. […] This platform would be very useful for my research.”

–Tester

 


“It seemed pretty easy to use after I got used to the two platforms. I would certainly like to use it in my future research!”

“It is looking very good. Lots of possible uses and a nice interface.”

“The platform is easy to use and has the great potential to help with Linguistic research and wide applications in other areas…”

“The system seemed to be quite user-friendly. As first I was relying on the manual, however when the manual became more streamlined with less details, the system was still easy to follow.”

“This is a powerful tool and I think it is pretty good.”

“I really liked using the system and the instructions were very easy to use and the system easy to navigate. […] This platform would be very useful for my research.”

“I think it’s quite easy to use. […] Generally the platform is very clearly organised, and user-friendly.”

“The platform overall is very good.”

“Very nice platform with great user interface!”

“A very promising and impressive setup so far!”

“I’m impressed with the platform – it’s smooth and the interface is very intuitive.”

“I think it’s quite easy to use. […] Generally the platform is very clearly organised, and user-friendly.”

“The platform overall is very good.”

“Very nice platform with great user interface!”