Successful proposal for NeCTAR Virtual Lab funding: Above and Beyond Speech, Language and Music: A Virtual Lab for Human Communication Science (HCSvLab) by The University of Western Sydney is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License.
[This is the successful proposal document for the Virtual
Lab for Human Communication Science (HCS vLab) with some financial at personnel
details redacted, we’re putting it here as an introduction to the project and
to kick-off the project blog – Peter Sefton 2012-12-13]
Section A. Header Details
A.1 Program and Proposal Title
Program: Virtual Laboratory
Title: Above and Beyond Speech, Language and
Music: A Virtual Lab for Human Communication Science (HCS vLab)
University of Western Sydney
Professor Denis Burnham
Director, Marcs Institute
University of Western Sydney
Locked Bag 1797, Penrith South DC NSW 2751
(02) 9772 6681
(02) 9772 6040
A.2.2 Participating Organisations
Organisation / Group Name
System Administration Communications
Application maintenance and further development
A.2.3 Project Funding Summary
EIF Funds Requested
Total Co-investment Offered
Section B. Proposal Summary
Australia plays a strong and prominent international role in Human
(HCS), encompassing research in speech science, speech technology, computer
science, language technology, behavioural science, linguistics, music science,
phonetics, phonology, and sonics and acoustics. However, this position is in
Research Effort and Expertise: In Australia we have excellent
researchers analysing their own
corpora of data using their own analysis
tools in relative isolation. Yes,
these researchers meet and share their knowledge, at national/international
conferences but (a) relatively infrequently and (b) in discipline-centred
and Inefficiency: Research conducted in isolation entails local
unshared mark-up or augmentation of local data sets, and inefficient repetition
of search, information retrieval, annotation, and analysis using tools that are
usually home-grown, inaccessible (e.g., idiosyncratic command line execution)
In order to keep abreast of the modern pace
of research, to leverage the available yet unrealised interdisciplinary
challenges and opportunities, to go beyond the isolated
desk-PC-lab-university-bound model of research – a quantum leap in the way in which we do our Human Communication
Science is absolutely essential. We must move into a research environment that
will eradicate the waste involved in repeated unshared analyses; ignite the
research spark that affords the serendipity of new tool-corpus combinations;
and dramatically improve scientific replicability by moving local and
idiosyncratic desktop-based tools and data to an easy access, in-the-cloud,
public, replicable environment that standardises, defines, and captures
procedures and data output (see, e.g., www.myexperiment.org/). We
must connect HCS researchers, and their desks, computers, labs, and
universities in order to build upon the achievements so far, produce emergent
research knowledge, and instil this approach in our new interdisciplinary PhD
Fortunately, but not without a good deal of
forward planning, the common ground for
such a Virtual Lab has been carefully prepared. The 2004-2009
ARC-funded UWS-administered Human Communication Science Network (HCSNet) identified
a community of over 1000 Australian Human Communication Science (HCS)
researchers, engaged that community in over 60 different workshops, seminars
and conferences, integrated that community with like international communities,
the success rate of Australian HCS grant applications, including
significant multi-site, interdisciplinary projects such as the ‘Thinking Head’
and the ‘Big Australian Speech Corpus’. The focus that binds this
community is the manner in which humans communicate with each other and with
computers and machines via codified means — speech and text, music and
The HCSNet experience made
it clear that there is a thriving HCSNet community with vast potential for cross-disciplinary
research connectivity (i) to provide new insights into old problems by
approaching them from different disciplinary perspectives or with a hitherto
untried method, and (ii) to apply novel combinations of old ideas or methods of
analysis from different disciplines to new problems (see for instance a
compendium of 30 HCSNet research papers, Dale, Burnham & Stevens, 2011). On
the other hand, the HCSNet experience has also made it abundantly clear that
one of the main impediments to the quantum leap that is required for HCS
research to bloom is the difficulty for a researcher from one discipline to
apply the tools and techniques of another discipline, or to explore data
collected under one paradigm via a completely different analytical perspective.
created the intellectual space between universities, disciplines, paradigms
and methods for HCS research to flourish. Above
and Beyond HCSNet, the ‘HCS vLab’ will create the virtual space for
infrastructure that enables easy access to shared tools and data, and overcomes
the resource limitations of individual desktops. It will be a virtual
laboratory that makes it possible for HCS researchers from a diverse range of
disciplines to access an amalgamation of existing data sets (corpora)
and selected analytical tools collected and generated from their midst and
then, in this project, put into the cloud. Most
importantly, it will enable the guided use of workflow tools and options
to allow researchers to cross disciplinary boundaries.
Consider the Music Researcher who wishes to
analyse auditory-visual cues that Indonesian singers provide to elicit
particular emotions in their audience. Our researcher has in mind some
instances of particular songs, but then where to? How can auditory-visual
records of appropriate songs be found, the words transcribed, and the phonetic
nuances annotated and aligned with the auditory-visual record and the emotion
in the face at any particular time? In HCS vLab a consolidated cross-corpus search
would produce candidate songs (probably mainly from the PARADISEC corpus in this case), the text from use of a
transcription tool such as Transcriber analysed
with the ParGram grammar for
Indonesian, the emotions information from the DeMobLib tool, with EMU affording
a prosodic analysis (usually reserved for speech) of the songs. Without HCS
vLab our researcher would have had to access PARADISEC, run a search, look around for the tools to analyse it,
and probably be unaware of the range of tools that would be of use. With HCS
vLab our researcher has searched a complex of corpora, and been guided through
a complex of analytical tools and provided with a rich multimodal description
of relevant songs. Moreover, the resultant metadata can be stored, as can the
sequence of operations, and later be applied to other songs; the HCS vLab will
house an ever-expanding enriched data set that will endure, not only for this
researcher but for any of the national and international HCS researchers to use
and augment. This example demonstrates various strengths of the vLab –
multimodal, cross-cultural data and analyses, affective and lexical material,
and focus on unique features of the Asia-Pacific region. The same process can
be applied to any of the data available through the HCS vLab, with similar or
quite different emphases. While this example is somewhat esoteric, more
mainstream applications include research supporting automatic speech
recognition (taxi ordering, directory assistance etc.), hearing aids and
cochlear implants, interactive learning programs for children with learning
disabilities, automatic melody recognition, forensic determination of origin
and background of particular accents, computer-based information retrieval
based on music, speech, sounds, or visual patterns, and psycholinguistic
studies of second language learning and pedagogy.
The HCS vLab framework will be built by
Intersect, a leading eResearch organisation in NSW with a proven track record
in building research software infrastructure. Intersect will be responsible for
the technical design of the vLab and will manage the software development
lifecycle. The Intersect Project Manager will report to the HCS vLab Project
Manager; the project will be governed by the HCS vLab Steering Committee
consisting of a blend of senior stakeholders from partner organisations along
with a number of technical advisors. The vLab will be delivered in 3
development releases over 14 months, and will incorporate 11 tools and 7
HCS vLab will be built upon the ground prepared by HCSNet and the raw
cross-disciplinary potential that HCSNet seeded. Corpora have been collected
and tools have been developed by a range of technically-skilled HCS
researchers, and each begs for a common ground, but due to the local needs of
individual researchers and the impossibility of one lab or even one institution
providing the required person- and computational-power these corpora and tools
are isolated on PCs and servers around the country. HCS
integrate 7 existing
e.g., the 100TB ClueWeb corpus; the 3000-hour ARC-LIEF-funded AusTalk
audio-visual speech corpus, the ANDS-funded Australian National Corpus (text
& speech), the PARADISEC corpus of speech, text & song in indigenous
& endangered languages, and the AMC corpus of Australian music.
integrate 11 existing tools, e.g., Natural Language
Tool Kit (NLTK) for text analytics; EMU for speech analysis and interactive
waveform labelling; PsySound3 with physical and psychoacoustic algorithms for
sound & music analysis; Johnson-Charniak parsers to generate parse trees;
& DeMoLib, for lip-tracking video analysis.
integrations are only possible because of the cohesion of a >1200-strong HCS
research community who eagerly await the development of a vLab that will allow
full realisation of the collaborative power of HCS research; a community that
includes: Bruce Croft & Mark Sanderson (RMIT), top Information Retrieval
(IR) researchers in the world; Denis Burnham & Catherine Stevens (UWS,
speech & music experts, respectively) leaders on large ARC projects –
HCSNet, the Thinking Head, and for Burnham, the Austalk corpus, Janet Fletcher,
Andy Butcher & MarijaTabain (UniMelb, Flinders, LaTrobe), world experts in
Australian indigenous languages; Steve Cassidy, Mark Johnson, Robert Dale
(Macq), and Steven Bird (UniMelb) who are at the hub of language technology
research in Australia; and Michael Wagner (UC), Roberto Togneri, Mohammed
Bennamoun (UWA) & young stars Trent Lewis (Flinders) & Roland Göcke
(UC), who are pioneering AV face and voice analysis.
Over and above the realisation of a heuristic HCS research environment, HCS vLab will be:
accessible: the variety of HCS tools will be
accessible to non-technical researchers via workflow tools, stored protocols,
and interactive GUIs, while retaining capacity for more sophisticated analyses.
the generic HCS vLab infrastructural
support will allow incorporation of HCS corpora from various platforms (e.g.,
Windows, Mac OS, Linux), and interoperability with other major systems
at home, e.g., the already-funded
NeCTAR Humanities National Infrastructure (HuNI) vLab, and
internationally by virtue of our Product Owner’s intimate domain knowledge and
sustainable: 13 universities, 3 organisations, and 47
key investigators have provided $423K in cash and $1.9M in-kind (1.7 times >
the request to NeCTAR; with 5 Gold, 5 Silver, 4 Bronze
& 2 Other members – see our Partner Model, Appendix E) with partner cash
supporting sustained operational development and development of capabilities
and reach including future plug-in of additional tools and corpora, and
specialist user support well beyond the formal conclusion of HCS vLab
Without the HCS vLab the promise of HCSNet and the
careful priming of the HCS community will have been in vain. The HCS vLab
is the natural progression for a strong receptive and ambitious research
community that is on the move and wants to keep moving. It will create the
avenue by which HCS research can make a quantitative and qualitative leap to
enhanced capability, collaboration and output, to travel well beyond the
geographical confines of individual labs, and well above the disciplinary
confines of speech, language, music, or sonics alone into an interdisciplinary
and heuristic Human Communication Science cloud-space.
for Public Release
Applications in automatic speech recognition (taxi ordering, directory
assistance etc.), hearing aids and cochlear implants, interactive learning
programs for children with learning disabilities, automatic melody recognition,
forensic determination of origin and background of particular accents,
computer-based information retrieval based on music, speech, sounds, or visual
patterns, psycholinguistic studies of second language learning and pedagogy all
depend upon research in Human Communication Science (HCS). Human Communication Science encompasses
the areas of speech science, speech technology, computer science, language
technology, behavioural science, linguistics, music science, phonetics,
phonology, and sonics and acoustics. In turn HCS research depends upon datasets
(corpora) of speech, music, text, faces, sounds, and specialised tools by which
to search, analyse and annotate these data. Australia boasts a strong and
active community of HCS researchers who have developed a wealth of corpora and
tools relevant to HCS research. However, these researchers tend to analyse their
own corpora of data using their own analysis
tools in relative isolation. Yes,
these researchers meet and share their knowledge, at national/international
conferences but (a) relatively infrequently and (b) in discipline-centred
While HCS research in Australia is
blooming, especially due to the highly successful Australian Research Council
funded HCS Network from 2005-2009, and related research projects, research conducted
in isolation entails inefficient repetition of analysis of local data sets. HCS
research in Australia, and successful further real-life applications, requires going
beyond the isolated desk-PC-lab-university-bound model of research into a new research
environment. Such an environment will eradicate the waste involved in repeated
unshared analyses; ignite the research spark that affords the serendipity of
new tool-corpus combinations; and dramatically improve scientific replicability
by moving corpora and tools and the analyses conducted with these into an easy
access, shared, in-the-cloud, public, replicable environment.
The HCS virtual Laboratory (HCS vLab) will
connect HCS researchers, their desks, computers, labs, and universities and so
accelerate HCS research and produce emergent knowledge that comes from novel
application of previously unshared tools to analyse previously difficult to
access data sets. The HCS vLab infrastructure will overcome resource
limitations of individual desktops; allow easy access to shared tools and data;
and provide the guided use of workflow tools and options to allow researchers
to cross disciplinary boundaries.
The HCS vLab will be:
accessible to non-technical
researchers via workflow tools, stored protocols, and interactive GUIs, while
retaining capacity for more sophisticated analyses;
by incorporating HCS corpora from various platforms (e.g., Windows, Mac
OS, Linux), and ensuring compatability with other with major systems in
Australia and internationally by virtue of our Product Owner’s intimate domain
knowledge and wide collaborations; and
sustainable: 13 universities, 3 organisations, and 47
key investigators have provided $423K in cash and $1.9M in-kind to support
sustained operational development and development of capabilities and reach
including future plug-in of additional tools and corpora, and specialist user
support well beyond the formal conclusion of HCS vLab construction.
The HCS vLab will create the avenue by which HCS
research can make a quantitative and qualitative leap to enhanced capability,
collaboration and output, to travel well beyond the geographical confines of
individual labs, and well above the disciplinary confines of speech, language,
music, or sonics alone into an interdisciplinary and heuristic Human
Communication Science cloud-space.
HCS vLab will be built by and
for the Human Communication Science research community. The Human Communication
Science community was formally established with the creation of HCSNet, and HCS
researchers are active in various professional organisations as well as in
Universities and Research Organisations, as set out below.
1. The Network for Human Communication Science (HCSNet)
The HCSNet community is a
broad-reaching interdisciplinary mix of researchers from right across
Australia, whose research spans speech, language and music and sonics. The
community banded together and obtained funds from the Australian Research Council (ARC) for an ARC Research
Network (RN0460284, 2004-2009, $2M, see http://www.hcsnet.edu.au/), a network that greatly
facilitated research and research collaboration in the > 1200 community of
HCS researchers. HCSNet aimed to build Australia’s reputation as a
leader in communication science and technology via advances in its priority
areas of ‘Speech’, ‘Effective Human-Computer Interfaces’, ‘Next Generation
Search Technology’, ‘Human Communication Disorders’, and ‘Human and Machine
Perception and Action’. This succeeded: of the 47 researchers on this Virtual
Lab bid, all but the 5 researchers who were not in Australia at the time were
HCSNet members; and as a result of collaborations formed and projects hatched
in the > 60 HCSNet workshops and conferences during the life of HCSNet, the HCSNet community spawned and
continues to incubate various multidisciplinary multi-institutional projects,
such as the following:
The Thinking Head (ARC/NH&MRC Special Initiatives, TS0669874,
2006-2012, thinkinghead.edu.au/) – Research on auditory-visual
speech, dialog, speech and speaker recognition, human-machine interaction,
avatar and robot development. 11 Chief Investigators (CIs) from 6 Australian
universities; 4 Partner Investigators (PIs) from 3 international institutions.
Forensic Voice Comparison (ARC Linkage, LP100200142, 2010-2013) –
Making demonstrably valid and reliable forensic voice comparison a practical
everyday reality in Australia. 3 UNSW CIs, partners in Spain and China,
industry partners including Australian Federal Police (AFP), National Institute
of Forensic Science Australia (NIFS), NSW and Queensland Police and others.
The Big ASC
(Australian Speech Corpus, ARC LIEF, 2010-2012, LE100100211, austalk.edu.au/) – Large (1000 speakers x
3 hours speech) auditory-visual speech corpus from 17 sites across Australia.
29 CIs, 11 Australian
universities in every state of Australia, 1 international PI.
DADA-HCS (ARC SRI e-Research Support,
2005-2006, SR0567319) – Distributed Access & Data Annotation
for Human Communication Sciences. 9 CIs, 5 Australian universities and institutions.
Other funded projects involving
HCS researchers preceded HCSNet and set the scene for the growing zeitgeist in
human communication science, e.g., See Hear! The Multimodal Recording and
Analysis Facility – new interfaces for analysis of complex visual
and auditory scenes, and creation of a research tool for sound and music
analysis (ARC LIEF: LE0668448, 2006, 12 CIs from 4 Australian
Such projects have cemented
HCS cross-disciplinary links by focussing research effort and harnessing
cross-disciplinary research expertise. The result is a mature Human Communication
Science research community au fait with the approaches and strengths of
other disciplines, but a community yearning for a vehicle to transport these
approaches to new lands.
2. The Australasian Speech Science & Technology
ASSTA (est. 1988) is the peak
speech science and technology body in Australasia, and the meeting ground for
engineers, computer scientists, cognitive scientists, psycholinguists, language
technologists, phoneticians, linguists, forensic speech scientists, and speech
pathologists via instruments such as its biennial Australasian Speech Science
& Technology conference and its international counterpart, Interspeech,
which ASSTA hosted in 2008. ASSTA has financially backed HCS endeavours
(HCSNet; Forensic Voice Comparison; the Big ASC above), and will financially
and academically support the current proposal. Along with other professional
bodies (see B.8), ASSTA supports HCS research in Australia via research and
conference funding (especially for early career researchers).
3. HCS Community members in Universities and Research
University of Western Sydney (UWS): UWS, and in particular the
Marcs Institute, has played a leading role in the establishment of the HCS community; it was the lead institution on
the HCSNet, Thinking Head, See Hear!, and Big ASC projects (see above), and
provided and continues to provide cash and infrastructure support for these and
like projects. UWS is committed to eResearch, and is a strong supporter of
interdisciplinarity and eResearch initiatives. UWS will provide $175K cash and
$528K in-kind and ongoing support for the project, support that can only
facilitate HCS vLab uptake among HCS researchers. UWS and Marcs Institute will
continue to play a leading role in maintaining the HCS community both by its
contribution to research and its lead role in this project and promoting the
use of HCS vLab. Marcs Institute, elevated
from a UWS Centre to one of its 4 Institutes in 2011, is led by the
Project Lead on this bid, Prof Denis Burnham. Marcs comprises 51 researchers and 24 higher degree students, who
conduct behavioural, neuroscience, and computational research on human-human
and human-machine communication in normal, heightened and degraded contexts in
5 programs: Speech & Language, Music Cognition & Action, Bioelectronics
& Neuroscience, Multisensory Processing, and Human-Machine Interaction; and
in ARC FOR codes areas 1701, 1702, 2004, 1904, 0801, 0906, and 0903. Marcs has
current public funding of $6,664,807, comprising 6 ARC Discovery grants, 1
Discovery Early Career Research Award grant, 1 ARC/NHMRC Special Research
Initiative, and 1 ARC Linkage Infrastructure, Equipment and Facilities grant
across a range of areas that will use and promote the HCS vLab in areas such as
auditory-visual speech and
cognitive processing; speech perception, regional accents, reading and language
acquisition; reverse engineering of the brain; acoustic factors in music
perception; human-machine interaction; and corpus studies. Moreover, Marcs has established collaborations with
researchers from psychology, linguistics, music, education, computer science,
engineering, and various industry partners, in over 20 major research
institutes and over 30 additional individuals in Australasia, North America,
the UK, and Europe, which will further add to the reach of HCS vLab to the HCS
Further Universities and Research Organisations: Macquarie
University and the lead institution, UWS,
have a long history of collaboration and project co-leadership. The convenor of
HCSNet, Prof Robert Dale, directs the Macquarie Centre for
Language Technology and was a senior investigator in the Thinking Head, Big ASC
and DADA projects; A/Prof Felicity Cox and HCS vLab Product Owner, Associate
Prof Steve Cassidy are major players in the Big ASC project. Together Macquarie
is a major node of HCS research. Other universities in the project all have a
history of involvement in HCS research through HCSNet, one or more of the above
multi-disciplinary projects, and in their own HCS projects. These HCS community
universities contribute to HCS research in the specialist areas as set out
hereafter. ANU: Phonetics,
Indigenous Languages; Canberra:
Speech Forensics, AV Speech and Speaker Recognition; Flinders: Computer Science and AV speech; Melbourne: Engineering, Phonetics; Sydney: Indigenous Song, Music, Speech Pathology; Tasmania: Cognitive Science,
Psychology; UNSW: Speech Science, Music, Emotion; UWA: Speech and Speaker
Recognition; RMIT: Information
Retrieval (IR) for very large databases, dialog and multi-agent systems, models
in AI and computer science; UNE:
Psycholinguistics, regional languages, logic in child language; LaTrobe: Language diversity, minority
languages, Australian indigenous languages, data-oriented and theory-oriented
approaches; NICTA: Machine
Learning and NLP.
The wide geographical and disciplinary spread of the 14 partner research
organisations across Australia, along with the active research profiles of the
47 individual researchers in the bid, with their ongoing Higher Degree Research (HDR) student load and numerous significant national and international
collaborations, provide a strong scaffold of support and uptake for this
1. Intersect Australia
Australia Ltd is a not-for-profit company limited by guarantee, owned and
funded by its members, the universities in NSW, state government departments,
and other organisations undertaking research in NSW.
has a strategic focus on national research infrastructure. Intersect is a
member of The National Computational Infrastructure (NCI), and the Australian
Access Federation (AAF). Intersect has undertaken and is undertaking many
projects deploying data capture and management solutions for the Australian
National Data Service (ANDS). The software Intersect develops integrates with
infrastructure provided through these bodies.
its establishment, Intersect has demonstrated that it is one of Australia’s
leading eResearch organizations in having the capability and capacity for
undertaking eResearch projects.
Capacity and Capabilities
Intersect has approximately 50 staff. It has
established a capacity and capability to develop, deploy and support
substantial and complex eResearch infrastructure, that is unique in Australia.
This capability is built on a company culture which emphasises a focus on the
client and of engineering excellence. Intersect has built a team that delivers
eResearch solutions on time, on budget and of value.
Intersect’s Engineering Division brings together
many years of commercial experience in developing large scale IT systems across
many sectors such as academia, government, banking and enterprise security
tools. The team of 30 staff includes user interface designers, specialist test
engineers, software and systems engineers and project managers.
Intersect’s Services Division staff have
backgrounds in publically funded research, commercial research and development,
and commercial information technology service provision. The team of eleven
staff is responsible for outreach and engagement before, during and after
development commences. They provide capability to carry out stakeholder
management, requirements gathering, and product ownership.
The Operations Division, comprising five staff,
has been centrally involved in systems integration projects and the
transitioning of projects from development to commissioning and ongoing
Track record and relevant experience
Intersect has undertaken and successfully
delivered many eResearch projects (at the time of writing approximately 25),
including projects with development and integration budgets in excess of $1
million. These projects provide solutions and infrastructure to research
efforts across a range of disciplines. These projects include:
analysis and integration projects for many NCRIS
capability areas (e.g. AMMRF, PHRN, AAL) analysis and development projects for
non-university research bodies (ANSTO, NSW Office of Environment and Heritage)
software development projects funded through PfC
capabilities (e.g. 11 ANDS data capture projects for 4 universities) software
development projects funded directly by our membership (e.g. Rainfall)
strategic software development projects funded by Intersect (e.g. Genomic Data
Repository, Australian Schizophrenia Research Bank)
In the vocabulary of NeCTAR, a number of these
projects would fit within the parameters of eResearch tools (e.g. ANDS data
capture projects, ASRB) or virtual laboratories (e.g. PHRNi). Intersect is
currently delivering six ANDS-funded projects for its members.
see the attached letters of support for Intersect’s track record as a
Approach to quality standards
has not sought formal certification under any standards (e.g. ISO 9000).
Intersect follows a three part method for achieving quality:
“Say what you are going to do”. Starting from
the concept stage of this project, and continuing throughout the project, Intersect
keeps the customer informed of what they are doing and how they are doing it.
Intersect has processes covering Consultation, Business Analysis, Project
Management and Software Engineering. Intersect works with clients to tailor
these to suit their needs and the project’s needs.
“Do what you said you were going to do”. Intersect
follows their processes. If issues are encountered then Intersect talks to the
customer to find agreeable solutions.
“Prove it”. Intersect keeps the smallest quality
record possible, as documented in a quality management plan.
plan is written during the elaboration stage of development (see Item B.15), in
conjunction with the stakeholders and NeCTAR.
Support and warranty mechanisms
issues a formal 3-month warranty for all projects; the warranty commences after
user-acceptance testing has completed. During this period all defects are fixed
at no cost to the customer. Intersect’s on-going defect rate is less than one
new defect discovered per month, across 25 deployed systems. In practice, the
defect rate has been low enough that Intersect has fixed the majority of
defects that come to light after the warranty period has expired. Additional
feature requests are carried out on a fee-for-service basis.
RMIT. The Information
Retrieval (IR) group in the School of Computer Science & IT at RMIT
University is recognised as an international leader in the development of
search engines: producing 28 A or A* journal/conference papers in the last ERA
period. The ISAR group has extensive experience building open source search
engines, creating the Zettair and MG systems, both widely used. In a separate
strand of research, the group also evaluates search engine effectiveness. Based
on citations to its last five years of publication outputs, the RMIT IR group
was placed in the world’s top 15 IR research groups by Microsoft’s Academic
Search (http://academic.research.microsoft.com/); and No. 1 in Australia. Three
members of the RMIT group will contribute.
NICTA. NICTA is
Australia’s centre of excellence for ICT research, with over 600 researchers and
students, including many world leaders in their fields. NICTA’s mission is to
deliver outstanding ICT research outcomes, and to create wealth for Australia
through the application of that research. NICTA performs research in a number
of areas, including Machine Learning and Control and Signal Processing, which
include outstanding researchers in Language Technology and Text Retrieval.
NICTA’s Engineering and Technology Development team is expert at transitioning
research into innovative technology solutions to real problems.
RMIT/NICTA Software Engineering. An important
part of the HCS vLab’s toolkit will be the ability to assemble “processing
pipelines” involving multiple tools processing the same data sequentially.
Based on the combined expertise of the NICTA and the RMIT staff involved in
this project, a Research Engineer will be employed via sub-contract to help the
Intersect engineering team build a flexible component architecture for the HCS vLab
compatible with UIMA, an emerging standard for wrapping components for
processing language, speech, video, and other unstructured data.
Operational Organisation Profile
Australia Ltd is also the Operational Organisation.
Capacity and capabilities
provides hosting, operations, outreach, L&D and support for our members’
and affiliates’ eResearch needs. Intersect hosts its production infrastructure
with commercial hosting partners ‘ac3’. The hosting is located at the Global
Switch data centre in Ultimo. Intersect hosting partners provide managed
services, including backup, system monitoring and logging, and core network
connectivity for all services. The Intersect systems administration team
performs network management, user-level support and troubleshooting, and
general systems administration. Intersect has an access agreement with AARNet;
all systems hosted by Intersect are “on-net”. Issues with services hosted at
ac3 are notified to the head of Intersect’s systems administration team.
Operations, advocacy, L&D and support are built on a
full-time team of ten eResearch Analysts, an HPC specialist, a data management
specialist and five systems administrators. The team is currently responsible
Operations of and merit-based allocation to HPC
facilities, both through Intersect’s own McLaren service, hosted at ac3, and
Intersect’s partner share of NCI.
Hosting data and applications on behalf of Intersect
members’ researchers. We host off-the-shelf applications, customized
open-source applications, bespoke software developed by Intersect, data
accessible via the DataFabric, and data management systems. Hosted systems
include Confluence, Jira, OpenClinica and OpenCDMS.
Hosting hardware on behalf of Intersect clients,
for example the SAX Institute.
Providing first-tier support for HPC users, as
well as users of national services such as the ARCS Data-Fabric, ARCS Grid
Computing Service, and AAF’s authentication services.
team of Analysts has a background in publicly funded research, commercial research
and development, and commercial information technology. They provide both
one-off and ongoing assistance to research groups, and combine experience
across a large number of research disciplines.
system administration team brings together commercial- and research-based
experience supporting and integrating administrative as well as research
has applied to the RDSI NoDe program and has a plan to build and operate an
RDSI node. Intersect has applied to the NeCTAR Research Cloud program, and has
a plan to build and operate a research cloud node. Both services will operate
along side our existing infrastructure, hosted in commercial hosting
Track record and relevant experience
Intersect provides operational support for HPC,
valued at approximately $1m annually, to its membership. Over the last three
years, Intersect has provided support directly to more than 100 research
projects comprising hundreds of researchers. In the last year Intersect
provided specialist ongoing support, in the form of scripting, troubleshooting,
compilation of software, and design of experiments, to approximately 30
Intersect hosts applications or application
spaces for approximately 30 groups across seven institutions. Hosting is
provided through a combination of in-house infrastructure and in partnership
Systems administrators at Intersect have been
involved in the roll-out and support of national services, through most of the
PfC programs, including ARCS, the AAF and ANDS.
Intersect’s eResearch Analysts provide ongoing
support comprising engagement and outreach, issue management, and learning and
development to hundreds of researchers each year.
see the attached letters of support for Intersect’s track record as an
Approach to quality standards
has systems and procedures in place that provide quality control and assurance
to their customers. The tools used in these systems and procedures include:
‘Jira’ to raise and track external and internal issues; ‘Nagios’ to automate
service monitoring and raise alerts; ‘Cacti’ for usage trend analysis and
reporting; ‘Splunk’ for log management, troubleshooting and forensics; and
‘Confluence’ to document and manage procedures and processes.
There is an end-to-end process for raising and resolving
support issues, including a process by which support issues are prioritized and
Support and warranty mechanisms
issues a formal 3-month warranty for all projects; the warranty commences after
user-acceptance testing has completed. During this period all defects are fixed
at no cost to the customer. In practice, the defect rate has been low enough
that Intersect has fixed the majority of defects that come to light after the
warranty period has expired. Additional feature requests are carried out on a
The systems administration team provides user support for
hosted services. The systems administration team works with the customer to
configure the initial setup of their service. This work includes making a
decision on the appropriate hosting arrangements (e.g. local, Intersect,
commercial) based on the required level of service.
commissioning, issues are tracked using Jira, and issues are triaged into
support and defect cases. Defects are escalated to the engineering team.
Responsibility for each support issue is assigned to a case manager, who looks
after the reporter until the issue is either resolved or escalated.
All services are monitored automatically using Nagios.
Outages and anomalies are reported to the systems administration team, and
emails are sent to the owners of the service using mailman.
Scheduled outages are negotiated with the customer as they
are required e.g. for upgrade of hardware or maintenance releases of software.
Services are promoted through Intersect’s website,
newsletter, marketing collateral and outreach program. Outreach activities
include promotion of services, including through an Intersect-maintained tools
register (soon to be integrated with CAUDIT’s eResearch portal), as well as
providing one-on-one assistance to researchers deciding on and using services
in their research.
Training is supported though Intersect’s learning and development
(L&D) program. Existing L&D material has been developed for services
including: interactive and self-paced training courses (covering e.g. HPC,
Google’s REFINE); written material available through our website (e.g. guides
to Evo, Collaborative Authoring); and an emerging program of web-casts
demonstrating concepts and usage of tools (e.g. Subversion).
In addition to UWS (project leadership and management,
access to the AusTalk (BigASC) (Burnham), and the AMC (Dean in association with
the John Davis, CEO of the Australian
Music Centre) databases, and the ParseEval (Shaw)
tool, the following institutions or groups will be involved. Details of
involvement and contribution are given in Part D3.
– Project development, business analysis, and co-investment.
Macquarie U – Specialists in Phonetics and in Language
Technology. Access to the EMU, AusNC (Cassidy), and the Johnson-Charniak parser
(Johnson) tools; adaptation of the audio aspects of AusTalk (Cox).
Melbourne – Specialists in Phonetics,
Linguistics and Engineering. Adaptation and testing of the NLTK (Bird),
PARADISEC tools and the speech component of the PARADISEC corpus (Thieberger).
U – Specialists in Music and Speech Pathology. Adaptation
of the PARADISEC (Music aspects) corpus (Barwick); user testing and feedback
(Arciuli), adaptation and testing of PsySound3 (Carbrera).
– Specialists in Phonetics and Indigenous Languages. Adaptation of
Indigenous languages and text aspects of PARADISEC corpus (Simpson), adaptation
of the Indonesian corpus (Arka, Mistica), user testing and feedback (Ishihara).
U – Specialists in Computer Science and Auditory-Visual (AV) Speech. User
testing and feedback (Powers, Lewis); adaptation of AV aspects of AusTalk and
advice on AV aspects of the project (Lewis).
– Specialists in Speech Science, Music, Emotion in speech and music, and
in Forensic Speech Science. User testing and feedback (Epps, Ambikairajah, Cabrera, Emery).
UWA – Specialists in Robust Speech, Speaker
Recognition, and 3D audio-visual speech and speaker recognition. Adaptation and
testing of the modifications to HTK (Togneri); Visual and 3D processing for
recognition (Bennamoun); User
testing and feedback, audio-visual feature
processing advice (Togneri, Bennamoun).
Canberra – Specialists in AV speech, Automatic Speech Recognition,
Forensics. Adaptation of AVOZES corpus and of the DeMoLib liptracker tool
(Goecke); User testing and feedback (Wagner, Goecke).
– Psycholinguistics and reading studies. User testing and feedback
– Information Retrieval and Natural Language Processing (Sanderson,
UNE – Psycholinguistics,
regional languages, logic in child language. User testing and feedback (Khlentzos, and the Language and Cognition Research Centre).
LaTrobe – Language diversity, minority
languages, data-oriented and theory-oriented approaches. Integration
post-project of the VisLab for the remote use of scientific instruments
and imaging of scientific data (Schembri); User testing and feedback (Tabain).
– Specialists in Machine
Learning and Control and Signal Processing, including Language Technology and
Text Retrieval. Support for interoperability with UIMA, the Unstructured Information Management Architecture, an
emerging standard for wrapping components for processing language, speech,
video, etc. data (Cavedon,
Inc. – The Australian National Corpus – Adaptation
of AusNC corpora and tools for HCS vLab; contribution of expertise on the licensing of corpora for online use and the
core technical platform developed to ingest corpus data and meta-data into a
unified online format (Haugh, Cassidy, Goddard).
– the Australasian Speech Science and Technology Association – Peak
body on speech science research in Australasia. Promotion of the HCS vLab
through electronic bulletins, the ASSTA Newsletter and the biennial Speech
Science and Technology (SST) conference; advice from speech science and technology
experts as required; liaison and HCS vLab promotion with international
counterpart, The International Speech Communication Association ISCA.
Director, Professor Denis Burnham, Inaugural Director (1999- ) of Marcs
Institute, UWS conducts research in behavioural and speech science with collaborators from music cognition,
linguistics, phonetics, engineering, computer science and creative arts. He is
President, Australasian Speech Science & Technology Association (ASSTA,
2002- ); Member, ISCA (International Speech Communication Association)
International Advisory Council and Interspeech Steering Committee; and Co-Founder, Auditory-Visual Speech Perception Association. Burnham has held over 30
externally-funded grants (over 20 as Leader): he has led ARC Linkage projects
with industry partners Australian Caption
Centre, and with Cochlear Ltd;
and large interdisciplinary projects, e.g., the $2M 5-year ARC Research Network on Human Communication
Science (HCSNet) (Triumvirate CI); and Leader of the $3.4M 5-year
ARC and NH&MRC Special Initiative Thinking
Head, the ARC $1M Big Australian
Speech Corpus (Big ASC), and most recently Seeds of Literacy, a 5-year $750K ARC Discovery.
Manager, Dr Dominique Estival has a PhD in Linguistics and extensive experience
in academic research and commercial project management for Language Processing.
Following research, industry and academic positions in the USA, Europe and
Australia, she took up Project Management roles: Team Leader, R&D, Syrinx
Speech Systems, a Sydney speech recognition company developing automated
telephone dialogue systems; Senior Research Scientist, Natural Language technologies,
human-computer interfaces and multi-lingual processing with the DSTO (Defence
Science & Technology Organisation); and Senior Manager, Projects and
Research, managing language processing research for US-government-funded and
commercial projects at Appen P/L, a company providing speech and language
databases for language applications. Estival is currently Project Manager, the
Big ASC (Australian Speech Corpus) at UWS, where she has managed rollout of
software and hardware to 17 Australian sites for AV recording of the AusTalk
corpus. Estival is a founding member of the
Australasian Language Technology Association (ALTA) and in 2008
established the Australian Computation and Linguistic Olympiad (OzCLO). Dr Estival has a wealth of experience in academia and
industry, including project management of large collaborative projects and
will, therefore, be employed at the Manager level.
Associate Professor Steve Cassidy, Macquarie University is a Computer Scientist whose
research covers the creation, management and exploitation of language
resources. Cassidy has a PhD in Cognitive Science and has worked in both
Linguistics and Computer Science departments. He is the main author of the Emu
speech database system which is widely used in the creation and analysis of
spoken language data for acoustic phonetics research. He has been involved in
the standardisation of tools and formats for the exchange of language resources
starting with his work on the Emu system and more recently as an invited expert
on the ISO TC 37 working groups on annotation interchange formats and query
languages for linguistic data. He has been instrumental in establishing the
Australian National Corpus as an umbrella organisation to manage language
resources in Australia and is an active collaborator with similar projects in
the US and Europe. Cassidy will act
as Product Owner for this project and as such will act as a conduit between the
development team and the prospective users around Australia as well as ensuring
that the product is interoperable with related international efforts.
eResearch Analyst, Peter Bugeia (Intersect, UWS) is the eResearch Analyst for the
University of Western Sydney. He has 27 years IT experience across a wide range
of industries including medicine, banking, finance and media. He has worked in
commercial, not-for-profit and public sectors and has held various roles from
Senior Software Engineer and Test Manager to Project Manager, Enterprise
Architect and Business Analyst.
Intersect Project Manager. Georgina Edwards is Intersect’s technical development
manager assigned to manage the software development aspects of the
project. She has over 10 years experience in commercial software design
and development, and has worked in banking and finance as well as eResearch. Edwards
has a BE (Hons) in IT & Telecommunications from the University of Adelaide.
She has experience building web applications in a range of languages, and is
also an experienced agile practitioner. Edwards will work closely with the
Project Manager and Product Owner to ensure proper and effective integration
between the stakeholder community and the software engineering team. Edwards
will also work with Intersect’s management team comprising of Dr Ian Gibson (CEO),
Rodney Harrison (Engineering Manager), Dr Joe Thurbon (Services
Manager), Shane Youl (IT Manager) to ensure that the development and
operations of the project are staffed appropriately and executed efficiently.
vLab will constitute a collaborative environment for access and analysis of human
communications data by HCS tools. It will provide resources to create new
annotations for existing data, and a space for researchers to store new data
and tools for use by the research communities. The overall structure of the
environment is shown in Figure 1. The HCS vLab is designed to make use of
national infrastructure – including data storage, discovery and research
computing services. It incorporates existing eResearch tools adapted to work on
shared infrastructure, and orchestrated by a workflow engine with both web and
command line interfaces.
Estival will be Project Manager and will be working with A/Prof Steve Cassidy,
the Product Owner. The Project Manager will be part of the Steering Committee
and have oversight of the development undertaken at both Intersect and
Virtual Laboratory will provide researchers with an integrated environment in
which to select and perform analysis on Corpora through a suite of
pre-installed tools. The HCS vLab will be designed for use by both IT-technical
and non-IT-technical researchers, with user interaction through a Web
interface. General functionalities of the HCS vLab are as follows:
Users can browse lists of corpora containing Human Communications
data and of pre-installed HCS Tools,
including corpora from the already-funded NeCTAR Virtual Lab, HuNI (the
Humanities Networked Infrastructure) (see Tools and Corpora below, and also B.18).
Users can select either a single corpus for search and analysis or
several corpora to perform a federated search. Some users will also be able to select
and add their own data sets for search and analysis.
Users then select one or more tools by which to analyse the selected
data. The system will display runtime options for the selected tools. Users
will then be able to choose and save the most appropriate options for their
analysis. The system will ensure only valid options are selectable.
Once one or more corpora or data sets have been selected and tool
options have been chosen, the user can invoke execution of the tools and the vLab
will run the tools in their execution environment. Some tools may be configured
to run in multiple execution environments, and a special HPC execution
environment will be available for compute-intensive tools (Intersect has
provided an initial in-kind allocation of HPC computing time for the HCS vLab).
During execution, vLab will copy and/or make available data and files to the
selected tool transparently to the user.
The user will be able to monitor and control execution as it proceeds,
and terminate if necessary; the user will be able to request a change in the
computing resources assigned to an executing tool.
Once execution is complete, the user will be able to view the results
through the Web interface.
Tools will either automatically add results to the Annotation and
Record-Level Metadata Stores and/or the user will make these updates manually
through an Annotation Service. Annotations and metadata will be private to the
originating researcher until the researcher chooses to make them publicly
The UWS Research Data Catalogue, managed by the UWS library (established
with the help of ANDS Metadata Stores and Seeding the Commons programmes) will
play a key role in describing and disseminating descriptions of corpora, tools
(as services) and annotations (new data sets).
A command line interface
will also be available for users.
For Desktop Tools, the user
will be able to interact with the application once it has been initiated.
The user will be able to
chain the execution of Tools together, and capture and share these workflows.
As some of the Corpora data
are sensitive, Corpora will only be available to researchers who are
appropriately authorised for that particular corpus. This will be checked
through the “User Registration” service.
Users will be able to
request the addition of Tools and Corpora to the vLab, and to store their own
private data, subject to the level of commitment to the Virtual Lab from their
organization as described in the Partner Model, with appropriate service
support and regulation for authority.
vLab, in particular the Workflow Engine, will be built around the Galaxy open
source workflow management system (see also B.11). The proposed Technical
Architecture of the HCS vLab is as follows:
An instance of the Workflow
engine will be run on a virtual machine in the Intersect, NeCTAR, or other
The Standard Execution
Environments will be pre-configured virtual machines that have one or more
tools installed. The HPC Execution Environment is likely to be a standard
operating environment with no virtualisation.
The Tool & Corpora
Definitions, Record-Level Metadata Store and Annotation Store will be
cloud-based databases which are available to all virtual machines that form
part of the vLab.
The Corpora will live on
the Intersect RDSI Node, or some other RDSI node, closely located to HPC
Within an execution
environment, tools will be run in either the foreground or in batch mode,
depending on what is appropriate for the tool and the environment.
Tools and Corpora
The HCS vLab will integrate existing corpora that house human
communications data, consisting of
language and music data, in the three most common modes in which these are
represented – audio, auditory-visual and text. The corpora to be made
available in this project are all corpora which our participant members have
either established or are caretakers for, so we have direct access to these.
They are presented below in the order in which they will be incorporated into
the framework (see also B.13) and further details regarding platforms, UIs,
input and output etc. are provided in Appendix B. The order of incorporation of
these corpora into the framework is chosen by consideration of the joint
factors of maturity of each corpus and the amount of work required to adapt
them for incorporation into the HCS vLab.
PARADISEC (the Pacific and Regional Archive for Digital Sources in
Endangered Cultures, including Indigenous languages music, and speech) (5.1TB);
AusTalk AV speech corpus from the BigASC project (7TB);
the Australian National
Corpus (incorporating the Australian Corpus
of English (ACE), Australian Radio Talkback (ART), AustLit, Braided Channels,
Corpus of Oz Early English (COOEE), Email Australia, Griffith Corpus of Spoken
English (GCSAusE), International Corpus of English (Australia contribution is
ICE-AUS), the Mitchell & Delbridge corpus, and the Monash Corpus of Spoken
AVOZES visual speech corpus (15GB);
Australian Music Centre archive (extremely large collection of sound and text: over 30,000 items by 530 artists);
Colloquial Jakartan Indonesian corpus (audio and text 32.5TB);
The HCS vLab will also integrate existing HCS tools for the analysis of
music, speech and written text and make them accessible to non-technical
researchers, while maintaining a command line functionality for more
sophisticated analyses. Nine of these are tools that have been developed by our
and these are listed below in the order in which they will be incorporated, by
consideration of the joint factors of maturity of the tool, the amount of work
required to adapt the tool for incorporation into the HCS vLab and the order in
which corpora will be incorporated (see B.13).
EOPAS (PARADISEC tool) for text interlinear text and media analysis;
NLTK (Natural Language Toolkit) for text analytics with
EMU for search, speech analysis,
interactive labelling of spectrograms and waveforms;
AusNC Tools: KWIC, Concordance, Word Count,
statistical summary and statistical analysis on a user-defined subset of
Johnson-Charniak parsers, to generate full parse
trees for text sentences;
ParseEval, tool to evaluate the syllabic parse
of consonant clusters;
HTK – modifications, a patch to HTK (Hidden Markov Model Toolkit, http://htk.eng.cam.ac.uk/) to enable missing data recognition;
DeMoLib software for video analysis; and
PsySound3 (physical and psychoacoustical
algorithms) of complex visual and auditory scenes.
ParGram (grammar for Indonesian).
The INDRI tool for information retrieval with
large data sets.
For each of the corpora
and tools listed above, there is a HCS expert who will work in-kind for 3 weeks
(15 effort days) with the Project Manager, the Product Owner and Intersect to
incorporate these corpora and tools into the HCS vLab (see B.13).
Collaboration with HuNI (an already funded NeCTAR Virtual
The Virtual Lab will query the HuNI virtual
lab using a protocol to be negotiated (e.g. OAI-PMH or Atom) for information
about corpora known to HuNI which may be of use to VL users. Appropriate
corpora with sufficient metadata can then be loaded into the HCS vLab and used
with the suite of HCS vLab Tools. Resulting data will be stored in the HCS vLab Annotation
store, and the existence of the new data advertised back to HuNI via an
appropriate mechanism. Metadata exchange will use an appropriate
standards, such as OAI-PMH with discipline and corpus-appropriate metadata
schemas (EAC-CPF for information about parties, OLAC for linguistics
resources, MARC for bibliographic description).
Figure 2 shows a potential workflow where an
HCS vLab user is able to acquire data from HuNI, transcribe it, and publish it
so that a HuNI user can access the original HuNI object along with the
transcript using one of the HCS web-based tools, giving them access to new data
has been created in the HCS vLab environment. Similarly, HuNI users will
be able to access key HCS vLab tools, such as time-aligned transcription tools,
which save data in standard reusable formats, rather than using ad hoc
solutions as is currently often the case.
Figure 2: A potential scenario
involving interoperable metadata, tool and data exchange
between HCS vLab and the HuNI
RESEARCH COMMUNITY NEEDS & BENEFITS
Target Research Community
The primary target research community to use the HCS vLab is
encompassed by HCSNet (see B.2), an Australian community with >1200 active members
from Speech, Language, Music
and Sonics areas and around 45,000 international members. More specifically,
the international music community that wants to have the AMC not only more easily
available but open to different kinds of searches is estimated at around
10,000. The research communities that would benefit from Text and Speech
corpora and associated tools can be estimated at 20,000 (linguists, speech
scientists, behavioural scientists and language technologists). For video and
visual analysis, the international community is estimated at around 15,000
researchers and is steadily growing.
The Australian and international HCS community
also intersects and overlaps with various professional bodies as set out below
and, via these, with their international counterparts.
The Australasian Speech Science and Technology Association
(ASSTA, est. 1972, www.assta.org/) is the peak
body for speech scientists in Australia and New Zealand and is a cash
contributor to this bid. ASSTA has various Corporate Members (Appen Butler
Hill, Cochlear Pty Ltd., the HEARing CRC, Spectral Dynamics) and 110 members
from the disciplines of engineering, computer science, cognitive science, psycholinguistics,
language technology, phonetics, linguistics, forensic speech science, and
speech pathology. ASSTA runs the biennial Speech
Science & Technology conference, provides HDR student travel assistance
for international conference presentations, and funds speech
science and technology research events and initiatives. ASSTA has close ties to
ISCA, the International Speech Communication Association, which attracts >
1000 registrants to its annual conference, Interspeech. ASSTA members make up
the core set of researchers working in the various manifestations of speech
science in Australasia. Almost all the members were active participants in
HCSNet, and ASSTA members are at the forefront of grant-getting, publication,
and PhD supervision in speech science in Australasia.
The Australian Linguistics Society (ALS, est. 1967, www.als.asn.au/) with more than 450 members, is the peak organisation for
linguists in Australia. It runs the annual ALS conference and a biennial
Australian Linguistic Institute (ALI). Many ALS members participated in HCSNet
activities. The main international counterpart is the LSA (Linguistic Society
of America) with around 4,500 members, and there are individual Linguistics
organisations in many countries around the world, with an estimated combined
total of 10,000.
The Australasian Language Technology Association (ALTA, est.
2002, http://www.alta.asn.au/) with around
240 members promotes research in Computational Linguistics and Natural Language
Processing. It is a founding regional organisation of the Asian Federation of
Natural Language Processing, runs the annual ALTA workshops,
and manages funds for OzCLO (the Australian Computational and Linguistics
Olympiad). Its main international counterpart is the ACL (Association for
Computational Linguistics), with around 5000 members and there are more
specialised organisations, such as AMTA (Association for Machine
Translation in the Americas), EAMT (European Association for MT), AFNLP (Asian
Federation of Natural Language Processing), with an estimated combined total of
10,000 researchers around the world. Most ALTA members participated in HCSNet
The Australian Music and Psychology Society (AMPS, est.
1996, www.ampsociety.org.au/), with 200 members, is
a member of the Asia-Pacific Society for the Cognitive Sciences of Music
(APSCOM) and has links with ESCOM, the European
Society for the Cognitive Sciences of Music and SMPC
the (US) Society for Music Perception and Cognition. AMPS runs
workshops, seminars and conferences, and provides HDR student travel assistance
for international presentations, e.g., at the annual International Conference
on Music Perception and Cognition which AMPS hosted in Sydney in 2002. Many
AMPS members participated in HCSNet activities.
Needs and Impact
The HCS research community has produced a number of corpora and
repositories for human communications data and many HCS tools to
manipulate, process and analyse these data. But the use of and access to these
data is hampered by two constraints.
First, as in so many other disciplines,
the amount of data available is growing
exponentially, making it increasingly infeasible for any one researcher or
research laboratory to maintain up-to-date locally stored datasets. Even where
the storage capacity to do this exists at a given institution, the many-ways
replication that local copying strategies encourage only introduces version
inconsistency problems that outweigh the advantages of redundancy. Cloud-based
storage, with appropriate backup procedures, is widely accepted as the best way
forward, ensuring that all users see the same data. Similarly, cloud-based
analytical tools relieve individual researchers and sites of the need to
maintain up-to-date versions of software, outsourcing software infrastructure
tasks that are not, and should not be, core business for a researcher.
second problem is unique to interdisciplinary
endeavours. It is already well-recognized that a major impediment to
interdisciplinary interaction is the fact that we ‘speak different languages’: two
researchers aiming to cross a disciplinary divide need to take the time to
understand how each uses terminology in often subtly different ways. A related
phenomenon is present where ‘the rubber hits the road’ in terms of actual
software and data usage: although a researcher in one discipline may have much
to gain, and much to offer, from the use of tools and corpora developed in
another discipline, all too often the hurdles to making this a reality are
immense, posing a learning curve that has characteristics not dissimilar to the
culture shock one faces when moving to a new country. Cloud-based tools and
support cannot erase the difficulties here, but they provide an opportunity for
interfacing and modularising that make it easier to overcome them.
further key concern that our proposal addresses is that of replicability. At a time when conventional science is subject to
disruptive forces—debates over open access models, attacks on the peer
review process, and a sense of public distrust—this element of the
scientific process remains indisputably indispensable. But replicability has
always been hard, and gets harder as experiments require ever-more technically
sophisticated tools, and make use of ever-larger data sets. This problem has
been recognized in Big Science, whether dealing with web click histories,
consumer purchasing patterns or astronomical data. But it is all too easy to
overlook the importance of replicability in sciences, like Human Communication
Science, that rely on ‘mid-size’ data sets. Our HCS vLab proposal addresses
this problem head-on, by providing a cloud-based platform that supports
user-defined experimental workflows using standardised public-domain data sets
and cloud-friendly revisions of existing desktop analysis tools. In this regard,
our aim is to develop and provide a world-leading best-practice platform for
scientific replicability in the human communication sciences.
address these issues we need a connected repository for
data and integrated access to tools. The HCS community will benefit from (i)
easy access to data collected in other disciplines, e.g., speech and video
researchers accessing the indigenous language PARADISEC data, linguists
accessing the AusTalk transcripts, and the speech/language components of music
in the Australian Music Corpus; and (ii) the use of tools developed in
neighbouring disciplines to process and manipulate their own data, e.g.,
linguistic tools to process the language components of the AMC, visual analysis
and lip-tracking tools to analyse the video components of AusTalk, and musical
and acoustic analysis tools to examine rhythm and melody in speech corpora.
The potential impact of making these
tools and resources easily available to the vastly distributed Australian
research community is vast. This is perhaps most obvious where the resources
and tools have a direct connection to commercially-valuable technological
developments, as in the areas of speech and language technology, and music
processing software. Australia’s small size means that we will always struggle
to compete against the major players in the US and Europe, and increasingly in
Asia, but HCS vLab will enable a pooling of resources, data and tools that will
encourage a higher degree of collaboration amongst our researchers, and allow
us to do more with less.
less obvious are the more niche areas of research that are all too often
ignored in the push for short-term technological wins. Here, HCS vLab provides
an opportunity to enfranchise and strengthen more isolated research activities.
(U Melbourne) has built a corpus of recordings and time-aligned transcripts of South Efate, a language
from Vanuatu. This rich set of material could be accessed by others, but
currently there is no platform to make it available. It requires streaming
media linked to texts with an annotation module to allow researchers interested
in prosody, narrative structure or musicology to access and annotate the
material, safe in the knowledge that both the primary material and the new
annotations will all have persistent locations.
many legacy audio speech and music recordings that are not annotated beyond a
gross one-line description per audio-tape. In some cases it is not even clear
what language is represented. Annotation would be greatly facilitated by
crowd-sourcing; HCS vLab would provide an online space with an easy-to-use GUI
where native language speakers and other researchers could access and annotate
There is a wealth of Aboriginal language dictionary and text material
that once formed the Aboriginal Studies Electronic Data Archive (ASEDA) digital
archive collected between the late 80s and early 2009 by the Australian
Institute of Aboriginal and Torres Strait Islander Studies (AIATSIS). The
establishment of the HCS vLab would make possible collaborative work with
AIATSIS to make the ASEDA digital archive (also available as AUSTLANG) more visible.
There is an increasing awareness that the temporal dimension of speech
encodes a substantial amount of information about linguistic structure.
Extracting this information requires experimental methods for acquiring high
temporal resolution speech movement data, software for data analysis and
computational modelling tools for linking the data to linguistic structure, but
there is no accessible platform by which such analyses can be effected and no
standards for analysis have been established, and many of the analytical tools
remain inaccessible to the broader research community. HCS vLab will provide
access to Shaw’s ParseEval, a tool
that allows the syllable structure to be ‘read’ from speech data. Other tools,
such as DemoLib will provide
analysis of audio-visual speech.
Much has been written and said about the imperative
to make research data widely available, especially when its creation is
publicly funded; but the commitments signed up to in ARC proposals are more
honoured in the breach than in the observance. This is in no small part due to
the difficulty of the mechanics of making data available. HCS vLab will provide
a platform for researchers in the human communication sciences that overcomes
this problem, allowing increased leverage of existing investments and, in the
process, making the data accessible to a wide range of existing tools in a
streamlined fashion. For instance, with regard to the AusNC, HCS vLab will
provide web-based tools for the display and analysis of annotations on
linguistic material; natural language processing tools for text; automatic
tools to support the generation and classification of annotations from audio
and textual material; and tools for transcoding and streaming of audio and
video in HMTL 5 ready format; and HCS
vLab will make available the Ethnographic E-Research Online Presentation
System www.eopas.org to
present interlinear glossed text and media for other corpora. In addition, with the assistance of ongoing
infrastructure support from UWS after the project is completed and HCS vLab is
established, there are new tools that could be incorporated into HCS vLab, e.g.,
with the Australian Music Centre, an IR (information retrieval) tool for
music and other acoustic data; and musical scores
in XML form for musical academic research.
Australian HCS community responded positively to the formation of HCSNet by
attending the 60 HCSNet workshops and seminars, by an increase in successful
large grants in the HCS area (see B.2) and by significant high impact journal
publications (see the top 30 papers to come out of HCSNet in Dale, R., Burnham,
D. & Stevens, C.J. (2011) Human
Communication Science: A Compendium). Through HCSNet, the ARC has already
invested in the development of a strong interdisciplinary community that has
been widely recognized overseas; HCS vLab is an opportunity to both build upon,
and reach far beyond HCSNet to provide that community with the tools and
resources that it needs to leverage our distributed research capacity. The user
base is ready and waiting. The impact of the HCS vLab on HCS research in
Australia will be significant, far-reaching and sustainable. It will take HCS research
capability to the next level, beyond the individual modalities of speech,
language and music, and above that which can be accomplished on the ground in
individual centres and institutes.
the opportunity for applying novel combinations of old ideas or methods of
analysis from different disciplines to new problems. For instance, in the HCS
Compendium, Butavicius and Lee (2011) describe a multi-disciplinary approach
involving visual perception, human-computer interaction and cognitive modelling
to the problem of assisting users to find relevant information in very large
data sets; and Copland et al. (2011) married an exisiting psycholinguistic
semantic priming task with fMRI to address the role of dopamine on
neurotransmitters in semantic processing. However, the HCSNet experience also
made it abundantly clear that one of the main impediments to such quantum leaps
in HCS research was the difficulty for a researcher from one discipline to
apply the tools and techniques of another discipline, or to explore data
collected under one paradigm via a completely different analytical perspective.
As HCSNet proved to be a model for inter-disciplinary collaboration, the
HCS vLab will be an exemplar for other research communities.
impact and benefits of the HCS vLab activity will be tracked via the UWS
Service Desk. The following measures will be used to report on utilisation and
uptake by the research community: and measured through usage count (number of
users logged in, number of tools used, number of queries) and user surveys
distributed and collected at each of the 3 vLab phases (see B.13).
Project / developmental measures:
functional tests performed by researchers
researchers who have participated in requirements gathering and testing
researchers with login accounts
number of actual
researcher login to VLab
counts: e.g. searches performed, invocation of tools, annotations generated and committed to
the annotation store.
Once the tools described in
section B.7 have been incorporated, other well-known public access tools, such
as ELAN (professional tool for the
creation of complex annotations on video and audio resources); CLAN (Computerized Language
ANalysis, for Conversational Analysis); PRAAT (scientific software program for the analysis of
speech in phonetics); and The Field Linguist’s Toolbox (data management and
analysis tool for field linguists), will be considered for addition to the HCS
vLab. See Section B.18 for a list of projects which will be interoperable with
the HCS vLab and for a list of additional corpora and tools which project
partners already plan to add.
Project partners have
already indicated their intention to add the following:
ExSite9 – a tool for
efficient adding of metadata for digital research data as it is collected in
NABU – a catalog system for
research collections with the ability to provide streaming access to media
objects as well as providing a management environment.
These tools could work together
with EOPAS: 1) ExSite9 for assembling and linking data prior to submission to
repository; 2) NABU for managing the data repository and providing access to it
where possible; 3) EOPAS for providing fine-grained online access to annotated
media in the repository
UCBN – a broadcast news
database with audio-visual video sequences (both text dependent and text
independent subsets). There are around 20 speakers and total footage is around
6 hours of recording about 100 MB.
MultimodalFusionLibrary – a
software Library in C# for audio visual processing with several pre-processing,
feature extraction, learning and classification algorithms mainly for
stereoscopic and 3D video sequences.
The AV face cover corpus through our MOU with the BBfor2 (Bayesian Biometrics for
Forensics) Euro-funded project (BBfor2 is funded by the EC as a Marie-Curie ITN-project
(FP7-PEOPLE-ITN-2008) under Grant Agreement number 238803.)
The ABC corpus of infant-directed
The flexible HCS vLab framework
coupled with the ongoing support by the UWS eResearch Unit will facilitate the
inclusion of future corpora and tools. For example:
A new (awarded 2012) Discovery
(Best, Shaw, w/ PIs Hay, Foulkes, Docherty, Evans: ‘You came here TO DIE?!’ will collect audio recordings of
a carefully-constructed corpus of words and phrases in 5 English regional
accents (AusE, NZEng, Cockney, York and Newcastle-upon-Tyne). That corpus, plus
earlier sets of Jamaican and American English materials collected in a current
ARC Discovery project at UWS ‘How strict is the Mother Tongue?’ will be added
as a database in HCS vLab. The pending project includes plans to run
computational modelling on that corpus to determine relationships among
pronunciations of words in the 5 accents, and to relate those modelled
parameters to real-human word recognition across accents.
A current ARC Discovery
project at UWS (DP DP110105123,
2011-2015, The Seeds of Literacy) will generate a corpus of caretaker speech to
100 children at 3-monthly intervals from when they are 6-month-old infants to
5-year-old children and this will be added as a database in HCS vLab.
A recently awarded (2012)
ARC LIEF grant ‘A Living Archive of Australian Indigenous Languages’ through ANU is designed to
digitise vernacular literature from NT Aboriginal schools, and the HCS vLab
would provide a suitable home for these data.
At ANU there are plans to augment a current fledgling corpus of Spanish
with new data in order to conduct comparative research. Again the HCS vLab
would provide a suitable home for these data.
In addition, a number of international
labs across the world will welcome HCS vLab, for example:
The New Zealand Institute for Language, Brain and Behaviour (NZILBB), a
close partner of Marcs Institute, has both databases and tools they have
developed for recordings of NZ English and NZ Maori and Maorian-English, which
could be included as part of HCS vLab in the future.
The University of Southern California
is currently developing Electomagnetic Articulograph and real-time dynamic
Magnetic Resonance Imaging speech databases and tools, and members of that
project, Shri Narayanan and Mike Proctor are some of the many international
collaborators of Australian HCS researchers.
NSF grant for ‘The Language Acquisition Grid: A Framework for rapid Adaptation
and Reuse’ project led by Nancy Ide of Vassar College will provide a grid-based development environment for
building natural language processing pipelines, facilitating application
building and experimentation. As an international collaborator, the HCS
vLab Product Owner, Steve Cassidy, will ensure that
the Language Application Grid is compatible with HCS vLab infrastructure
through the use of shared data standards and application interfaces.
ANDS – Australian Research Data
HCS vLab will be integrated with the Australian Research Data Commons. The
project will manage collections of research data, and enable the selective
export of descriptions of the data collections via OIA-PMH.
AAF – Authentication
HCS vLab will provide authentication against the AAF through the SAML2
protocol. This will allow appropriately authorized individuals to authenticate
using their institutional credentials. Where that is inadequate, dual
authentication systems will be used, as outlined in the Federation’s technical
documentation. Where possible, international collaborators will be
authenticated in the system using the existing reciprocal arrangements between
the AAF and its international partner organizations.
NCI – National Computational
HCS vLab will integrate with NCI by submitting jobs to the HPC system.
Authentication will be based on the credentials of the user logged in to the
proposed system. SSH will be used as the protocol. Access to the HPC system
assumes that the user of the proposed system has CPU time allocated to them on
the HPC system, for example through the national Merit Allocation Scheme.
RDSI – Research Data Storage
HCS vLab will host human communications datasets of national significance, as
described in items 2 and 3 of this section. We plan to host these datasets on
RDSI on the Intersect node. We anticipate strong alignment between this project
and the RDSI ReDs programme. We are planning to host this project on the
proposed Intersect Research Cloud node, which is to be co-located with the
proposed Intersect RDSI node, and Intersect’s existing infrastructure.
HCS vLab will be built using the OpenStack Cloud API, to allow integration with
the NeCTAR Research Cloud.
vLab acknowledges the Galaxy-GDR Integration NeCTAR project which will provide
designs and insights that will be useful to this project, and vice versa.
The Australian National Corpus
The Australian National Corpus was
funded by the Australian National Data Service (ANDS) to collect together a
number of existing language corpora on a single technical platform that unified
the many data and meta-data formats and provided a uniform set of interfaces to
browsing and searching these data sets. The ANDS project developed an ingestion
process that could be used with a wide variety of data types including text,
audio and video collections, bringing them into a single standardised data
store that is then exposed via a web interface. Another outcome of the project
is a legal and ethical framework for sharing language resources within the
research community. The Australian National Corpus platform will provide a
starting point for the storage of data and meta-data in the HCS vLab. It will
be extended to allow ingestion of new data types and to support federated
storage of resources and meta-data.
The Humanities Networked Infrastructure (HuNI)
The Humanities Networked Infrastructure (HuNI) is one of the NeCTAR Virtual
Labs being established. The HCS vLab and HuNI are complementary to each other
and a compatible feed pipeline will be built
between the two (see B.18).
authority structure for the HCS vLab project reflects the partnership model as
described in Appendix E. Two key governance groups will be formed: a Steering Committee and a Stakeholder Group. A number of key
project roles will interact with these groups, as described below.
The Steering Committee is a small, senior
group with overall responsibility for and authority over the project and its
resources. The Steering Committee is charged with ensuring the project achieves
its objectives, that stakeholders receive the benefits of the HCS vLab, and
that these benefits are sustainable into the future. An important aspect is that the
Steering Committee is responsible for trading off between the interests of the
Stakeholder Group when necessary. The Committee consists of:
Director – Professor Denis Burnham, UWS
eResearch Manager – Dr Peter Sefton, UWS
Partner Representatives – Representatives from the major partners in the
members: other stakeholder representatives have been identified as having
particular expertise across areas of HCS and will be called upon to for
particular meetings as required.
representative from NeCTAR will be invited to attend as an observer.
Committee will meet monthly. The HCS vLab Project Manager (Dr Dominique
Estival) and the Intersect Project Manager (Georgina Edwards) will be required
to attend Steering Committee meetings to report progress and resolve issues. The Steering Committee will be provided expert
technical advice through the Product Owner and the Intersect Project Manager. The
Steering Committee is expected to become the Executive User Group once the
project has completed.
The Project Director is the Chair
of the Steering Committee, and is responsible for overall direction of the
project and liaising with UWS executive.
The HCS vLab Project Manager will be
responsible for overall project planning, tracking and control, change
management, stakeholder communication both within and external to the project,
vLab marketing, user support, and acceptance testing. The Prince 2 project
methodology will be used, leveraging off templates available through UWS
Information Technology Services’ Project Management Office. These methods will
be tailored somewhat to integrate with Intersect’s agile Scrum software
development method. The vLab Project Manager will work with the Steering
Committee to establish key success factors, metrics and project quality gates
for assessing project performance.
The Intersect Project Manager will
be responsible for software development and co-ordinating all requirements
analysis and software development activities associated with the project.
Intersect will be responsible for producing the solution design, a working vLab
application, and for integrating tools and corpus’ into the vLab. Intersect
employs an agile approach to software development – see Appendix C for
The RMIT / NICTA Software Developer will
be responsible for building software interfaces which enable the vLab to
interoperate with other applications and vLabs as described in B.7
The Product Owner will be
the main point of interface between the software engineering team and members
of the stakeholder working group, as per Intersect’s software methods which are
described in Appendix C. The Product Owner (see B.6) will be responsible for
the injection of domain-specific information into the framework, and will work
closely with the vLab Project Manager to gain input from external stakeholders
as and when required.
Group comprises as wide a collection of representatives of end-users as
possible to ensure the widest possible engagement. Importantly, this group will
provide subject matter expertise in defining the functional requirements for
the vLab, and will participate in user testing. The primary stakeholder group
has already been identified: they are the HCS Tool Authors, HCS Corpus
Caretakers and HCS academic experts
from across the country, who will interact with the Intersect software
engineering team through the Product Owner. They will, along with academic
experts in particular areas of HCS and their doctoral students, test releases
of the HCS vLab (see Section B.15 for a more detailed description). There are
currently 47 academics (see Appendix A) who are committed to participating in
the Stakeholder Group. Further stakeholders will be nominated during the
elaboration phase, and will include members of the Steering Committee.
Scale, Key Deliverables and Acceptance Criteria
Please see Part D3.2 for more detail. A summary
is given below:
12 elapsed months
Total contingent effort
190 effort months
FTE (during project)
Funding Requested from
The effort estimates have been built according to Intersect standard
A standard 20% contingency has been applied to the estimate.
Project risk profile is Moderate. Overall mitigation strategy is to
build the system Design To Cost
FTE is a combination of project management personnel, product ownership,
software developers, testers, researcher developers,
and user representatives.
$423,000 of the co-investment is in cash. This will be used to part-fund
the Product Owner role, and for additional software development.
is from the following partners: UWS,
Macquarie, Melbourne, Sydney, ANU, Flinders, RMIT, NICTA, UNSW, UWA, UNE, UC,
LaTrobe, UTas, ASSTA, AusNC Inc.
and Acceptance Criteria
The 7 corpora, and the 11 HCS Tools developed by HCS
members to be made available in the HCS vLab project (see also B.7) will be
incorporated at different phases (N=3) of the project in an order determined by
consideration of the joint factors of maturity of the corpus and the amount of
work required to adapt the corpus for incorporation into the HCS vLab.
Table 1: Key Deliverables and Acceptance Criteria
Project plan; Quality plan; User
HCS vLab Architecture
Tested by collaborators and HDRs
VLab V1: Standard Execution
Basic Workflow platform: User
RDSI; Tool and Corpora Data.
Tested by collaborators and HDRs
Browse corpora and read data
Use tools on data
Signed off by Steering Committee;
Service Desk operational;
Impact and benefits tracking in
Tested by collaborators and HDRs
Execution Environment: NeCTAR
Workflow: Security and Access
UWS Research Data Catalogue;
Additional corpora (AusNC, AVOZES,
Tested by collaborators and HDRs
Tested by collaborators and HRDs
Post-implementation Review (PIR)
Name of Service/Deliverable
Date of deployment for
Date of deployment as
Sub-contract signed, project started
Elaboration Phase Complete
HCS vLab Version 1 Operational
HCS vLab Version 2 Operational
HCS vLab Version 3 Operational
Final Admin Closure
Application Support and Maintenance, User Group to Dec 2014
Application Support and Maintenance, User Group to Dec 2015
detailed description of Intersect’s project approach is in the attached
document “Intersect Software Development Process” (Appendix C) and this is a
summary of that document. Intersect’s approach to running the project rests on
two principles: 1) Ongoing communication with the governance group, reporting
to them and soliciting input; and 2) Continual integration of a tested and
A project goes through four
stages: Concept; Elaboration; Development and Deployment. These four stages
will be spread across the 14 months of the HCS vLab project as set out below. The
Stakeholders Group and the Steering Committee will be involved in all four
stages of the project, with the Project Manager and the Product Owner ensuring
communication between the Steering Committee, the Stakeholder Group and the
Intersect development team and overseeing the integration and testing of the
HCS Corpora and Tools. HCS
end-user testers, drawn from the Stakeholder Group, will be responsible for
testing the software after functionality has been developed.
specifically, for the HCS vLab, the Stakeholder Group will comprise:
Tool Authors: The author or developer of each of the 11 HCS
tools to be integrated in the HCS vLab will be responsible for adapting their
particular tool to the HCS vLab environment, with the assistance of the Product
Owner interacting with the Developer, Intersect.
Corpus Caretakers: For each of the 7 corpora to be
incorporated, at least one HCS Corpus Caretaker (for instance AusTalk and
PARADISEC have 2 and 3 developers respectively) will be responsible for
adapting their particular corpus to the HCS vLab environment, with the
assistance of the Product Owner interacting with the Developer, Intersect.
Sprint Testers: The HCS Tool Authors and HCS Corpus Caretakers
will be joined by another 26 academic experts in particular HCS disciplines to provide
in-kind co-investment of 3 occasions of 2 days of testing the HCS vLab functionalities
during the sprints in the development phase – see 3 below. These people
will be deployed when tools and/or corpora of interest in their particular area
have been incorporated (see Appendix B for a list of the tools and corpora). A
total of 15 Higher Degree Research (HDR) students from the 15 co-investing universities
and research institutions (see Appendix A) will be assigned to test the functionalities
of the HCS vLab during the sprints in the development phase – see 3
below. To provide continuity, each HDR student will act as tester for the three
consecutive sprints in a particular development phase (see Figure 1 below) and
then as a tester in the first sprint of the next development stage.
1. Concept Stage
the concept stage, the development team and the stakeholders come to an
agreement on a concise definition of the key problem to be solved. This
statement identifies the nature of the problem, the group whom the problem
impacts, the impact, and the properties of a solution. The outcome of this
stage is a problem statement. The nature of the NeCTAR RfP means that proposed
projects are well into the concept stage by the time they are proposed.
key outcome of the concept stage is an agreement on the problem statement.
2. Elaboration Stage
elaboration stage is used to bootstrap the development stage. During this
stage, the artefacts uses to monitor and steer the project are created,
including the initial user stories, a product backlog and a burn-up chart. In
addition, the team evaluates and settles on the key technology choices for the
elaboration phase concludes when the team can articulate the key technical
risks and approaches to removing those risks; and when the team can identify
the key roles on the project (especially key stakeholders) and key constraints
of the project (e.g. dates). Key to the agile process is that these decisions
may change at a later stage.
the end of this stage, a project management plan and a quality management plan
have been developed.
3. Development Stage
development phase consists of two-week sprints where the team works to complete
the stories in that sprint. Usually planning, execution and review overlap, as
shown below. The defining characteristic of the development stage is that at
the end of each sprint, we have a potentially shippable increment of the
product. If the customer wishes, we can deploy this to production systems for
Figure 3: Illustration of when planning,
executing and reviewing of sprints occurs
to the start of a sprint, the sprint is ‘planned’. A set of stories from the
backlog is selected and elaborated. This elaboration includes defining detailed
acceptance criteria: these criteria define when the story is ‘done’ (i.e.
acceptable to the end-user and integrated into the product). The selected
stories are the ones that will be implemented during the sprint. During the
sprint, the sprint is ‘executed’. Selected stories are developed, tested and
accepted by the stakeholders. At the end of each sprint the sprint is
‘reviewed’ and we have a demonstration of the functionality developed during
the sprint. Progress is then evaluated in terms of stories implemented, stories
remaining and budget spent.
to the completion of a sprint’s review, the user-testing team is responsible
for testing the product against acceptance tests, and reporting defects found.
During subsequent sprints, reported defects are assessed and planned along with
other user stories.
contribution of the partners to the development of the HCS vLab is as follows:
15 days of effort to help Intersect
integrate each tool or corpus contributed into the HCS vLab.
30 days of effort involved in vLab
requirements gathering and user testing.
3 days of effort involved in project
30 days of effort involved in user group
meetings and post-project user support activities (Jan 2014 – Dec 2015).
At the end of the last sprint in the Development
phase, we have a fully functioning, integrated and tested piece of software.
Unlike traditional waterfall development, there is no ‘final testing’ or
‘acceptance testing’ phase – the software has been tested by its users throughout
the project. During the Deployment phase the system is deployed to production
in accordance with the hosting arrangements.
Quality Control and Acceptance testing are
integrated in Intersect’s development approach, and are ongoing from the start
of the project. This mitigates the risks of “big-bang” integration and
acceptance testing. Part of the quality control is integrated with other
processes, including test-driven development and writing acceptance tests
before implementing a user story. Additional testing (e.g. non-functional
requirements, risk-mitigation testing) is performed based on the quality
management plan (see above).
For each user story implemented during
development, quality assurance is managed by interaction between the Product
Owner (as a representative of the governance group), the developer responsible
for that story, and the senior test-engineer on the project.
During planning for the sprint, the Product
Owner is responsible for defining the acceptance criteria for the story, with
support from a test engineer.
During the execution of the sprint, the
developer responsible for the implementation of the story defines automated
unit tests, to guard against regressions. Prior to the completion of the
sprint, the test engineer is responsible for signoff of the user story against
the acceptance criteria. Where possible, the acceptance tests are automated,
using frameworks such as Cucumber.
During the review of the sprint, the Product
Owner (representing the governance group) validates that the stories
implemented during have the correct functionality. Subsequent to the
demonstration, the end-user testers are responsible for end-user testing of the
system as implemented to date, as described above.
A deliverable is comprised of multiple user
stories. Formally, the Product Owner, acting as a proxy for the governance
group, is responsible for the acceptance of a deliverable. This responsibility
comprises: ensuring that a set of user stories covering the deliverable are
defined, elaborated, tested and reviewed.
Wherever possible, commissioning testing is combined with
Acceptance testing. That is, we deploy the system early and often, and
acceptance tests are run against a production system. This will be the NeCTAR
Research Cloud, as available, including the lead node at the University of
Melbourne during development.
and Issue Management
This project has an external dependency on the
Mitigation: The project has the ability to use Intersect
This project has an external dependency on the
Mitigation: The project has the ability to use Intersect
This project depends on further development of
Mitigation: The project members will engage with the
This project depends on further development of
Mitigation: In most cases, there will be close collaboration
This project proposes a distributed
Mitigation: In addition to regular video-conferences
possible risks include:
to manage and arbitrate data access, including privacy concerns
on a lack of standards, or tools being integrated using non-standard formats
position for tools being integrated
research project requirements, making it difficult to arrive at a consolidated
position during the execution of the project
Approach to Risk Management
vLab project will maintain a risks register, including the impact, likelihood
and mitigation strategy for each risk, as it is identified. The risks register
will be a part of the regular reporting to both the governance body and NeCTAR.
For technical risks, risk level is
accounted for in prioritising user stories – more risky stories are,
other things being equal, attempted before less risky stories. For staffing-related risks, use of Intersect
as a development partner with significant capacity ameliorates the risk of
departure of key staff. For external/dependency
risks, responsibility for identification and management of the risk profile
of the project rests with the Project Manager.
There is a significant move towards
interoperability of language resources and tools in the international arena as
national and international groups collaborate more frequently on large scale
projects. There are three primary areas of focus for standardisation that are
relevant to this project: meta-data, annotation standards and tool
Meta-data standardisation has long been
a focus of the language resources community with well established standards
such as OLAC and IMDI used to describe resources in various
meta-data repositories. More recent efforts have established the ISOcat
data category registry that is used to register vocabularies used at various
levels in language resources. A significant EU effort just getting underway is META-NET
(Multilingual Europe Technology Alliance) which aims to establish a platform
for resource sharing around Europe and has already made significant
contributions relating to meta-data management. The current ANDS funded Australian National Corpus project is developing a
hybrid meta-data model suitable for describing language corpora which can be
exported to the ANDS Research Data Collection via the RIF-CS
standard vocabulary. The HCS vLab will make ARDC-compliant use of RIF-CS at the
collection level, through automatic export of selected data set descriptions to
ARDC as well as by allowing import of data that others have ‘advertised’ on the
The standardisation of annotation
formats and the semantics of various annotation schemes has been a focus of the
ISO TC 37 (Terminology and Other Language and Content Resources) working
groups. This group has established, for example, standards for Morphosyntactic
annotation (MAF), a Lexical Markup Framework for use in dictionary-like
resources (LMF) and an interchange format for annotation (LAF/GrAF). These
formats and standards are now gradually being adopted by projects around the
world and will provide for increased inter-operability between language
resources. Project member Steve Cassidy is an active member of a number of ISO
TC 37 working groups and was recently invited to help establish a working group
to standardise query languages for language resources.
Internationally: A number of projects in the EU and US aim to develop standard
web-service architectures for defining and managing work-flows that process
audio or textual resources using tools such as parsers, taggers, speech
recognisers etc. One important
project that we are associated with is the US NSF funded project “The
Language Application Grid: A Framework for Rapid Adaptation and Reuse” managed
by Nancy Ide and James Pustejovsky which names the Australian National Corpus
as one of a number of international collaborators. The HCS vLab will work closely with these
international partners to ensure that we are building compatible and
NICTA will contribute to provide support
for interoperability with UIMA, the Unstructured Information Management
Architecture, an emerging standard for wrapping components for processing
language, speech, video, and other unstructured data. Dr. Verspoor (NICTA) was
a member of the OASIS standard committee for UIMA and is an editor of the
The Humanities Networked
Infrastructure (HuNI) is a NeCTAR Virtual Lab established in the first round of funding.
HuNI is primarily concerned with developing sustainable, relevant and enabling
infrastructure for Australian humanities researchers and cultural custodians
and it involves researchers and custodians working with cultural datasets. There
is some overlap in the kinds of data being managed by each network and we
believe that there is significant scope for HuNI to benefit from some of the
workflows that will be developed within HCS vLab. We will work closely with
HuNI to ensure that data can be exchanged where appropriate to maximise the
value of this infrastructure to the audiences of both virtual laboratories.
University is making
available the Humanities and Social Sciences Visualisation Laboratory (VisLab).
VisLab allows for the remote use of scientific instruments and imaging of
scientific data, creating a capability for interactive research collaboration,
visualisation and imaging. For instance, the city of Melbourne is a research
hub in the phonetics of endangered languages, yet the largest speech research
facilities are at the University of Western Sydney and Macquarie University in
Sydney. Using the VisLab, phonetics staff members and PhD students in Melbourne
would have access to expensive equipment such as the 3D Carstens EMA machine at
UWS (worth approximately $100K), without having to travel to Sydney for the
fiddly and time-consuming trial-and-error stage of data acquisition. Such
virtual access will save resources in the long term, since errors are less
likely in the data collection stage if there has been sufficient time for
DETAILS IN ATTACHMENT D3 MILESTONE AND FUNDING TEMPLATE
SERVICES AND SUPPORT
The service levels are set out in Table 4. Please refer to B.20 for
further details of these services.
Table 4: Service level that will be offered for each service
24×7 365 days per year, with
For automated workflows: as
For manual workflows:
Virtual Service Desk
Initial response: next
Initial response: next
and User Support
With reference to the services in Table 4, the following
operations and user support will be provided:
Intersect will provide the application hosting support
service for the HCS vLab application. Each service hosted by Intersect has an
individual who is ultimately responsible for the provision of that service
– the service owner. Services hosted by Intersect are monitored, and
backed up, and user support is provided during business hours. All services are
automatically monitored for availability using a ‘shallow monitoring’ approach
(“Is the service alive?”). Where appropriate, services are monitored using a
‘deep monitoring’ approach (“Is the service responding sensibly?”). Unexpected
published via a mailing list – the makeup of
the mailing list being maintained by the service owner;
escalated to the systems administration team.
outages, including for upgrades of software and hardware, are requested by the
systems administration team. These outages are publicised and negotiated with
the user base of the service by the service owner.
The following business services have been identified:
User Registration, Research Data Annotation
Service, Tool Execution Service, Federated Search, Corpora Management and
Commissioning Service, Tool Management and Commissioning Service, Workflow
Capture Service, ANDS Publication Service.
UWS Marcs will be responsible for servicing
these requests. The project is aiming to implement a self-service model.
However, it is likely that not all services will be fully automated; some
services may require human co-ordination. The requests will originate from
application workflows. Service levels are as specified in B.19.
Virtual Service Desk
and Intersect will jointly maintain a virtual service desk for users. Intersect
will provide a single point of access for end-users of the service to lodge and
track issues and other service requests. UWS Marcs will provide tier-1 (first
response and triage) and tier-2 (business workflow and problem resolution)
support to users. UWS and Intersect will provide tier-3 (software defect
resolution) support for the vLab framework proper, whereas the tools themselves
will be supported by contributing project partners. Jira will be used to track
all issues and service requests. UWS is committed to support the service desk
The application support service (e.g., account creation and
access control, troubleshooting) will be carried out by UWS Marcs tier-1
support. In addition to Virtual Service Desk, UWS Marcs and Intersect will
provide learning and development (L&D) and outreach programs. Support for
the project during its operation will include:
and delivery of training modules for the use of the outcomes of the project
through UWS’ and Intersect’s learning and development activities,
Publicity and on-going assistance with the outcomes of the project,
through Intersect’s outreach activities. This work is distributed amongst the
team of 9 research analysts.
vLab will be operationalised throughout the course of the project and is
expected to attract a broad and active user group both within the research
communities, and by external users. After the project is completed, UWS Marcs,
with its partners, will take responsibility for supporting and hosting the HCS
vLab. As researcher requirements are expected to be diverse with new corpora
and new tools becoming available, there will be strong drivers in place for
further development and improvement to occur. UWS Marcs will:
Explore opportunities to
expand the HCS vLab service desk.
Negotiate with its partners
to secure additional funding for software development ($150,000 has already
Engage and influence UWS to
provide further operational and application support for the HCS vLab.
Explore opportunities to
develop a transactional cost model for external parties to gain access to the HCS
Explore opportunities to
develop a data mining and analysis service for external parties.
Apply for further
infrastructure grants to expand the HCS vLab further.
Facilitate the development
of training courses for adding tools and corpora to the HCS vLab –
on-line and face to face.
Commercialise the system
and make it available to new and emerging research centres and external
commercial enterprises for a fee. UWS would commit to reinvesting funds back
into the system and establish a product user group.
Hold conferences and forums
to further the concept of the HCS vLab. Resulting funds reinvested.
are already very strong links between the HCS vLab partners and the research
community (see B.2) and it is expected that the infrastructure built will
encourage the integration of future databases and analysis tools. Several
partners are already planning to add corpora and tools to the HCS vLab (see B.9
and B.10). In addition, A/Prof. Drew Khlentzos, from the Language and Cognition
Research Centre at UNE, has proposed the integration of a new logic database
comprising core logical principles governing the main logical operators that
are expressed in most (if not all) of the world’s languages, which should be
available for incorporation into the HCS vLab within the next 2 years.
Licensing and Access
Intersect currently makes software available under an
open source licence. Intersect’s IP in NeCTAR projects will be made available
to Australian publicly funded researchers under the same conditions.
All the corpora and tools in the HCS vLab will be
available to researchers and users under appropriate licensing, according to
guidelines published from the Australian National Data Service.
are three periods during which customer satisfaction is solicited by Intersect.
the concept and elaboration stages (see B.15) the governance and stakeholder
groups represent the customer. They are engaged frequently and in depth through
this phase by the developer, Intersect. Project plans and quality management
plans are reviewed by the group, as are the initial list of user stories that
define the initial direction of the project. There is no separate process by
which satisfaction is measured – it emerges from the collaborative nature
of the project.
the development and deployment stages (see B.15), again the governance and
stakeholder groups represent the customer. The stakeholder group is engaged via
sprint-end demonstrations and end-user testing, and can affect the course of
the project during sprint planning. Rather than attempt to measure
satisfaction, Intersect will engage continuously, and allow the customer
regular opportunities to effect change in the project’s direction. The Project
Manager reports to the Steering Committee monthly. The governance group is
invited to provide input at that point, including their level of satisfaction.
deployment of every project carried out by Intersect, the Development Team,
Product Owner and Project Manager conduct a ‘lessons learned’ forum.
Representatives from the Steering Committee can attend and provide input. At
this forum, the outputs of the project and the process by which they were
derived are critically analysed.
months, the governance group are consulted for their satisfaction with the
process and outputs of the project. Intersect uses a standard form to solicit
the group’s input. Formal channels, as outlined above, go some way to measuring
customer satisfaction. Intersect also makes active use of a variety of informal
channels to solicit feedback, including consulting DVCRs or PVCRs and CIOs of
universities making use of the service, attending university eResearch
committee meetings, and through the outreach conducted by eResearch analysts.
These informal channels provide an important complement to the formal ones.
engagement under Intersect’s development model requires that the stakeholder
group is broadly representative of the eventual end-user community. Further,
the model requires a significant investment of time from members of the
stakeholder group over the period of the project.
Date first required
Milestones or Deliverables dependent on that capability
Seeding The Commons Project SC20
and AusNC Inc.
and Support for integrating the Aus NC corpora and tool sets